Getting Started with APPy
This guide walks you through how to install and use APPy to parallelize your Python loops on GPUs.
Installation
APPy is available on PyPI as appyc and can be installed using pip:
pip install appyc
Or you can install the latest development version from source:
git clone https://github.com/habanero-lab/APPy.git
cd APPy
pip install -e .
APPy is designed to have minimal dependencies and the appyc package only includes the code generator itself.
To use the triton backend, you will also need to have torch and triton (part of the torch package) installed.
pip install torch
Supported platforms
APPy currently supports Python 3.9+ on Linux platforms with a CUDA-enabled GPU (Compute Capability 8.0 or higher).
Basic example
The easiest way to parallelize a Python/NumPy loop with APPy is to replace range with appy.prange
and annotate the loop with @appy.jit:
import numpy as np
from appy import jit, prange
@jit
def add_one(a):
for i in prange(a.shape[0]):
a[i] += 1
a = np.zeros(10)
add_one(a)
# a is now [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Reductions can be parallelized as well:
@jit
def sum_vector(a, N):
sum = 0
for i in prange(N):
sum += a[i]
return sum
a = np.ones(10)
sum_vector(a, 10)
# sum is now 10
APPy automatically detects reductions and make them work properly in parallel.
Will APPy work for my code?
APPy only supports the following operations in the parallel loop region, which should be sufficient to express a wide range of applications already.
On scalar integer or float values:
Arithmetic operations
Math functions
Bitwise operations
Logical operations
Compare operations
On arrays of integers or floats:
Array indexing (store or load)
Control flows:
Ternary operators
In general, APPy’s usage scenarios are similar to numba.prange which parallelizes Python loops on CPUs.
When can appy.prange be used?
prange may be used if the loop does not have any cross-iteration dependencies, except for reductions which can actually be parallelized.
An example of a cross-iteration dependency is:
def dependence_example(a, N):
for i in range(N-1):
a[i+1] = a[i]
In this code example, every loop iteration depends on the previous loop iteration, so the loop cannot be parallelized (prange cannot be used).
Reduction is a special case of cross-iteration dependency that can be parallelized due to reduction operations being commutative:
@jit
def sum_vector(a, N):
sum = 0
for i in prange(N):
sum += a[i]
return sum
More examples are available in High-Level Programming Interface and Low-Level Programming Interface.
APPy supports both a high-level and a low-level programming interface.
The high-level interface is easy to use - parallelizing a Python loop on GPUs
is as simple as replacing range with appy.prange while
the low-level interface is more flexible and allows for more control over the generated code via pragmas.
Known Limitations/Bugs
As of the current new implementation, automatic reduction supports only scalar variables, not array variables. For example,
@jit
def sum_vector(a, N):
sum = 0
for i in prange(N):
sum += a[i]
return sum
is fine, and the compiler will recognize sum as a reduction variable. However, for loops like
#pragma parallel for
for i in range(M):
#pragma simd
for j in range(N):
y[i] += alpha * A[i, j] * x[j]
The compiler won’t be able to reliably recognize y[i] as a reduction pattern.
Each backend also has some backend-specific limitations. For the “triton” backend, using regular control flows
inside a #pragma simd loop is not supported. Only ternary expressions can be used.