The following are example Sequoia programs, roughly listed in order of increasing complexity. The Vector Add example is also something of a tutorial.
More examples will be posted here in the coming weeks and months, including some "irregular" programs which make use of the scatters and gathers that can be expressed via indexed array blocks. Further, some other examples are provided as a part of the Sequoia release download.
The vector add program simply takes two 1D arrays as input and adds their corresponding elements to produce a 1D output array. The prototype of a Sequoia task that implements a vector add is as follows; the two input arrays A and B are added to produce the output array C.
void task VectAdd(in float A[N], in float B[N], out float C[N]); |
The first step in writing a Sequoia program is deciding on how the problem will be decomposed. Suppose that the code was run with 1-billion element arrays as inputs on a machine (such as the Cell processor) which has only 256KB of local space. Since the entire arrays won't fit, the problem must be decomposed or partitioned into subproblems which will fit.
Vector add has an obvious decomposition strategy: the size-N input arrays can be partitioned into size-T subarrays, and then each pair of subarrays of A and B can be added to produce a subarray of the output array C. As the input is size-N and the subarrays are size-T, there will be N/T subproblems to be solved. Further, each subproblem can be executed in parallel. This is illustrated in the following diagram:
|
Sequoia Source Code, Version 1
There is a straightforward translation of the above problem decomposition strategy into Sequoia: a size-N inner task will chop the arrays into size-T array blocks that are passed recursively to the vector add task, which will perform the vector add operation on the subarrays.
// vectadd.sq (version 1)
// ----------------------
void task VectAdd(in float A[N], in float B[N], out float C[N]);
void task<inner> VectAdd::Inner(in float A[N], in float B[N], out float C[N])
{
tunable T;
mappar (unsigned int i = 0 : N/T) {
VectAdd(A[i*T:(i+1)*T;T], B[i*T:(i+1)*T;T], C[i*T:(i+1)*T;T]);
}
}
void task<leaf> VectAdd::Leaf(in float A[N], in float B[N], out float C[N])
{
for (unsigned int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
}
|
Iteration i of the mappar operates on elements in the range i*T to (i+1)*T of each of the three arrays; written in Sequoia's array block notation, this becomes the range [i*T:(i+1)*T;T], where the third field of this expression (;T) denotes the maximum possible size of the range (which in this case is its exact size).
Sequoia Source Code, Version 2
There is an assumption implicit in the above implementation of vector add, namely the fact that the size of the array blocks, T, must evenly divide the size of the input arrays, N. This assumption can be removed by a small variation in the implementation, as follows:
// vectadd.sq (version 2)
// ----------------------
void task VectAdd(in float A[N], in float B[N], out float C[N]);
void task<inner> VectAdd::Inner(in float A[N], in float B[N], out float C[N])
{
tunable T;
mappar (unsigned int i = 0 : (N+T-1)/T) {
VectAdd(A[i*T;T], B[i*T;T], C[i*T;T]);
}
}
void task<leaf> VectAdd::Leaf(in float A[N], in float B[N], out float C[N])
{
for (unsigned int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
}
|
There are only two differences between this implementation and that of version 1:
The implementation of version 2 illustrates a common Sequoia idiom: the use of the [start;max] array block notation to express "bounded" array blocks, that is, array blocks which will be exactly a given size if they are "in the middle" of their base array, or smaller if they are "on the edge". This is explained further in the section on array block default values.
Mapping the Program onto a Cell Processor
Sequoia programs are abstract in that they don't refer to any specific machine parameters or mechanisms, enabling them to be portable across machines with very different architectures and yet still yield high performance. Compiling a Sequoia program for a particular target machine involves making a set of mapping decisions, and is referred to as instantiating or specializing the program for the machine.
Suppose that the vector add program was to be mapped onto a Cell processor. One possible way to map the program would be to place an instance of VectAdd::Inner in main memory, to be executed by the PPE, and an instance of VectAdd::Leaf in the LS, to be executed by the SPE. This is illustrated below, with the task instance hierarchy drawn alongside the Cell's abstract machine model.
|
The mapping details that were specified to achieve this were a follows.
Try to visualize how this program will be executed on the Cell processor. The PPE will execute the "i" mappar loop in the VectAdd::Inner instance, and every iteration it will make a task call to the VectAdd::Leaf leaf instance in level 0 (the LS/SPE) of the machine. As dictated by Sequoia's task calling semantics, each task call will result in (1) the subtask's input arguments, which are subarrays of A and B, being copied (i.e. transferred via DMAs) to the LS, (2) the VectAdd::Leaf task instance executing on those input subarrays ro produce its output subarray, which will (3) be copied back to main memory once the VectAdd::Leaf instance has completed. Conceptually, this is a "push" model of execution, with the PPE sending data to the LS every iteration, causing a SPE thread to execute the leaf instance on the data, and then retreiving the results from the LS once the leaf instance is done.
As it turns out, this program will perform very poorly on Cell if mapped this way, since the amount of time that the VectAdd::Leaf instance takes to perform a single size-T vector add operation (which it does every iteration of i) will be dwarfed by the amount of time that it takes for the PPE to transfer the data and make the SPE thread invocation each iteration; this particular mapping of the vector add program will be communication-limited.
An important part of getting the optimal performance out of a Sequoia program is choosing the "right" mapping for a given target machine. An second way that the same vector add Sequoia program could be mapped to the Cell processor is illustrated below.
|
Here, the mappar loop (over i) is mapped onto level 0 of the machine rather than level 2; aside from the change to the i loop, this mapping is identical to the first mapping.
Visualize how this second mapping of the program will execute on Cell. The PPE will make a single thread call to the SPE to initiate the execution of the mappar loop, and from that point on, the SPE will be able to execute all iterations of the mappar without any intervention from the PPE. As opposed to the first mapping that was shown, this represents a "pull" execution model. Each iteration, the SPE will fetch a subarray from memory, execute the VectAdd::Leaf instance on it in the LS, and then store the result back when the leaf instance is done.
This is a much more efficient Cell mapping of the vector add program, as the PPE-SPE thread invocation only happens once, at the start of the mappar, rather than every iteration. Further, as the mappar is executing in level 0, it can be parallelized over the machine's SPEs (this is the "SPMDx8" mapping directive in the diagram), and it can be software-pipelined ("SWP"), also referred to as double-buffering in this context, allowing the DMA transfers from one iteration to be overlapped with the SPE leaf instance execution of another iteration.
Note that the value 8192 was chosen to fit Cell's LS capacity, given the second mapping strategy. The LS must stage three subarrays of floats (A, B, and C), each of length 8192, which collectively consume 3*8192*4 bytes = 96KB. Since the loop is being software pipelined with a depth-2 pipeline, two iterations are being run in parallel, and both of these iterations need 96KB of space to stage their data, resulting in a total demand of 192KB. Cell has 256KB total space in an LS, so this will leave some room for the code and stack. As a general heuristic on Cell, the programmer should make all their subarrays that will be staged in the LS as large as possible without exceeding its capacity, and it's not uncommon for the programmer to try out a few different sizes for the LS subarrays before deciding which yields the best program performance.
XML Mapping File for the Cell Target
Note: The XML mapping file interface will soon be replaced with a new interface which is much less verbose and which is more convenient. The XML mapping example provided here will be replaced with an example that uses the new mapping interface when it is ready.
The XML mapping directives for targeting this program to Cell are as follows; these correspond to the second, more efficient, mapping presented above.
<?xml version="1.0"?>
<mapping>
<machine filename="cell.xml"/>
<entrypoints>VectAddTopInst</entrypoints>
<taskmapping name="VectAdd">
<instance name="VectAddTopInst">
<variant>Inner</variant>
<level>2</level>
<tunable name="T">8192</tunable>
<ctrl>
<level>0</level>
<loop name="i">
<spmd>
<fullrange> 0, 8 </fullrange>
<ways> 8 </ways>
<iterblk>1</iterblk>
</spmd>
<swp> 2 </swp>
</loop>
<callsite name="VectAdd">
<target name="VectAddLeafInst">
</target>
</callsite>
</ctrl>
</instance>
<instance name="VectAddLeafInst">
<level>0</level>
<variant>Leaf</variant>
</instance>
</taskmapping>
</mapping>
|
A Sequoia program expresses an abstract algorithm, while a Sequoia program plus a target-specific set of mapping directives represents a concrete implementation of the program. Multiple different target-specific mappings may be provided, enabling the same Sequoia program to be run on different machines. For example, a P4 cluster mapping specification would result in the Sequoia compiler emitting an implementation of the program for a P4 cluster.
Note that at no point did the programmer need to deal with low-level details such as DMAs in their program; they simply express the mapping of their abstract algorithm to a machine, and the Sequoia compiler and runtime handle all of the machine-specific mechanisms "under the hood".
Linking to an External Program
The mapped Sequoia vector add program constitutes an efficient implementation of the vector add algorithm on a specific target processor. The programmer can call into this Sequoia code from their external application, effectively using the Sequoia vector add program as a module or library in a larger application. The Sequoia API section details this calling convention.
A 2D convolution is simply the application of a mask to a 2D image, conceptually the same as a "blur" operation in computer graphics. The prototype of a Sequoia task which implements a 2D convolution is as follows.
void task Conv2D(in float A[M+U-1][N+V-1], in float H[U][V], out float C[M][N]); |
The output image, C, is computed from the input image, A, and the convolution mask, H, according to the following expression.
|
A 2D convolution problem, producing an output image of size M*N, can be decomposed into a set of parallel 2D convolution subproblems, each computing a non-overlapping region of the output image of size S*T. This is the decomposition that is used to implement this algorithm in Sequoia.
Note that in this formulation of 2D convolution, a size (M+U-1)*(N+V-1) input image is used to compute the size M*N output image, and similarly, a size (S+U-1)*(T+V-1) input subimage is used to compute each size S*T output subimage. In this formulation, the input subimages overlap with each other, even though the output subimages don't.
Sequoia Source Code
The abstract Sequoia implementation of a 2D convolution is as follows.
void task Conv2D(in float A[M+U-1][N+V-1], in float H[U][V], out float C[M][N]);
void task<inner> Conv2D::Inner(in float A[M+U-1][N+V-1], in float H[U][V], out float C[M][N])
{
tunable S, T;
mappar (unsigned int i = 0 : (M+S-1)/S) {
mappar (unsigned int j = 0 : (N+T-1)/T) {
Conv2D(A[i*S;S+U-1][j*T;T+V-1], H, C[i*S;S][j*T;T]);
}
}
}
void task<leaf> Conv2D::Leaf(in float A[M+U-1][N+V-1], in float H[U][V], out float C[M][N])
{
for (unsigned int m = 0; m < M; m++) {
for (unsigned int n = 0; n < N; n++) {
C[m][n] = 0;
for (unsigned int u = 0; u < U; u++) {
for (unsigned int v = 0; v < V; v++) {
C[m][n] += H[u][v] * A[m+u][n+v];
}
}
}
}
}
|
In the above implementation, the tunable values that describe the subarray sizes, S and T, don't need to evenly divide the output image size parameters, M and N respectively, as both the array block expressions and the mappar iteration bounds properly handle this case.
XML Mapping File for the Cell Target
Note: The XML mapping file interface will soon be replaced with a new interface which is much less verbose and which is more convenient. The XML mapping example provided here will be replaced with an example that uses the new mapping interface when it is ready.
The XML mapping directives for targeting this program to Cell are as follows.
<?xml version="1.0"?>
<mapping>
<machine filename="cell.xml"/>
<entrypoints>Conv2DMainInst</entrypoints>
<taskmapping name="Conv2D">
<instance name="Conv2DMainInst">
<variant>Inner</variant>
<level>2</level>
<tunable name="S">32</tunable>
<tunable name="T">256</tunable>
<data>
<array name="H">
<arraysizeprecond type="exact" name="h">5, 5</arraysizeprecond>
</array>
</data>
<ctrl>
<level>0</level>
<loop name="i">
<spmd>
<fullrange> 0, 8 </fullrange>
<ways> 8 </ways>
<iterblk>1</iterblk>
</spmd>
</loop>
<loop name="j">
<swp> 2 </swp>
</loop>
<callsite name="Conv2D">
<target name="Conv2DLeafInst">
</target>
</callsite>
</ctrl>
</instance>
<instance name="Conv2DLeafInst">
<variant>Leaf</variant>
<level>0</level>
</instance>
</taskmapping>
</mapping>
|