Vector Add

The vector add program simply takes two 1D arrays as input and adds their corresponding elements to produce a 1D output array. The prototype of a Sequoia task that implements a vector add is as follows; the two input arrays A and B are added to produce the output array C.

void task VectAdd(in float A[N], in float B[N], out float C[N]);

The first step in writing a Sequoia program is deciding on how the problem will be decomposed. Suppose that the code was run with 1-billion element arrays as inputs on a machine (such as the Cell processor) which has only 256KB of local space. Since the entire arrays won't fit, the problem must be decomposed or partitioned into subproblems which will fit.

Vector add has an obvious decomposition strategy: the size-N input arrays can be partitioned into size-T subarrays, and then each pair of subarrays of A and B can be added to produce a subarray of the output array C. As the input is size-N and the subarrays are size-T, there will be N/T subproblems to be solved. Further, each subproblem can be executed in parallel. This is illustrated in the following diagram:

Sequoia Source Code, Version 1

There is a straightforward translation of the above problem decomposition strategy into Sequoia: a size-N inner task will chop the arrays into size-T array blocks that are passed recursively to the vector add task, which will perform the vector add operation on the subarrays.

// vectadd.sq (version 1)
// ----------------------

void task VectAdd(in float A[N], in float B[N], out float C[N]);

void task<inner> VectAdd::Inner(in float A[N], in float B[N], out float C[N])
{
    tunable T;
    mappar (unsigned int i = 0 : N/T) {
        VectAdd(A[i*T:(i+1)*T;T], B[i*T:(i+1)*T;T], C[i*T:(i+1)*T;T]);
    }
}

void task<leaf> VectAdd::Leaf(in float A[N], in float B[N], out float C[N])
{
    for (unsigned int i = 0; i < N; i++) {
        C[i] = A[i] + B[i];
    }
}

Iteration i of the mappar operates on elements in the range i*T to (i+1)*T of each of the three arrays; written in Sequoia's array block notation, this becomes the range [i*T:(i+1)*T;T], where the third field of this expression (;T) denotes the maximum possible size of the range (which in this case is its exact size).

Sequoia Source Code, Version 2

There is an assumption implicit in the above implementation of vector add, namely the fact that the size of the array blocks, T, must evenly divide the size of the input arrays, N. This assumption can be removed by a small variation in the implementation, as follows:

// vectadd.sq (version 2)
// ----------------------

void task VectAdd(in float A[N], in float B[N], out float C[N]);

void task<inner> VectAdd::Inner(in float A[N], in float B[N], out float C[N])
{
    tunable T;
    mappar (unsigned int i = 0 : (N+T-1)/T) {
        VectAdd(A[i*T;T], B[i*T;T], C[i*T;T]);
    }
}

void task<leaf> VectAdd::Leaf(in float A[N], in float B[N], out float C[N])
{
    for (unsigned int i = 0; i < N; i++) {
        C[i] = A[i] + B[i];
    }
}

There are only two differences between this implementation and that of version 1:

  • The mappar iteration bound was changed from N/T to (N+T-1)/T, taking into account the possibilty that T does not divide N evenly.
  • In the event that T doesn't divide N evenly, then there will be N/T size-T subproblems, and a single subproblem "on the end" that is adding two arrays which are smaller than size-T. This was handled by changing the array block range expressions from [i*T:(i+1)*T;T], which specifies an exact start and end index for the range, to [i*T;T], which only specifies the range's start and its maximum length, allowing the subarray that is "on the end" to be smaller than size-T.

The implementation of version 2 illustrates a common Sequoia idiom: the use of the [start;max] array block notation to express "bounded" array blocks, that is, array blocks which will be exactly a given size if they are "in the middle" of their base array, or smaller if they are "on the edge". This is explained further in the section on array block default values.

Mapping the Program onto a Cell Processor

Sequoia programs are abstract in that they don't refer to any specific machine parameters or mechanisms, enabling them to be portable across machines with very different architectures and yet still yield high performance. Compiling a Sequoia program for a particular target machine involves making a set of mapping decisions, and is referred to as instantiating or specializing the program for the machine.

Suppose that the vector add program was to be mapped onto a Cell processor. One possible way to map the program would be to place an instance of VectAdd::Inner in main memory, to be executed by the PPE, and an instance of VectAdd::Leaf in the LS, to be executed by the SPE. This is illustrated below, with the task instance hierarchy drawn alongside the Cell's abstract machine model.

The mapping details that were specified to achieve this were a follows.

  • Instantiate VectAdd::Inner in memory level 2, operating on dynamically sized arrays, and VectAdd::Leaf in memory level 0, operating on arrays of length 8192.
  • Put the arrays in the VectAdd::Inner instance in memory level 2, and the arrays in the VectAdd::Leaf instance in memory level 0.
  • Have the PPE execute the mappar loop (i).
  • Set the tunable T in the VectAdd::Inner instance to 8192, so that the array blocks passed to the VectAdd::Leaf instance will be the right size.

Try to visualize how this program will be executed on the Cell processor. The PPE will execute the "i" mappar loop in the VectAdd::Inner instance, and every iteration it will make a task call to the VectAdd::Leaf leaf instance in level 0 (the LS/SPE) of the machine. As dictated by Sequoia's task calling semantics, each task call will result in (1) the subtask's input arguments, which are subarrays of A and B, being copied (i.e. transferred via DMAs) to the LS, (2) the VectAdd::Leaf task instance executing on those input subarrays ro produce its output subarray, which will (3) be copied back to main memory once the VectAdd::Leaf instance has completed. Conceptually, this is a "push" model of execution, with the PPE sending data to the LS every iteration, causing a SPE thread to execute the leaf instance on the data, and then retreiving the results from the LS once the leaf instance is done.

As it turns out, this program will perform very poorly on Cell if mapped this way, since the amount of time that the VectAdd::Leaf instance takes to perform a single size-T vector add operation (which it does every iteration of i) will be dwarfed by the amount of time that it takes for the PPE to transfer the data and make the SPE thread invocation each iteration; this particular mapping of the vector add program will be communication-limited.

An important part of getting the optimal performance out of a Sequoia program is choosing the "right" mapping for a given target machine. An second way that the same vector add Sequoia program could be mapped to the Cell processor is illustrated below.

Here, the mappar loop (over i) is mapped onto level 0 of the machine rather than level 2; aside from the change to the i loop, this mapping is identical to the first mapping.

Visualize how this second mapping of the program will execute on Cell. The PPE will make a single thread call to the SPE to initiate the execution of the mappar loop, and from that point on, the SPE will be able to execute all iterations of the mappar without any intervention from the PPE. As opposed to the first mapping that was shown, this represents a "pull" execution model. Each iteration, the SPE will fetch a subarray from memory, execute the VectAdd::Leaf instance on it in the LS, and then store the result back when the leaf instance is done.

This is a much more efficient Cell mapping of the vector add program, as the PPE-SPE thread invocation only happens once, at the start of the mappar, rather than every iteration. Further, as the mappar is executing in level 0, it can be parallelized over the machine's SPEs (this is the "SPMDx8" mapping directive in the diagram), and it can be software-pipelined ("SWP"), also referred to as double-buffering in this context, allowing the DMA transfers from one iteration to be overlapped with the SPE leaf instance execution of another iteration.

Note that the value 8192 was chosen to fit Cell's LS capacity, given the second mapping strategy. The LS must stage three subarrays of floats (A, B, and C), each of length 8192, which collectively consume 3*8192*4 bytes = 96KB. Since the loop is being software pipelined with a depth-2 pipeline, two iterations are being run in parallel, and both of these iterations need 96KB of space to stage their data, resulting in a total demand of 192KB. Cell has 256KB total space in an LS, so this will leave some room for the code and stack. As a general heuristic on Cell, the programmer should make all their subarrays that will be staged in the LS as large as possible without exceeding its capacity, and it's not uncommon for the programmer to try out a few different sizes for the LS subarrays before deciding which yields the best program performance.

XML Mapping File for the Cell Target

Note: The XML mapping file interface will soon be replaced with a new interface which is much less verbose and which is more convenient. The XML mapping example provided here will be replaced with an example that uses the new mapping interface when it is ready.

The XML mapping directives for targeting this program to Cell are as follows; these correspond to the second, more efficient, mapping presented above.

<?xml version="1.0"?>
<mapping>
<machine filename="cell.xml"/>

<entrypoints>VectAddTopInst</entrypoints>

<taskmapping name="VectAdd">

    <instance name="VectAddTopInst">
        <variant>Inner</variant>
        <level>2</level>
        <tunable name="T">8192</tunable>
        <ctrl>
            <level>0</level>
            <loop name="i">
                <spmd>
                    <fullrange> 0, 8 </fullrange>
                    <ways> 8 </ways>
                    <iterblk>1</iterblk>
                </spmd>
                <swp> 2 </swp>
            </loop>
            <callsite name="VectAdd">
                <target name="VectAddLeafInst">
                </target>
            </callsite>
        </ctrl>
    </instance>

    <instance name="VectAddLeafInst">
        <level>0</level>
        <variant>Leaf</variant>
    </instance>

</taskmapping>

</mapping>

A Sequoia program expresses an abstract algorithm, while a Sequoia program plus a target-specific set of mapping directives represents a concrete implementation of the program. Multiple different target-specific mappings may be provided, enabling the same Sequoia program to be run on different machines. For example, a P4 cluster mapping specification would result in the Sequoia compiler emitting an implementation of the program for a P4 cluster.

Note that at no point did the programmer need to deal with low-level details such as DMAs in their program; they simply express the mapping of their abstract algorithm to a machine, and the Sequoia compiler and runtime handle all of the machine-specific mechanisms "under the hood".

Linking to an External Program

The mapped Sequoia vector add program constitutes an efficient implementation of the vector add algorithm on a specific target processor. The programmer can call into this Sequoia code from their external application, effectively using the Sequoia vector add program as a module or library in a larger application. The Sequoia API section details this calling convention.