The following links comprise Sequoia's documentation, in HTML format.
Sequoia is a portable, low-level language which is intended for programming modern machines, both parallel and superscalar, in which managing data allocation within and transfers throughout the machine's memory hierarchy is important for performance. Sequoia is syntactically an extension to C, though the language constructs that Sequoia introduces result in a programming model that is quite different from C.
The following sections provide an introduction to Sequoia.
The C Abstract Machine Model
The following diagram summarizes C's abstract machine model:
|
The model consists of:
The Sequoia Abstract Machine Model
Sequoia's abstract machine model is very different to C's abstract machine model, with the most fundamental difference being the fact that rather than a single global memory space, there are multiple independent memory spaces exposed to the program. Sequoia's abstract model is of a tree of memories, illustrated below:
|
The model consists of:
The basic idea behind the Sequoia machine model is that there are a set of high-performance processing elements which are backed by a hierarchy of memory levels. Programming such a machine requires transferring data from the large, slow outer memory levels into the small, fast local memory levels which the high-performance processing elements can access.
Each memory level has the following properties:
The following diagram illustrates an example machine model for a Cell processor node, with main memory backed by disk. There are 9 processors in the machine: eight SPEs, and one PPE.
|
As a second example, the following diagram illustrates a machine model for a dual-CPU workstation, in which the two CPUs have private L1 and L2 caches but share main memory.
|
In C, the abstract machine model consists of a single processor which can randomly access any memory location in the entire machine. Sequoia is a language that was designed to target machines in which there are multiple distinct memory spaces, with processors only being able to directly access a subset of them, and hence the Sequoia programming model makes explicit the fact that program logic can only access certain program variables and certain memory locations.
Tasks
The key Sequoia language construct that allows programmers to structure their programs to run on the machines that Sequoia targets is a task, a function that can only access its function arguments and locally delcared data. For example, the following task multiplies every element in an input array A by a floating point variable x, resulting in the output array Y.
void task<leaf> VectScale::Kernel(in float A[128], in float x, out float Y[128])
{
for (int i = 0; i < 128; i++)
{
Y[i] = x * A[i];
}
}
|
The body of the task, in this case the for loop and assignment statement, may only refer to the task's local variables, namely A, x, Y, and i.
Mapping Tasks to a Machine
A Sequoia program consists of a set of tasks, such as the example task above. All tasks are "abstract" in that they don't refer to any concrete locations in the machine; rather, they only refer to their local data and arguments.
When a Sequoia program is compiled for a particular machine, it is specialized in a mapping phase, in which the programmer specifies how each task will be executed. The mapping of a task to a target machine consists of two related items:
The separation of the logical algorithm, which is expressed as a set of tasks, from the machine-specific implementation details, results in Sequoia programs being portable across machines with very different architectures.
Communication is Vertical
All communication is "vertical" in the Sequoia abstract machine model, that is, between parent and child memory modules. Asynchronous bulk data transfers, such as DMAs and cache prefetches, will be used to implement these vertical data transfers if the target machine supports them.
The Sequoia programming model only has one method of specifying communication, namely that of task calling. If a task whose local data is mapped to a memory module in memory level i calls a subtask whose arguments are mapped to a module in memory level i-1, then the the arguments that the parent passes to the child task call will be physically copied between the corresponding memory modules. Similarly, data returned from the subtask call in level i-1 will be copied back to the parent (calling) task in level i when the subtask completes. The Sequoia compiler and runtime will use the appropriate hardware mechanisms to implement such copies, taking advantage of DMAs or whatever other data transfer methods are provided.
The Sequoia abstract machine model consists of "vertical" communication between parent and child memory modules. Some machines, however, support efficient "horizontal" communication between "sibling" memory modules (i.e.) memory modules in the same memory level with the same parent. The most obvious example of this is the inter-node communication in a cluster of workstations.
Sequoia's model encompasses such horizontal communication by introducing the concept of a virtual memory level in the machine model. A virtual level represents an aggregation of physical memory modules, and not a single module. The following diagram illustrates the Sequoia machine model for a cluster of 128 dual-CPU workstations with explicitly managed L1 and L2 caches:
|
As explained in the programming model introduction, a Sequoia task is mapped to a target machine, with all of its local data objects being assigned to a memory level in the target machine. The same applies with virtual levels. Logically, a piece of data can be placed within a memory module in a virtual level, just as if it were a "normal" memory module, though in actuality, the Sequoia compiler and runtime will distribute the data across the multiple physical memory modules that are aggregated by the virtual memory module. This process is described in the mapping section.
Data transfers are still vertical in the Sequoia programming model, between parent and child memory modules, and that is no different if a virtual memory module is used. However, a "logical" data transfer between a virtual parent module and one of its physical child modules will actually result in a "horizontal" data transfer between child modules, as the virtual module is merely a logical grouping of the child modules into a single shared namespace.
As a second example, consider the following machine model of a Cell node, in which a virtual memory level was added in between the LS and main memory levels. The Cell processor provides high-throughput inter-SPE communication channels which could support higher bandwidth between different LSes than between an LS and main memory. By using a virtual memory module to aggregate the eight 256KB LSes into a single 2MB virtual memory module, this efficient "horizontal" communication can be exploited.
|
The following pages comprise the Sequoia language reference.
Sequoia is derived from C, and it both limits and extends C syntax.
The following tables enumerate all ANSI C keywords and operators. Entries that are shaded gray are not supported in Sequoia. In general, the excluded features deal with pointers/references, or with memory allocation; Sequoia forbids the use of pointers, and decouples the physical location of data from its logical use, and thus keywords such as volatile don't have as much meaning in the context in which Sequoia operates.
Keywords
| auto | break | case | char | const | continue | default | do |
| double | else | enum | extern | float | for | goto | if |
| int | long | register | return | short | signed | sizeof | static |
| struct | switch | typedef | union | unsigned | void | volatile | while |
In addition, the C++ keyword inline is supported in Sequoia.
Unary Operators
| & | Referencing operator |
| * | Dereferencing operator |
| + - | Unary plus and minus operators |
| ++ -- | Increment and decrement operators |
| ! | Logical negation operator |
| ~ | Bitwise complement operator |
Binary and Ternary Operators
| + - | Binary plus and minus |
| * / % | Multiplicative operators |
| << >> | Bitwise shift operators |
| & ^ | && || | Logical and bitwise operators |
| = += (etc.) | Assignment operators |
| < > <= >= | Relational operators |
| == != | Equality operators |
| . | Structure-access operator |
| -> | Structure-access operator |
| ?: | Conditional operator |
| , | Comma operator |
A Sequoia (.sq) source file will be run through a C preprocessor before the Sequoia compiler sees it, and thus the full range of C comments, defines, macros, includes and other preprocessor features are supported. The C preprocessor is external to the Sequoia system; "g++ -E" is used in the Cell build infrastructure, for example.
Sequoia extends C by adding four categories of constructs.
Tasks
Tasks, similar to C functions, are used to express an independent unit of computation that occurs in an isolated (private) namespace.
General C functions are not supported in Sequoia, although inline C functions are. Such inline functions may only refer to the subset of C that Sequoia supports (and not any of the C extensions which it defines), and may only take scalar arguments.
Abstract Arrays and Array Blocks
Abstract arrays are logical collections of data that are decoupled from their physical locations and layouts, and array blocks are used to refer to sub-arrays in a manner reminiscient of Matlab matrices. These Sequoia abstract arrays and array blocks replace C's arrays, except in the case of arrays contained within a struct or union; these are still C-style arrays.
The built-in copy operator can be used to transfer data between Sequoia array blocks.
Iteration Constructs
Sequoia adds some simple iteration constructs to C's set of loop constructs (while, for, etc.). These iteration constructs are conceptually like the map operator of functional languages, in which a task is mapped over a collection of data elements (which are array blocks, in Sequoia). There is a parallel iteration construct (mappar), a sequential iteration construct (mapseq), and a construct which expresses a parallel reduction (mapreduce).
Parameters
Sequoia parameters are variables whose values are set at compile time, rather than in the Sequoia program code.
Sequoia extends the set of C keywords with the keywords in the following table.
| copy | in | inout | mappar | mapreduce |
| mapseq | out | reducearg | task | tunable |
The copy keyword is used to copy data betwen array blocks, in, out, inout, and task are used in task definitions, mappar, mapreduce, mapseq, and reducearg are part of Sequoia's iteration constructs, and tunable is used to declare Sequoia parameters.
Sequoia has two main categories of data types: scalars, and arrays.
Scalar data objects can be of any of the non-array C types (int, float, struct, etc.). Conceptually, a scalar object is a single, indivisible piece of data that resides in a single location in the machine, spanning a set of contiguous memory addresses. Scalar objects are often referred to as records in Sequoia's nomenclature.
The intent is that scalar data objects in Sequoia are "small" and are used mainly for the purposes of program control (counters, iteration variables, etc.). "Large" data objects should be expressed as Sequoia arrays.
In C, arrays and pointers are closely coupled, and an array variable refers to a specific memory location in the machine. In Sequoia, on the other hand, physical memory locations have been abstracted, and pointers of any sort are not used, and so arrays in Sequoia are more "logical" than "physical". On some machines, an array may be physically distributed over multiple disjoint memory modules or nodes, though in the programming model an array is abstracted as being contained within a single namespace.
Array Declarations
Sequoia natively supports both one- and multi-dimensional arrays using C's "[]" notation. Arrays can be of any scalar type. A Sequoia array is declared in a similar way to a C array, although in Sequoia, arrays must have a value specifying the size of every dimension of the array.
float A[20][10]; |
An array declaration's size value may be an arithmetic expression over constant literals and parameters (see the Parameters section for details), and may have a dynamic size. For example:
double B[100+S][2*T]; |
Accessing Arrays
A Sequoia array element is accessed using the same syntax as for C arrays, and can be used as an rvalue in any C expression, for example:
x += A[i][j];
if (A[2*m][B[n]] == 2) {
...
}
...
|
However, an array element can not, in general be used as an lvalue; this is described in the section on "task types". Put another way, array elements may be freely read, but there are constraints on the manner in which they can be written to.
Note that while a program may refer to array elements in the usual manner (e.g. A[i][j]), unlike with C there is no way to refer to the physical location of an array element in the machine. The array is an abstract collection of scalar records that has been decoupled from physical factors such as where in the machine the array elements reside, whether the array is distributed over multiple machines, and the array's layout in memory (row- or column-major, etc.). (For consistency, all diagrams in this documentation visually present arrays as being row-major, but their physical layout on a target machine need not be).
In addition to single-element indexing, Sequoia adds syntax for referring to collections of array elements that can be moved through the machine together in single bulk data transfers; see the Array Blocks section of the documentation for details.
Sequoia abstract arrays and array blocks, with their decoupling of element locations and data layout from logical structure, completely replace C's arrays in the language, except in one situation: arrays contained within a struct or union are still C-style arrays. Further, a struct or union containing an array is still considered a scalar object.
The reason for this distinction is that the physical layout of an array that is packed within a struct is exposed to the user, while Sequoia's arrays abstract these physical details. Moreover, Sequoia's arrays are intended to be decomposed into sub-arrays, while structures and other scalar objects are treated as indivisible data units.
However, despite being native C arrays, these contained arrays can only be accessed using C's "[]" array indexing and not via pointer notation. In addition, Sequoia's array block notation can't be used on such arrays, nor can they be passed as arguments to task calls or the copy operator.
The following pages describe Sequoia's array blocks.
A Sequoia array is an abstract collection of scalar records, and an array block refers to a subset of the elements in an array. An array block can be considered a "window" or "view" into the base array from which it is derived; it is not a separate copy of the array elements.
Range Array Blocks
Consider the following example; the array block notation used is A[start:end;].
|
The same example, but with a non-unit stride array block, is as follows. Note that the yellow array block only refers to 4 elements of A. The notation used is A[start:end:stride;].
|
The first of the two kinds of array blocks, range array blocks, is depicted above in the two examples. In each dimension of the array, a range of elements is specified, in the form [start:end:stride;max]. The stride field is omitted from the first example, and the max field is omitted from both. (The default values used in the case that certain fields are omitted are described in the Array Block Default Values section).
The four fields of range array blocks are defined as follows.
An N-dimensional array can be used as the source of a N-dimensional range array block; that is, the number of dimensions must match.
Indexed Array Blocks
The second kind of array block is an indexed array block, in which a second "index" array is used to specify which elements of the base array are referred to. Unlike range array blocks, indexed array blocks can only be formed from 1-dimensional arrays, and the index array must also be 1-dimensional. Consider the following example; the notation is BaseArray[IdxArray].
|
Note that as with range array blocks, the indexed array block is a "view" or "window" into its base array; the yellow sub-array in the diagram was just drawn to illustrate that the order of the elements in the indexed array block is not necessarily the same as their order in the base array from which they are selected.
As a second example, consider the following, in which a range array block of the index array is used to specify an indexed array block of array A. Note that indexed array blocks may not be nested; that is, Array1[Array2[Array3]] isn't a valid Sequoia array block. (This is evident in the array block grammar).
|
Clearly, the size of the resultant indexed array block is identical to the size of array block of the index array.
Array blocks can only be used in two ways: as arguments to tasks, or as arguments to the built-in copy operator. In each of these cases, the array block refers to a region of some base array, but the actual transfer of data to or from that region only occurs when the array block is passed as an argument. See these two sections for details, in particular the section on the copy operator, as it contains several array blocks examples.
Sequoia is intended as a language for programming machines with explicitly managed memory hierarchies. That is, there will often be some form of local memory, such as the LS in Cell, that the program is managing, and data must be explicitly allocated in this memory. Given that such local memories are small and of fixed size, it is important for the Sequoia program to reason about the precise amount of space that a collection of data will consume; there are no efficient "spilling" or replacement mechanisms such as a cache provides in conventional architectures to enable a large working set to be staged in a small local memory.
To enable the Sequoia compiler to reason about space, range array blocks have a max field associated with them. The start, end, and stride fields are used to refer to a range of elements, and the max field specifies the largest possible number of elements that the corresponding range could refer to. This becomes important due to the fact that the other three fields may be arbitrary expressions whose values aren't known until runtime. The max field allows the programmer to express to the compiler a program invariant: that the maximum amount of space needed to stage the array block is bounded by some value.
Consider this range array block, in which i and j are dynamic variables:
// start end max A[ (i+j)*32 : j*64 ; 128 ] |
Just by looking at the start and end fields, there is no way to statically determine how large this array block may be, which is a problem if the program instructs that the array block should be copied into a fixed-size local memory. The max field specifies that no matter what the values of the variables i and j are, the array block will contain at most 128 elements of A, enabling the compiler to allocate space for it in the local memory.
At runtime, when the actual values of start and end are known, a check will be perfomed to ensure that the actual size of the array block is indeed at most 128 elements.
Static vs. Dynamic Array Blocks
While start, end, and stride may be arbitrary C expressions, max may only be an arithmetic expression over constant literals and parameters. The description of these parameters is left to the Parameters section, but for the purposes of the description of array blocks in this section, there are two types of parameters: tunable and array size. Tunable parameters are compile-time constants, while array size parameters may or may may not be.
Ideally, the max expression will evaluate to a compile-time constant, and if it only uses tunable parameters, then this will always be the case. An array block whose maximum size in each dimension is statically known is referred to as a static array block.
If, however, the max expression refers to any array size parameters, then it is possible that its value may not be statically determinable, and in this case, max will default to the full size of the base array that the array block is over; the same default applies if the max field is omitted. Further, it is possible that the base array itself is dynamically sized, and thus no static bound will be known for the size of the array block; in this case, the array block is referred to as being dynamic.
Dynamic array blocks are not illegal in Sequoia, however only static array blocks may be transferred to software-managed memories such as the LS in Cell (assuming that they will fit); dynamic array blocks may only be transferred between memory modules with an effectively "infinte" size, such as global memory that is paged to disk.
The following grammar defines the syntax for specifying array blocks.
array-block <- range-array-block-Nd
<- indexed-array-block
range-array-block-1d <- array-ident dim-range0
<- array-ident
range-array-block-Nd <- array-ident dim-range0 { dim-range1 ... }
<- array-ident
indexed-array-block <- array-ident "[" range-array-block-1d "]"
dim-range <- "[" start-expr ":" end-expr ":" stride-expr ";" max-expr "]"
<- "[" start-expr ":" end-expr ";" max-expr "]"
<- "[" start-expr ";" max-expr "]"
max-expr <- const-param-expr
<-
start-expr <- expr
end-expr <- expr
stride-expr <- expr
|
The three base terms in the grammar are:
There are six different range array block syntax styles, each allowing different fields to be omitted; these are described by the array block grammar. Note that in a multi-dimensional range array block, different styles may be used in different dimensions, for example A[5:8;][i:j:2;10]. The six styles are listed below, with the latter five essentially being short-hand forms of the first style.
Consider an array with Ni elements in its ith dimension. The endi, stridei, and maxi fields may all be omitted, depending on which short-hand notation is used, and if they are, then they will default to certain values in the following manner.
Note that Ni may be an unbounded value, if the base array is dynamically sized, and maxi and endi will both default to an unbounded value in this case, resulting in a dynamic array block.
The utility of omitting the end field is that it allows the expression that an array block covers at most a certain range; if the end field is present, then it specifies that the block exactly covers a precise range. This is convenient when handling boundary cases, such as tiling an array into array blocks when the array block's size doesn't evenly divide the base array's size in some dimension, leaving an array block at the "edge" of the base array which is smaller than the rest of the array blocks. The following diagram shows some example array blocks which use this technique:
|
Trivial Array Blocks
An entire array may be used as an array block, un-annotated by the "[::;]" range syntax. The above defaults for omitted fields apply, in addition to the default that starti <- 0 in each dimension.
Consider an array with Ni elements in its ith dimension. The valid values for the range array block fields starti, endi, stridei and maxi are as follows:
Further, the following relationships must hold, else the array block isn't valid:
An array block can actually refer to zero elements of its base array, if any of the following are true:
The built-in copy operator copies data between two array blocks. Its syntax is as follows:
copy( dest-array-block, src-array-block ); |
The constraints on the two arguments to this operator are:
Recall that array blocks only describe a "view" or "window" into their base array. The copy operator causes data to be transferred from the base array of the source array block to the base array of the destination array block, between corresponding array block elements.
Examples
A copy between two rectangular regions:
|
A similar copy between two rectangular regions, in which the source array block is the entire array B:
|
A copy that scatters the data into A, spacing it out with a stride of 2 in each dimension:
|
A copy that gathers elements from array B and places them in a contiguous region of A.
|
A copy that gathers elements from array B and scatters them into A.
|
Sequoia programs are structured as collections of tasks, where a task is a function which can only access its local, private data.
Tasks are void functions that can only access their arguments and their locally declared data. In the Sequoia language, tasks replace C-style functions (although external functions written in other languages, such as C/C++, can be called from Sequoia code; see the Sequoia API section for details).
A simple example task is as follows, with Sequoia keywords in bold:
void task<leaf> VectScale::Kernel(in float A[128], in float x, out float Y[128])
{
for (int i = 0; i < 128; i++) {
Y[i] = x * A[i];
}
}
|
This task takes as inputs an array A and scalar value x, and outputs the array Y which results from multiplying each element of A by x. The only data objects which the body of this task may access are its task arguments (A, x, and Y) and its local variables (i). (The meaning of the task<leaf> keyword is described in the Task Types section, and the naming convention for tasks is described in the Task Variants section).
Sequoia adds three keywords for specifying whether a task's argument is an input (in), an output (out), or both an input and an output (inout). Exactly one of these keywords must be the first term in the type expression for each of the task's arguments. The above example used in and out, and the following modified example uses inout, and accumulates the vector-scalar product into the output array.
void task<leaf> VectScaleAccum::Kernel(in float A[128], in float x, inout float Y[128])
{
for (int i = 0; i < 128; i++) {
Y[i] += x * A[i];
}
}
|
An in argument will have been initialized before the task is called, an out argument will be uninitialized at the start of the task's execution and will be copied back to the task's caller when it completes, and an inout argument is both initialized before the task begins and copied back to the task's caller when it returns.
A constraint on the use of in arguments is that they can only be read from; they cannot be written to. This applies to both "single-element" accesses, such as "x = A[i];", and also to accessing the array via array blocks passed to tasks or to the copy operator. There is no constraint on inout or out arguments, which can be both read from and written to.
An important property of Sequoia's tasks is that they execute in "isolation" of the rest of the program. Put another way, the data that a task can access (its arguments and local variables) are private to the task during its entire execution, and there is no way for concurrent tasks to communicate with each other. This is explained more fully in the Calling Semantics section.
There are three types of task in Sequoia: inner, leaf, and external tasks. (The reason for these names will be made clear in the Task Variant Call Graph section.) These types are distinguished by their intent as well as their syntax.
First, recall the abstract machine model that Sequoia assumes. A characteristic of machines which fit this model is that there are a number of efficient processing elements (PEs) that operate out of small, local memories, which are backed by one or more levels of larger/slower memories, in a memory hierarchy. For example, in Cell, the SPEs only access data in the LSes, which stage data transferred into them from main memory.
A leaf task is intended to execute directly on the machine's efficient PEs (the SPEs in Cell) and a leaf task's dataset must completely fit into the local memory. Leaf tasks perform the "actual computation" of a program, and operate on individual array elements.
Inner tasks, on the other hand, are intended to operate on large datasets, simply decomposing them such that they can be passed to subtasks. Inner tasks express the decomposition of an algorithm, and manipulate array blocks rather than individual array elements. On Cell, an inner task could execute in the PPE, and its two main roles would be to transfer blocks of data between the LS and memory, and to invoke leaf tasks in the SPEs that will execute out of the data in the LS.
An external task is any function written outside of Sequoia that will be called from within Sequoia code. The external implementation must conform to the Sequoia API.
The type of a task is specified in its definition; note that external tasks don't have bodies.
void task<inner> TaskName::VariantName( ... ) { ... }
void task<leaf> TaskName::VariantName( ... ) { ... }
void task<ext> TaskName::VariantName( ... );
|
The following table enumerates the capabilities and limitations of inner and leaf tasks. As a quick summary of this table, leaf tasks (mostly) only use Sequoia's supported subset of C, while inner tasks use Sequoia's extensions to C, in addition to this C subset. One important difference is that leaf tasks may read and write their array variables directly using single-element indexing, while inner tasks may only read them in this manner.
| Inner tasks | Leaf tasks | |
| Sequoia's supported C syntax | ||
| Use all of the C keywords and operators that Sequoia supports, including the use of C control flow (for, while, etc.) | X | X |
| Call C inline functions (passing only scalar variables to them) | X | X |
| Call functions provided externally to Sequoia, using the Sequoia API | X | X |
| Freely read and write their scalar arguments and local scalar variables, including C-style arrays that are contained within scalar structs | X | X |
| Read from their array arguments and local array variables using single-element accessing (e.g.) x += A[i]; | X | X |
| Write to their array arguments and local array variables using single-element accessing (e.g.) A[i] = 2 * x; | X | |
| Sequoia's extensions to C | ||
| Use array blocks | X | |
| Call Sequoia tasks (inner, leaf, or external) | X | |
| Use the copy operator | X | |
| Use Sequoia's iteration constructs | X | |
| Use parameters (both tunable and array size parameters) | X | X |
Sequoia allows multiple algorithmic variants of tasks to be defined. Task variants are different implementations of the same task, and variants of a task all have the same function type signatures.
The variant name is just an arbitrary label that follows the "::" symbols in the full task identifier; the programmer is free to choose any variant name, and isn't limited to those used in the examples in this documentation ("Leaf", "Inner", etc.). No two variants of the same task may have the same variant name.
Note that different variants need not have the same symbolic names for their arguments, nor the same symbolic names for their array size parameters; just as with C function signatures, Sequoia task signatures only relate to the type of the arguments, not their symbolic names.
The utility of having multiple variants relates to the way that tasks are called in Sequoia. A task callsite only refers to the base name of the task and not to any specific task variant, allowing the target-specific process of mapping the Sequoia program to select which variant to call, based on which would be a more appropriate implementation for the target machine.
All tasks must have a prototype, similar to a C function prototype. There is a single prototype per task, not per task variant, and the prototype has no variant specifier or task type specifier. For example:
// Prototype
void task MyTask(in float A[P][Q], inout int x);
// Inner task variant definition #1
void task<inner> MyTask::InnerVariant1(in float A[P][Q], inout int x)
{
...
}
// Inner task variant definition #2
void task<inner> MyTask::InnerVariant2(in float A[P][Q], inout int x)
{
...
}
// Leaf task variant definition
void task<leaf> MyTask::LeafVariant(in float A[P][Q], inout int x)
{
...
}
// External task variant definition; has no body
void task<ext> MyTask::ExtVariant(in float A[P][Q], inout int x);
|
All task variant definitions are required to have an identical signature to the task's prototype.
A property of inner tasks, but not leaf tasks, is that they can call other tasks, as in the example below, in which a 128x128 matrix A is scaled by variable x, resulting in the matrix Y. The MatrixScale128::Inner task calls the MatrixScale64::Leaf task 4 times, once for each of the top-left, top-right, bottom-left, and bottom-right regions of its argument arrays A and Y, passing the corresponding 64x64 array blocks to the subtask.
void task<inner> MatrixScale128::Inner(in float A[128][128], in float x, out float Y[128][128])
{
for (unsigned int i = 0; i < 2; i++) {
for (unsigned int j = 0; j < 2; j++) {
// Task call:
MatrixScale64( A[i*64;64][j*64;64], x, Y[i*64;64][j*64;64] );
}
}
}
void task<leaf> MatrixScale64::Leaf(in float A[64][64], in float x, out float Y[64][64])
{
for (unsigned int i = 0; i < 64; i++) {
for (unsigned int j = 0; j < 64; j++) {
Y[i][j] = A[i][j] * x;
}
}
}
|
Task calling syntax resembles C-style function calling syntax, though the only arguments that may be passed to a task call are array blocks and scalars.
Callee Variant not Specified
Note that a task callsite only specifies the "base" name of the task being called, and not the specific task variant. The choice of which variants to call at a task's callsites is made during the process of mapping a Sequoia program to a target machine; see the task instances section for more details.
Copy-In, Copy-Out
The task calling semantics are call-by-value-result (CBVR), also referred to as "copy in, copy out". The data that the calling (parent) task passes as arguments to the call are copied into the child (callee) task's namespace. The copies that take place between the parent and child task namespaces in the above example are illustrated in the following diagram:
|
If the child (callee) task's argument is in or inout, then a copy from the parent's namespace to the child's namespace occurs before the child task begins executing, and if the child task's argument is inout or out, then a copy from the child's namespace back to the parent task's namespace occurs after the child completes.
Recall that array blocks are "views" or "windows" into their base arrays. Passing an array block as an argument to a subtask call implies that copies will occur between the base array of the array block and the subtask's namespace. Hence, due to the constraint that a task may not modify its in arguments, such arguments may not be passed to subtask calls if the subtask's corresponding argument type is inout or out.
Array Block Aliasing
Just as the copy operator doesn't permit its source and destination array blocks to overlap if they are derived from the same base array, array blocks passed as task arguments are similarly prohibited from overlapping if passed to an inout or out argument of the callee subtask. That is, input-output and output-output aliasing of task arguments is not permitted, though it is the responsibilty of the programmer to ensure that this doesn't occur. The section on undefined behavior has more details.
Input-input aliasing is permitted, however; overlapping array blocks (of the same base array) may be passed to different subtask arguments, provided that they are all in.
In the case of scalar variables, the compiler will indicate an error if the same variable is passed to more than one subtask argument, unless all of the subtask arguments that the scalar variable is passed to are in.
Concurrent Execution
Child tasks do not run concurrently with their parent tasks. Rather, when the parent task makes a task call, it will wait until the child task completes before continuing. The only way for tasks to run concurrently in Sequoia is via its parallel iteration constructs mappar and mapreduce, and in these cases, such tasks are "sibling" tasks, rather than "parent" and "child".
Semantic Copies != Physical Copies
An important point to understand is that the copies of data between task namespaces that are implied by the calling semantics of tasks do not necessarily correspond to physical copies within the machine; this is covered in the section on mapping a Sequoia program to a task machine. In general, copies only physically occur when absolutely required, such as when the calling task passes an array that is located in one memory module to a callee subtask whose argument data lives in a different memory module.
Relationship to the Built-in copy Operator
Passing array blocks to subtask calls and using them as arguments to the copy operator are the only two ways that array blocks can be used in Sequoia, and in fact, these two usages have the same semantics.
Consider the following example of using the copy operator to copy a 1D array block of the source array A to a 1D array block of destination array B, where each array is an array of float scalars.
copy(B[IdxB], A[0:100:2;50]); |
This could have been implemented by the programmer by writing their own MyCopy task, such as:
void task<leaf> MyCopy::Leaf(out float Dst[N], in float Src[N])
{
for (unsigned int i = 0; i < N; i++) {
Dst[i] = Src[i];
}
}
|
and then calling the task as follows, rather than the above call to the copy operator:
MyCopy(B[IdxB], A[0:100:2;50]); |
(The value N in the MyCopy::Leaf task example is an array size parameter; these are described in the Array Size Parameters section).
Sequoia tasks have two different types of parameters, tunable parameters and array size parameters.
Tunable parameters are locally declared variables which have the tunable type modifier. For example:
void task<inner> VectScale::Inner(in float A[128], in float x, out float Y[128])
{
tunable BLKSIZE;
unsigned int numBlks = 128/BLKSIZE;
for (unsigned int i = 0; i < numBlks; i++) {
VectScale( A[i*BLKSIZE;BLKSIZE], x, Y[i*BLKSIZE;BLKSIZE] );
}
}
|
These "tunable" variables are not set to any value in the Sequoia source program. Rather, they are set at compile time, during the process of mapping the Sequoia program to the target machine. Tunable parameters can be used in the same manner as any other variables, except for the fact that they can not be written to, similarly to the way that const variables are treated. That is, tunable variables can be C rvalues, but not lvalues.
Tunable parameters are non-negative integers whose range is effectively unbounded. Conceptually, tunable parameters are very similar to preprocessor constants, with the main difference being that the compiler understands that they are integers and is able to perform certain optimizations with them, while preprocessor constants are just textually replaced and are "opaque" to the compiler.
The most common usage of tunable parameters in Sequoia is to set the size of array blocks in inner tasks, such as in the above example.
Consider the following example task:
void task<leaf> VectScale::Leaf(in float A[N], in float x, out float Y[N])
{
for (int i = 0; i < N; i++) {
Y[i] = x * A[i];
}
}
|
The variable N is an array size parameter. Just as with tunable parameters, array size parameters are not set to any values in the Sequoia source code, and also similarly to tunable parameters, array size parameters can be rvalues in any C expressions, but not lvalues. However, while tunable parameters are set at compile time, array size parameters are set to the actual size of the array argument that is passed to the task call, at runtime.
As with tunables, array size parameters are non-negative integers with an effectively unbounded range.
As an example, suppose the above task was called with a 1024-element array A passed as its first argument; in this case, the value of N would become 1024 for the lifetime of that task execution. Later in the same program run, the same task could be called again, perhaps with a 2048-element array A, resulting in N becoming 2048 for the duration of the second task execution.
One point to note about the above task is that both the A and Y arrays have the same array size parameter, N. This expresses a constraint that when this task is called, the array passed as its A argument and the array passed to its Y argument must be of the same size. In general, if the same array size parameter is used more than once in a task definition, then the value it becomes at runtime when the task is called must be consistent across all of its uses.
Array Size Expressions
Each dimension of a task's argument array can actually contain an expression over array size parameters and constants, such as in the following 1D convolution example:
void task<leaf> Conv1d::Leaf(in float A[N+U-1], in float H[U], out float C[N])
{
for (unsigned int n = 0; n < N; n++) {
C[n] = 0;
for (unsigned int u = 0; u < U; u++) {
C[n] += H[u] * A[n+u];
}
}
}
|
At runtime in this example, the U and N values will be determined by the sizes of H and C, respectively, and these values will then be used to determine the value of the expression N+U-1, which will be compared against the actual size of A at runtime, yielding an error if they don't match. This syntax permits constraints between the sizes of different arrays to be expressed, in this case that len(A) = len(C) + len(H) - 1.
A rule which must be followed is that any array size parameter which is a part of an array size expression must also exist as a "unit" expression for some other array dimension's size. In the above example, both the U and N array size parameters express the size of some array's dimension, and thus they may be used in an expression that describes the size of some other array dimension.
Array size expressions must be of the form:
c0 * p0 + c1 * p1 + ... + cn * pn + c
where:
As a final example, the following is a task that implements a 2D convolution and uses array size parameter expressions in multiple dimensions.
void task<leaf> Conv2d::Leaf(in float A[M+U-1][N+V-1], in float H[U][V], out float C[M][N])
{
for (unsigned int m = 0; m < M; m++) {
for (unsigned int n = 0; n < N; n++) {
C[m][n] = 0;
for (unsigned int u = 0; u < U; u++) {
for (unsigned int v = 0; v < V; v++) {
C[m][n] += H[u][v] * A[m+u][n+v];
}
}
}
}
}
|
Sequoia task variants are parameterized in several ways.
All of these program parameters are left unspecified in the Sequoia source code and are set by the programmer at compile time during the process of mapping the Sequoia program to a specific target machine.
The process of filling in all of a task variant's parameters is called instantiating the task variant, and a task variant which has all of its parameters filled in is called a task instance. A task instance is a task variant that has been specialized for a specific target machine.
The same task variant can be instantiated multiple times to become multiple different task instances.
See the Sequoia Mapping Reference for details of how Sequoia's abstract tasks are instantiated for a target machine.
A Sequoia program consists of a set of abstract, parameterized task variants. The task variant call graph is a figure in which the parameterized variants are nodes and edges represent possible caller-callee relationships. Note that each subtask callsite in a task's body just refers to the "base" task name, and not to the specific variant which is being called, and hence there may be a number of task variants that may be called from any given callsite.
Consider the following example Sequoia program (which implements a 1D FFT using a recursive, divide-and-conquer algorithm). The leaf task bodies aren't shown.
typedef struct { float re; float im; } cmplx;
void task<inner> FFT::InnerVariant(inout cmplx A[N], in cmplx W[N-1])
{
Butterfly(A[0:N/2;], A[N/2:N;], W[0:N/2;]);
for (int i = 0; i < 2; i++) {
FFT(A[i*N/2;N/2], W[N/2:N-1;]);
}
}
void task<leaf> FFT::LeafVariant(inout cmplx A[N], in cmplx W[N-1])
{
// Leaf task body not shown
}
void task<inner> Butterfly::InnerVariant(inout cmplx A[N], inout cmplx B[N], in cmplx W[N])
{
tunable T;
for (int i = 0; i < N/T; i++) {
Butterfly(A[i*T;T], B[i*T;T], W[i*T;T]);
}
}
void task<leaf> Butterfly::LeafVariant(inout cmplx A[N], inout cmplx B[N], in cmplx W[N])
{
// Leaf task body not shown
}
|
This is represented as the following task variant call graph.
|
The "leaf" and "inner" task terminology comes from this notion of a task call graph:
The following sections describe Sequoia's iteration constructs.
Sequoia supports all of the C iteration constructs (while, for, and do). Although some optimizing compilers are able to analyze such loops in simple cases and parallelize them, in general, these are all sequential, and possibly data-dependent iteration constructs. Consider the following example inner task, which blocks its argument arrays into length-T subarrays that are passed to subtask calls.
void task<inner> VectScale::Inner(in float A[N], in float x, out float Y[N])
{
tunable T;
unsigned int numBlks = (128+T-1)/T;
for (unsigned int i = 0; i < numBlks; i++) {
VectScale( A[i*T;T], x, Y[i*T;T] );
}
}
|
Given that for is a sequential loop, the semantics here are that the VectScale subtask will be called on array blocks A[0;T] and Y[0;T] in iteration i=0, and when this subtask returns, VectScale will again be called, this time on array blocks A[T;T] and Y[T;T] in iteration i=1, and so on.
Clearly, all of the VectScale subtask calls made in the above program are independent, since (1) they operate on non-overlapping ranges of arrays A and Y, and (2) it is known that the subtask calls are side-effect-free as a consequence of the fact that all tasks run in isolation on private copies of data, and thus the VectScale subtask calls should be able to run in parallel, if there are parallel exectuion resources available in the hardware.
To make it explicit when subtasks can be executed in parallel, and to allow the programmer to express their iteration spaces using compiler-analyzable loops in general, Sequoia adds three iteration constructs to the set that C provides: mappar, mapseq, and mapreduce. These are conceptually similar to the map and reduce operators of functional languages: they express a task call being applied to every member of a collection of array blocks.
The same example task as in the introduction section, but using the Sequoia mappar construct ("map in parallel") instead of a C for loop, is as follows:
void task<inner> VectScale::Inner(in float A[N], in float x, out float Y[N])
{
tunable T;
unsigned int numBlks = (128+T-1)/T;
mappar (unsigned int i = 0 : numBlks) {
VectScale( A[i*T;T], x, Y[i*T;T] );
}
}
|
This syntax specifies that each VectScale subtask call can occur in parallel, or in any arbitrary sequential order. Each subtask call has the same "copy in, copy out" task calling semantics as if it were called directly by its parent task (rather than inside a mappar), with the copies between the parent task's and each child's namespace implied by these semantics all happening in parallel.
In the example above, the VectScale task instance correponding to i=0 will copy the array block A[0;T] and the scalar x to its local namespace, compute their product, and copy the result back into the array block Y[0;T], while potentially simultaneously, the i=1 VectScale task instance will copy A[T;T] and x to its local namespace, compute their product, and copy the result back into Y[T;T], and so on for the rest of the mappar iterations.
Syntax
The mappar iteration range specifies how many parallel task calls to make. Its syntax is as below; this is explained in detail in the grammar section.
mappar(type loopvar = start : end) { taskcall(); }
|
There is an implicit "loopvar++" in the construct, and start must be less than or equal to end for there to exist any iterations to be executed (although it's not an error if start > end). As with the element ranges in the array block notation, the iteration range is non-inclusive of the end value. For example, if the range was (int i = 5 : 9), then the loop variable i would take on the values 5,6,7,8.
Syntactic Rules
Some syntactic rules for mappar are as follows:
Unlike scalar subtask arguments, array block subtask arguments may be in, inout, or out, though the programmer has the responsibility to ensure that there is no input-output or output-output aliasing of array blocks, either within an iteration (which is the same constraint that applies to any task call), or across iterations; see the section on undefined behavior for details.
The mapseq (map sequential) iteration construct is syntactically the same as mappar, with the difference being that its iterations are executed serially, and not in parallel. It is essentially a more constrained version of a C for loop.
The syntactic rules for mapseq are the same as mappar except for the rule dealing with scalar arguments of the called subtask; in mappar iteration spaces, subtasks may only have in arguments, while in mapseq iteration spaces, subtasks may also have inout and out arguments. Since the subtask calls occur sequentially, with each iteration's call completing before the next iteration's call begins, the resultant values in scalar variables passed to such subtask arguments are well defined.
Array blocks that are passed as arguments to a subtask inside a mapseq construct may freely overlap across iterations, but overlapping array blocks that are passed to the same iteration's subtask call will still result in undefined behavior if there is input-output or output-output aliasing, just as with any task call; see the section on undefined behavior for details.
The main utility of a mapseq iteration construct over a C for loop is its ability to be used in a Sequoia nested iteration space.
Background
Sequoia's third iteration construct, mapreduce, is inspired by the reduce operator of functional languages, in which a collection of data is "collapsed" or "reduced" to a single data item. An example of a reduction is summing the numbers of an array: the input is a sequence of numbers, and the output is a single number, into which every number in the array was summed.
A reduction has two components: a collection of data items, and a combiner function which take two inputs, a data element from that collection and a value representing a partial aggregation of data elements, and produces as a single output an updated partial aggregation. In the example of summing the numbers of an array, the combiner function takes a single array element and a running tally as inputs, adds the array element to the running tally and returns it as an updated running tally. Once all the array elements have been processed, the running tally value will be the full sum of the array's elements.
An important property of reductions is that they can be done in parallel, since the order that elements are aggregated isn't important. This assumes that the combiner function is both associative and commutative.
Sequoia Reductions
In Sequoia, the mapreduce construct allows the epxression of such reductions, though in a slightly different way. A mapreduce statement is syntactically similar to a mappar statement, and it runs in two phases. First, the mapreduce iterations are run in parallel, just as with a mappar, and second, once all iterations have completed, a combining phase occurs, in which a combining function is applied to all parallel results of the first phase, collapsing them into single elements. Only after the combining phase is complete will control return to the calling task.
Combiner functions may be either inner tasks or leaf tasks; see the section on mapping the Sequoia program to the target machine for details.
As an example, consider the histogram program below. This program takes as input an array of numbers in the range 0:B, and counts (or "bins") how many occurrences of each number there are in the array, returning these counts in the bins array.
void task<inner> Histo::Inner(in int data[N], inout int bins[B])
{
tunable T;
unsigned int numBlks = (N+T-1)/T;
mapreduce (int i = 0 : numBlks) {
Histo(data[i*T;T], reducearg<bins, HistoCombiner>);
}
}
void task<leaf> Histo::Leaf(in int data[N], inout int bins[B])
{
for (int i = 0; i < N; i++) {
bins[data[i]]++;
}
}
void task<leaf> HistoCombiner::Leaf(in int bins1[B], inout int bins2[B])
{
for (int i = 0; i < B; i++) {
bins2[i] += bins1[i];
}
}
|
To get more concrete, suppose that this program was run on a machine with 16 parallel PEs, and there were numBlks=1024 iterations to the mapreduce construct in the Histo::Inner task. The program behavior would be as follows.
Reducing Multiple Arguments
More than one argument can be reduced in a single mapreduce construct, such as in the following example, which both histograms numbers, as above, and also sums them all up.
void task<inner> HistoAndSum::Inner(in int data[N], inout int bins[B], inout int sum)
{
tunable T;
unsigned int numBlks = (N+T-1)/T;
mapreduce (int i = 0 : numBlks) {
HistoAndSum(data[i*T;T], reducearg<bins, HistoCombiner>, reducearg<sum, SumCombiner>);
}
}
void task<leaf> HistoAndSum::Leaf(in int data[N], inout int bins[B], inout int sum)
{
for (int i = 0; i < N; i++) {
bins[data[i]]++;
sum += data[i];
}
}
void task<leaf> HistoCombiner::Leaf(in int bins1[B], inout int bins2[B])
{
for (int i = 0; i < B; i++) {
bins2[i] += bins1[i];
}
}
void task<leaf> SumCombiner::Leaf(in int sum1, inout int sum2)
{
sum2 += sum1;
}
|
Note that each reduction argument of the task call is combined independently, using its own combiner function.
Reduction Arguments
Notice that the arguments being reduced in the task called inside the mapreduce construct are inout. This is required for reduction arguments. Further, the combiner task must have a signature like that in the examples above: it must have two arguments of the same type as the reduction argument, the first being in and the second being inout.
If the mapreduce is parallelized over multiple PEs, then the local copy of the reduction variable (that is being aggregated into) will be initialized to the value passed in to the inout argument. In the above example, this would imply that the bins array and the sum scalar should both be zeroed in the code that makes the initial call to HistoAndSum::Inner.
Syntactic Rules
The syntactic rules for mapreduce are as follows; these are the same as for mappar, except for the way in which reduction arguments are handled.
Unlike scalar subtask arguments, array block subtask arguments may be in, inout, or out, though the programmer has the responsibility to ensure that there is no input-output or output-output aliasing of array blocks, either within an iteration (which is the same constraint that applies to any task call), or across iterations; see the section on undefined behavior for details.
Sequoia's three iteration constructs may be nested around a single task call, as in the following example program:
void task<inner> MatrixMult::Inner(in float A[M][P], in float B[P][N], inout float C[M][N])
{
tunable Mblk, Pblk, Nblk;
mappar (int i = 0 : M/Mblk) {
mappar (int j = 0 : N/Nblk) {
mapreduce (int k=0 : P/Pblk) {
MatrixMult(A[i*Mblk; Mblk][k*Pblk; Pblk],
B[k*Pblk; Pblk][j*Nblk; Nblk],
reducearg |
In addition, as a shorthand, multiple loops may be defined within a single iteration statement, such as the following (equivalent) version of the above example:
void task<inner> MatrixMult::Inner(in float A[M][P], in float B[P][N], inout float C[M][N])
{
tunable Mblk, Pblk, Nblk;
mappar (int i = 0 : M/Mblk,
int j = 0 : N/Nblk) {
mapreduce (int k=0 : P/Pblk) {
MatrixMult(A[i*Mblk; Mblk][k*Pblk; Pblk],
B[k*Pblk; Pblk][j*Nblk; Nblk],
reducearg |
The rules for such nests of iteration constructs are as follows:
The following grammar details the syntax of the three Sequoia iteration constructs.
iter-stmt -> mappar-stmt
-> mapseq-stmt
-> mapreduce-stmt
mappar-stmt -> "mappar" "(" iter-range { "," iter-range } ")" "{" body-stmt "}"
mapseq-stmt -> "mapseq" "(" iter-range { "," iter-range } ")" "{" body-stmt "}"
mapreduce-stmt -> "mapreduce" "(" iter-range ")" "{" reduce-task-call-stmt "}"
body-stmt -> iter-stmt
-> task-call-stmt
iter-range -> loopvar-type loopvar-ident "=" start-expr ":" end-expr
loopvar-ident -> scalar-var-ident
start-expr -> expr
end-expr -> expr
task-call-stmt -> task-ident "(" task-arg { "," task-arg } ")" ";"
reduce-task-call-stmt -> task-ident "(" reduce-task-arg { "," reduce-task-arg } ")" ";"
task-arg -> array-block
-> scalar-var-ident
reduce-task-arg -> task-arg
-> "reducearg" "<" task-arg "," task-ident ">"
scalar-var-ident -> var-ident
|
The base terms in this grammar are:
As a result of the semantics of the language, there are a number of situations in which a program may result in undefined behavior, despite the compiler not raising any errors. These are as follows.
Aliasing Array Arguments to copy Operator
If the source and destination array blocks passed to the copy operator have the same base array and there are any elements of that base array that are referred to by both array blocks, then undefined behavior will result. Depending on the specifics of the overlap pattern and the order that data is transferred by the underlying runtime implementation, the value in the destination array may or may not be what the programmer "expects".
Aliasing Array Block Elements
If an array block is passed as a "destination" argument, either to the copy operator or to an out or inout task argument, and the array block is an indexed array block, then undefined behavior will result if any of the indices are repeated. The final value in the array once the operation has completed is undefined.
Note that element aliasing causes no problems in the case of array blocks used as the source of the copy operator or the input to a task call.
Another way of stating this: it's OK to have repeated indices in a gather, but not in a scatter.
Aliasing Array Arguments to a Task Call
Array blocks which have the same base array and are passed to a task call will result in undefined behavior if they overlap and if at least one of the overlapping array blocks is an out or an inout argument. That is, input-output and output-output aliasing of task array arguments results in undefined behavior.
In the case of input-output aliasing (but not output-output aliasing), whether or not the "expected" result is written to the output array depends on target-specific details, such as whether the task finished reading the array as an input before it started writing to it as an output.
Note that in the case of scalar task arguments, the compiler will check that no scalar variable is passed to a task "unsafely". (It is harder to check in the case of arrays since the compiler may have no way of knowing whether two different array blocks over the same base array overlap or not). For scalars, the rule is the same as for arrays: no input-output or output-output "aliasing" of a scalar variable is permitted in a task call. For example, given the following task definition:
void task<inner> MyTask::Inner(in int x1, in int x2, inout int x3, out int x4) { ... }
|
the following holds:
int a,b,c,d; MyTask(a,b,c,d); // Legal MyTask(a,a,b,c); // Legal MyTask(a,b,c,c); // Illegal MyTask(a,b,b,c); // Illegal |
Aliasing mappar or mapreduce Array Arguments
In addition to input-output and output-output aliasing of arrays causing undefined behavior in a single task call, input-output and output-output aliasing of arrays across iterations of a mappar or mapreduce statement also result in undefined behavior. That is, for an array which is written to by a task in a mappar or mapreduce, any given element of the array may only be written to once over the entire set of parallel task calls, and if an array element is written to by any task call, then it may not be read from by any task call.
External Tasks
External tasks, by their very nature, are outside of the Sequoia system, and yet certain assumptions have been made about how any task operates. In particular, it is assumed (and is enforced for inner and leaf tasks) that no task will write to any of its in arguments, though since an external task is just passed a pointer to each argument, it is able to freely violate this assumption. In general, external tasks will result in undefined behavior if they don't "play by the rules".
A Sequoia program is a collection of abstract tasks that do not contain any machine-specific constructs (unless they are external tasks). The abstract Sequoia program is specialized or instantiated for a particular machine (these terms are used interchangeably) at compile-time in a process referred to as mapping the program to the machine.
A Sequoia program expresses an abstract algorithm, while a Sequoia program plus a target-specific set of mapping directives represents a concrete implementation of the program. Multiple different target-specific mappings may be provided, enabling the same Sequoia program to be run on different machines.
Note that the current mapping interface that is described here and that the Sequoia compiler supports is an early prototype, and more convenient interfaces are under active development. Sequoia will always give a programmer as much control as they desire when mapping their program to a machine, and in that sense the current mapping concepts, with a few improvements, can be thought of as a "backbone" of the Sequoia system; this detailed control over mapping will always be present. In addition, however, some partial automation of the mapping process will be supported as a programmer convenience on top of the current low-level interface.
The concepts used in mapping a Sequoia program to different target machines are independent of any specific mapping syntax. The Sequoia compiler now supports an alternate mapping file interface that is designed to be more human-readable. The mapping information provided via the alternate syntax is the same as is described in the Generic Mapping Concepts section.
Alternate Mapping File Grammar
The following is a formal grammar for the alternate mapping file syntax. The notation is that of a standard regular expression. Literal open and close braces are indicated as \[ and \], respectively. Variables are indicated by capital letters.
#include "MACHINE_FILE"
[
task TASK_NAME [: entrypoint(INSTANCE_NAME)]?
{
[
instance INSTANCE_NAME::VARIANT_NAME([level LEVEL]?) [:]? [unique overlay|external("FILENAME")]*
{
[
tunable TUNABLE_NAME[\[LEXNUM\]]? = TUNABLE_VALUE;
|
data([level LEVEL]?) [: spaceshare(ARRAY_NAME [, ARRAY_NAME]*)]?
{
[
array ARRAY_NAME[\[LEXNUM\]]?([level LEVEL]?)
{
[
[elements|pitch] [<|=|>] VALUE [, VALUE]*;
|
blockcyclic { [grid = VALUE [, VALUE]*; | blocksize = VALUE [, VALUE]*;]* }
]*
}
]*
}
|
control([level LEVEL]?)
{
[
loop LOOP_NAME[\[LEXNUM\]]?([level LEVEL]?) [: flat]?
{
[
swp = STAGES;
|
unroll = DEPTH;
|
spmd { [fullrange = LOWER,UPPER; | ways = [WAYS|auto]; | iterblk = BLOCKS;]* }
]*
}
|
callsite TASK_NAME[\[LEXNUM\]]?()
{
[
target INSTANCE_NAME() [:]? [dynamic|copy(ARRAY_NAME [, ARRAY_NAME]*)]*
{
[
ARRAY_NAME.elements [<|=|>] VALUE [, VALUE]*;
|
instantiations [<|<=|=|>=|>] VALUE;
]*
}
]*
}
]*
}
]*
}
]*
}
]*
|
As with the XML mapping file syntax, if there are multiple objects in a task with the same name, then the LEXNUM field can be used to identify which one is being referred to,
The following sections describe the conceptual process of mapping a Sequoia program to a target machine. While mapping syntax and machine options may be target-specific, the same mapping concepts apply to all targets.
A Sequoia program is comprised of a set of abstract, parameterized task variants, and the process of mapping such a Sequoia program to a target machine consists of instantiating the abstract tasks, yielding a set of task instances.
Tasks contain four types of contructs which must be mapped to a target machine.
The process of instantiating a task consists of making all of the above parameter choices and mapping decisions. A task instance is a task variant which has been specialized for a target machine by mapping all the above constructs.
As a running example for the explanations in the subsections of this section, consider the following implementation of matrix multiplication:
void task<inner> MatMul::Inner(in float A[M][P], in float B[P][N], inout float C[M][N])
{
tunable Mblk, Pblk, Rblk;
mappar (int i = 0 : M/Mblk,
mappar (int j = 0 : N/Nblk) {
mapseq (int k = 0 : P/Pblk) {
MatMul(A[i*Mblk;Mblk][k*Pblk;Pblk],
B[k*Pblk;Pblk][j*Nblk;Nblk],
C[i*Mblk;Mblk][j*Nblk;Nblk]);
}
}
}
}
void task<leaf> MatMul::Leaf(in float A[M][P], in float B[P][N], inout float C[M][N])
{
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < P; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
|
The above implementation multiplies input matrices A and B and accumulates the result into matrix C, decomposing the algorithm by splitting up the problem of multipling an M*P matrix by a P*N matrix into subproblems which multiply submatrices of sizes Mblk*Pblk and Pblk*Nblk together.
The task variant call graph corresponding to this program is as follows:
|
The first step in instantiating a program's tasks is actually specifying what instances are going to be created. Suppose that the matrix multiplication running example program is to be mapped to a Cell target machine with the following abstract machine model, and that the program will be operating on very large (many-GB) matrices that are staged on disk.
|
The programmer would create three task instances in this case, one instance of MatMul::Inner to decompose the large (on-disk) matrices into medium-sized submatrices that would fit in machine's main memory, a second instance of MatMul::Inner to split the main-memory medium-sized submatrices up into small, LS-sized sub-submatrices, and then an instance of MatMul::Leaf to actually perform a small matrix multiplication operation on the Cell SPEs. These three instances are illustrated in the following task instance call graph:
|
Every program variable must reside in a specific memory level of the target machine, and it's typically the case that the decision about what task instances to create is interlinked with the decision about where data will live and how much space it will take.
In the matrix multiplication running example, the data objects in the task instances will be mapped as follows:
This is illustrated in the updated task instance call graph below:
|
Each task instance will have its own "versions" of the task variant's tunable parameters, and the programmer must specify a compile-time constant for every tunable in every instance when mapping the program to a machine.
The tunables in the matrix multiplication running example are used to set the submatrix sizes that are passed to subtask calls. They will be set as follows:
These numbers were chosen to arrive at arrays in each level that are as large as possible, by the following method.
The selection of tunable values is illustrated in the updated task instance call graph below:
|
Note that the most common use of tunable values in Sequoia is to set array block sizes, and in particular, the "max" field of array blocks may only refer to tunable and array size parameters.
The amount of freedom that a programmer has to choose tunables depends upon the particular Sequoia application and the particular machine being mapped to. For example, a program may be written assuming that a tunable value will evenly divide some other application parameter, and on a machine such as Cell, hardware alignment constraints require that the start addresses of array blocks be aligned on 16-byte boundaries; each of these can constrain the space of "valid" values that the tunables can be set to.
In general, the heuristic that a programmer should use when setting tunable values is to try to make the arrays that reside in fixed size, local memories as large as possible without exceeding their capacity. Also, it's common for a programmer to try out a handful of choices for their program's tunable values before choosing the ones that yield the highest program performance.
Just as data objects in a Sequoia program must be mapped to physical storage locations in the target machine, so must control statements be mapped to processors that will execute them.
Sequoia draws a distinction between control and computation in a program, though the distinction is really a difference of the intent of the code, rather than its syntax.
All inner task control constructs must be mapped to a control processor in the target machine. This is the processor that will execute the statements.
Current Mapping Interface
It would be unwieldy and impractical for a mapping directive to be required for every operation in an inner task, such as the addition and assignment statement "int x = y + z;", and yet statements such as this must be executed somewhere. A general method that would allow the programmer to conveniently map control operations as a group is under development, however there is a mapping interface supported that provides a working solution.
The current mapping interface that Sequoia supports handles the mapping of control by associating a control level with every Sequoia iteration construct (mappar, mapseq, and mapreduce), and then implicitly placing every other type of control operation in the same control level as its "closest" parent iteration construct. The mapping interface will assign a memory level in the target machine to each such Sequoia iteration construct, and this will cause the Sequoia compiler to emit code for that loop into a file that will be executed by a control processor at that memory level. The exception to this rule is a leaf task call; leaf tasks are always "controlled" by the level-0 controller.
If a Sequoia loop that is running on a control processor in level L contains a nested Sequoia loop whose control level is set to level L-1, then this will result in the level-L processor calling into the L-1 processor to execute the contained loop; this call will be emitted as the appropriate communication operation on the target machine by the Sequoia compiler. For example, on a Cell machine, if a mapseq that is running in the PPE has a nested mappar inside it that is set to run in the SPE level, then each PPE mapseq iteration a thread call will be made from the PPE to the SPE to execute the nested mappar contruct.
In the matrix multiplication running example, the control level of the Sequoia iteration constructs would be mapped as in the following updated task instance call graph diagram.
|
Sequoia Loop Transformations
In addition to being assigned to a memory level, mapping a Sequoia loop (mappar, mapseq, or mapreduce) to a machine involves specifying any (or none) of the following set of loop transformations that will be performed on the loop.
Note that if a loop is software-pipelined, then the amount of space that each iteration has to work with will be reduced, since multiple iterations will be running in parallel.
In the matrix multiplication running example, these transformations would be applied as follows, yielding a fully mapped/instantiated program, with all mapping details having been filled in. This mapped program represents a concrete implementation of the MatMul abstract Sequoua code.
|
There are 8 parallel SPEs in level 0 in the Cell machine model, and thus the program will be parallelized 8 ways over these processors, though this is acheived via a 2D parallelization mapping. Further, at both the disk-to-memory and memory-to-LS interfaces there are asynchronous transfer operations provided by the underlying hardware/OS, and so at each interface in the program a loop was software-pipelined to acheive overlap between computation and communication.
Although the descriptions of the mapping of control and data are presented as two separate subsections, they are closely coupled by the following constraint: any control statement that accesses a data object (scalar variable or array element) must be placed in the same memory level as the data object is. If this constraint isn't met by the programmer's mapping directives then a compile-time error will occur.
Arrays can be laid out in various different ways within a memory module, for example in row-major, column-major, or block-major order, and they can be distributed across multiple parallel memory modules with a virtual memory level using the standard array distributions that array languages support (block-cyclic, etc.)
Documentation and examples of these mapping interfaces are yet to be written.
As Sequoia deals with the explicit management of fixed-size memory modules, it is important for the Sequoia compiler to be able to statically determine the maximum possible amount of space that data objects may consume, and this is something that is done during the process of mapping a program to a target machine.
Downward Constant Propagation
Consider the matrix multiplication running example from the section on instantiating tasks.
void task<inner> MatMul::Inner(in float A[M][P], in float B[P][N], inout float C[M][N])
{
tunable Mblk, Pblk, Rblk;
mappar (int i = 0 : M/Mblk,
mappar (int j = 0 : N/Nblk) {
mapseq (int k = 0 : P/Pblk) {
MatMul(A[i*Mblk;Mblk][k*Pblk;Pblk],
B[k*Pblk;Pblk][j*Nblk;Nblk],
C[i*Mblk;Mblk][j*Nblk;Nblk]);
}
}
}
}
void task<leaf> MatMul::Leaf(in float A[M][P], in float B[P][N], inout float C[M][N])
{
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < P; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
|
In this example, the program was mapped on to a Cell processor, yielding the following task instance call graph that illustrates the various mapping decisions that were made.
|
Recall that the semantics of array size parameters are that at runtime, they are computed from the actual sizes of the array arguments that are passed to the task instance call. It is often the case, however, that despite this "dynamic" definition, they can be determined statically by the compiler when the program is mapped to a target machine; this is a deliberate design point of the Sequoia language.
Consider the array size parameters (M, P, and N) in the MatMulInstSmall instance in the above figure. Since (1) this particular mapping of this program has specified that the MatMulInstSmall is only called by the MatMulInstMedium instance, (2) the MatMulInstMedium instance has tunable values Mblk = Pblk = Nblk = 64, and (3) these tunable values are used to set the array block "max" sizes of the arguments to the MatMulInstSmall call, the compiler is able to propagate information "downwards" through this call chain and determine that the array size parameters in the MatMulInstSmall instance must are constrained such that M <= 64, P <= 64,and N <= 64. By a similar downward propagation, the array size parameters of the MatMulInstMedium instance can be statically determined to be M <= 8192, P <= 8192, and N <= 8192, however there is no information available for the compiler to determine the array size parameters of the MatMulInstLarge instance; these parameters are "dynamic". This is illustrated in the following updated task instance call graph for this example program.
|
After the programmer's mapping decisions have been applied, resulting in instantiated tasks with tunables set to constant values, the compiler performs a "downward constant propagation" pass, where for each task instance, starting at the program entry points and working "downwards" through the task instance call graph from caller to callee, the "max" values of all array blocks that are passed as arguments to a subtask call will be propagated to the subtask's instance's array size parameters, if possible. And once a program's array size parameters are bounded to some statically known value, the maximum possible sizes of the arrays in a task will also be known, since (1) argument arrays are sized exlusively by array size parameters, and (2) locally declared arrays within a task may only be sized by array size or tunable parameters, and the latter are compile-time constants. Once the maximum possible sizes of a task's arrays are known, the compiler/runtime can statically allocate space for them in explicitly managed, fixed-size local memories (if they fit).
Array Size Preconditions: Trust But Verify
The above compiler pass will not always be successful in determining static (compile-time) values for all array size parameters in the system. The following example illustrates two different cases where this isn't possible.
void task<inner> MyTask::Inner(inout float A[N])
{
MyTask2(A);
mappar (unsigned int i = 0 : 4) {
MyTask3(A[i*N/4;N/4]);
}
}
|
Suppose that this task has been instantiated, to MyTaskInst (say). If the above "downward" constant propagation compiler pass was able to statically bound the value of the argument's array size parameter N, then a static bound will be able to be propagated to the calls to the MyTask2 and MyTask3 instances. If, however, N is "dynamic", just as the array size parameters of the MatMulInstLarge instance in the above matrix multiplication example are, then niether of these task calls will enable the compiler to propagate any static bounds do the callee instances, since in the call to MyTask2, the full (dynamically sized) array A is passed as the argument, and in the call to MyTask3, the size of the argument is at most N/4, which doesn't help since this expression is not staically bounded if N isn't.
In general, there are occasions when the downward propagation of constants by the compiler isn't sufficient to statically bound all array size parameters (and hence all array sizes) in the program. This fact could be problematic, however, due to the requirement that only data objects whose size is statically bounded can be copied to a fixed-size, software-managed memory such as the Cell's LS. To solve this dilemma, the Sequoia mapping interface provides the ability to specify an array size precondition on a task instance's argument array.
An array size precondition on a task instance simply specifies that the instance may only be called if its argument array's size would be at most (or exactly) a certain value (at runtime). If this condition isn't true, then it would be a runtime program error to call the task instance.
This same mapping interface also supplies a static bound for the size of an array argument to the task; the programmer "plugs in" a constant value for a bound on an array's size, just as they do when setting tunables. The compiler will then propagate this constant size downward just as it would any other array size that it had statically determined a bound for.
In the above example, supposing that the MyTaskInst instance was called with a dynamically sized array, the programmer could provide an array size precondition for the MyTaskInst instance which would force the compiler to treat its argument A[N] as having a statically known bound on its size, which would in turn permit the MyTask2 and MyTask3 instances called by MyTaskInst to have a static bound propagated to their argument arrays.
The compiler has a "trust but verify" approach with array size preconditions; it will statically take it on faith that the arrays will be bounded by the size the programmer tells it, but at runtime, checks will be performed to ensure that the array does in fact satisfy the instance's precondition.
Recall that a callsite in a Sequoia task doesn't refer to the specific instance that is being called; rather, it just refers to the "base" task name, and the programmer will specify which instance is to be called by each callsite when instantiating the task, during the process of mapping the program to a target machine.
The Sequoia mapping interface actually allows the programmer to specify a list of possible instances which may be called from a given callsite, with calling conditions on each instance (not to be confused with the array size preconditions described in the section on statically bounding array sizes).
The semantics of providing a list of instances as possible targets are that at runtime, the calling conditions of each target instance will be evaluated, starting from the head of the list, until one of the instances has its calling conditions evaluate to true; this is the instance which will be called.
Calling conditions may be on the size of an array in the calling task, for example, "call instance InstX at this callsite if array A has fewer than 1024 elements". In addition, other types calling conditions may be defined in the mapping interface for a specific target.
Note that the compiler will still perform its downward propagation of constants pass through callsites which have a list of possible targets with calling conditions, by evaluating the calling conditions statically where possible.
The Sequoia compiler and runtime is one particular implementation of the Sequoia language, and it's possible that there could exist completely different systems which take Sequoia code as an input language. This section describes the optimizations which any Sequoia implementation must perform; these can be considered a "contract" between the programmer and the system, and the programmer may rely on these optimizations being performed when writing their code. (The Sequoia compiler performs these optimizations, and more.)
CBVR Copy Elimination of Stride-1 Range Array Blocks
The task calling semantics of the Sequoia language are call-by-value-result (CBVR), also known as "copy in, copy out". Clearly, if every task call in the program required a physical memory copy of array arguments then performance would be very poor, but the copies may only be semantic copies; that is, as far as the program is concerned, the copy did occur, but "under the hood", the Sequoia compiler or runtime system may have optimized the copy away, converting it instead into an alias or reference to another array if it proves that it is safe to do so.
There is a very common case of argument passing, in which a multi-dimensional stride-1 range array block is passed as an argument to a subtask call, such as in the following example.
void task<inner> MyTask1::Inner(in float A[M][N], out float B[M][N])
{
tunable T, U;
MyTask2(A, B);
MyTask2(A[0;T][u0:u1;U], B[0;T][u0:u1;U]);
}
void task<inner> MyTask2::Inner(in float X[U][V], out float Y[U][V])
{
...
}
|
Suppose that these tasks were instantiated as in the following diagram:
|
In the event that the mapping interface specifies that the MyTask1Inst instance's array A and the MyTask2Inst instance's array X are in the same memory level, the semantic copy from the namespace of MyTask1Inst into MyTask2Inst that occurs on a task call would actually be a copy within the same memory module, and further, the calling semantics of tasks specify that when MyTask1Inst executes the task call "MyTask2(A);" (say), it will actually pause there until the child task returns, and thus, the compiler can actually alias array X to array A and not perform an underlying physical copy of the data. As far as the program is concerned, however, arrays X and A are completely different physical objects. Similarly, arrays Y and B can be aliased to the same physical array object.
In essence, the compiler is taking a language with copy semantics on task calls and optimizing away unnecessary copies when it is safe to do so. The Sequoia compiler actually eliminates copies in other cases too (if safe), though the basic copy elimination of stride-1 range array arguments to task calls is a guaranteed feature of any Sequoia implementation.
Copy Hoisting
Consider the following example, consisting of an inner task implementation from a Sequoia 2D convolution program.
void task<inner> Conv2D::Inner(in float A[M+U-1][N+V-1], in float H[U][V], out float C[M][N])
{
tunable S, T;
mappar (unsigned int i = 0 : (M+S-1)/S) {
mappar (unsigned int j = 0 : (N+T-1)/T) {
Conv2D(A[i*S;S+U-1][j*T;T+V-1], H, C[i*S;S][j*T;T]);
}
}
}
|
The language semantics are that every iteration of the nested mappar statement, the Conv2D task is called, passing array blocks of A and C and the entire array H as arguments. Suppose that the instance called at the Conv2D callsite was in a different memory level to the calling instance of Conv2D::Inner: the semantics of this code would be that every iteration, the two input array arguments would be physically copied between memory levels (which could be an expensive memory transfer operation), the subtask instance would run, and then the subtask's output data would be copied back to the calling task's C array.
Clearly, the exact same array H is passed every iteration, and it would be inefficient to recopy it every time. The compiler will perform an optimization known as "loop hoisting", also known as "loop-invariant code motion", and only perform the copy of H once, at the start of the nested mappar construct, with every iteration's Conv2D call operating on this copy.
The contractual optimization that any Sequoia implementation must perform is to identify the iteration variables that are used to describe an array block passed to a task call inside a nest of Sequoia iteration constructs and "hoist" out the copy of any arguments as far as possible.
As a second example, consider the following inner task from a Sequoia implementation of matrix multiplication.
void task<inner> MatMul::Inner(in float A[M][P], in float B[P][N], inout float C[M][N])
{
tunable Mblk, Pblk, Rblk;
mappar (int i = 0 : M/Mblk,
mappar (int j = 0 : N/Nblk) {
mapseq (int k = 0 : P/Pblk) {
MatMul(A[i*Mblk;Mblk][k*Pblk;Pblk],
B[k*Pblk;Pblk][j*Nblk;Nblk],
C[i*Mblk;Mblk][j*Nblk;Nblk]);
}
}
}
}
|
In this example, the array block C[i*Mblk;Mblk][j*Nblk;Nblk] doesn't depend on the k loop variable, only i and j, and so rather than copying C on every (i, j, k) iteration, it is only copied on every (i, j) iteration, with each k iteration using the same copy.
The concepts used in mapping a Sequoia program to different target machines are independent of any specific mapping syntax. The Sequoia compiler currently supports an XML mapping file interface, although other interfaces are being actively developed, as the XML syntax is extremely verbose and inconvenient to use. The mapping information provided via the XML syntax is the same as is described in the Generic Mapping Concepts section.
XML Mapping File Template
The following is a template for an XML mapping file which includes all of the currently supported mapping directives. As shown in the example programs, an application need only adapt this template to its specific needs, picking and choosing which XML entries to use. Further, the Sequoia compiler is capable of producing an XML template which has been tailored to a Sequoia program, with the programmer then only having to "fill in the blanks".
<?xml version="1.0"?>
<mapping>
<!-- The machine description file to use. -->
<machine filename="cell.xml"/>
<!-- Do codegen for this top-level instance; this is the task instance -->
<!-- which may be called from external code. -->
<entrypoints> task_inst_name </entrypoints>
<!-- Each "base" task has a <taskmapping> entry. -->
<taskmapping name="task_name">
<!-- Each task instance has an <instance> entry. -->
<!-- There may be multiple instances defined of a single task. -->
<instance name="task_inst_name">
<!-- The task variant that this instance is instantiating. -->
<variant> variant_name </variant>
<!-- For external tasks only: -->
<!-- The name of the file containing the external implementation. -->
<filename> ext_file_name </filename>
<!-- For leaf tasks only, when the target is Cell: -->
<!-- On Cell, program code must be put into overlays. This flag, if present, -->
<!-- will force the object code for this leaf task to be in an overlay -->
<!-- by itself; without this flag, the compiler may try to pack some other -->
<!-- code into the overlay with the leaf object code. -->
<!-- Use this to reduce the size of an overlay. -->
<cellforceuniqueoverlay />
<!-- An integer constant assigned to a tunable parameter. -->
<tunable name="tunable_name" lexnum="0"> 64 </tunable>
<!-- Specifies a default memory level for all ctrl and data. -->
<!-- This will be overrided by "nested" <level> entries. -->
<level> 2 </level>
<!-- A <data> block specifies mapping details for program arrays. -->
<data>
<!-- Specifies a default memory level for all data objects. -->
<!-- This overrides an instance-wide <level> specification, -->
<!-- and is itself overrided by any per-array <level> entries. -->
<level> 2 </level>
<!-- Applies to leaf tasks only: -->
<!-- Specify that multiple arrays in the same instance will be -->
<!-- physically packed at the same memory address; this is in general -->
<!-- not "safe". An example of when it is safe is if an input and an -->
<!-- output array are traversed together sequentially, with an input -->
<!-- array element only being read before the corresponding output -->
<!-- array element is written. -->
<spaceshare> array1, array2, array3 </spaceshare>
<!-- An <array> block specifies mapping details for a single array. -->
<array name="array_name" lexnum="2">
<!-- Specifies the memory level of this array, overriding -->
<!-- any <level> entries "above" this. -->
<level> 2 </level>
<!-- An array precondition is used to assert that the -->
<!-- arguments are a certain size; before a task is called, -->
<!-- the precondition will be checked, and if false, the program -->
<!-- will raise an error. -->
<!-- There are two types of array preconditions: "numelmt", -->
<!-- and "pitch"; these correspond to the fields of the -->
<!-- same names in the sqArray_t structure in the Sequoia API. -->
<!-- It is not used to decide which task to call -->
<!-- Types are "boundabove", "exact", and "boundbelow". -->
<arraynumelmtprecond type="boundabove"> 1024,128 </arraynumelmtprecond>
<arraypitchprecond type="boundabove"> 1024,128 </arraypitchprecond>
</array>
</data>
<!-- A ctrl block specifies mapping details for program loops and callsites. -->
<ctrl>
<!-- Specifies a default memory level for all loops. -->
<!-- This overrides an instance-wide <level> specification, -->
<!-- and is itself overrided by any per-loop <level> entries. -->
<level> 0 </level>
<!-- A <loop> block specifies mapping details for a single -->
<!-- Sequoia loop: mappar, mapseq, or mapreduce.-->
<loop name="j" lexnum="0">
<!-- Specifies the memory level of this loop, overriding -->
<!-- any <level> entries "above" this. -->
<level> 0 </level>
<!-- Software-pipeline (SWP) this loop. The number specified -->
<!-- is the max number of stages; the compiler may use less -->
<!-- stages than this. -->
<!-- "1" would be the same as not software-pipelining at all. -->
<!-- Only mappar and mapreduce loops may be SW-pipelined. -->
<swp> 2 </swp>
<!-- Unroll this loop. The number specified is the unroll -->
<! count; 1 = not unrolled, 3 = three versions of loop, etc. -->
<unroll> 3 </unroll>
<!-- Flatten this loop. This means completely unrolling it, -->
<!-- so it's a loop of 1 iteration, and then actually removing -->
<!-- the loop and inlining its contained ops. -->
<!-- Only loops with constant iteration bounds can be flattened. -->
<flatten />
<!-- Spatially parallelize this loop, in a SPMD manner. -->
<!-- If the loop is mapped to memory level M and level M has -->
<!-- parallel control units or PEs, then the loop can be -->
<!-- SPMD-ized over them; this will distribute the iteration -->
<!-- space of the loop, with each parallel processor getting -->
<!-- an equal share the iterations. -->
<spmd>
<!-- The first time this is "seen" it sets the full SPMD width -->
<!-- within a memory level, and it is subsequently ignored. -->
<!-- This directive is necessary if N-way SPMD-ness will be -->
<!-- "built up" over multiple loops; for example, loop "i" -->
<!-- could be 2-way, and nested loop j could be 4-way, resulting -->
<!-- an 8-way parallelization. The outer loop in the nest (i) -->
<!-- would specify the <fullrange> of the SPMD-ization. -->
<fullrange> 0,8 </fullrange>
<!-- Number of ways to parallelize this loop, up to the full -->
<!-- SPMD width. -->
<ways> 4 </ways>
<!-- Number of ways can also be "auto", and will be filled in to -->
<!-- bring the SPMD-ness up to the fill SPMD width. -->
<!-- (Either specify an integer, or "auto", for <ways>, not both.) -->
<ways> auto </ways>
<!-- The loop iterations are divided over the parallel PEs -->
<!-- in chunks of "iterblk" iterations; it's a block-cyclic -->
<!-- distribution. -->
<iterblk> 16 </iterblk>
</spmd>
</loop>
<!-- A <callsite> block specifies mapping details for a -->
<!-- task callsite in the Sequoia code. Multiple candidate instances -->
<!-- can be specified, with calling conditions selecting which to call. -->
<callsite name="callee_task_name" lexnum="2">
<!-- A candidate task instance to call at this callsite. -->
<!-- If there are multiple candidates, then the conditions -->
<!-- of each are tried, in the order they are listed here, -->
<!-- until one of the candidate's conditions evaluate to true, -->
<!-- and that is the instance which will be called. -->
<target name="task_inst_name">
<!-- Conditions; call this target if they are ALL true. --%gt;
<!-- That is, it's an "AND". -->
<!-- Only call this instance if the specified array in the -->
<!-- Sequoia task has a certain size. -->
<!-- Note that the array referred to is an array in the calling -->
<!-- task, not the callee task. -->
<!-- Arraysizecond types are "boundabove", "exact", "boundbelow" -->
<arraysizecond name="array_name1" type="boundabove"> 16,32 </arraysizecond>
<!-- A condition on number of times this instance has been "called", -->
<!-- in top-down, runtime program order. -->
<!-- Note that by "called", it means that this instance was -->
<!-- selected to be called at a callsite, and in the case of -->
<!-- a callsite in a loop, the selection only happens once, with -->
<!-- every iteration calling the same instance. -->
<!-- Types are "lt", "leq", "eq", "geq", "gt" -->
<numinstcond type="leq"> 3 </numinstcond>
<!-- Override to specify to always call the task dynamically. -->
<!-- Never statically expand the call graph in the compiler -->
<!-- through this callsite. -->
<forcedynamiccall />
<!-- Override to specify that the CBVR copy of this argument -->
<!-- mustn't be eliminated by the compiler. -->
<forcecopy name="array_name1" />
</target>
</callsite>
</ctrl>
</instance>
</taskmapping>
</mapping>
|
Below is the same template, without all the comments.
<?xml version="1.0"?>
<mapping>
<machine filename="cell.xml"/>
<entrypoints> task_inst_name </entrypoints>
<taskmapping name="task_name">
<instance name="task_inst_name">
<variant> variant_name </variant>
<filename> ext_file_name </filename>
<cellforceuniqueoverlay />
<tunable name="tunable_name" lexnum="0"> 64 </tunable>
<level> 2 </level>
<data>
<level> 2 </level>
<spaceshare> array1, array2, array3 </spaceshare>
<array name="array_name" lexnum="2">
<level> 2 </level>
<arraynumelmtprecond type="boundabove"> 1024,128 </arraynumelmtprecond>
<arraypitchprecond type="boundabove"> 1024,128 </arraypitchprecond>
</array>
</data>
<ctrl>
<level> 0 </level>
<loop name="j" lexnum="0">
<level> 0 </level>
<swp> 2 </swp>
<unroll> 3 </unroll>
<flatten />
<spmd>
<fullrange> 0,8 </fullrange>
<ways> 4 </ways>
<ways> auto </ways>
<iterblk> 16 </iterblk>
</spmd>
</loop>
<callsite name="callee_task_name" lexnum="2">
<target name="task_inst_name">
<arraysizecond name="array_name1" type="boundabove"> 16,32 </arraysizecond>
<numinstcond type="leq"> 3 </numinstcond>
<forcedynamiccall />
<forcecopy name="array_name1" />
</target>
</callsite>
</ctrl>
</instance>
</taskmapping>
</mapping>
|
Naming Program Objects
The mapping file needs to be able to name objects in the Sequoia source code, such as loop variables, arrays, callsites, and tunable parameters. In the case that these names are unique within a Sequoia task variant, the symbolic name itself can be used to identify the program object that is being mapped. For example, if the array A was being mapped, then the following would be the XML code to refer to this array:
<array name="A">
...
</array>
|
If there are multiple objects in a task with the same name, then the lexnum field can be used to identify which one is being referred to, where lexnum="X" would refer to the (X+1)th version of the symbol, in the program's lexical order (starting counting from 0). For example, the following would be a mapping entry for the third array B defined in a task:
<array name="B" lexnum="2">
...
</array>
|
The Sequoia API defines the interface through which external code can call Sequoia tasks, and through which external functions can be called from within Sequoia code.
Types
The sequoia.h header file contains the Sequoia API types and functions that are implemented for every target.
#define SQ_MAX_DIMS 9
typedef unsigned long sqSize_t;
// A struct to wrap an array pointer, containing information about the array's
// layout in memory.
typedef struct {
unsigned int numDims;
int contigDim;
sqSize_t offset[ SQ_MAX_DIMS ];
sqSize_t pitch[ SQ_MAX_DIMS ];
sqSize_t numElmts[ SQ_MAX_DIMS ];
sqSize_t elmtSize;
void *ptr;
} sqArray_t;
|
The sqArray_t structure is used to describe a Sequoia array; it wraps a pointer to the block of memory that holds the array's data, and contains various fields which describe the layout of the data within that block of memory. Its fields are as follows:
The following diagram illustrates the usage of these fields on a 2D array of floats. The blue (inner) array corresponds to the logical Sequoia array that the Sequoia program accesses, while the yellow (outer) array corresponds to its physical layout in memory.
|
Allocating Arrays
The sequoia.h file also gives the interfaces to Sequoia API functions for allocation and de-allocation of arrays.
// Functions to allocate an array.
sqArray_t *sqAllocArray1D(sqSize_t elmtSize, sqSize_t d0);
sqArray_t *sqAllocArray2D(sqSize_t elmtSize, sqSize_t d0,sqSize_t d1);
sqArray_t *sqAllocArray3D(sqSize_t elmtSize, sqSize_t d0,sqSize_t d1,sqSize_t d2);
// ... (up to 9D)
// Function to free an array that was allocted by a sqAllocArray*D() call.
void sqFreeArray(sqArray_t *p);
|
The allocation functions allocate a block of memory and wrap it in a sqArray_t structure which describes its physical layout in a machine. The array need not be in row-major order, nor need it even be local to one memory module.
Although a programmer is free to allocate their memory in any way they like and wrap it in a sqArray_t structure, it is recommended that these allocation functions be used instead, as their implementation on each platform will be cognizant of machine-specific requirements, such as data alignment and padding constraints, etc. If the programmer does allocate their own data externally, then they have the onus of ensuring that the data's layout is correct for the machine that they're running on, and that the sqArray_t structure's contents are filled in appropriately.
Accessing Array Elements
The Sequoia API also provides functions for accessing elements of arrays that are wrapped in sqArray_t structures.
// Functions to return a pointer to a specified element in an array. void *sqElmt1D(sqArray_t *a, sqSize_t i0); void *sqElmt2D(sqArray_t *a, sqSize_t i0,sqSize_t i1); void *sqElmt3D(sqArray_t *a, sqSize_t i0,sqSize_t i1,sqSize_t i2); // ... (up to 9D) |
These return pointers to the requested element. The sqElmtND function, which returns a pointer to an element of an N-dimensional Sequoia array, is implemented via the following expression in the case that the array is laid out in a contiguous row-major order.
sqElmtND(A, i0,...,iN-1) = (char *)(A->ptr)
+ ( (i0 + A->offset0) * A->pitch1 * ... * A->pitchN-1
+ (i1 + A->offset1) * A->pitch2 * ... * A->pitchN-1
+ ...
+ (iN-2 + A->offsetN-2) * A->pitchN-1
+ (iN-1 + A->offsetN-1)
) * A->elmtSize;
|
The overhead of using the above relatively heavy-weight functions for every access to an array could be large, and so the intended usage model is that the sqElmtND function can be used to return a pointer to a contiguous chunk of the array, and then C pointer arithmetic can be used to traverse the elements in that contiguous chunk. This is where the contigDim field of the sqArray_t structure comes in handy; it specifies which dimension of the array (if any) is contiguous in its layout in memory. For example, if a 3D array has a row-major layout, then it will have contigDim=2, and the sqElmt3D function can be used to return a pointer to the first element of a row which can then be traversed using efficient pointer arithmetic.
The process of calling Sequoia tasks from external code is as follows:
As an example, suppose that the following Sequoia task will be called from an external program.
void task<inner> MyTask::MyVariant(in int A[T], out float B[U][V], in int x, inout float y); |
The Sequoia compiler will emit a header file containing the following prototype; to call the above Sequoia task, simply call this C function.
void MyTaskInst(sqArray_t *A, sqArray_t *B, int x, float *y); |
The rules for determining the signature of the compiler-generated prototype for a given Sequoia task are simple:
There are three types of Sequoia tasks: inner, leaf, and external. Inner and leaf tasks are both implemented natively in Sequoia, while external tasks only have their interfaces defined in the Sequoia code, not their bodies. The purpose of external tasks is to link to an externally provided function, enabling Sequoia code to call into a non-Sequoia function.
Any external function called by Sequoia code must have a specific interface. Consider the following example external task declaration:
void task<ext> MyTask::MyVariant(in int A[T], out float B[U][V], in int x, inout float y); |
Suppose that the compile-time mapping directives directed that this task should be instantiated with the instance name "MyTaskInst", and that an external function was to be used to implement this instance. The external function's signature must be:
void MyTaskInst(sqArray_t *A, sqArray_t *B, int x, float *y); |
The rules for determining the type signature that the external function must have for a given external Sequoia task declaration that it will implement are simple, and are the same as for the case of calling a Sequoia task from external code:
External functions are commonly used in Sequoia to provide optimized, target-specific leaf task implementations that can be called instead of generic Sequoia leaf task implementations. It's often the case that much better program performance can be obtained by optimizing leaf task code using SIMD intrinsics and other target-specific low-level operations.
Compiler-Generated API: sequoia_ext.h
The Sequoia compiler also generates the sequoia_ext.h header file containing the following macros for each Sequoia external task instance. This interface is not necessary to be able to write an external function that will interface to Sequoia, rather it is provided for convenience and efficiency. (The implementation of these macros isn't shown here). All of these macros may only be called from within the body of the external function.
// Evaluates to a pointer to a block of local scratch space that was allocated // for the external function by the compiler (as a result of a mapping // directive). #define sqLocalSpace(instNameSym) // A macro that will compute the values of the array size paramters from the // actual array sizes. #define sqCalcParams(instNameSym) // A macro that the compiler defines that will evaluate to the value of a // parameter of the function, for each parameter "paramNameSym". // The sqCalcParams() macro must be called before the sqParam() macro may be // used, and as the sqCalcParams() macro internally creates a scalar variable // for each array size parameter, the sqParam() macro may only be used in the // scope of the sqCalcParams() call. // The compiler may also hard-code the sqParam() implementations for some of the // array size parameters for an external task instance by making them evaluate // to a preprocessor constant, if their values are known statically at compile // time. #define sqParam(instNameSym, paramNameSym) // Macros that provide specialized implementations of the sqElmt*D() functions, // for each array argument "arrayNameSym" of the external task. These can be used // to more efficiently access array elements. #define sqElmtSpec1D(instNameSym, arrayNameSym, i0) #define sqElmtSpec2D(instNameSym, arrayNameSym, i0,i1) #define sqElmtSpec3D(instNameSym, arrayNameSym, i0,i1,i2) // ... |
As an example, the above MyTaskInst external function could use these macros as follows:
void MyTaskInst(sqArray_t *A, sqArray_t *B, int x, float *y)
{
char *localScratchSpace = sqLocalSpace(MyTaskInst);
sqCalcParams();
unsigned int T = sqParam(MyTaskInst, T);
unsigned int U = sqParam(MyTaskInst, U);
float *B_elmt = sqElmtSpec2D(MyTaskInst, B, 10, 0);
}
|
The following are example Sequoia programs, roughly listed in order of increasing complexity. The Vector Add example is also something of a tutorial.
More examples will be posted here in the coming weeks and months, including some "irregular" programs which make use of the scatters and gathers that can be expressed via indexed array blocks. Further, some other examples are provided as a part of the Sequoia release download.
The vector add program simply takes two 1D arrays as input and adds their corresponding elements to produce a 1D output array. The prototype of a Sequoia task that implements a vector add is as follows; the two input arrays A and B are added to produce the output array C.
void task VectAdd(in float A[N], in float B[N], out float C[N]); |
The first step in writing a Sequoia program is deciding on how the problem will be decomposed. Suppose that the code was run with 1-billion element arrays as inputs on a machine (such as the Cell processor) which has only 256KB of local space. Since the entire arrays won't fit, the problem must be decomposed or partitioned into subproblems which will fit.
Vector add has an obvious decomposition strategy: the size-N input arrays can be partitioned into size-T subarrays, and then each pair of subarrays of A and B can be added to produce a subarray of the output array C. As the input is size-N and the subarrays are size-T, there will be N/T subproblems to be solved. Further, each subproblem can be executed in parallel. This is illustrated in the following diagram:
|
Sequoia Source Code, Version 1
There is a straightforward translation of the above problem decomposition strategy into Sequoia: a size-N inner task will chop the arrays into size-T array blocks that are passed recursively to the vector add task, which will perform the vector add operation on the subarrays.
// vectadd.sq (version 1)
// ----------------------
void task VectAdd(in float A[N], in float B[N], out float C[N]);
void task<inner> VectAdd::Inner(in float A[N], in float B[N], out float C[N])
{
tunable T;
mappar (unsigned int i = 0 : N/T) {
VectAdd(A[i*T:(i+1)*T;T], B[i*T:(i+1)*T;T], C[i*T:(i+1)*T;T]);
}
}
void task<leaf> VectAdd::Leaf(in float A[N], in float B[N], out float C[N])
{
for (unsigned int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
}
|
Iteration i of the mappar operates on elements in the range i*T to (i+1)*T of each of the three arrays; written in Sequoia's array block notation, this becomes the range [i*T:(i+1)*T;T], where the third field of this expression (;T) denotes the maximum possible size of the range (which in this case is its exact size).
Sequoia Source Code, Version 2
There is an assumption implicit in the above implementation of vector add, namely the fact that the size of the array blocks, T, must evenly divide the size of the input arrays, N. This assumption can be removed by a small variation in the implementation, as follows:
// vectadd.sq (version 2)
// ----------------------
void task VectAdd(in float A[N], in float B[N], out float C[N]);
void task<inner> VectAdd::Inner(in float A[N], in float B[N], out float C[N])
{
tunable T;
mappar (unsigned int i = 0 : (N+T-1)/T) {
VectAdd(A[i*T;T], B[i*T;T], C[i*T;T]);
}
}
void task<leaf> VectAdd::Leaf(in float A[N], in float B[N], out float C[N])
{
for (unsigned int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
}
|
There are only two differences between this implementation and that of version 1:
The implementation of version 2 illustrates a common Sequoia idiom: the use of the [start;max] array block notation to express "bounded" array blocks, that is, array blocks which will be exactly a given size if they are "in the middle" of their base array, or smaller if they are "on the edge". This is explained further in the section on array block default values.
Mapping the Program onto a Cell Processor
Sequoia programs are abstract in that they don't refer to any specific machine parameters or mechanisms, enabling them to be portable across machines with very different architectures and yet still yield high performance. Compiling a Sequoia program for a particular target machine involves making a set of mapping decisions, and is referred to as instantiating or specializing the program for the machine.
Suppose that the vector add program was to be mapped onto a Cell processor. One possible way to map the program would be to place an instance of VectAdd::Inner in main memory, to be executed by the PPE, and an instance of VectAdd::Leaf in the LS, to be executed by the SPE. This is illustrated below, with the task instance hierarchy drawn alongside the Cell's abstract machine model.
|
The mapping details that were specified to achieve this were a follows.
Try to visualize how this program will be executed on the Cell processor. The PPE will execute the "i" mappar loop in the VectAdd::Inner instance, and every iteration it will make a task call to the VectAdd::Leaf leaf instance in level 0 (the LS/SPE) of the machine. As dictated by Sequoia's task calling semantics, each task call will result in (1) the subtask's input arguments, which are subarrays of A and B, being copied (i.e. transferred via DMAs) to the LS, (2) the VectAdd::Leaf task instance executing on those input subarrays ro produce its output subarray, which will (3) be copied back to main memory once the VectAdd::Leaf instance has completed. Conceptually, this is a "push" model of execution, with the PPE sending data to the LS every iteration, causing a SPE thread to execute the leaf instance on the data, and then retreiving the results from the LS once the leaf instance is done.
As it turns out, this program will perform very poorly on Cell if mapped this way, since the amount of time that the VectAdd::Leaf instance takes to perform a single size-T vector add operation (which it does every iteration of i) will be dwarfed by the amount of time that it takes for the PPE to transfer the data and make the SPE thread invocation each iteration; this particular mapping of the vector add program will be communication-limited.
An important part of getting the optimal performance out of a Sequoia program is choosing the "right" mapping for a given target machine. An second way that the same vector add Sequoia program could be mapped to the Cell processor is illustrated below.
|
Here, the mappar loop (over i) is mapped onto level 0 of the machine rather than level 2; aside from the change to the i loop, this mapping is identical to the first mapping.
Visualize how this second mapping of the program will execute on Cell. The PPE will make a single thread call to the SPE to initiate the execution of the mappar loop, and from that point on, the SPE will be able to execute all iterations of the mappar without any intervention from the PPE. As opposed to the first mapping that was shown, this represents a "pull" execution model. Each iteration, the SPE will fetch a subarray from memory, execute the VectAdd::Leaf instance on it in the LS, and then store the result back when the leaf instance is done.
This is a much more efficient Cell mapping of the vector add program, as the PPE-SPE thread invocation only happens once, at the start of the mappar, rather than every iteration. Further, as the mappar is executing in level 0, it can be parallelized over the machine's SPEs (this is the "SPMDx8" mapping directive in the diagram), and it can be software-pipelined ("SWP"), also referred to as double-buffering in this context, allowing the DMA transfers from one iteration to be overlapped with the SPE leaf instance execution of another iteration.
Note that the value 8192 was chosen to fit Cell's LS capacity, given the second mapping strategy. The LS must stage three subarrays of floats (A, B, and C), each of length 8192, which collectively consume 3*8192*4 bytes = 96KB. Since the loop is being software pipelined with a depth-2 pipeline, two iterations are being run in parallel, and both of these iterations need 96KB of space to stage their data, resulting in a total demand of 192KB. Cell has 256KB total space in an LS, so this will leave some room for the code and stack. As a general heuristic on Cell, the programmer should make all their subarrays that will be staged in the LS as large as possible without exceeding its capacity, and it's not uncommon for the programmer to try out a few different sizes for the LS subarrays before deciding which yields the best program performance.
XML Mapping File for the Cell Target
Note: The XML mapping file interface will soon be replaced with a new interface which is much less verbose and which is more convenient. The XML mapping example provided here will be replaced with an example that uses the new mapping interface when it is ready.
The XML mapping directives for targeting this program to Cell are as follows; these correspond to the second, more efficient, mapping presented above.
<?xml version="1.0"?>
<mapping>
<machine filename="cell.xml"/>
<entrypoints>VectAddTopInst</entrypoints>
<taskmapping name="VectAdd">
<instance name="VectAddTopInst">
<variant>Inner</variant>
<level>2</level>
<tunable name="T">8192</tunable>
<ctrl>
<level>0</level>
<loop name="i">
<spmd>
<fullrange> 0, 8 </fullrange>
<ways> 8 </ways>
<iterblk>1</iterblk>
</spmd>
<swp> 2 </swp>
</loop>
<callsite name="VectAdd">
<target name="VectAddLeafInst">
</target>
</callsite>
</ctrl>
</instance>
<instance name="VectAddLeafInst">
<level>0</level>
<variant>Leaf</variant>
</instance>
</taskmapping>
</mapping>
|
A Sequoia program expresses an abstract algorithm, while a Sequoia program plus a target-specific set of mapping directives represents a concrete implementation of the program. Multiple different target-specific mappings may be provided, enabling the same Sequoia program to be run on different machines. For example, a P4 cluster mapping specification would result in the Sequoia compiler emitting an implementation of the program for a P4 cluster.
Note that at no point did the programmer need to deal with low-level details such as DMAs in their program; they simply express the mapping of their abstract algorithm to a machine, and the Sequoia compiler and runtime handle all of the machine-specific mechanisms "under the hood".
Linking to an External Program
The mapped Sequoia vector add program constitutes an efficient implementation of the vector add algorithm on a specific target processor. The programmer can call into this Sequoia code from their external application, effectively using the Sequoia vector add program as a module or library in a larger application. The Sequoia API section details this calling convention.
A 2D convolution is simply the application of a mask to a 2D image, conceptually the same as a "blur" operation in computer graphics. The prototype of a Sequoia task which implements a 2D convolution is as follows.
void task Conv2D(in float A[M+U-1][N+V-1], in float H[U][V], out float C[M][N]); |
The output image, C, is computed from the input image, A, and the convolution mask, H, according to the following expression.
|
A 2D convolution problem, producing an output image of size M*N, can be decomposed into a set of parallel 2D convolution subproblems, each computing a non-overlapping region of the output image of size S*T. This is the decomposition that is used to implement this algorithm in Sequoia.
Note that in this formulation of 2D convolution, a size (M+U-1)*(N+V-1) input image is used to compute the size M*N output image, and similarly, a size (S+U-1)*(T+V-1) input subimage is used to compute each size S*T output subimage. In this formulation, the input subimages overlap with each other, even though the output subimages don't.
Sequoia Source Code
The abstract Sequoia implementation of a 2D convolution is as follows.
void task Conv2D(in float A[M+U-1][N+V-1], in float H[U][V], out float C[M][N]);
void task<inner> Conv2D::Inner(in float A[M+U-1][N+V-1], in float H[U][V], out float C[M][N])
{
tunable S, T;
mappar (unsigned int i = 0 : (M+S-1)/S) {
mappar (unsigned int j = 0 : (N+T-1)/T) {
Conv2D(A[i*S;S+U-1][j*T;T+V-1], H, C[i*S;S][j*T;T]);
}
}
}
void task<leaf> Conv2D::Leaf(in float A[M+U-1][N+V-1], in float H[U][V], out float C[M][N])
{
for (unsigned int m = 0; m < M; m++) {
for (unsigned int n = 0; n < N; n++) {
C[m][n] = 0;
for (unsigned int u = 0; u < U; u++) {
for (unsigned int v = 0; v < V; v++) {
C[m][n] += H[u][v] * A[m+u][n+v];
}
}
}
}
}
|
In the above implementation, the tunable values that describe the subarray sizes, S and T, don't need to evenly divide the output image size parameters, M and N respectively, as both the array block expressions and the mappar iteration bounds properly handle this case.
XML Mapping File for the Cell Target
Note: The XML mapping file interface will soon be replaced with a new interface which is much less verbose and which is more convenient. The XML mapping example provided here will be replaced with an example that uses the new mapping interface when it is ready.
The XML mapping directives for targeting this program to Cell are as follows.
<?xml version="1.0"?>
<mapping>
<machine filename="cell.xml"/>
<entrypoints>Conv2DMainInst</entrypoints>
<taskmapping name="Conv2D">
<instance name="Conv2DMainInst">
<variant>Inner</variant>
<level>2</level>
<tunable name="S">32</tunable>
<tunable name="T">256</tunable>
<data>
<array name="H">
<arraysizeprecond type="exact" name="h">5, 5</arraysizeprecond>
</array>
</data>
<ctrl>
<level>0</level>
<loop name="i">
<spmd>
<fullrange> 0, 8 </fullrange>
<ways> 8 </ways>
<iterblk>1</iterblk>
</spmd>
</loop>
<loop name="j">
<swp> 2 </swp>
</loop>
<callsite name="Conv2D">
<target name="Conv2DLeafInst">
</target>
</callsite>
</ctrl>
</instance>
<instance name="Conv2DLeafInst">
<variant>Leaf</variant>
<level>0</level>
</instance>
</taskmapping>
</mapping>
|
This documentation section is yet to be written. For now, see the README inside the release download for installation and usage instructions.