Mapping Control

Just as data objects in a Sequoia program must be mapped to physical storage locations in the target machine, so must control statements be mapped to processors that will execute them.

Sequoia draws a distinction between control and computation in a program, though the distinction is really a difference of the intent of the code, rather than its syntax.

  • A Sequoia leaf task encapsulates a program's computation, in that the intent is that the code in a leaf task will be executed on the efficient processing elements of the target machine, such as the SPEs in Cell.
  • Sequoia inner tasks can also use arbitrary C expressions and control flow statements, including C loops (for, while, etc.) arithmetic operations ("+", "-", etc.), and in addition, they can use Sequoia's iteration constructs. In the Sequoia world, all of these inner task statements are generally referred to as being control, as the intent is that the logic in an inner task will serve the purpose of decomposing a problem to a small enough size that it will fit into a local memory and a leaf task will be able to perform the "actual computation" of the program.

All inner task control constructs must be mapped to a control processor in the target machine. This is the processor that will execute the statements.

Current Mapping Interface

It would be unwieldy and impractical for a mapping directive to be required for every operation in an inner task, such as the addition and assignment statement "int x = y + z;", and yet statements such as this must be executed somewhere. A general method that would allow the programmer to conveniently map control operations as a group is under development, however there is a mapping interface supported that provides a working solution.

The current mapping interface that Sequoia supports handles the mapping of control by associating a control level with every Sequoia iteration construct (mappar, mapseq, and mapreduce), and then implicitly placing every other type of control operation in the same control level as its "closest" parent iteration construct. The mapping interface will assign a memory level in the target machine to each such Sequoia iteration construct, and this will cause the Sequoia compiler to emit code for that loop into a file that will be executed by a control processor at that memory level. The exception to this rule is a leaf task call; leaf tasks are always "controlled" by the level-0 controller.

If a Sequoia loop that is running on a control processor in level L contains a nested Sequoia loop whose control level is set to level L-1, then this will result in the level-L processor calling into the L-1 processor to execute the contained loop; this call will be emitted as the appropriate communication operation on the target machine by the Sequoia compiler. For example, on a Cell machine, if a mapseq that is running in the PPE has a nested mappar inside it that is set to run in the SPE level, then each PPE mapseq iteration a thread call will be made from the PPE to the SPE to execute the nested mappar contruct.

In the matrix multiplication running example, the control level of the Sequoia iteration constructs would be mapped as in the following updated task instance call graph diagram.

Sequoia Loop Transformations

In addition to being assigned to a memory level, mapping a Sequoia loop (mappar, mapseq, or mapreduce) to a machine involves specifying any (or none) of the following set of loop transformations that will be performed on the loop.

  • SPMD: If the loop is mapped to a memory level, such as level 0 in the Cell model being used in the running example, that has parallel control processors, then the loop can be parallelized over these processors. The Sequoia compiler/runtime will assign an equal share of the loop's iterations to each processor. Note that only mappar and mapreduce can be parallelized; mapseq cannot. This process is also referred to a "SPMD-izing" the loop, since the same program code will be run on each of the parallel control processors; that is, they will all run the same loop, with different processors executing different iterations.
  • SWP: A loop can be software-pipelined (also referred to as multi-buffered), a transformation which can enable asynchronous operations to overlap with each other by running multiple iterations simultaneously (if the hardware supports asynchronous operations). An example of the utility of this is software-pipelining level-0 loops in the running example Cell target; the hardware supports asynchronous DMA operations that can be issued by the SPE processors, and if a level-0 parallel loop is software-pipelined, then the DMA operations from one iteration can potentially be overlapped with leaf task executions from other iterations, allowing the program to effectively utilize the machine's resources.

Note that if a loop is software-pipelined, then the amount of space that each iteration has to work with will be reduced, since multiple iterations will be running in parallel.

In the matrix multiplication running example, these transformations would be applied as follows, yielding a fully mapped/instantiated program, with all mapping details having been filled in. This mapped program represents a concrete implementation of the MatMul abstract Sequoua code.

There are 8 parallel SPEs in level 0 in the Cell machine model, and thus the program will be parallelized 8 ways over these processors, though this is acheived via a 2D parallelization mapping. Further, at both the disk-to-memory and memory-to-LS interfaces there are asynchronous transfer operations provided by the underlying hardware/OS, and so at each interface in the program a loop was software-pipelined to acheive overlap between computation and communication.