may the force be with you: Unleashing the power of the Cell Broadband Engine

Monday, November 28, 2005

Unleashing the power of the Cell Broadband Engine

16 Nov 2005

This paper from the MPR Fall Processor Forum 2005 explores programming models for the Cell Broadband Engine (CBE) Processor, from the simple to the progressively more advanced. With nine cores on a single die, programming for the CBE is like programming for no processor you've ever met before. Read why.

The Cell Broadband Engine (CBE) Processor offers the potential for increased processor performance for a broad variety of applications. However, coming anywhere close to the theoretical performance capability of the processor requires a good understanding of the processor's capabilities, and the choice of a programming model which matches the processor's architecture.

This paper reviews the basic architecture of the CBE processor and some of the programming models which fit well with its design.

The basic CBE architecture

Figure 1. The CBE block diagram

The CBE processor contains a PowerPC® processor element, used as a primary processor, and eight "synergistic" processor elements. The architecture allows any SPE to take up a controller role if needed. These are called the PPE and SPEs, respectively (see the sidebar on acronyms). The CBE processor also contains much bandwidth. Each of the eight SPEs has 256KB of local storage for code and data, and 128 registers, each 128 bits wide. The instruction set for the SPEs is designed to favor SIMD processing. The SPEs do not have any hardware cache of main memory.

The closeness of each SPU to its 256KB of local store makes it easy to note the abstract similarity to a cache. SPE programmers can manage the local store to keep frequently used pieces of data. However, from the hardware-architecture point of view, they are not the same. Caches hold temporary copies of physical memory. The local storage on each SPE is not associated with some region of physical memory; it is a private, non-coherent, local store. Data can be transferred directly from one SPE's local storage to another, without going through physical memory. The internal bus, called the Element Interconnect Bus, has a bandwidth of roughly 100GB/sec; for more information on the ways data moves around the CBE, see the related article, "Unleashing the Cell Processor: The Element Interconnect Bus," to be published next week.

Each SPE can communicate with the PPE through mailboxes, which are registers available to a given SPE and the main processor. The three three mailboxes are: inbound and outbound non-interrupting mailboxes and an outbound interrupting mailbox. The interrupts allow notification and communication in a quick and efficient manner.

Back to top

Levels of parallelism

Acronym alley

The CBE processor has a broad feature set for which a number of terms have been coined. Following are a few of the acronyms you will see used in discussions of the Cell Broadband Engine:

PPE - PowerPC Processing Element. This is a two-thread, dual-issue, processor. Unlike most modern PowerPC chips, it executes in-order, and has a comparatively short pipeline.

SPE - Synergistic Processing Element. This is a specialized processor with a small amount of local storage and its own access to system memory. SPEs are capable of running independently of PPE as independent processors. They may also run under the control of the PPE, and are used to offload computationally intensive tasks.

CESOF - Cell Embedded SPE Object Format. An object format used to bundle SPE code together with PPE code. It gives the PPE a simple handle that can be used to hand code off to a given SPE. The format also enables the SPE code to reference global objects defined in the main memory. In fact, the CESOF is an ELF format with additional sections to support symbol resolution and embedding SPE ELF objects within PPE ELF objects .

DMA - Direct Memory Access. Generically, transfer of memory directly from one part of a system to another without processor interaction. Within the CBE, DMA is also used to refer to moving data between, for instance, processor elements and main memory, or between two processor elements.

EIB - Element Interconnect Bus. This is the bus that connects the various processor elements to memory and other I/O.

MFC - Memory Flow Controller. The part of an SPE which carries out DMA operations and moves data around.

LS - Local storage. 256KB of shared instruction and data storage local to each SPE. (Not to be mistaken for a cache.)

SIMD - Single Instruction Multiple Data. The simple way to do parallelism, as in the PowerPC's VMX/AltiVec instructions.

CBEA - Cell Broadband Engine Architecture. Strictly speaking, the Cell Broadband Engine is the first CBEA processor, and is often called the Cell.

IDL - Interface Definition Language.

Several kinds of parallelism are available when developing applications for the CBE. For starters, both the PPE and the SPEs have SIMD instructions available, so a single instruction can perform multiple simultaneous operations. They are also superscalar, capable of executing two instructions per clock cycle. This level of parallelism is familiar to PowerPC developers already from the VMX/AltiVec instructions available on recent PowerPC processors.

Each processing element can be performing a different ongoing task, which allows for task-level parallelism. The PPE is dual-threaded, and there are 8 SPE cores, allowing a total of ten tasks at once (two on the PPE, and one on each SPE). Each task, in turn, could be using SIMD instructions to process large amounts of data.

While all of this processing is going on, the DMA engines (MFCs) on each SPE can also be moving data around. This is a separate component of the architecture, and need not prevent the processors from operating on the data already available to them. Thus, the CBE on-chip cores don't need to spend much time moving data around.

Finally, you can have multiple CBE processors in a system, or even multiple systems in a cluster. This level of parallelism is fairly well understood and is not specific to Cell. For more on data coherency, see "Unleashing the Cell Processor: The Element Interconnect Bus," to be published next week.

The array of options here is potentially bewildering. You should adopt programming models designed to make efficient use of the available resources. A good understanding of the ways in which you can use the CBE processor makes it much easier to develop efficient, reliable code that can be delivered on a useful schedule. A good programming model makes good use of the huge computational capacity and bandwidth the CBE provides, dividing work up among the various processing components available. Furthermore, use of consistent models allows the development of language constructs, libraries, frameworks, or even operating system support to simplify development.

This paper reviews both "small" (local-store only) and "large" (using external code or data) single-SPE programming models, as well as multiSPE parallel models. At the end, it also describes the multitasking aspect of sharing the SPEs.

Back to top

Programming the PPE

Before getting into the details of programming the SPEs, you need to understand the role of the PPE. The PPE is a 64-bit PowerPC chip, designed to run general-purpose code and facilitate the SPEs. The CESOF object format (remember to see the Acronyms sidebar) is used to store a chunk of code for delivery to an SPE. Each SPE image is associated with a handle which can be used at runtime to load specific code on an SPE. The PPE handles memory mapping and exception handling, to load code on the SPEs, and to start and stop their execution. In general, but not always, scheduling of the SPEs is left up to the PPE, and OS services (such as file I/O) also run on the PPE.

Back to top

Small single-SPE models

The simplest way to use a single SPE is to load a single chunk of code on it, along with the data it needs to process, and let it run. The SPE runs entirely out of its local data store, with no access to (or bandwidth load on) main memory. Many workloads can be handled entirely this way, but code and data together must fit in 256KB.

In this model, input and output are always explicit. The SPE program is given arguments (passed in as arguments to a main function) and returns an exit status. It can also communicate using the mailboxes, or by system calls. This model might be supported by an IDL. SPE executables are compiled and linked separately, then embedded as read-only data in the PPE executable, using the CESOF format. (The CESOF object is bundled into a section of the ELF object file for the PPE executable.) At runtime, the PPE loads and initializes the SPE, then starts it running on the code.

See also the "SPU Application Binary Interface Specification," listed in the Resources section of this document.

The following code listings show how this works:

Listing 1. A program to run on the SPE



 /* spe_foo.c
  * A C program to be compiled into an executable called "spe_foo"
  */

 int main(unsigned long long speid, addr64 argp, addr64 envp)


 {
  int i;
  /* func_foo would be the real code */
  i = func_foo(argp);
  return i;
 }

 Listing 2:  A program to run another program on the SPE.

 /* spe_runner.c
  * A C program to be linked with spe_foo and run on the PPE.
  */

 extern spe_program_handle_t spe_foo;

 int main()
 {
  int rc, status = 0;
  speid_t spe_id;

  spe_id = spe_create_thread(0, &spe_foo, 0, NULL, -1, 0);

  rc = spe_wait(spe_id, &status, 0);

  return status;
 }

Note that the result of spe_wait and the status code returned by the SPE program are distinct; if spe_wait succeeds, then status will be filled in with the value returned from the SPE program. Interaction between the SPE and the PPE is not needed until execution completes. The spe_wait operation blocks until the SPE program exits.

Another option is to load multiple pieces of code and data to a single SPE; this can be the basis of a primitive multitasking environment on the SPE, useful for multiple small jobs which do not need a dedicated processor. You cannot provide memory protection between tasks running on a single SPE, and such multitasking is necessarily cooperative, not preemptive. However, in cases where the tasks are small enough, and complete reliably enough, this can dramatically improve performance, freeing up other SPEs for dedicated tasks. The cost of transferring a small program to the SPE might be too high in some cases, however.

Back to top

Large single-SPE programming models

When all your code and data together cannot be fit entirely within 256KB, you might need to use a "large" model. In this model, the PPE reserves chunks of effective address space for use by the SPE program, which accesses them through DMA. Memory mapping into effective address space is set up to give the SPE a secondary memory store which it can access using the DMA engine (MFC).

You can use this model in many ways. One is the streaming model: load a regular chunk of sequential data, modify it, write it back to main memory, and repeat. Another is to use the local store as a temporary buffer, copying random data back and forth as needed.

In some cases, the same techniques could be used for code, not just data; the primary code loaded on the SPE could use overlay segments located in main memory as-needed. The compiler can generate automatic overlaying code to handle this case. CESOF, and the toolchains in use, allow SPE code to refer to objects defined in the effective address space.

The simple LS-resident multitasking possible on a single SPE becomes somewhat more flexible when combined with automatic loading and storing of job code and data; a small kernel running on the SPE can load tasks, or swap them out when blocked or completed, allowing the SPE to manage multiple tasks from a job queue. Once again, this is still not preemptive.

One consideration when using DMA to span the effective memory space of an SPE is that it imposes significant latency. With that in mind, prefetching (also known as double-buffering or multibuffering, particularly in game programming) becomes an important technique. If, while buffer N is being processed, buffer N-1 is being written out, and then buffer N+1 is being read in, the processor can execute continuously, even if the time required to transfer the data is a substantial fraction (up to half) of the time it takes to perform an operation.

Back to top

MultiSPE programming models

It is possible to combine the work of multiple SPEs. Synchronization becomes an issue here. Options include MFC atomic update commands, mailboxes, SPE signal notification registers, events and interrupts, or even just polling of shared memory. As with large single-SPE models, a compiler (for example, openMP) might transparently manage access to shared memory with proper critical section access controls.

The job queue model is a popular model for programming the CBE, which allows any idling SPE to obtain another task quickly, providing automatic load balancing. One special case of this that optimizes particularly well for regular and sequential chunks is the streaming model. If a given piece of data can be processed quickly enough by a single SPE, but there is too much sequential data of that sort for a single SPE to process quickly enough, multiple SPEs can be assigned to process the array of data, pulling new blocks of data off a FIFO queue and processing them simultaneously.

Another option for streaming is the pipeline, where each SPE handles part of a task, processing the output from the previous SPE. This might make heavy use of direct LS to LS DMA, bypassing main memory to reduce bandwidth use. This allows SPEs to perform tasks which need too much code to leave any room for efficient data handling. The code can be split up among multiple SPEs, allowing data to be handed across quickly. This is a good example of the use of the DMA channels for message-passing. On the down side, load balancing is much harder than it is with the previously mentioned streaming setup. If a given chunk of data cannot be processed quickly enough by a single SPE, code overlays might be a better fit.

Back to top

Kernel management of SPEs

An OS running on the PPE can manage and allocate SPEs, providing and arbitrating access to them. Making the SPEs available like this allows preemptive multitasking of more tasks than there are SPEs to run them. Running tasks or threads can be mapped onto SPEs, paused and copied out, resumed, and so on. Because the context-switch cost is relatively large (the whole 256KB of local store, 128x16B register file, and DMA command queues), a run-to-completion policy is of course strongly favored, but the option of preemption is there. This can allow for memory protection between tasks, because a task will be swapped out completely before another task is loaded on the same SPE.

Back to top

Application development

Everything you've ever been told about application optimization applies more so to the CBE architecture. Choice of algorithms, interactions between algorithms, and related issues are crucial to effective development. Budget some time for experimentation; partition the algorithm and program, see how it works, and be prepared to try again. Start with the code that will run on the PPE, then offload specific tasks to SPEs. Switch to SIMD code on the SPEs if you need to, but be sure your overall algorithm partitioning is working before you spend a lot of time vectorizing code that will have to be rewritten anyway.

When targeting a CBE processor, you have to budget both computation and bandwidth. There's plenty of both, but the sheer volume of computation available can swamp your bandwidth, and the sheer volume of data available can swamp your processing. If you have bandwidth crunches, look for ways to calculate data rather than copying data, and look for ways to do more calculations before passing the data onward.

Look for bottlenecks when benchmarking your code. The PPE can easily become a bottleneck if you rely too heavily on precalculating what you want the SPEs to do. If an operation is taking too long on a single SPE, look into splitting the work.

Back to top

Conclusion

A good choice of programming model and a clear understanding of the many models possible on the CBE architecture can reduce development cost while improving performance. Abstractions, such as streaming and job-queue programming models, and development tools are crucial to reliable and efficient development. Don't be afraid to mix programming models; you may find the best design has two SPEs running unique tasks, two SPEs streaming a common task, and a four-SPE pipeline handling a particularly complicated task. You are not required to use all the SPEs in the same way. New applications may suggest new programming models; budget time for experimentation. Streaming can emulate the function of pipelining, although it might impose a slight additional cost.

The CBE architecture makes it virtually certain that a fairly easy development effort will result in an impressive performance gain over a more traditional processor architecture, but achieving top performance is difficult.

Back to top

Acknowledgments

This article was adapted by Peter Seebach, working from the original presentation "Unleashing the power of Cell Broadband Engine: A programming model approach," presented at MPR Fall Processor Forum 2005 by Alex Chow of IBM. Peter would like to thank Tim Kelly, Alex Chow, and Daniel Brokenshire for technical and editorial review during the writing process.

Back to top

Resources

Learn

This paper is based on a presentation given at Fall Processor Forum 2005: The Road to Multicore. See the rest in this series.
The Cell Broadband Engine project page at IBM Research offers a wealth of links, diagrams, information, and articles.
Introduction to the Cell multiprocessor (IBM Journal of Research and Development, 2005) has a good discussion of the history of the CBE project.
The IBM Semiconductor Solutions Technical Library Cell Broadband Engine documentation section lists specifications, user manuals, and articles of general interest.
Check out the "Unleashing the power" discussion of programming large FFTs on CBE at Power.org.
The SPU Application Binary Interface Specification V1.3 discusses register usage and calling conventions, data type sizes and alignment, low-level system and language binding information, information on loading and linking, and coding examples. This specification defines the system interface for SPU-targeted object files to help ensure maximum binary portability across implementations.
Find related articles, downloads, discussion forums, and more at the IBM developerWorks Cell Broadband Engine resource center: your definitive resource for all things CBE
Keep abreast of all the CBE news: subscribe to the Power Architecture Community Newsletter

Get products and technologies

Get CBE: Contact IBM E&TS.
Get the alphaWorks Cell Broadband Engine downloads.
See all Power Architecture-related downloads on one page.

Discuss

Participate in the discussion forum.
Take part in the IBM developerWorks Power Architecture Cell Broadband Engine discussion forum.
Send a letter to the editor.

Back to top

About the author

The developerWorks Power Architecture editors welcome your comments on this article. E-mail them at dwpower@us.ibm.com.

source:http://www-128.ibm.com/developerworks/power/library/pa-fpfunleashing/?ca=dgr-lnxw01CellUnleash

# posted by dark master : 11/28/2005 09:41:00 AM

Comments: Post a Comment

<< Home

may the force be with you

Monday, November 28, 2005

Unleashing the power of the Cell Broadband Engine

About Me

Previous Posts

Links

Archives