 Hello, everyone. I'm Hema Reddy. I've been with IBM for about six and a half years now. I've been working on Cell Processor for close to one and a half to two years. I moved over from System I and P series. I used to write device drivers and firmware microcode and then just took a position with programming on Cell Processor. That being said, I'll just quickly go over what is exactly included in the development kit. Right now, this is the most accessible vehicle that's available to everyone because not everyone has a cell blade and I think the Sony PlayStation 3 is not available yet in India. So if you have to buy it over here, I guess you have to go to eBay or something like that and God forbid if something goes wrong, there's no help or support. So far, Sony or Tsuchiba have not committed if the plans are going to change or what. So the only form factor that's available in India would be purchase of a blade. That's a QS20 product offering from IBM website. So meanwhile, before then, if you wanted to get your feedback, we have a very useful kit that is the development kit that's available via the Developer Works website. The links that Duke showed earlier in his presentation. The reliability of these results is very close to an actual cell blade, up to 99.99% accuracy. So in other words, if you have some code that is compiled to be built on the cell system simulator and you're getting some performance results out, it cannot be a big application because obviously the cell system simulator can only take you so far. But just to test out some general routines, algorithms, the results are pretty accurate. When you take the same code and the same executable and you run it on the cell blade, you can get comparable results. And talking of cell blades, we do have something called virtual loaner program. So if you are curious in testing out your code, if you have some algorithm that you want to run it on a cell blade, then do get in contact with us. We can get you set up with the loaner program so you can get a taste of how the hardware runs. So today, I will be covering the development kit and also some basic programming concepts, some communication mechanisms on cell, why it's different from other, say, Intel multi-core architectures, the strong points and some synchronization mechanisms. Obviously, when there's communication involved in a real application, there will be some synchronization overhead. So we'll go over that and that followed by some examples on the cell system simulator in the programming hands-on session in the afternoon. Development kit offers has support for x86. So if you have your Linux, it doesn't have support yet. There's no tool chain yet for Windows development. So the limitation is that it has to be Linux, x86, 64-bit operating system or any PowerPC box. It's got support. It's got lots of software libraries. In the afternoon, we will be going over the directory tree structure for the simulator and see where everything is, what are the libraries currently there. There's lots of workloads that have already been optimized. So in the afternoon, when you have your systems in the lab, you can look, there's quite a few workloads that have been optimized so we can take a look at what the cell code looks like. It's got some support for some profiling tools, debug tools. Recently with this SDK 2.0, we have released an Eclipse-based IDE environment that makes it a little more GUI-based. So for those of you who are more comfortable with GUI-based development, there is an interface available that you can use to your advantage. Talking of libraries, there is a few SIMD math libraries for some math functions. This is something that was not there earlier that we recently added on. There are some already tuned functions, some trigonometric functions, some other math functions that are available that you could just directly link to and invoke them in your code. There is a feedback directed program restructuring tool that we'll be going in further detail, in great detail actually tomorrow. So the typical process is on the system simulator, you build an application and there is a utility that allows you to take that application from your development sandbox over to the cell system simulator. It's a local loopback mount feature that you can use to take the executable from your development sandbox and once the system simulator comes up, it is actually creating a virtual cell layer environment and you can copy it over there. It's as simple as just copying it over there and running it. Supported languages are CC++, Fortran compiler, we're not completely there yet. The Fortran compiler is in the works. It's not released yet, it's not stable yet. And an assembler. And there's a full-flesh GNU tool chain. It's a standard, it's like a GCC tool chain except that it's been modified to include support for cell hardware. Therefore you would see PPU, GCC, PPUG++ for C++ code. PPU32 would be for 32-bit configurations, 32-bit Linux. There should be PPU64 but I believe maybe in SDK 2.0 it's been linked directly, the default is PPU, GCC would be PPU64. There's a C++ language extensions that we can take a quick look at. The header file after this presentation is over. That's a heavily used one. Any time a standard code, a simple scalar code needs to be converted to a cell code, there's a list of instructions. For example, instead of doing the plus add, you have to use the instruction spu underscore add. So all these instructions are mentioned in the API. It's a very simple user-friendly API that we can take a quick look at at the presentation. There's an application binary interface and instructions at architecture. All of these documents are very easily available on IBM Microelectronics website. And in general, like the regular LD tool, AS utility, with any development environment, whatever tools are there are available for cell. The XLC compiler is an IBM proprietary compiler. Now it includes support for cell. The reason there's two compilers, GCC and XLC, for cell is GCC doesn't have a lot of optimization support. XLC has got a lot of different levels of optimizations. O1 all the way through O5. And O2 onwards, there's some automatic optimizations that get into effect once you build with that optimization flag. Automatic loop unrolling, introduction of no ops, code restructuring, function inlining. Some of these things happen automatically. And therefore, if you're compiling some code in the cell software development kit on your X86 box, if you compile it with GCC, and you see the performance numbers, and we'll go over all those in the lab sessions, and you convert it over to XLC and try to just use the same code with XLC that you're bound to see some performance gain. Assuming that there's no false dependencies in the loops. There's two different kinds of simulation models. Mutually existing. One is functional only simulation for code development and debugging. Again, we'll go over all these in the lab sessions. There's extensive tools available for performance simulation also. In other words, you can look at the queues, you can look at the pipelines, you can look at the performance statistics. In other words, after you run a simple program, how many cycles did it take? What is the CPI cycles per instruction count? How many stalls were there? How many latencies were there? All those things come up in a nice graphical interface that will allow you to evaluate your application and see what are the performance roadblocks. The SP runtime management library gives the support for all function calls that allow you to create threads. When I say thread, the main, as Duke went over this morning, the PPEs and SPEs mutually coexist. There's a reason why SPEs or SPEs are called synergistic processing elements. Why are they synergistic? Because PPEs are basically just a control-intensive processor. It cannot be used as a workhorse for number crunching algorithms. The SPEs are the workhorses. So all the PPE does, once your main application starts, is the main application resides on the PPE. It looks at the data buffers. And then, PPE controls creation of all these... There's eight SPEs that you can use to your benefit. So, PPE will create threads. Eight threads. And if it's a full-cell blade with two cell processors, 16 SPEs. So you can create up to 16 threads. So, the PPE cannot basically do computationally-intensive tasks. It's good at control-intensive tasks and controlled at task switching. The SPEs are more adept at handling, say, very tens of millions of loops, repeated computations. But they cannot run an operating system on their own. So the PPEs need SPEs and the SPEs need the PPEs. They both mutually coexist. And so, the SPEs are called what they're called, which is synergistic processing elements. So there's a few runtime API that this runtime management library supports. The question was, how much of the task... How much does the programmer have to take responsibility when it comes to creating these tasks or assigning tasks to these numerous processing elements? Right? So, the answer is good and bad. A lot... If you're a bear... bear-to-metal programmer, a lot of control exists in the hands of the programmer. So in other words, you can decide if it's an image processing algorithm, say, for example, and there is three different stages to it. And you want four SPEs to work on one stage. So you know that your application is using up four SPEs. You have left with four. And you can use, for the second stage, if there's no dependencies between the first and second, you can use another two because you know that this task does not need... does not have a lot of data to be split up and code to be split up between the two SPEs. And also probably there's no dependencies. So used to, and then for the third stage, used to. Now that load balancing responsibility, assigning tasks, splitting the data up, moving the data to the SPEs, all is in the hands of the programmer. So unfortunately, yes, there is a lot of responsibility. There's a lot of design issues also to be considered. So more time spent on doing these observations, data movement, locality of data, dependencies, everything, the better it is once the code is executed. That being said, we are continuously working with tools vendors, also internally in-house development of tools that takes these responsibilities away from the hands of the programmer. A lot of times in the industry, people just want to write quick high-performance program applications, but don't really want to deal with, oh, God, I have to create memory. I have to send it over there. I have to make sure that it reaches and then I have to write it back to the main memory. So there are some tools. One example is a RapidMind development platform. We have joined hands with RapidMind development platform. They are a new company, but they're rapidly, they have ensured quite a few VC backed funding. And their programming model is basically one program fits all processing elements. In other words, the same code can be run on GPUs, say, the same road if you take the code and move it over to cell processors. It'll run just fine. Move it over to some Intel multi-core architectures. It'll just run fine and you don't have to change the code. On top of that, what the platform does is you can use your same development sets. You don't need to change your compilers. All you need to do is link with their library and convert your code, convert your code into their RapidMind development platform APIs. So in other words, use their structures to create arrays of data and things like that. It has got a backend detector, which basically does evaluate the platform to see what hardware it is. And at runtime, it just makes sure that the executable is converted to that format to run on that hardware. Also, the fact is they take care of splitting up data. The platform is based on data level parallelism. They automatically do the optimizations. So you don't really need to split your take care of any kind of memory, DMA operations, any synchronization issues, nothing. You just take your scalar program, use their data types. You don't even have to parallelize it. Obviously, there's some things that you have to keep in mind, things like structuring of data. There's a way that you define the arrays and initialize the arrays so that parallelism is possible. So things like that have to be kept in mind, but it makes programming one notch easier. And there's another platform today that also helps like... So SIMD Math Library has got some math functions that you can use. There's another library, Mathematical Acceleration Subsystem Library, which is an IBM proprietary library that was initially supported in Blue Gene and AIX. And now it has support on Cell Blade 2. It's just been released. There is a prototype version out there. But again, for those of you who are involved with HPC applications based on use of a lot of math functions, please feel free to try this library out. So the libraries are... They have both vector and scalar functions. They're thread safe completely. Support both 32 and 64-bit compilations. And in order to use these functions, all you need to do is just link to the library and you should be able to use all their APIs. And these are the samples packages that are there in the system simulator. We will go over these in the lab sessions where all these workloads are. For example, if you just want to see how the FFT workload looks like using the cell API, there is a workload already there which is tuned and optimized to be working on a cell blade and it's gotten very good impressive results. And the benchmarks slide that just went over. There's a sample image processing code. There's a newly included software managed cache which is a very useful library. Basically, it gives... Using this software managed cache, there's a few workloads that IBM demoed out at different conferences. We can show you a few demos. We have it on our laptops. Sometime in the afternoon, good demos. And again, all these workloads are there in the system simulator for you to take a look at with the source code. Samples has again some sample code. The performance utilities. We will go in detail about the SPU timing tool, how it is used for static code analysis. Currently, the output that it produces involves a lot of assembly. So if you're not from an assembly background, it might be a little hard for you, but then the option is VPA, the visual performance analyzer that Duke will cover tomorrow that will give you an option to see the source code along with the disassembly. Cell IDE is just based on the Eclipse development tool. So this is pretty similar. It's just added support. It's the same Eclipse IDE. It's just got support for cell development too. So in other words, you can write code and build your application on the IDE itself. And it's got support for debugging also. So it's a standard, the same GDP interface, a GDP tool that's there for more common Linux OSes. The same commands, the same syntax for triggers, breakpoints, listing of variables, all is the same. These are the installation requirements when you're ready to install the cell development platform. These are some instructions that you can refer to and all this material is there on your CD for reference. And it's also available on the websites, numerous websites. These are some dependencies. So sometimes if you are running, say, Red Hat, you might need to upgrade a TCL or TK library or you download a different version of GCC. So these are some heads up instructions that we wanted to give you. You might need to upgrade to make 3.8 depending upon what your enterprise version of Linux is. And similarly, all these are package dependencies that you might need to want to watch out for. And anytime there is some kind of question and there is a technical issue that you're having, we have something called the developer works forums. A lot of people, game developers, university, faculty, researchers, cell developers, Linux kernel folks visit these sites, also actively monitored by IBM cell developers. So you can post a question over there and you can definitely expect to get some response. Again, the install components for X86 platforms and PPC64 platforms, this is some information. This list of RPMs in SDK 2.0, these are all for your reference so that you can anytime pull up your CD and refer to this. Very easy way of installing, uninstalling. And cell SDK is the install script basically. So all you need to do to install the development kit is download the ISO file. There's one place on developer works where the ISO file is available. Download the ISO file, run the script with the install option and that's it. Your development kit would be installed. And again, clear instructions for installing the cell IDE is also here. The make file environment, the development kit comes with a standard make file environment. We will go over that in the afternoon also. But basically there is a set of default make file structures that you can use, a standard base level make file, any application. Say for example, we are writing a simple application to do DMA transfers. There will be a standard base level make file and there will be two directories for the SPU code, resides in one directory and the PPU code resides in another directory. So the SPU code will have its own make file because it's a completely different architecture. So there cannot be one make file for both kinds of both processing units. So there's another make file for the PPU and so ultimately the final binary that you get, the SPU binary executable is embedded in the PPU binary and they're all clubbed to be one binary so that's all you need to run on the system simulator or a cell blade. And again, there's environment variables to switch the compilers from GCC to XLC and back. And like a typical make file there's a make.footer and make.header with all the standard environment variables. A lot of times, and usually this happens a lot in real applications is that you cannot have, use those default make files, you can write your own make files in that case. Okay. These are the environment variables to switch the compilers. Make.env is at the top level of the development kit, with all these variables defined and you can change the values, either in the make.env file or in the command prompt, you can do a export PPU underscore compiler equals XLC to switch from GCC. And in order to, if you are using the standard default make files that come with the development kit, it's recommended that you export this variable cell underscore top. This is used in a lot of sample programs. So if this variable is not set, it won't be able to find the header file locations. So do export this value so that when you do, just go to the samples directory and do make, the path is all visible to the make files. And this is the standard make file variables. Program underscore PPU defines what is the name of the PPU executable. If it's 64 bit executable, accordingly use the program underscore PPU 64 variable. Library underscore embed, it basically creates a link library from an SPU program, as it says, to be embedded into the final PPU executable. And this is done. And I want to go over that later on when we go over the basic programming concepts. Because we want the variables that are there in the PPU executable to be visible to the SPU and vice versa. So that's why we came up with something called CSOF. Cell embedded SPU executable format. So basically, whenever in a PPU application it creates a global reference. That will be the same executable name for the SPU executable. So because of this global reference, anything that's there in the SPU environment will be visible to PPU. So it's possible to basically move data back and forth. And again, we'll see that via program code and additional diagrams in the next presentation. There's a sample make file for a project. So if there's a sample directory, sample is the directory where your code will exist. There'll be an SPU directory. With the two, say for example, the sample code name is sample underscore SPU. Make file will look like that. Program underscore SPU will define the sample SPU name. Library embed will create the archive file. And include path will give the link to the include path. And include the make.footer from the standard make file environment that comes with the development kit. And similarly for the PPU, now if you see in the PPU directory there is a variable called imports which will be linking with the archive file which will be linking with the archive file that is there in the SPU directory. So this is how it embeds this SPU, the sample SPU with the final sample executable that is produced as a result of the compilation. So finally what's run on the simulator or a cell blade is the sample file, but it's got embedded in it, the SPU side of the executable also. So the Linux kernel support is similar to traditional Linux support, just a few differences. For example, let's see. There is a few similarities and a few differences obviously. The PPE's Linux kernel is similar to Linux on power and as we went over this morning the PowerPC Core is based on PowerPC 970 but it's a really stripped down version of PowerPC 970. One of the reasons that it is stripped down is basically to save on the cost and the heat, the power factor and simplify the hardware in general so that while we're doing this ascent on performance the heat, the wattage is low. So cell I believe is 32 watts which is pretty good compared to what the market standards are. So the Linux kernel basically has support for virtual memory including there's a diagram coming up later on that will explain this effective address space versus the local store address space. It has support for large pages, by default it's 4K in general, just like I tell you usually, but however you can turn on large pages support using the huge TLBFS command. I think you can go up to 64KB, 1MB and maximum you can go to 16MB. So PPE code can create something called a Linux task. So in a way the application can split the... can create different Linux tasks and these tasks can enter and create like an SPE thread. Now the SPE thread is nothing but just a group of data with its own main program and once you create the thread it just runs asynchronously. So the PPE has no control once that the SPE thread is created and it has a unique identifier just like you would create just like a normal P thread. And again all these SPE threads can be created in one group so in other words you can create all these SPE threads and assign them to one group. So you can query status on that group on completion of a particular task and each SPE thread has to belong to at least one thread group. And this is a table again it's from the programming hire handbook which basically defines how we are when we are saying Linux thread versus SPE task or SPE thread how is it different? Basically Linux thread is just a thread running on the PPE Linux OS. The cell broadband engine Linux task the Linux task is just a personalized way of saying that yes this is my task it will... it can create two or three SPE threads and so it's just basically modularizing your application to a certain extent. Now the SPE thread has its own environment again you know once the thread is created it uses all the resources it has got available it has got its own program counter it has got its own memory flow controller which is responsible which has a DMA controller inside it and a channel interface to send messages back and forth from the PPE to the SPE and then there are some synchronization commands for communication between SPE to SPE and it has got its own 128 entry 16 byte register file and MFC command queues are nothing but just queues where you can queue in DMA commands so that's a very neat feature that if you have to issue three or you know maybe ten DMA commands at one time they can be queued. So PPE versus SPE comparison let's look at a few differences both of them have SIMD operation support single instruction multiple data in the presentation this morning there was a chart which showed the way the data is laid out when we talk about vectors vector is nothing but just a bunch of consecutive memory locations so in other words if we have an array of say float A4 so it's 16 bytes of memory so instead of saying float A4 you would say vector float A so that's basically equivalent to a float four times 16 bytes so it's just a different it's just a name given to address 16 bytes of consecutive memory locations and there's no difference the way they are they are positioned in memory it's same as a scalar it's just that when we define a vector and you try to load a vector it loads 16 bytes at a time that's where the benefit comes because compared to scalar programming versus vector programming you're not doing 16 or four loads that itself gives a huge benefit because the load the fewer loads and stores you do the better so instead of doing if you're doing a computation on say 16 bytes of data in a normal scalar program you would do four loads so both SP and PP have SIMD operation support so they have their own VMX instruction set VMX is vector multimedia extension this is something that was created with traditional power PC architectures IBM is one of the four more actually it is the four most to do multi-core technology initially it did the the two core power four that it released was the first of its kind in the industry so and this is just the cell broadband engine and that's the reason Sony and Toshiba came to IBM that yes you're the chip guru if you think about the gaming consoles Nintendo is an IBM chip Microsoft Xbox is an IBM chip and finally cell broadband engine is an IBM chip for cell broadband engine we are using the standard VMX instruction set so the power PC unit does have the VMX instruction support some of the instructions from traditional power PC 970 may not be supported but a lot of them are let's see SPS have a different kind of VMX vector instruction support available and again we'll be going over some of those instructions very soon and because they're so different architecturally they need different compilers so PP has 32 SIMD registers whereas SPE has got 128 16 byte wide registers SPE has got a unified register file whereas PPE has got different registers for handling fixed point computations floating point arithmetic, vector multimedia registers load latency in PPE is variable because there's an L1 cache and there's an L2 cache and then there is the main memory so any load operation cannot be deterministic because you don't know where it's going to come from it's going to come from the caches will there be a cache miss or a hit so we don't know versus SPE is a very deterministic environment no exceptions no caching very simple so in other words just to computationally calculate the performance number for any given application it's very straightforward in theory did construe what the performance number will look like SPE is currently optimized for single precision float PPE does not have double precision floating point support let's see currently the single precision floating point support is very good over 200 gigaflops SPE the PPE is also good comparable to the other platforms but it's not doesn't give as good a scale of a single precision yet we have a roadmap we're working on the next few years we'll have better double precision floating point performance as of now it's just comparable but we cannot boast off with a 30x or 20x performance benefit when it comes to double precision applications again this is a table most of the data types are supported unsigned care signed care it's just that every variable will be preceding the word vector in it so in other words vector unsigned care will be having 16 bytes of data in them so when you create a vector unsigned care keep in mind that you're creating 16 consecutive bytes in memory similar unsigned long long support is there in support double standard data types are also supported on SPE's it's just that when we say vector unsigned short it is four short variables in that vector when we say vector unsigned care it's 16 bytes of scalar data types inside that one vector when we say double it's two double variables inside that vector and this is the language extension when I say language extension is just a different API instruction set that you have to use if you want to convert a scalar application to vector application this is the communication part so there is three different mechanisms by which you can communicate to the SPE take a real world example you have your main program that you have built on the PPE now there is these data buffers that you need to move over to the SPE to do some really rigorous computation so in order to send this data from the PPE to the SPE we have something called DMA it's just nothing but doing a simple command which says from this effective address in the main memory send these many bytes over to the local store in this SPE SPE 0 through N 0 through 7 so pick an SPE send these many bytes so once those bytes are there do the computation on the SPE do MFC put memory flow controller underscore put so once that data buffer is used put those results back in main memory sometimes what you could do is do something like shared IO buffers to save the buffer space when you send a buffer of data to the SPE once the results are computed you overwrite that buffer and send it back to the PPE so when we are using things like this there comes a need of synchronization sometimes you want to know if a particular section of data from the main memory has already done some task so another SPE might be dependent on stage one of image processing or it's trying to do something on a particular data set and say SPE N cannot even start until that stage is done so then there needs to be synchronization variables or mailbox messages a simple 32-bit message that you can send over a channel interface to the SPE to say I'm done and then you can start off trigger off the next stage there's two mailboxes the SPE write outbound mailbox whenever we say write or read or inbound or outbound the reference point or get or put the reference point is always in SPE so if I say put it means putting from the SPE into the main memory that the PPE is seeing PPE does loads and stores from the PPE the way it acts as main memory is by load and store from the main memory over to the register file that it has the SPE has the main memory only via DMA operations so in other words SPE cannot do load and store into the main memory directly it needs to get the data from the main memory into the local store and then do an instruction fetch there is something called signal notification channels you can configure them to be atomic and this illustrator what I just mentioned when we say get it basically means transfer data into an SPE so get into the SPE when we say put we mean putting the data back to the PPE and this is a storage domain this puts it all into perspective whatever we have discussed so far so here is the PPE it has got its own processing units and it has got its own register files and here is the main memory the DRAM memory and this is one typical SPE so the SPE is nothing but a combination of the SPE SPE has got some processing units the arithmetic processing units and a combination of local store which is just flat memory 256 kilobyte flat memory so it is single ported flash memory flat memory so it is only one read write port so all the loads and stores DMA operations and instruction fetch all happen through that one port my mouse so the way the PPE communicates this is the EIB the element interconnect bus very very high bandwidth bus this is a great asset to this whole architecture because this is the bus that makes it possible to use cell in a cluster environment form the basis of all the supercomputing nodes so and it is a fully coherent bus the whole DRAM memory is fully coherent so the way we implement coherency in memory and bus is via the snoop protocol so the way the PPE communicates with the SPE is by writing to this memory mapped IO registers so it does reads and writes to this memory mapped IO registers to send or receive data the way the SPE sends data to the PPE is by writing to this DMA memory flow controller the way it channels commands so this is the way the SPE communicates with other SPEs and the PPE communicates with other SPEs also so PPE just directly does load and store from the DRAM memory the SPE will have to do a DMA operation get the data into the local store and then read it from the local store so local store is 256 kilobyte ECC protected single ported non caching memory instruction prefitches are 128 bytes per cycle data access bandwidth is 16 bytes per cycle and all loads and stores on the PPE or the SPE have to be aligned on 16 byte boundaries if it's not aligned on 16 byte boundaries there won't be a trap there won't be anything it will automatically truncate the last 4 bits so it will force it so it is just automatically enforced so we really recommend there's attributes as macros we can use to our benefit when we're writing the code we're defining an array that will automatically define it on a 16 byte boundary so the quad word is 16 bytes in this case a word is 32 bits and a quad word would be 16 bytes which is the size of one cache line request all cache line transfers on a 128 byte boundary and all loads and stores are on a 16 byte boundary so they have to be aligned it has to be aligned on a 128 byte cache line boundary and if it's a load and store it has to be aligned always on a 16 byte boundary and again SPU can only fetch instructions from its own local store and obviously there's no address translation the privilege software on the PPE can set up effective address aliases to the local store