 The objective of this workshop is to bring in to you the CELBE. We want to give you an overview of the CELBE system, the software development environments and the CELT SDK 2.0. And then we will talk about the CELT BE software model and application code, how to report applications over from a regular traditional architectures to its CELT architecture. We also really want to go out with you and show you how to write the programs under the CELBE. Just affecting a very similar program and how do we convert the program into a CELT program. What is the difference? We will point out the differences also. We will point out the key difference between these architectures and other architectures. The vectorizing, the SIMD, SIM instruction, multiple data stream, which is the main features of this architecture. We go through all those vectorizing, exercise, and then we will show you the different kinds of tools that we have. Development tools. The integrated development environment for the SCF, we call the IDE, which is Eclipse-based. The SPU timing analysis. Some of the performance tools that allow you to look at and to analyze the performance behavior of the system. And also some of the tools that we have on the SDK 2.0, which we call the Feedback Directed Restructure Program, FDPR Pro. So you can input just a binary of your code. And then the programs can go through the exercise, analyze your programs. And ask you, run the program the first time and give me some data. And then using that data, I will optimize your program by restructuring your program. So this is a little bit different than some of the tools that we had before from the open source, like the VPA, we call them Visual Performance Analyzer. The VPA or Visual Performance Analyzer just analyze your binary code only, but not optimizing or restructuring the code. The FDPR will restructuring the code for you. And then we look at some all profiles. All profiles is in open source, two available under the Linux distro. This is the system software enablement that we have since 2005. As you see, the timeline over here, 2005, we released some SDK, SDK 1.0 with 1.1. We support the Singapore. We support the board bring up at the time, right? And then we integrate the board into a system. We call it the cell plate. And we integrate that cell plate into the cell plate chassis so we can form a server. And then we can, you know, talk about the clustering. By the end of last year, I mean, then we talked about things that you can have a cell product engine, right? And here, in some other cell plate, can we run a traditional CPU, for example, an opteron, right? Or an Intel-based system. And let those systems talk into the cell here, right? And since you guys perform a very specific function, computational very fast, extremely fast, are very exceptionally fast. Then we talk to each other and use you to cell BAs and accelerators, right? So this project here, the Los Alamos National Labs, we, the IBM signed up with these labs here, provide about 16,000 of these, the cell BAs and 16,000 of the AMD opterons, connect to whole clustering, right? So we form a very, very large computer system, offer that there are flops and large number of flops, whatever. So that's what the movement of the progression of the software from the beginning, supporting a single board bring up to the system and now to the whole world, the clustering world. Here's the progression of the SDK. SDK, as of now, we have the SDK 2.0. And then since we released SDK 1.0 here to support our board bring up and the base of the cell plate. And then now we have removed it to SDK 2.0, which provide a full blown of the XFC compilers. This is the IBM compilers. We're also working with the new open source here, provided a GCC compiler. So we have the two compilers available all the times on the cell BAs here. We also have Linux kernel working with Linux kernel teams from the IBM labs in Germany so that we can and also in Australia to provide a fix and patches. So we bring in the mainstream of Linux 2.6.18, which is loading with the base kernels for the running on the cell plate. And that is also the basis of the main. So you have an IDE here. You have a performance analysis tools. You also have the MAMBO, which is the simulator. This is an IBM simulator that was used to simulate the Power 3, Power 4, Power 5, Power 6. This is not a new simulator, but it's an existing simulator has been proven and been used by IBM to simulate hardware when we develop a new hardware. We should ship the simulator with the SDK 2.0 so we can use the simulator to write the application and compile and run on simulator. And this SDK 2.0 running on different platforms, the FC5, Linux 8.6, Linux Unsell, and also Linux on Power. This is Linux base FC5. This is the cell-based systems, cell-based stack. We have the applications working with the ISV and universities and different labs around the world. And some sector-specific libraries for the sector application sector like AMD, aerospace and defense, financial sector, and also some other sector like the exploration and so on. And we also have, going that, we have the tools, environment, the compilers, the libraries, operating systems, and so on. This is the left of the software stock provided by IBM and some of its partners. And also these are the kind of basic features that we put into SDK 2.0. Overlay support, accelerator, library framework, ALF, performance tooling, all profiles and so on. Now this is the roadmap from the hardware point of view. 2006, the CELBE we released, officially released, right? And made it available, commercially available for you to purchase the system. We can integrate it in the cell plate and the cell plate chassis. So we can form a clustering of those guys. Now we have this, so 2006 we released the CELBE, which is a 1PPE and 8SPE. We call nine of them using the process of 90 nanometers SOI, silicon on insulators. Thank you very much. I never remember these guys. SOI, SOI. And then in 2008, we're working on now, we continue working on, hopefully, not hopefully, but we will release in 2008 a version of what we call enhanced BE, which has improvement performance in the double precision. Okay? And then by 2010, we released another processor. By that time, we reduced the size using the 45 nanometers SOI and then one teraflops processor. This guy is running about 200 gigaflops performance, you know. So this one, like a five times improvement in terms of performance. And then you're reducing the size, half the size, cut in half the size of the chipset. In terms of software, we released SDK to promote in 2006. This is running on two CELBE processor here and single precision floating point affinity. Extremely high performance on single precision floating point, one gigabyte of memories, 500 and 12 megabytes for each processor. We connect both of them together. We form a system or a board with two processors. Two processors means two processors here mean that you have two PPEs and 16 SP. So you have a system on the same board with about 18 cores. We call it 18 cores, 18 CPUs. And then we move on SDK to promote, just release, advance some CELBE here by 2007. And then we improve the memory here from one gigabyte of memory to two gigabyte of memory. And we support 16X, 16 times the PCI Express card. By the time of 2008, we will provide some support for the double precision performance, improve that one from what we have now into almost double, right? And then we support up to 16 gigabyte of memory. So for those applications that require a lot of memory, a lot of memories, and this is the times, for example, Java and some other applications, that's required. You have to make the memories available and stay resident into system memories, for example. We will provide some system by 2008 to satisfy those requirements. Okay, where to get the CELBE information? You have a lot of places to go. IBM spread out a lot of information. The key site is the CELBE Resource Center at Developer Works. Okay, we call it the Developer Works. CELBE Power Zone and slash CELBE. And then we also maintain what we call some CELBE corners at the Power.org, the CELBE Project IBM Research and so on, the Alpha Works and so on. Okay, and then for the CELBE Education, we maintain a website in which we can download some of the materials and, you know, we can download all the materials. Anyway, all those are available to you to the public and you want to know what's going on, so we can go up here and download the files in there. We have online courses, IBM learnings. We are on the process of putting a lot of our courses, such as these courses on the IBM learnings. So this will be made available to the public as well. We have podcasts at Power.org and on-site classes at various IBM Innovation Center. Documentation. And here I have a list of documentations with respect to the self-portment engine, a list of them, the architecture, programming handbook and registers and extension of the C and C++ to support the SPU. And then, you know, the SPU description, instruction set architecture, the ISA and abbreviations and the SPU application binary interface and assembly language. The documentation using the SDK, right, we have an installation guide, a programmer guide and so on. We also have the documentation for the self-play center, the QS20 and then for the compilers and the documentation for the simulator, for the PowerPC base, because one of our systems is based on the PowerPC architecture. And here's a list of technical articles. This is some detailed description here. We get the time you're going through. The key documents that I would like to recommend all of you to read before you do some programming is that the CELBE programming handbook. This book is about 857 pages. It's considered as a CELBE cookbook, right? Description, everything. So you need to know programming, you need to know architecture, you need to know anything about it. Just come there. Don't print, okay? Because to print that one, you kill a lot of trees. So you just print out, you know, just get a soft copy and read it through. Okay? Let's move on. Developer works here, architecture. This is just a detailed description of what I just described on the other pages previously. We maintain some information at the Barcelona Supercomputing Center. We call it the BCS. This is why we maintain this one. On this side, we maintain all of the open source software. GNU, compilers, the fix and patches for our Linux kernel, installation scripts, all of those, and the Linux tunes and GNU tunes, we maintain those at the BCS because those at the open source and public and IBM don't want to get involved into this area. Okay? Let me talk about CELBE and give you an introduction. Here, I will have a lot of pages. I will cut it short because we're short of time, and then just an introduction. You will see something, you know, here, and then if it will remind you when I go through details, stop me and go back and say, you mentioned something here, you know, just give me some more details. Okay? Overview of the CELBE history, microprocessor highlights, hardware components. So environment, this is the objectives in this presentation here. And then this is an agenda. So history-wise, IBM, Sony, and Toshiba, Jean Force in 2000. And by 2001, coming up with something like, okay, let's pick up a site which is also inside, right? Texas, Texas right here. And then we first formed, you know, the first single CELBE operation in the spring 2004, two SMP, come up in the summer, and then six CELBE released a different release here, especially, you know, the CELBE plate was announced by IBM in February 2008, 2006. SDK 1.1 was made available in July 2006, and then the G8 of IBM Place Center, TQS 20 in September 2006, and the SDK 2.0 available in 2006. 2006 marks a very, very special year for IBM in terms of releasing the products and make the products become commercially available, okay? And here's the location of those labs that are working on the cell, right? So we have it in North America here, we have it in Europe, in Germany, some place over here, and we have a large lab also in India as well, okay? This is the size of the cell processor, right? There's a push pins, okay, and then there's a small ASI socket. This lady here has a very small ASI socket, so we use that one. Otherwise, a larger than half mile, we're not going to use her ASI socket anyway, right? Okay, basic design. We have, as we're moving on, you know, form processor design and different architecture and so on. Among the other things like the course, the times, we recognize the three gating factors that inhibit our design. The first thing is the power, power dissipation, power consumption of the processor. The second one is the memory. As we move the data, and we design the processor to support the movement of the data, always make sure we have the data available when we need it. We create a memory hierarchy, different memory like our CPU. We choose the level one, level two, level three, cache and so on. By doing that, we make the... If everything is okay, we're okay. But if we're missing something, if something happens, the data is not there, and we have to fetch the data, we have to flush the cache, we have to do some other extra tasks, so that we can bring the data in. That's causing great latency time. On the power five, we look at 75 cycles, 150 cycles, if we have a branch miss. So we look at the power, we look at the memory, and we look at the frequency. We cannot bring up the frequencies as high as... greater than 3.2 gigahertz, 4 gigahertz and 4.5 and 6. You saw the history, if you look back a couple of years, about five years ago, we have an Intel AMD coming up with 2.4 gigahertz new encore, 2.6, 2.8. They're not going up to 3 or above 3 or so on, because they cannot control. This is a reverse or what we call the inverse relationship. As you pump up your frequency, okay, you will and you will and you have to generate more power. You have to switch faster to obtain the frequency, the higher frequency, and switching to get faster, generate more power. And we look at this one here from the IEEE descriptions, a publication about... I think it's 2001, and we look at different processes over here and the frequency, and we support like 1 gigahertz, we see the couple of materials here, then the SOI materials, and then the low-cave dielectrics and so on. And all of those has been used, materials used to support the higher frequency. Now, more laws, of course, two extra systems density every 18 and 24 months, but more law did not say that the power dissipation of the power consumption of those gates will rise as well. So he talked about the number of transistors, he increased transistors, transistors consist of gates. And then those gates switches consume power. And some of the gates, they also produce power, they consume power even when they're inactive. So we have active gates consume power, inactive gates also consume power. So by adding up a lot of power that we generate. We know how to design the processor, but we do not know how to control the power. That's the key point here. Where the transistor gone, right? This is key from the first date, about 10 years, 15 years ago, when IBM created the first PC, the first CPU, very simple. There's 64 kb memory only, no level cache, no level 2 cache, nothing. There's no virtual memory support. There's no superscalar micro architecture and so on. Then, and now we introduced like the CPU, the cache, the layers, we introduced the micro architecture superscalar to support more units, the deep pipelines. We look at some of the current architectures, the pipelines like going from 10 stage, 12 stage, like 30 stages with the Intel base. So 30 stage is a lot of stage, right? You may hide some performance here, but you introduce a long delay. If, and then it will cause a lot of real state on your board as well. And we talk about the out-of-order processing. Most of the systems said, okay, give me some instruction, I will decode that instruction, I will bring that instruction in, and the data is on and we can run, right? Now, if something happens, I will leave that one and I will stack up another one, I will get a process. Now, everything is out-of-order, I don't care, okay? But out-of-order support is very expensive as well, okay? And here is the chart we discussed about the lithography improvements, increase the cache size, deep pipelines, superscalar, out-of-order processing, and the trading. All of those techniques were used by various microprocessor designers to help or to improve performance of the processor. Okay? And here is the chart, this could limit more details on the different memory organization and so on. The key point is here, here is on the chip level multiprocessors, okay? We can, we don't have to go through the traditional design. We can reduce the superscalar, reduce the pipeline depth, reduce the speculation. Instead of out-of-order, we just do it in order. It's easier when instructions bring in, we will bring in an instruction, we call the instruction, get the data, so on, and then we execute that instruction until completion without the out-of-order execution. We also bring up something like we call the SIMD and the vector so we can do a lot of operations in parallel at every cycle. Instead of waiting for the cycle to complete, instead of well vectorizing the instructions so we can bring in a number of elements, processing those elements together. And then different memory organization here. Remember, the basic organization is that when you need to run, to execute some instruction, you get some data, you have to load that data into the registers finally at the end of the food channel, whatever, and you execute it. Why don't you provide a set of registers, you know, instead of having all of those extra steps whatever, providing the set of registers right next to your memory, right? Eliminate all these intermediate steps. Just go straight and form the register and the memories will be called here the local memories. Loading the instruction, your data into the local memories and when it's time to execute those instructions, load into the registers and voila, perform the operations. So we talked about all of these concepts, the basic concepts that we mentioned here is the accelerator hardware, accelerator concepts. And we will accelerate all of the operations and using this board here, providing, you know, let's say that we have a level two cache and if we have a system of memories which are coherence with the level two cache, okay? We maintain the coherency with that level two cache, right? So everything that we do, on the main processor, we do it very fast and however, anything that we do locally on some other satellite processors or remote processors, we can do it together as well and we maintain the coherency so the integrity of the data always maintains throughout our system, okay? And here we say that, all right, now providing some of these features here, we can convince, we can win over the memory wall, frequency wall and the power wall. Providing a very simple set of processors, moving the data, okay, instead of going through a layer, a hierarchy of memory, we go through from the locals, from the memories now to the register right away. And don't make it run fast. We don't need to go through like a four gigahertz or five gigahertz, right? Remaining on some frequency, but we don't look at any more, don't look at the time or execution time, type of unit, type of unit of wall, right? Look at the number of wall, look at the number of transaction completed per unit of time. Look at the throughput, because we provided, you know, a multi-core, a multi-processor system here. You know, each of them go ahead and do something. The throughput is what it's called, right? At the end, how much transaction, how much transaction we complete, that's what it's called. Some other micro-architecture and decision, the last shared register file, local store size, trade-off and so on, those we think we talked about, we can issue two instructions at every cycle, software branch prediction, we remove the hardware support for the software, for the branch prediction and replace that one with the software branch prediction and leave this decision to whether or not we take a branch or not taking the branch to you as a programmer. You, the programmer will have to make a lot of decisions when you program on this environment. You will decide, you know, how to break up your applications, you will decide, you know, how to partition your data, okay, and how to optimize your section of code and you also decide, you know, when you take the branch, when not to take the branch. Some of you may say, well, we leave that to the compiler. If not, yes, the compiler will do that as well for you as well. The second steps, it is a base on the 64-bit power. So we build the 64-bit power PC so we can leverage the applications and also the investment that IBM's and the partners have been putting, you know, for the past many, many years. IBM, this is the power trees which went back to like 15 years before, right? And we increase the efficiency and performance and go through a non-homogeneous coherent chips multiprocessor. Okay, we have a night course. We're streaming the data. We provide some sort of mechanism so we can stream the data between components extremely fast. Okay, and then we provide some interface to the outside world, go through some, some the, we call the flex IO connection which allow us to connect either it's a self-processor together to form an SMP system or we're going out to the IO world to collect the data from the IO devices. Okay, hardware components. Let me give you some highlight of the CELBE here. This is the board, all right? And we have about eight SPEs here. Now, if you look at this one, we still dated here, in 1907. So Hemman and I were still in different time zones. So forgive us if we screw up anything, right? We have 20 here. So this is the power-producing elements and this is level one, level two cache, level two cache right here and eight of the SPEs in the element interconnect buzz and so on. Now, this one, this board here consists of about 241 million transistors. The area is about 235 millimeters squared comprised of nine cores and 10 threads. We have the simultaneous multi-threading here on the power-PC. So we actually support the two threads on the power-PC and each of the SPEs is represented by another thread. So we have eight threads for the SPEs and two threads for the PPE total of 10 threads. Simultaneous threading, right? We have a performance single-precision greater than 200 gigaflops and double-precision about 10 times of that one, 10 times less, about 20 gigaflops. Memory bandwidth about 25 gigabytes per second memory and then the IO bandwidth about 75 gigabytes per second. The EIB, which is the one that connects all the elements together running about 300 gigabytes per second. Top frequency is about greater than four gigahertz. Some publications, some technical publications only published on the IBM site and some other conferences mentioned about 5.2 gigahertz. Internet labs were able to achieve the 5.2 gigahertz but here we just release a system with a running at 3.2 gigahertz only. And here is the diagram that described our system. We have a lower portion here. We have a PPE which is a traditional processor. It is based on the 970, which is the power of the Apple G5 but it is not the power of the PC 970. It has some features but most of the features from the 970 was removed, for example, the card support, the French prediction support and some other output support were removed from the 970. And then this one has a level 1, level 2 cache and it supports the VMX, the vector multimedia instructions which is similar to on-the-effects and some other vectors instruction support on the Intel architecture base. In addition to PPE, we also have A-SPE. Each SPE defined by the execution unit, a local store and the memory flow controller we call the MFC. This MFC is very important because this is a hardware base. Each SPE has one. So the flow of traffic and the transfer of data and the traffic flow between the SPE to end between SPE to end to the PPE and between the SPE into memory and to the burst interface controller here which goes to the outside world. They can all be done concurrently and parallel. We are not gating. Our transfer here is not gating by a single MFC but they have individual MFC for each of the SPE here. Very important concepts right here. Here we run up to 96 bytes per cycle 96 bytes per cycle which is translated to about 300 the top maximum bandwidth is about 300 gigabytes per second. Here is the chart describing the cell processor. In the middle here we have the power PC we call the power PPE, level 1, level 2 cache two-way hardware when they trade it level 1 cache, level 2 cache coherent load and store the VMX instruction support running a 3.2 GHz. We also an outside over here we call the SPE APL chips, 100 controller 8-bit wide SIMD units. Outside over here each of the SPE will have about 128 registers. Each register is 100 controller 8-bit wide. What that means? That means that if your instruction is about 32-bit 32-bit instruction. Effect the instructions if your data type is about 32-bit you can have about 4 data elements in that register you can load 4 elements every time. So you can perform 4 operations concurrently in every cycle. If your data type is carter 8-bit you can have like a 16 of them running together. So it's very powerful here. And we also have interconnect right here which connects all the elements together and the description about this is that we can have a greater than 100 outstanding memory records at any given instant of time to go through this EIB here. Memory management. We have the concepts of effective address on the power PC. On this one over here running at the 64-bit address. Each of the SPE here running at the 32-bit address. Support 32-bit address and instructions here at 32-bit instructions. So we have a 64-bit address on the PPE and we have a 32-bit address on the SPE. The memory on the PPE has to be translated now to the SPE as well as the SPE memory address has been translated to 64-bit address if we want to run some 64-bit programs on the PPE. All of those translation and memory management are performed by the MFC with the MMU units here. And all of the SPE-DMA access that means that the data transferring between those two will be protected by the MFC and MMU. When you do memory transfer, you go through this MFC. Excellent connection. We have 25.6 gigabytes per second bandwidth here and we have two connections here. One is for connecting the processor together. The other one is connecting the IO device to the SPE. Synergy. Cell is not a single processor. Let's say it's a non-homogeneous multi-core system and each of them perform a very specific function. The open system running on this one is Linux. Linux is a different version of Linux distro. Fedora Core 5. We also seen that on this one over here, I run the Yellow Dog. I also can run Debian. I can run Ubuntu. I can run any so far, any Linux version. That one, that version, that Linux will run on the PPE. On the SPE, it's dedicated just for computational only. So, very clear distension between the two. Do we do any scheduling? Do we do any timing tasks and so on? Down the SPE? No. Everything we learn on the PPE only. The software environment, we provide some for development environments. On the left side, the tunes and the debug and performance tunes and so on. Mr. Ronyasun, those guys that you've seen before. On the right hand, we have a sample workload, ship in your SDK and SPE management libraries so we can manage the SPE, create an SPE, queue an SPE, release an SPE and so on, create an SPE group, for example. We have Linux PowerPC 64 with cell extensions that you can go to run on the SELB-E here. Verification hypervisor, supported by the PowerPC and the hardware system level simulator which used to develop your code and the language extension API. The system simulator, this is the simulator we talked about. The Pascal of Slice. This is an overview of the simulator. Here we have a two-use interface for the simulator. We have a command line interface and we have a GUI interface for the simulator. As a programmer, when you bring up the simulator, this simulator will boot a Linux version. This is 2.18, 2.6.18. That Linux version represents a fully implemented Linux version on the cell. Now, on the native system, you of course don't have a simulator. This simulator exists only if you do your growth developments using the x86 base as the growth development platform. You write an application on your native, on your x86 base, x8664 and then you combine, you bring the binary inside of the simulator and then you run it. This is what's provided in here. Have a different simulator environment. We go more details on this. This is some shots. Let me go back here a little bit too fast. When you bring up, this is the GUI interface for the simulator. On your left side, you have the resources of the cell. Here, we describe only two power PC, PPC0 and PPC1. And then eight SPU. On the right-hand side, you have the control of the simulator where you issue some commands to control the execution, running, stopping, collection of data and so on. Here's the simulator environment. In this window here, when you bring the simulator up, it will move to Linux and display in this window. We call it the simulated window. And some of the data that you can look at from the statistical point of view or from the resource point of view. You can set a break point. This is some graphic, nice graphic. You can see that one as well. All right. Linux on the cell is a patches that 2.618 power PC kernel, power PC64, and support SPU threads as an API. So everything has been extended to the Linux. And it is part of the mainstream as well. If you go to Linux kernel.org and you download your Linux in there, you would see a lot of patches that came in and view a member of the Linux open source community. As one of the developers, you'll see a lot of patches from the cell that went into the mainline. Now this one available, all the fixes are available from the Barcelona Supercomputing Center. SP management libraries. I mentioned this before so we can manage the SPU threads under the Linux environment. Some workload we've shown here, physical simulations, physical simulation here, subdivision surface on the mesh, and then, you know, refining from that mesh to make it to show a really graphic rendering, a very nice image of the head here. The terrain rendering engine is another example that we showed here that superimpose takes a satellite image which has elevations later superimpose on the holes on top of a flat terrain image and render that image on the 3D versions on one of the displays, a really nice display here. Performance tools as static analysis, SPU timing, dynamic analysis, running on the system simulator or profile FDPR Pro. I mentioned that before. The cell-B IDE, Eclipse-based, which is built on top of the Eclipse SDK here and then the CNC++, development toolkit CDT, and we put in the plug-in for the cell and create an environment for you to write your programs. Really, really nice. Software design considerations. We have a two-level of parallelism here in cell. We have a regular vector engine, right? We have a PPE vector engine. Each of them can run a number of instructions with different data elements concurrently. And then on top of that one, we have the concept of tasking, the threading. You have the task on the Linux. It can be partitioned to different threads as well. So we have at the high level and the low level. We have a different level of concurrency or parallelism running all the time. Computational-wise, we have the SIMD engines. We have a parallel sequence distributed over a SPE and one PPE. And then we have a 256-kB local store for SPU usage. The SPE, remember, we have about the SPU and we have the local store. Local store, we only have 256-kB only. So we have to deal with that one. We have to think about, hey, this memory is so small. How can I fit my program? That's one of the issues. Communicational, we have the DMAid and bus bandwidth. We have some mechanisms so we can transfer data. As fast as possible. How do we leverage that transfer speed? And how do we support the data, how we supply the data to all these tasks running on our SPE? Typical software development flow, we go through the algorithm complexity study. Traditional one, do the same thing. We go to data layout, locality and do some data for analysis. So the dependency studies on your data and some dependency on the control flow of your program. The same thing, right? So basically, we do the same thing up to this point. However, then we said now we have a PPE or we have an SPE. We let's take a look at the PPE program. We develop some PPE control, right? And then we say that, okay, let's the code, the first code will be the Scala code. Develop that code and the PPE is the Scala code. And then we have a Scala code now. We say that let's partition that Scala code down to the SPE. We partition the SPE code and then we take care of communications. We take care of synchronization, latency handling. How do we hide the time, the delay, and so on within our programs. And then we transform the Scala code to the SIMD code, to the vector code. So involving traditionally, we look at the algorithms complexity. We study the same thing. And we take that code, we develop and the PPE. And then we print out the application, the data so that we can send to each of the SPE, providing some sort of mechanisms that control that synchronization. And then we transform that code into the SIMD, vectorize that code, rebalance, try running the first time, second time, using some performance tune analysis here. Analyze the data, the behavior, and then finalize your code. Tuning, performance tuning is never end task, right? So we keep on tuning and then finalizing, checking on the vectorizing code and so on. So this is the typical self-supply development flow that we normally recommend you to go through. Recommending is one thing, but you don't have to go through this, of course, right? Performance-wise, we have a listing here on the floating point single precision, floating point double precision, integers and integers, 16-bit and 32-bit, right? And the red one over here, floating point single precision, extremely fast, extremely good on the cell. This chart here compared the performance of the PowerPC 970 multi-processing, there's two cores here running at 2.5 gigahertz, compared against the cell broadband engine running at 3.2 gigahertz. We're cheating a little bit over here between 2.5 and 2.2, but it's not a whole lot. But we see tremendous jumps in the floating point single precision performance from red right here and the blue right here. Integers, multiplications, 16-bit or 32-bit, very good. The performance of the double precision here remain almost the same, slightly better, but not extremely high or not as... The improvement is not as high as in the other cases. The reason why, because when we designed the processor that was designed to support the game industry, right? And then the focus was... then was to focus on performance of the single precision. To improve the performance of the double precision, we're working in the labs now and in the chart I showed you, the roadmap of the hardware, we released a version of the enhanced double precision performance in the next year, within 2007-2008. So we do recognize some performance problems. We fill some of the demands from the market, depend on the market as well. Some market does not need to have a very high double precision performance, but some is really required and we meet those requirements as well. This chart here shows you the performance difference between the 3.2 GHz general-purpose processor and the 3.2 GHz cell-probe engine and different application areas, high-performance computing, graphic security, video processing, and these are the types of algorithms that we're going through. This is the performance improvement, the performance of the 2.2 GHz GPP. For example, in this case, we've seen it here, running on the cell, we were able to perform this one about eight times, permanent improvement by AX. Depending on how many SPEs we run here, in this case, we run eight SPEs, in this case, one SPE, we put cells here, 30 frames, one frame in this environment, and then the G5, which is the power PC here, let's say the 970, and running against the cell here, about performance improvement, about 12X on the cell. We did see some really performance improvements and some aspects very fast, with very significant improvement as well. Key performance characteristics, what set us apart from the traditional architectures. If you look at the single processor, single processor, we're about the same. The key point is that we have we offer eight SPEs. So whenever you're running something and you've distributed your workload on the different, on the eight SPEs, we can go, we can see an immediate improvement of AX right away. And some of them tweaking, some tuning, some rewriting or retuning of your application, we can see 20X, 50X. And some indication, we can set it up to 100X. The cell plate, this is the cell plate we have now. We have the cell processors, and we have the two plate here, and we have a one-year-bind of XDR memory. And then this is the way that we put the cell plate on the cell plate chassis. We have 14 slots, and each of the cell plate take two slots, and we have seven cell plates in this chassis here. In this configuration, this is the same configuration, the QS20. And then this is how we connect the system together. Either go through to form an XMP system or to form a clustering system. Okay, this is the router, computer rock. This is the one that we plan to connect all of the basic AMD systems connected to the cell B as an accelerator. Application affinities. This is the kind of application we look at. These are features on the left panel here. You have different features of the cell, and this is the kind of function that we believe that could be accelerated, could be improved, could be running faster on the cell. And here we look at different group of applications. Okay? And here is a different target for our industry. We look at AMD, aerospace and defense. We look at petroleum industry, seismic data processing, public and finance, FSS, industrial, on the semi-conductor LCD here, medical imagings, key application areas for the cell broadband engine here. Communications, consumer, digital content creation, media platform, video surveillance. It's a very, very key application. And extremely, when you apply, when you import those applications to the cell, you would see a tremendous jump in performance. The future of the cell B is we're moving from the punch cards 20 years, 30 years ago and up to the client survey here. And then we get into the immersive interaction of the client gaming. Okay? And then these are kind of applications from the window click and wait, we move into immersive, we move to real-time and distributed when we, a lot of data that have to be accessed, that we need to access everywhere, right? On your palm, on your cell phone, for example. And the wireless always on and text messaging, blocks. And now we look at some of the digital world that we're dealing with. From the text-based search engine, you have a Google, you have Entomy, now you have a graphic-based, image-based searching. Given a fact of, you know, some things like a picture is going through your screen, right? Capture that picture and capture the text-based and then retrieval the data. You know, what does it mean to us, right? From this piece of data, from, you know, look at the bottom of your TV screen, whatever. You're running through with some pictures. Capture that and then tell us, you know, and then save that one so it can be retrieval later on. Natural information processing, like a human brain. We do not need the mouse. I still need the mouse over here, right? You saw me from the beginning until now. I, my, my finger is still pointing to the mouse, pointing here, I move the mouse. Why can't I go here and move my finger, my pointing, right? And some camera capture my gesture, right? This one, instead of me, my finger under here. That's my natural gesture, right? Instead of my unnatural gesture over here. So this kind of application we move on and we can see that certain application areas is candidate for the cell here. Dynamic increase of data volumes. We're facing a large set of data volumes, every one of us. From here or in the US or anywhere that I go, right? I use, I flash my credit card. The data's there, it's ready for me. You know, I flash, I swipe it through and then I sign. That's all. Behind that one, they store my data, my record, my date of birth, my bank records, where I've been through, how much I have left in my card there. And sometimes I say that, well, I'm sorry, I cannot, you know, process your card here. Oops, you know, those things happen but they capture my data everywhere, right? Every year and every single moment. You can see that one over here, we created 800 megabytes of data. Now it's just every year we increase that one. 300 megabytes of data. That's not included, that's a text base. Not included in my pictures. A picture of my family posting somewhere on the web and they can download this one as well too. How do you process that one? X-ray of human body. Large set of data, 20 megabytes is low, right? Digital video, 10 megabytes per minute. This video camera here captures a lot of data, right? Put it somewhere. Retrieve that data and process that data. Just imaging this one camera. You have a thousand cameras watching you at the airport, right? Or a city, you know, anywhere in the city except Mumbai. Maybe it's no camera, right? You can try left and right, nobody cares, right? But anywhere, I'm not aware of Brazil. Oh, the system all over. The cameras on the highway as well. If you pass, if you exceed the speed limit at night, right? It's flash, you see the flash? The camera took your picture, take a picture of your car, of your license plate at 9 p.m. In the dark. You see? So when the set of data we go into somewhere, somehow, right? We have the capture that one. Another way that we look at that is of now when you buy the PC, right? The AMD and so on, and you go now and buy a nice graphic card from NVIDIA that some others, ATI and so on, so you can process that data, right? That graphics, nice graphics. You have a CPU in there, you have a memory and so on. On the cell base, you know, we said we'd go ahead, all of those things can be processed within the cell, inside of the cell, right? We have accelerators. We do the processing. We process the image and render the image, processing the data of the image and render the image for you. Instead of, you know, having a separate GPU, graphic processing units, processing the unit for the image for you, as we have now. Here's the look at the traditional graphic-based processor, right? We have the CPU here, we have memory, we have a knowledge connecting to the GPU, graphics processing units over here. The cell bridge connected to the I.O., right? And accelerators with some, you know, companies working on some accelerators to do, to help the CPU here perform some specific operation. On the cell side, we have a 164-bit PowerPC right here, and we have the memory controller. The type of memory is very tight, right? We use a very high-speed XDR memory, okay? And we have a set of processor. We've got the SBE. Eight of them can perform those operations concurrently, and we have a multi-core system right there without adding anything, like accelerators and so on. All of your accelerators, all these functions being performed by each of those SBE. And then you have a flex I.O. to connect to your I.O., and so on. So some things for you to look at in the future, like the specialization in computer architectures, beyond OS application, what specialization makes sense in the general-purpose chip design, okay? And also the programming paradigms, programming changing. We are not programming a single-core system, a dual-core system, an SMP system, a NUMA system even, right? We are moving to a different programming environment here. A lot of you, I think that, you know, working on the multi-core and programming the multi-core from the Intel base and so on and the AMD base, and that's a little bit different than the multi-core in this area here, okay? Here we have a set of homogeneous eight of them and one different, you know, so we can partition our tasks. On the other hand, you must have, or you would have, you know, one set of cores, two or four on a single socket or two socket tied together, okay? And each of them still replicates of a single core, fully implementation of level one, level two cache and so on, the partition. Thus, environment is different in this environment. Programming in this environment is different, a little bit different than what I just described. And the new types of application as well, all right, so that my summarize and then let me, you want to take a break now?