 This is a really quick overview of what are the current solutions-based applications we have. We have mentioned before that Cell Broadband Engine has got its applications in digital video surveillance. It's got its application in aerospace and defense sector, financial services and banking applications, also seismic computing, high-performance computing. Let's see what I'm missing. Electronic design and automation, medical imaging, and digital media. Digital media and digital content creation. So these are some of the examples of what IBM already has demonstrated workloads and what are some of them that these are in progress. Each of these areas are being worked on by a different team in IBM. So there's this first one application which means basically it stands for FAH. It's got some applications in life sciences and biomedical. And they have already demonstrated some results based on the Sony PlayStation 3. So imagine just having like a $600 or a $500 to $600 appliance with you and being able to do something really useful beyond just gaming. You might have heard about this very famous deal that we have made with the one of the US's leading defense organization, LANEL, Los Alamos National Laboratory. Basically there's two major defense labs in US. One is the Los Alamos National Laboratory and the other one is the Livermore. So we have signed a deal with Los Alamos and the final goal by 2009 or 2010 is to produce one petaflop of performance. So 1.6 petaflop peak double precision performance, floating point performance. So it's basically a hybrid environment where cell broadband engine is used as the diagram shows can be used as a Linux cluster with 8000 plus blades. So that will be 8000 times 16 SPEs, same in power. And the AMD Octaron system is used as the master cluster. So the reason we have to club up with AMD is because the main bottleneck on the cell broadband engine which we cannot avoid is basically the power PC, right? It's not really powerful, it's just a puppet master which basically just runs the initiates the application and tries to offload work. It just pulls the right threads. So AMD Octaron processor was proven to be much more better and for this particular application cell broadband engine is used more like an accelerator. Also there's finite element solvers and computational fluid dynamics. There's some workloads already implemented in the area of fluid dynamics on cell. This is in process, we don't have the, I don't think the FM is in available in market today but this is something that we are working on. There is some smaller workloads, not a full end-to-end solution proven yet. But we have for large objects, as it says, we have demonstrated performance of 52 gigaflops using 8 SPUs. And there is a website at the bottom. We are working with a company called Digital Medics in Germany. So this website will give you more details. DVS, this is another major play with a few billion dollar revenue potential for IBM. So basically now in the airports or in major stores, how are the security cameras there? There may be like 10, 12 or maybe 5, 6 only security cameras in different areas and all the data that's captured from this camera is stored in these many disks. And by the time you look at the disk, you know, you're revined and find out, then there's a person sitting manually in the office reading each of these disks and trying to see if there was any activity. So naturally the productivity of the human eye decreases after watching so many tapes back to back. So with cell, what we're proposing is all these different things that can be done. Object detection, 2D object tracking, classification, multi-scale tracking, multi-camera handoff, face cataloging, metadata representation. Basically how the cell broadband engine will be basically used for this thing, the video compression and analysis phase. So with cell broadband engine there can be maybe like 50 or 100 cameras in different, or maybe more than that actually. I don't know the current numbers yet. But they'll be there all in the airport. And real time they'll be reading in this video input from the camera, compressing it, analyzing it, and you know sending in, via the surveillance middleware and via this external terminal it will send an alert. So in other words, within a minute you will be knowing that okay, there was some suspicious activity and it will raise an alarm. Versus maybe the thing already goes off, you know, bomb already goes off and then you try to go do recovery. But instead with the cell broadband engine, with the cell processors and this current model with the right middleware and the end user equipment, we can actually do real-time alerts and do a lot more than what's being currently happening in the industry. You can do things like, you know, show all the cars that drove north on 10th Avenue to if you want to go ahead and, you know, to back trace the activity, you can see that beyond a certain line, there's a trip line, beyond a certain line in this parking lot, show me all the cars with this particular, with the license plate from Texas or something like that. So a lot of neat things can be done. There's a website for IBM. This is based on the IBM Smart Surveillance System, which was basically a people vision project from IBM Research Labs. Very well recognized. There's recently a really nice news article also about this. It's getting worldwide recognition. So there's a lot of, there's detailed descriptions available on the IBM Research website. I'm thinking it's not internal, it is an external website. If you need to know more, please do let us know. There's more details about what the Marvel system is. It's called Marvel, the Smart Surveillance Engine. So this is basically modeling of data and producing 3D, high-resolution 3D graphics images, reading the data. And now the most outputs that we have is all two-dimensional. The whole world is progressing towards 3D outputs, right? 3D graphical display, high-dimension data. So this is another, I Plus is another solution area that we're currently working on, where you can observe this input, model it, and basically reproduce it on the computer in a three-dimensional manner. And again, there's some more numbers about if you are from a banking sector, or if you're working on applications in the banking sector. Again, we've already produced workloads in the Monte Carlo simulation to the team in Germany that did a really good work on this. So we have some benchmark results available on that algorithm. Again, the way the cell processor deals with random number generators, random number generators are a big part when it comes to any play in financial sector. Again, cell processor is very good for that. And here there are some results. Over here, over eight cores versus 16 cores, and the peak gigaflow performance for FSS-based solutions. This is the company we have already talked about this. RapidMind joined hands with RealTime Technologies. RTT is also a Europe-based company to produce something called DeltaGen. RTT has got something called DeltaGen, which basically does complex 3D visualization for cars, really upscale cars. I don't know if they're doing something else for shoes and things, but I think mostly it's for cars, automotive, airspace, and consumer goods sector. So yes, consumer goods also. And they did demonstrate a very really nice application for the SIGGRAPH conference last year in 2006. Accelerator, this is just explaining the accelerator hierarchy for DCC digital content creation. Basically, this is just demonstrating that how cell broadband engine, when used as an accelerator, there will be obviously an encoding device, a video capture device, and how getting that data from the device, moving the computational part of it over to the cell broadband engine, can be just excellent results when it comes to 3D visualization. It's basically just the device that's doing the data capture of it. And then over here, I wish we had an image of a real blade center. I think we did have in one of the day one presentations that Duke had gone over. But this is exactly what you would buy from IBM. So you would buy this, and then seven of these blades go into two-use slots. Right now it's a two-use slot. Next year we're going to make it a one-use slot. So in one blade center chassis, it's roughly the size. It's very portable basically. It's just this high, and you can have seven blade centers plugged modular-wise. So the chassis itself has got the IO connectors, network connectors, everything else. And there's a management module that you can open up, and you can very neatly configure, find out how many blades are there, what are the resources available, how do you configure the network, give them all different network IPs. But this one thing has got two cell processors and 16 SPUs, and it just looks just this thin enclosure, which has this cell processor. Again, with more details on how you do the rendering of the data from here, move it over here for the data management part, storing and doing something else, which does not involve a really high-level computation. When it comes to the computation, move it over to the cell blades to do the analysis and compression. Basically, I think this device, this thing, it looks like a system IRP or some kind of that kind of server or an HP server or something. So it will basically read all this input data, put it in frames, and this is basically the device which will be sending the cell broadband engine all the frames, frame by frame to process. Again, more details on remote visualization. And this is basically some terrain rendering engine. I think we have a demo. I will show that demo in order to explain this further. Basically, there is, because with this high satellite imagery, recently there's been a lot of progress in satellite imagery, right? So you have all these satellite images of terrain. So what the cell broadband engine can do is capture the data, and then on the blade, it will basically do the server-side rendering. And to produce the output, and basically to produce it over this monitor, it would do, it employs some technique called ray casting. So it will try to determine the depth, combine it with the colors that get reflected from the surfaces to produce the final image. And this is just showing the advent of processing, how we have been going on a steep, on a high slope and just making significant advances when it comes to information processing. Started with the mainframe batch, right? Batch processing was the coolest thing back in the 1970s. And then moved over to mainframes and then a mini-computer, multi-tasking. And then the Linux introduced the concept of multi-tasking. And then the whole world, with the 2000s and the dot combos, everybody moved over to the internet. And now the revolutionary concept of cell broadband engine, or so we say. Okay, so developing an ecosystem around the cell. Yeah, our mission is to develop programs, be it programmability tools, be it a scheduler, be it a debugger, be it an IDE development kit, you know, anything that can help the programming community and make the programming a few steps easier. And also fill up the solutions tag. If there is an application and say imaging, right? There are a few sort of core algorithms that are needed in the imaging area. And there are a few, what you might call plugins or encoders or decoders, beat anything. So in every solution area, we want to complete the solutions tag. So by that we were trying to build the cell broadband engine ecosystem. The tools must address the creation of new code. Any tools that we need, they have to address the porting of existing code, portability tools, demos, proof of concept of proof of technology. That's another thing that we do for different workloads. In other words, if I want to see if Black-Scholes algorithms are... Let's pick up an Euler integration formula, for example. Is it a good fit for cell or not? So in order to determine that, we do something called a proof of concept. We put it over to cell and then we try to develop the... Get the performance numbers and benchmarks. And it becomes a benchmark when we're actually comparing it apples to apples with another architecture, which is equally compliant. Either an equivalent G5 or a dual-core processor or something like that. So we do a lot of POCs and proof of technologies. So what is the strategy? Provide development training. We did our part, step one, and leverage the academic interest. If there is any interest from the academic community, produce demos and papers, that is the key to all technologies, have as many papers as possible, right? Verify the performance potential of cell BEA. So in other words, if there is an application which I think is a good fit, right, verify that yes it is and then see if we can get the results out. Produce libraries, example code. Make sure that everybody gets a chance to play with the development kit. And then target ISVs in markets. The more assets we have, the better presence we have. It's very important for any technology to be able to survive to have ISV presence, academic interest, and open source assets in the community. So we do have lots of university programs that we do. Some high-tech universities that we work with. So high-tech programs and medium-touch programs and low-touch. For medium-touch, we have something called, well, actually let me go one by one. For the high-tech universities, we have one focal point that works with every university. We give out loaner systems to the universities. Again, this is all, we're not saying that we do that with every university. This depends upon what kind of collaborative research we can do with the university. We definitely give remote access. In other words, we have some QS20 blades available in what is called a virtual loaner program. So if you have an application that you have developed and you want to really run it on the real hardware, if you write it, if you send us an email, we will try to get you an account with something called Partner World and then get you set up with remote access so you can directly log into the cell blade and develop and benchmark your application. We have something called SIR grants and faculty awards. And of course we provide training worldwide and Duke is the lead for worldwide training. It does an excellent job. Already has done connected workshops in numerous places in China. Shanghai, Beijing, Taiwan, Singapore, Malaysia, Brazil, all over Europe. So anytime you have any training suggestions or any additional venues, please do let Duke know. Media Touch universities, we provide VLP support, virtual loaner program where students can remote log into the hardware. We have forums where anyone can post a question and there are developers actively monitoring the forum and they'll let them know. The forum is a really healthy community. You can see all kinds of really nice technical questions and discussions going on on any subject, be it A&E, be it financial sector, be it anything. And lots of really low-level technical questions too. I'm trying to do this DMA operation, it's not working. You can ask your DMA risk question if what we provided it was not satisfactory enough. Any question that you have that you're not satisfied with and your bulb goes on and you're like, oh, how does this work? Just go ahead and post it in the forum and somebody will respond to you. If they don't, please do write to us and let us know. There's a Lotus approach also where we provide online education. Rukh has set up this online education system which is called IBM Education Assistant. So there is a website link. I think it was there in the initial programming slides. You have it on your CD. And also you can purchase equipment. So if you have something and if you're trying to build a cluster, if you're trying to do something for university and you want to purchase a blade, please do let us know. So sample of events we did. They recently did the one-day workshop at Georgia Tech in Columbia University, a two-day workshop at Minnesota. Again, worldwide workshops everywhere. We have this university challenge which is getting a lot of recognition we already have in one month over 30,000 hits. Third-party development tools. We are working with a number of third parties. So in order in our continuous effort to make self-programming easier, we are working with a number of tools, ISVs, to provide programmability tools. So if you don't want to do the DMA, don't want to worry about all the stuff splitting and everything. We are working with better programming models, vendors that are providing better interfaces to make it easier for you. Provide more abstraction levels. So you don't have to worry about so many low-lying details. So we are working during the course of this year and next year we will have a lot more tools that we can talk of and showcase around this technology. Some of these are in the pipeline. The last libraries are MassLibrary, FFTs, FFTW. We already have support for open MPI, open MPI. Several multi-core frameworks available. Mercury Computer Systems has produced quite a few use programmability tools, something called TATL, something also called multi-core framework, which basically abstracts the responsibility from your hands to do any kinds of memory, data streaming or data partitioning or DMA operations. So in other words, you don't have to worry about synchronization and memory management. Continuously working on debuggers and performance analysis tools, we also have play in embedded systems. So in other words, we are working with our vendor, Mentor Graphics for developing real-time OS around cell. So in other words, we are targeting from the smallest level that starts with embedded systems and therefore we need some real-time OS support all the way to the most high-performing clusters to hybrid programming environments. This is a good example where cell SDK is a starting point. Then we have something called the Block Management Library. This is developed by one of our colleagues at IBM that works with me and Duke. He has produced this library. We are releasing it into open source. Basically, again, another abstraction layer, it takes away the responsibility from your hand to split the data buffers up and do memory management like DMAs at all. So you can use it. It's a very user-friendly, high-level API that you can use. We are currently in the phases of getting open source licenses and releasing it in the community, licensing and everything. So if you are interested in BML, do let us know. It's just a high-level API that will abstract the need from you to do the DMAs yourself and synchronization and everything. With the Pipe Library, BML is a two-sided data transfer. Pipe is a one-sided data transfer. Both of these came as requirements from the A&D Aerospace and Defense sector. We're working with Raytheon and they needed something that was not available. So this proved very useful for their requirements. That's also another framework which abstracts responsibility from your hands to do synchronization. Again, there's documents and documentation available about all of these in the SDK. And if you don't find any documentation tool, let us know. Another thing called ALF, it has got a similar use as BML, just a more low-level. Again, Mercury tools are available today and RapidMind are complete abstraction. So here, by going on this, this is like a line. It starts from the low-level tools, low-level, high-level, high-level, and this is complete abstraction. Today in RapidMind, what they provide, the environment, you don't even need to know how to vectorize. You don't need to know how to split the task of nothing. You just do your scalar program, take your scalar program, convert the data types, build it with your favorite compiler, and that's it. Okay, IDL cell applications. So right now, this is the best, I guess we have. We don't have a list of projects right now, but we will certainly follow up with you. So target areas, obviously, digital and media, image processing, media processing, any applications, algorithms that fall under these categories are an extremely good fit. And if you see, DSP is one of them over here. All right, and also graphics, floating-point intensive applications of pattern matching. This year, this is not our highest priority, but still it is a good, it is an area where we already identified over 4x of performance benefit. So we will cover programming tips and techniques really quick, some porting methodologies, and then move over to SPU overlays. Another effective technique, because the main challenge, a big question that is there in everybody's minds is that I will definitely have more than 256 kilobytes of code and data. What do I do? So Duke will cover that. There's also a very good paper. Dan Brokenshire is another senior technical staff member in IBM. He has written a paper in DeveloperWorks, 25 techniques to exploit the cell broadband engine. It's a very neat article. So if you can, you can go to DeveloperWorks and search for papers, articles, proof points that we've already developed. You'll find a lot of neat material. There's no dearth of material on anything in IBM. So we've covered the strong points. Obviously, everybody by now, everybody knows that, yes, it's 8 plus 1 is the magic number. And actually it's 8 plus 2. These all, these SIMD engines are decoupled. That's another key thing. There is a very simple architecture, no complex hardware-based features available. Decoupled SIMD engines. Obviously, the PPU is the two-way multi-threaded. The chip area is very small. And that makes the power consumption also, brings the power consumption also very small. Not the chip area, but the simplifications that we have done in the hardware. That is, by separating the PPU and the SPUs and by the memory architecture, all those things add up to, for the less power consumption. Because in a way, with the advent of technology, the limitation has not been CPU power anymore. You can go on building deeper and deeper and deeper pipelines. The main roadblock now is memory latency, several thousands of cycles in most cases. So that's our key challenge and that's what we're trying to address with this technology. As my performance goes up, my power needs to be low. So 128 registers, local store, the DMA bandwidth is 25.6 gigabytes per second, 16-way SIMD fashion, fully coherent buses and memory, peak bandwidth for EIB is 204.8 gigabytes per second. Programming tips. So we now know that, okay, the level of programming is not the most high level, right? It's not Java. So there is certain intelligence and awareness of hardware that's needed, not hardware as such, but the hardware features that's needed as such. Again, as Duke mentioned, this is a new technology. We are evolving right now. We're working on making more and more tools that address the area of programmability available commonly to every developer. So dual issue rate, we have dual issue pipelines. I have to overlap DMA with computation. Design for limited local store, the code has to be compact. Loop unrolling, we have to keep in mind that you cannot keep on just unrolling the loops, right? Because it increases the code size. So it's a delicate balance between how many times you want to unroll the loop versus what is the benefit, performance benefit that you want to get out of it. Shuffle byte instruction is a very key feature available. Avoid scalar code because, again, when you're using scalar code, what are you using? What is the disadvantage? We're not using all the 128 bits, right? We're not using the full cache lines. So avoid scalar code. And choose the right SIMD strategy, load and store only by quadward. Instructions scheduling, different instructions take different, have different lengths and stalls. For example, single precision have zero stall, and the latency is six cycles. Integer multiplies is seven cycles. Double precision is seven plus six, so 13 cycle latency right now. We're trying to change that. And there is pipe zero, which is the even pipeline, and the pipe one is the odd pipeline. All the memory-related instructions are there in the pipe one. So in the SPE, there's a high penalty for branch misses, right? 18 cycles. You can use branch prediction. There's software-assisted branch prediction that's available. If it's successful, it will completely avoid the branch penalty altogether. So use software-assisted branch prediction, and I think we have an example coming up very soon here. So if you have an Eiffel's loop, normally in today, this is how you would do the Eiffel's loop, right? If A greater than B, C plus equals one, or C plus equals B, is C equal to A plus B. Instead, what we're asking you to do, and this is definitely a branch. If you don't do any kind of, you know, branch hints, the branch will happen, and it has its own penalty. So we want to save that altogether, right? So what do we do? So we do, first of all, we do a compare greater than A comma B. It will find out if, you know, if A is greater or B is greater. The mask, the result is stored in the bit form in select. We do SPE underscore add C comma one, SPE underscore add A comma B. So in other words, we are in the scalar code, or in a normal code, you only do one of the two computations. You either do C plus equals one, or you do C plus is equal to A plus B. In this case, we're saying do both the computations, right? And leave it on runtime on the select mask. This avoids the branch. And C equals SPU underscore self. So depending upon this mask, the select, it selects if it should take the AB value and store it in C, or if it takes the C1 value and store it in C. Now, one key thing to remember over here is that sometimes it's not as simple as C plus equals one, right? There's a whole bunch of code in the if loop versus the else loop. That's again up to the programmer's decision that, you know, which one, is it safe? Is it worth it to do both the computations? Does it take more cycles to do both the computations versus does it take more cycles to basically just let the branch happen? Unroll the loop. Again, remember the code size. Every time you unroll the loop, you are saving cycles, but you are also increasing the code size. So do always measure the code size. If you unroll the loop like eight times and you go back and you compile your code and see, oh, God, it's 300K, go back and reduce by a factor. SPU unroll the loops to reduce dependencies. And there are some, the transform light workload is already demonstrated. There is a paper out on developer works again, which basically demonstrates different workloads and how they were done. If there was a matrix multiplication workload and which showed like the 10x or 12x benefit, how did they get the benefit? They go over the strategies they used. There's papers, there's a paper on FFT, both from Mercury and from three developers. We showed the pictures also from IBM, which demonstrate, okay, if I did produce the results, how did I do my optimizations? Like the approach and all. There is different papers available on different suggested workloads. Again, all this material is already out there. And this is actually some data using the XLC compiler, some demonstrated results for the transform light workload. Again, this is an example of workload, which is hardly this small. The code is like hardly what, 12 lines, no, not 12 lines, maybe 20, 25, 30, less than 50 lines of code. So it's a small workload, but it's a really crucial workload in a lot of areas. But by demonstrating this workload and putting huge amounts of data on the workload, you can pick up a sizable program and demonstrate that, yes, by optimizing and doing techniques like this, you're seeing a resultant performance. Function inlining is another thing. If there is a function that you can inline, we highly recommend you to inline it. It eliminates the two branches associated with function call linkage. The first one is branch and set link, to store the address where it's going off from and then branch and direct for the function call return. But then over-aggressive use of inlining can result in code that reduces the locals that will consume a lot of local store space. So do things, techniques like software pipelining, this is just a general concept, not very, not very particular to self-programming. A lot of times we try to enforce pipelining where we are trying to do two things in one loop. The way you will write the code is basically just like double buffering. So while you're doing this part, you're also doing a second part of the computation at the same time. Avoid integer multiplies. FPU only has 16-bit integer multiply support. So if you wanted to do a 32-bit multiply, you will need five instructions, right? Like three adds and three multiplies and two adds, something like that. So avoid 32-bit multiplies. That's another thing if you can. Keep the array element size to a power of two. Cast. Okay, so basically in order to keep it to a 16-bit multiply, try to do operands, cast it to unsigned short before you do the multiplying. And because FPU only supports the quad-word loads and stores, and it always does the loads and stores on 16-byte boundaries, so try to avoid scalar code. The FPU code should always be vectors, vectors and vectors. Only if it's absolutely required, if it's a counter for loop or something, is when you should use scalar variables. There's instructions available to switch from scalar to vector and back to pick up, to store locations in particular areas on the vector or to extract values out of a vector. Choose a SIMD strategy appropriate for your algorithm, evaluate an array of structure organization. Basically, if you want to store all x values in one vector, or you want to store x, y, z, w values. So it's your trade-off. It's your choice as to what is appropriate. It's actually not your choice. It's the algorithm's choice as to what is a better strategy. So to make all these design decisions early on and spend a lot of time making these decisions, it will really yield a good result. And this is another example. Basically, it's just taking an example of what kind of results you would expect if you did a vector across computation versus a parallel array versus parallel array for one subdivision point at a time for four independent triangles versus four subdivision services, vertices at a time for a single triangle. So it's just another demonstrated workload. There's more documentation also available for this on the web. And again, there's different strategies for partitioning and work allocation. There is something called, I think we have the next slide that covers the reporting techniques where your programming model can vary. It can be, actually, do quality programming models, the PPE centric versus SPE centric, the master and slave kind of configuration, the fork and joint kind of computation, the FFT largely use fork and joint, right, because it's numerous small, small, small computations and you try to break it up into the simplest phase and at the end you combine the results from all these computations into one result. Offload as much work onto the SPE as possible. Accommodate potential data type differences. Find your data and see if you really need 64, you know, doubles or floats versus, you know, 32-bit versus 16-bit integer multiplies. Exploit the SIMD. Write your code so that there is no false dependencies. Let the compilers do the auto-symdization. If there is false dependencies in your code, the compiler will... compiler's hands are tied. Write assembly. If you're an assembly person, write assembly. Programmer-specified user generics. There are features available. The branch hints, the function inlining, loop and rolling. There's techniques available. So it's quite interesting to deploy all these techniques and see the difference it can make in your code performance. Use SPE lets for quick prototyping. So, for example, you just want to know, okay, I just want to see if SPE can do this computation fast enough for me. I have very high expectations, right? You don't want to deal with compiling your program on the PPU, send the data all to the PPU, down to the SPU. You just want to just compute a small bunch of data, right? And then see if... Or maybe you just want to split it up the whole data already into the SPUs and see if it's fast enough. The vectorizing part of it, right? You can do something like SPU let, which is a standalone program, right? You just... The data is already there in the SPU main program. You don't have to do the sending and everything. So that will give you a quick overview. And the optimizing part, enough. Then, if it's good enough, then you can add the DMA part and everything else. There's software cache handling applications. Basically, the software managed cache. The source is available in this directory. Documentation is available in this directory. And then there's examples also. So we're giving you everything. Documentation, the source code, everything. And test programs that demonstrate the effectiveness of it. So several levels of parallelization available on the cell processor. Now, this is a presentation that deals with a step-to-step approach to porting an application. Some real-world examples of what problem we had in hand and how did we actually tackle it. So... SIMD processing. So several levels of parallelization available. One is SIMD processing. So we have the ability to vectorize our code. Dual issue is there. So in other words, we can make sure that we're not only... We're executing only one instruction at a time. We're executing more than two instructions at a time, basically. Multi-threading support available. Multiple execution units with heterogeneous architecture. Right? And then we have shared-memory multi-processing. And then distributed-memory multi-processing. This is the cluster kind of configuration where you can couple a lot of... Not couple, but you can use cell broadband engines in a cluster kind of configuration. The EIB gives the support. The high bandwidth plus enables us to do that. So basic steps for parallelizing my program. First of all, what do we need to do when we have an application in hand? We have to try to find out if there is their concurrency. Right? If there is no concurrency, then basically it's really hard to parallelize if the whole program is serial. There's serial flow of data. So the key points are what is the data flow? How is the data coming in? And how is the data getting out? What are the dependencies? We have to find our dependency if there's any concurrency of data. In other words, if the same loop if there is one section of data that's repeatedly operating on a big bunch of data several millions of times. So we have to... If there is still some room for concurrency but the application is written in a very serial manner to structure the application so that concurrency is either increased or you have means of exploiting the concurrency. To infer performance, challenge data dependencies, try to reduce the dependencies, overhead in synchronizing, try to reduce the synchronization that you really have to do. The less synchronization you do, the more time you save. So in other words if everything is fast, the memory transfer is fast you're doing double buffering, you're probably using DMA less, you're vectorizing the code but then in order to synchronize one task with the other task and the other recipes, you're doing a lot of back and forth DMA messages mailbox messages and synchronization variables it will negate the effect that you're getting out of everything else or neutralize it. So watch out for that is the message. And when you partitioning work make sure that you do at least some kind of initial decision about load balancing. And there's obviously if you're working migrating from another platform, there's differences in bandwidth of buses topology, so observe all those differences and plan accordingly. So locating the concurrency, the first step, analyze the program and its algorithms and data structures what is the effect of overhead of synchronizing, how much synchronization do you really need? Is there any way you can override these dependencies? Identify the program hotspots. There is a very cool, I don't know have you ever used Gprof Ganuprof? So it's a very neat profiling tool. Once you run that profiling tool, it gives you an output it will give you a total number of loops that are there total number of time taken on all the functions it produces a flat file with all these different different parameters you look at the parameters you're like, oh yeah so this is the code that is being entered maybe, you know, 100 times or 1000 times. So this is the part that I really need to optimize, the rest of it I don't need to care. This is the high traffic part. So tools like that are available to help figure out what are the hotspots in the scalar code. So do all processor elements update shared data? Is there any shared data? You know, the shared memory usage in that data? Does one task need another task data? Are there any race conditions where the data might be read before the other processor wrote it back? Is there too much synchronization overhead? Is there the interactions among tasks, synchronous or asynchronous? So and the processing advantages are they worth the data communication cost? So we observe all these things to locate the concurrency and then structure the application to exploit the concurrency there are several considerations, right? Task or service level parallelism, data partitioning, task grouping, how can you group the task together? Divide and conquer, right? Divide it into several tasks and see if you can and assuming that there is less dependencies between these tasks, it can be a very effective result. So it could be a master slave, it could be a fork and join. So considerations for distributing application workload and data, processing load distribution the program structure data flow, data access patterns, observe all these things. And I'm sure there's lots of open source tools available to help achieve these results so that we don't have to look at the code, hundreds and hundreds of lines of code. It's impossible to look at 10,000 or a few million lines of code and say, okay, where are my hotspots, right? It's very hard to do, it's unrealistic. So there are tools available that will do it for you. There's again different application partitioning can be basically PPE centric where the main application resides in the PPE and it will either stage it like in a pipeline like you do this part and you do that part and you do this part. Versus parallel stages, like pipeline is one below the other. So it splits it has one task and it spawns of three threads that are working at the same time. Versus parallel stages where there's dependencies between the first stage for the second stage. First we create thread finishes and then the next SP thread can be started off. Steps of parallelizing a program first understand the program, choose the programming tools and technology, high-level parallelization strategy, low-level parallelization strategy and then design the data structure, iterate and refine, fine tool. So we have this video, right? Basically this is about surveillance, surveillance of video. So we have a video, right? And we are monitoring this. There's techniques called segmentation and background subtraction by which you can detect activity. So in other words, the video looks like this. You try to build this background over a period of time so that you can find the differences. How do you monitor activity? How do you monitor change in the activity? The only way we can do that is to find all the differences between the foreground and the background. Right? So there is this background frame that gets built over a period of time. And via segmentation what you can do is detect the foreground objects and differentiate them with the background image. So the end result would be something like this. There was probably something over here. I don't know why she's cut off like this. Now the complexity, right? There's several complexity with this. In general, in real world there's all the security cameras on top of all these buildings, right? There's moving rainbows at sharp edges. It's called the NTSA effect or something. Then you have to adjust for weather changes. There might be wind. There might be very low visibility fog and rain. So the camera has to adjust for all this. And the data analysis has to, not the camera, but the data analysis has to adjust for all this. In other words, even if there is rain there has to be some mechanism by which you can still analyze the data. Handling swaying of cameras if the camera just moves off. How to make sure that the software is capable of detecting that and reanalyzing the images. And doing things like auto gain control. I don't know what auto gain control is. And then auto white balance. So basically in this BGS application which is BGS stands for background subtraction, you can build the background model over time. And then in image processing or in general in imaging there's nothing called an empty frame, right? There's no scene is empty. It's always a mixture of some kind of pixels. Weed grayscale or 8-bit color or grayscale. So tracking the illumination changes and adjusting to it accordingly. There was all these stages in a full fleshed application. So this is via input image and you could do smoothing to it. You clean it up do motion energy process it for shadow and highlights texture. Do that find out the texture differences pixel differences and again all these processing that you could do quiescent area addition, slow update blending. I don't know all of this but some parts of it yes I have worked on stationary healing and then we get our background. So there's all these stages that go into and even when the almost final stage comes out you still have to do the statistical thresholding, contour smoothing to smoothen the edges and then component pruning finally come out with a mask. There's some algorithms and there's detailed information available if you are interested in this particular workload we can provide you some documentation. So now our phase was to do that video analysis compression and analysis part right. So if we have to do the image preprocessing part and the salience detection part and the mask generation and then at the same time do model maintenance and then that gives us the final objects. Now we have at that time there were many so our design choices right. We have this full application with a few hundred thousands of lines of code. Many functions involved right and then functions tend to access multiple frames. Every single function needs some access to some kind of frames at different times and then some frames contain color images. It becomes much more complicated because it's not just dealing with 0 to 55 pixels anymore but we're talking about 8 bit color maybe more 16 bit and then the original code was written in C++ and then processing one frame touches 4 to 6 megabytes of data just one frame. So we had a few three alternatives right. Complete the BGS the background subtraction in a single SP for one video stream. So take one stream and then finish it in one SP or we can organize it into groups with one group per SP or we could do individual BGS function partitioned across multiple SPs. So what did we finally do? These are the alternatives. So take one video stream and give it to one SP and this case we needed code overlay because we're trying to forgive the entire processing responsibility to one SP at a time. This is when we used code overlays which Duke will cover in the next session and it made some difference and so the other option was to make it as a pipeline this didn't work very well because load balancing comes in. Some phases might take longer than the other phase. Third one we could add data partitioning across SPEs. In this particular application it was not so easy because there were lots of dependencies, there were lots of functions and all these functions were accessing all these frames at different times. So this we could not do. So we considered all these design considerations and went with the first alternative. So this is what we finally did. Data partitioning was not feasible for all functions and all images stored accesses from main memory. Code partitioning criteria code size and the compute time are the two key criterion to find out the code partitioning strategy. So we figured out there is four compute intensive pieces plus the coordinator. So each module ranges from 180 kilobytes to 240 kilobytes with code and the temporary buffers that we had used. And this was the final application, the way the process that we came out with. First of all we allocate the data block in system memory, right? Obviously we have to bring it up into the RAM and then pass the address of the data block to the SP function. The pre-process stage for the third stage the SP initiates the DMA to transfer data to local storage. The fourth one SP initiates the DMA to start transfer results back with MFC get input and then the SP notifies pp of the job completion. So this was the rough structure. This is one critical application that IBM has a big plane with some surveillance based airports and stores and everything. Again another example is image compositing. Base image, we have the base image B, this is overlay image A and we want to come out with this resultant image. So these are the, for all the images there is an RGB and alpha value. So this was the formula that was supposed to be applied to all these pixels, all the RGB and A values, the alpha values. So this was to be done for these several hundreds of thousands of pixels and so this is how we did the data flow for it. In other words read A and B, right? So read the image A and B like if you go back to this image right, we have the base image B and then the overlay image, right? So read both the images and try to do the formula but do it such that you can overlap different stages and form like a pipeline. It's just a basic pipeline it's just that we're trying to enforce the pipelining concept into our code right here. So while we do A equals A by B you write it to A, right? And then you write the A image back to main memory. At the same time we try to do the next two blocks from the image A and B and then while you're doing the right you're trying to do the compute which is on the next two blocks you're doing A equals A by B and at the same time you're trying to do another DMA operation initiation, right? So this is how we bring out some good performance out of some applications and these are the performance results for 120 images, A over B 120 images per second for four SPEs and there's some more numbers for more images also. So multi-beferring help, now remember you do have VMware but I strongly strongly recommend that if you have an X86 PC right try to partition it, create a Linux person, it's high time switch to Linux so create a Linux partition there's several partitioning tools available, create a Linux partition and install SDK 2.0 in a native environment. VMware is good, it gets you feet wet, it gets you up to speed real quick but when you want to do real applications, real workloads it's slow you will realize that you can't even open up a browser and check your email because it takes forever to come up, VMware takes all your memory, so really if you want to just get down and you're developing a real workload please do it natively ok, too much overhead and creating threads, so the lesser threads you create, the better ok, that's it