 So today we will cover some of the good stuff, basically how to get the juice out of the hardware, how to see what the performance numbers are, how to performance monitor the applications, some tools, some buoy interfaces which are pretty user friendly. Start of the morning with a demo. This was a workload done on the Cellbra brand engine by one of our tools of one of our business partners and I mentioned their name yesterday, RapidMind and the reason we're showing this is because this is one real, this was demoed in several conferences and this is one real application that you can see on crowd simulation, basically particle simulation. In this video we show a demonstration program written in RapidMind. The person who's talking in the background is Stefanos Dutua, this company originated from Canada, such as the Cellbra brand engine and GPEs, shown here. It just came into existence a year back, it's a pretty new company but in this demonstration over 16,000 chickens are being simulated simultaneously. Each chicken looks to its neighbors to decide on what's going on. This quickly leads to the formation of flocks of the chickens. Let us follow a single chicken in the crowd. We'll take control of the chicken to show how the other chickens react. In order to give the other chickens some incentive to follow us, we'll change our control chicken into a rooster. What makes this demo interesting is that because it was written with the RapidMind development platform, it can run unmodified on any number of high-performance parallel processors, such as the Cell, graphics processing units, or multi-core CPUs. The developer of this application did not have to know anything about these processors while writing the program. In fact, the basic algorithm for the simulation was written in a single day. We can vary how the chickens react by changing some basic parameters used in the simulation. Let's save our rooster by making the chickens more scared of one another. The change we've made to the chicken behavior quickly has a drastic effect on the overall shape of all the chickens. By zooming out to see the entire pen, we can watch how the chickens spread out. To mimic an online gaming environment, the simulation is computed on a server separate from the graphical client. The client can view one of four zones, all of which are computed simultaneously by the server. Note that both the server-side simulation and the client-side display were implemented using standard C++ with the RapidMind development platform. Our platform integrates directly with C++, allowing developers to use their existing C++ skills and tools. For more information on RapidMind, please visit our website at www.rapidmind.net. The other demo that we have is on medical imaging. The comparison of cell solution versus a PC solution. So this is the way that the PC is rendering and reproducing the data. And that's the speed of cell processor. So you can see that the PC solution takes two seconds per slice, six minutes to render the entire volume, versus cell, it just took two seconds to do the whole thing. So this was something that we demoed also in our conference in Germany. And we got pretty good feedback from all the people attending the event. So that's another demonstrated workload. All right, so let's elaborate more on the SIMD part. How do we get the performance out? This is one of the features. So one thing that I want to say is that it's not okay to expect that because we have eight plus one, nine cores, or every application will be 10-way performance. We cannot. There are some applications. If the application is not a good fit for data-level parallelism, or if the application has got a lot of random memory accesses and a lot of brands that are necessary to be taken, then you won't see. You won't probably even see 2x performance benefit. So it's a very certain type of applications, high performance computing, certain algorithms in HPC, certain algorithms in seismic, or certain algorithms in aerospace and defense. So it's a certain category of applications, which are good for supercomputing environments, are a good fit for cell. So we're not saying any or every application will definitely see a speedup. That it's possible that it will be negative speedup. So it depends from application to application. We have identified a whole lot of applications which have a lot of real-world relevance and a lot of emerging technology aspects to it that have been proven to be really good scientifically when run on cell broadband engine. So this is one, SIMD programming, single instruction, multiple data. How do we save on clock cycles? How do these multiple cores give us the performance benefit is what we'll cover in this presentation. So let's see. So via SIMD, we are exploiting data-level parallelism. Yesterday I had gone over some things that we are operating on vectors. We're not doing so many loads for 16 characters, 16 bytes of data. We're not doing the load 16 times. We are operating on vectors. We're only doing one load. That saves you 15 more cycles to go do something else. And not only that, not only just the load part, even the compute part, the arithmetic part, we don't have to do 16 additions to do computations on 16 characters or 16 bytes. We are only doing one cycle to do, you know, be it an add or a multiply or a subtract or whatever arithmetic operation it is. We're only doing one instead of 16 cycles. You have save time on loads. You have save time on compute. And there's only one part of it. There's many more aspects that come together. So the SIMD concept is fully exploited on the cell processor. The SPUs are very well equipped with the power to do all kinds of vector operations. So each, as we have seen, there's 128 registers. They're all vector registers. And they are 128-bit wide. So 16 bytes in length. And you can store four wide full words, that is 32 bits, or half words, eight of them or 16 wide bytes. And again, to illustrate the simple addition point, in one cycle, we're dealing with, say, four integers, for example. So register A, vector A, will contain 16 bytes. That is four floors or four integers. And in one cycle, so we have already loaded 16 bytes at a time, or four integers at a time, another four integers at a time. And then in one cycle, we're doing four ads, versus compared to a scalar architecture, what would you do? You would do A0 plus B0 equal to C0 in one cycle. In the next cycle, you do A1 plus B1 equals C1. In the next cycle, you do A2 plus B2 equals C2. That's a normal typical sequential program. With cell processor, or actually, you know, with this concept of vectorizing or simdizing, so to say, this is also a process known as simdizing, is where we're trying to save on cycles everywhere. Let's pick up an example, another concept of SIMD. So all the datas are stored in 16 byte boundaries, right? Loads in stores are 16 byte aligned, and also all accesses, all computations are on 16 byte boundaries. So let's pick up one vector, B0 through B3. It is storing four 4 byte wide values. So register R1, R1 has got B0 through B3, and R2 has got C0 through C3. Now, this is one vector, and this is the second vector. We do an ad, they're always aligned. So in other words, if you want to do B0 plus C2, if you add R1 plus R2, you won't get the result in the correct location, and we'll see another example about that. But whenever you add R1 plus R2, always, they're always added in order. Like in other words, B0 can only be added to C0, B1 can only be added to C1, and B2 can only be added to C2. So whenever we do any combinations between two vectors, they're always done in the same alignment, in the same offsets. So SIMD cross element instructions. We have support for shifts and rotates, permute and shuffle. So it's a similar concept. In the traditional vector processing, you call it permute, where basically, let's pick up this register. So whenever we load 16 bytes, if we want to just load B1, we cannot do that. If we want it or not, it will always load 16 bytes at a time. So we have to really find out where data is located. So data location, the offset of the data within a vector is really critical when it comes to SIMD operations. So sometimes, data does not reside in the place where we want to. For example, if you want to do B0, if you want B2 to be located in B0, in this slot and not over here. If you want to move things around, how do we do that? So obviously, the architecture has to provide some options. Those are shifts and rotates. You can do a rotate shift to move a byte of information to another slot. You can move four bytes at a time to another slot. So you can give indexes and say, okay, I want this data to be residing right here. It's particularly important in a lot of image processing applications, a lot of gaming-based applications, where you want to pack and unpack values. You have RGB values. So when you're operating in a vector fashion on RGB values, you want to come and apply some kind of, say, shading to all the red values and all the green values and all the blue values. Once you're done doing that computation, you want to store the data as RGB, RGB, RGB, and not as all reds and not as all greens and all blues. The image has to appear like an image and not rectangles of red and green and blue. So when we have to combine all these values, or even if a grayscale image, zero through 255 pixels, we want to do any kind of computations. At the end of the day, you want to combine them together. That's when permute and shuffle is a very low overhead, very lightweight operation, extremely useful. So one of our goals also at the end of the day is to make sure that at least a shuffle operation, which is like a pillar, a very important concept in vector programming, is well understood by everyone. And we'll see an example of shuffle very soon. So basically, shuffle will pick up the source register and we can set a flag, like a mask. It's like, I want this byte from this register in this location. I want this byte from this register in this location. You set a mask, create a mask and you apply it to your input data. The resultant data will be exactly in the way you want. And this is an example. Shuffle operation. So we have register VA and VB. And let's see. And we want the data to be located like this. So in other words, vector VA has got all these values, A0, A1, A2 in this manner. But you may be a person that, oh, I'm not that organized. I don't want such sequential boring data. I want some excitement. I want to move data around. So I want, in this location, I want one. And over here, I want four and eight. And this is the rearrangement that you want. How do we do that? Very simply. Very straightforward. Basically, we create this mask register. And we say, the first byte. Now, remember, these are all byte-wide. So this is four bits. And this is four bits. So in this mask register, we create a register VT. And we initialize that register with values, with values which basically A will all the input register vector V1 will always be considered as 0. And input vector V2 will always be a 1. So we want to create, initialize this mask by saying 0, 1. That means from the 0th vector, that is the first vector, pick up the first element. From the B vector, which is 1, would pick up the fourth element. So in other words, pick up this value. From vector B, pick up the eighth element. So in other words, the first four bits will either be 0 or 1. So in real world, when you initialize the vector VT, it would look like hex. It would look like hex. 0, 1 would be the first byte. Second byte would be hex, 1, 4. The third byte, the fourth byte would be hex, 1, 0. So by observing the first four bits value, you find out which register am I supposed to pick up the value from. And by observing the next four bits, you find out which value should I pick up from that register. So when we look at A6 over here, we want to say in this same location right here, in this same location, we want to pick up the value of the sixth element, the sixth byte from vector A, which is here, and place it in this location. From this vector B, we want to pick up the fifth element B5 and put it right here. From vector B, we want to pick up 9 and put it right here. So as you define the mask byte by byte, that byte value in the same location will hold the resultant of which vector you're picking up from and what value you're picking up from each vector. And it's purely byte-oriented as it says. In native SIMD programming, there's intrinsics, there's instructions as we discussed yesterday, and they look like this. spU underscore add will add. If A and B are two integers, it will do four adds, right? Vector A will contain four integers at a time and vector B will contain four integers at a time. Data types, care flow, they all look the same as the traditional programming language. They're always aligned on 16-byte boundaries. Now if we say vector float star P, right? P is pointing to a memory location that holds 16 bytes of data. So what will P plus 1 be? It is basically you have jumped ahead another 16 bytes, right? So we have to keep that in mind. When we do P plus plus, it's pointing to the next 16 bytes. And again, these are the instructions that we looked at yesterday. spU underscore splats, what it does is it will take a value, a scalar value, and it will replicate on all the, if it's a vector float, and if you have an integer two and you want to initialize, replicate two in all the four vector elements, four integer elements in one vector, it will put two, two, two. This is what splats does. Similarly on characters. And we have coding examples that we'll cover. This we saw yesterday. And let's see, pick up a simple example now. How do we do vector operations? This is a simple scalar code, right? In normal real world until now, this is how we write programs, scalarly. If we have an input variable in one, input variable into, output variable out, and the number of iterations that we want to do. So, we initialize the loop like 0 through n, right? And then out one is equals in one into, into a simple multiplication. And we run this n number of times. Now, how does cell give its power to this program? So, to start with, we convert all the input variables that are the scalar values that are coming in into vector values. So, convert n, it's as simple as that. It's just a simple typecast to convert it into vector. So, take all your scalar input values in one into and out, and convert them into float, that is win one, win two, and win a V out, right? And now, let's come to the loop. Now, since all these input variables are now pointing to four vectors at a time, also four integers at a time, instead of one integer, if we do a multiplication, do we have to run the loop 16 times? Or n number of times? n by 4, right? So, now we have already saved some loops. Loops are extremely, even though they're highly predictable, there's still some overhead in the branch, because there's at least two instructions. You save the current position, a branch and something, a branch and save, and then you go back to where you last left. So, avoiding loops is a key to also, is also one important key to getting performance out of applications. So, we have saved on the loops. Another concept that we'll be looking at later on is instead of doing this one loop over here, we can do V out of I plus plus equals the next four elements. So, in other words, in one loop iteration, you're doing one multiplication on four bytes. You can do another, on the next four bytes in the same loop, you can do another four multiplications. That's called loop unrolling. In which case, you'll be dividing NV by two, another two times, because you're doing two of these operations in one loop. And that's another concept that we'll be covering. So, this is how we are achieving data level parallelism. Instead of operating on one element at a time, we're operating on four elements at a time in one clock cycle. Here's another simple example. So, in the scalar code, we're doing I equal to 0 through 4, because 0 through 4k increment the I. And basically, we're just incrementing the destination vector, input vector. In the SIMD fashion, basically, well, first of all, you initialize, you basically create another vector and initialize it to 1, 1, 1, 1. Because what we want to do is we don't want to increment it one at a time. We want to increment four elements at one time. So, we do sp underscore add of vdesk, which is a convert to a typecast vector from destination. And then we add the value v1 to it, which consists of all ones. So, in one loop, it's adding, it's incrementing all the values of the arrays. Now, the vectorization example is over here. So, if you have a 4 by 4 matrix, right, in a four-way SIMD fashion. Now, imagine now, we instead of storing one y1 value in one vector, we're storing four vertices or four matrices, matrix values. So, in this case, it's a dot product. So, now, we can, what we can assume is basically, each row of the matrix is one vector, right? Because there's four values, you can store it in one vector, assuming they're all integers of floating point data. So, we can store the x values in another vector and y value is another vector register. Multiply the row register in this whole row by the x vector register and perform vector reduction in the product. Again, this is all more concepts of data level parallelism. Instead of doing one dot product at a time or one vector reduction at a time, we're doing four times vector reduction. And another approach to this, x vector is in a vector register, y vector is in another register. You can copy the element, basically, into all four slots, just like we did the splat operation earlier. You can replicate the scalar element in all four element locations and create an empty vector, initialize it, replicate the scalar value in all the four locations and just use that to be able to model our SPU add with the input register. So, different strategies of doing the same thing. And this is how we, this is another thing is that we can always store data in a vector across manner or a parallel array manner. In other words, say you have vertices, you have x, y, z and w vertices, right? You can, how do you store it in data arrays? You can store, you can create like one vector which contains x, y, z and w. So, all 16 bytes in one vector, right? So, all these vectors will be a mixture of all v0 through vn vertices. Or, what can be done is in one vector, we store all x values, so x0, x1, x2, x3. So, there will be a list of vectors that will contain only the x values, list of vectors that contain only the y values and only the z values. So, depending upon how your computation is done and what you, the manner which is more simpler to you, you can pick your approach. So, there's opportunities for loop unrolling and software pipelining as we discussed. So, what's the porting approach? Basically, first you write the SIMD program first for the PPU. It's again your preference. Sometimes you may just say, no, I don't want to worry about writing code on the PPU and then porting it over to the SPU, right? Or, what you want to do is basically just take a scalar program, convert it fully into vector and run it on the PPU, right? Use the vector instructions. And once you have the entire application running on the PPU, then observe, okay, what are the parts that I can break off as a module and put it into a thread, SPE thread, and fire off. Let that SPE do the work. So, you can pick an approach that's most convenient. And when we do try to split the work into the other SPUs, you have to think about strategies, right? You may be splitting the data up into 64k chunks and sending it over to the SPU. What is the code size? Once you build the application, how much is the code size coming to? Depending upon that, you have to divide the data. Okay, I'm using up 200k for my code size only. You're only left with 56k of data. So, obviously, there's something wrong. There's too much code and very less room for data. Then you have to see, maybe I don't need this code to be running on the SPU. Maybe this can run on the PPU. So, only the code that does hard-core computation should be offloaded to the SPU so that there's a decent balance between the code size and the data you can send it off for computations. Let's pick another example of complex multiplication. Now, this is a really good example because this gives us a real view of how to do multiplications and how to use these SPU intran 6. So, A plus IB equals C plus ID. This is something that we learned in what 5th standard, right? 5th or 6th. So, A plus IB plus C plus into C plus ID is AC minus BD. So, in this case, I, the B and D values are the imaginary values. So, in a scalar form, normally what you would have to do is when you're in your simple code, right? Your input vector is input 1. That is consisting of two n elements. The reason it's two n elements is because it's storing A, the real part and the immediately following the real part is the imaginary number. So, the real number and the imaginary number. So, input 1 will be consisting of A0, B0, A1, B1, A2, B2, A3, B3 and so on. Input vector 2 will be consisting of C1, C0, D0, C1, D1, C2, D2, C3, D3 and so on. So, they're interleaving the real and imaginary parts into the two input arrays. And similarly, the resultant will be containing the real and imaginary parts interleaved fashion. So, in the scalar code, this is what we do, right? AC minus BD. And we do that n number of times, right? We compute AC by multiplying the input vector of 1 into input 2 of i, right? And then compute BD and then we do AC minus BD to compute the output array. Now, let's see the hard part. This is how simple the SCALA code looks like. However, it's not as efficient. So, how do we make it, you know, 100 more lines but more efficient is what the RSPU code will look like. So, i1, the input vector 1 will be SPU underscore shuffle of A1, A2 and i perm vector. So, let's understand this step. This is how the input data looks like in memory, right? Restoring A1, B1 and A2, B2 in one vector. At the most, we can store how many elements in one vector if it's a four-byte wide integer? Four. Because we're having 16 bytes in one vector. So, at a time, we can store only four bytes. So, in other words, we can only have A1, B1 and A2, B2 at one time in one vector. So, let's pick up A1, one vector. The next vector is A2 which will be containing A3, B3. A3 is the real part, B3 is the imaginary part and then A4 is the real part and B4 is the imaginary part. Similarly, for the second vector, B1. So, this is all vector A. This is vector B. Vector A consists of A1, A2, A3, A4, so on till n by 4 and B1 contains C1, D1, C2, D2 so on till, you know, by 4. So, in this case, now we want to compute the sum AC. In other words, we want to multiply all the values of A with all the consecutive values of C. However, in memory, they are stored as A1 and B1 and A2 and B2. So, now if we multiply A1 into B1, the result will be what? A1, C1 and then B1, D1. Instead, what we want to do is we want to store all the A1 values together and in one clock cycle, we want to do four multiplications, right? We want to multiply all the A1, A2, A3, A4 values with all the C1, C2, C3, C4 values. So, now we have to create a shuffle mask. So, how do we create the shuffle mask? This is the input shuffle mask. So, we say 0 through 3. So, pick up the first four bytes from the input vector 0, right? Which is A1. So, pick up 0 through 3. That is A1 value. Then pick up 8 through 11. Which is what? A2. Imagine, right? This is what? 0 through 16. 0 through 15. A1 vector, right? This is 16 bytes. So, this is 0, 3, 7 and so on till 15. This is 16 through 31. Right? You get the picture? So, when we say pick up 0 through 3, we are saying from the 0th vector, pick up 0 through 3 bytes, right? Which is vector A1. And then 8 through 11 would be A2. And then 16 through 19, if you convert it into hex, the first four bits will yield a 1. So, which will be vector V2. This is the second input vector. This is input vector A1 and input vector A2. So, pick up 16 through 19 and then pick up 24 through 77, which is 27, which is A4. So, now, apply the mask and we do the shuffle operation. So, spu underscore shuffle, input vector A1, input vector A2 and then i underscore perm vector, which is the shuffle mask. It is always a vector unsigned care shuffle mask. So, now in our resultant vector, we have all the A1, A2, A3, A4 values together. And then let us compute the result B1, D1. Now, we have all the A1 values together, all the B1 values together, all the C1 values together and all the D1 values together. So, now we want to compute B1, D1. So, we use the NM sub instruction. Basically, it will just do a multiplication and negate it, right? And then we do a multiply add of B1 and C1, right? B2, C2, B3, C3 and B4, C4. So, this is why we need to break out all the real and imaginary parts together. So, we have in A2 value now, we have B1, C1 and B2, C2, B3, C3 and B4, C4. And then, so basically it is just computing stage by stage, right? What are the essential parts that we need? We need AC and BD. So, you compute all these things. So, we have A1, D1 and plus B1, C1 finally. One product is computed and then A1, C1 and B1, D1 also for the next I1 and I2 values. So, these are the essential parts of all the whole complex multiplication arithmetic. So, now we have the input vector A1, C1 and B1, D1 values, A2, C2 and B2, D2 values, A3, C3 and B3, D3 values and F4, C4 minus B4, D4. So, because we computed the A1, C1 product and we computed the minus of B1, D1 product, right? We did the multiplication and the negate right here, over here. So, we have these values computed and we have the A1, C1 values computed. We also have the A1, D1 and the B1, C1 values computed. Now, we need to store them back into the memory. So, we will create a shuffle pattern because the way we want to store it, the resultant vector, the way we want to store it is basically we want to put the product elements together. In other words, when we store it back in the memory we want all the AC minus BD values, right? A0, C0 minus B0, D0 and the next one has to be what? A1, exactly. So, we want to store it in the memory also like that. So, that is what is happening over here. So, basically we just use the shuffle mass to our benefit and finally get the end result and the key to observe over here one thing to note down and this example program is also there in the hands-on session. So, when you look at the code it all comes together better, you know, versus a PowerPoint slide in the hands-on. We can go over it one more time but the key to notice is that we are trying to parallelize and reduce a lot of clock cycles. There is a key to understand over here. Instead of doing scalar computations n number of times, we are trying to do it by 4 and by 8, 9 by 60, reduce the number of arithmetic operations.