 What I would like to do is to, you know, I also have a presentation to give you to the discussing about the overlay, right. But I would like to rub it up a little bit so that, you know, we will take you through some example that I have for the class covering from the beginning when we start a very, very simple assignment. Okay, then going through the simdising, going through the DM8, how do we do the DM8, how do we do the DM8 list, and we will discuss about if we have a large array of data, okay, in the main memory, and how do we DM8 that memory down to the local store so our local store program can run that one and through which techniques, the DM8 list, the multi-bufferings, and also this the discussion of after we exceeding 256k bytes or 200k bytes, who is going to take care of that one. Don't forget, the owner of the operations and DM8 was handed by the MFC. The MFC was handed that once we do the MFC list commands, the MFC hardware will handle the supplying, using the data and address from that list. We will supply the data to your programs and we'll show you how to do that one. Okay, let's say that I will take you a tour here, right, and bear with me, you know, that's some of the things that we may have seen before or may not, right. Let's say that we have a very straightforward program, right. We all understand this code. We have a float of ABC, right, and then we have a main program, so we do a equal to b plus c and it will return to zero. Okay, and now we'll take you through and ask you to show me how to convert these programs to an SPU program or the PPU or whatever, the embedded program we talked about the last two days, right. Today, if you have any questions, I will be, Hema and I will respond to you and answer your question, right, so we quit over here. All right, so now what we need to do, this is any program. What we need to do in order to make this one is a PPU program. Just compile the ASA, redo this, what compiler that we use, XSC, GCC, SPU minus GCC, right, T2.C, correct, everyone to put down. T1.C, fine, that's good, right. Oh, okay, I have a T1.C and T1.C, we'll show you what it, all right, so we're okay, right, do a list, we have something added out. Can we execute this program now? No, not yet, right, we have to bring it over simulator, fine, we don't worry about that. Okay, now let's go, let's make this one, this program, practice program is a PPU and SPU program, okay, and tell me how to do this one, all right, so let me have this one program, bring it back, T1.C, okay, what portion we want to make an SPU and what we want to make a PPU, okay, let's do this one, okay, let's say that this one over here, we will make it as is what, we save these programs, we will make this loop here, these operations as a SPU program, okay, so we save this one, write this one as is T1 underscore PPU.C, and then we figure it out, right, and then we write another lazy one, write one as T1 underscore SPU.C, so we save it sometimes, okay, right, so now we go back in here, we say that we have an SPU program, this is our SPU program, and now remember float A and B and C, we just bring it down there, right, so we have an application, an SPU application with its own data, right, to keep the data down in the SPU, okay, it's fine, okay, man here, the same thing, insert arguments, this is an SPU program, right, three arguments that we need to have, number one is, all right, unsigned, long, long, right, unsigned, unsigned, long, long, SP, SPE, what, ID, right, we need to have an SPID, because this is an SPU program, unsigned, long, long again, of what, argument point ARG, right, and unsigned, long, long, environment point ENVP, whatever, right, okay, and okay, so we have this, we have an ABC, do we need to do anything here, the data is still here, I bring the data to float ABC down to here, okay, so we finish, right, make sure that I do the correct programming here, all right, so we write and quit, okay, VI, the, our PPU, T1, underscore PPU.C, correct, all right, now I have the data down in my SPU, so I just go ahead and delete these guys, right, okay, delete these guys, okay, this look, or this operation here is now belong to the SPU, so I command this out, right, I insert this one, command this one out, and I need to return this one, all right, so this is my SPU PPU code, first thing what I need to do, I need to create an SPE thread, right, SPE create thread, correct, right, and I give what, what is thread, T-A-D, hammer, all right, and then what, the first argument is what, my group ID, right, I put it 0, so this is my, my what, my group ID, group, SPE group ID, right, SPE group ID, okay, the next one would be what, what's the next argument here, my, my program handler, right, this is the pointer to my program handler, which is T1 underscore SPU, right, command this is my program handler, okay, and go down to third argument is what, my pointer, right, which is my known node over here, I just put known here because I will set it, all right, this one is, is my argument pointer, our pointer, pointer here, and then to third one is known, okay, and then this is environment pointer, ENVP, ENVP pointer, right, and then I have a, and then, sorry, and then what they have, we have a mass ID, all right, okay, minus one, when we use minus one because it's not working fine, you know, as of now, any, okay, this is my, my, my, my, my SPE, SPE affinity, right, need T, and then we have a zero with a flux, right, that's 40, 30, right, okay, flux, right, the flux one for my, okay, and then what, then I have to wait for, right, I always have SPE, wait, okay, SPE, wait for my, for my SPE, SPE ID, ha, right, SPE ID, status, and zero, now, so now, we're going back up, right, this, this code, right, we return what, voila, yes, in SPE ID return to, right, SPE ID, not too fast, right, now, SPE ID, so we have this one, so we have to declare SPE ID somewhere, correct, and okay, SPE ID underscore T, right, SPE ID underscore T, what, it's my SPE ID, SPE ID, right, status, I declare over here as well, so I have to int status, right, okay, now, in this program, I use what, T1 underscore SPU, which is my program, I have to go back up here, I have to declare some things up here, right, before this domain, right, external, external what, external of SPE underscore program handler, handler underscore T, right, and then the name of the program is T1 underscore SPU, that's it, so that's my SPE program, correct, all right, what a grid, right, and my Mac file, my, my SPU Mac file, what is, what look like my SPU Mac file, VI, that's for example, right, this because I would have the directory, okay, so now, we have the PPU program, the SPU program, correct, we need to do the Mac file, the Mac file is on the PPU straightforward, right, we specify what, okay, let's do it, this is the Mac file for the PPU, okay, I'm just make it up, so here we go, this is the, well, this is already made, this is the Mac file for the, for the SPU, right, we don't have any things here, and okay, let's say that, let's assume that this one is the Mac file for the PPU, right, program underscore PPU here, right, and the T1 PPU, correct, and then what, we have to import, very good, import or RTS, import what, import the, we assume that T1, T1 underscore SPU dot A, the files, right there, okay, and then the last one, now let's if this one is an SPU Mac file, okay, this is SPU Mac file, okay, so the program is the SPU, correct, all right, and then over here we do what, we say that we do the library, right, we specify a library embedded, all right, library underscore embed, equal to what, what we embed in here, we embed under, let's say that this is the SPU program under the SPU directory, correct, right, and we embed, so we have to go back to the PPU program, edit the directory nearest equal to SPU, so we do the SPU, details, you know, okay, SPU T1 underscore SPU dot A, right, and then LSPE, correct, okay, so what happened here, what I have shown you is that a typical, these are on the, on the PPU programs, one thing's missing, what do I miss here, include, plus include what, include what, live SPU dot H, right, this is needed because I do the thread creations, if I, if I don't do anything, you know, I don't need this guys, right, okay, so, all right, so we, we have the T1 dot C program, fine, so far so good, all right, let's take a look at T dot C program, same program, all right, now we float, we make an array, three arrays, 1,000, 1,000, 1,000, and the main is, you know, now we do a loop, we do for the loop count and we do loop, right, so far we're okay, all right, okay, we, let me go back to here, and then the day three same trace and I will go to what I have in here and I'll show you, what I have in here is have a directory of seven different test cases, all right, I have the one PPU only, one is self-DMA with one SPU, we do the transferring here, okay, and then we do another DMA transfer example here, and we do the DMA list transfer, we do the multi-SPU, we split the task into different SPU, and we do the multi-buffering, and we do the software cache, the software cache I mentioned yesterday, but they did not have the time to go through, I forgive me because that I want to bring you up to speed so that we wrap it up, remind you what we've been through, and now let's take a real, some real examples, and this example here, this is the h-file, okay, use the text part because this is quicker than going to do this, this h-file, what do we have in here, this is an open source version, so it's going to help, I don't have anything, okay, so I'm going to bring it up, this is my PPU programs, okay, bear with me, I will walk through a little bit slowly here, but then I move it on quite a fast, right, standard IOH, we don't, is it clear enough for all of you in the back, it's okay, so what we have in here, we define an array size of 1024, and a number 512, and I use this one so I can define the alignment of array A, array B, and array C, okay, so a little bit changing, but here nothing else, we say that my align here, give it my var, def define it, and my my array, my alignments here, and then we define the attribute would be for this one, so I use this one here to define the my align of array A, array B, array C, it's nothing else, but to define array A is an array of 1024, alignment of 128 bytes, align of 1028 bytes, align 1208 bytes, so what I'm going to do here is to add the two elements together, all right, and put in C and A and B, whatever, okay, so this is a routine I use to initialize an array, nothing else, right, I put an array here, I call it init array, and I have an array 8 sub i here, go to all the i elements, i times num, i times num, times two, and so on, clear, so far we okay, nothing, no, we haven't done anything yet, let's look at the main program here, there's some tricks, some techniques that I would like to share with you, that's up to you to use it or not, okay, in here, we do the integer array, right, we do an add array, array C, A plus B, that's it, instead of two numbers now, I have an array, okay, C equal to A plus B, and a print out array, very fine, some number, that's my PPU program, so far you with me, okay, well PPU program, standard, right, we haven't called anything, we haven't split it yet, right, okay, already you will see, okay, so we have that, and then PPU only programs, that's the only thing we have, let's take a look at these guys, all right, in the step one, array of H, okay, what I'm doing here is that I define a structure here, I call it control block, right, I have in A, I have A, B, and C, okay, and I also have size, so each of the elements here is what, four bytes, four bytes, four bytes, four bytes, okay, we align this structure here to 128 bytes, so I have to pass it here, right, I have to pass that address, so I have 128 bytes, okay, so that's my control blocks, all right, we finish that one, we take a look at the step one, step one corresponding to the programs that has not simdised yet, and step two is the simdised version of the program, okay, so you have a scalar version and a simdised version, what you can look at, you know, when you get the times, this afternoon or if not, I give you this one, you can look at, you know, by the times you will hold my sword, let's look at the SPU program, all right, I bring up this one, using my text path, okay, this is my, all right, I will get out this one, I would like to bring up the SPU program as well, right, so I can put side by side and we go over, so what's going on, right, so I did that and then on the window, I put them, I tie them vertically, right, okay, so this is what, this is my SPU program, okay, what I'm going to do is that, in this program here, this is my PPU program, on my left side is my PPU program, the right side is the SPU program, correct, all right, so on my PPU program, I'm going ahead and I locate it, define it, my alignment of my array here, ABC, correct, and then initialize it here, didn't change anything, nothing, okay, then I initialize, the main program initialize array and we create an SPU thread, okay, and create an SPU thread to process the array SPU, okay, I create an SPU here, an array, create a thread and in this example, I will send the address now to the mailbox which I really don't want to show you yet, but you know just go ahead and show you here, I will send effective address to the SPU now to the mailbox, okay, I create an SPU thread and this is my SPU program, nothing, I send it out, right, I will use the mailbox to send it in address now, the mailbox is used for message, sending message, all right, in this case, instead of sending message, I will send it in address now, I will use the SPU ID here, which the SPU ID that I just created, okay, and I send the address of the array A, it put in that mailbox, this is the address of what, of the array A and this is the address of array B, okay, and also this is the address of array C, why I want to do that, why I want to do that, I can either send my address now when I create the thread, I have a three arrays, so I have to do the same thing for three times, right, or I can send the address of the control blocks and let the control block contain the address of my array A and B and C, or I can do this one as well, that's why I want to show you the two different techniques, this is the first technique, using the mailbox to send it down to the address of an array, all right, so we do that one, we have the address, by the end of this one, I send it down to the SPE three address of the array, okay, so now on the SPE programs, on the, I have the control blocks declared here, that's my local, okay, somewhere, okay, right here, the alignment control block, this control block is local CB, local to my SPE programs, I still declare the array A and B and C right here, I get the control, with the argument pointer from these programs, I get the address, the effective address from the AARPGP here, and then the size of the control blocks, okay, and the tag here 31, so I get that, I get that address for, I get the address of the control block now, so by using now the MSC gate, where I have to have the address of the A here and B on the local now, and the components is the A components address of the array A and address of the array B here, I download it out to my local areas here, and then perform the calculations, I perform the calculations just right here, okay, same things, A, B, C, C equal to A plus B whatever, when finished, I send it back, MSC put here, array dot C, array underscore C is the local array, the effective address is CB, in my control blocks, and then the size and so on, I send it back, we so what we did was that what we had before was that we have in the additional operations on the PPU, right, and now we shift it down to the PPU, right, so we everything that now instead of having the data now into the PPU, we said we keep the data somewhere in the effective address, right, and when we send the control block now so we can identify or we can acquire the address of the array A and B and C, and then we use that one, demate the data now, okay, control block, techniques, right, no mailbox, nothing, we have a control block we defined before under array underscore h, underscore add dot h here, that heading file, we have, we declare the array here, the size, array size, alignment 128 bit 128 byte alignments, we have a function here to do the initialize array in a main program, look at the main program, we initialize an array, okay, and then we check the number of SP working, you know, this is an extra, extra nears, if you want to be really careful when you do that, all is the same, right, and now here it is, I think it's much better, it's much clearer for you, this is the CB dot A, this is a CB dot B and CD dot C, the C of, we get the address of array A, beginning the first locations, right, and we give it to a CB dot A, refer to CB dot A, and we get the address of array B, okay, and send it, this is B components, okay, and address of C, and then the size, it's this size, okay, we clear this structure here, and then we create an SPA thread, again the same thing group ID, the name of the program, and here, here's what we send it out, we send a control block now, okay, so what we actually we send a structure now, which contains address of A, address of B, address of C, and each of these guys is thousand, do we care about breaking up, how we don't care yet, right, we, and so we go into print, after that we print and so on, we don't care this one, and here we synchronize, right, a lot of time we talk about, you know, once we, how do we know, then when we, when we know that we finish transferring the data, and how we sync the address between the local store and FET address, this is the typical instruction we use to sync between the two of them, okay, we do that one, and we wait for the SPC. Okay, so this is the SPU program, on the SPU programs, I define everything the same, like we before, these are right here, control blocks, this is my local control block CB over here, and then the first one would be, you know, to get the, I only have the control block passing down for me in this argument here, right, so I go ahead and use that argument and get the data of the size of the control blocks, bring it down, and then now I have the control block down on my SPU, or now, all right, so I use that one now, they refer to the components A, use that one to refer to my effective address, and use the CB, and then do the two MFC gates, right, one to get the A, one to get the B, and then the right tag mask, and so on, and the size already specified by the size that I have when I bring it down into argument pointer here, passing to me when I create a thread, I do the matrix add, and I put the bug, I put the bug seed here, CBC, so what we did was that we have a control block down, you know, somewhere, define it that containing some address of an array A and B and C in the size and so on, and we pass on the, on the PPU, we use that one to load the data, and then on the SPU, we pass that control block, the address of that control block down to the SPU, so we can refer to the element within that control block, that's it, right, and then we use that one to, as an effective address, so we can download, you know, the data to array A, data to array B, so we can perform this operation here, and we use the same strategy again, send it back to the, to the, to the, to the PPU program. That's how we pack up the data, and then how do we change. This is the SPU versions of the programs, of the SPU program we just mentioned about, and for, however, this is the, the vector versions, right, and here I do the same things, my, my define my alignment is still the same, and my arrays, I declare the array now instead of the full array size 1024 or something like that, I divide it by four, right, because and now I declare this one as a vector, my array A contains, contains a vector, it's a vector of elements, it's a vector of array B and a vector of array C, control block the same, okay, I do the same thing, see, I form, this is a SPU program, I get the control blocks from my argument list over here, check the tagmas, starters, whatever, I get the array, we do the same thing, do the same thing, do the same things, nothing changed in terms of data transferring, get and put, do the same thing, right, through the control block, okay, the actual operations now instead of array, up sub i equal to array A plus array B, okay, we have a SPU add, I have to use intrinsics functions to add this guys, we get four statement over here because of what, because of what we're doing here, un-drawn the loop, un-drawn the loops right here, right, so we, we, we not only vectorize this one but un-drawn to, to make some optimization as well, so that we're going to we speed up, we, you can leave it here if arrays are four or whatever, just doing one at a time, pardon me, right, yeah, we'll, four elements right here, the, the story here is that once you, yeah, the vectorized versions is, you know, you have to take care of your, your data here, it's now instead of single elements, it's not a four elements per, per instructions or per array, and then you do have with the, the, the calculation that you need to take care of, memory transfer remain the same, nothing change, so you deal with data on it. Is it possible to get a single variable out of a vector? Sing element with extract, right, right, okay, so we finish that one, let's take a look of the DMA list, okay, this is my, the array H I think will be the same, I didn't change anything, the control block, okay, this control block remain the same, right, and now we add the structure here. Remember this, the element of the list is an 8 byte, all right, and then we have the first bit whatever, we have the, the, the store bit, and then we have some the bit reserve here, and the number of bytes transferred is 16, so this determines the bit structures here, and then the size of this one is the, the unions of this unsigned int own 32, and then, and then this structure here, so we have, we define this one is the unsigned int below, all right, I step one on my PPU program, what I'm doing in my PPU program, in my PPU program, I still floating, declare my array here, my control blocks the same, my control blocks now contains A, B, C, the array, and then the size and so on, so I have this one the same, nothing change, I'm go down here, I check for the working hardware, nothing change here, my control blocks get the data from my control block, I set A and B and C, and it creates the threads, okay, the PPU program, we pass the control block down, and we do, we wait for SP finish, it will change anything, nothing, all right, okay, so far so good, we just go over here, and we pick up the SPU program, no, no, wrong one, right here, right, the DMA list, this is my SPU program, my array the same, nothing change here, my DMA list now, I have this structure, we call the DMA list elements, we call that list, we have a 16 elements, so we, 16, let's mean what, we have a 16 command, right, a list of 16 DMA commands, okay, why the 16 DMA commands, because how many, how many bytes we transferred out every time, maximum is what, 2K, we will see, right, okay, align A byte, right, DMA control blocks into local store, we get that DMA control blocks, so we have everything down to the local store now, next, we, okay, we do have a, what we call the transfer the large regions, okay, and now we have a two, two routines here, right, we have a one is transfer the large region where we go ahead, we, we call this routine here, transfer large regions, we give it address of array A, we say that, okay, now transfer the address of the, go ahead and get this effective address here, that's I passing by my, in my control blocks, okay, and, and local is the array A over here, and this is the size, how much, how many that you translate going down, and one, I think that's some of the flux that I set out here, okay, this is my transfer the large region, let me go through here, there's been, you know, couple of weeks I've worked on this one, so unsigned, tag ID, the list size, if the number buys, not equal to zero, whatever, it will return, okay, greater than zero, unsigned the size over here, what I'm trying to do, the same thing like the, how much I explained, I'm working through the list, okay, and I say that, you know, I have a, if I answer the size, if I have a greater than 16k, right, the size over here is n byte, it's less than 16, 384, or n bytes, otherwise n byte is 16k, so I move 16k at a time, and then down over here, list of i, the size 032, I put the size in there, and then my effective address is low, I put it in here, and then increase the size, I reduce the size of the buffer, of my main buffer there, of the arrays that I have, and then increase that size, how many more bytes do I transfer, and I go back to this loop, so what I'm trying to do in here is that I transfer 16k at a time, and I use the EA law to identify my, to move my pointers right next, and move this one 16k, and the next 16k, next 16k, okay, and then the list size here, we take up the size of the DMA list elements, okay, so I have that, now I have all this learned now for me, now I do the actual work, right, now if I pass to either 1 or 1 or 0 here, right, if it's 1, if it's get here, as I define, get somewhere, either 1 or 0, and I, you know, depending on the condition that I, the flux I set there, I can do the get list, put list, right, if the get list, I have the local address right here, and the list elements start by the first elements, and give it the effective address load over here, and then the size, the list size, and the name of the list, they are the first location of the DMA list, and either put, put out, and this is going to respond the same, exactly, you see that all of these instructions, MFC get, MFC put out, basically the document list, the same document list, right, so either we use, we specify the local stored address, the effective address here, and then the list address, right, so and then we let it go, do we care about it how often, and when we finish it, not know, right, we just use this subroutine here, okay, to get the data and supplied right here, we move this, how we use the move the data to array A, move the data to array B, and then so we perform this function here, and then we transfer it back, and we use this flux here to say that this is my put out, so we use the put out to get out and the put out to get the data is in the list, so we don't have to worry about, you know, just individually, so that's the different techniques we use. We have seen both the examples doing exactly the same thing, okay, so in this, even in this list example, since you are putting it into already defined arrays, you are not exceeding 256k at all, okay, so the question of 16k into 2k doesn't arise, because you will never exceed that, you are putting it into this particular array, all right, so what is the advantage of a get list over a single call which gets me the full array, okay, so if I have got single data which is more than 16k data transfer, then I can use list, I had a different question in mind, okay, basically the list does it in chunks, okay, so even if the list is not over, I can actually start processing the first chunk, is that the advantage, but then there is no mechanism for me to know which chunk has finished, no, is there any way of knowing that, is there a way of knowing like see when I do a MFC get, I do check for the status, it is over, inside the program, can I find out that what is the status, first list item is over, second is over, third is over, no, but there was a tag bit, each list element has got a tag bit, each list element had a tag bit, each list element, the 8 byte element has got a tag bit, same tag bit, it is all under the one category, the whole list, so I thought that was used for completion or non completion, no, as for one, one whole list has worked out, one later time, okay, this is an example of double buffering on the PPU program, PPU program the same, nothing changed, right, straightforward, and then initialize array, the programmer here, this is the SPU, and okay, check for the working hardware, and then I do the, here I didn't find the buffer here, okay, I have the array size times I divided by 8, is the offset into my array, and then this is the 63, 64 to 464 byte here, 40 offset sub-ites, increase that one, and then start arraying the cache line, get the control blocks, going down, okay, let me do this, let me go and get the SPU program, SPU is better, T-prex, five minutes, and then I take offline some questions, okay, come, yeah, okay, let me bring up the SPU program as well.