 When we talk about performance numbers on FPGAs or on the GPUs or on other multi-core architectures or cell processor, there is something, a very wise man I can't remember who once said there's 12 ways or 12 ways to fool the masses is what it's called, there's a quote which says because oftentimes writers and technical writers and people when they write about performance numbers, some of them very clearly overlook the overhead in getting the data in and out of the processor. Ultimately in the real world and it may be so many times, it may be so fast and so good and everything but if it does not include the data end-to-end data movement, it's not really a good benchmark. There's a few things to watch out for when we look for benchmarks and all other kinds of product specifications which are released in papers and articles. So when it comes to cell processor also we're trying to keep that in mind and keep come out with effective techniques which will hide this latency that is or the overhead that is used in movement of data with computation in the background. So one of the techniques that we came out with is a DMA list. So while one thing that we have is simple DMA transfer, PPU to SPU, SPU to PPU. Once in a while depending upon the application needs what you might need to do is when it's not enough just one by one buffer transfer. The application on the SPU side what it wants to do is it wants to initiate a huge chunk of data transfer just initiate it and then forget about it. Go keep working on the data on the SPU side while this list transfer what it does is it goes to the main memory picks up data here from wherever it's available and does it asynchronously keeps collecting the data and sends it over to the SPU. So the important thing to notice while this activity is going on for data transfer we are doing computation in the background. So DMA lists and double buffering are two effective techniques which we deploy in applications to override this overhead for data communication or IO transfer. Let's look at the theory part of this. Now there's basically two important aspects to it. One is you have to specify how many transfer elements are there. Every transfer elements is about 8 bytes and each element specifies one transfer. Each transfer can be up to what is the maximum size of DMA transfers that can happen between a PPU and SPU. 16 kilobytes right? So any particular DMA operation the limit of it is 16 kilobytes we cannot do more than that. So every list element will specify 16 kilobytes, 16 kilobytes, and there will be two case such list elements. So on the whole a total of 32 megabytes of data transfer can be transferred from the main memory over to the local store. It's of course a streaming model because you know we can't fit all that data in the local store. So the local store size is only 256 kilobytes but via this DMA list it can have access to as much as 32 megabytes of data which is collected asynchronously in the background. First we have to specify the starting address of the effective address where it needs to start collecting the data and then incrementally as the data keeps on getting collected then that offset is you know keeps on increasing and of course we already know that any DMA list can transfer 1, 2, 4, 8 or multiple of 16 bytes of memory to a total of 16 kilobytes. So local store address increments as again it needs to start somewhere in the effective address and it needs to start somewhere in the local store address. So once you specify the starting local store where which will contain all this data every time data is fetched with through every list element it keeps on getting incremental and this is how a list element looks like there will be a transfer size LTS and there will be an effective address low and optionally there can be a stall and notify bit that can be set once in a while sometimes you might the DMA list might need to say okay after this list element is fetched the stop because I need to notify it to the SPU that we have fetched say 32 kilobytes of data you want to stop there or 48 kilobytes of data you want to stop because the SPU needs to probably process that data before going ahead with the remaining list transfers. Okay so effective address low you have to specify that address and then the size of the list and they're only specified once because once they're specified we only need to keep on incrementing every time a list transfer happens okay that that's the structure definition how it looks like struck DMA list element and then the size and the address and then the command to start the this kind of DMA operation is MFC underscore get L so instead of MFC underscore get we just do MFC underscore get list and again it looks the same you have to specify the beginning local store address and then the effective address and then the list at this whole list that you have created over here and the size tag ID usually just like that and then the group ID and the RID. Alright let's look at one example so as we have covered in the just now there'll be a stall bit in the structure there's some reserved bits over here number of bytes in here is is set to 16 and then type def it to bits and then this whole union consists of that structure and the effective address low so in the DMA list element again we have to attribute we have to align it alignment right very important and then the we start the actual this is the main part that does the DMA transfer and list transfer so basically as you can see using a single DMA list command we are able to transfer a large region from main memory over to the local store of course every single list transfer is limited to 16 kilobytes so so the size for every transfer if it is less than 16k right or if it's more than 16k it is to if it is less than 16k it's set to n bytes of transfer per list element if it's more than it is truncated to 16k because it won't it won't go through anyway and then size of is set to the s3 value that we've just computed initialize the effective address and then decrement the size with the amount of bytes that we have just initialize this list element with so that the next element gets the next the remaining bytes to be transferred and then increment the effective address space because we want to store it in the new offset or we want to get the data from the new offset okay and then basically list size becomes I time size of struct DMA list element and then which is the total size of the list right and then do the MFC get this is the old API in now it's just MFC underscore get L you don't need to use this MFC DMA 32 okay so double buffering building on top of the DMA list technique now we have another technique where we are exactly trying to see you know for example if when you are your DMA is going on how to overlap with compute and again this is also a standard technique it's not a new invention by IBM done by other architectures also we are just trying to provide via the double buffering technique how it happens in cell architecture so basically we have to start one DMA list transfer the whole key the whole key is to overlap the DMA transfer when the DMA is exactly going on operating on the previous chunk of data that is fetched so we have to start the whole overlap contain you know is it comes between computing on the data and then initiating the DMA and while you're waiting for the first DMA to complete right when you're getting the first chunk of data obviously there's no overlap from the second chunk of data onwards when you're waiting for the second DMA to complete you're working on the first DMA data that you've already fetched so we start a DMA transfer from main storage to buffer B right wait for the transfer to complete use the data in buffer B and repeat let's look at a diagram to give it give a good idea so we have the first iteration over here these these portions signify compute and this is the portion that signify DMA input so the Mac purpose of double buffering is to maximize the time spent and then minimize the time spent in the compute phase of the program and then minimize the time spent waiting for DMA transfer to complete so a key thing is multiple local store buffers and double buffering we use two buffers and triple buffering will use four there's there's applications which use you know four times I'm not sure what the word is for that but you'll have four buffers and you're trying to create a pipeline for four buffers in fact we did have an application which uses triple buffering the extreme blue project accelerated vision okay so and then with the same thing the same rules apply for these DMAs also for all the get-and-up put operations you can use fences and you can use barriers with all the double buffering techniques also okay so this is a good diagram let's dig deeper into this so we first initiate the DMA transfer for buffer to buffer 0 right and then you initiate the transfer DMA transfer to buffer B1 right you bait you wait on the DMA transfer to buffer 0 on the buffer 0 to complete you use the data in buffer 0 right and initiate the DMA transfer to buffer 0 wait for the DMA transfer to buffer 1 on buffer B1 to complete and then use the data in buffer B1