 Okay, so look at the different models here, the PPE models, the SPE models, the pilot programming models, for those of you that want to work in pilot models, and then the multi-tasking SPE and the self-suffer development flow. The first one, the focus of this one is why we need this kind of programming models, the massive computational abilities that we have under the self-architectures and we have a huge communication bandwidth. We move data at a rate of 300 seconds. We move the data independently between the SPE and also between the other PPEs. Our resources are distributed. We distribute resources across different levels in terms of low-level, like the functional components of our PPE, in terms of tasking, in terms of training, what we're waiting for, the fixing of this one. The programming model, you look carefully, right? We can have situations. The leasing, we have an ASPE, and then we have the models based on the memory requirements of the applications. So now you can tie up the memory models against the self-architecture. You can have a situation where you have your programs fit the tone of your program at the space, fit into the local store of one SPE. On the other hand, you can have that outer space spreading across the EAs, effective address of your main memory, and also to your SPE as well. The other scenario where you have is that your outer space of your SPE program here may be across different SPEs. Even to cover the whole SPE may be across a couple of SPEs. And then at the BE level, here's the case where you have the outer space cover, also the EAs, the main memory here, and also the outer space of the local store on different SPEs. And then the last situation where you have the PPE threads running on different, on the PPE here, running on different instance of PPE. Let's take a look at the PPE programming model. We went through an example yesterday, very simple example just to show you, you can build a PPE program. And that PPE programs will do nothing else but to control executions of your SPE program. Control and managing that SPU is the key function, the main function of this PPE program. We provide what we call the CE-SOF, the embedded software format here, linkage format, to allow you to refer to the same program using a symbolic name, to refer to the same address space between the EA and the local store. So you don't have to do any mapping, you don't have to do any translation. You don't have to worry about 64 bits versus 32 bits on the cell here. The role of the PPE programs, of course, Manage located to system resources as well as the cell resources. It will do the printf, the file IO, all of those IO requests for your programs. Do we have any things down there in the SPU to do any printing? No, we don't have any things. All we have is just a vector machine running on some computations. That's all. Single SPE programs, we have a small programming model and a large programming model. The small single SPE models, we refer to that model as a single task environment. It's small enough to fit into a 256 kb of local store, sufficient for many dedicated workloads. How sufficient it is? What is the size of the program that you think that can run on the local store on that 256 kb? How many lines of code do you think that will run? 20 lines, 30 lines, 100 lines, 500 lines of code. You can run the program 500 lines of code on that 256 kb. So it is small, but not very small. Let's accumulate a very large size program. Then we mentioned about the address space between the EA and the local store. There's a line that delimits the address space. The mapping of that address space will be done by your MMU and the MFC. Everything is done for you by going through the demate and by going through some of the memory transfer functions. Go to the MMO, for example. Those will be translated automatically for you, so you have to concern about it. So in the small environment, the small single SPE programming model, everything will be dandy, nice and dandy. The tune and environment will give you the compilers. It will give you some of the libraries, the linkers, the loaders that initialize, load, start, stop your program. So this is the typical environment that you have. A set of programs for this environment like this program here. We call the SPE underscore full dot c. So a c program to compile it and then produce a binary code SPE underscore full. We have an integer SPE ID address 64 arguments, right? And then 64 bit of the environment pointer. We've seen this before, right? This is the SPE thread, right? And then we do something over here based on the argument that we passed here. Then we do printf whatever, return 0 or return i. Now this is the PPU program. And we do the status, declare the SPE ID here. It's the SPE underscore ID underscore t. Do the thread creations. Do some weight here and return. We have to declare the programs SPE underscore full, which is an SPE program. Declare it as an external SPE underscore program handler underscore t. We've seen that many times, right? That's basically that is it. However, in a large program, what happened in the case where we have a program and that address space is larger than the 256k byte. What are we going to do? And here in this environment, what we can do is can partition the programs into different segment address space. We have the code segment. We have the data segment. We have only stacks, all right? And see it's off as well. We partition into small pieces, and then we demate the pieces that we need whenever we need to run on the demate here, on the SPE here. Okay, the control of passing or breaking up a managing of these programs rests with the PPE. So here we use the PPE, breaking the code and the data into small pieces, demate now to transfer the data now to the SPE program, perform some calculation, and return it back to the PPE. For IO, we will do the same thing. For IO, we demate the data. For example, in this case, the int declared by IP32 here, another of 32 locations, and then 32 entries here. We feed it to the programs of the function IP here, perform the operation, generate the output, and demate back to the system memory, which is the main program running over here. Okay, for those of you sent us, you're telling me yesterday, right? You show me that every time I do the demate, I issue demate request, I wait for the data coming back, some cycle later, and then I continue to work. What happened to that latency time? What I'm going to do with while I'm waiting for the data? The scheme said, okay, we can provide you some sort of mechanism we call the software managed cache. Software managed cache is nothing else but a software implemented cache. Just the concept of the hardware cache that you had before in the architectures, we use as a small locations where we store some frequently referenced data in there. Okay, when we need the data, we go to that location, search for the data, right? And if the data is there, we load that data. If not, then we generate something like the data is not there, and cache phone or something, we go outside, either the next level of memory or the next level of memory, we bring the data in. On the hardware side, we need to deal with cache replacement algorithms, the size of the cache, all of these organizations, how do we organize the cache as well, right? How many lines of cache? How do we group that line of cache into? A set of cache, right? The way that we organize the cache. We deal with the same here. However, here, we give you as a programmer the flexibility of how to organize those cache. So once you decide the name of the cache, the type of data that you use in that cache, and what is the replacement algorithm that you use in this cache, you store the data here. Storing that means that this cache will do the DMA for you. What technique, what DMA technique that this cache will use? Double buffering or a command list, a DMA command list, you don't care, right? Implementer, the software cache implementer implement this scheme for you. All you need to know is that I have this data type, right? I define a cache in this data type, okay? And I have this interface according to load and read the cache or store the cache. Store the data to cache or read the data out from the cache. And that's it. When I need the data, I go to this cache and I suppose to have the data ready for me. Refreshing the data is another function of this software cache. How many software cache can I have? How many hardware cache can you have on your system? One, right? Can you say that, well, I don't like that cache. Can I replace that cache by different size or whatever? Can we? No way, right? On the hardware side, let's go to Intel, give me some scheme where I don't like that cache size, you know, and the cache size and the liner, the size of your cache line, and how you organize the cache. Let me make another cache. No way. Once the decision on the hardware side, once the decision on the size, how to implement that cache was made, you cannot change. On this end over here, you do have the opportunity to change the cache. The size of the buffer, which is the buffer, nothing else, but the buffer, right? The size of the buffer, the data time you put in here, and the organization of the cache. Plus, on this organization here, how many cache can you have? On the hardware side, one cache. How many cache can you have? Two, three, four, five, right? This is depending on what. Subject to the size of your local store. Performance-wise, subject to some things that you have to live with. That's mean that you have 236 KB. You reserve some buffer for your software cache. The remaining for your instruction, for your data. Coming back, how do I know the size of my instruction and so on? Through the linker and through some of the tool provided, option provided by the compiler, it should speed up the size of our code segment, the data segment, so we would know. The programmer has a lot of freedom to manipulate, to set up the cache, to set up the buffers down in the local store here. You want to use the raw DMA to transfer the data by yourself. Fine, you do have that option. You want to use some option already provided to you to do the DMA for you, to get the data for you. Use the software cache. Here is at the low-level API, we provide some of these functions. We call them the low-level API, where we define the cache missing, the line, the byte, some return address here, and then the cache miss and so on. Those are very low-level APIs that we provide here for the software guys to implement this cache strategy. At the end, we also provide what we call the high-level API. The high-level API is here where they say we define it. Let's define, okay, given me some definition in the SPI in the score h dot h. SPI in the score cache dot h. Where I define a load in a store. Loading is that I read some address, I read some data out of cache here, and here this cache write. I will write some data, some piece of data here in the cache. And then I refer to in my main programs, right, on the SPI program, when I refer to, I do the load, I read something and see it here, and I store C into some destination. And that's it. We make the scheme like, you know, very, very complicated. But you as a programmer, you're dealing with only definition of this. You have to define these guys here. And then when you use it, you need to load and store the cache. Okay? We hide all of these details of DMAs and all of the rest of the complexity of DMAs. On the large single SPI programming models, we know that we have a code segment, we have the data segment, and we can, we know the size of the programs. Each of the segments we know ahead of the size, right? When we sum up those sides, we said, wow, they're adding to the submissions, the sum of the size of all my segments exceeding 256 kb. Can I break up those sides? Can I break them up? And in such a way that I can load them, one of them, two of them, or three of them on the same location, using the overlay techniques. This overlay approach is not new. We're dealing with the 64 kb memories when we invent the first PC. Then we have to load a program larger than 64 kb or 128 kb or larger than 256 kb. What we're doing? What we did then, we prepped the code into small pieces. We called segment, right? And we looked at if they're independent to each other, do they refer to each other, reserve a specific address location, a segment of address space into memories, and loading those segments only when we need it. Instead of loading the whole program when it's standard program, the same concept right here in the self-programming environment when you run an SBE program, when you need, when you load a program, you load everything. That's why it has so large, right? If you have a large address space, you cannot fit the program into the local store. Prepare them up into small pieces. For example, in this case, we prepped the program into different functions, A, B, C, D, E, and F, and the main program. The main program sort of like the root segment has to be there all the time. That we have to do that, right? So main, the lower here, the main, function main and F somehow go together. We can load the main program and the F functions into this segment over here any time that we need them. Main always residing into the local store memories and F will be loaded into this segment here. And then on this segment over here, we can define the load point for the function B or C, so we can load either B or C when it needs B or C. And this segment here, we define the load point here so we can load the SBE function A, D or E. So we load this segment. Prepare it up, program into various segments, define the load points for those segments and overlay those segments loading together. How difficult it is, you will see. Okay, when you see it later on, we cover the overlay. An example showing this chart over here, we have two overlay regions. We call them region 1, region 2. Region 1 will load the segments of all of the text. And segment 2, data, text and objects and so on. Segment 2 again. Okay, on the E executable maybe, the objects and E format. And then the overlay is region 2. We load the segment 1 and segment 2. On the B and C and so on. So using the features provided by the SPU linker here, we can fit the linker, the data from this linker script and ask the SPU linker link my segment in such a way that I can load the segment here only when I need it and it's very specific in the starting from the address, defined by the overlay regions that I defined. This overlay manager will oversee, manage and load only segment for you. The other way we look at on the system memory, you have a program, right? And then the program, you can divide it into different code and data, segment again, and you can queue them up and then depending on the execution of the SPU kernel, this is the SPU kernel that you provide. This is not a Linux kernel and OS kernel. This is a small program that running an SPU program running on an SPU control the execution of those jobs here. It's running and looking for the code for the segment N and data N. We're here asking, and code for the data of N and then running that executed one and then through this job queue. We're also talking about another techniques that allow you to hide the latency. These techniques here we call the demo buffering. In your traditional programming model if you want to work with a large set of data or a set of data but you want to continuously access the data. Access the data in such a way that anytime that you need the data, you have it. You can either load the data into a single buffer and accessing the buffer one at a time, right? Or you load the data into buffers. Pack up the data into two buffers, right? And then accessing the buffer and then accessing the buffer one, get the data, process the data, working with that data and accessing the second buffer working on and then accessing back to buffer one. You can do that providing that you loaded the data into the buffer one and buffer two the first time, right? If we go ahead and say that, okay, now let me start my task before I start doing anything, let me load the data into buffer one and then also load the data into buffer two. While I'm waiting for the data in buffer one, buffer two, I have to spend some time waiting, right? And then once I get the data on the buffer one, I process the data. I finish processing the data. I get out, before I get out I initiate another request for the data in the buffer one. So that activity is going on now. I switch to my buffer two and now I have the data in buffer two I'm working on when I finish before getting out of buffer two initiate another request for the data and then switch back on the buffer one. I switch back and forth, right? So that I always guarantee that I have the piece of data that needed. Some of you may subject what happened if I have a different set of data on my data take longer on the buffer one or buffer two. Possible. You can make double buffer, double buffer or multi buffer, right? Your buffer is where? Your buffer is down the local store. What is the drawback on that? The drawback on that scheme is memory constraint again, right? You have two fifth, fifth, sixth skip bytes. No matter what scheme you do and you apply you still have to sacrifice certain space down there, right? To implement your scheme. Multi buffer, double buffer or you're doing some software cache but still costing sometimes costing some space. You're trading space for times, okay? And the CSOF we define this structure here so you don't have to resolve the symbol by yourself. We defined on the effective approach over here we define the structures. We call the card here GNSCOPFU 512 and then on the local store we may refer this is effective address resolution or something because EAR we refer to this one is ER GNSCOPFU structure and now in these programs we refer that one is card is a local NSCOPFU 512. So once we define these structures as a CSOF EAR we can refer we use this structure here we demade the transaction in between those two locations without worrying about who is the effective address who is the local store right? Remember when we do the MFC yet MFC put we need to know we need to supply the local store address the effective address the size and the top right? Some things that we need to but the two address we have to supply those guys in this mechanism in this implementation you don't have to okay parallel programming models this is traditional parallel programming models in your existing architectures also applied to this as well this is based on the single SP programs we implement the parallel SP programs here synchronization mechanism we provide you the MFC atomic updated commands remember when we draw the architectures of the MFC we have various components the MFC remember we have the MMU we have the MFC atomic would handle the cache currency for us the MMU is doing the memory address translation unit for us and the action with some buffers in there we use 40 mailbox we use 40 buffers very simple, very neat organizations and here it is showing that we use the cache line base MFC atomic update and mechanism and that one was implemented into MFC and there we use signal notification registers and event interrupts we handle all of those even any interrupts can be handled by the SP and then transferred back to the PPE to handle if we have all of those interrupt handlers will run on the PPE share memory we can use the share memory multiprocessor environments in this scheme the memory of the EA can be shared or a segment or an address space and the EA can be reserved, can be shared and synchronized the assessing mechanism will be synchronized by a typical synchronization mechanism okay compile open MP support we use the open MP here to control to provide a parallelism for the S&P systems where we connect those BE-78 together okay and we use MPI, message passings interfaces protocols to support the clustering both the open MP and MPI will be available in the SDK 3.0 okay message passing, we use MPI message passing here and one of the model we like to touch about spend some time here is the streaming model on this model here we assume that we have a set of input we call it in I0, I1, I2, I3, I4 or 5 our data split into a different set of inputs here and then the applications or the programs running on this memory here can send the data for this SBA program okay, SBA programs here receive the I0 the first SBA the first set of data processing the data send it back at the same time the data I1 we send out the SBA one and then processing and send it back at the same time I3, I4, I5 we send it out to the SBA 3 and 4 and 5 so what we doing is that we send the data into a small pieces streaming down the set of data down to different SBA and have the SBA perform those functions whatsoever irrespective of the time we don't care what the times when we finish send it back get it out 01, 02, 03 whatever and then we process in a set of data that's how we perform the matrix multiplication we subdivide the matrix into set matrices we divide on the SBA 0, 1, 2, 3 depending how we create a thread we create a thread filling the data giving some pointers to some address space this address of the set of data here this data structure get that address, limit the data down to the SBA and then perform another model is that if we know some times if we know ahead of times there is the the completions of our task of our SBA we can implement something like the pipeline model where we said okay bring my data down to the first SBA let it finish and move and send the data to the SBA to the next SBA that SBA finish working that one finish send it to the next SBA and send it to the next SBA very beautiful concept right dangerous because we need to know when this guys finish when the first SBA is 0 finish when the second SBA when the second SBA finish we want one of the advantage of the SBA is that is what the times we always want to know the completion time the given what we call the terminology here the time to response the time real time response always we guarantee the same response time every times if we can do that we can guarantee that one in this scheme is very good for you only tasking we can pack up again at the Linux level or either at the open system at the current level or at the user space level here we can pack up our task our workload into different tasks we can task ABC over here with each task we have an event and we have a queue of events here and depending on the occurrence of this event we run the task we need to have some program we call the event dispatcher similarly to what we had before like some kernel some SBA kernel which is an SBA program to manage those events to control the events here control the events recognize the events when the events happen field the events and dispatch the task to serve the events self-management the tasking again we have an SBA kernel running here and the job queues was queuing a number of tasks and based on the events happens down into SBA kernels here we can download the data, the code down to SBA and run it through multi-tasking SBA kernel this is an explanation of what I will just describe okay some of the programming models here reduce development cost while achieving higher performance this is true for the software cache this is true for the multi-buffering schemes and then some of these framework we provide here we use will be helpful to your productivity and to your time to solutions developing your code is a lot quicker the models we propose here it's not only models right and we discuss this model applicable only in the environment where we have a cell we have a 1PPE ASPE we have a 2PPE 16SP right what happens in the environment where you have a system traditional system running and using the cell accelerator you have to come up with that model as well what happens that traditional CPU or traditional PC or traditional system was running fully loaded handling database fielding data from your financial services and you have to handle different IO devices and bring the data in that model is not addressing here right how do we feed that data coming in through that traditional CPU send it down to the SP one of our SP one of our PPE and then send it down to the SP and what happens if you integrate or you hook up and you build a cluster of those is a traditional CPU together and use the cell here is an accelerators one or more or 10 whatever a number of cell here is an accelerators right very specific functions rendering an image processing a set of financial data your your stock options for example look at the the hedge funds so you can build up your funds today based on the data under your real time data so this sort of accelerators models is existing now we call the hybrid model and those are not addressing here questions we do that now we build some models we call the accelerators, libraries framework ALF and that framework was introduced in SDK to promote addressing the situation where you have to support the hybrid model is between a traditional CPU and the cell is an accelerators okay we may not go into the detail of that models because it's still working on and then there's a lot of details involved that one but that models let me ask you a question what do you think about what are the requirements that you fit to that model let's say that you have an application running on your main x86 base machine on your traditional CPU one application running that one and you want to send some piece of code out here so you can accelerate it on the CELBE what do you think you need to do why do you want this cell accelerator do what do you want him doing the task you have to project the task your big task over here into small task you can send it out application partitioning you have to partition your application into pieces same model here but on the other hand partitioning your application into small segment or small pieces and then move wood pieces go to this SPU on which piece it go to this PPU on which piece it go to this cell because you may have a model one cell accelerators more than one PS3 here you cannot build a cluster with PS3 and if you follow some of the online news lately the past week or so some students only build a cluster of PS3 and that cluster of PS3 if you couple that cluster of PS3 with a system a cluster of traditional CPU you build a supercomputing center you build a very powerful system here application partitioning is one area application partitioning you also have a lot of data coming in as well you do the data partitioning so you have to think about the scheme of your partitioning data as well and then receiving the data you are just computational you receive data as a programmer you have to think about that framework ALF has to provide you something that you can feed the data feed it easily you are not going out here as a programmer you are not doing the DMA by yourself you issue one DMA and we will get data DMA you do a list of data whatever or you do the software cache to do it better but that framework has to provide it those are the features of the ALF you will see data partitioning application partitioning you will see that you are doing double buffering it is now doing double buffering in the future it will do a multi buffering as well for you and of course do a DMA list I think that we cover the DMA list tomorrow those are the kind of programming models that IBM is working on we are not stopping at those programming models and you as a programmer when you are working on different models you may see this model here this model does not fit and that is coming up with the pilot programming aspect that you are working on and so you can see that is there anything that I had before or is there anything that in my in my course and my project here I can use this should I invent something else and nobody stop you to invent anything else besides this model these are just our model we think we have this we have this environment we have this local store we have an effective address store here and we come up with some scheme that we feel that is comfortable but as we are dealing with going out we expand the horizon of these architectures and features we encounter more questions your questions on how to handle your questions on the hybrid models it is coming out the hybrid models that we discussed but you said you thought about it this model is fine but I have traditional CPU I can hook up this cell here this PS3 how do I do that the hybrid model is very important model nobody is going to use this model by itself except me and you and some of the games player right but we are going to put together a real world and build a real super center and yes it is possible and potentially and that is what we are doing now IBM is doing with the Los Alamos labs on the road runner projects it does not stop you to build your own projects here with the closer of the PS3 self plate okay so we manage cache we mention this one is the the techniques that allow you to reserve allocate a segment down your local store and use it as a cache and you manage it use the program and manage it you may ask why you ask me to do so many things I write programs so I now have to write the cache too I manage the cache too this is hardware this machine was designed by the hardware guys and those hardware guys give you a lot of freedom the software guys say that okay either I take these freedoms or not I'm waiting for ALF and some of the models to help me better but for those of you that really want to get to the lower levels that want to control give that 256k by right you can control that one I have my own cache here I have 3 cache I have a number of cache I can play with every time that I add some cache in right I can see some improvement there was proven some program that we have the demo here the demo we build with 3 cache 4 cache 5 cache and so on and every times that we add some more cache right to performance improve get you know jump another 5 phone another 5x another 6x whatever what is falling down if we relocated or we make the data available to our CPU to our SPU then we should load install load install and run that's all we do right we don't have to pay any latency in terms of memory latency the question is that if you reserve one SPE the local store of one SPE is a cache right and let the other SPE access that cache how efficient it is if you reserve one SPE is a cache that local store in order to access in order to access that local store from any SPE you have to provide some SPE ID to thread right and you use that SPE to go over there go over to the EA address and fetch the data from go to background because you have the memory mapped the EA and so on right so efficiency implementation wise nobody can say no because you can access a local space and this LSE on the SPE 0 here can be accessed by SPE 1 nobody say no but since we say that we use the SPE is a computational engine so we prefer to use that way okay here's the basic cache concepts everybody familiar with this one skip going through cache, cache line right okay we use a cache with the size and all these N-weight N-thread, how many ways we partition the cache line I don't remember I mean when we look at the new architectures if you are not the designer right you are just a programmer when you look at the description of the new Intel a new Intel CPU and so on you look at the cache you say the size the cache size then what you sit down in your program right and here you say that okay I have to know a little bit about that organization I have to know a little bit about what does it mean to me fully associative mappings and then the N-weight set associative mapping here we have direct mappings okay we map everything with the cache line and we fully associate the mappings we map any memory location to a cache line your cache consists of a number of cache lines right okay so you have that cache line and how do you organize those cache lines in group and single one or how you map that cache line or group of cache line to a set of locations and your memory here so you cache the buffers or the data on that memory and that allow you to access the cache okay this one is applicable to the local store only and these benefits are the same benefits that you observe when you have a hardware cache simplify programming models from your load and store effective address model can be used, decrease the time to port to SPE the advantage of locality of reference that's the traditional one that's why we have the cache right, can be easily optimized to much data access patterns this is new here right, this is a software managed cache provide ability to apply sophisticated replacements and algorithms and so on do we invent this software managed cache I don't think so if you search through Google a lot of people work in a software managed cache right this is not IBM's inventions this is the one that invented by many many technical and the technical communities and the concepts the fit environment here we use it right so we suggest you if you see something that's new or that's been used or whatever you can retrofit the concepts out here in the cell use it as long as you can exploit certain features of cells here we exploit this one because we said that we want to pre-fetch the data okay we need to have data to supply that SPU we want to pre-fetch it and where do we get the cache we are now searching for how do we implement the cache this cache concept has been discussed in IBM like a four or five years since the beginning it's just actually implemented and then released on the SDK 2.0 only so to see how long does it take a couple of years, three or four years to implement this scheme here we talk about a lot of software cache in the SDK 1.0 but only in the 2.0 we have actual implementation and we do have some exam to show you okay the heading file is the cache minus API I think cache minus API.h here and in here the cache attribute comes this off name, associativity, the line size the read only or the read and write okay the type of data object to cache integers cars, float and so on and the number of sets user can define the multiple cache by redefining the attributes every times every definitions every software cache has to be defined by including the cache API here and then the attributes you want the new the second software cache you repeat this section here the cache API and the attributes and so on so can define the software cache as you want to the attributes are cache type the type of your data the name of your cache and then the size the sets, the end way the type read and write or read only the read x4 this one is mean each time you do a read and write the code here and then the cache starts give you the statistic of the cache defining a cache I define a cache name, my cache cache type, my type is called t, forward way asets, read and write cache line is 9 and then the start enables and included cache API cache read cache write, cut touch wait, read and write and these are the interface to make life simple and easy for you to use these features you are required only to supply the type and the name of the cache only all of the rest set the word whatever, leave it to us with set a default you can of course change but if you don't want to specify anything leave it to us, we define default for you and the replacement we define for you and you just care about the name and the type of your cache only here is an example we use the cache to swap two values cache read cache read here is an interface provided by the software cache utilities and cache the effective address aid over here read the the cache into address aid read from the cache specified by my items the name of the cache is minus cut items into address aid ok and then read again my items in the bead here and then write aid over here aid into the address into the cache my items also to the effective address aid in the scope bead here and then write B into aid so what we need we read B we read B we put B into A into B we swap two values behind the scene what we actually doing we have to issue an MFC something put right or get something go to effective address aid of B get that data bring it down to the local program to the SPU exchange the data switch that one MFC put back a number of MFC command we have to issue here we just do cache read and cache write that's it write back algorithms some of these algorithms here same thing so we talked about in the hardware when we learn our architecture class same thing this is the graph a chart showing the implementation of the cache the log you can find a valid some directories of your cache here to look up from the index and so on just a kind of performance wise okay this is kind of showing the size of the cache line here the line size and then the hit rate and applicable to the program quicksort hit rate versus the line size and the cache size is about 16k, 32k, 64k 120k so as we can see you know the hit rate about 75% improve to 85 something almost 82% very high rate however with the 32 cache the line size and then you know we go on as a photo the lines getting longer and longer or getting bigger and bigger we see the performance really decrease so depending on if you have a small line size 128 or something this line over here that's the line size of our cache okay you will see some difference improvements in terms of the cache size okay the hit rate improve a little bit with the large 128k byte cache size 128k byte that's half of my local store would you dedicate a half your to your cache? No way my friend right? I mean maybe it's about 16k about something like that right and 6k so again different programs different approach, different techniques different results this is the quicksort run time versus line size and this is the hit rate versus the line size different hip sort this hip sort behavior more or less based on the cache size as well we can see that this guy the hit rate jump between 76 77% or 80 something percent okay run time versus the line size again and cache start the set to set organize the cache line in terms of set how many sets we have we have 8 sets 7s here 1 or 2 or whatever different a little bit right the hit rate hit percentage here hit rate almost the same thing 94% or 91% based on the written write the set here on the quicksort with the prefetch and write buffers almost the same so experience with your programs and then with your alright let's go any questions before we close up here okay ladies and gentlemen wrap it up and let's go to the labs