 This we talked about the cell BE architecture. Objectives here to give you some idea about what architecture look like. The difference between the power PC, because the PPE, the SPE, and all of the elements of the system, the memory flow controllers, the EXB, the interconnect bus, element interconnect bus. In general, we go through the current microprocessor design. And then we look at the cell components. This is the layout of a typical processor board that we have. This is, I took this one from the IBM Power 5 Plus. On this board here, we can see that the AOU execution units consist of the floating pawn, the instruction units here, the FXU, FXU and FXU, the fixed-bone units, instruction units here, the FPU. Again, there's two FPUs because this is a dual core. Had replications of the components here, the IDU, IDU, the load and store unit, load and store unit. So this is a typical real estate of a Power 5 Plus. And level 2, level 2, level 2 cache right here. And then we have the L3 directory control which connect them to the off chips, level 3 memory. Comparing with the cell BE, this is the cell BE. We have the SPE and the processing. We have the PPE processing elements here. If we take this real estate, we drop on top of the IBM the Power 5 Plus. We can see how much real estate of the cell BE here took against the Power 5 Plus. We still have, we eliminate all of those L3 memory controllers. We eliminate almost half of the L2. And we eliminate all of those FPU and PPE and so on. Architecture-wise, we said that the cell BE is based on the 64-bit PowerPC. This is the layout of the 64-bit PowerPC. We have the Power Instruction Set Architecture right here with the memory map units and the burst interface unit talking to the memory on the IO system here through this coherent burst. And then from this architecture, we're going to add some what we call the load, plus the memory flow control. We call it added to local memory store here, local store memories. And then having the MMU and the DMA, which is a part of the MFC, the memory flow controller of the local store. So here, we have a SPE, we add the SPE. This is the memory components of the SPE, okay? And then we said that if we can, how do we address, how do we provide a mechanism for each of these SPE, talk to each other SPE? So we are aliasing the local store into the main memory. This is the main memories of the PPE, right? So you have two layers of memory, the effective address we call the main memories and the local store of the SPE. And the local store memories aliasing onto the effective address to allow you to address any location in the memories. In addition to this one, now, having adding the memory portions, we're adding the processing portion here. We recall the SPE, right? And here, this SPE here has its own instruction set. Okay, has its own instruction units. It's different than the PPC units right here. It's different completely, okay? We look into that the offload models, we call this offload model, we wear the DMAs now into, DMAs, the data into and out of the local store. That action or that activities is equivalent to the load and store doing activities running the power. And the same thing, the load and store of the vector registers as well, okay? We also support the shared memory models on the power architectures compatible addressing. Now you have a large address and then you can dedicate a set of, a set of address you've using for your shared memories and you can provide the synchronization mechanism to synchronize your access to the shared memory as well. Okay, now, this is the overview of the SELT program engine, right? We have again, we have a level two, they have an SPE and so on. And this environment here, we look at the PPE core, we have the VMX units, we have a level one cache, level two cache, two SMT, we said before. This is the block diagram of the PPE. We have the full load of the instruction units over here, we have a VMX units over here and we also have the load and store units over here with the fixed-bone units, we have a branch execution units, we have the load and store units. Fixed-bone units still mixing with the load and store execution units here. For instructions, there's a VMX, VMX instructions will be decoded, eight of them over here, fetching the instruction, okay, go through the decodings, identify the data, get into the level one cache, fetching the data in, dispatching based on whether or not should we go to the level one, should we go to the which pipe that we go, right? Which, well not which pipe, but which processing units that we're going because it's up on the two threads here, which instruction will have the which threads, okay? Decoding the data, resolve the dependency of the data, issue the instructions to either these units here or the VMX perform the computations and then get back to the data here, the instruction flush, okay? Here is the elements of the SPE, we have with the 128 bit of 128 of 128 bit registers, we have the local store, 256 k bytes, we have the MFC, and this processor here can be run in the isolation mode to support the securities on this cell here and securities or isolation mode when it's running, this means that only a specific window on this SPE local store memory can be accessed to someone outside. So you come, this SPE when it's running on the isolation mode is will be completely isolated from the rest of the processing units here. SPE organization, we saw it before, we have an SPE call and we have the DMA units right here and we have the local memories and we also have what we call the channel unit. Channel unit is a set of 32 channels or registers that we use to transfer or to contain the data when we move the data between the SPE and the PPE, okay? The local store, local store is very, very different on the cell organization here. We have a no, we have a concept of never misses on the local store. Why? Because we don't have any prefetching and any emotional memory associated with the local store. What we have is a physical address of the local store 256 k bytes, no protection whatsoever, right? Up to you to protect that address. If your program running something and in intrusions or some sort of trying to defeat your program, intrude your program, you can destroy the data, it's up to you as a programmer. Remember we have a no-votional memory mapping here, right? And we have a what we call the software managed caching so to provide you some mechanism to assess that data when you need it, okay? We eliminate the latency of data transfer when you request the data and then when you receive the data, that latency here, that latency will be eliminated if you use some mechanism we propose here, one of those called the software managed cache. We provide a large register file and then we also can move the data from one store to another one. No translation we set over here and no memory mapping. Can we map a system memory but there's no virtual memory mapping. One of the key points here I didn't mention is that is the predictable real-time behavior, right? We get the data, we have the data, we always running without the second tier memory we never miss. We do the branch, we never miss because we only hint, we give it some hint and that branch prediction will be taken that is exactly besides opponent so on. So what we have in here we always have a predictable performance on the real-time applications that you have. You're always guaranteed to have the same data, the same performance, the same timings coming back every time that you run that applications. DMA and multi-bufferings we use what we call the DMA concepts to move the data in between, the SPE and the PPE, okay? Latency, we talked about the latency we use the software cache models. We also, so we manage cache model, we also talk about multi-bufferings. You learn some programming basics if we want to optimize your data movement from one piece from location to another locations and continuously streaming the set of data you have to provide some more than one buffer so you can flip back and forth between the buffer that receiving your data so you can always have the data available to you, right? Same concepts here. Channel, we use the channel to pass the data, the IO between the IO device and the PPE and from the PPE to the SPE, okay? From the SPE to the PPE, we use the channels and we use the MMIO, memory map IO registers to pass the data from the PPE to the SPE. This is the chart has showed the SPE organization okay, we have different units here, we have a floating, we have floating point units, buy units and fix point units on one pipe, on the other pipe, we have a channel unit, permute unit, branch unit, load and store units. All of the execution arithmetic units reside in one pipe, okay, all of the memory related instruction reside in another pipe, so you have a two pipes. So what it means is that if your instructions, okay, if two instructions, one follow another, independent each other, they're not related, one is using the arithmetic units, one is using the memory related units, those two instructions can be issued in parallel in the same cycle and of course, those two instructions, one is issued, can be executed in parallel as well, okay? That's what we call the June issues organization right here, okay? So remember, we have a two pipes under the SPU and each pipe will perform a very specific functions, arithmetic related functions or instruction related functions and those can be in, can be executed in parallel as well. We use a DMA to transfer the data, very high bandwidth, okay? And this is a block diagram showing the pipe, the floating point units and the fix point units in one side. Remember on the PPE side, on the PPE layout organization, you see the branch and the fix point unit the same pipe, right? And then you have the VMAX and different. On this organization here, you look at, you don't see the VMAX because every instruction that is a VMAX instruction are the vector instructions. That's not the VMAX, VMAX instruction referring to the instruction on the PPE only. These SPU instructions different instructions set and different instructions set different and then the PPE instruction and the VMAX, okay? And here we have the set of the two pipe, the floating point unit, fix point units and the permute load store branch units and channel units, okay? Instruction coming in from this DMA unit here. We have a Singapore on the 128 bit read and 128 bit, 128 byte read and 128 byte write, okay? Coming in from this DMA unit then we issue instructions, go to register file, load the instruction to register file and running on. Local store, Singapore S-RAM. The key point here is that we have a Singapore S-RAM. That S-RAM Singapore has a supply one. You have to supply instruction to the SPU to run, right? SPU require instruction. You have to do the DMA, you have to transfer the data as well, right? So, and also you have to do the DMA, you have to do the instructions, feelings and you have to do the load and store as well. You use this port here to do the load and store to the registers, okay? So this Singapore S-RAM perform three functions will be used gating by three functions, loading the instructions, receiving the DMA transfer and also the load and store for the vectors. Each of those operation has a different priority, okay? So, you know, the priority, the highest priorities is the DMA. When you do DMA, you need to have data right away in the load and store. And finally, instruction fetch, okay? Instruction fetch or the bringing up the instructions from the local store is the third priority. So, you will see that those priorities and how do we play those priorities later on into some of the later presentations. We talk about SPE issues. In order, every instruction executed must be completed, all right? No out of order here. Doing issues, two pipes, your instruction swap for the SYNC issue, okay? If we're running something, okay? And we do have a dependency, okay? We're going back to the SYNC issue. Really, we can issue one instruction at a time instead of two. And here's the pipes, the presentation on the left and the right depending on which instruction that we have. We have odd pipe and we call the odd and even pipe site, okay? Here's the diagrams of the timings on different stages. Each stage represents a cycle. So, you look at the left side, the sample fix FX, where for example, running at pipe zero, right? Pipe zero is our even pipe, okay? And this FX over here will require two cycle to complete. The shift FX, the single precision FP, six cycle, floating point integers, seven cycle, okay? Byte, four cycle, permute, four cycle, load and store, six cycle, branch and channels. So, each of these on the right side, each block represents the stage you're going through and each of the stage require one cycle. SBE branch prediction cost you about 18 cycles if you do a miss, right? And it's not made for 11 F04 processor or whatever. What we thought about here is that it will cost you some cycle if you miss predict your branch. We provide some instruction with counter branch hint to help you to predict the branch targets, okay? And 18 cycles for these machines is not a whole lot because we can hide those cycles as well on the traditional systems. We may miss about 80 cycles, 120 cycles, for example on some of the machines on the power five, okay? So the branch penalty avoidance technique that we discussed here, we will go to later on to details later on. For example, unload loops and branch targets, load the targets every time you get the, using the branch hint instructions. SBE instructions, scalar processing instructions support on data parallel substrate, right? All instructions are data parallel. Every instructions on your SBEs affect instruction and they are in parallel, okay? And then if you have a scalar data, we will put into some only called the preferred slots. And each of the slots very well defined is depending on the data type of what you have. You have a byte or the half water address, three to bit address, or three to bit water, so on. As we will put into where which byte, the fourth byte or the second one to turn the four or the first four bytes, okay? Memory mapping. This is the memory mapping units. We have the 16 byte per cycle and the bandwidth is about 25.6 gigabytes. Okay, this is the detail of the MFC. We have the load install, we have the SPU here, we have the DMA engine, atomic facilities to allow us to implement the cache coherency, MMU memory translation and the IMT remote memory transfer and new DMA Q and MMIO. We'll discuss some more on this guys. Some of the resources on the SBE from the hardware point of view to list the list of the commands that allow you to interface with the hardware components. Okay, here we list at the 4K physical page boundaries. And here is the both SBE resources. 128, we have 128 bits in shipyard and external events. Some of those events are more or less just the hardware events and just allow you to control the resources only, okay? From the programmer point of view, moving the data in and out of the SBE and PPE, okay? We use some DMA commands like a put and get. Remember, you look at this one, you see only two commands, get command and put command, right? And then after that one, we add some prefix and suffix like a put S, put TR, put L, put RL, right? To provide a different semantics to identify the different operations for the same instructions, okay? We provided some cache management commands, provide some synchronization commands where you issues a set of commands, DMA commands. You want to make sure that all of the data that you want is coming back in the order on the patient that you want to come and back, all right? The EIB, EIBs is a very fast switches, okay? We have 96 bytes of psychopath bandwidth, okay? And then this is the hardware realization implementation of this EIB consists of four rings. You have two rings running counterclockwise and two rings clockwise. And each of them has a different bandwidth, okay? And have a different, not a different, they provide the support for all modern hunt tracks and they request any given sense of times, okay? The speed here, we talked about 307.2 gigabytes per second bandwidth between the units. The realistically, we look at 204 gigabytes per second, okay? And here is the hardware layout of the EIB. You have a two counterclockwise and two clockwise here and we have the data arbiter here to arbitrate to manage the data movements between those components. An example showing here where you have different rings, ring zero, ring one, ring two, ring three in different colors. And you have a different components, MSC, the SPA zero, one, two, three, four and the bus in the face or right here and IO in the face right here. Okay, so on the PPE, you look at the red one over here, we have a data movement between the SPA zero, go straight through here and go to PPE. At the same time, still on the red ring, where we have the red ring here, we have a data movement between the SPA one to SPA three and from the IO F one here, go down to SPA four. So you look at this ring so you can transfer between the components, you know, between the components, between the IO, between devices to the different PB and SPA at the same time. How can we do that? Because each of them, each of the SPA, they have a different, they have an independent MSC. The integration of the EIB and MSC allow us to do these activities. IO, we have IO in the face, 16 bytes per cycle. We have two of them, we have the flex IO and we have the one for the IO and one for the memory controller, okay? And here's the typical configurations when you put the system together. On the left over here, you only have the XDR memory, this is the cell-BE processors and hook up two processors together. So this is the typical cell-BE processor. On the right side, on the top here, you put two together, okay? And then on the lower ends here, you have put four together. This is the go-to-the-subway switch, we form an SMP system. This is the maximum SMP configuration, it's about four processors hook up together. If you go beyond the four, okay, you use the IO interface, hook up the system together and to form a cluster. Okay, using that clustering concept and using that IO in the face, you can form as a cluster, as a cluster, as a possible, based on the application, of course.