 So, as we covered earlier, there is no direct access to SPUs, to main memory. SPUs cannot directly address main memory, it's via DMAs only. When aliasing is set up, then yes, SPE, whatever SPE's local store is being aliased into the effective address space, that SPE will do the address translation. But normally by default, there's no address translation, there's no visibility to main storage apart from DMA accesses. Again another diagram to show the access methods, the blue lines over here, the blue lines indicate the DMA transfer paths between the SPE to another SPE or to a PPE and access channels between the SPU and the local store, which is a 16 byte load and stores and then from the local store to the SPE, SPE can talk to the IO, any memory mapped IO devices also. Coherence and synchronization, you know, all transfers between the local store and the main storage are coherent and it's a necessary feature if you think about this architecture because there's what, eight processing elements, all of them are mapped into the effective address space and even can write into main memory. So obviously there was a need for hardware enforced coherence protocol. So whenever there is a load at a particular request, all the cached copies for data residing at that address are checked and if there is any touched cached copy existing, first that value is written into the main memory and then the load is completed. Basically a Snoop mechanism. Each DMA command is tagged with a 5-bit tag ID. It's basically just an identifier to identify one particular DMA transfer. So all SPEs can initiate DMA transfers. The PPE calls can also initiate DMA transfers, however, the DMAs from SPEs only are recommended. There's a few reasons for that, basically there's more SPEs and the DMA, the command queue on the SPE side is bigger, 16 entries versus eight entries for the PPE command queue. Also, there's only one PPE dealing with everything, controlling tasks, switching everything, right? Say if there's any context save and restore that needs to be done, it's the only entity. So why load the PPE again with another task of initiating DMA? Another theory that we have is consumer initiated requests are more manageable. So SPE being the consumer of data always should be initiating the DMA transfers. They are more effective, peak bandwidth is higher when the DMA is initiated by the SPEs when the transfer size is a multiple of 128 bytes because they generate full cache line requests and the address offset in the main memory and the local store address offset both need to be basically multiple of 128 bytes. So in other words, we're not generating any partial cache line requests, all cache lines are 128 bytes. So that's the reasoning behind it and these are the registers on the PPE, there's different registers to handle floating point computations versus fixed point, there's a condition register, typical register format, you know, like in an Intel architecture. The PowerPC instruction sets, all instructions are four bytes long and they're all aligned on word boundaries. It supports byte, half-word, word and double-word accesses to all the general purpose registers and supports word and double-word operand accesses between a storage and a set of floating point registers. Floating integers are always represented in two's complement form and there is an extensive vector or multi-media instruction set, so in other words, we can write vector code on the PPU also. However, what we really recommend is just to do all the vector computations because essentially when you're doing vector, you're trying to write some effective high-performance code. So you would want to do that on SPEs because they are more adept at handling compute intensive code. Again, these are the instruction types for reference. So there is cache control instructions available, flow control instructions available, general load and store instructions, memory synchronization instructions, typical to a PowerPC architecture. Some more user-mode instructions, again, the processing units in the PPU, the FxCfx point unit, the floating point unit, the vector unit, there's some more instruction types. Basically, the point of all this information is that there is support for all these instructions available. So your application could be anything. It could be an algorithm from the A&D space or from Electronic Design and Automation or Seismic or Image Crossing and it can be any application. So the instruction set is capable of handling a diverse range of applications. The message we are continuously trying to send by doing all these workshops and contests and these ecosystem-related activities is to send that the cell is not just a PS3 or a gaming chip. When it was designed, the architects had it in their mind that it has to take technology and high-performance computing to a totally different level. By doing some kind of breakthrough design and beating the competition that we face from all these SMP architectures or other multi-core architectures. So again, when you design this kind of breakthrough technology, obviously the programming will not be as similar or common to the mainstream programming. So that's why the need for this different kind of programming and DMAs and synchronization, so it takes a step more effort to get the performance out of this hardware. Pipe has got its own CEC++ language extensions, vector data types. They're all, even on Pipe, the vector data type is 128 bits, 16 bytes. So it can hold 16 8-bit values, signed or unsigned, 8 16-bit values, 4 32-bit values and 4 single precision nitriple floating point values. And there's three categories of entrances. This is just instructions, like the VMX instructions. Specific instructions have one-to-one mapping with a single assembly language instruction. Generic, they can map to one or more. Predicates, there is some conditional kind of instructions that are called predicate instructions. Basically, they'll compare two values and generate a mask. And that mask will be used directly as a value to detect a condition or a branch. And this is how a VMX program would look like. The top part of this program just defines a union consisting of an integer array, 4 elements, and then there's a vector signed in, a vector signed in, and then typed it to VECvar. So the main program is initializing a vector. So this is how you would initialize a vector. So Vconst is a type of structure Vvar, VECvar, and Vconst of myVec would be initialized to four floating point values, 2, 2, 2, 2. V1 is another vector of type, that union that we just defined, and initialize it with the different values, and then do a VEC underscore add. So that VEC underscore add is called a vector instruction or a VMX intrinsic. So basically, with one instruction in one cycle, we're doing four additions, which is the whole premise of data level parallelism or single instruction multiple data. So with one single instruction, you're operating on four vector elements. So the result is stored in myVec, and you can print it out the way it's shown in this program. So SP registers is pretty much simpler than the PPE register model. 128 registers, they all look the same, unified registers. So the same registers are used to do floating point or fixed point arithmetic. So SP executes both single precision and double precision floating point operations, and single precision they would be operated, performed in a four-way SIMD fashion, just like an unsigned end or end. They're all fully pipelined. Double precision operations, however, are partially pipelined. So this morning, we covered the even and odd pipelines, where you can place, you can run two instructions in one clock cycle. One is an even pipeline, like a arithmetic pipeline, and the other one is for memory operations. So with double precision operation instructions, though, it cannot be run with any other instruction at the same time. So dual issue is not possible. Apart from that fact, once you execute a double precision instruction, for the next six cycles, you cannot execute any instructions of any other type. So there's a 13-cycle clock latency, seven cycles to perform the double precision operation, and then six as a latency. So double precision right now on the current cell B hardware is a little expensive, or it's not as good as single precision, but it's still pretty good by par with other architectures. Only one rounding mode is supported. It's around towards zero. Denormal operands are treated as zero. There is no support for infinity. There is no infinity or not a number. And just as I mentioned, the other instructions are dual issue with double precision instructions. They are 13-cycle latency, and only the final seven cycles are pipelined. And then after you execute a double precision instruction, you cannot execute anything else for six cycles. The SP local store holds instructions and data. So in the 256 kilobytes of memory that you have, you have to fit the code and the data sections. The instruction prefetch will at least deliver 17 instructions. And because every DMA operation happens once in eight cycles. So in other words, for a DMA read, it's one by 16th, and for a write, it's one by 16th. So it doesn't really impact the usage of the single ported memory per se. Because there's only one port for reason rights, and three different kinds of operations that have to go through that local store. Read DMA read and write, SPU load and store operations, and then instruction prefetch. The priority, however, the highest priority goes to DMA reads and writes. And then loads and stores, and then instruction prefetch. The reason why then loads and stores have a higher priority over instruction prefetch is because the load and store would at least help you with the program execution. It will take the program go further. Instruction prefetch, on the other hand, is speculative, right, most times. So even pipeline 5.0, and our pipeline, all the instructions, there's the program programmers handbook and the microelectronics website, fully, it covers everything from architecture to coding standards to optimization mechanisms to thermal and power monitoring to performance counters to virtual storage, everything, decrementors, hardware clocks, everything is covered in the handbook. So and a good part of the handbook also goes over the instructions and the clock cycle latencies and which pipeline they are issued in. All the data is big Indian on the PPU and SPU, and which means the MSB would be 0, and the LSB would be the bit 128 or 127. Again SP instruction data type, the names look the same, it's always vector, following the data type. There's about 204 instructions, shift and rotate instructions, there is no rotate right. So when you have to rotate left, you just give it a negative index to do a rotate right. The synchronization and ordering kind of instructions. And again even in the SPU language extensions, there is three classes, specific, generic and composite. The third category is different, third category is composite which means it's basically a sequence of two or more specific or generic instructions. Specific is directly related to assembly level instruction, and generic is basically one or more assembly level. The examples are also given, any specific instruction will be preceding with an SI underscore and a generic would always be a SPU underscore, add, subtract, multiply everything. And this is how this data stored in memory. The first four bytes are called the preferred slot. So in other words, we highly discourage the use of scalar data types on the SPUs. We always say just use vector because that's the way you would optimally use all the resources available to you. So if you have to have to use scalar data types, let always make sure that the data always resides in the preferred slot, the first four bytes. So that when there is, for example, I think in the third presentation, we go over the SIMD operations, but basically it helps with the alignment, it's less overhead for the compiler, it's just better for in terms of execution. So in other words, this is the byte, let's see, this is one vector consisting of 16 bytes which can store four integers or four floating points or two double words consisting of eight bytes each or one quad word. This is a typical example of instruction, SPU underscore insert will insert a scalar into the vector D, SPU promote will try to promote one scalar to a vector and then extract by you can give it an index. Going back, this is always, if they were dealing with four unsigned in one vector, right? So 0 through 4 would be integer element 0 or you can call it integer element 1, 0 through 3 and then 4 through 7 would be integer element 2. So sometimes you might need to extract integer element 2 or integer element 3 from one vector and print it. So you can do that via this instruction SPU underscore extract and there's a whole bunch of instructions like this. And all loads and stores are one quad word at a time, 16 bytes at a time. There is built-in compiler directives, the built-in expect is used for branch prediction. Remember SPU does not have any hardware support for branch prediction. So we try to keep the SPU hardware as simple as possible, less silicon, less heat, less complication, enforcing the 16 byte loads and stores, you know, removes the necessity to deal with exceptions or traps. And the way we enforce that all data types are aligned on 16 byte boundaries is when you define a float factor, use this macro, aligned of 16 would align it for efficient data transfer. And you can always say do a branch hint by using align underscore hint. And this is an example of an SPU program. So similar to the other example that we saw, this is the union with consisting of four integer values and one vector signed and three vectors defined. And then initialize the first vector with four values, initialize the second vector with four values. And now to add it, instead of vec underscore add, the SPU would do SPU underscore add. So there is a local store with the SPU processing unit. And this is the memory flow controller. The memory flow controller consists of a DMA queue where you can queue in the DMA request, a DMA engine. Basically it's a DMA controller, atomic facility for any instructions that go through MMU and RMT is replacement management table. And then there's a memory mapped IO registers to be read and written into by the PPE. And MFC commands, so the memory flow controller is this whole and sole entity responsible for all communications with any other devices, right? So on the SPEs. So the purpose of MFC commands is to basically access main storage, two main initiatives, right? May I access main storage? And then to maintain synchronization with other processors and devices in the system. Now they can be issued either SPU or by the PPE, right? Basically the instructions are going through the MFC. They can either be issued by the SPE or by the PPE. And all these instructions are basically channel instructions. So any time you're trying to do a read or a write, basically it's a read or write to the channel instructions. For the PPE though, it's always memory mapped IO instructions. So as we covered in the previous diagram, whenever the SPE needs to talk to the basically the DMA engine, if you see the red line is the data bus, it's for whenever the SPU needs to talk to communicate with the MFC, it uses channel instructions. And whenever the outside entity, the PPE is existing somewhere here, it needs to talk to this MFC, it uses this memory mapped IO instructions. And then all MFC commands are queued under MFC SPE command queue. The proxy command queue are for the PPE initiated memory mapped IO instructions. And this is some detail for your reference. So all operations on any given channel, channels are strictly unidirectional. They're just message passing interfaces, basically. They're always done in program order. And you can always query the status of a channel instruction by doing reads and writes. So again, whenever we're talking about any DMA requests or any mailbox communication or synchronization commands, the reference point is always the SPE. So when we say get, it's into the SPE and put is out of the SPE. The internal instruction for a DMA command is SPU write channel, basically the channel instruction. Composite intrinsic, we just covered, is a collection of one or more specific or generic intrinsic, right? FC DMA 32 is the composite intrinsic. But the instruction that we use is a wrapper external, it's a much simpler wrapper routine as MFC get. This is the command that we will most commonly use in all the code. And these are all defined in SPU underscore MFC IO dot H. This is a syntax for MFC get, it's a simple DMA transfer command. Basically you give the offset, the local store pointer, as to where you want to put the memory in, once you fetch it from the main memory, and where do you want to fetch the memory from. And that is the effective address in the main memory space, and then how many bytes and then assign a tag and identify to the memory so that if you want, once you initiate the DMA request, you want to query if it's done or not. So the tag ID can be used to query the status of a particular DMA transfer. And then transfer class ID, TID is used when you want to change or manipulate the transfer mechanism on the bus for it. Usually it's just default. I'm not sure if it's enabled in the current SDK development too. Our ID is replacement class ID. Again this is something that we use when we're using software managed cache. There's features for fences and barriers, like a typical Linux OS supported fence and barrier. Nothing different over here. It's just that now we have combined the fence and barrier feature with the DMA command. So we have more control over IO. There's a 5-bit DMA tag for all DMA commands. Again so up to 32 tags is what we can create. This is an example for a fence and barrier. So essentially barrier is something, say for example this is a green slot that is a barrier, that is a DMA transfer that you have issued with a barrier option. So any preceding DMA transfers cannot be executed until this barrier instruction is done. And even succeeding DMA commands cannot be executed until this barrier instruction is executed. Versus a fence which only stops all the preceding DMA operations to be done before this particular DMA transfer is finished. But any succeeding ones, any DMA issues that are issued after this fenced command can just happen fine. So that's the difference between the fence and barrier, like a typical fence and barrier. Transfer sizes always have to be 1, 2, 4, 8 or multiple of 16 bytes per DMA. Maximum of 16 kilobytes per DMA transfer, 128 by kilobyte alignment is preferable for to generate, to get optimum performance. The command queue has 16 elements for SPU initially request, for PPU it's 8 element queue. There is something called DMA list which is a very excellent feature again that's provided on the SPEs and on the cell broadband engine. It's basically a scatter gather mechanism. So by which you can get, the SPE can initiate DMA list. In a DMA list what it tells is that asynchronously get memory from wherever it's available in the PPE address, in the effective address space. And that happens asynchronously while the program is executing on the previously fetched data, previously retrieved data. The advantage to the DMA list is that it can, it's a kind of a solution to fragmentation in the effective address space. It will just grab wherever it is there and maintain it in a list and just do it, process the whole data asynchronously while the execution is going on, on the previously issued data. It can contain up to 2K transfer requests and each transfer request can be up to 16 kilobytes. So on a total it can fetch 32 megabytes of data. That's the API MFC get, that's a simpler API where you just specify the local store address, the effective address, the size and usually these two fields are zeros. You can check the status of a DMA command by, MFC by this command the read tag status and this is the API MFC write tag mask is API to set the tag mask. Basically a tag status bit of one will indicate that no DMA request tagged with that specific ID are still in progress. So basically it has completed. So you can either check for, you can create multiple DMA requests with the same tag ID. So in a way you're grouping them in one category. So sometimes you want to check, make sure that all this bunch of DMAs all can be collectively addressed with one. It's like basically creating threads in one thread group. Similarly doing all the DMA transfer requests in one under one tag ID so you can query on that tag ID to finish, to check if all DMA requests are done. So you can check on status of one tag or all tags, all DMA requests are done or not. So it's a typical example of DMA memory to local store. So MFC get operation, define the tag mask and then do an MFC get operation. So it basically reads the contents from mem underscore address and the effective address space and puts it in the address pointed by LS underscore ADDR and then you check the status and make sure it's wait for all the DMA operation to be done. So that was the get. And this is the put. When you're done with the computation on the look on the SPE side, you can use this instruction the MFC underscore put instruction to write it back to the main memory. And some, so how do you transfer data between SPE and SPE? The other SPE doesn't know what the other SPE's addresses. The way you can do it is that PPE however knows all the SPE's IDs and the address offsets. So what one SPE say I'm SPE2 and I want to send data over to SPE4. So it's, there's no point sending the data over to the PPE and then PPE sending it over to the SPE4. Instead what the SPE can do is be a mailbox or some other mechanism, try to get the local store address for SPE4. So once you have the local store address for SPE4 using the same API over here. So instead of the memory address you give the local store address of the other SPE. And then put it back in your own SPE address space. And tips to achieve peak bandwidth, always quad-word offset aligned data requests. They always have to be aligned 16 byte boundary. Mailboxes, they're always 32 bits in length. Basically the mailboxes are a really neat feature, really lightweight mechanism to say query status, send error code, return codes for program completion. Some kind of message to say okay, I'm done, you start or there's an error or wait something is wrong, something like that. So any kind of messages you can send via mailbox, 32 bit messages. So the PPE mailbox queue is one deep, there is a SPE write outbound mailbox queue for all the messages that it wants to write out to the PPE. There's a write outbound interrupt mailbox queue, the same thing except that after the data is written it will generate an interrupt. And then SPE will read inbound mailbox queue, basically PPE would write to it. So any messages that the PPE wants to send and then the SPE can read and it is four deep, in other words 16 bytes. So four 32 bit messages, 16 byte wide. This is the way the PPE again, even for mailbox it has to interact with the MMI or registers. So this is the API, pretty straightforward, really simple read write API. So this is the MFC, PPE mailbox, one PPE mailbox is there, one PPE interrupt mailbox is there and then SPE mailbox is there. So any time SPE wants to say read out from the mailbox, what the PPE has written for it, it can do a read out mailbox, all the SPE reads and writes to mailboxes are blocking. So in other words, if SPE is trying to write to the outbound mailbox that is already four it will halt, it will stall basically. So it's always a good idea to check the status to read the channel count and see if there's any data already there in the mailbox, if it's already full, if it's full then you wait. Also if SPE is trying to read from an empty mailbox, it will also stall until PPE writes something out of it, write something into it. So in order to avoid these stalls from the, for the SPE, it's always a good idea to check read the mailbox count and see if it's not empty or if it's not full. So this is the SPE, the PPE, if PPE is trying to write to some the mailbox which is already full, it will just overwrite the last entry, which is again not a good thing. So it won't halt the PPE because halting the PPE or stalling the PPE is a bad, bad idea, right? That's the only, your main application is running over there. So you want to check the status and make sure you don't overwrite data. So again this is just the API description, if the outbound mailbox is full the channel count will be zero. So if the SPE is trying to write to a full mailbox, it will remain stalled until PPE will read the message. So there's one 32-bit entry which is vacant. So there's room for write. And when the mailbox is read through the memory mapped IOU address by the PPE, the channel count is incremented. Similarly with write outbound interrupt mailbox, it's the same as the write outbound mailbox except that when the write is done and interrupt is raised. So provided the interrupt is enabled, so the first interrupt is serviced and then PPE goes and reads the MMI or register to get the data value out of it. So to avoid the SPE is stall, always read it before writing or reading, at least for the SPE. And for the PPE, if you don't want to read stem data or overwritten data, make sure you check the status. Similarly for read inbound mailbox channel, it will stall on an empty mailbox. So when the SPE wants to read any message that is waiting for it in the inbound mailbox from the PPE, make sure you read it, make sure it's not empty. Because if you try, if the SPE tries to read on an empty inbound mailbox, it'll stall until the PPE writes something to it. So this is the way to create the thread. So when the PPE is trying to try to create a thread, this is how it will try to create a thread. The SPE create thread ID, this is the group, thread group. If you have created a thread group before creating this API, you'll pass that in. And then this is the pointer for the program. And then Duke is going to cover that hands on in the lab to show, look at, see the code and see how the SPE program. Basically this SPE load image has to be the same name that is defined in the make SPE make file under the program underscore SPU. So it becomes a global reference. Again, to check the status of a mailbox, the simple API, start out inbox, pass it the SPE ID, find out if there's any data. Write out interrupt mailbox is also the same. Now when the PPE is trying to call it, it will pass in the SPE ID. When the SPE is trying to call it, obviously there's no SPE ID. All it needs to try to do is, when it's trying to do the right, it just sends the data to the PPE. So in the lab, there will be an example from a very simple, our premise is to basically get the message over to you on the basic mechanism on how the transfer happens and how the APIs need to be used. And we hope that you'll just build on top of it. Obviously in real world, it won't be as simple as this. It will be much more complicated. But if we set the right ground, we can be sure that the real application development can happen easier. So again, we just saw this in the previous presentation, reading into the local store is using the MFC get operation. So example one, there will be a get buff.c when we're covering all these examples. All it does is basically it will write an SPU program that will transfer the data into an SPU buffer from a PPU. So from the effective address space, using the MFC get operation, it will write the data buffer. And then it will try to get the data into the SPU and then write the data in the buffer to verify the transfer. Basically just print it. So this is how the PPU program would look like. I'll do all the defines, libsp is a libsp runtime management library. And then just standard header files. See right now, one thing that I want everyone to notice is in the PPU program. Always you'll have to define one variable for SPID. This is the buffer that we are creating. The malloc call would won't be a standard malloc call. It would be malloc underscore align. So in this case, and again, this definition, the support for this is there in the libmiss.hr file. So malloc underscore align is saying create 128 bytes and then align it on a 128 byte boundary, 2 power 7. So the 7 is an argument which just locked to of the input data. So in this buffer, we're writing the data good morning. And then creating the thread. So in this case, we have not created the thread under a group. So the first parameter is a 0. And then get buff underscore SPU is the program handle for the SPU program. And we'll be seeing that right now. Makefile will look like this. The base level makefile, the directory would be SPU. And then program PPU would be having this name. And then if you notice in the imports directory, you're linking to the SPE runtime library, the MISC library for all the memory operations. And then put buff underscore SPU, which is the same as the parameter over O. I think it's incorrect over here. It has to be get buff. So this name would be the same name as the program handle over here. And this should be created over here. SPE, external SPE program handle, get buff underscore SPU. So essentially in the PPU symbol space, this program handle, whatever is the SPU references will be visible to the PPU because it's embedded in the PPU executable itself. And this is the smartest way we could come up with in order to entertain the two different architectures inside only one binary also to save any things like inter-process cops. We don't need any IPCs. We don't need any messaging mechanisms. We want to keep it simple so that they're still visible in the global address space. And we don't need any explicit commands so that the SPU symbols are visible to PPU and vice versa. So get buff SPU would be the program handle or a reference to the executable, the SPU executable. The buffer is there is a third second parameter in the SPE create thread, which is like the environment pointer, which would be sending over data if you wanted to over to the SPU code. So this is like we want to say, OK, I've created a thread. And this is what I want the thread to do because the get buff underscore SP is not just a program handle. It's a program which gives a defined role for the thread. Like you can create a thread, but what is the thread going to do? What is its code portion? The handle gives it by creating the thread, you're actually giving the code portion to it.