 The next presenter has recently finished his honours in computer engineering, electrical engineering at the University of Adelaide. Please welcome Joel Stanley. Hello everyone, thanks for bearing with me. My first apology is about the title there. To change my slides I have to actually regenerate the bit stream for my FPGA so it's not easy to change my slides so I used some previous ones. So my project is titled Exploring the Communication Architecture of Multi-Processor System on Chips, a bit of a mouthful. What we're all about is looking at modern computer systems which these days have lots of multiple processes. There's an entire mini-comp stream dedicated to this topic this year which I'd probably be almost better off being at but I've been designing rockets all afternoon so there you go. So we're going to talk a little bit about why this project was selected, why I did it. Then talk a little bit about the FPGA and the different hardware blocks I developed for that. We might even talk about what an FPGA is just so the crowd knows what we're talking about. And then we'll talk about the different applications that I put on the FPGA. Fortunately I'm not demoing my project as part of my presentation like what normally happens. Normally I play my slides, I use my legitimate Nest controller, original Nest controller to change the slides and then do a demo at the end but unfortunately it's not working. Maybe a lightning talk later this week. So why do we have multi-processes system on chips? Why do we need lots of processes? Systems are getting smaller. We're integrating more and more functionality into fewer and fewer chips. So on the left, on my right, your left there we have Nokia from the late 90s and this device has a number of ICs and very limited functionality. It's a very small number of pixels on the screen. It essentially only makes phone calls and sends texts. Whereas on the other side we have the iPhone 4. Everyone knows what that can do. It's essentially a computer in your pocket being able to not only make calls and therefore the associated digital signal processing associate with calls but also play games which are computationally intensive playback video again computationally intensive and also perform general purpose computing tasks. So when we're designing these multi-processes system on chips there's kind of two design strategies that the engineers take. One of them is ad hoc. So you buy different components from different vendors and integrate them into the one design and each time you come up with a new design you're going to be doing that integration work over and over again. The other downside is it's quite static. Once you've produced this system all your different blocks can only talk together in a previously configured way. So you can't change your mind and later on decide we want to be able to have this innovative new functionality that involves this part and that part because it's quite static. The example up here is the OMAP 3430. If any of you own a N900 you've got this chip inside your phone. Number of other things which don't come to mind right now. The bigger board as well. That's the other big one that uses the same chip here. So turnip design strategy is a tired architecture. Students quite familiar with the concepts of cut and paste. The engineers use the same thing. You create one processing element and you cut and paste that many times. So you're reusing your designs a lot and the other way that it's designed sorry is you not only have multiple things which perform the same functionality but the interconnect between them is quite malleable. So it's just a memory interface in this case that you can reprogram and reconfigure all these different elements however you like. And so it's not decided at design time what the eventual functionality of the system will be. This process, this architecture here is the cell broadband engine commonly known as the PlayStation 3 processor. And so this is used for obviously gaming on the PlayStation but also for supercomputing the functions applications as well. And that kind of shows how you can get into different things not necessarily what it was designed for after the fact. Unlike the other architecture that I showed talked about just before. So my project. The project when I keep referring to my project I'm talking about the final year project that you have to do at the end of your engineering degree. So it's a year long project usually as a part of a team. I did mine by myself essentially. And so at the start we decided it's a good idea to understand the hardware we're working with. That's this board I've got at the front here. And so at the time at the start of last year it was a brand new fresh from the factory. You know only a couple hundred around the world system. And essentially we were debugging it. You buy it for cheap because the software doesn't quite work quite yet and there's not many applications for it. And so things like getting it to talk to a PC through the USB interface. There was no firmware for the USB controller. So we had to write that before we could then talk to the PC and be able to put our data on there. Etc. Etc. So understand the hardware. Understand what it can do. What its limitations are. That was the first goal. The second goal was to implement a simple multi processor communication application. So it's the kind of thing that you learn a bit about if you've done operating systems uni. I hadn't done operating systems at the time. So I learned about that. Learned about what are the kind of communication paradigms people use for talking between two processors. The same things apply generally to two processors running on an operating system. But in this case it was two processors running on separate CPUs. And finally I developed a killer application. Something fancy to show off at conferences like this when the hardware actually works. So as I said this is the board. And on the board. So rewinding a tiny bit. Who knows what an FPGA is. Okay so about half. So an FPGA at its simplest form is a processor that can be programmed to act like another processor. It's quite a bit more complex than that. But that will get you through this talk essentially. So you create designs using an IDE essentially. And then download them to the processor. And it will then act like essentially any other processor you want. The limitations being clock speed and the number of programmable logic elements that are on there. This particular board, the processor inside it has been used to emulate the entire Intel Atom. The first generation Intel Atom. And that gives up almost all the logic elements. That gives you some perspective on how complex the designs can get. Quite complex. But in my application I used a processor called MicroBlaze. So this is a RISC like architecture. If anyone studied Hennessy in Paterson it's similar to the DLX machine. And quite simple. It doesn't have much grunt. Much less grunt than the simplest net book you could buy these days. But there are some ways to get around that. And that's essentially what my project was about. It wasn't about powerful processes. It was about optimizing the communication and making the communication efficient between the different processes. And then there are a bunch of other parts that make up the FPGA. There's different kinds of memories. There's the DDR3, the off-chip memory that you have in your laptop. And there's onboard memories. And the differences between these two are essentially latency. How long does it take to go and fetch something or put something in that memory? And ideally you'd have unlimited amounts of onboard RAM. And so everything would be really, really fast. The problem is then your chips would be infinite in size. They consume infinite amounts of power and put out infinite amounts of heat. And so that's not going to happen. So you have to make a trade-off. Where are you going to put the data that's most important to you? And where are you going to put the data that you access a little bit less frequently? And deciding which data falls into which category is one of the problems that multiple computing has to solve. And so these different memories have different types of buses that connect them that have different properties. I won't go into that now. But the different buses result in different kinds of latency. Some of them are shared. And so if you're doing lots of different traffic across them, it's going to increase the latency. Some other ones are dedicated, so much lower latency. So I've spoken a bit about memory latency. I've mentioned it a few times already. That was the first part of the project. Understanding how slow the different memories were and the different buses that connect them. The problem with this is the only way to record it is from inside the FPGA itself. And so you might sit there and talk to a memory map timer and say, start recording now. But that has to go across the bus that you're trying to measure the performance of. So you don't really know when that start now command is going to get there. And so that's the hazy line here between start and then time. And so the first part of the project was reading assembly dumps and working out exactly what the latencies were and producing lots of graphs like this for putting in reports. So this shows the order of magnitude of the different memory latencies. And as you can see, some of the latencies move up and down. They're the shared buses that I talked about just before. So when the display controller is sitting there copying the frame buffer to just be displayed on the screen, the latency is much higher because there's much more contention on the bus. And then other points in the blanky interval, if you know anything about how VGA works, there's no traffic. You're not clocking out anything. And so the latency reduces a fair bit. And these are the kind of challenges that your application has to contend with if there's lots of things going on at once. So the first demo trivial application I put in there was a JPEG decoder. So this is the JPEG pipeline. There's essentially five stages. We won't go into them today, but they all perform varying signal processing tasks and kind of compression tasks and whatnot. And it's a good application for multi-core programming because it can be split up both different stages running on different processes. You can also split up an image. Say one processor will decode one quadrant of the image and you could split it across four processes that way. So this is the first system I built. The two big squares, processes, with each with their own local memories, a shared bus pushing out JPEG's images to a display controller and you debug it all through a C report that you are there. And so looking at JPEG decoding, the first iteration to just go to full screen image was about 300 million cycles. And the first optimization I performed was instead of having to go and fetch that source image from DDR each time was to stream it in via the USB link. And as you can see, this doubled the speed. So this is one of the challenges in multi-core programming and met systems that have non-uniform memory architectures. Some of the memories are going to be a lot slower than others. So if you can get the data closer to the consumer, your system is going to go much faster. And so that brought me up to about halfway through my project. I was essentially met all the academic goals so I decided to do something a little bit more fun. This is the multi-core Game Boy emulator. So we took GNGB, it's an open source Game Boy emulator, get installed GNGB. You can run it on your laptop now. And this emulates the Game Boy color hardware. And so the Game Boy color is an overclock Game Boy. It's twice as fast, 8 MHz, Z80 essentially, with some dedicated sound hardware, very, very small frame buffer, limited number of colors on the screen at once. And so we took the 15,000 lines of code and threw most of it out, lots of abstractions and whatnot running on different platforms, things, code we didn't need. And essentially kind of talking to libraries like SDL, whereas on this system we're running without an operating system, straight on a processor. And when you're talking to, say, the frame buffer, you're writing pixels directly to a memory region. When you're reading from the button input, you're reading from a register. So you don't need all these abstractions that SDL provides. And so the system I used has an input device, being the Nest controller, some sound output, which we developed because it wasn't on the board, a frame buffer and four processors, as the title of my talk mentions. So the Game Boy system started off with a single tile. So a single processor with some local memory. These local memories are in the order of about 128 kilobytes each, so quite small compared to where, I mean, that's how much cache that your crappy little Intel atom processor has these days. So not much memory at all. And I cut and pasted. So tiled architecture, four processors, each with a dedicated functionality in this case. And then glued together across that big shed bus that I spoke to about before. So initially the bootloader loads the ROM image from the complex flash card into memory, and then kicks off the Game Boy instruction set simulator. And that's initially what the system was. We get about 10% performance using a system like that. No one here playable, not much fun to play at all. And so we split off the sound processing to another processor inside the Game Boy itself. That's just a register right, the interface between the sound core and the instruction decoding. So it's very easy to split across two processors because you've got that well-defined interface. The other bit of optimization we did was moving out the video processing, partially to a color space converter, a bit of VHDL, to shuffle the pixels around a little bit, and then a video DMA core. Essentially, I'll just copy the pixels in and push them up to the display so that the instruction set simulator is doing the work of playing the game. And so this is what the inside of the FPGA looks like with that design on there. Each of the different color blobs is one of those kind of tiles I showed you just before. So it shows you how little of the thing we're using, even for a four-core system. And so now it's about when I usually press the start button and show the Game Boy demo. Unfortunately, I can't do that because I couldn't get the FPGA working. But hopefully, I'll get it working later this weekend or later on. So this is what I did plan to do this summer. I didn't do any of this stuff. I decided to find a job instead. But there's some tasks that people are going to pick up on next year as they continue the project at LFUNI. Any questions? No one more thing. Yeah, maybe I'll get it working then. I did an FPGA course last semester. Last semester we did a bit of Handel C and VHDL programming. So I'm quite interested in how you actually went about coding and what tools you used because it sounds like there was actually much available. And we were using some of the Xilinx tools that were actually provided for you, so we didn't have to deal with any of the actual underlying hardware stuff. Could you talk about that for a bit? Yes. I'll just bring up my slides again. There we go. This one. So most of the IP initially was just the stuff that Xilinx provides with XPS, the stuff you would have used. So that's the processes, the large squares. And then the first step when things weren't going fast enough was to customize the hardware that they gave us. So one of the first things I wanted to do was increase the resolution of the display controller that comes with. It only does 800x6 something and doesn't look very good, right? And also there's not much. That's a fairly small amount of memory. You can essentially double the amount of copies you have to do by going up to the next VGA resolution to XGA. And so that's the first hack I did was change the... It was a verilog module to do twice the resolution. And so if I was showing on my board you'd say how nice it looked. So that was the next step, right? Customizing the hardware they gave you. Then the next step after that was if the hardware was too hard to customize or if it wasn't performing the functionality we required, I'd write it from scratch. And so that was things like the color space converter, the NES controller input. That's just a state machine that serializes the NES buttons. And the I2S generator we actually got from the net. It's a nice bit of a hardware open sourciness. There's not much of that in the hardware space so it was good to see that. Yeah, so the writing VHDL if you haven't done any HDL or any of that kind of programming is tricky as you would have discovered when we got there in the end. I didn't use handle C so I'll be able to see how you found that. My supervisor didn't have very high opinions of it. Yeah, it was interesting. It sounds like it's basically dead and people are switching over to system C now so that's what I've heard. Just if you're interested. We wrote a desk core and the idea was you could have multiple like we had one of these Spartan starter kits starter boards that come with a GA output and sound output and all that kind of stuff. To connect two of them together one is encrypting, one is decrypting, a GA output and all that kind of stuff written by hand essentially not using any of the predefined core so it wasn't an interesting project. Cool, we'll talk about it later. I've got a video of my if I can load up on YouTube. Any other questions while that loads up? Actually I have a question. You sort of talked about splitting out and I mean I'm not a sort of low level hardware person so I may not entirely know what I'm talking about but you split out the video processing, the audio processing and so forth did you look at all or consider perhaps trying to parallelize say the video processing so that you could have a couple of I suppose smaller cores actually processing different parts of the video to actually make more use of the available resources? So I mean that's one of the objectives when you're trying to parallelize something across model processes. In this case the video CPU itself all it was doing was taking in data from a register read and pushing it out to the frame buffer so it was just a very overcomplicated DMA if there was some parallelization there we would have tried it but essentially the vision that you saw up on the board the three different cores in this application it doesn't really map very well the jpeg maps really well to lots of cores you can split those five different blocks across you can split it across five different blocks and then you can do different parts of the image but in this case it's kind of designed to run as one program I'll just show you this video briefly this is my supervisor playing Mario on a projector this is the first time we saw it working it was kind of exciting I hope it stays on that monitor it does, there you go it's very shaky it's taken with my G1 how does YouTube work why can't I rewind it so it gives you a bit of an idea of what it looks like we we decode the jpeg when you first boot up essentially the boot loader decodes that so it's just a picture sitting on the screen and it writes that the demo is going to show you about 90% it's really interesting the way humans perceive video versus sound when you're perceiving video if the frame weight drops a tiny bit you generally don't notice but if you miss a sound sample out to the ADC you'll hear a click straight away so that's why I had one that went at 90% because especially in more complex scenes you hear the clicking as producing sound samples often enough it's actually another hardware module that you can find which illuminates LEDs as you miss sound samples so you see the counter go higher and higher as the system performs worse okay, thank you very much Joel thanks everyone