 as a leader in the FPGA embedded Linux technology platform working with a wide range of commercial and public sector organizations including providing research support to NASA for reconfigurable scalable computing project. English is my fourth language as you can imagine. And the universe he started teaching and research positions then also the successful spin out of zetalogics into a privately held company in 2007 from the University of Queensland. John holds honours degrees in electronic engineering and information technology as a PhD in 2001 for his work in 3D computer vision and image processing. So I finished with your CV and I'm not ready yet. I think we're almost ready to go. And after him we have Tim Ocum, project manager, programmer and IT entrepreneur. And after him we have Dr. Wayne Kelly who has been researching the field of parallel computing since 1990. Well, in the meantime, I would say we're almost there. Well, you can start. You can start. I think we're about 10 seconds away. You don't need slides to start. Do you? Okay, let me tell you a little bit of what it's expected from the panel which says which industries applications need parallelization today. And that's just a title to trigger a discussion. So I honestly welcome comments about everything. Because without you guys the mini-conflict doesn't exist. So you can start to think about what you want to discuss. As you can see there are no seats in front of everyone. So there is no sacred cows in order to pay attention. That's getting better. This is actually my secret plan to make sure that even the light comers hear my entire talk. Okay. So, okay. There is a nice device from Apple which I don't know from HP or what. It's a small projector that costs about $2,000, $3,000. You just plug it and off you go. Okay. I like strong-minded people. So who wants to speak on multi-core mini-conflict 2012? You can start the plan. Why not? Why not? What should we be talking about in one year ahead? What do you expect? Okay. Let's rephrase it and forget you. It's like this microphone. Right. It's just a PDF, by the way. Look at that. Any chance I could see it down here or is that asking March too much? Somehow I'm reminded of that quote in Jurassic Park where the girl gets in on that PC and she goes, cool, this is Unix, I can do this, but somehow we're still... Okay. Just an announcement on the small lighting talk of Sam about hip-hop on TVV. That will be a full presentation on Friday at 11.30. Okay. We're good to go. Are you ready? Yes, we are ready. John, wait for the... Yeah, 20 minutes. I'm sorry. Yeah, that's all right. Actually, I was worried about running short, so that could be good. So, look, thanks very much for the introduction. My name's John Williams. I sort of have two roles. 20% of my time, I'm a lecturer at the University of Queensland, Terry and Brisbane. I lecture operating systems and some of my students are actually here. One of them is introducing me in my other talk on Thursday, so I'm expecting a bit of a roasting because he's graduated now, so he's out of my powers. The rest of my time, I spend in my company, Pathologics, and we do something which is actually nothing to do with multi-core at all. We do embedded Linux on FPGA systems. That's what I'm actually kind of recently expert at. So, this motivates my business owner. I'm not a multi-core expert. My focus being on embedded Linux, and that's where I feel most comfortable. So, I'm a little out of my comfort zone here. However, Nicholas did ask me as someone who's got some experience in the local area of working with FPGAs. FPGAs are kind of a hot topic, I guess, in multi-core and high-performance computing. So, he asked me to come along and see if I could maybe throw in some ideas. So, maybe if I could get a show of hands. Hands up if you know what an FPGA is. Hands up if you know how they work. Hands up if you've ever programmed one. Oh, shit, I'm in trouble. There's actually people in there who know this stuff. Anyway, I'll do my best. So, here is my agenda. I am admittedly and deliberately being a little bit flippant in some of my topics here today. But what I'm hoping to do is maybe just expose you to some things that hadn't occurred to you before, both in terms of ways of thinking about CPUs and what some of the alternatives are, but tell you why I think CPUs are awesome, and then why they stink. And I'm going to do the same thing with FPGAs. And then, well, the question, you know, what to do, what to do. People want to do more computation. They want to do it more quickly, less power, more product, more productively, et cetera, et cetera. So, that's broadly where I'm going to go today. This primarily is why CPUs are awesome, because you can do this and compile it and run it and it works. And it's really this whole sort of levels of abstraction, software models, libraries, virtual memory, all of the things that, you know, you're so familiar with them, you don't even think about them anymore when you write software. So, you know, in my mind, this is the fact that three lines, I mean, do you have any idea how many lines of code actually have to conspire to make this work and cause those, what, eleven characters to appear on a terminal? I mean, you have no idea how much there is going on under the hood, but that level of abstraction lets you just kind of forget all of that and just write code. And so, it's this model that has led to this massive productivity in the software world over the last, whatever, 30, 40, 50 years. This is a multi-core stream, so I thought I'd put the multi-core version. So, that's about the extent of my multi-core ability. But you get the point, right? So, again, with a few extra lines of code, now I've got this thing, you know, in theory, if I'm running on a mini-core machine, I'm now using all those cores. And this is great. I mean, this is a rather, well, it's not even embarrassingly parallel program, it's just an embarrassing program, I think. But anyway, that's my point. So, you know, if you think about, again, I have this moment where I'm kind of SSH-ing in from home into my workstation at the office and doing something. And every now and again, you just have this kind of moment where you realize how many incredible layers of abstraction you're actually tunneling through, you press the key here, and it kind of goes through the wire, through SSH, blah, blah, blah, through the phone system up at the other end, and so on, echoes the character back. And really, it's abstraction in software programming models, which has allowed that. Why does CPU stink? Part one. So, this is some data. It's obviously pretty old, and you see where the x-axis terminates at Pentium 4. It's from a guy, John Wozniak. He's at Berkeley. On the y-axis is mobs per megahertz per million transistors. So, what's that? Millions of operations per second, but normalized with respect to CPU clock frequency and with respect to a number of transistors. All right? This is really a measure of efficiency. How much work per transistor are we getting out of each CPU out of, well, how much work are we getting per transistor plotted effectively over time, but really, you know, pinging that at certain CPU generations? And what you see, basically, is that with every successive CPU generation, CPUs are getting less and less efficient. It's just that it's being masked because they're going faster and faster and faster and faster, and there's all these poor, overworked engineers at Intel who are spending their entire lives trying to basically hide this from you. So, hyperthreading, caches, multiple issue, et cetera, et cetera, right? So, pretty much every innovation in CPU architecture for the last, what, 30 years since Von Neumann sort of thought about it, you know, it's really kind of been to try and hide you in some sense from this precipitous decline in efficiency when you consider work per transistor at the dial level. I would really love to see where this graph goes in modern families, but I don't have that information. Do it as a log plot, only if I knew my history of Intel. I should get in touch with John and ask him if he's updated that. It would be interesting. But I think the point remains, even if it's not quite as dramatic. Don't go to sleep, please. No, we're back. Oh, I'm back. Are we back? There we go. No, we're back. It's all right. I'll just move faster. Stop talking so much. Okay, why CPU stick part two? So, here is a schematic of a kind of a textbook, of a textbook CPU architecture. I stole this from simpleCPU.com. I don't know if it's still there. I pulled this light a couple of years ago. So, question, you know, where does the computation happen? Where did numbers get crunched in this CPU? Call it out. It's ALU, right? So, it's this thing in here. So, all that other stuff is just, pardon me, but it's crap that you've got to have around the ALU just to get two numbers at the inputs of the ALU so we can add them. And then we're going to go all the way back out through the data memory, blah, blah, blah, blah. And so, when you look at it this way, you can see what a terribly inefficient computing machine a CPU really is. There's all these transistors and they're not doing anything useful. They're just kind of getting the data to where we can add stuff, multiply stuff, et cetera. And so, I said before that, you know, all these other innovations in computer architecture cases, multiple issue, hyperthreading, et cetera, are really just sort of trying to, I mean, among other things as well. I'm being a bit simplistic, but really trying to just get around those inherent inefficiencies of this tremendous architecture, which is what 60 years old von Neumann, you know, dreamed this up in sort of 40s or 50s. And this architecture has got us to where we are today, but now people are realizing that, you know, we've sort of hit a bit of a wall and things are getting a bit difficult. So, here's a die shot of an AMD Phenom 2. I don't even keep up with desktop. Is this a modern CPU? Reasonably? Okay, fine. I just did a Google image search for, you know, Intel die mask or something like that. So, this is a six-core CPU. So, you know, we could play some games here. That's stuff over on the right. That's probably going to be L2K. Those purple pads, you can see each of the six cores. There's a nice kind of symmetry. You see the four on the left? Then there's one mirror on the right. This is sort of, you know, modern art, I think. But where's Wally? Where's the ALU in here? I actually don't know. So, if anyone does know, can you tell me where are the transistors in here that actually add stuff and subtract stuff and multiply stuff for the purposes of computing where the simulations, etc? All right. That's sort of random stuff in the middle there. Right, right. Okay. Anyway, again, I'm just sort of laboring the point that, you know, there's all this technology, there's all these transistors. They're all switching. They're all generating heat. You know, they're all, there's capacitance on all these lines and the silicon that's slowing down clocks, etc, etc. And the ALU is the stuff that actually does something useful. It's kind of almost lost in the noise. So, one of the comments that, or one of the sort of observations that I offered to Vint Cerf, Dr. Cerf this morning when he mentioned FPGAs was the fact that the thing about an FPGA, or one of the things that makes them so kind of powerful, is that the computation and the memory can be kind of embedded together. You know, I did a show of hands before so most of the audience know what an FPGA is. I mean, it's basically a high density array of very regular structured, that's a key there, regular, very regular structure, of logic elements with a couple of bits of memory beside them. And so, at the last level, you program an FPGA at a bit level. So, the data path is designed and developed at that level. The hardware is intrinsically parallel. You know, logic gates aren't sequential unless you deliberately build it that way. So, I had an example. I was going to put up an implementation of something like an FIR filter, you know, where you want to do a loop of multiply, accumulate operators. And that's a loop maybe partially unrolled in a CPU implementation. But in the hardware implementation, you can just put down n multiply units and a single adder at the end, and you can get a new result every clock cycle. And if you can clock that fast enough, you're going to do some very serious number crunching. So, FPGAs have massive amounts of compute capability. And mountains of IO, all these modern FPGA architectures, you can plug them into, you know, various fiber optic modules and infinity band, and they've got DDR3 memory controllers, you know, sitting there on the die, et cetera. So, the compute potential in an FPGA is just tremendous. And, you know, people are a lot sort of more knowledgeable than I have done various benchmarks and comparisons and so on. And basically, for certain classes of problems, I mean, DSP, so I'm more of a kind of an embedded guy. But for DSP, you know, audio, radar, all that sort of stuff, I mean, an FPGA will just wipe the floor with anything else any day of the week. There's a reason if you go and visit the radio telescope folks at Siro down in Sydney, they've just got racks and racks and racks of boards that are just tiled with FPGAs. And it's very, very impressive hardware, generates a lot of heat, but actually less heat than a rack full of Xeon servers would probably generate. But yeah, for those kind of workloads, you can't be an FPGA. Here are some numbers, well, some funny numbers. How do I get rid of that? Exit. Oh, yeah, okay, thank you. Not that way. Is it going to go away? Yeah, there we go. So these are some results that were published in HPCY fairly recently, I guess, last couple of months. Disclaimer, everyone involved in this study works for Xalix or is funded by Xalix, right? I also received funding and income from Xalix there among my customers, so I should have disclosed that. So we've got gigaflops on the Y-axis. On the X-axis, we've got a bunch of technologies, dual core, port core, six core, and so on. And I believe what they've plotted there for those CPU architectures is not quite a peak performance, maximum clock frequency, maximum throughput, assumed location misses, et cetera, et cetera. And then what they've done is they've got performance results for various, these are various Xalix FPGA families, they're basically getting bigger and faster every product generation. And in the same way that peak performance numbers for a CPU are a little bit dodgy because it's very rare that you can actually achieve that sort of stuff, there's sort of an equation which they've used to calculate these peak performance ratios or values for the FPGA as well. I don't know the formula, it's related to how many multiplier accumulate blocks are in the FPGA, how much distributed memory is in the FPGA, number of logic resources. And it's pretty much making the assumption that you can get every single logic element in that device crunching your problem, which is a pretty big assumption, but it's actually a lot more realizable than the certainty that in a CPU you will never have every single transistor crunching on your computation because it's just all that junk around the outside that I pointed to. So the interesting thing about this as well, I didn't bring up the other graphs, this is for the 64-bit results, but if they've also got the same computation or the same estimates or however you want to call it, for 32-bit data widths and for 24-bit data widths. And the interesting thing is that I actually chose the kindest of these comparisons because when you look at particularly 24-bit data width, I mean if you want to do 24-bit computation on a modern kind of workstation CPU architecture, that's not a native bit width for that architecture. I'm not an expert in the whole SSE, MMX stuff, I don't know if you can configure those vector units into a 24-bit width sort of mode or not, but what you see or what they claim is more accurate to say is that when you have these sort of non-standard word sizes, you actually get a greater disparity of performance between a CPU kind of implementation and a FPGA implementation because you can tune that FPGA implementation to exactly 24 bits. You just throw down enough logic for a 24-bit outer 24-bit multiplier and so on, so you don't have all these wasted bits at the high end. And so there's an interesting thing that you see when people are trying to port their code to FPGAs, which is they've written it all their floating point code and it's all double precision floating point and they're very precious, they want all of these bits and you say where did that number come from? Oh, it's a 12-bit analog to digital converter and they want to maintain 80 bits of floating point through this entire range of computation and the only reason they want to do that is because they just basically want to run diff on the results at the end and just get no diffs, right? Those bits are actually meaningless, so an FPGA allows you to sort of tune that computation. Now in the time available, obviously I've got a limited number of arguments I can make, so I'm going to now go against my argument and well, why the FPGA stink? There is a reason that they call it hardware. Are there any electronic engineers in the room? Hey, there we go. So there is a reason that they call this stuff hardware, so creating large, complex, high performance digital circuits at the kind of abstraction level that is used for or traditionally used for FPGAs is very, very difficult. Can you imagine, gee, I don't know, like an MPEG4 decoder or something. Sorry, I'm going to throw all sorts of embedded examples at you because it's just where I spend my time, but creating an MPEG4 decoder at a bit level of abstraction, designing at that sort of level and working up. And people do this. There are people who do this for a living and that's why they kind of do it once and then they just sell it and hopefully people can reuse it. Now it's not that bad. So there are still structured concepts of modular design and hierarchical design and so on. So a lot of the software development concepts that have allowed us to write million line software applications and systems with some non-zero probability of success, some of these also apply in hardware design. But traditionally the way these things, am I getting time issues or are you just wondering? You're best. Okay. So here's an example of VHDL Hello World. This is, I actually trimmed some stuff out of this. This is blink and light. This is how you blink an LED on a board in VHDL, which is one of the main sort of little hardware design languages. And really what this is actually doing is it's instantiating a counter which is going to increment by one, a binary counter increment by one, every positive clock edge and then halfway through that count it's going to toggle an LED. So I'm not going to spend time here. You can go back and have a look through it. But this is really no way to design big complex computation system. And of course the other thing is that if you are in climate modeling, you just want to write climate models. You don't want to write VHDL and your organization probably can't afford to hire the kind of hardware designers who can then sit down for six months with you, talk to you about climate modeling algorithms, go and implement them and so on. So this is really very unproductive way to work. So I'm going to talk, what I'm doing is setting up a problem and the problem is the tools. So for a long time developer efficiency in the hardware design has been report. So I want to rip very quickly through a couple of interesting architectures, bit of a change of tack now. So some of you might have seen the news, this hit slash dot a couple of weeks ago I guess, Intel Adam Die with a PCI Express link to a little Altair FPGA. This to me is actually not a particularly interesting architecture from a compute perspective and that's because the programmable fabric, which is all wonderful like I've been telling you about, it's stuck away from the CPU. It's got no sort of cache coherent access. You can't really program that FPGA fabric to be a kind of a peer in any high performance computation. Really this is for embedded system designers. They want to put custom IO on the FPGA and they want to run Windows embedded or whatever, people do that sort of stuff apparently, embedded Linux hopefully, Amigo for example. So it's an interesting architecture in the embedded space, but actually not going to be very interesting at all in a high performance computing space. I don't think. But it's interesting to see that Intel are sort of getting in the game. This was an interesting one. The concept is a couple of years old now, but it's still around and various commercial vendors have picked it up. So Xilinx figured out how to make one of their high-end FPGAs talk the Intel front side bus protocol. So what they did is they got a big great FPGA. They put it on a board with that, whatever, thousand pins or something that you need to sit in a Xeon socket. And so you get an Intel sort of quad Xeon motherboard. You put a couple of Xeons in there. You leave one slot empty and instead you plug in an FPGA. Now the reason why this is really interesting is because that front side bus is a fully cache coherent tap into the primary system memory. So now you can create FPGA based hardware designs which are just fully fledged computational peers. This FPGA design can just go and fetch stuff from memory and it knows that it's getting it from the local cache if necessary. So this starts to get really interesting from a high-performance computing perspective. Because of that, you're not sort of stranded all the way at the end of a PCI Express link or something like that. You're right there in the core of the computing environment. So you can buy this stuff from a couple of vendors who picked it up and kind of pushed the model a bit. And Intel have a library. I think it's called AAL, something, it's designed for this kind of model. You know, helping people to program software that can communicate with co-processors. This one's very interesting to me. This doesn't actually exist yet, but it will. This is kind of Xilinx's next big thing. More interesting from an embedded perspective actually, that's its target market. So this is a dual core Cortex A9 system on chip with some FPGA fabric hanging off the back of it. Those of you who've sort of played with Xilinx before will remember they've put CPUs on their FPGAs before. This is a completely different piece. This guy boots Linux or can boot Linux or will boot Linux as soon as you turn the power on. So you don't even need to configure the FPGA before this can actually do something useful. The interconnect between the CPU and the fabric Xilinx settled on the AXI bus protocol, which is kind of ARM's bus protocol. And so again, you've got this high bandwidth case-coherent fabric interconnect. So you can implement high-performance computing stuff in the FPGA fabric and it's just a peer with the CPU. It's on that same kind of level, so it's not stuck away on some kind of high latency data path. Almost done. And there's lots of others. For years you've been able to buy big cards with lots of FPGAs, take them in a PCI bus, etc. So just to kind of wrap up and maybe even make a point, VHDL is no way to build the kind of systems that people actually want to use this. When you're going to get scientists and engineers and people who actually have legitimate big computational problems to solve, they are not going to be designing their systems in VHDL. So there's a topic called high-level synthesis. The Holy Grail is just take my DustyDexE program, turn it into hardware and make it run end times faster. And this has been a Holy Grail for a long time actually and there's been a lot of tools around over the years, some of them are very weak, very tightly constrained. What does a function point of mean to an FPGA? That's my question for you to just ponder. And often the only people who could actually get anything good out of these were the people who wrote the tools, which is not very useful. There is some serious commercial money in this space right now because they've realized that this is not only useful for designing FPGAs but also ASICs. So there's a bunch of companies there, there have been some mergers and acquisitions over the last 12 months. We're talking serious dollars here like Sinfora, that's about $100,000 for the FPGA license. So we're not going to see that stuff anytime soon. There's a few open source ones. The only one that really seems to be alive is C2Verilog and it's licensed under the GPL. It uses LLVM as the back end. Anyone who looked at LLVM, I think that's some of the coolest technology to come out in a while, we're going to see a lot of interesting things there. C is a terrible language for expressing parallelism. I was really pleased to come after the previous talk talking about all these functional languages because it turns out, don't ask me why because I'm not a theorist in this area, functional languages map really nicely onto circuits. Inputs and outputs, right? Single assignment and so on. So this is a really good opportunity but you would probably recognize that you're still pushing shit up a hill to try and get anyone to actually program in a functional language. So maybe there's some dual opportunities there. There's also graphical tools, right? National instruments, et cetera. A lot of domain specific tools. Some of these problems, you can create a generic architecture which you then customize to a particular instance but a general purpose framework, it's actually really difficult. Some other thoughts. So if I was going to put money on where this is all going, OpenCL is pretty interesting. It's kind of a generalization of CUDA, I guess in a very simplistic way, but it's intended for heterogeneous computing environments and this is the key. CPUs, FPGAs, GPUs, et cetera. So maybe OpenCL using these high-level synthesis tools to convert the sort of the C-like kernels down to FPGA implementations and then bringing that into a CUDA framework. That sounds like it could make sense. I'm not going to talk about that. FPGA versus Multicore, it was a deliberately false title. It's not a question of verses, it's going to be and. These will be part of the answer and I think the key really is the tools. But while those tools remain so phenomenally expensive, there's going to be limited adoption. So maybe if you've got some research students, you'd like to push in a direction, get them onto some open source stuff. That's my sprint done. We won't see FPGAs on Dick Smiths. FPGAs where? In Dick Smiths. I don't know, maybe you will. Yeah, sure, in a finished product. There's FPGAs in LED backlight TVs, they're doing the dynamic LED backlighting stuff. John, I'm sorry to cut you and please team come and you do all your stuff and please come here, John, if there is a question.