 Good afternoon everyone, am I audible? So, my name is Ritinder. So, that is my role essentially. I lead this project leap in a start up sort of environment. So, this is about logic engines for accelerated processing. So, I will be talking a little bit about that in the context of big data. So, just to you know get some interest in this, you might have heard that about an year back Microsoft announced that they have done this experiment for their Bing search engine where they would use these devices called FPGAs for accelerating the page rank algorithm used for Bing. So, then you know Intel announced that it is coming out with a Xeon chip with an FPGA device within the same package. And just about last month Intel announced that it is buying this company called Altera which makes FPGAs for almost 17 billion dollars. So, what is going on here? Intel invests billions of dollars in making these microprocessors. It is very successful there. So, why all of a sudden does it need to get into this FPGA stuff? What are FPGAs and how are they useful? So, to get some idea of what is going on before taking a look at FPGAs, let us take a look at a microprocessor right, the kind of stuff that Intel makes. So, before that I will just give you an outline. So, I will start by giving a mini tutorial of what FPGAs are. Then we will talk a little bit about you know it is a short talk. So, I will give a taste of what FPGA is how they are useful for big data. Then I will give a more concrete example which is something that we have been working on where I will show what kind of speed ups one can obtain using these FPGA devices and doing the computation hardware compared to software. So, moving to what a microprocessor what a modern microprocessor looks like right. So, this is a die of a AMD bulldozer but Intel chips also looks similar. So, if you will notice the first thing that leaps out at you is that about half of the you know die area is taken by the cache right. This is the L2 cache and if you look closely you can see the L1 L2 caches. So, why is it if you look at you know the history of die photographs you will see that the cache size goes on increasing and the reason is something called the von Neumann bottleneck right. So, what is essentially happening is that the speed at which your main memory is in a the speed of main memory DRAM chips is not increasing as fast as that of the microprocessor. So, you need this paraphernalia of the cache hierarchy, branch prediction buffers and all sorts of things just to keep the parts of the chip that do useful work busy. So, what are the parts of the chip that do useful work? So, one is the floating point unit that you can see below here right and the other is the arithmetic and logic unit. So, it is difficult to find it on this die photograph if you look closely you can see it here right these are the integer data parts. So, if you consider these as the useful units one can make the case that barely around 10 percent of the die and consequently so many of the transistors are only doing quote useful work right the rest of the stuff is just to keep that part busy right. So, that does not seem like a very efficient way of doing things. So, you know what I am doing here is I am highlighting one of the significant inefficiencies in the way that microprocessor does computation and this comment is valid not only for big data, but in general right for whatever computation one does it is only around you know 10 percent of the area that is actually doing the useful work. So, coming specifically to big data there is a significant number of big data applications if not the majority of them that are streaming applications in the sense that you have lots and lots of data that you feed through and each data item is typically process just once right. So, search is a very good example of that you look at each byte of your data just once in general. So, such applications actually the cache itself is not very useful right. In fact, it is almost useless I would go as far as saying right. So, and there are many big data applications which do not have much floating points. So, you know the point is that for big data applications at least some big data applications a microprocessor is a significantly inefficient way of doing computation and that trend seems to be exacerbating. So, what does an FPG offer us instead? So, this I am just giving you so we will get into the details, but what I am trying to show here is you know given intuitive idea of what an FPG structure looks like. So, there is a you know huge chip and you can configure the what each of these logic elements does and you can get you know 10 to 100 times performance. So, the marketing slogan is hardware like speeds with software like flexibility. So, just to give a taste of it we look at the various building blocks. So, this is the key building block of an FPGA it is called a lookup table. So, it is essentially a 8 is to 1 MUX and at the data inputs I have connected SRAM cells right or you can think of these as flip flops and so supposing I write into these flip flops right I write these values then the only time when I will get a so and I give 3 inputs there right. So, these are 3 inputs this is the output. So, only time I will get a 1 output is when all 3 inputs are 1. So, essentially what I have done is I have configured a 3 input AND gate correct and if I made this 1 0 and put 1 bits all here I will have a 3 input OR gate right. So, that is the essence of FPGA logic configuration. So, by appropriately writing a configuration bits into SRAM cells you can make whatever logic gates one wants right and in current FPGA is the size of the MUX is 64 is to 1 right. So, you can make 6 is to 1 gates yeah. So, that is a 3 input AND gate there. The other element are switches. So, here again you have a couple of MUXes. So, you have 2 inputs 2 outputs and by writing appropriate configuration bits here you can make it behave like a 2 cross to cross bar switch and by writing appropriate bits here you can make arbitrary connections right. So, there are millions of such switches in an FPGA and yeah. So, writing these bits you can make connectivity ok. So, the third block is you know you just use the lookup table and you need sequential logic also. So, you add a flip flop there and the output you can take either via the flip flop or directly controlled by a configuration bit. So, one has a sequential logic also. So, this unit is called a logic block right. So, now coming back to the layout. So, an FPGA consists of a 2D array of these logic blocks and these are embedded in an interconnection network lots and lots of wires with lots and lots of switches in those switch matrix matrices. So, one can you know by writing appropriate configuration bits make it do whatever the application requires and this will be much faster right because it is low logic level. Yeah. So, it is surrounded by IO blocks and now I will give you some idea what the size of this thing is like. So, what I have shown here is a 3 cross 3 array. A modern FPGA has about a million such logic blocks. So, you can imagine this as a 3000 cross 3000 array right of logic blocks and sprinkled among these logic blocks you also have RAM blocks. Each of these RAM blocks can contain 32 k memory. So, you can make really fast finite state machines and stuff. These memory blocks are dual ported also and you have got these DSP units. So, there are these multiplier adder accumulator units. So, on one of the top FPGAs it does each of these does 25 cross 18 bit multiplies and there are few thousand of them on the device. So, essentially FPGAs are quite powerful beasts. So, long as one knows how to program them. So, let us look at programming them. So, essentially the process is the same as programming an ASIC or designing for an ASIC right. So, you start with the logic design in HDL, synthesize it, get the net list, do place and route. Then the last step is instead of getting you know fabrication mass as you would for an ASIC you get these configuration bits that you put on an FPG. So, time is from few minutes to several hours and just to give you an idea. So, HDL you know you will specify a full adder as it is shown here right in very log. So, your synthesis will convert into gate level net list right. This will give you the adder output, this will give you the carry out. So, when you do place and route you will get these configuration bits which is a bit sequence of the configuration bits. And finally you can you know configure it on your lookup tables. So, actually these two lookup tables if you look at the contents carefully do implement these two logic structures right. So, that is essentially how you program an FPGA. So, now moving on to big data I incited your curiosity by talking about Intel and Microsoft. So, let's talk about them. So, okay first of all let's look at how you would use an FPGA right. So, typically it comes on a PCI express board that slots into a standard slot in your motherboard. And via IO hub chipset it talks to the CPU which is connected to the DRAM. So, this works this is very efficient and the bandwidth is not a problem at all. If you look at the PCI express standard you got lots and lots of bandwidth right. So, this is the way currently things work and they work pretty well. What Intel is trying to do is the next step essentially they have announced that they are going to come up with a single package with the Xeon and the FPGA connected with the Intel Quickpath Interconnect. So, and this FPGA will have complete access to the entire Xeon memory hierarchy right. So, they have announced it they have released some boards to the FPGA R&D community and if things go well next year one is expecting to get you know some commercially available products based on this. Sorry, sure. As far as I understand they are both the same AMD came out with Hyper Transport first and then Intel used you know the engineers who had worked on DECAlpha and that group in Massachusetts I think they developed Quickpath. So, essentially both are point to point links for high speed bandwidth. So, Intel claims that they get 2x performance for this Xeon FPGA combo just because they are using Quickpath here. Yeah, so this reduces latency and really simplifies deployment right. You don't need to have a PCI express board. So, moving on to what Microsoft is doing I'll summarize it here. So, essentially they have also developed a mini board right on which they put an FPGA with around 8 GB of RAM and what they do is they use a rack of 48 servers. So, these are your standard 1U servers and on each of these servers they'll put one of these tiny daughter cards and they connect these in a 6 cross 8 2D array. They use high speed point to point SAS links each I think is around 10 GBPS to connect a pair of FPGAs and they've deployed this, they've done an experiment on around 1600 servers and what they were able to show is throughput doubling with only modest increases in total cost of ownership and power. So, they've been effective in demonstrating that for data center needs of high performance, low power, flexibility and low cost FPGAs can play a effective role. So, from what one hears Microsoft is now expanding these efforts. So, now I'll try to give you a brief idea of the power of these devices going back to a little bit of theory. So, you look at a regular expression right. So, this one says that it will match any string which is a string sequence of A's and B's and the last two characters C and D right. So, this is one example. So, one can you know convert this into a NFA right and so, I'll just show you the power of FPGAs by showing how easy it is to map this kind of logic structure directly onto the FPGA device right. So, all you have to do is you have to replace each of the states by a flip flop and you just put in these AND gates for the transitioning and you get a structure that will do the matching. By the way, I have put these colored you know things I didn't want to clutter the presentation. So, these are links. So, these contain papers and stuff in which you know you can find more references. So, you can download and click and you can get more details. So, I'll just skip this I'm running out of time. So, I'll just talk about little bit about what we have done. We have developed this XML processing solution. So, this uses something called tree automata which are better than pushdown automata and we use this board and I'll just skip the processing. So, this is a schema validation example and we compared it against you know software the fastest we could find running on an Intel Xeon and this is the result that we get. So, X axis shows the file size is from 64 K to 4 megabytes. Y axis shows the throughput. So, software is around 300 odd megabits per second and hardware goes up to around 3 plus gigabits per second. So, we are getting around a 10 X speed up using these FPGA devices. So, we are utilizing only so much resources on the FPGA and this is the current FPGA we are using and you get FPGA is as big as this. So, to conclude you can get dramatically more efficient computation but one needs certain amount of expertise in the area at present to leverage that but devices and tools are improving and these devices have the potential to you know be at least one of the workhorses for big data processing. I'm sorry about going over time. Thanks. Any questions?