 Highly esteemed colleague, Uli Drepper, is going to be really groundbreaking and it's my great pleasure to be able to introduce to you Ahmed Sunala, one of our PhD interns in the Collaboratory, to talk about FPGAs. Thank you. Good afternoon, everyone. Thank you for taking the time to attend this talk. My name is Ahmed Sunala. I'm a PhD student at BU. I've been working with FPGAs on and off for about six years now. A couple of years ago, we started looking at FPGAs in the cloud, the role they have to play to move things forward, to move the technology forward. And it started off as my dissertation topic, but the more we looked at it, the more we explored this, we realized, you know, this is something much more than a PhD thesis. This is the FPGAs have a major role to play in the next generation of cloud computing. And my aim today is to convince you of the same, to not just convince you of the importance of FPGAs in the cloud, but also to convince you that you, the open source community, have a very large role to play in this. Because if you look at Linux, 27 years of Linux, we've taken it from somewhere there to somewhere there. And as because of the contributions of the open source community, FPGAs are the same thing. FPGAs have the same significance. They have the same rewards. They have the same benefits from an open source development model. And so, in this talk, we're going to talk about the advantages that FPGAs have to offer. But before we get into that, let's start from the first step. What does the cloud look like today? So the modern cloud has a network, it could be Ethernet, it could be InfiniBab, they're all good. It has a CPU that talks to that network, or rather, a bunch of CPUs that talk over this data center network. And then you have accelerators such as GPUs and FPGAs and storage that hang off that CPU. So the CPU is in the heart of everything that you do. It implements all your critical functions, all your critical operations, and every thing that you would want to get out of a data center has to happen via the CPU. And that was great for many years, but not anymore. Because Moore's law is slowing down, the rate at which you can pack transistors into a chip is decreasing, which means that you have to maximize the efficiency with which your transistor was operated. You cannot afford idle time for your transistors, because every cycle you spend idle, you're wasting precious amount of compute resource. Similarly, denar scaling has ended. It did not take into account leakage current, and that means that your transistors, your chips, while they're getting smaller, cannot continue to operate with increasing frequency. Let me throw some numbers at you. Let's say we had continued to increase the frequency of our CPUs beyond four gigahertz. Today, we would need a nuclear reactor to power up a CPU. Continue, we would need the sun to power up our CPUs. That has a massive amount of power that a CPU consumes if you continue to increase the frequency. So, because we cannot do that much with a single CPU, we need a cluster. We need a data center of CPUs. And that's where a new set of problems arise, because things have to talk to each other. These CPUs have to communicate. And if you remember Amdahl's law, a million CPUs does not give you a speed of a million times, because there's a communication bottleneck, because stuff will move sequentially, and unless we can effectively mask those latencies, unless we can effectively cover that up, we're not going to get the speed up for which we're dedicating all this resource for. So what is the solution? Well, if you have a lot of time, let's say a few years, if you have a lot of people, let's say a couple of dozen highly qualified engineers working for you, and let's say you have a lot of money, which, let's say, for 10 nanometer, that's about $1 billion. Somebody's got that in their pocket right now, I assume. You need a lot of money, a lot of time, a lot of people to make a chip, a custom chip, like Google did with their tensor processing unit. And that's well-engraved, because nothing is more efficient than an ASIC, something that's designed to specifically do a certain function makes maximum utilization of the transistors. But this is not a one-time investment, because when you make an ASIC, when you make a chip, you commit to a certain technology, you commit to a certain way of doing things. And if you have to change something, you have to go all the way back and make a new chip. Google came up with the TPU, they wrote a paper on it, but before they even got to releasing it publicly, they had to go back and make TPU too. And we don't know how long it is before they have to go back and make a TPU 3 because they have to stay out of the curve and technology is increasing at a rapid pace. A second solution is if that requires less money, that requires less time, fewer people, is off-the-shelf confidence, like GPUs, like FPGAs, both of which make excellent use of the available transistors. In our talk, we're going to focus on FPGAs, because the FPGAs have a vitro pool of applications that they can be used for. And companies like Microsoft, they already started investing in this, they already started putting FPGAs in their Azure Cloud. Today, you can boot up a VM, you can attach an FPGA to it, and you can get much better network performance than your standard VM. You can't put anything in the FPGA yourself for now, it's just a black box, but you can still use that to improve your network performance. So, before we go any further, before we look at how the FPGA is changing the world, we need to understand what's happening inside the FPGA. It is almost critical to know how the FPGA does what it does, and everything that follows after that will become very intuitive. So, inside the FPGA, well, we have the basic building block, and we've got a million of these, millions of, actually many millions of these, they are called custom logic blocks. This is what implements your basic digital logic function. These custom logic blocks are made up of lookup tables, and if you look inside one of them, let's take an example of a three-input lookup table. So, if it's a three-input lookup table, you have eight possible answers. So, you have eight memory cells inside a custom logic block, each with a specific answer that you can select. You have a configuration module which can load these ones or zeros in these blocks when you configure the FPGA, and then using your three inputs, you construct a network of muxes that can then select one of these eight possibilities and put that at the output. Other than the custom logic, we also have ASICs. We also have specialized chips because there are certain functions that do not change across use cases, things like on-chip memory, things like fast multipliers and dividers. Those are important almost regardless of what you're doing, and so those are implemented as a few thousand hard-coded chips which perform much better than the custom logic block. Then you have, finally, the switch boxes, things that route connections between all the modules that you have within the FPGA to effectively create your architecture. Let's do an example. Let's say we wanna implement this particular function on the FPGA. We start by looking at all the computations that need to be done. There's adders in there, there's multipliers, there's a comparison operation being done, and if the result is true, you wanna invert the bit. You wanna get zero at the output, so let's start by placing all these components in the respective blocks. For the adders and the comparator and the invert operation, because it's low complexity, we can do this using lookup tables. For the multiplier, because it's far more complex, we have to put that in ASIC. We can put this in the logic block, but putting it in ASIC gives you much better performance. And then once that is set up, then we route the network between them, starting off from an ASIC memory chip, ending off in another ASIC memory chip. And that is very important, because you're not doing load process and then store back into the same memory. That is sequential. No, you're doing loads, loads, loads, and at the same time in a separate place, you're doing compute, compute, compute, and then in a separate memory block, you're doing all your stores. And so all of these operations are happening at the same time in parallel. And if at any point you wanna use the output data, you could just switch those ASICs. You have a configurable network that you can work with. If you look at one of these connections, we see that it's not one wire. It's a lot of them. FPGAs can move data at virtually infinite bandwidth within these logic blocks. That means that you have pipelines that are operating at very high efficiencies because they have almost every cycle data available to process. They're spending minimal amount of time while waiting for data to arrive. So we saw that hardware programming is not like software. It's not a bunch of instructions that execute sequentially. It's a lot of things happening at the same time in the same chip. And so to really drive home that idea that hardware programming is different software, let's take an example. Let's say we have a set of test cases and a reference value and we're gonna find how many times does this occur. Now in a software environment, we will load something into memory, do the comparison, if it's true, we go out, fetch the value for counter, increment that and then store that. And this continues to look, whole process is done. If we had another set of test cases, they will happen sequentially after this. So let's say if we took n seconds through the first, it would take another n to the second, another n to do the third. If we were to do this in hardware, we load everything in from memory at the same time, we process it at the same time and we get this result out. If you look inside the FPGA, the way this happens is that you can assign a competitor to each input value. You can perform all the comparisons simultaneously, then add up the results and get your final answer. And if we wanna process more test values, we don't have to wait for this to get done. We can have this move on to the comparison stage, have a pipeline going. So when this is in the comparison, we can load up a new set of values and start working with those. And when this moves on to the adder, the second set moves into the comparison and we set load up another set of values. So we're working in a flow-based model here, where data is continuously being streamed in, it's continuously being processed, you're doing loads all the time, you're doing stores all the time, you're doing compute all the time, things are happening in parallel. And that is very important. That is where the performance benefits of an FPGA come from. Now, before I go any further, this was a lot to take in. So I'm gonna take a small pause. If you have any questions as to how the FPGA worked, I will be happy to take them now because I want us all to be in the same page going forward where we start to then see how this affects the cloud. So if you have any questions, I will take them. I'm sorry, could you repeat? Is there like a figure of Mary? It could be applications, it is application specific. So the question was that is there a figure for FPGA speeds with respect to CPU speeds, or is this a variable figure, if I understood that correctly? There's even more to that, so. It's not fun, FPGA is not the case that any way to program it will have the same performance. It's about how you lay this out, this is a very complicated mechanism onto the FPGA itself. We're constructing what is now called the synchronous circuit on the FPGA which runs at a certain frequency, which is the one we want to compare. But that depends heavily on how the individual operations which you're performing are looking like and how to add to what level you are going to break them up. So there's always the problem between the CPU with Jesus as well, between latency and throughput. High throughput machine, high pipeline machine can scale up the frequency, but they have a higher latency. You see this in CPUs as well, but overall there's a lot of logic in the FPGA which takes time. So this is not like in ASIC. So I'm gonna talk about ASIC as the ultimate program here. We can design them to have extremely high frequencies that has CPUs for our ASICs. But FPGA has to be generic enough and has to work in many situations of which to reliably work, the base frequency of FPGA is usually quite low. So not also low end ones have it. Say 80 megahertz or something like this, high end ones can have a couple of meters or some early meters, low meters. But the thing is, instead of the CPU, you're doing perhaps 30 operations at the same time between 30,000 operations at the same time. So you cannot compare them. All right, yes. What open tools are you using? So currently we're using the Intel 2 Flows, but I will get to that towards the end of the talk. There is a section on what we're doing to program the FPGA and how we're doing it. So what does the future cloud look like? We have the FPGA moving to the heart of the data center. And because now all network traffic to the CPU, to the network and even to the storage or even to the GPU is now going to go through the FPGA, we call this a bump in the wire FPGA. And this has a number of advantages. So many that I probably can cover all of them in this talk. So I'm going to pick three and talk about them in a little bit more detail. But trust me, there's a lot more. So let's look inside the FPGA. One benefit that we get is that we have a reduced network stack. An application running on FPGA can communicate with another application on another FPGA at very low latency. You can move data around very quickly, which goes a significant way in reducing the communication bottleneck between your applications. And that is besides the point that the FPGA is an accelerator, so the application is going to run faster on the FPGA as opposed to the CPU. Next, let's say you have a trivial thing you're performing on the CPU. It takes hundreds of lines of codes to do. And then just for fun, let's throw in interrupts. So your performance is going nowhere. You could continue with that and that will give you poor performance for your cloud. Or you could offload that to the FPGA so for doing something like, let's say, memcache D. And you're streaming in key values and you are processing them using the flow model we talked about before. And so all this performance, sorry, processing overhead almost disappears. And you're doing things at a very high speed and you're processing data at very high speeds now. Also, now let's look at something a little bit more non-trivial. When we work in a cloud environment, we can't hand over nodes directly to new tenants. We first have to attest the firmware because that is one point of an attack that can be compromised. To do that, we typically load in an operating system on the CPU, then an attestation client on that operating system and that is what attests your non-wild tile storage. If you have an FPGA, because it has general purpose IO pins, it can interface that non-wild tile storage directly and attest them without putting up an operating system. And it can do that not just for the CPU firmware but the firmware of other devices just as your power supply, which is also a point of attack. And because it happens without putting up an operating system, this is more elastic than the current way of doing the attestation. So, now we saw on a higher level what an FPGA is, what it does, where does it get placed in the data center, in the future cloud. Let's look at what's happening inside the FPGA. A typical bump in the architecture is composed of a network stack that interfaces the data center network. A simple switch could be hard-coded, could be round robin, something very trivial that just makes sure data can move towards one of the many destinations that it can go to. One destination is through another network stack to the CPU. Another possible destination is some provider offloads such as the attestation client that we saw before or some memcache tree running on the FPGA. Another example is some user logic, some application that you placed on that bump into our FPGA and that's another place for data can move to. And this is great if you're looking at a single tenant FPGA or you're looking at a private cloud. But in public clouds, when there is a multi-tenant environment when there are multiple people residing on a single device, this is gonna fail because there's nothing protecting users from providers, users from each other, or anybody from any external thread. And so, what we need to do is to have not just a simple architecture, but something that's more advanced, something that's able to provide more functionality for this to start to work in a cloud. We call that the shell. The shell involves replacing that simple switch with something that's more smart, something that has the capability of enforcing permissions, security protocols, something that has the capability of ensuring that one user cannot carry out a malicious attack on another user or the part on the provider logic in the FPGA. Similarly, we also need a programming logic. We don't want that if one user wants to program their assigned piece of the FPGA, everybody else has to wait while that operation is being carried out. We want everyone to continue normally while we are programming this piece of logic over here. And so we need some methods of getting in there and programming the user logic or even the provider logic. So, one last thing on this topic. Open Shell is the new consortium that we are proposing for bumping the wire shell development for FPGAs. Like I mentioned, this thing has the same importance, the same significance, the same rewards, the same benefits of an open source development model as Linux did. And as you've seen thus far, it's all about flexibility. It's all about the idea that you can create hardware that suits what you want to do, that there exists a marketplace where cloud providers and cloud users don't have to compromise on how they want to go about doing their applications and services. And that cannot happen in a proprietary environment because if you have a proprietary shell and there are a lot of them in the market right now, you are forced to change your application. You're forced to change. You're forced to become, to compromise on the flexibility in order to make use of any available shells. So that is very important that we start to open source the development of these shells so that everyone can start to take advantage of all of the FPGAs in the cloud. So last part, as part of my internship at Red Hat this summer, we worked on taking the first steps in this direction, creating a shell that supports open source development. For now, we applied the context of bare metal but the things that we talk about here are applicable to virtualized environments as well. So what did our shell look like? You know, we have the network stack for the data center network. We have the smart switch. We have another network stack that talks to the CPU over the nick. For the cloud providers, we use soft cores. Soft cores are tiny CPUs that you can synthesize on the FPGA using the custom logic blocks. They take very little resource so you can have a lot of these and these and so you can have the cloud provider implement a different function on each soft core. You can have one for the station, one for doing something else. So why did we use soft cores? Because soft cores require programming in C. There's a lot of legacy software out there that was not developed overnight that cannot be ported overnight. It takes time for that to make a transition from software programming to hardware programming. We wanted people to use this now, today. And because of that, we're supporting a model through which you can take your high level language codes, apply a few wrappers to it and have that run on the FPGA. You do not have to start coding in hardware descriptive languages today. Plus, another benefit of this is that because software is simple to change, it just requires updating the instruction memory. There's no hardware programming required there. You don't have to reprogram the FPGA every time you wanna do something different. You just load up the new instruction, reset the soft core, and it starts executing the new program. To sort of give proof of concepts, we implemented two very simple cloud provider services. One was a routing table that monitors IP addresses and makes sure you're not allowed to go somewhere that you're trying to. And similarly, an AES 128 encryption block that is able to encrypt for now a 512 byte payload every clock cycle. And it can do a lot more than that. This is by no means comprehensive security. There are flaws in there that even I have no idea about. So I'm not saying this is perfect security, but this is just to get the ballroading, just to give a proof of concept that you can start to implement security features on the FPGA and not just use it for service or application acceleration. For the users, we're programming it over the PCI bus. And I'll get to that in a bit why that's the case. If you look inside the modules that we've implemented, we've tried to use hardware description languages and open source IP cores. We're trying our best to not use proprietary software, proprietary IP cores, so that this is in line with the open source development model. We prefer HDL wherever possible because that means you are vendor agnostic. You can take HDL and have it run on almost any FPGA irrespective of who made it, whether it was Intel or Xilix. But in case it's far too complex or we didn't want to spend too much time on it today, we've used open source IP cores instead. Now, the question about how we program the user logic, what tool flows that we use. Well, for now, we're using OpenCL. It is a proprietary tool flow. The reason why we're using it is because it has a ton of documentation. It has a ton of optimized libraries already available. It implements some very important stuff like cache coherency. And most importantly, it lets you program and see. Just like cloud providers, we know that transition from software to hardware is not gonna happen overnight. It is gonna be gradual even for cloud users. And that's why by supporting a tool flow that works mainly with C99, we can have people start to use the cloud today and then slowly make the transition over to HDL where it's more complex to code out but has better performance. And this also highlights the importance of needing an open source tool flow for doing this. And we are moving towards developing something that can let users still program and see in high level languages but not use any proprietary tool flows such as the Intel OpenCL SDK. In terms of connectivity, we've used standard parameterized interfaces to make sure everything remains compatible when you change any of these blocks. And by the way, this is modular, so you don't have to change everything. If you wanna update something, just change that particular block. One thing to highlight is the interface between the user application and the smart switch which is also standardized. So you can start to draw a relation that the FPDS shell is like a hardware based operating system with a fixed API to which users can interact with it. The thing in light blue, those connections are supporting clock crossing. This is just a very hardcore hardware level feature where if you design some logic you're not bound to use a particular clock frequency. You use the one that's best suited for your architecture and we will make sure that it's supported in the shell. Finally, the things in purple, those have back pressure. Back pressure means that if you implement something that is slower than the rest of the system, the system will not crash. If it crashes, it's hard to figure out where the problem was. The system will continue to operate, it will just slow everything down, but at least it will enable you to get in there and figure out where the problem was. And so with that, thank you for listening. I hope I did justice to the topic. Thank you. Is he still a questioner? Ahmed is working on the big terrorists and so on. For them, we don't yet have anything. So I hope that we soon get also the tool chains working for that. But, oh sorry, the question was about proprietary tool chains. So Ahmed mentioned that everything he does in the moment which is correct is using the proprietary Intel tool chains from the synthesis tools up to the OpenCL compiler and so on. So the question was, is there an alternative to this? So for the big course currently, this does not exist. But we are hopefully getting there. So we have a beginning. So in the DEF CONFIM Bruno I gave a talk about FPGAs introduction and so on. I actually showed a completely freely available tool chain in which I was able to program the FPGAs which I had then with me. So these are small FPGAs which are currently, they are mostly used for the embedded realm, the company. So they're the I-40 based ones and there's a guy who now has a company formed around him who took the time to reverse engineer all the bit streams and he wrote the synthesis tools. He wrote the PNR tools, place and route tools for these kind of things. And they are working now on actually also Xilin's FPGAs to get this up. And as soon as this works, hopefully we get in also to other vendors to disclose enough information or reverse engineer enough of the bit streams to get all the bigger ones as well. So the thing is that the story we heard from Xilin's and from Altera and Son for the longest time is it's far too complex for you newbies out there. So you don't get this kind of information, you don't need it. So what has been proven now is that this is not the case. The free software organizations can actually produce these tools and will not be stopped. So my hope is that within the next one or two years or something like this, we actually have up to the highest scales for the million lot FPGAs. We actually have freely available tools and then this whole thing would come even more important. It synthesizes the SQL. It picks up the functions you're trying to implement. It has some IP blocks that it replaces those functions with and then connects them together to form your system and then passes on to the Qantas tool flow or your standard angel tool flow which then synthesizes that into hardware. So the question was that when you have all these tenants in the cloud, this is the cloud user, the cloud provider and all these functions in the shell, how do you partition them? How do you make sure that they're not able to talk to each other? If you remember the slide with the switch boxes, that's where the key lies. You have these switch boxes which can control the connectivity. So if a switch box is not set, then you are effectively isolated in hardware from each other. And so the only connectivity you possibly have is through the smart switch and in there we have protocols and permissions that are being set so that you can ensure that data is not going to a place that it's not allowed to go to. Yes, so that's called partial reconfiguration where you can define a part of the IPGA and you can do this at runtime. So a full reconfiguration that when you change the entire IPG hardware that takes a few seconds to get done. Partially reconfiguration and that's how the OpenCL two flows do it today is very fast. So what you've done is you've defined a part of the IPGA that this is where all my logic will exist. At least the logic that I want to be able to modify at runtime and then you can go in there and just modify that part. Sorry, for the network side, what's the question? So for now everything is hard-coded but we're looking to expand that to have all those full network features to have a full network compatibility. But for now, because this was a prototype just to have a proof of concept most of the stuff is hard-coded. Almost definitely, definitely. I mean, there's a whole class of FPGAs called net FPGAs that are designed for this networking task. So networking is, it's a field open zone. In fact, we have a PhD student and our group who's actually just working on networking site of the FPGA. So it is a big field. Yes. So the question was that we're using soft cores to ease the transition from legacy software into something that's compatible in the FPGA and what is the overhead of those wrappers? So the wrappers mainly will deal with IO because this is like a regular CPU. It's gonna do all your basic instructions with an ALU. It's only the case where when you want to interface something on the FPGA you need specific addresses, you need specific pointers and you have a specific interface how this works. So this is just dealing with interrupts, dealing with IOs that will need to be modified the rest of the code and all likelihood will stay the same. I'm sorry. So interrupts are optional. You can do this with points. So currently the way we're doing it, we're using a polling-based mechanism to look for network data and to process network data on the soft cores. Interrupts are optional and we avoid them anywhere we can because they're real slow things down. It's a long run, it will use probably, so if it's a long run, my wish is we will use RISC-5. So the question was, what architecture are we using for the soft cores? So currently we're using Altera's NIOS 2 cores but in the future we are gonna look towards moving to RISC-5 processors on the FPGA. So folks, next up we're gonna talk about the malleable metal which is a lot about how are we implementing, attestation and boot from volume into foreman. Thanks.