 Good afternoon, everyone. My name is Ahmed Sanada. I'm a fourth year PhD student at Boston University and I work with a wonderful set of mentors and advisors, Mark Nerbot from BU and Orang Krieger from BU and Uli Srepper from Red Hat. Sorry about that. My talk today is on FPGAs. We're seeing these pop up a lot in data centers and this is not some marketing gimmick. This is not some novelty people are trying to exploit to draw people to the clouds. We're there for a very good reason and that is because we're living in a world where our traditional compute systems are just not enough to meet the demands that are being expected of our data centers, of our CPUs. So let's start from the beginning. Well, not that far back, but really. There was a time not too long ago when a single computer was good enough. Poor's law was going strong, then our scaling was going strong. We could do all computations on a single machine. But of course, the complexities of the workload that we were trying to do scaled faster than hardware could and as a result, we had to sort of grow. We had to take multiple machines, connect them together, distribute the workload to meet those demands. But it didn't stop there. Workloads kept getting larger and larger. We kept requiring more compute than you could get simply by buying a few machines together and that's why you came up with the idea of data centers where you take a whole bunch of compute, a whole bunch of storage and really high-speed nerd connects and connect this all together and then you can access these remotely and then you can have people share the resources. Everybody gets the impression of having lots and lots of compute power, lots of storage, lots of communication bandwidth. But did we meet the demands then? Of course not. It's gonna keep growing. It's a trend and it will continue and it's probably gonna get worse because we live in a data-centric world. Everything we do, everything we don't do generates data and that needs to get processed and if you don't get processed immediately as it's getting streamed in, then you're gonna have this backlog that you just have to throw away because you cannot get around to process like that because there's new data coming out. The utility of doing it is completely lost. So how do you solve that? Well, you take data from storage, you bring it into a compute server, process it and take it back. How much time did you spend doing actual work? The time it was in the CPU. What about all the other stuff, all the other cycles that were wasted? Just bringing it to where that had to be computed. That's a big problem. That's a lot of cycles wasted and we need to use those cycles. We need to compute on data when it's sitting on storage. We need to compute on data when it's moving through the network. Okay, if you can stop it from moving all together that's even better. We need to start computing everywhere in this data center, not just this one set of compute servers. How do we do that? Well, new generation, well, this is sort of happening now. We're putting FPGAs. Where does one put an FPG in a data center? You could put them at entry points of nodes when data's coming in off network switch, put an FPG there. Let the FPGA do the computation instead of the CPU if it can. Or if it really has to go to the CPU, put an FPGA inside the NIC. Specialize the application specific, the use case specific to how you're processing those network packets. Don't just rely on a standard NIC architecture to do that for you. You could take Microsoft Catapult One approach where you connect a bunch of FPGAs together with these high-speed interconnects in a secondary backend network where you can go from chip to chip in 100 nanoseconds, which is incredibly fast. You could put FPGAs on top of storage. So you compute a data while it's sitting there in the stored servers. So for trivial operations, you're not bringing petabytes of data in just to let's say active numbers together. You could do it there. And because FPGAs is a resource on its own, it doesn't need a CPU. You can actually have a bunch of these FPGAs sitting connected to the network where you're coming in remotely programming them, programming your stuff on it, and then using this as part of your allocation in the data center. All wonderful stuff that you can do with the FPGA. So there's sort of a summary of all the ways in which you can find an FPGA in data center today. But with life as much as we like to believe that FPGAs were made for data centers, they weren't. The major FPGA market was actually routers, Cisco routers. And why is that important? Because for us to use the FPGA effectively today, we actually have to understand why that device came into importance and then draw on those positive aspects of it as we're gonna see in more detail. So Cisco routers have FPGAs that you use for updating protocols over time. Of course, if you have an FPGA in the router, you can do more than that. So for example, high frequency trading companies, they can actually do trades within the router, which is incredibly fast and something you need today to stay competitive in this area. Of course, FPGAs, as with all technology, gets bigger, gets better. And so there comes a point where you're thinking, well, this device has lots of DSP units, lots of memory. Maybe I could use this as a traditional accelerator like GPU is done. So maybe I could take out my GPU and replace that with an FPGA. Sure, for some applications, yeah, you can. But really, that's not what the FPGA is all about. It's not just an accelerator for your matrix multiplications or vector ads. It's more than that. It can do a whole bunch of complicated computations, but more importantly, this. You do not have to distinguish between communication and computation as explicit operations, where you're doing memory buffer copies again and again and again to copy data from network to your kernel space, user space, all that stuff you do in the FPGA on the CPU. You can tightly couple these things. You can take data from the network straight into your adder and then straight back out. You don't have any of those overhead processing that takes a lot of time on the CPU. And so for the longest time in our lab, we were looking at this as our main use case for FPGAs. Tightly coupling communication and computation. One of our classic, or model applications that we use is molecular dynamics, where we model an N body system of particles. And we were able to use this idea of tightly coupling communication computation in a dedicated FPGA cluster. And we got an order of magnitude better performance than your traditional CPU GPU cluster. And we're within one order of magnitude of your dedicated ASIC clusters, which cost millions of dollars to build, which is an insanely good result. Of course, you can have FPGs and not talk about machine learning. We actually looked at neural networks, specifically convolutional neural net training. And what we saw was that you can take the idea of layers within that model, and you don't have to be confined to that traditional one FPGA, one layer. There does the stuff, passes it on to the next FPGA, holding the next layer. You can be more fluid than that. You can split your layer amongst multiple FPGAs. And because of this 100 nanoseconds chip to chip connectivity, it says if everything's on the same FPGA. And so you can scale up to 80 plus FPGAs and you get 98% utilization of your chip by doing so. In doing so, sorry. But of course, that was half the picture. Microsoft comes along and they show, you know what? You guys are doing application support on your FPGAs. Why not split these two things apart? Why not provide not just application support for the thing that you're doing, but provide system support for all applications that are gonna run on the machine. And so they implement a whole bunch of important data center system functionality, like encryption and like software to find networking on their FPGAs. And they show that you can actually not just save CPU cycles, which you can then hand over to tenants who can make more money off it. But you can also get quite a bit of speed up with this because you're doing all these things without having to go through that CPU stack, which is, you know, if it's interrupt based, then you're spending a whole bunch of cycles there. And if it's polling based, then you're back to the same problem where you're spending a CPU thread completely on doing the polling, and that's really not something you can afford to do. So we started looking at this stuff then. System support, what can we do to improve data movement for an application, an arbitrary application running on the machine? Well, one of the things we looked at was lossy compression. You know, as data is moving through the network, you don't want it to be any precision flowing point. You can actually start to compress that throw away a lot of the information that's not really that relevant to the computation. And what we saw was that 10 FPGAs working together can outperform the number one super computer on the IO 500 list, which was an insane result. We also looked at collectives, which is in lots where you want to get all data together and operate on it together, except instead of waiting for everything to come together to one node, you just compute on it as it flows through the network. You add numbers while they're moving through the router instead of waiting for it to reach CPU. Again, if you try to do this in software, thousands of instructions you do in hardware, you can do it in line rates. With an off-lead MPI miller, again, same as before, try to do this stuff in software. It takes too long. Do this in FPGA, 100 milliseconds. And by the way, MPI is core to a lot of scientific computations that happen today. So, given a node with FPGA, both as separate boards and embedded in Nix and all that stuff, what can you do with what FPGA? You know, you could put, you can take your FPGA with a dedicated back-in network. You can do finance there, you can do bioinformatics. You can take also the FPGA and do stuff like molecular dynamics and machine learning that we saw before that benefits from that back-in network. You can take this FPGA that's tightly coupled with the memory on the CPU board, and then you can do some protocol handling there, some sort of compression on the smart neck, and then, you know, a whole bunch of things. So, each FPGA can be used for a whole bunch of tasks. Every FPGA that we're putting in there has important use cases, which brings me to my second part of the talk. And that is, we have FPGAs in the data center, but are we using them effectively? How do we use them effectively? Because there's a difference between providing a resource and then providing everything that has to happen for an ordinary developer to use that to actually be productive with it, or rather forget productivity just to get around to using it. That's the first step. And that's, even that's really complicated with FPGAs today. So, that's where we're working on this framework, which we called Harness, hardware as a reconfigurable, elastic, and specialized service. And there's a meaning to the name beyond the acronym. One, you don't see the word FPGA there. Why is that? Because FPGA is a manifestation of this specialized hardware technology. We don't expect FPGAs to continue to be the way that you can achieve this outcome. So, we don't want that all the effort that we go into coming up with this framework be so specialized to this particular technology or this particular formulation of the technology that it just gets thrown out the window the moment something comes along that replaces FPGAs as reconfigurable logic. Second, looking at the three individual terms, we get reconfiguration, right? You want to be able to be application specific. You want to be able to change hardware based on requirements. Why specialize as separate from reconfigurable? Because being reconfigurable, sorry, being reconfigurable doesn't necessarily mean that in a data center you can be specialized to your use case. And we're seeing that a lot and more and more of this because taking example of, let's say, machine learning, you may be provided with an FPGA that can implement a machine learning model or a neural network model, but that model may not be exactly what you're trying to build. This could be the closest thing because effectively you're not being given the control of the entire chip. You're just being given a set of IP cores that you can choose from to implement your stuff. You're getting the best case as opposed to the best possible. And so this is different. Being reconfigurable does not guarantee that you're specialized. We want to do both. We want to give people hardware that they can change and we want to give them the capability of changing it and making it exactly what they need it to be to make the application work in the best possible way. And then elastic, FPGAs are big devices. They're no longer that tiny chips that can do only one small thing. So you can do multiple things on the FPGA. So it's important that everyone plays nice and shares that FPGA to get maximum utilization out of the chip. So programming in FPGAs is hard, but it's important to identify where that difficulty comes in. So if you look at the traditional way, which is assembly-based programming, it takes a lot of time to write code. It takes a lot of time to compile it, to optimize it based on, sorry. So HDL is assembly for FPGAs. So it's the equivalent of season. That's equivalent of assembly program because you're pretty much specifying where each and every single wire is. Unless you use some behavioral, so you get some abstractions. So you optimize the based on the feedback that you get, and then you have to integrate that FPGA computation into the overall flow of your workload, which will involve CPUs and even GPUs to do that. If you use high level synthesis, which is your C code or Python or whatever high level abstraction, and that gets created, turned into hardware, you get rid of most of that development upfront time. But you still have to compile it, and often times that a little harder to do because you're constrained in what you're trying to build. And optimization is even harder here because you do not have that low level control. You cannot go in and change one bit of this particular part of the FPGA. You can only try to do this by specifying high level language code. And integration is okay. That's easy because that's how it's built. You have board support packages. You have all this infrastructure around it to help it integrate into your overall flow. Libraries, best possible case. If you need something and it's already existing, hardware for it, just use that. I mean, it's really, it's wonderful, it's beautiful. But what if it's not an exact match? That's what happens most of the time. What you're looking for is not exactly what's available because there's so many tiny iterations over what you want and what people tend to use, or at least a general case that they can build for it. And so you're back to square one, where you're having to build a system around this to compensate for the things that are missing from the core. And that takes you back two months. What we're looking to do with hardness is to take all this down to less than a day. For you going from idea to an actual implementation of that hardware in less than a day. How do we do that? We start with C code. This is simple C code. This is not, hey, I'm writing this for an FPGA. Hence, I must structure this differently. This is your regular C code. Of course, you're not gonna do stuff like recursion, because that doesn't work well on FPGAs. But really basic C code where you're really naive stuff, where all you're doing is just maybe optimizing for cash because that works well for the FPGAs. But not more than that. And maybe you're specifying some things that you think will help the compiler along. But really, there's no requirement for you to have ever even picked up an FPGA board or known that an FPGA exists before doing this. And then the first step is, we turn that C code that you're giving us into something that will actually map well to hardware. We make an assumption here. We make the assumption that when you're doing this, you want to generate high performance hardware. You want to generate efficient hardware. That assumption then allows us to tune the code to look at design patterns that map well to FPGA to extract those out of this C code that's given and to then replace them with the more efficient version. And because the process is systematic, you can implement this as a set of pre-compiler passes. Then we compile the CD-HDL, then the HDL bitstream. Then we provide software support for taking that bitstream, packaging that into a job, having drivers and OS support for putting that job on the FPGA, interfacing it for the duration of its life, and then moving data between the FPGA board and the CPU. And then finally, just like you have an OS on the CPU, you kind of need one on the FPGA as well because you've got all these resources and you've got multiple people sharing them. You need some way of managing that sharing process, managing all these things that are going to happen outside of that application that you're putting on the FPGA. And so the last part of the presentation, I'm just going to go over some of these lists, the parts of the compiler that we saw, tool chain that we saw, sorry. Starting off with the CDC compiler, where this is an example that we did for, I think this was Neelman Wunch, which is a DNA sequence mapping application. That's what we started with, one at the top. This is what we got after we put it through all those code transformations that replaced that with systematically replacing that with what works well or what maps well for the FPGA. And we did this with the Intel OpenSeal compiler and we got over a hundred times speed up from that going to that. C to A steel, now that sort of thing that hasn't received that much attention before, but it's a really important part of the problem. Because the way that it's currently done is that you take your source code, you break that into chunks, into sequences, and then you look in a library of IP blocks, what can do this functionality? Seems pretty simple, right, it's just a basic lookup. Except the problem is functionality overlaps. So you might think, hey, I've got an adder there, so let me just pick up an adder, do that. Except an adder may be done in multiple ways. How do you choose the best one? And if you have large library with lots of blocks overlapping, this problem gets harder and harder to optimize. How do you select the best configuration of IP blocks connected in the best possible way to give us the best performance? That's a really hard problem to solve. And that's why the backhand today, the HDL compilation today tends to be inefficient and actually prone to failure from personal experience. This fails a lot of the time because it tries to create this huge map and it ends up using about 100 gigabytes of RAM to try to do a very simple compilation that should take 30% of the chip, which is nothing. But it just completely goes out of the way and can optimize, cannot converge to a solution. What do we wanna do? We wanna take smaller code sequences. We wanna take very basic building blocks. Think gates, think very basic adders, and we wanna use those to create the design so that we're not spending time trying to find the best possible IP. We're building it from scratch. That's tuned to the exact application that you're putting onto the FPGA. Another thing the compiler is supposed to do is to determine whether this compilation to hardware is it necessary? Because when we let's say do molecular dynamics, we have two parts of the problem. The first part of the top, that takes almost no time to do. It's really simple OIN problem. The one at the bottom, that's almost an OIN square. Not quite, but almost. So you can see that what's gonna take quite a bit of time to do. If you do that really quickly, then you're just setting and waiting for that to happen to get done so you pretty much wasted your resource. Instead what you can do is you can put this into a soft core, into a tiny CPU synthesized inside the FPGA so you're right on top of pipelines so you get the really fast communication. But you can operate at a lower frequency with fewer resources and more temporal mapping of the computation. And you end up finishing at the exact same time as something that's mapped directly onto those pipelines, onto those LUTs, and it's taking longer to compute. So this is the distinction, this analysis of when things are supposed to finish. That is something that the CDHD of compiler should be able to do automatically. In terms of software support, our primary concern is that things deprecate really quickly today. You pick up a driver provided by a vendor a year old, will not work anymore. And it goes without saying, it won't work for any other vendor. So you're stuck with this problem where nothing works across vendors, nothing works across time. You're just stuck with these old redundant useless pieces of software, and those are becoming bottlenecks because the more and more you move towards a place where CPUs and FPGAs are working together, there's data transfers happening all the time, you need something that's more special, that's better, that's faster. And you're not gonna guarantee that you're gonna have only one FPGA in your data center at all times. You might have a Xilinx board, you may have an Intel board, you may have a Lattice board. You should not be constrained to using a piece of software just because this vendor has its boards here and you cannot use the other vendor's board. So we're looking at building this set of software support that's independent of the vendor, that looks at the FPGA because really think about it. The FPGA is sitting across a standard bus. Why should the driver be that's dedicated to a vendor when you're just talking over a standard PCI bus? So that's one of the things we're looking at. The operating system that I talked about, we're trying to share the FPGA being multiple tenants and even logic from the cloud provider and you're providing support for reconfiguring it, for communication, for multiplexing the resources between these multiple users within the same FPGA fabric and recalling that operating system Morpheus. Of course, we cannot have this discussion without this slide, security. If we say we're sharing the FPGA, if you're saying we're not sharing the FPGA, even then it is hardware that can be configured. That is a scary stuff if you're specializing in security. But it's a question that needs to be answered and we're trying to explain this as well. How do you make this thing secure? How do you tell people that you can use this hardware and there's your neighbor who's using the same hardware as well, you're safe. How do we give appropriate guarantees to actually convince people that this thing will be fine, you won't lose your data. And so with that, that's my talk. Thank you for listening and I'll be happy to take questions. So before we do this, let me add a couple of words. So first of all, when he said we are working, it's really mostly him. Ahmed had a small child and he still basically worked on three houses of sleep for last, almost a year just on these kind of things. So the second thing, okay. And that's the phone. So the second thing is, so the one slide which talked about the soft cores and so on. You might recall that he said, yeah, we can explicitly do things there on the soft core and the rest of this working there. There's actually a little bit more behind that. The thing is that we have an even in HPC world specifically today, we are already using CC++ code and annotating it. This takes on the form currently of OpenMP and similar forms in which we can just provide the tools with additional pieces of information. But this is exactly what we need to actually guide the compiler into deciding which part of the code is actually the kernel, the performance relevant part and which part is just the supporting infrastructure. The supporting infrastructure is not performance relevant. We can compactly represent this in normal execution code. So in this case we have selected RISC-5 as that because we can synthesize an appropriate soft cores very easily and legally. And we are then just integrating the rest of the code, the kernels in what transforming them into HDL which by the magic of how the instruction set and everything around RISC-5 is actually structured we can in a very efficient way interact with each other. So if you're interested in these kind of things we can talk to you about this. It's a little bit too detailed to go into this here right now but I think we have found a way together with the Malfoy's infrastructure of the operating system to really make an FPGA more looking like a traditional computer where parts of it be completely reconfigured when we actually need it. So we have processors running on the FPGA which are part of the operating system infrastructure and when someone comes up with a nice use for actual HDL for IP blocks which we can instantiate on the FPGA we can accommodate that as well in addition to doing this for multiple times at the same time. This is by itself so Ahmed has dug up a couple of papers there where people have expressed concern about multi-tenancy so you can imagine these are electrical wires and they are very close to each other if you have a hostile workload somewhere when they can actually monitor what the electrical signals are between connecting these IP blocks. So this is not really just a theoretical problem so we're looking into actually getting some form of guarantees there and this will require us to actually have control not just over the operating system these kind of things but also the tool chain. So that's another thing which we are actually working on so we are going to provide they are already to some extent some tool chains out there so we are currently paying a company to come up with a much more complete solution for these kind of things. This will be for some simple FPGAs at the beginning but we hope that this will then lead to us having the same kind of support for everything on an FPGA not just the small ones but also the bigger ones then we have an ecosystem where extends the free software world from what we know today on the Linux side to even the hardware side where we have reconfigured hardware and then comes in what Ahmed mentions perhaps in future we will not have explicit FPGAs we might have a new types of computer where the differentiation between hardware and software becomes much more fluid and we have something available which is able to handle this and program this. Yeah so just to reiterate so when you say soft core if you need a soft core you are actually using a RISC-5 so then you can just effectively compile to it. Yeah it's just normal coding we have to compile a tool chain for this away. Don't plan on using hard cores. Additional costs deserve the lack of transparency and we don't need them. Anything else? I always would have Ahmed go back to work but he has a deadline to do.