 Awesome. So it turns out names don't really matter because I changed the name of the talk. So a dog by any other name, the original name was, you know, rose by any other name. So Shakespeare, right? Names don't really matter. And then I realized I was putting the talk together that I had a really cool metaphor about dogs. So I changed the name. The real punchline is sort of understanding computational and hardware complexity and software defined radio. Marcus sort of give us a wonderful wonderful end of the perspective from the CPU side, which is sort of this is what we're seeing in terms of scheduling. And hopefully I'll kind of pitch in and provide a complimentary view. Unfortunately, it's not going to be a sort of technical and sort of advanced. But the basic idea is that if you are new to SDR or you are new to putting things on FPGAs, what are the things that go awry? Why isn't this working? Or if you have had experience with an FPGA and felt really bad afterwards. Maybe I could tell you what happened. And so my primary thesis that kind of underlies almost everything I do is that algorithms change to suit the hardware context. And in part, I'm an evangelist. I am here to tell you this and convince you and give you some of my religion that you can't just take the stuff that works on CPU and run it on an FPGA. That's not how this works. Hopefully by the end you'll agree with me. If not, at least there'll be a tiny little voice in the back of your head. So who am I to say this though? I'm just a guy who wandered in an hour ago. So I'm a digital hardware engineer. So I've contributed to chips at public trade companies, big chips. So those were itaniums and 8-core Xeon. Intel's really cool GPU that never really took off. You know, an actual GPU later. And then I tape out my own chips to examine radiation effects. So I'm kind of all consumed with how things work. Back long ago, before I did all this, I started out with hardware software co-design. So one of the challenges somebody gave me was if we wanted to take a GPU and have it render movies like Pixar does, how do we do that? The answer is very small triangles and it involves changing all of the hardware and not doing it any of the way that the Pixar people did it. So I ended up moving on from that to domain specific languages to RTL for image processing. So how do I take high level descriptions of an image processing application and turn it into hardware? And so I spent a lot of time thinking about abstractions, how to make this good. And everything sort of ends up being line buffers and local buffers. And then there's things that don't fit in this format and then those are bad ideas. It's sort of the general rule of thumb. Every time somebody goes, but I want something else, I go, great, but you need to pay for it. And so we wrote this up. So this was actually my thesis. We had a publication at SIGGRAPH sort of describing the abstraction. And then the much better version of this is called Halide that actually does have all of this implemented. Sort of we talked to those guys that was doing this work and then they made it real and made it free and open source. And so I'd encourage you to look at that. One of their solutions to the scheduling problem is to actually let the user directly describe the schedule. And so they've sort of thrown their heads up and said, well, it's not my problem. My job is just to implement what you tell me to implement. So as a part of my thesis work, we were constantly demoing things to DARPA on FPGAs. And wouldn't you know two weeks before the demo, it would be dog slow. And I'd be super sad because that wasn't my job. I was doing the ASIC stuff. Somebody else was in charge of the FPGA stuff. But I'm the one who looks silly in front of DARPA. So we spent a lot of time reading through the manuals to figure out why is it slow. And it turned out in most cases it's because we just didn't understand how the primitives worked. We had some fundamental misconceptions. And so some of this is this. And so that work has also been open sourced. Jing Pu did that Halide to HLS flow. And he has a wonderful paper exploring some further topics in that area. So currently we're working on this general problem. That's actually how we got interested in GNU radio. It's sort of this wonderful set of applications and wonderful abstractions to sort of look at what's going on. So given a set of code written by domain experts, can you identify kernels, procedureally label those kernels in some format, discover the taxonomy kernels, and predict an arbitrage. This might be the solution to somebody gave me a GNU radio block that is the entire processing application in one block. How would I break it into little blocks? And maybe produce FPGA code. So we're currently focusing on discovering kernels in the code. So the big problem is if you look at a static trace of code, you have no idea what it does. What you see is little colored blocks that are incomprehensible assembly. If you went back and relearned ASM, you might eventually be able to piece it together. But for all I'm concerned, it's just little pieces of the rainbow. So you run it, and then you get little pieces of the rainbow smeared all over the place, right? You see your basic blocks executed in time. And eventually from that, you can figure out how they cluster and what the data flow is. And so this is sort of the five-second pitch for some ongoing work that we've had, that we've open sourced on how do you do this tracing and how do you isolate the kernels together. So we have code ocean with some of our data set for that, and the actual stuff. And we have a pre-print if you want to find out all the cool details about what's going on. But we view this as sort of a procedural way to kind of discover the things that I'm about to talk about. So onto the metaphor. FPGAs are like puppies. Puppies are really cute. You see a puppy, and you want a puppy. Right? It is, I think cats. Definitely cats, too. Right? And you look at them and you go, I got to have this. I got to have this in my system. So why do you guys think this about FPGAs? Well, so what I did was I went back maybe for the past 15 years, looked at a lot of the published numbers for different things. And it turns out that in general, bottom left here is the good part. Bottom is area. Left is energy. FPGAs are in general a good 25x better than processors. They're not better than ASICs, but you don't want an ASIC because an ASIC does one thing. You want something programmable, and your choices seem to be DSPs and FPGAs. We could give a whole other talk about DSPs, but FPGAs seem a little bit more reasonable for the time slot. So you get about 25x, and that's just if you're putting them on equal footing. And all of the systems that you're going to deal with, they've disproportionately allocated resources to the FPGA. So they've given the processor one watt, and they've given the FPGA 10 watts. So take that 45x, or sorry, that 40x, and now you have 400x. So you should be expecting two to three orders of magnitude improvement when you go to run your applications. But often, as we're just hearing, that just doesn't happen. The other thing is that people are selling really cool toys. So this is a Lime SDR, right? I thought it was just a pretty board. They're neat, you know, as an electrical engineer. I love things with little black boxes inside or all over it. So the problem is you go buy your FPGA. Now you have a puppy, right? And now you're sort of addressed with the reality that, you know, you've got to feed and take care of this thing, right? It's not all cool pictures with matching hoodies, right? And so these are roughly what your system pitfalls are, right? There's a whole architecture metaphor about the feed and care of your accelerators, you know, and that's sort of here too. The other problem is you need to figure out how to work with this animal, right? And so it turns out that dogs do not like to be yelled at. That if you just keep yelling, roll over, they will never roll over. So the way that you teach dogs to do things is you sort of work through small atoms of a trick. So a rollover is like six steps that you have to teach a dog. And if you work through those six things, get them to sequence it, they can do it. So FPGA design is sort of the same way. Once you learn what the atoms of the FPGA are, you're fine. And after that, you can model what's going to happen with your accelerators by just sort of putting those atoms together. And it turns out that an FPGA doesn't have that many, so it's not too awful to learn. So what are the system pitfalls? So we're going to start really basic and some of you are just going to groan. But you really need to understand what your application's composition is. So you have an application, right? We have a bunch of red stuff happening and a purple. You decide the purple is really cool and that's the thing you sped up. And it didn't matter. You got a little bit, but it's underwhelming, right? I was just talking about 400x. Where's my 400x? Right? This just didn't happen. So you need to optimize the things that matter, right? So take all the red things and speed those up on the FPGA and then life is better. So if you don't actually know what's in your application, mapping it to an FPGA is going to be very, very hard. It's not going to be a straight code dump sort of thing. So for most folks, this is kind of obvious, but it bears saying. The sort of interesting thing here, though, is that if you were hoping for, you know, at least 40x, you have to map 98% of your runtime at minimum to get it if the accelerator is running infinitely fast, if you're not paying any latencies and if you're not dealing with any overheads. 98% is a lot, right? So in a real application, you should be looking at like three nines of your runtime is mapped. So nothing running on the CPU, everything running on the FPGA. So first, understand your sources of latency. Your source of latency is that these are very complicated things. There's tons of wires everywhere, tons of channels, and all the channels have latency. So this is from the Zinc Ultra Scale Plus spec. Don't worry about what's in it. The real point is that this is an atrocious figure to try and figure out what is going on when your processor is communicating with your programmable logic. It's even worse if you're using one of the USB ones, because you are going through the backplane of your motherboard, through the USB controller, through the USB, and it's just totally gross and awful. Oops, that's not mine. This one's mine. So for context, right, we go back to our thing that was totally awesome, and now we added latency, right? So you guys are comms people, right? You should have noticed that all of my communication lines were horizontal. I was committing a sin, right? There's no free communication. So all of a sudden, I am now approaching the original speed that I had, even though I'm totally sped up. So, you know, where does this all come from? So for dramatic effect, this isn't how most things work, but this is how bad it could be. So your user makes a blocking call to the OS to use the driver. Don't do this. Get rid of your operating system. The OS forced the cache to flush, not because it was a good idea, but because security protocols require it to flush caches now. The driver then forced a data copy because the memory spaces aren't shared between user space and the driver space, so it actually has to do a memory copy on the processor side before it can do any work. The driver then needs to send a command to the DMA engine in the P programable logic to ask for the data that it wants. Then the driver needs to pull. Then finally, the driver can send a command to the accelerator that it's allowed to read from the DMA to make forward progress. Then the driver needs to pull the accelerator for completion. Then it commands the other DMA to get the data back. Needs to pull the DMA to see if it's done because it's really hard to know when these things are done. And then at that point, the OS can restore the user thing and at that point, all of its locality and its caches are gone because everything has been flushed to DRAM. And then there was more, I'm sure it was awful. So what are the tricks here, right? These are solved problems. We all have GPUs in most of us play games. It has this problem. Why don't we encounter these things there? Well, their trick is they hide the latency. So they overlap transfer and operations. So if I just overlap the communications and I have multiple things in flight, all of a sudden my life got a lot better. But now my CPU has a full-time job, right? It is now the mother for the FPGA. That is all it's doing. That is all it has time to do. You have now lost a processing resource to gain a processing resource. That is truly gross and awful. So instead, you should execute, you should sequence the operations in the PL so that you don't need to do any of this handshaking yourself. Is it should just take care of it on the other side in the original config blast? So it looks a little more like that, where you send one packet over configuring the whole process and then run it over. And this is one of the deep insights about OpenGL, is that when the driver does a launch, you actually send the full pipeline configuration in one go, and then it runs. And then at that point, you get really low bandwidth high-latency comms to talk to it. Other than that, it's just ripping through. The problem is that you need to avoid blocking in that path. So you need to skip your false memory barriers. So everybody who's used libraries really likes to do one library call, wait till it's done, and then do the next library call when you can start as soon as you have the data. So in FFT, you can start as soon as you have half the data if you decide to do the top and middle. If you decide to start with the two top ones, you can start as soon as you get two data points. So start as soon as you have everything you need. And there are also true memory barriers. And we could talk about all the cool tricks to hide that. But we need to move on. You should understand your applications data rate because after you've fixed all your latency problems, you need to feed the beast. It likes to eat data, and it's hard to get it there. And all of the applications we like to use have really bad compute-to-bandwidth ratios. So this is rough, right? And FFT is the sort of log-in-one ratio, which is bad. I love my matrix multiplied because I send a little bit of data, and I need to do a lot of work. It's beautiful. And why do I think like this? Well, I go look at the resource bandwidths, right? So I have DRAM ports. I have two of them. They're each 128 bits. They clock at about 300 megahertz. I get 77 gigabits per second if I'm a god. That's never going to happen. I have BRAM ports. So all of the distributed memory. I have about 2,018 k bits of BRAMs. Each of them have a 32-bit channel. We'll just worry about the read side. Don't worry about the write side. They run at 600 megahertz. They go at 38 terabits per second, right? The real problem is my DSP runs at 114 terabits per second. So it's still yet faster than my local memories and way, way faster than my DRAM. So anything that I'm doing needs to live on the FPGA. It doesn't get to leave. The FPGA is Hotel California. The registers help a bit, but the registers are everywhere and they're single-bit and they don't actually run at 600 megahertz. That's only if you're really, really good. But we'll talk about how to get close to that at the end. So you need to exploit locality. So we've been calling this the memory wall. So registers are going to mitigate your band limit in BRAM. BRAM mitigates your band limit in DRAM and avoid writing back intermediates, right? Is if the next processing stage is going to consume it, don't write it somewhere. They don't know what it is with some computer scientists, but they're hoarders. They like to have their data in DRAM for no good reason. They're never going to look at it, but they feel really comfortable knowing where it is. Stop it. So how does this locality trick work? Think of this big block as DRAM, as your BRAM. This is your local cache where I'm going to keep some of my immediate locality. These little blocks are registers where I'm keeping local values. This is one kernel running or one processing block and this is its neighboring block. They are eating out of the same buffer. So they're not telling each other when the buffer is done. They are immediately grabbing things right off the bat. And as it's processing, it chunks data in and the other one slides its window along. And it just moves along and it's pretty. And data gets evicted and I never actually need to hold what you might think is the real working set, right? I don't need to hold a full image. I just need to hold the last thing in and the thing I'm evicting. And then if I'm really clever, this is a window, right? And if you look at the window, the window shares six squares every time. So I store the six squares in little boxes and then only update the three new squares. And so that's how I solve some of my bandwidth problems out of my cache, right? And so thinking about where these locality exist and the different layers of locality is really important for getting high performance audio accelerators. And this is sort of how we're going to feed the beast. So these are some of the system pitfalls. There's a good number of other ones, but I want to kind of get it through. First we're going to train individual actions and then we're going to train tricks. So this is a trick I call sitting pretty. It has two steps. Sit on the mat and cross your legs. It's a dumb. It brings me great pleasure. To get her to do that, I needed to teach her both things independently, work through it a large number of times, work that they are together. FPGAs are kind of similar. There's not that much in an FPGA. We can sort this out. So we need to think structurally about design. So what is an FPGA? Well, roughly their tiles. They're organized so that you get a pair of VRAMs, each 18 kb with a 36 bit read write interface. You get two DSPs that for some reason, even though it's a Mac, it has four ports in and one port out. It's because it's not a Mac. Then I have some number of LUTs and some local crossbar. I think back of the envelope for the processor I was looking at was about 305 to two lookup tables and about 301 bit states. So there's not a whole lot in this tile. But this is kind of your view of the world, right? Is I get a DSP that gets to talk to these VRAMs. I do get to go north and I get to go south because the way that they've designed this is it's cookie cutter. They took the shape and stamped it out a bunch of times. So going north and south is really easy. There is a global interconnect. It's truly gross and awful. So some minor misconceptions that I'm not really going to get into. So hopefully at this point you're convinced that an FPGA is not a verilog accelerator. It looks a lot more like somebody stopped doing their job on a systolic array part way through and said the rest is you. There are actually slices. It's not lots and registers. The tool loves to lie to you. They actually have a macro with tons of cool stuff buried in it. It's worth going and looking in the manual what it does. It also changes every generation. So you need to reread the manual every year. Wires are expensive, not gates. But for some reason the tool only reports the gates. It never reports the wires. It drives me nuts. You tell them and then they go, we can't figure out how to report it. I know it's a problem and I have ASIC tools that tell me how it's a problem. It's not that hard. Global wires are worse. So what's my takeaway from this? Pipeline your DSPs. So for some reason everybody wants to do a single multiply, accumulate, and cycle. The macro that's hard baked in there has four pipe stages in it. If you ask for the four pipe stage DSP, you get a four pipe stage DSP. That is how you get this to run at 600 megahertz. If you do not do that, it runs at 100 megahertz. That's sad. That's a regret factor of six. Also really neat, it has a pre-adder. That seems kind of helpful. Because if you're doing complex multiplies, sometimes you do a pre-ad. I'm pretty sure that's the reason that it's there. So that was a caricature. That's not what it really is. It's this. It's way worse. There's all these wires that are going everywhere. There's muxes. There's extra flip-flops I didn't even talk about. The first picture came from the front of the manual where they're trying to convince you this is a good thing. This came from the back of the manual. But this is more helpful. Pipeline your BRAMs. Everybody wants single cycle memory look-ups. Some people like a memory look-up while they're doing math. That's craziness. These have input flops for the addresses and output flops for the data. They also have input flops for the incoming write data. If you do not do it, it runs dog's flow. This was actually the thing that killed us on most of our image processing problems. Probably because we had a ton of these everywhere. But this will be one of the biggest contributors to running slow because it would just never occur to you that memory accesses are pipelined. This is how your caches work. So most caches, L1 is two pipe stages. Your L2 is like four or five pipe stages. This is just how memory is now. If you build a custom ASIC, I'm going to try and make this as deep as humanly possible. So when I'm thinking about it structurally, the goal I have is to maximize my resource utilization. So I roughly have three assets. I have DSPs that have rates and location. This is kind of a weird view of the world because I have DSPs that are local to memories and I have DSPs that are not local to memories. So that specific DSP is a resource. BRAMs are weird because they have capacity, ports, bandwidth, and location. And oftentimes I'm picking and choosing what I want and we'll talk about how that works with FFT. DRAM, I have bandwidth and I have ports. It turns out that often I need to share the DRAM with other people and it's really hard. So sometimes I have to build a controller just to arbitrage who gets to deal with the DRAM. Generally my goals are spatial utilization. I want to be between 70 and 90%. Below 70%, somebody's not doing what they're supposed to be doing. At 70%, they're at the point where most industry people will say they did a good job but they didn't. Roughly the difference between 70 and 90 is space. If you know where things are, you can get to 90. If you didn't want to look at where things are, you're going to sit at 70. Temporal utilization. If you've allocated something, use it 99% of the time. Most people get a high spatial utilization but everything is off. What was the point? And then minimize your regret, right? Don't utilize something just to run six times slower than you would otherwise. That is sad. So how do we plan a trick? So the way I think about this, right, is so let's do an in-place FFT, right? We'll do a radix to butterfly. We'll do 32 bit floats. So in order to service this, right, I need a bandwidth to support that 128 bit in, 128 bit out. So the limiting factor is going to be the bandwidth on my BRAMs. One, two, three, four. Now I could tie them straight in but I have a problem. In the early part of the FFT, it's an even odd calculation, right? I'm going to grab every other thing. So I could stride the axis. On the back end, they're all going to be on the left side. So now I have a problem. So they call this the conflict problem in FFT allocations. And it's really hard to do in-place FFTs this way. So the trick is that as you're writing things back, you need to move it. So if you've dealt with the Xilinx accelerators, the way they solve this problem is that they unroll it in space and actually have a dedicated unit to reshuffle the data. This is also going to be super deeply pipelined. This is going to be three cycles. This is roughly going to be 12 cycles. And this, in order to get the 600 megahertz out of the programmable logic, because this is all going to need to be implemented in programmable logic in a set of state machines controlling that sequencer, this is probably going to be 10 cycles. It's not going to make any sense that it's 10 cycles. But roughly every time you have a 5 to 2 lookup table, you have a register. So if you don't use the register, you're throwing it away. So you always want to deeply pipeline your programmable logic. And if I could leave you with any one big takeaway about how to build these accelerators, is that on the FPGA, they sort of want to be pipelined in an irritating way, like two or three logic levels per pipe stage. It sounds really weird to say as an ASIC person, because I shoot for 20 logic levels per pipe stage, that's sort of a good place to be. But this really wants three or four logic levels, or two logic levels at the most. Part of the reason is that the logic is not expensive, it's the wire. And any time you touch the logic, it has to run around some crossbar to get where it's going. So now I'm going to end with sort of an argument to get your opinion and hear where you are or say something controversial. You wouldn't have had to listen to everything I just said if there was a good Verilog library, right? You could blow me off and go, dude, there's this library. I could ignore everything you said, right? You don't actually need to know how to write a really good FFTW. I mean a really good FFT. At this point I just say the library name, FFTW, instead of actually thinking FFT, right? It's one to one in my head right now. So why isn't it like that for Verilog? Why can't we utilize best practices from software? Why can't it have compact APIs? Why can't it have parametric calls? Why can't we parameterize these things? Why isn't it well documented? I think there's something about electrical engineers and specifically digital engineers about documentation. I hate doing it. My students make me hate making them do it. So it's just all sorts of awful. And I have yet to see anybody actually benchmark code that they've given out as Verilog. But that's standard practice in a software library, right? So these are all things that, you know, how do we get part way there? How do we even figure this out? And then there's really awful catch 22s built into this too. Because right now anything free in terms of Verilog is crap. Unmitigated crap is awful. If it's free, it's bad, right? It's like finding a donut in the garbage. Because if they gave it away, they thought it had no value. The other problem is nobody will use anything they find in the wild if they think it's good because there's a chance that it wasn't used in a tape out. So if you want to argue with me more or talk with me more about how awful FOSS hardware is, I will be around. So thank you. Thoughts, concerns, hopes, dreams, desires. I think we have quite a queue for the satellite talk. Yeah, go ahead. So certainly a lot of the devices that couple the FPGA with the DACs directly so that you're streaming from the DACs directly into the FPGA solve a lot of these problems.