 I'm excited to present here. I'm Michael Taylor, I'm a professor at University of Washington, also known as UW sometimes. And this is Max Rutenberg, who's one of my PhD students, will be presenting the second half of the talk. So the talk is entitled The Hammerblade Wrist 5 Minicorps, and it's essentially a programmable scalable wrist 5 fabric that, you know, in some senses is equivalent to a GPGPU, but it's completely open-source. So for motivation, you know, especially in this room, you know, we're experiencing this open-source renaissance, right? We have open-source ISAs, open-source CAD tools, open-source processors, open-source libraries and RTL. And so I think, you know, we're going to see enormous transformation in the industry because of all this open-source stuff in hardware that we're doing. And then at the same time we have all of these new application domains that are enabled essentially by Moore's law winding down, right? So, and the key to technologies really for enabling these domains are the development of new DS cells, domain-specific languages that help us make the specification of parallel compilation or make the process of compiling some application and getting it to run in parallel in some way feasible, and then also the availability of a new parallel compute fabrics. And those are the things that are going to get us the energy efficiency that we need in order to get, you know, the high performance within a given thermal envelope. So the Hammerblade MeniCore is kind of seeking to be the base class of these parallel compute fabrics that are needed for these new application domains. So what is a Hammerblade MeniCore? So it's a highly programmable, highly energy-efficient spatial fabric for sort of mixed sparse dense compute. So the idea is this is trying to address not just codes that would run reasonably well in a traditional GPU, but also new codes like graph codes that are much less well behaved and much more challenging to get some parallel speed up. And at the very heart of the MeniCores we have a super high efficiency compute tile. It's a one instruction per cycle of this five engine. And then each tile has an instruction cache and a local data scratch pad. And you can adjust the size of this, but generally you want it to be small, because the smaller it is, the more cores you can fit on the chip, right? So there's a trade-off there. And we'll show you how the architecture actually allows you to flexibly change that trade-off. Each core also has an FPU and then a little router to talk to the other cores. And then it's very scalable. You just stamp out as many cores as you want. If you have, you know, a silicon area, you just stamp them out until you fill up the area. So we have very good efficiency for the MeniCores. So this is a picture of a floor plan and TSMC 16 nanometer. And then this is actually a die photo from the actual chip. So this was a chip we presented at the RISC-5 summit, the fourth one, I think it was in 2017. And what's amazing about this core is that even with super small instruction memory and data memory, the memories themselves occupy 64% of the tile area. So it's a kind of existing proof that, yeah, you could try to go and refactor this core and try to squeeze it down a little more, but the maximum improvement you could possibly get would be about 36%. So it's so small, in fact, that you can fit 40 of these per millimeter and 16 nanometer. And at 7 nanometer, you'd be able to fit 120 of these per millimeter. So zooming out, you know, we have this array of cores. So we need to also integrate them into a parallel memory system. And so the architectural model is you have the C of cores. And then at the edge, you have these L2 victim caches that are then connected to many parallel memory channels. So it could be, for example, HBM2, which, you know, in GPU systems, you can have 64 parallel memory channels these days. And these caches are also adaptive and can essentially change your favor on the fly as you learn more and more about the workload and the input data set. Now, it's pretty easy to create a C of core. So the real special sauce is how you weave these things together. So the interesting thing about how this is done in our system is that every single memory location and every single local core scratch pad, every single memory location in the L2, every single memory location in the DRAMs, it's all addressable by all the cores. So that means they can very easily collaborate just by doing loads and store instructions, which will automatically get routed over to whichever core it is or cache it is that owns that particular location. And so a core can actually issue many repeated loads and stores. Those things will go out in the network in parallel and then come back potentially out of order. And the core can continue executing instructions. It's only when it actually tries to use the register that was the result of the load instruction that it might actually stall waiting for the data. So one of the concepts that we have in trying to map software to the many cores that we have this big array of tiles and now we want to sort of group them together for two reasons. So one is we may want to execute many different programs at the same time on the many core. So we have this concept of tile group and here we show a tile group which is a 4x4 tile group where I allocated these 16 tiles on the many core and the many core essentially can, the computation that's running on the many core inside the tile group can share all the collective memories of those tiles. So this is a way for you to manage not only how much parallelism you have by having more fewer cores, but also how much working memory that you have. So if you have a big working set then you can use a larger group of cores in order to increase the amount of local memory that you have. And of course, because those cores are located near each other, the latency is very low in order to access the data. Whereas if you go off-chip, the latency would be higher. So in order to program this, we have, of course, we have DSLs which are more user-facing, but in terms of writing very high-performance library code for those DSLs, we have something we call CUDA Lite, which is very analogous to CUDA. And the idea is that there's a big developer base for CUDA that's familiar with those programming constructs. And we'd like to essentially take the knowledge that people have embedded in CUDA code and then relatively easily move that knowledge over to executing on the many core. So this is an example, a comparison of a snippet of device code that run on CUDA and equivalent code on CUDA Lite. And it's fairly one-to-one in terms of being able to move code over. There's certainly some differences in terms of optimization and what would be preferable in one architecture of the other. But in terms of getting something up and running, it's fairly straightforward. So of course, you have the many core and these cores are highly specialized for dense compute, but we would like to put a control processor of some form. So there's two flavors of this in our system. So the first is we support doing PCIe attached acceleration. So in this case, we have a chip, it's a many core chip. It's on a PCIe board and then we connect it to a Xeon server. And we have this up and running on F1. So in F1's case, there's an FPGA board, which is simulating our many core. And then we have our host code running on the Xeon. So you have X86 code talking to the many core together. In the longer term, our goal is to actually have a black parrot integrated on the same SOC as the many core. So for example, you could imagine right now we would run PyTorch on the Xeon and it offloads calls to the PCIe board. But eventually, when PyTorch runs on RISC-5, we would be running that on a black parrot. So I'd like to mention that we've actually been through many iterations. So we have a very agile methodology where we do tape outs, we build software on the devices, we get experience, then we develop the next generation. So we're actually entering our fifth generation in iteration of this, so the fifth silicon iteration of the many core. And so we started out in 180 nanometer and then we did two chips in 16 nanometer that had a 511 RISC-5 core. So it broke the world record for RISC-5 performance. I should probably still hold it and also for CoreMark for any ISA. And the latest system, Hammer1, is where we've really been focusing on programmability improvements to grow the user base and also floating point support. So I'd like to mention that in addition to the many core itself, my group has a website bjump.org. We have a bunch of stuff that might be of interest to you if you're developing open source hardware. So one of them is something we call Baychamp STL. Dan mentioned this in the last talk. So this is a library of very high quality implementations of almost every hardware primitive that you can think of. You want routers, you want arbiters, you want caches, you want networks, you want high speed links to go off your chip. All of these things we've developed over the last 10 years in our group and we've really iterated on their quality and also results. And it's all been validated. We also have open source motherboards. So you can design your ASIC to a particular interface standard that we specified and then you can just take your ASIC and plug it into a socket and it'll just work with this board. So you don't have to go and design a board in order to use the silicon that you developed. We also have open source BGA packages. These are much higher pinout than you can get off the shelf. So if you design your chip to this interface standard that we've specified, this ASIC socket standard, then you can get much more higher performance bandwidth out of your piece of silicon than you would on your own. So I'm going to wrap up the hardware part. I wanted to also mention that we have developed a methodology for taking cores out and replacing them with accelerators in this fabric. And in fact, our collaborators at Cornell have been developing several accelerators and have done proof of concept already with having hybrid many core accelerator systems. So now I'm going to turn it over to Max who's going to talk about some of our Hammerblade software stacks. Thank you. So yeah, so I'm going to, hi, I'm Max. I'm going to give an overview of our software stack. So one of the primary goals of Hammerblade is programmability and portability of code that already exists that's written in higher level frameworks. So Kudolite, which is our low level programming API, is the building block for these higher level frameworks. Our collaborators at Cornell are working on PyTorch or back in for PyTorch for Hammerblade. We at UW are also working on DGL, which is a Python library for targeting craft structured data, but using machine learning techniques and TVM, which is a machine learning IR. And we're also targeting Graphit, which is a DSL for high performance graph analytics. So just a quick introduction to Graphit. It's main features that it decouples the semantics of the program from the optimizations that are applied. This is important because with graph algorithms, there's a whole, the field of optimizations is pretty wide and they don't always work on every single input. Sometimes the optimizations that work are highly reliant on the input. The two main types are edge sets and vertex sets. And then there's a schedule language, which you can see down here, which says what optimizations so applied. So we ported this to Hammerblade and here's sort of an example. So the same snippet of code, I actually took it from BFS. Basically the frontier is just being updated by traversing the existing frontier, applying an update function and then creating a new frontier from the nodes that were updated. You can see in the scheduling language that we've said that we want the dense pole direction and that we want to generate Hammerblade code. Down here, you can see the host side that is generated. So this runs on the X86 or black parrot. You can see that it's just the outer loop, but we're offloading the work. Here's the generated risk five code. It's C++ code. The work is self-assigned. We have a local range function and then we are doing a parallel dense update. And then finally, when all the work is done, we do a tile group sync. So graphs are kind of challenging because they are typically memory intensive. So taking full advantage of all your cores can be tricky. They might spend a lot of time waiting for memory. So one way you might be able to address this problem on Hammerblade is doing something similar to what we might do on GPUs. We can tile our memory accesses. So in this case, the proposed plan would be to take vertex data and access in a block fashion so we would pull it in on one block for a group of tiles and they would pull it into their local memory and then they could just do sparse updates from the edges so they can access the edges. You would keep the edges partitioned across DRAM channels. This would maximize your message transfer rate and then you could restrict your updates to a given range. So this would keep your look out. This would keep you from breaking your cache and it will reduce the amount of time you spend waiting around for memory. It will maximize the parallelism of your course. So now I've given you an overview of mapping an application to Hammerblade. I wanna onboard you all with how you can get involved. So our main simulating environment is a CC++ co-simulation environment. Well as you didn't to simulate your entire stack, the host software, the mini-core software. We use Synopsys as our RTL simulator. I should note that most of the code and the RTL in our group is veriliter friendly so it would be a straightforward and solid contribution to get veriliter working. The code's all up on GitHub. You clone this repository called Blade Runner. You get the related sub-projects and then there are instructions in the readme about how to get up and running. If you have the tools and the environment it's fairly straightforward. We also support deployment on AWS. You need the Vivado tools. You need Xilinx Vivado compiler. But it's all part of the same project and we have detailed instructions in the readme about how you can build the FPGA image, how you can build the machine image and you can get up and running. And the machine image will have everything. It'll have runtime libraries, development tools for risk five, the FPGA bit image, all of it. And lastly, I'd like to suggest some directions you could take to help us contribute to the Hammerblade software stack. All of our software frameworks that we've imported are works in progress, everything down to cootalight. But there's some other really interesting directions we could take that we haven't explored yet. So Halide is for image processing. We haven't looked at that yet, but we're pretty confident that would be a good fit. TensorFlow right now we have PyTorch but TensorFlow would also be excellent. FFTW is a library for doing FFTs on CPUs. There's also QFFT for CUDA. We would really love one for Hammerblade Mini Core as well. On top of that, Cornell has been working on building their own accelerators and adding them to our mesh and it would be great if you could add for people to add their own accelerators and to expand our application domains. So we have a full stack team and they work very, very hard on this project and I'd like to thank them before including. And on behalf of the Hammerblade team, we salute you all, we really hope you contribute and thank you very much. So the question was, oh, how many of these can you fit on FPGA? So we haven't done too much with small FPGAs but we can fit 192 cores on a big FPGA. So a reasonable number of cores would fit on a small FPGA.