 Michael? It's hot in here. It's hot in here? Okay. Good morning. My name is Luca Carlone. I'm a professor at Columbia University in the city of New York. And I'm here with Davide Giri, who is a PhD student and one of the architects of ESP. ESP is an open source research platform for the design and the programming of system on chip. It's the result of over seven years of research in teaching at Columbia University, something we have made available, public source Apache II license a few months ago. This is the website of ESP. You find not only the current release, we keep making new components available. You also find links to our Twitter channel, YouTube channels. We have an ongoing set of hands-on videos and tutorials to learn how to use ESP. So, why ESP? ESP really is the result of our belief that motivated our research. Then across computing systems from the edge of the cloud to the data center machines in the cloud, even to supercomputers, you're going to see more and more engines which are heterogeneous. Heterogeneous means that they are made of components that have a different nature. First and foremost, general purpose processors that execute software and specialize hardware accelerators that provide high performance and better energy efficiency for key computational kernels compared to software. So, you have this heterogeneous system. Now, when you integrate heterogeneous components in general and particularly specialize hardware accelerators with general purpose processor, the task you face is hard. Doing so in a scalable way is very hard and keeping the system simple to program is even harder. So, the goal of ESP, ultimate goal, is to make this easy. So, as I said, ESP is an open source research platform. A platform is the combination of an architecture and a methodology. ESP combines a scalable architecture with a flexible design methodology. What it means that in order for ESP to help SoC designers and developers to integrate heterogeneous components and take care of the hardware software interface, we leverage a variety of different design flows and we leverage particularly components from the open source community, hardware components, as well as CAD tools that we combine with commercial tools. So, the ultimate vision of ESP, which is fairly ambitious, is to allow developers to develop their own SoCs. Today, what you have is that you can take advantage and contribute to a library of heterogeneous components. You can develop new components, particularly accelerators, with different design flows. So, we don't say that you have to use just one design flow. So, for instance, you can develop components with traditional RTL level design flow from the RTL level of abstraction. That means use Verlog, VHDL, or Chisel as a language to specify hardware and then use logic synthesis to generate your component. You also can, and we advocate, particularly for accelerators, develop components using a higher level of abstraction. So, use C or System C together with particularly commercial high level synthesis tools that we combine with our automation flow. Also, and this is true right now for one particular domain, which is, however, an important domain and is the domain of embedded machine learning, particularly inference, we leverage another open source program to which we participate, but we are not leading, which is called HLS4ML. So, leveraging, we have integrated HLS4ML with ESP. That means that you can leverage their solution of HLS4ML to go from models for embedded machine learning done in care stencil flow or PyTorch, automatically generate a C representation that can be synthesized with VVAD HLS tools and then you can leverage ESP to integrate that into a bigger system. Another thing that you have in ESP is that the library allows you to take advantage of open source components coming from different sources. So, for instance, we have integrated RISC-5 Ariane processor from ETH Zurich, and we have also integrated NVDA, NVDLA accelerator, which comes from NVIDIA. ESP comes from a graphical user interface that allows to do interactive floor planning and has capability for rapid FPGA-based prototyping using particular exiling flows. ESP is efficient because the methodology has been developed with that architecture. The architecture is style-based. We have been working here for several years. This is an example. One particular matrix of tile is a 4x4. You have different tiles. The main tiles are either accelerator tiles, processor tiles, or memory tiles. They are connected with a multiplayer network on chip. Which is highly configurable. So, overall, you get a system which is distributed, scalable, modular, and heterogeneous, and gives processor and accelerator similar importance. So, this is a very big difference with respect to other offerings in the RISC-5 community in terms of integration platforms. ESP is system-centric. It's not processor-centric. Processor tile, right now, the choice that you have at the same time is between the RISC-5 Ariane 64-bit core and the Spark 32-bit core. These are cores that you can integrate. They come with a level 1 cache. And we wrap them around with a tile wrapper, a tile socket that adds a level 2 cache as well as capability for coherency and interrupt request. This integration is completely transparent to software. This means that no ESP-specific patches are necessary in order to boot Linux. Memory tile is very important. At the same time, you can decide how many memory tiles you want. You can have an interface to external DRAM. You can have one, two, four, change it with a push-button capability and automatically generate the last level cache and directory that gets distributed for you. So, you can have different degrees of coherency for the accelerator. The accelerator is where a lot of the innovation is. So, in the accelerator, the accelerator designer focuses on the design of the data path and the private local memory of the accelerator. These are loosely coupled hardware blocks that execute coarse-grained tasks changing a lot of data with main memory, typically through a DMA channel. DMA services and virtual memory services are implemented in the socket that encapsulate an accelerator that can be designed with many very different design flows I mentioned before. So, that means that the developer of the accelerator can focus on doing the best hardware for the data path and the private local memory knowing that it's possible to rely on the services that are already automatically generated for you so you don't have to reinvent the wheel in order to have memory map configuration register, DMA and virtual memory services, virtual memory translation, as well as interrupt request. Now, one important point here is that the moment that we integrate an accelerator not designed with the flows that we have provided, you have that at some point some of the accelerator may already implement some of these capabilities. So, for example, when we integrated the NVIDIA NVIDIA LA accelerator some of the configuration register and DMA capability were already present with that but the good thing is that our socket is modular and so you can decide at the same time which services you want to have or not and so this was an important moment when we could take an accelerator, coarse-grained loosely coupled, like the NVIDIA LA, fairly complex and integrate in the rest of the system network like the other accelerators we had. I mentioned services some of these are these more low level services but the principle of having tiles with sockets that encapsulate them and the couple, the design of the tile from the integration with the rest of the system is true across all types of tiles and so you have various services that you can decide at the same time whether you want to add or not and some of these are reconfiguration capability at runtime. For instance, you can choose different models for coherency for the accelerator from non-coherent all the way to fully coherent if you really want it. You can also have at the granularity of each tile dynamic voltage frequency scaling monitoring capabilities you can configure the network on chip to set up data flows within the network around time. So now one final thing about the software so you have a hardware socket and correspondingly you have a software socket that actually can be automatically generated from templates so you get the device driver as well as unit test application generator like that to invoke an accelerator is fairly easy there is an ESP library which has some system calls that of course invoke the software in the device driver and what you have for instance in an application like this where you have an application that can invoke different functions you can decide and all you have to do is change one line of code in your software to have a function executed in hardware where you specialize the accelerator as opposed to executing in software on your processor so this is the key for design space exploration so now we'll have David take over and give you a quick I'll show a demo which is a shorter version of a tutorial that we have available online you will see can I click on I think I have to go you will see the vertical flow the whole vertical flow going from the design of an accelerator in COS or system C down to the FPGA emulation now in the first step in blue ESP generates almost all the code base that you need for your accelerator your test bench your device driver and your test applications in the next steps the red ones it's the designer that has to fill in a few things in the code of the accelerator mainly the computation part and then the data preparation and the data validation in all the test benches at that point you have the accelerator and we have some make targets for simulation and for running the HLS so this is the first step it will skip through pretty fast but the idea is that you input some characteristics of your accelerator and you get all the code generated and at that point you fill in the parts the computation and some data validation now you have your accelerator the synthesis so from C or system C you generate your RTL and then you're ready to already run the unit test simulation this simulation target is going to run the behavioral simulation on C or system C and then it's also running the simulation on the generated RTL in the case where you did some design space exploration so you have multiple possible implementations of your accelerator the simulation target is going to simulate all of them once you're done with these simulations what you have is that you have your accelerator you have tested it and you're ready to move forward and start working on the system integration and then on the FPGA emulation which is the next part online in this tutorial we divided in two parts the first part is mainly focused on the accelerator design and then in the second part we are talking more about the SOC and the FPGA emulation flow so you'll see that now we move inside the folder called SOCs then you choose a sub-folder which is specific to the FPGA target that you want to work with in this case it's a VCU-118 and then you open the ESP GUI in this case we are doing a very simple SOC it will only be a 2x2 you can configure it, have it bigger and immediately the GUI knows the accelerators that you have available in your system and so we are choosing the multiply and accumulate accelerator which is generated and with just a click we have generated the whole RTL for the whole system in this case we have the Ariane core for the CPU tile we are already ready to generate the Bstream and launch logic synthesis with Vivado Xilinx Vivado and we are going to run this in background and move in a new new window for the simulation the full system simulation in RTL so in this case we compile a bare metal application that will run on the Ariane core in simulation and it will be testing the accelerator tile this is running in bash mode with a slightly different target you can open the GUI in this case it's model sim but we support also other RTL simulation tools so you see the first step is scanning the device tree to find the accelerator and then it's going to evoke the accelerator to perform a simple task a pretty short test for time reasons so the full system simulation went well let's say we already have the Bstream now we are ready to test on FPGA so we already compiled the bare metal application and here we are compiling the Linux user application that we will use for testing so bare metal application Linux application and then we compile Linux itself at this point we open a serial interface for communicating with the FPGA and we program the FPGA with the Bstream and we are going to do first the bare metal test which is the same we did in the simulation you can probably recognize the outputs it's the same test and then now we are ready for testing the user space application in Linux so we are booting Linux we compiled earlier the image for Linux and now what you are seeing is the transfer from the host machine to the FPGA of the Linux image so Linux will be running on the RISC 5 core that we put on FPGA in our SOC you see the boot during the boot you will see that if there are some accelerators available in the system they will be discovered and the device driver will be registered and in the end we are ready to evoke the accelerator with a simple test by its functionality so this is just very quickly the whole flow and I really recommend to watch the full tutorial that we have online because there are a lot more details and a lot more insights I'll give you back the thank you David and so then in summary so ESP as we said is an open source research platform for designing programming of system and chip distribution to the RISC 5 community in particular and more generally to the open source hardware community we think that what distinguishes ESP from many other important projects is that we focus on the realization of system and chip based on more scalable architecture that integrate more heterogeneous components from different parts using a more flexible design methodology which doesn't dictate the particular design flow we accommodate many different design flows is the result of as I said many years of research at Columbia in particular seven years of research and teaching on the ESP project but there are some principles that have been studied for years before that so we hope that you find ESP useful and we hope that you want to contribute to the ESP community thank you very much any question we have I think three minutes now for your work it's definitely very interesting needed work to combine the heterogeneous systems but what my question is where in your modular system are the IO components when I have to add interfaces like infinity band space by your can bus course do I treat them like an accelerator the question is where in the system are the interface components well first of all we can answer with an key example DRM controller that's an interface so that goes into a memory tile other IO interface would do in IO tiles and so then of course depending on how you implement the system if you implement on the FPGA you rely on the memory controller that you have in the FPGA which is what we have if you do an SOC you need a DRM controller question there the framework that's an excellent question does the framework work with automatically generated XI buses it depends Ariane RIS 5 as well as the NVDLA accelerator speak ACSI our network on chip in itself doesn't speak ACSI more general than that but inside the tile we have automatically generated adapter from ACSI either master or slave to the network on chip so the principle is always to try to accommodate as much as possible different standards different design flows of course is a growing battle if you will because there are always new interesting standard that may have to be accommodated and for this the community can help and contribute question there I think we have one minute maybe very good question can the user change the interconnect into NOC so in one minute one thing the NOC is very important design time around time because of the services if the result of a lot of research we have done at Columbia you can check out the papers on network and chip design is a fairly simple NOC to the mesh but it's multi-plane allows you to support coherency allows you to change the flames configure it let's say you want to change the NOC because you don't want a 2D mesh but maybe I don't know you want a 2D torus well you're welcome to contribute to USP and what is possible is that because everything is done in a compositional way as long as you respect the interfaces or implement the interface with the tiles then you can change the design of the NOC maybe 10 seconds you already spoke let him ask different course I think you're referring to processor course like a big little kind of things you can do that right now we have two types of course which are very different 64-bit with an ISA a 32-bit with another ISA other course can be developed and of course what matters also is to a large extent for SOC particular for embedded application are the accelerators thank you