 Okay, so welcome everybody who made it here to this room. Also welcome everybody in the live stream. We're here to talk about communication between Linux host systems and FPGA-based subsystems. My name is Alex. I'm based in Munich, Germany. I'm an electrical engineer. I've been tackling problems with software for 20 plus years, and in the last at least 10 years I've been drifting more and more to embedded software and especially systems which rely on Linux and FPGA part. These days I'm wearing two hats. I work as an R&D engineer at Minerik. We make laser-based communication for aerospace, free space communication, pretty cool technology. But what I've been doing for a while longer is MPSI technologies. It's a startup with the goal to provide the embedded community with developer tools that replace repetitive and boring coding tasks by model-based source code generation. The tools are out there. In terms of adoption, there is still work to be done. Now on this slide, I will show you a couple of scenarios, which most of them I have come across, where you have communication between Linux host and the FPGA-based sub-system. The legend is common for all, so typically the purpose can either be you want to invoke a command from the host on the FPGA and get a reply, or you want to shovel data between the two systems, buffer transfers. In terms of bandwidth, you have different levels to it. Depending on the application, you will need more or less bandwidth. I will, you will immediately see this in the different scenarios, which I will be showing. So a very simple reason, but a very justified reason to talk between a host system and an FPGA-based sub-system, is you want to debug what is happening in the FPGA-based sub-system in a bit more comfortable way than with JTEC counting zeros and ones. But this is an application which doesn't require a lot of bandwidth, so you would probably pick a UART over USB connection. FPGAs are very good at reducing data, so you can have a source like an ADC that produces lots of data, and instead of bothering the Linux system with it, you can put the high-performance algorithm or the data crunching algorithm in the FPGA and just transport the result to the host system. Then, of course, there are also systems where you want a high amount of data to be transferred to the Linux host, or you need to have that. And then you would need a high bandwidth interface here, PCIe, and it's intentional that already the connected SRAM is shown on this graph, because typically you would go over the memory to talk to the host system, probably using DMA. A quite common variant of Linux host system and FPGA sub-system since around 10 years is FPGA SOC devices, which combine compute cores, which run Linux and FPGA fabric in one device. That simplifies things because then the interconnect is already given. It's typically, no, actually in all devices I know of is an ACI interconnect. But again, the principle remains the same, something which is quite tempting with such devices is also that you look at the FPGA and see, oh, it has so many free IOs, let's route some low-level interfaces through these pins. There is nothing wrong with that, and that's where I show these additional lines where you make connections available as drivers, but they are routed through the FPGA fabric. Finally, something quite popular recently is the so-called, well, the accelerator scenario, where the FPGA isn't even connected to external data source like in ADC or Ethernet, something which physically provides data, but where the FPGA is only used for acceleration. I will not talk about this here for two reasons. First of all, I know nothing about it, but second also, you are then talking about high-level synthesis or also the tightly integrated neural network framework where the vendors will provide you with something where you don't even need to think about that you have an FPGA there. Now, the reason why we've been looking a lot at the interface between Linux host and FPGA-based subsystem is in the context of our developer tools for model-based code generation. We have a demo project, which is this tabletop 3D laser scanner. You have a turntable and then a five megapixel camera and two line lasers controlled by some kind of embedded system. And we have lots of variants for that system. And so I list the ones which correspond to our scenario here. We have one where we have a Lattice device connected UART over USB to a laptop where obviously only control and result can be transferred through such low bandwidth. But then we also have a device where we choose a different architecture and then we pass high bandwidth camera data directly to the Linux and then we need PCIe. Another one is quite recent SOC, microchip polar fire SOC and then the usual contenders as well, Xilin sync. And also some other devices, which however don't fit in the pattern of this presentation. I'm showing a bit of software functionality for this project because I will be using this project throughout the presentation. So preview image acquisition, we are just doing binning. That's a very good algorithm to be run on FPGA and not on a Linux system. We have another one, this is a bit hard to see I guess. It's identifying the corners of a checkerboard. They're shown in blue, bit hard to see, sorry. And then for the 3D laser scanner operation, finally we have identification of line lasers in the image. So we make differential images between laser on and laser off. Each of these algorithms can either be performed on the Linux host or on the FPGA. So we have something to play with here and of course as you can imagine, this will greatly vary the load on the interconnect. The less data is processed on the CPU, the less bandwidth you need. So with that as an introduction, I would like to give you the outline of the things I want to be talking about. Similar to the network layers in this scenario, you can say you have on the bottom physical layer, then a hardware abstraction layer where it then does not become important anymore. What is the physical interface? Protocol layer and application layer. So I'm starting with the physical layer. These are some typical bandwidth results you can achieve without doing any tweaking. Of course you can get some high performance chips here and there which will probably get the numbers up. But without putting in a lot of effort, this is typical bandwidth. You can achieve in the communication between the two sides. So on the lower end you get UART already from most embedded, I mean actually all SOCs I have seen will have some sort of UART connectivity. UART over USB is particularly interesting for debugging so you don't have to worry about the long cable lengths. USB will handle that and then only the last piece which goes into your FPGA is UART. SPI similarly and then the last three ones are rather higher speed interconnects. So if you use AXI Lite, which is the interconnect you have in all FPGA SOCs, you get to 50 megabyte per second which is already quite something with 32-bit words and 100 megahertz clocks. AXI Lite always has a lot of handshake around each word which is transferred so it's not really optimized on that side but it's easy to deal with. PCIe is a typical solution if you have two different, maybe even PCBs talking to each other. You need differential pair routing and some control but you can get to 250 megabyte per second per lane in the PCIe one configuration. And finally, if you use the full power of the AXI interconnect here in an example, 64-bit time in the burst configuration then you can go to 700 megabits per second or higher. Next up, hardware abstraction. So from here I'm separating between the host side and the FPGA side. So to me, hardware abstraction on the host side which is Linux is well, what kind of driver do you have? And the good news is that really you can get very far with simple character device drivers. The transfers are controlled by the host, the Linux host and in the end you have open read write closed that's really handy. And this is then possible for all these three types of interfaces and I just listed some typical device files you would find on your system. If you have a board with a good board support package they should already be there. And then the AXI Lite category is also easily implementable like that. You can write a very simple driver. I guess it's a bit hard to read but if you get the address space of only four consecutive addresses you can make a very simple handshake where you notify the FPGA side, okay read transfer is starting, write transfer is starting then you send the actual data, you implement a timeout and in the end you can provide this as a character device driver. It's very simple, I'm mentioning this because the vendors will typically also provide you something and before you know it you would be vendor locked and here's maybe the moment to avoid that and just write the very simple driver can be cross vendor. I have very good experience with that. Then there is the advanced category in which I count the high bandwidth interfaces PCIe and AXI4 in the full bandwidth configuration which if you think about it typically will only make sense if you use DMA and interrupts because well Linux has a scheduler it decides when it has times and if you want to transfer huge chunks of data it's better to write them into the memory and then have the Linux part after an interrupt working on that data. The UIO already has features for that integrated with PCIe and there is also a helpful article it helped me to write a driver using this PCIe and what I also saw is tomorrow there will be a talk about debugging PCIe express so that's certainly interesting. In my case, I experimented with this I am a six host and the lattice as a client and fortunately there I didn't well there was not so much to debug. Now switching over to the FPGA side again for the simple protocols UART SPI and on the next slide also AXI Lite it's very simply implemented there is no need to use any vendor core for this you can easily write or download I will give you the sources, the references in the end very simple VHDL modules that handle UART RX, UART TX and SPI Slave and AXI Lite so you just have to react to some some of the handshake signals of the AXI Lite bus and that is really not difficult. Now for the more, for the PCIe PCIe is on the FPGA side a bit more advanced because it typically directly connects to the vendor specific gigabit transceivers so it cannot really be a fully cross vendor the good news is from the three vendors these blocks which provide the PCIe handshake are free of charge although not all of this is necessarily open source so this is what I found a couple of weeks ago so you can so you can at least work with this yeah. Now I want to show you the higher layers not one by one but by the example of first showing how a transfer from host to FPGA could look like or an interaction that's triggered by the host and then the way back afterwards. So starting here on the application layer then here I'm using the example of terminal which you can use to invoke commands yeah you type in a command basically on your host side and then we will go all the way until it is received and acknowledged on the FPGA. Here this is a functionality where I can in a small web user interface click my commands together and I pick the command a stepper motor move to so if you remember this reference application with the tabletop laser scanner it has a turntable that's driven by a stepper motor so a natural command for the module that controls the stepper motor in the FPGA is move to so it's visible there at command sequence also a bit slow so small it has two arguments you give an angle and t-step would basically be how fast should the motor turn. So from there yeah what has to happen next so you want to invoke this command and in the scenario I'm showing here there is a automatically generated library on the host side which translates this command into a bytecode that is then to be passed to the FPGA it is listed a bit here so there is just like the first section of the zoom in there are two t-axis where you first say okay where you tell the FPGA first okay I'm invoking a command this is the target and I'm sending also five bytes of parameter data this is then the second transmission and then you receive back an acknowledge if everything went well. Second command is step get info so that's the second block over there you would pull the angle which the FPGA knows because it's counting the steps it's put into the stepper motor until you have reached the target and that's basically it. So we have seen the host side the Linux side and now what happens on the FPGA side well you might know in FPGA is typically everything is hierarchical but that must not necessarily be the logical representation of the system. So in this case you would start from the left hand side there is somewhere the XI PSPL switch then you have your little FPGA VHDL module that decodes the XI transfer that's passed through an RXT DX block and in the module host interface the decoding of the byte string takes place and it will then know to give a handshake or a request to the stepper motor VHDL module so this is the endpoint so to say of our command invocation there is a request acknowledge signal and then below the parameters are passed. Yeah so that would be a very simple example of command transfers between host Linux host and FPGA and for the way back I have chosen instead something which requires higher bandwidth and that is preview image transmission. So the camera, so it's again left to right the camera supplies yeah, five megapixel images which in the end implies bandwidth of 150 megabyte per second but then you shrink them and you can transfer them to the host with an regular eggs highlight interface so there's some modules involved again so starting from CAM acquisition, ACQ the third from the left where the actual binning takes place there is a B buffer situation or ping pong buffer where always one buffer is being written to while the other is being read out to the host it goes again through the host interface that handles the handshake with the Linux host and then it's basically the return of what we had seen on the previous slide. This is definitely too small to read, sorry about that. So what the host does it basically pulls the buffer status that's again a command invocation if you will and then if it changes to A buffer full or B buffer full it will initiate the buffer transfer. And then, okay, then it goes all the way to a web-based user interface but that is definitely beyond the application layer which I mean by the little table that's shown there on the right upper corner. I can't deny that I'm a big fan of model-based source code generation so for these examples I have used our open source tool to generate from one single source of truth, one model file, all the glue code if you want between that was required from the place where we invoke the method and C++ to where it arrives and is being handled on the FPGA site in VHDL code and also for the way back this model-based approach is quite useful. So how it's done just very quickly so we have model input files for the modular structure of the FPGA site. So here it's shown in Xilinx Vivado how it then looks like this hierarchical structure and for some of those modules we say these modules are special modules which we named controller and you can attribute a command set to each of these controllers like the one we had seen step which had the command move to and get info. So once we tell this to the tool in a second model file we say well what are the commands and what are the invocation and return parameters and based on that description it is possible to generate all the code on the C++ site and on the VHDL site to make such transfers work commands and buffer transfers. Here I would also like to show three different concepts of how on the application layer like above the hardware abstraction layer you can possibly interact between Linux host and FPGA-based subsystem. So the one in the first column we had seen in the stepper motor example and also the buffer example. I call it simple. You are on the Linux side and you have blocking access to the FPGA resource for all requests for buffer transfers or commands. And in the FPGA you had seen there is one host interface module so it handles requests one at a time. An advantage of that is it requires a quite low FPGA footprint but I'd also indicated that you may have to pull the status like in our case how far has the stepper motor already been turning. You have to pull that all the time. That's maybe not ideal. Especially in view that FPGAs are meant, I mean they are ideal candidates for parallel processing so something like that is not really great. Second option is callback based so here the idea is you launch a command for example the move to command for the stepper motor and then you get replies to that from the FPGA side until the target is reached. This can be solved by having a command invocation and return buffers on the FPGA side so that requires a larger FPGA footprint but it is also much more powerful if you have many things going on at the same time. Something which you might have missed here is something which is typically done maybe some of you have worked with these cameras where for configuring them you have to write hundreds of register with a specific value in a specific order so that's something which I think if you have an FPGA and you have Linux on the other side I would not recommend although it's industry standard and good for bare metal but the problem is it is not really well defined what happens if you mistakenly write to registers in the wrong order. Yes, that brings me to my first conclusion, okay? Linux host to FPGA connectivity is not rocket science. Why is that so in my opinion so what helps? You have on the hardware side not so many different standards that are being used although there are several vendors out there which you could consider and these hardware interfaces they are both supported by the Linux kernel since very long and also on the FPGA side they are either very easy to implement or the IP as for PCIe is free of charge and in some cases not all also open source. So for this specific aspect also I think that model based generation of source code for this interaction is helpful, it simplifies life because in the end it even allows you to have an interface agnostic communication and you're also avoiding vendor lock in. So before I conclude finally I would like to stress all of the code which we have come up with throughout our experiments is open source. You will find it on the Git repositories. If you're looking for a specific interconnect and you cannot immediately find it, feel free to drop me a line and with that I thank you for your attention and I'm open for questions. Questions, I think the next sessions begin, one second, yeah 545 are the next sessions. Have a good day everyone. Thank you.