 Okay. So yeah, it works. Okay. So welcome, everyone. I'm Edgar Iglesias. I'll be talking today about how we at AMD and Silings use QMU and open source technologies to do RTL co-simulation. Yeah. So I'm Edgar. I've been with Silings and moved on to AMD through acquisition that just happened. And I managed the QMU or virtual platforms team at AMD. Okay. So first I'll start with going through how we use QMU at AMD Silings. And this is mostly applying to the Silings legacy stuff. AMD teams may do things a little bit different. Then we're going to talk about different emulation technologies that are usually used in sort of SOC development. What is co-simulation and why and how we do it. We're also going to cover a bit what we did with the DARPA push program to develop some of this stuff. Some more details about different co-simulation setups. And at the end we'll have a live demo. We'll see if that works. Okay. So this is a slide I usually show for people that don't know what QMU is. But you guys already know. But anyways, it's good to know the values and why we use this stuff. So QMU is primarily used as a virtual platform for software developers. There are other use cases that are becoming increasingly popular like SOC verification and validation. But the primary use is for software development. So QMU being open source, it has a very scalable distribution and cost model. What I mean with that is that it's very easy to scale up and give QMU as a platform for hundreds of developers. You don't have to worry about licensing costs and stuff like that. It's also easy for us to distribute these to our customers and they can in turn distribute it to their customers. There's no issues there. It scales very well. It's also popular in the open source community. So there's a whole bunch of open source software projects that tend to use QMU as a platform for testing or development. So it's good for us to enable those communities. It's been used at SILING since about 2009 with a whole variety of architectures including microblaze, all the ARM stuff and also x86 setups. So QMU is a sort of a transaction control emulator. It's very fast but it's not cycle accurate. We'll come more to that later. It's very fast. It can boot our ARM emulation or ARM platforms into user space in just a couple of seconds and through the whole user space into a login prompt in about a minute maybe two depending on the setup. It's got tons of debugging and profiling features and we have a co-simulation framework for it which is system C based and allows us to do sort of hybrid co-simulation setups using QMU and RTL simulation. And also to use QMU emulators and FPGA prototyping. I'll come to that in a moment. The real value is to shift the software development left in the program so we can start that early and get more and more stuff done earlier. It used to be that when we received samples for a new SOC we wanted to have everything ready to go on day one. Now actually we need to get everything going. All the ports of boot loader's firmware kernels, that has to be done before Tapeout basically. And that shifting left is going more and more left. So we need to do more and more. So in terms of users we have both internal and external users so the main users are of course the boot run teams. They start very early since their software needs to actually be part of the Tapeout run. We have the system software teams doing operating systems, hypervisors, all of that stuff. System validation teams writing test cases for validating the SOCs and the verification teams. So we also distribute QMU in our let's say tools, customer-facing tools. So it's part of Petalynics, it's also part of Vitis and Vitis and it goes under the name hardware emulation. And we also distribute it on GitHub for people that want to roll their own solutions. And there's actually plenty of those that take our QMU, the co-simulation framework, package it up and sell it to other customers, including to Sylinx. Which is fun. Okay, so if we look at the different technologies for emulation. I just picked some here, there are of course more. I generalized a lot here. There are exceptions to this. But I think that the point I want to make is really that there's a spectrum of technologies and there's a trade-off to be made. So we have color-coded the virtual ones with red. So what I mean with virtual is that they don't use the RTL to do the emulation. So we have to write sort of separate models for those. And the green ones, they are RTL based. So they're actually using the real source for a device. So on the far left, we have QMU. It has a lot of capacity in terms of being able to emulating large designs or large systems. We can emulate a full PC. All the PCA cards. We can even have two PCs or we can have an X86 and an ARM. We can do very big stuff. It's also very fast. Especially if we run in KVM mode, we can reach gigahertz speed. But in terms of visibility, it has low visibility into the actual implementation. Because we're using a separate model, that model may not even have the same state as the real hardware. But it has some visibility. If you look at the CPU, you can see the registers. If you look at the device model, you can see register state. But you may not be able to see signaling within the actual block. It may also have low accuracy in the sense that it's not cycle accurate, and it doesn't emulate the real or it doesn't have the RTL to do the real exact thing. So as we move to the right, look at system C. System C tends to move a little bit towards RTL. But it's still, if you're using the RTLM modeling styles, it's still very high level. So system C is a little bit special because the language allows you to model at the RTL level or higher. But yeah, I'm focusing here on the RTL. Then we have the FPGA prototyping, which is when you take the RTL and basically put it on an FPGA, and you try to emulate your system that way. So that's quite fast. But it has a problem with the capacity unless you try to do multiple FPGAs, which people are doing. There's bigger prototyping systems that have many FPGAs. You can go that way. But it's still a small to medium capacity. So you can't really model, let's say, multiple PCs or multiple huge SoCs. You're bound by the capacity. Oh, I can see my screen here too. So the visibility is pretty good. But typically, you need to have windows for tracing. You can't see everything. So I put medium here. But I can argue about that. It's very high accuracy. It's cycle accurate, of course. It's based on the RTL. Then we have the hardware emulators, which are sort of accelerated simulators. They support fairly large designs, but they are slower. They go down to like the megahertz kind of speeds. They have high visibility. You can trace and get waveforms and lots of details and very high accuracy. And they support fairly large designs. Then we have the RTL simulation, which is also a bit special here because it can actually be quite fast if you have small designs, but it tends to slow down a lot if you grow the stuff you're emulating or simulating. Okay, so what is co-simulation? Yeah, co-simulation is when we, instead of using one single technology to emulate or simulate things, we mix it up. We can use, put some stuff in one technology and other stuff in another one. An example I have here is when we have QMU, system C, and RTL simulation, altogether co-simulating a full system. So we get a mix of the speeds depending on what runs where. We have a mix of speeds, capacity, visibility, and accuracy here. And I just want to mention here that the way we do this is typically centered around system C because many of the tools, especially the RTL ones, they all support system C, let's say as a glue or there's bindings to system C. So system C can sort of access the signals at the RTL level. So system C is here sort of as a glue. Okay, I'll come a little bit more into that later. Okay, so I'll show a few more examples on why this can be very useful. So here's an example of the silencs-versal co-simulation that's built into Vitis, the Vitis tool. So our customers that use the Versal, they get this FPGA or SOC that's let's say a part of it is hardened, the whole ARM subsystems and there's Ethernet Max, a whole bunch of stuff that is hard. Then there's the programmable logic which is where the user, they write their RTL and they load it there and they run it. In terms of simulation, they don't really care about simulating the hard stuff with very much accuracy. They care about their own RTL and if they can get a simulator of the hardened parts that runs very fast, that's great. So that's why we put our hard blocks into QMU and then the soft blocks or the RTL goes into RTL simulation. So this way the user gets high accuracy and high visibility to their own stuff while high speed for the stuff that Silencs provides. Here's another one. So this is also kind of a real example from Versal development. So hooking up QMU to a hardware emulator which is typically capacity bound and slower. By moving out some parts of the SOC to QMU, we could increase the capacity by 60% and that means we could, for example, double the amount of users or you could model a system that wouldn't be possible without QMU. And you can also increase the speed of test runs. So if you're running software-driven tests to exercise specific RTL in QMU, you now get a 240X speedup on those tests. That's pretty significant. Okay, so how do we do this bridging between the different technologies? So if we take an example here on how we do it between QMU and system C. So we have this remote port bridge. Remote port is a protocol, like a network protocol that we can run over UNIX sockets or TCPIP sockets. And really what it does, it just serializes transactions. So there's a device in QMU, a remote port device. There's multiple of them actually. But there's, for example, one for memory map transactions. And when a transaction comes in, we serialize it and move it over to system C. And on the system C side, it gets converted to a TLM generic payload and injected further into ports there. We move memory map transactions both ways, wires both ways. So wires could be interrupts or any, like a reset or GPIOs, whatever. There's also ways to move MMU translations to, or requests, let's say, to support things like ATS or PCIe. There's stream transactions for AxiStream, for example. We can also move network packets. And PCIe, which is more like a composition of the memory map than wires, but there's a PCIe device, remote port device in QMU that allows you to emulate PCIe endpoint on the system C side or as RTL. Okay, so how is bridging done between system C and RTL simulation? Well, it's a little bit different here. On this side, we have a TLM generic payload, which is just a structure, basically, that describes a transaction. So this structure may have, you know, what address am I reading from? It may have how many bytes am I reading? Is this transaction an attribute? Is this transaction secure? What is the master ID? What are the cashability attributes? All of that will be described in a structure. So the job of the bridge here is basically to replay that transaction on the RTL signaling. So if we're talking Axi, that would mean, you know, replaying the various phases of the Axi protocol, such that the RTL simulated logic would just respond. Okay, so if we look at how we do it for FPGA prototyping, so in this case, we have the bridge is actually RTL logic that gets synthesized into the FPGA and sits between the DUT and the rest of the simulation. And the job of the RTL bridge is to replay transactions. So it gets, you configure it to replay a specific, much like I said before with the TLM generic payload, you configure all the details of a specific transaction, what address, the width, how many bits you want, all the attributes, and the bridge will go and replay it towards your DUT. And this works both ways. So if the DUT wants to DMA back into the system C environment, that also works. Okay, so that was a little intro to how we do those things. So if we look at, yeah, so the 2018 DARPA run a project that they called POSH. It stands for POSH Open Source Hardware. And we participated in that project extending some of our system C frameworks. And the library that mostly got worked on, there were other stuff too. It was the LibSystem TLMSOC library. It's a horrible name, but there you go. And yeah, the idea is to use open source technologies to achieve these kind of co-simulation setups that I just described. And as you see here, system C is really the center of things. Right, so in that project, we developed bridges for RTL simulation. And, yeah, for a whole variety of RTL protocols. All the Amba stuff, APB, AXI, all kinds of AXI, and all of this is open source. And it works with very later Accelera, System C, and QMU. We also developed the RTL FPGA prototyping RTL and the necessary drivers for those, which are also in the library as open source. And there's a whole bunch of demos and examples how to get going. So if you're interested in learning RTL in very long, this is a very easy way to get started, actually. Okay, so a little bit more about the setups here. So we have in this example, we have QMU at the left. For example, emulating a SyncMP. You know, whenever a transaction comes in to a memory region that targets the remote port device, that transaction will move over to the System C TLM side. And it'll go into the, so I refer to the POS System C bridges here. Those are actually the bridges I was talking about before. And so it'll hit the POS bridge and the generic paler will get translated to RTL signaling and the DUT will work. And this is all done with open source tools. So here's another more complete setup where we have two QMUs and they talk remote port between each other. And to the right, it looks the same. Okay, so here's how the FPGA prototyping works in more detail. So we have QMU again, let's say issuing a transaction. Transaction moves to the System C side and becomes a generic payload. The generic TLM generic payload hits the POS bridge. The POS bridge in this case is different because it's actually more like a driver, a VFIO driver that takes ownership of the POS bridge that is on the FPGA side. Sitting behind a PCIe card, like the LVO card. Right, so the POS bridge will, you know, VFIO take over that POS bridge and drive it. And in this case, set up the registers for a DMA transfer and move the transaction over to POS bridge, which then the POS bridge will replay to the DUT. The other way, on the other way, the POS bridge will pre-allocate like a ring of TLM generic payloads and all the buffers for them and the buffers will be programmed into the POS bridge so that whenever the DUT makes a DMA transaction, that transaction will be captured by the POS bridge and DMA into the generic payloads. And signal to the POS bridge that, okay, you've got a transaction that needs to be injected. Okay, so I just want to mention here, this is a question that often comes up. Why don't you just do device pass through here? Well, device pass through would be super fast, but if you just do pass through, you can't actually capture the DMA transactions as an example, right? If the DUT DMA is into this simulated system and you're doing DMA pass through, you're DMAing straight into memory. You can't capture a transaction. And for example, let's say that transaction targets an emulation model of, I don't know, a UART, right? You need that transaction to get captured and re-injected into the system C as a TLM generic payload or such, right? And the other way, why don't we just bridge over at the hardware level the transactions from QMU over to the DUT? Well, if you do that on PCAE, you'll lose all the attributes, right? Because PCAE may not have a trust zone, secure and unsecure bid, or it may not have a cashability attributes. And you wouldn't be able to control the exact timing of the transaction, how many beads, the size, etc. So that's why you need these bridges to give you that control. Okay, so this is one way of doing the bridging for when you're using hardware emulators. So when you use the hardware emulator, the vendor gives you the emulator and the interface to the emulator. So it may or may not be PCAE and you may or may not have control over it. So in this case, we have to use the vendor's verification IP to move transactions, right? And this one works okay, but there's some cons to it. One of them being that you don't have much control over how the AXI transactions get replayed because it's the vendor's bridge. So here's another way to do it, where you use the vendor's bridge only as infrastructure, and you put the open source port bridges around it wrapping it. Now we get full control over what the transactions look like. The con for this one is that it's slower because you're basically tunneling transactions over another channel. Okay, so I thought I was going to do a demo. We'll see if it works. So this is a demo we did at one of the posh events. So it'll show QMU emulating the SyncMP, the salient SyncMP. We're just going to boot the Linux, the pedolinics. On the system C side, we have the system C wrapper for the SyncMP, which basically just makes the whole remote port and QMU thing look like an ordinary system C module. It just wraps it up. We have a bunch of system C models for DMA controllers, or two of them. And the green part is RTL for the Luis LMAG. So the LMAG was an Ethernet controller developed in the posh program as well. It's open source, it's on GitHub, you can go find it. So we've used Verilator to make that system C compatible, we can say, or converted to system C. But it's RTL still at the RTL level. So we're going to use the posh bridges to connect that into a full system simulation. We have a virtual file, which is an interesting module here. Basically, so I mentioned before we have remote port net devices. So those are, it's basically a network, like a network interface emulated in QMU. But all the packets that go in and out of it go over remote port to system C. As generic payloads. So now in the system C side, we get all the Ethernet traffic. We can send and receive off live off the internet, right? And we have this little virtual file that takes a generic payload in one direction and replace the RTL signaling of GMII. And the other way, the reverse. So now we can talk to the Mac. The Mac thinks it's just talking to a file, but it's all going via QMU out on the internet live. Okay, so look how this goes. Okay, I hope you can read a little bit. Let's see. So I'm going to start the system C side here. And I'm running QMU on the left window. Okay, so let's see. I discovered the L Mac. Yeah, there it is. So I'm going to go and take down the first Ethernet interface because that's the built-in, the hard one in the CU plus. I'm taking that down and I'm going to instead bring up the L Mac one. So yeah, now the L Mac is up and we got an address. I can go ahead and just download something. Right, so that's pretty fast. And it's the Ethernet Mac is now RTL simulated. Right. And as you can see, it works pretty well for software, you know, fast enough for doing software development. It's going to SSH back into the host pretty responsive. Okay, and I can still do all the usual stuff like I'm going to connect GDB. Breakpoint on the receive interrupts. Continue. I'm going to do the download again and I'm going to see that hit. And second, you know, do the usual debugging when developing drivers. Okay, stop this. Okay, so another cool thing with this is CUMU has this feature where you can dump all the network traffic. Into a P cap trace, which we can then analyze with wire shark. So if you're debugging Ethernet stuff, you can also look at these traces, right? This gives you pretty good visibility. And now that we have an RTL simulated Ethernet Mac, we can also go down and look at the signaling site. Selected a couple of signals here. This is not all of them, but let me see. I don't know if this will be visible for you. Okay, anyway, what we have here is we basically have the clock so we can see what happens at each clock cycle. We have the wires between the five emulated five and the Ethernet Mac so we can see for each cycle we see. For example, here we have the idle sequences on the data lanes. Then we have a preamble and the packet starts. And then we see each and every beat for each clock cycle until the line goes idle again, right? Then there's a little bit of nothing happens. The Mac is processing the packet. And then we see the AXI stream interface from the Ethernet Mac to the DMA controller. And we see data for each cycle. We see the valid signal every beat until the T last gets set. Signaling this is the last beat of this transaction. We see the strobes like the byte enables basically. Yeah, okay, so that's really neat because now you get full visibility, right? So if you find an issue, you can track it down. You can even see all the signaling within the Ethernet Mac for each and every cycle, right? Okay, I lost there. Okay, anyway, that was my last slide. I guess we can take questions now. No questions? Right, okay, so yeah, right. Okay, so the question is, Paolo was saying here that typically the machine models in QMU for embedded devices are very static. I think what you mean is also that there's no structured bus to sort of hot plug and connect things into, right? And how we do things to be a, yeah, to compose the machine basically. Yeah, so we have some patches to QMU, to our QMU, which we've tried to upstream in several forms, but still no success yet, that essentially take in a device tree file that describes the machine for us. Right, now you can argue that device tree is not the best format and yes and no, we don't actually use the same device tree, the bindings that the kernel does. So we have different, we have more of a one-to-one mapping between device tree and QM. So every node is a QM object. Every DTS property gets set as if it was a QM property, right? There's some exceptions to that, but that's essentially how we do it. We have a way to take the GPIOs or the reset signals and the interrupts and put it together and compose the machine. Does that answer your question? Yeah, so there's a version of the demo I just showed. So it's a version of this demo, but it uses a RISC-5 CPU. And there's basically a readme file to walk you through each step to reproduce this. So yeah, there are demos. There are many demos without QMU, just doing the system C co-simulating with RTL. That's a good place to get started. QMU would be another add-on, but yeah, there's plenty of documentation. Also on the SILINs Wiki pages, there's... Right, that's a very good question. So the question is, typically when you do co-simulation, you mix these technologies, right? You get a difference in speed by nature, and how do you deal with that, really? Right, and there's several ways to handle it. One way is to just let them run at different times, free running them, decouple them and let them run freely. And we find that's the most popular way, but there are other ways too. You can do... So we have another mode where we run QMU in iCount mode, which has a lot of penalty, but anyway, you can do that. And then you get a sense of some timing, and we can then run the system C side sort of in lockstep with that. That allows you, with RTL simulation, you can get some kind of lockstepping. Now with the FPGA prototyping, we don't have a way to do any of that today, so it's all free running. I see a question? Yeah, now so we... With Verilator, and I'm sure the commercial vendors have features like that, but with Verilator, we cannot do that on the RTL side. But you can do it on the software... If you do it software-driven, you can sort of single-step the CPU. And if you're running lockstep, like if we go back to his question, then you can actually sort of single-step the CPU and you'll get one instructions worth of RTL execution, right, and you can sort of step... Right, no, you can't. So that would be a great feature, or I'm not aware of such a feature. Right, so in... There's several ways to do that, but typically the DMA goes back into QMU. There's another setup that we use sometimes where the system seed site can do a shared mapping of all the memory. So QMU... I think this feature is also available in upstream. You can set up so QMU generates files for each RAM, and those files can then be opened by another application, basically shared memory mapped across. And then both applications can write to the memory. So that is used in some scenarios to get speed a little bit. Yeah, get better speed. So we used to have an issue, though, if you were trying to do, like, atomic instructions across remote port, that could be problematic, but I think maybe to the... I actually haven't tried it because it's not really a use case that shows up for us. We tend to emulate, like, let's say the Cortex-8 clusters, they are all in QMU. We don't split them up, but there may be other scenarios where you need the atomic instructions, but I guess maybe with the empty TCG and that we now actually emit something for the memory barriers, then possibly it works. I haven't tried it. Okay, it looks like we are done. Okay, thank you very much, guys.