 Hi everybody. I would like to talk to you about a new Xen feature called cache coloring that greatly improved determinism of real-time workloads when deployed on Xen. First, let me start by talking about why we are doing real-time and in the context of hypervisors. So hypervisors in embedded are used for consolidation of different workloads or splitting up a single unit into multiple separate components that needs to be isolated deploying them all on a single SOC. Now the level of the degrees of like isolation can be many and under, you know, under different aspects. So one aspect is security. So you don't want something running in one VM to be able to steal the data or get inside another virtual machine. Another dimension is reliability. We want an application running on one VM, even if it crashes, not to be able to affect the rest of the system. So, you know, it's often the case that not all software is made equal, something is more reliable, something is not reliable, is less reliable. When you put them together on a single SOC, you want to make sure that the less reliable software doesn't really affect the most reliable software. This, even without safety certification requirements, but in many environments, safety certifications are actually required, such as automotive, some environment in industrial aviation. They need safety certification, in which case one of the VM might be safety certified and definitely you cannot really afford the non-safety certified VM to affect the safety certified VM. So these are all dimensions of isolation, and I would like to add one more to the list, which is real-time isolation. So if you have one workload that has real-time requirements or latency requirements, that needs to work correctly without interference, no matter what you're running on the side. You could be running something like a much larger OS on the alongside on another VM, and definitely the larger OS should not be able to affect negatively to have any impact on your RTOS. So in this context, Xen, as always, had a pretty decent set of features, one of which is CPU pools. So Xen is able to configure a group of CPU cores called pools and you run different schedulers on different pools. In this example, we have a four-course SOC, creating two pools, the blue pool at the top and the orange pool at the bottom. And we are running the default scheduler, which is credit at the bottom, which is, you know, a data center scheduler made to run more VCPUs and physical CPUs available. And at the top, we are running the null scheduler that, you know, as the name suggests, is really doing nothing, nothing at all. So that's, in fact, the best possible scheduler for RTQ latency, because the scheduler really does doesn't interrupt VMs and really doesn't do any scheduling. You can do better than credit, so even if you have more virtual CPUs and physical CPUs, because there are others, other schedulers that come with Xen, such as RTDS, which is a real-time scheduler, should still allow you to run more virtual CPUs than physical CPUs on the platform, while still maintaining real-time performance and real-time, still providing real-time guarantees. That said, there is nothing really better than the null scheduler, because simply it's the best, does nothing. So if you want the minimum possible latency and assign fully a CPU core to a VM, the null scheduler does exactly that. I see a question about how do you change a scheduler. So it's very easy to do it from the boot time. You just pass these options, CAD equal null, and that allows you to change the scheduler used for all CPUs from the default, which is credit to null. So in fact, if you want to try real-time, that is the quickest and easiest way possible, just switch to null. Now, this is where things get interesting. So you would think that using the null scheduler is more than enough to get real-time. Unfortunately, things are more complicated than that. On most RMSOCs, including the reference board that I used for this presentation and demos that you see later, the Xaliens UltraScale Plus and PSUC, there is a share that you cash on the board, plus a private L1 cash for each core. The share that you cash is shared across all cores. What that means is one application running on one core, doing heavy memory operations, can affect the performance of another application running on a different core. So in this picture, for instance, something running on core 4, my heavy memory operation, like memcorp is in a loop, end up causing cache-line evictions. Therefore, when your real-time application running on core number one is running, it might actually end up even missing a deadline because of that. So we started the presentation saying that hypervisors should guarantee interference, free systems, right? Interference should be free from interference, even from a real-time point of view. But due to the L2 cache, unless we do something specific and special, there is no way to avoid L2 cache interference. This has nothing to do with hypervisors per se, by the way. This is just a property of the SOC. So it could happen exactly the same if you're running an application on Linux on one core and another application on the same Linux on a different core, just because a shared cache is present. So what's the solution? The solution is we want to really ideally split the cache lines, split the cache into separate subsets, and give each VM one subset. So the hypervisors should be able to split this cache, this L2 cache into subgroup and dedicate one group for each VM. That way, one VM cannot cause interference to the other VM. So this technology is called cache coloring, and let me provide a bit more detail. So the idea is really to understand and study the way memory addresses that end up being related to cache lines in L2, and then make it so that one VM end up always using the same set of cache lines, which are unique and different from all the others. And that is implemented by choosing carefully the memory to allocate to each VM. If we choose the memory in a way so that when the VM does actually memory accesses, end up always using the same set of cache lines that we chose, then the problem is solved. And, you know, on the Xilinx and PSOC, it ended up being one page every 16. So it means allocating for one VM page zero, page 16, and so on, skipping the one in the middle. Let me, I'll answer the other question at the end, which I'll keep forward for the moment on the presentation. So the way we implemented the cache coloring in Xen, first of all, we automatically detected the way size, which can be done via registers, or it can be also passed via common line option to override or making sure that you set the way size correctly. On the Xilinx and PSOC, the way size is 64K. Now, 64K is order 16. That means you have 16 bits of address to play with. 12 bits are dedicated for the offset within the page, right? We cannot really allocate memory at a finer granularity than a page. That means we have four bits left to specify the color. The color being the minimum set of cache line that we allocate to each VM. So if you look here at the slide, you see the color bit mask being 0XF000. This F, these four bits are the one decided the color. This is how memory address end up being either one color or the other. Because there are four bits, that means there are 16 possible colors. We call them by numbers. So on the Xilinx Ultra Scale Plus and PSOC, the colors go from 0 to 15. And again, the memory allocation is page 0, no, no, no, no, page 16, and so forward. The next thing we did is we created, we wrote a new memory allocator. So the traditional memory allocator is called Badi on Zen. And what we did is added a new memory allocator that is colored. So understand colors. What it does is keep pages on list, one list for each color. So as we have, again, on Xilinx and PSOC, we have 16 lists, one for each color, page ordered by address on each list. So basically you ask, please give me a page from color number three. And you pick the page, the highest, the topmost page on the list for color three. The color, the color allocator can be used for both Zen memory as well as pure tool machine memory. When I say Zen memory, I mean the memory used by the hypervisor itself. Okay. Now, if you're interested in trying this out, this is the right time to pay attention. So on the slide, you see the configuration needed to get cache colony working. The way size here on the first line is an optional command line parameter for Zen to specify the way size. Again, optional because under normal circumstances, it can be automatically detected. The next two parameters are really fundamental. Zen colors specify the colors to be used for Zen allocations themselves. So for Zen memory, that's going to be used by the hypervisor only. The hypervisor does not use much memory at all. So one color is more than enough. Dom zero colors specify the colors to be used for Dom zero memory allocation. Now, I think this is the right time to stop for a second and explain how colors and memory allocation interact. So on acceleration PSOC, each color roughly correspond to 256 megabyte of RAM. That means if you specify Zen colors equal zero, you're giving up to 256 megabyte of RAM for Zen memory, which is way more than enough. And when you specify Dom zero colors from one to six for Dom zero, it means you have up to one gigabyte and 256 megabyte for Dom zero. This does not automatically mean that you are allocating one gigabyte and 256 megabyte to Dom zero because you still have a separate parameter, the good old Dom zero mem parameter to specify the memory for Dom zero. The reason why we kept it separate is to have maximum flexibility. This way, for instance, you can decide to not use all the memory possibly available from these colors and because we want to reuse this color in another VM. For instance, you could even create a VM that partially shares Dom zero colors and partially does not share Dom zero colors. And the reason for sharing colors with Dom zero could be to set up a ring in memory for communication to exchange data and so on. So that way you can select the memory allocation, the way you like your colors, the way you like. And you really are the maximum possible flexibility in setting up the system. Now, of course, if you choose color from one to six for Dom zero, that covers megabyte. But then you ask Dom zero to be allocated to gigabyte of memory. That is not going to work, right? That is going to be a failure, an error message, and you have to try again configuring the system. Dom zero less Dom use and other Dom use. So Dom zero less Dom use are other virtual machines that are started directly at boot by XAMP. So they are booted in parallel to Dom zero and without Dom zero having to do anything at all. These Dom zero less Dom use have their own configuration, which is specified on the device tree. So like all the other parameters for these Dom zero is Dom use, we added one option in the device tree to specify the colors is a bit mask. So in this case, color 0x80 means actually color number seven. So that is to specify color seven for these Dom zero less VM, which you see later in the demo setup is the one doing actually latency measurements. Finally, we have one option for regular Dom use. So these are the ones that you start normally after Dom zero is fully booted from the command line with the Excel tool. And you do Excel create to start these VMs. There is one more option called colors to specify the ranges and the list of colors that you'd like to dedicate for these Dom use. All right. I want to show you a bunch of benchmarks and results that really show you and highlight the value of this technology. The benchmark will run on the Xalynx and PSOC. The VM to measure, to take measurements, is a little Bermetal application. A Bermetal application is just an app running in kernel space written in C with a few libraries that are for drivers for accessing the UART. But basically, it's not even our ANOS. ANOS is smaller than a unique kernel. It's really very tiny. And that makes a great testing or environment because it's very controlled. We are also using a Bermetal application for causing stress on the system. So mem copies and so on. The color configuration is here on the slide. So we chose two colors, for example, because we had to spare, but actually one is more than enough. We chose four and seven for the motor control Dom U, eight and 11 for the first stress Dom U, 12 and 15 for the other stress Dom U. And then we also had a Linux VM doing stress as well. But let's jump to the results. Yeah. First, something very important. So as you try to reproduce these numbers, at first attempt, you might see something slightly different. And that could very well be because you end up using the L1 cache all the time instead of the L2 cache. Now, this is critical. So these applications that we use for tests tend to be very small. As we said, Bermetal application maybe. So they end up maybe fitting entirely into the L1 cache. And if so, the L1 cache is going to mask the interference effect on the L2 cache. To actually be able to measure the L2 cache interference, you need to make sure your L1 is not really masking your interference effects. And one way to do that is a bit rough, but it works is to issue cache cleaning operation at the right time during the cycle of the benchmark. All right. So the first set of results are related to motor control execution time. So we're running an example routine, as example routine to control a motor, and we are measuring the time it takes to run this routine. The first slide here is showing the results without interference. So it's really here just as a reference. The column on the left is the Bermetal application alone. The second column is Xen without cache coloring. So the Bermetal application running as a DOMU on Xen. And the column on the right is the Bermetal application running as a DOMU on Xen with cache coloring enabled. Lower is better because it's the time it takes to run the operation. And it's in nanosecond. And it's in nanosecond. The leftmost column, the one of the Bermetal application alone, is actually quite large. And that's because there is a bit more variance on the results compared to the Bermetal application on Xen. So strangely enough, without interference, it's better on Xen than alone. This is a bit strange, but it can happen sometimes with virtualization. But I think really the important message from this slide, the takeaway message is they are all below 2000 or 2050 nanoseconds. Now, let's add interference. We add interference. We see that the Bermetal application goes above 6000 nanosecond. So it gets far worse. The Bermetal application running on Xen without cache coloring also get worse close to 8000 nanoseconds. While the Bermetal application running on Xen alone, with cache coloring enabled, so not alone, with another VM doing interference, it's still the same value, still below 2000 nanoseconds. So this is, I'll take the opportunity to explain how we added interference. So in the case of the Bermetal application, we are running another thread on a separate processor doing interference. In the case of Xen, we are starting a different DOMU, a different VM, running the Bermetal application doing interference. So we can add one more. This is with two threads doing, two VMs doing interference, so two processors being used to do main copies. The Bermetal application still get worse. The Bermetal application running on Xen without cache coloring, which is the second column, get close to 10,000 nanoseconds. So again, get worse. But the Bermetal application running on Xen with cache coloring enabled, it's still fine. So the effect is minimal, still around 2000 nanoseconds. We can increase interference yet again by adding one more process doing main copies. And we can see the Bermetal application get worse. The Bermetal application running on Xen without cache coloring gets close to 12,000 nanoseconds while the Bermetal application running on Xen with cache coloring enabled is still below 2000 nanoseconds. So we have a summary slide here where you can see the same result you have just seen just on the same slide. So the first two columns are Xen without cache coloring and Xen with cache coloring. Without interference. And then the second two columns are Xen without cache coloring with interference. Final column is Xen with cache coloring and with interference. So here you can really see on a single diagram how Xen without cache coloring goes from below 2000 nanoseconds to close to 8000 nanoseconds. And Xen with cache coloring enabled is basically stable. There's a minimal, minimal effect. The second benchmark that we run in my interest you even more is about interrupt latency. In the sense of the difference between the time when the interrupt service routine runs and the time the interrupt was expected to fire. The timer that we used is a timer in programmable logic but you should be able to reproduce these results on any of the timers on the Xilinx and PSOC including the TTC timer and the Arch Timer. So these are as before the results without interference. The Bermetal application alone is below 500 slightly below 500 nanoseconds. The Bermetal application on Xen is around 2500 nanoseconds. The column in the middle is Xen without cache coloring. The coloring on the right is Xen with cache coloring. The reason why without interference the interrupt latency is higher on Xen that's because on ARM the interrupts get delivered first to the iPervisor and then the iPervisor injects a corresponding virtual interrupt into the virtual machine. So the code path is longer on when running with Xen or any other iPervisors so the RQ latency is going to be a bit higher. Again the takeaway here though is not really the little difference but that they're really below 3000 nanoseconds around 2500 and you see a muzzle jump when we increase when we add interference. So adding interference causes Xen without cache coloring to go above 10 000 nanoseconds and then with cache coloring as usual as before to stay the same to stay at 3000 nanoseconds. So the effect it's really small really minimal. Note that the Bermetal application also jumps forward significantly in proportion to its 500 nanoseconds of interference but in absolute number it's lower than with an iPervisor. So we added one more thread causing interference doing mem copies and again Xen without cache coloring goes to close to 15 000 nanoseconds Xen with cache coloring is still at 3000. And yet again we add one more source of interference and the Bermetal actually I don't have the number for the Bermetal application but the the number for Xen without cache coloring go a bit higher again and the Xen with cache coloring is extremely stable. So this is just a diagram as before to see all the results on a single diagram. As you can see the Bermetal application get a bit higher Xen with cache coloring is basically stable as 3000 and Xen without cache coloring is really suffering from interference and the latency really gets higher. So in conclusion Xen with cache coloring offer much lower execution times for your motor control routine under stress. It also offers much lower IQ latency under stress and even more importantly it offers a much improved determinism on the results. And in fact the results have far far lower variance compared to Xen without cache coloring. So Xen cache coloring is effective to reduce cache interference and I think it does enable deployment of highly sensitive real-time workloads on Xen. Now what I want to show you is ah well status. I have before before showing you a demo I'm going to give you a quick update on the latest status of cache coloring. All the patches are public and online as exiling Xentry. The latest branch is released 2021 and has all the cache coloring patches. Make sure to enable cache coloring in the k-config of the hypervisor if you want to use it. The up streaming has yet to start but it should be started soon. The other limitation that we currently have is there is a missing component in Linux. Today Linux expects to be mapped one-to-one which is not the case when Linux is running Linux expect to be mapped one-to-one when it's running as DOM zero on Xen on ARM. It's not the case when you enable cache coloring. As a consequence PV drivers don't quite work correctly. So this is one of the things we're going to solve we're going to solve in the remaining months of the year. So by the end of the year we're aiming at fixing this issue. But today if you're doing a static partitioning setup so all resources directly assigned it will work just fine. So only if you use PV drivers like PV network or PV block you can run into issues. Okay so let me show you the demo setup. There is Xen running and I assign color zero to Xen. Color one to six to DOM zero. DOM zero is just a normal Yocto Linux build. The Bermetal Latency Measurement application is just a tiny application built with Exiling SDKs. Exiling SDK comes with a way to build these tiny Bermetal apps in C that runs in kernel space. And I've wrote one that measures latency based on the TTC timer. And also there is another Bermetal application that we're going to start by hand from DOM zero to cause memory stress. The Bermetal memory stress application has basically no output, no console, nothing. Just doing mem copies. The Bermetal Latency Measurement application the one on the right the blue one has the TTC timer directly assigned and the second UART directly assigned. So that you can see the output. On the screen I just copy pasted the interrupt service routine just to give you an idea on how the latency is measured in the system when the interrupt fires. And I also copy pasted the interference main loop as you can see is actually quite trivial. All right it's time for the demo so let me share my screen. It's going to take some time. So here I am booting first the system without cache coloring enabled. So while the system setups I can also show you the configuration for later when I'm going to boot the system with cache coloring enabled. So this is a U-boot boot script that I'm using to add the configuration parameters of boot. And as you can see here I gave one gigabyte and 200 megabyte to the zero one virtual CPU. I chose an out scheduler. I also set up VW-5 which is a very important parameter if you want to minimize latency. I specified way size equal 64k also it should be automatically detected so it's not really necessary and then color zero for Xen and color one to six for DOM zero. Here this is the configuration for your DOM zero less DOM U and as I mentioned on the slide I gave color seven to the DOM zero less DOM U. See if the system is starting. There's TFDTP in the binaries. So here Xen is starting then DOM zero. In parallel in meanwhile the DOM zero less VM is already started and you can see it's taking measurements in nanoseconds. This is my second UART. I'm going to go back to the first UART. DOM zero has finished booting. If I look there is one VM running without a name because this is the DOM zero less VM. It does 128 megabyte of RAM. I'm going to start now the additional VM causing interference and the colors really here don't matter because I compiled out the colors coloring support but this is what you're going to see later when I do it with cache coloring enabled. Now this DOM U is not outputting anything because it really has no console but you're going to see the effect quite soon. So here it's jumping four from 4300 to 10,000 then 16,000 and it's basically staying at 15,000 nanoseconds. So there is a very significant jump in interaplatency. Now I'm going to reboot the system by making it boot then with cache coloring enabled and it's going to take some time for the reboot so bear with me but what you'll see is that there is no effect close to no effect of the interaplatency when cache coloring is enabled. The board is running here on my desk at home by the way. If you hear the fan noise that's what it is. I'm ticketing all the binaries as really not the fastest way to boot but it's extremely convenient for development and testing very easy to replace the binaries as you need. All right now I'm going to scroll up and show you that DOM 0 boot is sorry then a boot is going to print all the interesting configuration including the way size, the color bits in the address, the maximum number of colors, the size of each color in table memory. The color chosen for Xen is a bit mask so that's color 0. The colors chosen for DOM 0 which are 1 to 6. The colors chosen for the DOM 0 less DOM U here. Yes so here the interaplatency well it's already lower actually than before 2,900 nanoseconds but most importantly we're going to start as before the stress VM we should see that there are VMs running now and as you can see it's pretty I mean there is barely no jump in latency. All right I'm going to stop sharing now and take any questions that you might have. So yes so the first question is about whether there is any benefit into segmenting the cache. Yes there is definitely a benefit as you saw from the benchmarks and it really see it really shows when there is interference so this is very important actually. If you don't have any interference in the system actually not only there is no difference but it's better to allocate the memory continuously so the more continuous the allocation is the better are the caching effects. So if you have a system that you know there is no interference you do the measurement and you are surprised that cache coloring is slightly worse that's not surprising that's normal so cache coloring the purpose of cache coloring is to prevent interference effect. So yeah so you need to have a source of interference in a separate VM to really see the benefits of this technology. The slides are definitely available for download and they're going to be available I already provided them to the Linux Foundation so you you'll be able to download them on and as soon as the talk sends. Are there any other questions or observation on this technology or the benchmarks? I know there was a lot of data on the screen so if you want to go back to one of the slides just ask well Xen so Xen is the name of the iProvisor that was chosen to base its work on. Technically the full name is Xen project but in the community we typically refer to the iProvisor as simply Xen and the name goes back to 15-20 years now so you know when I say Xen cache coloring what that really means is the cache coloring feature for the Xen iProvisor. Any other questions on the slides the benchmarks or anything? I saw one so yes Xen so we colored Xen too so the so hi Yan so yes so the Xen itself it's colored so the text of the Xen iProvisor it gets relocated in a color fashion so that the colors are respected. Otherwise yes you're completely right without coloring then you will end up with the issue of by any I mean using any of the iProvisor services you might end up trying to cause interference that way that is prevented and ensured by colorings and itself and was actually quite difficult to be honest with the maybe the most interesting and challenging part of the project to achieve because it involves that you can imagine copying the Xen text into you know striped fashion and creating a contiguous mapping of it is very technical. So the Bermetala high allister so if you're referring to the to the RQ latency so that's normal ah you mean the the one for not the other one right you mean the I take you mean the test the the routine I bet you mean this right so yes it's surprising that Bermetala alone is lower than Bermetala Xen without interference however this is the data and the university on Modena guides that run this benchmark were extremely thorough and they run thousands of iterations this is not the first time I see results like that where running something on Xen is faster than running it Bermetal is not that Xen is magical it's just that when you enable virtualization you enable some characteristic of the micro architecture that end up speeding something up so it's surprising yes I agree but the other things that is surprising is look at the variance so the variance goes from 1700 to 2100 is a huge variance on the results while on Xen the variance is minimal that is surprising too that's cash coloring affect cash locality yes it does affect the cash and it does affect performance that's normal in so of course normally when you have a Bermetal application running natively the old cash is available to you if you use cash coloring you effectively limit the size of the cash for the given application so you're really going from a much higher cash size to as much smaller cash size so it does affect performance and so here on this result is not actually visible because it's well it's it's in fact faster but if you run a more complex or a larger benchmark you should see a bit of a slowdown this is what I was saying before right if you run a benchmark without interference you should expect the cash coloring will be slightly slower if cash coloring is faster it's surprising right it should be slower without interference it's when you add interference they're really cash coloring make a difference because it prevents interference from having an effect yeah it's it's surprising so I as a reference this is the second of the third time that I saw that something running on an hypervisor is faster than not running it on an hypervisor and it will it happened on x862 in the past so seeing it with my own eyes so it's it's rare but it can happen all right I hope you enjoy this presentation so if you are interested in in more first of all yes download the slides you make them available immediately after the presentation and stay tuned on ZenDevelop when we are going to post the cash coloring patches soon probably in the next month or two you can already try them out and I'll show you again the link actually is later and you can test it yourself at home and you should definitely be able to reproduce these results yourself that's the link with all the patches so that's it and thank you all for joining