 Thanks. Hi all, I'm Anastasius Nanos. My talk is about light with hypervisors in the cloud and at the age. I'm going to talk about our work. In our team we are a young SME. We're doing research in hypervisors and the low level parts of the stack. We're almost seven people, most of the team is here. You can ping us right there in front. You can ping us anytime to grab a coffee or a beer. We're working on the lower level parts of the stack, meaning with hypervisors, with containers, with container runtimes, with storage, with unikernels, and we're also working on accessing hardware accelerators from within VMs or containers or unikernels. We're based in the UK, in Greece and in Spain. So this talk is about how we should handle the hypervisor layers and the virtualization layers in general at the age. So people are using microservices based approaches. They're using platform as a service, software as a service, function as a service, the serverless stuff, and they have been using hypervisors or sandbox environments or hybrid approaches on this level. So we have been working on this kind of stuff and trying to understand which is best for its solution, which is for the workloads that people are running. We will present a minimal lightweight VMM, which interfaces directly with KVM, and some of the work that we did to port Firecracker on a Raspberry Pi. So as I mentioned, people are using containers, sandboxes, micro-VMs, micro-VMMs or unikernels. We're not sure about that. And we have seen that the community has introduced some lower overhead VMMs that can host these approaches. People are using containers and unikernels on containers. They're using sandbox environments like, they're using SecComp to actually contain a specific workload within a specific enclave. They're using KVM to isolate workloads. There's work on SolO5, the former UKVM stuff. There's Firecracker, there's RustVMM or CrossVM, all that stuff. But if we try to run this kind of stuff on a low-power device, on a Raspberry Pi, on an NVIDIA detection board, we're getting issues. There seems to be a lot of contention in this kind of boards. Additionally, these boards are built to have extra hardware to be able to accelerate specific applications, like NPUs or GPUs specifically to handle ML workloads and stuff like that. So if we try to run a VM on this kind of devices, we will probably end up with either Zen or KVM. I'm a Zen guy, so I've been working with Zen for quite some time, so I'm more biased on their approach. So Zen is a bare-metal hypervisor and VMM. There's no mode switches between the monitor and the hypervisor, actually. KVM is a patch, it's actually support for the hardware extensions using an emulator in user space. There have been approaches where people have been optimizing the VMM with Kemu Lite, with CrossVM, with RustVM, with Firecracker, with Solo5, with the HVT tender at least. So we think that on one part KVM is really useful in terms of the interface for the hardware extensions for VMs. So it's easier to debug than Zen because most of the VMM is in user space. Linux has all the available device drivers to be handled by the system, so there's no custom porting there. On the other hand, Zen has the VMM and the hypervisor on the same context, so it's a bit faster in terms of spawning a guest, and the hardware extensions for the memory management. But we need a driver domain to handle device drivers to interface with the external world of the node. That's a figure of how KVM handles a VM running. So there are device drivers in the Linux kernel, there's the network and the block stack. There are the functions that KVM offers to provide this kind of isolation through the hardware extensions. There's a VMM running in user space on the left side that handles all the binary loading for the VMM OS to run some of the memory management, all the PV drivers and the exit handler for the VMM when a privileged instruction needs to be executed. And on the right side is the non-root mode where the VMM is running. So on a VMM there's this switch from root mode to non-root mode, at least on the x86 architecture. The application is running in user space. There's the runtime. There are the drivers to interface with the monitor. So it switches to kernel space on non-root mode. Then the PV drivers communicate with the PV drivers of the monitor. There's another world switch there with a VM exit. And then there's another mode switch to KVM to handle all the hardware-specific stuff, like in the network case or in the block case. So all these paths seem to produce a lot of overhead when handling IO. In the same case, the good thing is that there's no redundant mode switch, but still we need a helper driver domain to be able to handle device access. So actually there is an extra mode switch and an extra world switch. So we have been trying to understand which are the overheads when running stuff at low-power devices. And to do that we explored several lightweight hypervisors and we executed them on a number of low-power devices, arm-based and on x86. So we used a lot of tools from the community. We used the Rampran Unicernel. We used the Mirage OS Unicernel. We ran Redis and NGINX and the DNS example on the KVM tender of SOLO5, which is SOLO5 HVT, on Firecracker, on Kemu. And while SOLO5 worked out of the box on a Raspberry Pi, it's the first thing that we tried. We tried to run Firecracker, but it didn't work because there's no gig support for the gig V2 version of the interrupt controller. SOLO5 actually worked because there's no interrupt handling. So we ran this simple Unicernel example. So with the blue bars we see Kemu. We have the clients running on the same host and the VM running on Kemu on a single core. So the blue lines show the Kemu results. These are requested per second, so higher is better. And the red lines show SOLO5 HVT. There is some kind of improvement, but still we think that it's not enough. So we wanted to try how Firecracker would respond to this setup. Firecracker is a user space monitor written in Rust. AWS Amazon introduced it. There was already support for ARM devices, so what we had to do was to just port gig V2 support. We managed to upstream this effort, which is good. Firecracker now supports Air Raspberry Pi 4. And we ran the same experiment. We see that Firecracker behaves a bit like Kemu, and SOLO5 is still better than those two approaches. But still we think that there is space for improvement in this space. So actually we say that the requirements for a workload to be running on such a low power device should be that we need a driver for the hardware extensions. We need access to the hardware, so an interface, a device driver for the hardware. We need to define the ABI for the workload to be running. Will it be a Unicernel, a Linux, something else? And we need some kind of software to be able to access and interface with the hardware extensions. And to handle VM enter and VM exits. So we think that KVM is fine. It's a great example for the driver for most platforms. We think that the Linux kernel is the most used thing, so we have drivers for almost every device available. Regarding the binary interface we're experimenting with SOLO5. We want to try micro-VM, the firecracker approach, or even the Linux ABI. And for the VMM we introduce our own monitor. So KVM is actually a VMM running in the Linux kernel. It handles VM enters and VM exits just the same way as Kemu or SOLO5 HVT. And it interfaces with network and block stack directly in the kernel. So this is an equivalent figure of how a VM is running. This is all in kernel space. So there is no mode switch on the VM exit and on VM enters. There is nothing that would interrupt the execution. The VM is running on the right part of the graph. And the PV drivers are interfacing with the VMM in the Linux kernel. So there's only a root mode to non-root mode switch. It's working progress, so we're working on that. We have support for X86 and ARM. We are able to run SOLO5 applications and everything that runs on top of SOLO5. So we can run run-run unique kernels, Mirage OS, Unicraft. Recently we have tried to integrate an OCI-compatible runtime. We forked the work that the NABLA containers team have done. So it's as easy as running Docker-run with our runtime and a specific image. So we run the same experiment with Redis. I should mention that these results are on X86. So we can see that there is improvement compared to KM when firecracker and also improvement compared to the generic SOLO5 approach. Regarding latency, we have run a microbenchmark ping for the ultimate latency handling on X86 and on ARM. We see that KVMM is... So this is time, so lower is better. We can see that KVMM outperforms all other approaches. And the same stands for ARM. For the ARM case, we run that on a KADAS VIN3 board. We could do a small demo for showcasing our approach. I have a screencast. I have a screencast. I'm not sure if I can zoom. Can you see? I'm not sure how to zoom on that. So the whole point, let me just skip that. I'll just say what the demo is about. So we log in into an NVIDIA Jetson Nano board and we execute Docker-run and the container image with our runtime. And we see that if we point a DNS client to this IP address that the runtime has given us, we see that zones are being resolved as they should. So it's actually showcasing our VMM on an ARM device. Let me go back, I think. Okay, let me conclude. So with some KVMM, which is a minimal virtual machine monitor residing in the Linux kernel, we currently support the Solo5 ABI. And we're planning to have a pre-alpha release sometime next month. Our next steps are to support more ABI's. So we would like to try running something more, let's say, extending our ability to run more applications. We would like to study the container integration. We have been doing some work on integrating container storage and unicernel storage and bridging the gap actually when we run unicernel containers as containers. And also we would like to try some of the recent approaches by the community like VM forking or pre-allocation of VCP use to be able to have instant spawn times. I would like to mention that this project is partially funded by Horizon 2020 Project. And also I would like to take the opportunity to invite you to submit a paper in our workshop on ISC this year. It's in June 21, 25th actually, it's the workshop. And the deadline is April 5th. Thanks very much. You can also feel free to check out our blog. It's blogcloudcanals.net or GitHub. Sure. So is it there just because you're not already using existing interface that does the same thing? Yeah, so the question is if we were using Vhost, why are we better than the K-more, the Firecracker alternative? So we are using Vhost. That's one thing. The second thing is that for these kind of benchmarks we are on a latency side. So Vhost is not helping with latency at all. It should be. I'm not that sure. You can take this offline. Yes. But the typical approach in Vhost is just keep spinning until you finally see a new request coming in and you don't need any exit at all, you need to be faster. And on these small devices you usually have plenty of hosts to spare. OK. We can chat about offline. But I think that in the Vhost case, if the message size is large, then you do get the same result. If the message size is small, I think you do get an exit. That's really just because you're not going through the fast path then because you waited for too long for the package to come in. And that's something that's just a tunic for the Vhost. Maybe. Yeah, yeah, yeah. Sure, sure. But I think of different ways to do this more elegantly. Sure. Yes. And just to be clear, we're doing the exact same thing as solo5 or chemo for the exit and the IO handling. There's a hypercall. You wake up in the monitor and you do the send message, receive message, if it's for the network. And actually, we have been thinking that we don't have to use a virtual IO approach or the solo5 approach, which just pulls. We could use something which is more elegant and tailored to our approach. I mean, if you are already in the kernel, then you don't have to do anything complicated like IO rings or you could just forward the request from the guest to the network or to storage. So, yeah, anyway. Ideally, in a VM world, you don't want to exit at all. That's the main thing you should optimize for. Sure. As soon as you're exiting, you're already adding latency. So what I was missing in the charts was the baseline. What you always want to optimize for in VM worlds is to not have exits because any exit is always slow. It's always going to be through the slow path, whatever you do. If you can somehow avoid to do exits, just exactly what all of the modern interfaces are about, like the V host or IO U ring or everybody who's like even the smartnicks do the same thing. The whole point is to not ever exit the VM at all if you're doing UIO. Yes, you might end up sacrificing a core, but again, especially on edge devices, you have in-order cores just lying around basically. You could dedicate one of those just to keep handling UIO in the background. If we could integrate with what I had suggested earlier with IO U ring or maybe just improve V host to do the polling for you on behalf of you, you might literally get to the same performance numbers as by metal most likely. Yeah, could be. The issue is that we need to exit to ensure isolation in this kind of cases. So, no? Do you disagree? I disagree. We don't need to exit. You need to make sure you have only access to the data you actually need to from the polling side, but that doesn't have to be the same entity as the VMM, it can be different. Either way, I don't think we don't have the time to go through this in detail, so let's take this offline and then... Okay, thanks very much.