 Welcome everyone, my name is Sergio Lopez and software engineer in the particularization team at Red Hat and we're going to talk a bit about lightweight VMs for serverless and containers. So to be able to talk about lightweight VMs we need to define what is a lightweight VM and in which way they are different from other kinds of VMs which lead us to the first question which is how can we classify virtual machines? There are of course many answers to this question but we are going to focus mainly on three aspects. The first one is the guest operating system that is going to be running inside the VM and we don't really care about the guest operating system in a general sense so we are not going to be talking about Windows VMs against Linux VMs against wherever VMs but we don't really care about the limitations that the guest imposes on the VM so whether we need to support hardware quicks or we need to emulate certain kinds of legacy devices because in virtualization compatibility is a big burden for the development so the best scenario is when you have a fully customizable operating system where you can buy a built-in source changing wherever way you want and the worst of course is when you have to run a legacy binary only proprietary operating system. We are also going to care about the physical resources that a virtual machine requires because the number of bcp users may have an impact on how we present them to the VM and depending on the architecture we may need to emulate one kind of specialized device or the other. The amount of RAM that also makes the virtual machine monitor overhead less or more relevant so if for example I'm running a 32 gigabyte virtual machine and my virtual machine monitor requires an additional gigabyte of RAM for its own purposes well it's not such a big deal it's 32 gigabytes against one by five plan to run many small VMs which one gigabyte or two gigabytes of RAM if my monitor still requires an additional gigabyte of RAM and then I have a problem. For those not familiar with the terminology a virtual machine monitor is the software that implements the user space size of the hypervisor so for KVM we have multiple VMs like QMU, KVM to all cross VM, firecracker, cloud hypervisor and many more. As is important to determine whether we need to implement virtual NUMA so we do if we do expect to run large VMs that are potentially larger than a single NUMA node in the host we are going to need to implement virtual numas to be able to partition the guest in multiple nodes so avoid gross node accesses and as opposed to devices may require special treatment. The final aspect we are going to get about is the size lifecycle of the VM itself. Do we need the VM to oblige the host? If so we are going to need some kind of migration mechanisms perhaps potentially also a stress migration and if the VM is going to be persistent we are going to care more about the block logger requirements we probably are going to need snapshots for having for being able to easily do backups we're going to need bitmaps sparse files for team provisioning and probably linked images so with once we clarify those aspects let's define what the traditional machines are. In a general sense traditional visualization was driven by the needs of the users at the time which totally makes sense which mainly were consolidating services supporting legacy systems on your hardware and improving the availability of the services to tolerate hardware issues. Those needs translating to supporting legacy operating systems as we said before with hardware quays and potentially legacy devices. Implementing features to support high availability strategies like migration for example, supporting potentially large VMs which implies having to support VNUMA and potentially specializing devices depending on architecture and prioritizing reliability, flexibility and management over performance in the block layer. A good example of this point is that it's a well-known fact that road images have better performance than future-rich images like QQ2 but in fact most users are actually using future-rich images because they are willing to sacrifice that performance in a chance for those new features that those images provide. So now that we talk about traditional VMs was different within VN and legacy VMs so legacy visualization is driven mainly by one need which is strong isolation there are different use cases, there may be secondary goals but the main one is strong isolation which means that the traditional set of requirements is obsolete. The guess always is assumed to be fully customized in fact in all use cases we are going to explore here it can be completely altering in any way so we don't need to support legacy operating systems. Usually there are soft leaves so no need for late migration, no need to oblige the host. They are expected to be small which means we don't need to support things like VINUMA but on the other hand footprint of the virtual machine module becomes very important and there are usually ephemeral which means that we probably not need so many features in the block layer. Okay so now that we have defined that virtual machines have different needs that traditional virtual machines which translate to different requirements we're going to see how some virtual machine monitors have evolved to fulfill better those requirements and we're going to do that by looking through some use cases for like VMs. The first use case is microservices and functional services which technically are two different use case but share most of the requirements which are running lots of almost identical small VMs which implies that the virtual machine monitor must have a very small footprint. Another requirement is that we need to get to use a space as quick as possible which basically means booting very fast and as a sample of this kind of use case we have AWS Lambda which is very interesting because it's basically running lots of VMs using firecracker. Inside of those VMs you have a minimalist Linux kernel, you have a minimalist root 5 system and then you have the customer provide binary implementing the microservice inside that VM and that's basically you have also some network endpoints and an API for discover those network points but that's basically quite simple and that's why I think that Lambda is not only a good example of the use case but it's also a good example of quite pragmatic engineering. Another example is the unikernel-based microservices which this was kind of a trend like two years ago then it faded away and then it's making a comeback or at least I think it's making a comeback because I was asking people around here around Defcom and I was the only one with that impression so maybe it's just me, maybe I can fix my desires with the reality again and an example of unikernel-based microservices the good thing about that is that it goes one step forward of container so basically the idea is that you are linking your application not only statically linking your application not only against the dynamic libraries but also with the kernel resulting in a static image that provides everything so you get the lowest possible footprint so you just throw that into a VM and it's sufficient and as an example of specialised virtual machine monitors for this use case we have Firecracker and QMU with the introduction of the recent with the recent introduction of the micro VM machine type. Let's see some highlights of Firecracker. Firecracker is written in Rust and based on cross VM which is the cramion OS virtual machine monitor. This is kind of interesting because cross VM isn't intended for this kind of use case. In fact as far I know they are simply using it for running a single VM inside Chrome OS to run the new applications. I don't know why they needed to write a VMM just for that but that's the thing. Firecracker builds into a static binary which is nice because it allows you to do things like statically analyze the binary see what kind of things this cost can potentially do and then create a very precise second filter for the binary which is kind of cool. It implements a minimalism machine type so for example it doesn't implement ECPI and instead of PCI it implements MMIO as transport and this is time so we just live to the fact that it doesn't need to be compatible with legacy operating systems but also has some drag bass and the main one is the lack of hot plug support. It is customized but on patch Linux kernel image which means that you can take pre-built kernel image from a distribution and booting in Firecracker but on the hand you don't need to patch the Linux kernel source code you just need a special configuration and the advantages for this use case is that you have a smaller code base which is the sentiment I know that you have a smaller binary you have a reduced footprint and the guest operating boost fast thanks to the minimalism machine type. I want to put emphasis on this last point because when Firecracker was in Italy announced people on Hacker News, Lobster and whatever were assuming that the guest operating system booted fast because some kind of magical optimization in the BMM in Firecracker which is not the case so what's happened here is that you are presenting less device to the guest operating system and you are presenting devices that are easier to configure so it boosts faster. Another example of a speciality BMM for this use case is the micro-BM matching type for QMU which has to say this simply has a new QMU matching type. QMU can be built only with micro-BM support to obtain a smaller binary to link with fewer dynamic libraries and have lower startup times. It's heavily inspired by Firecracker and basically they implement the same matching type with small differences as you as a result it also lacks a couple of support and it also needs a customized button patch kernel image. The advantages for this use case is that the guest boot fast again thanks to the minimalism matching type. I have some numbers for this BMM which is basically for entry it boosts for entry point to use a space on a VM with a bidionet device and a bidion block device in roughly 60 milliseconds on Linux and three milliseconds for OSB which is a unicernel and it has a smaller footprint than other matching types of QMU which is nice. Okay so let's do this. Firecracker versus QMU micro-BM QMU versus Firecracker. Which one is better? By the guy who wrote the QMU micro-BM matching type. So I'm totally impartial. So we've seen that there are different aspects of both BMMs. We've seen that also some similarities. So I put the pros and the cons on a spreadsheet, I run some some complex algorithm on the spreadsheet and I got the conclusion that the best one is the one that fits your better use case obviously. So in general Firecracker is suspected to have a lower footprint but QMU on the other hand provides you most features and usually a better performance. But this depends on a large number of factors. So the guest operating system, your kind of workload, your host. So the best thing you can do is you are considering this use case, try both, evaluate the pros and cons, see which one was better for you. And the important thing I would hear is that now we have more BMS than ever. We can learn three, one from each other and we can improve the KBM ecosystems and liberate the world from those base key close source hypervisors which is the goal in the end. And with this we go to the second use case which is additional isolation for containers. The requirements are basically being able to run lots of different smart to meet sizes BMS. We also need to support HubLook for growing up the BMS. This is because in Kubernetes terminology each BMS doesn't hold a single container but a pod and a pod can be composed by one or more containers. So the way in which this works is basically when you create the first container, the BMS is created only with the requirements of that particular container. And when a second container is going to run on the same BMS, this BMS is expanded by HubLooking, vCPUs and VRAM. That's the reason why you need HubLack. The main example is obviously Cata containers. And it has a special BMS. We can count QMU after incorporating the name and QMU lead improvements. We see something about this in a minute. And hypervisor which is still in early stages. What's the thing about QMU and QMU lead? So basically Intel run to friendly QMU for experimenting with features they needed, which were mainly faster boot time and a lower footprint. As of today, those improvements have been matching to main link QMU. So QMU is now the preferred BMS for running Cata containers once again. And this is an example of how QMU has evolved through friendly force to better suite this use case. The second BMS we are going to talk about is Cloud Hypervisor, which is experimentally BMS sponsored by Intel for this particular use case. It's also written in Rust. It's based on Rust BMM, which is an umbrella project for holding crates, implementing functionality required by Rust-based BMS. So it's basically a way for Firecracker, Cloud Hypervisor, and potentially other Rust BMS to serve code and avoid duplicating efforts. It emulates a malchines similar to the one emulated by Firecracker and QMU and Vaker BMS, but with a device model that includes ACPI and PCI to support hot look. With that exception, it also needs to be a small link into a statically binary. Before going to the next use case, I want to post from me to talk a bit about hot plug, but if you remember, we highlighted that Firecracker doesn't support hot plug, QMU doesn't support either. Cloud Hypervisor was modified on purpose to support hot plug support. What's happening here? So the thing is that nowadays, hot plug support is implemented after the way in which you will be doing on a physical machine. So if you are virtualizing on an X86, you need to do the same things you will do on a physical X86 machine, which is not something as straightforward as it could be, because that makes sense why you are dealing with physical hardware, but it doesn't make as much sense when you're dealing with a virtualizing machine. So it's technically possible to implement a specific hot plug model for virtual machinists that will allow to simplify the process and play many optimization tricks. An example of this will be the IOMM, which is still on early stages, but there is some opposition that would mean duplicating an existing work in future. It requires some of the BMS and the guest kernels, but it seems that there are more arguments in favor of doing that, so perhaps it's something that is worth a try. That would give us a couple of support in both Firecracker and Micro-VM and Chromome Micro-VM with ACPI and PCI. At this point, some of you may be wondering why ACPI and PCI are a problem for micro-VMs, and we prefer not to implement them. The main reason is that both ACPI and PCI are very complex for good reasons when you are dealing with physical hardware, but those reasons are not as strong when you are running virtualized machines. That complexity translates into a larger codebase to support those devices, which are also translating to more cost in the boot time, so the legacy practices takes more time for booting. That's the reason why we prefer not to implement those devices on micro-VMs. The last use case is implementing a trusted execution environment in AVM. The requirements are running in VMs with hardware-assisted memory encrypted like AMD, SCV, provide remote attestation for the contents in AVM, provide a functionality similar to the one provided by Intel SGX. The main example is NR, which is in other stages. I'm not going to enter much into the test about how TE works because it's very complex and there was a talk about this yesterday, but I think it was nice to talk about NR's ability in this presentation to illustrate mainly how different things can be when you don't need to support legacy practices that you have full control of both the VMM and whatever is running in AVM. For example, in the case of NRX, you say the VM, there will be a micro or unique kernel to be defined, there will be a WebAssembly, JIT, there will be a WebAssembly, Cisco, interface base program, and instead of animulating devices, which is kind of interesting, NR will implement a mechanism for the kernel to request Cisco. So if you need network access instead of providing a bit ionized device, which will give you an interface, I guess, what NR will provide is a mechanism for the kernel to request, opening a connection to some host on IP, to read from the socket, write to the socket, pull the socket, whatever. The Cisco will be parsed by the VMM and served in part by the VMM and in part by the host through the VMM, and has to say it shows how different things can be. Conclusions, VMs are cool again. To be honest, traditional virtualization was becoming a bit boring for good reasons because it's a mature technology, which means that we have products that are more stable, which is great. It makes customers happy. That's cool. But from an engineering point of view, it's kind of a bit boring because there are no breakthroughs. Evolution is low and incremental. So the fun thing is that we've seen that breaking the ties with the compatibility with live migration, breaking ties with the storage requirements opens a lot of possibilities. We've seen that in the serverless and container spaces, live ways VMs will play a significant role, providing additional isolation. We know for a fact that in a public cloud, live way VMs will be essential for hardware assisted and remotely assisted memory encryption, which is very complex, but it's basically the only, the best way we have right now to ensure that your data is not accessible by anyone, not even the cloud provider. And with all these things in mind, let's work on improving the, like with VM non-skate, let's find new use case, let's say both uncreate new VMs, do weird things that weren't possible when you are tied with the compatibility, and in general have fun. So that's all I have to say. I hope you enjoy it and thank you. Any questions? Sorry, I can hear you. So the question is the best way to introduce it as a down is to generate a triple fold from the guest. Yeah, that's true. So the best thing to do is there, basically, this is going a bit into detail, but basically what you do is override the introduction vector and they generate a page fold that will go into the variable vector, generate another page fold and then you have a triple page fold and it should be done. Thank you. Any more questions? Well, it's not hard by itself, but it means implementing support for a new thing. It was just an example of how a firecracker on other cloud hypervisors can give up removing functionality and still works. So it's not a throw-through, we could name many features that all the hypervisors don't need at creating those needs. It was just an example. Now, you're right, it's not a good point, it's not really that hard to implement. It's just another thing you need to implement. So the question is about how to do DMA to the devices. Basically, nothing changes here. You have IOMMU in the same way. You can implement it. There is also... Without PCI, you can do basically the same things. The main issue, which is outside the topic of the presentation, is about the out of the tree, the device driver, so you can have the bus drivers outside the main process and communicate with them with some kind of protocol that is outside the presentation for the next one, perhaps. Sorry, there was some noise and I couldn't hear. Well, I'm not aware of many of them. You mean the question is if there are... The question is about other security extensions that will you use instead of running a... Well, I don't know much about the details, but I know that Google's divisor tries to do something like that, so they can use Ptrace or they can use KBM. But in the end, they are basically running a very small kernel inside the VM, and it can so simulate that there is no operating system, but, well, there is an operating system for a fact. Just a very small one. Any more questions? No, we're out of time. Finish. Thank you.