 Hi everybody, I'm Sebastian. I'm working for Intel and I'll be talking today about building a PMM in Rust or the Edge. So let's get started with the need for security. So we're going to look at the reason why we need security in the virtualization use case. Then we'll look at two pillars to achieve the security that we're looking for which are Rust and Rust. Then we'll look at different use cases that we have at the Edge namely the pad versus cattle VM use case and we'll finish the presentation diving into the details of the Cloud Advisor project as well as the Rust hypervisor firmware. Let's look at this need for security. So if you look at a virtualization stack we have two types possible the type one and the type two. In the case of type one you have the hardware on which an hypervisor is running and because this hypervisor has access to the full hardware since it's running in ring zero it can completely interact with it and create virtual machines on demand. There are multiple projects implementing type ones such as Xen, Akron, ESXi from VMware and Hyper-V from Microsoft. On the other side we have the type two use case. It's slightly different because on top of the hardware here it's not the hypervisor directly but an operating system. So we'll see the example of Linux with KVM and so the hypervisor is basically a kernel module inside your kernel and on top of this you need a user space layer so the VMM to interact with the guest and more than that to actually create the VMs the point is to let user space be able to run virtual machines directly and so when you look at it a lot of things such as the device model and the communication between host and guest is done through this VMM that's why it's so important to to get the VMM secure properly. And so just a quick comparison between the two types where we can see that in the type two use case the VMM is a quite important piece in the individualization stack. So we talked about why the VMM is important now let's talk about what. Well you we have this typical situation in all CSPs where CSPs owns so by CSP I mean call service provider of course a CSP owns large amount of hardware and on these servers they're basically running loads of VMs for different tenants or customers and what they don't want is a potential malicious guest to be able to escape from the virtualization layer and attack the hardware which would pose some sort of scenarios of service would also be able to access data from other VMs so that means that it could store data from other customers so they definitely want to strengthen the whole stack that they propose to their customer. Let's now look at the pillar that we're going to rely on to achieve the security. So the first one is Rust and the main point about Rust is that so first it's kind of a new key in the block even if it's not because it's been here for 10 years but it's only recently that big companies have started acknowledging the fact that Rust is the right language for critical projects which require high security and the main selling point for Rust is the memory safety that's provided by its Rust compiler. There's a concept of memory ownership that basically gives you memory safety for free and the program won't even compile if you don't follow this ownership rule. Because of this there is no more issue in having your program trying to access memory after it's been freed or trying to access and initialize memory. All these matters are actually taken care of by a compiler. The second very interesting thing about Rust is that it's fully profound and you can compare that to a simple C program. It's just as efficient because there is no garbage collection involved in Rust. So we get security and we get performance. The last interesting thing about Rust is how to manage dependencies. So dependencies in a circle crates and there's a tool provided by Rust called Cargo and this tool just helps the developers with the maintenance burden of updating crates which means that you get security kind of free by making sure that it's your project is easily updated with the latest versions of the dependency you have. The second big player in building SphereVMM is RustVMM. So this is a project started three years ago by the different stakeholders in this field where basically we had like Google, AWS and Intel all trying to build a VMM in Rust. We obviously had different use cases and scripts in mind and that's why we couldn't contribute to each other project but what we wanted to do is to rely on the same core components and that's what RustVMM is really. It's a set of common components that can be used across multiple VMM projects and so here you can see a list of those crates that are proposed by RustVMM and in the context of Cloud Advisor we rely on one of them already. There are some that we're not interested in but like the one that you can see with the red dashes these are the one we're actually working on so that we will soon rely on those in the script because they're actually very important in the context of the VMM especially the product. So let's now look at the two main use cases that we can get in the edge context. On one side we can get what's called a pet VMM. It's usually a manually built and manually managed VMM and because it has a very specific purpose from the user's perspective so sometimes it's about making sure that you can run a 5G stack. That's why we need accelerators like dpdk, things like stdk in the storage use case context as well. We also need a support for some kind of management layer and the most common one is actually LibVirt. That's why it's important to support the LibVirt API to support this pet VMM use case and the last one is about making sure that you can boot those pet VMM because they're based on cloud images. Those cloud images rely on EFI boots and so having the support for an EFI firmware is very convenient when you want to boot that kind of image. The other case is called the cattle VMM or container VMM I actually prefer. It's basically in the cloud native context. You will not manage the VMMs directly. As an end user what you really want is to be able to run containers and you want to run those containers securely. That's why they're eventually run in VMMs but it's totally transparent to you. The entire management of those VMMs is done directly by the management layer which in that case comes basically from something like communities down to cattle containers and then it's done in VMMs. It's ultimately managed and that's why it's completely different from the pet VMM use case we saw earlier. In that context you can think about when you create a pod. The pod has multiple containers in it. If you're trying to extend the pod at some point in time by adding more containers to it you're effectively going to need the VMM because the VMM is the pod. You're going to need the VMM to also be updated in terms of resources and that's why we need the support for dynamic resource resizing. The need for the direct canal boot is coming from basically the fact that you know about the environment like that that's basically cattle containers hemming that and the management layer knows exactly what is going to run inside the VMM and for that reason there is no need for overloading the boot with an EFI firmware layer so with direct canal boot we can shoot pretty fast boot for each container. The last point is about being able to communicate with usually what's called a guest agent so there's this program that's running in the guest managing all the containers on behalf of the host and so that means that the host and the guest must be able to communicate through some virtual circuit. Okay so now that we looked at all the reason why we did the security, how we can achieve the security and also the different use case we can have in the edge context. Let's look at the Cloud iProvisor projects. So CloudProvisor is a project that has been started in 2019. I kind of said that earlier but it basically came from Google's project called Prozmium and AWS project called Firecracker. It's kind of a fork from their project and then we just added a bunch of things on top of it. The reason why we didn't contribute to the project is because they have a very different scope in mind. They're trying to tackle different issues and that's why we couldn't contribute directly to that project. We're working closely together on the Rust VMM project. So I think you've got that so far. The CloudProvisor is basically written in Rust and relies heavily on Rust VMM. The main ideas behind the CloudProvisor projects are being able to run modern cloud workloads and basically we don't want like someone to bring like a legacy image or legacy workload and expect this workload to run with a floppy disk because we don't actually refuse to implement legacy devices such as a floppy disk. So you simply won't be able to run that kind of legacy workload on CloudProvisor. Instead we want to rely on what's more modern like EFI-based images relying heavily on parameterization. And so that leads to my second point which is making sure we have like a minimal device emulation in terms of device model. So we want to make sure that this device model of course encompasses all the use cases we enumerated earlier but we want to use parameterization devices as much as possible and when I'm talking about power virtualized I'm basically talking about for rail devices. Unfortunately we still have like a few legacy devices such as like a CMOS or a serial but that's because if you look at OSes like Linux and Windows we can't make them boot without those devices. They just assume they will be there and so there is no other alternative than providing those legacy devices directly. The third point is yeah we're very opinionated and pragmatic about the set of features that we accept and that we want to implement simply because if there's not a clear need for a feature well we will refuse to implement it and if this feature that we're talking about is actually not following the modern Cloud ID that we mentioned we're not implementing either. So we've already talked about security but like security is about the Rust and the VMM but not only it's also about the way the VMM has been designed and by this I mean like the devices that we chose to actually implement. In this context you can definitely see that we're focusing almost entirely on power virtualization because that helps us focus on strengthening only the virtues which is the communication channel between guest and host but we don't have to like secure every single legacy device that we would have to support and so that makes everything way easier in terms of security. Same thing we decided not to implement a V host devices where the back end lives in the kernel because we thought that would give a a too large attack surface to the guest. Let's say the guest would escape and take access and get access to the host kernel that would be kind of dramatic and that's one of also the reason why we chose not to implement specific devices like that like the V host kernel ones even if they actually bring with performance. And of course we're trying to cover those two VMM use cases that we talked about. So cloud advisors support multiple architectures x86, 64 and 64 we started supporting x86, 64 because the project has been initiated by Intel obviously but it's important to note that we had very nice and contributions from our friends from ARM. They've been working super hard on making sure that cloud advisor can run on 64 platform as well and I think we can say that we reached party in terms of features that are supported by both architecture right now. So cloud advisor again has been designed with the idea of running on in a type 2 use case on top of KVM but there's been recently some interest from Microsoft in some contributions basically in pushing for the support of MSHV. So MSHV is a kernel module as the same level as KVM but the difference is it's not the hypervisor itself. MSHV is just an abstraction layer to reach out to the hyperv hypervisor underneath so the type 1 hypervisor that's that's leaving underneath. So that like allows cloud advisor to support both type 2 and kind of type 1 use case it's kind of an hybrid type 1 here and that was also the opportunity this this this work on supporting multiple hypervisor to to to create a proper abstraction layer on on the hypervisor abstraction in the in the Roscoe that we have. So cloud advisor supports multiple guest addresses we started with Linux but then with some help from Microsoft folks as well we've been able to be Windows guest and that's something that we're pretty excited about because that means that it helped us it helped us broadening the scope and the relevance of cloud advisor for our potential customers because we we find multiple workloads which are expecting to run on Windows. So that's that's a good point. Then we have migration so migration is is quite complex but it's always been a strong requirement from CSPs and the reason for that is CSPs have customers running VMs and they must be able to migrate a running VM from one of their customer within their infrastructure so that they can update the hardware or so that they can update the the software stack running underneath. For all these reasons they they they have a strong need for migration support and so we do support migration cloud advisor as well as snapshot restore in case you actually want to clone an existing VM that you have running. So we're gonna now talk about NUMA it's it's really when when it comes to performance that we start looking at NUMA and especially in the context of CSPs where the hardware that they have relies very often on multiple sockets so they have like huge servers with at least two sockets meaning that the accesses from the socket one sorry from the from the core one to to the run on the socket two would take much more time than trying to access the run on the same socket and for that reason cloud advisor ensures that we give full control to the user so and through our options in the CLI or the HTTP API we give the the user all the control they need to expose NUMA nodes correctly so that the guest workloads are going to run efficiently and are not going to try to reach reach out the the run from another socket from a different CPU that would cause a huge drop in performance otherwise. The direct device assignment use case that's something that's also called device path through. It is important when you care as a customer about when you care about how your performance in the context of virtualization so you're you're you're running a VM and you want to have access to for instance a network card to make to to to run your workload that that needs to process network packets very quickly or in the AI computing context you might want to also do some machine learning computation in your VM well you want to have access to a GPU directly and so direct device assignment means that you attach this device to the VM so that it's no longer available from the host and it belongs entirely to the VM and it's secure because of this device sits behind a physical IOMU so it ensures that all the dmin transfers transfers between device and the guest run are acknowledged before they actually happen and in terms of performance because everything happens directly between the device and the run and since there is no VM exit involved especially when when it supports hosted interrupts for the interrupt handling well you can pretty much achieve native performance for this device within the VM and so obviously that's a very important use case for us which we support in in CompLizer so we talked about the dynamic resource resizing in the container VM use case and so here we will see a bit more in details so first we want to be able to hot plug CPUs that means that we want to be able to add or remove CPUs to an existing VM because for example you're just adding a new container that requires two more CPUs so let's add more to the VM so that the container will have access to those to those CPUs the mechanism we're relying on to is ACPI based to notify the guest about a CPU being added or moved we have the same kind of memory so we want to be able to extend or shrink the guest memory depending on the needs from the pod right by adding or removing containers and here we have multiple ways of performing this memory hot plug either we have an ACPI based mechanism the same as the CPU or we can use two vario devices called vario mem or vario balloon vario balloon has an interesting feature as well which is called deflate on um which means that if the guest runs out of memory well the balloon will deflate to give more memory to the guest so that's also something people might be interested in and the the last important point is to be able to add or remove devices so here we're talking about PCI devices and again it's it's based on the ACPI mechanism to notify the guest that a new vario block for instance has just been added or that it's going to be removed yeah now let's talk about the interface um to to interact with with cloud advisor um we have the cloud uh sorry we have the command line interface um that's called CLI but we also have an HTTP API that comes from uh the firecracker code base initially that we ported and it's very convenient um because you you might still want to run your um VM first with with like the CLI but if during the the the runtime of your VM you want to update it as as we just saw with the dynamic resizing well you need to interact with the VMM to update the VM and that's that's through this this HTTP API that you can ask cloud advisor to update your VM um or to get info uh on your VM um directly from from there the HTTP API interface is also um interesting because that's the interface that's used from both libvert as well as cloud containers when it comes to um manage the the cloud advisor VMM and and the and the underlying VN as well um so in case of libvert there's a libred driver that that relies on this API and cloud containers interact directly with uh with the API as well um so that's the main um interface that's that's provided to our users um next is confidential computing um Intel SGX uh is is the first item um this is something that has been very quickly immersed in the option Linux terminal um and we do have the support in cloud browser which means that from memory enclaves that are encrypted um so that part is handle on the other side there is an even more exciting technology coming from Intel which is Intel TDX and in that case it's still under active development and that's why we only have like an experimental support for it um but the the the big sign point of Intel TDX is that we will be able to run VMs where the guest has full control of the memory meaning that unless the guest gives access to memory regions to the host the VMM will not be able to see what the guest can see and that's um really confidential computing because here we're talking about uh even if a somehow a guest was trying was able to compromise the host the host would have no way of stealing data from other guests running on the on the same machines um so back to security we also have um the support for second filtering so if you if you look at um cloud browser in terms of processing threads we have like the main um thread being the VMM one and then it spawns like a bunch of different thread threads for each component um and so here that's a simple example where basically we run a VM with a vario console two vario blocks and two vcps and you can see that each thread is separately constrained by a second a list of second filters that means that they're only authorized to issue a specific list of system calls and if they try to issue a system call that's not in the list they will be the entire application so the entire VMM will be killed by the main host kernel um so we've been talking about using privatization a lot so far and of course I've been mentioning about verio being this private privatization uh layer um and here's the list of the vario devices that we do support so we have verio balloon and verio mem that we already talked about for uh resizing the the guest run that's where our block when it comes to um expose a block device uh usually it's the image that we expose to the guest to to with the VM um then the vario console to to get input and output uh from and into the VM the verio IMM view is a pretty recent one it's been immersed in the in the kernel 514 actually um and it allows um a guest to see a virtual IMM view which means that we can place we can attach like any device behind virtual IMM view and eventually in the nested use case we can use also VFIO so this direct device assignment uh feature to pass the device into a second layer of virtualization um verio net it's just about um of course providing a network interface to the guest verio p-man is about um creating a persistent memory device inside the guest it's it's showing up as as a block device as well uh verio renji is the random generator uh random number generator device which can be backed usually by slash dev slash urandom host um and verio vsoc is um the virtual socket I was talking about earlier when I mentioned uh we needed that like a communication channel between uh the the host and the the agent running in the guest in the container use case um so after verio we have vios chooser so vios chooser is basically exposing uh verio devices to the to the guest but on the host side things are slightly different because it allows to run the backend in a separate process and it's very convenient because it gives us flexibility um to to tell our customers like hey if you want to come with uh with your own if you want to bring your own um backend for the block device your own implementation because you have specific needs then you can do that you just use uh the vios chooser block implementation of cloud riser and you plug your backend into this um if we come back to the discussion about dpdk and spdk as well uh they both rely on on vios chooser on the vios chooser protocol so if you want to use those software accelerators you're actually going to need the vios chooser support um and the last point which is actually pretty important is in terms of security um by having vios chooser devices so by having this backend running in the different process you give um your your user and customers the the opportunity to run this free process and with a specific set of constraints if they want to constrain it uh a specific way they can do that uh it's it's not hardcoded in the cloud riser implementation itself so it gives even more security to to to our users um so vio user that's that's a feature I really wanted to talk about it's uh very recent um it's not stable yet so the implementation is is experimental in cloud riser um but um it's very interesting because it's kind of like using the the benefit of vios chooser where you're going to run uh your backend implementation in a different process so you get all the flexibility in the security but on top of this here we're not exposing a variety of device devices to the guest instead uh those backends are going to uh be showed as PCI devices inside the guest so that means that if you come up with like an NVMe backend uh you plug this into cloud riser and you will see this NVMe device in your VM being attached to the PCI bus directly um and so the the big benefit on top of the flexibility and security um is really to to tell our customer hey if you actually want uh to implement legacy devices now you kind of can um we will not be responsible for you know making sure that those legacy device are properly implemented because you come up with the implementation but we make sure that the vio user implementation is strong and secure um and then you can just come up with like any device you need so we don't compromise on our motto of of running a legacy-free vmm but we still give the opportunity to potential users to to to still um have this legacy device showing up in the in the guest and here is um here is the a summary of kind of the features we went through um so we can see um different things the the features yes but we can also see the abstraction layers um that we have in the code right now um so the CPU manager the memory manager and the device manager are the three big components um to basically and the app hypervisor instruction is the one we talked about mshv um and um that's that's pretty much it the memory of of the virtualization stack and how hypervisor um define interfaces in there uh before i'm i'm done with with the hypervisor section i just wanted to give a big shout out for all our contributors uh if the project is where it is today it's uh part partially because of them um they've been contributing a lot to different features uh as i already mentioned it um when we we dug through the features um but yeah i just wanted to give them a big thank you because um they they helped a lot and and i'm really looking forward to more contribution from from them and that that's what i really like in open source actually um so i'll i'll finish this presentation uh quickly going through the rust hypervisor firmware uh this was a project um the id was to create a very minimal um firmware that would support the modern cloud images again uh and that would be as secure as possible so that's why we wrote it in rust um but also it was important for for this to be very minimal um and and so the only use case we wanted to support was to boot um efi images because that's that's uh the standard for most cloud images right now uh at least the the modern ones um and so we want to support the the we want to have this minimal efi support um that we can find in bigger project like like channel core uh vtk2 but again that was not the goal we didn't want to end up with like complex implementation uh our firmware only a very small one uh we do have the support for the bootloader specification so if you also have an image that um relies on system reboot uh you will be we will be able to boot this image from from the firmware so it launched as an health binary that's important because that means that if you look at the cloud browser um or even the the kmyu command line when you usually do directional boot you provide the kernel image here the point is to avoid modifying uh any of the command line you you can just replace the kernel image with this firmware image directly and from your disk image which contain uh the efi bits the rust hypervisor firmware will be able to to directly boot from it um of course um this rust hypervisor firmware contains a pvh section uh so that we will be able to to boot from it um there is support for the guig partition table uh in order to boot different partitions and last thing is um we support only one type of a block device and that's basically for a block over pc i um so if you look at the dash dash disk option which will basically create a particular block uh device for you uh and you will pass your image through this through this back end uh otherwise the the firmware won't be able to to to find the the efi image and you won't be able to boot from it so i'm done with the with the presentation now um i hope you enjoyed it i hope um you're interested now in in in learning even more and about cloud visor and maybe starting playing with it um and force contributing to it uh so i'm just giving you a few links um the the two first ones are the the repositories for the the two projects cloud visor and rust hypervisor firmware uh and the third one is the organ the github organization uh link that basically gathers all the the ross bmn crates uh you you'll find all of them uh in in the organization and the last link is if you have more questions for us uh you can reach out on slack at this uh URL uh and i i'm done now so uh i'm i'm gonna thank you for for attending today and uh see you later