 Okay, so let's start. I'm here to present our work as an alternative to nested visualization. The name of the presentation is No More Turtle Turtles. It's a referral to the original article of nested visualization, the turtle project. So what's the problem here? The problem is that we have cloud customers that want to spawn VMs. For example, on Kubevert, you have VM-based workload. Workers are VM or in Cata containers. For example, you want to increase boundaries using virtual machines. Right now, you only have two options. You can either use virtual machines on bare metal systems, and those are quite expensive to rent bare metal system and you don't really have much options of customization of the system itself. I mean, you cannot really choose to have a small bare metal system. You either get tons of core and tons of RAM, or you just don't get it. Or you can use nested visualization. So you just rent a standard virtual machines and you use that to spawn nested virtual machines. But that has some other issues. You do miss some features in the nested visualization. So for example, a confidentiality, you cannot use hardware encryption. So Intel TDX or AMD SCV or IBM PVM. You also have some problem in terms of security because the code base to enable nesting is quite large and larger code base means higher chance of bugs and you still in you do miss a bit of performance with nested compared to the standard VM. For these reasons, the Cloud Providers, most Cloud Providers, they don't allow it. So if you actually want to do it, you can't because it's not available to you. So what we thought of was to flatten the hierarchy. So basically we create a special VM. We call it the primary VM that it's able to talk to the host and ask for some things. So the VM is given some resources that it's booted with. So a number of CPU, some storage, some memory and so on. It can ask the host to just take some of these resources away from itself and use them to spawn secondary Biffle machine and then it's able to control them access deleted, pause them, modify them and so on. KVM already does some hierarchy flattening. So it has a VM control structure for nesting and shadow page table that actually accelerates some operation and that kind of solves some problems of performance, but not all of them. Most of all, IO is not really as good, latency is not really as good, and you're still missing some features like hardware encryption. So yeah, this is what we're at with nesting. We have the host with visualization software that spawns a standard virtual machine with its own version of the visualization software that's used to spawn nested VM. We want to go to a situation like this where the visualization software spawns both the primary VM and the secondary VM, and the primary VM just has access and some basic control over it. What are the challenges of this? Well, first of all, in terms of security, you need to make sure that the system, the rest of the system is unaffected by it. So exactly like nested VMs, it doesn't matter how many you spawn. For the host, it's like they have a single virtual machine. Like this, it doesn't matter how many secondary VM you spawn for the host, it should be exactly the same. So that's why the primary VM only has access to some predefined action. It doesn't have access to the whole control plane. Another issue is the other way around. Not only you need to protect the host from the primary and secondary VMs, you also need to protect the primary VM from the rest of the system. So if you have other VMs, you need to be sure that they don't have access to the channel of communication, for example, between the host primary and secondary VMs, and also the secondary VMs should act like nested VMs. So other VMs should not be able to see them. So this is where we start. We have on the left the level zero, the host with its version of the visualization software, and on the right, we have the primary VM spawn. To allow communication, we need to add a component which we called the secondary VM demon. Actually, we have one for each primary VM to avoid interference and potential voluntary or involuntary DOS attacks. Then we have a template that the host can use to spawn every secondary VM it's asked to. Now we have one final problem, which is how do we get the image from the primary VM to the host to boot. We cannot just copy it because that way, any primary VM could just keep sending data to the host and fill up the host space. So that's why we add a secondary disk, secondary drive, a virtual drive where the primary VM can put the image of the secondary VMs. Finally, we encapsulate everything in a pool of resources, and those are the one that they are allowed to use. The demon is also part of these resources. So now a primary VM that wants to spawn a secondary VM asks send the request to the demon. The demon unplugged the storage, it gets the image, it will shrink the secondary drive to the new size, which is the old size minus the size of the secondary VM image and it will plug it back. Now it has every information it needs. So it has the request, it has the template, it has the image, and it will just start the secondary VM and connect them to the primary VM. This is a table of comparison between the current solution. So standard plus tested and bare metal plus standard, and this is a comparison. So it should be as cheap as standard because for a cloud provider, it doesn't really change much to spawn one or the other. It has some flexibility issue. So you do have your storage divided into, you have the secondary drive and your primary drive. Also you do have some problem with memory, but you have much more choice than bare metal system. You have the full hardware feature available to you for the secondary VM. So you have encryption, you have through device pass-through, you do have some virtual IOMU for nested, but you take quite a big it in performance there and that's the whole point of pass-through anyway. Secondary VMs are as fast as standard VM because they are standard VMs. So now let's go through all the components of this. The most important one is the daemon. The daemon is the one that talks to the primary VM. With the channel we use is Vsoc. Vsoc was an easy choice because it's only visible in virtualized environments. So there's less chance for other software to interfere with it. We can use the idea of the device of the primary VM as to discern traffic between all the primary VMs. Actually that also gives us a way to demote a primary VM to a standard VM. You just remove the Vsoc device, you remove the secondary drive and you just have a standard VM. The daemon is the main component that does all the stuff on the behalf of the primary VM. Another important stuff is the resource pool. The main component of the resource pool enforcement is C-group. So we have CPU set to ensure that the number of cores stays the same and max memory for memory to ensure that the memory is, you don't go over the memory you started with. Right now those are the two that we implemented, but we will add some other things to make sure that other stuff needs to be proportional. For example, block scheduling awaits to be sure that all groups have the same bandwidth allowed no matter how many secondary VMs you spawn. One problem that the C-group doesn't solve is the storage like we talked about. So that's where we just do the unplug, shrink and replug thing. Finally, network every time you spawn a primary VM, you create a virtual network and you connect everything to it. So it's easy for them to talk and the rest of the system to not know. Up to this point, I just talked about concept that not really about a specific software layer, and that's because this is supposed to be a general solution. It doesn't matter which visualization stack you use, you can implement your own version of the daemon and your own version of the client and it should be good to go. But we did have a proof of concept in Libvert because, well, it's open source, it supports already multiple libraries or so, not only KVM you can use many more of them. It already has a system D integration and with system D we have access to C-groups. So it was quite easy for us to just add our limits to it. It's easy to have devices attached and detached at runtime. For example, network and storage. Now let me show you a demo. So on the left, we have the host, which has a single primary VM running and on the right, you have three instances of the same primary VM. The primary VM is encrypted with the AMD SCV, and you can see that we have the VDB the secondary disk, which is 50 gigabytes and inside that disk, we already have the images of the secondary VM we want to spawn. Now from the second instance, we send a message to the host. We want a secondary VM with two CPUs for gigabytes of RAM, that image is raw. We want it encrypted and so created. As you can see, right now, it's a simple plain text string to be parsed by the host. This will change in the future, but for a proof of concept, it's fine. As you can see on the whole side, we have spawned the secondary VM. We have a sense of hierarchy, so this is the secondary of one. You can see on the screen, but on the left, you have the IDs of the VMs. Now we are spawning another one, which is not encrypted. Again, sending it plain text on the host. It's done. You can see it. Again, it's a secondary one, the primary VMs ID one. You can see that the disk is now 28 gigabytes. That means the secondary VM in total were 22 gigabytes. Now we can access them using the network. You can see that it's in fact encrypted like we asked. We can use encryption both in primary and secondary VM at the same time. The last one, we can access it, but this one is not going to be encrypted because we didn't ask to be encrypted. Finally, make sure that this is not encrypted. Finally, another thing that we want to show is the resource usage. In the host, we're going to have a resource monitor, and we're going to have a few stressors in the primary VM CPU and memory stressor, on the primary VM and on the secondary VMs. As you can see, only four CPUs are being utilized, and that's because that's the resources that the primary VM has been given. It doesn't matter how many stressors, on how many virtual machine you launch, that's what the hosts see. One final thing is that the hierarchy is a strong hierarchy, so you cannot have orphaned secondary VMs. Let's log out. If you kill the primary VM, then that means all the other secondary VMs gets destroyed automatically. Yeah, no more. This is where we are at right now. We are going to do some other stuff in the future. The most important one is going to have a more defined way to talk between the primary VM and the host. We settled on the VSOC for the reason that I talked before, but we need to decide on the format, probably a form of XML or JSON, and encryption. I mean, the channel is invisible from the outside, but if you add encryption, it's better. Then we will clean up our code in an early state and try to upstream it so that this is going to be a baked-in feature to livevert. It's going to have, you can add a simple primary tag in your XML file, and that will just do everything for you. We'll boot a VM that's going to be a primary VM that can talk to the host using whatever clients it has. Another thing we want to improve the C-group stuff like I talked about, we want to add some other limits, like block limits and so on. Another thing is going to be to improve the isolation of the primary VM in the host. The most important one is we found that the CPU set of C-group doesn't really work as well as the boot parameters, ISO CPU, which was what we were using in the beginning. So we will try to improve that and make it act more like ISO CPUs. Another thing is we want to investigate over the storage solution, so how to get the secondary VM images. We tried a few stuff and then we settled on the plug-and-plug. So we tried like NFS having an NFS server in the primary VM so that the host can just access the data and spawn it, but that's quite slow in terms of your performance in the secondary VMs. We also tried to hack the Q-Cow layer and the Q-Cow file system and have the secondary VM be an overlay of the primary VM. So for the host, the secondary VM is a snapshot of the primary VM and you can boot it, but that still increases the size inside the host and that's not allowed. The final thing that we tried to do was for the primary VM to tell the host which were the physical blocks where the image was located, and then in the host, you create a fake disk with a file system which mirrors the file system in the primary VM. So you can create an iNode that points to the correct blocks and you access the file directly, but then you need the kernel to synchronized, and I don't think that's ever going to be accepted than from the kernel community. Yeah, that was it. If you have questions, yeah. So the question is, can you boot more of these secondary primaries? Well, you can boot. Secondary VM to boot another secondary VM. We thought about it, so we do have a vision of a hierarchy in the host, so it's possible we didn't implement it yet, but yeah, in theory, I don't see any problem. They are all at the same level from visualization perspective, but on the host, you can see which one belongs to which. First on the, yeah. Well, the OpenShift and Kubernetes are the most, like the one that comes to mind, so they usually have worker nodes inside VMs, and since the user is given a VM to begin with, they don't have the access to spawning nested VM in cloud, and so they need to do it somewhere, and this might be one way to do it. Yeah, that's probably the, maybe not the next step, but trying to have a demon version of this inside Qvert. It's something that we wrote somewhere and that we will investigate, yeah. Yeah. Is the demon a user space? No, it's all user space, like, okay. So if you have one, like if you spawn multiple VMs on a host, you pay for each one of them, and that might be something that you don't want to do, so you just get one VM and you just use the same resources over and over again. And yeah, the problem with the Q-Cow, so okay, this was for the thing that we tried to get the image from the primary VM. We tried to get the Q-Cow and have it as an overlay of the primary VM Q-Cow file, so you have the primary VM Q-Cow, which has a snapshot, which is not really a snapshot, is a secondary VM, and you just boot that. But a snapshot is a diff between those files, so it was still quite huge, and that's not something that we can allow because you fill up the whole space, the more secondary VMs you boot. And yeah, yeah, this is something that we will have to talk to the cloud providers. Probably then it's going to be a limit of how many encrypted VMs you are allowed to boot. Not unencrypted, as many as you want encrypted, you're going to have a limit. I have a question from the chat, can you hear me okay? Okay, it says, so the secondary VM does not consume any memory of the primary. How do you shrink the memory size of the primary VM when a secondary VM starts? Yes, you can cap the total sum via a host C-group, but the primary VM will happily use the memory until C-group will kill one of the processes. So the question is more C-group capping, OOM, killer versus using the existing shrinkers might have different end results. One is a fatal kill, maybe the primary ouch versus not being able to start the secondary nested VM. Yeah, so when the secondary VM is not encrypted, we didn't find any problem. When it's encrypted, the memory is pinned and we do remove the memory, we use a similar mechanism to the disk. So we have some slots of memory, you remove those virtual slots and you assign them to the secondary VM. So the primary VM actually does see, contrary to the CPUs, so primary VM always has a vision of the same number of CPUs. Memory, when you spawn a secondary VM, you see that your memory is shrinking. Great, thank you. And another question here. Just one, I'm just thinking, if you're starting to remove resources from the primary VM, and let's see if Kubernetes is not running there, how do you let the schedule know that it actually has less resources to schedule for its thought? Yeah, yeah, that was an issue that really came. And so on the CPU side, you really don't have a problem because the CPUs are virtual CPUs, you can just have memory. And on the memory side, yeah, that's an issue and you're going to need to change a bit the worker node so that he knows that his memory are diminishing. Any more questions? No, well, thank you. If you have some other question, you can find our email and write to any of the team. Yeah, thank you.