 Hello. Thank you for being here. My name is Hugo. I will be talking a little bit about hardware virtualization and what changed in the last decade. I'm a systems engineer at a small cloud provider called Digital Ocean. Sorry, had to put this somewhere. So why do we virtualize hardware? Apart from the right answers that the CTOs, MCOs like server consolidation, faster provisioning, and faster recovery. From the geeky side, some operational tasks became much easier. We can live migrate virtual servers between servers. We can have virtually zero downtime maintenance windows. Virtual machines are really nice environments. At least rebooting a VM is much faster than putting a normal computer. So I will be talking a little bit about why hardware virtualization is useful, because it's not exactly required. Well, nowadays it actually is. So the easier way to virtualize is just emulate everything, literally everything. We have instructions. We do instruction by instruction interpretation. All the hardware state is emulating software. Like RAM is just a block of allocated memory. Registers are just variables and so on and so on. This is extremely simple. It could be easy enough to be a university segment if you are not implementing a system instruction set at least. But it's really slow. Nevertheless, it's very useful for development. The most famous emulator is probably a box that started in 94 and the last commit was one day ago. I believe people are still using it. So instruction by instruction emulation is actually not feasible in the production environment because we expect the downspeed to be not that significant. So there are faster ways to virtualize even without hardware support. But first let's make a little detour. Who here heard about the term virtual machine monitor? Interesting. Any idea in which decade that term was coined? So in the 70s, actually in 74, Popek and Goldberg wrote a paper with the title formal requirements for virtualizable third-generation architectures. The paper abstract mentioned things like PDP-10 and IBM 360. Popek and Goldberg in this paper established the base requirements to consider a virtual machine monitor. So a fidelity is just that the behavior from the guest's perspective should be similar to the host's perspective. And the guest is not supposed to understand or perceive in any way that it's a guest that is a virtualized thing. A performance is a little bit subjective here, but in the paper it was very obvious that it meant that emulation or plain emulation was not an option. So the visualization overhead was supposed to be close to neglectable. And safety is obvious. The guest VMs must coexist between each other and should not interfere with the virtual machine monitor and with the resources that they don't have access to or that they are not supposed to have access to. So these are the requirements. Popek and Goldberg also in the same paper came up with a couple of theorems. We are going to put our focus in the first theorem. And it is. For any conventional third-generation computer, a virtual machine monitor may be constructed if the set of sensitive instructions for that computer is a subset of the set of privileged instructions. So my question about this was what is a third-generation computer? Since this paper was written in the 70s, well, different expectations. But this is also my very explicit unit in the paper. It should have some sort of relocation mechanism, which makes sense because similar VMs are supposed to share the same space. The instruction set is expected to have different privileged levels. So a supervisor and a user mode. And a trap mechanism or set of trap mechanisms are also required. Now, what's the meaning of sensitive instructions and privileged instructions? So privileged instructions are those that trap if the processor is in user mode and do not trap in system mode or in supervisor mode. And sensitive instructions when trapped must pass the control to the virtual machine monitor. And now the suggestion or the proposed way of virtualizing virtual machines is nowadays called classic virtualization. And the concept is very, very simple. We are supposed to directly execute the safe instructions and by safe we mean that they don't interfere with the virtual machine monitor or any other VMs. And we are expected to trap and emulate the privileged instructions. Now, how does this apply to hardware? Curiosity. Is any of you going to Prague next week? Okay, this might be safe then. Will you be attending Embedded Linux Conf? Okay, so about hardware. Arm exists, but we'll be ignoring it for today. And this is going to be very specific to Intel. I'm not going to mention AMD. They have very similar visualization capabilities, even if they are incompatible at the instruction set level. So, how this classical virtualization could work on x86? So x86 already has four privilege levels, which is very useful. Those privilege levels go from ring zero to ring three. Ring zero is the actual privilege levels and rings one to three are not privileged. This means that if we try to execute privileged instruction on rings one to three, it will trap and the virtual machine monitor will regain control. And when that happens, it is supposed to emulate the instruction that actually caused the trap and give control back to the visualized OS. So from the privileged level's perspective or rings, the virtual machine monitor and possibly the operating system kernel runs on ring zero. So in real privilege mode, the user run runs on ring three and visualize OS could run on ring one. Probably that virtualize OS that we don't know which OS could be, who use ring three as well to run is their land code. I should mention that in terms of data access, ring zero has access to everything. Ring ring one has access to or can see and access the data in the rings to entry and so on. So let's pick a random x86 instruction. Okay, it's not really random. A popflex. Popflex is a very interesting instruction because when we run it in privileged mode, it changes the ALU and the system flags. And when we run it in non-privileged mode, only the ALU is changed. So its behavior depends on the current privilege level, which makes it challenging because there is no way for the virtual machine monitor to intercept its execution. And the ideal scenario would be the execution would be intercepted by the VMM to check the guest's current privilege level and it would either emulate it in privileged mode or non-privileged mode from the guest perspective. But since this instruction doesn't trap, that is not possible. x86 has in total 17 non-privileged sensitive instructions, which means they don't actually trap and we would expect it to, at least, for a classical virtualization to work. And additionally, we can easily or the guest operating system can easily know if it is being virtualized or not because the privilege level is stored in the lower bits of the CS register. So this is bad news because this means that x86 can't be classically virtualized without hardware support. However, we have been using fast emulators way before having hardware virtualization available. And this has been done by doing binary translation. The idea is extremely simple. The implementation is actually not that simple, but the idea is there is a mechanism inside the VMM that dynamically translates the sensitive instructions. So they cause a trap and the virtual machine monitor instead of running those instructions directly emulates their behavior and mimics the side effects of that execution to the guest. As an optimization and a required optimization, the translated blocks have to be cached, which sounds simple but is actually not simple when we consider corner cases like self-modifying code. And we need a few things like shadow page tables. I'm going to talk a little bit about this in a few minutes. And all the IO is emulated as well. So it's not trivial, but it was a fast solution for some years. I guess the first actual working virtual machine monitor using binary translation that was really fast was VMware Workstation that was released in 1999 or 2000. So eventually, a few years later, 2005 or 2006, Intel started having support for hardware virtualization. And what they did was adding a new mode of execution that is called by the root mode. And the virtual machine monitor would run running root mode and the guests would run in non-root mode. The guests would be totally unaware about their execution mode. So Intel VTX adds a new set of instructions. And the guests wouldn't be able to run any of those instructions without trapping back into the VMM except one actually. But not important for now. So yeah, the guests are totally unaware about what's going on. They just run their code and they don't need to do anything else. Now, how VTX actually works. So each guest has its own control structure that has all the state and a little bit more. The transitions between root and non-root mode are atomic. The virtual machine monitor has a very simple job. It creates a VM control structure. It's not a trivial structure, but it feels things like the initial guest state, how the registers look like. It has to provide some memory space context. Then it sets a program counter and it just launches the VM. And when it launches the VM, the VM or the processor transits to non-root mode. And it stays in non-root mode running the guest's code until a VM exit happens. Now, forgot to mention, back here, the VM CS also specifies in which conditions a VM exit should happen. And this makes things very, very interesting. We can easily fine-tune the guest to exit only in very specific conditions. And I'm going to talk a little bit about what are the typical exit reasons. So we can have multiple VM CSs, but in a single point in time, there is only one active VM CS. So a single car, and I'm talking about cars and not processors because each car can have its own active VM CS. In a single point in time, the car is either in root mode or in non-root mode. If it's in non-root mode, it's running a guest. And once it VM exits back to root mode, the VMM might decide to schedule another different virtual machine to that same car. And it just appoints the VM CS pointer to another VM CS structure and starts that virtual machine or resumes that virtual machine. In this graph, I'm using VM entry as a generalization of a VM resume or a VM start, actually, it's VM launch. So we have an easy way to run guest code safely and this can be considered classic virtualization. Well, kind of. So, we can easily emulate CPU. Now, we need to talk about the MMU. In the initial reasons, in the initial releases of VTX or the initial one, the first one in 2006, the MMU could not be virtualized. And that was bad because there has to be something translating or managing the page tables. So, the page tables are simply the mapping from virtual memory to physical memory. But since we have a virtual machine inside something that is real, we need two layers of translation. So, from the virtual memory to the virtual machine physical memory, but that physical memory seen by the virtual machine is actually virtual memory from the VMM's perspective. Then there is also the translation from the host level or the VMM level from that virtual memory to actual physical memory. And the solution for that was shadow page tables. And this has lots of performance implications. So, shadow page tables consist on keeping the mapping or in an optimal way the guest operating system would be managing the page tables and everything would work out of the box. But since there is no MMU virtualization, the VMM has to take the mappings provided from the guest operating system and map those mappings to actual physical memory and use that as appointing the VM's page table registry to that shadow page tables. This is going to make more sense in a second. So, this mapping has to be kept in sync. So, every time that the page tables change on the guest, the VMM has to update them, has to update the shadow page tables. This means that every time the page tables register changes on the guest, it causes a VM exit. And it gets worse than this. When a new memory page is added to the page tables, the VMM has to keep track of changes to that page table. This means that every time that we or the guest map something that changes in the specific page table, it also causes a VM exit. And this is lots of VM exits. And I guess I should mention that every time that there is a context switch in the guest operating system, it causes the page tables register changes. Here I wrote CR3. I don't know why. And yeah, this is lots of VM exits. And in the earlier versions of VDX, VM exit, VM resume would take around 4,000 cycles, CPU cycles. In theory, we could measure VM exit, VM resume in microseconds, which is pretty uncool. So, eventually, MMU virtualization arrived. And Intel called that extended page tables. Some, the formal name, fancy name, second level address translation. Some people also called them nested page tables. So, they were introduced in 2008. So, a couple years after VDX first introduction. And it makes all the work for us. When TLB miss happens, it walks the guest page table entries and the host page table entries for us. And that's all we wanted, right? And obviously, that doesn't require as many VM exits as before. So, the guest manages its own page tables and that doesn't cause a VM exit. And the performance gains when compared to shadow page tables were very obvious. So, we are in 2008. The situation report. We have a really nice CPU virtualization. The MMU is guest friendly. So, everything is awesome. I still find this music extremely catchy. A sort of. So, there are multiple situations that can cause a VM exit. The typical reasons are exceptions, the CPU exceptions, external interrupts, the execution of root sensitive instructions, VTX instructions, IO instructions, lots of things. The VMM can avoid some of the exits because it can literally cherry pick plenty of conditions that trigger a VM exit. But some of them, it can't be ignored. And usually, when a VM exit happens, the virtual machine monitor looks into the virtual machine control structure. It has a field exit reason. That exit reason is like an exit status code. And the virtual machine monitor knows what triggered the exception. But usually, that is not enough. That instruction has to be emulated and the guest site has to be updated somehow or something has to happen. So, this means that to actually handle VM exits properly, we need to have X86 emulator in our VMM, which are not very trivial. A curiosity, any of you compiled X86 emulator in the last month? Interesting. Any of you compiled kernel in the last month? Okay. Any of you compiled an X86 without knowing it in the last month? I guess you did. So, the KVM code that is in the kernel, yeah, I can talk a little bit about KVM. The KVM code that we have in the kernel actually has a minimal emulator that handles VM exits successfully. The KVM module does most of the work for us in terms of keeping track of virtual machine state. There are different implementations. So, the Intel VTX has been improved over the years. Then there is this incompatible AMD implementation and the KVM module handles all that for us. And it's really simple to use. We just need to... There is a device file, the FKVM, and we can play with it using IO CTLs. And that's what QMU uses. So, now let's talk a little bit about IO. All the devices exposed to the guest are emulated. Since one of the requirements is that the guest should be able to be a complete and modified system, we need to provide some device that the guest knows how to deal with. Usually, this means providing or emulating very old devices like Intel E1000 or RTL8139 or NE2000, because pretty much every operating system has built-in drivers for those devices. The way IO works is actually very simple. When the guest does any IO operation, and it can be PIO or MMIO, the VMM gains control emulates wherever it's expected to happen from that emulated device and gives control back to the guest. This means that sometimes a single IO operation might require multiple VM exits, mostly because we are emulating devices that were not thought to be used in a visualization environment. So, high throughput is impossible in this scenario. This didn't use to be a big deal until a couple of years ago with the whole cloud boom. Mostly because you don't want to, even if you are a cloud provider, you don't want to have a virtual machine spending 30% of its time in a cycle VM exit, VM resume, just because the user inside that guest is downloading filet, five megs per second. Yes, it's that bad. So, a decade ago, I guess it was actually a couple of years ago, a smart guy called Russell suggested something called virtual IO, that is a per virtualized IO. I remember I mentioned that we didn't want to change our guests, but this happened for a good reason. So, instead of emulating a full device, virtual IO just suggested that a very simple PCI device was enough, and that would be okay for all or most of the device classes that we could have some interest in having access from the guest. The way it works is very simple. So, there is a PCI device, the guest knows how to deal with it, and it exposes just one or n ring buffers, and those ring buffers are shared with the VMM and are used as communication channel between the guest and the host. The VMM or the backend side of that device provides the actual IO operation. Virtual IO became supported by multiple hypervisors, and I guess any major operating system nowadays has virtual IO drivers at least for network cards and probably something else, but I like network cards. And this improved things significantly, still not as, performance-wise, still not as good as native hardware, but was much better than having an emulated device. I guess one could easily reach like a speedup of 20x network-wise, which was a lot. Eventually, hardware, IO virtualization became a requirement, and I don't have slides about this, so I'm going to do some talking. And the basic functionality just allows the VMM to safely assign or dedicate a device to a virtual machine. That by itself is not very useful, because okay, it's useful if you want to play Quake in your VM, in your desktop, using your graphics card, but in a cloud environment, usually the number of VMs scale much better than the number of devices that you have to actually assign. I guess since the first implementation of release of native hardware, IO virtualization, things were really already optimized, and we could get to a state where there were virtually no VM exits triggered by IO. There are a few tricks to make that happen and a few exemptions, but it was already technically possible. So we could get to, for example, close to line speed network-wise, and a new family of virtualized devices started appearing since then. But this is not really that interesting in the whole cloud thingy today. So the PCIeC came with a really nice thing called a single-root IO visualization. Thank you. And it allows us to do things like safely assign a NIC to multiple VMs. And I know that this is going to sound very weird, say, like this, but some specific NICs or any NIC that one can buy for a server nowadays supports SRIOV. And it allows us to split that NIC into multiple virtual functions. So this means that the NIC has the physical function that we expect a network card to have, but we can have secondary virtual functions that can also function like network devices, but have some limitations. And those limitations exist to guarantee safety. And we can get NICs that support until, like, 256 virtual functions. And we can assign a single NIC to a virtual machine. And those virtual machines have access to real hardware. And that's pretty cool. I guess my time is pretty much over. And so virtualization is pretty cool. And that's all. Questions? No questions? In the case of these virtualization aware network interfaces, would each VM be assigned a different, let's say, MAC address? That's a very good question. Okay. There are some options there. And there is also... I'm going to say yes, but the right answer is a little bit more complex than that. So you can just give direct access for a VM that you might or might not trust, give direct access to the wire to that VM. So most of the NIC brands, at least Solar Flare and Intel, support a mechanism to filter out and filter in the packets that pass through. So that's called queue splitting. And you are allowed to assign specific TX and RXQs to that PF. And so, yeah, you can do whatever you want at L2, but there is safety there. Okay, thanks. Oh, here. Is there other class of devices in addition to network cards which can be used that way? Not that I'm aware of, actually. So the entity that comes up with this is a PCI special interest group. And every time that I start reading their specs, I kind of fall asleep. And I mentioned network cards because that's something quite important for my company. And it's something that I deal with daily. And I guess it's one of... I guess doing the same with GPUs would be pretty cool as well. But with NICS, doable now. Where are most of the remaining performance limitations in virtualization? Or where is most of the work happening in terms of optimizing for performance? Right now, so with hardware IO virtualization, you can get very close to native performance. And at that point, it's mostly about the drawbacks. For example, by doing IO virtualization, you lose some things like live migration. And for some use cases, that's quite important. But in terms of speed, you can get very close to native. Thank you.