 Takže, můžete mluvět nám zvukovat? Díky. Dobrý mohl. Já jsem Radim Krčmář. Já věděl jsem KVM-developer na RedHead. A děláme, že děláme všechno virtualizovat Linux x86. 마 anunci Donald. The talk will begin with what self virtualization means. Most of the talk will be spent by giving a description of high hardware that allows us to implement it and I said it's going to be on x86, That's going to be on x86, but this presentation will only cover vMX which is the x86 Intel implementation of hardware virtualization. And in the last part, I'll show you how Linux implementation looks like. What is this self virtualization? The basic idea is that when you normally boot a Linux or an operating system it just has direct access to all hardware it needs. And then a guest introduces a layer in between that that can intercept those accesses to hardware and forbid them or modify them. And self-virtualization means that the guest and the host is basically the same entity. So it's going to be one part of Linux that controls another part of itself. In the diagram you can see that Linux is just going to split and install itself over the real hardware. And when the host is also a guest we have some advantages over the standard virtualization that needs them to be really isolated. So now the host can access guest data in case you need to inject some packets or whatever then it's already being provided. And it doesn't need full isolation which means you can get better performance. And the sharing of devices is that also logging is already implemented for you if you wish to use it. And then you can implement it really fast and the demonstration is just 750 lines of code. And that's including a copy pasted blocks from KVM because I wasn't sharing all of the code. And now we are maybe asking what it's good for. Well the main reason I brought this was to show you how the VMX extensions work. So we would have better understanding of virtualization and it's possible pitfalls. But we can also do it for hardware enhancement because a guest usually doesn't have access to all the hardware that the host has. But on the other hand we can also emulate some hardware in the host or hypervisor. So for example new Intel processor features like trapping CPU ID while in user space or supervisor mode access prevention which prevents the Linux kernel from accessing user mode data without it being intended. Then this can be implemented even on hardware that doesn't support them just by trapping from the guest part of our Linux into the host part or hypervisor part. Another option is if you implement proper isolation of our host and guest layers you basically have Intel kernel guard technology which can prevent or lock any access to hardware. And the third one which maybe is most interesting is that guests have some features that are not available for the host which for example is TSC scaling. So if you had a host that can set up the guest in a way that triggers almost no VMX and has no performance overhead then you could use this feature and any time read you do would be faster because currently you have to scale the TSC which means that you have to multiply it by its frequency and then add some offset. And hardware would do this for you. But the main problem of virtualization is that if you want more control you on the other hand lose performance. So every time you need to implement more features you can be sure that it's going to cost you and the guest is going to run slower because it needs to trap into the host more often to do what you are wanting to do and the hardware doesn't support. And in this talk we'll focus on the minimal overhead implementation which if the situation was ideal you would just set up once the split into the host and guest and then you would run the guest and it would basically never exit. So the host would just do the setup and then the guest would run and it would run as if nothing happened. So the performance wouldn't be affected. This would be completely useless but then you could just pick any single feature from the hardware extensions you want to trigger just the VM exits you are interested in and not have any additional overhead because the hardware features don't allow you to have all those VM exits you want but also force you to take VM exits on things you are interested in. So now what, how actually the virtualization is going to be implemented in hardware is with the Intel virtualization technology or it's usually called virtual machine extensions if you read some programming books for example the SDM software developer manual. So a general overview of what a virtual machine is can be said to be when you have hardware it has some state it's usually called registers and if you have a virtual machine then you want to emulate this hardware for some other operating system for example and then you want to switch all those hardware registers to be the virtual machine and then you want to switch them back again. So you need at least two sets of these hardware registers that are going to be switched as you go back and forth between the virtual machine and your bell metal and then there are usually virtualization controls which just decide when the hardware is going to exit. So if you could control that the hardware never exits then you would be a mentor and never get the control back from the guest which would basically mean that the guest registers would be in the hardware registers forever. BMCS or virtual machine control structure is the Intel implementation of the general idea. BMX controls it has BMX information host state and guest state and BMX controls usually have some controls that affect when the VM exits and then host state is a subset of host registers and guest state is as well but it has more registers because it needs to be more complete because the guest doesn't know that it's running on a virtual machine but the host does so it doesn't need to access all those registers transparently and BMX information is provided by the hardware to tell you when an exit happens what the guest did and what can you do to resume the guest back. BMX for example the instruction length so you don't need to implement a decoder in your hypervisor. And now the solve virtualization with BMX is going to be in six steps. You first enable those extensions then you set up the control structure and then the important instruction that switches from the host to the guest is VM launch and after VM launch you'll eventually get a VM exit which you will handle and then VM resumes the guest back and the exit and resume is going to continue as long as the machine is powered on. Enabling VMX is fairly straightforward you just first should detect that you have it and then set the VMX enabled bit in control register 4 then you reserve some memory area for VMX operation and you execute VMX on. After VMX on Intel describes the state as VMX root state which we are going to call also host state or hypervisor so VMX in root operation is the one that can call all those other VMX instructions. So following enabling VMX you can now set up VMCS you also allocate some memory for it and then you load it into the hardware because it has some extra registers for it as well and then you access it with VM read and VM write. The setup has three main parts. You set up VMX controls, you set up host state and you set up guest state. VMX controls are mostly controlling VM exits so most of them are luckily going to be disabled for our server actualization because we are not interested in any exits right now and want to have maximum performance. So these controls are pin based VM execution controls, primary processors based VM execution controls but here are two features that are actually needed to eliminate more VM exits and those are MSR bitmaps that only say that for all those MSRs don't do any VM exits so we activate secondary controls and in secondary controls we can enable three instructions. It's invalidate PCID, read TSC with processor and save the FPU state and if we didn't do that then these instructions would BMX it and we would have to emulate them in the host but there is no point if the hardware can do that. Then in VM exit we just set up this bit to make sure that the host space is reserved and in VM entry because we are going to be entering in 64-bit mode already then we set up this i32e mode and the host state is very simple you basically have hardware registers that are on the host and you are going to want to self virtualize them and so you just copy them into the host state of your VMCS but the VMCS doesn't have entries for all hardware registers it just has segment registers like code segment or stack segment, controller registers which are CR0 or flags or even RIP or SRP and then because on the host side for your hypervisor you have to have some code that's going to be the new host after the split into the guest and host and the host state is then going to fall back into your host state into your host handler and its stack also cannot be the same as the original stack because that would overwrite and disturb the execution of your normal Linux operation so you just allocate new memory and set it as a stack and the good part about this state is that it doesn't change from now on because the host is not going to be modified while the guest is running the initial guest state begins the same because you have only one set of hardware registers at the point where you start to self virtualize so you just copy them all into the guest state in VMCS it has a few more registers for you and like the SMBase or DR7 or R Flux but the important part is that if you want to enable VMX in the guest you need to hide that you have already enabled VMX in the host and this is about the only modification we do to the guest we are going to run and this is just done by masking the VMXe bit in CR4 the guest also has interoperability and activity states and it's going to change a lot so this is just the initial state for our first run and then as the guest or the original Linux continues its boot it's going to update this structure and on every VMX it will read it you could have noticed that the hardware registers there are more hardware registers than can be put into the guest state and these are the mainly general purpose registers RAX, RBX through R15 and machine specific registers which there's a lot of them and some of them might be even undocumented so Intel just decided that all of them are going to be preserved across VM entry and VM exit and because when we enter we don't want to actually modify anything we don't have to do anything so we just ignore that these registers exist and can VM enter now we have done the enabling of VMX and setup of VM control structure so we are ready to enter the guest we have prepared it has all the information from the hardware registers it needs to just resume operation and VM launch is the instruction that is going to do it VM launch just loads the state from VMCS into hardware registers and because we have modified the instruction pointer to begin just after the VM launch instruction the guest is going to think that VM launch maybe wasn't even executed and is going to continue execution as if nothing happened it will mean that it will then return to its normal boot process finish booting and you have your Linux that is sort of virtualized but things are not that simple because in VMX non-root mode which is what is the processor state in after VM launch compared to VMX root mode that is after VMX on and before VM launch some instructions and operations behave differently and one of the main differences is that you can get a VM exit there are 64 basic exit reasons basically a lot of instructions have their own or you can get interrupted and get a VM exit or you can just set up some timer to expire in VMX and get a VM exit but we have disabled almost all of them but some VM exits can't be avoided and this is VM call which is an instruction whose only purpose is to force a VM exit so it's quite understandable and then there are CPU ID instruction which could have been emulated in hardware but it would probably require too much information to pass into the hardware so designers of the VMX just decided to exit into hypervisor and let it handle it and then there are MSR reads and brides because there's a lot of MSRs and a lot of them fit into our MSR bitmaps that don't care about these MSRs and just write them directly into hardware and back then accessing these MSRs is going to cause a VM exit and then there are VMX on and VM pointer load the instructions we use to actually virtualize ourselves and hardware doesn't support or accelerate hardware virtualization which gets us into a VM exit every time we would try to virtualize our self-virtualized system so when a VM exit happens the most important part is that the guest state is saved into VMCS and the host state is loaded from the VMCS so the execution begins with the host state again in our case it's going to be some part of Linux that is going to handle this VM exit but not all those registers, hardware registers are saved so for example general purpose registers are still only in the hardware and our VM exit handler which is being called from the loaded VMX host state must first save those registers that it's interested in and would probably want to use during its execution so general purpose registers are pretty much used all the time so just after a VM exit you save them into some memory area and you can't use any general purpose register to do that so you only basically have the stack pointer or your instruction pointer to do that so after all those general purpose registers are saved you can read the exit reason and if you know what to do which you should in the iProvisor you can perform an action that's based on the exit reason so for the instructions that we have seen that host VM exits cannot be disabled it's VM call and you can just inject an undefined instruction intercept or do whatever you want maybe you want to have some action to do on VM call and then there are the CPU ID and MSR accesses and because you can't do anything better than just executing them in the host and passing the results to the guest then this is usually what is going to happen it's just the host the guest is going to reach an execution of a CPU ID instruction it's going to VM exit and then the host will execute the CPU ID instruction for the guest and load the output of the instruction into guest registers and resume so the guest will think that it executed the instruction successfully and the last part is VMX instructions and sadly they cannot be simply executed in the way that CPU ID is because that would override our self-virtualization so we need to also implement nesting nesting is a weird beast it's often confusing to talk about so when on hardware you run your first Linux host then you run your Linux guest and then you run its guest-guest we usually call that the host that runs directly on the hardware is L0 and the next level is L1, then L2 and so on but usually we don't get too much higher than L2 unless you really hate performance because we'll see what the main drawback of virtualization is of nested virtualization is the hardware has almost no acceleration for it so you cannot just exit from L2 direct to L1 which you would like to for example if L2 in executed CPU ID you know that it's going to be handled in L1 because L1 is the hypervisor that controls L2 but due to the lack of hardware features L0 must never lose control of the hardware virtualization so every VM exit that is going to happen must always go to L0 then L0 will take a look see that it's an exit from L2 and L1 is going to handle it so VM enters L1 passes the exit information it got into L1 and then when L1 finishes handling the VM exit it cannot just directly enter back into L2 but must go through L0 again and this overhead is going to be quite significant when you start adding more levels so now that we want to know what we roughly want to do in nested VMX exits we can implement them we can't use any hardware checks like the other instruction did so for example for a VMX on we will just perform all the checks that mean if the CPU ID is enabled if VMX E in CR4 isn't set and so on and if the checks pass we will just remember that L1 or the guest that's trying to be a hypervisor entered root operation and VMX on also provides a VMX area so we will remember where it is and VM pointer load again we'll check everything and just remember what VMCS is active right now in L1 but we won't execute VM pointer load do it ourselves and VM read and VM write are luckily emulated by hardware so we don't have to do that but otherwise it would be just simple memory accesses and now to the complicated part because VMCS or level 0 must never lose the control then it must always be the host state in VMCS that is being executed by VM launch so when L1 wants to VM launch itself it traps into L0 and L0 then performs that the VMCS provided is sane or at least its host state is sane and then loads everything except the host state into newly created VMCS which can live in the VMX area provided by VMX on and when the new VMCS is merged out of the L0 VMCS and the VMCS in L1 then the L1 can do a VM launch that's going to enter L2 how we say it for better clarity this VMCS are called 0 to 1 which means that it enters from L0 to L1 and VMCS 1.2 is from level 1 to level 2 and the result is going from level 0 to level 2 and after launching the L2 from L0 then every exit is going to go back to L0 but we know that L1 is the correct one to handle it which means that right now we just have to pass the VMX we got back into L2 and this is also operation that combines 3 VMCS but now we have the information that we got from exit from VMCS 02 which is the VMX information and we copy it into the VMCS 01 which is going to launch the L1 but if the L2 exited directly into L1 then it would copy is the host state into hardware registers so right now when we are launching L1 from L0 we copy the host state into guest state which is done this host state is copied into guest state of the new VMCS we are going to launch the information is just copied here and then we have VMX controls and host state for the original VMCS that launches L1 and we just use them so again this VMCS is going to exit back to L0 and only has the host state from the VMCS 02 there were some instructions that we haven't used in our self-virtualization but still need to implement if we want to self-virtualize because Linux guest is going to use them and these are for example VM Clear which just forgets that the VMCS is active and flashes and it changes that might have been pending to it and VMX off which disables the VMX instructions now we know what happens after VM launch and how first VM exit is handled but we still don't know how to continue the execution forever but VM resume is just basically the same as VM launch just we have one additional task to do because on VM exit the guest has some state in its general purpose registers we have saved it and now on VM resume we must restore it because the hypervisor clobbered it in between and after this is done we are sufficiently well featured to implement self-virtualization so now it's going to just VM resume and then after every VM exit for some reason that we are going to handle we will VM resume again and this will continue forever so how is this implemented for Linux? we already have most definitions of those VMX structures and other useful functions like VMX on and VM launch thanks to KVM that is internal implementation so we can just include their header files and either verifactor the code or copy paste it and then we only have to insert our entry point of self-virtualization somewhere which is best done during boot of the guest and there are two considerations you want to do it as early as possible to get maximum control but then you also don't want to do it too early because then you might not have some features that you would like to or it would create more complication for the guest side because you are allocating memory and if you would have to say to the guest that don't touch it or just make it not available from the hypervisor it would be harder than to let the kernel allocate it for you knowing that it won't use it for the second time and then you have no problems another nice thing is that you can use tracing mechanisms like print case for ease of debugging and development and you can also, if you decide to, just switch back and merge them again like strip the self-virtualization and gain performance in case you like to but it always depends on what you want to do with this self-virtualization from quick tests with the implementation that because VMX doesn't allow exitless operation we got one RIT MSR and about 32K VMXits CPU ID VMXits during boot and the overhead of a VMXit is 800 cycles on a Haskell machine so the boot is going to be about 10 milliseconds longer than it would have been which I guess isn't noticeable because just one demon extra is going to take much longer and if you used TAC scaling or some other feature that actually allows you to eliminate some workload from the guest in this case it would be a multiply and addition then you could get a performance gain from using it on bare metal but if you then go into nested VMX you're not going to get a very good performance so just remember that nesting is evil if you'd like to play with the code I have pushed a working version into this GitHub and I think we have a few minutes left so I can go through the code I would need to do this so currently it's a very simple implementation the main init point is here in it kernel right at the end which is probably too late because we don't need that many kernel features but still doable if you want to move it somewhere else zoom in sorry that might be too much that's good ok now sorry for that but now I'll need to search and then the actual code itself is in Arch x86 sv and you just enable it in kconfig it's under section server utilization and don't do it if you value your hardware the problem is that current VMX has some bug and it doesn't work with it because it doesn't pass through interrupts as the configuration that the cell virtualization using is quite rare and no guest has used it before then the nested KVM that's implemented in Linux just didn't think of this and sadly it will take a while to get fixed as it's not really an important feature so you have to run this on bare metal and well any bug you do requires you to reboot so this is the main part which takes care of the VM launch you just load your state and that's a big long ugly function that we might get to it write your guest entry where guest entry is an instruction just after VM launch here then take some stack into the host state and hypervisor entry into the host RIP then you just preserve your current stack pointer and launch it after the launch, which is here the execution is going to continue right here so in case you mess something you'll get a VM entry error and you'll continue just fine and if it worked you won't get any error and it's going to continue right here and the boot will resume at the point where you started to sell virtualize if you get a VM exit then it's going to go into hypervisor entry it doesn't work hypervisor entry which starts here and is just a blog of code that saves the general purpose registers and then enters the VM exit handler that only reads the VM exit reason and if it fails it bugs and the bug has a nice feature that on current Linux implementation you just fall back out of the self virtualizing into normal operation so you can try again if this were a module and if not then you'll just handle the VM exit and VM resume and this is going to continue all the way around probably the simplest VM exit handler is handle CPU ID which as you can see is basically just read what the input to CPU ID would have been on the guest execute the CPU ID yourself and then load the output back I think I'm running out of time almost still not perfect then we have this there's some questions how to make all these VM controls into zero because when you don't want any VM exits you disable most VM controls you want and there's quite a lot of them and for example if you wanted to have interrupts you wouldn't just enable or exception exceptions you would disable okay back exceptions for example right now we don't capture any exceptions exceptions because the exception bitmap is zero but just switching it into a bit which each bit represents one exception from zero then you would get a VM exit for that and similarly you can trigger VM exits for almost anything you can imagine by configuring these VM CS controls so that's all for me I'd like to reiterate that virtualization is not a tool for every problem you might have it can help getting you better for performance or isolation but if you then want to virtualize back again you lose all performance and normal guests that are running under VMX usually trigger far more VM exits than what the server virtualization do so the overhead is higher I would say 2500 cycles for a minimal VM exit so beware that any workload you run on bare metal might not run as well in a virtual machine so thank you if you have any questions yes please well I guess I could do it if I plan to submit it upstream I guess I could do it on some April 1st next year but the most upstreamable part of this is refactoring of VMX to make it reusable for of KVM to make it reusable for other projects and then well finding bugs which we already had and fixing them in the emulation of VMX in KVM but we'll see maybe if you want to make it upstreamable or build something out of it it's going to happen any other questions I guess we're done so have a nice day and make it to lunch safely no 1, 2, 1, 2, 1, 2 je to blisko je to pro je to pro je to pro je to pro test