 All right, we're going to get started. I'm pleased to introduce from the ZEM project, Ross Lagerval, I'll get that right someday, from Citrix. Thank you, Ross. Hi there. So today I'm going to be talking about the state of live catching in the ZEM and hypervisor. So this is a project that I did with Conrad Marshall. So the high level view is that we've got support for live patching in ZEM 4.7 and it was extended in 4.8 to support on an addition to x86. So the first question that you might ask is what's the point of live patching in ZEM when you've got live migration support already? Now, there are a few reasons for this. For start, you might be using a VM that's using a path-through device like a network hardware GPU and this is tied to the hardware itself, so it's not possible to live migrate that. Also, ZEM VMs might have sort of senior real-time constraints. They can't take their sort of second or third downtime that you might get with live migration. Also, some VMs, you might have a terabat-sized VM that just doesn't fit on any other host in the system. And if you're a cloud provider, you generally want to avoid downtime as much as possible. If you've got 10,000 VMs, sorry, 10,000 hosts, and you want to, each feature then takes 15 minutes to patch on the boot, then that's a lot of sort of CPU or computing time that's just wasted. So I think all of these mean that it's good to support live patching in addition to that migration. So for the basics of what we're doing, it could be possible to do in-line replacement of hypervisor codes or just over-lighting existing code and memory with some new code. The problem is that we usually want to use this for security issues, and security issues tend to add an extra check or something, in which case the function would get longer and then wouldn't fit in place of the old one. So the approach we take is to insert a trampoline into the old function, which then allows you to jump from the old function to the new function, which is like a replacement function. So it means that the live patching that we have is sort of function level granularity, so the smallest thing we can patch is a function. So this is just Linux live pattern. So we talk about payloads in the live pattern with Xen. So it's kind of a set of replacement functions that together make the live patch with the metadata which describes which functions need to be replaced with what. It's packaged in a relocatable of file just like a kernel module, and just like a kernel module, it has symbols that need to be resolved. So the approach we took was to add a simple sort of link into Xen to resolve those symbols at one time. We took this approach because Xen already has sort of symbol tables built into it and can access them. And it just seems simpler to do all the module loading in one place in the hypervisor. So to complete the module loading, then we implemented support for performing revocations, applying alternative instructions, and there were other things that, say, is done for Linux kernel module loading. So this was very much inspired by the Linux kernel module load. It's useful to be able to apply multiple payloads or live patches onto a hypervisor. So if you take the security example, if you have a security failure that you patch one week and then two weeks later, another one comes out, it's useful to be able to apply a live patch on top, so one live patch on top of another live patch. The problem is that they depend completely on the hypervisor's internal ABI. So if you're accessing offsets, it's destructs or calling out functions. It just depends completely on that ABI. And Xen's ABI is not stable. So the approach that we took to resolve this is to embed a build ID in each hypervisor and live patch. This is like a check some of the binary contents of it. And each payload then is allowed to, at build time, you say on which build ID it needs to be applied. This then creates a model where you have a stack of live patches and at the base is the original hypervisor. So this is how we allow applying multiple payloads. It's not the most flexible of models and we could consider changing it in the future. But this is what we've got for the moment. Can you repeat the question please? Oh, sorry, okay. So does it replace the build ID of the original hypervisor if you load a new patch? No, it doesn't, but you always apply a patch when you look at it, you look at the last build ID sort of at the top of the stack. So in effect, I mean, it kind of does, but they're all still available. And so once the payload is loaded and memory, it still needs to be applied for it to actually take effect. So the approach that we take because we don't want to modify code while it's being executed is to effectively, you know, quest the system or pause it. So we have one CPU that does the patching work and all the other CPUs are temporarily stopped. We do this by checking for any live patches that need to be applied and the hypervisor's about to exit to start running the gist. We do this because at that point, it basically discards its stack. So there are no functions that could be in use at the time. There are any functions that could be in use are the pattern functions. And yeah, so there's just a small subset of functions that are known to be not patchable. So this is what it's having to check for any functions that are in use, which is at some of the approaches and for Linux live pattern does. So if we're looking at the application process in more detail, the first CPU to arrive at the pattern code who designated the master and it then just waits for all the other CPUs to arrive. Once they arrive, all the CPUs disable X to see that getting interruptions and execute other functions. And so those CPUs then just spin and wait for the pattern to complete. The master then disables the wrap protection so it can modify code in memory. But then there's some hook functions which I'll talk about later. But then for each function that needs to be replaced, we calculate the offset in memory between the existing function and the new function. Save the first five bats of the old function in memory somewhere else. And write a 32 bit new locator, relative jump into the beginning of the old function which directs it straight to the new function. So this basically allows us to avoid having to modify all the callers of the old function. So if you just replace the function, anything that you're changing is basically like a pointer at the beginning of the old function. After all the functions are done, we enable wrap protection, flush the CPU cache because we've modified the executable code and we enable X. This then signals to the slaves that the pattern is complete and they do the same. So this whole process usually takes just a few microseconds, so it doesn't really interrupt the system at all. Scheduling latency is usually more than that. If you need to revert a live patch, which is possible to do, the same thing happens, but instead of writing a jump, you just restore the stashed five bats that you stashed when you were applying the pattern in the first place. We also had to then extend half of our interface to allow applying live patterns. So there's a few sub-ups for one of the HAPA calls, which is like Zen's system call for uploading payloads, listing payloads in the Korean estate and applying and reverting and all the operations that you want to do. And we're controllable through XSMs, so you can control who is able to do what. Although by default, it's just Domzo can do everything and no other domains can do anything else. We've got a tool for also accessing this functionality, which comes with Zen, and it's just available through LibxC. So if we look at, compared with the way that Linux live patching is done, this is very similar to the Kpatch model where basically you run the stop machine in Linux and all the CPUs are paused and then it does the patching. The difference there is that it has to check stacks because there can be hundreds of processes in the system or executing a code at different points and functions may be in use and it's not really possible to patch a function that's in use or safe. But because of the way that we do the patching in Zen, we don't need to do the stack checking, which reduces the complexity. Also for VMs, licensee is not super critical, so we don't use the KGaft model, which is where there's no stop machine or don't temporary pause. Each patch is moved between the old world and the new world individually, so each process is moved between the old world and the new world individually when it exits to user space. But we just didn't need this complexity. We don't have any iftrace infrastructure or profiling sort of calls in Zen, so we just over at the start of the function instead. The advantage of this is that there's minimal overhead, it's just a jump, whereas for Linux, it has to call into the iftrace infrastructure and look up the handler and only then redirect it to the new function. It also means that we can patch any function basically rather than only those that are iftraceable, which is usually somewhat less. So that's the hypervisor's implementation. The other reasonably complex part is actually building lab patches themselves. So it is possible to stick some replacement functions in a file and compile it with the correct options and create a lab patch. The main problem is compiler optimizations. So if you change one function, then GCC is free to inline it and you don't really know which functions that change will actually end up in. Also, GCC will often copy functions and have different ADRs. So some parameters hold constant and so this makes it more complex as well as so just changing a macro and a header file and then you don't know where on earth that might get used or it's just quite difficult to tell. So the approach we took is to use the unimaginatively titled lab patch build tools. So this is based on the K patch build tool which was developed by Red Hat for Linux lab patching. So these tools are pretty simple to run and pretty quick as well. It takes the source tree from the running version of the hypervisor and the config that was used for it. It also takes the build ID onto which it needs to be applied. So this is what implements the stacked sort of dependency taking that we had on the hypervisor. And then most importantly, it takes the source code patch which is the interesting part. So if you look at how it actually builds the lab patch, it first does a complete build of Xen to just match the existing build of Xen that's currently running on some house somewhere. We then apply the source code patch and build Xen again with a magic GCC option to split each function into its own off-section which makes comparing the object files a lot easier in the future. We then revoke the source patch and build Xen again with the same option. And each time we do this, we use a compiler wrapper to capture which object files have been changed. We then run a diff tool on those changed object files and that creates the lab patch. So if you just go into that diff process in your detail, for each pair of sort of the original and the patched object files, once they're loaded into memory, you have to de-manual in a bit because GCC is somewhat non-deterministic in what it means the optimized functions that it creates in static local variables. There's some heuristics to de-mangle these names. The sections and the symbols and functions are all then correlated between the two. So things with the same names are basically matched up into sort of pairs. For each of these pairs, we just do a simple, binary comparison to determine if they're the same or if they're changed or if they're new. And then for each of those changed functions, the markers being need to be included and recursively include anything new or any changed function that they reference. There's some special logic to handle things like exception tables and bug frames because we need to include any of the relevant portions into the lab patch so any of the changed portions, basically, so it needs to be split out. It then creates some metadata which the hypervisor uses for determining which functions need to be replaced. So this is how it iterates through that list. We then laptop an old file and for each of those old files, we link them all together into the locator module which kind of looks like a Linux kernel module and that's what gets loaded into the hypervisor. There's some trickiness resolving some handling lab patches with data. So if a lab patch creates new sort of, say, global static data or new data like strings, this is just easy. It's included in the lab patch and everything's handled correctly. If you try and change the initialization of data or you change the format of the data structure, it's really difficult to, there's no automatic way of creating a lab patch for this so the tool just prevents you from doing that. So the approach that we took instead was to allow hook functions to be executed while the patch is being applied. So this allows you to sort of dynamically transform data so you can iterate over the list of domains and do whatever transformation is needed or you can do sort of one-self initializations. We also have shadow variables which is kind of like a way of attaching a new, a data or some sort of data structure to an existing address in memory and the new data structure or member variable is stored in a separate hash table and so it's kind of a way of attaching a new member to an existing data structure. But this then needs to be done in the hook functions or we might need to modify the patch somewhat. In some cases it's as if possible to rewrite the pattern in such a way that it doesn't actually need to use a new data member. It's because mostly when patches are written they're not sort of written with consideration of lab patching in mind but it's often possible to do that. So compare this with Linux lab patching. They have general purpose modules and their pattern mechanism sort of tries to hook them to that at various points but this can be a bit tricky trying to get it to hook into the right points because then this does not support loadable code modules in general. The pattern process is quite straightforward and it iterates quite easily. Not supporting loadable modules also makes a very quite a lot of issues like if you try to patch a module that's not loadable if you try and need to patch a reference a symbol that's not exported in a module and all these things sort of create quite a few issues and it's possible for modules to have the same symbols and it's difficult to resolve them. So we just sort of sidestep a lot of those issues. K-Gaft as far as I know does not have a tool to automatically create lab patches. They have a tool to diff builds using the growth info I think and then you meant to just stick the replacement functions in a file and just build that using the normal kernel module build system. So as far as the K-Patch themselves even though they've created this tool that we sort of repurposed was in, I believe they're also moving towards not using it for some reason or the other because as far as I can tell it's still a lot easier than creating this by hand. So that's kind of where we're up to. As far as future work, there's an issue at the moment of handling NMRs and MCEs during that critical pattern engine. So at the moment they're just ignored but we want to change it so that they're recorded and then replay it after the critical section is left but hopefully there's a very low chance that you had an NMR in that couple of microseconds. And so we also want to add support for signing payloads so this would prevent make it more difficult to accidentally or intentionally load a malicious payload especially if you sort of consider the sort of vendor distribution scenario. So I've got some patches to do that that I need to send out. Then finally I think we want to remove the experimental tag in Zen and sort of declare it as a supported feature. So part of this work is adding it to the upstream or assist sort of testing infrastructure. There's some patches on the list to do that but I don't think they're complete yet. But I think once that happens then there would be sort of a prepared to call it supported and kind of done. So yeah, that's pretty much the state of where we are. For any questions I think we've got a minute or two. Otherwise feel free to find the Zen booth but I've got a demo of it running so you can sit there. But you do need to have the same tool chain basically an identical build system. So I mean if you're using Mark or something like that to build as it's pretty straightforward. Any other questions? Thank you very much.