 Hi everyone, I'm James. I apologize for being remote. Wish I could make it there. It's scheduling just in workout. I am presenting along with my colleague, David Woodhouse. We are colon hypervisor engineers in EC2, specifically working on the virtualization aspect of EC2. And this is kind of perhaps a bit poorly or confusingly titled, but we are basically, it's kind of an RFC presentation, discussing some options for making guests memory and state be able to be persisted and recreated or restored after live updates. Specifically live updates is a K exec. And doing this in the presence of some virtualization requirements, like being able to do device pass through memory of a subscription and some other virtualization aspects which we'll go into. Yeah, so roughly we're going through problem and requirements, implementation options, comparison steps, and then have a discussion afterwards. And I'm most keen to have the discussion and get people's opinions on this. So fundamentally a live update to the hypervisor is done by a K exec where the state of the virtual machines is serialized. A K exec is then done into the new version of the hypervisor user space VMM processes started up. The guest state deserialized and then the virtual machines are allowed to run, basically pick up from where they left off. And we need to persist guest memory across this K exec basically. So when they pick up after the K exec, they can get say memories there. And that basically as far as I can tell implies that this memory can't be kernel managed in the traditional sense that it needs to be managed separately and we'll kind of go into options for that. We also need to be able to support memory of a commit where we can using some sort of a control process be able to reclaim pages from an instance or a guest's virtual machine and kind of put them somewhere else like a swap out workflow or be able to reclaim using when they are balloon reports of free page reports, those sort of things. Yes, that's another sort of requirement here. And also the ability to support sidecar virtual machines where guests can donate a portion of its memory to run another virtual machine. And also be able to deliver page faults to user space. For example, if doing post copy live migration or in the case of a swap workflow if you need to swap in a page that has been swapped out. And another requirement for device pass through is to be able to hand over slices small section of PCI bars to virtual machines. So these are kind of the constraints that we're looking at how to solve something that handles these various requirements for guest memory and these other aspects. So, as I mentioned, the guest memory is probably should not be touched by the new kernel. So something like a kernel command line parameter to carve out a large portion of the host memory space for guest memory and only have the kernel manage a small portion of it. We also need to play nicely with something like user fault FD for faults. Only anonymous memory is supported currently with user fault FD I believe. So need to figure out how to do that if we are not using anonymous kernel managed memory. And I think I've mentioned the PCI bars already. The, yeah, there's this kind of, there's this requirement for basically a privileged process to be able to do mandatory access control and the guest memory where, for example, when sporting sidecar VMs or when carving or when doing something like a swap workflow, we need to be able to remove pages from the guest memory space and treat the VMM process as essentially an untrusted process. So control process needs to have mandatory access control to modify the guest memory mappings. And also just note that we don't need struck pages for any of this. There's already been work that's done upstream to remove the need for struck pages for much of the KVM guests memory maps. Some other things to keep in mind. IAMMU mappings need to be kept in sync with user space mappings as guest memory is reclaimed through something like a memory over subscription component IAMMU or map back in the IAMMU also needs to be kept in sync. And also IAMMU page tables need to persist so that DMA can be kept running during the Kexec process. And just another potential use case is to be able to do faster Kexecs by passing state between the old kernel and the new kernel. So these are, this is kind of what we're looking at developing a solution to solve. Specifically, we want to develop something upstream to solve this. So before I move on to possible implementations are there any questions on or clarifications on the problem of what we're trying to actually do here? Yeah, yes. One small question, the... Sure. It said you don't need struck pages required. Does that also imply that you don't need any... Yes, that is correct. I think that in fact with the TDP IAMMU our mapping was completely removed if I'm correct. So yeah, that's not necessary. Essentially, we can treat all of this as PFN map memory. Thanks. I was just trying to see if there was overlap with something else I'm thinking of. Sure, I'm going to show this thing. Okay, I'm not sure how to make this side, but I'll go away. Okay, any other... Yeah, unfortunately, I can't see you so anyone feel free to interrupt at any point. Okay, so now let's get onto some possible implementation options for ways that we're thinking of doing this. So one option is to basically use something like a fully-fledged file system. And by a fully-fledged file system, I mean the file system in the kernel that is responsible for doing page allocations, all these sort of things, like a traditional file system does, but on this reserve chunk of memory, which the kernel is not managing. So I have a file system managed that. And just like any file system, it would keep data about which pages for which files have been allocated where on that memory. And you could potentially mount this or create a file system on top of a, like a DAX device or supply command line parameters or something in kernel command line parameters to indicate, okay, this file system needs to work on this reserved large chunk of memory. And then the files in this file system could be used for guest memory. And this is probably just something like PK-RAM on top of DAX. PK-RAM patches, I think we've posted again last year sometime, but I haven't seen much movement on them, so I'm not sure what the status on top of PK-RAM is. It is, if we try to do something like PK-RAM on top of DAX, it's not too clear to me how we would do something like other sorts of memory like PCI bars, if we need to pass through a portion of PCI bar space, how do you convince this file system to be able to do that, struggling to see that. Also the mandatory access control kind of aspects of this where we want to carve out a chunk of the memory of a guest and use that to spawn another guest or to be able to map portions to scratch page or to trigger some sort of user space driven swap workflows if a portion of this file is accessed. The kind of file system semantics here, I think we would struggle with. It's not clear how we would do those kind of mandatory access control parts if we had the kernel fully managing this file system itself. So that's the one option that we're looking at. The other is to get the kernel to do less of the file system work itself and rather punt it to user space through something like a fuse protocol, but instead of fuse today is kind of in the data path where it needs to move the actual, get the data for the read and write operations on files. But instead of that, it would rather do something like when a guest tries to access a portion of this file could bubble up something like a fault to user space and say, this portion of this new file, this file has tried to be accessed. What would you like to do? And then user space could say, I would like to map it to this PFN or something like that. Basically, so you still have a sort of file system interface of files, but the actual mappings and allocations and which part of which file is backed by which memory page would be that sort of decision would be made by a user space control process. In this case, user space even needs to, probably user space needs to keep track of where all the mappings are. In other words, which parts of which file correspond to which bit of memory so that after K exec, once you've done your live updates and those files start getting accessed again, user space can replay those same mappings into the kernel so that the same memory can be provided. Where was I on my process? Yeah, and the control process would also then be able to do these kind of more advanced things that you would struggle to do with the first sort of fully fledged file system operation where the control process could tell the file system, you know, unmap this page. If it gets accessed again, we need to trigger a swapping workflow and it could probably also do things like part of the PCI bars where if a portion of a file is accessed, this user space fuse sort of thing could say, well, you need to back this now by a particular PCI, IOMM address basically and set those mappings up so that would then be set up in the PFN mapped page tables. Maybe the allocations could be assisted in file system itself is and keeps a metadata about allocations, maybe everything would just be recreated from user space after live updates, not sure. The thing that makes me a bit nervous about saying, okay, we're gonna keep all of this, keep the allocations inside the kernel, inside this file system means that if you want to add new node types, you need to have things like translators to be able to do rollbacks, roll forwards and that kind of gets a bit hairy. So passing that that sort of state from one kernel to another is a bit, could be a bit hairy and perhaps just getting user space to read to re-inject those mappings is the way to go. And now the third option is something like, much more like a raw memory device, a character device with just file descriptors, no file system involved, just raw FDs, something like dev mem FD or even a mem FD, FD. They could be initialized or instantiated with backing memory and that backing memory could be something like a DAX device or even dev mem or PCI bar or something like that. And then the control process could do IO controls onto that file descriptor and say on the file descriptor that gets handed to a particular virtual machine and say which offsets in that file descriptor correspond to which actual host the FDs. So in a way it's kind of similar to the fuse option in that user space is basically programming in mapping saying which host physical addresses back which aspects of which parts of a file but instead of actually having a file system interface you just have a FD. It's possibly easier to do this without kind of having the constraints of a file system just having raw FDs mean you can have a new device type of things like that instead of having to work in inside the constraints of a file system style interface. Once again, user space would need to persist mappings across Kexec so that it could re-inject the same mappings after when those files start getting access and those FDs start getting access again. And this is something that we have, we're running and have built using this interface but it is very tailored at the moment to this one particular use case that we have and it is a clumsy kind of passing file descriptors around so we're looking at exploring other options for how we can actually build something upstream to support this live update use case. Okay, any questions so far before we start looking at kind of comparison between these approaches? Yeah, so maybe I have one question. So you are speaking about restoring mappings and stuff like that but after Kexec you have to restore perhaps also other stuff like open files and similar things. So how like to me it kind of resembles basically checkpoint restart that already exists where basically they want to basically save all the system state like all the open file descriptors setup, you know, pipes, perhaps even networking connections and stuff like that and then restore it like they do restore it in another container or in another virtual machine basically. Here you want to restore it on the same machine just the different kernel but still the principle seems to be very similar so I'm kind of wondering what are the intersections here? Sure, yeah, so I think one of the things that sounds different to what you're describing here is that we're not trying to kind of get everything recreated automatically. The idea is to Kexec into a completely sort of fresh kernel restart all of the user space processes that they need to come in, open their files again, create their KVM virtual machines again, all that sort of stuff. So you really are kind of creating everything from scratch but what you want to do is allow those virtual machines to pick up from where they left off in terms of having the same memory and the same memory. So the way we think about live update is it's like live migration in time instead of space instead of moving the state of that machine and all its memory to a different host and starting up a new BMM on the new host you basically Kexec into a new kernel and complete file system and new version of the BMM, et cetera on the same host and then just pick up your state from memory and it's like completing the live migration from there. So there are no file descriptors, no user space state. There in this particular case, there isn't any networking at all. We don't build config net because it's all pass through devices. So there's nothing of that kind, nothing of the checkpoint restore style thing to be moved across. Okay, thanks. Yeah, thanks for the question. Are the guests doing like the equivalent of a suspend the Kexec resume? So the guests are actually mostly oblivious to this. The process, the guests get paused, state serialized, stashed somewhere in this persistent memory that we're talking about, which is not kernel managed. And then after Kexec, KVM, VMs are created fresh but then that state such as VCPU state, where all the registers are in everything gets re-injected back into KVM. And then that VCPU gets resumed from basically exactly the instruction pointer where it was left off. So other than a short pause for when the actual Kexec happens and the user space process has already started the VMM process is basically the guest is unaware of this live update. It basically experiences some steal time like it might in live migration. But yeah, a little bit more unfortunately but we're working on that. But that ties into the other thing that we spoke of as sort of future work, right? We talked about a way to pass information from one kernel to the next. Now, any driver or subsystem could theoretically just pass a blob of information sort of like a bit of a device tree maybe to the incarnation of that same driver in the next kernel. The classic one is loops for Jiffy, right? Why bother recalibrating when you can just be told? And any driver can be given sort of this is the exact state of your hardware that I left it in for you a half a second ago and it can treat that as a plate blob for that driver and it can decide that it wants to consume that and hey, I don't have to go and re-initialize hardware or I don't have to go and enumerate what disks are on the other side of it or it can decide to throw it away and say, screw it, I don't trust that old kernel I'm gonna go and re-initialize for myself. So you can actually get the pause time down quite significantly by doing tricks like that which may or may not eventually tie into stuff that we want to persist from one case to the next but we were trying not to go down that rat hole too far. Right now. Yeah, I'm trying to keep this just sort of simpler comparatively simpler initially like how do we get to guest memory within all these virtualization constraints across. Okay, any other questions before we, well, I guess I'll just keep going but yeah, as I said, feel free to intro. So just looking at these different options here. So the nice thing about having the sort of fully fledged file system in the kernel is all your allocations, mappings, whatever done by kernel you don't need to switch to user space. You get the advantages of a file system which is really nice. You can see all your sort of guest memory files use file system semantics. And if this all the states and allocations and mappings and everything are persisted within the kernel in this persistent memory it can be available immediately after Kexec without needing to reprogram any mappings from user space. But the more advanced cases like how to do mandatory access control, deliver page faults, PCI bars, IMU page tables this I think becomes a lot more complicated and unclear how to do. With the few star file system you're pushing more messages and more control to user space. There's a better path to be able to get PCI bars and faults across to user space that it can then make decisions about. Although it could still be difficult to get some of the mandatory access control, logic or interfaces I suppose, convincing your file system to do this kind of stuff using traditional file system semantics. The role memory device where we move away from file system completely and just use file descriptors it's probably the most flexible custom device, IO controls, all that sort of thing. But I think the main disadvantage and we're not bound by file system semantics which it's got pros and cons basically. But I think the main disadvantages that we have to pass file descriptors around and we lack their introspectability of the file system. Yeah, so those are kind of options we're thinking about, advantages and disadvantages of each. We also need to think about what we're gonna do with this IMU remappings to the best of my knowledge it's still the case that if you do device pass through using VFIRE causes all the memory to be pinned. And that's if we want to do this kind of dynamic memory stuff where we can reclaim pages from virtual machines and map them back in later, we need higher IMU dynamic remappings as well where the user space page tables and the IMUs page tables are kept in sync. I almost think that's kind of an orthogonal thing that we need to solve somehow although it is also a requirement yeah that we need to figure out some approach for. I want to comment on that. Hi, here is David. So for VFIRE IMU, you might want to take a look at RIDOMM. RIDOMM is able to reconfigure that but it's something different than your traditional free page reporting or something like that. Like really to add and remove memory from a virtual machine and it is able to reprogram then like the IOMM. Maybe that gives you some clue what you might or might not want to do. Just a heads up. Okay, that's great. So VFIRE IMU can actually reprogram you saying that the hosts IMM you when pages are at removed. Like you essentially have like some kind of a sparse memory area that you expose to a virtual machine but you program the IOMM you only for the parts that actually have semantics for the guest which means like the parts that are actually currently usable by the guest. And can you update the IOMM you? The downside is that you have to do that like in some kind of chunks because I think you can only have a maximum of 64K mappings in VFIRE IMU. But like let's set some limit on how many like different mappings can actually have and what will be the granularity which you can add or remove memory from the virtual machine. Okay, so this is actually operating in these chunks. Okay, yeah, you see if we're doing this kind of individual page level thing I suspect we will need a more granular interface but it'll definitely be interesting to look at just to see how that's done. Okay, just carrying on here. So obviously we need to kind of decide on what approach we're gonna attempt to build here to work with upstream collaborators to try and develop something here. Just as a straw man like the few style file system with doing raw PFN mappings from user space that seems like maybe the most appealing way to do it. And yeah, we would like to get a RFC patch series together in the kind of coming months based on what we think here at least is the good approach to pursue. And hopefully later this year it presents some more polished patches and approaches at KBM forum. Yeah, so that's what I have. I'll open the floor for comments and feedback. I have some ideas for feedback here but yeah, open the floor to questions. I'll at least cop to looking at the PKRAM patches and then saying why to myself and then totally going away without saying anything on the list. The views and the raw device options seem less scary. They seem like, okay, I can review the small semantics that are needed on the kernel side and the rest of the use case and complexity can be pushed to user space so that sounds appealing at first glance, but yeah. Personally, I was scared away about the complexity of PKRAM. Yeah, I also took a skin through those patches. Did anyone happen to know what the story with PKRAM is? It seems pretty quiet. I mean, I don't have any details but I assume like when they send it the last time we weren't especially talking with joy. Like it was basically not really feedback because the issue with anything like that and also what you're discussing it is very, very specific to like big cloud vendors that do like very advanced stuff, so to say. Which of course, like the more complicated it gets the harder it is for us to digest it if you get what I mean. And to PKRAM was like in my humble opinion something like that. Yeah, those are the things we're looking at trying to simplify, right? The concept of, you know, fuse for DAX that's something that simple and you can imagine that we will find plenty of other sensible use cases for that that are not just applicable to big cloud vendors, right? And we have... Yeah, and that was one of my... DAX, somebody who's done DAX support reviews. I think it was Vivek, right? Maybe not exactly what you want but at least I know somebody's looking at DAX support reviews. That'll be a good start, thanks. That's great. David, did you catch that name? I didn't. I'm probably gonna get the wrong person but I think it's Vivek going out from Rithkast. Okay, thank you. Yeah, so I'm just on that topic, you know, the kind of use case specific thing. Yeah, that was one of my points that I was keen to solicit feedback from the room here. You know, are there other... The sort of infrastructure, has it got other uses that we should keep in mind because we would obviously prefer not to build something that's very use case specific? Usually whenever you talk about machines, the other big use case is databases because they have some very... A lot of stuff in common. So I would be wondering, for example, if you can use something like that for very big in-memory databases. Because I've heard of some setups where it takes them, I think, 12 hours or so to boot a machine and to load all the data or even longer. Of course, if you can do a life of that on such a machine with whatever, 12 terabyte or even more, you might have quite some significant benefits. But I guess it wouldn't be a question for our database folks. Yeah, now, we've kind of carefully tried to pick out the live update part here. You know, I talked about it a bit earlier and tried to keep things separate. But yeah, absolutely. And it's not just checkpoint restore. Across live update. But there's a whole bunch of other things we can do to speed that up. Some of the big database platforms also just take minutes. And you're writing disks. That's one of the other things we can clean out, for example. So yeah, we have definitely started looking at that internally as well. That's sort of from the point of view of what is something we want to build into the guest kernel for people to use on that kind of platform. It was Vivek and he's working on something called VrtaOSS and then DAC support for VrtaOSS, which is based on views. VrtaOSS, okay. So that would be, what is this? Okay, so file system, Cervantics, but that's to the guest, right? So the guest would be able to access a file system that is backed by a VrtaIO channel, something like that. Okay. So what they're attacking on is that you, like VrtaOSS is to speed up file access in your virtual machine, and it tries to avoid the page cache. And what they're thinking about is, I think sharing the page cache between the hypervisor and the guest in a DEX area, meaning like DEX you're meaning, I think you have a PCI bar that exposes that to the guest operating system and whatever it is and then like access faster or something along those lines. And I think there's some kind of tech support for that, but I also have like not too much knowledge about that part. I think the sounds of it quite, the sounds different, right? I mean, we're not trying to use. I don't think it is. I don't think it's importantly different. The point is VrtaOSS is implemented in user space, the kernel provides views for DAX or DAX support for views basically. And so yes, there are implementations of a kernel of a user space thing, which presumably has a BFIO connection to VrtaIO to talk file system protocol and then just tells the kernel, yeah, you want that page from this file? It's at this PFN. And that's the key, right? We would go and write some other user space thing that comes up with other answers. But I think this is the kernel side of what we need of at least the basis of it. And we need to look at the precise revoke semantics. That's the key part, right? But yeah, this is a good start. Yeah, I think. Okay, I believe I'm out of time. Any other last questions or observations? Thank you very much. Thank you. Thanks, James.