 So, this talk is going to be about memory passenger for virtual machines and it came from this project that we've been working on is basically to move some of the work from containing environment to virtual machines. So, I'll begin with the very brief explanation of how extended page table works or slats or stage two page table. It looks like every single vendor decided like their own name for this. But anyway, so when we translate a virtual address, we look up in TLB and if we hit, we get a physical address. And that's in virtual machine and so the translation if we hit TLB is just exactly the same as if we run natively. But if you have to go through the second level page table through two-dimensional paging as we call it in Linux kernel, then the picture is a little more complicated because we can have a page table miss both in the virtual machine page table and also in the second level page table. So in the second level page table in SLAT, the miss usually happens maybe like soon after boot when we first touch the memory if the virtual machine memory wasn't backed by like preallocated physical pages. But it also can happen because of the ballooning or for some other reasons. So what happens is that the translation cost is actually not the summation of these two page tables. It's actually more because the page, the guest page table also sits in the physical address space, in the guest physical address space and its page table levels have to be translated to the host physical address space. So the actual number of loads to do work of these two page tables is actually equals to n times m plus n plus m, which is where n and m are the sizes of those page tables. So if it's level four page tables in both cases, then it's n and m are both four. But we now also have five level page tables, so they could be four, five, or they can be both fives and so on. So it's somewhat more expensive. So this slide shows basically the summary of various combinations of page sizes and also how many loads is needed to perform the translation when we work the page tables. So as you can see with 4 and 4k, it's 24 loads for the four level page table. So 4 and 4k is probably not something that is frequently used. It's more likely that the guest memory is backed at least by the transparent huge pages. So in that case, the host has two megabyte and guest has 4k. So in that case, it's 19 translation. And if it's a huge TLB pages backed by a gigabyte, then we are down to 11 translations, actually 14 translations for the 4k host. So when we started working on this project, we started thinking about how can we optimize it. And we came with several ideas to look into. So the first one is that memory management is basically duplicated in the virtualized environment. So we have to look through the, we allocate pages in the guest, we search the free list, we take the pages for the free list, we allocate it, we fold it, we zero it. But then we can, we might repeat the same thing on the host as well if the page wasn't available, like the physical pages weren't available. So we might zero it twice. We also have the double footprint because the memory management has overhead because of the struct pages and some other metadata. So we have the host overhead and then we have guest overhead. Then we have this access cost, which I just talked about, which basically the extra translation going through the slide page tables, which can add up to 29 loads. And then there is also like an opportunity for the optimization compared to the containers is the boot performance that guest takes time to boot and container is usually just readily available. So we figured that we could try to do something like a memory pass-through basically that assumed that the guest memory is always available. So guest never misses a page and we provide guest with one gigabyte pages. So we treat the physical address space of the guest as a virtual address space. And all the translation happened on the host. So like misses and everything happened on the host. So with providing guest with the gigabyte pages on the guest side, we solved the problem of the slides. So it's quicker, the translations. But we also do not waste more memory because usually what is associated with one gigabyte pages is the fragmentation is that you allocate the full gigabyte page and you need to use the whole page. But if you do that in the guest and only subset of that page is actually used on the host, is actually allocated on the host, then we do not have this problem of wasted memory. So as you can see, the translation becomes somewhat simpler. So we've done some performance measurements as well to see how the translation costs for the heat and heat within the page table and then page fold is basically when we have to go through the allocating a new page and insert it into the page table. So as you can see, the first three columns is the bare metal. It's four kilobyte, two megabyte, and one gigabyte pages. And what a heat in the page table corresponds to, so a heat for four kilobyte pages, but the missing TLB, of course, is with our measurements was 465 ns. For one gigabyte, it was 225 ns because there are fewer page table levels to load. But if we use the SLAT, which is the next three columns, so if we use SLAT backed by 4K pages on the host, but with one gigabyte pages in the guest, we get about the same performance as the 4K. So it was 465 ns and then it becomes 493 ns for the one gigabyte. Page fold performance is also about the same because we do not zero that gigabyte page in the guest. We expect host to zero the memory when it folds it, like once we start using that gigabyte page and touch regions within that gigabyte page so when the pages are folded on the host, they are zeroed on demand as well. So to achieve that, we actually wrote a new driver called memctl. So this memctl is basically a driver that enlightens host about how guest wants its memory to be backed. So memctl sends commands from the guest to the host. So there is like an agent in the VMM and also a driver in the guest. So the driver in the guest accepts the physical range, the guest physical range and the command and then sends it to the host and synchronously executes that command. And the commands that we pass include a map, a map, a local device and protect and PRCTL which allows us to set the name for a given part of the guest physical range so that there is like a better observability from the host side how a particular region of the guest memory is actually used. So for example if guest allocated a particular region of its memory for the TCML or something then that's actually visible from the host side. So this is synchronous meaning that when VCPU performs this memctl call, it performs VMM exits and then it waits until that command is finished and then it returns to the caller. So we rely on this KVM Capsync MMU property which basically synchronizes the guest, synchronizes on the host the mapping between what VMM allocates and how that memory is mapped to the SLAT table. So this is basically the diagram that shows what I've described. We have a user process which can perform the MM control commands. So for now we use huge TLB to get the one gigabyte pages in the guest. We've modified that huge TLB to not zero those pages but instead rely on memctl command to do M-Advice do not need prior to providing the gigabyte pages to the user app. So that way those gigabyte pages are very quickly available. There is no time spent zeroing the full gigabyte and then we use the needed commands that like the same as like an allocator would use M-Advice and protect and so on. So this is the end of my slides. I wanted to go very quickly through the slides but then I actually have several topics for discussions and I have several questions so we have 15 minutes that's good. So I'd like to find the other use cases for memctl. I'd like to figure out what would be the upstreaming path for the memctl and basically answer any other questions that there might be. Yes, is there another mic? So my question is what, when this is enabled, whether the second level page table is still enabled in the guest in general? Yes, the second level page table is enabled but we use gigabyte pages in the guest and 2 or 4K pages, 2 megabyte or 4K pages on the host and that's why the translation itself is fast since there is less working. So if you go back to this page and look what it is for 1 gigabyte, 2 megabyte, that's 11 loads to do the translation. But normally, to achieve this 11 loads, we would use gigabyte pages on the host and that might waste memory considering that the workload that we are trying to move into the virtualized environment has previously been running in the containers and it needs to be dynamic. It shouldn't be wasting memory on the host. Okay, thanks. So I was wondering because before the two level page table exists, we have KBM have a shadow paging and basically initially what it does is it tries to squash the guest page table along with the host page table into a shadow page table and that can also avoid two level working. Basically, it reduces O and double N minus M to O N as well. Yeah. So my understanding is that the problem is that was to keep them synchronous and it was just expensive to update those page tables in the guest and on the host every time you update the page table and that's why those hardware optimizations were added. So what I recall is that there were some pervitralized approaches and I mean this is all pervitralized where you would actually like the guest would tell the hypervisor what it updates so you don't have to monitor guest memory for changes in the page table layout so you would just like send it pervitralized to the hypervisor. So yes, this part of virtualization because we enlighten our host about the guest memory but it's the guest itself tells host how it wants its memory to be back. So we use two things here. First, we enable guest to tell host that look, this memory is something that I don't access very frequently so please back it by 4K pages instead of two Mac pages. Second, it's just an example or like analog this range of memory or do something else like basically we allow guest to manipulate the way its physical memory is back on the host. And the second thing that we are doing is that we are treating guest physical memory as virtual memory so we do not assume that all of the physical memory is actually backed by physical pages and that enables us to use those gigabyte pages in the guest and not to waste memory and using the gigabyte pages in the guest also allows us to do the quicker translations. So what is the guest perception of this odd thing, like do you have a one gig folio that you stick in the VMA to get a one gig page or what? So today we use a huge TLB FS to get the gigabyte pages in the guest. I want us to start using something called single owner memory which I'll talk about tomorrow but that's a different topic but yes, so we used DEX initially then we switched to huge TLB FS for various reasons. So emulated PMM and DEX worked fine but there were some particular problems associated with the VMM that we've been using for this project and so we decided to use a huge TLB FS which had like fewer problems but yeah, so basically we don't really care where the gigabyte pages are coming from but I mean like what's the whole experience in the guest like what if I do gup on this one gig page and I target something that hasn't actually been mapped in the hypervisor, what happens? So the same thing as what would happen if you do gup on a huge TLB page today. You get the physical, you get the page back. You get the page. But there's no memory there, you said. So that page gets faulted like that 4K page or 2 Mac page. So we allocate 1 gigabyte, we tell host to back that gigabyte by like, so we do an advice telling like this gigabyte should be backed by 4K or 2 Mac pages and then when we touch that region, when we start reading or writing to that region, that's when a host actually fault that memory and we get a page like a 4K or 2 Mac page within that 1 gigabyte. So is the actual optimization that you're doing then essentially that you don't zero out the huge TLB page in the guest? Because if you would be searing it out, you would be touching everything. That's correct. We do not zero that. Yes. We do and advice don't need before providing the 1 gigabyte pages to the clients. Okay. So I wonder if you need all of the fully blown feeders set here that could be achieved using whatever free page reporting and then you're only like, you are aware if you're using free page reporting and were you reported it using what a balloon that the next time you touch it, it will get lazily allocated on demand. So you can just use a like gigabyte page and if you don't zero it out, it will still be freed in hypervised. I mean, I get that you have like these other optimizations telling like do I want to use 4K, 2 megabyte, 1 gig, but I think this is a separate set of optimizations if I'm not wrong. Like the co-optimization that I consider valuable is that you don't do the double searing and that's why when you have like a gigabyte page in your guest, you're not actually like allocating all of that memory in your hypervisor immediately. That's right. We do not allocate all of that memory immediately. That's the co-optimization. And also then we return that gigabyte page to the guest. We also do, M-Advice don't need on the host side before returning it so that memory can be returned immediately back to the host. Okay. Like free page reporting where the guest would report that memory is no unused in the hypervisor what M-Advice don't need or whatsoever. So yeah, I think it maybe it could be somehow done using existing mechanisms, like somehow mangling that into word of balloon. But I only liked that the optimization for the memory part, not everything else that you covered. So yeah, so memory ballooning I guess could provide some of this. I'm not sure that memory ballooning allows to send some other commands. No, no, that doesn't work. No. Yeah, like for example advising the size of the page on the host, like if we want 4K or two Mac pages, we don't have this set VMA ability, like to set the name for example for a given region and other things. So memory ballooning does not give us everything. Like as I said, like I think the most valuable one is to lazy allocation that you want to have. But the other ones I agree that you would need some other interface and I'm not sure how controversial that would be, I cannot tell. But is it really, like you said it was huge TLBFS, but it's not really huge TLBFS kind of like it's got this extra stuff on top of it where it's doing hypercalls to get rid of, you know, deallocated. So there is page clear call in kernel and then there is like a page clear for the one gigabyte pages, gigantic, I don't remember it's called something, gigantic page clear, something like that. But inside that call, we added the call to mem control to do them advice don't need. So the mem control sends the, so instead of doing the b0 for like the old mem set for the whole range and therefore faulting all the pages on the host, zeroing all the pages twice actually once on the host and once in the guest, we simply tell on the host, remove everything for this physical range in the guest and then lazily allocated when guest starts using that. Yeah, I'm just thinking about the talk that we had yesterday about kind of similar, not the same purpose but the same kind of abuse of huge TLBFS to add new functionality and things. Mm-hmm. Well, that's all I'm going to say. So Cloud Hypervisor said that they are okay to take the mem control agent part prior Linux. So my problem is like who should take the, like how to upstream this and what kind of security implication we might have and like where the decision for this kind of agent should take place in what setting. So should it be at the VMM, should it be discussed on LKML? So what would be the right path to upstream this kind of feature? Is there a microphone? What is that? So the changes to the huge TLBFS, I do not propose any changes to huge TLBFS at the moment because this is just a temporary solution, it's just basically a pool of gigabyte pages that we use but what I'd like to try to upstream is the mem CTL itself. So basically an ability for guest to tell host how it would like to manage its, like parts of its memory. So basically that the guest can hint host that this part of my physical memory is very hard so please back it by the huge pages, this is not very hard, it's okay to back it by 4K pages and save some memory and so on. So that is just a separate driver, it's just files in the driver directory. So there are no changes to the MM layer with this. And then there is also a counterpart on the hypervisor and that one depends on the hypervisor so we wrote it for cross VM, we have a prototype for cloud hypervisor, I mean I could also add it for KMU but cloud hypervisor basically said that they can't take that agent before it's upstream to Linux but since I never upstreamed anything that actually has two different parts like one is in VMM, one is in Linux kernel so I was just wondering what would be the right path here to upstream this kind of feature and again something that I'm worried about is that like after upstreaming it the interface between the VMM and kernel becomes somewhat stabilized so I'd like to make sure that we do not add any kind of security problems so we can always copy everyone on welfare there, it looks like you had an invention of hypercalls so the most logical way is to post it to KVM folks and then they have the user space counterpart ready for review but you don't want to merge user space part before you have the KVM bits there. That's the thing, yes, and another thing is that currently I depend on implementation, so the actual communication between the VMM and guest goes through a so called hyper channel which hasn't been upstreamed as well it's also like Google internal thing so I'll need to figure out if we can upstream that or if I should go back and write everything to use Vertio and if Vertio is as efficient for the synchronous calls because from my understanding that's not and that's the main reason why we don't use that. So yeah I think like Vorto might be the way to go charging that Vorto balloon is the dumping ground for all of such. My concern with Vorto you that it's Q-based and for synchronous it's not really efficient. I think there are ways to make some synchronous calls if I'm correct. I think there were some mechanisms, it's usually Q-based but I think there are some paths around that but like what I would suggest is what you look into first and I think that has already been discussed at one point so whenever we free a huge Dubica page in the guest what you would simply do is you would like trigger free page reporting to clear that memory in the guest in the hypervisor and then you simply mark the huge Dubica page as pre-zeroed and you would essentially like get the benefit on each and every system that uses huge Dubica without any additional mem CTL or whatsoever. True but maybe that's not what you want to do but you can get it running on existing systems. That's a potential optimization. Okay, I'm done. Thank you everyone. One more question. Yes. What would be the use case in the OSS domain? Do we need to change any memory user space memory allocator like GMLOCK for this new guest setup or do we need to make modifications to those popular open source databases so that they can take advantage of this feature? So basically whoever wants to take advantage of this they would need to send the commands to the mem CTL and that could be like any kind of user allocator. So are you proposing everybody to use SysCalls directly to use the SysCalls directly or you think it would be better to have a user space library to share for everybody? Okay, so with a single owner memory which could be optimized for this environment we could do the mem CTL calls right inside Linux kernel and then we wouldn't need to make changes to the user applications except telling them to map this single owner memory device and get those optimized gigabyte pages right away. But I think more advanced applications that want to tell that this particular memory is very hard please back by transparent page or some other hints they would still need to do the mem CTL calls. Sounds good. Thank you.