 Hi everyone. I've never been to Alisa before, so not many of you know me. I'm James Houghton and I work on live migration at Google. And I'm going to talk to you about HueshLB, high granularity mapping, why we want it, and various things that come with doing it the way that has been posted. And I think Mike Kravitz is in the call. I'm here. Hey Mike. Cool. Alright. I'll get started. So I'm going to talk about the problem we attempt to solve with HGM, the current approach that we took, various challenges with that approach, and then the future. And I don't want to talk at you, so if I say anything wrong or if there's anything you want to question me about, please just interrupt me. So we at Google want to use the largest page size we can to back our VMs, to maximize VM performance, and we want to guarantee that we get those pages. And so the obvious conclusion is to use HueshLB. Well, we certainly can't use THPs for this. So we also want to live migrate those VMs. And to do post copy, so we're talking about conventional live migration here. I know there are different kinds have been discussed, but to do post copy, so after you've moved execution to the target, but not all the memory is present there, you have to be able to catch vCPU accesses to memory that you don't know is safe to access. And we want to use user fault fd for that. But with HueshLB, either the entire one gig page or none of the one gig page is mapped in. So you can only get interrupts for either the entire one gig, or you can get page faults for the entire one gig page or none of the one gig page. And so to support, well to do post copy, you have to then fetch one gig at a time, like one gig is up to date or not. And we want to do as small a fetch as possible, so 4k ideally. And so, and at the end, we want to be able to collapse all the mappings, all the high granularity mappings that we've made. And so I'll talk about that. And so the current approach for HGM is just to implement it as an extension of HueshLB. So there's no THP like splitting. And so a lot of the walking code becomes, it looks a lot like PT mapped THPs. And so like an example of this kind of difference is like for memory poison, we don't split. So for THPs, they're always split before doing any unmapping, but for HueshLB, we can't do that. Or we don't do that in the current approach. And so HueshLB is already implemented as a bunch of special cases in the MM logic. And so all of those are still here. We just sort of do more things in those special cases. Which, yeah. So the current implement, the most up-to-date implementation that's been sent out is the V2 on the mailing list. That was sent in February. And that implements support for MapShared only, for X86 only. And it implements the first application, the live migration application, which allows user fault to FD, the UFFDI will continue to work at 4K instead of 1GIG. Memory poison extensions were posted last week. Also only supported MapShared, but could support MapPrivate. We want to support the same memory poison semantics MapShared or MapPrivate. To do UFFDI continue with MapPrivate is a little bit more complicated, but it's still possible, but we don't care about it. Like Google doesn't care about it. I don't think many people care about it. And the ARM64 implementation has been written, but it hasn't been posted, because it's just even more complicated than the X86 stuff. And so I have a whole long list of challenges. This isn't the full list. But so the first one is the user space API. Not exactly a kernel thing, but we have to settle on something that will work. And what we have now, which isn't necessarily the final thing, is to enable it, we have M and V split, which only enables the, like it tells the kernel that user space understands that 4K mappings could be produced from this. It doesn't force the kernel to use 4K mappings all the time. That doesn't really sound like that sounds a little bit different from what M and V split kind of hints at, but that's the idea right now. Maybe it should be M and V enable HGM, but that sounds a little bit too, not generalizable. M and V split could mean something for THPs in the future. Yes, David. So just a question, that means when we talk about this hardware poison handling, that would mean that only a process that actually set up M advice split would get a 4K mapping on? That's a good question. No, well, we certainly wouldn't want that to be the case, right? We want to do 4K at any time. And so there is a distinction between advising and like the kernel automatically enabling HGM. So if you advise, then things like UFF-DIO continue can work, but the kernel is free to enable it for memory poisoning. And so all the kernel bits to handle these high granularity stuff will operate if it's been enabled for any reason, not just advised and advise just controls the API, like whether or not user space is allowed to do certain things. So the M advice split would immediately split it, or would it just enable it? No, it just allows user space to do a high granularity UFF-DIO. That's highly confusing. Yeah, it is confusing. And so I don't think this is going to be the final API. Another idea, which was the original API, was to make a user fault FD feature just for this, but then it doesn't really extend to something like M and V don't need. If you wanted to allow that at 4K, you would need something more generalizable than the user fault FD feature. I personally still prefer the user fault FD feature, but there was a discussion about it. No, Peter has a question, but yeah, yeah, go ahead. Go ahead, Jason. So you, for user fault FD, maybe you could for something like M and V don't need, like if you wanted to support that. I mean, Mike Kravitz objected, I don't care too much, but if it's something like a user space program, might assume that they're using huge GLB, and so M and V don't need, if they accidentally do M and V don't need on 4K, then disallow, that has come up in the past. I don't know. I think if something like that, maybe we should just do it. Well, I don't know about that. But so I don't particularly care, but I think Mike probably cares, or Andrew probably cares. Yeah, I think the only objection to the M advice don't need is that currently we're not limiting that today to huge page sizes. So yeah, it's kind of a crapshoot to what you get out there. What I want to say is just that if we just worry about breaking the ABI, for example, we used to not be able to install a small page, and right now we can do that. So it's an ABI breakage. I don't think it really matters. I don't know, but if we worry about that, and we want to avoid M advice, bleed, confusing, we can have it like a global flag in Proc or whatever. Like a PR CTO or something. So any application can try to enable it, maybe, or something like PR control, I have no idea. I mean, we can make it something else so we can avoid M advice. I don't know. Sure. Again, I don't particularly care. Anything that Mike or Andrew? You have a file descriptor, right? Because it's huge TLB FS. So you opened You don't necessarily have to have one because huge TLB supports anonymous. Yeah, but for your application, do you have a file descriptor? Yes. So maybe an iOcto on the file descriptor. Sometimes we have a huge file, and maybe in this process mapping, we want to have it smally. And the other one, we want to map it hugely, right? I think it's the case. I mean, for hardware poison, it's not. Also, this isn't really a property of the file, it's more like a property of the mappings. So as you can see, this is a pretty big challenge, which is why I put it first. But for the sake of time, I think we should move on. But there's a sort of taste as to why UseSpace API is a big challenge. The next ones are all sort of implementation related. So right now with huge TLB, we just pass around the PTE pointer for the sort of like a one gig PUD, for example. And the size information like is passed around as the H state. And so it's always going to be the same size. And the page table lock is always, you can always determine that basically right before you are going to write to the PTE or read the PTE. With HGM, the size of the PTE could be totally different like it's variable. The page table lock you're supposed to use can be difficult to determine. So for example, if you have a 4K PTE, you have to know the value of the PMD above you. And depending on where you are, like in the code, that might be difficult to figure out. The way generic MIM does this is just by passing around the PMDs everywhere. But with huge TLB, that's not how, well with HGM, that's not how it was done. It's you pass around this generic struct huge TLB PTE, which could be pointing to a 4K PTE, but it doesn't know what the PMD above it is. It was a design choice. But that was, that's just how it was done. So we figure out the PTE, I'll store it in this thing. Have you considered changing it to a line to the generic walker approaches? We can talk about that. That would be great if somebody wanted to take that particular challenge. So yeah, we're definitely not going to go through all these slides, but there are only four I think. But yeah, let's talk about the sort of potential unification in a bit. The other, so the next one is walking the page tables. So because this is implemented separately, the architectures have to implement their own huge TLB stepping down. And so huge TLB provides this Alec PMD, Alec PTE, these functions for the architectures to implement their own walking functions. And it's a lot like their implementations of huge PTE offset and huge PTE Alec is just sort of another version of those. And so potentially huge PTE Alec and offset could just be implemented by this one function. That's not done in the current series, but could be done. The next one, contiguous PTE support. So there are two things I want to call on with that. And I think this applies to generic MM2, is the page table lock that you use for, like let's say you have a PTE that could be part of a contiguous PTE block. The page table lock you use for all of those PTEs must be the same. Because if they're, well, basically, you have to be careful not to overwrite. Let's say you want to install a contiguous PTE, you want to make sure not to overwrite some like some PTE that's been made present since, since you've checked whatever it is you're checking. So like if it's for use fault FD, you want to check that it's still blank, like still PTE none. Right now, like naively, you might only be checking the first PTE of the block assuming that they're always updated together. And so that's why the PTE Alec has to be the same. Be it you're, you're dealing with one PTE inside the block or the whole block at once. Also, the huge CLB architecture API doesn't pass the size of the PTE that you're dealing with. It only passes the H state, which doesn't necessarily correspond to the size of the PTE. So we have to change that, too. We have to pass the size information. And by that, I mean, we just have to pass the huge CLB, the huge CLB PTE struct. Oh, yeah. And the huge CLB PTE struct could be more generic, it doesn't necessarily have to be like huge TLB. It doesn't have to be in the name. It could just be generic PTE, generic size PTE. I don't know. Yeah. So one of the most difficult things with actually implementing HGM is we have to update everywhere where people assume that HGM doesn't exist. Yes. Yeah. Yeah. So I'm trying to build up the whole picture that you're presenting. And the more you are digging deeper into that, the more I think that you're just not, you just don't want to use huge TLB as your backing storage. And essentially, you can use the same allocator, because for the Gigabyte pages, you are relying on CMA or boot time preallocated pages. Anyway. And essentially, just find a different way to access that pool. And do not have all that baggage that huge, huge TLB FS has, because you are not going to use a large part of that anyway. Like it's a backing storage that is fully preallocated when you are starting your guest. So you do not use reservations. You don't use a large part of the complexity of the huge TLB FS. You do not, sorry, sorry. You do not want to use, or I guess you want that memory to be not swappable likely. Well, which you can achieve by amlog. Yeah. You can do that. Yeah. So essentially, I think that it would be much easier and less pain to simply find a different way to access that memory. I think to two points is we do use reservations in that we still have to guarantee that we get these huge pages. I don't know if reservations are necessarily required for that, but that's sort of an important part of huge TLB in how it gives user space the guarantee that it can provide these resources. The other thing is that to do what you're describing, we would have to merge like huge TLB false, huge TLB no page, huge TLB the cow bits. We have to merge that all into the generic MM stuff. Sorry. Would you, I mean, why? You just have that reliable source of that gigantic page, right? So if you just create a ad hoc driver that can access that CMA pool that allocates that memory or have or slash like kernel parameter gigabyte pages, but call differently so that regular huge TLB FS doesn't access that. So we cannot steal that from you. It could be done that way. Whatever way, I don't know if that was Mike, but whatever way you do has to support user fault FD. And so maybe it could be done like that. Like for example, if you had like dev mem and like you just you left a large amount of like unmanaged memory and passed off through like dev mem, but user fault FD doesn't support dev mem really. So I don't know. That could work. Michael, are you just talking about creating just a in essence a simpler version of huge TLB FS just for this purpose? That's what I want to talk to if it's like a slip down huge TLB FS that you're proposing. I actually, I have a separate question, but let's finish this discussion first. I mean, it sounds like you're saying implement huge TLB FS just used to generic MM mechanisms instead of this stuff. Yeah. Yeah, I like that too. I mean, yeah, actually what I'm saying is that just have a regular M up with a different page alloc backend that would access that. I don't know if that necessarily works because generic MM doesn't support anything higher than PMDs. Yeah, which could be beneficial for other reasons, but it would be less special less convoluted. And yeah, yeah, if I don't, if I'm not mistaken, there was a proposal for one gigabyte and spend huge pages and that was snagged from ZN. Okay, so this is not a spend. So one way you could do that is like if I'll just skip to this. So like one way could be like if huge TLB FS implements like the VM ops fault or huge fault, you could just put all the reservation logic in there and sort of in that way guarantee huge pages. Well, I mean, I think I think the way the MM is going is we're getting folios now and folios are going to be everywhere. And and if you imagine that you have folios, maybe they're to make maybe they're 16 meg or maybe they're they're one gig. I think the generic MM should be able to take a folio and install it in the VMA and install it in the PTs optimally for all of the sizes across all of the architectures. And that should just be good generic code, right? It shouldn't be huge TLB FS weird special stuff. I agree with that, right? So if you could get us closer to that on your project, that'd be fantastic, right? Yeah, I agree with that. If if HGM were done that way, I think it wouldn't be at all controversial. So my question is actually how does HGM works with the VMMAP optimization? Right. So it depends on how you do map count. So if you do it the so basically it just depends on if we need to use the sub page page trucks for anything. And so the only reason as HGM stands today that we would need to use it is potentially for map count. And so if we if you use the THP like way of using map count. So like if you have a bunch of 4k PTs mapping the page, then you increment the map count for those sub pages. But if you have a one gig PUD mapping, you just increment the compound map count just to simplify things. I've heard a rumor that Matthew's working on this. Yes. And so Matthew, I think is doing number three or attempting to do number three. And so that would be we would get the VMMAP optimization back. So my slides for tomorrow says up next meaning of map count James yesterday. So I was hoping you would cover all that and then I wouldn't have to talk about it in mine. Well, well basically we have an option because I think right now map count means two things. It means the number of page table entries that map your like 4k page and the number of VMAs that map your 4k page. Is it just the former? Okay. Yeah. And I'm talking about maybe changing it to the latter kind of sort of maybe if you squinted it, right? It's a bit. Yeah, it's there's possibilities for what it could mean. But we, you have a somewhat special case because you're one gigabyte pages and necessarily aligned to one gigabyte. Yes. Whereas for a general THP kind of you have to consider what if it's misaligned? What if it covers two page tables? What if it's a one gig and it covers this little bit of a PTE and then these number of PMDs and then some more PTEs at the end? Okay. What should map count be for that? Okay. So now what if you unmap a little bit in the middle? Is that now? It's hard. Well, yeah. So VMA splitting. So if we're talking about it maybe representing the number of VMAs that have this page mapped at all then that's fairly clear that goes up to two. But if we're talking about the number of page table pages which have a reference to this page then that becomes hard. I mean, this all becomes hard, right? Yeah. So I don't know that we have the obviously correct answer yet, particularly when you consider the number of times you need to acquire the page table locks to answer these some of these questions. Yeah. And I don't know the correct solution for how map count should be handled. The way that we were thinking like I guess Mike, Peter, and I were thinking was just use the THP like a way, deal with the fact that we don't have the page chart optimization and then optimize it later. And so if we went with number three, we can do something special with the fact that huge GLB is always aligned. But we could apply that to aligned THPs as well maybe. So we only have to deal with a potentially slower case if the THPs just aren't aligned to the page tables. I mean, one of the reasons why the THP like map count exists is because of all of these reasons that you could remap part of it. You could like only share part of that with the sub-process. You could partially un-map in a sub-process. Like ideally you would really only count in your special case that you have like one mapping of that page somewhere. And it doesn't matter if it's mapped by a single PTE, by a single PMD, which parts are un-mapped. You know exactly how it's aligned. So you should somehow be able to just say, well, it's mapped once, even though like a single 4K page is not mapped in that particular process. That would make your life easier. Once you go into the THP way, it's a little bit like, yeah, like we made some mistakes maybe in the past. Let's do the same mistakes with that one. I mean it's a different, I mean like more challenges. This is one of the biggest challenges I think. Yeah, I think Peter has a question. Not really. It was about the suggestion that Mihao mentioned whether we can just reuse the huge page into some other form. I think there can be some complexity on converting them because I think we want to firstly need a huge page properties like as you said reservations and whatever. So we have a VM running, or we'll be running as we saw the huge WFF already. So we want to transparent the conversions from small mapping to hugely mapping. And I think one thing quickly pops in my mind is that we have the page huge set. And it's definitely the same page. So the conversion won't be that straightforward. At least we need to clear all the page flags and everything. It seems to be just not feasible or maybe... Yeah, I don't know. Pretty hard. I don't know if I... So what I understood was converting huge TLB pages to and from being huge TLB pages. Yeah, I mean it seems very hard to do so. Currently your solution still keeps everything. I'm not sure if you would need to do that. I think even if you merged like the MM logic, like, you know, ditching huge PTE Alec and all that, you could still have the concept of a huge TLB page. And it could be that like the end game of huge TLB is just that it means that a huge page will be naturally aligned with the page shape. Like that could be what huge TLB means. That would be cool if it ended up like that. But so I guess to get back to why HGM was implanted like this is because like I think, oh I claim that adding all this support to generic MM will take like a lot of work. And no, like I agree, but like it could be many, many years of work. Whereas with HGM, because it's all contained within huge TLB, it sort of... There's an easier path forward and we can use huge TLB to demonstrate what generic MM sort of has to do to get up to the... I don't know, maybe that's a stretch. Yeah, but like something like the introduction of like the huge TLB PTE, that kind of thing, like in generic MM, I guess it's not necessary that you need something like that. I don't know. Yeah, I'm just worried that you would just have to do the very same amount of work just differently. So you would have to spend twice as much time, maybe triple. Yeah, that could be the case. But if I understand you right, Michael, you know what I'm saying, like doing the generic MM, you're just saying like write a device, like your own character device that has one gig mappings and you can... These one gig mappings can be 4k maps sometimes. And then add user polyp D for it. Is it... It's not really a generic MM, right? Is this a custom device driver? All the gup pieces, all the page walk pieces would be generic MM stuff, using the generic way of walking pages instead of the huge TLB way of walking page tables, because that fault is a big pain. And we know that we need these things coming for, like CXL is probably going to need big pages and DAX needed huge pages and it kind of did some weird stuff to or we gave up on it, right? So like it... But I mean, that's the point, like we plumbed gup for PUD mappings for DAX. So we have that. Well, yeah, I mean, aside from that, but it shows that you can force that kind of crap into the kernel, yeah. But in terms of like having a like struck page support for a device-mapped... A device-owned physical address range, like we... That's what device sex is. I mean, but I wouldn't use device sex. I would just make another thing that's another character device. What he wants is a character device that has a folio. The folio happens to be one gig and when he takes a fault, he wants to return a one gig folio to the generic MM and the generic MM optimally installs it in the PTEs, right? That's what we all want for all over the place. So that'd be great if you can get there. And I mean, that sounds like this, like making MM support PUDs, like installing PUD mappings and stuff like that and keeping... Like basically, huge TLB has its own cow logic, has a bunch of its own logic that would have to be merged in with generic MM. But we have PUD cow... We have this. I don't think it's as big as you think it is. That might be true. I know that like UFFTIO continue, for example, only supports adding PTEs. That would have to change for generic MM. And yeah, maybe it's time to reevaluate, see if it is simpler to actually do the correct thing and extend generic MM. Yeah, each TLB has such a large historical baggage that you would have to deal with, or you would just have to prohibit that, like page table sharing. Well, I think we would need to have that as well. Like I know Mike has users who do care deeply about PMD sharing. Yeah, absolutely. But then you have to deal with some portion of them might not like to be split. And so... Well, yeah, so I think the people who care about PMD sharing don't care at all about HGM. But if we're going to merge, huge TLB with generic MM, generic MM has to support PMD sharing. Yeah, I mean, right now, right now when you do any user fault FD operation, or should I say almost any user fault FD operation, you disable PMD sharing. Peter can correct me if I'm wrong, but he added that code. I'm pretty sure. Yeah, missing is fine. Miner is the problem, anyway. But I think the point is that like huge TLB supports contiguous PTEs and supports... Yeah, well, I mean, everything except like swap and all that. And so huge TLB and generic MM just support different things. And so we would need something that supports both. You have a pretty narrow use case. You just need generic MM to support what you want, right? Like you don't need the stuff you talked about in huge TLB FS. So you don't need to mess with it. I think that's Michael's point is... Okay, yeah, I understand. Yeah, you can live without page splitting, without page reclaim, without copy and write, without many of other things that are really making THP very complex and huge TLB FS pretty complex in a different way. So it's just about plumbing, I believe. Okay, just. I mean, just last words. So would you like to conclude or because we are... Yeah, I don't have anything to say at the end, but thank you all for spending time thinking about this and well, helping me and Mike come up with something that could actually be merged in the future.