 Hello, okay, so I'd like to talk about the single owner memory. So This is like a very specific use case for Google, but And we want to have some optimizations around this Use case But I'd like to get some opinion about the proposal of how we can address it and also To see if there are others who are Having like a similar similar problem. So basically a Single owner memory is a type of anonymous memory that is never shared so Examples would be to see Malik Malik and no forks after that a map Followed by something like I'm advised to not fork and In Google over 90% of our memory in the fleet is never shared so the reason is that we Don't use processes we use threats and If we use forks they are followed by execs at the single owner memory also works for In virtualized environment where we have Okay It will probably come back. So it also works in virtualized environment because VMM can have Basically something like a single owner memory for a virtual machine and inside the virtual machine There can be all like all kind of locations. It doesn't really matter after that So what are the problems the the main problem for us there are actually two problems Well, the main problem is the overhead so Today we use truck pages and Some other metadata and overall the overhead is about 1.6 percent To manage pages and this overhead results in petabytes and petabytes of DRAM in Google and DRAM is the most expensive part of the servers so it's a very expensive overhead and So we know that we use the memory In this single owner memory mode But this 1.6 percent overhead is To manage all kind of memory. So it it can be shared named and and so on So can we reduce this overhead that's basically the first question the second question is the security We've had in several problems where accidentally and by inspection and And also Because of crashes we detected falsely shared Pages so some of the some of those problems were detected accidentally where someone would look at the crash of an application and find Code that belong to another application other times it would be like the full system crash Others we detected using actually a page table check Which I added in 517. It's a mitigation technique to check that the anonymous pages are Not inserted in any other page tables In another application. So the problems Were caused by the box in ref count Other problems were caused but because drivers would Allocate some memory are mapped into the user space and then Later the driver would free that memory when it's unloaded, but it wouldn't remove the entry from the user space app and that page can be Allocated again. So basically The rock cases where the memory can get falsely shared so and In other problem is the performance is that So today we use most of the memory that we use is single owner memory, but We do not have optimizations for to use it with one gigabyte Pages because we don't have support for the transparent huge pages and With a single owner memory I was thinking to Guarantee the that pages are migratable and Like ability to assemble to be always assemble into one gigabyte chunks Okay, so and and the last point with the performance is that Hot plug is an expensive operation. It's kind of slow because during the hot block you have to allocate a lot of metadata and initialize it which takes a lot of time such as initialization of stock pages and The some can be implemented such that the hot plug is not required so The some driver would consist of two parts. So memory pool which Which is always managed in one gigabyte chunks or basically one gigabyte pages and the source for that pool can be huge tilby pages because they have the huge the VMA map optimization where we don't have struct pages for the for the tail pages Also like DAX which has a similar optimization or even kernel external memory such that Kernel does not even manage that memory It's just some physical addresses that were inserted into machine such as using the CXL map so We could also have separate pools for the movable and unmovable memory because we want to to be able to support long pins and The second part is the some driver. So some driver would take the one gigabyte pages from the pool the pages the one gigabyte pages are managed by bitmaps, so it's very memory efficient know not much overhead and then and because of the large chunks the it's also Like for the concurrency and for other problems The performance should be reasonable The second one is the some driver which manages in two megabyte granules. Again two megabyte doesn't have to be two megabyte It could be four megabytes eight megabytes, but Two megabyte sounds like a good enough size and it's also equals to To the huge page on x86 and on arm with the 4k base pages so the List can be something like per MM struct or per process. That is basically some driver man Like keeps a data structure of the granule lists attached to the process the also some driver provides New VMA that New VMA type that is that also has this pfn map flex set So we never actually tried to look up start pages for for its mappings Some handles false. It can be 4k the minimum and also It could be a fuller sized as discussed in previous talks. So like Order one two three depends Some will also handle some to some migration and some to outside migration So a page can be migrated from single owner memory into just the regular memory management managed page And It should add Support for not for everything that is listed, but at least for some that is needed And and it can be expanded later For some advice calls my great pages move pages and lock and protect We control PR control which PR control there is only one Memory related the setting which is to name the VMA Ambient set policy and and so on so There are of course some problems and So like overall like this sounds so cool except that When when you go into details, you need some more metadata to manage some features and So The main problem that I see is actually handling page aging because without LRU Without LRU pointers, that's kind of hard and so We've had some discussion internally and one proposal was to basically Still have something like struct pages, but like like sort of like have a new flag for the start pages that these are actually tiny start pages which have the LRU pointers to handle the aging but So far I've been resisting and but they don't have a better solution for that yet So that's something that need to be thought through then the the support for the Basically gap support For the PFN map type VMAs so not to look up into start pages when when we do the Long-term pints and short-term pints So swapping is Like it can be implemented independently so we could add some handlers to support just like some very specific swaps like this swap So NUMA support that would that's another thing. That's that's optional for now, but for example for CXL where For if if the single owner memory is handled by CXL and we assume there is like the same Latency for the newly attached memory like NUMA wouldn't be an issue, but But we will probably want to have like at least a pulper note or something like that so we would also need to do the hardware poisoning and finally Somehow to add the support for the fallen mem descriptor so basically in the future When mem descriptors are around they solves one of the main problems that we have which is the memory overhead They but they don't solve some other problems that I listed but So the And we have this problem now and we want to solve it quicker But the mem descriptors are not going to be available for several years So we need to have something some some solution in between Sooner if you help And seriously this This feels like okay, what if I get rid of the VM? What do I still need to re-implement and it feels like you're re-implementing almost everything's in the VM Right, well, you know, I wouldn't go so far as say you're writing a whole new kernel But it really feels so similar to huge tlbfs where it's like, okay We're gonna do something really special and the really special thing ends up being a completely parallel shadow VM And I'm not I don't understand what you end up winning when by the time You've added back in LRU pointers and swap and numer and it's like Where's the where's the win here? Yes, so if we and and that's why I've been resisting to like having this reduced Struct pages and tell you pointers because then I agree with you. There is I don't see a big win here So if you reduce the support for swapping and aging and Just assume like huge tlb, which basically support none of that then then there is win that we can Very quickly and dynamically add memory into the system like using cxl for example and Without the hot block Use the kernel external memory like using this driver So and also with a very small overhead But I agree it's it's a Concerned that if we need to have a support and today we do have Today in Google we do swapping we do reclaims we do the aging and that that is important part of the infrastructure So we cannot just get rid of that And Unless we just supported on a very limited types of machines, which do not support all of that because of the like special configurations So and regarding implementing huge tlb. Yes, it's very similar to that except it's it will live in its own driver It's not going to touch the core mm and it's I think It's similar to what Michelle proposed into someone in the earlier talk is just to truly write huge tlb in a driver to support some kind of use case, but So here we have a very specific use case and that's what I'm thinking to do actually So I'm curious I mean like most of what you want to achieve is actually like you you don't want the VM to manage your memory You just want to do it your own But why do you didn't need like huge tlb as well? Like why don't you set aside some memory to include and let your Current external memory can be supported it doesn't have to be huge to be on what I'm saying is that the pool can be anything like Yeah, right and like the natural way would be like if you if you want to do it your own way You set aside memory and you manage that but for me it gets confusing as soon as you try to then reuse some of the Some of the infrastructure like make migrating back and forth stuff like that Maybe like this migrating from some to outside is just like hot-plugging memory then and like fake removing it from Linux and so page migration an example of for like of a reason to do a page migration support is To support a feature MM feature that is not supported by some so for example if you want to swap a page You can migrate it to be a normal page and then swap it. It's just an example Okay, so like you would actually want some kind of transition like I'm I as a driver I am managed that memory, but now I'm gonna hand it back to To core amendment You're not handling the memory back. You're handling the page back So you actually copy the content of a page to a new place that is managed by the OS If you if you have a page Yes, if you have a page, that's right. Yes Okay, thanks. I mean and this is just an example of why migration might might be needed. It's it's not like This is an example is going to be used like because I don't think it's an efficient way to do the swapping But it's just so another example would be like a long pink or something like if user wants to To do a long pin today, we migrate a page to zone normal So here we would migrate for example some page which is always movable to a zone normal and then pin it So that those are examples of why do we want to migrate from some to outside? And then we also want to do some to some which is just to do the diffracts and to Be always able to allocate gigabyte pages and I mean Did this somehow like reminds me of that are we're approaches to do a swapping in user space? I think like like using some fancy user fold of tea Don't don't ask me about the details But essentially you would like you as your driver manager memory So you would like also try to find a different way to handle this swapping then like like Letting core MM like do that for you But like it's hard to imagine how they could look like and how it could not look hacky Like to me it sounds like it's a very hacky thing to do like I'm gonna manage to a memory But every now and then I'm gonna do some weird calls into the kernel and like tell it. Oh, yeah, like I'm This is not all of you. I'm managing myself But now would like please swap it and stuff like so it sounds very hacky to me and hard to imagine how you can get it Clean that that's all I'm saying. That's the problem. I'm having. Yeah, that's that's a reasonable concern. I agree So unless we limit the number of features again if if the number of features is limited and The driver so yesterday I talked about Migrating jobs from containers to the VM and which we have another project and I mean this some driver would Would be perfect as like a way to provide memory to the to those kind of VMs, but with a limited Subset of functionality. Yeah, but the features won't never be limited like Once you have the basic driver You'll get requests for more and more and more and more features and it will never stop your your eyes, right? Okay. Yes, that's fair. Yes Why am I called like to turn my device Something doesn't work. Yeah The agreement with Dave Hansen when device tax went in and it was like do not add more features This is all you get like don't add more and so this then because it's like don't don't become a huge tbl Fast and this this seems like we want to start a ramp towards huge tbl fast, but not get there all the way What it described sound to me like a GPU driver So basically GPU has GPU memory, right? And it manages it on memory. It's single. It's a single owner right and Also, it registers I'm the I'm a notifier so that it can Interoperate with the host I'm a mule Right. That's something you won't definitely want to do because you want to swap you want to Migrate pages, right? So you might have you got here, you know Achieve some of your goals by writing your driver your driver and then I look it From you know reserve part of the RAM memory as your private memory and You don't really need to change the I'm a party. Oh, no, I don't plan to change the MM code except for just some hooks that I needed to handle some some of the same advice and migrate pages and so on calls so majority of code will live in its own driver, which is like We'll show up to the user space as deaf soul and to get the pages from the driver. The user would need to a map deaf song with this like specified length and that's how the pages are going to be Allocated Okay, and and the song driver will have its own VMA So yes, we have a number of drivers that implement their own VMA So like a right it's why like VMA specific operations are going to be handled by this driver false are going to be handled So mm-hmm. Mm-hmm. So then my question will be on what? Prevents you from on doing that from you know achieving your goals by just adding new driver Very similar to a GPU driver Yes, right now probably I wouldn't call DRAM driver, you know, but you know you deserve part DRAM as a private memory and then you Export it to user space your space can map it can use whatever way It wants Yeah, what DRM. Okay, they aren't driver. Yes. They aren't driver. They are right. That's the term. Yeah I This isn't this this is a sub allocator. It wants to be a sub allocator, but you mentioned GPU driver And and I'm tech lead for you know part of a GPU driver And so we we have found that we want everything and so we We don't just have a cord-on off area of oh here's GPU memory we integrate completely with the memory management system and That makes us happy we get all the features and then if they don't run properly we try to submit patches and I Imagine trying to do this Well, actually, I can't imagine because there's this portions of the old driver that have of course any huge company wanted off an attempt You know variations on this and it just doesn't go well You want to integrate with mm and then fix core mm So I just don't want to be held up as an example of this is how it's done in GPU because it's it's not Where is the biggest overhead you said it's all like a one point six struck pages. So maybe Okay, can we reduce the size of strikes pages like Like What like for example Can we remove something like So can we make start pages less than 64 bytes can we remove six the the plan is to get them to eight bytes How many eight no that is I understand that Memdesk and like I'm talking about in a Short-term not no there was absolutely no way to get them smaller in the short term Well, I mean, no, what about Not that either Yeah, I looked yes, I looked at the fields. It's not like I'm just making stuff up. Yeah, I went like a field after field but but just the one that I've been thinking about is Is there a reasonable way to to remove The signal pointer from them like a message you You have to maintain that somewhere we used to be outside it was worse and Also, you have to have page a strict page aligned. So pretty much Yeah, the page alignment the start page alignment. That's like what actually Yeah, so you want to make the Start page smaller, but today I got email from Kent that the code taking guys decided to Put a new pointer into struct page. So it will be larger But actually as the struct page align alignment, I don't think there's actually a requirement for for it to be 64 by the line What is that? 16 by it. Okay The reason why I'm saying is that it doesn't have to be 64 by the line because I mean we can have bigger struct pages based on Very various compilation configs But it has to be 16 by the line. I said, yeah, so can we remove two things like So like my map count is not needed for single-owner memory So Being now array, I think this rock pages also Their advantages, right and the symbol is so efficient Right and I understand, you know, nobody likes the overhead part But we I don't think we should overlook the The you know the good bits, you know From this array large chunk of array, right? Yeah, so So, yeah, we're out of time. It's actually lunch time. So thank you everyone and We can discuss offline