 Okay, thanks everyone for coming to my talk. My name is Matthew Wilcox, I work for Oracle. I don't intend to be talking about any general product directions, but in case I slip up, that's got the lawyer satisfied. I am going to simplify some things during this talk. My slides certainly simplify some things for clarity. If I say something that's not clear, please stick your hand up and shout at me and I'll try and explain it better. If I say something you think, well, that's not quite true, save it, nobody cares. Since Linux is creation, Linux has managed memory in pages and before Linux is creation, Linux has managed memory in pages. Pages are a concept used by the CPU. The CPU can map memory in pages into user space. It's a very convenient chunk of memory to say this is our basic fundamental unit of memory. The problem is that systems got a little bit larger since 1991. The first computer I had had four megabytes of memory and such a thousand four kilobyte pages. These days, we're looking at one and a half billion pages. That's a few orders of magnitude. Fortunately, we don't actually try and manage the whole thing as one and a half billion pages. We split it up. The particular Oracle bit of hardware I'm talking about here, it's a 5U system. It sits in a rack. It's a nice big machine. We actually split it up into what we call numenodes. Basically, we split the machine up into eight and have them all cooperate with each other. Each of them is only managing a mere 192 million pages. That's a few orders of magnitude larger than the thousand or so pages we're managing on my first machine. We managed them in a number of ways. One of the ways we managed them is that we need to know which ones are going to be the least valuable to us. Which ones can we send to disk and recycle and reuse these pages for something that's more valuable? To do that, we have something called a least recently used list. This is a fairly classic data structure. A lot of operating systems use this concept and not just for memory, of course, but many other things. Having such a long LRU list is really inefficient. The list is protected by a lock. Whenever you go to access this list and modify this list, take things off it, add things to it, you've got to grab this lock. That lock is shared between all the CPUs within a numenode. We could have perhaps 50 CPUs or more. They're all contending on this one lock. It's really bad because not only are they holding this lock, but you tend to get a cache mess. When you get a cache mess, you may have to take thousands of cycles waiting for the cache line to come to your CPU. That really amplifies the bad effect of having all these CPUs because you're now holding the lock for far longer than just reading the code makes you think you're holding the lock. You look at the C code and it says take lock, modify list, drop lock and you think, oh, that's a couple of instructions. No, it's thousands because it happens to be cache cold and so you're waiting for some other CPU to give you the cache line that you actually wanted. There have been various efforts in Linux to reduce the length of the LRU list and to reduce the length of time. There's been some very, very clever optimizations suggested for how to improve the LRU lock contention. I'm here to say don't do any of that, just shrink the length of the list by managing memory in bigger chunks. It turns out we've actually been talking about managing memory in bigger chunks since at least 1999. That was the earliest discussion I found. It was with some people who've been with Linux for a very long time and some of them are still quite prominent. Eric Biedemann, Ingo Molnar, Aandirah Akalengi and then later, perhaps around 2003, Ben Lehaze, Hugh Dickens, Nadia Chambers and then later on Kareel Shutumov have all had a good go at trying to manage memory in larger chunks. A number of different ways of doing this have been proposed. Memory is used for various different purposes. Sometimes you use it to cache files. That's the kind that I'm most interested in that I know the most about. It's also used for anonymous memory. When your program in user space says I would like to call malloc and have some memory, that's called anonymous memory because it doesn't have a name. One of the things we found when I was working at Intel was that for some workloads, ARM performed significantly better than XE6. That was a cause of some concern, as you might imagine, to Intel. The benchmarking team determined it was because ARM was managing memory in 64K chunks while Intel was managing memory in 4K chunks. That's a real problem. We should do something about it. That's when I started getting involved in it. Then I left Intel. We went to Microsoft and now I'm at Oracle. I'm still working on this problem because it turns out to be a hard problem. I think it's one that's going to benefit all of us, not just benefiting Intel. We could do what ARM is doing. What ARM does is it says we're going to manage all memory in 64K chunks. That has downsides. It wastes some memory. It's really, really hard to make sure that user space doesn't notice that you've started playing this trick because user space likes to be able to map memory on 4K boundaries. We've made that guarantee to them. We don't want to break that guarantee. I discarded that approach. I said that's probably not going to work out all that well. It's kind of simple in a way. It's very tempting because it's a simple solution. We do like how to have simple solutions, but sometimes simple solutions don't work all that well. Perhaps we'll manage the page cache in 64K chunks. That has similar problems. It's still really hard to not break user space. We actually tried it. Back in 1999, we put in these macros and said, hey, maybe the page cache has bigger size pages than the rest of the system. We actually ripped that out in 2016 because we hadn't done that in 17 years. We started to think about, can we do it? It's like, well, it probably doesn't actually work, so we just ripped that code out again. I believe it was Karel who ripped it out. I was grateful because I was trying to get my head around it and I was failing too. What I decided to do was to use large pages adaptively because I don't know what the correct size is. I don't know where the 16K might actually outperform 64K. There's definitely workloads where that will help. If it's adaptive, then nobody can blame me for getting it wrong. More to the point, you can't blame your sysadmin for misconfiguring your kernel or your operating system distributor for configuring a kernel for you that's got 64K pages instead of 4K pages. We might have a bad algorithm for deciding when we're going to adapt, but then we can just fix the algorithm. That's just a bug. We can fix that. It's much harder to fix, oh, Red Hat needs to make this change and then Debian needs to make that change. When you allocate memory in Linux, when you allocate a page of memory in Linux, you don't just get the four kilobytes. You also get this little metadata structure called the struct page, which tells you a little bit about the page that you've got. Given the struct page, you say, well, where is this really in memory? What's its physical address and what's its virtual address? Which numinode does this page belong to? All that kind of good information is stored inside the struct page. But when you allocate the struct page, you actually get to use chunks of it for yourself. You've got roughly 60 bytes of memory available to you on the side. Several of the parts of the kernel that allocate memory do in fact do that. The page cache does it. The slab cache does it. Something that's called ZS malloc that I don't really know much about does it. This is allowed. This is a guarantee that the page allocator is made to the rest of the kernel. You can use this memory for your own purpose. One of the things that the page allocator does is it allows you to allocate not just a single page, but it allows you to allocate what we call a compound page. A compound page is two to the n pages that glued together and behave kind of as if they were a single page. This is what I based my work on. The fact that we have these struct pages. We're already doing some stuff with compound pages in the page cache. The TempFS file system used them. There are a couple of other ways you could actually get compound pages. The fun thing is that this was done in order to use the CPU's ability to chop off a level of the page table. If you were just in the last talk, he was talking about how page tables work and if you allocate a two megabyte page, then instead of putting in the bottom level of the page table, you can just put in the single entry and the CPU knows this is a two megabyte page and it can treat it specially and it's more efficient. People get these confused. What I'm talking about is making Linux more effective at managing memory. What a lot of people hear is this is a way for us to get all the way up to two megabyte pages and then everything gets more efficient because the CPU is using two megabyte pages. No, it's not about that. If we never get above 64K pages, that's going to be fine. We are going to have a significant reduction in LRU list length. There are other side benefits as well, but that is my primary focus is just managing memory more efficiently, not about using the CPU's abilities to treat very large pages more efficiently. When you allocate a compound page, I didn't want to show you an order nine page, an illustration order nine page because it wouldn't work very well. This is the kind of page that I will be allocating during this talk. It's an order three page. That means you've got eight struck pages that are glued together and behave as if they were a single page because they're a single allocation. You allocate it, you do stuff with it, and then you free it. It's always glued together as eight consecutive pages. We call the first page the head page. The head page is where all the real action goes on. The tail pages are occasionally visible and almost always cause confusion when people see a tail page. We have some of our functions understand what to do with a tail page, but other functions say, oh, no, no, no. If you're going to pass me a page, it had better be either a single page allocation or the head page. Some people try to indicate what kind of page they expect by saying, oh, I want a head page. Even that's confusing because if you've done a single page allocation, you can pass the struck page in, but it's not a head page. It's just a normal page. It's a plain page. We're suffering from a naming problem here. We don't really have a good way to say a single or head or not tail. That was where I came in with we need a new name, and that's how we got the name folios or folio. If you look at the various definitions of the word folio, you find really there are two. If you look at book binding, you take a very large piece of paper and you fold it a number of times and you cut the edges off, and it's several pages. The other is it's a large page size. I thought having these two different meanings and being able to look at it two different ways, that kind of made it the perfect name. We had a lot of arguments about names and names, and nobody really came up with a better one. Inside Linux now, we are transitioning from using pages to using folios. Sometimes there are still cases where you need to refer to a precise page. In the page fold handler, you need to know exactly which page from within a folio is the one which corresponds to the page fold. If you are referring to the entire memory allocation, you should use the strut folio that contains the page that you are looking for. Of course, many folios will contain a single page. That's fine. It's still beneficial even if you are only dealing with single pages. For reasons I don't need to go into particularly. Let's expand strut folio here a bit. This is basically the same as what's in strut page. It's a cut-down version of what's in strut page. As I was saying earlier, when you allocate a strut page, you are encouraged, permitted to use the contents of whatever you like. Whereas we are saying with folios, if you have a strut folio, it's either one of these anonymous pages or it's a file page. It's not being allocated to slab. It's not being allocated to anything else. Folios are, these fields in the strut actually refer to what they really are. It's not the case that somebody has hijacked some of these elements. That's not really a pointer to an address space. That's a pointer to something else entirely. Or it's actually an integer and we've just cast it because this is C and we can do that. We're saying we'll try and keep it clean and we'll keep it tidy. One of the ways we did that is that we went to the slab allocator. Unfortunately, one of my collaborators on this is a slab maintainer and he was more than happy to help and I was very grateful to have his help. What we did was I'm just going to flick back and forth between these two slides so you can see there's really very little difference between these two data structures. Some of these fields change their names. Some of them change type a little bit but they are almost identical. They're actually using the same memory. That's the important thing to know. When you convert from having a strut page to having a strut portfolio or you convert from having a strut page to having a strut slab, you aren't following a pointer, you aren't going off and looking at a complete different data structure. These data structures all overlap in memory and that's for efficiency. That's how we want, that's how things were anyway. We haven't changed how Linux is managing memory in that way yet. We're still working with these 64 byte data structures that describe what memory is. It's just now that we've changed the types. The compiler can say, hey, you're passing a strut slab pointer to something that's expecting a strut folio pointer. You probably didn't mean to do that and I'm going to throw a warning or an error. That's very useful because it means that you can't pass a strut page pointer to something that's expecting a strut folio. You can't pass in a tail page to a function which is expecting a folio. The whole question about what should a function do when it sees a tail page just disappears because it will never see a tail page. We also have runtime assertions in case somebody has decided to lie to the C compiler and say, oh no, I'm just going to cast this tail page to a folio. Those are only enabled by people who debug the memory manager. They wouldn't be enabled by real people. They certainly aren't enabled by distributions. It's of no concern in terms of overheads. There are two other benefits to doing this work with the slab allocator. One is that we're actually able to make this data structure entirely private to the slab allocators. Which means that unless you're working on the slab allocator, you don't even need to know that this data structure exists. That's really significant because it used to be about a third of the definition of struct page. We've now entirely deleted it from the headers. We've just shrunk the definition of struct page by about 30%. If you want to scare people, get them to read through the definition of struct page. It is over 200 lines. It's really quite intimidating. It used to be worse than it currently is, but that's no excuse. We need to make that an awful lot better. So what have we got upstream? It turns out quite a lot. We started upstreaming this work back in August 2021. We've been at this almost a year. You'll notice that we skipped 5.15. I had a big pull request ready and it got nacked. So 5.16 was when we really got the initial folio APIs and then we've been slowly working our way through various different parts of the kernel. Now you may remember the whole the whole point of this was so that we could have large pages caching our files. Well it turns out that file systems generally are not prepared to see folios which are larger than page size. So we allow a file system to now say whether it does in fact support folios of size larger than a single page. So we've done that for XFS. We enabled that back in March. So if you have XFS on any of your machines and you run a relatively recent kernel you can actually start testing this. It seems to be fairly stable. We've had a few bug reports. They've all been addressed as far as I know. So I don't know of any current problems and they've tended to be races. And what I love about working with the XFS people is that they live and die by their test suite. And they run thousands of machine hours of tests per week. And you know they run it far harder than customers would tend to. So I like to hope that we've got most of the bugs out. And the bugs we're finding now are weird ones right. They're races. They're oh if you truncate this file at the same time as you're reading from it and then you also write to it from a third thread then this matches that and it's complicated. Even just writing it out in a pool of stress about why this patch needs to happen. It takes a few lines to really describe what's going on. One of the interesting things is we got a lot of the support in before we actually enabled creating large folios. That didn't come in until 5.18 May 2022. So yeah we had a lot of the support in place and of course we were testing it right. This is just when we landed it. We've actually been creating large folios and testing them with XFS for like a year before that. It's just that this is when that code landed and was available to people who weren't willing to run a random kernel on a random website. So yeah, you can see 5.19 July 2022 hasn't happened yet. We're anticipating the release of 5.19 around about the end of next month. And we've already got work queued up for 5.20. We're hoping to add one new file system, the Andrew file system. And yeah, we're just continuing to work through converting old code that uses StrutPage to use StrutFolio instead. At this point it's a gradual performance improvement kind of situation. It works, right? Even though we're going through converting between folios and pages and back to folios and back to pages, it all works. It's just we can do it more efficiently by not going through the conversion process so many times. So I've kind of skated over how do we decide what's our algorithm for deciding what size folios to create. There were those who said, oh, we should take hints from user space. And if you look around you can find various articles describing how user space gets hints wrong all the time. I think I saw a paper from Google that said something like 90% of their programs are using huge pages when they shouldn't be. And so they're looking to ignore those hints. Some people said, oh, just leave it up to the file system. I'm like, well, the file system developers are not going to like me saying that. The file system just doesn't have enough information. It can't tell whether it would be useful to use a larger size unit to cache the files on it. It can sometimes say, oh, yeah, just shift the boundaries around a bit. I can be more efficient if you just move the boundaries around a bit. And so we've given file systems that degree of control, but we haven't allowed file systems to say, other than saying, no, I can't handle large folios at all. We haven't given them any kind of detailed control about, you know, I would really like to handle them this way. It's like, no, keep it simple. And what I settled on was using the reader headcode because we're already trying to predict the future in the reader headcode. We're looking at the access pattern that the program has had to the file by calling read or possibly doing page faults. And we're deciding how far ahead into the file should we be prefetching. Should we be starting reads now so that the data will be here by the time the program wants it. So we're already trying to do some degree of that. And so that just seems to me to be the right place to say, okay, let's start increasing. So you start out by, and let's say you start out doing like one kilobyte size reads, it will start out allocating you four kilobyte pages because it doesn't know, you know, is this going to be successful? Is it going to be worth doing anything more than a four kilobyte page? But as your reads continue, and it becomes obvious that really you're just scanning through the entire file one kilobyte at a time, we'll start to allocate larger and larger pages just to get the overheads down, right? Because allocating one 16k page is way better than allocating four 4k pages. Even before we start talking about the LRU list, just in terms of immediate benefit to me, it's much better to allocate a single 16 kilobyte page. The current algorithm is extremely stupid. It quadruples the size of the page every time it says yes, this was a successful read ahead. I would hope that somebody comes along and does it better. I'm kind of expecting that in 20 years' time I'm going to go back and look and it's still going to be there because nobody's bothered. But if anyone's looking for a project, analysing how successful this algorithm is and replacing it with a better one would be much appreciated because that's not something I'm interested in doing. I've got a whole bunch of other projects on my list and that seems like something that somebody else can do. Something I simply haven't implemented yet is creating large pages when writing to the file. If you're writing to something you've already read, that's fine. It will just use whatever pages in the cache no matter what size they are. But if it's having to create new pages in the cache perhaps because you're appending to the file or you've just opened the files, seeked to a bit of it and then you're starting writing there, it will start using, it will just use all the zero pages and that's pure laziness on my part. I simply haven't got around to that yet. Again, that's something somebody else could do but realistically it's probably something I'm going to get annoyed at myself for not having done yet and just go off and do it one day. I said I wasn't going to do hints from user space but then I started looking at the M-Map code and the M-Map code already uses this M-Advise hint, M-Adv, huge page. And that means if you go look at the M-Advise man page you will see that means I really want you to use two megabyte pages. And I said well that hint already exists. Fine, I will honour that hint and that's actually the hint that the Google paper says don't use but file pages are a bit different from anonymous pages, they're used in different ways so maybe it will work. And at least these users based asked for it. I mean maybe we're giving them what they asked for instead of what they really need but it is at least somewhat justifiable. And something I haven't done is I have not gone above two megabytes. There's nothing intrinsic to what I've done that says you can't go above two megabytes but I believe that we've already got most of the benefit by the time we go above two megabytes. And there are places in the kernel which assume that if a page is at least two megabytes it is exactly two megabytes, technically PMD size right but if we're talking about Intel it's two megabytes. And I just don't feel like going and finding all those places and fixing them. When I notice them I fix them but I wouldn't dare say that I found them all and given everything else that I was working on I just didn't want to go off and fix all those at the same time. When it's not even clear to me that there is any benefit to it. Because there are downsides to using large folios right. If you mark one as dirty that is you write to it. The whole folio has to be written back. We don't keep track of each little bit of the folio to decide whether or not it needs to be written back. So you write one byte into a two megabyte folio and we do a two megabyte write in order to get that back. Now that's not necessarily a bad thing. Some file systems some storage devices work much better when you give them very large chunks of work to do. And if your workload exhibits reasonable locality then you're probably not just dirty in one byte you're probably dirty in a lot of bytes. And so what we're doing is we're trading off now we do fewer IOs but they're larger rather than now we're doing the same number of IOs and each one is larger so we're doing that much more work on the desk. But not all workloads have good locality and I don't necessarily want us to bump all the way up. If we talk about perhaps going to a ludicrous extreme we could create one gigabyte pages. We could manage everything in terms of one gigabyte. Well the fastest SSD I can find does five gigabytes a second. And if you're only able to do five IOs per second that's 200 milliseconds to wait for a page to be read in. That's ridiculous. The glitch on your audio 200 milliseconds that's forever it's so noticeable. So I don't think we ever want to go all that large. Now maybe I'm going to be wrong you know maybe SSDs will get even faster than they currently are. Perhaps we'll see orders of magnitude improvement in SSDs as we have in the past. I'm just not counting on it. But when we do when we're seeing PCI gen 20 you know we can adapt this right. Folios will be there ready and waiting. We just need we'll just need to find the assumptions and fix them then. But until then I just don't think it's worth it. I think we're going to get most of the benefit already by going as far as 64K. And if we only go as far as 256 kilobytes I'm sure we will have reaped almost all of the benefit. Speaking of benefits I've got some performance results. This mostly from the Intel Kernel test robot. Although we also got some interesting results from the Foronix test suite and some others. But some of these are absolutely stunning. A 241% performance improvement for that particular benchmark and you know 60, 20, 45%. These are really good numbers. Now if you go click that link it will tell you about the 18% performance regression on some other benchmark. And these performance improvements are a little note at the bottom. Oh yeah by the way we just see some improvements on other tests. So the 18% performance degradation that was a bug that got fixed. Now we're only left with the improvements. Yeah so the benefits. We got a shorter LRU which translates into shorter lock hold times and fewer cache misses because we're dealing with memory in larger chunks. We don't have one struck page per four kilobytes. We've got one struck page per 256 kilobytes, 64 kilobytes. So fewer cache misses to manage the same amount of memory. We actually reduced memory fragmentation because the file system is now allocating memory in large chunks that increases the pressure on the memory allocator to keep memory around in larger chunks. There was a big fear of fragmentation, memory fragmentation when I started introducing this but actually it's beneficial to memory fragmentation and we have the measurements to prove it. We do larger IOs. That tends to benefit both disks and file systems. We possibly have opened the door to getting rid of 16 kilobyte and 64 kilobyte configurations. I'm still arguing with Oracle's distribution team about whether we can actually ditch the 64 kilobyte page size. Maybe we can, maybe we can't. There's various considerations to go in there. One of the things we're doing is we're getting rid of calls to compound head. This is shrinking the kernel. Every time you call some of these functions and you pass them what is potentially a tail page, it's a call to this function called compound head that says go and get me the compound head of this page. Even if it's already the compound head and the compiler knows it's the compound head, it still goes and gets the compound head. Very annoying. I tend to put those memory reduction wins into the commit messages and you add them all up and it's tens of kilobytes of kernel text that get removed as a result of getting rid of these calls to compound head. I aim to get rid of about five to ten kilobytes per kernel release just by submitting patches that convert things that use struct page to use struct folio. All these calls to compound head just disappear. There's a whole bunch of future work which I might come back to in just a couple of minutes because I've been told I only had a couple of minutes remaining. But I wanted to thank my contributors. I could not have done this without all of these people. People here work on XFS, people here work on the network file systems, memory management, just all kinds of different parts of the kernel have been affected. You can see that this has been a cross company collaboration. Many of these people just said yes, I want to help, I want to be part of this. That's been absolutely fantastic. I would like you to thank all of them with a round of applause because they've done so much work. I think I've got about two minutes to talk about what things I might want to do next. Just like we did for struct page, we can split out other struct page users and that's going to benefit everybody by clarifying what we're actually using struct page for. Once we've done all that, we might be able to dynamically allocate memory descriptors and that will shrink the amount of memory you're using. We can make TempFS more efficient. We need to start converting file systems from the old buffer head mechanism to use IOMAP and once we do that, they will magically get the ability to use large pages themselves, large folios themselves. For the files that we don't want to convert, we can still do the work to convert them to use folios. It's much better to convert file systems to use IOMAP because IOMAP does so much more of the work for them. If we can't do that, at least let's convert them to use folios so that we can just get rid of these calls to struct page. Again, maybe not large folios but at least folios. I think I may have regressed the performance of OSync, that's something that I need to work on, something to look at. I want to handle the mapping count of folios better. Right now you have to iterate over every single tail page in order to find out how many times each of tail pages mapped in order to find out how many times the folio is mapped and this is a ridiculous idea but I haven't quite got to having a better idea yet. When we hit a page fault, we know we've got a page that's in a folio. We could actually map each of the pages in the folio into user space at the same time and cut down on page faults that way. I haven't written that code yet, that's an easy, separate task. If somebody else wants to take that on, that would be fantastic. Oh, and we need to do large folios for anonymous memory. All I've been doing is struct folios in general and for the page cache. We need to also do this for anonymous memory. I don't know anonymous memory very well. Please somebody else do that, that would be fantastic and that's my time. Thank you all for coming.