 Good morning everybody. My name is Matthew Wilcox and I'm a technical advisor at Oracle. I'm in the Vince Kern development team. I've been a Vince Kern developer for about 20 years now. I'm here to talk to you today about using large pages in Linux. So when we talk about managing memory, the default unit that we manage memory in is pages. And on the next 86 CPU, those pages are 4 kilobytes in size. And so when you look at a computer like the Oracle Server x8-8, it's an 8 CPU system. It has up to 6 terabytes of memory. And that's one and a half billion 4 kilobyte pages. And that's a lot of memory. That's a lot of objects to be managing. Now, we do divide things up a bit. We actually divide things up on a per-numer node. And so there's only 192 million pages per node. But still, 192 million pages is a lot. And in case you're thinking that this really only affects large computers, it doesn't. My laptop has 16 gigabytes of RAM. That's 4 million pages. My phone has 4 gigabytes of RAM. So even my phone is dealing with managing a million pages. As you can expect, that takes a lot of CPU time to do. So we use this thing called a last recently used list. So every time we access a page, we take out its current position in the list, and we move it to the beginning for us. And that involves touching a lot of cache lines, modifying a lot of cache lines. And if you're on a system that has something better to do with its caches, those are going to be cache calls. And so we're going to bring a lot of cache misses. And those caches are done with a spin lock held. And there are several patches out there at the moment trying to fix the problem that an LRU list, the lock protecting the LRUs, is heavily contended. So if our LRU list were shorter, we would not see the same kind of contention that we do today. So how can we solve this? Well, one of the ways that's being discussed is to manage memory in larger pages. So just increase the size of a page from being the same size as the CPUs to being the size of a larger size. I have 64 kilobytes. Obviously, this approach is going to waste some memory, because things that, you know, if you have a file that's only perhaps eight kilobytes in size, all of a sudden instead of using exactly eight kilobytes for it, you'll be using 64 kilobytes to cache it. And maybe that's not a problem with a larger system, but this is also from on phones, remember. And so the phones may actually also have a problem. A bigger concern is that it's hard to maintain the user API, because the user API thinks it can end map on four kilobyte boundary. And so if we're using 64 kilobyte pages and we're trying to map the four kilobyte granularity, that starts to get quite tricky to do. And so this approach has never really found much favor. Another approach which was attempted earlier on is to manage just page cache memory in large pages. This also has some of the difficulties of managing, of the previous approach, just managing all the memory in large pages. It has some benefits. It has not quite as many disadvantages, but it does confuse people, because people are never quite sure when it is they're supposed to be using page size and when they're supposed to be using page cache size. And since the two were not actually different for very long nobody noticed for a long time, nobody noticed when people get it wrong. And so we had a flurry of patches which tried to fix some of this. And in the end in 2016 we just ripped out the page cache size macros entirely. So the approach that I chose to investigate here was an adaptive use of large pages. That is keep page sizes the same. But allocate them in multiples whenever we could. And the real question to understand is well, how do we decide when it's going to be beneficial? And I'll get into that a bit later. So then it already has the capability of allocating something called a compound page. So compound page is necessarily a pair of two in size. So if you allocate a single page that's more to zero page. If you want to even allocate a pair of pages that's more to one page and so on. If you want to read more about this it's called the buddy allocator. There's a lot of good documentation on it out there. So the first page is called the head page. And then all the other pages are tail pages which points to the head page. And if you try and call this function on a tail page like lock page, it will actually redirect to the head page and it will lock the head page. And this can confuse people when they try and lock one tail page and then lock the next tail page. And then they find that actually they've just deadlocked because they've just tried to lock the same page twice. I fixed a couple of bugs along those lines. So we've had the ability to allocate compound. We've used compound pages for memory for a while in the form of transparent huge pages. And 2016 when Carill added them to TempFest, that was when the page cache first got the ability to store transparent huge pages. In 2019, Tom added support for doing read-only huge pages and he was building on the work that Carill had done for TempFest. But now it was also available for file systems which use block storage or indeed network storage. And the way this is implemented is that you allocate compound pages. But it was very limited in that it would, you were only, there were only two sizes supported. Order zero pages, normal pages, base pages, and PMD size pages. And on x86 that would be an order nine page, which is two megabytes of memory. So about, around about the beginning of this year, I started working on something that I called new transparent huge pages. And so I was recycling a lot of the transparent huge pages work that had been done and generalizing it. So a transparent huge page is now no longer just two megabytes in size. It could be almost any pair of two in size. And that led to, because TempFest is a fully in memory file system, there was little attention paid to things like how something like dirtiness or whether the page was up to date or whether the page was under right back. All the things don't make any sense from any memory file system. So we didn't bother to track any of that. So I had to make a few changes to track some of that page state for the transparent huge page. And then I introduced a bunch of new APIs you can see there, which do, I think, obvious things tell you how many pages are in a transparent huge page, tells you how many bytes are in a transparent huge page, tells you the order of a transparent huge pages. These three things are all linked to each other, but some code wants to know one thing and some code wants to know another, so I provided all three. THP heads, so given any struct page, it will give you the head page of it. And then offset in THP, that will give you a byte offset within this particular struct page of whatever you pass to it. In general, that's a file offset. And so that tells you how far into the page, into a particular page and addresses. When I came to merge, I have got a very long way with this approach. And it was really only when I came to explaining it to other people and trying to justify what I'd done, what I'd done, and persuade maintainers that they should merge it, I really started to wonder whether I had indeed made the right decision there. It was definitely expedient to use the transparent huge page code paths. But it turns out that file system authors are generally unfamiliar with transparent huge pages because the only file system that's really supported up to now has been the shared memory file system, tempFS. Other file system authors have paid no attention to it at all. The tempFS is kind of a special file system by itself. Regular file system developers don't look at it, whereas file system authors tend to look at other file systems like theirs to gain inspiration. So if you talk to somebody who works on NFS, they may well know things about several different network file systems. And if you talk to, say, an EXT4 developer, they'll probably be quite familiar with how ButterFS and XFS do things. Because we're all looking at each other's code, we're trying to figure out the best way to do something. Somebody comes up with a good idea then. Let's do that. The other thing that I noticed was that, obviously I was working on x86, but only some architectures support transparent huge pages. It's actually something that an architectures has to opt into. And that makes a lot of sense if you're talking about the original use of transparent huge pages to do things like insert a TLB entry that covers a much larger page. That is something that an architectures needs to support. It's something that you actually need to write code. For example, the alpha CPU has hardware support for larger page sizes, but nobody's ever written the code for it. So it's not activated. The alpha does not support transparent huge pages. Even though the hardware does, the Linux port doesn't. But I don't care. I'm not writing this code in order to make more efficient use of hardware TLBs. I'm writing this to reduce the software overhead. Because that software overhead has now got to a point where it is beyond acceptable. So why shouldn't N68K and alpha and PA risk and all the other architectures have support for using large pages in the page cache? And one way to do that would be to make some of these things not depend on transparent huge pages or make transparent huge pages to split the config option. So you could say, well, use it for P and D size pages or just use it in general. But then I was also trying to explain how fastest and people exactly what a transparent huge page is. And it became increasingly difficult to justify some of the voices I've made. And one of those being when we call the file systems read page operation, do we want to read an entire transparent huge page? Or do we want to read a specific sub page? Do we just want to read 4k? Or do we want to read the whole thing? And a file system can ask, or any piece of code can ask, is this page a head page? Is it a tail page? Is it a base page, an order zero page? But actually, that doesn't matter. That isn't necessarily, just because it's a tail page, doesn't tell you, I just want to read this tail page. It might mean I'm really interested in this tail page, but I want to read the whole transparent huge page that contains this tail page. And then there's a similar kind of question with the page fault handler. Do we want to satisfy the page fault with an entire transparent huge page? Do we want to satisfy just this specific one? Do we want to say, well, put in as many as many teal the entries as we can around this page? Exactly what are we dealing with? And it becomes very tricky. And what I had done is put in assertions a lot of points that a page definitely was a head page. But that didn't really, I mean, that worked out well for what I was doing for my prototyping work. But I didn't really want to leave all that in for production code. And I also don't want to have to replicate this to other file systems when other file systems are also adding support for this. And finally, it's very hard to document exactly what suits you want. I frequently catch myself saying, oh, this must be a head page, and then people would turn around to me and say, well, yes, but can't it also be a base page when you're not guaranteed it's going to be a transparent huge page? Yes, okay. What I really mean here is it's either a base page or the head page would be transparent huge. And that quickly made the documentation very messy. So I have decided to adopt a new approach, which I call page folios. And so when an interface says this is a folio, this parameter is a folio, that means it is either a base page, or it's the head page of a transparent huge page. And in either case, whatever operation it says that this function is doing, it is definitely doing it on the entire size of the folio. When we're not referring to a specific page of the folio, we're referring to the entire folio. And that way, when somebody passes in a structured page for an operation, you know you're referring to a specific four kilobytes, or depending on the architecture, maybe 16 kilobytes or eight kilobytes or something. But if somebody passes you a folio, you're dealing with whatever size that folio happens to be. And so we have to replicate a lot of the page infrastructure that we have. So we have a page dirty macro, now we need a folio dirty macro, because the entire folio has to be tracked as a single unit. And similarly for up-to-date, active, right-back, logged, and there's about a dozen different flags that we set on pages. The next step will be modifying some APIs. For example, the re-page I was talking about, that's going to want to take a struct folio to indicate operating on the entire struct folio. And then the new APIs that I added, just recently for transfer huge pages, I'm going to have to pull those back out and put in these new folio equivalents. There's a couple of new ones in here that weren't in the don't have any equivalents in the previous slide. One is folio page. So given the index within a file and a struct folio, it would turn you the exact struct page for that particular index. And then page folio, does the opposite, is given this struct page return me the folio which this page is in. And you'll notice that struct folio is not a particularly big structure. Indeed, it's not really a structure of its own. Its purpose is to tell the C compiler and indeed the programmer. This struct page is definitely not a tail page. And so you literally can just cast between a struct folio and struct page. But if you do that, you are throwing away the type information which struct folio gives you. And sometimes that's the right thing to do. Sometimes you really do want to just cast that away. But usually, you know, I don't want to see lots of files. This one shouldn't be doing that. Indeed, the core VFS generally shouldn't be doing that. That should be something being done really by the macros folio page and page folio. It's not something that should be done outside. Maybe some debug code. So once we've got all that out of the way, we can return to our earlier interesting problem of when should we allocate large pages. And some people say, well, we should allow user space to hint to us. We should have an API where the program opens a file and then it calls fctl and says, I want to use large page. Here's the size of page I want you to use to cache this file. And we could do that. But often the user doesn't know. And if they do know, sometimes they tune on one particular hardware and they moved to a different hardware and a different decision would have made a better decision. There's also the case that files are not usually used by just one program. They may be written by one program and then read by three other different programs. And if all four of those programs set different hints, we don't know which one, which hint we should be using. And I don't like hints because it means that the user space has to change. And I think this is a general system level optimization. And it should just work naturally without anybody changing anything. Another important thing, I think, is that file systems should not have to develop their own heuristics. This is a caching decision being made by the VFS. It's not something that a file system should necessarily get involved with. There are a few cases where file systems want to tell the VFS, hey, that's a bad decision for me. Can we modify that decision slightly? But they shouldn't have to do that all themselves. And so I was looking around at how this should be done. And I settled on page cache read ahead because that already makes caching decisions. It decides how many pages to read ahead. That's currently capped around 256 kilobytes on those systems today. And I enhanced it to decide how large those pages should be. And it actually ramps up fairly aggressively in my testing, just because I want to test with a lot of large pages. Originally, I thought maybe this is too aggressive. I'm going to want to dial it back some, maybe not ramp up large pages quite as efficiently, quite as aggressively. But now I'm starting to think, well, maybe it's right. Maybe we do want to increase it quite aggressive. I mean, there are downsides to using large pages. The larger page that you use, the larger the right amplification. So if you dirty one byte in a 64 kilobyte page, you'll write back 64 kilobytes of data, even though most of it was unchanged. That said, most devices handle that quite well. There aren't really a lot of bandwidth limitations on most systems. You can find systems that don't have enough bandwidth on their PCIe bus or something. But it is generally not a problem. And you find that a lot of storage devices actually perform better with larger sizes. I just recently bought a new SSD and I discovered that it actually takes longer to read a 512 byte sector than it does to read 4 kilobytes. And that's weird. I mean, I would understand if it took about the same length of time, but it actually takes longer. I find really quite weird. So that's an indication that storage vendors are optimizing their devices for having larger IOs and that smaller IOs are these days inefficient. I haven't done anything to allocate large pages for writes yet. That to me is future work. I want to try and get the work that I've done upstream more than I want to more than I want to have writes use large pages. But it is a deficit. But I want to push out the benefits that I currently see so that everyone can take advantage of them. Page faults. The reader headcode works for both reads and page faults. But one of the patches that I did recently looks at an existing hint. And I advise huge page. And if you do have that set, then it will use PMD size pages. And also it will start to read ahead in PMD sized units. So two megabytes at a time. So the first time you take a page fault, it will actually read four megabytes. It will read the two megabytes you're looking at. And then the next two megabytes. And then if you fault in that next two megabytes, it will read it will start to read ahead the two megabytes after that. So it's it's got the pipelining effect of read ahead. But it's doing two megabytes at a time rather than 256 kilobytes at a time. So one of things that I had to do in order to make the read ahead code work was that I had to change the API between the VFS and the file system so that the read ahead code would actually add the pages to the page cache and then tell the file system here are the pages I've added, go off and update them. We used to have the file system. We used to have read ahead code would allocate some number of pages and then the file system would read into them and add them to the page cache. But because pages have to be aligned, so if you allocate a two megabyte page, it has to be at zero, two megabytes, four megabytes, six megabytes. You can't put a two megabyte page at one megabyte offset in the file. So because you've got that kind of restriction, it was going to be very very hard for file systems to know about those restrictions and try to add them themselves. So it's much easier to do all that work up in the read ahead code and then let the file system just take care of working with the page caches which it already knows how to do. It was also a nice cleanup. I have not finished this conversion yet. It's very hard to convert the file systems which use FS cache. So that cleanup is still in progress and it will be able to delete a few hundred more lines of code once that is in progress because we'll be able to remove the support in the VFS for the old read page function and just leave the new read ahead function. So one of the things that I think I said earlier that what's not present was support for doing IO to large pages because ShemFS or TempFS is an in-memory file system. There's really been no need to support large pages. So Ming Li added support for multi-page BIOS and this was an optimization because pages were often sequential. This was where we're getting a performance boost. If the pages were sequential then we could describe an IO with many fewer data structures with many fewer entries in array. But this turned out to be very critical for me because it actually meant that I didn't have to split up IOs. I could do a single IO. I could say here is a 64 kilobyte page. Please read all of it. I turned out not to quite work so I had to fix a bug but in the end I got support for large pages into the IOMAP code. Now the IOMAP code used to be part of XFS and it got moved out into its own code and now it's shared with ZonaFS as well. We hope, I think we all hope, that file systems will be converted to use IOMAP. So far many of them have not been and if somebody's looking for a project to take on then converting your favorite file system to use IOMAP would be a great and very large project. Then in collaboration with Dave Howells we're working on support for network file systems and so this actually works with the FS cache that I was talking about in the previous slide. Part of the problem is that opportunity is that Dave Howells is also writing on an FS cache rewrite at the same time and so he and I have been collaborating a lot over the last few months making sure that FS cache is going to be PHP, it's going to make it possible to support PHP. There'll be a certain round of work to be done in each file system but the FS cache work will insulate the file system from the details of how the page cache works and so we may actually see support for all five of these network file systems and transparent huge pages before we see support for many block based file systems and here's a good long list of file systems which use buffer heads and that's not even all of them. Basically every block file system uses buffer heads except for XFS and butter FS and butter FS has its own set of problems it doesn't currently support block sizes not equal to page size and so there are a lot of assumptions that one struck page deals with exactly one block on disk that is being worked on not by me and I look forward to collaborating with the butter FS people whenever they're ready to work with me on it they have more important things to do right now that's fine I'm busy enough without working on that at the moment so with the IO path taking care of we can then move on to looking at what it takes to actually support large pages in the page cache a lot of the patches that I've already got in simply remove the assumption that a compound page is exactly the size of a PMD so instead of saying oh is this page a compound page yes then do two megabytes of stuff instead we ask the page how large are you and currently the answer always comes back either I'm a four kilobyte page or two megabyte page because none of the other code is upstream yet but the support is there and that was perhaps 50 patches to to rip out all of the assumptions that a compound page was necessary PMD sized another strand of development was supporting operations which aren't just aren't used by TempFS so nobody had ever done the work to make them support transparent huge pages yeah transparent huge pages so TempFS doesn't support direct IO nobody tries to de-duplicate files on TempFS and stable pages are a concept which only makes sense for block-based file systems or network file systems but anyway they don't make it doesn't make sense for backing source nobody had ever looked at the stable pages code to say hey does this work as a transparent huge page shadow entries are fun they are used when you you have a page in the page cache it goes on to the LIU and then it falls off the end of the LIU due to memory pressure so it would have still been on the LIU if we had if it was purely time-based but because the LIU is sometimes sometimes a lot of memory pressure uh this this so this this page gets removed from RAM but rather than just saying oh well then it fell off the end of the LIU list we effectively extend the LIU list by putting some summary information into what's called a shadow entry and so if we then refold on this page we look at its shadow entry and say well would this have still been on the LIU list if the LIUs were infinitely long and if so we treat it as a refold and if it would have been taken off the LIU list by now because it's so old then we treat it as a new fault uh this helps us track the working set of a process and helps us make better decisions about when the when the process is going to need which kinds of pages another big chunk of work I had to do was making possible to find large pages which were tagged if you call something like msync and the the the range of the file that you're looking at is within a transparent huge page and that transparent huge page is dirty it must find that transparent huge page even though you're starting after the beginning of the page and that that proved to be that that's that proved to be very hard to do and it involved changing how the page cache worked so that one isn't upstream yet that that is still pending and yet another problem I had is that various page cache functions were returning individual pages rather than just the head page or as I should now say the folio it doesn't push the folio it turns the struck page and so when we had a compound page and we said give me all the pages between here and there it would actually return each of the individual pages whereas what almost every user wants is to be given just the head pages and so I had to go and audit all the places that use those functions and make sure that they were actually okay with receiving just the head page and some of those patches are currently in some are not there's you know for about 50 patches so outstanding and there are some things I haven't fixed I haven't figured out how to do it all um one of them is a splitting a folio so you'd want to do this if for example you have you know a cached file on the page cache and then you punch a hole in the middle of it or you or you truncate it down you need to trim off the end of the page and a lot of file systems will keep uh sub page information so often the block size of a file system will be smaller than the page size particularly if you're using a transparent huge page it will definitely be smaller than than the size that the block size will definitely be smaller inside the page and so file systems will allocate small data structures and uh this is for example buffer heads are used for one of the things buffer heads are used for is to keep track of individual disk blocks and whether the uh whether they individually are dirty or up to date and so when we come to split a page we then somehow have to redistribute the information that the files system has stored in the head page amongst all of the sub pages that the the huge pages being split into and that is very hard to do and so I'm trying to figure out various different ways to make that more efficient um than it currently is and uh I'm I'm currently working around it by saying well we only allocate we only allocate huge pages on lead ahead and that makes my life a lot easier because um you know the the the the page cannot be both dirty and up to date and not up to date so I don't have to distribute quite as much information between the the the sub pages as so on but it's it's definitely a problem that needs to be fixed before I do the right support because then you can definitely have pages which are partially up to date and and partially dirty and it gets to be very very complicated and and you can't lose any of this information because then you you end up actually losing users data and people don't like it when you sell data and of course there are many many bugs to fix that's basically my entire life right now is fixing bugs in in this patch set and then of course there's there's there are things to do like allocating photos on right we have this kernel thread called k huge page d which tries to combine individual pages into a transparent huge page a p and d size transparent huge page and right now it sees that we have we have large pages which are not necessarily p and d size and it just gives up it says no I don't know how to how to deal with that it probably wouldn't be a lot of work to do I just haven't prioritized doing it so somebody wants a task to take on that that's that's something that definitely could be done so despite saying earlier was I'm doing this entirely for software reasons and not for hardware reasons there are architectures which support TLB entries which are larger than order zero but smaller than than a p and d size and one example that is our arms support 64 kilobyte pages and right now we don't have a good API for telling the architecture hey I want you to put in this this 64 kilobyte page I've allocated I want you to put all 64 kilobytes of it into the into the page tables and it would be nice to have that I again this has just not made it to the top of my list yet and I would also like to support folios in butterfless and dxc4 there are problems with both of those file systems but not insurmountable we absolutely can put them under some circumstances in both of those file systems it's just I chose to start working on XFS and it's the only one currently working I shouldn't say the only one because Dave Howells has AFS working with transparent huge races as well so there's I think the other files the other network files systems are not too far behind it I I want to take a crack at NFS fairly shortly but yeah there's that you know I have to I have to fix the bugs in the code that I currently have before I start writing anymore so was it worth it well so Kernbench is a benchmark which just compiles the kernel which is obviously the most important workload and one my colleague was kind of to run the benchmark for me and he saw a six percent reduction in kernel time and so the the actual time went down I think from maybe 210 seconds to 204 seconds but you have to remember that most of that time is spent in user space running GCC so the the the bit that we can control i.e kernel time went went down substantially six percent I mean that that that's pretty good for a benchmark which we optimize so heavily for because those are in purpose so I've put the the Git tree in my slides you can go off and take a look at what I currently have so to summarize we get shorter LRU lists what a lot of hold times fewer cache misses a tangential effect is that it's very hard at the moment to allocate larger pages and we actually allocate larger pages for a lot of things but it can be quite tricky and there's a lot of fragmentation a lot of the fragmentation is actually due to the page cache the page cache will just allocate all to zero pages all the time and it will it will make it very hard for larger pages to be created if the page cache is allocating larger pages then the page cache can free larger pages too and it will make the entire system less fragmented we do larger IOs to storage and maybe we have to get rid of 16 kilobyte or 64 kilobyte page size configurations I have while I'm the main developer on this I'm doing most of the work this would not be possible without the help of so many people these are the people who I was able to think of this afternoon while I was writing these slides I know if I know I missed some off this list I'm sorry for that but yes we have contributions from all over the landscape the first is from people here memory management people people who are just generally interested in this kind of thing lots of people are giving their time to help make this possible and I want to thank all of them it's been really fun working on this project we've covered a lot of ground so thank you all for attending if you want to ask questions I'm looking at the chat window right now I believe we have about five minutes left in this presentation well I don't see any questions so thank you all for attending and I will stop the broadcast now please enjoy the rest of the conference