 Mae'r grwp cyff conceitur mwy sefydlu y tlefynu llwyddiadol ffasidlurau yma y Llyfrgell, ymlynedd ar gyfylltsio ymddangos i unrhyw wahanol y dywed ymddangos newid y Llyfrgell. Beth yng Nghefnidol Llyfrgell wedi ddwi'n cael ei ffordd yr hyn wedi'i gweithio'r unrhyw o'r oedd a oes i'r ffordd o'r cyfnod o'r cyfnod, ond, ond, yw'n gweld, rwy'n gweld o'r gweithio'r cyfnod o'r cyfnod. Yn ymddych chi'n gweithio, yna, gallwch yn cyfnod i'r cyffredinol ac rwy'n gweithio'r cyfnod o'r cyfnod, ond, mae'n dweud y byddwn ni'n gweithio yma, oherwydd o'r cyfnod o'r 64 cyfnod, ond, yn y benchmarkdd, mae'n rhaid i'n gweithio'r 64k pwysig, ond o ditch marks fy ngyfano 4K page size dwys. There's this big chunk of reasons that managing memory in 4K chunks is inefficient. Sometimes it's really handy to only right back the 4K chunks in which you're dirty, so it will be a reason to win. Just having to scan each individual 4K and ask, are you dirty, is quite a performance loss. Ond rwy'n rhoi gweld yn wahanol. Rwy'n rhoi Attin Ychyr Cymru sarodd, a rhathoriaeth yn cael ei ddweud am ddatbyddio. Ond rwy'n rhoi의r cyllidau ar y cyflodau o'i odlawer na'r cyflodau rhathoriaeth. Rwy'n rhoi, mae'nogi dddiriadau. Roedd ychydig arhaus, mae'n rhoiames o'r rôl eisiau i'r cyffredinol o'r cyfnod ar gyfer yr ystafell. is a shared data structure, everybody uses it, and so if you grow it for one use case, you grow it for all use cases when some users would very much like to shrink it, it's very hard to change it from its current size because somebody's always going to push back on you. So what we're doing here is largely splitting it out for each use case. And so here's where I think we're going to end up. We're going to end up with struct page shrinking to a single pointer, an encoded pointer. The bottom few bits will tell you what kind of thing we're pointing to, what kind of data structure we're pointing to, and the rest of it will actually be the pointer itself. And here are some examples of where we're going to get to. So slab is already split out, thank you, Vastimal. We're going to do page tables, ZS malock, hardware poison becomes its own thing, struct folio, the progression is already underway here, net mem, free mem, and dev mem. These are the ones I anticipate. Exactly what will happen. Preserve maybe, maybe we'll have a reserved data, maybe reserved doesn't actually need a data structure allocated to it, it will simply be a special representation in this field. But yes, you're quite right. We don't need the page flag. So we can free up the page flag of page reserved, and we'll replace it with a reserved mem desk. So how close are we to being able to do this? When people ask me, I generally say, oh, two years. I don't know how long is a piece of string. I was a little bit nervous about putting up this metric, because as soon as you put up a metric and you say, here's how we can measure how we're doing, people then start trying to gain that metric and say, well, we're doing so much better than last year because look at this metric. I'm going to use a different metric next year. So don't start submitting patches saying I've reduced the number of struck pages in the kernel by 200, and this is going to really show up while Matthew starts next year. I'm going to choose something else. I have lots of different things I can use to measure how are we doing towards eliminating struck page from the kernel. So we have over 1,000 commits which mention folios. That's not to say that we have 1,000 commits which are doing the work of converting from struck page to struck folio. About 100 of them have the word fixes in them, or actually the tag fixes, colon. The normal fixing bugs in earlier conversions, many of them are fixing other bugs but contain the word folio somewhere in the commit message. But here are some of the more significant things that we have managed to do since the start of the folio project. Convert to the page caster folios, split out slab. We recently moved all the tail page components of struck page into the folio. We no longer have a section that says these elements are only valid in the first tail page. That's all gone. You don't have to worry about looking at some of these things and wondering what they do. We have shrunk the size of the data structure so that people are less confused when they are looking at it, which is always a good thing. There is an asterisk by the address space operations conversion because there are, I think, three functions in the address space operations which still take a struck page, but that's because we are going to delete them rather than convert them. So don't send me patches to convert the three remaining functions. The idea is that we are going to get rid of them. I know Christophe is working hard on one of them and the others will go away when we switch all the file systems over to IOMAP. Pause for laughter. Thank you. So we've converted five file systems entirely to support three file systems and two support layers. IOMAP and NETFS support layers, XFS, AFS and EURFS all support multi-order large folios. So that's really great. We have converted EXT4 and NFS to use single-page folios. So we're getting some advantage here in that we are making progress towards using folio APIs and away from using page APIs, but we aren't addressing the fundamental problem that led us to start this project, which is that we need to manage memory in larger chunks because they can still only do single-page folios. That's regrettable, but there are underlying problems that I know about for EXT4. I haven't spoken to the NFS people about why they went with a single-page folio approach. TempFS and huge TLBFS are a little bit different to each other. They both already supported large folios in the form of either transparent huge pages or just huge pages. So those conversions are ongoing. TempFS at some point needs to be able to support arbitrary size folios. Right now it supports two sizes, single-page or THP size, which is two megabytes on X86 and maybe 16 megabytes on some architectures. Something in that order, but it doesn't support arbitrary sizes yet. That's just work that needs to get done at some point and hasn't been a high priority for anybody to do. So that's an open project if anyone wants to tackle it. We have get user pages converted to use a folio internally. The external API for that is still page-based, but internally it's now using folios. We converted large chunks of the memory management subsystem to use folios at least internally. Some now have only folio-based APIs. Some still have compatibility functions. But in general, most of the memory management subsystem is now converted to the point where somebody came up to me last night and said, aren't you basically done? Well, no. 77,997. No, we're not nearly done. There's a lot of code outside of MM that is using struck page, perhaps more than many of us recognized me. I had no idea. We were quite this. I didn't think I was taking on a 9,000 dimensions of struck page problem. Maybe I should have looked into it a bit more first. Fortune favours the bold, I'm told. We've started the buffer head conversion. This was partly for EXT4's benefit and partly for the benefit of just trying to remove various users of the compatibility APIs. Every time we do a conversion from using a page API to using a folio API, we remove code from the text segment of the kernel. We actually shrink the number of instructions executed by the kernel because we're no longer calling compound head. This removes 60 bytes worth of instructions each time we remove a call to compound head. On X86, it's something like 15 instructions, which is just terrifyingly huge. It's always justifiable to convert code from using pages to using folios. Just because you can guarantee I'm definitely working on the head page and I can ignore all of this stuff that's converting. Every time we call put page, we call compound head, and we call put page everywhere. We have thousands of callers of put page. If you can convert a put page to a folio put, it turns out to be significant. We have a number of patch sets that are out for review. I'm as guilty as anyone is not reviewing the patches. The patch sets at least weren't written by me. I've had a lot of review on my patch sets. I'm very grateful to those who have been reviewing my patch sets and the patch sets from others. I wanted to call out for, in particular, outstanding for review. Adding the ZS Maloch memory descriptor and adding the page table memory descriptor. Both those patch series could do with further review from people who know that area of the kernel. I have the Netmem series out. That's actually had a fair amount of review. That's on me. I need to fix up all the comments and be posted. Similarly for the set PTEs patch set. Set PTEs is a new API for architectures to use in order to set up multiple page table entries at once. This is kind of significant for architectures like ARM64, which support using 64K TLB entries when using a four kilobyte page table entry. This is allowing us to use the resources of the processor more efficiently. It is somewhat similar to being able to use THPs. This was not the original goal. This wasn't why we were doing it. A lot of memory management people and architecture people said, oh, this is the primary reason for doing it. This will increase our efficiency by x% and that our efficiency is to be gained here, but it's not the primary reason that we're doing it. There are also benefits on AMD CPUs as of, I believe, the Zen 2 micro architecture. It will transparently use a 32 kilobyte entry if you put eight compatible entries together. It will notice that it can use a larger TLB entry and it will transparently do it. That's different from ARM. ARM requires this kind of interface in order to set the magic bits which tell the CPU, hey, it's safe to do it. On AMD systems, they simply notice. I'm sure there are other systems with other requirements for using larger TLB entries and hopefully this set PTEs interface, as opposed to just having the set PTE called end times. Hopefully this interface will be good enough to support everyone. If it's not, now's a really good time to let me know. What should we be talking about today? This is a development conference. We have a number of talks on the schedule to tackle some of the bigger issues. Sunsetting buffer heads is next, and then we have supporting large block sizes. Both of these are enabled by the FOIA work, supporting large block sizes. Sunsetting buffer heads is just going to be a good idea. If we can get rid of buffer heads, that's a whole bunch of code that doesn't need to be converted. We have better interfaces available for get user pages, or we hope to have that, and Jason's going to be talking about that this afternoon. Lewis is going to be talking about converting more file systems over to using multi-page folios, and that's tomorrow. The one I'm most interested in is multi-page anonymous memory. Most of the work that's been done so far is file system based, because that's the code that I knew the best. It's also going to be very important to have anonymous memory also managed in larger chunks. I hope to not do any of that work. I think I've demonstrated quite thoroughly over the last couple of years that I have no idea what I'm doing with anonymous memory. I'm really excited to see other people stepping in and doing this work. I was hoping, in James' talk yesterday, we might get to the meaning of map count, and of course we had a grand argument about it and came to no conclusions, which didn't really surprise me, but trying to understand how map count works is tricky. Just for Mike, Mike pointed out yesterday that we have a shortage of GFP flags and blamed me for it. I thought I'd help out by talking about how we could get one of them back, and that is to reclaim the comp bit. The MMP people at this point think, why is he to explain this, and the file system people are all thinking, what on earth is this? Bear with me. The top illustration here represents what happens if you don't use GFP comp, and you say, I want to allocate an order to page. You get, well, very little. You get your memory, but the MMP does not fill in very much, and this annoys the hardening people, actually, because they want to know, does this copy to user extend beyond the boundaries of the allocation? If you look at the bottom one, that's what you get if you do set the GFP comp bit. You can look at an address and say, what page does this belong to? Maybe it's the second one. In that second one, you can say, oh, this is a tail page. I can see this as a tail page. It has compound headsets, and then you can go from your tail page to your head page, and from your head page you can see the order of the page, which happens to be stored on the second page, but you didn't know whether you were on the second page or third page or fourth page. So from doing the order, the size of the allocation and the hardening code can check, does this men copy or copy to user extend beyond the boundaries of this allocation? You try and do that to a non-compound allocation. This has actually been a problem. They try to do this, and the bug reports flowed in, and so the hardening people disabled us. We can't do this at this point. So Case Cook has been on me for a while. Can you fix this? Can you fix this? Can you fix this? Well, we've got a lot of users to go through who are doing non-zero order allocations and making sure they do in fact set the compound flag. There are some other grotty details with doing a non-compound order allocation. One thing, if you do a compound allocation, you can just use put page or folio put in order to get rid of it, in order to free it again. Whereas if you're doing a non-compound allocation, you have to call free pages and you have to tell free pages what the order is of the allocation that you have, and maybe you got that wrong, and now you have corruption, and there was a race that we had until like three years ago when I fixed it and I fixed it badly, and somebody else fixed my fix, and it's all really grotty code to try and figure out how to handle spurious ref count. So, what I would like to do is get rid of non-compound allocations. I do want to say it is perfectly safe to set the compound, the GFP comp bit on zero order allocations. The page allocator is smart enough to not do anything in that case. So, you don't have to worry that doing an order zero allocation setting the GFP comp bit is going to start corrupting other pages, but that doesn't happen. So, we could just say GFP comp is now always enabled, which would free up a flag and make Mike happy. So, I thought I'd talk about some of the pitfalls that we have. We've changed some of the semantics, hopefully for the better. For example, some functions now return a bool or an urno when they used to return something different, and that's sometimes tripped people up who are doing conversions and not thinking about the conversion very thoroughly. They're just replacing doing it automated rather than looking at the function signatures and modifying it correctly. But hopefully they're better semantics. Generally, when we've changed the semantics, it's been for a good reason. So, we didn't do that lightly. Sometimes when people are doing conversions, they need to really think a little bit more carefully when they're doing the conversion. So, for example, in some file systems, we check for the end of file by shifting the byte address down and checking is that equal to the index of the page that it's in. Well, that doesn't work for a multi-page folio. If you shift the offset in the file down, it now needs to lie within the range of the folio. And we have a helper function for that because that's a fairly common thing to want to do. So, we can ask, is the size of the file within this folio? We just need to know to use it. Rather than doing the mechanical conversion, we need to think about what is this function doing and use the appropriate helper that has been provided. When we're doing conversions, if you see almost anything that has the word page in it, particularly another case, you probably need to think carefully about exactly what it is it's doing. So, if you see page size, it's like, well, maybe I can just replace that with folio size. But if you see maxBuff per page, that's like how many blocks can there be in a page. And we have several places where that's used to size an array that's allocated on the stack. You can't do that for an arbitrary size folio. I mean, yes, C and GCC in particular, does support having variable size arrays on the stack. But if you have a two megabyte page and a 512 byte block, that's going to be, what, sorry? Yeah, stack over flow, exactly. And we actually already have compiler warnings on the hexagon architecture for those functions because that has a config 256 kilobyte page option. And so that already blows the stack out of proportion. So we're just going to be fixing up bad code here if we can get rid of those arrays. Yes, James. OK, I'll just repeat the question. OK, the question was, in SCSI, we just need a lot of memory. So we allocated a single page and what is the intended replacement for that rather than just calling get free pages? The question is it K malloc? And the answer is yes, it's K malloc. A lot of this code comes from legacy times when it was faster just to do what you were doing. It was the right decision at the time. But since that time we've improved the slab allocators and it is just as fast to call K malloc. You actually get back a pointer instead of an unsigned long freeing it is better. There's just so much. You get error checking. You get so much. It's just worth doing. Yeah, so, I mean, you can say, well, that's code that's not kept up with the times. But honestly, we have, what, 11 million lines of code in the kernel these days. There's so much code that needs to be updated to do that. Should he continue to use K malloc brackets page size? Sure. Why not? Why not? If it works for you, then keep doing it. Just because you're seeing a red flag doesn't mean it's not a relationship worth entering. Moving on. So we do have a problem with K map local folio. And I wanted to expand and honestly that's almost a topic for an entire talk by itself. I would really love us to use high mem a lot less. So there are definitely places where I'm going to be talking very seriously to file system people. Hey, maybe we just stop supporting putting directories in high mem. Perhaps directories can always live in low mem these days, given the current state of 32 bit systems that actually run a modern Linux. I'm running out of time. So I just want to say if you're having the question about which of these five functions to use, you've probably made some poor life decisions. And if you can't figure out for yourself, come talk to me. Don't just do a direct replacement if you happen to come across one of these functions. Most of us won't. Last slide, I promise. I've heard from some file system people saying, oh, are we going to run into locking problems because now we only have one lock per folio instead of one lock per page. And in it, I don't quite know how to answer this question because I don't quite understand what question they're asking. There's always only been one lock per folio. Sorry, one lock per compound page. So if you call lock page on a tail page, it has always locked the head page. But in a different sense, it is a very legitimate question because it used to be that you got one lock per four kilobyte chunk of memory and now you're getting one lock per chunk of however much the VFS has decided to allocate for you to read into. But the real answer to this is that the page lock is not highly contended. The longest hold that we have on the page lock or now the folio lock is while we're doing IO in order to bring the page up to date. So it is taken at the point that it's allocated and she's freed when the read completes in order to bring the page up to date. I'm sorry, the lock is released when we bring the page up to date. So the contention that you're seeing, particularly from the memory management point of view, when you're saying, oh my god, the M mat lock is held and we need to drop the M mat lock in order to wait on the IO. Yes, but that's not really the problem. The problem is you've just got to wait for IO to complete. It's not that we really have significant lock contention on that lock. Similar kinds of things about we only track access or dirty per folio instead of per page. We can still track dirtiness on a per block level, but that's now up to the file system. The page cache is keeping track of it per folio and will tell you yes, this folio contains dirty blocks and then it's up to the file system. It can choose whether or not it's going to keep track of dirtiness at a per block level. I've seen buggy patches that think the hardware poison flag is per folio, it's not it's per page. This is a miss feature of our APIs. I need to figure out how to poison it so that you can't actually ask. Does this folio have the poison bits there because you're actually only testing the head page of the folio? That's a simple matter programming. I left this bullet on the slide so I'm deleting it. Sorry about that. That is everything I have and that is my time, but yes James. You began this by comparing ARM on 64K pages versus 4K pages and you said some workloads work better. After you've done all of this, because on x86 we can't do this, but after you've done all of this, do you think that performance gap will come down sufficiently or should we also be trying to pressure Intel to use larger pages for the CPU? Because as you said if we can manage memory in larger chunks and perhaps two megabytes is too large a chunk to manage memory in, should we have intermediate page sizes? I don't think we need to pressure Intel. I think people who buy CPUs are going to be pressuring Intel. Intel have been aware of this problem for a while that they don't. I was at Intel when they noticed that they had this problem and that was a few years ago. Well, we can't directly influence the CPU design, but we can make it happen so much more frequently that Intel have no choice but to add CPU, hard to hold their features. Is there still going to be pressure on Intel or is this going to improve the management pages enough? What I have seen on benchmarks on other CPUs is that about half of the performance improvement comes from just managing memory in larger chunks and half that comes from actually using the 64 kilobytes TLB entries. So there is still going to be significant pressure on Intel to do better than they currently do. I mean what you said earlier that AMD performs it implicitly. Yes, Intel can choose to do exactly what AMD are doing. Once we use follow to manage everything in larger chunks, they can do the optimization in their CPU without us needing to do anything. So just a comment. I think for some cases it's really important that we have small pages still around. For some of you think about page tables like it occupies 4 kilobytes if you manage your memory in 64 kilobytes chunks. You end up eventually wasting a lot of memory or you have to play all of these hacks that some of the architectures do to squeeze in multiple page tables into single ones. If we could find a way to have like small pages and still get the benefit that you said by using these higher order folios just to make hardware happy and fast it would be awesome. As best of both worlds I would say. So that basically is what we are doing. We are not going to suddenly start allocating page tables in 64K chunks. I mean if you build an ARM kernel with a config 64K page size it does then start allocating page tables in 64K chunks. But we are not talking about doing that. We are talking about keeping the base page size at 4K. You can still choose to allocate 4K pages. Page tables will still be 4K in size. But you can choose to allocate your anonymous memory and your page cache memory in whatever size actually makes sense for that file. So if you are M mapping a 4 kilobyte file we are still going to allocate a 4 kilobyte page. We are still going to manage that particular 4K chunk. More importantly if you are mapping a larger file or call malloc and get some anonymous memory that can still be mapped as an arbitrary page granularity. We are not breaking that back with compatibility unlike when you build a config 64K page size and all of a sudden you can only map things at 64 kilobyte granularities. So just like with a THP you can choose to map it any way you want to. That's perfect. I guess we should impress your intel or whatsoever to support other page table layouts but not to support higher order page granularity essentially. Like we want 4K pages. Everybody wants 4K pages. That's kind of baked into the computer industry at this point. The Android team is looking at changing the page size from 4K to 16K because we expect it will significantly improve performance of the memory management system. Disadvantage we are aware of is that it will increase the IO overhead because when a page is dirty it's not a student that will be written at 16K. Is this something that would be interesting for users of Intel CPUs to have the page size of support for a page size of 16K? Well I don't control what Intel does. This is kind of not what we're talking about right. We're talking about improving software. When it happens to improve hardware performance as well that's great but really we're trying to improve how our software works. Bart just to answer you I guess quite some of these things will be answered if and when we are able to do large block IO because that's precisely it. And then you could choose to see eradicate what are the benefits of using large block IO and then you can over switch or not depending on the results.