 Well, I'm Mike from ABM, so I'm going to talk about the DirectNet fragmentation and originally it was talk, talk was intended how do we implement some extensions to page allocator to reduce DirectNet fragmentation. After running some benchmarks, I'm not convinced we need this, it's more about probably we don't really care about DirectNet fragmentation. Probably there is a question, yeah. So there are a few use cases that require fragmenting the DirectNet, these are all the places that allocate memory for code. There is secret memory and there are potentially SCV, SNP, TDX, although I'm not sure they be able to get away with it because they don't preallocate memory, but rather they fragment it as they go and I don't really see how they can get around it. And there was also use case to use PKS protection for page tables that Intel pushed a while ago and they stopped and probably they'll get back to it, I don't know. And there was some patch set about permissioned malloc, probably this was mostly targeting code allocations, so it might be not really relevant for general usage. So when we talked with Michal about the GFP unmapped page that I posted, I realized that people in MMM probably don't really know much about how code is allocated. So it's a brief recap. Essentially every place that allocates memory for code, like ftrace, kprobs, BPF modules, use module alloc, which is re-implemented by every architecture, by most of the architectures. And it essentially does Vmalloc inside a restricted virtual address space because every architecture has some own restrictions where the code can be placed relatively to the kernel image itself. And whenever the module rwx is enabled, the all memory that was allocated for the whole kind of elf image of the module, like code data and everything else, is then split into 4K chunks so that read-only data gets its read-only attributes, executable is read-executed and so on. And the attributes in the page tables are not only set in the Vmalloc address space, but also in the DirectMap alias, which causes a split of large pages in the DirectMap, and allegedly it reduces system performance. So a while ago I've sent a patch to create yet another version of caching of two Mac pages to reduce the fragmentation of the DirectMap so that once we need to allocate a page for something that needs different protection, different than default protections, we allocate a two Mac page at one go, and then we distribute the two Mac pages between the users. The whole patch and discussion is there, it's just a brief recap. And then I went to get the numbers. So this is a description of what a benchmark it was, AMD Zen 3 machine with 256 megs of RAM, two sockets, it runs Susie Leap because Susie is the most easy system to run MMM tests on. And the benchmarks were with the 6.3 RC4 and the non-vanilla version is 6.3 RC4 with the GFPN mapped applied. And I used the page alloc from MMM tests, and as I said, the run was several database benchmarks and the kernel bench, and to add to the mix a file, because when I did similar tests a long time ago, if I was the most sensitive benchmark to differences in the DirectMap representation. So for every run, I had a background job that did modprob and modprobminusr for 37 megs of models from netfilters just to keep sitting busy with the fragment in the DirectMap. And the results are peculiar. There were tests that showed advantage to using GFPN mapped and there were tests that showed it introduces regressions. Most of them had terrible signal to noise. And so the whole set of results can be downloaded from there, just quick examples. So the micro benchmarks are generally not very sensitive. Every difference is below 10%, and the standard deviation is usually not higher than 50%. But when you go to somewhat more complex, like PgBench or mutilate with MAMCHD, it gets a more interesting spread and here it's like the most interesting. The standard deviation for both benchmarks is way more than it should be. It kind of confirms the set of tests until folks run a couple of years ago that they just used the four case or two Mac pages for the entire DirectMap and run the set of benchmarks. And what they got, that largely it's better to use one G code two Mac pages for most of the use cases, but it's not necessarily true and the tests were as noisy as this. So my takeaways from this is another thing that seems really important to people is that people do want to have two Mac pages for code, especially Thomas and Peter. You hear them constantly repeating that we need to make pages for code and we can't we are better with lower ITLB pressure. I could confirm that lower ITLB pressure is not bad. It's a while ago with the more disruptive test that fragmentation in the DirectMap fragmentation in the kernel image mapping actually brought MAMCHD benchmarking by 10-20%, brought down MAMCHD benchmarks by 10-20%. And there was some benchmarks songly run for his proposal for ExactMap malloc that showed a couple of 10% improvement on ITLB miss. I don't know how it will really translate into a real benchmark or real workload improvements. So Thomas claims that the ITLB is very small. The twin assumptions of this is that the TLB itself, if we take a TLB miss, it's a very expensive process to fill it. One of the things your results and the reason for the huge error bars might be because deterministically the CPU is actually trying to do prefetching to cover up for this. So that might be one reason why the error bars are so big because you can't see what the CPU is doing in the background. What the other is, I thought most CPUs had moved away from separated ITLB, DTLB, and they all use a combined TLB. It depends on which level. So is Intel telling us that they still have ITLB, DTLB, separation? As far as I know, yes. And so the results I have are not related to ITLB. The results I have are more about detail, it's only about DTLB. So I don't know why ITLB would be better. But like Thomas said that if you put every hot code in the two Macpages will be just way better. I don't know what benchmark he did run and what results he had, it's just a citation from the mail discussion. That's rule of thumb. I mean, if you've got two Macs in ITLB, you've got one entry. And the question if you get them, you'll be missed whenever you get into the kernel anyway. Last email. Yeah, I just wanted to know that you run the benchmarks on AMD and not Intel, right? Yes. So that's kind of also... I don't think there's that significant difference. I run them on Intel once and it was quite similar. So in answer to James's question, I'm looking at my 11th gen Core i7 here. It has separate L1, iCache and Dcache TLBs. And it's actually split out the Dcache into separate store only TLB and load only TLB. So they're adding more, not unifying. Well, it's what it reports. So another question, if you can put that back, yeah. So what is the Peter's solution for everybody using two megabytes? Can you repeat, last email? What is the Peter's Ilstras solution link that so everyone uses? I hear him repeating this on IRC mostly and in different things that we need. There was a solution on the one of the earlier exec Memalok patches song proposed. And Peter suggested to use two Memalok areas for module data and for module code. And then everything that goes to module code can be mapped with two megs. And that solves the issue. Yeah. So my question would be, have you tried to compare your results with a worst case where you split everything into four key pages? I had it last year, yes. So what I measured previously, when I just made the entire direct map flat 4K or two megs, there is degradation of a single digit percent in most of the benchmarks. Sometimes 4K shows better results for some reason. And most of the time to make pages one on the system I had then. But it never went up above 5% the difference. Another question. Since now you use the background module load and load to perturb the system, did you also try not doing that and seeing how much it actually affects these perturbances? Because more or less the same. Like the differences between results were more or less the same. No, I mean, not between the non-patched and patched, but between no perturbed bands and perturbed bands, whether it even does something to the workload to the benchmark. I don't think I did it. I can do it. It will take a couple of weeks, yes. I didn't check this. But I did the test with just fragmenting everything to the... Of course, it can have more effects than just the fragmentation because you can be taking some logs using one core. Yeah, and additional question. So if you are going through that perturbation workload in the background, is this the slow path of that? Like you un-map and map again so that you have also the overhead of the TLB flushes or you are using your caching capabilities or... Whatever happens on the Vmalloc site is unchanged pretty much. Like the module ALOC calls Vmalloc, Vmalloc finds an area and they get the pay. The only thing that changes on the pass is ALOC pages. Instead of going to the normal page allocator, it goes to the cache that implements GFPN mapped. So the rest of the allocation and mapping and un-mapping is the same. Okay, so you are not benefiting from caching with respect to TLB flushes when you are changing page tables? You know, the TLB flushes in the Vmalloc area are pretty much the same. So eventually it can be even faster if you use the caching to the full extent that you do not have to change page tables because it's already un-mapped? You can optimize some of the TLB flashes there, yes. So my takeaways were that we don't really care for data allocations about direct map fragmentation. And we probably do care for code and it, again, it needs to be benchmarked like Facebook folks say that they have about 1.6, 1.9 percent of improvements when they do not fragment the direct cache, direct map. And Thomas and Peter want the module, everything called the Intumac pages. So maybe we just concentrate on implementing the code allocations in large pages. The way Peter suggested, the way Song is working on now in changing how the module alloc works and differentiate in different types of memory so that they will be executable and read-only and it's not will be allocated in a single chunk. And well, my motivation was certainly we should enable SecretMem by default, like no question about it. And whenever we see that we actually need to deal with direct map fragmentation, we can go back to GFP and map and take a look at it again. And something else, a side note I noticed. That wasn't me. So what he's talking about here is the radix tree. The radix tree uses the top three bits of the flags to encode various flags, it doesn't really matter. The point is that the x-ray doesn't need this. And if we finish the conversion from radix tree to x-ray, you can have those bits back. It's just that that is like 150 different things that are using the radix tree right now and they need to stop. They really need to stop. But I mean, you know, I need more minions to... Oh, Vashal, you're right here. Vashal, nice. Yeah. No, I mean, it's the kind of thing that makes a great outreach project, but the thing is it really needs somebody to actually go in and learn how each subsystem is actually using the radix tree to actually do the conversion, because I mean, I've done this like tree-wide kind of conversions for x-ray and it's really easy to make a mistake and break whoever knows how many subsystems. You've got to know the code that you're conversing is the problem. So the question is, what happens if we increase GFP to 64 bits? We should just not abuse GFP for every random single purpose thing. Like we have three flags for cousin tagging, right? But that one seemed actually appropriate, you know. Did you run any benchmarks when you would be using two megabyte pages for Secretman? Would that actually make a difference for some of your Secretman workloads that you imagine like we would have for code or like, is it just a void fragmentation not to benefit? The major pushback for Secretman was you're breaking our direct map. All right, because it doesn't need a direct map. It wants to be unmapped from direct map. But when you map it into the user space, like it gets a page cache entry, right? And it gets mapped into the user space. It doesn't really matter, I suppose, because it's not, it won't be hot memory anyway. So, okay, but what I understand is that we actually do have different classes of direct map users. Some of them don't want write access. Some of them don't want any access. Are there any other users? Direct map is largely read, write. It's the default. Sometimes we remove the write. Sometimes we add execute and then remove the write. That's what code allocation does mostly. The module alloc does all these tricks. I don't think there are a lot of other places that do anything to direct map. If you take a look, if you grab for set memory something, there are not much of them. Most of them go from some security reason. Like the page table thing you would most probably do when you don't want to write. Like page table. Well, page table is read write anyway. Yeah, I think you mentioned. Yeah, there was a patch set from Intel about using PKS protection for page tables and then they protected the same pages, the same pages, the direct map mapping is read only and you could access page table for write only with helpers that switched PKS on or for something like that. Okay, so what I'm trying to see is like is there any generic way we could have something that achieves that? As we learned, using a chief P flag is most probably the wrong way of doing it, maybe. But did we have some other allocator on top of that that is able to provide such classes of memory or doesn't it make any sense? For most cases, the free path gets really bad in this. Could I ask a more provocative question, which is basically the whole of our theory about using allocators like this and the problems they cause is because we have a theory about the way we think the TLB works. But these results seem to be showing us that the TLB doesn't quite work the way we think it does. And if you look at how machines have evolved, I think we're suffering from two things. One is the huge cost of a TLB miss is because of the hardware page table walker, which we assume is really expensive because it's all indirect mapped. But I think both Intel and AMD are speculating on that one. So the cost of a TLB miss might not be as high as we think. And then the other thing is I got a funny feeling that this TLB is not 4K, 2 meg, 1 gig. I bet you it's extend based. So if we're getting an extend based. That depends on microarchitectures. I agree, but it's sort of, we attach a huge amount of memory management to a theory about TLBs that may have been right 10 years ago, but may not be right today. And I think we could do with getting some CPU people under torment to see if we can actually clarify exactly what we're supposed to be caring about in the TLB before we start evolving elaborate theories about how we make TLB pressure better. My theory, one of my theories was that normally you have most of your TLB allocated to user space anyway. So whenever you get to kernel, you anyway get TLB miss. I mean Intel actually tells us how the TLB works, right? There's the L1 TLBs, there's the L2 TLBs. The L2 is actually unified and the L1 has separate instruction cache for 4K pages and for two megabyte pages. They're really pretty open about how it works because this is how you get good performance. And if they were lying about it, people would have figured that out by now. I mean they definitely have things which are there in the microarchitecture that are not public, but it's all caching. It's not additional features. Actually what they have and it's public that there are partial caches for the pagevolks, so you pagevolk only the last one or two levels and not all five of them if you miss in the TLB because the partial way is also cached. And they have additional cache for pagevolks as well. But I was also wondering besides the TLB, what about the memory overhead of needing the extra page tables? Is it a concern or are they pre-allocated anyway for this? For model use case, I don't think it's that much that we really care. For K-Pay per two megabytes is not so... What do you mean? The extra page table level you need to do the splits. I anyway do the splits at the point. But if you avoid them... No, today we do the splits. If we do some caching, we might do less splits. Although we will anyway have to split but let's say with GFP unmapped for modules you'll have less splits because it will be unmapped in the direct map and we'll be only mapped in Vmalloc. But I don't think it's that large amount of page tables. Yeah, but maybe if we enable the secret mem, can it happen that a few secret mem pages will be spread all over the direct map and then each of the two megabyte chunk would need a new four kilobyte page... I wouldn't expect massive usage of secret mem in large... It's a limited anyway, so you can't get more than 64 unless you root and you really want it. I mean, more than 64K. So there will be some overhead but adding another cache to page locator probably not worth it and increasing page block by four bits like... We'll offset that from the other side. I wonder how much does the cache prevent the direct map being fragmented? It's order of magnitude. I didn't try... I didn't bring it but if you look at the amount of two meg and one G splits in the Procfium stat, it brings it down to order by order of magnitude. Then without the caches, are they fragmented to all 4K mappings? The cache isn't necessarily fragmented because the two meg pages removed from the direct map and then let's say for modules, the same memory is mapped in the Vimaloc area. So Vimaloc area, it's anyway mapped with 4K pages and in direct map, it doesn't need to be fragmented because the entire two megs are removed from the direct map. I mean, without the cache and you run the background, how much is it fragmented with the background? So for as long as page alloc, microbench, max run, I've got several hundred of splits and then with the cache of GFP unmapped, I've got several tens of splits. Ten says it's time, so thank you.