 So, we're being recorded. Well, I just want to introduce myself so my name is Mike Kravitz and I started working on Linux. I think in the year 1999 or 2000. And since this is a mentorship series maybe I'll share a little experience that I had but my very first patch that I thought would be a good ideas to submit to the Linux kernel mailing list was a complete rewrite of the Linux scheduler in a monolithic patch. So, not suggesting that anybody try that as you can imagine that didn't go over very well but that was my introduction to Linux and introduction to open source and even though I was flamed quite a bit for that I continued on and have been working in the Linux kernel for, I guess, 22 years now. So, for the last several years I've been working in the areas of memory management, specifically I'm now the maintainer of the huge to be at this file system which is a file system that makes huge pages available to user applications. And so, when I was asked to talk about memory management. In this series I thought well, probably the best thing to talk about is something that I know a lot about which is the huge to be FS file system. But to get there to get to huge pages and Linux you kind of have to go back to the very beginnings of virtual memory so topics covered in this talk I'm going to start with some very basic concepts of virtual memory and hopefully you know this may be very basic. Very basic and very elementary for many people on the call, but I think it's important to kind of establish a baseline. And once we talk about those basics of virtual memory, we can talk about how huge pages fit into that virtual memory model. And then finally get to how huge pages are exposed to applications and used in Linux. And finally down at the very, at the very end, we can talk huge to be FS specifics which is really my area of expertise. So some of the expectations hairs is that when we talk about virtual memory and the virtual memory model. I am by no means an expert in this area. I know the concepts I know how it fits in with huge pages with the work that I do. But I may not be able to answer all questions in this area, but I will help as much as I can. Another thing to keep in mind is is that a lot of examples that I'm using will be based on Intel architecture and that's simply because that is what I am most familiar with. Again, my expertise really is in the huge to be FS file system area, although I have kind of pretty decent general knowledge of the Linux memory management in general. So with that, let's get going and talk about the basics of virtual memory. So every system has memory RAM associated with it, and we have processes that we want to run. So the basic question is how do we make that system memory available to processes. And we use this, what's called a virtual memory or on demand memory model to do this. And the first thing that we do is split system memory up into little chunks, which we call pages. And on x86 I'm sure you all know that a page is 4k four kilobytes in size. And one thing to also notice is that for each page, each little page in the system, we have a small data structure called a struck page which describes that. And the combination of all of those struck pages put together as something that we call a memory map. So there's a struck page for each page in the system. And that describes that page. Those struck pages actually have some fields that describe that page, it has some flags which describe the state of that particular page, things like whether it's locked whether it's dirty active up to date. Whether it is poison has a memory error associated with it, or kind of a PG head which is kind of for compound pages. It also has a reference count how many people actually have a reference to that page and a map count, which is a count of how many times that page has been mapped into user space. If you look at a struck page description and Linux. It's a huge union data structure, but it's really 64 bytes in size. And again, there's one of these for each struck page in the system. So if we have system memory divided into pages, and we want to make that system memory available to processes, then it makes sense that a process processes virtual address spaces also divided into pages. So no big deal there, but we can also have multiple processes trying to get at that system memory. So, in this simple chart here we have three processes, all with a virtual address space, perhaps even the same virtual address space in other words the virtual addresses for process a could be the same as the virtual addresses for process be in process and they all want to get to system memory. So how do we do that. How do we translate process virtual addresses to system memory, and the way that you do that is via something called page tables. So, this next slide is kind of a high level view of what page tables look like. And I'm going to go over this in a little more detail but let's talk about some of the key things. Over on the left hand side there's a MM struct which is actually a structure associated with each process and within that structure. There's a pointer to something called a PGD or a page global directory. And that's just simply one page, which has a bunch of entries as you can see here that point to another level in the page table, which point to some other things, which point to the next level and the next level in the next level. And finally down here at the end, we get to actually user data, or these pages in system memory, these 4k pages. So that's kind of a lot of information but let's go over some of the key pieces here. So if we go back, we see that there's these levels in the page table PGD page global directory, PUD page upper directory, PMD page middle directory and PTE page table entry. And so each one of those tables is one page in size. And those tables are really just an array of entries, which each are one word in size. So, doing the arithmetic here on x8664 as an example, a pages 4k, a word is eight bytes. So we end up with 512 entries per page for each one of those levels in the page table. And there's some definitions out there in the Linux source pointers per PGD pointers per PUD pointers per PMD pointers per PTE. And those are all 512. One thing that I should also mention is that this chart I'm going to go back to it real quickly here shows a four level page table. Those have actually been up to five levels there's another level in there that's commonly used on x86 today. And the number of levels in the page table is somewhat architecture dependent. I'm just trying to give a high level overview here so we'll use a four level page table that fits kind of nicely on a slide. I'll go back to this page table diagram. Like I say the PGD page these are all one pages in size with 512 entries but they contain these PGD T's PUD T's PMD T's and PTE T's so what are those things. Those are typically called page table entries, and they're typed for the level within the page table that they are that just provides some error checking with the within the Linux source code. And within each page table entry. There's a pointer to the next table page. For example from a PGD to the PUD, or at the very end in the at the last level there's a pointer to user data. And those pointers are either a page frame number or an actual physical address where a page frame number is just a number of that page and physical or system memory. And one thing to notice is that since those entries always point to a page. There's some extra bits available down at the end in other words we don't need to use those page shift bits at the end. So those entries also contain flags such as is the page actually present. Is it read readable writable is it dirty, or something that's kind of important for huge pages is this page. PSC or page size extension, which means that maybe at this level it really points to user data, where you might expect it to point to another level in the page table. So, that's kind of a high level of what those page table entries contain. So we have processes, virtual addresses. Going through the page tables to get to system memory. But how does that actually happen. So, in a process we have a virtual address, which can actually be split into various components. So, at the very high bits of the virtual address, it tells us what PGD entry, we look for so we have this PGD shift down at the bottom of 39 bits this is x86 64 specific. So if we shift the virtual address to the right by 39 bits, we can actually calculate the index into the PGD, the same for the PUD, the same for the PMD and PTE and finally down for page shift bits at the end we figure out where within that user data page, the data that we're looking for actually is that that virtual address corresponds to. What I wanted to do was just take a very quick walk through how we get from a virtual or a linear address in a process to the actual physical page in system memory. So, as you can see here from the virtual or linear address we get at the very high bits we can get this offset into the PGD. The next bits give us an offset into a PUD offset and PMD etc. So let's just take a very quick walk through of how that works. So we have a address in a processes address space or virtual address. And the first thing we do down here at the bottom is take that virtual address, shift it right PGD dirt shift bits mask off any upper bits, and we get this PGD offset, which tells us where within the PGD page, a PGD T entry exists. So if you notice that PGD offset is really just an index into that PGD page some value between zero and 511. And once we get that entry to the PGD T, then that actually points to a PUD page kind of the next level and the page table. If we follow that, then we take the virtual address, do a PUD shift to the right mask off upper bits, and we get this PUD offset, and we get the index to a PUD entry. Again, just like at the upper level. That's an index into the page, and it points to the next level on our page table the PMB page. This should be no surprise for the PMD offset we take the virtual address, do a PMD shift right mask off the upper bits, and we find this PMD T entry, which points to a PTE page. And with the PTE page, again, take that virtual address shift right PTE shift mask off the upper bits, we get to the PTE entry. And finally, we point to a page that contains user data base, that page that we're ultimately looking for. And then finally, at the very end we use that offset within the page to actually get at the data we're looking at. So yay, but as you can imagine that's kind of quite a bit of work to traverse that isn't it, to go from a virtual address to system memory, think about all of those calculations and everything that was needed in there. And everything that happened in the page tables, getting the PGD offset the PUD, and all of this stuff traversing the entire page tables. That's quite a bit of work. And that's actually required every time we make a memory access to figure out where that physical page is within the system. Because of that takes such a long time CPUs typically have something called a translation look aside buffer or to be. And a to be is really just a cache of virtual to physical translations, or in other words, kind of that information that's kept in the page tables. So, instead of traversing the entire page tables to go from a virtual address to a physical address. The TLB really contains a quick way to say this virtual address is associated with this physical address. One thing to note though is is that, like all caches, the TLB is kind of a small resource there's not a an unlimited number of entries that we can keep in the TLB. So, if we go back to our how do we get from virtual addresses to system memory physical addresses. We basically start with a virtual address. Is it in the TLB are actually the hardware does this force. If it's in the TLB we immediately get the physical address. And if it's not, in other words, you know we take a miss on that TLB, we actually have to go and traverse the page tables again to get back to system memory. And typically what happens if we do this traversals is that that entry gets put in the TLB that translation that we just did to go from a virtual address to a physical address and, and how that happens is really hardware and architecture dependent, but typically this is the case that if we take a miss on the TLB, and we have to traverse and look up that physical address that the TLB at T but TLB entry is created with that translation. So, as I mentioned that the TLB this cache of virtual to physical translations is somewhat of a limited resource. And to just share some of the sizes of TLB entries are the sizes of the TLB's and the number of entries for the various page sizes on Intel processors. And so we can see here for instructions to, you know, when you're actually executing an instruction there's that looks around you know 128 for most generation Intel processors for 4k pages. If you're trying to do to two meg pages, that number drops considerably similar for data pages. But if you notice for one gig pages, the number drops quite a bit. These over here are kind of second level, second level caching TLB sizes, and the sizes there are somewhat bigger, but still not huge. I talked a little bit of mostly about you know how we get from a virtual address in a process to the underlying physical address in the system. And I just want to point out that the kernel itself also uses virtual memory. Most of most kernel data is actually addressed with virtual memory. There's a set of page tables that translates kernel virtual addresses to physical addresses as well. So this isn't just limited to processes. The kernel itself actually uses virtual memory for most of its data, obviously, it sets up all of this data so there's a bootstrap process there's other areas where it actually has to deal directly with physical addresses. In general, you're a lot of the, a lot of kernel code really deals with virtual addresses. So, like, yes, I, there are two questions I think that are relevant to the previous slide you were showing with virtual memory. Okay, the previous slide that you have at the table showing. The last one, the to be sizes. Okay, the first question is depending upon the processors, how much virtual memories are there in any kind of processor system memory. How much system memory or how big is the virtual address space of the process. I am thinking that the question is, please chime in anonymous, whoever asked this question, but I'm thinking that they're asking about the virtual memory size. Okay, so the virtual memory size of a process is quite large. I do not know the specifics but I think it's, you know, I'm going to get this wrong but it's, it doesn't use the entire 64 bits of a word size it's maybe 52 to 54 bits in size. But it is a huge number as far as the virtual address space of a process. The physical memory is actually supported by the systems. I, I can't really give you that number I'm not an expert in that area. I know I have worked on machines with multiple terabytes of physical memory associated with them. And the virtual address space of any one process could easily address all of that physical memory. Looks like there is a follow on question, does the number mostly same for apple processor and one max and one ultra and two. Can you repeat that at the same for, I'm sorry, apple processors, apple processors. I am. I honestly I have no specifics on what Apple is doing. I was around a long time ago when you know even Apple use power PC processors, they tend to switch I'm not sure. You know, they were using Intel for quite a while so this, you know, this would hold for Intel processors I'm sure you're asking about the latest arm processors and I'm not sure about the specifics there. Okay, that sounds good. Another question. Various, you mentioned that various processes have the same virtual address. In this case, how would they access their own physical data. So, if we go back, I'm going to change slides here real quickly. So, you all should see a slide with kind of a translating going from the page table so when you start with a virtual address in a process. Each process has its own pointer to this upper level PGD page, a page global directory. So, each process has a unique start of the translation. And so, by the time you get down to physical memory, all of these are are different. So, each process has a different starting point here in the page table traversal. And each process has its own set of page tables. So page tables are unique to a process. Hopefully that answers the question. One more question. How are page tables until be accessed. Are they directly addressable in physical memory. Yes, they are. There are the kernel actually sets up the page tables. I don't want to get into this, but let's go back here. So when a process actually wants to get a page of physical memory. The kernel is actually responsible for setting up the translation, the kernel has to first allocate a page of memory. To figure out which physical page of memory it wants to use, and then set up a translation from the virtual address in the process to that physical page. And that's with a page table, as we've described before. So, think about this, the very first page used by a process. We have this PGD pointer over here at the beginning, and this PGD page doesn't even exist at this point. So the kernel has to actually allocate this page. This is a PGD pointer, and again, allocate a PUD page, a PMD page and a PTE page, until it finally can point to that page that it originally allocated for the user data to be associated with that virtual address so yes. The pages are allocated or allocated, accessed as physical pages as needed. The creating of TLB entries is not, is somewhat architecture dependent. In some architectures you can actually populate those yourselves. On x86 that's typically not done it's the hardware does that. After a TLB miss. The kernel itself is more interested in perhaps flushing those TLB entries when address translations are no longer valid but you know I think that's a topic that you could spend a whole session on. And I don't think we have time to do that today, or I don't have the expertise but hopefully that answers the question. Is that doesn't answer your question. That's good I think that asked a question. Okay, I assume it is. And another question, can multiple processes use same physical memory and how. Yes, they can. So if you think about it, if you have a have shared memory. You know a shared memory segment. What happens is is that each process does have its own page table, but those page table entries at the very end I'm going to go back again. These pt et's actually point to the same user data page for multiple processes. So, how you get there is via the page tables. The actual target can be the same in multiple processes. If it's set up that way. There is another question Mike, could virtual addresses be same for two different processes. If so, how does hardware handle conflicts. Yes, actually, typically, they are the same. And if you think about it, it kind of makes sense. Otherwise, if they weren't the same in processes you can have, you know, 10s, maybe even hundreds, thousands of processes out there and so you can't you would run out of virtual address space if each process had to have a unique virtual address space. So, processes typically have the same virtual address space. And she can you say the question again I forgot what the point of the question was I'm sorry. Okay, that's okay. So the second part of the question is, if so how does hardware handle conflicts, when the two virtual addresses are the same. So what so what happens is is that if we get to this slide here so the what you really want to happen is to go from a virtual address to system memory is go through the this translation look aside buffer, or we go through page tables and as as I've already mentioned, page tables are unique to a process. We start at this PGD pointer, and that's unique to every process so page tables are unique. The TLB actually gets flushed between each switching so you know if we go from running process a to process B, we have to get rid of all of those TLB entries because they cash virtual to physical translations. So those are this is another level where you can go deep because with all of these factor meltdown things we actually use tagging within the TLB and you can have but but that that's that's too deep for right now but essentially the page tables and the translations in the TLB are unique to a process. That's how you can have multiple processes with the same virtual address so those, you know, we know what process where we're running we use the page tables associated with that process and we clear the TLB between running different processes. Looks like we have like few more questions would you like to take them now or do you want to make progress on a few slides. Um, yeah let's. Well, no let's take a few questions now. Okay. So just for clarification what does IDS stand for instruction data and stack. I think on one of your slides you might have had this one here. So the TLB sizes so. I TLB strength stands for an instruction translation so in other words. The the instruction itself, whereas the stands for the data which is the target of the instruction. This S over here is a second level. So if you think about caches where you have a first level second level third level cash. These TLB entries are kind of, as I mentioned, it's a cash. These are kind of the lowest level one right on the processor this themselves on these are kind of the second level of the cash there. Hopefully that to L2 cash and L3 cash and that's, I think similar. Yes, the kind of the same concept. All right, the other question on this slide is for and 4k and to to m stand for what page size. Yeah, that's for the page size for either 4k pages or two meg pages or one gigabit gigabyte pages. One question the is PGD page is going to be always one per process or can it, can a process have more PGD pages. It's just one today. Great. And then is virtual memory stored in secondary storage. If so, then what is its maximum limit of having virtual memory. I think you answered this question earlier might but yeah, go ahead. Yeah, I'm not really sure if virtual addresses are stored in secondary storage I don't know how to answer, or how to even interpret that question. Thank you. Do you have a, can you elaborate on your question. I don't think your question is very clear. I looks like secondary storage HDD. I don't think that really happens Mike. Virtual memory doesn't get stored on any secondary storage correct. Tech. Not, I don't think in the way that the person asking the question is concerned about. Obviously we can swap we can write memory to a swap device or something like that but but that's a whole nother topic. That's managed in a totally different way and I don't think we can't. We don't have the time to go into that today. Okay, great. And I think this is the last question. If two processes shared shared memory then when they switch switched. So do the memory also switch and where it goes. If two processes share memory. As I mentioned before the, it's the kind of the lowest level page table entry. I'm going to go back to this slide, like this PTE T for two different processes pointing at the same user data page. So, if we're running in one process and, you know, this PTE T points to a user data page, and it, another process points to that same user data page. As we switch to that other process, you know, the data page stays there that's part of system memory that that doesn't go away, per se. So switch to the other process we just have a different set of page tables that ultimately point to that same user data page. Thank you Mike, I think that's the last question we have. Okay. And with all of that said, let's finally talk about huge pages. So, huge pages are typically associated with a page table level either a PMD or a PUD in the page tables. The sizes are architecture dependent. Okay, there's MMU and TOB support. If you remember from this, that previous slide about Intel architectures and TOB sizes. They had entries for different size pages. But in general what huge pages are there, they are contiguous areas of physical memory. They have to be aligned to the huge page size. And I'm not sure how familiar people are with memory allocations within the kernel but there's something called a buddy allocator within the kernel to allocate pages. And as long as your huge pages are less than this max order in size they get allocated from this buddy allocator. And if not, there's something called alloc contig contig pages as an interface within the kernel that can actually tries to allocate contiguous areas of an arbitrary size. And huge pages can also be allocated the physical memory for them in other ways there's a contiguous memory allocator allocator. It's funny kind of redundant, but there's a CMA allocator which actually can be used to allocate huge pages. There's a mem block allocator, which is actually the early, if you think about it, boot time allocator which can be used to set aside memory for huge pages. And on certain architectures even the firmware can actually be used to get contiguous areas of memory for use with huge pages. What is a huge page then so we we've gone over this page table slide a bit how to get from a virtual address to a page, an actual page in the system a physical page. And if you notice down here the PTE page points to a user data page. And in this case the offset within the pages this is a 4k page or would be a 4k page on Intel architecture. So we could say that that page is at the PTE level. So for a huge page, we can actually have a page at the PMD level. So page table looks similar, but the PMD instead of pointing to a PTE page actually points to a user data huge page. One of the things to kind of note hairs is that, well, how would you know that in the page tables, and it's one of these flags that in the PTE entry for this PMD entry, it has this page PSE flag set that says, I actually point to a user page instead of what I would normally point to which is a PTE page or another level in the page table. This huge page here is actually since it's pointed to by a PMD entry, it's PMD, it's a PMD size page. So on Intel, this would be a two meg page. So you would have a single page that contains two meg worth of data as opposed to the 4k page that is at the PTE level. And if we take that one step further, we can also have huge pages at the PUD level. And in this case the PUDT again has that page PSE flag set that says it points to a user data page as opposed to normally where it would point to is the PMD, a PMD page in the page table hierarchy. So that points to a user data page and in this case the user data page is actually one gig in size. So that's in a nutshell kind of how huge pages fit into the page tables. And so the next question is, is that okay, well, that's good and well but, you know, is it worthwhile to even try to use huge pages. And one thing that I want to stress on this slide is that huge pages may increase performance. As a matter of fact, there are certain applications, certain use cases where they really do. And if you think about all of our talk about virtual memory page translations, TLBs, etc. The advantages of using huge pages is that there are fewer translation entries. So, if you think just about let me go back to this to do the slide for a huge page at the PMD level. So, we kind of skip this whole extra level here for the PUD. We kind of skip two levels of translation so doing a going from the PGD down to the actual user data is less steps, and also that the amount of data that you get for each that you can bring in with each translation is much less. So, you can spend a whole lot less time actually servicing less time having TLB misses. But again, this is all dependent upon how your application accesses the data. Huge pages can also be a bad thing in that we have a less granular page size. So, if you set up to use a one gig page and let's say you're only using, you know, 2k of that one gig page, that's kind of a waste for you. So, if you look at that chart of Intel TLB sizes, you'll notice that there were fewer TLB entries for larger page sizes for huge page sizes. And again, it really depends upon your access pattern of your application. Okay, stress this enough. There is no blanket statement that says using huge pages is going to increase performance, it could even hurt your performance in some cases. So, it's really up to the application developer or someone who wants to use huge pages to test this out benchmark benchmark benchmark to know whether or not this is actually an advantage for you and not only benchmarking but you kind of have to have really good knowledge of your application and how you expect those memory access patterns to be. So if we, again, I was kind of pointing out if we look at the TLB number of TLB entries for huge pages. You can see compared to 4k to 2m to one gig, the number goes down as you increase huge page size or just fewer entries in the TLB for those. And so you're going to end up with potentially more TLB misses it's, you know, it's a trade off because the pages are of a greater size, but you have fewer entries to actually cash those translations to those pages. One thing to note hairs is that the the latest generation, not the latest but one of the more recent generations of Intel processors ice lake actually has quite a few TLB entries for one gig pages so up until ice lake. Most applications did not make heavy use of one gig huge pages simply because there were not that many TLB entries available to them. Mike, there are three questions for you on the Q&A would you like to. Sure, let's go to those that might be a good time. Can TLB misses be monitored. If so, can you monitor them at the process level or only system wide. Yeah, you can monitor those. I've done it using the perf utility and you can monitor those at a an application level. Does that answer your question, Matt. Okay, if not ask another question. Another question Mike, a lot of two M one gig TLB increase are separated and limited. How are these increased designed full associated two ways associated. Um, I don't know off the top of my head to tell you the truth. And then I have another question here for you for an application use like DP DK would huge pages be helpful at Colonel level. Also, you're going to have to educate me on DP DK I'm not immediately familiar with that term. Neeraj. What does DP DK stand for in this context, while you come back to us Neeraj with the DP DK. I have another question Mike, does Linux kernel architecture based on any processor, whether it is AMD or Intel processor can it be used to design the windows or Mac OS architecture. I'm not sure I even follow the question but does that make sense to you Mike. No, it doesn't make much sense to me. Okay, so Neeraj, when you go ahead and tell us what DP DK means in this content. Okay, I think Intel provides DP DK library using huge table, huge pages to map packet memory. This seems to be the context Mike for. Okay, so. So I will say that the kernel actually makes use of huge pages itself. Some of its data structures. It has determined that it is advantageous to use huge pages for those. I am not familiar with all of the kernel code and all of the instances where huge pages are used I'm familiar with a few of those but certainly not all of them. But yes, to answer that question, it could be that there are cases, you know, places within the kernel where huge use using huge pages does make sense. And, you know, if it does, and it's not doing that today. That could be modified. What kind of applications require huge pages. When is it needed. That is another question. Yeah, so I guess the whole point here is is that or maybe not the whole point but one thing to notice is that I mean using huge pages is not required. It's a performance enhancement improvement. I will tell you some of the some of the more prominent use cases for huge pages are large applications so my company Oracle, you know, their database makes heavy use of huge pages they have this shared global area that, you know, is huge can be, you know, up to you know, 7580% of the system memory, and they use huge pages to describe that. JVMs actually make use of that. Virtual machines. Putting aside memory to back virtual machines they're actually using huge pages for that today so it's, it's not like huge pages are required. It's just that people have found that performance increases when they use huge pages in these use cases. And when we have large amounts of data and processes are sharing the data, for example, in a database situation, having huge page using huge pages might do two things one is the it will save probably the footprint because you we don't have as many pt entries. Would that be accurate. Yes, that that is accurate. You're going to steal my thunder but there's actually even more advantages to that with huge pages but we'll get to that later. Okay. Thank you. I won't jump the gun. Okay. Thank you and then another question I have here is that does Colonel use huge pages for its own data. Yes, as it helps reduce the CLB miss latency. Yes, it does. Because it's owned as a matter of fact, on the very second slide we showed that the memory map for the Colonel this big or this array of struck pages that there's a data structure for each page in the system. The Colonel actually uses huge pages for that, that array, because it accesses them frequently, and it's a large amount of data. So yes, there are various things that the Colonel itself uses huge pages for the, the text of Colonel the actual executable code, the executable instructions is mapped as a huge page as well. I think somebody came back and said DPD case stands for data plane development kit. Thank you for that answer. Let me, I have like three more questions Mike, would, would you like to continue with questions or would you like to cover some material. Yeah, let's continue the questions. What huge page size determine at what level of page table redirection to user data made PUD versus PMD. I'm not sure I understand the question but certainly I mean yes the size of the huge page in use determines what level in the page table actually points to that user page that user data page. The question is, can the usage be dynamic does Colonel page switch from normal page, huge page during execution. When is the decision made. Within the Colonel know, but that's a perfect lead in for our next topic, talking about transparent huge pages at the application level. So different than I mean the actual answer it'll be answered soon. Okay, so is it safe to tune the huge page size to increase my application performance without impacting the Colonel performance. Yes, I mean that that really is specific to the application. I'm not sure there's some ways that you could, you know, perhaps degrade Colonel performance in some small way but there's no direct way to do that and I'm aware of. So yeah, I think you know you again it's just experimenting with your application and your application use a huge pages and figuring out what is the best performance for your application. So that's all for the questions now Mike relevant to this presentation. Okay, thank you. So now we're going to switch into talking about huge page API is that are available to applications. There's really two specific ways that you can do that there's something called transparent huge pages or THP. And, as you can imagine since it transparent is in the title. The use should be mostly transparent to the application someone asked a question you know, can the Colonel automatically determine if it's good to use huge pages and do that. And that's what THP was designed around to actually with no intervention from the application or are very little to actually decide whether or not it makes sense to for the application to use huge pages and if it does do that automatically. The other API is huge to BFS, which is what I maintain and where my expertise kind of is. It's a much older technology it was perhaps one of the earliest ways that huge pages were made available to applications. It actually requires application modification so if you want to use huge to BFS you've got to modify your application to do it. Typically huge to BFS requires some type of sys admin intervention or setup. In other words, it's not just somebody running an application, somebody who actually manages manages the system has to be aware that there's some applications that want to use huge to BFS and set that up at a system level. So, again back to scope in my area of expertise. I think THP is really kind of the, the new and the cool technology because applications don't need to think very much about it. It is designed so that it just works and applications get a performance benefit when using THP. Unfortunately, that is not my area of expertise so we're just going to cover that briefly here and then dive into more specifics of huge to BFS which is where my expertise does lie. So, THP. It's primarily used for anonymous memory. In other words, if you think about, you know, your application data within your application that the heap that you have there. It can be used for temp FS so if you're on a most Linux distros use a temp FS or a memory backed file system for slash temp. You could theoretically configure that so that it's always backed by THP pages pages. And there's actually some support for mapping file mappings or executable code or files. It's only works with XFS now it's marked it as an experimental config option but it seems to work quite well. THP is limited to a PMD size support today. In other words, on Intel THP will give you to make pages that's that's all that it does. There's work people have been working on extending that to to have one gig page support as well but that is not in the kernel code today. There's kind of one big control file for THP and that's this CIS FS file that I have here on the slide. And there's three modes that THP works in. There's always which means always for every application try to use THP. There's M advise which means use THP as my application gives hints via the M advise system call, and then there's never, which means never use THP. The default on my desktop system here is M advise so that's why that little that M advise is in brackets there. If you actually want to use THP via M advise there's there's two flags or two advice flags with the M it within the M advise systems call. And so I'm not sure how sure people are how familiar you are with the M advise system call but it takes an address length kind of a range of virtual addresses and you provide some advice on what to do with that virtual address range So you can have MADV huge page which means try to use huge pages in this this address range and M advise no huge page which means don't ever use huge pages in this address range. And so that's kind of how you work with that M advise option enablement of THP. If you want to play with THP. There's lots of tunables out there. They're all under this sysfs file. Again, we get, there could be a whole mentorship session just on THP and how to use that in your applications, and we're not going to go over that today. Honestly, other people have much more knowledge about this than I do but I just wanted to point out where these tunables were. So, any quick questions on THP before we jump into huge TLBFS. I may not be able to answer them. Okay, so let's let's get into huge TLBFS and I can't answer any questions here so huge TLBFS is like I say an older way of using huge pages within your application and it does require application modification. And huge pages are generally pre allocated via some sys admin control. And by databases for many, many years I think support for huge TLBFS went in somewhere around the 2.4 kernel timeframe. Most recently it's been used quite successfully to back virtual machines. Virtual machine code like QEMU. It uses THP by default, but more people have been setting up huge TLBFS and using that to back virtual machines. Huge TLBFS does have multiple huge page sizes, which is really defined by the architecture itself that typically supports anything that the architecture does. One thing to notice is that huge TLB, I can say is very old and very simple and so the concept is that you reserve or set aside a pool of huge pages or pool of memory that can only be used as huge pages, and your application makes use of that pool. So setting aside a pool of huge pages means that they're only available for application use as huge TLBFS pages. They can't be used by the kernel, they can't be used by applications, they can't be shared in any other way. It's just used as huge TLBFS pages and set aside for that. So that's kind of what I was getting at is that it requires intervention by a sysadmin kind of to set up these pools. It has to be done with privilege, not just any user can do it, it requires a privileged user to do it. And it really kind of requires, you know, not just application knowledge but sysadmin knowledge and kind of some pre-planning to make this all work. I just wanted to share kind of the huge page sizes. Not an Intel example here but an ARM64 with a 4K base page size and in this example the huge page sizes that are supported on that system are we have 64K, 32, not 32, we have 64K, 2 meg, 32 meg and one gig huge page sizes. And just as a general note for huge TLBFS, the sizes supported on the system, you can look in this sysFS file, sys kernel MM huge pages, and it will have a directory for each huge page size that is supported on the system. And as you can see where there's actually four huge page sizes that are supported on this system. But then the question is, you know, what is kind of a default huge page size there's also the concept of a default. And there's a file proc mem info. And if you look at the huge page size field in that file that tells you the default huge page size, huge TLBFS huge page size on the system. So default is important, because if you just say I want to use huge TLBFS and do not specify a huge page size, you get the default huge page size. So I just want to point that out here. So as I mentioned, huge TLBFS you populate these huge page pools which is usually done by a system administrator. And there's kind of two ways to do that you can do it at boot time on the kernel command line, there's options to do that, where you say, What is the huge page size of this pool that you want to populate, and how many huge pages. Do you want to put into that pool. Recently support was even added so this, you could have this format of huge pages to be like n one y one so that would be numinode one. Number of huge pages is why one numinode to so you could actually specify huge pages to allocate on each numinode. Huge page CMA can be specified at boot time and that and that what that does is actually reserves area for the contiguous memory allocator to allocate huge pages. And again, that can be specified on a per node basis as well. At boot time, you can actually change the default huge page size for the system that you're running by specifying that on the kernel command line as well. So that's one way to actually populate huge page pools. The other time is to wait until the system is up and running. And you can actually just write into a sysfs or proc file to populate the pools that way. So, you would say, Well, why would I ever do it one way or the other. One thing is this is that, as mentioned earlier, huge pages have to be contiguous contiguous physical memory. So, as a, as the system is up and running. Typically we don't use huge pages and we allocate 4k pages, and the system memory itself becomes fragmented we grab a 4k page here 4k page there, and pretty soon, there's not any big areas of contiguous memory to create huge pages. And so, via memory management methods called compaction and migration we can try to create huge pages but the longer the system is up and running, the more fragmented memory becomes and the harder it is to actually create huge pages on the fly. So, doing things at runtime becomes more difficult. That's why some people may want to just do it at boot time before the system actually gets up and running, and they can set aside these huge page pools there and much easier. So how do you make use of these huge page pools. Well, huge COBFS is a file system and was originally designed that way so you can actually mount a huge COBFS file system. You mount it like you would any other file system. And when you mount that file system then all files in that file system are backed by huge pages. So one thing to keep in mind though is that this is a memory based file system. So when you first mount a huge COBFS file system, there's nothing in it, you know, and it goes away when you unmount it. So there's no persistent storage persistent state with that file system. All files in the file system are backed by huge pages. The page size mount option says what size huge pages to use for this file system. And it uses those out of that pool that was created up ahead of time. And most file system operations actually are supported in the huge COBFS. One notable exception is that the right system call is not supported. And by that I mean is that you cannot do a right system call to a huge TOBFS file. You can write to the file, you can populate, you can write to the contents of a file, but you have to do that via mapping the file and then actually writing to, you know, putting data into that M mapped area. You can also use huge COBFS via system five shared memory. So there's a SHM get system call, which creates a shared memory area. And if you pass in the SHM huge COB flag, that area actually gets backed by huge COBFS pages. You can pass in additional flags to actually specify the size of huge pages that back that shared memory segment. So that's another way that applications can make use of huge COBFS via shared system five shared memory segments. The other way is the M map system call. So if you have a huge COBFS mounted file system, you can M map a file within that huge COBFS file system, pass in open and pass in a file descriptor to a file. And that file is of course since it's in a huge COBFS file system it's backed by huge pages of that file system. You can also do M map and pass in the map anonymous flag, as well as the map huge COB flag and you will get anonymous memory backed by huge pages. And just like the system five shared memory you can specify the huge page sizes that are backing that that M map memory that M map anonymous memory. And the thing to note here those is that on the M map system calls is that address and offset within files have to be aligned to the underlying huge page size. You know, that's not always a requirement for non huge COBFS M map calls but that is a requirement if you're using huge TLB FS to back M map. Huge COBFS as I mentioned you set up these pools ahead of time and your application makes use of huge pages in those pools. So, one thing is what happens when you run out of huge pages in those pools. I'm sure you all have, you know, have applications that make use of memory. And typically what happens is is that if you run out of memory on the system. The first thing is is that the kernel tries to reclaim reuse some of the memory that's in use. The worst cases you're going to get an out of memory error message and your process will be killed. Huge pages though are not swappable. They're not reclaimable. So when we're out of huge pages we're out. If an application tries to make use of huge pages there's none available for them. They get a page fault they try to bring in a huge page none are available. And then we'll actually get a SIG bus which usually is fatal. Most applications don't have a handler to handle SIG bus. So, that's not so good to actually get a page fault and not have a huge page available get a SIG bus in your process gets killed. Finally on and huge COBFS development, the concept of huge page reservations was introduced to try to mitigate running out of huge pages. So what happens is that there's a, there's a reservation count, and there's a reserve, and that count is per pool. What happens is is that every time that an application does an M map or a shared memory, or creates a shared memory segment with the SHM get call is is that a reservation of that size is set aside within the pool. So, if you do an M map of let's say, you know, 50 huge to be pages, a reservation for those 50 pages is taken out at M map time, so that as the applicant as you actually fault in those pages, or allocate those pages, you're going to get those 50 pages are available. So what that means also is is that at M map time or SHM get time, if there are not enough pages available in the pool to reserve the pages that are needed, you're going to get an email man at that time. So that reserve count that per pool reserve count is incremented at M map and SHM get time. It's decremented each time that you fault in a page or allocate a page, and there's actually, you know, more than just a global count, but as you can imagine keeping track of you know, has a reservation for this virtual address been satisfied or not there's internal data structures associated with each huge to be mapping, whether it's anonymous or file that keep track of reservations. So if you look at some of those global counters that are available in sysfs or slash proc. You're going to see things like huge pages free and huge pages reserved. So just keep in mind that huge pages, you really might want to look at also think about huge pages available which is the huge pages freed minus huge pages reserved. Because you know, 50 pages are free but 50 pages are reserved. There really aren't any huge pages available for a process to use except for those that have already made those reservations. Huge CO BFS does have some very some unique features that are not implemented in THP or general virtual memory at this time. So one of those features is something called PMD sharing. And what that is is is that processes can actually share PMD pages in their page tables. As I said before and people ask questions about well you know, how can processes share access to the same data, and typically they have page tables that run down and pointed those same user pages well with huge to BFS we can actually share the PMD entries in the page table, and that can be for either file or anonymous mappings. But since we're sharing a PMD page, the range that we share has to be at least one gigabyte in size, or in other words a PUD size and aligned. So a quick example why is this important. And so, as I mentioned, databases like to use huge pages for shared areas. So, as an example, it might not be uncommon to have a one terabyte shared mapping with 10,000 processes sharing that mapping. If you look at it that's a 4k PMD page for each one gig of shared mapping could be saved with PMD sharing. If we have one terabyte, that's 1024 times those 9999 shared mappings we still have to have one out there point to it so we can save 39 gig of memory in this example by doing PMD sharing. So, what does that look like. So here we have two processes, their PGDs actually point you know at the same the PUDs at the level but the PUDs actually point to the same PMD page in these two processes so they kind of share this PMD page which points to the actual underlying processes. So this is a unique feature to huge to BFS that you can actually share page tables at all. There is some work underway to actually perhaps do more page table sharing and Linux but this is kind of a unique feature to huge to BFS today. Another unique feature to huge to BFS is something called the men that freeing and this was recently added in the 5.14 kernel and the men map is actually a virtually map and map so that entries are virtually contiguous and this is a memory saving feature. So we go back to this early slide that I did in the talking about virtual memory. I had this diagram that says, well, system memory is divided into pages and we have this memory map, which is a, a map a map of each page in the system. And this is a little bit of a lie here showing that this memory map is one big contiguous area. What typically happens is a, the sparse men memory model is used to divide system memory up, and we divided up into sections, and the memory map is associated with each section. As you can see here, we have four distinct memory maps, and those are not physically contiguous, but through the use of virtual memory, we can actually map those things so that they appear virtually contiguous within the kernel. And so if they are all contiguous, then our page entries for huge pages are also contiguous. So if we look at a two meg huge page. We actually have 512 4k pages describing that again this is an x86 64 Intel architecture. We have a head page, followed by a whole bunch of these tail pages. And so for these 512 pages with 64 bytes per page struct, we end up taking eight 4k pages to to contain all of these struck pages for a two meg huge page. So what. And if we look at this, even more, all of these tail pages don't really contain useful data. All they do is point back to a head page. And if we look at a one gig huge page, we even have more of these out here. So what the man map freeing does is it actually freeze all of these extra huge pages, all of these extra pages of struck pages, I'm sorry there's so many uses of the word pages here. But it frees the pages of struck pages down to the minimum needed to describe a huge page. So for a two meg huge page, it deletes pages two through seven. And remaps those back to the second huge page so basically we end up freeing five pages back into the system for other use. That's pretty good. It makes it's even better for one gig huge pages where we can free up today 4094 of those extra pages back into the system. And the change going into the next version I think in five dot 17 is actually going to say that we only need one page of struck pages to describe a huge page and so we'll get back another one. And this has shown this can also be, you know, a way to save memory as well as kind of unique to huge to BFS although it's actually being added for persistent memory, possibly in five dot 17 as well. So if we free up these struck pages that are used to describe huge pages. There is one downside though is that these free pages have to be reallocated before the huge pages can be removed from the pool and give them back to system memory. In other words, we need those struck pages to describe the the individual 4k pages, once we give a huge page back to the system. So freeing of the map is kind of an opt in. You have to specify it on the kernel command line. And the main reason why is is that you you actually have the situation now where you may not be able to free huge pages if you can allocate the the VMA map to free them back to the system. So there's some config options that are out there. There's a boot. There's an option. I honestly do not know how distros if they're going to enable this option but it is available out there in the kernel source today. Just as a real quick summary on huge pages and I want to go back to this first slide that I talked about and then that is huge pages may increase performance it really depends upon your application access patterns. And just want to repeat again, the best way to do that know your application and benchmark benchmark benchmark and you know, play around with these things, it may help you out. So, that's all that I have open to questions. Sorry, some of that was very confusing, especially the VMA map freeing but I'll take any questions now. Okay, we have we have a few questions where we are at the end of our presentation but we'll spend a few minutes for questions if that's okay with you. Sure. Do you have. Okay. What the first question is how does 64 kilobyte huge page size. I think I'm asking, I think I'm wondering it's the how can you configure probably 64 K huge page size. Okay, so so what I was on that. On that slide that I had. That was showing the huge pages that are the huge page sizes available on an arm 64 architecture. So, the huge page sizes that are available really are architecture dependent so you, you can't do 64 K huge page size on x86 Intel architecture today, there is just. I mean, it's not easily supported by the hardware. So, that's why I had that slide, let me see if I can go back to it, where we actually list to go out and look at what size huge pages are available on your system so if you go out to sis kernel and then huge pages on your system. Let's see what sizes are supported on your system so that's really up to the kernel code when it boots that will figure out what's available from the architecture, what it supports and populate those sysfs files. And then those are the sizes that are available to you. I mean, mine it shows just I have AMD and it just shows the 2k and I mean the first two options. Yeah, you probably have a 64 K base page size, maybe. Great. Thank you. The next question I'm not sure this is relevant to this presentation, Mike, but I'll read that out and make a choice. Is there an enable option for buddy allocator similar to CMA. What are the pros and cons of using buddy allocator was a CMA I'm thinking this is not in the scope of this presentation, but I'll let you decide that Mike. Well, well I talked about using CMA to allocate huge pages and I'll tell you why that was done. So, as I mentioned, the longer the system is up and running the more it gets fragmented. So it's not to allocate physically contiguous areas. So, use of the CMA allocator what the CMA allocator does is is that it sets aside an area that cannot be used for allocations that that cannot be easily reclaimed that cannot be easily used by another in other words, you know, the kernel typically doesn't allocate out of that continuous memory area, things that are going to be long standing. So, what happened is is that we reserve an area of CMA so that we can allocate huge pages out of that area later on instead of trying to use the buddy allocator to do that. So it just gives us a greater chance of being able to find a physically contiguous area for creation of huge pages after the system has been up and running for a while. Okay, so I have another question. Can a huge page set up a boot time to be overridden dynamically? For example, reducing the size or number of pages dynamically. Oh, can you repeat that again? Okay. Can a huge page set up at boot time be overridden dynamically? I'm assuming. Yes, okay. Yes, I can. So, so if you set up something at boot time, you can change those numbers after the fact. Yes, you can. By going in and writing to those sysfs or proc files as shown on the slide. Another question. How does copy on write interact with huge pages and shared huge pages? Two different questions but you know. I mean, I'm copy on write for huge pages follows the same semantics as regular pages. The code to do that obviously is different but the goal is to follow the same semantics. If you don't see that happening, let us know. And what was the second part? Second, same question about copy on write interact with shared huge pages. Okay. Well, if you're sharing huge pages that there really is no copy on right. So I don't understand that part of the question. I mean, copy on right really has to do with private mappings or mappings where two different processes basically start out with the same version of the data and when they write get different versions but if you're sharing a mapping if you're sharing a page and are designed to do that among processes copy on right doesn't really come into play you're always everyone is getting the same copy of the data. Okay. Yeah, go ahead, Mike. Yeah, there may be some more subtleties that the person is asking about but I mean in general that's kind of the answer. We have Megan. Do we have time for a couple more questions or we could ask them to reach out. Okay. So the presentation will be made available after on our website after the presentation ends. So there are several questions regarding that, not that answers those questions. And then I am going to go to a question from Matt. Do KSMD or numad attempt to merge manage huge pages. It's off the top of my head to tell you the truth. That's good. Then let me see I'm passing the questions to pick the ones that haven't been covered in a, in a way. Okay, so there is one more is, is there going to be a visa your presentation on transparent huge pages in the future. Okay, that one will decide on that. Mike is able to do another presentation presentation, we can schedule one, and then there's another question Mike PMD sharing is feasible only when all PMD increase have page PSC flag set. That means all increase point to which pages. How is this guaranteed. Well, first, you know, it only happens when you set up huge TLB mappings. So, so by definition, the huge TLB mappings have PSE set, PSE set in all of their PMD entries. Hopefully that helps. But yeah, I mean PMD sharing only works is only enabled for for huge TLB mappings. Not for mappings in general. Okay. And we have one more question. Can we use huge pages of multiple sizes at the same time, like two different applications using two different size page, huge page sizes. Yes, you can is in. In fact, you could even have one application using two different huge page sizes in two different mappings. So the huge page size is unique to a mapping, let's say, you know, and map call a SHMAT call. So, yes, with with huge TLBFS you can do that with with transparent huge pages you are limited to the PMD size page so there's only really one size available for that with but with huge TLBFS yes you could have multiple sizes and multiple processes or even multiple sizes within one process. Okay. Why can huge pages not be freed and reuse like normal pages during out of memory scenarios. Well, because the only, I guess they were never designed to be that way so they are, they reside in memory only. So if they were to be freed at out of memory time basically you would lose any contents there. So think about a file system, you know, containing data and we run low on memory and start throwing away the contents of the file system. That's not good. While there has been talk, people, you know, if we have huge pages that are not being used by anybody and the system is running low on memory, you know, would it be a good idea to maybe throw away some of these unused huge pages that haven't been used and nobody's using that perhaps but but that's kind of a policy decision and still something that could be discussed but we don't do that today. Okay. Does huge pages directly affect the database parameters processes and sessions. Does it directly. Yes, I mean there is some, I think, you know, the database Java, there's, you know, applications have options to say use huge pages are not during startup. They may have intelligence even built into their startup programs to determine the huge pages are available and use them. That's really, you know, how the application does that. I'm not aware of the details of that but I do know that a lot of applications do have options to say use huge pages if available. Great. Can multiple processes with shared huge pages all have right access to the shared huge pages. Yes they can. As a matter of fact that's kind of the way that some of the sharing works in the database is that all of these processes can write to the shared area. Can virtual machines share data using huge pages. I'm not sure share with who certainly you can set up a virtual machine and in that virtual machine configure and use huge pages. Typically, what people have been doing is before they set up a virtual machine, you know, you have to give it memory on your host to back that virtual machine. So people have been using huge pages to back the memory used by that virtual machine. So, yeah, I'm not sure exactly what sharing that question was asking about. Right. In some cases you probably don't want to share with memory between virtual machines for isolation and so on so. Yeah, I mean typically you do not at all. Right. So, I'm sorry we are getting really close to the time I'm just going to just do the last question here. Are there any protection for you huge pages pages against own killer. No. So I think this, you know, maybe related to a question that someone else asked earlier but when the killer kicks in, it will not free it will not try to do anything with huge pages so is a matter of fact, when you get a room out of memory report what's written to the system log it will tell you how many huge pages are out there and part of that is this, you know, you may have misconfigured your system you may have created too many huge pages and don't have enough memory available for other uses so no um does not. It just reports how many huge pages are in use it doesn't modify try to reclaim throw away or anything like that. Thank you so much Mike for taking extra time to do this and there are several requests for you to come back and do the transparent another presentation on transparent huge pages for sure. We'll have to get a transparent huge pages expert to do that right. That is true. Megan, take it away. Oh my gosh. No, thank you so much Mike thank you she was this has been amazing. Thank you all for participating and asking all of your questions. We will be sure this recording ends up on our YouTube channel as well as on the webinar websites and Mike will share a copy of this presentations presentation for us as well. I hope to see you back at a future webinar. Thank you all so much.