 Hello everyone. Hello everyone. I'm Khalid. I work at Oracle. I'm a consulting Linux kernel engineer at Oracle. I've been doing kernel development for a fairly long time. I have worked on various flavors of kernel and I have been working with Linux kernel for 20 years or so. So I have worked on many different subsystems in the Linux kernel for the last many years. I've been working on memory management subsystem. So I'm going to talk about memory management subsystem today. And specifically, I'll talk about how you go about debugging a problem in a memory management subsystem. So I'll talk about how you approach a problem, but then what is more interesting is talking about a real problem. So there was a real customer problem that came my way recently. And I had to walk through the problem, figure out where the problem was, how to fix it. So I'll talk about that problem. And in the process of talking about that problem, I'll also talk about the subsystems that play a role in the problem I was debugging. And then from there work my way through to where the problem might be how to zero in on where the actual problem is. And once we know what the problem is, how to go about solving the problem. And then beyond that, I'll also talk about just general tools and other tips and techniques you can use to solve a problem that happens to be in MN subsystem. So debugging is part art and part science. So how you approach a problem, you end up having to come up with slightly different approach to the problem every time, depending upon what the problem is. But there are general overall tips that seem to work you. So let's just go through the customer problem that I had worked on. And then we'll talk about other approaches you can take to approach a problem in the MN subsystem. And feel free to ask questions if something I'm talking about or something on the slide doesn't make sense. So how do you go about solving a problem in MN subsystem? First of all, something doesn't work right. Very often the problem comes to a developer, described at fairly high level. And then from there, you have to narrow it down to where the problem might be, and then go further down and confirm where the problem is. So a problem might come to me as kernel panic without a memory message. But what does that point to in terms of where the problem might be? So first step is look at what the failure is. What is the information available to you? Do some failure analysis. Once you have done some basic failure analysis, at least we have an idea of where to start looking in the kernel. And once we know what are the potential places in the kernel that could pass that are part of this failure, then we can start developing a strategy on how to go about narrowing the problem. One thing I would do when I'm developing the code, if I see a failure while developing the code, and I have narrowed down which are the code paths where the problem potentially is, add some dynamic observability to the code path. So I can see, I can get more information when the failure does happen. So one of the favorites for many long-time kernel developers is you just add print case. Print case have their downsides. They also work. You have to use them judiciously. You can use a kernel debugger. That works at times. There are also trace points available in the kernel. Many of the trace points are already coded in its 2,000, 3,000 trace points. You can enable them selectively and then be able to trace where the code is executing. Sometimes depending upon the problem, you may want to know the number of times certain events happen. So you can add counter or you can even use the existing counter that are in place already. So MN system has a large number of counters that it keeps track of. And you can take a look at the value of those counters and how those counters are changing can give you an idea of how the system is behaving. So that's something I can do when I come across a problem while I'm developing the code. The big issue when confronted with a bug is can I reproduce it? If it happens during development, obviously I can reproduce it. It's happening on my test system. But what if the problem is reported by someone else, especially when it is reported by a customer? At that point, you may not have access to the system. And you may not have option to reproduce a problem. You may be able to get access to the system to get some information. But even that can be difficult at times. Some customer systems are very locked down and there's simply no access. So you have to come up with a different strategy. And if you don't have access to the system, the whole dynamic of the ability approaches out the window, you obviously cannot make any changes to the system. Even if you had access to the customer system, some customers will not allow changes made to the system. Enabling a trace point is likely to have an impact on the customer system. These customer systems could very well be running critical workloads. And while trying to debug a problem, I cannot impact their workload. So when you find yourself in a situation where the problem happened on a customer system, you have no access to the customer system, then it pretty much comes down to what is the static information you can get from the system. You can't get live events happening on the system. You can't add an observability point. All you can do is get the logs. So that is the kind of problem that came my way. It was a customer system. Something happened and I had extremely limited access to the system. Nothing more than we can give you console logs. So let's talk about that problem. So first of all, the system. It was a two processor system with a total of 96 cores, 256 gigabyte memory. So it's somewhat of a large system. And it's running a workload that uses pretty much entire resources on the system. The CPU usage is maxed out. Almost all of the memory is in use by the workload that's running. And the kernel that was running on the system is an Oracle Enterprise kernel, which was based on 5477 kernel. This is not an unusual situation. Most of the customer workloads out there are running potentially some sort of enterprise kernel that may have been delivered by one of the vendors, or it may even have been developed in-house. Very few systems running customer workloads will be running the upstream current latest community kernel. So you often have to debug a problem that's happening on somewhat older kernel. So on this system, the customer, when they need to reboot the system, they don't do a full reboot of the system because of the availability requirement that they have. They cannot spend a lot of time rebooting the system. So they do a kexec reboot. And I'll talk more about kexec reboot later. What kexec reboot does is it lets you reboot a system a whole lot quicker than full shutdown and restart. And during some testing they were doing on one of their systems where they were applying kernel patches, updating kernels, updating the user space workload they were running. For many of those changes, once they make the change, they need to reboot. So they would do a kexec reboot. So they were going through accelerated testing on one of the systems where they just applied a whole bunch of these patches, updates, and did just lots of kexec reboots. And they found that after they had done about 90 to 100 kexec reboots, the system simply failed to boot. It printed an out of memory message on the console, and then that was it at that point system panicked. So they simply did a cold reboot, brought the system back up, put it back in service, and I started looking at what happened on the system. Now the interesting thing is that system had booted multiple times from cold state, as well as doing kexec reboot, and it rebooted successfully every time without ever running out of memory. So the workload sizing was not over provisioned. The workload was sized correctly to the size of the memory on the system, yet system ran out of memory. So what I got was the kernel trace, sorry, the extract trace from console. And extract trace on the console looked like this. There was a clear message saying kernel ran out of memory, and the extract trace assured that kernel was trying to initialize the MM struct for a process that it was about to launch, try to allocate a page. There was no memory available. So it's a critical process, kernel panicked at that point. So that was the information I had in my hand, and I had to start from there. Any questions at this point before I move on? How I debug the process? There are no questions in the chat or Q&A. If anybody has a question, just and put them in there in chat and Q&A, or yeah, just do that if you have any questions. But there is none at the moment, Khaled. Okay, I'll continue. Okay, so I started digging through the console log. When the system failed to reboot, the customer had captured the console log. I had the entire console log in my hand. I started looking through it, and just walking through the kernel boot up sequence, I could see that kernel mounted the root file system, and then it had started launching the services. System D had started and system D had started launching services. And very soon after that, we ran out of memory, kernel panicked. So kernel hadn't gotten to a point where we could see a login problem. So there was no way to login to the system. System was dead at that point. So that's why customers simply cycle power at that point. If you can't even log into the system, and system has been a power cycle, not a whole lot you can do at that point. So console log was the only thing I could work with. So looking through the console log, one of the things kernel does very early on is it prints a kernel command line that it was booted up with. And that can provide you some clue into how the kernel is being configured. So it's useful to take a look at it. So I just looked at the command line that was logged in the console log. And there are lots of options here in the command line, but nothing that's extraordinary, nothing that would indicate the kernel is being configured incorrectly. So looking further down through the console log, I could see the kernel also prints a message saying how much memory, physical memory it has detected. And it prints this message total RAM covered followed by the amount of memory it has detected. So that message said total RAM covered 262, 080 megabytes, which is 256 gigabytes. So kernel did see all of the physical memory. And we know that workload is sized to fit in that memory. So we still learn out of memory. So something else went wrong. So we just keep going. So at this point, what are the possible causes that would make kernel run out of the memory? One possibility, of course, is that a dim module just failed. If it fails, all of a sudden your workload was sized to 256 gig, but now the system actually has less memory than that. Obviously, we are going to try to allocate more memory than we have. But that's not the case here. Because if a memory module fails, you could reboot the system going all the way down to firmware and boot back up from there, that memory module is not going to come back. Well, sometimes it can happen if it is an intermittent hardware failure, but that's a rare situation. Since customer could reboot the system and see all the 256 gig memory, kernel booted up fully, all the services started normally, we can rule out a physical memory module failure. And then the next possibility is did the kernel not see all of the physical memory that was on the system? And we have verified that's not the case either because in that console log I have when kernel failed the boot, kernel does report that it saw 256 gigabytes of memory. So kernel did detect all of the physical memory. So that leads us to the third possibility. And that is when the kernel as it boots up, it will detect how much physical memory is there on the system. And then it has to do some massaging to this physical memory and it gets information about where this memory is mapped. So address ranges. Then we have got a memory from this addressing to this addressing and so on. So kernel will take all that information and build its own memory map. And the memory map it builds is what it uses to allocate memory to all the processes and drivers and stacks and everything else on the system. So let's look at that possibility. So when kernel detects all of the physical memory, it manages the memory in the units of pages, not down to the white level. So each page has a certain size and page size depends upon the processor. So in this talk, I'm going to focus primarily on x86 64 bit processor, which has a base page size of 4k. So kernel takes all the memory that's available, chops it up in sizes of 4k and determines, okay, this is how many pages I have got, and this is where they all live. So if you look through the console log, at some point, after kernel has gone through the process of detecting all the pages and entering it in its memory map, it prints a message saying what is the last pfn it saw? So what pfn is is pfn is a physical frame number. If you look at the memory or the address map of a processor, it can have addresses starting from zero all the way to whatever is the max processor can address. And if you divide this entire address range into pages, in our case, 4k page sizes, you divide this entire address range into 4k pages, each one of these address ranges is a physical frame. A physical frame may have physical memory backing it or not. A physical frame may not even maybe in use, but not by a physical memory, it might be in use for something else. IO might be mapped in there. You might have external sensors that map onto there. So it may not be a physical memory. It can be an IO device as well. So that's why we talk about physical page versus physical frame. If physical frame can host a physical page, and a physical frame can be empty as well. So kernel will report what is the last physical frame number it saw that was populated. So looking at the boot up that field, kernel reported that last BFN was a hex 70,000. If we take 70,000 translated, it translates to a physical frame 458,752. And you convert that to gigabytes. That's 1.75 gigabytes. That doesn't sound right because the system is supposed to have 256 gigabytes of memory. And if customers were saving the console log every time they rebooted the system, I did have a couple of other console logs from them. And one of them was a good boot that they did right after they ran into this failure and they reset the system. I had that console log from that boot as well. And looking at the console log from that successful boot, the last BFN reported was a very different number, which was a hex 408,000. If I translate that to what it converts to in terms of gigabytes, it comes to about 258 gigabytes. That sounds about right for a system with 256 gigabytes of memory. So we are already starting to see that somewhere kernel is missing something. It sees all of the memory, but then reports last BFN is so small. So let's continue with what happened. So as kernel initializes its memory range, as it is figuring out where all the memory is mapped, at the very end, it will go ahead and just print all the memory that it found and the address that it found the memory in. If the system is a Numa system, which happens to the case with most modern systems, a Numa system has got memory attached to each processor separately. So processor zero might have some memory, which is Numa node zero and then Numa node one, the processor one has some memory and the processors can access each other's memory. There's cost to accessing memory from one processor to the other. So that's why it's important for us to know where the memory lives, which Numa node it is on. So at the end of initialization, the kernel does print the address ranges it saw for the memory. When I look at the console log from a good boot up, I see that Numa node zero saw memory starting from address 1000 to some large X number. Similarly, on Numa node one, it saw memory ranging from one address to another. And if I translate those addresses to how many gigabytes of memory it saw, I can see that on node zero, kernels are 128 gigabytes. On node one, kernels are 128 gigabytes. So we got our 256 gigabytes of memory and that's from a successful boot. Now if we look at what happened in the boot up where kernel failed to detect all of the memory, kernel printed out the addressing for node zero from 1000 to some number. And then for node one, you can see that the range it printed, address it printed was from zero to zero, which means it detected no memory on node one. And the node zero memory, if you take that as address range and convert it to gigabytes, that translates to only 1.75 gigabytes. So for a system that is provisioned to run off of 256 gigabytes of memory, making use of all of that memory, you try to boot it up with only 1.75 gigabytes of memory, we are bound to run out of memory very quickly. And that's exactly what happened. Kernel could not start services and things just went wrong from there onwards. Okay, so we now know kernel detected all the memory, but failed to add all of the memory to its memory map. So let's talk about how does kernel go about that process so we can start to understand what might have gone wrong. So the way kernel detects physical memory is when a system boots up, the first thing that runs on the system is firmware. Firmware will go through the process of initializing all of the hardware that's on the system. Firmware also goes through the process of detecting where all the memory modules are populated. And then firmware will put together a list of the address ranges those memory modules map onto because where the memory module maps onto depends upon which physical dim slot you have put the memory in. So it builds this whole table of I found memory ranging from address this to this and then from this to this and so on. This information that kernel builds, it will make this information available to the kernel. And then when kernel starts, it will start reading through this memory map that kernel provided, sorry that firmware provided. And as it goes through the process of parsing the memory map from the firmware, it sanitizes it. We'll talk some more about sanitization, but essentially kernel has to make sure that memory map it got from the firmware looks good and it doesn't have inconsistency. So this memory map, how does it come to the kernel first of all? When a system starts booting up, once firmware has completed its initialization, the firmware will invoke the bootloader, which happens to be dropped on a lot of learning systems. So bootloader starts and bootloader reads in the kernel image from the desk or wherever it needs to get the kernel image from. It gets a kernel image, but before bootloader passes the control to kernel and it starts running the kernel, it also builds a boot parameters page, which is called the zero page. It puts a whole bunch of information on the zero page, which includes a kernel command line that we saw earlier. It will be on that boot page, on that zero page. And part of the information on the zero page is what we call the EA20 table. EA20 table, it's a legacy name, comes from the BIOS days, but EA20 table is the actual memory map that firmware provided. So bootloader reads the memory map from the firmware, takes a memory map, restructures it and puts it in the structure suitable for the kernel inside the zero page. If you want to see what the zero page looks like, I have pointed to the definition for a stock boot parameters up here. It's in the boot param.h file. And it also, you can see in there, the member in the structure for EA20 table. EA20 table is really just an array of address ranges. It gives the start point for the address range. It gives the size of the address range, then what type of memory it is. So it's just a whole list of arrays. Now, the thing is, zero page is a single page. On x86, the single page is 4k. So we can hold only 4k of information on zero page. And that's okay. When it comes to EA20 table, zero page can hold up to 128 of these arrays. If we need more than that, which actually doesn't happen really often. If we do need more than that, we have the possibility to store any additional ranges in another structure, which is a struct setup data. And a struct setup data has a node called a setup EA20 extended. So we just stock all of these entries beyond 128 in that structure, which can live on another page, and we put a pointer to that page in the zero page. So we can handle more than 128 memory entries. But just keep in mind 128 memory entries can cover a lot. It takes a very large system with very sparse memory map to run out of 128 memory entries. College, sorry to interrupt. I have a couple of questions that might be good to answer at this time. The first one is in the Q&A box. Would you like me to read that out to you or can you see the question? I can see it. So the question is how does the firmware make this information available? Is there a certain interrupt to get that information? So it depends upon the firmware. On the old BIOS systems, you had soft interrupts that you could use to make causing to the firmware and read the data. On the more modern systems, which is EFI based, a lot of this information is passed to ACPI. So firmware has ACPI interfaces. It publishes what those ACPI interfaces are. So what the boot loader will do is make a call into the ACPI subsystem, which is essentially a function call, and the function will return a pointer to the array. There is a second question in the chat. I answered that question, but if you would like to add more, what is a node 0 or 1 and what is meant with the firmware detects memory? You boot and what about the memory nodes in devices? Okay. So let's talk about node 0 and node 1 first. So when you have a system with a single processor, a single memory controller, all of the physical memory that is attached to the system is connected to that single memory controller and the processor can access any of the memory connected to that memory controller. But at one time memory controllers used to be external to the processor, but now memory controllers have moved into the processor. So when you attach memory to a memory controller, you are attaching the memory to that to the memory controller on that processor. And then if you have a two-processor system, each with its own memory controller, you could potentially lay out your motherboard to have dim slots on the memory, on the motherboard be connected to each one of the processors. And the memory controller for one processor potentially controls, talks to a certain number of memory modules on the memory controller for the other process, talks to the remaining set of memory modules. So what we have is essentially two memory controllers and it can scale up. You can go to many more than that. These memory controllers control their own memory, but they also allow one processor to talk to the memory controller on the other processor going through an interconnect fabric. There is an interconnect fabric between the processors. A processor can use that interconnect fabric to send a message to the other memory controller and say, I want to read that particular memory that you have access to. So a system like this of what happens is that you have memory connected to multiple memory controllers. When you access the memory that is connected to your local memory controller, you obviously have a fairly fast access. But if you have to go to the other processor and ask it to read or write the memory connected to its memory controller, you have a certain delay. So such systems are called NUMA systems, which is non-uniform memory access, because the cost of accessing memory differs based upon where the memory lives. If you scale up the system to say eight nodes or 16 nodes, depending upon what the connectivity is between the processor, it may take two hops, three hops to get to certain memory. Hence the cost starts to go up. And that cost matters, because if you're running a program on one processor and its memory lives on a memory controller that's three hops away, you're incurring significant cost. So that's what node zero and one refers to. Then kernel will simply number all of these nodes, these memory nodes, which comprise of the memory controller and memory modules as a memory node, and it simply assigns them a number. So we know if we are on node zero, we need to access memory that's connected to node one. We know it is a non-local memory, which means it may have extra costs associated with it. So that's what the nodes are. And let me see the second part of the question. And what is meant with firmware detects memory. If you look at the actual hardware, where the slots are, when you push a memory module into a dim slot, the hardware itself can detect if there is a memory module inserted into that dim slot. I'm not an electrical engineer, so I can't give you the exact details of how it determines. There are potentially sensors, voltage sensors, current sensors that can detect that, okay, this dim slot is consuming current, so there is a memory inserted in there. Then there are low-level protocols as well, where the processor can set up queries, and the individual controller, for instance, the memory controller might be able to talk to the dim module and come back and give you information. Is there a module inserted in the dim module or not? So the low-level hardware does the hardware detection of, is there a module inserted in that dim slot? Firmware uses those low-level hardware interfaces to detect where the memory is. So that's what I mean by firmware detects memory. And the actual mechanism, I'm sure, has changed over a period of time as new protocols have evolved. I2C is one of them, and that has been used to detect what is out there in the hardware land. There might be more. Okay, there is another question. At which point would you be able to tell the issue, may not be kernel, and it may be bad memory dims? I think you have covered that on one of the slides. Exactly. So if it is a bad dim slot or a memory module that has gone bad, simply rebooting the system will not bring the memory back. So that's the simplest test. You reboot the system. You can even power it all the way down. Just in case there was a power event on the system which caused a memory module to stop responding, power it all the way down, power it back up. If you see all the memory, slightly not a bad module. But if you have same address range that is just missing no matter how many times you reboot the system, then you know the dim corresponding to that address range has gone bad. And at that point, you have to go to lower level than kernel to find out which dim module has gone bad. And typically from where we'll tell you which module has gone bad. Okay. I think there is just one comment in the Q&A saying that serial presence detect method is used to detect dims. Yes, that is correct. I have seen it before the SPD. Yes. Thank you for that. Okay. So all right. 128 memory entries. That's what we can fit in the main zero page and then rest of the entries go into the extended node. So now as the kernel parses this memory map information, it will build its memory map and that memory map is visible to the user. For one thing, it is printed as part of the kernel log. So it will be on the console log. And even after the system has booted up and it's fully up and functional, if you ever want to see what was them, how did the kernel detect memory? What was all the ranges that it saw? kernel publishes the memory map it saw at a sysfromware memory map. And if you look into sysfromware memory map, it has got a bunch of directories. Each one is one of those E820 entries. Each one of those entries has a start and type. So you can simply add those files and see what the address range and type was. And at the same time, if you look through the kernel log, the dmessage log, you will see that kernel prints the message saying biosproject physical RAM map. This is the memory map that it got from the firmware. If there are more than 128 entries, it will print those as well. There's a message that says extended physical RAM map and then it will print the rest of the entries starting from 129th. I'm just going to walk through some of the code as I walk through to figure out where the problem might be. So where is the memory setup done? First of all in the kernel. So there is setup arch function in arch x86 kernel setup.c. So if you go through that, you will see a lot of hardware initialization and much of the kernel data structure initialization may also be done in setup arch. So in the setup arch function, there's a call to E820 memory setup. E820 memory setup is the one that is going to look at the E820 table. And as it sanitizes them, it just makes sure that the address range it got from firmware is not empty. It doesn't go from address 0 to 0 or address x to address x. So it's the same start and end address. It ensures that there are no overlapping entries. If there are, it has to resolve those or there are no entries where the size of the address range is negative because again that's not correct. It's potentially bugging the firmware or something else went wrong. So kernel doesn't want to panic simply because it got bad data from firmware. So it sanitizes all of that address range and uses the E820 update table function to clean up the table, E820 table that it got from the firmware. And then there might be more entries than 128. If that's the case, then it calls the E820 memory setup extended function, which will pass the setup data structure in that other page, find the setup E820 extended node, and then all the entries it finds in there, it will add that as well to E820 table. So once we have cleaned up the E820 table that we got, from the firmware, what kernel does is it will add all of these address ranges to a mem block allocator. So mem block allocator is the early memory allocator. As the kernel is coming up, it's going to get requests to allocate memory, drivers are getting initialized it, kernel is initializing its own data. So it has to add all of those ranges, address ranges to mem block allocator so it can start allocating memory. So it makes a call to E820 mem block setup in setup. I should do that. Now, one of the things is, before we can add all of these address ranges to mem block allocator, we have to make sure that if we mark the memory correctly, some of the address ranges, even though they're present on the system, but they're in use by something else. So the kernel will go through the process of marking address ranges, this one is available, this one is reserved, this one is in use by kernel, and so on. And all of these address ranges that get assigned to mem block allocator, mem block allocator will do the initial allocation. And then at some point when we are done with the initialization, all of this memory is handed over to the buddy allocator. And then from the buddy allocator takes over, and it will do the memory allocation for the system from there onwards. If at any point you want to see what is the physical memory map on the system, just do a cat on PROC IU mem. PROC IU mem shows you how the entire address range is in use on the system. So it will cover not just RAM modules, it will also show you the address ranges allocated to PCI devices, or any of the other, say you have got sensor devices that are connected into the system. You can see all of those address ranges in PROC IU mem. So I had mentioned NUMA system before, so one of the things that happens towards the end before we finalize the memory map is we do a NUMA initialization as well. NUMA init routine does that initialization and NUMA init will use the ACPI SRAT table to figure out which memory is attached to which NUMA node. So now we not only know where all the memories, we also understand how it is configured and what would be the cost of accessing each of this memory. Looks like there's a question here. Let me address that before I move on. What is the sanitization of EA20 table? What all things happened during this sanitization? Sanitization is just that make sure the address ranges are consistent. We don't have an address range that starts at an address but ends at a lower address. That's not okay. We don't have an address that has a negative side. We don't have an empty address range and we don't have two address ranges that overlap. So that's the sanitization. And then another question is, is the sanitization only required on multiprocessor system or is it something that could be required on single process system? Yes, it is required on all systems. Single processor, multiprocessor doesn't matter. We have to make sure that the memory information we are getting from the firmware is sane. Okay. So moving on. So we talked about how detects physical memory, how it manages it. So let's talk about Kexec because that was the other critical part of this issue. Normal boots were just fine. If you do a full shutdown, come back up. System never missed behavior. It was only Kexec reboot that caused problems. So what is Kexec reboot? If you look at the normal reboot sequence for a system, you are running kernel normally and you do a restart from there. What's going to happen is kernel will go through the process of shutting down all the services that are running on the system. And once all the services have been shut down, all the drivers have been shut down, then kernel doesn't reset. So it actually goes into the firmware and invokes reset. Once we reset, the system automatically makes a jump to an entry point into the firmware. So firmware takes over at that point. Firmware goes through the process of doing all its initialization, all its detection. And when it is done, it will invoke bootloader. Then bootloader takes a little bit of time. It will load the kernel image. It will build the zero page and then it will transfer the control over to the kernel. And now we start booting the kernel. So if you look at these steps, firmware itself can take significant amount of time. And then, of course, there is a little bit of time that bootloader takes as well. We have got customer systems out there that have very strict availability requirements. So, for instance, if you look at what is considered as five lines, people will promise five lines or six lines of availability time. Five nine is a 99.999% uptime. If you translate that into how much of downtime is allowed if you are promising five lines, that downtime translates to about somewhere between 31 and 32 seconds per year. That's not a whole lot of time. That's all the time a customer has in 32 seconds to do any schedule reboots and unscheduled reboots. So it is extremely important that our reboots be very, very quick. So what KXX does is it provides a way to shorten that reboot sequence that we just looked at. KXX is essentially a bootloader that lives in the kernel. What KXX can do is while you are still running the system, all the services are up, it can load another kernel image and prepare it ready for execution. So it will do everything that a bootloader should do, set this kernel image up, and then we are ready to just jump over to the new kernel. So that's what happens with KXX. When you do a reboot, you're currently in the kernel, you'll do a KXX load of a new kernel image, and then you will do a quick restart where kernel does a shutdown of all the services, and then it makes a direct jump to the new kernel. We don't go through a reset process. We don't go through firmware. We don't go through the bootloader. We just continue with the new kernel. So that can shorten the sequence significantly. And to give you an idea of how much shorter it can be, when I implemented KXX on IA64, a processor way back when IA64 process had just come out, I implemented it on the McKinney processors, firmware at that time on McKinney processors took a long time. So the total reboot time on the system was about three minutes or three minutes or a few seconds. Bulk of the time was spent in firmware initialization. So once I implemented KXX on the IA64 Linux kernel, the reboot time on the system dropped to five seconds. So KXX can make a very significant difference. If you can reboot your system in five seconds, you have up to six reboots available to you per year and still maintain the five lines. So what are the requirements from KXX? KXX behaves like a bootloader. It's just an internal bootloader. So one of the things it has to do is it has to make sure that the system, the hardware itself is in that same state for the next kernel as it would have been if we had actually come from firmware into the new kernel. That is, it takes some push-ups to do that. But there are enough functions implemented in the kernel to be able to accomplish that. Essentially, all of the drivers have shut down routine, which will shut down all the changes made by the driver, undo all of the changes made by the driver and put the hardware in the same state as it was when we got the hardware from the firmware. Firmware has already done the initialization. So we don't need to re-initialize the hardware because hardware will need to be re-initialized if you were to power down the system and you power it back up. The hardware on the system is not in initial state and that's the job of the firmware. So firmware has already done that. Since we didn't do a power down, as long as we can undo all the changes made by the current kernel, we are good to go for the next kernel. So that's part of the Kexec recurrent. The other big one is Kexec has to prepare that zero page that was prepared by the bootloader earlier and that zero page should have all the same information that would have been available to the kernel if it got the control from the firmware instead. Why do we do that? We don't want a special case, a Kexec reboot in the kernel. That would be fairly extensive, potentially a nightmare to maintain. So kernel shouldn't really need to know whether it got the control transferred from firmware or was it Kexec reboot. We have a few little special cases but for the most part kernel behaves the same whether it gets control from firmware or Kexec. So one of the big things is the EA20 table. We have to make sure we get that right. Now, as we talked about EA20 table earlier, we talked about how kernel sanitizes the table. So it is making changes to the original EA20 table that got from the firmware. But we know we may need to do Kexec later. So as a result, what kernel does it? It maintains three copies of the EA20 table. There's the EA20 table firmware, there's the original table as it got from the firmware. Then EA20 table Kexec, which is essentially a copy of the EA20 table firmware. But there are possible modifications. One of them is we could stick an MP table in there for old, old systems before a CPI came around. MP table implemented the multiprocessor specification. That's how kernel got to know what's the topology for a multiprocessor system. So if you need to fake an MP table, EA20 table Kexec is the data structure that will hold that fake MP table. So essentially the two tables are mostly the same. And then we have the EA20 table, which is the one kernel maintains, and that's the one that has been sanitized and cleaned up. When Kexec happens, we are going to take the EA20 table Kexec and copy that into the zero page as the EA20 table. So now that we know how Kexec works, how memory sanitized, what happened to all the memory? When we look at the fade boot, we did see that kernel reported total RAM covered 256 gigabytes. Now, since a customer had a console log from a few reboots, including some of the successful Kexec reboots, I could go through them and make sure that in every one of those reboots, kernel did see 256 gigabytes of memory. And then I started comparing the EA20 map that was reported by the kernel just to make sure that the EA20 map is consistent because now we know that the table provided by the firmware does get modified somewhat by the kernel before it is passed to the next Kexec kernel. And when I compared the two, I noticed that when I was looking at the console log for the successful boot after resetting, after power cycling the system, there were 15 entries in the map that covered the 256 gigabytes of memory. But the failed Kexec reboot, there were 128 entries. That is odd. We went from 15 entries to 128 entries that look suspicious. And then that number 128 is very suspicious because we know maximum number of entries you can add to zero page is 128. So something has gone wrong in the Kexec cycle itself. Then looking at the BIOS EA20 table that kernel prints from the boot up that happened after power cycle happened on the system, when I look at the last two entries, I see these two entries at the top, the very last entry. When I look at the address range of that, that is the bulk of the RAM. That translates to about 254 gigabytes of RAM. All the entries above that are much, much smaller. Now, when I look at the failed Kexec reboot, the last entry printed by the kernel had the entry just before the one that covers bulk of the RAM. So there was no entry that kernel reported in the EA20 table that would cover all of the 254 gigabytes. So the entry I have here under the failed Kexec reboot, this was the 128th entry. So 129th entry is the one that would have covered all of the RAM. And we need that 129th entry, which should be in the extended set of data in the set of data node. So now I know it's the 129th entry that kernel is not reporting. That's where our memory has gone. Why did 129th entry not make it to the kernel? So I started looking through the Kexec code. And in the Kexec code, there's a function set up EA20 entries. That's the function in the Kexec that will set up the EA20 entries in the zero page that will then get passed to the next kernel. Well, at that point, it was very quick. As soon as I started going through the code for set up EA20 entries, I came across a comment that said, to do pass entries more than EA20 max entries, zero page in boot params set up data. Well, that explains why 129th entry was not passed to the next kernel because the code is not done yet. Kexec code is not complete in that sense. So that tells me what happened. Somehow we ended up with 129 entries in EA20 table and 129th entry was never passed to the Kexec kernel. That memory disappeared. We run out of memory. So that saw part of the mystery. The second mystery is why do we have 129 entries at this point? We started with only 15 entries. We should have never needed 129 entries. So I started looking at the successive boots where I had the console logs. I had two or three of those. And I just cut out the EA20 table that was reported by the kernel for each one of those boots and it started comparing those. And when I started comparing those, I noticed that from one boot to the next Kexec reboot, there was one entry on the first reboot that had been split into three entries in the next one. So that sounds suspicious again because we are expanding the number of entries for some reason. And this is state consistent. I was able to get another console log for the Kexec reboot and the number of entries had jumped by two again. So we will go from 15 to 129 entries as we keep doing successive reboots because there is an entry that is getting split every time. So it takes about an iComputer 57 reboots. On the 57th Kexec reboot, the number of entries has gone from 15 to 129. And now you start dropping entries and lose memory. So we know entries are being split. This is why it is growing. Now the question is why is the entry being split? So I started looking through the code again and then I found the function that does the splitting of an EA20 entry and that's the underscore underscore EA20 range update. So I looked at which are the functions that call this function to split an entry. And this function is called by two other functions. One is the EA20 range update and another is EA20 range update Kexec. Now the difference between these two is EA20 range update updates the EA20 table pointer that currently maintains for its current boot and there's a sanitized version of EA20 table. EA20 range update exec operates on the Kexec copy of that EA20 table. So we know the problem happens only with Kexec. So the function of interest is the EA20 range update Kexec. And that's where I can start to see which entry it is splitting. Why is it splitting it? So I started looking at who calls EA20 range update Kexec. And that was caused by EA20 Memblock Alloc Reserve. Well, that one is of no interest because that function is called only to create a fake MP table entry which we don't have here. This is an EFI system. It has ACPI on it so we don't need MP table. So what else? The other caller is EA20 Reserve setup data. EA20 Reserve setup data is called early on in setup arch as we are processing all of these address changes that we saw in the EA20 table. Colonel is also trying to figure out is there an address change that's currently in use by something else and should we market preserve? And what's happening is as the Colonel goes through this data, it also is looking at some setup data. The setup data is passed on the zero page and the setup data itself may consume a little bit of memory. So as we go through the address range, if we see that setup data lives at this address range, we know that address range that we got from EA20 table potentially has portions of it that are currently in use by the setup data. We don't want that setup data over written. So we want to make sure we mark that entry reserved so that Memblock Allocator does not end up allocating that part of the memory. So what Colonel does is in the EA20 Reserve setup data, it will take that entry and split it so that if the setup data is somewhere in the middle, it will make the top part of that range available. Middle part where the setup data is reserved and then the bottom part is available again. But then at some point Colonel is done with the initialization. It doesn't need that setup data anymore. So it frees that range and that range becomes marked as available. So when I looked at Proc IOMM or even Bios EA20 table, I couldn't see this range as reserved. It was marked as usable or available because all the data Colonel needed from the setup data has already been consumed. No need to keep it around. Well, looking at the case of Colonel, it has to do something very similar. And I started looking at could it be reserving space for the EA5 setup data, which we don't need anymore because we are done with it. So could we be splitting an entry in the case, a copy of EA20 table when we don't need to? So the first step is since I'm looking at an enterprise Colonel, which is potentially, which is based on an older Colonel, potentially there have been changes upstream Colonel. So that's one of the things one should always do when working with a Linux Colonel. It's moving constantly. Lots of bugs are found upstream fixes a number of bugs. So anytime I come across something like this, first thing I do is has there been a change in the upstream Colonel has someone seen this problem and solve this problem or someone fix a different problem which happens to fix this problem as well. Essentially, don't reinvent the wheel. If the fix is available, let's use it. Okay, so the Colonel I was looking at was based on 5417, but the upstream Colonel at that point was 5.18. So I compared the two functions, the EA20 reserve setup data and comparing the two functions, I immediately saw a change in the upstream Colonel, which had this comment saying setup EFI is supplied by Kexec and does not need to be reserved. That's because Kexec synthesizes that set EFI setup data. We have already thrown it away in the initialization later when we are getting ready to Kexec, we can synthesize that data again, which is why Colonel threw it away earlier. And when Kexec synthesizes this data, it just reserves a new memory address and it will stick it in there. So now what happens is we have already reserved space for this setup data in the EA20 table and then we have also created a new EFI setup data. So we split a rain for no good reason because we were already going to synthesize this data. So that told me, okay, we are in the older Colonel, we are reserving a rain and splitting one of those ranges when we didn't need to. And that splitting is what is calling it to grow from 15 to 129. So I looked at what was the commit that brought in that code chain. The commit was simply to not reserve EFI setup data in the Kexec EA20 table. But when I went through the commit log the details, it described a problem where you run into a problem of reserving the EFI, setup EFI rain over and over again with subsequent Kexec reports. It didn't talk about how the system ultimately runs out of memory, but this sounds very much like what I'm seeing, except in my case, we ultimately run out of memory. So simple problem at that point, fixing it is easy, just take the upstream commit backported with that as a solution to the problem. So that was my journey through this bug. And now I'll talk more about some of the other tools and tricks and techniques to debug MMSU system. But before I move on, if there are any questions on this, let me know. I don't see any questions. Yeah, there are no questions in the chat or Q&A at the moment. Okay. So we will move on to how you would go about debugging an issue in the MMSU system. Very often, since you will find yourself in a situation where you can't really make changes to the kernel in order to debug a problem, you have to know what information you can extract from the currently running kernel. And kernel does make a lot of information available. You just have to know where to find it or to interpret it and how to then apply it to how it might be relevant to the problem you're looking at. So kernel makes information available to the PROCFS, CISFS and debugFS. On most of the systems, PROCFS is already mounted. It's mounted on slash PROC. CISFS is mounted typically on slash CIS. DebugFS is not always mounted on a system. Most systems do mount it, but it depends. A lot of vendor distros, I believe, do mount debugFS. And you'll typically find it on CISK kernel debug. So each file under these file systems, they contain lots of counters that count events that have happened on the system. They count objects that currently allocated. So looking at these numbers, it can give you an idea of what's happening with the MMSU system. What's the current state? And if you have some historical information, you can also see how the state of the system has changed. So to understand the files and the counters and events that you see in these file systems, this documentation, if you go into the kernel sources directory and actually kernel.org also has all of this documentation online, PROC is documented under documentation file systems, PROC.txt, CISFS is documented under documentation file systems. And then there's also a directory documentation admin guide MMS. This whole bunch of files in there, it's worth going through those. Most of the files are fairly up to date. Some might be missing a counter here and there, but that's no big deal. You can figure out what that counter means. So one of the important things to know is that the files that live under PROC sources kernel debug, the files that give you these counters, those are updated dynamically. So at the moment you do a cat, it will give you the current state of those counters. So I'll talk about some of those files and talk about where those counters are coming from, that then you can understand what those counters mean. Because you see a number, you look at PROC MMSU, there is a number there. What does that number mean? If you can correlate it with a variable in the source code, you can understand what that number means. So let's look at PROC MAMIN4. PROC MAMIN4 data is populated by MAMIN4 PROC show function and that function lives in FS PROC MAMIN4.c. So if you do a cat PROC MAMIN4, you see a counter, just go look at that function and you can see which variable did the kernel read to populate that number. There are a few very interesting counters. I use them all the time, anytime I'm looking at anomaly in the MMSU subsystem, some of the numbers I look at, see what might be happening. There's not more, I can cover only a few here. So MAMFRI just tells you how much memory is available on the free list. MAM available is a slightly different number. MAMFRI is the number you get for the memory that can be allocated immediately. But there's a lot of memory on the system that can potentially be tied in buffer cache and page cache. A lot of that memory is reclaimable. And actually I had another talk about memory reclamation. So you can go through that talk to see what does memory reclamation mean. But essentially it's memory that's sitting out there, currently allocated but not really in use and it can be reclaimed and made available. So that's what the MAM available number says. That here's how much memory I can make available through reclamation. Cache tells you how much memory is consumed in that page cache. MLog processes can call MLog. When they call MLog, they essentially can take a range of pages and just log them in memory, which means you cannot reclaim them, you cannot stop them. So it's good to know how much memory is logged up that cannot be touched. MLog will give you that number. Then slab is another interesting one. Slab is essentially a cache of objects. So if a kernel is going to allocate a type of object over and over again, which it does for lots of its data structures, you can do MLog every time but you end up with an inefficient allocation. So kernel simplifies it. It makes a slab allocator available. You can allocate a cache of objects and then you can simply ask for yet another instance of objects. So slab allocator will just grab a chunk of memory based upon the size of the cache you want. It keeps that cache handy anytime you want a space for object. You ask for that space and you return it when you're done with it. So slab counter will tell you how much memory is consumed by slab objects. And then of course, a huge pages total is a useful one. This is the one that is used by huge TLBFS. So the pages that are located for huge TLBFS, they are essentially not available. They are used only by huge TLBFS. So knowing how much of memory is consumed there is important. Khaled, there are a few questions now. There is a chat question in the chat about MMInit. Can you see that one? Yes, I can see. Sure. Okay. So the question is, MMInit after setup arch, where does kernel memory mapping? This is the point where it gets access to all memory as it walks through all pages and markets. So before, kernel does walk through all ranges, which he also does in MMInit. Is that right to say, see it impacts boot time? So I'm not sure what you mean by kernel walks all the memory because it's not going to walk as in ensure you can read and write all of the memory that is being reported by EA20 table. And you can force it to do that, but that's not what kernel typically does. It will simply add all the pages. It finds one of those address ranges to the mem block allocator, which then finally gets handed over to the body allocator. Pages typically in the kernel are not zeroed out until they are allocated. It's that allocation time that we walk through the page and write zeros to all of it to make sure that we don't leak data because if a user comes along as for memory and we give it memory without zeroing out the page, we obviously leak memory from the last user of that page to this new process. So kernel really just walks the ranges to, walks in the sense that it ensures that the range is valid, it looks good, it's consistent and there are no overlaps. Okay, and then there's a question in the Q&A. There is also a hand up Sumitra Sharma, would you like to ask your question or would you like to type that in? Okay, while Sumitra asks her question, there is the one in Q&A. Okay. Okay, so this one is we are seeing issues where we don't lose memory, but cache so aggressively and get memory highly fragmented that when the time comes to allocate a large block of memory, the own Reaper gets activated. Yeah, I have seen those. It's an ARM embedded system 4.14.2 with 768 megabytes of memory. Yeah, I covered some of it in my other talk about MemOptimizer. That's a tool I have written that MemOptimizer tool now is called AdaptiveMM because it does more than just what original MemOptimizer does. But essentially what you're talking about is as memory is being allocated some of the memory ends up in buffer cache or page cache. And even though the ref count on the pages that are sitting in the cache is zero, those pages are just sitting there idle and have not been reclaimed. Until those pages are reclaimed, you cannot put them in this string of pages that are available continuously. So for instance, if you have got a heavy fragmentation and you try to allocate a huge page, huge page can be two megabytes or even a gigabyte on x86. You might be able to get a 4K page, but to get a huge page, you need to get 512 of these consecutively to be able to allocate a single huge page. And if you are in that situation where you have no contiguous 512 4K pages and you try to allocate a huge page, you are going to own. The solution to that, of course, is for kernel to take all the free pages and essentially compact them. So it keeps as many contiguous pages as possible. And also this applies not just to huge pages, it even applies to higher order pages because not all allocation is order zero pages. So I refer to a page order, that's a buddy allocator thing. Essentially what buddy allocator does is it keeps a list of pages that are contiguous. So order zero page is a single page. So these order zero pages potentially are in the memory non-contiguously. Order one page is a two page. Order is really what you put as power of two. So order two pages would be four contiguous pages. So buddy allocator keeps a list of all these contiguous pages. And some of the subsystems in the kernel have seen drivers, especially RDMA driver, request higher order pages. Instead of allocating order zero pages, many of these drivers will allocate order two, order three, order four pages. And if you don't have enough contiguous pages because you have high fragmentation, you're going to, the solution to that is really K compact, the solution is compaction of the memory. Compaction of the memory is done by a kernel thread, which is K compact D. And K compact D can at times fail to keep up with the memory demand. And when that happens, you end up with memory, enough memory available, but none of it is contiguous. So it, an allocation fails. So somehow you have to stay ahead, or the K compact D has to stay ahead of the memory allocation. That was a topic of my talk on a reclamation and compaction. How to, the R2MM tool that I wrote is a human that looks at the memory consumption pattern of the system currently and based on that it projects what its memory consumption pattern is going to be in the future. And then it computes, are we going to end up in a situation where we don't have enough memory at the time when an allocation request comes in. When the allocation request comes in, you are going to go either into install, waiting for memory to become available either through compaction or a reclamation or you will own. And so adaptive MMM essentially looks for the potential of these events in the future and will kick K compact D and K swap D to reclaim memory and compact it proactively. Yes, the recording of the previous talk is actually on the LF live mentorship series webpage. Okay, the next question is, as this is admin from time to time, I have seen runaway processes that consume memory and seem to continue holding on to that memory even after the process is long gone, which counters are best for investigating this issue or issues related to reclamation of memory or is rebooting the only option. Well, it depends upon where the memory is locked up. If a process dies, and its pages are left out in the memory with a reference count that is nonzero, that's typically a kernel bug. Any page with a reference count that is nonzero, you cannot reclaim it because that reference count is saying it's in use, leave it alone. What happens more often is really a process dies, its pages were in the page cache, process has died, we reset the reference count for all the pages it held in page cache to zero, but those pages are now consumed in the page cache, we haven't reclaimed those yet. The way those pages will be reclaimed is through K swap D. That's K swap D's job to go through the cache and reclaim all of these, this memory. You can kick system into reclaiming those pages. If you look in proc, there is proc, sys, VM, under that there is a file drop underscore cache. You can echo a number to drop cache. One, I believe, check the documentation because sometimes I get them reversed. I think if you echo one to eight, it will reclaim all slab caches that can be reclaimed. If you echo two, it will reclaim all of the reclaimable pages. So you can all those two together and echo three, it will reclaim everything. It's a fairly destructive operation because if you echo three to drop cache, system will just sit there first go through the entire cache and reclaim every page it can. So system may stall for a little bit while it does that. The second thing it does is as we do IO to the file system, we are going to bring pages in, we will hold them in the buffer cache. And then even though we were asked for that page is done with that page, we tend to keep these pages around because someone may ask for that page again. So why get rid of it? That's part of how kernel maximizes use of memory. If there's memory available, nobody is using it. Why not cache some data? So the next time it's asked for, it's already there in memory because reading from the desk or going over the network, let's say it's RDMA, it's expensive. When you echo three to drop cache, what happens is kernel will go through and throw away all of these pages as well, which means if someone asks for that page again, that data from the disk again, kernel just threw away its cache copy. So it has to go to the disk. So there are side effects of doing that, but that's one way to reclaim all the memory. And I do that as debugging step by step. If someone says I don't see free memory available, after I've done a lot of other analysis, I might ask them, if the memory is locked up in page cache, let's see if we can release it. Do that. If all the memory became available, okay, it was simply tied up in page cache, no problems, move on. So I have a question, Khalid. So you might have a bug report coming in and you don't always know, you might not have a reproducer, you might have a scenario that this is what we see that causes this problem. So how do you go about maybe figuring out, you have to reproduce it obviously. So do you go through the process of maybe coming up with a reproducer yourself, come up with a write a test or a reproducer that could also be used at a regression test later? Could you elaborate on that a little bit? Definitely, because if you can reproduce a problem on your own test system that then you're golden. Now you can play with that test system. So when a customer describes a problem, one of the things I indeed do is try to reproduce it on my own system. In this specific case, the bug I talked about, I could not reproduce it on my test system. We have a test team. They could not reproduce it on their systems easily either. So when you end up in that situation, you have a lot more limited options. But understanding what the workload was, what are the steps customers took, which ultimately led to this situation, combining that information with what data you are seeing from the system currently. If you can come up with a way to reproduce the problem, that is the most helpful thing you can do for yourself. Now, there's that other aspect to it. This is a problem that let's say that's nasty, results in a kernel panic. That's a nasty problem. If it is because of a kernel bug, we don't want such kernel bugs to go out into the field. So being able to take a reproducer and turn it into part of a regression test suite is definitely useful. So there's a kernel test suite in the kernel sources already. So anytime you can find a reproducer and you can rewrite the reproducer in a way where it can become usable as a regression test. That's a very good thing to do. Great. I'm not sure if this question in the Q&A is answered from Anthony. Yeah, I did answer that question. Rebooting is not the only option. You can also use that process VM drop cache file to drop the cache and see if you can recover the memory, how to investigate this issue. That was the other thing. There are some counters that can be useful. So PROC MemInfo will tell you how much memory is logged up in cache. So that's a useful one. MLog Day is another useful one because it's a runaway process and it logged pages in memory and then went away. But somehow the pages were never unlocked. MLog will tell you how much memory that is just being consumed with no real users. Another one I find useful is in PROC VM stat, there are numbers under a lock stall underscore star. There's a whole bunch of these. And then similarly, there's another compact underscore star. These will give you counters of how many times we got stuck waiting to allocate memory. So when an allocation request comes in and system doesn't have enough memory available right away on the free list, it has to go and reclaim memory. What happens is the process that requested memory goes into a stall while kernel goes and finds memory. So it will do reclamation, come back and say, okay, I have page available now here. But that is a stall and that causes a counter to increment. So this will tell you if that counter is constantly going up, especially at a high rate, you know, you are running into a starts very often, which means kernel is constantly having problems finding free memory. So that's a good counter to keep track off. And then compact, another compact start that will tell you when a process requested a higher order page may have come through even a driver subsystem. And a higher order page was not available. Kernel had to go into direct compaction, which is does a requester to hold on. Essentially, it's not going to return to the function, insert kernel jumps to the function that does the scan of the memory, finds all the free pages and moves them, which is an expensive process because to move a free page from one location to another, it has to move the contents of the destination page elsewhere. So it goes through this whole compaction process until enough higher order pages become available. And then the kernel can satisfy that allocation request and return it, but return that pointer to the requester. Whenever that happens, immense a system will also implement the compact install counter. So looking at these two counters, you can get a feel for how often you are running into these situations. So if you have runaway processes that are constantly the requesting memory, never freeing it, these numbers will keep going up. And then there are a couple of other files. Let me see. I think I have them listed here. The last file on this slide is CIS kernel debug EXT flag EXT flag index. That's a useful one that tells you what is the current level of fragmentation on the system. There's a documentation file that tells you how to interpret those numbers. Take a look at that, see how bad the fragmentation is on the system. Okay, so I'm just going to go over these quickly since we are running out of time. Proc VM stat, another useful one. Take a look at the data I have here. Zone info, another very useful one, especially if you are looking at watermarks. If you're curious about what watermarks are, I explained that in my adaptive VM talk. And then these are some of the other interesting files that you can look at. They all have a bunch of data. It's useful to go through those. Most of the data is explained either in a documentation file or just look at the code. And then there are a few debug tools. F trace, K probe trace, these are available in the kernel. They are documented in the kernel documentation directory. Using these traces, you can see how often the kernel is hitting certain function entry points. DRG and the dragon, that's an interesting one. That's a more recent one. It is a script table debugger that lets you do lots of very interesting things. Essentially, it lets you insert code into the kernel, running kernel, and you can print counters, you can print events, you can actually read even kernel variables using a dragon. So take a look at that. There's a talk as well that happened a couple of years ago on dragon that goes into more detail. And then of course, Brendan Greg's website has lots of performance monitoring and tracing tools. So since we are at 1027, I hope there are no questions left unanswered at this point. I don't see any. I'll hand it back to Candace for the wrap up. Thank you so much, Khalid, and Shua for your time today. And thank you, everyone, for joining us. As a reminder, this recording will be on the Linux Foundation's YouTube page later today. And a copy of the presentation slides will be added to the Linux Foundation website. We hope you are able to join us for future mentorship sessions. Have a wonderful day. Thank you.