 Okay, hello everybody to the last talk before lunch. I will try to explain you a bit the internals of KX and K-Dump So my name is Matthias Brugger. I work for SUSE and I work mainly on ARM64, so I work for the enterprise products of SUSE and ARM64 and also contribute to open SUSE and as part of this work I was working on KX and K-Dump for ARM64 and then I want to explain you a bit what I learned Apart from that, I'm a maintainer of MediaTek on the system MediaTek system on chips in the Linux kernel and I do a lot of more stuff Okay, what will we talk about today? So I will Quickly introduce the use cases that exist for KX and K-Dump And then we will dive into the internals of user space and kernel space. I will just describe a bit what's Support is given in open SUSE which is for sure another distribution as well, and then we will have a quick demo in QA Okay So basically there are three use cases the first is for me is like the most important one And I for sure is what why this in the first place was developed that is to debug a system So at SUSE for example, we have a lot of customers that have the hardware and then We don't have maybe don't have all these hardware in-house or they are running Some legacy code on there on the systems that we don't get a good re-producer when their system crashes There might be that we don't have good logs because kernel crashes so badly that the logs are actually Not usable and then for this case we can use K-Dump to create a dump file Which then afterwards you can inspect to see what actually happened and why the system crashed As the next use case is to boot a new kernel without rebooting the system so That's For me when I first heard it was a bit weird But the idea behind this is that when you have for example assist a really big machine which will take a long time to To boot up because this firmware has to enumerate all the hardware Etc. Or you have a system which is pre-production system and even the firmware is under development And you have some beta version of the firmware then you can have bugs in the firmware and so it might not be Sure that every time you boot the system the system comes up and So if you work on some kernel stuff Then it might be useful to just reboot the kernel from the kernel and skip all the firmware All the firmware initialization or there was also a talk yesterday about booting using linux to boot linux and That is exactly what is done with IBM S390 which are some really big machines and they have like as a bootloader linux and then they kexec into the distribution kernel and Throughout the talk I will talk about the first use case which is debugging a system the other two use cases Well, they're basically the same Technically and I will just I won't I won't go into detail about these and There kexec kdump has like some generic infrastructure and then but there are some architecture specific parts and Where where we go deep into see architecture specific parts. I will talk about arm64 because that's what I know best And so there are different names for the Production system so the idea when you bug debugger system you have a production system with your production workload running and some really bad bug in the kernel happens and Then a capture system gets started. There are different names for this capture system crash system panic system I will try to keep with capture system or capture kernel in this presentation To yeah to not confuse too much So how does this look like I made a little crappy graphic? So imagine this is C You can see this here, right? Yeah Imagine this is a memory of your of your system. So you have Somewhere in memory your production kernel, which is running which which shows this arrow and then you have your system run and when the production kernel crashes then your Capture kernel gets started and the capture kernel then takes all the memory and the kernel of the production system and creates a dump file So which parts are involved in says so there's a user user space part Which is called Kexic tools that is basically used to prepare the capture system Then of course a kernel itself That and that needs to execute and capture the kernel and crash and there are some other user space tools like make dump file which is of which is used to create the actual dump file from the capture system and then to inspect these dump you can use crash or crash python, which is a implementation effort done by Susie to those that you can use Python scripting to to inspect the dumps and then there are some distro programs to Do you to make it easier to set things up? Okay, so Kexic tools There are different tools in the in the repository, but the most important one is called Kexic Though this command line what that actually does it loads a kernel in an initrd and reuses a boot parameter of the kernel of your system and then you can with dash e Execute this new kernel and that it will Kexic in the new kernel and what it what it internally does it calls reboot reboot with Some magic number and this magic number then the kernel Recognizes and then that's okay. I don't have to reboot into the firmware, but reboot reboot into the loaded kernel and There's also now what confusion starts as you can also load a capture kernel, which is done with the HP Which don't stands for panic? And you can unload the the kernels and that there are some I specific options Which are not relevant for us now Okay, let's have a look under the under the hood how these okay exact tools actually works So we remember that the capture kernel will make a dump file out of of the of the production system though the question is How does he do it and how does he do it? And what does he need to know so he needs to know where the actual capture kernel is so that he can Load the capture kernel when the production system crashes We must know where as a usable memory is for the capture kernel because we can't use Just the memory that your production system uses because you would overwrite the memory and maybe Will will overwrite the hints you need to find out what really happened? And you need to know where the user space of the of your catcher capture system is and you need to know Where the production kernel and memory can be found So Kexec tools or Kexec in general what it does When you boot your production system you pass our kernel parameter, which is called crash kernel which is creates and Reserved memory area so if these down here these boxes your memory you have then the gray part is The reserved memory area which is in it not used by the by the production system And it can be a bit tricky to do so because you don't want to have this memory area too big because It will be taken out of your production system but you don't want to have it to be too small because then maybe your capture system won't start and You can see there are like different segments in The in this reserved memory area and that's these are called Kexec segments And this is a data structure which is shared between the user space in the kernel and it basically Describes is used in the following way when you for example Load the kernel with Kexec tools what Kexec tools does it allocates a buffer and on this address Well on this pointer with the size of buff size and then copies the kernel into this buffer and Then it searches in the reserved memory area and A whole weight can actually later copy the kernel to and points this void mem void mem pointer To this location so it would be to here and that's it what it does was care and with the kernel and in it are D and For the elf core headers a device tree blob in the purgatory I will explain now what that what these are and what they are doing so the elf core header is like the Core part of K dump so that's a capture system knows Where the memory of the production system is so it is an elf elf header Which has different program headers which then points to see and Points into the memory of the production system. So remember this elf core header lives in the reserved memory area and the first program headers are for each CPU which is which are called crash nodes crash nodes is a crash node is a Small part of memory in the production system that is reserved For every CPU's is that when the CPU crashes it can write the register states pit Etc in there so that you can later find out in which state every CPU was when the crash happened and From user space you can read system in this this device File meant to find out for every CPU where this memory is Where this memory is and K exit tools what it does it actually? writes the address in the in the program header Then there's another Important data structure, which is called VM core info This stage as a data structure also in the production kernel, which holds all the information about About the data structures of the kernel So for example the page size of your system see offset of flags and some structs, etc So we will you will need this information later to actually understand in your dump file how the kernel looks like and This file Can also be read from there by K exit tools through this sysaf s command and it has the address in the size which gets also copied in here and yeah, that's it and Last but not least you have some memory in the kernel itself for this and What K exit tools does it? parses proc IOM M where you can find the system system run parts and Where in which part of the system run the kernel gets loaded and takes these addresses well and put it in In some headers and these structures and later used to create a file in the capture system That's called proc VM core from which you can create a dump file then So next segment segment is the device tree or device tree blob Who knows what a device tree is? Quite some quite quite some people nice So device tree is like the equivalent to a CPI which is used by arm and by by Power PC Arm 64 actually has the possibility to boot with device tree always a CPI and So to come over this confusion what what the system does is when the production system boots up the ify stuff creates a Small and flatted the flattened device tree, which just holds some really basic information about the uef I and About where the unit RD can be found and the in the boot parameters and The K exit tools X K exit tool reads this file and then updates the unit are there any third year in it already and points it to the unit RD in the in the reserved memory area then it adds it's a Value for the elf co-header which points to this segment and the usable memory range Which would be all the rest of your reserve memory area, which is not used by the KXX Segments which is then the memory that you that your capture system can actually use to boot up and And do all the stuff you wanted to do in Yeah, and the last segment we have is the purgatory So the purgatory decides over heaven or hell and heaven of course is That you are lucky that you can boot your your capture kernel to actually create a dump And hell is that your system is so broken that you can't even boot that Boot that capture system and how does it do that? Well the purgatory when getting set up by KXX tools it holds the hatches of the other four and KXX segments and when the purgatory is like the entry point after a crash first what it first does it checks if the hatches are correct Because if not, then you go you can't you can't be sure that the capture system you boot up really holds the information that's valid for example imagine that the case co-header is the Is broken somehow and points somewhere else then the information you have is not not correct when you won't be able to Reliable the debug system There's also a Command option for a KXX tools to ignore these checks Which is nice because then you can you go always to heaven. Yeah Okay, what the purgatory actually does And for arm 64 it looks like this. I hope you can read it more or less So this is the assembler. It's not the whole file. I Deleted some some lines, but basically what it does it creates It has a small stack then it calls purgatory, which is a C function which the only thing it does actually is check the hatches of the different segments and It then loads the kernel and inner the kernel address in a ring and a register the device tree address in a register and Boots the kernel and this part is just like the normal way to boot arm 64 so nothing nothing special in here and KXX tool was it does what it does with Cs file is it updates the Kernel address in the device tree address to point actually To the kernel and the device tree in the reserved memory area Yeah, so then in the end Our knowledge tree and call it like this looks like this. So the purgatory which is like the first First to be executed after crash. He knows where the kernel and the device tree is It starts a kernel and the kernel can read from the register the device tree and the device tree holds the information of the Elf core header seen it early in the usable memory that can use So that's good. So and now we have set up all these segments in KXX tools and then we have to pass this to the to the production kernels that he can Actually prepare for a crash to actually start the purgatory in the capture system. So this is done Using the KXX load system call There's also another call KXX file load, which I won't explain here because it works a bit different because it doesn't most of the stuff we have seen here in the kernel space and So it passes the entry point of the purgatory to the kernel and The addresses of the segment well an array of the segments and the number of segments that exists and That's all what we have to do in user space and now we have passed all this information to the kernel and now I will Talk a bit about what the kernel actually does So in the kernel internals, we will look into the three parts. So there's a first part is and That the production kernel pre-press the capture system And through the KXX loads this call and then we will have a look what actually happens when your when your production system crashes and And then we will see what happens when the kernel boots up. So when you load your capture system what they What the kernel actually does it just check that your root because you don't want to and you and to have any user to load a new capture system on your On your machine It checks some flags that basically basically means that you are not allowed to load another Capture system if you ever have already one loaded. So you have to unload it first and then you and then loaded the reason behind this is that if you in the updating of the capture system you get a crash then everything is lost and then it it checks the segment number so you have like a Limit which is I think 16 of segments numbers you can use so we use for though We are totally fine here, and it then creates a structure what allocates a structure in the kernel which is called K image and this holds the KXX segments and The entry point of the purgatory and it allocates What holds a control page which is later used to actually load the purgatory? We will see that in a minute And Well, then there will the kernel will do some sanity checks. So basically it checks at the and Let's see segments Not overlapping because it would mean that something got really wrong that they are page aligned and and that they're in the reserve memory area and It also checks that the meme size is bigger or the same as a buffer size The reason behind this is that the memory in the that you use in the reserve memory areas page aligned and These buffer is created by malloc and user space. So it's not page aligned. So this can be bigger and can't be smaller and Last but not least it checks that all the memory you need by your KXX segments is at least And well is a maximum as big as the half of the of the RAM that you're available on your system So when that's all okay, then it Calls copy from user and it copies these buffer with the buffer size and Into the address of each segment in the river in the reserve memory area and Then it clears the PTE of a little bit for these segments for the pages of these segments, which basically means that it's That's the MMU won't access it So that's that's what it does well our last thing and that's not in the slides but it actually sets a flag to say that we have a capture system loaded and Then when the kernel crashes and And It realizes that that capture system is loaded and then it starts to disable the IQ's in the CPU registers and it will write into the VM core data structure and Also the time of the crash so that we have when you have a crash dump. You can also Read out when the system actually crashed and then it will send a signal to all the other process processors in your system to Shutdown but to shut down using crash the CPU crash stop, which is a special and Handler that basically writes All the CPU register all the CPU information into the crash nodes as we have seen beforehand and Disables the IQ's and then Tells a firmware to actually shut down the CPU so that in the end you have only one CPU running That is what you hope so he checks if only one CPU running with the superior that crashed If this is not the case then you have a problem because it can then be that the crash system when they capture system won't won't won't boot up correctly But he just throw the warning and goes on because you already crashed your production system. So What can you lose? Yeah Then it uses a relocation control Some relocation code that copy will be copied in the control page This code shuts down MMU and disable the caches and then it checks if you have to relocate the kernel That's the case if you just want to boot a new kernel and so that's not that's a case here And then it jumps to the purgatory We have seen What the burger burger sorry we have seen what the purgatory does beforehand so it checks the KXX segments and then loads the kernel and the loaded the loaded kernel can then Read this device tree entries which point to these KXX segments to to know which Where and how big some memory is it uses and where he can find the elf coheader and where he can find the unit ID and then can start the unit ID and Here you serve a some memory where he copies in the elf core header. So remember the elf core header Only holds pointers. So he copies the pointers and Not the de-reference of the pointers and it creates a file That's called proc VM core Which is a file that you can then use to create a dump file and this file has a special a special read function which De-references the pointers so that when you read this file, you actually read the memory here and not just the address Okay, so distribution part So to set set this all up. It's a bit difficult because apart from deciding how big your reserve memory area should be you would need to have And a special unit ID and it depends on for example, where you want to save your dump file in your capture In your capture system if you want to save it like for example via FTP to an FTP server Then you will need you will need all the drivers for your for your for your network card you will need a network stack and And you will need a FTP programs in your unit ID and Or when you you want to save it to disk which is disk which is Not what you should do because you could could have a crash in the file system and Then you can't write it to disk But you can do this and you will need the file system more kernel module in your unit ID, etc so there are so part there are Tools to help you to that that up so that you don't need like a full-blown unit or D which will just Waste memory for which we'll just waste memory that you don't really need Normally this distribution parts also have a way to say that you automatically as make a and Store the dump automatically to somewhere so you don't have to do any Manual intervention and then reboots the system so that at least the system comes up again And you doubt downtime is minimal while you you can send Investigate what really happened and At Susie and open Susie we use a program that's called K dump with capital K Which is this with army knife for setting up a K dump. So it has basically two parts So there's one part for the production system, which is some drake wood scripts to create unit ID and Then best scripts to load the capture system, which basically calls Kexec Miners LP with the kernel in it or the et cetera et cetera and There's a tool to approximate the size of the reserved memory area so that you don't have to do a A Try an error in on the capture system side that can Configure rate how to create the dump. This is done Mainly by using make make dump file, which where you can decide for example if you want to Add three pages to your dump or if you care about pages that are all zero Etc because if you just copy your all your memory that might be a really really huge dump file And there might be parts of the memory that you're not interested And you can also define where where these dump gets stored. So this is a configuration file and If you're really lazy and you don't want to understand this configuration file And you can use a tool. That's called yes, too, which is a Open Susie or Susie configuration tool for the system and there's a module called K dump and you can see it here And This is it looks nicer if you started from from a GUI, right? This is from serial console and So you can decide for example if you want to Compress a dump and or which pages you want to include into the dump and you can also for example send up some email Notifications so you get an email. So hey one of your system badly crashed. So you better have a look Okay So I will give a quick demo. I Hope you can do this because it's black and I wasn't able to change this So what I am doing here is I'm starting a virtual machine With QMU. This is our arm 64 machine. Also my laptop is not arm 64 and then we can Then I have a look for example About how many CPUs so we have two CPUs We have 1.8 gigabyte of RAM We can have a look On this script so I wrote a small script my last script is just a one-liner, right? That actually loads loads kernel image With an initRD it ignores the checks because we want to be fast because we all want to go to lunch afterwards And it depends some command line options so the command line options here is that you just start one CPU and said you reset the devices and one thing that I wanted to show and You can see in the production system in the command line on the kernel boot parameters though that we have Actually and 180 megabytes of RAM Reserved for for the crash system So if we now load this We can So you have also Lexis Wow so you can see there there are different sysfs files where you can see if if a good if Cres system is loaded or if a normal Colonel is loaded for just rebooting the kernel and not caring about crashes and the size of this so it should be loaded by now because we We loaded it. Yes, great. So what we can do now is we We can Write We can now write C The value C to the sysRQ RQ trigger what it does it create it crashes artificially your kernel and this is used to just Check out that this actually works So what we expect now that the capture system comes up and that we can then do some fancy stuff So we can see here Yes We can see here. Okay, so We triggered a crash through through sysRQ, which is a kernel null pointer dereference and Blah blah blah what comes from and it's in this okay We're stopping all the secondary CPUs that was just one and then it starts the crash kernel And then you can see here with the capture kernel and then you can say here how the capture kernel comes up So that's the first step throw really good We can now see Proc CPU info that should only have one CPU because we passed this max CPU equals one That's the case So we would expect we had 1.8 megabytes a gigabytes of RAM in the production system that we would expect like 180 megabytes now It's actually 147 megabytes because you have some part from the reverse of memory area, which is used by the KXX segments And what else can we do we can do we can we can have a look on the Device trip by Device trip lob that gets created and that got updated by KXX tools so you can see we have here the We if you map and we have here the internet ID and up here We have the usable memory range that and the KXX tools are that in the elf core header and where this can be found and now The last thing I want to do is There's another tool in KXX tools, which is called VM core Dmask, which basically takes a dump file and Extracts from the dump file the kernel and The kernel log so this can be used to have a quick look what actually happened and We can That do this here, and then you can see okay. It crashed It crashed because we send the This is a Q trigger a crash. Okay, so Well, so our capture system is working. So we're really happy And I added some references here for you In case that you want to read a bit more about it. This is the source code of the KXX tools It's on on gitkernel.org There are some SUSE documentation of how to set the things up This is the source code of the K dump program used by SUSE to make things easier And there's also a nice blocker that explains more or less the same that I explain here so if you don't understand what if it wasn't understandable what I explained or you don't trust me then you can read it there and Okay, maybe Quick some takeaways, I think we are on time. Oh, not too much. Okay, so basically Basically how this works likes it key points that you should recognize or Remember you have a reserve memory area And the capture system gets saved in this area in the elf core header is used with points to all these memory locations in your production system where The information can be found in the king capture system then out of this information creates This dump file Okay, thanks a lot and if you have any questions, please feel free is the capture kernel built Set like with any changes from the production kernel or can you have one? Built ahead of time that you just reuse That's compatible with your target Yeah, and well and it depends so I mean Know what we at at SUSE open SUSE do we reuse the kernel? Because we have the kernel with like everything built as a module and then we just put in the in it I did the modules that we need so you get it's that means it's not something you need to reboot the kernel You can if you want to you you could also like built in the The drivers that you need in the kernel and uses but it's not necessary So no Okay, no more questions. So, thank you. Yeah, thank you See you