 Hi, my name is Lakit Yagi. I work with Samsung Semiconductor India R&D Center, and I have eight years of experience in embedded systems development. I welcome you all to my presentation. If groups do not panic, here in OSS North America 2022. I hope you all enjoy it. And on that note, let's begin. We'll start the discussion with a basic introduction to Linux kernel and the various subsystems used by it. Then we'll move on to talk about kernel panics and different causes of it. Then we'll discuss about the tools which are available to us to debug a kernel panic and make sense out of a kernel oops message. And then finally, we'll have a short summary of our whole discussion. To begin with, what is Linux kernel? As listed here, it is a free open source and Unix-like operating systems kernel. It is the core interface between the processes running and the hardware underlying. And it manages resources as efficiently as possible, or we could say that is the main job of any kernel. The kernel is so named because like a seed inside a hard shell, it exists within the OS and controls all the major functions of the hardware, whether it's a phone, laptop, server, or any other kind of computer. As we can see here, the kernel lies between the application layer and the resources layer. By resources, we mean the CPU, that is the processor, the memory and the devices connected to it. On application layer, we have the user space applications or processes running and the application send request to the kernel via system call or IOP tools or any other interface available to them. The kernel tries to resolve those queries or requests as fast as possible, as efficiently as possible, and as securely as possible. Listed here are the five major subsystems of any Linux kernel. The first one is the process scheduler, which is responsible for fairly distributing the CPU time. What that means is any process which is running in user space gets a fair amount of processing power from the underlying CPU and it is the job of the process scheduler to make sure of it. Again, it depends upon the priority we have set for each process. Next is the memory management unit, which is responsible for proper distribution of the memory resources available to the system. Then virtual file system, which is responsible for providing a unified image to access the stored data. What that means is irrespective of the underlying technology we are using for storing data, that is it could be an SSD, an SDD, or a USB flashband drive. This subsystem makes sure that the interface which is given to user remains the same. That is we use the CP command to copy data or we use WIMP or CAD to access the data. The networking, next is the networking unit, which allows Linux system to connect to other systems over a network. And then we have finally the inter-process communication unit, which allows the processes to communicate with each other or with the kernel to coordinate their activities. For example, we have IFTLs and system calls, or pipeline also, if you wish to say so. Working with Linux kernel code brings with it its own unique sets of challenges. For example, we cannot debug the Linux kernel code like we will debug a normal C application or a user space application. This is because Linux kernel is not a process, but it is a set of functionalities. Also, there are cases where we may have a working and perfectly stable Linux kernel, but because it allows us to load modules during runtime, we can introduce new bugs or new faults through loadable modules. Most of the time, Linux kernel is self-sufficient enough to handle these faults. And in such cases, it usually kills the process, which happened to be using the faulty module at the time. Also, not all faults lead to kernel panic. Most of the time, the kernel kills the process which was executing and using the faulty module and rest of the system goes on. In such cases, we can try to reload the kernel module and try to reproduce the issue or try to debug it. However, there are some cases where the fault can lead to hardware unstable state. That means the hardware is in unstable state or unknown state. Also, it can lead to corrupt kernel memory at random places. In such cases, the system stays in an unreliable state and it is rather advisable to have a reboot in those cases. As listed here, kernel panic is a safety measure taken by kernel upon detecting an internal facial error. We can also call kernel panic as a situation when the kernel can't load properly and therefore the system fails to boot. OOPS is a Linux kernel problem, which is bad enough to affect the system reliability and in those cases, it is mostly advisable to reboot the system. The various causes of an OOPS or a kernel panic can be a software bug in the OS, hardware failure, malfunctioning RAM, incompatible device driver, corrupt root file system, or in case if our init process fails to execute or terminates, it may lead to kernel panic. The panic routine is responsible for handling kernel panic. Its job is to output an error message to the console, dump an image of kernel memory to this for debugging and after that either it will wait for a manual reboot or it will start an automatic reboot. The OOPS message displays the information about process of status at the time of kernel panic and it includes information about the CPU register status, the IP which caused the fault, and the process which was executing or which was using the faulty module at the time on which CPU number the kernel panic happened and a dump of stack trace of functions which ultimately led to the kernel panic. This is a sample module which we will be using to trigger a kernel panic. In this module, we are registering two functions init and exit using module init and module exit macros. Inside two OOPS init function, we are printing OOPS init message and after that we are making a call to function two OOPS. You will notice inside the OOPS function we are making an invalid reference, sorry, a reference to an invalid pointer. Now almost any address which is used by kernel or Linux kernel is virtual address and it is mapped to a physical address via the structures of page tables. When an invalid pointer is dereferenced, the paging mechanism fails to map the pointer to a valid physical address and kernel panic happens. This is what we, this is the concept which we will be using for our advantage to trigger a kernel panic using this module. When we insert our module or faulty module into a working Linux kernel during runtime, as expected it reports kernel null pointer dereference or rather we should say unable to handle kernel pointer dereference. When we are confronted with a huge dump of an OOPS message, the most important thing to notice is the instruction pointer. As listed here, it is pointing to do OOPS function and the instruction which caused the kernel panic is at an offset of 0x8 in the function do OOPS. Also you will notice we have the dump of all the CPU that is present here and also this kernel OOPS message is pointing to CPU zero which will report on which CPU this kernel panic happened if you have multiple cores available, which process caused this or which process was using this faulty module and as reported here it was INS mod. And in the end you will notice that the process has been killed. One more important thing here is it is not necessary that we use INS mod command, we can use modrob also which is another functionality to remove or insert modules during kernel runtime. And you will notice that call trace has dumped the stack trace dump of all the function calls which ultimately lead to the kernel panic and in the variant it had called who OOPS in it and who OOPS in it at this offset gave a function call to do OOPS and in do OOPS at this offset the instruction calls the kernel panic. Now that we have successfully created a kernel panic and got an OOPS dump from the Linux kernel with all the necessary information we need to debug a kernel panic, we will discuss about the different tools which are available to us to go on about the debugging process. First we have the print K function which is a standard function for printing messages and usually it is the most basic or primitive way of tracing and debugging. If we are going with print K we will usually add different print K statements in different stages of our module and then we will check before kernel panic till which point it was able to print. Then we can use address to line tool which translates addresses into filenames and line numbers. OBJ dump tool which displays information from the object file. The object file can be the .co file in case of loadable module or .o file in case of an inbuilt module. And lastly we have the new debugger. These are some of the tools which we will be discussing today here. In case we choose to use address to line the command is address ADDR to line hyphen e oops.o Now notice it can be oops.o or oops.co both depending upon if it is an inbuilt or a loadable module. And then we will mention the offset which we had obtained from our Linux kernel oops message here. Do oops at an offset of 0x08 also here where the instruction pointer was pointing at do oops at an offset of 0x08. So this is the offset which we will need to pass to address to line command. And once we do that, it will tell us the location of the .c file and the line number in the .c file which caused the issue. In case we choose to use OBJ dump first we will have to through the cat proc modules command find the virtual memory offset of the module. And once we have done that we will use the command obj dump hyphen d is for disassemble s is for source source is oops.co in this case. And then we will tell it to adjust the virtual memory address of our kernel module through this value. So what happens is once it does that whatever addresses it is showing in its dump here obj dump whatever addresses it is showing in its output here though those will be added with an offset of this value. So as seen here once we execute this command it is showing it is or it is pointing to this instruction which caused the kernel panic. In case we choose to use GDB make sure that you're using it with BM Linux image that is because BM Linux image contains the debug information and all the symbols we will need for debugging. Once we execute this command we will get a GDB prompt and from the GDB prompt we can just simply execute list star the function name and the offset. Now don't be alarmed we are using a different function name here because we had compiled a different kernel to use this and a different file with different functions name but the structure of the .c file was same. So in do panic function at an offset of eight it will dump the instruction or the code snippet which caused this kernel panic. As seen here the c file is located in lib test panic .c file and at line number eight is the instruction which caused this kernel panic. Next we will be talking about the Kexec and Kdump utilities. Kexec is a tool which is used to boot into another kernel while we have an existing and running kernel. Kdump is a tool which is based on Kexec and it is a crash dumping mechanism. To enable these things we have to enable the following configurations as listed here in our kernel and recompile it again. You will also notice along with the Kexec and crash dump configurations we have enabled magic sysRQ. This is because we will be triggering the kernel panic via the magic sysRQ key. Magic sysRQ is usually involved with a combination of alt and sysRQ key on PC keyboard or with special keys on other platforms. It is also available on the serial console as well. We will be triggering the sysRQ kernel panic via the serial console. And along with the alt and a special key we have to press a special third key which will determine what kind of signal we are trying to send to kernel. For example, if we press S it will force the kernel to sync all hard disks or all disks. If we press U it will force the kernel to unmount all the disks and remount them as read only. Similarly, B will trigger a reboot. B will print the process register state, processor register states and M will print the memory information. On the target system or in this case the target and development system both are same. We have to make sure that all the packages installed are updated and updated to the latest versions. Along with them GCC, MAKE and BIN utilities have to be installed which mostly if it is a development system, if your development and target system are same will already be installed. Along with that we will have to install the Linux headers available for the current version of Linux which we are running. Decadem tools utility, crash utility and the debug information called the current current executing Linux version. Once we have done that, we have to enable Kexec to handle reboots. We have to enable cadem to run and load a system boot and to configure the cadem, we have these two files available which are located at ETC default and ETC default rub.d directories. Once we have configured our cadem and Kexec we have to restart our system and then to verify that the cadem utility has started we have to check the DMSS logs with the grip of crash. High for me it can be the upper case or lower case and to test the cadem utility via a driver we can execute this command that is sudo cadem config test and it will give us the status of the cadem service which is running in the background. Once we have configured our kernel with the configurations listed in the previous slide and we have rebooted into it using Kexec utility we will trigger a kernel panic using the sysr queue. As shown here we are sending the signal C to proxysr queue file and that will trigger a kernel panic. After that the kernel loaded over exit that is the kernel which we had compiled with those configurations enabled will save the state of the system and memory and CPU loaded modules and much more information into a debug file and after that it will reboot automatically into a functional state that is the kernel which we had booted into via Kexec utility will reboot and come back to a functional state. Once it has come back to a functional state we can check the where crash directory as listed here for a dump of our kernel image with all the debug information we need into a file which is formatted with the date at which the file was created. To read this file or to read the debug information present in this file we will use the crash utility and the command sudo crash dump and the file name and then we will give the user live debug directory path that is called that is where the kernel debug information is stored at. Once we do that it will print the information which that crash or K dump utility has trapped when the kernel panic happened. So the information which we have available here is the what kind of VM or where the location of VM Linux is and from that what kind of dump file we have and what is the CPU number or CPU core available we have available we have in our system then the date at which kernel at which the panic happened and other information and also apart from that it will tell the source of the kernel panic and the process ID of the process which caused this. So in our case as you will notice here we had done we had given the command C via pipeline to T and T wrote it to or T wrote this command into proxy circuit trigger. So here it is reporting that the process caused this kernel panic was command was command T and it happened on CPU core number two and here also it is telling the task running status of the CSRQ and if we give the command I mean once we have executed this crash utility we will get a prompt and in that prompt we can give the command BT to get a back trace of the function calls which happened before the kernel panic was triggered. Next we will talk about the tools KDB and KGDB which are used to debug the Linux kernel. KDB is an in kernel debugger and KGDB is kernel to debugger. KDB was merged in the mainline Linux kernel in version 2.6.35 which was after KGDB which was added or merged into the mainline Linux kernel in version 2.6.26 and KGDB uses the same backend as KGDB. Also it is possible to use either of those debuggers and dynamically transition between them during the kernel runtime. And but this will happen only if we have configured the kernel properly while before compilation. Now KDB is not a source level debugger that means it does not need the VM Linux file or any other debug information. It runs within our executing Linux kernel and through that it provides us a simplistic shell style interface which we can use on a system console. We can use it to inspect team memory, the CPU registers, the process list which we're executing in our system and the DMS logs also. We can use it to set break points to stop at a certain location. It is mainly aimed at doing some analysis to aid in development or diagnosing kernel problems. Also it does, it has a limitation that we cannot step over instructions as we do using a normal debugger in our normal, while we are debugging our normal C applications or user space application. KGB on the other hand is a source level debugger for Linux kernel. It is used along with GDB. The process of debugging Linux kernel through KGB is similar to how we would debug a normal C application on our development system. It is possible to place break points using KGB in the kernel code. And then if we do have a kind of functionality where we can step over instructions and also we need two machines to use KGB. The KGB instance will run in our kernel on our target system. And GDB, which will run on our development system will connect to the KGB instance on our Linux kernel or on our target machine. And in GDB, the developers specify the connection parameters and connects to KGB. Now these connection parameters is what is, how do I say, these are the parameters which determine what kind of connection we will have with the KGB. We can have it via TCP also. We can have it via serial console also. The different configs which we have to enable to use KGB or KGB are listed here. Now you will see if we are using KGB, we have to make sure that read-only data is disabled. That is because KGB will write into, how do I say, it will need access to storage to dump the debug information. We will have to enable the same pointer, KGB and KGB serial console to be able to see the KGB dumps on our serial terminal. You will also notice KGB enables all the configurations which are needed for KGB. And along with that, it enables KGB and KGB keyboard also. This allows us to send input to a kernel which has faced a kernel panic through the keyboard which is connected to the USB port. Next, we will talk about the kernel debugger boot arguments. Namely, three main boot arguments. KGB OC, KGB wait and KGB con. KGB OC as listed here is the primary mechanism to configure how to communicate from GDB to KGB as well as the devices we want to connect, sorry, we want to use to interact with the KDB shell. KGB wait makes KGB wait for a debugger connection during booting of a kernel. What it means is that if we have passed KGB boot argument or debugger boot argument, then as soon as kernel debugger, as soon as Linux kernel starts booting, it will halt and wait for a connection request from the GDB from our development system. KGB con, on the other hand, allows us to see print K messages inside GDB while GDB is connected to the kernel. What it means is usually the debug logs comes inside the KGB terminal or KGB prompt. If we have passed KGB con boot argument, then those logs will be pushed over the GDB connection to the GDB prompt, which is on our development system. Moving on, KGB OC can be configured as a built-in or a loadable module, but be aware that KGB wait boot argument can only be used if we have configured KGB OC as a built-in module. That is because if it is a loadable module, even passing KGB wait will not trigger any halt in the Linux kernel because KGB OC is not present in it. To enable KGB OC or to pass this boot argument, if it is a built-in kernel module, we can use the boot argument KGB OC in this format and the most important parameters here are the serial device and the border at which it will be, at which the GDB will communicate over that serial device. For example, KGB OC equal to TTYS0 and border rate we have specified as double one phi two double zero, which is a standard border rate for any UR terminal or any serial terminal. If in case we have compiled it as a loadable module, then we can pass the KGB OC kernel argument during the modeprop phase itself or modeprop step itself, wherein the format will be modprop KGB OC, KGB OC equal to TTY device and the border rate. Or if we have already inserted it, we can use the CFS entries to mention these parameters. Now, do notice we are not mentioning any border rate here and also here the border rate is a optional feature, which we have to mention that is because once the kernel is up and running, the border rate is already set up for that serial console and we don't have to explicitly mention it. All we have to mention is which serial console to be used inside the KGB OC kernel module parameter. If we are going to enable it, we have to tell the name of the serial device or if we have to disable it, we just have to pass an empty string to it. KGB con, the one which is used to push the KGB logs or KGB print K messages to the GDB, which is connected over with our development system on our development system with the target. Now, this cannot be used by KDB because KDB does not have any connection with the GDB on our target system. Again, there are two ways to activate it. One is through the kernel boot arguments and one is through the CFS entries. We do have to make sure that we are enabling it after we have enabled the KGB OC. After, first we have to configure the KGB OC and after that we have to configure the KGB con because KGB OC connects with GDB and KGB con after that pushes the print messages to the GDB on our development system. We cannot use KGB OC and KGB con on a TTY that is an active system console for and the examples given here is an example for an invalid configuration. However, it is possible to use these options use these options or use these together. If we configure KGB on a TTY that is not the system console. So in that case, these boot parameters will change to console TTY S0, development PY2 double zero, KGB OC equal to TTY S1 and then KGB con. In that case, we can use these two together. Using KDB, first we have to boot the kernel with the arguments as mentioned here. Sorry, this should be TTY S1. Yeah, TTY S1. Or we can configure the KGB OC in the CISFS through the entry as listed here. And then we have to trigger, we have to trigger a kernel panic using the CISRQ or any fault module, faulty module. And then after that, our KDB prompt comes up on its own after the kernel has dumped its panic message. And from the KDB prompt, we can run the help command to get the list of commands which are available inside the KDB prompt which will help us in debugging. Some of the sample commands are LSMOD which will show the kernel modules which are loaded or the PS which will display all the processes which will display the active processes on the current core. PS-A will show all the processes and summary will show the kernel info and the other memory usage information related to it. BT will dump the back trace of the current process and D-message will help us in doing the kernel syslog buffer. Go will continue the system in case we have introduced any breakpoint in the KDB debugging steps. To use KGDB again, the process is somewhat similar. We have to pass the KGDB OC kernel debuger boot argument or we can configure it through the CISFS. Again, do make sure before you configure KGDB OC, your KGDB con is enabled, sorry. Before you configure KGDB con, the KGDB OC is enabled or configured. Then we will trigger a kernel panic through the CISR queue and then as soon as the kernel panic happens, we get the oops dump. After oops dumps, we get the KGDB console on our target system. From our development system, we have to execute GDB with the VM Linux image that is because VM Linux image has all the debug information which is needed. And then we have to connect to the KGDB. On the target system, that is done through the commands mentioned here. Set remote bot, 00520 and the target remote, we will have to pass the serial device which we are going to use. If we are connected to the target system via TCP and port number 2012, then we can use the command target remote, the IP of our target in the local network and the port through which connection is to be made. Once connected, we can debug the kernel similarly, we will debug any normal C application or any application through a debugger. And to enable GDB to be verbose about its target communication, we have to set the command, set debug remote one. This is because connecting from GDB to KGDB is prone to failures and to see exactly at which stage or at which step it failed or what response it not get from the KGDB from our target system. We can set the GDB to be verbose about it and that then we can figure out which step failed or what parameter was not passed correctly. It is also possible to switch between KGDB and KDB during runtime. This is the interoperability functionality between those two. To switch from KGDB to KDB, we have to either pass the command dollar three hash 33 in the GDB prompt or we can give the command maintenance backup three. These two are essentially the same. If we are going to switch from and do make sure this is passed in the GDB terminal that is in our development system. We have GDB instance running with the VM Linux image. To switch from KGDB to KGDB, we have to issue the command KGDB from the KGDB prompt in our target system. And yeah, that is it. And this talk is actually intended for developers who have just begun their journey in the Linux kernel development path. A general methodology of debugging a kernel panic is discussed here. After triggering a simple soft panic, a standard approach is followed explaining the various deep tools and their uses to root cause the issue. And this is a famous instance from an interview which was given by Tom when like where he remarked to Dennis that easily half the code Tom was writing in multiple express error recovery code. In response to that, Dennis said, we left all that stuff out. If there's another, we have this routine called panic and when it is called the machine crashes and you haul around the hall, they reboot it. This is again just to showcase the simplicity and keep it simple philosophy of Linux development. Any questions and please thank you. Thank you all for listening.