 Okay, so I'm going to get started on the talk. Looks like we got all of our AV stuff solved just at the nick of time. So this talk is about using serial KDB and KGDB to debug the Linux kernel. The reason I gave this talk is because I think I've seen lots of people mystified about KGDB and KDB over the years. And always had people on their to-do list was to figure out how to get it set up. And I heard this over and over again and I decided it was probably time to do a talk on it. Indra, so about me, I am a random kernel engineer at Google working on Chrome OS. I like debugging. I think it's probably my favorite thing to do and I like debuggers. I like working on them and using them and having them solve problems. I'm not the author of any of the debugging tools in Linux. So I'm not the author of KDB or KGDB. I fix bugs occasionally, but anything I say here, I'm not the expert on it. It's just me using these tools. You'll also notice that my deep experience is with RM32 and RM64. So I'm going to be using RM32 here. There are ways to use KDB and KGDB to debug X86 and all of those, plenty of other architectures. I actually have seen patches from people using KGDB for those other architectures. But this is focused on ARM architecture, just cuz that's what I know. So as I was writing this, I felt like I was writing an online dating profile. And so you'll notice the little tongue in cheek. But basically, if you're here at this talk, presumably you're a kernel engineer, and you sometimes see your device crash, because it doesn't always work perfectly. And the key thing here for this talk is that you have a serial connection to your device. You can make KGDB talk over ethernet. I've never done it. I have no idea how well it works. You can certainly debug the kernel with JTAG. That's not what this talk is about. This talk is about debugging over a serial port. And presumably, you'll get the most out of this talk if you're here or watching it, and probably you like to go for a long romantic walk through the woods. So here's generally what we're gonna talk about. I'm gonna give a bunch of introduction stuff that sort of talks about what these tools are, what they're good for, and how to get set up. And then I actually have a Chromebook here with a serial connection that I will demo, assuming the demo gods behave. I will demo doing a bunch of debugging if crashes. And I timed myself on this talk and I went slightly over the a lot of time. And we have a lot of a lot of time. If we run out of time, I think there's plenty of good information in this talk anyway. Even if we run out of time, we'll just see how far we get. So don't be surprised if suddenly I'm like, okay, well, I've done enough demoing, bye. I'm hungry. So, what are these tools I'm talking about? And why do I keep saying KDB, KGDB? I pointed to the docs here and at least some version of the docs here. But basically KDB is the kernel debugger. And KDB is a simple shell that lets you do some, get some diagnostic information out of the debugger. You can, when you're in the KDB shell, you interact with it directly over a serial port. And you can type simple human readable commands and get text results for what the thing is. So if your device crashes and you're at the KDB prompt, you can type Dmessage and it will print the Dmessage. And you can type PS and it will list all the processes. And you can type FTdump and it will dump your Ftrace buffer. So it's really just kind of a text based debugging thing. It is not a source level debugger. It's not even an assembly level debugger. At some point in time, I think KDB was an assembly level debugger. And you'll see some vestiges of it. But today KDB is not any of those things. All it does is let you print some information about a crash system. KGDB is the kernel GDB server. So KGDB doesn't do anything useful on its own. Literally when you're in KGDB, it provides a binary interface over the serial port for some other tool, presumably GDB, to talk to and debug device. You can poke memory, you can poke CPU registers, look at information about other processes that are running. So it provides the GDB protocol and allows GDB to debug the device. But KGDB is useless without GDB. So the first thing you might ask when you hear about KDB and KGDB is, which do I want and how can this talk to you about bulk? Before I was involved in using these things, there were two efforts for what the kernel debugger should be. This is the story I know anyway. And you had to pick. Did you want the KDB patches or the KGDB patches? And now basically you enable them and you kind of get bulk. If you go look at the kernel config options, they're integrated on top of each other and the code is all intertwined. You can turn on KGDB without KDB support. In other words, you can provide the GDB server without the extra text-based tools to interpret the kernel's structures. But I don't really know why you would. KDB has some nice tools. You might as well enable them if you're enabling KGDB. Here I say that I do mention here that KGDB doesn't know anything about the Linux kernel. So it mostly is just providing peaks and pokes and things like that to GDB. There's a little bit of ways to make GDB a little bit more like KDB to make it so that you can do some of those commands even without KDB. And I'll go into that a little bit later. So at crash time, I said KDB provides you with a whole bunch of things you can do to inspect the state of the system. And here's some of the things that you can do. You can list processes, dump dmessage, dump ftrace buffer, you can backtrace anyone. These are all things that you can kind of do at the command line for the most part too, right? You're using your device normally and you can do these things. Well, you can do them at crash time with KDB. You can send sysrequests and you can peak and poke memory. But I've never really peaked and poked memory from KDB because I would go into KGDB where I can figure out what memory I should peak and poke. And mostly what I do when I end up in KDB is I run dump all. It just spews a whole bunch of stuff and then I stick it to a text file and then search it. So what is KGDB good at? KGDB good is good at debugging code or it's as good at debugging code as GDB is. I have sort of a non-enthusiastic, sometimes GDB is not very good at debugging code. I don't want to sound discouraging here, but occasionally when you use GDB to debug optimized code, GDB will just keep telling you that variable you wanted to read is optimized out. This other variable is optimized out. This other variable is optimized out and you just kind of want to throw something at your computer. But it's still way better than nothing. And I'll show some examples of cases where it says optimized out and cases where it's actually pretty amazing. I will also say here that if you are envisioning using KGDB to step through code and understand the Linux kernel by single stepping through it, it's not going to do that for you. Maybe someone can make it a little bit better. It's a little quirky and crashy and achy breaky right now when you start stepping through code. But overall, trying to debug something like the Linux kernel using a stop mode debugger, that means when you're in GDB, the entire system grinds to a halt. No interrupts are firing. No CPUs are running. That doesn't work really that well for something as complex as a Linux kernel. All the other peripherals in your system may time out. You may not be able to service interrupts to make some critical part of your system run. Watch dogs may fire. It's not really great for stepping through code. And so what I'm really focusing here on is that it's good for debugging when your device crashes. So personally, I just always run with KDB and KGDB enabled when I'm doing development. And then when I get a crash, I don't have to scratch my head and wonder why it crashed. So I wanted to do a few comparisons to other tools and other debugging techniques too, just at least so that you guys know where these tools fit in. And comparing these tools to JTAG, there's a lot of overlap for sure. JTAG does a lot of the things that KGDB does. So JTAG lets you peek and poke at your memory and inspect the state of all your CPUs really well. And if you happen to have a JTAG set up and it works really well for you and you have hardware that has JTAG, keep using it. It's great. But JTAG is often hard to get set up. Usually you need special hardware. And when you get your form factor board, there's no more JTAG. JTAG software is often expensive and maybe can only debug CPU cores that existed five years ago if you get the cheaper ones. And JTAG also has no KGB. So as I was talking about this, the KGB, KGDB combo, JTAG just gives you the peek and poke. But JTAG certainly has some extra features. So basically the summary of what I'm saying here is that a lot of what people use JTAG for, you can use KGB, KGDB4. And you don't need some special setup. You can keep using it as long as you've got that serial port. Someone else I talked to about KGB and KGDB said, but if you're just using it to debug crashes, what good is it? Because when my device crashes, it dumps a bunch of stat crawls into the D message. And I just go look at the D message or I go look at the console RAM loops. And that's a fair criticism. Basically it's just that you tend to not have everything. You don't dump all of the memory of the system when you get a K crash. You dump a little teeny bit. And if that has the stuff you need, then great. And if it doesn't, then you're out of luck, because the system is already rebooted. So getting real GDB and having all the memory available can be really nice. And then I say, someone also mentioned that there's this cool thing called K-dump, where you can actually have something that dumps all of the state of all of memory at crash time. And that sounds really cool to me, but I don't know how to use it. So if someone knows how to use K-dump, make a presentation at ELC next year, and I'll come. I promise. So setting up. OK, so the first thing to do to set up is that you need a serial port. And I said this already, but some people came in late. And some people might have been distracted about romantic woods and thinking of romantic walks through the woods. And so maybe you forgot. We need a serial port. And not just a serial port, we also need a serial driver in the Linux kernel that can run in polling mode. So you remember that I said that when you're running GDB and KDB, that the whole system has stopped and all interrupts are disabled. And that includes the interrupts for the serial port. So you just need to make sure you have a serial driver which adds these two functions. And probably yours does. When I started working on these things years ago, I think the Exynos, the driver used in Exynos, didn't have this. And I actually found some patch that someone else had added it on the mailing list. And I made a tiny fix up and got it landed. And now the Exynos one has it. And most other ones have these functions. But if you somehow things aren't working, you can just go search and figure out how to add it. The next thing I list in the Getting Setup is KDMX. And this is technically not needed. But in my mind, you really want something like this. So here we're talking about running KDB and KDB over the serial port. But probably you're already using your serial port for something else. You're probably using your serial port for console or maybe for login from an AGetty or something like that. And so you can't use your serial port for the debugger if you're already using it for something else. And what KDMX does is it lets you essentially create two virtual ports on top of that one serial port so that you can have the debugger attached without quitting the program that's watching your console. It will create two pseudo terminals, one for the console and AGetty and the other for GDB. And it just works pretty well. Every once in a while it gets a little confused. I think I mentioned here if you start seeing it echoing dashes or pluses, just quit KDMX and start it again and you're fine. Maybe one of these days someone will figure out why it does it and get it to fix it. But I think it just gets mixed up every once in a while. And I point a URL here. And people say, oh, I have to install a tool. That's too hard. It's really not too hard. Literally this is a set of steps for building KDMX. It doesn't require even autocom to build it. You just run make. And it works great for me. I'm happy to paste this set of instructions in a terminal for you guys if you want. But I already have it running on mine. So I'll probably skip. But basically, yeah, you clone it. You go in there and you make it and you point it at your port. And then I have it print out the two PTYs. I have it create files for the two PTYs. So I have it create a GDB, KDMX ports GDB, and KDMX ports TRM for the terminal. And then I can make my terminal emulator. I use cu in this case as my terminal emulator. And I just say, hey, go look at that pseudo terminal. And then when I tell GDB, I say GDB, go look at that pseudo terminal. And it just works. So as I said, for all of this stuff, you also need GDB to do anything really useful here. And you need a cross-compiled version of GDB. So probably that's not a big deal. If you're building for the kernel and debugging the kernel, you have cross-compiling tools. You're running on an x86 machine and maybe targeting an RM64 machine. So you have your GCC or your Clang that runs on your x86 box and generates RM64 code. Well, you need a GDB that runs on an x86 box and understands RM64 code. And it probably comes from the same place your compiler comes from, hopefully it does. Because GDB is too hard to set up. I mean, I'm sure someone can figure out how to do it. And probably the same guys that made your compiler can figure out how to set it up. But setting it up is way beyond the scope of this talk. And if you don't have a GDB that works, find professional help. And if you know how to set GDB up yourself, you need professional help because it's hard. Kernel config. So to get KDB and KGDB working, you have to enable it in your kernel. You probably don't want to do this for production. And so you will need to add some special configs to whatever your kernel is in order to make KGDB work. And here, I've shown you a whole pile of configs that in Chrome OS, we turn on when you want KGDB. You probably don't need every one of these. But this is at least the set I have. I believe the VT and VT console are just a hard dependency of KGDB. Having panic timeout of zero will make it so that it won't reboot when you get a panic. Turning off randomized base is pretty important because you can't debug if your kernel has moved to random locations. There's other ways to turn this off on command line, too. Personally, I turn off the hardware watchdog system. The hardware watchdog system needs to be padded in user space. And when you're in the debugger, user space stops running. And so if you don't turn off the watchdog, your system will just reboot as soon as you boot up. I like having magic sister quests enabled to use GDB. You don't have to. I believe debug kernel may be required. I think these two are required. And I have turning on Dwarf 4 because it seemed like a nice thing. But I've never seen it make a difference. And some of these other ones, just the GDB scripts we'll see later. And the frame pointer, again, just makes it better for debugging. They all help GDB. So not only do you need kernel config for turning on KGDB, but you also need to tell KGDB which port to talk over. So just enabling it, how is it going to know which serial port to talk to? And so you add a kernel command line parameter that tells it which port to talk to. In this case, my serial port is TTYS2. So I have KGDB OC equal TTYS2. The OC, I think, is supposed to be over console. And if you look at the docs, it says, you don't actually have to use a console port for this. But it still has the same kernel command line. And I gave a few extra things that I happen to have on my command line because I also use it for console. I think panic equals zero is just me being extra paranoid about making sure panic is it's not going to reboot when it panics. And then that guy. Demo, OK. So now we're going to find out if the demo gods were kind to us after I just rebooted my machine. So dropping into the debugger. So the easiest way, if you just want to drop in the debugger and poke around, you can always drop into the debugger using magic sysrequest. And so for those that aren't familiar with magic sysrequest, if you have an external keyboard hooked up to your machine, you can do alt print screen G or alt sysrequest G. If you happen to have a working command line at the time, you can do echo G to proc sysrequest trigger. And in some cases, you can also do break G over a serial port. So break is a special signaling over the serial. It's not a normal character to send over a serial port. It's when you hold the serial line low, I believe, for a certain amount of time. And it's not sending any character as it interprets that as a break. And so it sometimes is a little tricky to actually get a break sent to your device. I happen to have a tricky way to do it with what I use to talk to my device. I have a special shortcut. KDMX has some shortcuts that might help you where you can do a tilde B. But if break isn't working, you may have to find other ways to drop into the debugger. And so that gives into some of the other things here is if you need to get into the debugger, you can hard code a breakpoint into your code with KGDB breakpoint. You can do something that causes the device to panic or oops. And my favorite one that I do is, if I'm debugging a particularly tricky thing, I will find something that I can do on the system that I can just stick a breakpoint in. So for a little while when I was totally stuck and couldn't get sister quest working, I made a patch called who needs an SD card when you can have a debug trigger. And in that patch, I just went into the card insert of the SD card. And I put KGDB breakpoint. And so every time I plugged an SD card in the device, it dropped in the debugger. And it was great because it was a super simple place and super reliable. And yeah, then I was debugging the SD card insertion routine. But then you can go and inspect the rest of the system too. OK, so debugging our first problem, I'm going to. Oh, actually, I guess I should actually try dropping into the debugger first. So we're going to find out if this actually works. I've got my terminal here of my Chromebook. And I'll even log in even though I don't have to. Oh, maybe I won't because I typed the password wrong. We'll just type breakG. And you can see here it says, OK, you dropped in the debugger. SysRequestDebug entering KDB. So I'm now at the KDB command line. And I can type help. And I can do all the stuff that I would normally do. And then I can say, gee, go. And now I'm back trying to type my password again. I'll probably type it wrong again. So that was the simple way of dropping the debugger. I guess I could try other ways, but we'll see them shortly. So debugging our first problem, I'm using simulated bugs here because it's really hard to come up with an interesting bug to debug. I suppose I could have purposely introduced some bugs in this kernel, but the simulator ones are pretty much the same. There's something you may have turned on on your kernel. We have it on our kernel for debugging that allows you to provoke a crash at times. And so I am going to try that root. This time I really do need to log in. So I have to type my password. Yes. OK, so sys kernel debug provoke crash. And I think I said echo write kern to direct. Poof! It died. And so you can see, basically, I can scroll up. You can see, yeah, it yells like, oh, it's doing a direct entry, and it's doing something, and then it gets a crash. And then I end up in KDB. And of course, as I said, in KDB, I can do all the things. I can backtrace. And I can do dmessage and see all the dmessage of the device. So if I happen to attach to a device that I didn't have the dmessage for whatever reason, maybe I forgot to hook up to the serial port and found it in the debugger, I can get that stuff. And here we go. I said I'm going to demo BT. And then I said the thing that I usually do when I end up in KGB is I do dump all. And you can see on this slide, it runs off the bottom, and this doesn't even come close to what dump all does. So I'll try it, and we'll see. So it just starts spewing. It includes dmessage, but it also includes backtraces for every single thing in the system and some extra summary stuff. In this case, my system was idle, so it didn't actually do too much. But it has so much stuff you couldn't possibly fit in a slide, and a lot of stuff I don't look at. But it does find all the processes and dumps the stacks of them. And like I said, the way I normally do this is I literally just take this guy and I scroll all the way up to the beginning and I just paste it into a text editor of some sort. I got convinced to like Visual Studio Code on the last ELC, and everyone can make fun of me for it. But I'll paste it into that just because it's nice. But yeah, so now I can go figure out like what might have happened, right? I can lgit k dtm and search and find out, hey, there is lgit dtm do action, and basically, searching is nice is the summary of that. I actually also often save the dump all. Even if I don't have time to debug a crash, I'll often take that dump all and save it to a text file and reboot the device so that next time if I see a similar crash, I can look for patterns. Another cool thing you can do in KDB is run a sysrequest. I wouldn't say this is 100% guaranteed to work perfectly in every single case, but if I type srm, for instance, it runs the magic sysrequest for m, which shows you member information. So even if KDB doesn't have a command for something, you might be able to get some information by running a sysrequest. Okay, so that was mostly the quick summary of KDB. You can see that KDB has a few things, but it's not really gonna give you all the debugging you really want. What probably people here wanna see is they wanna see a KGDB. And so in theory, I can type KGDB on this prompt, and if I do it, you'll see it says, please attach a debugger. You don't actually often need to type KGDB. You can just start talking, and something about KDB seems to recognize that GDB's talking to it. But if you wanna be extra paranoid, you can type KGDB to switch into KGDB mode. And here I have, at least on my system, what the magic command is to load VM Linux. So you can see I'm running our cross-compiled GDB, targeting the VM Linux that I stashed ahead of time. This is not a VM Linux that is coming from the device. I had to save these debug symbols ahead of time so that now I can do my debugging. And then I tell it to target the serial port that KDMX created for me, the PTY that KDMX created for me. So I will go over and hopefully find this, and we will see whether, oh boy, I don't have my normal GDB command. We will hope that I can actually get this. Okay, so I'm gonna type it in. You'll see it real time. It lost my history. AR64. Oh, oh, ha. That's why I forgot to enter the truth. There you go. Okay, so here is my command. You can see it's almost the same thing that I had here. I think I have my KDMX file in a slightly different place. I have 8891 in there because I usually have more than one device hooked up to the same thing and I have them identified by ports. And, ah, and I even screwed up. Board equals Kevin, because, oh, and it's gonna be toast. Oh, no, it worked, okay. Okay, so I talked, I got in here and you can see now that I am in the debugger. And hopefully, oh, and of course, actually, so it is funny. I got to a point that I didn't mean to get to yet, but we'll jump ahead onto this point anyway. You notice that when I screwed up, so when I screwed up, I had forgotten to set the board variable and that meant I wasn't pointing at the VM Linux symbols, right? And so then GDB said, oh, well, I'm talking to your device, but your symbols you told me to load weren't there. So I'll just give you no symbols and you can debug with no help. And so I quit GDB. I said, oh, get out of here, GDB. I meant to, let me try that again. When you quit GDB, it actually sends a signal to my target saying, okay, I'm done debugging, go ahead and continue. And what do you think happens when I said go ahead and continue to the device which was in a panicked state? It was not very happy. So if you accidentally quit GDB, then in a normal quitting mode, it can really put your target in a bad state. The way that I should have quit GDB is I should have actually suspended it and then a kill-9%, which is just terrible. It's terrible to remember this. Maybe someone can fix it to put an option into the kernel that says if GDB says that it's quitting, don't actually quit, don't take it seriously. I don't know. Maybe someone can find a better way, but I always screw that up and you probably will too. And you can see here, oh, kernel panic, not syncing fatal exception. I'm gonna reboot my device just to get back on script. Let's see if this works. There you go. So now it's rebooting again and we'll go on to the next slide. But we saw that we attached. So once it finishes and I can get back into the same state, we can actually see a back trace in KGB. Log in, type the password wrong because I always do echo sys kernel debug, provoke crash, echo write, kern to direct. There you go. And this time I'm not gonna type KGB and you'll watch and it'll still work. So I say this, attaching and it worked. I hope. There you go. And you can actually see, right? So the whole point of the KDMX, remember I said that the muxing is really handy, maybe not required? If I didn't have KDMX running, I would have had to quit my terminal program over here and then I would have needed to find out what that port was and then re go on and give it to GDB. And then when I was done with GDB, I would have to go recreate my terminal program and start it back up again. KDMX means I don't have to do that and you can see that my terminal program is still running over here. Okay, so I can do BT here and let's see if I can make my window a little bigger so we can see more. And you can see that this looks like a backtrace just like you would expect in GDB. I can go up various functions and look at frames. I think I have, let's see which of these I say I'm doing next. Oh, disassemble, ha. Okay, we'll do that one first. So we'll go down, we'll go back to where we crash and I'll try disassemble minus S. You can see GDB can disassemble where you are. In this case, it's disassembling assembler code because the place that we end up crashing in this case happens to be mem copy and mem copy is often implemented in assembly. And so here disassemble minus S is actually disassembling assembly code although it's still helpful because if we keep going enough and we figure out where we are, we were at, oh boy, Info-Redge. Our PC is, ends in 8 3 C and if we find that it'll, it's often hard to find but you'll see it, 8 3 C. There you go, there's this hidden little arrow here you notice that shows us where we are. It's easier often to find your address than go look for that hidden arrow. But you can see that the disassembly of the assembler was actually handy in this case, right? So the assembler said store one temp 1 W into desk pound four and it certainly got translated into something very different. Store W3 into bracket X64. So it's helpful there and now I'm gonna say, so here's all that stuff and I can look at it if I want to. Print slash X dollar X6, I can, I already showed you, I can look at all the CPU registers. I can also refer to certain CPU registers and I can also find out about address ranges because this is GDB. So people probably know GDB better than me, just a sec. Probably people know GDB better than me and they know lots more commands. I know one or two handy ones and so I'll show you one more and then I'll get your question. Info dollar X6 and you can see, Info symbol X6 and it says that this address is actually the function do overwritten and so that might give us a hint about this crash if we were trying to figure it out. You had a question? Yeah. It's still live. So he's asked about the muxing and saying the terminal emulator is still running. Is it still live? It is still live. The KDMX knows which parts, it tries to know what parts of the communication are relevant for GDB versus the terminal and so it will actually just split just the GDB things over to the GDB side and so if I actually say break in the debugger, I can actually have both of them running at the same time and I'll do that after I demo this and show you but basically you can hit go in GDB and then type some things on the terminal and then you'll end up back in the debugger again. It's pretty neat. Okay, so as I was doing all that assembly, I was kind of envisioning that where I probably would have been if I was in the audience which is holy crap, I have to know assembly and the answer is no, you don't. I mean, you need to know assembly as much as you do debugging any program in a debugger especially debugging optimized code or ending up in places that might have assembler in them like memcopy, you don't really need to but it's good not to be too afraid of assembly. I think it's nice to know certain types of instructions and what they might do and so it's nice to know for instance, what will you say? This is way the heck up at eight, three something now. It's good to know that that instruction is probably writing something maybe writing like what's in W3 which is some register into something close to what's in X6, right? And so you don't have to understand what the heck that whole instruction was but then I knew, hey, let me go find out about X6. Oh, it's trying to write something to the location do overwritten and so even if you don't know everything, it's still nice and it can often help you figure out what's going on if you are in GDB and it says optimized out sometimes you can use a little bit of assembly to figure out the value of a variable anyway. I already did a little bit of this demo, InfoReg so this lets you see all the CPU registers. It's one of the handy things if you do end up in the assembly and then let's get back to C code. So we'll go up one frame. We can type frame one or up and then we can type list and we can see where we are and we've got C code. You can see what it was doing attempting to write something bad to some address and then you can see in fact that it was calling mem pointer and it was calling from do nothing and writing to pointer and in theory, let's see if I remember what's next in the talk, I can say print pointer and you can see, yep, it's trying to write to do overwritten and we can even print size and you can see it's trying to write a lot of bytes. This one was actually sort of interesting to me. I said, why is it trying to write so many bytes and you can kind of see what it's doing here. It's negative in 64 bit, yes. He's found the punchline essentially which is that if we print the same math it's doing, print do overwritten minus do nothing and again, I think for those that know GDB I'm probably just showing them demos of things they already know but GDB can do math for you. It's a really powerful debugger and you can in fact just say, hey, do some math for me and print me the result and it's pretty cool, it does it and it says yeah, it was negative 32 but then you notice that size was unsigned of course and so I don't really know if that was what was intended by whoever wrote this test is that they meant to copy negative 32 bytes. I guess it didn't really matter, they were trying to crash it anyway so mission accomplished but anyway, it's interesting to know that you can quickly find out things like that with GDB and I also sort of say GDB is pretty handy, right? So I can do all the things that GDB could do. I can say print crash type or even print star crash type and it will do all the things that GDB can do. It'll de-reference your structures for you. These are things that if you just had a K crash or you just had in KDB without the KGDB heart you would not be able to do, right? Or you could do it assuming you had the memory and assuming you could do the disassembly and figure out where everything was, you could figure out that in fact the name stored in this structure was right-current and that the function happened to point to the memory, L, G, D, T and right-current but GDB does it for you and you can even go further, right? I think this is just more of the same but you can say print star file and you see massive amount of spews of stuff and personally this is too much memory so I often do set print pretty true or on. I always do that wrong. Set print pretty true, set pagination off and then I try printing the file and now it takes a lot more space but suddenly I can understand it. If you want your GDB tips and tidbits, this is the talk for you. So the other thing that you can do with KGDB is KGDB can actually integrate with KDB. So one sort of funny thing you can do here, let's see if I actually get it to work. Am I gonna go off script? What do you guys think? Okay, okay. Maintenance, packet three. This is crazy that you have to type this but maintenance packet three will switch us back over to KGDB and remember you asked if it was still live, it's still live, I'm in KDB now and then I maybe can type KGDB, will I end up back here? No, I think because it ended up with a timeout so I'll do a kill-9% and try reattaching to it. Come on, come on, yes! And it still works and I'm still where I was. So you can see that it was in fact still live but that was kind of a pain, right? If I wanted to do a KDB command and like, because I could have gone over here and then I could have at the KDB prompt I could have typed Dmessage or FTDump or any of the things KDB knows how to do. You can also type monitor to get into a, to do a KDB command without leaving KGDB. So if I do monitor help, there's all my KDB commands from within KGDB and the one that I used to use all the time was monitor LSMod, oops, LSBMod doesn't do work anything but LSMod will. And so I would use this command all the time to if I wanted to load where a module was, I wanted to know where a module was and load the symbols for a kernel module because remember at the beginning of this when I loaded GDB, I just pointed it at VM Linux but not my kernel module. So I don't have symbols loaded for my kernel module right now. Using this information from LSMod, I could load the symbols for the kernel modules. I'll get into why I don't have to do that in a little bit but go ahead, that's this desktop, yes. Yes, okay so the question is asking about what the heck is this monitor command and how does it exist? I don't know the history of the monitor command but this is not some special version of GDB. This is like, I think if you, right, so it's part of the GDB protocol and if you just use GDB there's some kind of way that GDB as part of this protocol can send commands to whatever's running GDB server, right? And so I think it must just send it a bunch of text and it sends a string and it says I'm gonna get a string back and you know, it's in a, okay yeah. I think all of your stuff is not being recorded unfortunately. So, but we can certainly talk more about it afterwards but the basic summary for those that are listening to recording is that yes, monitor will send it to whatever's implementing the GDB protocol on the other side and it's totally up to that other side to interpret it and figure out what to do about it. And in this case, KDB, KGDB happens to implement a bunch of commands on it. If you have something else that is implementing your GDB monitor or protocol, it may also implement some commands. Okay, where are we? So I said that, okay, so I said that you can get to a bunch of GDB, a bunch of smarts by using the monitor command, right? KDB provides a bunch of Linux smarts. I can do LS mod, I can do D message, I can do all of these other cool things but that requires this thing running on the host that knows how to answer those text commands. Some people recently have been working a lot on something called LX scripts and you saw when I turned on my kernel config, I had turned on GDB scripts and that is what enables or compiles or installs some of these types of scripts. You can find them in the kernel sources and I think you may have to run a special install to get them installed. But basically, if you get them installed and you put your VM Linux, GDB next to your VM Linux and you allow GDB to load them, you have to do something special in your GDB init for it. Then, oh and I show you where all of that stuff is on my system. I can even go there if you want to. I'll leave it running in the background. CD build, Kevin, user lib boot, user lib debug boot. And you can see here's my VM Linux and you can also see that there's the VM Linux, GDB.py and a bunch of scripts and I think you can do the same find command I did on here and it shows you, hey there's a bunch of Python things. And you can look at my GDB init and I have allow those scripts to be loaded everywhere. Okay, so then let's go back into GDB. We'll type help just to get it. And I can type lx and tab complete and I can see the list of all the scripts that were automatically loaded by these lx scripts that the kernel does. So what these scripts all do, they all got installed on my kernel, they run entirely in GDB. So they don't use any KDB support for doing anything. They just use the peaks and pokes that you can do of memory and GDB's ability to interpret kernel data structure and find global variables to show you all kinds of cool stuff. And so remember last time I said monitor LS mod and that asked KDB to do an LS mod. I can also do lx LS mod and it's a little slower but it shows you mostly the same information. It gets it a vastly different way, right? The other one sent the string and on the device something saw you, sent me the string LS mod. I will print you out all of the stuff here. It literally said I'm gonna go to the top level global variable where I know the kernel stores information about loaded modules and then I'm gonna start walking through kernel data structures and read strings and interpret those strings and then I'm gonna print you this nice table and they're written in Python. Yes, go ahead. The question is would these scripts work through JTAG and the answer is yes. In fact, I think at ELC some number of times ago the guy that started writing a lot of these scripts demoed them over JTAG and so they all do work over JTAG or over whatever other thing you might have GDB running. And so there's some really cool ones actually. There's things that are not even implemented as KDB commands that could be but they're not. LX clock summary is one of those that I kinda liked just because I often end up looking at the clock summary and yeah, you can see all your clocks. It's basically the same thing that CIS kernel debug clock summary and I won't claim that I wrote any of these things. I'm just demoing them for you. So please, if you have praise, please send the praise to the people who wrote these scripts, not me. I'm just showing you what other people have done. Okay, so we are going to do demo a different type of crash now. So I am going to kill, actually here we go. Let's try this. We'll even be, I don't want that. Monitor, reboot, ha. And then kill-9. So monitor reboot told my device to reboot because there's a KDB command called reboot and so I sent the KDB command to reboot and then of course KDB gets really unhappy because it can't talk anymore and so I just killed the GDB. Okay, so what was the second one I was going to test is soft lockup. So we will go and I'll get it right. Yes. Every time I type that password in, I have a little like jump for joy celebration. It has too many zeros in it. Okay, so then I'm going to say echo soft lockup to direct. Oh God. Okay, and then once I do that, I'm going to break into the debugger myself. So what I did here is I told it do a soft lockup, simulate a soft lockup kind of crash where it just sits there and loops and then I purposely broke into the debugger using magic sys request. And so the question is in this case, can we find what locked up the system? Imagine that something was misbehaving and I wanted to figure out what might have locked up. The answer is yes. I actually, as I was doing this demo, I realized that I couldn't. And then I'm like, gosh, I've seen this before when I've tried to use GDB is that like I'm doing a stack trace and somehow it seems to not really show the right thing sometimes. And so for this talk, I debugged it and I sent a patch up. So assuming you have a KDB that has this fixed, you can now do BTC and it will do a stack trace on all of the CPUs I have. Now I happen to have six CPUs in the system. And so in this particular case, I can scroll through them all. And if I was looking really carefully, I would find out that here's where I am. LKDTM soft lockup, right? I could see that probably the rest of these CPUs aren't doing anything useful. This guy's sitting there in CPU idle, enter. This guy's sitting here in CPU idle. This guy's sitting here in CPU idle. You can kind of guess what they all might do. And this one was doing something different. So maybe he would be one to go look at. And so I could debug it and find out what was going on on a different CPU that was locked up by using KDB. So in this case, I happen to know that the system, the process that was locked up was running constantly on one of the CPUs. So BTC worked. BTC is showing you what is actively running on all the CPUs in your system. If something was stuck but locked, like imagine it's sitting there. It's not behaving properly, but it's blocked on a mutex, for instance, and is therefore sleeping. BTC is not gonna show that to you, but dump all will. So dump all will show you the stack trace of everything in the system. And it is a little tedious looking through the stacks of everything in your system. But if you were really desperate to figure out why you're crashed, you might do it. You can also try other types of ways to find this information. You can type PS and show things for certain states. And I think if you go look at this function, you can see what those might be. I happened to find SRW, which in this case shows you nothing. But if you did have a block task, task sysrequest w would show you that something is blocked on something. So you may be able to find information about other tasks that way. And if you happen to have the process ID of something that happened to be blocked, you can do backtrace process, BTP to backtrace a PID. So you can see here. Okay, so I don't want to just debug this other CPU that's crashed in KDB, but I want to actually see it in KGDB, right? Because that's the more powerful debugger for figuring out what's going on. And so if I reconnect to KGDB, the way that the GDB server in Linux represents processes in the GDB protocol is it represents them as threads. So the processes in the Linux kernel are thought of by GDB as threads. So if I want to debug one of the other things in the system, I can type info threads. And it will start spewing out information about all the processes in the system. And you can kind of see it's a little weird because it will give you an ID here. And then it will give you another number. This number is the Linux kernel process ID. And this number is the number you have to tell GDB about if you want to talk about that. And then here is like kind of a name about it. It's basically confusing about all the numbers. So let's see, maybe it's easier to actually look at this. So I'm gonna say quit and I'll try saying monitor PS. And monitor commands are pretty quick. Okay, and let's randomly pick one of these guys. So monitor PS shows you the Linux PID, right? Here, and let's look at bash because I think that's probably the one that crashed anyway. And you can see that bash is PID 1479. And remember that, because I'll forget it. And then I'm gonna try running info threads. Now info threads, before I do that, I almost always say set pagination off info thread and then I go get a coffee. But I'm probably not gonna go up to the fourth floor right now. So we're gonna, I'll let it run, but we're looking for 1479, 1479, 1479. So at some point we'll see something that will say thread 1479. And even though it says thread 1479, that's the process ID. And so when I see that, then I'll look at the number to the left and that is how I tell GDB to talk about that process ID. Luckily there's not too many things in the system so it shouldn't take too long, but eventually we'll get there. I think it starts going really fast once it gets over the other ones because we have more short lived things. Come on, come on, almost there. Oh, and then of course they're in different orders now. Yay, there it was. 1479 was number 172. So I can now say thread 172. And yeah, it's like uber confusing that even though it said thread 1479, I then typed thread 172, totally confusing, but that's the way it is. And now you can see I said switching to thread 172. I mean thread 1479, right? So it's like it even tells you that it's confusing here. It's a debt's capital T, of course I should have known. But then you can see now I'm actually debugging the exact place that I wanted to debug, right? I can actually go up and I can see in fact, huh, I wonder maybe someone shouldn't have put an infinite loop in here and that's what crashed my system. So I managed to debug that pretty easily. Now you guys will probably hate me for this, but I made you do this the hard way this time. So remember I said we had to go search for 1479 and we had to wait through that whole info thread and all this crap like that. If you go look really carefully at the whole thread list, you'll actually see that the first several of them are shadow CPUs. And so I could actually get to the exact same thing by typing thread three. It's a little bit of just the weird way that the whole GDB stub works. But if I say thread three, now let's see if this works. Thread three, come on, yay, okay. And we're in the same place. We're in exactly the same place and in fact we're debugging the same process. But here it's sort of this weird way in the way the GDB monitor works is now we've told it to talk to a shadow CPU instead of by process ID and it works the same. They're the same damn thing. So oh and I guess I should mention here as I'm looking at this. When you're doing a backtrace in GDB you'll almost always see this scary message that you think is totally a bug and totally your problem and it's not. Backtraces in GDB almost always stop with some arcane message because they had to stop somewhere, right? GDB has no idea where your stack ends and at some point it will stop tracing and just say I can't access any more memory or the memory of the stack frame looks duplicate or looks corrupt or looks bad. No, it just probably just stopped because it got to the end of the stack. So info thread, I show you all this stuff. I told you how to get there. Okay, so my next demo here is failing of KGDB. KGDB falls on its face. So you notice when I did this demo I purposely dropped in the debugger, right? I quickly did it before the system happened to detect the soft luck up on its own. Now I'll say continue and now I've let the system go back and keep running and if I'm lucky and if the demo gods go my way, there you go, at some point it will yell at me and say, oh God, I detected a soft lockup. So I happen to have the soft lockup detector turned on. You can see bugs, soft locked up CPU too was stuck for 11 seconds. It must be broken. And you can see in fact it gives you this nice trace of exactly what was crashing. Cool. So now we say, but we wanna debug this in GDB, not in KGDB. We wanna go like actually look at the local variables and see what was failing. Annoyingly, we fail. The problem is that at the moment KGDB on ARM64 and ARM devices anyway, can't trace past an IRQ. And in this particular case, we happen to panic and there's an IRQ here in between where we actually failed and where we panicked. And GDB just doesn't know how to get past this because it's a little bit of assembly code. And so if I, I'll show you my BT in here also, BT. And you can see, yeah, it stops at Yale one IRQ and it doesn't show anything about the soft lockup and anything about where we actually crashed. And well, someone can submit a patch for that and fix it. I think I tried it one time and then didn't spend enough time. I think we need like CFI annotations in the assembly somewhere. Okay, another demo, how are we doing on time? We have 20 minutes left. So another demo is that we, remember I told you that GDB is better for debugging crashes than it is for single stepping and things like that. And yet I have shown you guys cases where I was in GDB and I hit continue and the system kind of worked, right? The system is actually pretty robust. It's not perfectly, but it's pretty robust. And so breakpoints do sometimes work okay in the system. You're welcome to try them if you want to. When I tested sometimes I tried to delete a breakpoint and then I would hit continue and then I would find that I was at that breakpoint again. I don't know why, but they sometimes work, but I'll give it a try anyway. So let's get ourselves back to a clean state, monitor reboot and then our fun kill-9% and oh, it did not work. Did I type monitor reboot incorrectly? Why didn't you tell me? There you go, so it's rebooting. And so basically I will try, I will show you another reboot, another type of way to do break, or how to do breakpoints and also show you some other interesting techniques as we do it. So I'm gonna drop in the debugger with sysrequest G and I'm gonna try getting into GDB and I'm gonna break on PCI try reset function just because that was when I tested even tab completion work, which is nice. And then I'm gonna say continue. Okay, so my devices continue to work and of course we're not trying to reset the PCI right now, so it's just running normally. And then if I'm lucky this will work, echo one, two, sys, kernel, debug, Marvell, Wi-Fi, MLAN zero, reset. Cross your fingers and we ended up at our breakpoint. Cool, so they do work a little bit. Now the reason that I was showing you breakpoints is I had an ulterior motive here. Remember earlier I told you that modules don't have their symbols loaded, right? And I happen to know that we compile the Marvell Wi-Fi driver as a module. And so if I do a backtrace now, I get nothing. So we can load our modules manually and then we'll get something. So remember I showed you there was monitor LS mod and I can go through and walk through this and manually load all of our modules. And I think here I showed you what it looks like. I may skip this time because there's actually a really cooler, easier way to do this. And it's using the LX, one of the LX commands. You had a question? Sorry, the module symbols. He asked whether I meant load the module or load the module symbols. And I mean load the module symbols manually. The module of course is already loaded because it's running right now. But I would need to load the module symbols into GDB manually. And so LX symbols is a really cool command that will load all of the symbols for all of the modules. And I think I have a patch that is probably, I think it's in Andrew Morton's tree right now that makes this work with the way Chrome OS does things, which is split kernel debug symbols. But even if you don't have split kernel debug symbols, you can still, if you don't have split kernel debug symbols, it works magically today. If you happen to be on Chrome OS and doing split kernel debug symbols, then it shouldn't be hard to find a patch from me about this. I'm happy to provide pointers. But I will type this in LX symbols and I happen to know my symbols are in build kevin user lib debug. And we'll see whether we get lucky. Loading VM Linux, loading symbols. And you can see it's going through and loading it slow, but it's still faster than me manually going through and trying to match addresses and load symbol files and copy and paste paths and things like that. And of course I have pagination turned back on so it's super slow. Yes, you can ask it. So here is where I had, I didn't actually walk through this just because we don't have tons of time, but here is add symbol file, give it the path to the modules debug information and give it the text address and it will just do one at a time. Maybe you could. I don't know if I've ever tried it. I haven't experimented lots with LX symbols. The question is about, could you have LX symbols not load the symbols for all modules? I mean, it takes 15 seconds to load the symbols for all the modules I have and me trying to type in the module I wanted to load would take 10 seconds. And so I don't really think about it. I'm just like, okay, I'm just gonna run this script and it'll just work. But if I type BT now, you remember before it just showed I don't know how to trace anymore. I'm just gonna show you garbage. Now I can actually see all of the stuff I wanted to. So I get all my debugging for where I am. Okay, so another case to show you about how GDB fails because it's important not to just know what's good about this debugger but when you use it and it doesn't work, I don't want you to be surprised. Which is that GDB cannot stop CPUs that you can't talk to. So on most architectures, especially Arm and Arm64, I know the best. The way that GDB will stop a CPU when if you break into the debugger, the way that GDB stops all of the CPUs is by sending them an IPI which basically is just an interrupt. It sends an IRQ that gets sent to all the CPUs. And if a CPU is looping and has interrupts disabled, it won't stop on the debugger. So if you happen to be in a spin lock IRQ save area and then you're looping forever or you're crash, you can't debug that CPU with KGDB. This is where JTAG really saves you. You may still be able to use other techniques and KGDB may still be able to help you but you cannot debug that particular case. Maybe people will figure out how to solve this in the future. You could imagine doing something fancy where you can somehow break spin locks. If you're in GDB, maybe you would want to add some command to break spin locks and allow people to drop in the debugger so you know where they are. Maybe you could imagine someone implementing FIQs on Arm devices. Maybe you could come up with some other fancy way to do this but right now you're just out of luck. So I'll demo this for you guys so you can see exactly how terrible it is. Okay, so we're booting again. Okay, can you guys hear me now? How much was lost? Just a little bit? Okay, the microphone died for those that might be watching the recording and so luckily it was right at the time that we were waiting and I guess that was a funny joke was that my microphone died. Okay, so what I was testing, I was demoing what was failing in the case where you can't stop an unstoppable CPU and so here I did a hard lock up here which a hard lock up basically is gonna make a CPU totally unable to be interrupted anymore. And when you look it says okay, panic, hung task, KGDB timed out waiting for secondary CPUs. If I type now info, oops, what is it, CPUs? BTC will show you. Basically, BTC says CPU status currently on five, zero is idle, two to four is idle, five is running and one is out to lunch. And if you go look at all of these stack traces now, you won't find anything about the CPU that's dead. And in fact, like I said, it's really unfortunate and if someone can come up with a fix like I would just love it. It's really unfortunate because this is the CPU that's got the problem and if we can find some way to debug that CPU that would just be totally awesome but we can't right now. Okay, we can just continue on like I said we just keep working on random stuff and telling other tips and trips until we run out of time or we run out of slides. So this one is sort of some tricks for working with optimized code. I said often when you end up with GDB you end up kind of cursing at it because it will tell you something is optimized out. And so we will see if in fact, I'm gonna reboot this guy, I didn't even end in GDB last time. We will see if in fact I didn't screw up and this function is still optimized out. Every time I was doing this demo I would like sync to a new code base and recompile and something different would be optimized out. And so hopefully it's still the same one that was optimized out in this, when I wrote the slides. But we will find out. So we'll drop in the debugger and we'll end up in GDB. And I was doing cross ECX for high priority. So BR cross ECX for high pry continue. And we end up in the debugger. And I said that in frame five something was optimized out. What was it? Ah, well if nothing else work is optimized out but I think I said param was optimized out too. I'm not even in the same frame. See in frame five cross EC log work. No symbol param. Wow, that's awesome. What's that? Yeah, list. This guy. Did I somehow like end up in a totally different place? It thinks I'm in cross EC console log work. Oh, you know what it is? Let's continue and see if I happen to get lucky. You know what it is? Is that I think in this case I am happening to get, here I was in frame five was cross EC console log. Nope, okay. Here I said param was optimized out but it really was work. Okay, so I think I might have missed something when I, like I said, I did the slides like twice and I changed versions and it all got screwed up. Work is in fact the one that was optimized out. So print work optimized out. So the first trick you're gonna tell me is obvious but I'll still tell it to you anyway. Sometimes you can just look one frame up, right? So I'll go up again and then you can see that work was the first parameter to cross EC console log work and in fact I can maybe find work here and I say print work and there it is. So GDB is often really stupid and humans can be smart. Sometimes if you end up debugging optimized code and you're stuck, try being creative. Look for other places you might be able to find the same value. In this case, it is a really obvious place to find the same value but it's just to point out that sometimes it's not that hard. Sometimes you have to work a little bit harder. I think I said let's look at this guy. Let's go back to frame five and let's list cross EC console log work and I think I wanted to get debug information here. Print, debug info. It's a local variable. You can see at the top of this function and it's optimized out also. Yes, you got a question? So I said I'm just happening to go with these numbers because I did these slides ahead of time and they were the same things. If I do BT, you can see that backtrace is showing me all these different numbers and I happen to know that frame five was a certain location in the frame that was in the backtrace and that had these problems with optimized out values. He asked how I was picking five and why I was picking five. It's just arbitrarily I know that they were optimized out problems on that stack frame. So as I said, debug information was optimized out. Sometimes you can get fancier. In this case, I knew that I could find it. Let's go list this again and you can see that debug information is container of two delayed work, work of struct cross EC, body, body, ball, all of this crap. Sometimes you gotta work harder and figure out what container of and two delayed work does and you can actually still find your value. In this case, if you go look at offset of, you can find out that offset of basically casps zero to a structure and then you can get the address of casting zero and the structure and you can find out what the offset is with this super crazy complex math. This is where the power of GDB is great but you do have to work a little harder but you could basically see that the work structure is hex 88 bytes off of this structure, right? And I use GDB to tell me that by just giving the address of if the original thing pointed to null what would be the address. And then I can go up to frame six and I know that in frame six, let's see, what did I say? Oh, work was what I had, right? So I could say, I'm gonna paste this because it is probably gonna be too hard. Oh, I can't paste, huh? Okay, well, oh, I think I can do this and paste it. Is it gonna work? It is a lot of typing to do this and a lot of probably doing it wrong. Paste, there you go. So I did that magic. I figured out here, as you can see, I figured out that 88 was the magic offset and then I figured out where the thing it was an offset of was and then I could in fact find the value that I wanted to. I could find this big debug information structure even though GDB thought it was an optimized out value, right? And I'm not asking you guys to let go, understand every last detail of this. I basically just trying to say sometimes it's easy and sometimes you just gotta work a little bit harder but you can often still use GDB to find things that give you a lot of information about what you want. So go ahead. So the question is that rather than doing all this work, another option is to compile your kernel without debugging or sorry, without optimization. I don't think you can actually run the kernel without optimization, but one trick I have heard and I might have this, here you go, as the two slides down, sometimes you can in fact save yourself some time by compiling with less optimization. Often the way to do this is to look up the pragma for your particular compiler. Like I think GCC and Clang both have these and you can say just turn off optimization for this file or for this section of code via some sort of pragma and then your life can be a lot easier. And that works. The downside of course is often I end up using these things cause I'm doing something like running 20,000 reboots on my system and I let it run overnight and I come in in the morning and one of them is crashed, right? And I can't necessarily know that that crash is gonna be there again. And I wanna go figure out what happened. In that case you're just out of luck and you're forced to debug optimized code. But certainly if you're using it to debug something that happens over and over again and you're just baffled, if you can turn optimization off it can save you time. You just have to do the trade off. I think I'm getting pretty much to the end of the slides and two minutes away from the end of the talk. I will mention that sometimes like I said knowing a little bit of assembly is good. Knowing your calling conventions for the particular system you're debugging is useful so you can know whether to trust registers. And that's it. Conclusion and next steps. You can debug real crashes with KDB. It's not perfect. I've tried to show you some of the things you'll probably stumble upon but it's not that hard. And that's it.