 Hello everyone, thanks for joining the session of the Open Source Summit Latin America conference. So today I'm going to talk about some tools and techniques to debug an embedded Linux system. The idea here is really to show different ways and different techniques that we can apply when trying to identify and solve bugs in software. With that kind of focus on embedded Linux, but most of the concepts here can also be applied on, let's say, a desktop system. So, yeah, a little bit about myself. My name is Sergio Prado. I'm living in Sao Paulo, Brazil. I've been working with embedded systems for 25 plus years. For the last 12, actually, years I've been working with my company, Bad Lab Works, where I provide consulting and training services. I'm also an open source contributor, so I contribute to a few open source projects and blogger on free time. So this is our agenda. I'm going to quickly introduce some important concepts related to the buggy. And then we're going to jump to real situations where we need to analyze bugs on an embedded Linux system. So I created here different kind of situations, and I'm going to show lots of comments here and explain how they work and what kind of techniques we can apply to try to find out the root of cause, right, when we are debugging problems on an embedded Linux. And I hope you enjoy. So I'm going to start with a little bit of introduction on the bugging. This is a kind of a joke, right? But I really like cause usually when we write software, we think that our software doesn't have any bugs, but that's really not true, right? All software I mean have, I guess probably the only software that doesn't have bugs that the software doesn't have online, I don't know. So software will always have bugs. It's a matter of finding those bugs. So usually, I mean, this is again a joke, and that's kind of representing the six stages of the bugging. It starts from denial. You just said, there is no bug there. I'm pretty sure I tested. And then you start realizing that might be an issue. And then you just say like, doesn't happen on my machine, right? And until you ask yourself how did that ever work, right? The bugging is kind of the process to removing bugs, right? And bug is that word that we have been using since the earlier times. And by this term, we mean problems, right? A bug is a problem in a software that unexpected issue. And another phrase that I like a lot in the software development process, we spend 50 times percent the bug in the software and the other 50 bugging. If we think about kind of what would be the process to the bug in the software, we can kind of come up with five steps, right? At least I come up with five steps. And that would start with understanding the problem that I guess it's one of the most important steps, right? Because if you don't understand the problem, how can you debug it, right? The second step would be try to reproduce the problem. And that's also very important because if you cannot reproduce it or if you cannot reliably reproduce the problem, how can you be sure that you fix the problem when you apply some kind of fix, right? So the problem should be reproducible so you can make sure that in the end you have fixed the problem. Third step would be identify the root cause. That's what usually takes some time depending on the issue. But if you know the tools, if you know the techniques and apply the right techniques, depending on the type of problem, you can be more assertive here and find the issues faster. The fourth step would be fixing the problem. And that's kind of easy, I would say, because as soon as you identify the root cause, fixing would mean changing the code, compiling the code, deploying and testing. Kind of easy, right? Comparing to the other steps. And if you fix it, you just celebrate and go to another issue. If not, you get back to step one to see what is happening, what went wrong. Thinking about the problems that we can have in software, I come up with five kind of categories. So just think about a problem. I could kind of divide this problem in five different categories. There is those kinds of crash issues, right? The software just crash for some reason. Like you can have in the kernel, oops, in user space, you can have segmentation falls. So you have crash messages that you could investigate. That is those kind of lookup issues where the software just hangs, right? And yeah, usually I don't know some kind of error in the implementation. I don't know some kind of issue in the synchronization of, I don't know, an application that is based on multiple threads. The third kind of problem, I'm calling here logic or implementation problem. So the software just works, but it doesn't do what you do. So we have an input to have some kind of processing and an output, and the output is not expected. It's a logic problem. Fourth type of problem, research leakage, that's when you allocate research, but after using this research, you don't deallocate the research. So the research just leaks. And the last one, lack of performance. Everything works, but the performance is bad. Usually an issue that is caused by a lack of performance, the user has some kind of perception of bad quality, right? The usability is not good when you have this kind of performance issues. And for some reason I could also come up with five different tools and techniques to debug those kinds of problems. I would say the first one, it's our brain, because it's probably the most important tool that we have, right? Our brain, we need to think about sometimes, think a lot about a problem, think about different ways that the problem could happen. And of course, the technical knowledge is also important, right? To understand, for example, what is a segmentation for? We need to understand why a segmentation happens because of the isolation of memory provided by the MMU and etc. Second class of tools and techniques, I'm calling post-mortem analysis, that's where you just take some information and analyze that information, right? You don't do the debugging at runtime, you do it later. So collecting logs, analyzing the logs, collecting dump, memory dumps, like a card dump, analyzing those. It's what I'm calling post-mortem analysis. Tracing that could relate to profiling. This is a technique where you trace the code at runtime. That's kind of one of the most used techniques, I would say, because everyone knows how to debug an application by putting prints in the code, right? So when you add prints to the code, you are tracing the application. But it's important to know here that sometimes you don't need to put prints in the code. Sometimes there are complete kind of infrastructure to trace the application. For example, the kernel has ftrace, complete infrastructure to trace kernel function calls and profile functions. I mean, you can do lots of different things with ftrace. In user space, you also have a few nice tools like strace, ltrace, perf that you can use to trace user space applications. Another technique to debug applications is interactive debugging, GDB. So you can just run your application step-by-step, interact with application, inspect the memory, right? That's another way to debug an application. Last one, I would call, I'm calling here debugging frameworks. That's where you use some kind of tools that were made to debug on a specific kind of problem. The kernel has a few debugging frameworks. I'm going to mention it here later. And a very known debugging framework is called Valgrind. It's kind of framework to develop memory-related tools. So Valgrind, for example, is a tool used to try to find memory leaks and other kind of memory-related problems in applications. So far, we talked a little bit about different kind of problems that we can have and different kind of tools that can help us to debug those problems. Now we're going to go over each one of these main types of problems, see some real cases on an embedded Linux device, and talk about how can we debug those kind of issues. So let's start with post-mortem analysis. Post-mortem analysis, as I mentioned, is a kind of technique where you extract information from a system, in our case an embedded Linux device, and then you analyze that information. It could be logs from the device, it could be some kind of memory dump, like an application cordon, and then you take this to your device and analyze it. Let's see a little bit of how that is done. So here I have a kernel ops message. Kernel shows this message when something very bad happened at Kernel's base. And it's kind of scary for those who don't know what is happening, but there are a lot of useful information here, right? So for example, the first nine-week can see the reason why the kernel crashed, and no point or the reference inside the kernel. We have information about where was the program counter at the moment the issue happened. With this information, we can find out the line of code in the Linux kernel that causes this crash. And in the end, we have a backtrace of the problem. A backtrace is a kind of stack of functions that were called until the problem happened. Here, because of the size of the slide, I cannot put the complete backtrace, right? But here we have all of the functions that were called until the next one. It's a backtrace. So from the last one to the first one, this is the last one, the function that crashes. And that function was called by this one and so on. How can we analyze this kind of message, right? We can use several tools here. For example, ADDR to line or even GDB. What do we need to debug this kind of problems? We need the kernel source code, because the idea here is to take that address and resolve to a symbol, right? We have an address here. We can take this address or we can take this function plus index of the function that causes the problem. That's basically the same thing. Storage probe plus 60 and this address. That's basically the same thing. We can take this and try to convert a line of code. For that, we need the kernel source code. We need the Linux image in the ELF format. That's the VM Linux. When you build Linux, you have this file, the VM Linux file. You need this one with the bugging symbols. That's important. We can see here that this ELF file has the bug info. So you have to compile the Linux kernel with the bugging. There is an option there in the kernel hacking menu. So you just recompile the kernel with the bugging symbols. You're going to have this VM Linux file with the bugging symbols and the kernel source code with that. You can run, for example, ADDR2 line from your tool chain. In my case here, I'm running the kernel on an ARM device. I'm using my cross compiler tool chain. Then with ADDR2 line, I can just take that address, the program counter address, give that to the ELF file with the bugging symbols, and it will show me the function, the source code and the line that causes that crash. I can also do the same with GDB from my cross compiler tool chain. I just have to pass the VM Linux file, the kernel image in the ELF format with the bugging symbols, and then I can just run this common list asking the kernel to convert this to the address. And then I can have here the address, the source code, the line, that's the same result, right? And I can see that in this case, this was the offending line, the line that causes the crash. That doesn't mean that we should remove this line to solve the bug, right? But that does mean that we should analyze. We know now where to start. We know now that if we look at this code, you can see that we are referencing a pointer here. So this one, no. And we need to find out why this pointer is no. What about the crash in user space? Let's say we were on a program and it's like false. When that happens, it means the application tries to access a memory address that is not allowed. To do this kind of post-mortem analysis in user space, we need to generate a cardam file. By default, the kernel doesn't generate cardam files for user space applications. We have to enable it. And there is this multi-view limit that you can use for that. This tool is able to configure process limits. One of the limits is the size of the core file. So with this command here, we can set an unlimited size for the core file. And then after running this command and running the program again, the core will be dumped. And then in my specific case here, where I don't have anything special configured, the core file is generated in the same directory that I run the application. This is a kind of a snapshot of the memory. So I can take this snapshot and open in GDB to analyze the problem. So I can see here the file. What do I need to analyze the cardam? I need the search code of the application and I need the application with the bug in SQL. That's also important. So having those, I can take this core file to my machine and then open it with GDB. So here I'm using GDB with the application, the binary with the bug in symbols and the core file. And as soon as I open this application, the kernel will just... Sorry, not the kernel, the tool GDB. So as soon as I open the cardam with GDB, GDB will say that the application crashes with a segmentation file and it will also tell me the line of code that calls the crash. I can see here the line of code. And what is nice is that it's kind of, again, a snapshot of the memory. So I can inspect the memory. I can even, like, go up and down in this stack if I want. So here I'm listing the search code and I can see the line that calls the crash. And I can just ask you what happened here? Why this crash? It could be the options pointer. It could be the RQV pointer. And then I can just inspect memory. I can print options. I can see that options is... It seems like to be a pointer, but when I try to print the RQV pointer, I can see that it's no. And then I can, like, see that that's why it's crashed, the RQV pointer is no. And then, I mean, I can't keep the investigation inside GDB. I can go up in this stack to try to identify why RQV is no and so on. So that is very useful, right? Especially when you don't have a fixed physical access to the device. Because if you have physical access, maybe doing an interactivity bug is better, because then you can run the code step-by-step to try to identify the issue. But if you don't have physical access, you can just tell someone, please send me the core file and then I can analyze you or something, I suppose. Another technique to the bug is tracing. As I mentioned, tracing is very popular, right? You add prints in the code, you are tracing the application. When you add prints in the code, you are basically adding tracing points to the application that we usually call static tracing points. Because you are adding the trace points at build time. There are also tools and frameworks that are able to add the trace points at run time, and that's very nice also. There are tracing tools that you can use in carry space and also in user space. And tracing is very useful to identify specific kind of issues. For example, to measure time, tracing is very useful to do measurements and find out performance issues, why that functions take a lot of time, things like this. Sometimes tracing is very useful to find out lockup issues when the application is just locked up. You can trace and try to identify where it is stopping the execution. Here I have a few use cases. First, a lockup issue in carry space. Here I kind of added a bug in the carry node where when you run... Actually, this is not a lockup. It's a kind of an issue where it takes a lot of time to run. I put a bug, I added a bug in the carry node to when you do something very simple like this, setting the brightness of a LED, turning on an LED, it takes four seconds. Tracing could be a good option here to debug this problem. For that, trace the carry node, the best option that we have is ftrace. It's a very nice framework to trace the Linux kernel. If you go to the kernel menu configuration, you're going to see there the kernel hack menu inside the kernel hack menu, you're going to see ftrace or kernel tracing. I don't remember exactly. And then there are a lot of options there that you can enable. And that's basically what you need. Then when you enable ftrace, you're going to have trace file system, trace FS. If you mount, you can mount anywhere, but usually we mount on CIS kernel tracing. You're going to have several files there to configure the tracing. You can do that by hand, I mean write on those files to do the tracing. But that is a nice tool called trace CMD. There is a kind of front-end for ftrace and the trace file system. And here I'm using these two. So trace CMD, I want to record and do a function graph tracing of this command. So these two will configure ftrace. So trace the functions executed in kernel space for this specific application. And then it will save this in a trace.dat file that I can open with trace CMD report to see. And that's what I'm doing here. And then we can see all of the kernel functions that were called during the execution of that application. And then looking at the functions, I can see why that take, I can see this msleep function that took 44 seconds to run. That is another nice tool called kernel shark. That is basically a tool to parse, it's a graphical tool to parse that file and show you what happened during that trace. And it's graphical, you can have better visualization here and you can see also the issue looking here, the msleep function that caused the delay in the execution of that process. Another, now in user space another nice tool to trace a Linux system is strace. So let's say you run a command and it just fails and you don't know why. Just run strace, just run that command with strace. strace is a tool to trace system calls so it will show you all of the system calls executed by the application and sometimes that's very useful. So for example here, I'm running these two NETSCAT and it's failing with this error, could not set up a listening socket. If I run the same command with strace I can see all of the system calls and I can say why it's failing. This is the failure, this is the write, the message, the failure message and two system calls above we have the error, the bind call. It passes here a new argument and should not be know it failed and then it have this message. So just running strace you don't need anything there just strace and you just run applications with strace you can't expect what's happening and try to and try what's going on. Very useful. Let's say you want to strace is a kind of limited in terms of only trace system calls. There is another nice tool that is called an Ltrace that also traces library calls and could also trace system calls. But let's say you want to trace your own application you want to trace the functions of your application. So for that that is framework in the kernel call it Uprobe that you could use. It's kind of hard to use the Uprobe framework by hand but you could use Perf for that. So here I have another problem. I run ETH2 and it just freezes, hangs and I have Uprobe enabled in the kernel and I have Perf installed in the system. So what I need to do is tracing. I need the application with the plug-in symbols because Perf will collect the symbols from the application and add dynamic trace points to those symbols. I can do that with this command here. So here I'm running Uprobe in this binary. So it will collect all of the symbols find out the addresses of those symbols and add dynamic trace points to those symbols using the Uprobe framework from the kernel. I can list, after running this command I can list all of the probes that Perf added and then I can just run the application and record the result. It will generate a Perf.data file that I could open with Perf script and I can see all of the functions called in the application until the one that freezes it. Then I can open the search code and try to find out why this application is freezing. You could see here that I didn't like open the search code and add a list to it because there are tools that do that for me so I don't need that. Another technique to debug a Linux system is interactive debugging. It's a kind of very useful technique when you want to understand what any kind of piece of software is doing. You connect to the software you run the software step by step you inspect variables and you can do everything with the solder at execution time. The main tool for that is GDB and you can really do interactive debugging in both kernel space, user space very effectively on a Linux device. The only issue here is that it's a remote debugging architecture because you have the binary that you want to debug and you have the search code and the tools on the host device. So the host has all of the search code and the tools and the target has the binary that you want to debug and then you need some kind of connection between those could be a serial port usually is an internet port and then you can start a remote debugging session and that is a protocol that is generated by the GDB community so that is a GDB server, a GDB client and then you can start the server on the target device and run a client on the host machine that will connect the server and send commands that will be executed by the server. That's kind of the main idea and the architecture on doing this kind of remote debugging. So and you can do this with user space or kernel space. Here I'm doing a kernel space debugging remote debugging session so I run these commands and it is not working. I just want to understand why when I set a heartbeat of the LED it doesn't work. What I need to do this kind of interactive debugging the Linux kernel I need KGDB enabled KGDB it's kernel GDB basically a GDB server implementation in the Linux kernel it's the GDB protocol implemented in the Linux kernel so I need to enable it and also the serial driver so currently the kernel only supports a serial connection and then I will need a serial connection with the kernel to do this kind of remote debugging and then after that I have to put the kernel in debug mode. That can be done at boot time if I pass a specific argument to the Linux kernel or that can be done at runtime. Here I'm doing at runtime so I'm configuring what is the serial part that I will use for remote debugging and after that I am starting and change moving the kernel to debug mode when I write G to the sysrq trigger and putting the kernel in the debug mode and then the kernel will just freeze with the nothing stop waiting for a connection to start debugging. Then I go to the host to add to the debugging in the host what I need of course the kernel serves code I need a kernel image with the debugging symbols the VM Linux file with the debugging symbols and of course GDB from my toolchain so I will open the kernel image with the debugging symbols in the L file with GDB after that I'm going to connect to the target device and then I'm done after that I'm connected to the kernel and I can just debug the kernel I can run continuity and then the kernel will run I can put breakpoints I can run step by step the code and here I'm doing as I'm putting a breakpoint in a function call it legacy to write after that I can continue the execution and then the kernel will stop the execution on that function and then I can go run the code step by step line by line to see what's happening with that specific function in the last framework in user space it's simpler to do it you just need on the device GDB server so let's say you run an application and it hangs that's the case here I'm running this application and it is hanging so what I have to do is run this application with GDB server and here I'm using a network connection so it will open a local part one two three four to wait for a connection to run this to debug this application in the host what I need to debug the search code of the application the application with the buggy symbols and GDB then I run the application with GDB actually I run GDB passing the application and I like to start GDB in tweet mode then I connect to the device again here I'm using a network connection so the IP of the device and the port that I started listening in the target and then after that it will be connected then I can just continue the execution because it is kind of hanged or freezed if I run the application then I stop it with control C I can basically see where it is freezed because if it is freezed in a specific line of code I can see that just like here that are also several of the debugging frameworks that can help debugging in kernel space or in user space by debugging frameworks I mean there are tools that were built to solve specific debugging issues for example you want to identify locks in kernel space locks locking problems in kernel space that is a debugging framework for that you want to identify memory leaks in kernel space that is a debugging framework for that usually the debugging frameworks they add some overhead in the system so you should not enable by default only when you want to debug the application a very classic example of the debugging framework is Valgrind is a kind of framework created to debug memory issues very popular for example to debug memory leaks in applications again I have here a few use cases using kind of debugging framework tools first a problem in kernel space I try to I'm trying here to see prog slash uptime and it just hangs then I go to the kernel I enable the lookup detectors from the kernel it's a kind of debugging framework from the kernel to identify lookups in kernel space when an application just locks hangs in kernel space then I run the application and wait until I get a kernel works then the kernel will say like that CPU got stuck for x seconds I can see where it got stuck so the function the index the address also and I have a back trace here I can do the same kind of post-mortem analysis right I can take this information for example the program counter and then use ADDR2 line or GDB to find out the line of code that is causing this lookup for that I need the kernel search code I need the kernel with the symbols and I can use again ADDR2 line even the address and I can get the search code that is hanging in the kernel with GDB it's easier because it shows me the search code so I can just run the list command and it will say like it's line 37 this one that is hanging very nice another situation a memory leak in user space so I'm running this CPU load tool and memory leaks when I run this tool how can I identify where the memory is leaking in the search code I can use Valgrind for that so I need Valgrind installed I need also if I want to show symbols when I need it I want Valgrind to tell me where in the search code of that application memory is leaking for that I need to install the application with the bugging symbols in the device then I just run Valgrind for Valgrind I want a complete check of memory leaks and this is the application and then Valgrind will run emulator because it will run the application in a kind of a virtual CPU or so it will run so the application will run very slowly and the timings of the application will be impacted that's important to know and that's how it is done that's how it is able to monitor the memory see what's going on but it works very well and you have to let the application run and then stop it at some time and then as soon as we stop the application it will show a report of what's happening and then you can see like total heap usage write the amount of allocations and the allocations and the memory and this field is very important definitely lost that this field means em lots of memory will leak it during execution and and here we have a kind of back trace of the leaks here we can see that something probably not probably some symbol in the libc call it this function that call it this one that call it this one so this malloc leak it that was called by this function printcpulode that is in cpulode.c line 79 so with this information we know we can open the search code to try to find out why this malloc call it inside this function in this line is leaking memory why is not freeing that memory so very easily we could find out the malloc call that was causing the leak just using just running the application with our brand so my point here is that usually we do a lot of debugging adding prints to the code right and sometimes that's useful but I would say that most of the times there are better tools and techniques for debugging so adding prints in the code might help in a specific situation but most of the time you might want to apply other techniques for debugging and what we have seen here several different techniques to the bug right post-mortem tracing the debugging using the debugging frameworks and those kind of different techniques can be applied to different problems for example you will not want to use gdb to do performance analysis because it will impact the performance of this it's better not using gdb for that gdb is a kind of for example a perfect tool for a logic problem right the application not doing what's supposed to do you run it with gdb step by step to try to identify what is doing wrong but not for performance analysis now for example tracing is a very nice tool for performance analysis because you can trace the code you can find out timings measure things so it might be very useful for performance analysis and that's on my point here I added kind of a green yellow and red faces here to and I mean my opinion you could have other opinions on what kind of tools would be better for work kind of problems but in the end the point here is that we should know all of these tools we should know all of these techniques to be more productive on trying to debug and embed the Linux system and sometimes we have to get out of our inertia right because we think that it's going to take a lot of time to understand and learn gdb so I'm going to add print to the code right now we should try to get out of this kind of state of mind and really think about learning new stuff so learning ftrace learning per learn gdb learn all of those tools because one day you're going to need it and you're going to be very productive much more productive using those tools to debug problems in the Linux system so I hope you enjoyed this talk those are my contacts feel free to reach me you can send me an email you can reach me on LinkedIn or Twitter I hope again you enjoyed this talk and yeah until next time bye-bye