 Very good. Thanks, Candice. Yeah, let's start. Welcome, everyone. I hope you enjoyed this presentation. We're going to have some fun debugging on Embedda Linux. So let me first introduce myself. I work with Embedda for 25 plus years now. I'm located in São Paulo, Brazil. And I've been working with my company for 12 plus years now. I do consulting and training for customers in Brazil and also have a few customers in Europe and in the US. I'm also an open source contributor. Occasionally, I would say. Beauty Routy Octo, the Linux kernel and a few other open source projects. You can, sorry, follow me on LinkedIn. You can find me there, Twitter. And also you can follow my blog, Sergio Parado.blog. So our agenda, I'm going to quickly introduce this talk, debugging. And I'm going to talk about some of the most important techniques and tools to debug any kind of software. Of course, our focus here is Embedda Linux. And I hope to have lots of hands on during the presentation. Any bugs are intended for this presentation. Hopefully we don't have unintended bugs, let's see. Well, what is the bug, right? The bug is the process of identifying and solving bugs. That's errors, softer, harder. And I found that this topic is very fascinating because this is something that we usually don't learn, right, in school. But that's something that we do almost every day. So that's something that's kind of improved over time, right? And one of the ideas of this presentation is to try to understand, like, there are several different ways and techniques and process to debug. That is one very common, that is adding prints to the code. And when we start our career, that's something that we do. But over time, we learn other techniques that are usually more productive. So let's talk about this in this presentation. It's hard to say, like, try to find out a good process for debugging software. It's like you are a detective, right? And usually you are also the one that you're searching for, right? The criminal. You are the criminal and detective at the same time because you are searching for bugs in the software that you wrote. But anyways, usually I follow this process to debug any kind of software error. I first try to understand the problem. That's very important, right? Let's say you are having a kernel ops or a kernel panic. You need to understand that. You need to understand what a kernel panic is. You need to understand how to analyze a kernel ops message. So if you don't understand that, it's going to be more difficult for you to find out the issue. So that's the first part. And then the second step would be try to reproduce the problem. That is also very important because if you cannot reproduce the problem, how will you say that you solve it in the end, right? Before you even start working on the problem, you need to find out a way to reproduce it. Sometimes it's very easy, right? Like you run a software and it crashes. So it's easy to reproduce. But sometimes it's not that easy. You might have intermittent problems. And then you're going to have to find a way to reproduce the intermittent problem. So sometimes it's not that easy. But that's an important part of the debugging process. And the third step would be to identify the root cause. So now you understand what is happening. The kernel is crashing. Now you can reproduce the problem. Like if you write to this file, then the kernel crashes. And then now you're going to try to find out the root cause. That's usually what takes most of your time, right? Searching for the bug. As soon as you find the root cause, then it's a manual process of applying the fix. Or the possible fix because sometimes you don't know if that's the correct fix. Sometimes it's a try and everything. And if it fix it because you know how to reproduce or if you cannot reproduce anymore, then you are, in theory, good to go. If not, you're going to get back to the first step and start again. So this is kind of a process, right? Debug any kind of problems in software. Also, I usually classify software problems in five categories. You might find out another category, I don't know. But usually you can take any kind of bug and classify in one of these five categories. Crash problems. That's when the software just crashes. Crash is a kind of unexpected. It stops execution, right? Abruptly. It might be because usually user application crashes because of invalid memory access. Then you have the segful, significant fault error. But you might have other reasons for crashes. Like invalid instruction. The software tried to execute an invalid instruction. Like a division by zero on some architectures that will crash the software and so on. The kernel might also crash, right? With a kind of panic, for example. Lookups. Hanks. Another category of error. So the software just hangs. When I say here the software could be the kernel in our case. We're talking about an embedded Linux system where we have the Linux kernel and then user space applications. So the kernel might hang because of some bug in the kernel or a user application might also hang. So that's another type of error. For example, let's say you have a multi-thread application. And because of some kind of deadlock issue in the application, right? The application might hang. Like to spread, waiting for the mutex to be released by the other thread. And then you have a deadlock. Another type of error. Loge code implementation. So any software is a kind of system, right? Where you have an input, some kind of processing in the output. So this kind of error, everything's kind of working. But the output is not expected. It's not what you expect. So it's a loge problem or an implementation problem. Fourth type of error research leakage. So it leaks research for some reason. The most common research that leaks in software usually is memory. Especially if you are working with dynamic allocation, right? You might have problems if you don't have a kind of garbage collector that stuff. But you might leak other resources. For example, you open a file and forget to close that file. If you stay in a loop, open a file without closing it, then you might end up with no file descriptors for you because you are leaking file descriptors. They cannot allocate for you file descriptors anymore. Last type of error in my five major categories is lack of performance or performance issues. So everything is working, but you are using too much memory or too much CPU. Of course, this is very relative, right? Like you have a software running on an embedded Linux device with, I don't know, 156 megs of RAM. If you take the same software and run in a PC with those kinds of memory, you might not have a performance issue, but if you run it on an embedded Linux device, you might. So usually the performance issue is relative to the system that you are running on the software. Or not, right? You might really have a kind of problem in the application or in the kernel that are causing the usage of too much pure memory or maybe energy and so on. Very good. And I also usually say that we have five tools or techniques to solve these kind of problems. I would say the first and probably the most important tool is our brain. That means our knowledge, not only knowledge, but also the skills, right? You need to understand and also know how to apply that understanding. And that's important. That's a tool that's always or we should always keep improving it. Another technique very important to the bug solter is post-mortem analysis. It's a kind of technique where you do analysis of some kind of information that you extracted from the device. Like logging, you might extract the kernel logs to see what's going on with the kernel space or you might extract a user space log, let's say JournalD or something else. You might create dumps from memory to analyze later. For example, you can generate a core dump from an application that is crashing and then do the analysis. Tracing profiling is another technique for debugging and adding prints in the code, it's inside that category. So probably most of us know how to do tracing because when you add prints to the code, you are tracing the code with your own messages. But the point here is that there are several, there is an infrastructure in the kernel and also several tools and frameworks in user space to trace applications without the need to add messages to the application. So that's important and I'm going to talk about this during this presentation. Another tool technique to debug is the interactive debugging like GDB. So GDB is a tool that allows us to interactively debug the software. You can stop the execution, run the code line by line and so on. That's another way to debug software. And last but not least, there are several, I usually call it frameworks or debugging frameworks to the bug software. One very known debug framework is called ball grind to debug memory related problems. So those are tools that are created so you can debug better and a specific kind of problem. So my idea now is to show you examples on each one of these types of techniques. So we're going to see problems and how to apply those techniques for those kind of problems. So I'm going to start with post mortem and then after that tracing interactive debugging and debugging frameworks. Let's see how it goes. I'm planning to do lots of hands on here. So we're going to see my terminal and hopefully everything will work well. We're going to have access to these slides later. Everything that we're going to do here in the terminal, it's on these slides. Of course, you can write anything you want, but everything's on these slides. You can check it out later if you want. But I think it's nice instead of just taking the slides, just open the terminal and play with it. So post mortem analysis is a kind of technique where you do the analysis after the bug happens. So you're going to need some way to collect information from the device to analyze later. This information can be logs, right? Like logs from the kernel, logs from a user space application, or it can be dumps from memory. That is a way to create for example a dump from the Linux kernel. That is a group of tools called K-dump and a system call from the kernel called key-exec. That allows you to do that, create a core dump from the kernel. But we're not going to talk about this. We're going to see here core dump in the user space application and also how to analyze a kernel ops message. I usually say that post mortem analysis is a very good technique to analyze crashes and logic problems. That's usually very useful for those two kind of problems. Let's see here a few examples. We don't have much time to go over like every detail, but I hope I'm going to focus here on the most important things and I hope that will be useful for you. If later you want more information about a specific part of this presentation, you can just write me. So a kernel crash. I have here this board, as you can see. My camera, I'm running here. I small embedded Linux system that I created with build routes. I'm doing the boot over network. Everything is over network. It's downloading the kernel and the device tree and mounting the root file system. Everything over the network should be fast to the boots because it's a very small system. All right. Let me know if the size of the phone is not good for you or if you're going to see it. The first situation I'm going to show you it's a kernel. I added a few bugs in the kernel and also in the file in the user space. Some user space applications. So there's a bug in the kernel. If I connect USB stick, the kernel crashes. Nice. So this is a kernel crash. Are you afraid of kernel crashes? There are lots of useful information here. Since there are only numbers, right? But no, there are lots of useful information here. So this is the beginning of the kernel ops message. This is a kernel ops message. This is the beginning. You can see the reason of the crash unable to handle new pointer that happens. Nice. As soon as kernel developers convert everything to Rust, we're not going to have this problem anymore, right? But I'm not sure that will happen soon, at least. Anyway, so there is a wild pointer there in the software in the kernel that is causing this. More useful information here. Here we can see the location of the program counter. This kind of information really depends on the architecture. If you are running this on X86 or MIPS or PORPC or any other architect, this is ARM, 32 bits. You might see different registers, right? But the information will be there. So this is the program counter and this is the location of the crash. You have the name of the function and you have the index inside the function that causes the crash. You can use this to find out the line of code that causes the crash. Here you also have the address of a memory that causes the crash. That's basically the same thing as this information here. The address, this is the function plus index and the address is the same thing. And below we have the back trace. That's the functions that were called until the crash. So cut kthread, call it worker thread, call it process, one work, call it hub event, going up to here. That crashed. So this might be useful for you to debug like all of the functions that were called. Nice. How can we debug this? So to debug this, we need the three things. We need the search code of the Linux kernel. We need tools from the tool chain. There are two tools that you can use here, GDB and ADGR to line. And we need the kernel image in ELF format with a bug in C. I have here all of this. So this is the kernel search code. I have the kernel with the bug in symbols here. This VM Linux file is the kernel image in ELF format. That's not the image that we boot. But it is generated when you build the kernel. And here we can see that this ELF file has the bug in symbols. That means you can convert those addresses to symbols, like line addresses. Let's do this. So one tool to do this is ADGR to line. You can see that I'm using my cross compiler tool chain, right? You need to use your cross compiler tool chain. The tool chain you use to build the kernel. And I'm going to pass here dash F to see the function. What I want to do is to convert that address. That address, what is the address here of the crash here. So I want to convert this address to a line of code. So I take in the address of the program counter and then I'm going to pass. The kernel image and the address. Oops, sorry. Here we go. So now I have storage probe. That's the function where the crash happened. The search code and the line. So now I can open this. Let's open this file. Oops, 1118. This is the line that crashed. And if you want to see why we can see that this unusual that pointer is new here and it was not initialized. So I added that code there to crash. But you could see like how easy it was to find out the line of code that caused the crash. Sometimes not that easy to find the root cause. That's two different things. One thing is to find out the line of code that caused the issue. And the other thing is to find out why that line is crashed. In this case, it was kind of easy. But it might not be that easy, right? But it's like it's better than just open the kernel that's 30 million nights of code and starting adding prints to it. Much better. So this was a kind of example on doing post-mortem analysis in a kernel code. I want now to show another example on doing this on an user space application. So I have here an application. Again, everything that I have done here in the terminal, it's in these slides, right? So you can see all of the comments here. And you're going to be able to refer to the presentation later. It's going to be on YouTube so we can watch it later again. If you want to review this process. I have another example here. So I have this fping application that is crashing. Let's see if I can check in if I have any questions, not oops, it's in a boot look probably because I forgot to remove the pen drive. And it's crashing every time it boots. There are a couple of questions about the first question is I did answer that question, but you can add your take on it as well. The more you test, less you debug, am I wrong? That was the first question. I would not say like the more you test the less bugs you have, but I would say the better is your test covers. Then for sure less bugs you have in the code because sometimes you test but you test the wrong thing. Or you're not cover everything from the software. So there are several ways that this talks not focus on avoiding adding bugs to the code or preventing from adding bugs to the code. But like working with unit tests and doing tests before releasing. I mean, there are several different techniques that you could apply to prevent from adding bugs to the code. It's impossible to not add any bugs to the code. I mean, probably the only code that doesn't have bugs the code that doesn't have any lines. But the more you cover your software from the testing perspective, the better for sure will be the quality of the code. So there is another question about I think the material that the images. Let me see what the question is, can we get training materials such as source and images for the steps that you are showing I think that's probably the question the image. Yeah, yeah, this actually this talk is part of my training on debugging and everything's open so I can provide you with links so you can style this material and try for yourself. I'm using here I thought I'd export and for you to run all of those hands on you're going to probably need this development board or adapt to, I don't know, as bird ties or any other maker boards, but I can I can share with all of you if you want. I'm going to send the slides with the links for my training and then you yeah. You think you can show you can think you can share the link to do with the slides. Yeah, we can we can definitely share the links. So you can just include it in the slides and we'll share. And the other question is how to debug issues that are difficult to reproduce on systems. We have but something that's found on a different system, meaning a driver works perfectly on one hardware but it doesn't on another socks and then also what if these are remote. Yeah, in that case you can just blame the harder right. Yeah, that's a tricky question I mean usually when I have this kind of high intermittent issues like you can reproduce this issue here but not there or the issue will only happen on Sunday. In that kind of situation what I try to do is to create a right tools to improve like make it easier for the issue to happen. Because usually like you have to play with statistics right. So that is this issue in this protocol that happens every like 20 messages that you send or something like this so with some tools and automation you might want to improve or increase the possibility of the issue to happen. And then you leave the tool running, like for several hours or maybe days to create this kind of statistical data like I run my tool for three days and then I got three times the issue. And then you have this data like we're going to work with this, and then you work on fixes and keep running our tool until you cannot see the issue anymore. And then after that you go to production with that fix and see if it's working. I mean, it's kind of tricky right to work with those kind of intermittent issues to try to understand what is the root cause of the problem. Very good. And so I want to show you now the another situation where we do post mort analysis. So this is an application that is crashing or as I added of course. And how can we debug that. So we could start here a GB session that could also be effective, but I'm going to do it post mort analysis. So what I want to do here is to create a core dump of the application. A core dump is a kind of a snapshot of the application that crashed. So, let's do that. That is a tool, call it you limit that makes it possible to configure the limits of an application. How many threads you can create how many files you can open and that stuff. So one of the limits is the size of the cordon and then you can figure that with the dash C parameter by default, it's zero. Of course it depends on your system on your distribution for example. But if you don't change it by default the configuration is zero. So I want to increase that to unlimited. So that means unlimited. Sorry. So now I can create unlimited cordon files. That might be a problem for embedded Linux system because let's say the application is running one gig of memory is using one gig of memory. And the core dump will be of that size. So you need a space to create that file in your system. But I mean, here in my specific case. This is no problem because I'm all of my file system is over network. So I, I have unlimited space to store files. Now I'm going to run the application again. We're going to see that the core was done. So that means now I have here a core file. This file, this is an L file, but basically a snapshot of the memory of the process that crash. Now I can take this file and all my host machine analyze it. So let's do that. What I need to do that. So here, well, I'm built my system with beautiful. That's a good system. And to the, to analyze the core dump, what I need is the search code of the application. I need my two chain, of course. And I'm going to use a GDP for that. And I also need the binary with the bugging symbols. On beauty hoods, I have everything under output build and the application. So everything is here. I think my services inside my binary is inside SRC. So this is the binary generated by beautiful with the bugging symbols. And I have the service code here in the binary so I can do everything. So how to do this core dump analysis. I'm going to start. First of all, I need a core dump. So I'm going to get that file from the target to the host. But since all my file system is inside of this directory here. So it's already on my machine because I'm booting over a network. This is the root file system. It's already on my machine. And the core dump is here, I would say. Yeah, this is the core dump. So I'm going to copy this file here. I need to add to add the permissions to my user to this file because by default it's only by root. So I have the file here, the core file that was created on the device. And I'm going to do a post-mortem analysis on the host. Then I'm just going to open gdb with this file. So I started it with the application and the core dump dash C core. So what we will do is just stop, like show me the line of code, the last line of code that was executed. That's the line of code that called the crash. Here we can see the issue, the problem, program terminated with 6-side v. That's a sign that the kernel sent to the application. Because the application accessed and valid memory addresses. And that's the line of code that calls the crash. OptiPars line 217. Sorry, there is one question probably relevant to answer now. Is it possible to specify where to save the core dump file? Yes, exactly. It is possible there is a file, not sure if I mentioned here. So that is this proxies kernel core pattern that you can use to specify information about where you want to generate this file, the name of the file and so on. I think the specifics of this question is they are running application running on a Docker container. We would like to save the core dump in a volume shared with the host. I think that's kind of what you have on your setup right now. Yeah. Yeah, that will not change the same thing. Yeah, so that is a nice feature from GDB that I like. That's the tweet mode because it shows you a better view of the search code. So if you start GDB with dash tweet, you can see like a window with a search code and that is nice, right? So here we can see like the line that crashed and then here you can also see the same line. Two, one, seven. And another thing nice about the core dumps that you can really expect the memory because it's a snapshot of the memory. So you can print stuff like I want to see what is the value of these options pointer, print options, and then you can see and not know so probably it was not that one that caused the crash. We can print the value of the RGV pointer inside of this option pointer to see if it's new and we can see it is zero. No, that means that was the pointer that caused the crash. We can actually actually print the complete structure here options. If I reference this pointer, I can see the complete content of the structure. So as you can see, we really have access to the content of the memory. And I can do a backtrace to see like main call it this function and this function crash it. I can go to main like frame one and go to main and then I can see the call to that function that crashed. I really can expect the software of course I cannot run the software right that's what that's why you're doing like it's a post mort analysis you are not doing an interactive debugging session. I'm going to do that in a few moments but not now so this is a post mort analysis is very useful for cases where you don't have access to the device. Can you please just like run this comments generate the card up and send it to me then I can analyze here. That may might be very useful. What there is one question about a GBB shows up. Sometimes GBB shows optimized out why. Everybody. That's that's because when you when you compile the software with optimizations flags, the compiler will optimize the code. That means the compiler might remove lines of code because these things doesn't need to run or might reorder some instructions to improve let's say the speed of software or the size. So, if you compile the software with any kind of optimization flags, you might have problems. In that case, what I usually recommend try to debug with optimization on because that's what it is running right on your production system. But if that's causing your problems, like you really want to see these or that and the compiler optimizes that, then you're going to have to turn off the optimizations or decrease the optimization level. Usually on on GCC. So if you open the GCC month page, you're going to see those dash all options. This is nowadays the recommended flag to build a software with a same set of optimization level that doesn't impact the bugging the dash or G. So you might build a software with like dash or three, for example, or dash or as that improves the size of the binary and that might impact the bugging. That's right off. So there is another question, can we detect a memory leakage using gdb like well grind gdb and wall grind, probably. I never use a gdb to the buggy leaks. I'm going to show you in a few slides how to do that with val grind. I felt that most of the time you're going to, when I use Valgrind for that kind of problems. The trade off is Valgrind will run the software slowly. Let me get there because I'm going to talk about Valgrind. And then you might ask you again, that doesn't answer your question. But so we saw a few examples on doing on debugging using this post mort analysis. I'm not sure how we are on the time wise. We are kind of in the middle of the presentation so let me let me know if I should go faster or not. But hopefully you are enjoying tracing traces another technique to debug software. It's a kind of a specialized type of logging right tracing enables you to trace the execution of the software. And also do some kind of runtime analysis. You can measure the execution of functions you can measure latencies because you are tracing like all functions and that stuff. There are different techniques and frameworks to implement tracing. There are static that are two kinds of like when you do Tracy you add a trace point in the code. When you add a print to the code, you are doing static, you are adding a static trace point to the code. And in the Linux kernel for example that are frameworks to do dynamic tracing. That means you can dynamically at runtime not at build time at runtime you can add trace points in kernel space and in user space. That's very nice. And Tracy is very useful for different kind of problems. But when you want to measure performance it helps when you want to measure latency. It also helps lookups. The software is just locking up like some reason Tracy might help because you can identify like the function that freeze the execution. I want to show you here a few examples. So at kernel space. That is a nice framework. Call it after this. You can do everything with that trace. So I've traced infrastructure in the kernel for Tracy. And hitting this example and tracing an application that is taking a while to execute. I'm going to show you like what it is using then I'm going to show the bug so to to use a trace what you have to do is go to the kernel like let me open here. I'm going to go to the kernel configuration menu and then you go to this very nice menu called kernel hacking. And then there are several nice features there. If you have some time go there and see. And then that is this menu tracers. If you enable tracers you're going to enable a trace. And then that you have several different traces to trace the execution. The Linux kernel, not only the Linux kernel but also usually space applications. And then if you enable that, you're going to have the base skinter face to have traces file system, and then you can just mount it. Trace FS usually we mounted this directories is kernel tracing. And then you're going to see there, several files to enable tracing. I'm going to show you a quick example of how that works. So we have here for example this file call it available tracers. The tracers that I enable for this kernel. And then I'm going to enable this one function graph tracer. It's going to show me all of the functions executed at the kernel space in a graph view. So to enable that I have to write the name of the tracer to this current tracer file. Now it's enabled. Now it is happening. The kernel is tracing itself and right into a buffering memory that we can see with this trace file. So here what you are seeing is all of the kind of functions being executed. This might be very useful because you don't need to like to try to print to the code like I'm executing this function now I'm executing this because it's everything here you don't need to do anything. And there are several files here, because it's a lot of information but you can filter by process you can filter by function by system by events. So you can really play a lot with this. Here is an example of using F trace to the bug and application that is taking a lot to execute so I added a bug to the core code to the kernel code where if you try to blink LED like enabling Turning on an LED it's taking four seconds. Usually it doesn't take four seconds right to blink an LED. So that is something going on here. Then I mounted F trace. Then here I'm using a common color tracing the this is a common to make it easier to use F trace because I could write that active to those files. But I would need to write to several files to do this tracing and tracing the dude would just do that for me. I'm asking trace to do a function graph tracing of this comment. And then it will run the comment, it will trace the kernel and generate a dead file for me. With all of the functions executed by the kernel. And then I can take this trace file and ask trace me to generate a report for me to parse it and then show me in a human readable way. And then I'm just redirecting that to a log and then cats in the log. And then I'm going to see all of the kernel functions executed. When I run that comment. And here I can see if I now do some analysis on this. I can see that this function call a dysfunction that call it is Ms. Leap function. That took four seconds to run. So I can see here the cause of that delay in the execution and Ms. Leap call inside this function. I didn't have to do anything just enable F trace, generate the tracing and then do some analysis on the data. I can do that also with a graphical tool called kernel shark. Very nice. So, for those that like graphical applications, I mean, there are some nice graphs here about pure usage and the functions that were called you have the same information here but you can filter, do that stuff that you can do with graphical application. If you want to learn more about F trace, I would recommend Steven's talks. Those are always very funny. So just go to YouTube and search for F trace. For sure find several talks about Steven. And we're going to have some fun with tracing the kernel with F trace. I have a few questions. Would you like to take them now? Yeah, yeah, let's do it. Why not? It's about if it's about this. Okay. So sometimes you get core file from production with optimized values on error line. And there is no way to reproduce is then what to do. Right. Well, if you have a core dump, it's a snapshot of the memory. And if the shop that you run was compiled in some kind of optimization level. You cannot really like match the line of code with the memory, with the memory addresses that is really not much you can do, but try to reproduce the problem later and generate another core dump with the application without optimizations live or at least I'm not sure. I don't know how you would debug in this kind of situation. Right. It's always a trade off. Right. And like the if you want to improve security, usually you decrease the bookability. And if you want to improve research usage, usually you, you decrease the level of the ability you have. It's harder to debug. And that is not much you can do. Right. And sometimes you try to find a good balance between, between those, but usually as far as my knowledge goes, that is not much you can do, but try to reproduce the problem later with a different level of optimization where you can really debug that. So the second question is GDB is for user space only. Can we debug? Let's do it in a moment. Okay. After commercials. Can Valgrind catch dynamic library memory leaks? Yes. So Valgrind will basically simulate the software. And if the leak is happening at the memory, at the library level and Valgrind will be able to see that. Now, if Valgrind will be able to find out the symbols from the library that's causing the leak, then that depends on if you have the library with the bugging symbols installed on the target. I'm going to also talk a little bit about that on the following talk. So I think the other question we have is it's worth mentioning that we're going to do and other distrust have tracing support usually enabled. Yes. Some of them do distrust. You can just check those. And if not, you have to enable. I think most of them now have that enabled. Is it possible to filter the output of F trace to follow a single device driver that is running on the system? How does that work? Usually you don't, I mean, you can filter, you can do different kinds of future. You can filter by function. So you can just say, I want to just see only those functions. I want those five functions on that. So yes, like if you just want to debug a driver, you can filter by their functions. And you can also filter by module. So we have a .c file. That's a module. And then you can filter by that. You can enable. I want to see just the functions from that kernel driver, that .c file. So, yes. In my example, I have done the debug at the process level. So I ask the TraceMD to debug that user space application. So in this trace, I have all of the functions called at the kernel level from this process ID. That was the filter that was created here by TraceMD. But yeah, you can change, as I mentioned, by function, by file, by subsystem. So it's pretty much flexible. I think we have one more question here. Do we have any limitation or disadvantages of Valgrind? Yes, Valgrind will run the software like five times over. Because it will, Valgrind, the way it works to capture problems in memory, the way it works is that it will emulate the software. It's kind of like off QML. It will emulate the software, your application, it has the implementation of syntactic CPUs inside of it. And that means the software will run pretty much slower. So if you don't care about that, usually when you are debugging memory allocations or deadlocks, you might not care about the speed of the execution, the execution speed of the application. But sometimes you might care about that. Sometimes the execution is important, the speed. If you want to really execute the application, not emulate the application, and then you might want to use other tools to debug a specific problem, then Valgrind might not be the right tool for you. You just need to be aware of that. The software will run slower. That's the trade-off. Okay. There's two more questions. There are some scenarios where drivers don't care about function return values, and so can silently fail. How does one catch that? If the silent failure bites us much after it actually happened? Not sure if I follow the question. So you have a function that doesn't have a return. And if it fails, maybe it's a problem in the design of the function, because if it doesn't have a returning theory, it doesn't want to report a failure. So I'm not sure if I understood that question. Can you rephrase that? Return values are not checked in some cases. Oh, okay. So the return values are not checked. So the function has returning values, but they are not checked. And the question is how to debug that? Yes, I think that is the question. So essentially I think the question is about silent failures, meaning you do not have a good cookie crumbs as you're going by and trying to debug the problem. You don't always know. You don't have enough information to say this is what could have happened. I would say the best tool for that is GDP. Like doing it on the internet because you cannot collect information. You don't have information. It is just failing, right? That's in the category of logic problems where the software just failed, but you don't have much information about it. So I would just run step by step the code to try to get to the function that is failing, but not like providing any feedback. Right. And then in some cases if you, if the problem goes away, GDP sometimes you cannot reproduce the problem. If the problem is not reproducible through GDP, you have to find other ways to figure out what's happening. Right. GDP is useful. I would think that in cases where for analysis it's useful. Sometimes it's not always useful to reproduce the problem. I think that's, that would be accurate to say it. Yeah. Yeah, I would agree. And so I think, Dhruva, hopefully that answered your question. If not come back to us. I have a couple of more questions. Is this okay to ask? Looks like they have a couple of more questions. How is it going in terms of, do you want to differ these questions? Maybe I could move a little bit and then you can ask more. So we'll get back to your questions. Just hang on. Yeah. Good. So here I have a few more examples on user, on tracing, but on user space trace team that are two tools that I like to use at the user space level for tracing. One of them actually to S trace, S trace, S trace is a tool to trace system calls. L trace is a tool to trace library calls. And those tools are used very, like it's, it's easy to use and it's very helpful. So it's a tool that can help you to understand what a binary is doing to understand the result of the execution of a binary. In this kind of situation that the question like the, you run a specific software and then it just fails, but you don't provide you any feedback. Maybe S trace can help you on that because S trace is going to show you all the system calls. And then you can see, like, for example, let's say an application tries to open a configuration file, the configuration file is not there, then it fails, but it doesn't show you any message. So with S trace, you can see that, right? So it's very, very useful. Here I have an example with this Netscat application. I just added a bug there. You run Netscat and then it fails. Couldn't set up listening socket. For some reason you cannot listen to this. TCP port. And then if you run it as trace, you're going to see all system calls executed. And then if you go down, like from the bottom up, then you can see the message. If you go up, you can see the error. So this is the call bind return this error, bad address, and then you got this message. Then here you can see that the bind call failed. And if you look at the interface for the bind call, you're going to see that this parameter is missing. You're passing new here, but should not be new. So we have now a good indication of why the software is failing. So there is a bug in the software that is passing the wrong argument to the bind call. Just by running S trace, you could have this kind of feedback, right, without opening any search code. You don't need search code for that, because it's just capturing the system calls, parsing that and showing you the name of the system call, the parameters and the return. You can have the same thing with electricity, but electricity is for library calls. Yeah, those are two very nice tools. Another example here, I'm not going to do this at the time now, because it might take a while, and I want to cover the last part, GDB stuff. So here I'm using Kern infrastructure called U probe, or user space probe, to probe user space applications, because S trace and electricity can help you debug, trace user space applications, but you cannot use S trace and electricity to trace the functions from the application itself. Only the system calls and library calls. If you want to trace the functions from the application itself, you can use U probe for that. And there are a few tools and scripts that makes it easier to use U probe, because not that easy to use it. One tool that I feel it's very easy to use is Perf. So here I have an example that I created with Perf to trace this AGH tool. In the first comment here, I am adding the trace points. So I have a loop where I'm collecting all trace points from this software and then enabling those trace points. That means I'm taking all of the functions from this binary and adding S trace points in the kernel. So when I run the software, if I have those trace points enabled, the kernel will let me know. That's the idea here. So I'm enabling the trace of all of the functions inside this software with this common line here. I can see this if I run Perf probe list. I can see all of the probes that I created at the trace points that I added. And then I can use the Perf record to run the application and capture all of those trace points. The result of this will be a Perf.data file with the result of tracing that application. That means all of the functions called from the application is going to be inside that file. Then I can use another comment from Perf script to parse that. Then I can see all of the functions that was called. I mean, there are lots of different things that I could talk about Perf here. You can create a graph, a backtrace, and all that stuff. But just a simple example, and there are lots of things to explore here. You can just try Perf. It's a very nice tool, not only for profiling, but also for tracing. So those are two kind of techniques that I usually use for tracing user space applications. Let's talk now about interactive debugging. That's one of the main techniques that we use for debugging. And nowadays, still, GDB is our default tool for that. So I want to show you two examples of interactive debugging. One that the kernel space, there was a question there. Someone asked, can I use GDB in the kernel space? Yes, you can. Let's see it. And one at the user space level. The point here is that you need like a client-server architecture because you have the search code, the tools, the binary with the debugging symbols in your host machine, in your target machine, that's your device, you have only the software running. So what you need is a GDB server running on the target device, communicate with a client on your host machine, receiving commands from the client, and executing the instructions on the target. So the client will just say like, add a breakpoint to this address, run the next line of code and that's that. And the server will just execute that. So that's how it works. And it happens that the kernel has this kind of a server implementation that's called KGDB. So if you enable KGDB in the kernel, you're going to enable config KGDB and a driver for communicating with the KGDB. We usually use the serial port for that. I want to show you how this works now. So the first thing is to set up the infrastructure of communication and the things I'm using, the serial port for the console. So here I'm connected over the serial port. I cannot just reuse it for GDB. So what I usually do is I use an agent that will act as a kind of proxy for communicating over the serial port. There is one agent provided by the community called agent proxy. I think it's inside the kernel.org server. And it's a very simple application that will act as a proxy. So you run it telling the serial port of the device and two TCP ports. The first one is for the console. The second one is for GDB. So every communication will pass through this proxy that will distribute the messages. So if it's a GDB message, it goes to GDB. If it's a console message, it goes to the console port. I'm going to use this. If you are not using the serial ports for the console, then you don't need this, okay? But I am, so I'm going to use it. So I started the proxy. Now I'm going to set up. I have to start a Telnet connection in the first port, 5550. You can see now that I'm connected. I'm going to reboot your device. So I'm in the console, but over Telnet. Now I have this infrastructure set up working. I have the kernel with KGDB enabled. Now what I can do is just put the kernel in the bug mode. I can do that at boot time or at runtime. I'm going to do at runtime. I created here a small share script to do that for me. But I will show you. So this is the share script. I need just to set up the parameter for the communication. That's the serial ports. And then use this sysrq command to put the kernel in the bug mode. So if I run this, now the kernel is in the bug mode. It is basically waiting. Everything is like freezing. Now I can connect to the kernel over GDB. Now I'll go to my kernel search code on my host machine. I can start a GDB session using the kernel in L format, the VM Linux image. I'm going to also use the free mode for better visualization. Now what I have to do, connect to the device. So there is a command in GDB. Target. Remote. What I want to do is to connect to my local host to the proxy that will send a message to the device. So this part here is that's the ported for GDB. Now I'm connected to the kernel. So I can put any breakpoints anywhere in the kernel, including interrupt service routines and the bug of the kernel. I'm going to debug this problem here. So let me just get one of the functions this one. So I'm going to put break function. So I added a breakpoint to a function inside the LED framework from the kernel. Then I'm going to ask GDB to continue the execution. You can see here on the console that the execution is frozen. Nothing has happened here, but I can go here, continue. Now the kernel is executing. So if I go there, it's executing. I'm controlling the execution of the kernel. So now I'm going to write to the LED file. Let me take the command here, this one. And we're going to see what's going to happen. So it's garbage that you'll see here. It's coming from GDB. Now if I go to GDB, I'm there. Stop it at that function that I added the breakpoint. And then I can run the kernel code step by step. Next, next, next. I can do a backtrace. I can do everything I want at the kernel space. So yes, you can do interactive debugging of the Linux kernel with GDB. It might take a while for you to create this infrastructure. But for complicated problems, that might be very, very helpful. To finish this interactive talk, I want to just quickly show interactive debugging at the user space level. So I have here another situation, this one. So three is hanging. So I have here a problem with the three commands that is hanging. I want to debug it. I could trace. That would be one option. I could add prints to the code to see what's happening. But anyway, let's do it with GDB. How we can debug this with GDB. Again, it's a client server architecture. So I need to start the server on the target and run the client on the host. And that is a common GDB server for that. So GDB server, and then you say like the connection and the command. So I'm asking GDB server, please open a part one, two, three, four to debug this three slash var command. And it's there waiting for a connection. And then now I go, what is this? Okay, I'll close it. Now I go to the search code of the application. This is the search code. So I need a search code of the application. I need a two chain GDB from my two chain and I need the application with the bugging symbols. That's what I need. Then I start GDB with my application. I'm going to use three modes. That's nice. So I open the application with GDB. Now I connect to the target to connect to the target, target remote. Now I'm, we'll connect to the IP address of the targets that in my case is, so I'm doing this. So we saw in the kernel that I use the serial ports to transport the communication. Now I'm using an internet connection. So it's not to work here. So I will use the IP address of the device and that part that I use it there. Now I'm connected. It is stopping at the first line of the code. And since I didn't load the symbols from libraries, I cannot see like the source here. But I don't care about this. I just want to put a break at the main function, break main, and then continue the execution. And now I'm debugging the software so I can run it line by line. Since it is hanging, I can just continue the execution. Wait a few seconds. Control C to stop it. And then we can see the line that is hanging. There is a call on there that should not be there. I can see that you have 15 minutes. I want to finish the presentation. So just a few more slides. And then we can open for questions until the end of the presentation. So to finish the bugging framework. So that's the name that I give for these kind of tools. It's not formal. When you create a tool to debug some kind of problem, like you create an infrastructure to debug that. There are several debugging frameworks and looks kernel to the bug memory leaks to the bug hangs look ups in the kernel. There are frameworks at the user space level. Val grind is one of them. So a few examples. The kernel has let's say you run a command and the kernel just like it's freezing. It's not as a good thing like it's hanging. The kernel has look up detectors that can attack. Most of the situations where code just hang at Kernerspace. If you enable that. That's usually not really recommended for production system because it adds some overhead. But for debugging. You can enable it no problem. And then when you run something that hangs at Kernerspace. After a few seconds, usually 30 seconds, 30 something, you're going to see a kind of ups. And it's a kind of ups like any other kind of ups. You're going to see the problem solve to look up. You're going to see the problem counter the function that hangs. You can take this information. And then with the Linux kernel search code and there are two chain. You can find out the line of code. That's called in the. The hand. This is using Val grind. So I have a problem that in this common CPU load. And I have a problem with memory leaks and I want to debug that. So I added Val grind to the system. And I want Val grind to solve symbols for me. So I also updated the program with the bugging symbols. If you don't do that, Val grind is. It's still able to find the issue, but it is not able to solve symbols. So it do not say to you like which line of code that are located memory that was not the allocated. But it still works. If you have this possibilities better to update the binary with the bugging symbols to run it with Val grind. And then you run Val grind. There are a few parameters there. I usually use this one for memory leaks, leak check equal full and the program. Then you leave the program running for some time. You close it and then Val grind will generate a report for you. As soon as it is finished. And then you're going to see like information about the leaks. So in this case, I definitely lost. That means I allocated and didn't allocate anywhere. Nine times this size here. 36 K of RAM. And then it generates backtrace from the leaks. So this function do start called the principle load that called my lock and this my lock was not free. And if you open this, you're going to see the root cause of the leak. So if you run my Val grind with binary debugging symbols, you're going to see this kind of information that is very useful. Search code and lines of code. If you don't, you're going to have to solve the symbols on the roast host you can do that. Very good. So my conclusion here to open for questions is when you start our career, it's very common for us to use only one technique for debugging that's adding prints to the code. And when I say print adding print to the code, maybe you're not even adding print to the code but like you're you instrument the code somehow right sometimes you don't have a way to see a print. So you blink an LED, right. So you use an LED as a way to provide your feedback and then you do this interactively right like you, you, you change the code, like if I pass you are going to blink this LED. The fact that like instrumenting the code. It's only one technique. And we need to understand. And most of the times, try to identify the best approach for the best situation for the best problem. So, for those that are starting, try to learn more about all of those tools for those that are not starting like me and learning every day about new tools. Things are always improving and I'm always learning about new techniques and tools. And that's very important to try to always improve the way we do things to be more productive. So, these light shows like for that kind of problem, we might have different options or specific techniques might be better. Just as a simple example, interactivity bug is very bad for performance analysis, because it might probably impact the performance of the system if you run it step by step, line by line. So it's not good for that specific case. The code, depending on the situation, you might like when you instrument the code, you might change the behavior. And the simplest fact that you add up into the code, you will change the execution time, and that might hide problems. So tracing, for example, might not be good in some situations. So yeah, my point here is exactly that. There are several different techniques and tools. I show you a few of them during the presentation. Try to explore them. We really just crash the surface. There are lots of stuff. I mean, this material is from my three-day training on this talk. And I could really turn this into a five-day training if I want. So there are lots of stuff to talk about debugging and tracing and performance analysis only. So I hope all of this was helpful. Let me open for questions and to the end of the presentation. Okay, so let's see. I'll start with a question in the Q&A box. How can we extract the trace log in case of cardinal panic when the system gets stuck? Good question. So the kernel has a feature to persist. A feature called PSTOR to persist logs. If you want that, you might want to have a look at PSTOR. So this is a framework that you can use to persist kernel logs, to persist ftrace buffers to analyze later. It has this kind of architecture where you can define the place that you want to store this. Usually, of course, it is a persistent storage device, like a flash device. But you might also want to use memory. So if you have a kind of a soft reset, you can, for example, on ARM, you go to the device tree, you reserve a part of the memory for PSTOR, and then that part of the memory will be used to persist logging. And then if you have a soft reset, let's say a cardinal panic with a reboot, the next reboot, you are able to collect the kernel ops message from the previous boot. So have a look at PSTOR. There is one question that might be, this could be the very last question probably, sometimes GDB, backwards shows corrupt stack frame wise. Again, some kind of weird optimization done by the compiler. You might have, for example, security flags enabled when you compile the software. And the security flags might add some specific logic to handle the stack frames. And that might confuse GDB. That's the whole point because you don't want to see the stack. So that might be one of the reasons. There are some exceptions or security flags that wasn't able when the software was compiled. A few reasons that might cause those corrupted frames that you see on GDB. Okay, I think that's about, I tried to, okay, last question. Kind of how about are we doing on time? We have two minutes left. Yep, we're good. How to debug corrupted kernel memory while JTAG probe watch point support is not available? Can you repeat, please? Yes. How to debug corrupted kernel memory while JTAG probe watch point support is not available? The kernel has the only way I know to debug in this situation is by using a feature from the kernel called KDB. So KDB is a built-in debugger in the Linux kernel. It's not related to KDB. So it doesn't use GDB. It's a built-in debugger in the Linux kernel that you can use to debug at the assembly or machine level. So you can enable KDB and as soon as you have some kind of crash, it will pop up a KDB prompt. So we can debug memory. You can really manipulate everything like in memory, in process, everything. So if the problem that you are having is not, doesn't impact the behavior of KDB, then you can use it to debug this kind of situation. That's the only way I see if you don't have the JTAG for that. So one last question about being able to debug threads. Can we, can you debug each kernel thread independently, not each CPU, but each thread? I'm saying you're thinking kernel thread, I'm thinking it's not a CPU thread. A kernel thread? Right. Yes, I mean, yes, you can. As soon as you attach it to the, to KDB, you can add breakpoints to specific threads, to symbols from threads and debug it. So I have never had problems with that. I mean, yeah, you can, as you can do at user space level, you can also do at the kernel space level for sure. What I found impressive is that you can even do with interrupt service routine. So at the interrupt context, you can, you can debug and run the code step by step. So if you can do that at the interrupt context, for sure you can do that at the kernel thread level. Great. I think that's all we have as for questions. Thanks. Thank you for doing this session. Lots of good questions. Candice, back to you. Perfect. Thank you Sergio and Sushila for your time today and thank you everyone for joining us. As a reminder, this recording will be on the Linux Foundation's YouTube page later today and a copy of the presentation slides will be added to the Linux Foundation's website. We hope you join us for future mentorship sessions. Have a wonderful day. Thank you.