 Hello everybody. Welcome to my talk. My name is Lev Estrovich, and I work at DHL Research. And today we're going to be talking about a disciplined approach to debugging. So first, a quick word about myself and my company. So I work at DHL Research. It's an independent research lab found in 2002 by David Shaw. He used to manage a hedge fund, also called DHL. But now he wants to advance science and specifically using molecular dynamics. So the high level goal is to advance, to make an impact in the fields of biology and chemistry using computer science and computer hardware. Specifically, we built a supercomputer called Amthon for molecular dynamics simulation. I'm not going to be talking about very much of that in this talk. But what I do is system software. So basically kind of operating system level software for this supercomputer that we built. And during the course of this, there's definitely a lot of things break. And you have to debug a lot of things. I've generally had about 20 years of experience in the field working at places like HP and Sienna Networking Company. So I've worked for everything from little embedded devices to obviously supercomputers. And hopefully I could provide some good background on debugging techniques and tools specifically under Linux and embedded Linux more particularly. So first of all, why do we want to talk about debugging and all? Well, about half, if you're a programmer, about half your time is spent doing administration tasks. So things like filing bug reports, talking to other people, HR, and all sorts of things that go along with working in an office or at home as the case may be now. And half of the time that you actually spend programming, half of that time is actually spent debugging. So these statistics you should probably take with a grain of salt. But there's about 1.5 million software developers in the US. And the total wages estimated by the US Department of Labor Statistics is about $150 billion. So about $37 billion at least is spent debugging in the US. And yet there's not that many resources really formally to teach debugging. If you look at the MIT course catalog, the word debugging only appears three times out of 10,000 words. I'm seeing some questions that people can't hear me. Is everybody can speak up or type in a question if anybody else can't hear me? OK. Sorry about that. Great. Thank you very much. I see some responses that we can hear you. So whoever can't, there must be something on your end. Sorry about that. OK. So let's continue. So before we even start talking about debugging, you really want to start first by testing. And even before you do any debugging, you should probably do testing. Unit tests, integration tests, basically the more expensive your software, the more testing you want to do, because debugging is always more expensive than testing in the beginning. And generally about writing software, I love this quote from Brian Kernigan, one of the authors of the original book on C. He says that everybody knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you are when you write it, as can be when you write it, how will you ever debug it? On the question of, will these slides be available? Yes, I will ask the Linux Foundation organizers, and we'll make them available after the talk on the skip page. So basically my approach to debugging is basically I wanted to give it a clever title. So to reproduce, observe, and bisect, which spells Rob, and this apparently is a character in the Nintendo video game called Rob. But basically the idea is we want to reproduce the problem that we're seeing. We want to observe and get as much visibility as we can into the problem and bisect is to find where the problem lies. So why do you want to reproduce instead of just starting to debug right away? Well, generally when the users report bugs or when you see a bug happen, you're generally are missing a lot of information. There's a very low signal to noise ratio. Users don't really know what you want to know about the bug or why it happens. They just say, oh, this broke. So to properly instrument and observe and figure out what happened, you really like the most important step is to reproduce the problem. In fact, even if you think that you can look at the code and find a fix and say, oh, I'm pretty sure this is what happened, you don't really know that you fixed it until you've reproduced it, then tested it after you fixed the bug and said, yeah, this really fixes the problem. A lot of times you can think that you found the problem, but you didn't actually. So until reproduced, you can't actually know for sure that you fixed it. So observe what am I talking about here. Basically, you want visibility. So the visibility is the first most important thing to produce a problem. And the second is to have visibility into the program into what you're trying to debug. So the picture over here is the very first bug found in a computer in the 1950s, I believe it was in the INEAC. And it was an actual bug that got fried in one of the relays and created a short. And the little bug was taped over there with a piece of tape to the paper. So to find these bugs, you need visibility into the program. You could use it in lots of different ways. The simplest one is just put a lot of print statements in your program. A better way is to use a logging framework. I have some links here to some very nice ones, like NCC, C++, Google's G-Log, and Python. There is a great logging module included. You can log to the system log from the shell using the logger command, or just redirect all your output to a file. You don't want things to disappear. So whenever you have something that could produce output, you want to drop it somewhere. And when there is logging aggregators, like Splunk or Elasticsearch, I really like Splunk. It's very nice. It's a way to aggregate logs from lots of machines. And then you can kind of see what's happening on many different machines at the same time if you're writing in a cluster and so forth. And the last point is Asserts. Asserts are your friends. If you're sure that something is true at a certain point of the program, put an Assert there, because you never know what can go wrong. And it'll provide a useful stopping point if things have gotten the same. So there's lots of great tools in Linux. I think in Linux, basically, you have tons and tons of tools to have visibility and the biggest problem is which tools they actually use to get visibility to your program. The first thing I do generally when I need to debug a Linux application is to use Strace, which traces the whole system calls that a program is making. GDB, of course. There is correct lots of great resources about GDB. GDB is also useful for debugging core files if a program crashes. Don't forget to compile with debug symbols. So there's the dash G option to GCC. You can do a man GCC and it'll describe all the various debugging options. But in general, debug symbols don't add. They add a little bit of size to your executable. They don't slow down your program at all. So unless you're in a very tight embedded system, you really do want to compile with symbols. Another trick that you can do is if your executable space is very tight on your platform, you can compile once with symbols and once without. So once you compile with minus G and once without, the two binaries will actually be interoperable, meaning if you use GDB server on the target and the GDB client on your developing platform, you can actually use the program file with symbols to debug the one without symbols. You can load up the symbol file from a different file. You can also use the file command in GDB. So if you started running your program with the one without symbols, you can then use the file command in GDB to load up the symbols from one. I have a question is how can we debug if from Pankaj, I believe? How can we debug if Linux is stuck in booting? For that, you need KGDB. There's good resources on that, but you'll need the second computer to attach to the first computer, or you can use the built-in KGDB debugger. So you should just Google KGDB and follow the instructions there. Always use WALL and WER, so that's to turn on all warnings and all errors in GCC. They're generally super useful. It doesn't hurt to turn them on. And the things it finds sometimes seem pedantic, but it's generally a good idea to fix all of them. A word on reverse debugging. It's pretty cool, but I practically haven't found it really super useful. Once you kind of know what you're looking for, you can restart your program and get to that place from the beginning instead of going back. Other people may have had different experiences. GDB does support some level of reverse debugging. There's also a tool called undoTP. We'll also talk about Wireshark and TCP Dump. And if you're debugging any hardware, you want to get visibility into that protocol. We'll skip that. So this is a Linux performance observability tools from Brendan Gregg. He's an engineer at Netflix, and he has some great resources on his web page about all the possible visibility tools in Linux. There's way too much to go into right now, but you can definitely go on his website. He gives great overviews of what to look at. And finally, bisect. So the idea is you want to do a binary search. Once you find that there is a problem, once you've reproduced it, and you can observe it, you now want to do a binary search to find out where in the code the problem actually is. So one way to do this is to kind of say, if you don't have an idea, make an educated guess. If you don't have an idea where it is, just guess. First half or second half of the program. And then you could just disable the other half, run it again, see if you hit the bug, and iterate. You could do other things to switch things up. You could switch compiler optimization levels. You could switch tool versions. Generally, whenever you do these things, do one at a time. Don't try to switch too many things at a time, because then you won't know which one of those actually caused the problem. And you also generally want to, what I like to call the trust stack. You should trust your program the least. You should trust libraries that you're using or probably have been tested by more people than just you. So they're probably more trustworthy. The compiler, the OS, and the hardware in that order are less at fault. It's always kind of, you always want to blame the computer or the hardware or the OS being wrong, but most of the time that's not the case. Although sometimes it is. So I like to classify bugs into hard and soft errors. So the hard errors are actually easy to find. The hard errors is when you have a crash or unrecoverable error. So things just go off the rails and break. And it's generally pretty easy to produce and easy to figure out what happened. The soft error is kind of an intermittent error. And those are terrible because you don't really know that you fixed it because maybe it's just didn't get lucky and didn't happen again. So what you want to do is you want to try as hard as you can to reproduce the error and to turn that soft error into a hard error. So how do you do that? Well tracing and logging, you don't give you an idea. You could look at external events like logs, network, CPU load. You could try to replicate the error message that you see. You could stress test the application or of course you can use GDB or the bug error. So I'll give a few examples and I'm gonna answer, see there's a lot of questions I'm gonna try to answer most of them at the end. So this was an error that we saw that it's a weird error. It looks like an IO error in the program. So it's highlighted there in red, you could see the output. So we run an executable and it says boost file system reports input output error. So input output error is caused by an EIO. And generally this is very hard to reproduce because this generally only caused by bad hardware or bad network NFS shares. It's intermittent, meaning when it's bad hardware it won't always give you an EIO. So how do you reproduce something like this? Well, I actually tried to reproduce this first by just removing the file but it does not result in the same failure. The call works just fine and says the file is not there and doesn't cause an IO error or anything that causes the program to crash. Well, I looked at error.h that can be found in user include has some generic error.h and then you links distribution. Just look for an error code that's similar. There's one called elope, too many symbolic links. You could simulate that by making two links. You link file A to file B and there's some other thing from file B back to file A. And then when you cap A, you get this error too many levels of symbolic links. Well, I could try running a program against that. And then I wrote here a test program that does the call a fail boost file system exists on an non-existing file foo. And then red tried on this too many symbolic links file A. When we run the program, the first line outputs normally foo does not exist but on the second one, it throws an error and says terminate called. So that shows you that C++ through an exception. Now, why would checking for existence of a file through an exception? Well, if you look at the documentation for boost, it turns out that somebody decided that it's a really good idea to return one of something exists, zero of something doesn't exist. And in some cases as specified in a whole different long legalese page of call error recording, in some cases will throw errors. This is a really terrible idea, I think, but we found what the error was and there's a different version of the exists call this does not throw exceptions or just switched to using that. I don't know why they decided to do that. A different bug. We start a program and it fails to start and it gives us useful error message saying that Intel MKL, which is the Intel math library, could not load this library. Well, you know, that's straightforward enough but then you check, is that library actually there? Oh, yes it is. What? How, why is it not failing to load the library? This library is not on NFS. It's on a regular disk. So why would the program fail to start when that library is actually present? So second step would be to observe, to try to figure out what is going on here. So like I mentioned before, S-Trace is great for doing that. You can look, you can run your program under S-Trace if it's reproducible. You'll see what it's trying to do. So I took an S-Trace of a working run of this program and a failing one. And when they begin to diverge is the first red line when it does a get CWD. Get CWD is a system called it gets the current working directory. So when you type PWD in your shell, that's what it does. It says, where are we right now in the file system? And it's getting an email end. There's no such file or directory. So that's weird that it's checking its current directory and the current directory does not exist. So soon after you see the right call until MKA will fail error, which is the one that we're looking for. So why did that happen? It looks as though something in the MKA library really, really wants to know which directory you're in. And if it can't find, if it gets an email end on the current working directory, it fails with a really confusing error message. So let's try to find to check that this is really the problem. So when you base sect, we make like an educated guess. Let's make an educated guess and write a test program. So this little bash script makes a directory test. The parentheses here say in a subshell you wait for a one second and then you run the program. That was the failing program for us. Meanwhile, in the top shell script and that's the ampersand here says do this in the background. So it's going to execute another sub shell in that sub shell is going to wait a little bit and run the program in that directory. Meanwhile, in the main program, we're going to remove the directory that that sub shell is executing under and then we wait for everything to exit. And sure enough when we run this shell script we get the failing error message. So now we've reproduced it and found the actual error. So MKA really doesn't like when you remove that underlying directory. Homer Simpson is very upset. So GDB. So GDB is really great at debugging anything it can actually debug probably any language you can think of anything that's compiled by GCC which is not the GNU C compiler but now the GNU compiler collection which compiles basically any language and it can also debug things with LLVM if I'm not mistaken. So you can GDB a live process as well as you can start a new program for debugging or if you have a process that's running you can GDB the process with it. You do have to be careful because when you do that it stops the process. So if it's something that's important to keep running you probably don't want to run GDB on it directly. However, what you can do is you can use the G core tool that will generate a core dump. The core dump is basically an image of the register the memory of the running process. So G core will cause the program to dump core to generate that image without stopping the process. And then afterwards you can use GDB to debug it. In fact, GDB from the core file can even pick up which executable it was. So if you're not even sure, if you don't remember which version of the executable or which one you ran the path is actually will be in the core file. It has really good thread support. You can look at the man pages it just you could do commands that to look at all the threads or apply a command to all the threads. For example, the bottom one here is a thread apply all backtrace. So the backtrace command will then be done on all the threads in your program. It has very good support for debugging embedded for programs which is particularly useful for this conference for embedded systems. You wanna run something called GDB server which is a very small, several kilobyte stub that can just trace the program and then all the command line and all the other processing in GDB is done on a client. So the GDB connect command you could look at the manual pages for GDB server and explain exactly how to do this. A very useful feature if you have memory corruption that this happens a lot where you know that you've some memory has been trashed there is something called watch points. So this is kind of like break points except they break if something has written or read a certain value in memory. This is super useful for when you know that in your program something's gonna corrupt something's gonna write memory but you don't know what it is. You should keep in mind that what you really want to use are hardware watch points. The X86 architecture provides four hardware watch points other architectures provides different numbers some don't have any, you really don't want to if you use more than four or if your architecture does not provide any hardware watch points your program will get really slow because then GDB has to intercept basically every memory access and check if it's it for the address that you're watching for. But if you're using the hardware watch points and the processor at native speed we'll just check every memory access and break if you've tripped on a watch point. Conditional break points are also super useful you can break on a certain location right there's like a regular break point and you can give it a condition. So you say any condition that's evaluated by GDB so my variable equals five. So you say break on this location if and then you give the condition to give just keep in mind that GDB will actually take the break point and evaluate the condition every time. So if it's something it will slow down if you're going to be hitting that break point many, many times. You can observe, you can do commands at a break point. So if you want to print out a bunch of things every time you hit that break point you can use the command keyword and give it a bunch of lines to do after that break point. Finally, the useful features catch. This is useful for C++ specifically but could also be used to catch forks or signals. And C++ is very useful to catch exceptions. It will break point you as soon as anything throws any exception. So here's an example of debugging a hanging process with GDB. So we had a periodic hang where we run this big application and every like 10,000 jobs and this application runs for like 30 minutes. And periodically we see a hang where it just stops there and we have a timeout after two hours that kills the application. So the first thing we've tried to do is to reproduce but we couldn't figure out which inputs would actually cause it to hang. So we added code on the schedule level. So that's the process that starts our application to monitor the application. And if we see no output for about 10 minutes then we use G core to ask to get a core dump of the application and we can debug both line later. Now core files that are produced are really big because there's lots of memory that a replication uses. However, they zip very well. In fact, when your program runs, it'll use a lot of memory but a lot of that memory will be zeros or repeating bits of text and G zip will compress it really well. So we use GDB to analyze the core dump file of this hung process. And when we get into GDB, type info threads which will tell you where each thread application is. This one has lots of threads. All of them are in one thread. One is doing something interesting. The rest of them are in P thread can't wait. They're waiting on some P thread condition and one of them is stuck in read. Now read is a blocking system call meaning when you call read, you will not return until you get the results of the read. So that's interesting to follow up. That could be a reason why it's hanging. What is the read blocking on? So we'll switch to thread number three. The one that was blocked and see what's going on. Now you see this is in the loop P thread implementation that does not compile the debug symbol so it can't tell us which registers or sorry, what the arguments sorry, what the arguments for the read call currently are. However, we could Oh, I have a question. How do we know it's stuck in read? Well, the reason we don't really know it's stuck in read but we can look at the things that we think it could be stuck in. Blob 64 is the only other function here. That's not the thread can't wait. And Blob 64 is a very simple function. So that doesn't seem to be something that would be stuck in. Read is the only function here that should take any amount of time. So if you look at read, what you can do is you can type in for registers. When a core dump is generated it also dumps all the registers at the time. And if you look at the Wikipedia page for the x86 calling conventions it'll tell you which registers are used in the system call or actually in any function call. I've highlighted them here in the green, blue and purple and the read takes three arguments. And if we look at the registers of those arguments in that thread, the green one RDI is zero. The blue one second argument is some pointer and the purple one RDX is four. So that corresponds to file descriptor zero some pointer to a buffer and a count of four. And in fact, application does have a read of four bytes but it's supposed to be reading from a socket. How about file descriptor zero is standard in. And that's a good explanation of why it's waiting. It's waiting for input and standard in which is not actually connected to anything. This is a offline application. So looking at our server, it turns out what happened was in this application which is a server, there was a cleanup function that closed the socket and a different thread that was in a loop and waiting for the thread to exit. Every time and through the loop, it would do a read. If the read returns an error, it would exit. Otherwise it will go and process a data. And basically there was a race condition on exit. If you close the socket and then cleaned up the C++ object, the M socket variable could then be reused by something else. The thread that was looping will then use the M socket. It turns out that something set after the main thread exited, something you reuse the memory and set the M socket variable to zero which is just some garbage value. And the read from any other garbage value would have returned an error right away and the reader thread would have exited. However, a read from zero from standard in would get stuck waiting for somebody to take something in on the console. The fix here was easy. There was the two green lines that added an exit flag and a wait for the reader thread to exit. So I promised I'd talk a little bit about Wireshark. I'll try to get a little bit of that in. Wireshark is great at debugging any sort of network events when you have anything you have to dump with a network, you have to do with a network. You can use TCP dump which uses, which is a command line tool to gather network packets and then you can visualize them with Wireshark. Keep in mind that language to specify what you're capturing is slightly different for TCP dump and Wireshark. And in fact, Wireshark uses the same libpcap which is the underlying library that TCP dump uses. So the capture language on Wireshark is also gonna be different than the filtering language for viewing. So you can do a man seven on pcap filter that's for capturing and man four Wireshark filter that's for viewing. They're very similar but slightly different as you can see in the slides. But you can capture a specific port and a specific host and then analyze it later. You wanna capture as much as you can without overwhelming the network without overwhelming the kernel, sorry, when it's capturing all the packets so it doesn't drop packets. And then you can analyze later to see what you see. Conveniently Wireshark when you open it it's a nice GUI, it'll break down all the packets for you and show you in different colors, all of various things. It could follow TCP streams even or it could give you the whole conversation. So over here, I'm not sure if anybody can see it's pretty busy but the idea is you wanna look for the red and the black things are the errors. So over here there is a get of a HTTP get which Wireshark conveniently decodes and prints over here circled in blue. And then a few times later we see a TCP retransmission. So it's highlighted by Wireshark and I've put a little red highlight around it. Basically when you see a TCP retransmission that means the previous packet was lost. So we see there's packet loss on this link. Finally, the word about hardware debugging. Again, observability is what you really, really want if you're debugging iSquared C, PCIe, any sort of USB, any sort of hardware interface you want to get the tool that can give you visibility on that hardware interface. For iSquared C, I've listed several debuggers, Saley or Beagle analyzers are all good. USB, a little known fact is that Wireshark can actually display USB traffic and you can snoop traffic on the USB bus with the following commands. You do mod probe USB mon and then TCP dump and the interface is gonna be USB mon one, two, three or five depending on the number of your USB bus. So your computer can actually snoop traffic on the USB bus output it to a file like a regular TCP dump on the network it could just dump USB packets and Wireshark supports looking at those packets afterwards. If you're using PCIe or NVMe, SATA, any sort of high speed interface, you can use a large protocol analyzer. The CROI gets makes great protocol analyzers for PCIe, I've used them, they're super useful, also super expensive. So I'll give one example of a hardware debugger. So we had a Xilinx FPGA that was controlling an iSquared C bus. Once every several days in production, we would have devices that would just stop responding and we didn't know what was going on. So the one step was to reproduce and figure out what the heck is going on in this iSquared C bus. So we increased the polling frequency of our temperature sensors and so forth that were on iSquared C bus and to sense them every second every, sorry, 10th of a second instead of every 10 seconds so that increased the probability of hitting the error about a hundred times. Now we got hangs like every few minutes and we can actually produce this in the lab and attach a iSquared C analyzer or a logic analyzer to this. We couldn't get visibility into the Xilinx FPGA because the IP block wasn't encrypted. So we used an external logic analyzer and just took the probes onto the wires of an iSquared C bus and iSquared C is slow enough that you could do this on a regular analyzer. Now, this is not the capture from the actual analyzer that we took but this is a very similar problem I found online. So on the top in yellow, you could see the clock. The clock is one, the iSquared C has two lines, clock and data. Basically clock will always clock up, down, up, down, up, down and the data will be ones or zeros as you can see on the green at the bottom. So if you look at the zoom region that highlighted in red as the clock here is rising pretty slowly and as it rises past the threshold of where it switches from the clock switches from a zero to a one, there's a little glitch there. You could kind of see where it glitches past. Now, if your device and a lot of iSquared C devices are supposed to unglitch this, they're supposed to allow for slow rise times but not all of them really do. If you see glitches on your clock line like this, you could get an extra clock cycle detected by your hardware. So your hardware thinks that another clock has been issued on the bus and then everything is off by one because the thing that's sending the iSquared C signal thinks that it's only clocked once and thing receiving it thinks that things got clocked twice and everything breaks. Once we saw this, there was a very easy solution which is to put more pull-up resistors on the bus to increase the rise time. There's a good link here from debugging iSquared C from Texas Instruments that discusses similar problems to this. So again, the point of this is you want as much observability on the bus that you're developing as you can. We would have never caught this just in software without actually looking at the logic analyzer. So that's all I got in the time allotted and now let me try to go through some questions that we had. I think I answered some of them but we'll go through some of them. So, I'm sorry, advanced timer sprouts your name. So let's see here. We have some comments from, second. Okay, are the slides gonna be available? Yes, we'll post the slides. So there's a comment that's saying that from Steve that a cert is actually really easy to use and doesn't really provide a runtime cost because it compiles away in a release build. That's a very good point to make. If you have a release build where you really want the cert to keep your certain release build, you want to have your own version implementation of a cert that won't get compiled away in the release build. How does Leandro asks, how does LLVM Clang compare to GCC in terms of debugging? It GDB supports LLVM Clang. I think there's basically no difference in LLVM Clang but I could be wrong. I don't have that much experience with LLVM. Steve also points out that there's other useful warnings from GCC like W Shadow and W All that doesn't turn them all on. That's true. W All is basically a good starting point. It'll turn on most of the warnings that you care about. There are other warnings you might be more pedantic about and there's also other code coverage tools like Coverty and other tools that you could use to do static analysis on your code that could catch a bit more things. Larkin, I think, sorry, asked if I can elaborate on reverse debugging. So reverse debugging is basically it will allow you to step your program backwards. As your program executes, it records state of basically how everything was and then you can step backwards. So instead of doing next in GDB, the previous it'll step back in time. This does incur a cost because it has to record the state of your program at every point and it can't undo things like back as you sent in the network. It can't take those back. I'm not sure if it can then do file writes that also seems difficult in practice. I really haven't found it to be super useful. Brafadumbal, sorry, if I mispronounce your name, asks, well, will Gcore work even when the process is hung? Yes, Gcore will work even if your process is hung and in fact, that's very useful. If your process is hung, however, you can cause, you can just attach to it with GDB directly. You don't need to take a core dump, but if you do need to, you can use Gcore. Gavin asks, I experienced rare 3% memory corruption that is caused by a task race condition on the MIPS processor. Is there a good case for a GDB server and hardware watch points? Yes, if you have memory corruption, watch points are great. Make sure that your hardware again supports watch points because otherwise you're gonna impact the timing and your program run really slowly. You don't need GDB server if you can run GDB natively on the target. If your target again is too small and you wanna run your program under GDB server and then use a GDB client to connect to it and then it'll work just like regular GDB then in the client, it's just a bunch of setup that you have to do. F.Umson asks, what correlates the two log entries? I think you're referring to the TCP entries here, I think. What correlates them is you have to look at the addresses of the sender and the receiver. So look at the IP address of the source and the destination and the port. So basically your tuple in TCP is your source destination, source port and destination port and those are going to define a connection and Wireshark can also filter. You can just like right click on any one of these entries and say follow TCP stream and then it'll only give you packets that correspond to those particular posts in that particular TCP conversation. Joe asks, does any packets live for support IO on Windows? I don't know, I honestly don't know, I don't have that much Windows experience, I'm sorry. Shiva Marthi asks, am I covering F trace? I am not in this talk, but there is lots of great resources. You could look on Brendan Gregg's website, like I mentioned before, or you can just Google it, there was a good talk yesterday on the debugging GDP tutorial that covered F trace as well. If there's gonna be like an offline way to watch it, I would suggest watching that. Pankaj asks, why were the glitches in the SCL line occurring? So this is in an I2C. Basically the more devices, an I2C is a passive bus, the active device pulls down the line and then when it's not using the line, it lets go of the line and lets it flow back up to its full voltage. That's why the resistors on those lines will pull up resistors because they're pulling the voltage back up from zero to the line voltage. The more devices you put on there, the more capacitance there is in the line and the slower it will take to get back from voltage of zero to your desired voltage. So you need a lower resistance on the line, basically to pull it up quicker. This happens quite a lot in I2C buses. There is different solutions that are discussed in that TI document. He also says the drive strength of GPIO could have changed the rise time, that's true. However, again, the I2C is driven low, not driven high, so your drive strength would have caused the drop-off line to be slower. Here the drop-off line is pretty fast but the rise time is pretty slow. Rachel asks, do I use fault injection testing to support debugging? Yes. In fact, the TCP error here was done using fault injection testing. I find Docker containers are super useful for this. You can start things up in a Docker container and cause it to drop packets. You can cause it to power off the machine entirely or to hang it or to hold it or to cause all sorts of affairs. They are a bit messy and annoying to set up and they're not super useful on embedded systems because it's kind of hard to set up a Docker container for a different architecture on the host architecture. But if you're debugging in the same architecture, I think containers are super useful for a false injection testing. Jorge asks, how do you analyze a core dump? So basically what a core dump does is it takes a snapshot in time and then when you run GDB, you run GDB, your program and the core dump and the GDB will load up your program as though it was instantiated at that point in time when the core dump was taken and then you can look at the registers, you could look at memory. You can't advance time because it's just again a snapshot in time but you can use it to look at the current state of the entire program, examine all the variables and registers and threads and what everything was going on on that time. Backtraces will work. It'll tell you like for each thread, you could see what the backtrace is, who called, who and what your stack looks like at that point. Emil asks, recommendations for tools for debugging multi-threaded applications. Oh, GDB is really good at thread support. If you have race conditions and things like that, you can use Valgrind and Helgrind. Helgrind is a subset tool of Valgrind that tries to detect race conditions in your code and it's generally pretty good. It does slow down your code a lot by like a hundred times because it's effectively, it's almost like emulating the CPU. Jeremy asks, are there any tools for debugging PC transactions besides a PCIE analyzer? Are there any tools that can record data in a bus that could be seen by the root complex? I'm not aware of anything. You need really fast snoop sort of capability to log what's going on in the PCIE bus and most processors don't have that much, that visibility into the PCIE bus. So other than the PCIE analyzer, I'm not aware of anything that would capture all the transactions on a PCIE bus, unfortunately. We have a question of Thomas asks, GDB has a lot of hidden power, but there's very many accessible and clear resources isn't very many accessible and clear resources for discovering it. GDB has a super useful help interface. So just typing help in GDB and then perusing the different sections that gives you sections on break points and trace points and watch points are all really useful. The online manual is actually quite good. I would also just check out the GDB documentation on the GNU website. There's a good O'Reilly book on GDB as well. Andreas asks, is there any tool to dump and observe HTTPS traffic? Unfortunately, because of the nature of HTTPS, unless you capture it before it's encrypted, there's really no good way to observe HTTPS. It's supposed to be secure to prevent you from observing it. Orlin asks, are there any tools with a GUI on top of GDB? Yes, there's several GUIs on top of GDB. The most common one is DDD. The data displayed the bugger, I believe. It's pretty old and clunky, but it's workable. Eclipse supports GDB as a backend. So it can be a GDB GUI, Microsoft Visual Studio Code supports GDB and it can be the front, the GUI for GDB. There's so lots of tools support GDB on the backend to use your GDB text interface. I just like to use the text interface, but I know other people prefer the GUI. In fact, GDB has a kind of GUI text mode. I think it's minus Q. It shows you your program in like a curses window. So it's a combination of text mode and GUI. Andree asks, do you recommend using debugging flags and production baseline accommodations? You should always do minus G. That doesn't really add any time to your program. It does add symbols, so it adds a little bit in your image size. Compiling with less optimization, however, will make it easier to debug because things will be more sane as you step through them, but it will add a lot of, it'll slow down your programs. I don't recommend doing that in production. So Lechen asks, is there in general standpoint, any formal techniques to the bugger program? Like instead of test observable, okay, we can infer a bug for how a program was written formally. There are, there is some research about it. I don't know if anybody who actually does this much, there's basically static analysis tools that try to catch bugs. Basically, this is a way to write code that's less buggy, is what probably you're looking for. There is some research on that, but it's beyond the scope of the stock, as you've mentioned. Encore asks, are there any recommendations about debugging Linux kernel driver interrupt related issues? KGDB, which is built into the kernel, is useful. The kernel generally provides, oh, one interesting thing in debugging the kernel is there is a proc kcore file that will give you the current core. And you can use, if you have your Linux build, you can use, you can use GDB to actually debug the live kernel. You can't step through it, obviously, but you can look at the current, effectively core dump of the kernel and see what its state is. Or you can use the build-in KGDB tool. I'm gonna do one more question, and then I think we're out of time. So, CRA asks, in some cases, the process may get hung in the kernel that could prevent user space dump from either happening or completing a full kernel dump should be used in cases like that. Right, that is a good point. You can try to use the proc kcore, which gives you the kernel core file. Well, if you're in a state where your kernel is hung, it's pretty bad. Generally, those watch dogs effectively in the kernel that tasks that get scheduled periodically and will print out back traces in your VARLOG messages or VARLOG current or Dmessage. So all of those are kernel logging locations. And you can try to analyze what happened for bugs in drivers and so forth. So I think we're coming to the end a lot of time of the talk. If there is more questions, I see there's a few more that I wasn't able to get to. You can continue the conversation in Slack. There's a two dash track dash embedded Linux track Slack channel for this track where I'll monitor it for a while. And if anybody has specific questions, they can ask me there. Thank you very much for attending. I hope you learned something during this presentation. And I hope to see everybody in a year from now. And at an actual in-person conference, thank you very much.