 Hey, everyone. My name is James Gallick. Let me just make sure this thing is on. So I go by James Gallick everywhere online, Twitter, GitHub, Instagram, FreeNode. My blog is jamesgallick.com. Very easy to find online. And I work 24-7, so you can always reach me there. I work for a company called Computology. We make a product called Package Cloud, where we do apt, yum, and RubyGEM repositories as a service. If you have a need for public or private package repositories or both, definitely check us out. Come talk to me if you want to talk about packaging. I would venture a guess that I will probably get more excited than anyone else in this room about talking about packaging. So people say this, right? Somewhat often, programmers say this, right? And there's been a lot of talk on my Twitter recently and on some blogs and stuff about whether we should stop saying this or suggesting that we should stop saying this. And I think it's important to distinguish between people who say this in sort of a moment of frustration and an outburst of frustration, because what we do for a living can be very frustrating, right? There's pressure on. You got to ship something. Something's broken. You get frustrated. You tweet something, whatever. And people who actually legitimately believe that everything is terrible, right? I was going to pull my iPhone out of my pocket right now, but I'm not allowed to have it on stage and show you my iPhone and be like, I have a supercomputer in my pocket that can literally summon nearly any media that was ever created via the air. That's not terrible. That's fucking awesome, right? But as much as I think that you're wrong if you legitimately believe that everything is terrible, I think that you're probably at the very least naive and possibly being a little bit disingenuous if you don't admit that everything is broken. And I don't mean that nothing works. Obviously stuff works, right? But the reality is the software is buggy and it's flaky and it's unreliable despite our best efforts to make it better. And this actually makes sense, right? We're innovating really fast. We, software engineering is a relatively new field. We're still figuring out how to do it. And we really just haven't caught up with the pace of innovation and growth in our industry and we're still working on it, right? So it makes sense that everything is broken. And as a result of everything being broken, when you get engineers together in a room or online, one of the big topics of discussion is always, well, how do we write better code, right? How do we actually, how do we produce software that's less unreliable or that's more reliable or rather that's more correct, that works better, that handles edge cases better? So all these different techniques for doing that, right? There's testing obviously very popular in the Ruby world, static analysis. And then stuff like what MRB was talking about this morning where with new languages that have more sophisticated type systems that are capable of more accurately expressing the constraints of different units inside of our programs. But one thing that we don't talk about that much or at least in my opinion that we don't talk about enough is how do we cope with and what are the strategies for dealing with software when it doesn't work? Whether it's our code or someone else's code. I said this the other day on Twitter. If you wanna deploy high quality software that performs, you should expect to fix bugs at every level. There's a very simple reason why. Bugs exist at every level. And so given enough time, given enough complexity, you're gonna run into those bugs and either you're gonna fix them or they're still gonna be broken. So over the last few years, I fixed a bunch of bugs at a bunch of different levels of the stack, obviously lots of bugs in my own code. But bugs in the Ruby VM, bugs in memory allocators, mySQL, all kinds of places. And I always get this question, people asking me like, well, how did you find that bug? How do you go about finding a bug in a code base you're unfamiliar with? How do you go about finding a bug in a language that you don't know very well? And I realized over the years that the methodology that I use for debugging is always the same. It doesn't matter where in the stack I'm looking. It doesn't matter what the language is. It doesn't matter whether I even know the language really. It's always the same methodology for debugging and it's very, very simple. So that's what this talk is about. Every good debugging session starts with this quote. This is a mantra among programmers. So someone, maybe it's your boss, maybe it's a user, maybe it's a friend, comes to you and they report a defect in something and you pull up the code that you think is the offender and you stare at it and you're like, I don't understand how this is possible. This can't be possible. This can't be happening. And you reread the code and you read the code and you read it over and over again. You try to understand how it's possible and you keep saying this, right? We're gonna come back to that. So, by the way, there's no Ruby in this talk, so sorry, but not sorry. So this is a true story about a debugging session that I engaged in a few years ago. I'm from Toronto, Canada. I have a friend from my hometown who was running a PHP site and he called me up one day and he's non-technical. He had a staff of people who were working with him. I don't know where they were at this time but he called me up and he's like, hey, my site is down. And I'm like, okay, cool. So why don't you just get your team to fix it? And he's like, well, they're not here because of reasons. So he's like, can you fix it? So I didn't have a source code. Didn't know anything about the system, never seen the source code, never even really talked to any of his developers. Written in PHP, I had written PHP maybe like five years ago, very little familiarity with the language. I did happen to have SSH access to his servers because I had diagnosed some other thing for him at some point that was unrelated. So he's like, yeah, can you fix it? I'm like, I don't know, I guess I can take a look. So I asked SSH into one of his servers and I'm like, okay, well, he's probably running PHP under Apache. So I'll take a look in the Apache error logs because that's where I figured the PHP error logs would be. And of course, there's nothing in there, right? And it's funny because you might think that this is like sort of a worst case scenario. Oh no, the site is down, there's nothing in the logs. But the fact of the matter is that there's never anything in the logs. And even if there is something in the logs, you'll be really lucky if it's useful. I mean, if the program knew why it was broken, probably wouldn't be fucking broken, right? So, cool, what now? So this is what I did. I knew that the PHP code was probably executing in one of these Apache processes. So I found a PID for one of those Apache processes. Then I ran a program called Strace to attach to that running program and give me some debugging output. If you're not familiar with Strace, Strace is a program that will give you a trace of all the system calls that get executed by a program that you attach it to. So if you don't know what a system call is, system calls create the interface, provide the interface between user land programs, which are the kinds of programs that probably most of us in here write most of the time, and the operating system. So system calls are used for all kinds of things like writing to files or to sockets or allocating memory or all kinds of services, essentially, that the operating system provides to user land. Strace output looks like this. So basically what Strace does, very simple little program, is that it captures the system call information using a kernel API and then reconstructs in ASCII text those system calls to look like C function calls. So you have the name of the function, arguments and parentheses, an equal sign, and then whatever that system call happens to have returned. In this case, we're writing to file descriptor one, which is standard output from a buffer, which is in this case a C string that says hi and has a new line character at the end of it. The third argument to write is the number of bytes to write from that buffer to the file descriptor. And then the return value is the number of bytes that got written successfully to that file descriptor. Most of the system calls that you're probably gonna, that are probably gonna provide useful information to you when you're debugging like a Ruby program or something like that are gonna have really simple names like write or open or read. But if you get confused about what a system call is, you don't know what it's called by name. Some of them have some really, really obscure names that probably won't mean anything to you if you've never done this kind of programming before. They're all documented in section two of the manual. So it's really, really easy to find them. Man, space two, and then the name of the system call and most of them are documented pretty well. I've taught a lot of people how to use Strace and this is usually where it kind of falls down. So you attach Strace to some process and you're like, all right, I'm gonna find a bug. And then you get like, and this is like a really small, small amount of Strace output. I've seen many megabytes of Strace output for like a small number of requests on a web server. And so you look at this and you're like, well, what the hell do I do now? But it turns out that there's actually like a really, really straightforward methodology for finding the causes of problems in Strace output. The first step is, well, first of all, you have always have to work backwards. So work from the bottom and go up. The first step is to try to find where the failure is actually being reported. So in this example, this is Apache writing a 500 HTTP response back to a socket, socket number 12, presumably. And so you know once you find the error being reported that probably everything beneath that is not very interesting to you because obviously the cause of the failure happened before the failure got reported, right? And then work back up and usually if you're gonna find the cause of your bug in the Strace output, it's gonna be relatively close to where the error actually gets reported to the client or to wherever it's being reported to. So in this case, we work back up. We find this failing call to open. There's a file called var slash www slash db.in.php that's missing and then we get a 500 error right afterwards. Okay, starting to form a hypothesis, like maybe someone typoed that file and they deployed bad code or I don't know if they edited the code right on the server, I don't really know how that works, but there's a sort of a hypothesis forming, right? And so you can slowly work backwards from there in the output until you find something that looks like it may be the offender, right? And so there's just above that, I don't know if you all can see the green, but just before that failing open call, there's a successful open call to var slash www slash index.php. And so you can imagine, you hit the root of the site, Apache attempts to load a file called index.php, which has a typoed included in it and that's causing the 500 error. That's a sensible hypothesis for what's causing this outage. So then try to prove your hypothesis. So look and see in the index.php file, are we attempting, does our hypothesis sort of, does the first test of our hypothesis prove true? Yes, we are attempting to include something called db.in.php. Then look to see if that file's there. Is that file actually there? Maybe the permissions were wrong. Turns out that file is not there, but there is a file called db.inc.php. So that makes sense. Someone, there's bad code on the server, that's what's causing the 500 error. Next step is to fix the bug and then, you know, feel good about yourself. So I think that the total time that it took me to fix this outage from the time that my friend in Toronto called me until the time that his site was back up was like three minutes. And I felt pretty good about myself and he was really impressed. He was like, wow, how did you do that, right? So not like he, you know, it's not like he sent me flowers or anything, but he seemed appreciative. I would have liked some flowers, but you know, later that night I started reflecting on that debugging session. I was like, why was that so effective? Like I never find bugs in my own code that quickly, right? Like when it's my code, something stupid like that takes me like an hour to find or longer, right? I'm like searching through files, trying to see the error. And I realized that it's because when you come into a debugging session with all these assumptions, they usually lead you astray, right? If your assumptions were right, you probably would have written the right code in the first place, right? And then there wouldn't be a bug. So that's the zeroth rule of my formula for how to debug anything, which is forget everything you think you know because it's all wrong. And the first rule follows from that, which is to get a third-party opinion. So if you don't know anything, if you're blind, then you need to ask someone for help. You need to ask someone for some information about what's actually happening as opposed to what you think is happening because if what you thought was happening was right, then there would be no bug. So in this example, and mostly in this talk, I'm talking about this tool called Strace. It's very, very useful, extremely useful for debugging programs on Linux. It's like such a great first thing to look at, but there's a whole bunch of other ways to get third-party opinions. This is a great diagram. It's a little bit intimidating, but a lot of these tools, depending on what kind of thing you're trying to debug, can be very, very useful, and many of them are worth learning, if not all of them, depending on what kind of software you actually write and try to debug. And I'm gonna put these slides up online so you can look at this diagram if you want to. There's also other ways of getting third-party opinions. For example, if you suspect that the bug might be in the operating system, finding another program that does what your program is supposed to do, and running that and seeing if you get the same behavior might be a way of sort of making a first step toward confirming that the bug is in the operating system or in another layer rather than in your program. So that can be a very useful technique as well. And I'm sure there are other ones that I haven't thought of or that I've never used before. So the next bug I wanna talk about, it's kinda debatable whether it was a bug in package cloud or a bug in app, the package manager. But this was an interesting debugging session that lasted way too long. It started when we had a customer who tried to install a package cloud repository on the latest Ubuntu. And I don't know if you can see this output, but beside many of those URLs are the three letters, IGN, which means ignore, which means that apt couldn't find anything there, the request failed, something like that, so it's ignoring those files. And many of these files are like critical files so that the package index wasn't working, right? Now this package index was working on every other version of Ubuntu, every other version of Debian we hadn't gotten around to testing on the latest Ubuntu yet. So we were like pretty confused. So after staring at our own code completely uselessly for a while, pull out Strace. And this is a different way of invoking Strace so that the last way that I showed in the last example, I was attaching to an existing process. With this way, you can actually start a process underneath Strace and get all of the output from that process. And the output from this was really long. It was like a couple of megabytes, so it was a lot to decipher. But using our trusty methodology, work backwards to find a failure. So this is apt writing to standard output, what we saw, so ignore. And then this is a local copy of package cloud, so it's like this local VM IP address at the port 3000, it's a Rails app. And then trusty, which is the distribution and then the name of the file that that app was failing to find. So then we started working back up from there. And we found this line, right? Which looks like, if you can't read it, it says read six and then a long string that says 400 URI failure. And then the URI is assigned S3 URL, that's we redirect to those in package cloud. And then after that, a new line and it says message colon bad header line. Okay, so it seems like S3 is returning a 400. That's kind of odd, wonder what's up there. Try to confirm the hypothesis and make a curl request to that same exact signed S3 URL and get back a 200. Cool. So here we have a case where what apt is reporting seems to disagree with what's actually happening. And I didn't have a space for it in these slides, but we actually ran a TCP dump, which dumps out all the network traffic from that request to confirm that apt was in fact receiving a 200 from S3. So what now? Well, this is where things start to get a little bit more real and you have to actually download the source for whatever you're trying to debug and try to figure out how that thing works and where the problem is. Now this step is actually a lot harder than it might otherwise sound. Knowing where to find the source for packages that are installed on your system, depending on what flavor of Linux or whatever operating system you're using might be a little bit trickier than it sounds. And I've had a lot of late nights and early mornings that were caused by me thinking that I was debugging a version of the source code that was running on my computer, but it turned out to not be even remotely similar versions. Distributions like Red Hat, for example, heavily patched source code. So the version number that is on the package could be completely different from the actual source code that they compiled to build that package. So if you do an open SSL version from the package version that gets installed from the Red Hat vendor package repos, it will look vulnerable right now, but it's actually not because they maintain their own set of patches. If you're on an app-based distribution, app gets source and then the name of the package will unpack the actual source that was used to create that exact package that's on your system into your home directory. So this is really, really useful and it's worth knowing for whatever platform you deploy on. Now, app is like a lot of lines of C++, more lines than it's not conceivable that we could just enter the source directory even if you're the best C++ programmer in the world and just find the bug, you need to find a starting point in that code base, especially if you're unfamiliar with the code base and especially if you're unfamiliar with the language. So the key here is to locate some kind of hook, some kind of string or sequence of bytes that you think might be hard-coded into the source code. So in my case, I was kind of guessing that this error message bad header line was probably somewhere in the app source code and that proved to be true. So it's contained in a bunch of these translation files that perform kind of like internationalization for apps. So it's translated, works in a bunch of different languages and then it's also in this C++ file. So this is where it gets a little tricky, especially if you don't know C++. But the fact of the matter is, if you can read Ruby Meta Programming, I would venture a guess that with enough effort or with a small amount of effort, you can understand a few C++ methods. If you stare at this for long enough and it did take us a while, you'll see that it's looking for malformed headers or what it believes are malformed headers. Now taking one step back from there, when you first read this code, you're like, wait, apt implements its own HTTP client? That's kind of weird. Turns out it does. And it turns out that the way that it processes headers is a lot stricter than most other HTTP clients. Probably because a lot of other HTTP clients have been made a lot more relaxed over the years of the protocol being so popular and lots of misbehaving HTTP servers in the wild. But apps is so specific that its HTTP parser is just way stricter than everyone else's. So basically what this is looking for is a header that HTTP headers are the name of the header, a colon, the value of the header, and then a new line. And this is looking for headers that have no value. So name, colon, and then new line. Turns out if we look back at our curl, maybe you can't see it there, but there is a header, an empty header, the content type header that was causing this thing to trip up and report this error. We fixed it by actually setting content type on all of our S3 resources, which is probably a good idea anyway, obviously. Cool, so the bug is fixed. So, rule two, locate the correct source code. Easier said than done. And if you're deploying software on a platform, learn how to do that. It won't take that long to figure out. And knowing how to find the correct source code will save your ass, I guarantee it. Identify a hard-coded string or some way, some hook that you can use to find a starting point in the source code from which to work. So this is really important because if you like, I remember one time I was debugging in MySQL, it's, I don't know, two million lines of code or something like that. If you don't have a starting point, you're not finding anything, period. If debugging is a funnel, if we think about it like a customer acquisition funnel, I think step four is where probably most people drop out of the funnel. They come to some code that's written in some language they're not familiar with. They're a Ruby programmer and they come upon some C or some C++, but the reality is that with enough effort, you can learn this stuff. And it's really not that much effort. It's a lot of these code bases, especially if you follow a good methodology for reading through them, are not that hard to understand, particularly if you're trying to find and understand some small defect in them. I would encourage everyone to not to fall out of the funnel here, to dive into code bases and languages that they don't know because that's how you learn. A lot of people ask me like where to start learning C or where to start learning about systems programming or stuff like that. And the answer is that debugging is a really, really great way to sort of get your feet wet in that stuff. Hopefully you get to this step. If you get to this step, that's a good time to have a beer. Fix whatever's broken. So those are my steps for how to debug anything. Forget everything you know. Get a third party opinion. Locate the correct source code. Identify a hook. Stare at the code until it makes sense. Maybe learn a new language in the process. And then fix whatever's broken. Questions?