 This is a practical taxonomy of bugs and how to squash them and I would just want to thank Katrina Owen for her excellent keynote this morning because it's got everyone in this good head space for something that is very important to my talk too. So since this is a practical taxonomy of bugs, this presentation is kind of my personal field guide to debugging and we talk about debugging. We talk a lot about debugging skills, primarily, that's kind of the angle we take it from, and debugging software is an extremely useful skill, right? It's a Swiss Army knife in any developer's tool belt. And really adept debuggers seem to be able to troubleshoot any issue all the way down from databases that refuse connection, like all the way up to undefined is not a function, right? It's like they always know the exact path to take or the next question to ask. Maybe they don't always know the right answer but they know the next place to look. And when we watch them debug, like their deft actions, it makes it really easy to believe that debugging comes from instinct or intuition. And so having worked on a couple of different consulting projects when I first started programming to now and I'm working on a big product company, I've heard this kind of phrase recently about debugging a new application. And it's generally these lines of once you get familiar with the application, once you get familiar with the code base, you'll start to build up some debugging instincts or maybe even more upsetting, you'll build up some calluses around this. I like that it's disgusting, like it's quite gross to think that the work would make me callous and I like that. But I don't think it's actually true or sometimes people say, oh, I need to look there when this happened again because I have scars from the last time. And then you hear specifics like this. Whenever I see something like this happening, the first thing I do is check the logs and see if this process is completing or maybe it's sending a weird message on exit. So you can take that at face value and write it down and memorize it and memorize, you know, when X happens, check the logs and, you know, develop your debugging instincts as I like to call them or you can disabuse yourself of the notion that instincts are magic and that you just build them up over time, you know, through witchcraft or just proximity to other debugging witches. So instincts are mystifying, right? When you see someone debugging as if on instinct, you feel like you're watching MacGyver. Does this person look familiar? MacGyver? No? Okay. So maybe, it might be because I am not very good at drawing, but MacGyver, if you haven't heard of that person before, this is like a television show in America and he was hired as a troubleshooter for the government and he would just go places and he'd have a ridiculous problem and he would just fix it because he was MacGyver. And he just knew how to fix everything. But that person on your team who knows how to fix everything, like they're not MacGyver and you're not MacGyver and you won't ever be. Sorry, it was a television show. Yeah, welcome to the real world. Talk is just about bringing people down to just bad, low place. Instincts let us make heroes of each other and I think it's fine to praise heroic efforts if someone does a really good job, we should praise them and let them know, hey, thank you, that was excellent. But when we turn someone into a hero, into a MacGyver, we make them the source of truth instead of our code and instincts kind of let us pass off the hard work of communicating in favor of trying to just rely on the heroes of our team. That always breaks, Kate's the only one who knows how to fix it, so you're just going to have to call her because she knows how to fix it, I can't help you. And then finally, instincts don't scale. You can't just spin up a new instance of your resident MacGyver whenever you have a problem. So let's look at the statement before, maybe a little bit more generally. What I see when I saw that statement is whenever I see X, I always check Y. So it's just a conditional, right? And then as someone, there's a vulgar misspelling in my presenter notes, I'm not going to tell you what it is. But it's very funny to me, so please know I'm having a nice time. As someone who really loves logic programming, like I really like conditionals and rule sets, and in my experience, these instincts or scars or gut feelings, and you kind of just boil them down to the base statement, they're just internalized rule sets. And so we can turn instincts and these rule sets into patterns by observing them. But to do this, we have to keep a couple of things in mind, our research methods, if you will. So sometimes we have to contain a bug before we can squash it. And that just means we just have to stop this bad thing from happening before we can do a retrospective and really figure out what's going on and why it happened. And that's okay, because that's just how it is in the real world. We just have to get the app back online and get things working and we can figure out why when we have a little bit more time. Next, we can only work with facts. So you can't say, I think it's this, or I feel like it might be that. You have to say, what I see is this. And then finally, due to timing, because I wanted this to be a 25-hour talk and I submitted a proposal for a 25-hour talk, we can't squash every bug in the entire world in this talk. We really just do not have time. And as much as I like to, I can't just hand you my Swiss Army knife, but I can show you how to build your own. So what's practical, I think, for this is to learn how to identify bugs by their behavior. Luckily for us, such a branch of science already exists. Biologists use a branch or a form of taxonomy called phonetics to identify and group living organisms by their observable attributes. Bugs in the natural world can be identified by their observable attributes and bugs in software can too. To facilitate this, though, I'm going to contrive some highly convenient scenarios so that I can focus just on identifying attributes. And I'm also going to focus on the kinds of bugs that we see usually in live production applications. So first, let's define the two major types of bugs. I call them the upsettingly observable and the wildly chaotic. Upsettingly observable bugs are those that make you smack yourself when you see them. You say, how could this have happened? Or shouldn't unit tests have caught this? Or how could this have happened again? So you've seen this type. All right. They usually live in your code. They're often under-tested or untested. Sometimes they hang out like in your server configuration too. Wildly chaotic bugs break everything everywhere. You can't reproduce them locally. And frankly, you're terrified to reproduce them on production. Maybe, like, how can this port be open but simultaneously rejecting connection? It doesn't make sense. Sort of an up-as-down, black-as-white scenario there. So now we have kind of an idea... Oh, that's way better. So now we have an idea of what two major types of bugs there are. We can actually talk about how to squash them. This is the good part of the talk. If you are snoozing, this is the good part now. We're in the good part. Yeah, so let's get down to business. Let's look at upsettingly observable bug number one. And here are its observable attributes. Is the bug observable on production? Can you reproduce the bug easily in your local environment? Does the bug seem to be restricted to one area, like a specific workflow, or maybe a specific data state? If you answered yes to most of these questions, you might have a boar bug on your hands. Boar bugs are named such because... Sorry, because, like, the boar model of the atom, they're simple and deterministic. And when I say they're deterministic, it means that they always provide the same output for a given input, and they're characterized by their repeatability and their reliable manifestation. So boar bugs are commonly found in code, but they can sometimes be hiding out your server configuration too. Their favorite hiding spot where they like to lurk is, like, in functions or classes or config files with complex branching logic. So somewhere you might see this in the wild, like in your application, is validation. Validation can be a minefield for boar bugs. Validator classes are great classes to unit tests because of their inherent complexity. This also means it's really easy to heavily unit test this functionality using, like, text fixtures without really considering all the things that are required to get data in that specific state. If the consequences of, like, a failing validation aren't planned for, then they can't be rescued and they can't fail in a noticeable way. And validations that fail silently will exit silently. The lack of errors will lead you to believe that there's nothing wrong with your flow and your unit test will lead you to believe there's nothing wrong with your validation. Until that is someone tries to send an email or maybe make a purchase and the UI tells them that they're in a good state, but the action that they wish to trigger was never triggered. So the boar bug is the friendliest of bugs to catch. So I kind of start with this one because we're going to use it as a model for everything else. Replicate the boar bug locally in a test because it has the same output for the same given input. It's pretty easy to test. Write the simplest possible solution and then rewrite the code to be highly readable. Readable code is like your best defense against bugs sneaking back in. It means that the next time someone else has to make change to this class or function or config or whatever, and they will, don't hit yourselves, then they'll know exactly what to do and where to change things. So here's the boar bug again. And there's a little bit of audience involvement here. This is a practical taxonomy of bugs on how to squash them. So I'm going to squash it, but feel free to react like whatever makes you feel good. I thought, yeah, that is good, whatever you want to do, man. Just any kind of squashing, you could just yell squash. So here we go. Excellent. Y'all are going to like the rest of this talk because we got a lot more of that coming. All right, so let's take a look at upsettingly observable bug number two. Here are the observable attributes. How does this work? Wait, does this work? Wait, what is this even testing? Did this ever work? If you answered yes or huh or what to all of these, you may have found a Schroden bug. A Schroden bug is a bug that Schrodinger's infamostotic experiment we cannot confirm the validity of without observing it directly. So this bug kind of looks like a stick, but also if we really stare at it, we can see it's actually a bug. So it likes to pretend to be working code, right? But on close inspection of the code itself, it reveals itself to be a bug. It manifests itself after you read the code. And once you look, you notice that maybe it never should have worked in the first place. Fun. There's two main types of Schroden bugs, those that never worked and those that don't work how you thought that they did. So Schroden bugs that never worked often reveal themselves because of side effects. A common hiding spot for the Schroden bug is in return values. When you complete a function or save something and then end the function with a return value, you secure the result of the function. So a user reports that the UI shows them their template or whatever has been updated. Sorry. Has been updated, but then when they check again later, the changes aren't there. And this could have been hiding in your code for years. Isn't that awesome? The second type is similar but a little different. It appears to work. You can see that the data reaches the correct state eventually. And this manifests itself in code in a lot of different interesting ways, but a common place to find this bug hiding is in call counts. So maybe you look at a piece of code to add something new, and as you're stepping through, you realize that a function is being called several times over and over. So maybe the first time the function is being called, the data is getting set up and set in the right state. And then finally on the second or third or fourth time that function is called, the data is in place and now it's actually saved. So this means that your code is running the exact same path several times before executing correctly. When you have a bug that seems to be doing the right thing but doesn't, logs are your best friend. Add logging on save update failures to indicate why the value isn't saved. You can also add logging at the various states of data manipulation to see what value your data is at and why. When you have a short bug that seems to work at some point previously but doesn't work now you can use get bisect to binary search between your current bad state and some earlier good state to know when the code worked. And you're probably thinking, I don't care about that metadata I just need to get this fixed. This is actually extremely important. Doing this will prevent you from introducing or reintroducing another bug back into your code. So if you thought that it worked at some point a binary search to find out what made it not work anymore is extremely valuable. So now, reproduction and resolution. So reproduce this broken state locally and in test. Add log statements until you can safely confirm what causes the broken state. If the bug did work at some point find out at which point it did work and make sure you don't reintroduce some bugs from that state too. And then write test to represent the configuration of flow of the fixed state. Are you ready? So those two are fun soft balls. You're probably like, I don't need this talk. Let's get to the fun stuff. Wildly chaotic bug number one. And these are observable attributes. Does it appear non-deterministic? So that means maybe it doesn't appear on every single one of your servers or it doesn't appear on every transaction or it doesn't appear every time the page is loaded. Does it seem to disappear once you observe or debug it? If you answered yes to those questions you might be dealing with a Heisenbug. A Heisenbug is characterized by its seeming inability to be reproduced. Once you try to observe it through recreation you may not be able to find it. So there's two main types of Heisenbug that I think you should look out for. There's a Heisenbug that lives in code. And this often occurs because debugging tools like print statements actually modify the code or they can modify the code rather. And they sometimes change the timing of the execution. So this can be as a result of like a statement or a function in your code that's like lazily evaluated. So it's only evaluated when it's called. And adding a print statement forces eval and it makes it look like the statement was called when maybe it was only called by the print or if it changes the timing maybe now your processes are timed up they're aligned to the stars perfectly so that actually does execute. But that's not the reality of what's happening in your application. Another common Heisenbug to look out for is one that lives in data. This type of bug seems hard to reproduce because you can't and you really shouldn't and I know you do but please don't download users' data onto your machine to recreate the steps. So you probably don't realize what's causing the problem because you can only see the bug on production in that data set which is either of a huge size or an odd specific configuration. Since you can't replicate these bugs on your computer you have to rely on some tooling. I recommend profiling. Profiling measures the memory, complexity, time and callers as a program being profiled. I like to use a VDBug for a VIM which outputs these callgrind files and then you can use it with a tool called Qcashgrind or Kcashgrind. I don't know why it has two names but it does so you can find it at either of those names. It puts out these excellent files which you can see at the top. It indicates what file is called what function is being called and then the line numbers and the length of time it takes for these callers to complete. This is obviously very easy to read in a high stress situation. I could just read this for hours. You can do this you can read it this way or you can output it to a flame graph using Kcashgrind or Qcashgrind. This actually indicates the initial caller the callers that it calls and all those callers up the stack and then the length of time that each of those callers takes to execute. For Heisenberg in code this lets observe what's being called and by whom without affecting the code with print statements or anything else that would adjust the output of evaluations or how long things take. For Heisenberg in data profiling would reveal how much time we're spending trying to return maybe some huge data set or maybe some time expensive calculation. Reproduction and resolution use profiling to find the trigger state use the app not fixtures or database manipulation to try to get the data in this state if possible recreate that state and test and then now that you know what now that you can make this bug deterministic you know for what input you receive a certain output follow the boar bug instructions. Alright wildly chaotic bug number two here are the observable attributes is everything broken literally all of it please send help if you answered yes to all of the above you may have a mandal bug on your hands the mandal bug is named after it's resemblance to the Mandelbrot set a mathematical set a mathematical set that's seemingly random when plotted or graphed the set itself is a collection of complex numbers but those numbers and those numbers alone are the single factor in creating the highly convoluted and sometimes beautiful depending on the artist fractal so keep that in mind that there's usually a single factor the key attributes of the mandal bug is that it seems like everything is unrelatedly broken at once and as a result of this you're probably under very high pressure to get this fixed right now and I have good and bad news so the good news is I really do think it's a single point of failure and it's not an issue with your code and you're probably thinking excellent I'm off the hook but you're not actually anything in your system you're responsible for and you're thinking I didn't sign up for this I'm a web developer not an ops wizard who can fix this MacGyver's on vacation but they said you could call if it's an emergency I think that you can fix this so let's talk about what's really happening maybe the bug is huge and it's everywhere all at once SQL can't connect jobs won't run, emails won't send every submit button on the site is fatal erroring that's okay because you have a toolkit let's just start with the simplest thing let's see if we can access your servers okay yes perfect I have an idea jobs are running, emails aren't sending submits catch fire and nothing stays remember when we set up all that logging let's check the logs when was the last log entry oh around the same time that the bugs started showing up that's not very helpful Kylie actually it is something has happened around the same time that all your bugs started popping up that your logs are no longer writing I have one more idea one more tool that I think will help you and it's the disk usage so I see maybe I'm hearing maybe that this is something others have experienced so this is frighteningly common but let's check the disk usage using df-h the dash-h flag is supposed to give us a human readable output for some definition of human and some definition of readable and let's just see if any one process is using up any one process or any write is using up all of our storage and just see who they might be df-h will show us the used and free disk usage and so the logs stop writing everything stops writing all at once like that's a pretty clear flag to me hey we can't save things and why maybe it's because we're out of space to save things so now maybe df-h hopefully df-h and this contrived solution has revealed to us that we wrote everything to the logs and now we have no disk storage left so now we can do reproduction and resolution on this so let's attempt to connect to the server and view the logs again if the logs are completely full maybe everything else is as well so use the disk usage to figure that out and then unfortunately if you're not experienced downtime this is part of life it's okay so cannot be restarted, rotated or killed at this time in this contrived example, yes it can so please go, thank you this is a practical taxonomy of bugs from the upsettingly observable boar bug and schroden bug to the wildly chaotic hyzen bug and mandel bug and I have really really good news for you we have a jerk, MacGyver who just fixes everything because they just know everything you don't need debugging instincts and you don't need MacGyver you have a toolkit a swiss army knife you know how to observe and classify proactively observe and classify with structured logging and monitoring keep your tools sharp by sharing this knowledge you should have an on-call list and you should have a playbook that documents information these are some resources that I think are excellent about debugging and kind of like more high level debugging on a systems level than just like here's Pry, y'all so I recommend you read these some of them are a bit drier Julia Evans is an excellent resource again my name is Kylie Stradley I'm here from Atlanta, Georgia if they don't hook me off the stage I want to tell you a little bit about my company MailChimp a lot of people think that maybe if you've heard of MailChimp you think it's like a hip internet company that our app is written in Ruby it's actually not MailChimp is actually two PHPs stacked on top of each other wearing a trench coat that being said we write highly object oriented code commit only with tests and we have like a huge diverse development team so if you would like to hear more about that or you just want some of the swag that I brought with me you can find me at lunch and I will give you some thank you