 I'm Paul Gross and I work at Braintree Payments and I'm here to talk about bugs. So bugs are a serious problem in software. Hopefully this comes as no surprise. We all deal with bugs every day. We read about them online. Everything from Y2K to showing the wrong text in a web page. These are a constant part of software development. So I'm going to start by going through some of the notable bugs that I remember over the course of my time, kind of alive, I guess. So one of the worst ones, the Mars climate orbiter, so this was in 1999. There was a bad computation that produced the wrong units and this orbiter basically hit Mars with the wrong trajectory and burned up. And so this is one of the worst bugs, right? Probably years of work, millions of dollars just gone, like the orbiter was destroyed. So that's the stakes. Most of us here probably deal with more of a web development side, but the stakes for software are very high. You can destroy a satellite. So the next one is a little bit more lighthearted. I don't know if people remember this. It was also called corrupted blood plague. In 2005, there was a bug that essentially caused a pandemic through World of Warcraft. So there was a spell that was very deadly but only supposed to work within a certain area and affect people. But there's a bug that allowed pets and minions to carry the virus out. And so then it essentially spread through the world the same way a real virus would in the real world. And it caused a pandemic. It swept through, it killed tons of people, virtual people. And it was pretty devastating for the people that played World of Warcraft. This one actually had a little bit of a silver lining though. Because it was a virtual world with tons of data, researchers were actually able to study the path of the virus through the world and learn a lot about how viruses travel and communicate. So it's kind of a fun upside to a bad bug. Okay, and then this one just a few years ago, Night Capital had a bug in their trading platform. And I believe within 45 minutes, they lost $440 million. There was a bad deployment and some deprecated code that wasn't supposed to be there that I guess interacted in a weird way. But this bug just lost money as fast as possible. And so these are serious issues, right? You put a bad bug out and you can cause a company to collapse, right? So bugs are a problem. So what's our strategy for bugs today? We all have them, we all fight against them. What do we do today? Well, the primary answer is testing, right? And so I think tests are great. I hope we can all agree that we should all be testing regardless of philosophy or strategy, right? Whether you do TDD, test first, test last. We can debate the merits of the various philosophies and strategies, but I think hopefully we can all agree that you should be doing some kind of testing. And there are lots of talks here that will talk about other kinds of testing and stuff, so I don't want to diminish testing. That's not the goal of my talk. I think we should all be doing testing. But I believe that testing is not enough, right? Well-tested software still has bugs. We haven't been able to get, right, bug-free software. Those three examples I showed at the beginning, they were probably extremely well-tested. They're large, complicated systems. They probably had tons of tests. But they still have these very serious bugs. So a few quotes that I kind of like that really drive home the point for me. If debugging is the process of removing bugs, then programming must be the process of putting them in. So as soon as we open our editor and start writing code, we're adding bugs, right, that's kind of the unavoidable truth of software. And then program testing can be very effective way to show the presence of bugs, but it's hopelessly inadequate for showing their absence. And I think this is really the point, right? Testing is great, but if you don't write the test, then you don't catch the bug. And sometimes you don't know which tests you're missing. It's one of those, you don't know what you don't know. And so you should write tests for all things you do know, but there are always going to be cases where you don't know. Components can interact in ways you don't expect, or you can receive input you never even thought about. And so there's no way that I know of, at least, to really be totally comprehensive. There's always going to be kind of bugs slipping in the corners. So then the question is, if you can't remove bugs through testing, what can you do? What are the other strategies that we have in our toolbox? And so I believe that one of the main strategies is mitigation, right? Like, we can hopefully detect sometimes when bugs have happened or when something is wrong, and we can fail fast. You know, rather than continue operating under bad assumptions or bad data, hopefully we can fail right away. And then I think we can also structure our systems to reduce the severity of bugs, right? Like a life jacket doesn't prevent you from falling in the water. It just mitigates the severity of it, right? Like, once you fall in the water, it will help save your life, but it doesn't keep you on dry land. And I think we could take some of those practices over to software and structure our systems to be a little bit safer. Okay, so start with a real-world example. This is not software. I don't know if anyone here has ever had surgery, but one of the things they do before you have surgery sometimes is the doctors will actually come in and initial your body. Like, they will take a Sharpie and write on a part of your body that you're about to operate on, and they'll put their initials there. And this is essentially mitigation, right? There could be a bug in the chart software, or even someone could have written the wrong thing down. But when they go to, you know, they have that scalpel in their hand and they go to cut you open, if they don't see their initials on the part of the body they're about to cut, they know something has gone wrong and they can fail fast. And this is, it's kind of, it's almost silly, but this is actually really powerful. There are tons of stories and horrible stories about errors in people's charts causing them to amputate the wrong arm or operate on the wrong eye or something like that. It's a very simple way for them to kind of mitigate these issues. So, next example, a hospital management system. So, at one point in my career I worked on a hospital management system, and this is kind of the software that essentially runs all of the patient's side of a hospital. So it manages the patient records, the charts, the medical results, all that kind of stuff. And one part in particular we worked on was called clinical notes. So this is the idea that a doctor can add notes to a patient's chart and then later on when they're talking to the patient or pulling up their chart they can see all these notes. So it's a very simple corner of the application. So the app wasn't written in Ruby or Rails, but let's say it was, and the clinical notes might have a controller like this that has a show action, right? So it's super simple. We're going to, based on the ID we get, we have, we're going to look up the note and then we're going to display it in the view. And the view would just, you know, imagine some kind of typical Rails app. So this is all fine and good, but it has a problem, right? We don't know that that ID that we're pulling up is actually for the right patient. So maybe, you know, you modify this code to look like this, right? Now we're going to look it up by patient ID and the ID of the note. And so this works, you know, this is great, bug, you know, bug is removed, code goes on, but code evolves over time. And what, you know, what starts as a very simple lookup over time might evolve into something more like this. You know, now, let's say finding the notes becomes really complicated. It could be just a series of really complicated queries. It could be actually looking at other systems. Maybe our data is stored in some kind of a note service or a patient record service or something that's, you know, out of even the scope of our app. And so now you have this issue where, you know, this code looks fine, we're passing in the patient ID, but we have no guarantee that somewhere along the way that that's used correctly. You know, what if we accidentally just ignore the second param or, you know, even worse, maybe it gets truncated or changed some way. And so when I was working on this app, there is this idea and they call it hazard mitigation. And this is the idea that you can add extra code that shouldn't be necessary, but mitigates hazards in bugs. And so the really simple example we would have would look like this. So right before you return that note to the view, you check the patient one more time, right? So, you know, if there's a bug in that clinical note service, this doesn't catch, you know, this doesn't help you find the bug or catch the bug, you know, it's not a test, but what it does is it mitigates the severity of the bug, right? This, it's better to just blow up and not show anything to the doctor about that patient than to show the wrong information, right? The wrong information is worse than nothing. And so, you know, when we built this system, we kind of structure things like this all over the place, right? At the last possible moment, hopefully after all the other logic is done, let's just check a few things that we know are really important. So when we think about like how to write those kinds of hazard mitigations or hacks, you know, we kind of think about the strategy we used. So the first piece of that strategy was you wanna figure out the invariance of the system. These are the things that never change. So in the hospital management system, one of the invariance is that you only deal with a patient at a time. You never show, you know, a different patient's data when you're showing something about a patient. And so once we know that invariant, you know, then you can add these runtime checks for them. So that's kind of like the one I showed, right? Right at the last possible moment, we check that the patient we have is the one we expect. And then the last, you know, the next thing you wanna do is once you know that something has gone wrong, you wanna fail fast and then you wanna alert so that someone can actually go fix the bug, right? The mitigation code doesn't fix anything. It just blows up. So you wanna make sure that you alert someone so that you can actually fix the real bug and get the software working again. So that's kind of the first set of strategy we would think about. So next example. So I work for Braintree. We are a payment gateway. So the idea is if you want to accept payments online or in a mobile application, you would integrate with us. You would send us, you know, your credit cards, your payment info. We vault them, we return you tokens and then you kind of charge against those tokens. So we are a multi-tenant system as well, right? We have, we call them merchants. So lots of different companies or merchants integrate with us and we keep all of their data separate and then we kind of in their own logical vaults. So similar to the kind of clinical note, we have the idea of a customer. You vault a customer with us and we give it back to you and then we can charge that customer. So again, it looks like, you know, it would look like this. This is obviously heavily simplified slide code, but you want to show the customer in the control panel or in the API, we would look it up by the token, right? But this again has the same problem. You know, we're not sure that token is for the right merchant. So in reality, we would write code like this, right? Instead of finding the customer directly, we would look up the merchant first and then from there, we would find the customers and find them. And again, this is right. You know, this code is, at least in this sense, bug free, right, it does the right thing. But we have this case where we know what the wrong code is and we know what the right code is. So how do we prevent people from making the wrong decision? You know, everyone makes mistakes. You might accidentally just type out the wrong version because it is so common to look things up like that in Ruby and Rails. So, you know, knowing that we know the right version and the wrong version, what can we do to prevent these mistakes? So at Brain Tree, we actually added, we call something we call the scoped find hook. So we actually don't let you look up objects unscoped. So if you were to do a customer that way or like that and just directly, it would blow up and it would say, fines must be scoped. And so if you wanted to actually look up the customer, you'd have to do something like this. You'd have to look up the merchant first and then you'd have to chain it. And so this is, you know, a hack that we kind of added to our code base that works across almost all of our associations and basically prevents you from doing the wrong thing accidentally. And obviously, you know, we have tests for these code paths, but if there was a code path that wasn't properly tested or there was an edge case that didn't have a test, at least this code would blow up in production, right? In production, you would get that runtime error that would prevent the action from actually returning data. So certainly it's not, you know, it's not the best to catch things at runtime but it's kind of a last resort, right? We try to catch everything we can in tests but if we miss something, at least we have the runtime safety. So, you know, like I said, this works pretty much everywhere where we do lookups but there are cases where you need to look things up directly and so we do allow you to bypass it. So we have, you know, along with the scope find hook, we have something we call allow on scope find. So if you need to look up a customer directly, you would add this and then you would be able to get past it. And so it's kind of, you know, it serves two purposes. So one, it allows you to bypass it for the cases where you really do need to do something different. But also it adds a little bit of a visual indicator to the code base. So now when a developer is reading the code, if they see this allow on scope find, they know that something is different about this code. All right, this code does not go through the merchant path and maybe there's a good reason and, you know, someone should think about why or maybe there's not a good reason and this should be removed. But in any case, there's, you know, just, it's a little bit of a note to the developer that this code is special and so we should look at it a little bit more carefully. Okay, so the scope find hook works pretty well. You know, it forces us to look things up through our associations. But what about code like this, right? Our gateway is an evolving Rails app. It's, we have a lot of data. There are cases where we need to look things up by SQL for performance or even in other systems. You know, there are other ways to look things up. So what happens if you look something up directly by SQL? How can you still verify that safety? So we have a nice advantage. We've scoped our URLs by merchant. So our URLs generally look like merchant slash, merchant ID and then kind of the thing you're doing. So in this case, customers slash token. So what this means is for a given request, we know what merchant you're supposed to be operating on, you know, for the entire scope of that request. So we can codify that. So in our application controller, we added an around filter that basically looks the merchant up by that ID and stores it in this merchant consistency check. And essentially we're storing it globally, you know, for the scope of that request. So then now we know around every action in our controller, we know what merchant we're supposed to be using. And then now if you go look something up for the wrong one, we can actually blow up. So at the time when ActiveRecord loads objects, we walk the object trees and we verify that every merchant on every object belongs to the merchant that we've stored globally. And so it's kind of a, you know, another way we can add code safety. You know, at runtime we can do these checks to make sure that we don't have any bugs, right? You know, this doesn't take away the tests, but it adds another layer of safety onto our testing. Okay, so that's that example. So looking forward, thinking about other strategies you wanna use for these kinds of hacks to help us with bug prevention. So the next big piece of strategy we think about is we wanna focus on the validity of the data. So data is extremely difficult to fix once it becomes corrupted. You know, in our case you'd have to write scripts, you'd have to go through your database, you'd have to figure out the extent of what went wrong. And then it gets even worse. If you have an API and you return data to the merchant or to the customer, you know, it's really how to even fix that data. Data is not even in your system anymore. So if we return, let's say, the wrong token for a customer that belonged to a different merchant, we would have to reach out to our merchants and get them to fix their databases. So data corruption is extremely difficult. But code, you know, is much easier to fix, right? We can change code, we can redeploy, you know, code is under our control, you know, data isn't always under our control. And so generally when we think about these hacks, we wanna focus on the data more than the code even. So example, recurring billing. So Braintree has a recurring billing system. So at the simplest, this is the idea that you would tell us, you know, I've got a customer in your vault and I wanna charge that customer $10 every month until, you know, I say stop. And everything, you know, and every month we'll charge it and everything works well. Recurring billing, although seemingly simple, is actually really complicated. And there are tons of edge cases. So for example, if a subscription can't bill that month, let's say the credit card is no longer valid or it's maxed out or something like that, you know, we mark the subscription as past due. And let's say that happens for a few months in a row. Now you've got subscriptions that are in a past due state. The balance is just growing on them. You know, they're not billing every month because they're past due. And then let's say the merchant wants to either wipe the balance away and just say, like I'm gonna forgive this amount or they update the credit card and wanna charge it again. And then we have to do this kind of crazy fast-forwarding thing where we need to update the balance, you know, the next billing date, the billing period, and all these other fields that kind of hold the state of the subscription. And because it's complicated, there are edge cases and there are bugs that creep in. So we, you know, we've certainly had our fair share of bugs in this area. So since we know the code is really complicated and we, you know, we already have tests for the code, you know, we try to focus on the validity of the data. So what we did is we added a bunch of data consistency checks to our subscriptions. So you can kind of imagine this running in a before save. So we say something like, you know, if the start date is after the end date, we know something went wrong, so you should raise. You know, similarly, if the correct next billing date does not equal the next billing date. So we actually compute the next billing date again in a different way using different logic and compare it to the one that was computed in the normal code path. We say, if these don't match, something went wrong. And we have tons of these. We have like a whole set of these consistency checks. And the nice thing about them is we, because we can run them in a before save, if subscriptions get into a weird state, we can blow up, we can alert, and then we actually don't save the subscriptions. So we never actually save the corrupted data. We basically roll the whole thing back to its previous state. We get these alerts, you know, we can look at what happened, we trace the code, we try to figure out the bug. And then once we fix the bug, we can redeploy and then rerun recurring billing, either later that day or the following day. And so, you know, it's a way to prevent the data from becoming corrupted, right? We prevent, we blow up at the point of save so that we can, you know, save the data, essentially so we can preserve the data. And that's, I think, a powerful pattern for us. Okay, so kind of the next set of strategy we like to think about. So one of the big things is you wanna write your checks outside of the normal flow of the code. You know, you don't want the same bugs in the code and the checking code, right? That would be, you know, that would be really bad, right? Then you're not providing any extra safety. So kind of looking at that picture as an analogy, right? You wanna reinforce the fence not by making the vertical pillar stronger. You wanna add something perpendicular. So you want something different that adds some kind of safety that's not on the normal code path. And then because it's not in the normal code path, you wanna keep it really simple. You know, the more complicated this is, the more chance that this will also have bugs. So if you're writing something outside of the normal flow, you want it to be very simple. So an example of that for us is settlement. So, you know, you run credit card transactions through our system all day long and then at the end of the day, at night, we run these complicated settlement processes and this is what actually moves the money. And the logic is very complicated. So it depends on, you know, what country you're in, what the car types are and all these other factors. So, you know, Amex, all the American Express cards would settle to Amex at one time of the night. All of the credit cards for a different processor in a different country sell a different time. And so we have all these, you know, hourly jobs that say, like, is it time to settle these things? Okay, go do them. Is it time to settle these other things? Okay, go do them. So this is, again, you know, it's a very complicated system, bugs creep in. But if we think about, you know, what our invariance are and try to keep the checks really simple, we can write something pretty obvious. So in this case, we have a Nagios check. So Nagios is just a system that runs checks periodically and alerts out of band. So in our case, we say, you know, we can count the number of transactions that were submitted for settlement, so they should have settled and are more than a day old. So we can say, you know, our invariant is that everything should settle within a day. And if that is not true, we can, you know, this Nagios check will fail and alert someone. And so, you know, this doesn't have any of the complicated code around what should settle when or how it does it. It just knows that everything should settle within a day and if it doesn't, something has gone wrong. And then we can go look into it and figure out which of those crazy processes didn't work correctly and fix it. And so, you know, just an example of your check code can actually be a lot simpler than your real code. And that's kind of the goal. Okay, so those are all kind of the examples of hacks during runtime or, you know, checks during runtime. We also try to add hacks during development time. Right, the earlier we can catch an issue, the better, right? We don't want to catch them in production if we don't have to. So we have lots of little hacks at development time as well. So one example, we call them safe migrations. So we, like I said, we're a Rails app. We, most of our data is primarily stored in Postgres. And so, you know, we might have a migration like this, right, we use active record migrations. Let's say we want to start searching for credit card transactions based on the last name. Well, this seems simple, but it's actually got a huge problem. Postgres, like most databases, adds a lock to a table once you try to add an index, and it prevents writes to that table for the duration of that index. And so, on a large table like ours, and we actually have many databases, but they're all pretty large, this can lock for, let's say, minutes, which means that if we were to actually run this code, it would cause us a partial outage. You wouldn't be able to create any new transactions for minutes. And so, we know that this is a problem. We don't want developers to write this. And so, what we did is we added a hack at development time. So, if you were to write this and run it, you know, you migrate your local database, it would blow up. It would say, add index is not safe. Use safe add concurrent index instead. So, Postgres has a way to add indexes concurrently that doesn't lock a table. It's called a concurrent index. And we just want to make sure we use it. So, we basically changed add index to blow up, and now if you want to write an index, you do this. And so, this, you know, is safe. You know, we prefixed it with safe. We've done this for all of the operations that we believe are safe on our database. And similar to other hacks, you know, there are times we need to run an unsafe operation in a migration. So, we do allow you to bypass. But we prefix those with unsafe, right? So, removing a column is unsafe because we don't take any downtime. So, in order to remove a column, you have to make sure the code can handle the column both being present and absent. And so, because we, you know, the code is running while the column just disappears. And so, you have to make sure a bunch of other things happen in the code to do that. We have a way to kind of mark columns as deleted so that we tell Rails to essentially ignore the column. But it's essentially an unsafe operation. So, we categorize it as such. So, now when you're going to do a deploy, you can look at the migrations. You can see that there are unsafe ones. And you can just, as a double check, before you do the deploy, you can make sure that we've done all of the steps necessary to handle it being unsafe. And so, we essentially went through all of our, all of the active record migration methods and added either safe or unsafe to them. It's just another, as a clue to the developer. Okay, so another example at development time. We call them sanity specs. So, this is kind of a grab bag of random tests we write. So, anytime where we're like, where we know kind of the right code and we know what the wrong code is, and we can codify that in a test, we try to. So, for example, here's kind of a big example. But basically, almost all of our tables have a public ID column, and that's the ID we expose to merchants. And that's generally the ID we look things up by. So, if you want to find a transaction, you'd find it by its public ID. So, since we look things up by that ID, we need to make sure it's indexed. So, we actually, since we know we always want it to be indexed, we actually just wrote a test for it. So, now, if you add a new table with a public ID and don't add an index, a test will fail. And this is in the suite of tests that we run before we push commits. Or if you add a public ID to an existing table, and you don't add the index, it will fail. And so, this makes sure we don't accidentally ship code with a public ID that we use, where we don't have an index, and all of a sudden, the queries are taking a really long time. So, this is, again, just kind of one example of the type of sanity specs we write. But we have tons of these. Like I said, anything we can kind of codify into a test, we do. And that way, when the developer's writing code, you get a failure, ideally, before you push any commits right away, and then you can go fix it and then push it. Because we all make mistakes, right? It's very easy to add a new table and forget to add an index, right? That's a very simple thing to do. And that's why we want hacks and tests to make sure we don't do something dumb. So, that's pretty much it. So, to summarize, you want to check for invariance, right? You want to think about your systems. You want to find the invariance. You know, the things that never change. And then you want to write checks for those, right? Those are the safety that you want to add to the system. You know, above all else, we think you should focus on data validity, right? Code is easier to fix and change and redeploy. Data is extremely hard to do that with. It could mean restoring for backups. It could mean manually repairing records in a database. Or like I said, it could mean reaching out to all of your customers and telling them that the data you gave them is wrong and they need to go update it. And so, in any case, data validity is extremely important. And then you want to make sure to write the checks outside the normal flow. You want your checks to be extremely simple but not susceptible to the same bugs as the main part of your code. And so, in some cases, this means writing something twice with two different algorithms. Or it means writing something a different way but much more simply. And then the biggest one is you want to find your own examples, right? Like I've shown a bunch of examples here, but you definitely should not go take these home and try to implement in your own systems. You should think about your own invariance and your own data and write your own checks. And you know, just kind of thinking about this, the bugs I talked about at the beginning. If you think about that trading system, maybe they could have written a hack that said if it lost more than a certain amount of money and a certain amount of time, fail fast, just shut off the trading system. So maybe, and obviously these are big complicated systems and I don't really know anything about them, but you can imagine a system where if you lose more than $50 million in 10 minutes or something like that, the system will just halt and say it needs user input to continue. Someone to just give a sanity check manually to keep going. You know, if you think about the World of Warcraft, maybe there's an extra hack they could have added. Again, like really complicated systems, but maybe for the most deadly spells, right before they take action on a user, right before they work, you do one last check on the code, like am I in the location where I think I should be and is this person applicable for this spell? And if those are both true, then you can carry out the rest of the spell. Maybe you add that extra check somewhere along the way. But think about your own systems and think about your own safety. And that's it for me. Are there any questions? All right, thank you.