 All right, welcome. I am James Thompson. I'm a principal software engineer at NAV. We are working to reduce the death rate of small businesses in the United States. If that sounds like something you would be interested in, please come and talk to me. We are hiring and we are looking for remote engineers. Now, I am here to talk to you today about building for gracious failure. How can we make failure something that doesn't ruin our days, that doesn't ruin our nights, and that doesn't ruin our weekends? I am not a fan of overtime. I have a personal rule that I will not work overtime unless I absolutely have to, and I will never take someone else's word for whether or not I absolutely have to. But the reality is failure happens. It's unavoidable. We will have infrastructure go down. We will have people delete production databases. We will have people deploy services that they should not. And so we have to plan for failure. We have to find ways to manage failure. That's the best we can hope for. We can never eliminate failure. None of us will ever write a perfect system. And so we have to plan for our failures. We need to identify techniques, processes and ways that we can make failure manageable. That's the goal. Now, I'm gonna share a few stories about failures that I've dealt with and actually all of these are from the not very distant past. All of them are failures that I've had to deal with over the last year. And the first one I wanna talk about is probably the one that bugs me the most. And that's the reality that we can't fix what we can't see. If we don't know something has gone wrong, it's incredibly challenging, if not impossible, to actually resolve that thing. And if your users are your notification system for when something has gone down, unless you're an incredibly small startup, you're probably doing something wrong. Visibility is the first step to aiding us in managing failure. If we don't know that our systems are failing, we're not gonna be prepared to respond to that failure. And instrumentation is one of the best ways to get that information that we need to be able to act on and prioritize and deal with the failures that happen in our system. I recently changed teams at Nav. I took over what is now being called our data sourcing team. We are responsible for the ingestion of data from credit bureaus, Experian, Equifax, TransUnion, Dunn & Bradstreet, as well as a number of other data sources. And we have to deal with a lot of garbage. And in particular, we have to deal with this garbage asynchronously. We have to deploy systems, jobs, workers that are able to go through and update credit reports on a regular basis that are able to fetch alerts from these various bureaus and bring them together in a sane way. And so we have a job processor that was written in-house. It is very similar to Sidekick or Rescue or Delayed Job or any of the typical worker systems that you might be familiar with, but it was written in-house. And whenever I picked up this project, I noticed that the only visibility we had into what was going on were the logs. And we were running this system in Kubernetes. So we don't have a static number of environments that is running, we don't have a static number of systems. We have, in the production environment, I believe at the moment about 30 instances of this application running. Collating 30 systems worth of logs and figuring out if something is going awry is not what I ever want to spend my time doing. I don't know about any of you, but I do not fancy the idea of sitting down in a comfy armchair with a cup of coffee and scrolling through 30 services of logs. That sounds like a horrible way to spend any amount of time. And so I needed to figure out a way to stop having to deal with logs. And I figured it out within the first day. I decided to use bug snag because I didn't know how many errors we were having. I knew we were having errors, but I didn't know if we were having an unusual volume of errors or if we were having any kind of anything that I really needed to care about. And so by using bug snag, I was able to go from the picture on the left except much, much longer and ever growing to the picture on the right where I can at least say, okay, I know how many errors I have. And I have a little bit of insight into how regular they are. And the reality is, I don't know if this is normal. I only have a week's worth of data here. I can see that there's a little bit of variability in that one where we have 231,000 errors. I can see that there's a lot more variability in this one that we only have 2,000 errors for. And we've got this one that's just shy of 10,000 that is super stable. We just, it's lots of errors all the time. But I have absolutely no idea if any of this is normal. Now thankfully, this is our staging environment. So I'm not really worried that we're not delivering to our customers. But when I look at this and see all of these errors, not really knowing if this is normal or abnormal, I can't trust this code. I don't feel comfortable deploying this service into production anymore because if that 231,000 errors happens in production, that's going to be a horrible day for me and my entire team. And so having this kind of visibility is the first step to being able to manage failure. I need to know that it's happening. And so tools like bug snag, air brake, roll bar, they give you that first level of visibility. But I still don't know if these are worth working on. I can go and I can talk to my product owner and I can try to ask, hey, this is the error that's happening. This is what I suspect is causing it. Is this affecting other teams? Is this something we need to prioritize and address? But I don't have enough information to be able to say, yeah, this is definitely something we need to address and I'm not trying to convince my product owner, I'm actually going to them trying to have them convince me that it's worth working on. And I don't like that when it comes to errors, especially errors of this kind of volume. And so there's another step in terms of visibility that I think is really, really important. And that is metrics. And this was something that we just got deployed in the end of last week. This is actual graph from our staging environment and the numbers on the left, I don't know what the hex signal effects is doing there. It's supposed to just be counting and I don't know how we have fractional numbers of jobs. So I'm not sure what's going on there. This is where we're still trying to get our instrumentation right. But something that this did reveal to me when I know that we have thousands upon thousands of errors happening is that the blue line, which is jobs started and the red line, which is jobs failed, they're following each other. Almost every job that starts in our staging environment fails. Now I know I can't trust the code before I ship it into production because if we can't run it in a staging environment, how in the hell am I supposed to know that this is safe to run in production? And so visibility is the very first thing that you need to do in order to manage failure. This is kind of the table stakes of managing failure. Before you can deal with anything else, you need to be able to visualize and track and investigate your errors and not through logs because logs don't provide you enough information to be able to act on reliably. And so that's where we have to start. We have to start by making errors visible, by making the process by which we discover that failures have happened, by discovering whether or not they're meaningful, whether or not the rate of failure is significant. The first step is visibility. And so there's tooling for this. If you're in, working in Ruby, you've got lots of options here. New Relic provides a good bit of this in one package. You have systems like SignalFX and Keen that provide just metric tracking. But this is something that you need to be doing. If your systems don't have a good way for you to know when errors happen, when failures occur in your code, and to be able to tell whether or not those errors are actually at an anomalous rate, you're already behind. You need to catch up. And this is stuff that is very easy to implement. Now, the service I'm talking about is actually written in Go. And God bless Go. They are not a friendly environment to implement this kind of instrumentation in. Because there's no way, especially if you're running a concurrent system, to be able to catch everything that's happening across the concurrent Go routines. But Ruby is stupid simple. And so please instrument your code. Track metrics, not just the errors, but track how many jobs are starting in your system, how many jobs are succeeding, how many are failing, how many HTTP requests you're getting, how many are returning different classes of error codes, whether they're 400 or 500, and how many are successful. You'll be able to then establish a baseline for what is normal, what is typical, and then you can do anomaly detection on top of that. But you can't hope to do that kind of anomaly detection until you have a baseline, and until you have visibility into your system. And so that alone, if you do nothing else, but leave here and implement bug snag, or signal effects, or new relic, or any solution that gives you this kind of visibility, if you do nothing else, you will have benefited your team greatly, and you will have likely saved yourself at least several hours in your job from having to deal with a failure that just arises out of nowhere because you didn't see it coming. Now, to move on from this, I want to talk a bit about some techniques for making your systems more gracious in the face of failure. How the services that we build can be made to be more forgiving, how they can be more forgiving, not only in terms of how they respond to different circumstances, but also what they afford for other systems that depend on them. So the first one of these affordances that I think we need to make is we need to get into the habit of returning what we can. And I have another story here, and this is one of an unexpected error that happened. Whenever I started at NAV, I had the task of figuring out how to deal with a service, or how to build a service that we were calling business profile. We keep records on lots of small businesses, and with those small businesses, we have to track a whole bunch of different data points when they were founded, whether or not they're incorporated, what their annual revenue is, do they accept credit cards, all kinds of different facets. We have about a dozen or so of these fields that we track. And the business profile service is responsible for maintaining a record of those fields over the course of time. Now, there was a service that existed prior to the work I did that was a prototype that was shipped into production. It then got abandoned, like all good prototypes that get shipped into production. And so in the process of coming on and looking at this, I had to assess, okay, are we gonna keep this service and try to make it work, or are we going to just start fresh? And I made the decision to start fresh. I'm still not sure whether that was a mistake or not, but a year later, having worked on the same service for a whole year, we have made the transition to this new system. And in the process of doing that, we had to bring over all of the historical records from that prototype system. We needed to bring over about nine million independent data points. They were all in a single table and we needed to migrate those over so that we can maintain history. And we were able to do that successfully. We were able to do an ETL on that and bring all of that data over. But then as we started to transition, other services to rely on business profiles rather than the old service, we started seeing 500 errors. There were some folks who were asking for data from business profiles and business profiles was returning a 500. We were able, because one of the first things I did in this project was install bugsnack, we're able to figure out what was causing those 500 errors and we were identified that there was corrupt data in the database. Now it wasn't corrupt from the database, it was a string. It thought this looked fine. That's why the migration worked. But whenever the application tried to read this string out of the database, it's like, I don't know what that means. It was trying to understand the format that string was in and it just choked. And so we had a situation where the data that had come over from that legacy system, for some reason it got handled okay in that system, but when we brought it over into the new system, we weren't able to parse it. And upon doing some more investigation, we realized, oh, this data's never been valid. It's just the other system was way more tolerant of reading out garbage from the database. And so was able to add a rescue clause to catch this specific parsing error and to get the system where it was no longer returning a 500 error. And what we then kind of decided is that we were comfortable returning an empty value, returning null rather than returning a 500. Because we had other data, even with this corrupt field, all of the fields in the system are independent of each other. They are related but only loosely so. And so returning some of that data was still meaningful and still valuable. And so when the system encounters errors where it can't parse or can't deal with a particular value, it'll return null for that value. And it'll return everything else just fine. And so that was an example of us being able to make this service or me being able to make this service return what we could. And it now means that collaborators with this service don't have to worry about whether or not I'm gonna 500 if a piece of data is corrupt. It's one less use case under which they have to worry about my system doing something that they don't expect it to. And this is something that we should think about in our systems. There are lots of occasions where returning some data is far better than returning no data. Or even worse, returning an error. And so can we think about values in our system that are separable from each other? Can we think about the values that have to go together? Say like a currency and then a value for some money amount and values that move completely independently of each other. Like the date that a business was founded and it's annual revenue. Those two values have nothing to do with each other. If one's blank and one's corrupt, that's okay. We can still return something. And so we should think about the values we have in our system and how we can be tolerant of those kinds of cases. Another case that we have is accepting what we can. Now that business profile service I was referring to, it has, like I said, about a dozen data points that it can accept. But again, because they all move independent of each other and they are separable, we don't have to get all of them at once. And in fact, collaborators don't have to even send any value or even acknowledgement that value exists when they submit an update. They can just send a JSON payload with just the fields they wanna update. And we'll accept that and it's fine. But we discovered that sometimes we'll be sent strings instead of numbers. And our service doesn't like that. And so we made a decision that we still wanted to accept as much data as we could. And so if we sent along four fields and one was not what we expected, we wanted to go ahead and record the values of the three that were fine and then let the user know that that fourth value had something wrong with it. And so we decided to build this service in a way or to adapt this service so that it could accept whatever it can. And it will still notify the user that, hey, something was wrong but I still accepted the updates for the things that I could. And so we need to be forgiving with what other systems send to our services. We need to be able to accept what we can. Partial acceptance is often much better than total rejection. And so we again have to think about what values must go together and what values can we reasonably separate and allow them to be accepted independently. All of this gets to the point of trying to make our systems tolerant and tolerable. Being able to tolerate what some folks that we may not know whether or not they're just testing we may not know whether they're expecting certain behavior but if we can tolerate and be tolerable to other systems it'll make our entire environment our entire system more resilient. Now another detail that we have or another approach that I think is very important is that we need to trust carefully. And this is one that can be applied both to third-party services and to services within an existing service ecosystem. The reality is that depending on others and depending heavily on others and other services can make their failures your failures. And this again was another case in which business profiles ended up being a problem for us. And it wasn't so much business profiles fault except that business profiles was at the bottom of the stack and it was the thing returning the 500 error when it couldn't read values out of its database. But what went wrong was when the service that was collaborating with business profiles saw that 500 error and it said I'm just gonna forward this on. I'm not going to intercept that 500 error I'm not gonna do anything about it I'm just gonna pretend like yeah whoever's upstream for me they'll know what to do with nothing. And sure enough the service that was up a layer had no idea what to do with nothing. It had no idea what to do with a 500 error. And so it also returned a 500 error until eventually we got all the way to the user interface and by the time we got there we had an outage for an entire feature of our site all because not quite a half dozen services out of all of them none of them had been built to tolerate any of the services they trusted down the stack not responding appropriately. Now we had this situation because of a 500 error it could have just as easily been caused by a network partition or the service actually going offline and being unreachable. The impact would have been the same we would have had an outage for an entire feature and a really nasty error message for our users all because nowhere along the way could we intercept and deal with this in a gracious way. And so trust carefully. You need to be careful who you trust and how you trust them. This is most prevalent in a microservice or a service-based environment where a lot of times we assume trust between services. That's wrong. Pivotal actually had a illustration they just tweeted out a little while ago on the eight fallacies of distributed systems and of course they're things like the network has zero latency, unlimited bandwidth, all kinds of things that are absolutely not true but in our service systems we tend to assume that all the services we interact with within our boundaries are trustworthy. That's not true. Sometimes prototypes get shipped into production, sometimes you're having to talk to a legacy Java app that no one wants to talk to but they have to because there's no choice. Sometimes you have systems that are completely untested but as long as we don't look at them or touch them it'll be fine. You can't trust the other systems that are running in your ecosystem and so you need to build with that in mind. Whenever possible don't return 500 errors if you're dealing with a service that you have control over but expect that other services are going to return 500 errors or something far far worse. We need to assume that failure is going to be a reality because it is. We have to get into the mindset, we have to get to the place where we expect failure. That's the big takeaway. We have this mindset that arises out of chaos engineering and the notion of chaos monkey and the Simeon army and the ability to simulate infrastructure failures. But the reality is for most of us our infrastructure isn't the most likely thing to go down. It's the crappy code that I and y'all write. We are much more fallible than an automated script unless that automated script requires user input which can then take down an entire AWS region. And so we need to get to the place where we expect failure. We need to get to the point where not only in regards to our infrastructure but also in regards to the systems that we build that we anticipate the ways they can fail and that we build in mechanisms and processes to be able to raise the visibility of failures to be able to tell whether those failures are meaningful within our constraints. We must prepare for it otherwise we'll always be stuck suffering from them. If we're not prepared for failures they will always take us by surprise or worse they won't take us by surprise but they'll still mess up our day, our night, our weekend, our month because we have nothing else to do other than just firefight. And the first step is to raise that visibility. Once we've gotten to the point where we have our failures visible to us and we can analyze them then we can start figuring out how can we make our systems more forgiving, more tolerant and more tolerable within our environments. Now we have some time for questions I wanted to make sure that we left some time here and so there is not a mic to run around so you will need to be in a position or comfortable yelling at me and then I will repeat as best I can what you asked and then provide you something resembling an answer. Right, yeah so the question is how do we balance being tolerant particularly with data ingestion versus being strict and making sure that we don't end up with garbage in our system? Is that summarize? Okay, so that's going to come down to a business case. In the case of business profiles the example that I've used we made the determination that accepting partial data in particular was valuable but we also made the determination that we did not want to accept garbage data. So if someone tries to send us a string where we want a number we're going to tell them that's not acceptable and so that's something that's going to come down to a service by service basis where we have to make the assessment what's acceptable to us. Now I don't think it's a good idea to build services that just accept whatever gets sent to them and stores it because then you end up in a situation where you just have garbage data and your BI and data science folks will absolutely hate you for that. So don't do that. Don't just take whatever is sent to you and store it. Make sure you're performing some basic validations on it to ensure that even if you're doing partial acceptance that you're still rejecting outright garbage. I think that's actually a really good place to start but figuring out whether you can deal with partial acceptance or not is entirely a case by case, service by service and that needs to be vetted by the business side of things. Yeah, absolutely. And that is a situation where, and when I say partial data I don't mean accepting some data that's valid and some data that's invalid. If the data can be ruled out and said this is invalid, always reject that. Yes, so he was making note about stronger params as being a system in Rails that allows you to do type checking but then also raising the concern over accepting again invalid data versus valid data. And that is a point that is worth clarifying yet. Don't accept invalid data. If you can look at it and say this is absolutely not acceptable, by all means reject it. The point that I wanna make is that when we're talking about partial acceptance is if you're in a situation where some of the values in a record don't necessarily have to be accepted with other values in the record, take what you can. In many cases, and this is something that actually comes out of the notion of event sourcing where the present state of a system is discoverable by replaying all of history and seeing how changes over time have then altered the state to get to where you are now. And in that kind of a system, a partial update because you still have valid data that can be used to reconstruct other parts of a record, partial data can still be accepted and still keep that record in a valid state. And so the business profiles that I'm using as an example is one of those where all the pieces can be thought of, all the values can be thought of independently. And so updating one is just as meaningful as updating all of them. And so being able to accept what you can, when you can, still delivers value in certain cases and it's absolutely something that needs to be vetted. In the far corner. Yeah, okay, so in the case of returning a partial response, do we still return a 200 status code? In the case of this particular system, we do still return a 200. But along with that, we have notified all of the consumers of this service that they need to always check the errors parameter that we've returned to them, which will always contain an array of any values that were not acceptable. There are other ways you can handle that, depending on, again, the constraints of your environment. And so because we did accept the data and it was okay, we do return a 200 code, but that may not always be appropriate depending on the circumstance, yeah. Yeah, so the question is, is in some cases the partial data may not be ideal, and so do we provide opportunities to undo that? In our particular use case, we don't have a situation where partial data needs to be rolled back. The most common way that data enters this particular system is either from bureaus where we're scraping specific pieces of information out of a report and using that to populate, and then in that case we know what fields are available in those reports and if there are values we bring them in, or where we're accepting user input. And in those situations, this was actually something that came up recently where a user went in or we discovered that if a user went in and updated their business's industry, their annual revenue, and a couple other pieces of data that there was at least from the front end's perspective, they were not expecting it to accept partial data and to leave the value that was incorrectly formatted unchanged, and that was a failure in communication for us to let front end know that this is how this service behaves. And so that was one case where they were not expecting it to do a partial update, they were expecting a hard error, and in talking with them and communicating with them, we were gonna say it's like, no, you just need to make sure that you're always checking that error's array to know if there was any fields that weren't handled, but we've not yet had a use case where we need to roll back just because of the nature of this particular service that I've been working closely with. Yeah, so how do we distinguish valid use of the system since we do have partial response and folks who are trying to exploit the system? In this particular case, business profiles is buried deep in our infrastructure, it sits behind at least three other systems that handle access control, and so it is actually an insanely trusting service from the access control standpoint, which is why it is not publicly routable. It's only exposed through other services that provide that access control scenario, but that would absolutely be something that you would need to mitigate against if you were gonna have a service that would return partial data if it was publicly accessible. You would wanna make sure that it has appropriate access controls in place to make sure you don't leak data. Yeah, so the question is, do we keep metrics on when we have these situations where reading data out of the database is not possible when we have those error cases, and yes, we do. We don't have signal effects metrics where we're not tracking it that way, but we do still have a bug snag notify call that will record the context, what the value, we will actually fetch the raw value from the database, put that into the payload, and then actually notify on bug snag for that so that we can see what values are causing this, and hopefully discover where they're coming from. Up to this point, we've been able to identify that all of those corrupt values came from the migration, where the other system was just much more accepting of input than the new system happens to be. Yeah, so how do we prioritize, once we have visibility on errors and failures that are happening in our system, how do we prioritize which ones that we wanna work on? And that's something where you need a good product owner. You need someone who understands where the business value is, what the impact is, or the potential impact is on users, whether there is any, and of course you need to help them by providing as much detail as you can in terms of what you know. But that's ultimately, in my opinion, a product decision, and it's something where we as engineers need to collaborate with the product owners to determine which fires can we let burn and which ones do we need to put out. And that's of course with the examples that I gave earlier in bug snag, all of those, as soon as we got the instrumentation in place, I turned them into tickets, added as much detail as I could, and then I notified my product owner and said, hey, here's what we've got, here's what I think is the problem, and then can you do the legwork to figure out are other teams impacted by this, are any customers impacted by this, and how much about these do we care? And because it has to come down to, will fixing this deliver business value or will it not provide business value, but will it restore business value we're currently missing? And so until you can answer that question, it's difficult to prioritize them from a technical standpoint, other than the fact that looking at this one that's 230 something thousand, I hate getting the every new 10,000th one email from bug snag, that's really annoying, but that's my only metric right now. So eventually I'll probably fix that one just because I don't want those emails anymore. Yeah, so how do you keep, particularly errors like the stuff you see in bug snag, is all it does is just provide you with what's gone wrong, how do you distinguish that, background noise from the things you actually need to work on. And that's where I think bug snag and systems like it are not enough. That's why I really like signal effects or Keen or something like that, where you're able to actually see not only that an error happened, which you have to have separate metric tracking for that, but how significant is that error rate in comparison to the total volume of traffic coming through your system? And so like whenever I showed that slide earlier where the failure rate is perfectly tracking just on a slight lag, the number of jobs started, that's a huge red flag because what was it? There's no gap between failures and starts. But you will need to look at what is that gap and that's where other visualization tools like signal effects, actual pure metrics libraries will allow you to get an idea of how big is this problem? And of course bug snag can help you there because in some cases, depending on how, the way your application is structured, it will actually tell you number of users affected because of where business profiles and the worker system that I showed earlier, because of where they sit in the stack, there's no way to identify the user that triggered certain errors. So we have no idea what the impact is until we start asking people. And so that's where the separate metric to be able to track jobs started, jobs finished, jobs failed, HTTP requests, different error codes, status codes that way becomes very important to have those metrics to allow you to sanity check whether or not the errors you're dealing with are actually affecting enough of your user base to be worthy of inspection and further follow up. Yeah, and so some more input there on how to prioritize bugs is taking into account severity, taking into account frequency and being able to again provide more detail on how impactful a particular bug is. And so that's of course, the more information you have, the easier it is to figure out how impactful a given failure actually is in your environment. All right, well I think we are out of time now. Thank you all for coming.