 It's the last day of RailsConf. You have come to my talk, so thank you. I am James Thompson. I'm a principal software engineer for MavenLink. We build project management software primarily for professional services companies. We are hiring in both Salt Lake City and in San Francisco, so if you're looking for work, come and talk to me. We'd love to get you on our team. Now, this is a talk that is roughly similar to what I did in Los Angeles at RubyConf, but I have added a little bit of material, so if you did happen to come to my talk on this same subject at RubyConf, you will be getting a little bit extra, and I've tried to change the focus up a little bit. But what we're gonna talk about today is failure and how to deal with failure. How we can cope with the failures that happen in our systems, and so I wanna start with a very simple question. How many of y'all when writing software have ever written software that failed? Yeah, it fails in all kinds of ways. Sometimes it's because of a hardware issue, sometimes it's because the code we wrote isn't quite perfect, or isn't even close to perfect, or is just garbage. Sometimes our systems fail for things completely outside our control, and I'm gonna be talking about a few different ways to deal with particularly those kinds of failures, the kinds of failures that are difficult to foresee, but that can be mitigated. And so the first thing that we have to come to terms with is that failure happens. We have to deal with the fact that everything fails some of the time. It's unavoidable, we need to plan for it, and we need to have strategies in place to help us mitigate it. And we need to think about these things ahead of time as much as possible. We can't foresee the future, but we can plan for reasonable outcomes, reasonable ways in which things may fail, and we'll get into some maybe unreasonable expectations with some of the stories I have to share, but a lot of the systems that we build, we can plan for the kinds of failures that happen in them. Now, not everything I'm gonna talk about today is gonna be immediately applicable to the projects that you're working on. If you're working predominantly in a monolith, some of the specific stories that I have come out of a microservice ecosystem, and so there's not gonna be a perfect one-to-one correspondence there, but I'm going to try and present some ideas that should have general application regardless of the type of environment you're working on, regardless of the programming languages you're working in, regardless of the frameworks that you're using in your day-to-day development. And I wanna start with what I think is the most basic and fundamental practice and what I hope is an uncontested truth, and that is that we can't fix what we can't see. Now, I'm gonna tell a story about this, but I hope that nobody here thinks that they have such perfect omniscience to think that they can fix something that they have no awareness of. And of course, if you have no awareness of something, you clearly don't have omniscience, but as we think about our systems, we need to be looking for ways to gain visibility into them. The first step in being able to deal with failure and being able to cope with the ways that our systems are gonna fail is to gain visibility into those failures, and there's lots of different ways that we can do that. This visibility aids us in managing our failures, and it does this by giving us a window into the many facets that help determine when and how and to what degree our systems are failing. Beyond error reporting and instrumentation, there are systems for metric capturing that can go a long way to provide more rich context to help you understand why and how and to what extent your systems are failing. And so to help illustrate this, I'm gonna share a story. Now, we know that low visibility is dangerous in many contexts, whether you're sailing, whether you're in a plane or driving, having low visibility is a hazard. In software, it is similarly hazardous, although not to quite the same life-threatening degree in most cases. Now, I was working for a company a little while back, and this is a microservice environment. I was working on a system that was written in Go, not my favorite system in the world, but was able to get up to speed on it, and we had very extensive logging coming out of this service. We thought we knew a lot about what was going on. We had alerts that were set up on the logs, we were able to monitor and see what was going on or what we thought was going on by keeping track of the logs coming out of the system. But then we started to notice something strange in our staging environment. We started to notice that the processes we thought should be completing weren't actually finishing. That while we saw the data coming into the system, and we could see in the logs that the system was saying it was processing that data, we were not seeing the results of that coming out the other end that we expected to see. And now, this revealed to us that while we thought we had adequate visibility, clearly something was missing. Clearly we were missing some part of the picture here. Now, in connection with the service, I had started working on rolling out some additional tooling that was already in other parts of our ecosystem for this particular service, specifically bug snag and signal effects for metric tracking. And by rolling out these two solutions, rolling out bug snag to give us a more context aware way to see the errors in our system. And also signal effects to track some very simple metrics, specifically the jobs that started, the jobs that succeeded, and the jobs that failed. Rolling out those simple changes gave us an immense amount of visibility that we did not have previously. Fundamentally, what we were able to do was to go from what you see on the left here to on the right. And now, how many of y'all just love staring at log files, trying to figure out what's going wrong? I hate them. I actually find that log files are damn near useless when you're actually trying to figure out what's going on in a system. Almost every log file is incredibly noisy and has a very low signal ratio compared to what you're actually trying to get out of it. And so things like bug snag, and there's lots of tools like this. We have some on the vendor floor here that they give you a much better picture of what's going on in your environment, when you start to see certain errors, how many are happening, what environments they're happening in. And so having that kind of visibility changes the way that we interact with our applications. It gives us more information to be able to decide what we need to work on and when. But that's not the only thing that you need in large systems, or even in moderately sized systems. Just knowing that, oh, there's several thousand or tens of thousands of errors in a certain system may not actually be that much. And in the case of this system, we weren't sure that even though we have what seemed like high error rates, we didn't actually know if that was normal or abnormal. And one of the reasons we didn't know that is that one of our data sources was a credit bureau. And credit bureau data is God awful. It is just the most horrible dumpster fire in terms of formatting and consistency. We have numbers that can be either a number, or it can be a blank string, or it can be the string NA, or occasionally it can be something completely different that's not documented at all. And so that kind of inconsistency means that we don't know how much we should be failing. We know that we should be failing occasionally because the data we're dealing with is just garbage, but we don't know how much. And so that's where we brought the metrics tooling to bear. This is where we use signal effects in particular to get a graph like this. And this graph scared the crap out of us. The blue line is how many jobs we're starting. The orange line is how many jobs are failing. And there is no green line because none of the jobs are succeeding. This gave us a really quick window to know that we had messed something up badly. And of course this was happening in our staging environment, thankfully. We had not yet rolled out a series of changes to production, and so we knew that now in our staging environment, we had something that we needed to fix desperately. But if we had just been looking at bug snag or even the logs which were not telling us exactly how many of our requests which were failing, which of course ended up being 100%, if we had not had this additional context, we would not have realized how severe the problem is, and we might have ended up chasing down other bugs and other errors that we thought were more important but actually weren't our key problem. And so this is an area where having this visibility gives you greater context to be able to figure out not just what is failing, but why it's failing and why what's failing matters, why the system that you're dealing with is important and how big the scope of your failures are. Now, there's some additional tooling that I have actually come to love recently from a company called Log Rocket that has the ability to actually show you down to the user interaction where errors are cropping up in your system. They can actually tie user interactions back to bug snag or century or air break or other air reporting services and you can then dive into an error and actually go and see what did the user do that triggered this error, whether it's client side or server side. And getting even that additional level of visibility is incredibly powerful in being able to figure out does this error matter? And so visibility kind of is the table stakes when it comes to dealing with error. When it comes to figuring out how do we deal with failure in a reasonable way, we have to start by raising the visibility of our errors and giving ourselves as much context as we can about those errors so that we can actually deal with them in a sane way. We need to pick tools that give us visibility not only into the health of our systems, not just into the raw error details, but also give us the broader context of exactly how bad is this error and that gets into things like metrics tools. Ensure that you're collecting the context that's around your errors as much as possible. This will help you to know how you should prioritize your efforts. And again, raising visibility alone will give you a huge advantage when things go sideways as they will in your apps. Do these kinds of things and you will greatly improve the path that you have available to you when it comes to resolving errors and even more so you'll be better equipped to know what errors actually matter to your customers. The more context you have, the better the information you have to act on is. And that then leads to the next thing I wanna talk about and that is that we need to fix what's valuable. How many of y'all have ever worked with compiled languages, all right? Now, how many of you are familiar with the mantra that we should treat error or that we should treat warnings as errors? All right, a few of y'all. This was something that I originally encountered in the 90s and I thought, hey, that sounds like a good idea. I don't know any better. Let's treat every warning as an error. Our code will certainly end up being better. That was dumb. And in the age of dynamic languages, looking at things like JavaScript and Linters, oh my goodness, how we can just navel gaze forever on Linters. We have the ability to deal with and treat things as warnings or to warn ourselves about things that absolutely do not matter. And even if you have an error reporting system and you have bugs that are legitimately coming up in your system, they do not all matter. They do not all require immediate response. And so that's why it's important for us to think about what is actually valuable in our system. What is it about our system that gives it value to our customers, to the consumers, to the collaborators that work with it? And let's make sure that we're actually prioritizing our effort based on the value that we're either trying to create or the value we're trying to restore to those people. This is one of those areas where like outdated dependencies or security vulnerabilities or countless other issues that we tend to lump into the category of technical debt, a lot of it just isn't real because it doesn't have any value. It's not depriving anyone of value. It's not causing anyone to stay up at night. And we need to be better about thinking through whether or not something that we encounter in our system that is an error is actually an error that's worth fixing. And this is where having that visibility, being able to figure out what is going on, how big is its impact and is it affecting people in a way that is actually depriving them of the usefulness of what we've built? And so if you have product and customer service teams within your organization, before you start working on some error that you've encountered or that has been reported to you, cross check with them as to whether there's another mitigation strategy. Could customer service help your users use the system in a way that's not going to bring about this error case? Are there other ways that we can address the problems in our system that do not require us to invest some of the most expensive resources that most organizations have, their engineering teams, in fixing every little fire when it comes up? Instead, can we let some of these things burn for just a little while so that we can focus on the things that have real and demonstrable value? So again, the more we can focus on value, the better we will do at satisfying the people who actually need our software systems to work. They want to use our systems for a reason and we need to preserve that reason, not preserve our own egos in terms of what we're choosing to focus our attentions on. Now I want to get into some of the actual stories that deal with some unusual error conditions. The first one connects to a principle that I think can be applied in some cases called return what we can. The next story I want to share is about a different microservice working at the same company from the one I described previously and a unique issue that came up when dealing with it. Now in this situation, we had one generation of a microservice that we were replacing with a new generation. Fundamentally, they did the same thing. They stored individual data points tracked over time about what it means to be a business for our ecosystem. They tracked things like the business's name, when it was founded, when it was incorporated, its estimated annual revenue, whether or not they accepted credit cards, all kinds of individual data points and we tracked them over time so that we could try to develop a profile of business health. And in the process of deploying this new service, we had to come up with a migration strategy because we needed to bring over several million data points from the old system. We didn't want to lose that historical context. So we designed a migration strategy that allowed us to bring over that data to deal with the fact that the database structure of the new service was fundamentally different from the database structure of the old service and to preserve that history. Now that migration went well. We were able to bring over all the data, all the cross checks came back fine. We were able to migrate all of the services collaborators over and they were all able to get up and running using the new generation of the service. Nothing went terribly wrong when it came to the actual migration from the old service to the new. What went wrong was when one of these collaborators added a new capability that needed data from our service and specifically they needed some historical data from our service and in doing this, they discovered that some of the data that we had migrated from the previous generation of the service was garbage. It was corrupt, it had never been valid and in the process of bringing it over to the new system it was still definitely not valid but because of how we had translated that data now when we went to read it from the database we got hard application level errors. Part of the reason for this was that we had encoded these previous data values into a YAML-based serialized column. We did this because it was easy and because we didn't know exactly what the shape of this data structure was going to have to be long-term so we used the easiest thing that we had available to us and that was a YAML serialized column. Now we eventually migrated to a JSON native column in Postgres but for this initial rollout we had a problem because when we started asking for this data we were getting YAML parser errors. Whenever we were trying to deserialize these garbage values out of the database the YAML was not well formed and we'd get a psych error and so when we encountered this our service returned to 500. This was a reasonable expectation for our service that when it had data it couldn't handle, it would error. The problem was when our service aired its collaborator also then aired which then caused another upstream collaborator error which then took down an entire section of our site one that was responsible for actually making us money. And so we had a cascading failure that resulted from this very low level issue with database values not being the way we expected them to be. Now we had multiple problems that needed to be fixed in this context but the easiest one to fix was to simply rescue that psych parsing error and that's what we did. We looked at the data, we looked at how it was corrupted and we realized that the reality of this data is that because the data is corrupt there is absolutely nothing special about it. It's essentially an empty value and because it's an empty value we can simply return null anytime we encounter this corrupt data and so that's what we did. We started returning null anytime we ran into this parsing error and yeah we might be missing some data occasionally just because the data is malformed but the reality is if we can't parse it it's fundamentally worthless to us and so returning null was a valid option for us and so it's the one that we pursued. And so this is an example of where we were able to return what we could. We had lots of different data points but if we encountered just one that was corrupt or that we couldn't understand we can return null in that case. And in many cases especially in a microservice environment and in many application contexts most application context I'm aware of and that I've worked in returning something is better than returning nothing and it's absolutely better than returning a hard error in most cases. Very rarely does all of our data actually have to be complete in order to be useful and valuable but we were accustomed to thinking about it as if it's an all or nothing proposition. And we need to be thinking about how to have less dependency between parts of our systems, not more. And so in those kinds of situations a great way to start decoupling and loosening the connections between our systems is to start thinking about how little can we give and this data still be useful. And so I think there's a lot to be said for returning what you can and coming up with sane blank values to return for the things that you can't. Another principle I wanna talk about is that we also need to be accepting just like we need to be somewhat cautious in terms of what we return and try to maintain a working state whenever we are having people ask us for information we also need to be very generous in terms of what we're willing to accept. This is the notion that acceptance is often better than total rejection. Now in this same service we had many collaborators, almost half a dozen or so and all of them had only a small piece of the total picture of what a business was. We had some data coming from credit reporting agencies we had some data that was being supplied by users through parts of our interface and we had other data that was coming from other sources and none of them gave us a complete picture of what a business was as far as we were concerned. And so we needed to make sure that we're able to accept as much or as little data as each of these collaborators is willing and able to give us. And so we designed this system in such a way that it does not require you to send it a complete version of the business profile object. It'll accept one field if that's what you've got it'll accept all the fields if you have them. And so in doing that we were able to make this system in such a way that it's very permissive in terms of what you can send it as long as you can send it some data it will figure out what it can save and what it can't. In addition to that we also made it resilient in such a way that if you sent it five fields that were great and one field that was total garbage it would simply return it would accept what it could and it would tell you what it couldn't accept. And so that was a very important detail as well because instead of having to say well because you've sent us one data field out of this set that we don't understand we're gonna return a hard error to you. Instead we designed this system because we made the decision that accepting some of this data was better than accepting none of it that we would be forgiving in terms of what was sent to us we would accept everything that we could understand and we would warn the collaborator about what we couldn't understand. And so this ended up giving us a system that was much more resilient to the variety of sources that we were gonna be receiving data from that we had no direct control over and that we're gonna have varying degrees of quality in terms of what they were gonna send us. And so we want to make sure that as we're building services that we're being forgiving when it comes to the collaborators with those services. We wanna think about the data that has to go together and the data that we can accept independently. And if we do that our services will become more resilient to failure and we will have to fail much less often. Now the final principle I wanna talk about is about this idea of trust and that we should trust carefully. The reason that trust is important here is in any software system we're going to have dependencies and those dependencies imply a level of trust in whatever it is, whether it's another service that we're collaborating with, whether it is a dependency that we bring in a Ruby gem into our project. We are trusting that thing to act reasonably and to not break the rest of our system. But that trust is very rarely deserved. When we depend on others, their failures can very easily and very quickly become our own. And this gets back to that story that I originally told with regards to the cascading failures within our system. We'd essentially built a distributed monolith, not a collection of microservices when a failure in one service caused a failure in another service that cascaded all the way up and caused an outage in a portion of our site for all of our users. There was too much trust baked into that system. And this was definitely too much trust because of course we have multiple teams working very loosely together, which is the idea behind microservices. It gives you this ability for teams to move somewhat independently of each other. But all of these teams had been very trusting about the nature of how our services were going to respond, about how reliable they were. And that led to a cascading failure that led to an outage for our actual customers. And so we need to be mindful of this. Whenever we use somebody else's services, whether inside our organization or outside, when we use Google or Facebook or Auth0 or anyone else to do authentication for us, their failures can very easily become ours. If we're using S3 or some other storage back in to host files for our system, their failures can very quickly become ours. And so we need to be careful who we trust and we need to make sure we have mitigation strategies for when they fail, not if they fail. Because eventually AWS will have another big outage. Eventually Google will have another big outage. Eventually our own system will have another big outage. And we need to be thinking about how can we build our systems to trust a little bit less and to respond well in the face of these kinds of failures? How can we design them so that they don't fall over when one portion of what they are doing stops working the way we expect it to? This is also one of the reasons why I think most teams are nowhere near ready to adopt microservices. Because they are not yet ready to deal with the fact that you are trading out method calls for distributed systems latency. That we are dealing with systems that are inherently complex. And in most cases, most teams do not yet have the skill or the knowledge to actually be able to deal with that inherent complexity yet. And this again is a problem that can come up in any kind of collaborative environment. No matter who you're using, no matter what the scale of the scope is, there is an opportunity for other people's failures to become yours. Now this is not an argument to avoid dependencies, although sometimes that is absolutely the right decision and you should not depend on someone else for a critical service or feature. Rather the point is to be careful about who we trust and why and to what degree. We wanna make sure that the systems that we collaborate with that our systems depend on are as reliable as they can be, but also that we make plans for when they do and will break. So trust carefully. Expect others to become unavailable. Respond graciously and gracefully in those cases. Don't return a 500 or any other kind of error if you can avoid it. Try to preserve the user experience in some meaningful way. Think about the customers, the other developers, whoever it is that is using and relying on your services, think about them as you are building your systems and thinking about the ways that they might fail. The big takeaway is that we need to expect failure. This is the key idea at the heart of chaos engineering. This is a key concept that I think we all need to grapple with more deliberately and more intentionally. Even at small scales, we can't avoid things breaking. So we need to plan for how and when they're going to break. And of course, the first step is to make sure that we have good visibility into our systems, not just log files, not just error reporting services, but that we have meaningful application level metrics that allow us to get a better picture, a more complete picture of the scope and the scale of our failures when they happen. This is how we can be better at preserving value and restoring value for our customers when our systems break. Thank you, you can find me online. Not gonna do Q and A, but if anyone has any questions, I will be hanging out down here afterwards. Come and talk to me, and if you're looking for work, also come and talk to me. Again, thank you for coming out, and there's the link to my slides as well.