 I'm the only thing standing between you and lunch, so I just love this position. I'm just, yeah, yeah. So yeah, it's gonna be a three hour talk, and no breaks. So I'm Paul Hinsey. You can find me on the Tubes as Finsey, that's my handle most places. And I am the director of infrastructure at HashiCorp, but today I'm just here as Paul Hinsey, guy who hangs out near prod, and then pokes his head up when things break. So yeah, I am super psyched about this topic. I probably have way too many slides, and we'll kinda see how it goes. This is all brand new content, so I'm gonna find my way through it here. So systems break in various interesting ways. I'm gonna break down the skill of systems failure analysis into four sections, and I realize that they're all just, you could just summarize them this way. And if you haven't noticed, the theme of the talk is big ass emoji. So yeah, we'll have some fun. So yeah, so we'll start with feeling and move into thinking and acting, which will have sort of the concrete stuff, and then we'll kind of bring it back up to collaborating all around failure and failure analysis. Okay, so systems fail, it turns out. We more than most groups understand this. So let's break down, let's start with a couple of definitions just to sort of ground ourselves in this topic. So what is a system? So you'll notice actually we do this a lot in this talk, and it's actually one of the skills of system failure analysis is understanding the boundaries of your current context. So understanding the assumptions that you're making, the inputs to the current thing that you're saying, you're asserting to be true. So we'll do this here with just the simple statement. So what do we mean when we say system? If you stop for a second, you're like, I don't know, wait, what? System, right, what is a system? So a system is a set of connected things or parts forming a complex whole. And in computers, we have systems on systems on systems, right? And that's sort of like the hallmark of modern failure is just the vast complexity we deal with on a moment by moment basis when we're trying to interact with computers. And so what do we mean when we say fail to be unsuccessful in achieving one's goal? And this is actually, there's some nuance to this as well because when you're dealing with failure with a system not performing or behaving, when there's so much complexity involved, oftentimes you can actually lose the original thing that went wrong, right? So because you can sort of dive into the complexity of a system and start sort of diagnosing things, it's really important to remember what was the failure? And we'll get to that as we kind of dig into the meat here. Okay, so systems fail, now what, what do we do? I think one of the probably the most important things that I can talk to you about in terms of what I've learned around system failure analysis is attitude, is how you treat failure in your head with your teams, in your company. And so yeah, that's the first, the first section is around feelings about failure. And I've done a lot of thinking about this. I've watched teams react to failure. I've watched teams try to fix prod when it's down. I've watched teams get lost in an incident trying to figure out what the actual issue is. And over time, I've got this working framework. It involves a couple of memes that sort of anchors the context about how we treat failure. So, and I learned that in the academic world, you can say towards a, and that basically is like beta for ideas. So it's like, this might not be fully baked yet, but I think I've got something going here and it's been useful for me, so we'll walk through it. It all starts with everybody's favorite dog. This is the, this is fine dog. He doesn't have a name. He's from Casey Greene's comic. I'm sure everybody here is familiar with this dog. I love this dog so much. And, but I also have kind of a complicated relationship with this dog, which we'll get into. So basically the, like, you know, the basis of this comic is here is a character that is amidst a burning building and he's just totally fine with it. He's unable to process that the building is burning and he himself like is completely destroyed. It usually only is cited as the first two panels, but if you, like, it's a six-panel comic and he just gets real gross by the end. So, okay. So this is like a central icon of DevOps. It was, it was sort of in operations, it was more, it was popularized earlier, right? It's from 2013, I think. And then the world caught it, right? It's like, it's become sort of a, like a mascot for a generation a little bit. So, but we had it first, us operations folk. And let's talk about, we'll talk about sort of what, what are the feelings associated with this guy? So I have his counterpoint. This is Fixit Felix Jr. He is from a fictional video game in the Disney movie Wreck-It Ralph. Now Wreck-It Ralph is like an animated picture. And, but this character is just really useful as like the antithesis of that dog. So much so that I feel, I've started to think about the way that people play, like the way that people react in failure scenarios on a spectrum that starts with the, is this, this is fine dog and ends with Fixit Felix. And so Fixit Felix is like, he, that is his entire character essentially. He's just like the hero of the video game. He's like super happy go lucky. He's always positive. He's always, he's like, you know, the town has given him a medal that says like hero on it. And then Wreck-It Ralph is like the, you know, the Donkey Kong style guy who like breaks stuff and then Fixit Felix fixes it. He's essentially the protagonist, but he's kind of boring. But this is like the, where I started to get sort of a complicated relationship with this dog because people share this dog. Sometimes it's healthy and helpful. It's people's way of processing, sort of like an adverse situation, but sometimes there's a lot of like negativity under that dog, right? It's helplessness, it's hopelessness, it's apathy and watching sort of the spectrum of the way that people can react to failure. Generally speaking, like I've started to look, I've started to try to think like, how can I be a little bit more like Fixit Felix than like that dog in a failure scenario? Now, with any spectrum, you can flip it, right? And get some sort of interesting nuance. So there is some value to sort of like chillness, right? Especially in failure. But, and you can sort of take Fixit Felix as like the, he just overdoes it. And everybody, he's the one you would roll your eyes at. Because nobody wants the person who's just like endlessly and artificially positive in a failure scenario, that's not healthy either. So you do wanna take sort of like a page from the book of our dog. And then sort of introducing sort of an additional dimension on this, we have Walter. Walter is a character in the movie The Big Lebowski. He believes he's the only one who gives a shit about the rules. If you don't know the scene, it's basically an argument about a frame of bowling. He pulls out a gun and he has become a meme saying that, saying he's the only one who gives a shit about the rules. So this is a danger of people who do try to react positively to failure and who find themselves frustrated. I've been this person. I've been the person who believes that they're the only person who is the one out fixing the real problems that the system is catching. So these are like a little bit like, I'm arranging these into a little bit of a framework, right? Just sort of giving us some images and some touchstones that we can use to think about how we react in failure scenarios, how our teams react in failure scenarios. And I assert to you that generally speaking, you want to try to move your teams towards Fixit Felix. And so, okay, so here's a framework, right? Let's talk about what a productive posture towards failure might look like. So I'm gonna kind of break it down into three different things that I think have been useful for me to anchor myself in failure scenarios. So systems fail. Why do you care? I think this is important. I think the things that can bring, so this is Find Dog out for us, is places where we lose touch with the reason that we're doing this in the first place. Like when something's on fire, it's stressful, you don't know what's happening. If your tendency to react in a positive or a negative way is going to be affected by how anchored you are to your reason for doing what you do. So for some of us, I think it's understanding what good the system that you work on does for other people or for the world. Or for, you know, so maybe I'm saving people time. Maybe I'm making it easier for people to do this. Maybe, you know, maybe I'm making the markets more efficient. For me, this has been true. Like I've been sort of, I keep sort of in my prior jobs and my current jobs more abstract, so this is actually harder. But the concept of, you know, oh, I'm helping small businesses accept payments. Oh, I'm helping teachers and students. You know, their lives would be a little bit easier. Like staying anchored to that is important because it's what motivated me to fix the thing when it broke. Beyond that, there's actually sort of like a content agnostic way posture that you can take that is productive, which is just like, this is what we do. I don't care what the system does. It's not gonna fall over on my watch. And this is a perfectly valid way of anchoring yourself into a posture that is going to productively react to failure. And in reality, you're gonna have some of both, right? Sometimes you wanna help the humans that are using the system or that are being affected by the system. Sometimes you're just like, no, this is just like wrong and I can't, I have to help. So that's the motivation that gets you from apathy up into a place of motivation. I'd like the rackets using the fuel of giving a shit, I like that. So the other really big one, this one is more recent for me because I didn't process it till more recently, which is like, how do you deal with the depression of things breaking all the time and hanging out near the stuff that breaks because the dog comes back again to help you really, right? It's like kind of papering over this idea that like everything's breaking all the time and like people are mad and we're losing money and this is real and it happens disproportionately in operations because we're the ones who are nearest to the systems that are breaking. And for me, the just incredible like solve here is to just totally transform the way that you think about failure. And it's kind of central to the DevOps movement, but I like to use slides that sort of do sort of stress testing, Keto. So the next slide, if you're not really cool with like intense audio visual imagery, not audio, just visual, you can look away. But anyways, come on, let's see if it works. Cool, okay, so we're transforming the way that we think about failure. It's not something breaking, it's the system coming back together and believing in itself and showing, come on buddy, it did it before. Based on my prior knowledge of this system, I can, yes, okay, cool. So this guy, Nasim Nicholas Taleb is crazy, very smart, kind of a snot cake. I think he would agree with this. He's like the kind of person, if you read his books, you will get this. He's the kind of person who will be like, take that as a descriptor, be like, I'm a snot cake, but like in kind of a cool way. He thinks it's in kind of a cool way. I think it's just in a snot cake way, but he's real smart. So, okay, so this guy is a former market person, financial person, statistician, economist. He's really, and now he's sort of just an essayist, talking about uncertainty and randomness and chaos and chance and how bad we are at dealing with it. And so this really helped me to sort of like, to sort of transform my feelings about failure, was the Black Swan was the first book. Just of which there's a lot in there, but the thesis is essentially, we're really bad at predicting. We are so bad at predicting what he calls a Black Swan event, which is an event that is not very probable, but has a large impact, and then afterwards, we weave into the narrative. So it's like the start of World War II. It's like, oh no, it's just that one guy got shot. And then, you know, that's, you know, it's just, in fact, one guy didn't get shot. We probably wouldn't have had World War II. Stuff like that where he just goes over and over again. How bad we are at predicting. And so anti-fragile is sort of the application of that process to systems, which is sort of like, how can we think about exposure to chaos, exposure to failure and stress as something that is making our systems better. And our systems are actually designed to react to that. And I think that's very much woven into the kind of core of DevOps' sort of iterative improvement concept. The other sort of helpful metaphor that I've, that's helped me transform the way I've seen failure, is this concept of health is a resource. So this might be a little bit of a specific reference, but it's sort of like, if you play video games or card games, like tabletop games, there's this sort of game strategy, which is like, if you end and you have 100 life, you have not lost anything. You've actually not played the game optimally. Like, you should think about your health as a resource. And this is very much linked to the way that Google SRE has the concept of error budgets, right? Which is like, if your system has 99.8, 9s of uptime, you're actually doing it wrong. You're wasting your time. And then kind of summing it all up, right? Failure as an opportunity. So we can't predict all the things that are gonna happen to our system. We know that if we were to try to prepare for the entirety of whatever could happen, we would be wasting our time. And therefore, like, every time the system fails, it's teaching us something. It's showing us something. It's a gift. And if we accept that gift with the right sort of attitude, then we can treat it as something great. Not great, right? You still need a little bit of, Walter, you need a little bit of, everything is on fire, this is fine, dog. But internally, I think the way that you sustainably think about failure is a lesson. A lesson that the system is teaching us. And so that for me, that sort of grab bag of stuff is what's gotten me from sort of understanding failure as new data and opportunity to make the system better. Cool, all right. So why do you care? So this is the sort of part of the posture that where Walter comes in. And it's like, why me? Why is it always me that has to deal with this? And for me, the best antidote for this is collective ownership, radical collective ownership. Use the word we every time you're talking about the system, every time you're talking about the code, every time you're talking about anything having to do with the code, with the system and its job and the things it does. Language matters, and by doing that, you really are going to create a culture of improvement, of positivity. And I think, so that's sort of like the super version of Fixit Felix is a we, one that says, not I can fix it, but we can fix it. And this is like, this image is the one that I think of when I'm in the doldrums or I'm freaking out, and I'm like, no, we gotta pull out the we can fix it, Felix. And if you are sort of balking at the idea of using we, if you're like, no, I'm the only one here down here in the corner doing all this stuff, I encourage you to start using we anyways, because you can start thinking about your future self and your past self as just another member of the team. You're not communicating very well with six months ago self. We all know this, this is like a classic ops joke, right? It's like, oh, who broke this? Oh, I'm six months ago, me. And so you can just think about the multiple use across time that all collectively on the system, and you can start showing some empathy towards, because it's easier to feel empathy for yourself. It's like, oh, well, I didn't know what I was doing at that time, but it's okay. And that's sort of like a starting point for the sort of posture that I think is healthy. So that's what I think, a collective ownership is what gets you from frustration to empowerment. So feeling about failure. When systems fail, we can learn, and then we can make the system smarter. So now we've got sort of the right attitude. Let's start thinking about failure. This is where it's just become so fun for me, like so rewarding. Failure, systems that fail. So like, once you get into that mode, that sort of attitude, it's like, oh, you strap into your chair, it's like the system is doing something wrong, what are we gonna learn today? And in that sort of mindset, you can just learn so much. So let's talk about sort of like how to address failure. So in order to understand a failure, you have to understand the system and then you have to understand the failure. So let's break that down a little bit. Systems fail. Do you understand the system that failed? What is it, can you? I mean, like, how many people here would say, I already seeded it, I was gonna say, how many people feel like they understand the systems they work on? And so like, you're not gonna raise your hand because I already like flopped the thing. So here's a question, how much do you need to understand the system? Ah, so how much do you need to understand the system in order to understand the failure? So this here is like the key insight, which is that you don't necessarily need to understand. So let's talk about understanding. Well, give me a couple more slides here. Cause like, what does understand the system even mean? What does it mean to understand the system? Like, we can, I can just go through the layers and we can find something that you don't know, right? It's like, do you understand how the CPU is pipelining the instructions? Like, do you understand that how like the, we can go down low, we can go up high. Do you understand how like the BGP switch is making a decision about whether the packet goes to your server? So like understanding there's too many layers. Brains are limited, systems are complex. The like the superset of systems that are actually delivers the value of your software is just incredibly insanely complex. And so we need tools. We need tools to help us with this. So there's this concept in computer science of a working set. So this is like the, the sort of amount of data that a process needs at hand in order to get its job done. I mean, you generally want to keep your working set in resident memory, that's your residence set, but your working set can be a little bit bigger than that, you can put some of it close. There's all sorts of research about like keeping it. But it's useful to think about the working set of your own brain, which is tiny, right? There's those studies where it's like, remember some number of digits and you get like five and then you start forgetting. So we, this is what we're working against every time we're dealing with system failure because the system is too big to keep in our head at any one time. So the skill you can develop is one of the, one of maintaining this working set more actively, more thoughtfully, and giving yourself legs up in sort of context switching through the system. Cause you're never, like in most systems you're never gonna sort of live at all layers at once. We can't do that. Maybe eventually, super humans. So what do we do? Model, we model. This is a simplified description of the system or process to assist calculations and predictions or understanding. So we use these all the time, right? It's a simplification of the system. So here's the system that I currently work on. And here's a data flow diagram of it. You don't need to see the whole thing. It's just like an example, right? So this is Terraform Enterprise, it's a hash core product. And it's a data flow diagram. So it's just one kind of model, right? So it's a, let's name all the pieces in the system. Let's look at how data flows from like the web down to the data source. So in this case, it's like Postgres, RabbitMQ, Redis, and S3. So it's a view. And it's honestly one of the, like data flow diagrams are interesting in that they provide a really abbreviated view. Like that's the whole thing. And it's like, oh my God, it has so many pieces. It's true. But you can sort of understand, like kind of sitting with that and learning sort of what the names in each of the boxes means. You can sort of get a loose understanding of the whole system, which is useful. That's like your sort of high level mini map. Sequence diagrams. So I'm using the word diagram here. Diagrams and models, right? Diagrams are reifications of models. Models are in our heads. Models are the abstract, diagrams are the visual. So a sequence diagram is super useful to model a process. So this is really important in failures, right? In failures you usually have some sort of a process that is taking certain steps that you really need to understand in order to debug. So sequence diagrams are useful. Domain models. So this is sort of an abstraction of the objects in your system. The important objects and how they relate. So do you have these models for your system? Do you have them in your head? Do you have them on paper? Do you have them in diagrams? You can build some. Plug a couple tools. I'm really into mermaid. Mermaid's awesome. If you're not familiar, go look it up. It is like graph is plus mark down. And so it can do sequence diagrams. It can do Gantt charts. It can do, you know, all of the stuff we just saw can be done in mermaid with text, which is really cool. Versionable, you can check it in alongside your code. You check in the PNG so people can see it, but then you can change it and version it and all that good stuff. So here's the mermaid source for that thing we just saw. 50 lines. Not too bad. And it's a little less obtuse than graph is, which is a little bit harder to, I mean graph is cool too. Like if you want to learn graph is, go for it. Same idea, right? Text that gives you diagrams. Type aura is the mark down editor that I'm into because it has native mermaid support. So literally you can pop into there, like I think there's an option you have to turn it on, but it's like triple back take mermaid and start typing and you start seeing a diagram. It's like very, the cycle of sort of like building out from a concept to reality is very quick. And then you can share it, because it's an image. So that's the, this is like, these are tools that we can use for how much do you need to understand the system in order to understand the failure. You can't understand the whole thing at once. Understand the part that matters to you. So for in this system, for example, if Terraform runs are slow, might be a problem. This is a system that like runs Terraform on four people. If they're slow, that's bad. So this is what my brain does almost immediately, is I'm like, okay, so that's Packer stuff, that's Web stuff, probably not Web stuff, that's Blob stuff over there. Let's look at, let's zoom in on this part of the system. I want to look at, first I grab an MQ, see if it's okay. Then I'm going to look at these two services that I know are associated with Terraform builds. And as I sort of work through the failure, I might expand that scope. I may be like, oh crap, this all looks fine. I'm going to actually expand the boundary of my analysis out to the Web stuff, because maybe it is in the Web, maybe it is the Blob. Maybe my diagram is missing edge, which I realized it was when I made this. So like, there needs to be an edge from the workers over to the Blob stuff, because they actually fetched binaries out of the Blob stuff. And I was like, oh man, I like, I need, I could have a failure that totally triggers me to update that model. So use models to anchor your working set of understanding, of system understanding to the failure at hand. It's like that'll give you the superpower that you need in order to really be able to sort of dance around your system and solve problems. So again, no models, build some. And a great place to start would be the last failure that the system had. But we had this thing break. It's like, let's create a sequence diagram for it and like watch and like really map out the last failure that we had. Because the other thing you have to realize is your working set is small and your working set is not synced with the working set of your team members. And that's like the clear and useful communication is just so difficult and it's such an art form in system analysis and systems generally, but especially when talking about failures. Okay, so I'm running over time. I've got a bunch of stuff about responding to failure that I will dance through. So what the failure is given an input I expected X instead I observed Y, it's a mystery. So diagnosing failure is the process of narrowing it. And so the nice thing about this like skipping through these slides is it's all just code. So I can, I can just put it online and you can kind of look at it. So the big like point I wanted to make here was that you don't need fancy tools in order to sort of diagnose the system. Like the best tools that I use are SSH shell loops, scratch files, shell history. Set your shell history to something huge and control are your way to victory. Unix text processing tools and Google. You don't have to understand all the Unix text processing tools in your head just type Unix tools, find how many lines that say this. And you can get really far. So failure analysis is all about scoping. I have a bunch of examples here that you can kind of pop through. So it's like, I don't even know where this app is running. Snag all of the IPs for the system and then like loop over them and then do PS and then look for it. So yeah, this is kind of like that's the sort of thesis of this section is sort of like you can take a couple of loops and you can figure out answers to questions like is this happening to all users? Read the logs. Even if the logs aren't everywhere you can go snag the logs from the servers. Log ag is great. Aggregated logs are cool. You can go snag the last two hours of logs and slice them up on your computer. How many HTTP responses in the last two hours? How many HTTP responses by code? But in the last two hours by minute. The vitals of the system are important to check. You can do this in loops too. Just check all the things that can be exhausted in a system and I'll sort of get to this last point and then I'll let you have lunch. So I'm gonna make a controversial statement here which is read the code. As somebody who's diagnosing a failure with the system in the context of your team with the help of the application folks near you jump into the code. You don't know the language, learn the language. If you have a little box and a diagram that you do not understand that's a great opportunity for learning. If this thing breaks, go understand why it breaks. So I have some sort of like minimum viable code reading editor features where I put in prints like what my Vim thing does but like you could find these five features and then you have an editor that can go learn about code use your models to say anchored and then solve the problem. And so the last point is about collaborating which is essentially like once you have this posture towards learning around failures, it's infectious. And once you can go to other people on your team with the idea it's like I don't understand how this part of the system works but it's failing and I don't know why. That is a great place for you to create new knowledge with that person. So learning is contagious, ownership is contagious, failure can make everybody stronger and that's what I have. Thank you.