 So, we're going to be talking about learning from failure this morning or this idea of failure as success. And as Jason said, I go to a lot of DevOps Days events. Denver by far is my favorite event because I actually grew up just up the road up in Fort Collins. So, it means it's the only DevOps Days that my mother attends to join us. And so, mom, stand up, wave to everyone. That's right. DevOps, it's for the entire family. So, also, by the way, mom, if I cuss, I apologize in advance. Sorry. Okay. So, one of the first talks that I ever did was this talk at DevOps Days Silicon Valley a few years ago now. And it was titled, Is Your Team Instrument Rated? And the idea was, if we look at some of the other operational endeavors that humans undertake and we sort of model development and operations, what would that look like? So I said, well, if we think of pilots as developers and operations people as air traffic controllers, like, what would that look like? What can we learn from aviation that would be applicable in our lives and how we work together as developers and operations people and looking at how they work together? So I'm going to start with a thought experiment in the JFK data center. This is a clip from JFK Airport, actually. And I'm going to tell you up front, I'm going to ask you questions later about this. Okay? All right. Let's see. Hopefully, it's not too loud. So that sounds pretty stressful, doesn't it? It's not really the conversation you want to, like, overhear while you're on a plane like that. Who had to deal with Heartbleed? Anybody else have to deal with Heartbleed? This happened, I heard with a lot of operations and development teams with Heartbleed because ops would say, no, this is important. We need to fix this right now. And development was like, no, it's not that big a deal. I was like, no, we're actively getting hacked right now by Heartbleed. So this kind of conversation, I think, might be familiar to some of us. Now, I have a question. As part of the thought experiment here, was this event a success or a failure? Raise your hand if you think it was a success. Couple hands. What? Okay. Couple think it was a failure. Raise your hands. How many people want to know if the plane landed to that point? Right. Okay. We're going to be talking about that because that's a fairly interesting kind of thought, right? We actually want to know what happened before we'll commit to whether or not it was a success or a failure. So come to think of it like, what is success, right? Is it just the absence of failure or is it something else entirely? So that's what we're going to be talking about today. This is who I am. I'm not going to go through this slide except for a couple of things. I'm Jay Paul Reed on Twitter, so you can tweet things at me if I say something wrong or whatever. That's cool. I love the tweet sphere. The other thing I would point out is I'm currently a master's of science candidate in human factors and system safety. And the reason I point that out is because a lot of the things that we're going to be talking about actually come from the safety sciences and the research that has been done over the past 50 to 80 years. And we're going to look at why that might be applicable to the environments that we work in. The other thing I want to mention about this talk is we're talking a lot about failure and we all have our own sort of ideas about failure and how failure affects us. And it has to do a lot in a lot of ways with what environment we're in, how we were raised with failure. I mean, if your parents, if you got, you know, fell off the horse, if they're like, get right back on the horse or if they're like, no, maybe you should go play the clarinet or something, right? We're going to react differently to failure. So that's something to keep in mind that this is kind of a very unique and personal discussion that we're having. It can be kind of vulnerable at the time. So that's going to factor into kind of what you see here. So the other thing I would point out is like public postmortems have sort of become a thing and the talk, Connor's talk just right before me about GitLab doing their public postmortem, it's funny, like almost a year ago, that would seem insane to do a postmortem in public. This, I just, I searched for Chef in their public postmortems. They actually have been holding their postmortems in public for a few years now. And they even do, they do a Google Hangout for their postmortem. So if you wanted to watch them do that, you can, you can log on. You can even participate though, if you weren't there, they might mute you. So that's kind of interesting. Quick question, and I meant to ask this earlier. How many people do retrospectives? Like agile retrospectives, sprint retrospectives. Let's start with sprint retrospectives. Okay, how many people do incident retrospectives? Cool, all right. Would you like to post them publicly on the internet? No? I see some head shaking like no. The other weird thing, who's heard of this whole idea of blameless postmortems? How to be, like, does that mean you feel a little uncomfortable? Like, blameless, I don't really understand what this blameless thing is. I get that a lot, and I got to tell you, I don't know if you can see, but I stole this slide from Jason Hand, and I love it because everybody's in harmony, and they're all, yay, I'm based in San Francisco, and Jason's based here in Colorado, and I get this question like, blameless postmortems, what are you smoking? And it's like, well, actually, about that. So yeah, we'll talk a little bit about that too. The other thing is there's this kind of weird sort of failure fetish in Silicon Valley, right? This idea of startups pivoting. So we kind of had this weird interaction already with failure. I kind of actually like this picture of plaintive developers, plaintive, or just sad that he's paying $10,000 a month in rent. So let's talk a little bit about the mindset of this idea. So you can hear these words a lot, accidents, blame, success, retrospective. Human error is a big one that we hear a lot. And what we're really talking about is safety, right? And this is what I was talking about before when I said the safety sciences. And so we're gonna be looking at the research that's been done about the past 75 years or so. Now, a lot of this comes from industrial accidents that they've studied or accidents in transportation, that sort of thing. And you might think to yourself, we're in technology, right? Why does, what does safety mean if I'm running a website? Like, why is this even applicable to me? Well, of course, there's obviously the whole Google self-driving cars, Internet of Things is like a thing. So that's one way to kind of think that maybe the work that you're doing might impact some of that. Who here has heard of night capital? See a couple of hands going up. All right, for those of you that have not heard of night capital, they had a deployment on a Wednesday. And they deployed on a Wednesday. They opened for trading on a Thursday, 9 a.m. And they ceased trading at 9 45 a.m. Because of a problem with that deployment, they were losing $170,000 a second for 45 minutes, for a total of $45 million. Sorry, I'm off by an order of magnitude, $460 million. So $460 million out the door in 45 minutes. And on Friday, they're out of business. So when we're talking about these things, it really can be we did a deployment. The next day, we thought everything was okay. And the day after that, we're out of business. These things can actually move that quickly. So that's something to think about. Now, you might say, well, we don't do high frequency trading. So we're fine, it's cool. This doesn't really impact my life. Who's heard of the, what is it, PetNet online pet feeder? Well, they had an outage. So it's this little feeder that you can schedule feedings for your fluffy dog or cat with your smartphone or whatever, right? And what was interesting about this is they said they're experiencing some minor difficulties with a third party server. Does anybody use that cloud thing? I think they use that cloud thing, and they're having some problems. Now, what was interesting is that this outage caused people's pets to starve. And it's interesting to consider that where that dependency was in the architecture of like, okay, what happened such that because something, yeah, S3 went down, we can't do our pet feedings. It's interesting to think about how that happened. So if you're using sort of SaaS or cloud, again, that's a thing, I guess I've heard. If you're using any of that, this is sort of probably relevant to your life. So first thing that we're gonna talk about is sort of the conception of safety. So safety is energy and barriers. And this is a model that comes from the early industrial post kind of World War II area where the processes are very linear. And we can see them, they come from factories, right? They would look at as things go from station to station at the factory floor. You can actually see it all, right? And so it's this idea that failure, this idea of failure is a release of energy. An uncontrolled potentially release of energy, right? So chemical plant explosion or nuclear or something like that. All the back then, they weren't really talking about nuclear. So the idea was then, okay, well, we can design safety in. We have this system, we can put guard rails on things, we can move stations further apart if they should meet that close, that type of thing. And of course then you can have defense in depth, right? How do you become safer? What do you do? You add more barriers, right? You make them bigger and stronger and thicker. So this worked relatively well when you could see everything. But then something happened in the late 70s. Any guesses? Three Mile Island happened. And what they really found in the Three Mile Island accident when they started studying it is that we're dealing with a different type of complex system. That isn't a linear process anymore. It's not like the factory floor that all of these other models that we've seen sort of made sense to think about things that way. One of the interesting things about Three Mile Island, and in fact most of the nuclear plants in this country, is that they're basically really scaled up versions of nuclear submarine reactors. And then we bolted a generator onto them and figured, well, it's fine. If you just make it bigger, it's totally fine. We won't have any problems. So we're dealing with a couple problems. First, it's more complex. Second, we scaled these systems up in ways that we thought it would just be fine if they scaled, but it turns out there's something emergent there that we didn't really account for. And then the other problem that is actually really relatable to those of us doing operations and development is there's no direct observability. So in a nuclear plant, you can't actually go into the core and take the temperature because you would die. So there's a bunch of sensors to handle that problem. But then we have this problem, well, what if the sensors disagree? And then we get into this whole monitoring discussion, right? About, well, how do we monitor this effectively? How do we resolve those sorts of inconsistencies? One of the kind of interesting anecdotes about this event, anybody remember those old line printers, the really big ones that with the head that move and the paper would roll out, right? They had a computer attached to the reactor that would print out events onto printer paper so they had a record of them, right? And when this event started, the printer started printing. And after about eight hours of printing, it had gone through the first four minutes of the event. And they just turned the printer off, and they rebooted the computer. That's, again, how fast these events can happen. So it's like, all right, well, this model clearly doesn't work. And this incident really kind of made that point very pretty stark before us. So then a guy came along, a sociologist actually named Charles Perot. And he said he came up with this idea of normal accidents, right? And he modeled it on the degree of linearity versus how complex it is. So the process is either very linear or it's very complex, right? And then he put that on an access with how tightly coupled is the system that I'm looking at? Are there lots of degrees of freedom? Are the components really tightly coupled together? Or are they further apart? And he said that certain systems, complex tightly coupled ones, are going, they're just gonna have accidents. He said this was normal, hence the idea of normal accidents. And he said this was unavoidable. There's like nothing you can do to prevent this other than make your process simpler and decouple the components. So this is directly out of his book and this is this concept on our chart. So we have loosely coupled at the bottom and tightly coupled above. And then on the left here we have linear and then we've got complex. So as a couple of examples, he puts in the idea of loosely coupled systems that are linear. So the DMV is an example of that, which is kind of an odd example. Single goal agencies. But if you have a system that's linear, so fairly simple but tightly coupled, he said an example of that was dams. Which as someone who lives in California, if you heard about the Oroville Dam crisis, this suddenly became very relevant. The Oroville Dam was the tallest dam in the United States and it's spillway cracked open and they thought the dam might fail. So that's kind of an interesting one. He also said on complex but loosely coupled military adventures is there. Which is kind of an interesting military adventures, what does that mean? But then of course up in the complex tightly coupled, we've got nuclear plants and also like nuclear weapons accidents. Now, I have a question for you to think about. Where would you put your organization or your main products architecture in this diagram? Kind of something interesting to think about. So at some point, another set of researchers from Berkeley actually said, well, we're not seeing accidents at the rate that we would see them. Environments that we know are complex and tightly coupled. Why is that? So these were, like I said, some researchers from Berkeley, a guy named Rocklin. And he started looking at a bunch of different systems. The most famous one is the deck of aircraft carriers. Just fairly complex, a lot of stuff going on. And fairly tightly coupled, because you don't of course have a lot of space, but also a lot of interdependent activities that are going on at once. So that's an environment where they weren't seeing a lot of these accidents. So they started looking at why, they said, why, that's odd, why is that? Well, they found one thing on the deck of the carriers, sort of constant active learning. So the people that are on the deck of this aircraft carrier pretty much are doing training all the time for whatever tasks they're doing. And they do renewal training pretty constantly as well. So they found that decentralized active review. So basically that's the idea that people on the deck of the aircraft carrier, if I'm resetting the hook for the plane to land or something like that, the person next to me who might be running the fuel pump or something will actually look at my work and let me know if I've done something wrong. They'll say, hey, you forgot this or you forgot that, right? So people up and down a stream from what I'm doing that are around me will actually sort of review my work in a very decentralized manner. Rank de-emphasized oddly, so this is one of those things that was the military, they were kind of surprised by this. But they said that they found out that the lowest ranked person on the deck of the carrier can stop operations. They can say, you know what, we don't want to land any more planes. And they would do that, they would listen to that and then deal with the situation that that person had brought up. So that was kind of surprising to them. Crew rotation, so they rotate the crews through about, I think the research anywhere from six to 12 weeks they would rotate crews through. And if you think about it, does that actually make sense? Because if I'm supposed to check the work that you do but I've never done your job, then how can I do that, right? So they would rotate people through fairly commonly and there's actually a quote from the research where they're like one of the people on board the deck of the carrier said, you know, I feel like as soon as I get really good at the job, then I'm doing somebody else a job and I'm learning it. And they actually thought it was a good thing, but it was kind of an interesting perspective from that. And finally, maybe most importantly, success may in fact actually be failure. So this was the idea that the plane may land. It may taxi off to its parking spot. The pilot gets out and is like, hey, I'm here, everything's fine. And everybody that was involved in landing that plane was like, we didn't feel safe doing that. There's no damage to the aircraft. No one died. There wasn't even really a service impact. So the planes behind you could land, it was totally fine. But we became, we got way too close to the envelope of what we as a group think is reasonable. And so we consider this a failure despite the fact that anybody looking at it would say it's a success. An admiral might say, oh, but it's fine, right? So let's summarize. We have this progression of from energy and barriers to normal accidents to higher liability organizations. And the general things you see is this idea of static processes and repeated defenses moving more to active defenses and processes. Processes that are changing constantly that are able to react to what's actually going on. You also see a major shift in thinking about this from technical and engineering solutions to the problem. So again, design in more safety, more process, more relays, more checks on things. To solutions inherent to people, the organization, and really the way that we do work was kind of the big shift. Now, what does this mean in technology? How many people have like a backup DR site that they operate themselves still? I see some hands going up, yeah. Sometimes with clients, I used to go like, hey, can I go to your primary site and just unplug cables? And they're like, no. And I'm like, yeah, but you have a DR site. Aren't you happy? Can't we just fail over? And they're like, no, let's not do that. So it's always kind of funny. Anybody coding Corba or Com? Remember that sadness? Okay, if you do, Com was a way to decompose like reusable components, right? So that's what you see like Com was an attempt at that. We want to actually decouple our system as much as we can and try to simplify it by reusing components. So we see that again in microservices, is real push to decompose things. And then of course, finally, now we're starting to see drilled incident response, retrospectives, red teaming, and then it's idea of value streams. Do people know what red teaming is? Yeah, red teaming is where you hire a bunch of people and pay them to hack you and then see what happens. Hilarity always ensues. So the methods, how do you do this? Well, injecting failure. Who's heard of the Simeon army? Netflix's Simeon army. If you haven't heard about it, there's a bunch of blog posts. If you haven't heard about it, it's basically chaos monkey, right? This monkey goes around and shuts off instances in the cloud. We've all heard of that. They have a bunch of different monkeys, which is why I mentioned it. They've got a security monkey and a latency monkey. So it's not just about shutting off instances. It's actually about sort of battle testing them. The other thing that I've started seeing organizations doing is sort of human chaos monkeys. And what I mean by that is they go to someone and they say, hi, it's Thursday. I'm taking your laptop and your VPN fob. You get a four day weekend and the rest of us are gonna drill stuff tomorrow while you're gone and see what happens. And they find that people on teams, like when they do an incident response, it's like, oh, you know, Bob is the only person who knows that Jill is the only person who knows how that AWS thing is set up. And when those people are gone, well, hilarity ensues. So they wanna test that. The other one, I actually have this in here as hardware chaos monkeys, but actually platform chaos monkeys is a better example. Let's say you work on a Mac, right? So they'll come and take your Mac away and give you a Windows laptop and say, go run an incident because how many of us have our SSH keys on our laptop and our GPG keys and all of these things that we need to actually do our job are on the laptop. Well, what happens if that gets stolen and then we have an incident, right? So they drill that kind of stuff too. So you're seeing this sort of injecting failure and chaos in a bunch of different ways these days. So incident command and cruise, you see a lot of discussion around that. How do we come together as an organization and as a team to respond to an incident? There's, I could do a presentation just on this and so I won't talk much about it except to say this. When I see organizations working on this, one of the sort of things that I see is they get really good at coming together to deal with the incident, but then they aren't as good at dissolving the crew in a reasonable way. So the analogy that I make is, okay, it's like the fire department. Your house is on fire. They come there, they get the hoses out, they put the fire out and then literally all of them takes off their gear, drops it on the floor, leaves the hoses out and walks away with the fire trunk sitting in your front yard, right? No, they don't do that obviously, right? They roll the hoses back up, they drive the fire truck back, they back it into the fire house so that they can leave on the next call, right? But a lot of teams are really good at like responding to the fire and then once it's put out, it's like, okay. And they're not as good at sort of unwinding that sort of incident response that they have. Postmortems? Yes, postmortems. We should do those. Blameless postmortems. I asked this earlier how many people think this is like confusing or uncomfortable or it doesn't make sense. It's okay if it doesn't make sense, I won't blame you. I actually did a post about this and the headline is actually a little kind of wonky. It's a little click-baity but the point that I was trying to make is that blameless is actually not a thing. There's a sociologist named Brenny Bren who says blame is a way to discharge pain and discomfort. And what she meant is in her research, we're actually neurologically hardwired to use blame to get rid of pain and discomfort. And so when I see people talking about blameless postmortems, it's like the analogy I always make is like they're going to a meeting and we're all gonna pretend we don't have arms and it's like I can see my arms and I can see your arms but we're just gonna pretend we don't have arms, right? So this article, you can read more about the nuance of it if you're interested but I make the case for what I call blame-aware postmortems. We're aware of the tendency to blame. Although somebody said that makes it sound like we need to decide who to blame first. So anyway, debrief the actors. You wanna do this as close to the incident as possible. And this is really the distinction between accountability and responsibility. We talk a lot about accountability. We want everyone to be accountable by providing an account. That's what that means, right? Providing an account for their behavior and what they did. Gather the data together. We all do monitoring, right? We have everything monitored. So gather that together. And the goal here is really to put together a timeline. And there's multiple levels of timelines but this one you can see they kind of have the monitoring and metrics at the top there. They've sort of got that accountability we're talking about in the tasks. And then if you've got chat ops which I know there's a ignite on chat ops and so if you don't know what that is stick around I think it's tomorrow. You have a record of what people were saying as they were doing things and that's useful. And the whole goal behind this is to come up if you see these T1, T2, these vertical lines come up with events. Capital E events that were turning points in the incident. So of course to do this you need what I call retrospective ready infrastructure. And those are tools that basically cover the infrastructure. So if we were talking about a plane it would be like the plane and the engines and things like that. But they also cover the environment so that would be like altitude, airspeed, stuff like that. And also the operational aspect. So that's like the cockpit voice recorder. So we want our architectures to account for that. Again, chat ops, there's gonna be more on that so stick around, I won't go into it. Team 8 is sort of a way to actually look at what somebody else is doing so you can do this like pilot, co-pilot kind of turn your key serve sort of thing that we see if you're working in incident so that can be helpful if you're kind of pair not pair programming I guess, pair incident responding. And then of course incident response management tools. There's a lot of those that are coming up that are really upping the way that they address specifically not just paging people but like getting them involved in the incident. So tools that do that are really important. All right, the landmines. Organizational incompatibility. So I don't want you to go back on Wednesday to your job and go into the data center and start pulling cables out and then when your boss comes in and says, what are you doing? It's like, oh, this DevOps state speaker said I should do this. So I did, don't do that. Anybody read this book? The No Asshole Rule? Robert Sutton's a professor at Stanford, Stanford Business School. And this is a really good read by the way. He talks about something called the total cost of asshole which actually is per person that is an asshole is like $160,000 a year. So there's real costs associated with that. But he said the best single question for testing your organization's character is what happens when people make mistakes? So it's interesting to ask about your own organization. Good book. Other landmines. Only certain people get to make failure or only certain groups get to fail, right? So when developers fail, it's a bug but when operations fails, it's like, you're fired because the site went down, right? I used to get this a lot as a build engineer, like we as a build team couldn't ever fail. That's sort of an organizational landmine to be aware. This idea, sort of an anti-pattern of stopping the line is a privilege. So those errors are pointing to and-on cords. And-on cords are from the Toyota production system. So the idea is something's going wrong, pull the cord, line stops, somebody comes to help you with it, right? Well, so this is from Toyota and a GM brought it over to test it here and they built a plant with and-on cords. And in the first week, one of the people pulled the and-on cord and the manager came over and said, why did you pull the cord? What's wrong? What did the managers just teach that person to never ever ever do again? Pull the and-on cord, right? For getting to dampen failure where possible. So again, you know, people talk about injecting failure and chaos. You know, Netflix, when you write a new microservice at Netflix, you're not automatically enrolled in Chaos Monkey. You have about three or four months of a window to opt out because they don't want to be inhumane to the teams that are still like trying to figure out how to operate this microservice they just wrote, right? So they dampen failure there. I don't recommend doing chaos days without like letting people know in your company. That's a good one, a good thing to do. And in fact, I've seen people that do chaos days, they have a kind of an and-on cord for that. They, teams can cry uncle and say, no, no, we're failing at this sort of incident response. And then the other teams will come and help them. So it's actually kind of team building, but it's not just let's inject this chaos and see what happens. Only reviewing failure. So we often only, I mean, hold postmortems for incidents. We seldom hold postmortems for why the site is up. So that's kind of interesting. Why do we do that? So, you know, this looks, this is landing in Denver, I think, but that landing look kind of weird. And so I was like, oh, that's okay. But then we might push the system a little further and we've got something like this. And it was like, okay, fine. So you look at this, this pilot got themselves into a pretty bad situation. If you can tell the wing actually touches the runway and scrapes across it. And then of course, there's this, but he got himself out of it. So it's fine, whatever, right? But then we have a nice sunny day lined up on the runway, no weather, everything looks fine. And then you have like that happen. Oops, yeah. Now, I doubt any of you want to be on any of those planes, but which one got the postmortem? The last one, yep. One of the other land mines is forgetting about bias. That was called outcome bias, by the way. So that means that we bias our perceptions of the severity of an incident based on the outcome, whether it was a bad outcome or a good outcome. The sort of ugly subsister of outcome bias is hindsight bias. So you get this, this is really easy to see in postmortems. Why didn't you, why did you run that command? Why did you delete the database? Why did you run that S3 command with turn S3 off? It's like, don't ever run that command with dash, turn S3 off, right? The problem with that is it's not particularly useful to have those conversations because it talks about a reality that doesn't exist, right? It's almost as if we had a time machine and we could stand over our own shoulders and say, do you really wanna run that command? Are you sure? You're like, yes, failure is very improbable in my mind at this time. Oh, expletive deleted. So those are, we call those counterfactuals, by the way. So they're very fairly easy to see in retrospectives. Correspondence bias is one of my favorite ones. It's often called fundamental attribution error. I'm just gonna read the tendency to place an undue emphasis on internal character six to explain someone else's behavior in a given situation. This is the quintessential DevOps bias, right? Because we have development operations. They're smiling, I want change, I want stability. And literally the first time they have to talk to each other, it's frowny faces. It's like, oh, those, you know, the operations people, oh, those developers, they, why can't they just run everything on CentOS 4? It's fine. There's no problem with it. And developers are like, oh, those operations people, they want me to deploy twice a year, right? We as humans are awash in a sea of biases. This slide is just searching Wikipedia. My favorite one, by the way, is the Ikea effect. The Ikea effect is when we place more value in something we built because we built it. So it's this like really crappy particle board furniture, but I built it. So it's like the most wonderful dresser or bookshelf ever. The thing here is bias is really, really hard because bias is built into the way that our brains function. So I have a clip from W-H-Y-Y's The Pulse. Let's listen to this. Isn't that wacky? It's kind of like a bell, you can't unring it after you've heard it. The last real big landmine that I see is sort of deprioritizing retrospectives and learning processes. And so here's the pattern that I see. There's an incident on a Wednesday. And actually, no, yeah, Wednesday. Incident on a Wednesday. We solve it, it's horrible, the site is down. Everybody is tired when they're done with it and they're like, let's go home. We're gonna do the retrospective on Friday. Then on Friday morning, there was some Band-Aid fix that we put in for the incident on Wednesday and we have another incident all Friday. So then when we're done, it's like, okay, we fixed it right this time and we're going to the bar. We're not going to have the retrospective now. We'll have it on Monday. And then on Monday, what was it? Engineer number one. Team member number one is gone on a vacation, a scheduled vacation. So then it's like, okay, we'll wait till they're back. They get back on Thursday. So we'll have it on Thursday. And then on Thursday, it's an all hands meeting for the whole day. So we're doing it the week after that. And suddenly, this retrospective for this incident, we're doing it two, three, four weeks after. People run into this problem. And I can tell you that I've seen those retrospectives and if you were holding a retrospective about 72 hours, more than that after an incident, greater than 72 hours, it's just garbage. It's just buy it. You're going to be looking at the timeline. You may have metrics, but you're going to lose all of that data about why did I do what I did? All of that context when you're in the middle of the incident and that's the biggest thing that I see. Now, back to our thought experiment for a moment. This was a clip that I played for you about half an hour ago and I told you I was going to ask questions about it. Who remembers what the aircraft call sign was? Okay, what type of approach was it? Were they cleared for an ILS or a visual? To what runway? Or was it actually runways? What heading did the controller order the plane to fly before the emergency? What radio frequency does the controller hand the plane off to? Right? Half an hour ago and most of us forgot this, right? Now, a lot of people say, how many people like they're like, okay, well this is not really fair. Cause this is aviation. I don't know all those weird words, blah, blah, blah. Raise your hand if you're like, this is kind of unfair. It's okay, my feelings won't be hurt. Nobody thinks it's unfair. Yes. Okay, one person. Perfect, okay. But this is what I would say. How many times do we go into a meeting where it's a retrospective and we think because we have some sort of idea about how things work, that we know what the database team dealt with, right? We go in with that preconceived notion of what the database does. So we should know exactly when they say, well, you know, I was thinking about this. We say, well, why didn't you think about that? And the reason that I use aviation is cause most of us have flown an airplane. Like we understand you get on the plane. You go from point A to point B and you get off the plane, right? But there's a lot of activities that happen as part of that process. And so when we do retrospectives, I see this with teams all the time where they think they understand some layer in the stack. And so they react that way. And they may react poorly. So last couple of slides here. What we're really shooting for is this is the, what's called the Rasmussen Safety Triangle. Jens Rasmussen was a guy who studied Three Mile Island. That's where he did most of his work. And the idea is that there's obviously three boundaries to this triangle. There's this boundary of financially acceptable behavior. So that's cheaper, better, faster, right? The business is gonna make pushes to do that. There's the boundary of unacceptable workload. So that's the idea that humans are lazy. And I don't mean that insultingly. We all try to get the most bang for our buck out of that glucose in our brain. So we try to do things efficiently. And then this other boundary is the boundary of acceptable risk. And you'll notice that since all of us are kind of pushing, you know, let's do this as easily as possible. And the business is pushing us this other direction. Where are we all moving, you know, trotting towards? It's that risk. And when we go over that boundary, that's where we have an incident. So the idea is actually, a lot of these triangles, there's a dotted line there. The idea is that we actually wanna try to find out as a team when we're going up to that dotted line. So that we can actually say, oh, we're kinda close to that edge where we're close to an incident. And we can pull, as a team pull that system back. So we talked a lot about a lot of things. These are the last three takeaways. Takeaway number one, the key to reframing failure, stop thinking about incidents as events that went wrong and start thinking about incidents in terms of your team's response to them. Another way of thinking about this is develop your incident immune system, right? It gets better the more you actually use it. To make this practical, your operations and infrastructure need to be retrospective ready. They need to have those black box recorders, whatever that means for your particular system and organization. And probably the most important one is the only thing we directly control in complex environments that we operate in is our reaction. We often, in high tempo situations in complex systems, it's hard to control directly the system, if not impossible. Thanks.