 complex system failures and blameless retrospectives. Let's talk disasters. And just to let you know, no one, there's nothing like, there's no loss of life, there's no injury. It's just a disaster for a bridge. So Seattle has the two longest floating bridges in the world. This is a picture of the I-90 bridge over Lake Washington. Some of you who are locals may have seen this or even gone over it. In 1990, one of the spans of this bridge sank while it was being repurposed. Things are more likely to break when you are fucking with them. So they were doing hydro demolition on the surface of the bridge. They were using pressurized water to blast off the paving. They were basically gonna change it from a two-way span to a one-way span and then build it like another span. In the process of doing this, the EPA was like, you can't let that crap run off into the lake. That's hazardous at this point. You gotta do something with that water. So the engineers conducted a plan to pump the water out of the lake to catch the water and then truck it away, basically. But in order to catch it and then truck it, you've gotta store it in between. So they discovered that the pontoons for the bridge were over-engineered and the bridge had been floating higher than it needed to its entire life. This is so they're hollow. They're made of concrete, but they're hollow. That's how they float. So they decided to put the wastewater from the demolition into the pontoons. This sounds stupid, but it actually wasn't. They looked at it super carefully. They examined the integrity of the pontoons. They determined how high it was floating. They determined how much leeway they had. They added extra buffers. They were super careful about how they designed this and how much water was gonna be able to store it and they set up this whole schedule for pumping the water away. Should have been fine. The biggest national holiday on the American calendar is a four-day weekend which many of you are probably familiar with called Thanksgiving. There was a big storm. Nobody was working on the bridge, of course, because it was Thanksgiving. One of the pontoons started taking on lake water, resulting in the slow sinking of eight pontoons. It was slow enough that they got it on video. And I'm only gonna play a part of this in the interest of time. Built in 1940 in C. Water level inside the pontoons was carefully monitored so the structure would not be compromised. Giving weekend with bridge crews shorthanded, a heavy storm hit the Seattle area. Rainwater drained into the open pontoons. Winds from the storm whipped up the surface of the lake, propelling splash water into the pontoons as well. Number 25th, eight waterlogged pontoons sunk into the lake, taking those sections of the bridge down with them and severing 12 anchoring cables of a bridge that was being constructed nearby. While luckily no one was injured, the cost of damage was estimated at $69 million. An investigation of the accident came to a sobering, if unsurprising, conclusion. After a very extensive study and very sophisticated computer modeling methods, the conclusion was that concrete pontoons simply don't float when they're full of water. So that's a really nice sound bite, right? Very vivid. It's really a reductionist. This was a giant disaster. It required a giant investigation. They checked all that shit. I'm not gonna read it. You don't have to read it. They did a lot of investigation. Like there's a paper put up by UW about this disaster and the investigation they did. I read it. It was both fascinating and boring because it's an academic paper. This is a lot of text, but I'm gonna read this text to you. This turns out to be the shortest way to describe how the bridge sank. The loads that created significant leakage were the combined effects of all accumulations of water, including rain after the windstorm, longitudinal flow, let's like water on the surface of the bridge, and pumping through November 24th, 1990. These loads caused static moments, pressures, that exceeded the threshold for leakage. Existing cracks were open sufficiently to allow water to leak into the pontoon. Progressive and accelerated sinking began at this time. So that's a lot of stuff, right? Like a lot of things had to go wrong to get here, even though people were putting water in the pontoons. And that was almost a good retrospective. They didn't go far enough. There were five things required for this failure. Extra water in the pontoons over the amount that they had determined to be safe, uneven pumping out of the pontoons, causing uneven stress on the bridge. Rainwater accumulation in the pontoons above the extra water load for the stuff they were storing. Existing cracks in the concrete that were non-leaking until we added all of these other loads on the bridge. And wind and wave stress from the storm. All of those things were required to sink this bridge. It wasn't just that one person screwed up, it wasn't one stupid error. But the official investigation missed something else, even along with all of that. It mentions that crews had been having trouble sticking to the pumping schedule before the storm happened. That's why there were uneven stresses on the bridges or on the pontoons. The, like there were high water in some places and low water in other places. And so the bridge was sort of getting twisted a little bit by the loads in different locations. But we don't know why they were having trouble sticking to the schedule. That was never investigated. It doesn't mean the pumping crews were at fault. It doesn't mean the schedule was at fault. It doesn't mean the engineers were at fault necessarily, right? It could have been any of these things. The, sorry. Complex systems have complex failures. It means that the whole human system was somehow at fault. But since they didn't investigate that, we'll actually never know what happened. Things that could have been going on, they might have had trouble. The trucks might have had less capacity than they thought. They might not have had enough workers for the pumps. They might have mismeasured some stuff about where the water was. And those are just some guesses off the top of my head, right? We won't know. But the important takeaway here is that no one human action could be the sole cause of a failure like this in a complex system. So let's talk about doing better. This is gonna be the highlights reel because I usually give this talk in about an hour. I've got about 30 minutes. First, I wanna talk about blame. It is the best way to not find out what happened. It means it's the best way to not get better and not be able to do better next time. Medicine, aviation, fire response, other fields have been talking about this for years. They all agree you've really gotta, if you're gonna do this, you can't blame anybody. You have to be looking at the system and figuring out why people did the things that they did at the time, remembering that people don't do things that don't make sense mostly. But we're all super good at blaming. Like English is super good at blaming people. That we use you a lot. You is very blaming language. So we have to put conscious effort toward not blaming the same way that we do. Conscious effort toward inclusive language and not using slurs. And not using accidental slurs like lame. It's hard. Most of this is gonna be pointers about that. So the old way that we used to do this was invented by a guy called Sakichi Toyota. It was used and publicized by Toyota Motor Corp. It's called the Five-Wise. If you ask why something happened, you will get an answer that involves a person doing a thing. Why did the vase break? Because I'm clumsy and I knocked it over. Think about because I'm clumsy and I knocked it over for a sec. There's a lot going on there and a lot of it is about how I'm not very good at something, maybe walking. That's called agentive language. Agentive language is strongly remembered by English speakers and it has blame built in. Because I did a thing. English is slanted. It sort of tends to use agentive language. It's a way that we're all really used to speaking. So asking why questions gets you this blaming language that you remember really strongly and you remember that I'm a klutz who breaks things I care about. Asking why questions gets you answers in agentive language. So let's ask a different question. How did the vase break? I tripped over the cat while looking for my glasses and I knocked it off the shelf. There's a lot more to work with here, right? If we want to make sure that I don't break any more vases, there's a whole bunch of things that we can look at. We can look at how could I not find my glasses? We could lock the cat out on the balcony. We could put earthquake gel under the vase so that it can't be knocked over. Next question. How was I able to knock the vase off the shelf? Well, it's just sitting there, right? Like you hit it with your elbow or something, it falls down. There is this thing, which I mentioned a minute ago, called earthquake gel. You stick it on the bottom of the vase and that way when you've got your bookshelf bolted to the wall, which I don't, because I'm a bad earthquake dweller, and there's the blaming language again, then the bookshelf doesn't fall down and the vase doesn't fall down because it's stuck to the shelf. It's kind of like that poster putty that you use that you can take off and put it back on. So we could ask, how is it that I don't have earthquake gel under this vase? Well, I bought it recently. I haven't had time to use it yet. I've never used it before. I didn't read the instructions in the packaging yet, so that's a task that's sort of, I want to do when I have a lot of concentration and using a new tool for the first time. And instead, I was writing this talk. This is the vase, by the way, that inspired the story. It is not broken, but I would be heartbroken if it were. It was made by my dear friend of mine and a really amazing artist. Come find me later if you want to see his flicker or buy any of his stuff. So human error is not a root cause. If you say, what happened here is the court made a mistake, that's not useful. We can't fix that. It's also, again, that blaming language, right? So the questions to ask are things like, how did we allow the human to make this mistake? How did the system not help the human do better? We can't prevent humans making mistakes. It's built in, right? We're organisms and we do things and sometimes we haven't slept or haven't eaten and we're hangry as shit. Stuff happens, right? You can't make stuff not happen, but you can try to make it better when stuff does happen. Try to make it easier to not have a really bad screw up. No one does things they think are gonna blow up the world, right? No one does stuff on purpose that they think is going to cause a disaster. I'm not talking about terrorism. I'm talking about retrospectives. People do the best they can with what they know and the resources that they have, right? So whatever it is that you're doing in your life and your professional life and your personal life, you're probably not trying to screw that up, right? You're probably doing everything that you can given what you know, but maybe you don't know the things that you need to know. Asking about what a person knew and what the conditions were at the time that something happened will give you good insight for effective change, right? We had that discussion about, I was looking for my glasses and I tripped over the cat. Those are two really important things to understand if we want to prevent me breaking more stuff that I really care about. Everything that happens has multiple factors. You wanna try to investigate all of them. I think about this as using the right hand rule to walk through a maze. This is a mathematical concept where if you have what they call a true maze, which is like all the paths are connected, you don't have any sort of isolated paths, you can put your right hand on the wall and keep walking, and eventually you will find your way out. You might have to traverse all of the branches of the maze first, but you will find your way to the exit if it's a maze with all the walls connected. And I think about it like that because this asking these questions turns into this like tree. You start with how did the vase break and then you start with like how did you lose your glasses or how were the glasses not where you thought they were? Some question like that. Keep going, right? And then we get to, I was writing this talk instead of putting earthquake gel on my vases and that's sort of the end of one branch. We could go down the branch about cats. Like can we change cats? Well, we could, but it might not be very ethical. It also might not help very much. They might just get worse. So maybe the end of that branch is not very useful, but we explored it, right? We talked about it. The end of a branch should contain a thing that we can do differently or something that's too much effort compared to like the size of the change that we're trying to make, right? Like genetically engineering cats, that's a lot of effort. It's not gonna help in my lifetime probably. That's something that is, then that's how you know you found the end. You found something that's ridiculous, but also it should be a thing that can be done, right? Remediation items, the end of this branch like what could we do differently next time needs to be an action or a task which can be completed. You can't say things like try harder. They need to have measurable end states or goals. And if your idea includes something like stop doing or start doing, you need to figure out what tools and processes in your organization are going to help you stop or start doing those things. You can't just be like humans stop doing the thing or humans start doing the thing. There are reasons why the people in your organization are doing what they're doing and not doing what they aren't doing. So you have to figure out how to effect a change that is lasting by changing the environment and helping people move into this new path that you have imagined. Try harder is not something that you can do that has a measurable end state. You can't tell if someone's trying harder or less harder very well. The fallacy is well now that we know about this kind of mistake we can just not do it again, but hangar happens. The same goes for work harder, pay more attention, be more thoughtful, anything that's like try to do what you were doing, just do it better. It's not an action item, right? You can't just like wake up tomorrow morning and be like I'm better now. It's not how it works. And when you're conducting a retrospective, don't be afraid of silence. Wait until you're uncomfortable and then wait a little longer. Sometimes it takes people a while to find their voice. And notice this when people are waiting for questions at the end of the session at one of these conferences. They wait just long enough for someone who's not sure if they wanna speak to start raising their hand then close the computer and leave. So what's my point? Humans created everything in your infrastructure. The systems, the other humans, the tools, the software, the hardware, the desks, arguably the cats, they're domesticated. Which means that all problems are human problems. Literally everything that exists around you is the way that it is because of a human making one decision or many decisions. Retrospectives are how we learn about how all of these decisions have unexpected consequences. Learning that allows us to change the systems that we're part of so we can make different mistakes next time. Let's go make bigger and more interesting mistakes. So if you're interested in how we do things, I work for Heroku. We have an incident retrospective template which my manager posted in Heroket Hub. I will tweet this as well. I'm hash-octothorpe on Twitter. And if you want to come and talk to me, again, this was the highlights reel. So if you wanna talk to me in more detail, come and grab me. Or I think I have time for questions. That was shorter than I expected. So I will wait. Sakichi Toyoda with a D. Oh, I'm sorry. She asked, who did the five Ys come from? And it's Sakichi Toyoda with a D. Instead of a TA, it's a DA. That's a great question. So the way that we do it, and I think this is, oh, I'm sorry, yes. What's the smallest group that should do a retrospective? I mean, the sort of snarky answer, but also the true answer is one. I find myself doing mental retrospectives for things all the time. That's kind of how the base story happened. I didn't actually knock over a base, but I almost did. In the workplace, what we do is we collect the people who responded to an incident. So in our case, that's what we call the incident commander, the person who sort of orchestrates the response. The communications person who writes the public post that we make telling people what's wrong. And whatever engineers were engaged to fix the problem. And it's useful sometimes to have extra folks sit in if they want to listen, but it's the job of the retrospective facilitator to make sure that they don't sort of derail the conversation. So that's, and we try to avoid having a retrospective without any of those people. Like if we're missing one of those critical people, we try to reschedule, because you can't answer all the questions about how something happened if you don't have all the people who were present when it happened. Oh, what a good question. Do we exclude people with hiring, firing power, or managers or whatever from retrospectives? Is that summarized it? Okay. No, and actually what we usually do is the manager of the team whose component failed facilitates the retrospective. And we started doing that because, one, that person knows the system and the people involved pretty well. And two, having my team, the SRE team, do all of the retrospectives meant we were doing absolutely nothing else with our lives and that was sort of boring. But we have a culture of the, in some ways, my team and I personally are sort of responsible for making sure that these retrospectives occur in a good way. And sometimes we have to step in when someone's like, I screwed up. And they're like, no, this is not the place for I screwed up, this is the place for what happened. Why did that seem like a good decision to you? How can we get you better information next time? And I think one of the reasons that this probably works is that my team, it owns this process and we're the incident commander. So one of us is in every retrospective. So after the retrospective, how do we take what we learned and what we want to do and put that into future designs and product work? That's still a fairly tough part for our organization. We don't have a super cohesive planning strategy across the organization. But what happens is that each team creates usually a Trello card, sometimes GitHub issues depending on the team, and sort of feeds that into their backlog. The major problem that we're working on right now as an organization is what do we do with the things that are not smaller changes that aren't just pull requests, but that are projects. And we're, like my team is actively working with our product team right now to try to figure out how to feed that into our quarterly planning process better. My request, which I hope we'll get to maybe in the next quarter is that we ask teams to do at least like 15% of their project planning for this quarter should be remediation items. That is an excellent question. The question is, is this discoverable if there's legal action, right? Is this documentation, could this be part of an investigation? Are we protecting our individual people? And I don't know the answer to that, so thank you for asking. I'm gonna go find out. Are we out of time? I'm sorry. Come and find me later. Woo!