 Man, these old letters sure were getting rickety. I'm just happy I was able to change them out before they reach their shelf life. When I was an engineering undergrad, many of my classes emphasized the steep price of failing to design systems with enough rigor, or margin for error. The Hyatt Regency walkway collapse, the Quebec Bridge disaster, the Boeing 737 MAX. History is littered with sobering reminders of just how badly things can go wrong if an engineer gets a little too comfortable and lets their guard down. It's a good lesson for engineering students that every decision we make, or don't make, matters. And that failing to think through and account for possible failures isn't just negligent, it can have tragic or catastrophic consequences. But these engineering horror stories are often told in a way that implies something like, if you're a good enough engineer, if you're diligent enough to catch the singular issue that leads to disaster, everything will be fine. And that's not always how these awful things happen. Take the 1979 nuclear accident on Three Mile Island. A disaster that halted construction of new nuclear plants in the US for over 30 years because of how scary it was. Some technicians performing a routine cleaning procedure accidentally got a bubble stock in a sensor, which eventually caused the main coolant pumps to stop circulating water. Valves to auxiliary pumps have been closed for maintenance, in violation of the plant's operating procedures. So even though the reactor detected a problem and went into emergency shutdown mode, there was no water circulation and nowhere for the heat to go. A pressure relief valve triggered to vent the overheating reactor coolant system, then suffered a mechanical failure and got stuck open. The control room indicator light for that valve only showed if the valve closing mechanism was powered on or off, not if the valve itself was open or closed. And the open valve slowly leaked water necessary to cool the reactor. There was one coolant temperature sensor that was spiking in a remote corner of the control room, but the crew's training for abnormal incidents directed them to examine other sensors, which said that everything was fine, or even that there might be too much water in the system rather than not enough. The sheer number of things that went wrong at the same time before the plant suffered a meltdown and vented some radioactive material to the surrounding area is absurd. It's really a testament to the designers that even with this comedy of errors, it took 11 hours to go from the root cause to meltdown. Well, I say root cause, but if you really think about it, that's not really a useful way of parsing this sequence of events. If the bubble in the sensor was the beginning and end of it, if anything else in the causal chain had behaved differently, through Mile Island would probably still be pumping megawatts of clean energy into the Pennsylvania electrical grid. This sort of event, sometimes called a normal accident, is the wheelhouse of Dr. Richard I. Cook, a physician and researcher who spent a lot of time thinking about how and why things like nuclear power plants, medical devices, spacecraft, and other important systems with lots of moving parts stopped working. After picking apart dozens of high-profile disasters like Three Mile Island, he's developed a compelling model for system failure that's much different than the way that I'm used to thinking about things. Cook's high-level analysis of how systems fail goes something like this. Humans don't build complex systems just for the hell of it. We build them in response to some perceived need that lies beyond our unassisted capabilities. And the systems that we really worry about are built to render some intrinsically hazardous thing, like a nuclear reactor, safe for humans to work with. All the thought that goes into such a system, every moving part in subroutine, is designed with the ultimate end goal of allowing us to perform some risky task while making sure nobody gets hurt. And unless it was designed by idiots, it's going to have a fair amount of redundancy, margins for error, and redundancy to keep that hazard in check. The people using the system usually aren't idiots either. If something looks broken, they're likely to notice and fix it when they can to keep everything working smoothly. But maintenance and production are usually at cross purposes. If you shut everything down every time you have to change a light bulb, you're never going to get anything done. Usually, the people running the system will lean on the robustness of its design to keep things rolling in spite of these issues. And repairs happen when they're necessary, or when there's a convenient opportunity. This means that complex systems are running in a constant state of slight disrepair, with a stochastic pattern of small things breaking and getting fixed. And as we saw with Three Mile Island, catastrophic failure only occurs when enough of these small errors line up in exactly the wrong way, like the world's least welcomed Tetris. This is an interesting way of looking at complex system failures, but the rubber of Cook's theory meets the road after a disaster has occurred, when we're picking up the pieces and trying to figure out what went wrong. Hindsight bias is a well-documented error in human cognition. When we know the outcome of a series of events, we tend to contextualize those events with that outcome in mind, imagining that it's the only thing that could have possibly happened. If someone gets into a car accident and you learn that they were fiddling with their stereo while they were driving, your brain helpfully draws a crystal clear error of causation between those facts. Obviously, if they hadn't fiddled with the stereo, the accident never would have happened. Never mind that people fiddle with their stereos all the time without issue, or that the accident might have occurred regardless. Your brain is looking to explain some event, and will happily latch onto anything remotely plausible as the reason why. Doubly so if it means we get to hold someone accountable. When we're looking at the contributing factors to a disaster after it's already happened, that hindsight bias is ready to recontextualize everything in the timeline up to the disaster as obviously contributing to it, in a way that any idiot should have seen coming. And as per Cook's model, there's always going to be a ready supply of smoking guns that seem to indicate negligence or incompetence, just because in a complex system, there's always something going wrong. The auxiliary pumps were switched off in violation of protocol. The indicator wasn't measuring the state of the valve. The operators didn't check the other gauges. These all seem like red flags that someone should have picked up on, but we don't ask how many times these issues didn't lead to a meltdown. We also ignore the 50 other things that weren't working properly, but didn't contribute to the disaster, at least directly. We just see a big arrow pointing at the inevitable catastrophe and wonder why nobody did anything about it. That hindsight bias is aggravated by the way we tend to compartmentalize safety and risk. When everything's going well, we're happy to evaluate people based on how well they produce. But in the context of an accident, we suddenly switch gears and judge them as gatekeepers of safety. Both are true. Every actor in a complex system is constantly weighing risk against their nominal goals as a producer. But we rarely think about how they're always making these judgment calls unless they happen to make the wrong one. I didn't get a good night's sleep last night. Should I call in sick and piss off my boss? Or will I be fine with a second cup of coffee? The forklift has a seatbelt, but it makes it harder to get in and out quickly. Should I use it or bypass it? The organizations running these systems can make big noises about safety being a priority, but without well-defined and enforced safety procedures that are routine for everyone who deals with the system, we can't act surprised if, at the end of their shift, the 50th judgment call that someone makes about safety versus expediency happens to be the wrong one. So complex systems and their operators are constantly experiencing and fixing small, usually uneventful errors. When those errors happen to align, things blow up, at which point we instinctively blame any humans in the chain of causality for making unsafe choices. For Cook, the idea that we can pin an entire catastrophe on one bad choice leads to all sorts of counterproductive nonsense. We add more moving parts to the system to take humans out of the loop. We change the policies and processes operators have to follow, often without training. If failure of a complex system is a stochastic process, all we're doing with this stuff is mixing in a big batch of new variables that might fail and expecting things to get better. Cook's model stands in stark contrast to the way that I was taught about engineering disasters in school. Rather than emphasizing the importance of individual decisions as a guarantees of safety, he stresses the continuous dynamic nature of keeping failures in check and adapting around them. He suggests that it's impossible to make a system free from all errors forever, but empowering people to spot and fix them as they crop up can make the system more adaptive and robust. That includes training and experience, providing ample resources and opportunities to patch things, understanding what pressures we're applying to individuals that will influence how they make those moment-to-moment decisions about safety. I can do my absolute best to engineer things so they won't break or confuse people who use them, but I don't get to decide what things will go wrong and when. All I can do is give the people using my designs the best possible shot at keeping the lights on. What do you think of Richard Cook's model of complex system failure? Are there any systems you can think of that might benefit from this sort of analytical lens? Please, leave a comment below and let me know what you think. Thank you very much for watching. Don't forget to subscribe while I share and don't stop talking.