 I2K and other disappointing disasters. Verbal warning, I'm going to talk about disasters and people dying. Not just like little disasters, disasters where people die, so if that's not a thing you can handle right after lunch, no shame in leaving, thought I'd warn you ahead of time. So, how many of you were working in software or technology in 1999? The few, the proud, the brave, the grizzled. It was December 1999 and a young CIS admin of my acquaintance had just bet her boss that their systems wouldn't go down. The stakes of this bet were her job. If anything went down, her boss said, she did not need to bother coming back to work. He would wanted her to stay on the servers all night and watch them to make sure they had stayed up, but she was 20-something and wanted to drive from Des Moines, over here, up to Minneapolis, it's four hours away, and go to a party with all of her friends. And she wasn't worried about I2K. So, a lot of you may get pages. This is how it used to look. This was actually the modern version. The old one had, you just saw things along the top of it. Back then, you wore it on your belt until you dropped it in your toilet and then you got a better belt. And when it went off, you had to go find a pay phone so that you could call the person back, because all you got was a page with a number and maybe a little tiny snippet of data about what went wrong. So, you'd go to a club and there'd always be some harass-looking person hanging on a phone outside the club like this, going, it went down? Yeah? No, come in. Sometimes you'd actually have to hang out of cars to talk into pay phones. They were not actually super convenient and they were always located somewhere super busy. And it didn't really matter what you got paged about because you always had to go and fix it in person with your actual fingers because remote administration was in the future and we weren't in the future yet because it wasn't 2000 yet. So, our sis admin had her pager but she knew it wasn't gonna do any good four hours away. It's as good as the end of the world. It didn't matter. But she left anyway because she knew nothing was going to go wrong. And she knew nothing was going to go wrong because she had spent all of 1999 upgrading and hardening systems. And this didn't mean like a trivial deploy. This means like hand wiring in a floppy drive so that you could put the floppy in the floppy drive and upgrade your router. That was what it meant. You had to write people and ask them to send you physical pieces of software. Magazines were full of patches and fixes and alarmism. Like this was an equal distribution. Maybe a little more alarmism than everything else. We spent a lot of time preparing for Y2K. So, our sis admin had patched all our scripts and installed all the software updates and traced back every piece of hardware, software, firmware, weekly wear, coded cables. If we had had DRM then it would be even more nightmarish than it was everything. Traced it all back, made sure it all worked, tested it. She knew nothing was going to go wrong. As we get further from the reality of what it was like to do Y2K mitigation, we've been forgetting. How many of you were in school during Y2K? Right. If you never experienced this, this was not your problem. We get further and further from the millions of dollars and person hours that went into reducing our risk. Utilities did fail, but they failed in like July of 1999 in isolation from the grid because they were testing. And they figured out how to mitigate. And so we did not have a cascade failure of all of our power systems because they knew that it was going to stay up. And probably about half to a quarter of the people you know who were in the industry did not go out and party like it was 1999. They stayed in their server rooms. They stayed in their war rooms. They stayed on conference bridges to make sure that the world did not end. Back then we didn't really have the idea of push updates and we didn't really have the idea of regular updates. Like the longer your system stayed up, the better it was. If you had to take it down to patch it, something had gone wrong. People would count their server uptime in years with like a point of pride. So it was a really different paradigm to take everything down and patch it. The space program, which we think of as the epic collection of labor and produce and material and we got a man to the moon. We think of that as amazing. It's a tiny fraction of how nightmarish it was to do Y2K. So it's easy now for us to say that it was overblown because nothing happened. Obviously it wasn't that a big a deal. Why are we so head up about this? It was just, it wasn't a real thing. Well, it wasn't a real thing because we worked really hard to make it not a real thing. And it's easy to forget the disasters that don't happen. Nobody remembers all the times that they almost backed into somebody or they almost ran over a kid in the road. It's scary for a while and then you move on. All of this Y2K preparation was risk reduction. We knew that there was something that was going to cause a problem and we did everything that we could to prevent it from happening. We do risk reduction in a lot of ways like we do vaccinations so we don't get diseases and we do anti-lock brakes so we don't hit the kid in the road. And we do train gates so people stop parking on train tracks. We understand how to reduce the risk of things. And how do we do this? Like what are the essential elements of reducing risk? It's impossible to avoid all your risk. You can't do it. We make trade-offs with risk all the time and those trade-offs are complicated and sometimes illogical. Air travel is much safer than driving but we will still prefer to drive a vast majority of the time because we feel like we're in control. There are lots of subtle things that we do to reduce risk but rather than becoming overwhelmed at the idea of trying to avoid all the risk in our lives we can figure out what is in our control and reduce the risk there. And we have to decide how much risk we can tolerate in each zone. For instance, I'm a cyclist and I cycle on the roads in the winter in Minneapolis and that's a risk I'm willing to take and people die doing it every year. Cyclists die because they get hit because people don't see them. Well, I try and mitigate that by wearing bright clothes but I also carry a lot of life insurance. Like I've accepted that for this lifestyle I'm just gonna have to deal with that in other ways and if I die my family needs to be protected. That's how I'm securing my zone. In another example, I work directly adding documentation to our APIs and there's a risk that I could break a customer's integration if I do something wrong but we have a standard way of preventing that which is to say somebody goes over my commits before we actually push them and make sure I haven't done something insanely stupid which only happens once in a while but you really want to avoid it. This is so standard that we don't usually think of it as a risk reduction strategy. There are lots of risk reduction strategies that we've just built into our lives. Like feature flags and development branches and test instances and canary launches and deployments and just all sorts of things that we think are just best practices but we'd never trace that back to best practice to avoid risk. Another way to think about reducing your risk is to predict that the state's something can end up in. I love state machines. I am the English major most likely to draw a little heart emoji around a state machine and the formal way to think of all the ways your stuff could end up in is called a finite state machine. What are the transitions and what are the end states and how do they get there? You can draw this all out for almost every piece of software and the more horrible and complicated it is to draw out the more you should be thinking about microservices because once you get a finite state machine that's like this big it's very hard to secure properly. So to use this to assess risk you enumerate all the possible states your system could be in and what would make that happen and then you can address the transitions that would cause that. So here's an example. I want somebody to put a zip code in. I know you can't read this, it's okay. I only want them to put in a five digit zip code. I don't want extras, I don't want unders, I don't want anything invalid. So my state is an analysis of what they entered. If it is alphabet, if it is more than five characters, if it is less than five characters, I reject it. Try again. Enter a real zip code this time, you know what it is. Only five numbers will get through my filter. Those are the only two possible states. So having figured out what my two possible states are I can derive what the risk of somebody doing it wrong is and fix that. And if I forget one, I can add it to the list, be like, oh yeah, you also can't use characters. You've now reduced your risk of invalid shipping and or weird injection attacks. Harm mitigation is the flip side of risk reduction. Harm mitigation doesn't matter until something goes wrong. No matter how much we try to avoid risk, something is going to go wrong. The truth is that bad things do happen all the time. This is why we wear seat belts and life jackets and install fire sprinklers and smoke detectors and doors that open outward. Usually we don't have house fires in the first place but if we do, we'd like to survive them. So the most concrete example of harm mitigation that I'm going to talk about is building codes. You can also think of seat belts as harm mitigation. We hope you don't get in a car accident but if you do, we hope you don't go through the windshield. Like, stuff has already gone wrong. You're already in an accident. How can we make that less bad? Rate arrays are the assumption that some of your inexpensive disks are going to fail. You just have to have the data distributed across enough disks that you can accept that risk. But building codes. Building codes save lives. In 1985, an 8.0 earthquake destroyed Mexico City. At least 100,000 people died. Mexico City is built on a lake bed that was sort of kind of filled in. It's really prone to liquefaction and when a quake hits it, everything jiggles like a bowl full of jelly. It's not an ideal substrate for any kind of seismically active zone. And this was the hospital. Like, that was one of the better built buildings. The whole city of Mexico pretty much looked like this. This is the earthquake that happened this year. That's pretty much the worst picture I could find of what happened. In the intervening 30 years, Mexico City insisted on rebuilding with stronger, better, more robust building codes. Nobody put up seismically unsound buildings because they knew what was going to happen. And they had this really fresh vivid memory of thousands and thousands of people dying in building collapses. So when Mexico got hit by a 7.1 earthquake, the modern building survived well and it looks like the Mexico City death toll will be 228. And this is not because Mexico City is smaller. Its population is 25 million. But they've even installed an early warning earthquake detector that detects the super low frequency vibrations that get there earlier. People in Mexico City have a 10 second warning before they get hit by actual tremors that they can feel. Mexico City knows that they are in a bad place and they are just doing as much harm mitigation as they can. They're going to get hit by earthquakes, but they can mitigate how bad it is. The last example of building codes is even a harder for me. On your right is Grenfell Towers. It's a 24 story residential building. We still don't know how many people died in this fire. We haven't been able to get in and find them all. A lot of things went wrong, but one of the primary things that went wrong was building codes. It had a flammable cladding that wicked the fire up the side of the building, defeating the slightly fire proof concrete box construction. The cladding outgassed cyanide when it burned and there was one evacuation stairway for 24 floors. And it was blocked. There were refrigerators, there were mop buckets, there was all sorts of crap in the stairway in a way that would make an American flinch because you don't think about it, but every one of the doors that you hit from here to outside will open outward because we have a trauma about pictures of people piled up in front of doors that open inward. You can maybe think about triangle shirt waste factory pictures and how people were blocked by locked doors. You can think about the ghost ship fire. 50 people were in that building and 36 of them didn't make it out because there were no fire exits. But if you thought about it and yesterday the fire alarm went off, you knew where to go. Even if you weren't sure that there was a fire, the first thing that happened was you hit the doors and they swung open outward. And you hit the next set of doors and they swung open outward. And in an emergency, that little bit of mitigation saves so many lives. This other picture, this nighttime picture is the sadly named a torch tower in Dubai. It had the same kind of cladding, the same aluminum flame cladding. But it had four exit stairways and it's 78 stories tall and everyone got out okay. No deaths, no injuries because of building codes. It's not that these didn't both catch fire in a really dramatic way. It's that building codes mandate you be able to exit. So think about that when you're thinking about what does harm mitigation do? We accept failure and we make it as not bad as possible. So the first step to mitigating harm is acknowledging that we are never going to be able to fully eliminate risk, which is kind of horrible for us. Like as a parent, I would love to be able to wrap my children up and make it so that nothing bad ever happened to them. But it's not really a workable technique. So all I can do is sort of life proof them. Like give them bumpers and let them learn what hurts and what doesn't. Once we accept that there's risk in everything we do, we can identify it and figure it out instead of just hiding from it. So assume that everything is going to fail. If you don't know how your product is going to fail, you are missing something. If you don't have an idea of what your failure modes are, it's gonna be really difficult for you to make that better. For example, failure modes can have a lot of different meanings for different people. If I leave my phone unlocked and my partner sees my chat history, I will get teased because it's silly. If somebody who is an abusive or controlling partner leaves their phone open and their partner sees their chat history, they could end up dead. Like why are you talking to a woman shelter? Are you trying to do something? Are you trying to get away from me? Your threat model needs to include a variety of people for every failure circumstance. In a software as a service setting, what may seem like a harmless bug around graphics could break an important custom analytics module or it could be something that nobody uses at all. Only by using analytics are you going to be able to tell what people are actually doing with your stuff. Testers are the amazing people who never believe in the happy path and we need more of them in our lives. Like it's important that we do our own testing but it is also important that we hire the professionally paranoid because they have seen all the ways that things fail and they are excited about finding more of them. And if that's not what drives your life's purpose, hire a tester to do it for you because those people really exist and they're amazing. So think about what it is that failure means in your product. When we try to mitigate harm, we need to think about what it is we're trying to protect. For example, nuclear power plants are designed to fail safe. In the absence of power or positive control, the control rods drop and the nuclear reactions stop. That is the ideal. Usually it works. Like we hear about the times it doesn't but for the most part, if your control rods can drop at all, they will drop as soon as a power cut happens. On the other hand, time lock safes aren't designed to protect people. They're designed to protect money. You're not supposed to be inside a time lock safe pretty much at any point. And so you can't open it without a combination and a set time. And that protects the money and the person because you can't coerce them into giving a code and have it do any good. So what we're trying to protect there is what's inside the safe and also what's outside the safe. In most computer cases, the thing that we're trying to protect is data or state. When a laptop runs out of power, it shuts down pretty gracefully. It's like, oh, I'm almost out of power. Nning. That's actually pretty graceful. Like it could be so much worse if you've ever had somebody trip over your desktop cable. You know what I'm talking about. That never happens. But laptops also have this cool thing now where if you knock them off a table, they will attempt to save the state of the data that they know about in the moment that they are falling. Like how cool is that? Somebody has figured out that we care so much about our data that even as the computer is falling to its probable screen death, you're still saving whatever you need to to the hard drive or the SSD. So you have to figure out what you care about. I worked at Microsoft BitLocker for a while and it was really interesting because we realized we didn't care about the laptops. Like hardware is easy to replace. Access. Access was our real risk point. The number of executives who leave laptops in taxi cabs is really large. As a person who has not lost their laptop but has lost other things, I sympathize. But we say like that's disposable. It's a $2,000 laptop, we don't care. It's so much less than the data breach of somebody being able to hack into our network with physical access to your computer. So we secured it so that it was easier to accidentally wipe the data than it was to recover it. We're like, we're gonna assume your data's in the cloud. Nothing is worth somebody else getting into our system. That is our risk model. Again, predicting the possible states allows you to understand where you can make a difference by focusing your attention on transitions and states that could be altered to be better. You really need to think about what it is you can do to do continuous improvement on your product to make it more risk tolerant. The easiest path in this for software is the kill switch. We used to do this process by redeploying a known build but that's painfully slow when you're in the middle of an outgoing crisis. Instead, you can always have a transition state of hit the kill switch. And then you could just sort of like walk away like a cat that just fell off a bed. I meant to do that. That was never deployed, nobody saw me here. You want to be able to say, turn something off in a hurry without having to redeploy everything. Failure is inevitable. We are all going to fail. We fail every day. We forget to remember to bring home the milk. We typo something, something happens. We fail all the time. Disasters are a concatenation of failures that add up to a catastrophe. We can avoid disasters. Why is predicting states so complicated in heart? My simplistic examples don't make it seem like it's super difficult to do this but I think even though we've designed for redundancy and failure, our systems are sufficiently complex that we can't understand all of the interactions. We saw this on LinkedIn. Who here remembers when LinkedIn managed to turn on all of their features at once? No? I was the only one who saw that? It was terrible. It wasn't at a time that they'd stopped nagging you for a little while. Well, they got their stuff sorted out. Turning on all of your stuff at once is not a thing anybody could have anticipated because there's no reasonable reason to do that. You have to have that mindset of what could possibly go wrong no weirder than that. In modern software, it's gotten really difficult to say I can follow a message from one location to another. There's a real like internet magic happens here in the middle, especially in software as a service. Like you can't be sure which node it's going through. You can't be sure which channel it's going through. You can only predict that probably it's gonna come out the other end. This makes testing kind of difficult because you can't build with just the nodes, just the round parts of the Tinker Toys and you can't build with just the sticks of the Tinker Toys. You have to have both the software and the channels, microservices and the connections. Multiplying combinations grow super complex super rapidly. Like if you tried to test every possible permutation of a Tinker Toys object, you'd quickly run out of computational power. This is just two objects in space rotating around a central point. Imagine what it's like when you have 60 microservices and they have more than one channel each. You can't test all of that. So you're going to have to test each of them, test your microservice, test your channel and then when something goes wrong, trace it back and find out where it actually fell over. But you have to give up on the idea of 100% test coverage to do that. It's not a thing. I don't think 100% test coverage is really possible for software as a service. Certainly not for any kind of complex microservices ecosystem. So if a disaster is a combination of failures that causes a system to stop working, usually in a way that cause lives or money, every disaster that you can think of from natural disasters to engineering disasters has several contributory factors, right? So in a hurricane, high winds and storm surge debilitate buildings and take off roofs. Okay, that's bad. But it's really bad when the power is also out and people can't get water and they can't dispose of waste. And it's not just one thing that fails. It's not just that your roof is gone. It's also that you don't have clean water or a place to store your medication or a place to sleep that's safe. Disasters are about a multiplicity of failures. 1,200 miles from here, at the end of the navigable Mississippi, there's an interstate bridge. It is not this interstate bridge. This interstate bridge fell into the river. It was pretty unpleasant to be in Minneapolis right then. It was a beautiful spring day and it fell very suddenly. There was almost no warning. And 13 people died, which is a miraculously low number because it's the fourth busiest bridge in Minneapolis, St. Paul. The engineering report says the primary cause of the collapse was the undersized gusset plates at .5 inches thick. Contributing to that design or construction error was the fact that two inches of concrete had been added to the road surface over the years, increasing the static load by 20%. Another factor was the extraordinary weight of construction equipment and material resting on the bridge just above its weakest point at the time of the collapse. That was another 262 tons consisting of water, sand and vehicles. It wasn't just that the gusset plates were too thin, although that was the main cause. They'd stayed up for 50 years by that point. It wasn't just that there was a lot of concrete on it. There'd been a lot of concrete on it for 10 years. It wasn't just that there were construction vehicles on it. It had had three major resurfacing. It's that all of those things came together at one time to contribute to a disaster. And it was by the grace of over-engineering that it stayed up for 50 years. Because if it had been, like, spec'd properly, it would never have fallen down. And if it had been built shoddily, it would have fallen down much sooner. It was only when you got this combination of factors that we exceeded engineering parameters to keep it up. When we fish our own disasters out of our metaphorical river and piece them back together, we find out that it's the same. This cluster had a bad shard, but that didn't matter until the other cluster blipped and threw a bunch of traffic over to the first cluster and then the shard failed to replicate. It's always going to be something like that. So the only thing you can do is build in engineering tolerances that allow you to recover from these problems. We have to give Netflix credit for popularizing this idea, toolifying chaos to head off disaster, causing minor failures, and seeing if anybody notices. And if anybody notices, your engineering tolerance is not big enough. You need a bigger tolerance. You need more servers. Because you set your parameters for Chaos Monkey and it takes servers out. And if it takes out enough servers that somebody notices, then either your parameters are not an accurate portrayal or you need to add more servers. Here are some general concepts that I'm going to go over really quickly, but if you want them all on one page, this is the time to take the picture. Here's things that cause disasters to fizzle into failure. Use microservices and very loose coupling. Chad mentioned this in his talk. Not pets, not even cattle. Think of them as bacteria. The system has to survive the death of any bacterial thing. Like your server doesn't matter. It needs to not matter. It needs to be a replaceable element. Use very loose coupling. Use internal APIs so that if something goes down, it doesn't chain react and cause other things to come down with it. If your APIs continue to ping something that's down, you will in fact have a problem because now you have DDOS yourself. Which is unpleasant and unfortunately common because you're like, why aren't you up? Why aren't you up? Why aren't you up? I'm down. One of the common mistakes that we make in mitigation is that systems can't always talk to each other smoothly. If there's a problem, it sometimes takes out the messaging too. How many of you remember the status page on S3 was inaccurate? Yeah, right? Because one of the elements on the status page for S3 was hosted on S3. Your messaging about failure needs to tolerate being part of the failure. So if you have a status page, make sure you're not hosting it yourselves. Like status page IO will host it for you and say, our site is down and everything's on fire, yo. And then people will stop calling your tech support to find out if your page is down and everything is on fire. Storage is cheap. Make sure you have your data in more than one place because nothing and nowhere is safe. Delta, I am looking at you. Delta had a perfectly nice data center in Atlanta. And they had, when they acquired Northwest Airlines, a perfectly nice data center in Minneapolis. And then the Atlanta one caught on fire, which would have been fine if Minneapolis was up, but they decommissioned it. They had decided that it was not worth the cost to keep a redundant data center up. And what are the odds of our data center catching on fire? Right? Get in more than one place. Test your feature in production. You cannot test any modern software fully in test. It is impossible. We are talking billions of transactions and messages and you just can't do that. Absolutely. Do your unit test and your integration test and your big test and your, you know, have a test server. I'm not saying that. I'm saying for the love of God, you're gonna have to test an actual production where there's actual traffic. No script on earth is going to write you enough messages to make it look like Black Friday. That's just not a thing. So roll out your stuff to a few people at a time, make sure it goes okay, increase your rollout as you go, because that's gonna make it a lot safer to test in production rather than just flipping the switch on. Humans are the slow fallible meat part of almost any system. It's a great design when you're trying to prevent cascade failures of overreaction, but when we automate our world to react, we also need to automate our ability to shut things down. A circuit breaker in electricity stops over just from just circling around and causing fires and other unpleasant things, but we can install circuit breakers in our software with kill switches that says, hey, if somebody is pinging you like 500 times a second about whether or not you're up or down, you can just kill them. Like kill that lie. You don't need to listen to it. And automating our circuit breakers and our kill switches is going to give us a lot more safety because it's better to shut something down than to have that one failure spin up into a giant disaster. If you have a runbook for how to get a system back up, why haven't you automated it? Like you know the system went down, okay, that's good. You don't necessarily know why. And because of the mystical power of how computers work, sometimes turning it off and turning it on again makes it feel better. I would love to tell you that we have solved this problem where we're actually good at diagnosing why things go down, but a large percentage of the time turning it off and turning it on makes it feel better. So if it's already fallen over and turned itself off, let's automate turning it back on by itself because again, we are the slow meat part of the system. There's no reason somebody has to wake up at three in the morning and hit the switch and run the runbook to turn something on. If it falls over again, sure, wake them up, but like try the first time on automated. Test for load and stress and outage. Like I said, you can't fully test modern software in a test environment. You have to test it under load and you have to figure out what your peak load is and then calculate beyond that. Like in the old days, and again, I'm going to date myself here, we had this phenomena called getting slash dotted. And there was a thing where you would get mentioned on this extremely popular site and then everybody in the world would come look at your site, your tiny little site that was used to getting 50 hits a day and it would get thousands and hundreds of thousands and your server would fall over in a dead faint like a goat. And then people would be angry because they wanted to see this cool thing that slash dotted mentioned and the site was down. And we don't have as much trouble with that in like an elastic capacity world, but we still need to be testing for slash dot level events happening to us because it's easy to mitigate, it's easy to prepare for. Like maybe you need to throttle inbound traffic. I'm not saying you necessarily need to be like staffed up like Amazon at all times, but you need to think about what would happen if you got Amazon level traffic. So if this talk was too long and you read Twitter instead of listening to me, expect failure because it's going to happen. Make your systems less rigid and more nimble. Plan for disasters and expect that they will happen too and degrade your service gracefully so that in an emergency situation you can still offer something instead of a whole lot of 404. I work for a cool company called Launch Darkly. We do feature flags as a service and if you would like a free t-shirt, I am too lazy to haul them around the country, but you can take a picture of this and go to the site and we will send you one. Thank you.