 I don't need to mute it, because then she can't hear you. Sky camera? All right, we're back. So now we'll have Heidi Waterhouse talking about disaster resilience the Waffle House way. So take it away. Hey, folks. I'm glad to see you here today in that I can't see you. But I know you're here. And I'm super excited to talk to you about resilience and hurricanes and rubber ducks. So let's go ahead. I'm going to share my screen with you. And then we will be able to talk. Maybe I know that's I love technology. Is that showing? No. I have not shared my whole screen. I promise this all worked perfectly in test. All right, do you have it now? Disaster resilience the Waffle House way. Flat Pops, feature flags, and finite state machines. So this is a really interesting talk to me because I'm excited about how things work. The summer after my freshman year in college, I went to work in Alaska. And it didn't look like this. Instead, it looked like this. I worked at the busiest windows in Anchorage. Hey, Heidi, we can't see your slides. You haven't shared it yet. OK, thank you for checking that. No problem. Thought I had. Yeah, you got to share the screen. I'll switch back to you for now. Yeah, it's a little button. It's right next to the heart in that little chat window. There we go. Perfect. OK, now what do you? OK, we see your slides. That's awesome. Well, you see my speaker notes. That's like my. Oh, so it's back to you. So yeah, what I would do in that. So just share your main screen. Uh-huh. And then because if you have multiple screens, I should say which screen you want to share and just share that one. Oh, OK. All right, we'll do it that way. I have to remember where I put it. Yeah, how's that? We don't see it showing. No, nothing showing yet. Oh, that is because it doesn't understand different workspaces. How about that? It's showing now? No, not at all. I'm so sorry, team. Do you have different spaces or did you just have one space? I have different spaces. But I am just going to go this way. Do it this way. And it's at your vault Skype at this point. I. There we go. OK, so you see my notes, yes? Now we see your notes and we see the slides, yes. OK, so what we're going to do is just present. OK, do you still see it? Now it says, yes, it's loading. Now it says disaster resilience to Waffle Houseway. Yep, we can cut this all out of the recording. This will all be gone. Welcome to edit point. There you go. All right, hey, team, I'm going to talk to you about disaster resilience this way because I'm super interested in how things work and how they don't work. So let's get going. The summer after college, I worked in Alaska and it didn't look like this. It looked like this. I worked at the busiest Dairy Queen in Anchorage, Alaska, and you wouldn't think that was very busy, but there is, in fact, a major military base up there. It's super busy. And I worked there all summer and one interesting thing happened, which is that we lost power during lunch rush. And there are hundreds of people that come through during lunch rush. And I figured we were going to have to shut down because we'd lost power. But it turns out that a fast food place can continue operating without power for quite a while. And in fact, we operated all through lunch rush and made people food, not all the food, but a lot of food, without any power. Because even if you can't microwave anything or the refrigerators have to be kept closed, you can still cook things in the fires, although the timer doesn't work, but the coils do. You can still make hamburgers. You can still serve a lot of food. The frosty machine is definitely out though. That's all electrical. So when I was thinking about resilience, I thought about the fact that even though we think power is sort of essential to providing food, it isn't. And this leads me to think about the difference between failure and disaster. Failure is when something goes wrong. But for disaster to happen, you really need some teamwork. A disaster is a collection of failures that continue to escalate. And when we're talking about failure and disaster, one of the things we want to do is stop things when they're still at the failure stage before they get to the disaster stage. So when I think about this, I think about this in a chart. So on the one side, there's a success and a failure. And sometimes they're not very far apart. This is sort of what we're talking about when we talk about minimum viable product. It's hard to tell whether something is a big success or a big failure with only a couple of data points. But the more data points you get, the easier you can tell whether something is succeeding or failing. And I'm really interested in that zone between the two of them, between flawless success and critical failure. When I think about this, I think about Left Pad. If you remember what happened, a developer pulled something out of the NPM library and that broke a lot of things that had dependencies. And the modern comparison is Seth Vargo pulled some of his Chef Sugar code from Chef when he disagreed with their policies on ICE. And that broke a lot of dependencies. Chef had to scramble and fix a lot of things that depended on code that Seth had written many years ago. But nobody had thought about it because it was just part of the product. Dependencies are always going to be in our products like that because our products are too complex for us to have built them entirely by ourselves. So when we're thinking about success, it's not just like, did something work okay? It's a flawless victory thing. So a single success is like a single rubber duck. It's there as a very nice duck. You can talk to it, it's gonna help you code. You can ask it questions. That's a single success and it's exciting. But what you really want for like flawless victory is to pile up the ninths. You want an entire house full of rubber ducks. You wanna be able to say, I can't even count how successful I am because I know that these rubber ducks are just going to keep arriving. That's what flawless victory feels like. But the same way one success is not the same as a whole house full of successes, not the same as all the ninths, there are also degrees of failure. There are things that happen that are not ideal but they're not catastrophic. So this is that zone between flawless success and abject misery. And I tried to come up with a good name for it but it didn't come to these. So it's just that bit in the middle where it's neither perfect nor so broken that it doesn't work. And that's what brings us to Waffle House. For those of you who don't know, Waffle House is a Southern institution where you go and you can order the waffles but that's not what anybody is actually there for. They're there for the hash browns. That's the best thing that Waffle House makes. Waffle House has a hurricane center, legit. They have a war room and they watch hurricane tracks and they decide whether or not they're going to be able to continue operating in what Waffle House is will be impacted by the path of the hurricane. And then they station preparedness trucks just outside the zone of damage of the hurricane trucks so that they can get people in right away. And so they have this whole team arrangement that says we're in the South. We know that there will be hurricanes at some point. And we are committed to staying open as much as possible as much as is compatible with life and safety. And when they say that, they mean it. Like FEMA has a thing that says they know that if the Waffle House is closed in a town that town is very badly affected. And if they're still serving some kind of food that's not the place you send the National Guard to first. So they have different ways of operating depending on what's broken. If there's no power and they still have gas and water on the thing that they can do is use the cook tops. So you still get your hash browns but you're not going to get waffles because those are made in waffle makers and you're going to be able to do the dishes. So if you walk into a Waffle House and they're operating in a storm condition where they don't have power they might have trouble taking your money like you might need cash but you're still going to get fed. If there's no gas but the power and water are on they can still make things with the fryer and they can still cook waffles but they're not going to be able to do anything that involves the cook top because that's gas fired. If there's no water that's the hardest one but they can still operate with no water which I find fascinating. Like they've obviously done some intensive training. I've done food service training and all of it is predicated on having potable water. And if there's no water then you have to do a whole bunch of sanitization and use paper plates and like manage all of the things that we do for food safety in a different way. Like I said, they prep at the edges. This is a picture of a whole bunch of people who do shallow water rescue heading into Louisiana as a storm is hitting. They call themselves the Cajun Navy and it's a bunch of volunteers who have boats and they show up and do shallow water rescue and some rapid water rescue because no Coast Guard, no National Guard can station enough things in place to do post hurricane recovery. So these volunteers drive to the edges of where the storm is going to hit and then as soon as it's remotely safe they're already close enough to get in. Instead of waiting at home until there's a call for them to go out they go prep at the edges. So what have we learned about Waffle House? Well, they have a plan for degraded service. They establish the restart at the edges based on how close they think they can get to the storm margins without endangering anyone and they focus on key deliverables. I'm pretty sure they feed a lot of people for free when they aren't able to take cards. I'm sure they want to make money during a hurricane but based on the stories coming out of it they mostly wanna feed people. The thing that we know about how we react to things is that we're always reacting to the last outage. We're always reacting to the last bad thing that happened and not the next bad thing because we know where we failed last time but we don't know where we will fail next time. We need to think about where our vulnerable places are that could fail next time so we can start doing mitigation. So we can start doing prep on the edges so that we can be prepared for when the next failure happens because perfection, although it's a noble goal is not a thing that actually happens in the world. So there are two kinds of problems that we deal with in technology, local and distributed problems. For instance, sometimes what we have is some connection. We have a little bit of bandwidth. We don't have everything that we're looking for but we have some way to talk to each other. So how do we deal with that? How do we say what's important? Sometimes we have regional outages. This is more of a distributed problem. Like if you have the Great Northeast blackout how do you stop that from failing across the entire electrical grid across the country? That was the action of circuit breakers to keep it from getting any worse than it did and it was the action and effect of interconnectedness that made it as bad as it was because all of these utilities that were nominally independent were sharing. But the thing is if you're sharing excess you can also share excess load and it could take things down in sequence. Another one is slow failure. Slow failures are extremely difficult to diagnose because it sort of sneaks up on you. What happens is you start to get laggier response times and nothing is catastrophically down but it just keeps getting slower and slower and it's very difficult to diagnose because there isn't anything obviously on fire. So when we talk about failure modes we often forget to talk about slow failure as a thing that can happen because we feel like we can see it coming. My personal favorite is observation changes state. This is the thing where your computer is running slowly and you would like to run a monitor to figure out what process is slowing it down but the act of spinning up a monitor makes everything run more slowly and sometimes you can't even get the monitor up because it's running so slowly so you can't figure out what to shut down. It's a problem that we encounter a lot not just on our personal computers but in our data centers and in our distributed networks that changing how you're looking at something changes the state of it. And when we do that it makes it really difficult for us to determine what the original problem was because now we've multiplied it. And then there's the human factor. I'd like to say that humans are part of the safety net and we are but we're also part of the problem because humans make the best decisions they can with the resources they have at the time available. So given all of these problems how do we build resiliency? How do we make our systems more robust to deal with the fact that failure is inevitable? I am the English major most in love with state machines. I super love state machines because I think they're a very elegant way to address a lot of problems. What you do with a state machine is figure out all the states that something can end up in what can happen. So if you're using an ATM you could get money or you could be rejected or it could eat your card or the ATM could catch on fire. That's sort of an edge case but you never know, right? And so you map out all of the things that are possible all of the states that it could happen in and you don't have to know yet how you get there because the next thing you do is you take all of those end points all of those places that a machine can end up and you look at how they got there. Once you have a map of how they got there you're gonna be able to see that there are commonalities in how we route to an end point and we can put our reinforcements and our hardening on those commonalities. So every time we look at it we can say the thing that I want is to be able to reliably end up in this safe failure state rather than a bad failure state. I would rather the lights in the house go out rather than I started an electrical fire. So when we're doing this I think it's really useful to think about a state machine as a way to predict what kinds of failures are possible and then route to the ones that are least damaging. So I've given you all of these ideas on how to make it work or how things could go wrong but like what are we gonna do about this now? I don't wanna leave you with just like this looming sense of dread that failure is inevitable, although it is. The first thing I wanna say is somebody told me this and it was really impactful. If trying harder was going to work it would have. If just trying to be more perfect was going to make for better systems it would already have done that. If trying to be on time was going to make someone on time they would have already showed up on time. There's something that is blocking that and in both humans and systems we have to accept that trying harder is not really a useful answer. So one of the things that we learned from Waffle House is that you can offer reduced service. What if instead of serving your whole webpage if you're dealing with a system slowdown you serve only the essentials, no pictures, no ads just the core business value of what you're trying to get through. Maybe most of what you're doing can be stripped down. I think this is a really good place for people to talk about what their core business value is. Like what are we providing? What is the most important thing to get through? Because once you identify that the most important thing is to feed people in an emergency you can do a bunch of things to make that happen. But if you identify the most important thing is to save money then you would close Waffle House all the time. You wouldn't keep it open. That would be the safer thing to do if that was your core business value. So if you're thinking about the robustness and resilience of your system you have to be serving your business value whatever that is and you have to know what it is. And I find that a lot of us in IT and a lot of us in software don't spend a lot of time thinking about that. We're thinking about features and not value. And that's something I want you to go home and think about like if I could only serve one part of what I'm making, what would it be and how would it help the company? How would it help people? Think beyond the binary. When I think about this this is a very gender-essentialist joke but I love this because women are taught a bunch of different color names and men are taught not as many color names. There's the same number of colors but we could be thinking about them in different ways just like we could be thinking about things in a non-binary success and failure in that middle space. I want us to think about like the spectrum of success and failure. Talk to the business owners. I already said that but I'm gonna say it again. You need to know what it is that you're supposed to be delivering and business owners are the people who know what that is because business is not the enemy of development. They are the subject matter experts of customers and I'm in favor of having customers because that's how we continue to live inside and eat food is somebody is giving us money to produce something. And so when I think about this I really want all of us to make better friends with business and understand what they're trying to do because they are just as technical as we are. It's just that their technology is a lot more people-centric than ours. But if you've ever looked at Salesforce it is a terrifyingly complex set of people interactions and I'm super impressed by anybody who can manage that. And another thing that we can do when we're thinking about this is think about the service levels that we are agreeing to provide because every plane you've been on has been a little bit broken. Everything that you deal with is a little bit broken and what we're relying on is the fact that there is a minimum standard. There is an error budget that says if it's more than this percent broken please pull it out of service. But if it's an only ordinary amount of broken that's okay. And I'm a person who lives in the world and I'm a little bit broken and I want my software to operate in the same way to be robust enough to work around the fact that not all of the connections are working all of the time. More TCP protocol, less like one-to-one connection. Another thing that you can do is use circuit breakers and safety valves. Humans are the slow meat part of any system. And we invented circuit breakers because we are slower than electricity and we would like to not die from electricity. So the thing that we do is put in a circuit breaker, a fuse that says, no, something is going wrong here and electricity is going where it shouldn't. So let's stop and figure out what's happening. Even if that means everything goes dark dark is better than dead. So when you're building a system, think about the fact that all of our computer software is electricity and that we are the slow and fallible part. And maybe like spend some time thinking through how to set up circuit breakers so that humans can't break the system, at least not easily, not trivially, not by just like sticking a fork in it. Traffic is pressure. When you think about designing a system for traffic you have to sort of think about pressure vessels and how it is that you can constrain and funnel and release traffic and do load shedding in a way that's not detrimental to your business values so that you're not losing stuff. Like maybe load shedding request is fine because there's nothing important in there but you don't wanna be like load shedding checkout carts because then you're gonna lose money. When you think about traffic as pressure it helps you understand that you have to build both the pipes and the receptacles to a certain degree and then you have to keep checking that the pressure is okay because it's possible as you see in this picture of an exploded steam engine to overpressure anything. Think about modular delivery. And I've said this sort of a couple of different ways when you talked about how Waffle House does, waffles and not hash browns, hash browns and not waffles but think about this for your software. Microservices and modules are a way to say, hey, some things are down but not everything is down. Maybe one part of my system isn't working but the other parts are working and you can get done what you need to get done. And this is really interesting because frequently as people have been moving from monoliths to microservices they've been designing something that I can only identify as a distributed monolith where there are still dependencies and breakpoints if anyone microservice goes down. And then you don't have any of the advantages of microservices like if you only have one database as a microservice you just have a distributed monolith because if the database goes down you can't write to it and you can't read from it. You have to have like more than one in order for it to be a valid microservice architecture. And I work for a vendor that sells feature management and one of the things that we like to be able to do is turn things on and off to test A if they're really isolated and B if turning them on breaks things. So one of the things that I suggest is people either buy something that does feature management or roll their own so that as you finish a feature you can deploy it to production but not show it to everybody. Do some testing in production before you put it out in front of the real world. And a lot of people say, but that's what I use staging for. And the thing is staging is basically a comforting lie because nothing is as weird as production. Production is full of human data that has people with single letter first names and triple hyphenated last names. And I joke about this but I think we need to be ready for the new generation of the emoji named children. I think they're coming. And a couple of weeks ago I saw an error message go by in our log that said, hey, I can't delete a flag that's emoji named. I'm like, you named your flag with an emoji. Ha, I'm psychic and I can see the future. We're gonna name things with emojis. So the more support that you have for being able to test things in production where everything is weird and expensive without breaking the user experience, the more confident you're going to feel when you actually roll it out all the way. So none of the resilience matters if you can't define what victory it looks like because it's not the same as not failing. So what is a minimum victory? What is barely enough to serve some kind of business value? What is just enough to get by? What's not bad? What is like, I could live with this. I want to improve it. I want to iterate on it, but it's okay. Like the middle state of success. This is the costuming ideas that your mom came up with when you told her that you had a play in a day. What is good enough? If you think about that and you aim your robustness at that, you're probably gonna achieve it because good enough is much easier to hit than perfect. And then what's your best case? Because if you don't have a best case, you can't be aiming for it. What is a flawless victory? What does it look like? What is the latency? What is the timing? How could you continue to iterate your system so that you can get closer and closer to it? Because that would be exciting. How can you be victorious if you don't define victory? I think a lot of us define how our features should work, but we're not doing a great job defining it from user perspective, defining it from the way that we want to have it work for other people. Because I want all of us to be rainbow unicorn Pusheen and we can't do that unless we say I want to be rainbow unicorn Pusheen. So if this talk was too long and you read Twitter instead, make failure more nuanced, make it more than binary, make success bigger, hold more things in the pool that you call success, including successful enough, not too bad, I guess it works. And definitely when you go to Waffle House, order the hash browns. So if you are interested in a t-shirt and you were in the United States, I'm sorry, go ahead and follow this link and we will send you a t-shirt and some information. Thank you all. You can switch me back to video now, which I can't see. So thank you all for your time and attention. I really appreciate you doing this with me and I hope it's been a good time. Yeah, thank you for joining us. It was great. And so we're gonna get ready for our next talk now. And I forgot to say that we have a new co-host now here, Maria Nagaga. Hello. And so thank you so much, Heidi. Thank you. Thank you.