 Yeah, so I'm going to talk about managing fires and leadership's role in incident management. So specifically, won't necessarily touch on what you should do as an individual contributor during incident management, but what an incident commander or leader can do in those situations. So this is me, now you can all see. I am from Connecticut where I work at Cage Data. We do DevOps consulting, helping organizations that want to make a transition but don't know how to do it. We help navigate that change, but I'm also organizing DevOps CT and DevOps Days Hartford. So if you know people on that side of the country, please let them know. And I'm crazy on Twitter because I don't like when people can spell things correctly the first time I make it difficult. So yeah, so little fires are not really the big instance we're talking about. Those, you can handle with tooling. These are the kinds of things that go off in the middle of the night, someone responds to it, and it's usually an individual that can fix it. We've got tools like the context about what's going on. That's really important to fix these small incidents. If you understand just why you're getting this alert or what it means for your company, you can analyze it and make a good decision on how to act. Runbooks are critical, just sort of very simple. If I have a small fire, here's how I put it out, and if I follow these steps, it will go away. So these are critical tools for managing small incidents, making sure they get taken care of quickly. If we've got them clear, we can a lot of times build automation engines out of that as well. So whether it's chat ops or some other automated script that kicks off, some incident happens, some computer can analyze that, respond and fix it without having to wake someone up. That's always good stuff. But eventually, there's some sort of escalation. This is where we also want to have this documented. I believe this escalation process was from brush fires. This is the actual escalation process of what it looks like. It starts out small. There's a fire, but we've got it under control. Don't worry about it. Okay, now we need to look at it, and then now we've got an emergency you need to probably evacuate. You need to take action. So the same is true of our incidents, right? You get a small fire. Maybe it's nothing, it's handled by automation. Maybe we need to get a couple people involved to look at it. Eventually it's going to escalate. They happen. Crises require a team and leadership. If there's a large fire, they don't have one fireman go in. They don't have a few without someone who's an incident commander actually managing that fire and directing people and helping them tell where to go. We want to talk about what are the good qualities of leadership in that. What's their actual role during that incident? In case no one believes that I actually had up my slides the day of, there you go. Transformational leadership. These are great qualities to have. These are really good. These are kind of like peacetime leadership qualities. It's sometimes easy to say, well, yeah, but if there's an incident, how do we manage it? There we go. My job's done. Thanks, Scott. No, but seriously, these are actually really good qualities to talk about. I'm going to try and make them a little bit more practical. Kind of the key points around this are that B column. We have a servant leader who's adaptive, creative, and a global communicator is kind of major in any sort of incident management, being able to share information. Before we actually get to a crisis, you want to know who the leader is. Because if you're in the middle of a crisis and you now have to figure out who the leader is for this incident, you have two crises. Because now you've got an argument on top of it. That's critically important, understanding who the actual points of reference are, who should be making decisions, and not having a question about it. There might be some democratic process for passing that incident commander along. It's going to depend on your organization. But it's crucial that if I say, hey, we've got a major problem and I want to escalate this to a major incident or a major crisis, who's going to take over that control? Kind of the key of that is that whoever is the incident commander should be authorized to make really big business decisions sometimes. Sometimes really big, but always business decisions to clarify that. If your major production database is down, they need to say, hey, can we spend twice as much for at least the next two weeks to light up another copy of it somewhere and analyze it? We're having errors. Can we bring down the production system right now to look at it or reset it? These are decisions that are going to affect bottom line for the business. If they can't make them quickly, just delaying, trying to get approvals, trying to talk to people is potentially going to blow up that incident even worse. If I've got a small error that I can stop everything fixed and bring it back up fine, but it might balloon to corrupted data, I want to be able to bring it down while it's still safe. Part of that also isn't that all this power should rest with one person. You want to try and spread that out as much as possible throughout your team. So if I'm an incident commander, I still want to break up authority over one task to different people. So if I say, hey, I've got big picture, but I want one person to be in charge of database recovery. And they should be able to make those decisions about what database recovery looks like, because now I'm sort of delegating that choice. Choices can happen faster. We can iterate faster on that. And that's kind of, right, this is how we sort of troubleshoot stuff 101. If I have a big problem, I need to start with the smallest components and work through them. It's just how that works, right? First check DNS because it's always DNS, and then work at the next smallest component up the line until you figure out how it fits together. And it might be a connection that's broken, but it could be just this one small piece that you only find out by working incrementally, which is my next slide. Take it incrementally. I mean, that's exactly how you build Legos, right? So you start small and you eventually make something that's more interesting from all these component parts. That's how we build systems, that's how we need to troubleshoot systems. And if we can put people and teams in charge of individual groups to make all those decisions about checking it out, we can kind of, instead of building step 1, 2, 3, 4, maybe we can work on 1, 10, 15, and kind of assemble these all at once. So it just speeds up our actual recovery time. This is a good point too. So this is actually, all of this comes from a story. And this is a great quote from Rahm Emanuel. And the story this comes from is something that we did at KG Data. We had an outage with a client who, making a transition, they had old infrastructure, single server kind of running all of these things, running directory services, running domain control, running file storage, running emails all in one place. It's aging. It's not under warranty, which is terrifying. Software is probably end of life within a week, which is terrifying. So we've got an improvement plan in place and we're working on it, but it's not quite there. You've got to schedule downtime to sort of extract these monolith services because it's Microsoft. So what happens if the server goes down? We just get those alerts, not reporting. So, well, that's not good. We happened to have someone nearby and said, hey, could you go to the client and actually check on this? Like, is this actually down? Because we've got a major issue if it is. He's like, yeah, it's down. I'm like, can you bring it back up? He said, no, it's looping. It's not coming up. It's crashing continually over and over again. We said, great. So we actually, we went in, we took care of it. We spread it out. We had a high availability structure we were building and everything worked really well. But what happened was at a critical point in this, I had a decision to make. We could either work for 16 hours and recover the data and come back to how it was, which would be a recovery. Or I could take 16 hours, push through my migration that I had planned, and come up with something better. So this crisis, this downtime, gave me an opportunity to actually improve their services across the board because I had this great moment to make use of or I could focus my efforts on improvement rather than just getting back to the status quo. So having that leadership, having someone who understands where you're at, where you can go and what you have in place is critical. I mean, I would much rather go through a crisis and come out better on the other side than come out back where I was and now have to work to improve it on top of it. So don't let a good crisis go to waste. So that's part of it too, right? This is what it comes. We could either work on what will make it work and that's not great, right? I can still use the toilet right now, but I'm not really happy about it. So that's part of it and it's good to know that because sometimes you have to implement these changes, right, like if my toilet's backed up and I really need to use it, I would rather have this than nothing, but it's important to know that's not the resting point. My toilet isn't fixed yet. The other differential is what'll make it whole, right? What'll actually restore it back to what it was beforehand or better? What'll actually make it complete again? So that's a distinction that's also there to make in making those decisions in a major crisis about, hey, we need to do this to fix it, but here's our plan afterwards to actually bring it back to a whole service again. So communication was a big thing with this as well and there's a great quote. If you know John Hodgman, you may know him as PC from the Mac and PC commercials or a correspondent from The Daily Show. He's now a fake internet judge where he dispenses fake internet justice, but he's got this great quote. He says, brevity is the soul of wit, but specificity is the soul of narrative and I sort of adapted that because as humans, we communicate in stories. It's what we've been doing for millennia so even when we're trying to coordinate efforts, we're building this narrative of what we're telling people to do, what we're going to do, it's simple stuff. These are the next three steps I'm going to take. That's still a story. So being specific about it and understanding that communication is key to being clear about, hey, I'm going to run a restoration on this system, then I'm going to run these tests, then I'm going to do that. Now I don't have to question what's going on. I don't have to double effort because I said, hey, you're restoring that, did you test it? We just don't have to do all that if we're specific about our communication. And this is how we do it as well. We make work actually visible, so bring it out in the open. Instead of working on something, this is kind of like committing to master, right? Instead of having your own section of work that you're working on and then bringing it back when it's all done, we want to actually work publicly, especially when we're trying to coordinate as a team and get it all completed quickly. Fortunately, we've got really awesome tools for all of this, right? These are things that just exist. Trello is still free, I think, last time I checked, which is an awesome like Kanban board where you can just put everyone's work. Here's what we're working on. You can actually assign it to people, great visibility, or if we have actual chat ops running. If we're actually running our work inside of a chat window, like it becomes really apparent who's working on what. So that's really big stuff that a leader has to coordinate. Sorry, I talked too fast and then I run out of. So yeah, you're setting that example, right? So you're that clear point for communication. It's not always, this isn't all there is, right? If you have just post-it notes and a wall that you can stick them on, like you've got a good way to make your work visible. And sometimes you have to do a conference call and chatting about everything isn't practical, but the key is to make sure everyone's in communication. This is actually the other side of it. We've had a similar incident where we have an outage with the client and there wasn't good communication. The people who were in leadership were actually off-site at the time. So we weren't kind of looped in on what was happening. It didn't get escalated cleanly, but what it came down to was a communication issue. We just didn't know what was going on. So our fears and the reality were just not lining up and our retrospectives after brought that up and this became incredibly important for us for what we value as a company. So getting to servant leadership. Leaders are handling those human interaction problems on this. That isn't to say that your incident commanders can't be technical or can't contribute. It's just that your role as a commander of that situation is to coordinate all of the group efforts. It's not necessarily to come up with the fix that makes you the hero. It's to actually coordinate everybody. Machines are great because they run on electricity and they can run all the time, but as meat bags, we run on carbon and water. So make sure that issue is handled as a leader. This sounds really dumb, but if you're making a lot of critical decisions, maybe late at night, when someone says, hey, what do you want to eat? That's just sort of like throwing a wrench and everything else. Like, oh, now I have to decide about that too. I'm trying to make critical business decisions, but now I also have to figure out what I want to eat so that I can stay focused. Honestly, for us, I've been there on both sides of it and just having someone say, hey, I'm going to get food here is that good is way better than saying, where do you want to get food or having it just show up? Like, here is pizza, now eat. Like, great, everyone's going to eat it, have an array of things that everyone can eat and we don't have to worry about it. It allows the engineers to focus on what they need to do without having to make more decisions on top of their problem. And sorry for the sports metaphor at a tech conference, but you need to manage your people as well. So it's just like in sports, you don't put your best people in all of the time for every game, for the whole game, right? We need to make sure there are people that can fill in. We swarm an incident when it happens, right? Everyone wants to get involved, everyone wants to help out. We've all been there like, hey, how can I help? How can I help? How can I help? But if everyone's giving 100% right from the start, you have no one left at hour 10, hour 12, hour 24. Who's going to jump in at that point? So it's important to actually manage the people. Make sure you have people available so that if you say, hey, I need you to go home, someone else can come in. I think especially personality type-wise, a lot of individuals, I'll own that, I at least, have a tendency to work on a problem until it's fixed. That might mean that I'm not making good decisions if it's taking too long to fix it. So having that knowledge of your team and how they work and who those people are and saying, hey, I know you really want this fixed, but I really need you to go home and rest right now because I want you to be better in a few hours, not okay right now. So it's key to just make people aware of themselves. There's this, the sun keeps coming, right? It's gonna happen no matter what, whether or not the problem's fixed, tomorrow is gonna be here and time's gonna keep happening, so I need people that are there to take over. So the other part about communication, right? There's a communication internally, but there's also that public communication back out of the organization or that response team. Where the leader might not be the person communicating, view as a leader can control that message and that phrasing, you kinda control what comes in and out, the input and output there. So the main thing is you wanna make sure you or someone who's delegated is communicating quickly. If there's an incident, I wanna know that it's happening. I think it was, maybe it was last pass or something like that, I was trying to log in for like two hours and couldn't get in because they had an outage with their login service, but it was not communicated inside of their application. Extremely frustrating two hours of my life, I do not get back now. Like please communicate quickly and in a way that reaches people quickly as well. And we wanna communicate often. I love like status pages are wonderful so that when I'm trying to find out if a service I'm waiting for, because you have customers, whether they're internal to your business or external, that are waiting for you to come back up. Communicate often about what's going on, just knowing that it's gonna take me another 16 hours allows me to plan my day or if it's gonna take five minutes, I can just say, oh, I can wait five minutes. Knowing the status of things is important. And I think a key here is to be open and honest about the problems as well. When we have failures, it's really easy to react and say, oh, I don't want people to know that we screwed up. But I think we've got two companies almost a year apart that deleted production databases. Yep, that was us. One of our engineers just deleted it. And that was what they said happened. Like that was awesome. They were just totally honest about it and said, that's what occurred. So Gliffy did it in GitLab what a month ago at that point. But what's happened, what's interesting is that was rewarded. No one said how dare he. Everyone like camaraderie here. Everyone's been there. Everyone's made stupid mistakes and so when it happened, everyone said, oh, we feel for you, right? Like I hope all the best for you. We want this to improve. And so everyone rallies around that honesty. You actually get this bonus effect for your company as opposed to you saying, oh, how could you, how could you be dumb about that? So that's a major interesting thing. This is kind of the summation of all of it, right? So as a leader, your job isn't necessarily to fix the problem. Your job is to empower your team through the communication tools. Your job is to remove the barriers. You're trying to remove everything that's preventing them from being the best they can do. So as a leader in crisis, you want to empower your team and make sure they can do the best job bringing everything back up and running as they can. That's my summation of it. Thank you.