 page to the middle of the night for a service you didn't build. So you're not entirely sure how to diagnose the problem, let alone fix it. So you start phishing for people across teams, adding everybody to a conference line, trying to find someone who can help, slowly getting more frustrated and resentful. And then it's total chaos. You've got 20 people at a conference line, everyone's talking over each other, no one knows who's doing what. People are doing things at the same time, so it's impossible to know what's actually helping the situation. Everyone's exhausted and confused. Panic sets in and people are yelling at each other. Hours go by and you still don't have a solution. Finally, the CEO gets word of this outage, tracks down that conference line, dials in and demands why there isn't a fix yet. Does any of this sound familiar? When you have an outage like this, every minute counts. Downtime can mean losing customers, which affects your bottom line. So it's important to have a streamlined approach to come to fastest resolution, minimizing as much as possible your time to resolution and the cost of the outage. So let's get some definitions out of the way. A major incident is an unplanned service outage that actively affects customer's ability to use your product. Now some more operationally mature companies may have a formalized incident response process. This is basically a runbook for how to behave during an outage. It defines who needs to be involved, the steps you should take, how decisions are made and so on. And an incident commander is a formalized role in an incident response process. They're responsible for leading the response. They keep communication flowing, they're the final decision maker, they coordinate everyone and so forth. It's often abbreviated to IC, which I know is confusing because that also means individual contributor. But in this talk, if I slip and say IC instead of incident commander, I'm talking about incident commanders, not individual contributors. So whether or not you have a formalized process like that, it's common to think the person best suited to lead during a major incident would be senior technical leadership. They have the domain expertise, they have the authority and the credibility to call the shots. The problem is this isn't scalable. If your CTO or even a small group of tech leads needs to respond to every major incident, they're going to burn out. Not to mention, your tech leads are probably on call for their own teams and their own services. If you've also got them on an incident commander rotation, that duplicates their on call load, which is just terrible. So I'm here to tell you, you can streamline and scale your incident response process by welcoming non-technical people to serve as incident commanders. And I'm proof that this can be done without compromising response effectiveness. So a little bit about me, my background is in project and product management. I joined PagerDuty as a scrum master and I'm now leading our business intelligence or reporting team. And I'm PagerDuty's first non-technical incident commander. So I go on call to respond to major incidents. So I'm gonna talk about two main things today. First, what are the skills that make a really good incident commander? And then how can you train them so you can get more incident commanders to relieve that on call load? So incident commander skills. I'm gonna tell you the top skills that you should look for to target more people in your organization that could make really good incident commanders to tell them more about what this process is, get them up to speed and reduce that on call load. None of these skills are technical skills. Now although I am arguing that an incident commander does not need to be technical, I will say it is important to have a high level understanding of your system architecture. You do need to understand what your services do at a basic level, how they affect the product and your end users. For me, that was staring at a system diagram for a really long time. I even made some flashcards to try to memorize the silly names of our services, what they do in relation to our product and which team is responsible for each service. So I knew who to call when that happened. But the point here is you don't have to know how to build or fix anything in your system. You should just know at a basic level how the tubes flow, what it does at a very basic level. So barring that. First major skill for a really good incident commander is comfort with a structured process. So you need someone who is really good at remembering a process, following and enforcing the specific steps that need to happen. You wanna find someone who can remember those formalized roles in the process, who should be doing what? Because they're responsible for keeping things flowing, for following the process and leading the response. So you need someone who has just a knack for thinking through process. You want someone who can remember communication rules, because they are responsible for keeping communication flowing in a specific way. You wanna make sure they're really good at remembering all those rules for how you communicate during an incident. Because they are the final decision maker during an incident, they should remember the rules of how decisions are made. There is a formal way that we come to decisions during an incident. So this is a core competency for project managers, obviously. This was sort of my introduction of, hey, it's this crazy formalized process. I like crazy formalized processes. Maybe this is something that I can learn and get more familiar with. I wanna learn more about that. But it doesn't have to be project managers. Certain people just thrive in a more structured environment. They're more productive, they can communicate better when you have those formalized rules. So think about people in your organization who are comfortable in this sort of formalized, structured situation. Next top skill is communication. And it's a specific kind of communication. An incident commander should be really good at directive communication. They need to be able to tell people what to do. Even if it means telling your boss what to do. Or even the CEO to get off the call if they're being disruptive. Because that's the incident commander's job, to keep things on track, keep it moving and to be the director, the leader of the response. You need someone who is able to facilitate in a particular way that prevents just debating forever what we should be doing. They need to facilitate the sharing of information but always be driving towards decision making. We're always moving forwards with the response. We're not stalling because again, every minute counts. So they need to be able to facilitate in that particular way. Another really important communication pattern is being very specific about who you want to do what. So you don't just say, hey, could somebody look into this thing? You say, hey, Tom, I need you to look into X, Y, and Z. And you get that verbal confirmation that Tom is going to look into that. So very specific directive communication. To leave no openness about who might be doing what. You're making sure it's gonna happen. This is my natural communication style, which is not always the best approach at work. Especially I would say as a scrum master, I think you often need to take the opposite approach. So as I learned more about what an incident commander is, how they're supposed to behave, a lot of it really resonated with me and I realized that this could be a really good outlet for me to use my natural skills with this communication style. So think about people in your organization that may tend towards this more directive leadership style. They don't have to be a senior executive to have this style. They might be really good at picking up these kinds of communication patterns that are important for an incident commander. Next skill is time management. So the incident commander is using a directive communication skills to specifically assign a task and you wanna time box every assignment that you give. So I will get back to you in five minutes with that. Always remember to give a time limit for what you're assigning and then actually watch the clock to actually circle back at the end of that time box to get that report back. And that may be I'm not done yet, I need five more minutes, that's fine, but you give a time box and you follow up within that time box. Really important. And since the incident commander is also responsible for keeping that regular cadence of communication, you need someone who's really good at regularly checking in, always giving an update internally or to the stakeholders, whatever the rules of your incident response define, there's usually some sort of timeline that you need to adhere to that the incident commander is responsible for keeping. Again, another core competency for project managers. But think about people in your organization that are really timely, that can keep a regular cadence of communication, that are good at time boxing tasks, reporting back when they need more time, et cetera. That is another really good skill for an incident commander. Now I just talked about a few skills that are very structured and directive, which are all important for an incident commander, but you do need to balance all of that with active listening skills. Especially when you're dealing with an incident commander that may not be highly technical themselves, they don't know how to fix the thing, they need to rely on the expertise of the responders on the call. So they need to be able to actively ask for and listen to the suggestions of the experts on the call. So for me, because I'm a structured person, follow a pattern of asking, are there any proposals for what we should do? What are the implications of those proposals? And based on that information and really understanding what they're saying, then I'm able to make a decision. So you have to be able to listen to the feedback of experts. You can't just, I'm the leader, I'm gonna decide. You have to get that input. You also have to be flexible. It's not about just strictly following a process. You have to be able to, as you get new information, change the plan. We're not just, I'm setting the plan and this is what we're doing forever. You get new information, you learn and you shift as needed. So it's not 100% structured. You do need to have this flexibility and active listening to all of the responders. The other important thing about listening during an incident response is even when you have this incident commander role trying to keep a handle on things and this formalized process, it can still be really chaotic. There's a lot going on. You may have different people assigned to different tasks. You're supposed to be checking in with different groups. So you really need to be able to take in a lot of signal at once and be able to keep up with this is what's happening. This is what these people are doing. I remember that. So I'm able to coordinate, lead, follow up, communicate among all of the stuff that's happening. That really requires some strong listening skills to keep up. So here again are those key skills for a really good incident commander. None of them are technical skills. So think about maybe junior people on your team or people outside of engineering functions, business functions may have these in spades. Project managers are an obvious choice in my mind. Product managers can make really good incident commanders. They often have that directive style of communication. People managers could be really good at this. Maybe even QA engineers could be really good at following that structured process day to day. So try to start thinking outside the box. It doesn't have to be your most senior engineers that can lead a response. You want people with these skills and that may not even be in engineering. So after you've found some people that you think have what it takes that could be a really good fit for an incident commander, you got to train them up. It's a critical role. As we said, money's on the line, every minute counts. So you really need to get them up to speed and confident with leading during a response. And that's hard because even if you're within engineering, maybe you're a senior engineer, you've been involved in plenty of major incidents. It's still really scary. It's still really hard. And I would argue maybe especially for people outside of engineering, it's even more of a black box. It's easy to think I couldn't ever understand what happens, I couldn't ever help. So you really need to find an inclusive way to open the door to those people that common knowledge, even within engineering, may be telling them, this isn't for you, this isn't a place for you. But you want more people to be able to scale your process. So you need to find a way to welcome and encourage them and get them up to speed to be effective during this critical time. So I'm gonna share with you the training process that I went through. And the key thing is throughout the process, I consistently felt welcomed and encouraged as part of the incident commander community, which really made a difference, made it so I can get through this and become a real incident commander. So remember that, that is very important. Okay, so first step in training, you need to specifically define the role of incident commander, step zero. And at Pedro duty, this took a lot of trial and error, we've really refined it over time. And the key distinction here is the incident responder is not responsible for finding a solution. They shouldn't be doing any mitigation or investigation tasks themselves at all. Their role is to coordinate the response, keep communication flowing and delegate all tasks. Delegate means they're not doing the tasks. They're telling other people to go and do those things. And just coordinating the response is enough cognitive load for any one person. It's not possible for any human, no matter how senior or technical or awesome you are, to keep a handle on this major response effort and be trying to fix it yourself. It's just too much. So this division of labor is really important to be effective. And by specifically defining the incident commander role in this way, you've eliminated the need for them to be highly technical at all. So that opens it up to potentially non-technical people to become incident commanders. So how do you snag them? At Pedro duty, we host regular incident commander office hours and we try to get people interested because when there's an outage, everyone knows. It affects the entire company. It's not just an engineering problem. So this is really an open space for us to explain what happens, how that might affect you, and eventually how they might help. So it's the time for you to explain the process, how we behave during a major incident, and explain the role of incident commander, emphasizing those skills, especially communication skills, and really trying to show people that you don't have to be technical to help. And to try to hit home the need for more help. On-call load on a very small group of incident commanders is rough, especially when you're duplicating with responding to your own systems, maybe at a SEV-2, SEV-3 level. So this is a time for you to try to convince people, hey, you can do this and I really need your help. Would you like to learn more? So once you've hooked a prospective incident commander that understands the need and expectations for their help, put them on a schedule to start shadowing the current incident commanders. Now especially if you're working with non-technical incident commanders, it's likely they've never been on-call before. So hooking them up on a schedule will give them a feel for what that's like. They'll get paged when the real incident commanders get paged. They get woken up and interrupted to really build that empathy and to start to really drive home the need for more help. The more people we have on the rotation, the shorter your on-call shift is going to be. So they start to feel that pain themselves and it starts to motivate them, oh yeah, you do need help. If I can get added here, it'll be less load for all of us. But we're not just putting them on the schedule so they can feel the pain and know what it feels like. This is for them to learn. So when they get paged, they need to dial in to the response line. Just following along on Slack is not enough. You don't really get what's happening. So they dial in, but make clear they need to be a silent observer to avoid distracting from the response. It's critical time. We don't have time to answer questions when we're trying to resolve a major incident. So they dial in and they listen silently. But then after resolution, make yourself available to answer any questions they may have to make sure that they feel like their learning is being supported. And they observed this thing but they maybe didn't understand everything that happened. It's important for you to fill in the gaps for them after resolution. So after your trainee has listened in maybe to a few response calls, you can invite them to start helping as a scribe. So scribe is another formalized role in an incident response process. They're responsible for keeping a real-time log of everything that's happening during the incident. And it's a really low-risk way to start helping because you're still a silent participant. You're just logging everything that's happening. This is really helpful to analyze in the post-mortem to see what did we do? Why did we make that decision? How did we make that decision? What worked well? What didn't work so well? But it's not the end of the world if they miss one thing. We've also found that during longer incidents, it's really helpful to have people hand off in roles. So this is actually the first way that I got involved in an incident response. We're having a longer incident and the person who was scribing for about an hour or two was getting a little toasty, really needed a break. So he just kind of pinged me in the background because I'd been around. I'm sitting in the room watching everybody asked if I could take over as scribe so he could have a bio break. And I was a little scared at first but I just follow along, type along and it really made me realize that I can help. I can be involved. That was the first step where it became less scary because I am actively helping during the response. And just writing down everything that happens really cements your knowledge of that process because you're writing down each step. You're writing down the communication patterns. You're writing down the IC did this thing so I'm gonna remember that the IC does that thing. So it's a really great way to learn and to start building a little confidence. Another way that you can have your IC get up to speed is have them practice during a failure Friday. So you may be familiar with Netflix Chaos Monkey. It's the process of intentionally introducing failure into your systems to test resiliency. And we do this at PagerDuty every Friday. It's a great way to test your systems. It's also a really great way to test your processes and the human factor. So when we do a failure Friday we invoke our incident response process to test the process as well. And we always ask our current trainees who is available and willing to practice being an incident commander during a failure Friday. It's a very close simulation to a real scenario that they might encounter. So it's really excellent practice. And it's again, pretty low risk. It's during business hours, we're all sitting in a room together. You're probably gonna have one, two, three other incident commanders there to remind you what you should be doing. So this is a solid way to practice being in IC and again, gaining that confidence. You're one to one simulating what you would do in a real scenario. So it really makes clear, I can totally do this. Next is reverse shadowing. So at PagerDuty, we always have both a primary and a secondary incident commander on call because we're talking about major incidents. Redundancy is important in that kind of critical situation. So we always have two on call. The primary on call is the incident commander. There's not two incident commanders. The primary is the incident commander. And the secondary on call is their deputy. And they assist the incident commander however is needed. Most often that's tracking down additional responders that are needed for their particular situation. They often serve as the scribe. So to set a trainee up to reverse shadow, you just put them on the main schedule, which is great because then they're on the main schedule forever. So you set them up on the schedule and you reassure them you're never going to be alone. Not during this reverse shadow phase, not ever. There will always be a secondary with you to support you. Try to talk them down a little bit. Yes, it's the real schedule but you will have help. So when they get paged, both join the call and the trainee gets to be the incident commander. You announce on the call, I am the incident commander so everyone knows. And the secondary should be helping them in the background. It's very important. You need to let them lead. Don't threaten their authority on the call. That becomes confusing and demoralizing. So you can send them tips and reminders in the background via like a direct message. This is also a great way for the trainee to rely on their deputy in the background to ask those questions of hey, am I understanding this correctly? I think I need to say this next, is that right? And the deputy is there to say yes, you're doing great. Just keep doing that. It goes a really long way to make you feel comfortable. And after I did this for the first time, it was a super intense experience. It's exhausting, you're super drained afterwards. But it was amazing just like having a buddy to doing it for real and as a team. The sense of accomplishment and pride you feel after resolving a major incident makes all the stress leading up to it during it totally worth it. So I thought it was a really great way to have an impact with the company. And that's it. Your trainee has successfully led an incident response. They've now graduated to a fully fledged incident commander. Celebrate them. They're having a direct impact on the health of your services, the happiness of your customers, all the while distributing on call load internally. That means less burnout, better outcomes. So here again is a summary of that training process. You gotta start by specifically defining the role. Remember, the incident commander is not responsible for finding the solution. They coordinate and delegate. Host regular office hours to demystify the process to welcome people that aren't familiar with this world, to make clear the need for more help and the fact that you don't have to be highly technical to help in this way, to try to recruit those people. Then hook them up to shadow, current ICs to build some empathy, observe silently. Maybe invite them to help as a scribe, just keeping that real time long to cement their knowledge of the process. Maybe practice as an incident commander during a failure Friday exercise and finally reverse shadow with the support, background support of a deputy to help them along and make them feel comfortable and supported. And I wanna reiterate the key thread throughout all of this. This is still really hard and scary. You need to make sure your trainee feels encouraged and supported, that they're part of the community, that you believe in them, you believe that they can do this because it's gonna feel like you can't for a really long time even when you're demonstrating that you can, it's gonna feel like you can't. So that encouragement, that consistent encouragement is really important to get these people through the end of training and get them on your real schedule. Now I understand why the other ICs at PagerDuty were so encouraging of me now that I'm on the schedule, I'm recruiting real hard for more incident commanders because it means a better schedule for me. It means more people, shorter rotation for me as well. So it's in everyone's best interest to train people really thoroughly to make them believe that they can do it. So DevOps is all about building empathy, breaking down functional silos. And I think we should expand that collaborative mission beyond just development and operation teams but across the entire company. There's no reason technical and non-technical people can't help each other for better business outcomes. Come and find me if you wanna talk more about incident response or how to welcome non-technical people into the DevOps community. If you wanna read more about PagerDuty's incident response process, we've actually open sourced most of our documentation. You can find it at response.pagerduty.com and that's it for me. I think we have a little time for questions. All right, just curious. Are ICs a full-time position or do you guys have other responsibilities? They're not a full-time position. It's just a schedule that you're on. Oh, and if you could repeat the question, sorry. How do you avoid badgering the troubleshooters? So we talked a lot about a regular cadence of communication. It's important to keep updates. It's important to time box things. Another pattern is silence is okay sometimes. Sometimes you just need to be quiet and let people focus on the work. So you give an update, you assign someone to do a thing, you give them a reasonable time box. You may even ask directly, is that enough time for you to get back with me with more information? Get a verbal confirmation that yes, I will get back to you in 10 minutes and then you be quiet so they can do it and you don't follow up until you get to those 10 minutes. Have you found any particular tooling or technologies that make it easier for the incident commander to do their job or alternatively have you found good ways for ICs to develop that tooling and bring it into the process? That's a good question. The super advanced tooling that some of us use is a notebook paper. Just a keeping track of okay, I told this person to do this thing in five minutes and this is the time now to sort of keep that log for myself. You also have the scribe basically doing that for you. So instant messaging platforms are super awesome. Hip chat, Slack to keep track of what did I just say? When am I getting back to that person? But I like having my own log right here. Yeah, not super exciting. For failure Friday, who is in charge of that? Is that the incident commander who kind of comes up with the whole scenario and makes that go? I think we have like a volunteer in engineering that really cares about the process that kind of leads it, make sure it's scheduled in the calendar that tries to recruit people. We've got an incident commander chat room that will usually post, hey, reminder, failure Friday, this is the service we're taking down, any trainees you available. So it is important to have a champion to make sure it happens, but it doesn't have to be the incident commander. That may be a really good idea, but just someone to make sure it's really happening to build that culture. So what about failing over incident commanders like someone sick or they have to go to the bathroom or even you have someone managing an incident, they're not good at it. How do you pick someone from the current team or do you find a previous incident commander? That's another reason having both a primary and a secondary on call is so important because maybe the primary's in the shower and they don't hear the page. The backup's gonna join and be like, hey, primary, you here? No, okay, I beat you, I am the incident commander. And in terms of maybe someone's foundering, they're not doing a really good job. Anyone can take command if absolutely necessary or you can pass command. Like I'm really struggling, I feel like I'm treading water. Eric, can you become incident commander? Verbal confirmation, yes, I am now incident commander. Thank you for your help. That's totally okay. Besides the incident commander, who are your default attendees for the start of an incident response? Yeah, so we have the people that own the service that went down, they join the on call for that service. We have the primary and secondary incident commander. We have a customer support person always join. They go on call for major incidents as well and they're our customer liaison. They're responsible for tweeting out if we need to, to customers for following up with related tickets that come in. So they're on the call and have that information and can really represent the customer on the call. I think those are the big ones that we're working with now. We've been thinking about maybe an internal liaison as well. Usually the customer liaison is also responsible for internal updates. We found that that was a bit too much for the incident commander to do themselves but we may separate that role further as we scale as an organization. You may want a dedicated person to make sure executives, other stakeholders are aware of what's going on so the customer liaison can focus on customers and the incident commander can focus on the responders. Hi, I have two questions. The first one is... I'll allow it. Do you have a company policy that allows you to prioritize the IC role and drop it anything that you're doing? And the second one is, do you get compensated in any way for your IC role? We do have guidelines for what qualifies a SEV-1, a SEV-2 based on the way that the product is impacted, the number of customers that are impacted. We have thresholds for that and incident response gets invoked when it's a SEV-1 or a SEV-2. If it's lower than that, the team just handles it based on their rules. In terms of compensation, I don't get a bonus or anything. I'm really motivated by reducing on-call load for my colleagues. I thought it was horrific that a lot of these engineers are on-call for their teams and as an incident commander that's unacceptable so I really wanted to help with that. I don't have a day-to-day on-call rotation so I want to help them. And it was a really great way to get visibility in the organization, to have a really direct impact. I'm helping resolve a major incident that is causing thousands of dollars of loss and really making our customers angry. I'm directly impacting that and that can be really exciting and rewarding. Awesome. Rachel, thank you. That was, I'm so impressed. That was awesome.