 So I'm going to get this started and we'll probably have some time for war stories if you all feel like it, if you don't leave me for a happy hour. So my name is Lisa and I am the VP of Service Reliability Engineering for Fastly. I started about 20 years ago, just had my birthday. Working in tech started out in the systems administration route with ISPs kind of moved on, you know, it used to be called ops, then it was tech ops. I did database administration, leadership, worked for a company called LiveJournal where MimCashD was developed, so you're welcome, internet. And then from there I went on, worked for other social media sites, worked for Six Apart, which had type pad and movable type, and eventually went on to Twitter where my job was to kill the fail whale. So if you don't know what the fail whale is, then I did my job, so that was cool. And today I'm going to be talking to you about Fastly, which is where I am now. We have a team of about 17 SREs and I'm going to be talking about how we run our incidents. If you don't know what Fastly is, we are, wait, hold on, our Edge Cloud platform enhances web and mobile security and content delivery. So we're distributed, we basically make the web faster, distributed globally, over 10 terabit network capacity, over five continents. Chances are if you're online on a given day, you have hit one of our, you've hit our network at some point because we cache some of the world's largest websites or sites who are aspiring to be. So before I get started on why, on incident management and how we've approached it at Fastly, I'm going to take you through a story of something that happened to me over the holidays. I woke up on a plane to an image like this. And having been on call in an ops for 20 years, I can wake up pretty quickly and get my thinking cap on. So, you know, within seconds I was awake, made sure my shoes were on. And then basically thought to myself, wow, what an amazing opportunity to think about how incidents are managed. Because here we are in a plane that's supposed to get me to a destination, right? With the people flying the plane, people serving me drinks, you know, or trying to be nice to me, who sort of have to go from their regular job to incident responders. They're on a plane, there's not like a group of people sitting back, they're waiting for this scene to happen, right? So I was thinking like, this is amazing because not only did they get the plane to land because I'm here today, but they also made sure that I was calm. I felt safe. Well, I mean, I didn't feel safe. I felt like I was about to die. But besides that, at the end of it, I felt like I was in safe, capable hands the whole time. In addition to thinking about what the people were doing on the flight, there was also a whole bunch of communication going on behind the scenes because as we landed, we sped immediately to the gate. So all the other planes, you know, had to be rerouted. And fire engine met us. So there is a whole variety of other incident kind of responders and processes happening within seconds. So not only did I leave alive, but I also left feeling very positive about Delta, my preferred airline because I don't know what would have happened if this was un-united. So thankfully, I don't have to deal with life and death situations in my job. I do, however, have to deal with managing all the types of incidents that you all do if you're working in tech. We host folks like Twitter, New York Times, folks that need to have their data and their platform available constantly because while my job isn't about life and death, those platforms are providing information that might be life or death, kind of critical information for people. It can also be a platform for critical information like, for example, the election where something can be changing minute to minute, second to second. When we kind of first looked at why it is that you'd want to have an incident management program, and I came in a year and a half ago, one of the first steps, and if you're looking at doing this yourself, is to think about what are all the ways that you anticipate that your service or a service dependency is going to fail. Don't be afraid to sit there and categorize them all because chances are they will happen at some point, and that's the first step in understanding how to be that incident responder. So what's the goal of an incident program? For us, it was around helping us with how we make decisions, which also should lead to reduced time to repair, to make sure we're communicating internally and to our customers, and to make sure we're constantly improving for the next time. We decided to borrow from FEMA's incident management program, which again, much more critical for life and death situations, but also has a lot of parallels to the unexpected types of issues that we see on the internet. So if you don't have this right now, if you were to start out, these four basic areas would be the places to start with. Think about, you've already categorized all the ways that you know things are going to fail. Think about what that would mean to you and to your business and to your customers. What's the impact of that failure? Start from there. Determine what does that severity look like, like how severe is it? And this is like a traditional term, right, from ITIL. We're in DevOps land, and we try not to take too much of the traditional stuff with us, but there are some things that are useful. So SEV0s through SEV3s, they give us a vocabulary, so we can all understand how are we supposed to respond right now. From that impact and the severity, you can set up roles and steps that you would take to respond, and then define how you're going to deal with it afterwards. So coming back to the idea of impact, like here's why we talk about impact. And by the way, that's like what I say every day. What's the impact to the customer? Our customers and our customers, customers care about how does this impact them. I don't care whether or not you understood the bug or not. I care about how the bug is impacting the customers. Without that, sometimes people, and maybe you have this in your own situations, if you don't have a knock or you don't have an ops team that is just sitting there waiting for this to happen all the time, it can be confusing for engineers to understand how much attention should I be paying to this. And sometimes you get executives who might be saying, I think this is the most important thing right now. But it may not actually, it might just be for some reason that person decided that that was the issue that they cared about today. You have to set yourself up with understanding things are going to fail and how they're going to fail and what that impact is. Otherwise you're in that sort of constant state. When the smoke happened on the plane, to me that seemed very serious. But the flight attendants had seen it before. They knew what to do. They knew we did still have to assume the emergency landing position. However, that turns out to be a great way to keep a plane full of people from panicking and yelling. It gives them something to do. On the internet, if everything is an incident, then nothing's an incident. You just have chaos every day. So that's why it's really important. And I think if you're trying to launch this at your own company, it's really important to get your team and your leadership to buy into this idea. So here's an example of how we define the severity levels at Fastly. And the reason why we take the time to indicate sub three versus a sub zero is you can learn as much from an incident that has smaller impact as you can from a larger incident. Sometimes postmortems and sort of urgency on issues are reserved for the really big issues. But if you have more of the sub threes that go through a process, you'll actually find yourself with fewer sub ones. So in our case, we actually review our incidents sub zero, one, two, three every week, regardless of whether or not it impacted a large number of customers. In the case, an example of a big incident was the S3 outage. Did anyone go through that a couple of months ago? Yeah. That one was gigantic, right? In terms of the impact to our customers, it had nothing to do with us, though. It didn't actually impact our own infrastructure. But we knew right away that our customers would be impacted and that we needed to have a response right away. So we called an incident and treated it just as the same as if something had broke within our own infrastructure. And as a result of the fact that we do this process so often, we posted a status update to our customers 40 minutes before Amazon updated their dashboard. This slide is actually the core of my beliefs about how technology and our industry needs to change as we're thinking about incident management, which is that ultimately, these are people. And we have needs. We need to sleep and eat. We need time to ourselves so that we can be creative and think about how to avoid incidents in the future. The business has to keep running despite internet weather, right? But if you don't take the time to set up a process like this and establish things such as on-call schedules, then you may be requiring your engineers to do the same than you may be requiring your engineers to be working implicitly or explicitly. I'll talk a little bit about how to get engineers to be on call, but I think that's the number one issue. That's the number one question I get, which is like, how do you actually get people to answer their phones? How do you get people to get engaged? Because this is a 24 by 7 business. And the answer is, well, you're nice to them. And you basically, over time, can make the case that if you don't want to feel guilty because you didn't answer your phone that one time, it's better to just call out what's real, like who's actually going to be the one that fixes this issue? Are they the only person that's going to fix the issue? If that's true, let's not have it be only one person. But if that's true, why don't we just write it down? Why don't we just declare it and document it? That shows leadership, hey, we only have one person who's on call. We only have one person who knows how to fix this. And through that process, especially if you can hire more people or figure out other ways to address the single points of failure there, you will make yourself, you will be taking better care of the people that work with you and for you. So Fastly doesn't have a knock, a traditional knock. Once again, bucking the trend. With the DevOps movement, a lot of companies are moving away from traditional locks and they're moving away from traditional ops. However, you still need to have problems, but you still need to have people who are paying attention and fixing stuff. So in our case, we have a global customer service engineering organization and our SREs are also globally distributed. So through that, we have basically around the clock coverage of customer tickets as well as our internal monitoring systems. We also have decentralized on call and monitoring. So for every service, for every project at Fastly, there is a team listed and an on call schedule for that team. What that means is at any point, anybody, a customer service engineer, an SRE, a network engineer, can reach out and page an appropriate team. And they're empowered to do so. Our customer service engineers are smart. Anyway, and they can probably debug issues faster than a traditional knock could. It also means that we save the time it takes. In a traditional knock sense, it is often the case that it takes you longer to address an issue due to the fact that you have to go through those hops. So in this method, and because engineers are sort of empowered to do their own monitoring, have their own on call, we make sure that the right people get the alert the fastest way. They also, all the engineers are included in our post-mortem process, an incident review. So if we're going through an incident review and we find that it took 10 hours for someone to respond to an alert, well, that's an opportunity for that engineering group to improve. And we encourage them to do that. So with this decentralized on call, there's still the fact that there's got to be someone who's sort of calling the shots. So we do have a pager rotation. That's sort of a volunteer position for an incident commander, which is like a weak rotation. And it's all people who have other jobs. So it's VPs, directors, managers, and they're in customer service, sales engineering, network engineering, SRE, they're all over the place. And the reason for that is because we believe that the diversity of people, of roles, acting as an incident commander, gives us a different perspective. So we learn more. When the customer service engineer is in IEC, we learn more about the customer's experience. When it's a network engineer, we learn a lot more about the network with each incident. And as long as they stick with the process that we have, which is really about coordination and not about actually hacking on the keyboard, we learn more. So this is a person in order to be in that role, though. They do need to have trust. You can't hire someone and throw them in this role the next week. They did try to do that with me, and it sucked. So it's good for them to know how your service runs. They need to be comfortable talking, like posting information for customers. They need to be comfortable talking to our executives and giving them a highlight of how something's failed. This is another thing they can do. They can tell people what to do and when, and they can tell people to stop doing things, which is another question I get, which is like, how do you get engineers to stop making changes, which we all know is the worst thing to do in the middle of an incident when it's not coordinated? And this person, because of the fact that they have the trust, they are able to just tell engineers to knock it off. I'm not going to go into too much detail on this one, but if you have questions, we can talk about it some more. This is actually just the actual process that we do. We see that there's a problem, either because our customers notice, we notice, we always want to notice before our customers, but sometimes we get a ticket. We have internal monitoring, we have external monitoring, we have like a million monitoring systems, and those get escalated. Everyone in the company is empowered to page the incident commander, so if someone thinks that there could be like a sub three issue, they can page the incident commander. We immediately move into a communications channel, text-based communication channel, or sometimes a video bridge. We actually mitigate the problem first. We triage the patient. And at the same time, in parallel, we have got another track that's handling customer communication in a status post. That, those three things right there, as I mentioned before, those three things are the things that we practice, I think, the most. And that is the reason why we're able to update our status as quickly as we can, as quickly as we do, I should say. After we've mitigated the problem, the IC's job is basically to say, this is actually over now. Let's move the other things, because you know what happens, like you get all the people in the room, and they're all like, no, but you know what we should do? We should totally never rely on S3 for anything ever again. And then you get that conversation going. And it's what everyone's like, but we're in the middle of an incident. Why are we talking about that right now? Or someone's like, let me know if you want me to ping my friend at Google, and it gets very confusing. Our incident command process is super transparent. Everyone in the company can watch as we're doing our incident response. And they do. So we have to keep things concise and not confusing, so that IC's job is to say, OK, this is done. Let's move it on to other areas for communication. Again, they're making sure they clean up. Do we notify the executives? Do we have to have an FSA, which is our service advisory? Then we write the incident report, the timeline, the reason for the outage. We do the five whys until you just get to the point where you're like, I don't know. Why is the internet? And then document it for a weekly review. And in the weekly review, we're basically going through and asking all the questions again that the person who wrote the incident report asks. So that's how much detail we're putting into each of our incidents. And then, as I said, continuous improvement. So each of those items are JIRA tickets, and we're following up on them, assigning them, and then reviewing them weekly to see, has anyone made progress on this? Do we really need to remove the S3 dependency? Those kinds of things are talked about weekly. Something that's, I think, unique to Fastly as well that we do is we involve everyone. Not just the incident commanders, but we involve marketing, legal, HR, IT. Because we're so transparent in our IC, what it means is we find out things that we wouldn't have otherwise known. We basically crowdsource our timeline in our service advisory. So maybe someone in marketing has something they would say a little differently in our FSA. We always are gearing for transparency, but there might be a way that we can phrase it that won't be as confusing for customers. It also means it takes the pressure off engineers for knowing how to do everything. And I think they can answer for you. I think they appreciate being involved with the process. One thing about this whole transparent involve everyone thing is you will get a lot of volunteers in the company who want to help. And it's like the right, you love it, their heart is good, well-intentioned. But it's not the right time for everyone to come in and give their opinion an offer to help. So there are ways that we train and we talk about how we can use these different people because we don't want to turn away offers of help. So do you remember the dine outage that happened last year? Anyone get hit with that? Yeah. I was the incident commander that day. And I mean, this is a story that's on CNN front page. It's pretty big news. So of course, everyone in the company wanted to be involved, like every engineer or half engineers. So in that case, actually once we had the core group of people working on our own mitigation, I set up multiple video chats, multiple work Slack channels and basically gave each group a leader and a job. Research what happens if dine never comes back again. Research, like does anyone still remember, like how many people here still remember buying? I remember this story I heard from one of our customers that they had moved everything to the cloud and no one knew how to use Bind anymore because they didn't have any, yeah, all just young cloud engineers. And they had to kind of rely on the one person in the corner who, you know, I'm talking about the Unix of Sidemen, yeah. So we had a group of people working on technology decisions. And appointed times would choose them. So what that meant was the core group could focus without a bunch of extra opinions. And also, we were already working on our plan A's through, you know, G, without, you know, in parallel. So that's what I can suggest you do if you have too many people interested in helping you. So I can't stress enough with the continuous improvement. I already talked about this slide. I jumped ahead. And finally, as a part of this process, I like to say that every incident is practiced for an incident. But it actually is good to regularly do exercises where you walk through what you would do in the case of, you know, whatever incident. So once a quarter, we get lunch and get a group of people around a big conference table. And it's basically like a big game of D&D. Somebody comes up with, like, a scenario. And we all have our characters. And we basically walk through trying to figure out how to solve a problem and see if we can actually solve it. And then at the end kind of write up what our follow-ups would be if it was a real event and then actually do those as if we really had that incident. We do onboarding for new incident commanders, which is a great way as well to show you what you do in practice, but which isn't obvious. So we just had a couple new incident commanders recently. And there were gaps. You know, there were areas where things weren't as smooth as they were usually. That's not a problem. That's actually an opportunity for you to improve how you train. Like, be willing to have it go a little less smoothly so that you can gain another incident commander or you can gain insight into your process. Don't throw them out the window. I think that's really common in ops for us to decide that only three people know how to run an incident and, therefore, fire everyone else. But that's not the best. That's not how we improve. That's not how you get to. That's not what dev ops is. So, yeah, start with your basics. Take the time to be really honest with your company about what breaks do the thing where you're like, oh, well, I guess we only have one copy of this data or we have, like, thousands of copies of this data in the cloud. What would happen if that got compromised? Like, go through those questions if you have time. Otherwise, just go through it when it happens. Remember that your engineers are people and that you should be empowering them to deal with their workload. Like, they're not going to be, they're not going to do you any favors if they're overworked. They're going to quit because they can get another job in this industry. Or you're not going to have awesome products because they're too busy working all night on boring, uninteresting incidents. Always be improving. And what that means is be comfortable with admitting what's not working. And partner with everyone who's a stakeholder. It's not your job alone to run, to make sure that the internet never breaks. And let those incidents teach you about yourself and about your company. That is it for me. It's a short one. Does anyone have any questions or stories? I have one quick question. You mentioned during an incident, you have two teams, one mitigating and one providing communications, customer facing. How does the team communicating customer facing, how are they informed at a level that they know what and how to communicate? Because I think what we run into sometimes is the people trying to fix the problem are being asked to provide details about the problem during the fixing. Yeah. So that's definitely super common. The role of the incident commander is to always understand the impact, which is essentially what gets posted to customers. If you ask an engineer, and they're like, oh, there's a bug in our cache. That's not a good example. That doesn't happen a lot. This network just dropped offline on the internet. That's not something that we can tell customers because it's not relevant to them, basically. It's not that we couldn't tell them. So the incident commander, it's their job. And this is why they also have to be fairly technical. They don't have to know how everything works, but they have to understand what would the impact to a customer be if that network engineer just told me that this transit provider dropped offline. So it is the role of the incident commander to always know what is the current impact. And then we have another team actually doing the communication. So the incident commander in our Slack channel, which is often where we coordinate, would be posting that impact every few minutes, basically. Like, here's what's going on. Here's what we've done here. And then the other thing is we have the, when we post in status page, it automatically updates into the Slack channel that has the coordination. So it kind of helps people understand, like, what's the last thing we've told customers? If we have customer engineers who are reaching out directly to a customer, like in a Slack channel or in a support ticket, that customer support engineer's job is kind of to handle the communication with the customer and they know through their own training, like how the CDN works, basically. So that's another element. They're all fairly technical people who are handling customer coordination and communication. And yeah, it doesn't really work to ask an engineer every few minutes what the heck is going on. So I'm familiar with that. Any other questions? Like, I can get into tools. I can get into, like, how do you get people to agree to be on call? That's a fun one. I can get into, like, what if people just don't want to do this? Yeah. Yeah, so the question's about the tools we use to help me remember I'm gonna go back to the process. So, monitoring-wise, I'll start from that layer. We have Nagios, Ganglia, Datadog, New Relic, Catchpoint, Sumo Logic, there's more. Oh, Graphite, I'm just gonna, the reason why I'm naming all those is because one of the things that's hard about tools is that everyone, especially with the DevOps land, people like to do their own tool. And it's difficult to have a coordinated view of the world when you have all those different tools. So we have guidelines on what sends people pages versus what is just informational. And so from a tool's perspective, oh, and then of course we have dashboards for, like, the high level, the high level, like, critical path data that tells us about the health of the network. So there's that. I mentioned Slack for coordination. We use PagerDuty for paging. And we also use a video conferencing tool for when we need to have real-time chats that are more like brainstorming. One of the challenges with that one, and actually we're tackling this right now, is how do you have some people only on Slack and some people on the video? So we're actually looking at changing that right now because we wanna have that real-time ability to troubleshoot and brainstorm. But what you need is to have a transponder, like a transcriber, sorry, who's writing all the decisions that get made in the video conference. So that's a job that we have as well. Jira is used to collect data about incidents. So when there's an issue sub zero through three, a Jira ticket gets created for tracking purposes, but that does not kick off the process. Really the process is kicked off with a page and with the Slack channel. We have the ability to page people through Slack as well, and we're adding that functionality now further. We really do try to keep as much in one kind of room as possible. We are exploring looking at tools like VictorOps to get all the alerts and incident response in one webinar face, basically. They're a pretty good option. What else do we do? We use Wiki for, we write up our incident report on the Wiki, and what else do we do? We don't use email, drives me crazy, but nobody use email anymore. So all of the communication is really over Slack and Confluence, basically. Are there specific tools that you may want to know about? Yeah, yeah. There's a fair amount of tooling, I think that we could be introducing, for example, multiple, I think some companies do Slack channels per incident instead of there being one. You guys do that, yeah. I've been curious about what coordinates that. Like in our case, we can just do, I see, you know what's going on. So I guess what we've been thinking about doing is we have a nice each Slack channel, and that just links to, okay, go over here to this other Slack channel for this other incident with some sort of standard naming scheme. I'd be curious to hear your thoughts on that, though. The ticket name has the channel. Okay, that's a good idea. Yeah. The good thing about the JIRA tickets, I think, even if you don't like JIRA, but I think they won, is that, and I didn't talk about this, is the categorization. We categorize, you know, if something was a capacity issue, a network issue, a vendor issue, we categorize as much as possible, which again goes back to sort of, that used to be what people did back in the day, because over time we can look at trends and say, okay, what percentage of our incidents over the last six months were related to a particular vendor issue, to a capacity issue, then we can kind of help see, like it tells us where to sort of invest our time in reliability, although we don't have as many data points as you might like statistically for that to be useful, which is a good thing. Any other questions? All right, you get more of your evening back. You can still hit up happy hour. Feel free to reach out to me if you have any more questions. Always exciting to talk about incidents. Have a good night. Thank you.