 So we'll go ahead and get started. So hi, my name's Elon Rubinovich. I'm with Datadog. Today we'll be chatting a little bit about postmortems, how to, some tips for doing them effectively within your IT organizations, although a lot of the lessons apply for, you know, I've been chatting with folks in the audience as we're waiting to get started. And a lot of them tend to apply to really any part, any endeavor you are attempting. I know I was chatting with Jeremy in the back there about postmortems related to conference organizing. Get to do those a lot there as well. So hopefully a lot of these apply regardless of if you own some sort of production service or not. Hopefully some of these lessons can work for you in other places. So again, Elon, I'm with Datadog. We're a director of technical community. I get to engage with the open source community on our various open source projects. We have an open source agent that takes plugins. We have various SDKs and libraries, lots of different pieces. So I get to chat with folks about expanding that community, getting folks engaged to build and contribute more, as well as to capture interesting stories around monitoring and metrics. My background is primarily in web operations and monitoring systems, automation tools, work at places like uyalonadmins.com, building monitoring systems, often failing, and then finding things like Datadog to save the day. In addition to that, I organize a number of open source community events. Scale is one of them down in Los Angeles. Again, we run an annual postmortem for those events as well. So Datadog and what we do, I'll keep the Datadog sort of advertisement very short, but hopefully you guys all have a chance to meet with us down in the booth in the expo floor at some point or another. We're a hosted monitoring solution. We pick up, we collect metrics and events from all of your applications and infrastructure. That's everything from your containers and your schedulers like Kubernetes and Mesos and Docker and the like, to cloud providers, to open source things like Redis and Cassandra and MySQL, all the way up and down your stack. The idea is to give you some insightful dashboards, some intelligent alerts that don't give you that pager fatigue we all hate, and then, as well as collecting all that data in a way that you can use them for these types of postmortems, as well as making database decisions in your environment. We're collecting about a trillion metrics a day at this point to give you a sense of our scale. And we're always looking to hire more folks to work on our open source projects, as well as to help make other open source projects more monitorable. So maybe they have to extract more metrics or what have you from them. So if you're looking for an opportunity, datadog.com slash careers, we'd be happy to, or I'm happy to chat with folks afterwards. So postmortems, we'll kind of dive into the actual talk itself. So this is sort of a quick quote from an internal Datadog developer guide, sort of chatting about discussing some of the challenges as you build these large distributed systems. And this applies to more than just Datadog. The monitoring systems that we, the monitoring systems that we engage with these days are distributed in complex, more so than ever. We've got schedulers moving the pieces out from underneath us and moving them from one server to the next, maybe from one region to the next. We've got auto scaling going on with our cloud environments. We've got crazy end-tier architectures with SOA and REST APIs and all of the different jargon that you hear as you wander around these tech conferences. And all the pieces interact in ways that are much more complex than they might have been 10 years ago when you had a very clear maybe three-tier architecture or static website that you interacted with. There's lots more pieces to go that can break or interact in unintentional ways. So the problems that we work on are often really hard. They don't necessarily have obvious solutions. And so we need to have sort of hone our skills so that we know how to troubleshoot, know how to investigate, and then know how to learn from our failures afterwards so that we don't repeat the same mistakes because there's enough new mistakes to make we don't need to repeat the old ones. So yeah, encountering problems is gonna be part of your job especially in an operations environment whether that be again for a conference or a large-scale monitoring system or whatever else you might be working on. So I guess before I keep going, just from curiosity, how many folks in the room are running some type of post-mortem after their failures? Cool, how about after what you might call a success? I don't know, you might call that a retrospective instead but I mean it's similar idea, right? Cool. So as we go through these troubleshooting exercises, of course as I said it's important that we're embracing and learning from these failures. This is a quote from Henry Ford, as he says, the only mistake is the one from which we learn nothing. Never lose an opportunity to learn or to better yourself as a result of an incident or an outage or a failure. It's an opportunity to fix that technical debt you might have missed in the past and you knew you had to address. It might be an opportunity to fix some communication challenges within your team. It might be an opportunity to just be more successful in the thing you already did really well. How do we repeat that? So it's important to come back and look at how we did, whether or not we would classify it as a complete failure or not. So what do we mean by post-mortem? It's a discussion of an event held soon after it occurred especially in order to, if it was a failure and especially in order to determine why. Some people feel that post-mortem term's a little negative so again, like I said, retrospectives or reviews are also fine to call them. But again, the goal is to learn is that we prevent those repeated failures. So along the same lines, the other thing I like to do in a lot of my talks is just kind of give the focus area and one of the things that I like to define really quickly is DevOps and how many folks would say they engage in some sort of whatever they define them for in terms of DevOps to mean. Okay, lots of hands. So these two guys, the picture on screen, John Willis and Damon Edwards, they coined this term, this acronym CAMS, Culture, Automation, Metrics and Sharing as the sort of the four pillars of DevOps. And so as I start talks, I kinda like to sort of steer into which of those four corners we're gonna focus on. To give you a sense, culture is this idea that we're working together, we're seeing the problem as the enemy, not each other, we're looking to collaborate. Automation being the idea of building, scaling with code rather than with people. Metrics being measuring what we're doing and so that we know if we're getting better or worse and then sharing this sort of, this idea that we're gonna take our learnings back and help each other be more successful in the future. So we're mostly gonna skip over automation but I do like to point out, this is sort of an example I like to point out quite frequently is there's not a lot of automation here but they're repeatedly building barns here for like 200 years without very many failures and it's a lot of it comes down to culture. Like I mean, I wanna say there's no automation but really I guess police and ropes are automation compared to trying to do this by hand but the sense is, we don't need to argue about our puppet or our bash or our data dog versus our Nagios. Let's first argue about, but let's first get on the same page as to what we wanna do as a team because then we can do incredible things like we see here. Again, the reason they're able to be successful is they're approaching this collaboratively and many hands make lighter work and there's not a lot of infighting as to like who gets to hold this rope? Are we using a rope made out of like camel hair versus a rope made out of some synthetic twine? Like what are we doing? They're just getting it done. They're also not fighting over do I need specialists or do I need generalists in my barn raising DevOps? They actually, in some cases they have folks, not everybody off the street gets to do the joystick or the connect or working with certain police. They do have specializations there as well. So no matter what your flavor of DevOps, there's lessons to be learned here. But yeah, once you're successful, that get that crane, build yourself, build yourself a bunch of barns really fast because you'll still be communicating, right? There's that guy in that crane has a walkie-talkie there somewhere and he's letting somebody know he's about to drop a big container because so that they don't get crushed. We all know that people are our bottleneck and it's not really the technology, right? Not really the ideas of the technology. We all have open headcount I assume. I don't know about you guys, but I can't hire people fast enough to fill the roles I have. It's not, that's my biggest challenge in general. So again, we're gonna focus on sort of this idea of culture and sharing. That's, those are sort of the pillars that would generally cover this. Again, postmortems in general would be under sharing and the idea of learning across our teams and across our organization. Blameless postmortems we'll get to it a little bit later, fall a little bit, we'll fall into the culture bucket and working together rather than against each other. So what do I mean by blameless postmortems? It's not like, it's a term that was coined by John Allspaw. He's got some great writings on it. I've got some links later in the resources. But what does it mean to be blameless, right? Somebody made a mistake. You still want to be able to have a conversation about them. It's not like oops, I spilled the milk. That's it, we're done. Or you don't confess your sins and all of a sudden you're absolved because you did it in a postmortem. Like if there are mistakes, we need to address them and we need to make them better because otherwise we're not actually continuously improving and that was the whole point of this. So no, having a blameless postmortem actually means that the engineers who have actions have contributed to the accident or the issue can give a detailed account of what actions they took, why they observed, what they observed, what made them take those actions. What was their context that led to the mistakes they made or the outages that they encountered? Maybe some of the assumptions they made about the business or about the technology. And their understanding of the timeline of events as they occurred. And the most important part is, and where the word blameless comes or maybe blame aware, some people are calling in these days, it's the key that we need to be able to have to do this without the fear of the pitch works coming out. They can't fear, people can't fear punishment or retribution. It can't be, I made the mistake, I fat fingered something, I took down the website. I can't fear that when I admit that, I'm gonna lose my job. We need to maybe go back and see, why was I able to do that? Why did I make that mistake? Why did I think that was the right actions to take? So again, put away the pitch works, it should never be about the blame here. This is where the culture piece comes around. I think on first, people, when you first join a team, you're gonna initially think, you're gonna, you need to see, you're gonna observe the culture around you. And if you're in a situation where there's a lot of infighting and you feel like every time I admit a mistake, I'm gonna get fired, then you're not gonna be the most successful in what you're attempting to do, because you're gonna fear that failure. And more importantly, you're gonna fear explaining why you failed that and then learning from it. I would say that's, if you're in an organization where that's a problem for you, then there's a serious culture problem that you need to address first before you can get into the practice of holding these postmortems effectively. Because otherwise you're gonna get in a room, you're gonna start lobbing these blame grenades around and whoever has, you know, whoever caught the most of them, they're the ones that are gonna lose and they're gonna lose their job and they're gonna move on. But I mean, a lot of this just comes down to that culture a bit. I can't give you culture, but I can sort of point you in the direction of what I feel is important. Point you to some resources and hopefully as an organization, you guys can agree that you wanna work together not against each other. And it's a challenge. I've been in organizations on both ends of the spectrum and there's a middle ground as always. Yeah, sure. So again, John Allspaugh has done a lot of writing on blameless postmortems. There's also a great book by Dave Zwickbach on called The Human Side of Postmortems by O'Reilly. He can, you know, these are both great resources to dive in if you wanted to, if you wanna read the book on these, the slides will be up online later. So don't feel like you have to take pictures of slides or jot down bitly links real quick. So, you know, talking about it all is great, but how do we measure what we're doing and how do we measure if we're being successful and how do we use that to inform our postmortems? And the key part of course is metrics. That's the third of four pillars we talked about earlier in DevOps. And so, I mean, really going down, if you don't have metrics, whether it be about monitoring your systems or about monitoring your teams or monitoring your response times to those incidents, you really don't have a sense of whether or not you're getting better or worse, or staying the same. And it's sort of like driving down the street with your headlights off or your wipers off in the rain. You're doing it blind, it's not responsible. Let's start collecting that data. This guy on Twitter, honest updates, hilarious. You should follow him. You know, these are sort of some tweets that sort of reinforce this point. A postmortem that would require us knowing, you know, having some idea of what just happened, our metric collection failed during what you're calling an incident. So as far as we're concerned, it didn't happen. These are, he's got like a dozen of these a day, so he's, but he's always spot on as to what these are. I mean, how many of you guys have had incidents where the only action item in your postmortem is, I promise next time I'll have some monitoring in place so that I catch it and it doesn't happen a third time? Is that a question or are you raising your hand at admitting, admitting, cool. Usually people are kind of sheepish under the table. But yeah. So, you know, collecting data is cheap. I mean, 10 terabytes of S3 storage, you know, is with a put every three seconds is like $315 a month. Like stop pretending like we need to throw away data. We don't collect as much of it as you can. If you don't, it's going to be expensive to generate again later, like going back and trying to recreate the events of a security incident or a technical outage or what you said or didn't say on a control call, nearly impossible. So it's very expensive when you don't have it, but cheap, cheap to collect it when you do. So, you know, track research had a survey of 300 companies. They said the average revenue loss for an hour downtime was $21,000. You know, Amazon has a similar, Amazon's numbers are a bit higher, as you can imagine. If Amazon.com goes down, I think they said it's about 117 to $118,000 for an hour of down, sorry, for a one minute outage. So, you know, your organization may be not at that scale, but there's definitely a cost, whether it's for trust within your organization or for, or to your customers. And, you know, so collecting this information so that you can identify what happened and when is key. So how do we recommend doing it? Well, there's, our first step, we would say, is to categorize your metrics from your organization and from your team into three buckets. The first being work metrics, the second being resources and the last being events. And, you know, to give you a sense, work metrics are the things that your customers come to you for. So it's an API call, it's a car, it's a, you know, it's a widget that you sell. It's the thing that they will wake up in the middle of the night, they will wake you up in the middle of the night to complain about. It's not necessarily all of the things behind it. So those are what we would call work metrics. Those are the things you really have incidents about. They're the things that impact your SLAs, et cetera. Resources are the things that go into it and events are things that provide context. So in this example, you know, we're a donut factory. Work metrics are the things that are indicating the sort of the top level health of our systems. They're letting us measure its useful outputs. So the four types that we kind of, that we break that into our throughput, that's requests per second. Maybe how many donut orders have I gotten today? How many people are sitting in line at the register? Success being how many of those are, you know, in the case of a web server, might be 200 responses. In the case of donuts, it's, you know, how many donuts showed up with all the holes intact or not intact as it may need to be. Had the glaze they needed to have, et cetera. Errors being maybe how many tasted bad I had to throw out or the ones that came out a little clunky. Performance being, you know, the response time. So when somebody walked up to my register in order to donut, how long did it take before they got it? Or your API call, how long did it take before we responded? Resource metrics are all the things that go into it. So again, if we continue down this bakery or donut store example, these are the things that are, these are the things that go into making those donuts. So how much, you know, how much flour do I have? Am I, you know, how many of the ovens am I using? How many of the sort of conveyor belts am I using there? Saturation being how much, you know, how much queued work do I have? And availability being, you know, do I have enough of those resources that are the resources that I need available? Maybe it's the baking staff, maybe it's the pieces, what have you. So these are really important for investigating as we're diagnosing the problems and responding to incidents as well as for building our timelines afterwards. Without them we can't really tell a complete story. Events on the other hand are things that are, they're things that happen to us. They're sort of the discrete occurrences of something that would provide a context as to what happens. So if, you know, Homer here is being force fed a bunch of donuts, we gotta keep up with the devil's orders. That's a big, we gotta generate a ton of donuts right now. That might cause a large backlog. That might cause an increase of orders. It changes the way we interact with it. If you're doing a code deployment, you've now changed the behavior of your environment. You wanna be able to know what's happening and when. Again, these are context. They explain why something occurred in your organization or to your metrics and that's this. Again, these are important for telling your story about what changed in, about why things changed in your system's behavior or in it's output, it's useful output. So these three things together, I mean in general, there's a whole another talk about which one's to alert on, the short of it is work metrics are the things you wanna wake up on, use the rest of these events and the resources to troubleshoot it and then again to build your postmortems afterwards as to how you could have done better. So what are qualities of a good metric? Well, they need to be well understood. You should be able to quickly look at it and determine how the metric or event was captured, what it represents, should be named well so you know what it means. And they should be common to everybody involved. If you use different terminology or if you use different units of measure across your teams you're gonna have sort of that Mars orbit or disaster where things crash into planets because you thought feet and somebody else thought meters and mismatch. They should be granular. So if you're collecting metrics too infrequently or you're taking averages you're gonna find that a lot of times your metrics are lies. The average request time across the entire Superbowl after your ad is probably not, sorry the average number of requests during the Superbowl across the day of the Superbowl, probably not that high. The average request during the 15 minutes that like in that 15 minute window around the ad break where your ad played, probably pretty high. So you need to be granular enough that you're getting all this data and that's again in your monitoring systems or in other data capture locations that you might be interacting with. Don't wanna lose important behavior over averaging out. Make sure that you're sort of providing enough facets on your metrics as you're doing this. You wanna be able to look at it by region or by data center or by task or by person and figure out how these all play together. Make sure that you can slice and dice, drill down get more and look at this from the angle that's most important to you. Keep them around for a while. A lot of times we have incidents where it slowly hurt us. The response time's got worse with every release that we did until one day we reached our SLA and we didn't know why, sort of very, very slow death there or incident. So keeping these around for a while is important so that we can do seasonal cycles. We can figure out if we're better this year versus last year, especially on big events. So you come to Linux, go on with a booth, you sell a ton of what have you. Wanna make sure that your site can handle that next year. So again, the idea is that you start figure out what your work metrics are, figure out what your resources are, and sort of recurse. The most important thing as you're working on your team is knowing what you're responsible for. So you're responsible for your work metric. If I have a web application and my job is to return API calls or to return donuts, that's my work metric. That's what I measure my success on. But I should know who my resources are which might be a data tier down below me or a bag of flour that I need from the supply guy. And those resources are in turn his work metrics. And his resources will be the people that supplied the bits that he needed to provide that upstream. So if we all know what our work metrics are and what our resource metrics are and where those point, we build this sort of this dependency graph that's quite useful to us as we're doing troubleshooting and planning and figuring out our post modems. So again, as you identify an incident, you'll examine work metrics, dig into those resources, figure out what changed in terms of events and keep going. So the next lesson I'd say around postmortems that's key is knowing when to have them. So your primary goal during the actual incident that you're investigating needs to be restoring your service. Don't start porst morteming during the incident. So if somebody is here taking notes about how I messed up in this talk and I'm sure there's a list, I'm happy to take it later, don't come up right now to start telling me about it because I still gotta finish the talk. I still gotta finish the work product that we've got in front of us. I will come back to it later. We will have a conversation. It's okay to take those notes and investigation points for later, but don't derail the actual investigation with your postmorteming. Trying to conduct it during the incident is only gonna severely distract us from actually getting our jobs done and getting back online. So as we're collecting data for these postmortems now that the event is over, who should be doing it? Well, it's easy to say everyone, but you gotta kind of to use an analogy. My colleague, Jason, he likes to say that the responders are like the police. The identifiers are like, they're the witnesses, they're maybe the people that saw the outage happen and let you know that could be a monitoring system, but it could be a person. Effective users might be the victims of the crime. And so you want them all involved. A detective or a cop or that traffic cop is never gonna just talk to the one guy that was behind the wheel in the car. He's gonna talk to everybody that was involved and figure out what was going on as they're doing it. It should be the same case as you're building your postmortem, whether you're the coroner or the person that's doing the report, the report on it. But what are we trying to collect? Well, we wanna collect, again, as John Oswald was saying, we wanna try to collect all of the information that will let us know whether or not we made the right choices during the incident, not just why the incident occurred. So what did people do? Why did they do it? Why did they think it was the right thing to do? And how do those all connect? It's important that we figure out what the false indicators are and identify those, how we got to the wrong conclusions that may have extended outage or that may have caused it to begin with. It's important to make sure that we're not asking yes or no questions because that's not, you can't really get the new ones there. You wanna be, we wanna do more open questions. And often the question is, people like to talk about the five whys from the Toyota lessons, from Toyota's lessons within their factories, how is also quite important and what, so you wanna kind of dive into those as well, into those other aspects as well. But the key is that we're all, we're not looking to blame any individual as we're doing this. We need to be safe, feel safe as they're saying this and we wanna share the blame sort of as a team or evade it entirely. The other thing is, technical issues often don't have, they often have non-technical causes. So that may be, again, maybe a human misunderstanding, it may be an event outside of the context of your actual applications or services. There's lots of aspects to this that may not be something related to code that may have changed. So a good example, as a real world, back in 2014, there's a cloud provider joined that tend to be big and tend to do a lot in the container space. I'm sure you guys have heard of them. Big experience to human-induced outage, shut down their entire data center in the East Coast, made all the headlines. They wrote a great post mortem, I've got a link in the docs explaining what happened. And the short answer is, sysadmin fat finger to command, shut down all of the servers rather than shutting down a subset of new servers. Not, we've all been there in our career where we've done a thing by mistake. But one thing that's important, but one thing you'll note here is if you look at that post mortem that I linked here, nowhere in the action items will you see something, you will see Blamer Pitchforks. You're not gonna see an action item of fire, that guy, he's careless. What you'll see is the why, which is a review of the context of what led the engineer to run the command. He thought he was doing a maintenance on some of their controllers. The how being a review of the tooling failure that allowed it to that mistake to happen. Again, in this case, there was no validation on the command. So when he didn't enter the scoping it expected, it just shut everything down. And what's the next action to fix? So these are all key to have in your post mortem. And in this case, again, it was a tool that should have had a little bit more validation. Check that you actually provided the arguments that expected rather than just blindly doing what it was told. And the way they determined, what they determined the failure, the lessons were actually not educating the engineer, not chastising the engineer. They call out and all their blogs and post mortems on this. The guy already felt bad. Like he hit the command. He was already beating himself up before he could stop the command from running. There's no better, there's no win in making him feel worse. And you can't prevent typos. We're all gonna make mistakes. You didn't sleep last night. You didn't get your coffee. You literally just fat fingered something because you made a mistake as you were typing. These mistakes are gonna happen. You can't prevent them. So we have to make tools that prevent us from doing that. And that's very clear in this post mortem. That's everything that they focused on is how to make sure that this will not happen again from a tooling perspective, not necessarily a human perspective. So let people that are involved in the post mortem just write down their story in more than just bullet points. This helps them kind of clear their thoughts and make sure that what they're thinking is actually what they mean to say. It turns out that a lot of times when you write something down, you start to crystallize the actual message that you're looking to give rather than just living it to a couple words. This is not Twitter. We're not looking for 144 characters. We're looking to get the real story as to what happened. There's a Chinese proverb or depending on who you ask, it was possibly a 1950s advertisement for a car dealership. But pictures worth 10,000 words. You wanna make sure that you're enriching your story in your timelines with graphs and snapshots of what happened from your monitoring tools. So that might be time series graphs from something like DataDog or Ganglia or one of the many other time series monitoring tools that are out there. Pull all those metrics together and use those to tell a story. Caption them with what you think happened at what time. Overlay events on top of them so you know where things are going. As an aside, DataDog has a new notebook feature that kind of lets you snip portions of your graphs and drop them into a timeline to tell a story. Happy to demo that for somebody some point. But even if you have a real notebook and you're just jotting down the things and pasting the graphs on top of them, that's fine as well. They're really gonna help provide that context to your story. So when do we wanna do this? Well, as soon as you can, again, without doing it in the middle of the incident, you know, memory drops off exponentially exponentially after about 20 minutes and it just keeps, you keep forgetting how it's going. There's a, I saw a talk by a training, I went as a training event by J. Paul Reed, who's a consultant in sort of this DevOps space, and he played a snippet for us at the start of the session of an airplane landing where there was a failure and he told us, pay attention to what's happening here. At the end of the one-hour session, he asked us to write down the timeline of events of what happened. Not a single person in the room got it right and we were told to pay attention to it. So in your incident, you're gonna have these same problems if you don't write these things down quickly. It's very easy to fall into the trap of, man, I'm tired today, I'll do it tomorrow, oh, I'm on, oh, then Al Bob's on vacation, we'll do it when he gets back. And next thing you know, another incident arises, you know, you push it back again, now you've repeated, now you've pushed this out days, weeks, months, and you actually don't remember what you did or why. I mean, you have the metrics, but you don't have a lot of the context of what's going on. So getting in the habit of doing this as soon as possible is key. Don't let yourself slip on it. Make sure that as you're gathering the data, you're not putting people in an awkward position. You know, people are obviously stressed, they've just been dealing with a crisis. Putting them on the spot and sort of having it be an acquisition is not ideal. You know, you wanna be sensitive to that, but you know, get at that data. If they're stressed or they're concerned, as we said earlier, they're not necessarily gonna give you their honest view of what happened, they're gonna try to protect themselves. Sleep deprivation contributes to memory loss. So if you've been waking up at three in the morning every night to deal with these outages, you're probably not gonna remember exactly what happened either. And burnout sort of falls into that as well. Biases are also a key thing. People, even when they want to give you the sort of correct view of what happened, it's very easy to find yourself in a situation where you unintentionally are giving a biased response. Maybe you have some, maybe you're doing, these are just a couple of the biases and there's a ton more. There's probably, I think I was looking at, looking at the types of biases that might be related here in Wikipedia and there's like a 200 item list of different types of biases that have been identified in humans. There's anchoring, basically we rely too heavily on a single piece of information. So we're sure it's this one thing. We're over here looking and we've totally missed this big set of issues over here that we could have learned from. Hindsight, we think things that are, things are obvious or evident now that they weren't have been at the time. It's this 2020 hindsight, being a Monday morning quarterback, like of course we should have done that. That would have been obvious at the time and it turns out maybe it wasn't from the data. So looking for that, those are great places to, those are places to, it's important to identify that bias and make sure that we're addressing it in our tooling so that we're collecting the right data. Availability bias is sort of, we recall, we overestimate sort of the value of events that we can easily recall but not underestimate the ones of the value of the events that we can't recall. Again, the reason why it's important to take good notes as you're doing this. It doesn't, the bandwagon effect or you know this other one where like we think because one of our colleagues said that it must have been this memory leak that caused outage, we all wanna go focus on the memory leak because we're gonna join the bandwagon rather than investigating all the aspects. So let's talk a little bit about how we do a lot of this at Datadog and we think this is particularly effective. I hope these will be hopefully this aligns with some of what you guys are doing and some of the ways that your organizations can run these similar postmortems. So one of the things that's important is that is having these postmortems occur on a regular basis means that people are accustomed to them. It's not, even if it's, they don't have to happen for P zero, the world stopped outage only, they can happen for smaller things as we said earlier. And so having them regularly, lets you build up the good habits for this, it lets us learn from our peers, it lets us avoid the cultural issue that we were asked about before, well what if people feel like they're afraid of this? Well if they have one every week and they see that nobody's losing their jobs over the fact that they've made a mistake, they're not gonna see that be, they're gonna quickly adapt to that culture. So the way we do it is when a postmortem's compiled that's emailed company-wide, I mean primarily engineering is looking at it but also product teams and other parts of the organization, you wanna be fairly open with these, this is the part about sharing. I realize not every organization can do this but sometimes there's compliance reasons where maybe you need to be a little bit more circumspect as to what occurred but ideally you wanna share as much as you can because people can't learn without it and they're gonna make the same mistake you did otherwise. The other thing we do is we'll schedule a recurring review meeting. So once a month the whole team gets together and we look at all of the leads on any given postmortem, we'll go back and report to the whole company in person what they believed happened, some of the practices that we learned, some of the action items that came out of it and how we stand, where we stand on delivering those. And that means that people get to actually have a conversation with that person as well and ask questions about maybe things that we missed in the core postmortem team as we were developing those postmortems. So what do these tend to contain from our, the primary thing is impact on our customers. We always wanna keep our eye on the business goals and make sure that we're addressing our customers' needs otherwise as a company what's the point in doing what we do. So we try to describe what happened at a very high level, provide a very short but specific summary, think of it like an abstract in a scientific paper. It doesn't need to go into all the details. You wanna make sure that you're covering the impact on the customers, the severity of the outage, what it impacted and ultimately what it took to resolve the outage. But again, this is a summary. This is not intended to be the actual scientific paper. This is that abstract at the front. So here's an example. We had an incident in March where we had an incident in March that resulted in some impact to our customers. We were fairly open about this postmortem. There's a link to it afterwards. We feel as a data, as a company called Datadog that focuses on monitoring and teaches people to have postmortems. We feel it's important to make those available to our customers. So in this case, we had two sets of applications that the code names don't particularly matter. But they were blocked on accessing a cache that they needed it resulted in some, an increase of latency and 500 errors. Our monitoring system caught it. You can see we brought Pingdom data into Datadog. We were looking at that graph over time. Yes, we do use Datadog to monitor Datadog. You can see there's no events overlaid on this graph. So there weren't any changes on our end in this particular moment. So we don't know, it wasn't a thing that we did that caused it to occur, at least in this particular case, in this particular view. Everything seemed normal except that some cache nodes seemed overloaded. So you can see that's, at the top here, we've got a quick summary. It explains what we think happened or what we believe happened, how it impacted our customers and what we did to resolve it, as well as the length of that outage. You know, we then have additional sections where we're talking about the impact. So we go into a little bit more details. We talk about customers, they might have seen a down page intermittently as they were trying to access the site. You know, we considered it a major outage. We classified it as such. We have SLA's to adhere to. We want to make sure that our customers know whether or not we're breaching them. And then we talked about how we solved the problem. So we replaced some of the cache nodes with larger cache nodes. You know, going the whole problem went away. So, you know, how was this detected as the next question you want to ask yourself? Did you have all the tools in place to figure this out so that you could catch it the next time? You know, metrics and monitors are how we improve our response time, right? We tend to focus on mean time to resolution, not necessarily mean time between failures. You want to make sure that when you catch these, failures are going to happen. You just want to make sure that when you catch them, you can catch them quickly and resolve them quickly. So this section kind of helps you understand where we can improve. Did we have the right monitors in place to know what was going on? Did we have a metric that showed that outage? Was there a monitor for that? You know, was there a monitor there? And, you know, how long did it take for us to declare it? And so you can use another snippet from that same postmortem. You know, we had multiple metrics that caught it. Here's links to the graphs that show it. Was there a monitor metric? Yep, and we got alerted on it almost immediately. You know, took us less than three minutes from start to setting out the notification, updating our status pages. It's pretty solid. And so you can see we've got graphs all across the board here showing the metrics around this. So what did we respond? It's gonna be, so the next section being how do we respond? This is sort of part three of our template. We looked, we list who was involved, who owned the incident, who drove it to resolution. We quickly grabbed archives from our chat service. We used Slack, I don't, if you use IRC or your Jabber or HipChat or Mattermost doesn't really matter, you pick the tool. But that you wanna capture wherever your control call was occurring. We talked about what went well, what didn't go well. We outlined that here. So in this particular case, names have been, in general, have been changed to protect the innocent or the guilty, whichever it may be in a lot of these things. But we talked about who responded. Again, you see links to here. We'd have graphs tagged to things and a quick timeline. You know again, chat ops archives. This is like, if you're not adopting chat ops with an organization, I encourage you to go back and look at what this is. I know my initial reaction was, I don't wanna do all of this in chat, that's absurd. It turns out being able to do all of your work in an open environment like this means that people are not wondering what happened when, if you can run your commands directly from your chat service. So like we actually do a lot of our deployments by talking to a bot in our chat room. There's no question as to who clicked what link went. It's all right there and it's happening in the order that occurred. So everybody has visibility during the incident as well as after the incident is what had occurred. Having it all logged is fantastic for getting those control calls done. If you're doing control calls by phone, that's also good. Just make sure you have a scribe that's writing down what's going on during the incident so you can remember later. But it makes it super easy to build timelines. You know, tracking learning as we go, we talked about not, you know, not instance of the time to run the postmortem. It's fine to track the learnings as you're doing it though. So one of the things that we do is we'll drop a message in the chat room and we'll just have the hashtag postmortem or lesson tagged on it. People know not time to debate it right now. I may or may not agree, but at least when we come back to the chat archives later, there's a thing that we should have thought about and we should talk about a little bit further. But yep, and then finally why did it happen? This is like the technical and the degree of, you know, when you start to dive into, you know, what caused the incident. Especially where you're sort of writing pros at this point instead of bullet points. CTO Alexi has a great presentation where he actually does this postmortem that I mentioned here. There's some links here on the slide, but he actually does this presentation and at a meetup ran the actual postmortem in front of folks so they could see what was going on. I recommend you take a look at that. But the TLDR as we looked at this, this took a while to actually find the root cause of we, you know, we had to narrow it down to Redis and the network. We had to narrow it down. We then had to dig into CPU. We had to dig into like the maximum capacity of EC2 instances. We then had to dig in, you know, this was not a short investigation, but there was a lot, we managed to capture it all and build a very accurate timeline using all those graphs in the metrics we had collected, which was quite helpful in preventing it from occurring in the future, adjusting our alerts, et cetera. There's a lot of circumstantial evidence in many of these cases that will point to you into the entirely wrong direction. And it's easy to say, that looks like the easy one. Check and go back to your day job and then just have the incident occur again. But it's the most important part of all of this is how do we prevent it from happening again? Make sure that you're filing, you know, we use GitHub and Trello to track our work. You guys, you know, you might use JIRA, you might use ServiceNow, you might use some other, you might use an actual physical notebook. Make sure that you're creating these issues and these cards and these plans and tracking them through the completion. Otherwise, what was the point in this entire exercise? If there's any related issues that you discovered, make sure that you're noting those as well so that they can be tracked. But be honest with yourself, it should be legitimate. I've been in postmortems with Action Item as well, we should just retire that app. Well, you've been trying to retire that app for five years and you've not done it yet. That's not an immediate postmortem Action Item, that's maybe an aspirational goal. That doesn't belong in this postmortem. But you wanna break them into these three categories now being we're gonna drop everything we're doing right now and work on addressing that issue. Those often tend to be things that, where you know the incident's gonna happen again tomorrow if you don't address it. Next being, maybe we'll do it in the next sprint or two, whatever your sprint cycles look like. Later, it's the closest aspirational I might get, like maybe in the next month or two. If it's outside of that scope, it's likely not a fit for your postmortem Action Items and maybe things to track in some sort of a quarterly plan or long-term plan for what your team needs to do. But they're not directly actionable. Your team's not gonna be focusing on these and it's gonna go into this backlog that sort of festers and grows endlessly like a queue. Make sure your business owners are part of these conversations and discussions as you're having this. They shouldn't feel blindsided when you add a bunch of things and now and next and all of a sudden now all the features they wanted are not coming through. You need to make sure that we're having that conversation with a wider organization. So here's some examples. We added some, you know, we had some tickets here to add some tweaks on Kernel and, you know, Kernel and Redis tuning links to not only the tickets, but also to the commits once those commits were done so that people could come back and see the learnings from this. We added monitors on, you know, TCP retransmits and error rates on that particular application. But even there, we still got to have a conversation about it here. Somebody else popped in later on and said, I don't know that I agree without being the right monitor. We should maybe look at something different. And so again, these continue to be a conversation even after the incident has occurred, we want to, it's, you know, sort of a continuous learning exercise. So quick recap, we want to look for things. We want to cover what happened, how we reacted to it, how we responded and what, you know, why it happened and then what we're going to do next. And again, always focus on that impact to customers. Fixing things for the sake of fixing things doesn't get you any wins. So some additional resources. We talked about John Allspot earlier. He's got some great writings on his blog and on the Etsy blog about it. There's some short links for that there. Jay Paul Reed, you know, we talked earlier about I mentioned blameless postmortems. Somebody, I think somebody was asking about like how do you not blame people for the failures? Jay Paul Reed will argue that you should be having blame aware postmortems where it's okay to identify who was at fault but not necessarily lob those grenades at each other, make sure that we're understanding how to work together in that. He's got some great writings there. And then of course, Alexi's written, our CTO has written that series of posts called monitoring 101 where he describes how to monitor your environment. So with that, you know, we can start the postmortem on this particular talk. I'm happy to answer any questions you might have about my decisions or what caused me to write it. And we can go from there. But thanks for sticking with me through the end of LinuxCon. You guys are definitely dedicated for being here at the 430 hour rather than at the bar. So thanks. The slides are not yet up but I will upload them here in the next couple of minutes. To the LinuxCon site. I'll put them on all the slide share slide, you know, speaker deck ones but also on the LinuxCon site. Sure, the front row here and then we'll kind of work backwards. We have a Xabix integration that will take data from data, that will take data into DataDog. We have one for Nagia. So we have one for New Relic. We tend to be happy to take, we want to bring all of your metrics in one place. Can I ask what you didn't mention? Yeah, I mean, we have over 150 to 200 integrations with various open source projects. If that could be a talk in and of itself although it would be more of an advertisement. So I try not to, but Xabix is one way. I mean, you could also use DataDog to replace that entirely but if you have things in place it's a great way to feed data from other monitoring systems. Oh yeah, well I'm happy to chat with afterwards if you'd like to chat about DataDog more. Somebody else had questions? Yeah, so I mean, I think there's a couple things packed in that question. The first is, you know, maybe, the first is how do you, so you asked how do you identify patterns across this postmortems? Well, that's where, one of the things that we do is, again, about every month we get together and we have a review as an entire engineering team of all of our postmortems. And you can very quickly, as each individual presents them, you're like, wait, I didn't realize that. Maybe I didn't realize that I have it multiple times or maybe I didn't think these things were connected and now I see it. You know, in terms of, sometimes the issue is, sometimes the issue isn't that one network link between the two hosts was down but rather why did I care? That's another one to look at, right? Maybe we need to look at why an issue in that particular window could have impacted me rather than being distributed across multiple points of failure. A lot of this relates to your own architecture and your own environment. You want to focus on there, but that's making yourself more resilient to the failures is probably a better use of your time often than investigating the individual system failures. I think that was you, yeah? Sure. So within Datadog, if you built it, you own it for the most part. So when I say operations, I mean that's a bucket of resources within your engineering team on how to deal with it. So if you build the API that takes in metrics, that's you own that. That was similarly the case where I worked out of Yalla but just because I own something that hasn't outed doesn't mean that somebody else downstream doesn't have to do post-mortem works that I don't have that outage again. And that's where my point about involving business owners is important. So I think one of the biggest mistakes folks make as they tackle incident reviews and post-mortems in general is they don't include their product managers. At the end of the day, everybody wants to make the right decision by the company, whether it's the product manager, the ops guy, the developer, every role, right? I think that's really the lesson of DevOps is we want to be working together as a team rather than working in these silos that have competing locally optimized goals. And so if you include your product manager, who's sort of the CEO of your application, he wants to make his service, his feature, his segment of the company profitable. He wants it to be successful. He wants customers to use it. He's gonna help you prioritize the true, important, like immediate action items at the front above feature work. And if he has to make a compromise, you're gonna get to have a conversation about why that compromise is important. It's the decisions you make when you're a five-person startup, you're gonna be very different in what you make when you have an established service that's existing. And so maybe addressing the immediate outage is not nearly as important as getting a feature out the door that will enable the company to exist tomorrow. And so every organization has a different tolerance to these risks and to these failures. And so you have to have that conversation with the business team, not just within a technical team. Was there another question? I thought, I saw a hand, but it looks like that. Awesome, well, hope you all enjoyed the talk. Happy to chat with folks afterwards about Datadog or Postmortems or anything else. I'm Ira Binovich on Twitter. You saw that, I think, at the beginning of the deck. Feel free to tweet at me if you have questions. But otherwise, enjoy the gala tonight and hope I see you at some of the parties here at LinuxCon. Thanks.