 This is the title. So you should know where you are. Clearly, that gentleman did not, so he left. As we get started here, a little bit about me. So I'm Jason. Also, hi. I'm one of your track chairs for the DevOps track, along with Ricardo here, who's sitting in the front to heckle me. But if you're on Twitter, I'm getbysect on Drupal.org. I'm Jay Yee. I've been involved in DrupalCon pretty much almost every DrupalCon since Denver, so 2012. I'm also a prolific traveler and a chef. And I work at Datadog. Datadog is a SaaS-based monitoring company. So we monitor your infrastructure and your applications. If you're on Twitter, it's Datadog HQ. Don't follow Datadog. Datadog is actually a black Labrador Retriever. If you tweet at Datadog, he will make fun of you and tease you for that. We handle about a trillion data points per day. So just to give you some idea of the scale, we have thousands of clients, some large, some small. But yeah, we do a trillion points of data every day. And we are hiring. So if you're looking for something interesting and creative, solve some interesting technical problems, we don't particularly use Drupal. But if you're in ops and like to build really complicated systems, we are hiring. So take a look at that. So speaking of complicating systems, this is actually from our internal developer guide. It says the problems that we work on at Datadog are hard and often don't have obvious, clean-cut solutions. So it's useful to cultivate your troubleshooting skills no matter what role you work in. Now, this is in our internal guide, but I think it applies to everything that we do. Whether you're a Drupal person, whether you're just an ops engineer, the systems that we're setting up today are really complicated. If you've ever tried to put together a system on AWS, it's not just I have a server, I install the LAMP stack. It's like I have RDS and I'm connecting that to Route 53 and all these other pieces. And the problem about having all these complex systems, particularly because they're distributed in cloud, whether you're using private cloud, hybrid cloud, public cloud, whatever is that a lot can go wrong, right? I mean, that's essentially the job if you're in operations is fixing the things that break. So I love this quote from Henry Ford. And he says the only real mistake is one that we don't learn anything from, right? So the whole point of postmortems is to learn. So dictionary definition of a postmortem, an analysis of an event to figure out why it was a failure. So we're here in the DevOps track. I'm pretty sure that if I asked any of you the definition of DevOps, you would all give me very different answers because DevOps is slightly hard to define. But these two gentlemen, Damon Edwards and John Willis, they've been around in the DevOps community for a long time. They've coined the definition of CAMS, culture automation metrics and sharing. A lot of times in DevOps, we get caught up with the automation, right? It's the tooling. And a colleague of mine, Alon, likes to point out these guys. So these are Amish. They're a religious sect or community in the United States based on asceticism, right? So one of the interesting things that they do is they shun the use of electricity and modern tools. But they build these amazing buildings simply because with the power of community. They all get together and they use handsaws and ropes and pulleys. And they can put up an amazing barn in a day or two. And they don't have modern tooling. So it's interesting when we think about, we think about making buildings and you have cranes and bulldozers and builders that get together with electric saws. The whole point about it though, when you think of it is DevOps really is about culture, right? You can have all the tooling you want, but if you don't have the right culture, then you can have huge incidents. So if we think about the crane, right? If these guys aren't communicating in the same ways that those Amish guys are communicating, you can start dropping rather large payloads on people's heads. Or if you're thinking about the automation that you have in your systems, it's really great that you can auto scale. But if you don't have the culture and some of the background things to set up your organization, that automation makes it really easy for things to get really bad really quickly. So starting to ignore more of the automation side of DevOps, really I think what we're here to talk about is culture and sharing. And that's traditionally what's been talked about for postmortems, right? The culture side being that often we talk about blameless postmortems and the sharing side obviously that when we're getting together to learn about failure and learn about incidents, the idea is to share that information. So speaking of blameless postmortems, blameless, the concept of blameless really got started with a guy named John Alsbaugh. John is now the senior VP of technology at Etsy, which is the online store where people can sell their handmade goods. And he coined this, so what does blameless mean? Like a lot of people talk about blameless postmortems. And they think, oh, it's this idea that like, oh, it's not my fault, right? Like nobody's to blame. It's just not my fault. But really the idea isn't to say that blameless isn't your fault. We're all adults, we should all take responsibility. The notion is that with a blameless postmortem, the focus is on our observations about what happened, right? And the idea that the decisions that we make and the decisions that others make were made with information and made in circumstances that we need to understand. So if this is your normal postmortem, the idea of blameless is put your pitchforks away. Let's talk about things. Let's focus on the learning rather than figuring out who did what in order to put blame on them. So a little bit about blameless. If you want more information, John's original post is there. You can read about his thoughts on blameless. There's another guy named Dave Zwibach who's written quite a bit about blameless postmortems. Definitely worth the read there. So I mentioned CAMS before when we talked about culture. I mentioned the automation and we mentioned the sharing part. But often one thing that gets missed out is the metrics. And so I titled this talk Data Driven Postmortems because really what we want to start to get to is some of that data and how can we use that? Anybody recognize this screen? Hopefully some true nerds, yes. So this is a screen from the Battlestar Galactica TV show which you should all watch. It's amazing. It's the reboot. But this is one of the screens and the FTL there mentions, or it's a mention to the FTL drives that they use. FTL stands for faster than light. So they're out in space. They're getting chased by these cylons who are trying to destroy humanity and wipe them out. And so as they're running, they have the Battlestar Galactica and a bunch of other ships that have this faster than light drive. And the sequence of the show really, there's all this drama around other things. But really the main part of the show boils down to you plot a light jump, you make the jump, you start the clock because they know that the cylons are on their tail and the cylons come like clockwork because they're machines like every so many hours after they make the jump. So they start the clock, they verify their position using the stars or like, are we where we think we are? And then they count the other ships. Like, has anybody been left behind? Because now there are only like 15,000 humans left. And then they start plotting the next jump. And this is a perfect example of what we should be doing in DevOps and with infrastructure and operations. You need to have a plan. And a lot of people have that. They plan out their architecture. This is the plot the jump thing, right? And then you make the jump. So you have this plan for what am I gonna build with my infrastructure and then you build it. And then people generally leave it at that, right? Hey, I built this thing. We're missing the part where we actually verify our position, right? How do you verify that you built what you thought you built? And then the monitoring part, like when you count the ships, like is everybody there? Are you leaving anything behind? And the problem is we don't do that. So often we get this. If you're not following honest update on Twitter, you really should. They're absolutely hilarious tweets. But the thing is they're hilarious because they're pretty true, right? Things like our metric collection failed during what you're calling an incident. So it didn't really happen, right? So while they're funny, it is funny because they're true. And the problem with them being true is that really collecting data is extremely cheap, but not having it, particularly when you need it, is very, very expensive. So just as an example, let's say you were gonna build your own monitoring or start storing your own metrics. S3 is a good place to store that on AWS. So 10 terabytes of S3, if you do one put every second, runs at about $315 or 280 Euro a month. So if we do that annually, it's like $3,800 or 3,400 Euro a year. It's super cheap, right? I mean, it's like nothing compared to your salary. On the flip side, way back in 2010, so it's a little bit of an old statistic which actually makes this worse because of inflation. It's actually gonna be higher numbers, but in 2010, TRAC Research did a survey of 300 large corporations, so like Fortune 500 level companies, and they estimated the lost revenue for an hour of downtime at that point was 19,000 Euros for just one hour. So basically like you could store six times, like six years worth of data is roughly equal to one hour of downtime. If you're larger like Amazon, so someone estimated that Amazon, if they experience a one minute outage, it costs them 105,000 Euros for one minute. So saving data, really cheap. Not collecting it, very expensive when you have outages. So let's talk briefly about what sort of data, what sort of metrics you should be storing. I like this framework because it helps you to think about some of these things and sort of prioritize, but there are really two types of metrics. There are work metrics and resource metrics, and then there are events. Events are things like your logs, so like watchdog that you should be collecting. So let's talk a little bit about work metrics. Work metrics are really your high-level, top-level health indicators. These are the things that you should be setting your paging to, like your alerts to wake you up in the middle of the night. So if we think about, I like donuts, so let's think about donut production here just to make this easy. You got things like throughput, right? How many donuts are coming off the line? How many donuts are you making? You have things like success. How many good donuts are there? Like donuts that we can sell or donuts that we can eat. Errors, how many bad donuts? Like things that we just have to chuck in the rubbish bin. Like how many things are we just making and going to waste? And then finally, performance. How efficient are we at making donuts? On the flip side, you have the resource metrics, and all too often people get focused on these, particularly in tech, right? We're monitoring things like our CPU usage and thinking that that's important. In terms of donuts, it's things like how many raw ingredients do we have? Or saturation, like can we make more? Do we have capacity to make more? Errors, did we accidentally buy like bags of salt instead of bags of sugar because they kind of look the same? And then availability, like if we need more ingredients, if we need more resources, can we get more? So I mentioned CPU, right? Because a lot of people like to track things like that. But what does tracking CPU mean? Like you don't want to get paid necessarily if you have high CPU, because if you're hosting in a public cloud, that's a good thing. Hey, you're using all of your resources and you're getting your money's worth. But if your CPU is low and you're in the cloud, then you're just like paying for Amazon for things that you're not using. Similarly, on the flip side, if you're hosting your own infrastructure and your CPU is like super high, then maybe that's a bad thing. The final one is events. So being able to correlate events, right? Because metrics don't happen for no reason. You want to see what happened around the time that you're seeing changes in your metrics. So that could be things like code changes, someone deployed and that's why suddenly all of your resources are getting used up. You want to get alerts, things like scaling events, things that actually matter to your changing environments. So as we talk about this data that we're starting to collect for our postmortems, a note about this metrics, like not all metrics are created equal. Your metrics that you're collecting should have four qualities. The first is that they should be well understood. You should be able to easily look at your metrics and know what they stand for, why they're important. And more than just you, your coworkers need to know that. Everyone across the organization who's looking at your metrics needs to know why they're important and what they mean. So this is a nice image of a rocket explosion. Back in 1990, NASA, the US space agency, was working with a private contractor, Lockheed Martin. And well, NASA traditionally uses metrics as everybody should. But because Lockheed Martin was in the US, they decided they were using Imperial, right? Everybody thought they were on the same page, except for they weren't. And that meant that they shot up the NASA Mars orbiter into space and it was supposed to land on Mars. And instead it went flying out into the middle space never to be heard from again. Number two for good qualities of metrics. So we recently had the Olympics down in Brazil. And this is some of the times from the men's 50 meter freestyle. Well, depending on your granularity, I guess all of these guys won gold. We're all winners, right? Cause they all finished in 21 seconds. So the granularity matters, like how granular of a time makes a huge difference. So we're going to two decimal places here. Granularity in your systems matters. So I'm just using these as an example. So these are the most popular public cloud systems out there. You've got AWS and AWS is great other than you can send all the custom metrics you want to cloud watch, but they'll only store at one minute granularity. So whatever you send them, they just end up averaging out. Azure is a bit different. They do one minute up to 24 hours. And then once you get beyond 24 hours, they start aggregating. So they compile it down to one hour for the week. And then after the week is done, they start doing it by day, right? So if you were to say have a 15 second outage, which could be important, up to 24 hours, well, first off, they'll average it into the minute. So, hey, you had an outage for 15 seconds. Well, in the terms of a minute, that's not so bad. That's like 75%. You did okay. Past that day, it's aggregated into an hour. So it's pretty much like you didn't have an outage at all, right? So you start to lose this information in these metrics, which is really bad if you're trying to compile that for a postmortem and figure out what went wrong. So yeah, when you're figuring out what went wrong and Azure is saying, well, nothing went wrong because we don't have that info, that could be bad. Similarly, Google Stackdriver does one minute aggregations. Third quality of good metrics, they should be tagged and filterable. You really need to know where your metrics are coming from and be able to slice and dice as metrics. If we're thinking about the donut example, right? If we had a donut factory, it would be really nice to know that all of our standard glazed donuts are perfect, while our chocolate donuts are all coming out as garbage. Similarly, with our infrastructure, you need to be able to tag by, if you're doing availability zones and geographically distributed, are your servers here in Europe doing great while your ones in Asia are like constantly falling down? Or you need to be able to split by task and other things, are all of your PHP back ends up and running, but your database servers are like constantly falling down so your cluster is slow. And then finally, they need to be long lived. We saw with the Azure example, they roll them up and so you start losing data. Amazon CloudWatch, I didn't have up there, but they only keep things around for two weeks. It makes it really hard to see long-term trends and cyclical patterns if you're not keeping your data around for a long time. So the important part of these metrics, as we talk about postmortems, is to be able to find the problems, right? So you're gonna be using these metrics to find the problems and then report on them in your postmortems. The way you do that is you start to recurse through, so the work metrics, like I mentioned, are the important ones. They're essentially the ones that are the year end business metrics. Are you doing what you're supposed to be doing? What are your customers seeing? So you take that information from those metrics, for example, my website's slow or my website's down and then you look at the resource metrics associated with those, right? So let's say my Drupal site is slow. Well, I can take a look at my IngenX, like if I'm running that as a reverse proxy. What's IngenX look like? That's the resource, right? Okay, IngenX is probably okay. There's no events correlated with that. So then I can recurse down, right? So IngenX is working fine. My internal work metric is then that my reverse proxy can actually hit all of my web endpoints. So those are now my resources. So I can look at my web endpoints, correlate events, right? Someone updated my Chef recipe. Oh, a bunch of my web endpoints are now like working slowly or they're not up at all. So you can keep recursing down to find what these issues are. One note as you're doing this, it's important to understand when you're actually in this situation trying to resolve an issue, it's not a good time for a postmortem. I get asked this question a lot. Like, okay, I'm solving things. Maybe I should be writing them up and we should be discussing. When you're actually resolving an incident, focus on that. Get things fixed first. A good example is this traffic cop here. He's examining the accident after it's happened. If there are still cars flying by and crashing and exploding, you probably don't wanna be walking around looking at why the crash happened. So that explains some of the technical issues that we have, right? Gathering the data from our infrastructure systems. But all technical issues do have human elements, right? Because we all work in technology and we are the ones that are building and we're the ones that are developing. So let's talk about us for a bit. So as we think of the postmortem and the data collection, our human data, first thing is who, right? Who should be involved in postmortems? And it's really easy to say everyone. But when you're thinking about everyone, I like to think of things in the same way, as I mentioned the cop before, similar to how police or detectives would do their work. So you have your responders and those are your engineers who are actually working to resolve the issue or mitigate the issue. Obviously you want them involved. But in the same way that a police officer would not just talk to the other police officers who showed up on the scene, you need to talk to the other people. So you have people like identifiers, right? And those are gonna be the people involved in your outage or your incident who are the witnesses. If you think about them, often they might be your customers, the people that were like, I went to your website and this didn't work right. Getting their story is very important. And then similarly, if we think about the police officer crime scene analogy, talking to your affected users, these are gonna be the people who would be the victims in that case, like who actually suffered? If you were running an e-commerce store, who's the guy that his order gets so screwed up and you charge them a bunch of money? That guy has a lot to say about what happened. So talking to everyone involved and thinking about the different roles that are involved is critical. So I know who we need to talk to and who we need to get data from. So what are we actually collecting from these people? What we're collecting is their perspective. Not only what they did, but what they thought and particularly why they thought it. So referencing John Allspaw again, mentioned him earlier for Blameless Postmortems, John likes to point out that people try to make the right choices. Hopefully you don't have anyone working in your company who's intentionally trying to sabotage your company. You trust that your HR people and the managers, your engineering managers who hired people, you're trying to hire the best people you can and those people are trying to do a good job. We all take pride in our work and we try to make the best decisions that we can with the information that we have. John likes to say that nobody would have done what they did if they had intentionally known it would have crashed your systems. So the key here really is to try to uncover what sort of false indicators led them to the wrong conclusion. And a good way to do that is to start asking questions that are open-ended. So rather than asking questions that they could easily answer with a yes or no or a single word, ask them really open-ended things that caused them to start discussing what they were thinking and telling their story. Another good tip about this is actually not to just do it through discussions but to actually have people write their stories down. This is a great quote from Richard Gwinden. Writing is nature's way of letting you know how sloppy your thinking is because as you write things out, it forces you to start to think logically about your story and what actually happened. Another thing to do with your story is actually to try to include pictures. So we've all heard pictures worth a thousand words. When you're having people write their stories, start to include graphs and the metrics that you have from your own dashboarding and monitoring systems. It's one thing to explain what people were seeing when someone says, is writing out that they saw a spike in DiscoIO at a certain time stamp, right? But that doesn't convey as much as actually seeing the graph of the ramp up or the spikiness in what that DiscoIO was. Just as an aside, this is a screenshot. Datadog has a notebooks feature which we've rolled out that allows this but really you could use anything, right? If you're using Google Docs, great. Google Docs allows you to put images in there. So we covered who and what. So let's talk about when, right? And it's nice to think of gathering data as soon as possible. In 1885, there's a researcher named Ebbinghaus. He was the first one to really study human memory. He noted that really you lose memory exponentially up until about two days. So if you're waiting two days to collect information, it's nearly the same as waiting a month or a year. So try to collect within that first two-day period. The other thing about that is while it's sort of this logarithmic scale that drops off sharply for actually memory loss, there's an inverse to that where you start to become susceptible to false memories. And since I've been talking about crime scenes, the police have noted this a lot, is that similar to people forgetting a lot of things within the first two days, people start to remember things that didn't actually occur. So that's another reason that you need to start collecting data as quickly as possible. That said, if you're dealing with an outage, particularly if you're one of the responders who are working to mitigate it, that's pretty stressful. And stress can lead to data skew and corruption, right? Nobody wants to go from putting out a fire, having to rebuild all your infrastructure because something failed, to immediately getting questioned about what happened. So definitely be sensitive to stress. Obviously, if it's a longer outage, sleep deprivation comes into play, right? Because you've spent multiple days actually working on this and you may not have slept. And then burnout, the longer the incidents go on or the more frequently that they happen, burnout becomes a part of that. So there is a balance. Try to get it within those first two days, but don't be so keen to force people to immediately do it right after an outage. A few more things on data skew and corruption when you're collecting human data. We talked about blameless. So the number one way to get false data or no data is when people are afraid that you are going to fire them for what they did. So that's why blameless is important. But similarly, there are some natural biases that we have. So things like anchoring, right? Like taking a single piece of information and focusing too heavily on that when you're thinking about what happened. You have things like hindsight bias. Hindsight bias is essentially the make-believe of an alternate reality. Hindsight bias is when you ask people why didn't you do this? Shouldn't you have done that? Couldn't you have noticed this issue? Those are all, they didn't happen. You're not talking about reality. So it's not even worth bringing up those sort of words. Outcome bias. Outcome bias is the bias that says when you do something and something bad happens then that must have been a bad decision. Or when you do something good and something good happens then that must have been a good decision. Clearly that doesn't work out, right? So I love whiskey. I'm here in Ireland. There's great whiskey here. I could go out tonight and get pissed and then get in the car and go driving around because I'm an American and I'm driving on the wrong side of the street. And if I don't get into a crash that doesn't make that a good decision. Similarly, if I go out there and I'm completely sober and I'm driving on the correct side of the street and I get into a crash, well that doesn't necessarily mean that me driving around here is a bad decision. So your decisions and the outcomes that happen are often not related. And so it's good to be aware of that bias. A few more availability, or also known as recency. Availability bias essentially says that we tend to place a heavier weight on experiences that have happened more recently. So if we did something last time and it caused an outage and this incident looks somewhat similar, it's really easy for us to say, oh, that's similar. We did that last time. Who did it this time? When they can be completely unrelated. And often availability bias actually leads to incidents happening because people associate scenarios that they see with what just happened when they could be completely different although the metrics on them could look somewhat similar. And then finally bandwagon effect. Most of us work in teams. It's really easy for us to sort of join in when we're in a team. If some people think that an issue had a certain cause or a number of causes, it's really easy to just join in and agree with that group even when we see some outliers that may be important to point out the differences. So as I mentioned at Datadog, we handle over a trillion data points a day, we have thousands of customers and we have very complex systems. This means that we have outages, we have incidents, which means we do postmortems. So I'm gonna talk a little bit about how we do postmortems because I think we do them really well. So a few notes. When we do postmortems, we email a postmortem document company-wide, almost company-wide I guess. It's not to everybody but it is to our developer list and since we are a technology company, that's a very, very large part of the company. They're also posted online so anybody who's not part of the developer list, people like our sales team, if they really were interested, they could actually look at those postmortems. And the other thing that we do is we schedule a recurring postmortem meeting and this is important not only to get all of your organization together to actually discuss postmortems outside of the individual teams that were on them, but when you schedule something, they become a routine and it goes from having postmortems as this reminder of a failure or reminder of this stressful time in your life to something that's just like a regular learning session. So it allows you to practice really good habits. So I recommend that you definitely do recurring meetings. So as we start into our postmortem template, we really have five key sections and the first is the summary. The summary is what happened. It's really just describing the incident at a high level. So what was the impact on customers? It's always important to keep your end business goals in mind. But then what was the severity of the outage? A lot of people only do postmortems on their really critical incidents. It's important to start drawing that line down because you wanna start knowing what were some of the mid-level incidents or what were some of the non-critical incidents that may be pointing out issues that could build up and cause incidents in the future. And what components were affected so you wanna know what was involved and then ultimately just a summary of what resolved the outage. And these, again, this is just a summary. So as you're sharing this across your organization, people outside of the team, people who are managers who might not be technical, those people can read the summary and get a quick highlight of what happened and be better informed. So this is an actual example. Back in March, we had a fairly severe outage. It was caused by essentially one of our Redis clusters having issues. But as you'll notice, this is a summary at a high level, really just talking about our caching. It has the basic time stamps of when this happened. And again, including pictures. So this is our metrics from Pingdom. And as you can see, everything's kind of going along and then it just explodes. And then the textual side, right? So again, keeping your customers or your end goals in mind as most important. So our impact on customers were that they were shown a down page unless they were already logged in. Obviously as a SaaS based company, people being able to log in and get to our services is highly critical. So this was a major severity. We note the components affected. McNulty is one of our internal systems along with snapshots and our crawlers that we used to gather metrics. And then ultimately what resolved it was that we just, we swapped in and replaced with a AWS with a large instance. So part two of the template, how was the outage detected, right? We wanna make sure that if an incident comes up again, that we can easily detect it and detect it early. So the questions on this, did we have a metric that showed the outage? Was there a monitor on that metric? And how long did it take for us to declare an outage? So for our particular incident, we had multiple metrics that let us know. Obviously we saw the Pingdom before, that was one of them. Was there a monitor? Was there something actually giving us automated alerts? And in this case, yes. So in this case, we had a HA proxy and it had a service check. And so the service check noted that the service was down. And then how long did it take to declare an outage? So this one, it took three minutes, which is pretty good. If it's taking you a long time to discover things, then that's a good note that either you don't have metrics and you should, or you don't have alerts and you should, or you need to adjust those to actually warn you sooner. And again, we embed even more graphics so you can see some of our systems there like our AWS ELB in the top middle, very clear that latency on our load balancer just went through the roof. Similarly, you can see on a bunch of these other metrics. So part three, right? We've talked about a summary and how we can detect things. So how did we actually respond? So this is the area where we start talking to people and capturing the data from people involved. So who is the incident owner? Who took control of this incident and managed it along with who else was involved? And we start to build a timeline. So if you're using Slack or other sort of chat ops or chat tools, whether that's HipChat, IRC, Mattermost, the great thing about all these tools is they give you a nice timeline of what people were saying at the time. So in particularly, we have a Slack channel called Outage. Whenever there's an incident, people just jump in there and we know that that's logged and they can be free from other channels so you're not getting noise or crosstalk. And then asking what went well and what didn't go well, right? The whole point of a postmortem is to start learning. So where were your pain points? What are the things that you can start to address? And then what went well? Like what are the things that we're doing really well that we can start praising people for and recognizing people for? So jumping back into our postmortem, we can see here, we list the people who are involved in the incident. We just link directly to the Slack archives. You should be monitoring and so we have a link to our monitoring dashboards for that, which is really nice for people who are coming in to actually be able to see what the engineers are seeing. And then as mentioned, we have notebooks which is sort of our way of capturing a story along with some of those charts. This for you could be your Google Docs or whatever you're using. And then we recreate the timeline of events, right? So the Slack archive gives us what people were talking about, the actual transcript, but really we want to start adding in some of the events that have happened. So I already mentioned Slack, but if you do have an integration into Slack from your monitoring service, it's really nice to be able to post directly your monitors or your metrics and graphs. Makes it much easier when people are discussing to just be able to see that. Now I know I mentioned earlier that you shouldn't be trying to run a postmortem while you're trying to resolve an issue. But as you're going along, if there are points that you want to highlight, be sure to note those down. It's really easy if you're using a chat tool to just do this, to tag something like, in this incident, Dan didn't get paged. That's pretty important thing to make note of and something that you might miss as you're, if you're coming back to a postmortem the day afterward, you're stressed out about actually solving it. So if, as Dan got called into this, it's like, why didn't I get paged? Make a quick note. Hey, I didn't get paged. And then you can come back and you can actually search for those through your transcripts later and compile those. So on to the fourth part of our postmortem template. And this is where you start getting technical. Why did it happen? So really deep dive into the incident. If you're curious about the incident that I'm using it as an example, we published both to our status page, which is the public side. So that, the first link is to our statuspage.io which has a full write up of the incident. The second one, actually, we like to be transparent. So Alexi, who's our CTO, actually went and gave a presentation on this issue. The TLDR, in case you're really interested, is we narrowed it down to Redis and network issue. So we scaled up because we initially thought it was a CPU issue, but it actually turns out that it might be a VM issue. There's a lot of evidence pointing to that. It's really hard to get something conclusive. But essentially, if a VM receives too many packets, the hypervisor starts to choke. So the last one, and the most important, right? Because we're talking about postmortems and postmortems, your main goal is how to learn from it and how to prevent things. So this is where we start to create tasks. And you wanna make sure that those are actionable tasks, not just like, hey, philosophy changed, we should all think about how to do this better or we should all be more careful, right? You want actual hard tasks that people can say, yes, they did. So link directly to GitHub issues if you're using that. Link directly to Trello cards or JIRA tickets, whatever sort of system you're using. Ensure that people have something concrete and they know what they're working on. We like to divide these up into now, next, and later. Now are the things that we absolutely have to do right now, otherwise this will happen again in the near term future. Next is our next steps. So what are some things that we can start to implement to prevent what happened? And then later is for things that are larger architectural changes, right? It's quite possible that the way you've architected your system, you've invested a lot and that might have some flaws, but it's not like you can just say, I'm gonna completely re-architect all of my systems tomorrow. So keeping those in mind. And then also having some follow-up notes. As people start to work through these tasks, it will uncover some other things about your systems and it's important to note those. So this is ours from our post-mortem. We've got some checklists there and we noticed the first one is to ship some fixes and make some reddest tweaks. That one, there's the link there to the two ticket numbers and to a GitHub commit. So that's checked and done. But as mentioned, adding follow-up notes. So that second item there, you'll notice there's a comment. And the comment is that this doesn't actually appear to be specific to the system that we had. So the person, the engineer working on this, she was reluctant to call this one done and actually didn't check it off. So your tasks will start to evolve and being able to do that, right? Because you should be working agilely, taking the information that you learn and adjusting the tasks that you're doing. So just to recap the five key sections that you should have in your post-mortem template. What happened? High-level summary. How did you detect it? How did you respond? Why did it happen as a deep dive be sure to include the technical details so other engineers can learn from that? And then always the last one, actionable next steps. A few more resources. I've mentioned John Alsbaugh a couple of times. He's come out with, a couple of years ago, he wrote a blog post called The Infinite House. And it's essentially, some of you may be using something called the five whys. The five whys is sort of this notion of you ask five why questions. And by the time you, that fifth why question, whatever the answer to that, that's your root cause. We all work with very complex systems. There's rarely ever a single root cause. So John started thinking of The Infinite House, constantly asking how something happened is a good way to start figuring out what those multiple causes are and how those causes work together to cause an incident. I've mentioned blameless a bunch of times. So my friend Paul Reed wrote, why blameless post-mortems don't work. This is a fantastic read. He doesn't say that you shouldn't do blameless, but he does note that psychologically we are all, sort of we've evolved to blame. And so how to deal with that, this notion that we innately wanna blame. And then finally actually trying to get more down to the metrics and the numbers and the data that you should be collecting on the technical side. Alex, ERCTO read a fantastic series of blog posts on monitoring 101. So that's a really great place to read more about the metrics that you should be collecting and how to collect those. I know I have a ton of links and a ton of resources in case you didn't catch all of those. My slides are available here. So just bitly decon-postmortems. And I'll leave this up. My next slide is just questions. So let's get to some questions, but I'll leave the slide link up so you can copy that down. If anybody has questions, I think you're supposed to go to the mic in the middle, but if you're like trapped in your row, just shout it out and I'll just repeat it. No questions? So repeating the question for those watching at home later. The question is, in terms of blameless, do you think that we just shouldn't hold people accountable? Yes. Well, yes and no, right? So it depends on what you mean by holding people accountable. Obviously, with the transcripts, we're actually, we do capture the names of people in the timeline and what they did, so there is a record of what people have done. I know some people like to do postmortems truly blameless and they'll just in that section put like engineer one, like shutdown server, so it really depends. I like to actually have some sort of tracking there, right? Partly because if someone has an incident in the future, they can actually go talk to that engineer and know who did what and ask that person like, hey, I was reading the postmortem or I'm in this incident and you did this thing, why did you do it? Because I think I'm seeing something similar. As far as keeping people accountable, it really depends. You don't wanna have anything punitive against them, right? Because they really are making the best decisions that they can, so it's really crucial to figure out why they were making that decision. If they actually were being malicious, then that's I think another case, right? Where yes, you should hold them accountable and if it was intentionally malicious, absolutely fire that person. You don't need that person on your team. If it was actually a, hey, I thought this was going on and I made the wrong choice in this decision, then no. As far as accountability in terms of something punitive, absolutely do not punish that person. I mentioned John Allspot at Etsy, they actually do the exact opposite. If you were the person in an incident who accidentally made that wrong choice, congratulations, you just helped our company. It's essentially like when you're a security researcher and they pay off hackers who find security holes. They actually reward that person with an extra bonus, so it's a good way to think about things if you're trying to learn. So that actually popped me another question. I don't know if you mentioned in your presentation, are these post-mortems available to the whole company like other sectors in the company? Yeah, so, oh, yes, so. Yeah, go ahead. Okay, yeah, so we do make them available to everybody, so they're just dropped into, we started with Hackpad and that was available to anybody, they could just hit the URL, and then as we've moved to Google Docs and then now to our notebooks, they're all available to everybody in the company. Okay, so that leads me to the second question which is for that to happen, and for you to have exactly the names of the persons that were involved in the event, you must have a very good culture inside of the whole company, not in the tech space only, because what I've seen is that DevOps culture lives more on the tech side, like people that are not on the tech side, like administrators, financial, human resources, et cetera, they don't understand, so the moment they see their name, it's the name or the names, you know what I mean? So what do you do regarding that? That's a good question, so, I mean, largely I think the people outside of our technical areas don't care, so I'm not sure, it's available to them, but I'm not sure how much they actually look at our post-mortems, financially. Yeah, yeah, if it's affecting them financially, they will care, but I think it is one of those things where even if it is affecting them financially, they don't particularly get a say in how we treat our engineers, but I think it really, you're right, it comes down to culture, and we have a fantastic culture that we've promoted within the organization, so people do understand that the stuff we're working on is challenging and bad stuff happens, and a sales guy might be like, hey, you really screwed this one up and now I have an unhappy customer, but at the end of the day, we hire really great engineers, the best that we can find, so things just happen, and they didn't know that. That said, we're like 200 mid-200s, so we're not like a giant megacorporation that does become harder as you get larger and larger, and people start to become even more detached, so they see a name, and they have no connection to that person, so it's really easy to just get angry, but yeah, scaling culture is hard. Do you have a standard approach for new hires? Because somebody that's just started it, they're like, oh my God, I'm the one that caused this, and they think I'm gonna get fired for it, or something, do you have a standard newcomer approach, or how do you get new people into that culture? Yeah, so with new hires, we do have, I had that quote at the beginning, and that was from our internal developer guide, so everyone that gets hired on the technical side reads that document, so that has a little bit of it, but really part of culture is, there's a great quote, and I can't remember who it's from, but it's essentially talking about culture is built, and evolves by mimicking and copying, and then inventing only when necessary, and so the thing is, once you establish a culture, maintaining that really is just finding people that align with that, and will start to adopt those practices, and obviously if you have someone that is in your organization that decides they're not, and is just gonna do their own thing, then maybe that's not a good fit, similar to when you, if someone is intentionally maliciously destroying your systems, you should fire that person. You might wanna consider letting that person go if they're intentionally being obstinate and not falling in with their culture, particularly when you've built a good one. Yeah, for new hires who are afraid when they come into our company, we have some really fantastic engineering managers who are really great mentors. We're small enough that when something happens, well I should say we're small enough that when something happens on the people side, it's easy to notice. We're a monitoring company, so we dog food and use our own tool. It's really easy to see for the most part when something goes wrong. So you're not, it's really hard to hide when something goes wrong in our company. I think the key point is that our engineering management team is good about actually expressing the blamelessness in our situations, yeah. Yeah, any other questions? We've got five more minutes or we can all go to lunch early. No more questions? I think everybody wants lunch. All right, thanks for coming. I'll be sticking around, so if you have a question you wanna ask later, feel free.