 You get like five seconds of hot pulsing techno music when you come up on the stage. It's great Hello, everyone. This is after death. My name is Sam Fippin I'm a member of the aspect core team and a architect and interim manager at digital ocean Let's get started if you've worked in software engineering For any time at all. You'll have heard tales of eventually encountering something that looks like this or This or Where the service is trying and trying and trying and then you do a deploy and it gives up the ghost Earlier today not one hour ago this status page went up on digital ocean and There was a non-zero likelihood that I would be doing incident response right now instead of being on this stage Which is fun? It happens The things we build fail. It's inevitable the systems that we work on aren't perfect and neither are the people and if you hold a Pager for any production system. I'm sure you know the feeling of sadness and frustration and What's going on? Clickers doing two inputs at once and for an anger at being woken up at three o'clock in the morning Why did my team not care enough to not make that bad deploy? Why is my company not protecting me from failing infrastructure? What on earth has gone wrong? As I mentioned I'm a manager now and I have a covenant with my people that says that this will be as rare as possible And I will empower them to prevent it from happening as often as we possibly can and the thing about digital ocean is It's not a rails monolith. It's a big Complicated services world and I can tell you from experience that those services contribute just as much to the downtime As any like other factor that we have in our infrastructure and this happens, right? There's really nothing we can do about it no matter how hard we try computers will Break the disks on your servers will fill up a human will write a bug and play it to production and Well in the worst case the underlying hardware literally catches fire So what can we do about this? What tools do we have at our disposal to make sure that as often as possible? We are not being affected by these incidents that we're getting better and just spoke to this very eloquently this morning And I'd like to sort of throw my ring in the hat many of us when these things happen do a post incident review or a learning review or Something else, but the truth of the matter is a digital ocean We call them post mortems, and I like this metal font So that's that's what we're doing and the thing about doing these reviews is that it gives us a Literal system from learning from the failures that we undergo and with that system We're able to really patch up and find underlying causes and fix them We're able to work with our organizations to understand what risks we might be accepting when we build the systems That we work on Everyone here who has the ability to deploy to their production Environment is accepting a risk every single time they do it but we're all asked to ship features basically as quickly as We can and while that's a really valuable and important thing It's important to note that ultimately if your product is not available You can't have any customers and so this is the thing that they care about To and by no means am I here to tell you that I think I have all the answers I used to be terrible at this and Working on a cloud scale system for a couple of years has taught me a lot Really, we only started getting good at this a year ago And so I'm just gonna share some reckons here if you disagree Feel free to have a chat with me afterwards. I love talking about this stuff. So To talk about this I'd actually like to dig into how we post-mortem at digital ocean to give you some understanding of How I think about the process and to do that I'm actually gonna walk us through digital oceans post-mortem template every time we have any kind of severe production incident One of these gets kicked off and an engineer starts filling it out We give the incident a name we take the date of when the incident commenced and then we assign it a severity Severities are useful because they tell you how to respond. What kind of response is necessary and who should be Involved and a digital ocean. We have a five-point scale that starts at Sev zero The Sev zero incident a digital ocean means that there is a Critical impact to the business that likely means that everything will end if engineers don't work quickly to fix it These are the company-ending events that you all dread and fortunately they're incredibly rare I think we've had two since the scale was defined but in a Sev zero incident Every engineer in the company multiple directors Infrastructure people a coordinating response literally everyone bands together to try and make the problem go away It's the worst thing that can happen. And so it gets the highest form of response as a result a Sev one is a major global outage or an entire single product not functioning if a data center is gone that's a Sev one and Again at this point we get executives involved There are usually multiple teams coordinating But we don't believe the business is going to end if we don't deal with it immediately and that's sort of the core distinction Sev two is the lowest severity of incident that we wake people up in the middle of the night for and that's basically when a Single product has stopped working or something else has gone wrong that's severe but not severe enough to wake everyone up Usually there's one engineer here doing response coordinating with our support and communications teams And then we have three and four which are basically just bugs and defects that go into Jira and get fixed at some point In the future hopefully If your organization doesn't yet have a mature incident response practice having a severity scale is a really good place to start It allows you to communicate with your company about What the incident classes that you have actually are how bad they are and what you can do to fix them? Who needs to be involved what the procedures should be and that sort of thing? This is pretty standard across all incident response practice And so this is a really good thing that you can add to your tools if you don't have one already the next thing that I want to focus on in our post-mortem is the timeline and the timeline is really one of the most core and critical pieces of These documents it's the facts of everything that happened that caused the incident to occur from the very beginning to the very end Everyone who is involved all of the systems that went wrong logging monitoring and those sorts of things But there are a couple of rules Timelines must be hard facts you shouldn't include any analysis or emotion What people did is valid how you thought about what you were doing is not this is entirely just to get an understanding of what happened The sort of pro tip that I have here is if you're dealing with an incident like this You should complete your timeline as close to the incident as possible And the reason for that is that logs rotate out graphs expire met memories get fuzzy And so depending on how severe your incident is we typically either complete this as soon as response is done or the next morning afterwards To steal a line from Jess Document everything no matter how small what the thing may seem it may actually be a critical underpinning of a later piece Of analysis and so the timeline is a really valuable tool This is what a typical incident timeline looks like it includes timestamps. We use UTC some people use their local time zone that's up to you and This is kind of a straw time timeline This is not a real incident timeline because if I told you what we did in real incidents that might be problematic And you might want to stop using us This first line here is a deploy was made which introduced a bug and here I'm really trying to make the point that your incident starts before you notice it a Deploy happens a server fails a database row gets corrupted at that point No human has actually noticed that something has gone wrong But something has gone wrong and so when we think about incident response We need to document when our incident actually started which is a separate point to when we notice and when incident response begins Next line is saying that our graphs indicate something went wrong And you'll see here. I have a like a little one Wikipedia like citation mark in all of these lines You should add some kind of evidence that explains what the That explains how this timeline was created be it a screenshot of a graph screenshot of logs a link to a slack line a link to page duty That's all really good and again remember log systems and graph systems expire Make sure you capture it permanently the other piece of advice that I have here is Observability is really good and you should have some the rails community as a whole I think is not great at building good observable services and thinking about what you can do to add Graphs metrics logs to your system is a really good idea a digital ocean We make very heavy use of Prometheus and Rafaana, but if you're hosting on Heroku They actually have a really great dashboard built in and this is probably good enough for most smaller rails applications This is really important the actual impact of your incident what happened how many customers how many tickets what data was lost Can usually be reconstructed from this information? And so thinking carefully about the observability is useful both for incident response And is just a generally healthy thing for you to do as well as helping you fill out your post mortems The next line you'll see is that Sam got paid for bizops internal tools Which is the name of my team at digital ocean with a link out to pager duty and pager duty is great If you have another paging system, that's cool I but I'm really only familiar with pager duty and here you'll see this is what a pager duty incident Looks like and the thing we're trying to capture with this line is When did we actually get a human involved and that's really important because that's the earliest point in your incident That you could possibly begin clearing it up if it's bad enough that it's not going to heal itself And so Documenting not just when the incident started and when we first noticed it But also when response began begins to inform on a wider basis how we're dealing with incidents that occur at our company Another factor that's really important is who alerted this person a digital ocean We're fortunate enough to have a 24 by 7 support and operations team that can wake us up in the middle of the night And is able to make technical Investigations if you don't have that you need to have some kind of machine Alerting and a thing that we have generally observed is that if a machine alerts us our incident Closure time is much faster than if a human alerts us And so having automated alerts is a generally healthy operations practice And it's something that you can build once you've got that observability in your application So We then come to our next timeline, which is the human acknowledging the page and that's another sort of useful and important point Because that's when this person that we reached out to actually began responding If you're paging someone at three o'clock in the morning, and they're in deep sleep It might take them 40 50 minutes to actually wake up and realize what is going on But understanding that time and that factor is really important But as well as this in our timelines where we have multiple responders We document where everyone got pulled in and what they were doing and so the idea here is that This part of the timeline really scales to everyone that is involved in incident response and should document who was involved and is doing what There's one thing that I have here as a super mega pro tip and that's announcing presence in the slack If you're doing incident response make sure people know you're doing incident response What you're doing why you're doing it when you're doing it and communicating frequently as a general rule I ask primary incident responders that I work with to update in slack at least once every five minutes with exactly what they're doing And that's not to sort of micromanage and continuously check in on them But that's sort of as a heartbeat to make sure that we're still fighting the incident even if they don't have a status update It's just useful and healthy good practice And then finally you have whatever you do to fix links to GitHub or whatever it might be and Grass that show resolution and here. We're really asking the question. How did we fix it? Can we prove that we fixed it? Do we have confirmation of that? So that's our timeline and that may all seem fairly obvious But getting that documentation exactly correct is really really valuable And if you have this ongoing log of everything that happened in all of your incident response You as an engineer will be able to build up patterns of what happened why it happened and how your team is fixing things So let's move on to the next section and this is the single most important part of Every one of these post-mortems and learning reviews the root cause section Now the thing about the term root cause is that it's inherently Misleading it implies that it's singular and it also implies that it's the closest thing to the incident That's how most people I see naively fill these out and I'm guilty of having done this in the past myself This is one of the earliest root causes. I wrote a digital ocean I absolutely don't expect you to read all of this text You basically need to get the idea that a deploy happened bad queries too much scale and That was a sort of outdated deploy load This caused the incident. I deployed something that was too much scale stuff went wrong That seems perfectly reasonable at a first glance But this doesn't actually help us solve any problem and to think about how we can Let's dig into exactly what's written there a little bit more There was an Atlantis deployment with a scaling factor of 10 where each instance had a default worker thread of 25 Select statements being issued to prod my sequel 1a Didn't have an index in place and an outdated version of the worker These four things are what should actually be called proximate causes These are all close to the incident that happened But not actually causal if you really really dig And so let's do a real root cause analysis on these four factors and have a better understanding of what is really going on To start with let's take a look at these select statements to prod my SQL 1a Quote to understand this you need to understand a little bit about how database architecture works at distortion We have one ginormous my sequel cluster Called alpha and it's made up of a bunch of servers My SQL 1a is the primary and accepts reads and writes and all the other ones are read replicas they only accept read queries and The point of doing this is that we can alleviate Traffic from our primary by moving those queries on to the followers and it's a generally established best practice That if you know a transaction is only going to issue selects to use a follower instead of the primary database If I was going to write this today I would probably say that Atlantis was issuing unnecessary reads to the production primary But also leave a training note that says words the effect that we should be careful to ensure to use the read replicas Wherever possible this reinforces that best practice and also creates documentation for other people to follow Going forward as a way to avoid this problem We had this thing about Issuing queries that didn't have an index and this should be fairly self-explanatory Have indexes database indexes are really good. They make everything faster. You almost always want to have them and So well, let's think about we have this missing index But but a query made it all the way to production without us having any ability to detect it was going to cause a problem You could argue absence of testing for full table scans. You could argue unrealistic testing data You could argue and this is a really common one staging environment doesn't reflect production How many of you have staging environments that you genuinely believe are a good reflection of your production environment? That can be one of the most useful silver bullets against operational failure. Try building one. It's a really hard problem Running an outdated version of the worker may have contributed to load Um Complex production environments when you run the internet are hard Basically, we used to have this production environment Where these servers their sidekick workers were deployed with chef and Capistrano and there were four of them then the great Kubernetes migration of 2016 came along and We dockerized everything and then we deployed 10 10 of these pods to our new shiny Kubernetes environment The problem is we didn't burn these chef nodes initially because we were worried Kubernetes wasn't gonna work and The deploys weren't in sync So we had like old revisions of the software in production at the same time as new revisions of the software and like That was a nightmare. Don't don't do it. It's a really bad idea. I would note this and root cause it now as Running parallel chef and Kubernetes environments exacerbated the load problem lack of procedure around decommissioning chef nodes and in general Having a little bit more maturity around thinking about big complex production transitions like that Which is not something that many of the people involved were experienced at doing at the time And then finally this like scaling factor one This is really tricky because the truth of the matter is if those other three things hadn't gone wrong This wouldn't have been a problem. And so here we have a really good example of like An exacerbating factor that if we'd fixed all of our other problems wouldn't even have been exposed But it's worth looking here. We didn't have a great reason to scale this application up It was just like Kubernetes infinite scale. Hooray Oh, no productions on fire So I might just note a sort of training procedure issue about when you change scale of production applications and move on So let's talk about the kind of thinking that we did that because it's a very specific kind of thinking We're not looking at what directly close to this incident caused it We're asking the questions what in our system made this incident possible The way I phrase this to newer developers is look for procedural policy systems and training problems That are affecting your organization and macro ways that you can fix them It's never ever a single person doing a bad thing that causes a major production incident like this Not one time everyone that you work with is trying really hard and Assuming Lee that's not a word Is doing their best? And so I always ask questions. Well, given that this person was trying as hard as they could What safeguards were missing? Why didn't they know that this was a dangerous operation to do? It's it's never a person. It's always a system and I'm just gonna Shamelessly directly quote Jess from this morning. If you show me a team that punishes people that makes mistakes I'll show you a team that makes a lot of mistakes. I think that's really true And it's like sort of the perfect summary of my talk This takes time getting really good at postmortems getting really good at healing the problems in your organization That cause these issues to occur takes time and practice and care and attention The truth of the matter is when I look at the people who write postmortems around me The ones that produce the best ones are the people who give the most of a damn about working with their colleagues the ones who have empathy with everyone that they work with and the ones who understand the pain of not being up to sleep Through the night because the computers aren't working So that's most of what I got um if you want to steal our postmortem template I will be sharing it on the internet because that seems like a good thing to do I am hiring literally for my team. So if you want to work with exactly me That's that's a thing that's happening That's yeah, that's the least chilly way I could possibly try and hire you is if you like me come work with me That's all I got. Thank you