 Thank you, Jeremy. And the opening slide is, I guess, something that most of us as developers and even as product managers must have done it. So show of hands for anyone who has kind of run either this or a drop table on a production or a non-production system. Show of hands. Non-production is totally fine. OK, I don't see that many. Interesting. You all are all like great software engineers. Never made this mistake. Unfortunately, some of us, like myself, tend to make this mistake, which means that in a production system, we tend to believe that this is probably a non-prod box. So just make this one-off mistake where we just go in and run a command like this. What happens? The consequence can be something in between an hours long outage that involves maybe a millions of dollars worth of customer impact to something just like restart the box or failover to another AWS, whatever you call, Availability Zone. But today's talk is going to talk about what happens right after you run this command. What happens when your complex system fail? So my name is Aish. And like Jeremy mentioned, I work for a company called Peugeodury. And without talking a lot about myself today, I'm just going to dive straight into the topic and talk about, first, what are complex systems? So the title of the talk easily talks about what to do when complex systems fail. So to go with the definition of what's a complex system, let's just go take a line from the English Wikipedia. A complex system is a system that's composed of many components which may interact with each other. Sounds very, very specific, right? Definitely not. This kind of covers almost any software that build. Software sort of has this modular principle in which you have classes or objects or function or any different components that talk to each other. This sort of inherently means that almost any piece of code that you ship, apart from that one Hello World example, or maybe even that, too, is a complex system. You can ask me, what's not a complex system? Well, if you build a bottle opener, that's not a complex system. So unless the software system that you build, nay, the system that you build is a bottle opener, that's more likely a complex system. And again, you might ask me, what's the deal with all of the complex systems and all? But let's first address the elephant in the room. The elephant in the room is, why are you talking about failure? And the second thing is, why are all of these emojis around? Well, I am a millennial. So hence, all of the emojis in the talk. So hold my avocado for a second. Well, we take this detour into the world of academia. So this paper was written, as you can see on the screen. It's called How Complex Systems Fail by Richard Cook, who's a medical doctor, MD. And it was written in the year 1998 about patient care and healthcare systems. The paper talks about a bunch of scenarios on how things fail. But again, the objective of this talk is not to describe how things fail, but rather to talk about what to do when things fail. But there's a very great quote from the man himself, Richard Cook, about failures in general and failures of complex systems. To quote him, failure-free operations require experience to deal with failures. Now let's take a pause here and think for a moment. Failure-free operations require experience with failure. This is very, very counter-intuitive. This means that in order to deal with failure, you need to have prior experience. Now wait, isn't this more like a chicken and egg problem? Do you mean that you should first have an experience to go and fix things before actually, which means you have to go and break things first? So this is what this talk is about. Here's the structure of today's talk. First, we'll talk about a horror story. The horror story is not like one of those high-budget Hollywood movies. It's just an operational nightmare that any one of us could be in. And this is like, from my experience, what happened to me in the past when we did not have a good operational and an incident management framework in place. The second part of this talk deals with lessons learned from the story. So in the story that I'll be telling you, we'll be seeing a bunch of failure modes in dealing with failures. Systematic failure of multiple things, including communication, including tactical things, as well as talking to the customers. So the second part is dealing with how to actually deal with these failures in dealing with failures in turn. And that's kind of very meta. And the last part is a review about the things that we will be talking about. So first, the story about failure. So chapter one, this is fine. To kind of give you a background, this happened to me while I was an intern. I was still in college. I was at a small startup somewhere. And it was the middle of the night. I get a phone call. And it was the CTO of the company calling me to say, hey, looks like this particular piece of software that you shipped is not working and looks like this big customer is not getting the reports. I didn't know what was the next thing to do. I was told that there is some bridge number that I need to dial in. I was told that there was some hip chat room back then that I was to go in and talk to other engineers. Being an intern who was still at college, that was the expectation out of me. I, for one matter, did not know what was the meaning of being on call. I was roped into this incident call to deal with this systematic failure of a complex systems without actually being equipped with the knowledge about how to deal with these things. Almost every engineer I knew was on call. And this was a 20% startup. So 16 or 17 people were on the call. Most of them were half asleep. It was a Friday night. Someone had even dialed in from a bar. I could hear the background noise of people talking and laughing in the middle of all of this chaos. Most importantly, we're all trying to do the same exact thing. We're trying to go to the last commit on GitHub, go to the commit and see what happened. Unfortunately, since all of us are doing the same thing, and had we been smart enough, we probably wouldn't be doing this. But since we were all doing the same thing, we didn't get to the solution, like you guessed correctly. The problem was there with a database machine somewhere. And since we had a bunch of machines for different services, we just didn't know what was going on. We had no clue where to start. So 16 engineers in the middle of a Friday night, someone dialed in from a bar and intern and the CTO of the company trying to just go and fix reports for this big customer without even knowing what went wrong and how did this particular thing fail? So the thing was that our logging service was failing. The log aggregator was failing. So the box ran out of disk space, the database machine as a result, like our reports were not being delivered. That was the actual problem, just like a spoiler alert for people. That is now chapter one. This is chapter two, a dark and stormy night. These lines, it was dark and a stormy night. The rain failed in torrents, except at occasional intervals when it was checked by a violent gust of wind. So these lines are the famous anti-pattern in English literature. If you look at them closely, whenever you write an English language based essay or anything that's not a poem, this is like how you should not be starting. Now, why am I drawing like a comparison between an English literature and operations call? That's probably, that's definitely because any engineer that you asked on that call, that I was in about, hey, like, do you know what should we do next? Give answers almost as out of point as these lines. The answers used to be, well, like, you know, there's a wiki somewhere and it could be something or there could be something out there, but no one definitely had an answer about what was the problem and what they were doing. So it was almost as ambiguous as these lines here. Most importantly, there was no clear leader amongst us. It was like a herd of sheep where everyone was trying to follow each other. There was no one to coordinate. In the middle of all this, we kind of make a segue into chapter three. This is the exact soup. Like the title actually says it all, the CEO of the company, this is a tiny startup at that time, jumped into the call and started asking questions. And these questions included things that as an intern, I was definitely not aware of, but there were also engineers on the call who were aware of, but they really didn't have the answers to these questions. Now you asked me, what questions are we talking about? The questions were like, and this was at 2 a.m. again on a Friday night, can you send me a spreadsheet with a list of affected customers? Now in the middle of the night, when you are kind of dealing with fires, the last thing that you want, I mean literally the last thing that you want is an exec standing on your head and asking you to send a list of affected customers. You barely know what the problem was, you barely know how you're dealing with, you barely know like you're able to talk to the right set of engineers who kind of know the system inside out, and in the middle of all of this, you're trying to get that one spreadsheet with some customers who have been affected, apart from that one person, that one customer who initially put their request that things are not working. So like we were kind of confused about what to do. So you know what? Adding more to the chaos, adding more to the complexity, we decided to do both, which meant that we first decided to get the list of affected customers and then go and deal with the actual problem, which meant the actual time, the total amount of time that we spent in dealing with this incident was much longer. We spent almost two hours trying to get this list. It was finally at 4 a.m. that we got to know what the problem was, that particular log aggregator not running and those servers going out of memory, which kind of led us to finally go and fix the problem. But the story does not end here. The morning after, the morning after kind of didn't bring us any hope. The morning brought us some more pain, some more agony. I was kind of blamed for the incident despite of being an intern and I had to go and do a lot of things apart from just cleaning out the thing. I had to go and add a cron that clean the jobs, add more metric I'm monitoring just because I was blamed. Now, the question that you might ask as we all can see is that what's wrong with the picture? It's a cat image that's upside down so I'm definitely not asking about the cat image. I'm asking about what's wrong with the story? Like, if you follow like agile DevOps or any of these hipster terms, there's definitely a lot of things that was wrong in the picture. This type of being like this new age startup not like an mammoth old company as like people as developers that we kind of try to, you know, stereotype companies in, this was still the case. So a lot of things went wrong but we can categorize things that went wrong into two distinct buckets. The first one, we did not know the difference between a minor incident a minor incident being a recurring thing, something that can be automated, something that does not require you to wake up in the middle of the night at 2 a.m. and go and log in to a computer among all things to do. And the second thing that kind of, the second category of thing that sort of went haywire was not having a framework or a dedicated method to deal with major incident. Had it been a major incident? So we'll kind of talk about what's a major incident in a while, but for now kind of bear with me that there's like two different categories of things. There's a minor incident and there's a major incident. Before we kind of try to address these problems that we saw, let's move on to the second part of this talk. Lessons learned. Lessons learned from these, from this particular horror stories, lessons learned from these mistakes that we made in the call and how can a good incident management framework address the concerns and the problems that we saw on the call. So before we start to talk about the framework itself, let's kind of see like, how can we deal with the first part of the problem? And the first part of the problem was not being able to identify whether it's a minor incident or whether it's a major incident. So the first thing that companies, organizations and team need to do, teams need to do is to define, prepare and measure what's a major incident. So it's critical to define business failures in terms of business metrics. So for example, if you're an online retailer, it might be the number of checkouts per second. If you're an online video or an audio streaming platform, it could be the number of streams per second at my current employer, PagerDuty. It's the number of outgoing notifications per second. So defining your most critical business metric and tying it back to the engineering system sort of helps you build that understanding throughout the company about whether are we in a major incident, is there a massive customer impact or not? The second thing is, get everyone in the company to agree on the metric. This means right in from the CTO, the CEO, all of the execs to someone like an intern must agree that this is the metric that we're looking for. And once we cross this threshold, we are in a major incident. So like in the story that we did not really have a metric to talk about, like we are affecting like one customer. Well, that was kind of fine, but that did not require the entire company to be awake. So defining these metrics help you prepare the amount of time and effort that you are going to spend in dealing with these kind of problems. The second part is about preparation. So the best organizations prepare with failure beforehand. Like Richard Cook said, and to quote him once again, yet once again in this talk, failure-free operations require experience to deal with failures. Companies have their own version of simulating failures. A few companies call it game days. A few other call it chaos monkey or like chaos con or one of those chaos buzzwords. You can call it anything. It could be automated, semi-automated, or it could be manual. It could be just as simple as restarting all servers randomly on a Friday day. That's what we call it, page duty. We call it failure Friday in which we do a bunch of simulations. But this is up to you. This is up to your org, your company to prepare your people to deal with failure beforehand. And the last and the most important part of this triad is measuring things. So measuring the impact during these failure simulation exercises would help you go and add and redefine those metrics if required, tweak them, and then get other stakeholders in the business to agree. So once you kind of complete this triad, you need to make sure that failure should be unique. If not, we should be able to automate the response. For example, in my case, the failure was just something that was supposed to run on the machine and clear up the space from old log files that did not really run properly. So if that was something as simple as just going and freeing up space from that machine in terms of failure, that should not be wasting human time. Human time is precious. If you can automate it, just go and automate the thing. So like I said, remember, we should be only triggering major incident response if you are in a major incident. So getting those 20 people on call and trying to solve the problem would only make sense if it was something that could not be automated. If it was like a button click away to just go and clear up the log space, we should probably have done that. Well, apart from hindsight, let's just move on to the meat of this talk. So the meat of this talk is talking about this framework that's inspired by the national incident management systems of the United States. This is a framework that's developed by the Department of Homeland Security and it's used for dealing with natural calamities and other incidents as classified by the US government. So when I say these bigger, bigger words, people generally give me the look and say, like, wait, aren't we talking about like software and IT operations? How do something that was built today, something that was designed to deal with natural calamities, how can something like that be applied to software and IT and operations failure? Well, the core of this is to deal with failure and when you kind of try to categorize failure, the failure modes are kind of similar. So the lessons learned from the NIMS framework can for sure be applied into a software failure mode as well. So the first thing, like most software developers know, is the single responsibility principle and the single responsibility principle that I'm referring to is not about code reuse or keeping your things dry, which is like don't repeat yourself or just like writing cleaner code. This is about whenever you get paged, whenever you get a phone call from your CEO, your CTO or someone, make sure that there's only one person responsible for one task. Do not have a redundancy there. The redundancy might be good when you are actually writing code and deploying it in distributed systems, but when you are talking about people, having the same task being done by two people in a redundant way in the middle of the night is definitely not the way to go, particularly not in a major incident which might run for hours and hours. So since I just mentioned like the single responsibility principle, there can be different roles. So based on the NIMS, there are different roles that people can take when they join this incident call. So the first, and when I say an incident call, an incident call, I refer to as a major incident call. That means there's a major customer impact, your production database has just been dropped or like if you're an online retailer, your customers are not able to check out. What do you do? It's all hands and decks scenario. So the first role that comes into mind is that the subject matter expert. So the subject matter expert is sometimes what we call a resolver or a responder to the particular event. They are the domain experts. It could be someone from a team that built the service or someone who knows about it well enough to go on call for it and go and fix the things if necessary. So in my story, we had 15 engineers working on the redundant part of the same system. And we are all the so-called SMEs or the subject matter experts. You don't need 15 experts. One person, single responsibility principle. So one person per logical component so as to avoid confusion in an incident call is sufficient. Now this is the mantra of the subject matter expert. Never hesitate to escalate. And this kind of comes in the back of things like, well, like since I was an intern and since let's say that if I was on call and if I was called by my CTO, what's the first thing that I do? The first thing that I should be doing as per the framework is just saying it out loud that I don't really have enough context on this. So please escalate it to the next level. Please pull in someone else who kind of knows better about the system so that I am not on call for this thing. So never hesitate to escalate as an SME. The next and the most important role in an incident call is of what we call an incident commander. And before we kind of get into the details about what is an incident commander and like what are the roles and responsibilities, let's just take a slight detour. The image in the background is of Gene Kranz. He is known for being the flight director, I guess, for the Apollo 13 rescue mission. So if you have seen the movie Apollo 13, you kind of might see someone coming in with a vest and trying to get everyone on board to work together as a team and get a group of astronauts stuck in space back to the earth. So draw a parallel there. This is what an incident commander does. If your database system, if your prod database has been dropped, or let's say if your business is not able to somehow function in the middle of the night or middle of the day, it does not really matter, the incident commander is the sole point of contact. The incident commander is the person that drives the entire incident call. So this means that the incident commander is responsible for single-handedly talking to everyone from the CEO, right to the stakeholders on the call and making sure that everyone is kind of like working together as a team and trying to work towards the solution of the problem. So what's the first thing that an incident commander does? The first thing that an incident commander does is that they notify that the company is in a major incident. This means, and this we're talking about an internal notification. So this actually means that jumping on to the incident call and saying it out aloud, I have been notified that there is a major incident going on and I'm the incident commander for this call. Is there anyone else on this call? Which means you're trying to gather subject matter experts. That helps us make segue into our next part of the incident commander roles and responsibilities. You verify that all subject matter experts are present on the incident call. This is essentially just asking out out aloud whether people from these different teams for whatever thing that's going down are present. And then you get onto the long running task of dividing and conquering. So what do we mean by divide and conquer? Like isn't the incident commander the single point of contact for this? Yes, but the incident commander is not the subject matter expert. An incident commander does not need to know the in and outs of system. An incident commander does not have to be like a principal or a senior architect. An incident commander is just there to coordinate and help people work together so that they can work as a team and get to the solution of the problem. So the incident commander's responsibility is to delegate all repair actions and not act as a resolver. The other key thing about an incident commander and an incident call is to communicate effectively. This also means to maintain order, to kind of control the chaos that comes out of a tired set of people trying to work towards systems level solution. So the incident commander needs to take in human empathy during the call, which sometimes may translate to just asking people to drop off the call and go and spend an hour outside. They don't need to be on the call if there can be someone else on it. So the incident commander is also responsible for swapping in and out people from an incident call based on their judgment. And effective communication also means sometimes people might be harsh towards each other. People like we all human beings, people sometimes get tired on an incident call. They might try to shout at each other, may not be using the best words. So the incident commander's responsibility is to make also sure that the communication there is also great. Next, the incident commander is also responsible to avoid the bystander effect in the call. What do you mean by this? So rather than saying something like, please say yes if you think it's a good idea to do so. So if I'm an incident commander, rather than asking for permission to do something, you ask for something like, is there any strong objection to do that? You kind of get one of the suggestions from a stakeholder, preferably an SME, a subject matter expert on the call, and you ask the question if there's a strong objection to do that. This helps avoid the bystander effect. We have seen it in like, place where I work and other places as well, that this kind of helps cut a lot of situations where a bystander effect is seen, particularly in incidents call, when you kind of instead ask for permission to do things. The next thing is reducing scope. I guess we all have been in one or more of those times where something is going on in a production system. Your company's like core business has been affected and you just kind of just want, just for the sake of information, you just kind of leap into that incident call and just know what's going on. So one of the key things about being incident commander is to reduce scope, which means not allowing people, apart from those that are required to be actually present on the call. And this is just done to not burn out people. So having more than the necessary number of people on a call just means that there is a lot of crowd, a lot of noise, and a lot of more confusion. There could be a clash of ideas or opinions. People are opinionated. So particularly engineers are. Which means the incident commander's responsibility is to reduce scope, which means kicking out people from incident calls if you feel as an incident commander that they should not be part of that call or if their help is not required. You could politely ask them to leave the call and say, like, we will get you added back to the call if you actually need the help. The next part is maintaining order. This is something that we kind of touched upon before, but one of the things that we sort of talked about in the communications part as well was about reminding people to only talk once at a time. So not having multiple people talk at the same time. The next role is of the deputy. And like in the old Western movies, the deputy is not responsible for a lot of things in an incident call. What the deputy does is that the deputy acts as an assistant to the incident commander. This means that the deputy is responsible for getting all subject matter experts up to speed about what's happening. So imagine that you are, again, in the middle of a chaotic incident, like something has gone wrong with the production system and the deputy sort of calls you in into your phone number and you are a subject matter experts joining the bridge, joining the call. So the deputy is the person responsible for giving information to you. So you might be added into the incident call five hours after it started, which means you probably have no context about what was happening. So rather than having the incident commander like stop all of his other tasks and come back and talk to you, the deputy kind of acts as the backup incident commander, calls you on your phone or like pages you or whatever, reaches out to you, gets you on the call and fills you in about what was happening. In the middle of all of this, the incident commander can carry on with their responsibilities so that their standard workflow is not affected. The other responsibility of the deputy IC or the incident commander is to liaise with the stakeholders. Remember in the beginning of this particular section about incident commander, I kind of mentioned that the incident commander is responsible for making sure that everyone within the company knows that you're in a major incident. Now, in the middle of all of this, the incident commander might get an email or a message on the Slack or a phone call from someone who's like a CEO. So rather than having the incident commander interrupted by these external interruptions, the deputy incident commander is responsible for liaising with the stakeholders and the incident commander. So the deputy incident commander acts as this particular liaison between the incident call and like other stakeholders. And this includes people like who might be like an exec in the company or someone else who's not part of the actual incident response call but wants to know about what's going on. Knowing and talking about like not being part of the actual incident call and but wanting to know what's going on. There's a dedicated role in this incident management framework and it's called the scribe. So what does a scribe do? A scribe documents the timeline of an incident call as it progresses. So this is just someone typing on your chat medium like it could be, it does not have to be a chat medium. It could be a Google Docs or it could be your Slack, your HipChat or like a Skype. Any sort of messaging or any sort of shared documentation that's accessible from people within the company. This is internal. They document at the time when the call starts that this is the time that the first the incident call was started. And then they kind of start taking notes about people that they are about what people said and how things are happening. This kind of acts as a bridge between people on the call and people off the call. So the scribe necessarily kind of tries to get feedback also from people who are outside of the call. So for example, if I happen to know something that the SME for this particular incident does not know, I can very well just talk to the scribe on the Slack message or the Skype message and tell them that looks like whatever you guys are doing on the call, whatever you people are doing on the call, it may not be really accurate. There's an alternative. And the scribe could again get that feedback relayed into the actual incident call without actually having you or someone else outside of the call jump back into the call. The next role is of the customer liaison. So in my particular story that we talked about, the CEO sort of jumped in and started asking questions about customer facing things about which engineers did not really know about. So the role of the customer liaison is kind of to avoid that entire thing where an execs comes in and starts asking questions about customer facing things. The role of the customer liaison is to act as the bridge between the customers and the incident call. This means making sure the number of people get treated all the time about like, you know what, your site is down, getting track of those messages, getting track of things like all of the support queries that they may get in your downtime. So the customer liaison acts as this particular person who acts as a bridge between any customer facing request and the internal incident call as such. They also directly talk to the IC, the incident commander, and like, rather than just talking to the subject matter experts and confusing them with these things, they let the incident commander make call for things like whether to get that spreadsheet or whether should we go and focus on the problem first. So the incident commander actually makes that call but the request comes in from the customer liaison. The customer liaison is also the person responsible for notifying people outside of the company about this incident. So this involves sending out tweets about looks like we're having some problem with some service, something like an API or like the entire site is down or putting these updates on your status pages. So that this role is specifically targeted for the customer liaison because they are in constant touch with the outside world so they can have a better picture about what to put on the outside world. So what to put on the outside like, like how do you translate this internal impact to the outside world? Well, the customer liaison again works with the incident commander to frame that particular message that is to be posted on the outside world. And yeah, the customer liaison also keeps the incident commander appraised of any relevant customer information. So this means bigger customers complaining about or a large number of customers complaining about this particular thing so that the incident commander can use their judgment wisely about the situation that the call is actually progressing. The next thing, the incident commander role sounds a bit heavy. So something that the operations response guideline proposes is to allow a great transfer of command from incident commander, from one incident commander to the other if necessary. What does this mean? This simply means calling in someone who is able to become an incident commander and giving them the information about how far the call has progressed and signing off as an incident commander. This is just to avoid burnout. The next part deals with the thing that we saw in chapter four of my story, the morning after one. So blameless post-modem is something that the industry has talked about for years. John Alspar from Etsy has written some really great stuff about it. So I won't be spending like a lot of time talking about blameless post-modems, but the TLDR version of this is that post-modems are something that needs to be blameless. It's not a great thing that you can blame someone for any incident. At the end of the day, we are all human beings. We are all more or less equally likely to make mistakes. Be it someone who is like a C level exec, a CTO, or someone who's just new started at your company, a new hire, an intern, you're all equally likely to make mistakes and impact your company. So blaming people for things that go wrong with complex systems is kind of pointless. Remember, something that great companies do. You really can't fire your way into reliability. So firing people for making and creating a major incident or having a negative business impact is not the way to go. So there's like a few gotchas about the role of incident commander. One of the most common things that we get is who can be an incident commander? Does it have to be someone really senior in the company? The answer is no. Anyone can be an incident commander. Anyone who's able to kind of communicate well, knows the systems well enough, and is confident enough that they can deal with the chaos can become an incident commander. And to kind of make sure that you are comfortable with being an incident commander, the three-step mantra, the define, prepare, and measure comes into play here. So if you want to be an incident commander, for example, or for example, if I was an intern and had I wanted to become an incident commander, the first thing was to be prepared for it beforehand, which means run these chaos exercises, run these game days, or run these chaos experiments before an actual failure, and become an incident commander in one of those so that you kind of have enough experience to deal with an actual failure. So it does not really have to be someone really senior in the company. And these are the lines from a major incident call. Names have been deducted, but for example, you join in to an incident call as an SME, and you have been prepared to become an incident commander, you have been trained, and suddenly you kind of realize that you're probably like four or five only SMEs on the call. Then the first thing that you do is that you ask, is there an incident commander on the call? And then wait, if you don't hear back anything, if there's just crickets around, you just say it out loud. This is Aish, and I'm the incident commander for this call. Well, you say your name, I'm just saying out mine. You can also very well say this is Aish, and I'm the incident commander, but that's probably not gonna work out that well. Next is wartime versus peacetime expectations. So a lot of times we don't get paged, we don't get, things don't fail. Like most developers don't go and run pseudo RMRF in production or run drop tables in production without taking a backup. That's like one of things. So what is the expectation from an SME? What's an expectation for you, for me, or for someone who's an engineer who works in a software development team and just goes on call for things that they build, once in a while things fail, but that's fine. What is the expectation then? What's the peacetime expectation? The peacetime expectation is that you are just prepared to deal with failure. That's it. You can go have a life. There's nothing about being on call that you shouldn't be worried about. And the wartime expectation is to follow the guidelines and to stick with it. Not to just go and join an incident call just to know about how things work. You can kind of do that offline. Now moving on to the last section, the review. These are some key takeaways from the incident that we talked about, my first incident that was, and then the framework and how we can talk about some major takeaways. First thing is, shit happens. Prepare for it. Run simulations. Prepare for it. Train your people. Make sure that your systems can deal with failures and things can go wrong all the time. Just make sure that your company, everyone including the business as well as the technology part of it knows about it well enough. Develop on call empathy. So we might have seen this Twitter hashtag. Hashtag huggops. So develop on call empathy. Things can go wrong for anyone at any time. So it's kind of very important to have empathy towards someone who just got paged, who's kind of working on a problem. Don't try to intimidate them. So just make sure that you are a good team player and you follow the rules and have some empathy on it. If you are an incident commander and if you are tired, feel free to step down. There are no gold medals for on call heroics. No one has received something like the Victoria Cross for being an on call hero. So just make sure that you don't become an on call hero. Don't try to burn yourself out to prove that you know well enough about the system and you can deal with failures first hand. So before I leave all of you for today, here's the one slide that if I were to condense the slacks, this particular talk into like a minute long talk or maybe a lighting talk or an output, something that's a key takeaway for most companies and teams that build software is that people are the most valuable asset. Don't burn out your people, having them do something that can be automated. Be it cleaning up logs, restarting a server, just putting up a cron. So these are things that can be just done automatically. You don't need to trigger an incident call and have 10 people out there to done something as trivial. People are your most valuable assets and thank you, that's it. Nice work. We have lots of questions. That caused lots of discussion and debate as anyone looking at Slack or no. Firstly, everybody on Slack agrees that mere I should be described on all calls in all situations. That was an easy one to start with. There's a question about where and when self-healing systems can be used. Could they be used at night time if sort of an incident that was in that area happened? Can code roll back? How much value is there in self-healing systems? So like most answers, the answer is it depends. There's no one thing fit all solution. So the generic thing has to be kind of tweaked upon your needs, but more or less it should work out in those Susanish scenarios that you just mentioned. But well, it depends when you ask like how it depends. I need like more specifics to kind of answer that question. A follow-up question on Slack for later then. There was a question about how actually, logistically, you manage this process in terms of things like chat ops. Do you think that chat ops really work as effectively when there's a need for a call about incidents? Definitely, yes. So if you, I can give you an example from my current employer, PageDuty. So we have a chat ops command to start a major incident call. It's an internal in-house thing that we have, unfortunately it's not open source yet, but if you feel that a minor incident is getting escalated into a major incident, we have a chat ops command that sort of automates the process of calling in the incident commander, the customer liaison, the scribe, the deputy incident commander. All of these people from their given schedules are put into the call and they get paged about these things. Chat ops definitely helps. One of the big questions was around the scale of company that this is appropriate for. I'll just pull up a couple of examples. So in situations where you're working a really small team where there might be actually less people than even the amount of roles that are there, how does that work? And don't you quickly end up risk getting in a situation where you've got 50% of your whole company on a call and actually who's out there doing the work and how does that work? What size team do you need to make this appropriate, do you think? That's a great question. And that's something that I kind of get quite often. So the most critical role, and if you're a small company, is to have the incident commander, apart from the subject matter experts which are the meat of the problem, which are who are the people who are working towards solving the problem. You'll need an incident commander and you'll need a scribe. These are the bare minimum and the customer liaison. You'll need three people, at least on the call, apart from the people who are actually working on getting to the solutions part. So if you're a 10% team, I'll still recommend that you have three people there. And this means that apart from the customer support person like the customer liaison who's not from the engineering org, who's from the customer support org, you have two engineers on call, which is kind of something that a decent 10-person company should be able to do, I guess. So in that situation are you just removing some of the roles or do you see the three people on the call as sort of merging all of those individual roles? Yes, you end up merging all of these roles. So which kind of translates to the fact that the scribe would now have to act also as the deputy incident commander. And some other roles would be like merged. So the internal, like the customer liaison would also kind of help the incident commander. So there is definitely some overlap of the responsibilities that will happen, but that's better than not having a structure at all. Okay, final question then. What tools do you use in practice to implement these roles and procedures? Are the parts of it that are like little bits that are automatable that you could recommend? Are there any specifics that you would recommend we will go and look at? Since you asked the question, a bit of shameless plug, we use PidgeyDuty. But apart from PidgeyDuty, we use some great monitoring tools which kind of help us get like the data. And we use a bunch of open source tools to like I said, like the internal chat plugin that we wrote, which is kind of built on a bunch of open source chatbot commands. So chat ops, good monitoring, and PidgeyDuty for schedules and management. Okay, great, thank you very much. Thank you all.