 Welcome, everyone, to Shanking Productions Strengths by Damodran. We're very, very glad that you can join us today. Thanks, Siddharth. Good morning, everyone. I'm Damodran Rajalingam. Shortly, people call me as Damu. So I work for Google in the Sydney office. I've been with Google for about seven years, close to seven years in the Sydney office. And been an SRE or in an operations role for about 15 years. Okay, so that said, so what are we going to talk about today? As mentioned in the introduction and as on the screen, it's a shrinking production incidents or a database approach to setting engineering priorities. So what we will talk about is like what are service outages, like different phases of service outages, what we can learn from them. And based on that learnings, like how we can set engineering priorities, how we can use those learning to identify opportunities to improve the reliability of your service. Okay. I'm in the wrong window. Sorry. Okay, cool. So, sorry. Before we delve into the presentation, I would like to mention like we'll be discussing this in the context of SRE. For those who haven't heard it, it's a site reliability engineering. What is SRE? So SRE is something which Google came up when the Google leadership realized that the traditional ops system is not going to scale. In a traditional ops system, usually you add people to solve the problem in a linear scale as the service grows. And there are also kind of like a split between the product and the ops teams in a traditional ops system and it can lead to some frictions. So SRE is something where you convert this ops problem into a software scalability problem or a software engineering problem. So you incentivize people to solve problems through automation or making the system automatic instead of manually handling the issues or doing repetitive toil process. So that's basically what SRE's fundamental goal is. And one of the advantages of SRE is it aligns the incentives of management, development, and operations. As I said earlier, in ops world, where I have been before, there is usually this tension between the product team and the operations team. The product team wants to roll out features as fast as possible and the operations team thinks like you make changes to the system, you break the system. There is this natural tension. And there is no common vocabulary to talk with either of the teams. So SRE aligns the incentives for the management, development, and operation teams using common vocabularies. And the other thing that comes out of it is you let the product development velocity, means you can have as much velocity as possible with and also enforce reliability. You can make trade-offs and you have by aligning these priorities. So how does SRE basically achieve that? So what are the fundamental concepts of SRE? So the three things that drive SRE, fundamental concepts of SRE are like error budgets and SLOs. So this is SLO is basically how you measure your user satisfaction with your service. And error budgets is like how much of that SLO budget can you burn in making feature releases or doing tests or in failures? And then blameless postmortems, these three are the fundamental concepts of SRE which allows, which drives product velocity and while keeping the reliability. So this is a very high level one and you can visit this link, Google.com slash SRE to know more about it. And I think there are a couple of books on that which delve much deeper into that. Okay, so why are we talking about reliability? So happy users stay and happy users leave. This is especially true now. The internet landscape is quite competitive. And if your users are unhappy, they are going to find a different product. It's easier to move. So what's user happiness? Like the obvious thing that comes to mind when we talk about user happiness is like, okay, that's the product has X features. Like does it have this awesome feature? Does it have that all, this specific feature, right? That's what we basically advertise on. But that's not what I'm going to talk about here. The thing which you are going to talk about here is another feature which is not usually advertised but implicitly assumed it exists is reliability. It's a feature which kind of users don't notice when it is present, but certainly gets noticed and talked upon when it is not present. If your service is unreliable, obviously users are not going to stay with your service. So it is the most important feature, I would say, given that I'm a site delivery engineer. And because there's not much point if you have awesome features implemented but users are not able to use it when they want it. Okay, so when we talk about reliability, what disrupts a reliability? That would be a service disruption or an outage. So systems of these days are like really large, complex and distributed systems. And in these systems, it's not the question of like whether there will be an outage but when that outage happens and what is its impact. So this presentation is mostly about we will dissect the outages and see what we can learn from the outages and apply that to improve the reliability of our system. Before going into that, we'll see what are the different, how are the stages of outages and how it affects our users. So here we see like the red bar and the blue bar. And so when you are in the blue bar, we say like, okay, you're above your SLO target. So user is happy there using your product. Let's say like you made a release. You are releasing a new feature and something starts going bad. Users start seeing some issues with your site and that's when your outage begins. So note that like at this time the on-call is still not aware like there is an outage happening. They could be having their lunch. They could be out playing or they could be doing some other work. And after enough of your SLO burns, your monitoring picks it up, the on-call gets paged. And now the on-call acknowledges. So their phone goes off, they acknowledge the page. They start looking into like, oh, okay, something is wrong with the service. What do we do now? What happened? And they start working on it. Okay. So the time when the outage starts and when the on-call gets to know about it, that's basically the time to detect. And so depending on how impactful the event has been, we have to kind of like tighten this interval. Okay. Now that on-call has worked on the issue, they have mitigated the issue. And now the user experience has started to improve. The users are happy with the service and things are getting better. So the time from when the on-call gets notified or like when the issue gets identified and when it gets mitigated, that's what we call the time to mitigate. Okay. Things are going fine for some time. And as we said, like outages do happen. And the next outage begins. And the time between these two outages is what we call the time between failures. And so think about like what happens if we shrink the time to detect or time to mitigate or the time between failures. Or we see like the, like what will it be? Like if we change the slope of this curve, basically means like how bad, how quickly the service gets bad. What if we kind of slow it down or flatten it? Right. So that's basically the goal of this talk is going to be like, that's what we basically call is like shrinking these production incidents. Okay. So an outage happens. And how much does this outage cost you? Like there are direct costs for an outage. That could be loss of revenue. Users are frustrated with the service and they are going to leave the service. The friends they have talked to about the service, they may not join your service. So your user is not going to grow. And in these days, depending on the size and impact of the outage, you may even appear in the headlines or you could be trending in Twitter. Yeah. So these are some of the costs of having an outage. Now think about this question. Like so how much would you pay to not have this outage? So that's basically the question that will pop on our mind. Right. Okay. So now let's think about like what, how if we turn this question around, like I asked like how much would you pay to avoid this incident? Now, if we turn around this question, like now that you have paid, like we talked about that on direct cost, now you have paid with this outage. What can you learn from this outage? What value can you extract from this outage? Or put it other way, like you can consider this outage as an unintentional investment. You are not looking to make this investment, but there you are. You had an outage, you made this investment with the upfront cost of like lost revenue or lost users. Now you made an investment and you want to extract value out of it. So that's basically what postmortem is. Like postmortems turn this question around this, like what you can learn from the outages. So what are the goals of a postmortem? Right. The first goal of postmortem is like the incident is documented. Like you have documented like such an incident happened, when it happened, why it happened, because having this repository of postmortems allows you to know, like it will help you to understand your systems better and help in prioritizing what you have to work on. Okay. And the next important thing is like, so you had an incident and you understand all the root causes that contributed to the incident. I want to highlight that all. So usually we think like there might be one, or there might be like one root cause that caused all this incident, right? But in many cases like there may be more than one root cause. Also you can also think about it in this way, like that could be one root cause that actually triggered the outage. But let's say like after the use on call got notified, they would have taken longer time to mitigate the issue. Now there is a root cause why the on call couldn't mitigate it fast enough. So there could be multiple root causes in that outage. So it's important to identify and document all these root causes. Okay. Now you have identified the root causes of the outage and the next thing is like, you have to figure out like how you can avoid the recurrence of these outages in the future. So that gets documented as action items. So when you're writing, so when you're trying to formulate these action items, your goal is to kind of like, not only prevent the recurrence of this particular outage, but you should also be looking at preventing the occurrence of a class of these outages. And it's also very important to know like you shouldn't stop at just documenting these action items. We should get, make sure that these action items get prioritized accordingly in the Dev Roadmap or in the Sorry Roadmap or in the Product Roadmap, wherever like whoever has to take action, they have to take these action items. So that's kind of like the understanding we have. Okay, so what do we write post, when do we write post mortems, right? So you write post mortems on any significant undesirable event like a huge outage or even outage that was a near miss or like it could have affected the user experience greatly, but it didn't affect and you got lucky. So the SRE motor in Google is like, hope is not your strategy. So just because you got lucky, you shouldn't hope like you will continue to have that luck. So in case of even near misses, you should write in post mortems. Okay. The other thing to realize here is like writing a postmortem is not a punishment. For me, it's another kind of like a creative thing. Like, so I dealt with an outage and I now kind of think in a different ways to fix that issue. Think about like when you are working on a product, like many times when we write something, when we implement something, we think that it is a good feature and we implement it. Postmortem is kind of the other way. Like it has already shown you that it's a very important thing to be implemented. So writing a postmortem is not a punishment. It's a rewarding activity personally and even for the product. Okay. So we talked about postmortems, right? So how it fits in. So we talked about like what is a postmortem and when do we write postmortem? Another important thing about postmortem is basically that the link been holding all this together is the culture of blamelessness. See, you are an on-call engineer and you got paid saying like, okay, your system, you are having high memory replacement or you kind of have some sort of failures. Like users are seeing failures and the on-call engineer looks at the playbook and say like, okay, this could be because of memory leaks or something and or based on their past experience, they think that they are going to restart the servers and they restart the server and that ends up being counterproductive. Instead of mitigating the issue, say like it amplified the issue. The thing here is like when you write the postmortem, you don't ask like why did the, you basically kind of don't point your finger at that on-call person and say like they did something wrong. But instead the postmortem basically tells into like why did the engineer, so, okay, so why the engineer was not prevented from restarting so many servers at once or like why was this not, or use the postmortem to kind of make sure like we clean all the bugs in the system, in the software that resulted in these memory leaks or resulted in these errors. So in the culture of postmortem, the first thing is like we assume good intentions, like the on-call engineer, we assume like the on-call engineer did their best at that time based on the tools and information available to them. Hindsight is obviously 2020, like after the incident has resolved and you look back, possibly some things could have been done different, but the thing is like the on-call basically did what they could at that time based on the information available to them. So that's the first thing, like we assume good intentions. Next thing is like identifying causes without implicating people. For example, let's say like I was the person who actually restarted the servers and that resulted into an outage. So when you're writing the postmortem, it's important to say like, not to say like a Daumu restarted the server. Instead it says like, okay, the on-call engineer had to restart the server, or like let's say like I submitted a change which resulted in an outage and instead of saying like, okay, Daumu's code, you basically will say like, okay, the change number one, two, three, four had this bug. So you basically kind of don't point fingers at the person and instead you actually implicate the causes like what happened, the actions rather than the person. And the intention here is not to fix people, right? So as we talked about earlier, like we assume like they acted with good intention. The goal for us is to fix the system. Why did the system allow the on-call engineer to do something that's unsafe? That's the question we have. Again, to point it like the solutions, for example, like yeah, when we do not point at people for blames, but when it doesn't necessarily mean the solution to the problem doesn't involve people. For example, what I mean is like, the solution doesn't always have to be technical. Sometimes like the problem could have been that the on-call engineers may not have had good incident management training. So that's something which is systematic in organizations and that involves human. Like they have to be trained appropriately so that they're able to manage incidents. So though the blame doesn't rest on the people, this solution could involve people. And yeah, so for a post-mortem to be very effective, you need to get to the bottom of the issue. Like you need to know why it happened and how it happened and what can be done. For that to happen, the people who are on-call or who were involved in the outage, they have to be not scared away and they should feel free to kind of write, they should feel free to express their views or like to document what exactly happened. If there is no safe environment for them to record what happened with high fidelity, then the post-mortem is not useful because you haven't found the actual issue. For those people in upper management, I would say like you should think about how you can drive blameless culture in your organization because as an engineer, it's very difficult for someone to not pause the blame if the upper management is intent on pinning the blame on someone. So it has to be driven from top. Okay, so that's basically like what a blameless post-mortem is. Okay, so we talked about different stages of outages and what a post-mortem is like, how we record what happened and the action items, right? So now how do we focus, where do we focus our efforts, right? When you have a critical mass of post-mortems, you will see like patterns start to emerge and you'll be able to identify where there is a frequent lapse of something or like where do we have gaps and then you can apply some of these engineering tasks to fix those gaps. Again, the list here is not an exhaustive list. It's something as good representative set of lists that has helped in Google where we apply in Google to improve or reduce the impact and duration of incidents. So if you look at this, the time to detect is basically the fundamental thing that we improved there is like defining and measuring a failure. How do we define what a failure is and how do we measure it? The time to mitigate is basically, it's a human problem. How are we training them? Do we have given them a good environment for the on-caller to fix the issue or they overloaded? So this is what time to mitigate. We'll focus on. And how do we ensure that the service runs without outages or like the time between outages is longer? That is basically fundamentally boils down to the engineering discipline. Okay, so from here on, we will zoom into the each stage quickly the time to detect. So as we talked about the time to detect starts when the outage starts when the user starts to see the issues and they get notified when your SLO budget starts burning so fast, right? So your SLO burns so fast like you're going to eat up all the budget. That's when you get notified. On-call cannot start to act on issues that they don't know about. So there is a trade-off here. If your outage has been really very impactful in this time to detect, then we have to tighten the time to detect. The trade-off is like, if you alert too quickly, probably you will be paging someone at midnight on a transient issue. If you alert too fast, sorry, too slow, then you have made lots of users unhappy. Okay, so what are the basic things for basic things we need to do for shrinking the time to detect? The first thing is the refining the SLAs. So what is an SLAs? SLAs are service level indicator, right? So this is basically what you define as a good user experience. It could be latency. It could be like availability of your service. You say like if a user is able to send an email, that's a successful operation. That's an SLA. So the latency, like how long it takes for their request to be served. So how to have good SLAs? The first thing is like you have to close it, measure it as close as possible to the users. Like if you can instrument the client and measure the latency and availability there, it is good. So you want to kind of like measure it from the user's perspective. This could mean like, for example, if you are instrumenting it in the client, it could mean like the user could be seeing issues because of flaky networks or something systems that's out of your control. But that doesn't mean like you shouldn't measure it at the client. What this gives you is like figure out like how often they face these outages and prioritize it if like these external dependencies are affecting your user happiness and you can engineer it around. Sometimes it may not be possible to instrument your client. Then you can approximate it with some probers that simulate clients or you may kind of do different sort of measurements. But you need to be aware like when we are doing something like that, we are blind to some of the aspects of user experience. And the other thing is to verify against external sources. A good SLA is something that goes up when the user is happy and goes down when the user is unhappy. So we have had, so for example, like so you have an SLI defined and you get alerted on that SLI and you check with your user, they don't see any issues. They are happy about it, which means like there is a mismatch between your SLI and the user expectation or the worst thing is like you defend an SLI which doesn't page you but the users are seeing outage. So you should verify with your, you should talk to your user, you should understand like what matters to them and base your SLIs accordingly. So now you have SLIs, then you need to set a target for that SLI. That's what we call the SLO, service level objective. Similar to how you refine SLIs, you also kind of iterate on your SLOs. It should again kind of like user focused. There is a cost to setting, like you should make sure like you don't have too relaxed of an SLO that it doesn't reflect user happiness. You also should make sure like it's not too tight. You should understand like there is a cost to adding more reliability, like the cost could be exponential. Moving from 99% to 99.9% could cost you 10x more engineering effort or complexity. So you need to understand your users requirements and secure SLOs accordingly. Okay, so now you have SLOs and the next thing is like your alert should be effective. Like when an on call receives an alert, you should feel confident that that alert actually means something. It should be effective and it should be user focused which we have talked about. Like you receive an alert, it means there is a user issue. It should be sensitive enough and actionable enough. So we already talked about having that sensitive and actionable alert. So it's kind of like a fine line, a trade-off that we have to iterate and find it after few iterations sometimes. You may not be able to strike the balance right away. Sometimes your monitoring might be too sensitive and your alerts are not actionable like your alert on transient issues or your alert is like too slow and you actually get notified when the outage has become much bigger. Okay, so we talked about how to measure, define and measure service failure which when improved will shrink the time to identify the issue. Once you have identified the issue, the next thing is like how to shrink the time required to repair the issue or to mitigate the issue. As we talked earlier, like this is mostly a people issue. Like you need to have good policies, you need to have training and of course, another important thing is like stress management. It's also the most, one of the most important thing in handling the incident. Another thing I want to highlight here is like, the first thing the on call has to focus on is mitigating the failure. Sometimes the curiosity in an engineer can get in the way and you start, when the incident is happening, you start digging into the code and the system to find out what is the root cause. But to emphasize like the first action that the on caller has to take is to mitigate the issue. This could be like draining the service out or rolling back the update. It could be anything, but debugging the code and trying to fix the code comes later first after the mitigation. Sorry. Okay. So what can you do to improve the time to mitigate, right? First thing is like you have to train the responders. You should have emergency procedures defined, documented and tested. So it's important to have them tested. There is no point in having an emergency procedure which is not tested and when applied, it doesn't work. The procedure should provide enough structure and protocol. At the same time, it should also be flexible enough so that the on caller can do some ad hoc exploration. The other thing is like you should, training doesn't mean training only the on call. Sometimes the incident spans more than the SRE team. You may have to involve the developers. You may have to involve the product managers. You may have to involve someone else to manage the communication. So the training should, you should try to train everyone that could possibly be involved in an incident. The next thing is like running practical and theoretical feature tools. In Google, we kind of have some teams have this weekly exercise called a wheel of misfortune where you kind of trigger a failure in a non-production environment or you simulate a failure and you have a volunteer who tries to debug the issue. Most often it's not a volunteer. You get volunteered or we even have a tool inside where we kind of actually runs a wheel of misfortune and selects a name. So you become the lucky winner who gets paged on this training incident. So what this allows you to do is like it keeps your team trained on the emergency procedures. Many times this could become rusty and it also kind of gives them the critical thinking and to figure out how to debug issues. And they know how to escalate. So they can actually practice this in a safe environment so that when it actually happens, they are confident of handling it. Okay, writing a suit of runbooks. When you get an alert, the first thing that on color looks at is the runbook. So the runbook, that should be a runbook for every alert. And the alert, the runbook should have enough details for the on color to kind of figure out how to debug the issue. There's a, again, in my experience, it should not also be kind of do where it was. You shouldn't write a research paper or the entire design document in your runbooks or sometimes some of them call it as playbooks. So it should kind of have the special response procedures. It should have links to dashboards. It should have clues for investigation, like what to search for in your debug logs. The other trick sometimes we do inside Google is like, so when you get an alert, the alert has details about the entity, like this particular job is having an issue. This particular job running in this particular location is having an issue or in this environment. So when you go to the dashboards, you have to obviously apply these filters to focus the data for that particular entity. So if your tool allows the links to these dashboards should prefill it so that they don't have to go and fill it every time. This makes it easier for them. The other thing that helps is like it avoids human error. You don't put in long values in the filter and you look at the wrong data. The other important thing is like, when you write a runbook, you should write it from the perspective of a newbie. As a person who's very experienced with that alert, you tend to assume like people reading that runbook know about it. But most often, you will be surprised when you actually ask someone else to go through that runbook. I will just link it back to the theoretical, the failure drills. This is a good time to kind of run a wheel of misfortune and ask the person doing the drill to run through the playbook and follow the playbook and see if they can actually debug the issue and if there are anything missing. So it's very important to make sure like it has all the relevant details and by running failure drills, you should also verify it often that the runbooks are updated. The next thing is like responder fatigue. Being on-call is a stressful job. It's a big ask, right? So and keeping the page of load down so that you don't overwhelm the on-corner is very important. So one of the rule of thumb we have in Google is like two incidents per shift is kind of like a manageable page of load. It could differ based on companies and teams but you should have some notion of like how much page of load is acceptable and you should make sure like the page of load does remain below that limit. Eliminate the other thing that could contribute to the fatigue is toil where you have to do manual repetitive tasks. You should automate most of those things to reduce the toil and reduce the fatigue of the on-caller. The other important thing is like shedding load. I will talk about other things. So when you have an on-call rotation, so shedding load in two things. One is like the number of pages. The other thing is like too frequent or too many on-call schedules, right? So your on-call roster should be well-sized so that people don't become on-call too frequently. In Google, we have kind of like recommend like an on-call team has at least six engineers. So and if you are not able to have that many engineers and you have lots of on-call load, one way to shed the load is to kind of like have a part of your dev team to also be part of the on-call roster so that you can shed some load from the on-call team and avoid burnout of the team. Okay. So the other important thing that will help you in mitigating the issue is having good dashboards and logging. It helps the on-caller have relevant information for debugging the outage. Another of SRE principle is like measure everything you can, log all your response codes, all the error codes, mutexes, whatever you can. But you also need to keep into consideration like this is going to have some costs so make sure you have reasonable retainment limit so that you don't blow up your cost. So having good dashboards and logging helps the on-caller to quickly pinpoint where the issue is. Okay. So now we have mitigated the issue. How do we keep the issues from recurring? Or like how do we make sure like there is a good amount of time between the next failure occurs? So when saying this, like I also kind of try to want, like sometimes you may get too hung up on like avoiding failure, like keeping this time between failures to be very long or never happen, which is a wrong thing to concentrate. Failures do happen. And thing is like you shouldn't get too hung up. It's an important thing to fix, but you shouldn't get too hung up on increasing the time between failures. It has cost like when you get too hung up on it, it has cost like slowing your feature releases, adding engineering complexities and all those things. And also there's a minor side effect. Like let's say like you have a service which hasn't paged for quite long and when it suddenly pages, now you have an on-caller who's not experienced in actually handling the page and which could be a bigger, which means like they will take lots of time to mitigate the issue. So yeah, so maximizing the time between failures is important, but don't get too hung up on that one. And as we mentioned earlier, like maximizing the time between failures, it basically boils down to engineering discipline. So you have to foster a culture of quality, like no like monkey patching the code or cowboy coding stuff. Make sure like the code added is like roll forward and roll back compatible. So follow good practices and code reviews. So code reviews is something which will help you to enforce these good practices. You send your code to someone else who can actually ask questions about like, hey, this looks bad, can you fix this? So code reviews is a good practice which you should follow. Test coverage requirements ensure like your code is tested well. In Google, like what we do is like every change list automatically runs every possible test for the change list and you can submit the change list only if all these tests pass. So we try to make sure like the change that's going in has been tested thoroughly. And looping it back to the code reviews, when you do code reviews, you kind of also make sure like enough tests have been added to cover the functionality. Okay, so the other important thing that helps maximizing the time between failures is CI CD, which is like continuous integration and continuous delivery. So automated testing, which we already talked about earlier. You should have all your tests automated. Every change you make should be tested thoroughly. And the CI CD pipeline, having a CI CD pipeline allows you to kind of ensure like the tests are run automatically. The other important thing when we talk about continuous integration and continuous delivery is like the gradual rollouts. In gradual rollouts, what we mean by that is that so this is basically what prevents your sadness loop, which we talked about in the graph earlier, like it doesn't go steeply downwards eating all the SLOs, SLO budget. You release your new feature to a small set of users, compare their errors and latencies and performance users that already run old version. Sorry, I'll think I'm slightly over time. I'll finish it in another five minutes. Sorry. So yeah, so you gradually roll out to increasing a set of user population so that like if you introduce a bug, you don't affect all of the user population at once. This is a very fundamental technique which will prevent huge outages and reduce your blast radius. And then automatic rollbacks. It is very important to make sure like the version you are releasing can be rolled back safely. In a distributed system, there will always be more than one version running and you have to ensure like the data format shared between them, the RPC protocols between them, they are compatible between versions. The other thing which we do in some of our teams is like every release automatically gets tested for rollbacks. Like you roll out that release, you roll back one version, you roll back two version, and you make sure like you can roll back up to your compatibility window. Okay. Obviously the other thing is to have a robust architecture. What is a robust architecture? So it is bad. Like your system is able to satisfy the user queries even in terms of like significant internal failures. Some of the good practices around us like don't have a single point of failure, have a redundancy, and have graceful degradation. Like see if you can actually have like good partial responses of like totally failing the output to the user. Like say like you have 10 APIs behind your front end and one of the API fails. If you can actually just fail that part and show the remaining information to the user without failing the entire page, have some sort of graceful degradation. And finally, chaos engineering. It's a weird term I would say. Basically what we are doing here is like introduce chaos into the system to find issues before they find you. So you kind of try to break the system in a controlled environment. In a controlled and safer environment to find issues. And automated disaster recoveries. Earlier we talked about wheel of misfortunes and there is another type of thing which Google runs called the DIRT test like disaster recovery tests where we kind of simulate events like data center failures or we try to see all scenarios which could break the system. And that is mostly geared towards so it also identifies your system vulnerabilities and also gives training to the people on how to handle those situations. It has both of these things. It is also possible to automate your disaster recovery. You can use things like chaos monkey or something and you can have it as one of your tests to kind of release it. You kind of fail the system and see how your system responds to certain failures. This is kind of like so that you find issues before it blows up in your face and you have certain notions about your system like it would be stable under certain circumstances and you break that and see if your system actually holds up to that. Okay. So this is the final slide. In a sense, what we have discussed is like have SLAs and SLOs that is how you actually define your failures and know how to define the health of the system and let the post mortems guide you. Like after every incident you write the post mortems it will give you insights about like how your system, where your system fails how you can improve the resilience of the system how you can shrink the time to detection or shrink the time to mitigation how to prioritize those action items if you have enough post mortems you know like where to invest your engineering and hopefully like you will reduce you will shrink your protection incidents. Thank you. Sorry about running long. So do we have any questions? Thank you so much, Davudran. I think we may have exhausted our budget on the questions. Sorry about that. We do have to bother from your talks. I've learned something new today. I know for sure that there will be questions. This was a great segue from Dave's talk in the morning into like a peak behind the curtain. So thank you so much because I think he covered it at a very strategic level and we got a great look at the tactical view of how to actually implement a lot of what he talked about. So I think that's pretty great.