 Hey, so I'm audible right? Yeah, I think it's perfect. Awesome. So yeah, basically, I just wanted to start off with setting some expectations about this talk. So this is going to concentrate more on the why instead of the how. Hence, this will probably not be a very technical talk. Also, this is kind of a evolving talk. So as things proceed, things may change over here as well. With that, yeah, I think joy pretty much covered up everything about me. So yeah, let me let me start with that. So this, you can pretty much use that report of this book called implementing service level objectives. That's where I got this idea of this talk from and lots of ideas which I discussed over here, right? They are discussed at depth in the book itself. Right, so with that, let me set stage for this thing. So when we talk about SLIs, we are talking about service level indicators. So after that SLO is service level objectives SLAs are service level agreements, right? Have you noticed something interesting over here? The term service keeps popping up, right? So what do we mean with this term service over here? But to define this service, I'll first have to define a user. So for the purpose of this, I'm going to define a user as a user is anything or anyone that relies on a service. Yeah, so I think this is where we left off like service level indicators, service level objectives, service level agreements. The term services is very common over here. So what do we mean by service? So, yeah, to define a service, first I define a user. User is anything or anyone that relies on a service being a paying customer or an internal user or another service also can be a user in this case. With that in mind, a service is any system that not users. That's how I would define a service for this talk. So what we offer to our customers is a service, right? They give us money and we provide them some infrastructure as a service. In the next slide, let's try to explain how to think about these services. So reliability means not just availability, but also many other measures. Such as quality, dependability and responsiveness. The question, is my service reliable? Is pretty much the same as asking the question, is my service doing what the user needed to do, right? So the first truth is, yeah, I'm going to take it in terms of like three different routes. So the fourth truth is a proper level of reliability is the most important operational requirement of a service. The second truth, how we appear to be operating to our users is what determines whether we are being reliable or not. It's not about how things look from our end. It's what the customer sees that matters the most. It doesn't matter if we can point out zero errors in our logs or perfect availability of our metrics or incredible uptimes. If I use being reliable, then they are not. Third truth, nothing is perfect all the time. So a service also doesn't have to be perfect all the time inside. Not only is it impossible to be perfect all the time, but the cost in both financial and human resources, as we keep closer to perfection, is very, very steep. Luckily, it turns out that our software doesn't have to be 100% perfect all the time either. Unless, yeah, unless it's like a very system critical, hard pacemaker or something like that. But that's not what I'm going to be concentrating mostly on here. So then how do we manage this situation? How should we think about the truth side? So what I want to do is introduce people to this reliability style. Reliability is the most important requirement of a service. It's okay not to be perfect all the time. Hence, we need a way to think about these truths. We have limited resources to stop. Be it financial, human or political. And one of the best ways to account for these resources is why what is called a liability stuff. And the basic liability stuff is service level indicators or epilogists. So let me start there. An SLR is a measurement that is determined over the property of a service. What this means is an SLR means any metric of a property or service which can be measured. And I measure measures or service from the perspective of our users. It might represent something like if someone can load a web page quickly enough. People like web pages that load quickly, right? But also things don't have to be instantaneous. But that's what I mean is most of the pages, they have that thing where you can click and expand into like more information about whatever is initially presented to you, right? So in that case, a good SLR would be the initial presentation and like how quickly things are brought up, the initial things are brought up. The things which expand and then they can be filled up later, right? So, yeah, that's what I wanted to convey at this point. SLR is more often to use useful if it can result in a binary good or bad outcome for each event. That is to say either yes, the service did what the user needed it to do or no, the service did not do what the users needed it to do. And this is the reason why. Once we have a system that can turn an SLR into a good or bad value, we can easily determine the percentage of good events by dividing it by the total number of events, right? So, yeah, this is why it's important. So, to drive the point home, here's an example for let's say 60,000 visitors to any website in a given day, we are able to measure that 59,982 of those visits resulted in a page loading time within two seconds. And here's how we end up over there. So, you take the measurements of those good events and divide them by the total number of events and you'll get this ratio which will always be less than 1. And you multiply it by 100 and you get this percentage of 99.97% of good events, right? So, yeah, after getting our SLRs, we can climb the next ladder which is SLRs. So, the SLR is the proper level of reliability targeted by the service. So, what I mean over here is let's say we continue with our example, right? Let's say we didn't hear any complaints about the day when we had that 99.97% of page load. So, we could infer that users are happy enough as long as we are hitting that 99.97% metric. So, in this case, the SLR, right? That 99.97% that figure. And what do we mean by error budget? So, an error budget is a way of measuring how SLRs perform against our SLRs over a period of time. It depends on how underlives are permitted to be within that period and serves as a signal of when we need to take a corrective action. So, I think it's clear so far. I'll take that as... We do not have any questions right now related to this particular talk. I'll just take you to, but I think, yeah, we haven't received any burning questions. I had a question on SLRs. What do we do when the SLRs are dependent? So, in this specific case, you said the page loads within two seconds are not. A number of bad events by a number of all events. What if page loads are not necessarily entirely in our hands, for example. But there may be other SLRs that are more out of our hands. For example, in the case of payments, there'll be customers that don't want to make a payment. They initiated a payment. What if the SLR is not under your hand? I didn't understand that. So, the service level indicator we took was, okay, page loads within two seconds are not. But the customers that are loading these pages may be on a 2G connection. And as a consequence, does it change in any way? Does it change how I approach my SLR in any way? Not clearly, because you take decisions based on the SLRs. So, how you form that SLR is completely under your control. You keep taking into account the fact that SLRs are not something under your control. So, when you derive SLRs, you can have a standard operating procedure where you say, okay, if we are over a critical level of whatever, this is what needs to be done. And that may be going and talking to the third party. Or if you are given a way to tweak that, you can tweak that. Or just it can also be just informing them saying that, hey, we noticed that these were the number of bad events and we didn't expect them to be there or something like that. So, SLR is just an indicator. It indicates whether things are good or bad. It can be something which you are in control of or it can be something which is a third party is in control. I'll give you an example. So, latencies, when I talk about latencies, I'm talking about let's say as a home user. I'm trying to talk to AWS and there is latencies of it. Suddenly, there is a spike. I do anything personally about that. But it's still better to know that that is the reason why and that latency of better is the reason why my returns are being bad or something like that. So, SLR is just an indicator. It's an observation. What you do with an observation is you can take like a connective action or you can inform other people or you can manage your customers as well. So, when a customer comes and complains, you can say like, hey, this is the third party. We've informed them and take it from them. Did that make sense? More curious like when I'm setting my SLO, the variance in the SLI that I have at hand for whatever reason, the SLI of number of payments completed makes sense. For example, an SLI of number of transactions completed versus number of transactions total makes sense. But the variance over there may be too high for me to easily set an SLO. Because Let's do one thing. I probably need to ask you more questions to come up with like a more logical answer. So, you want to ask this question towards the end? Yep. Cool. So, error budgets, you can, yeah. Some of the answers you'll probably also find later on in this presentation as well. Right. Next, I wanted to like drive home a quick differentiation between this error SLAs and SLOs. So, an SLI is very similar to an SLO, but differs in few key ways. SLAs are business decisions that are written into contracts with paying customers. And SLO is a target percentage we use to help us drive decision making. While an SLI is usually a promise to a customer that includes compensation of some sort if we don't hit our targets. SLOs should always be higher than SLAs, please some amount of margin between you start paying your customers. So, the main differentiation over here is SLAs are something which you will have with a third party or like your paying customers. SLO is something which you'll have internally. SLAs is something which your lawyers will draft and it'll be like a written agreement kind of SLOs are something which you're meant to be tweaking. So, as in when situation and things change, your SLOs will change. Since lawyers are involved, SLAs, it's kind of hard to change the values for us. So, as engineers or people who are looking at metrics and things like that, I think we should mostly be concerned with SLOs rather than SLAs. So, like a quick recap. A good SLI can also be expressed meaningfully in a sentence that all stakeholders can understand. For example, 95% of our requests to a service will be responded with the correct data within 40 milliseconds. So, something which can be very clearly defined in a single sentence. SLOs target for how often we can fail or otherwise not operate properly and still ensure that our users are meaningfully upset. So, yeah, there's some extra information I'll give about error budgets. There are two very different approaches we can take in terms of calculating error budgets, which is event-based error budget and time-based error budgets. The approach that is right for us will lastly depend on the productivity of the data available, how our systems work and even personal preference for that model. If it approaches eight to figure out how many bad events we might be able to incur during a defined error budget time limit without our user base becoming dissatisfied. Time-based approach focuses more on the concept of bad time intervals, also called bad minutes. So, I see people having difficulty grasping this second concept, right? So, I thought that I'll spend some time defining this second concept. So, time-based approaches. So, here also, let's take an example. Let's say we have a 30-day window and our SLOs say our target reliability is 99.9%. This means we can have 0.1% failure or bad events over those 30 days before we exceed our error budget. However, we can also claim this as we can have 43 bad minutes every month and still meet our target since 0.1% of 30 days is approximately 43 minutes. Either way, what we are saying is the same thing, right? We are just using different implementations of the same philosophy over here. So, what is the point of all of this? So, if you say error budget means shift more features. So, if you are not hitting your error budget, that allows you to concentrate more on shipping features or doing something else and if you exceed your error budget, you have to focus more on reliability. So, basically, you can use your error budget to take decisions. Things to keep in mind, SLOs are just data, right? Don't treat that as gospel or don't use SLOs as something which is written in stone. SLOs are not a project. It's like a evolving thing. Once you set your SLOs, you will probably want to iterate over it. Take over like a half-year period or whatever and iterate over your SLOs, see what went bad, what was good about it. If you want, you can tweak it because everything is iterative, right? Like, iterate over it, which is the most important aspect. The world will change. So, your SLOs will also have to change and evolve along with the world because in the end, we are indirectly interacting with humans. The preferences will change. Maybe your preferences will change. The way you interact with your customers will change. So, just keep that in mind and whatever you are monitoring or whatever you are measuring, right? Even that has to change based on the same, because of exactly the same factors. Quality is a meaningful SLI, right? It can be easy for us to get caught up with our day-to-day lives and quarterly OKRs or various other business needs while forgetting what it is that our service actually needs to do, right? It might also end up being a side effect of linking a user's primary focus. If you make your primary focus, invariably, you will end up with happier users, right? Thinking in this paradigm might not be easy. Initially, we will probably get some resistance from our engine side. That is something which I wanted to point out. If we can develop meaningful SLIs, the only reason we have to wake up someone at like 3 a.m. in the morning or whatever is when our SLIs are not performing correctly, develop good SLIs. Our engineers have a single thing to point out in order to determine what constitutes an emergency. So, basically, if something is led, then it's an emergency. I'll go ahead and pick up other people in the middle of the night. Or if it's green, then no, it's not. Or if it's something yellow, then hopefully there is some documentation which the engineer can look at and figure out what to do because everyone is going to be groggy and easy at 3 o'clock in the middle of the night. Right? Because our engineering organization can now better align with our product, business, and QA organization. Because everyone is going to be on the same page, right? Everyone is going to speak in the same language. So, which is invariably going to lead to a happier business. This is something which I've seen happen like over a period of time. So, if you bring everyone about, everyone is trying to speak in the same language, then it just ends up being that the business is happier because Indian business also is people. Yeah, there's a reason why I put up all of these questions in the same slide. We need to be collecting at least six metrics so that they can be analyzed and examined by us when needed. However, not all of these metrics help inform in a meaningful SLI directly. Even if we have many things to measure, we can often get meaningful SLIs by measuring only a few of them. Let's start with a question at the bottom of the list. Are the payloads of the responses actually being requested? It turns out that if we can figure out a way to measure this, we are also measuring, are the responses in the correct data format? From a user's perspective, we can't possibly be receiving the correct data if the data isn't formatted in the way we expect it to be. And if we know that the data is both the data we need and the data in character, we can also be sure that the responses we are receiving are also and not just errors. If we are also receiving responses at all, we also know that the service is bought both up and available. We are just going from measuring one, measuring five things with just answering that one question. It might be the case that we have to calculate latencies via second metric, but even if that, we are still measuring six things by measuring only two of them. So it doesn't have to be like a very complicated thing. That's what I wanted to share over here. Good SLOs generally have traits in common. If we are exceeding the SLO targets, our users are happy with the state of our service. SLOs have to be reviewed periodically. Are we overachieving or underachieving? This is a question that you have to ask, basically. Problems with being total are... For an example, let's say we have done our due diligence and we have chosen an SLO target percentage of 99.9%. As long as we exceed the target, we are going to be able to do that. 99.9%. As long as we exceed this 99.9% users, 99.9% users are complaining. They are moving elsewhere and our business is growing and doing well. However, let's now imagine that we are routinely being 99.9%, hitting a target of 99.9%. Our SLO is published and discoverable by everyone. We make our target most challenging. In this case, what have we given? This is a question to the audiences. If we change our SLOs from 99.9% to 99.99%, what have we given? Audience, if you want to answer that question, please raise your hand and we can actively participate in the conversation. Or you can ask on YouTube as well. I'll be there. If you want to answer folks on YouTube, please do answer this question. I know what comes in the next slide and that's pretty surprising. This is a great question to try and answer on your end also before you actually see the answer. I do not think someone is... We'll leave it there. By doing so, we have given ourselves fewer opportunities to fail but also fewer opportunities to learn. If we are being too remote, we are missing out on opportunities to experiment, to perform chaos engineering, shipping features quicker than we have before or even induce structure of downtime to see how our dependencies react. In other words, we have given up a lot of ways to learn about our system. This is the trade-off in this case. This is another thing which I've seen people do. There's a problem with the number 9. When people talk about SLOs and SLAs, they often think about these in terms of 9s. The most common numbers we might run into are things like 99% or 99.9% or 99.99%. Even generally are not attainable 99.99% of time. These targets are so common. People often refer to them as just like 2, 9s, 3, 9s, 4, 9s and 5, 9s. This is what it actually means. Hitting these long strings of lines will be more difficult and expensive than people realize. Picking the right target for our service involves thinking about our users, our engineers and our resources. It should be an arbitrary constraint in this way because it will actually be more difficult than you here to hit these kind of percentages if you're just trying to do this right out of the game. What I propose is use this instead. The difference between 99.99% or something similar is often much greater than people realize at first. Start with a comfortable number and go from there. Sometimes it helps to start with a number rather than a percentage. This is what I recommend. Starting off with a number instead of a percentage. Choosing targets, past performances are good way. Now that we have established, we shouldn't try to make our targets too high. How should we calculate that? The best way to figure out how our services might operate in the future is studying how it has operated in the past. This is a very good way just to look at your past and figure out what SLOs you should be setting from there. Before I get to my next point, there's some basic statistics I wanted to dive into. Anyone here knows what these are? I think two of them are easily definable, right? I'll let people guess what all of these five mean. I'll wait for five more seconds and let's see if anyone is able to respond. Let's dive into the definition of each of these terms. Mean is just the term average. I think we all know how to... I think we all know how to talk in terms of average. It's much easier at least for me to talk in terms of average rather than mean. I intuitively figure out what is the average of something and when I have to figure out the mean, I think I have to go through this entire calculation. Medium is the middle number. The point to drive this home is medium of four, one and seven is four because when the numbers are put in order, one, four and seven, between the right parenthesis and the left parenthesis, the number four in the middle is the median because that is the thing which is right in the middle. I have a graph which shows all of these concepts pretty easily. It's the most frequently occurring number. If you take all of these numbers between the parenthesis, two is the more over here because two is the number which happens which occurs more frequently. A bit of practice. When you're dealing with statistics, generally it's a good idea to arrange your values from the minimum to the maximum. Also, this is something which I didn't cover previously. Over here, out of those five, minimum would be 0.7 and maximum would be 21 in this data range. I already made a slide to point that out. This is what I meant. I wanted to write this point home and a friend of mine helped me make this slide for me. This is a graph of 100 data points over, I think, 10 minutes. Mean is the average number found by adding all the data points and dividing the number of data points. In this data, we know that minimum is 10 and maximum is 99. The average duration of the 10 minutes of data we analyzed was 53.19. This can help us pick better thresholds. If you look at your me, you can probably figure out where your threshold should be in a better way. Medium is the number found by ordering all the data points and picking out the number and picking the number in the middle. For example, like you see over here, if there are two numbers in the middle, then what you do is you take the average of them. I think this graph should be labeled in a better way. Anyways, one of those two red lines is 53. The other of those red line is 56. So 53 and 56. So we have to take an average of 53 and 56. That's how we end up with 54.5. So that's how you end up with the median and if there are two values, then if your entire data set has an even number of values, then you'll have two values in between and you'll have to take an average of both of them. Mode is the most frequently occurring number, right? That occurs the highest number of times. When we have two different values occurring at equal frequencies, the data is called multimodal. So in that case, this example would be countered as multimodal. When there are no values which happen more frequently than others, then the data has no more ranges. So the difference between minimum and maximum is called a range. Another important basic statistical concept is the range, which is simply the difference between minimum and maximum. Range gives us a great starting point to think about how varied our data point is, right? Like how spread out you are. This is also a concept I struggle with a lot. I'll take a second. Just give me a second. There's someone at the door and they've been ringing insistently. Pardon folks. I think we haven't walked from home interruption. As always happens with a lot of our workplaces these days. Anyone wants to raise their hand and in the meantime, ask any questions? Okay, Gaurav is back already. Yeah, bye. Yeah, percentiles is also a concept which actually I struggle with. When developing SLOs, percentiles serve a few important purposes, right? They give us more meaningful ways to isolate our outliers than our simple concepts of median scan. We get the same data set and analyze the P90, P95, P99. By P90, I mean percentile of the topmost 90%, the topmost 95% at the topmost 99%. We even have a graph which is coming up next. I particularly learn well visually. This was a concept which I had to spend some time thinking about this. If you are struggling over here, hopefully this graph can help you. When we talk about, let's say, for example, the top 95%ile, this is what the top 95%ile looks like. When you arrange your data in an ascending order, you are talking about only the top 5% because there are 100 data points. The top 5% would be the top 5 data points. It's important to concentrate over here because the ways in which, let's say, you have an SLI of monitoring latencies. If you are concentrating on top 5% or top 10% which is 90th percentile or 95th percentile or something like that, most often that is the largest contributor to your latencies. Serving those kind of problems can help you gain a lot, basically. Is this concept of a percentile clear? I will repeat for five seconds if anyone has questions you can go ahead and ask. How to use error budgets? We have come up with this well-established definition of SLI in a simple error budget, but what's the point of all of this? You can use error budgets to decide whether you want to release new features or not. Beyond just deciding whether new features should be shipped or not, we can also use error budgets to derive the entire focus of our project. If we blow out our error budget, it might mean that we can't release any new features and the entire team has to concentrate on stability instead of just building up instead of the entire project. Basically, you can't release new features. There's something objective which you can point out to. You can say that till my error budget fills up, I will not release any new features. Measuring error budgets over time can create insight into the risk factors that impact our services. For example, the frequency of incidents and the severity also. By knowing what kind of events and failures are bad enough to burn our error budgets, even if momentarily we can better discover what factors are causing most problems over time. What do I mean by this? Let me derive that by an example. By analyzing our error budget over a year, we might determine that approximately one in every five release causes some amount of terrible. This could help point out to the fact that we need a smoother release process for better testing. We can use this as a way of discovering failure modes we may not have been familiar with before. Any kind of experimentation which would bring down the production infrastructure for a known period of time, the key point over here is for a known period of time. Error injection is another great option in this phase. Load testing and stress testing are like a technique to learn more about the reliability and robustness of our system. But they are also by design procedures that can break our services. Our black hole exercise is where we turn off an entire location for either a single service or a single environment or all of our services. This process, of course, works better if you have confidence in our ability to hold back the changes quickly. These are some examples of how to use an error budget if you have some left over. Now, let's say, like, great. Everyone is sold on this idea. You think it's a very good thing and you actually want to implement this in your organization, right? Like, how would you get buy-ins? So, SLOs increase both reliability and feature velocity over time. These are some points you can take back to your engineering team or your engineering team lead or manager or whoever. This will help you convince those people about trying out this experiment in your company. SLOs give engineering license to take more risks and to be subject to fewer launch fund states because engineers and engineers want to ship more features, right? So, if you reduce the amount of constraints with which engineers can ship more or newer features, the happier they will be or they should be done. But the principle exercise of error budget policies make people better software engineers because it creates a real and rapid feedback loop for the users, your product managers. Reliability is the first class feature of product. In fact, it's the most important feature. SLOs will eventually increase the feature velocity of the product because they remove much of the friction from the product release cycle, right? They tend to try in with the engineering thing as well. Modern PRDs or like product requirement documents include user tools. What I mean by PRDs is I'm not 100% sure of. This is just like a point I stole up. I stole from the book, right? Because I'm not a product manager. I don't understand that requirement document very well. But if anyone else over here understands that better, you can enlighten all of us. SLIs are excellent ways to measure and ensure that the user journeys in the PRDs are being monitored and reacted to. So these are some features which you can take back to your product managers just to get some buy-in from them and get them on your side as well. Operations. SLOs and error budget policies put them over and even fitting with the engineers. So you can use SLOs and error budgets to actually bargain when to shift new features versus when to concentrate on stability, right? So it gives some amount of objectivity for operations people as well. SLOs and error budgets also allow operators to remove some of the deployment friction between ACIM, which has been accumulating over some of the OAPDs of yours. Leadership. The biggest objects in senior leadership often most of these people often say that they want 100% customer satisfaction, 100% downtime, right? And I've presented some of the reasons of why 100% uptime or 100% satisfaction might actually be something bad, right? In fact, there's nothing human-made which has like 100% reliability over any reasonable period of time. Also, there is the leadership that we already have buying from all the other teams, right? Which helps to convince the leadership in a better way. This is also a very, very key point which I've seen. When is the first test of this entire process? When we adopt SLOs, the most important moment is not when everyone agrees or when you decide your first targets and policies. The most important moment is the first time you exhaust your budget and need to enforce your policy and see if everyone helps is on board or not. Yeah, this is basically all I have. Awesome, thank you for the talk, Gaurav. I think this sort of should answer the perennial question that we have, right? The understanding of what is DevOps and what is SRE, right? And I guess this is a question I get asked a lot. I think Gaurav also probably would have encountered such questions out in the wild, which is we have started calling everything as DevOps and everything as SRE and there is a weird intermix of the word. But I think this talk clarifies what is SRE. DevOps creates a pipeline to have sensors monitor your applications and get data from them. SRE is what you do with that data. And SRE is about ensuring the inferences derived from the data get pushed back into a feedback look into the whole system. Whether it's an engineering system or just the DevOps pipelines, it doesn't matter. All the inferences from an SRE practice has to get injected back into the org. So I think this sort of clearly defines that boundary between what we consider SRE out in the wild versus what SRE actually is. So yeah, I'll take up a couple of questions that we have on. We have two questions, I think on Zoom. There is one question pending a bit of clarification. I would kindly ask the person to ask again. So Anirudh has two questions, okay. How do you design an SLI that is both user centric and realistic for development? I think this again speaks to that whole, you know, what tradeoff thing, what to trade? Yeah, God of you do want to answer this maybe. What I would say in this case is this is something between which operations and engineers, I guess both of them have to sit and come up with something which works, right? So this is something which is not something which I can point to and say, this is the answer to your question because the answer will be different for different amount of people and it will also depend about what are you trying to monitor, right? Yeah. So the biggest point I can point out to over here is actually as inclusive as possible. If possible, try to get support and even business and sales people also involved when you're defining this kind of an SLI, right? Fixing which everyone thinks is comfortable and move on from there. Of course, this is something which you'll have to look at. I'll give you an example of a user centric SLI which is slightly off-pit, right? Like if your application or your website has accessibility options, right? You can actually track an SLI there on the drop-off rate or how many people who are trying to use your accessibility features, at what point are they facing an issue and at what point, which page are they dropping off, right? Which particular workflow are they dropping off? And then you can set an indicator or a budget on that, right? An SLO on that. That, okay, this particular app should have this much accessibility percentage and then we'll put a barrier there. If it falls below that, we need to, you know, invest from our UX site on accessibility, right? Or even on the UI site, both cases, right? So that's sort of like an user centric SLO that I personally can think of if you want to design it that way. Realistic for development, again, it depends on for that feature. If your current front-end team is capable and invested in building accessibility features, right? So you'd have to make that happen. If you do not have that bandwidth, maybe then you need to reschedule it at the right point of time in your lifecycle, right? But it's, again, you know that it's a trade-off, right? If you do not get it solved, people who need accessibility features are going to drop off from using your site and it sort of, in general, gives the org a bad rib. But that's where you make the trade, right? What sort of user base you're going for and that I think will clarify your decision. Gauram, was that a good analogy, maybe, from user centricity point of view? Look, you're answering questions like a pro, right? Yeah, basically, this was the entire point of this presentation, right? I want other people to answer these, basically come up with answers and see what they think. Because I've seen so many people answer these kind of questions in a different way. I think there is no one right answer. This is basically just meant to get the conversation started, right? One person answers the other person's question. So yeah, what Joy says is right, there is no direct way to figure all of this out, right? This is something which you will have to look at and iterate over time in the sense that when you see too many users dropping off or when you see too many errors or something like that. So again, that has to be a slider which you have to tweak and it's not something which you'll set and forget, it's something which you'll have to tweak. As you correctly said in the talk, it's a constantly morphing sort of metric, right? Each and every SLI and SLI is constantly morphing and aggregating upwards, right? So I think Aniruddh that I hope, you know, Gaurav was able to answer that question and maybe the analogy I made helped. Again, he has also a second question as a tag along. When thinking about spending the error budget to develop a new feature or set of new features, is it better to think about spending in smaller intervals? For example, only allow spending up to the monthly budget instead of the annual budget. Oh, so basically do you sandbox error budgets within a time frame or not? I would answer this question by asking you, how comfortable are you? How comfortable are you with spending your annual error budget at a single period of time? If you're not comfortable, then again, you have your answer. If you are comfortable enough spending that error budget on a single instance, then again, you have your answer. Cool. I think that's a legit answer. See, again, all of these practices are, as I think it's very subjective. It depends on your current bandwidth, your team's maturity, ability to, you know, pay off debt and error budgets. So if you have that sort of a team where you are only able to spend, let's say, monthly error budgets and refresh it the next interval, then go for it. Maybe in a couple of iterations you find out, you know, it doesn't work. Then you reset it back to your larger interval. Maybe you go from monthly to six-monthly blocks, right? And even if that doesn't work, go to, you know, annual intervals or scale it back down the other way. Maybe start with annual and a very large error budget of only, you know, 99.9% uptime. And then slowly as your team matures and your code matures, your stability and reliability increases, you just like start making it harder and harder. It's like a game, right? You're just raising your difficulty level as you go, right? But you do not start out at the, you know, hardest difficulty level unless and until you have the best possible gamers, right? That's, I think, an analogy that comes to my mind, that you start off at a made or a low difficulty level and then you increase the difficulty, right? So, yeah. So another thought which I had, like this is a question I've seen other people also ask is start small, both in terms of your spending your error budget and start small in terms of your implementation as well, right? Get everyone to agree on like one single SLO, right? If you say, if people are comfortable enough, you can set multiple SLOs, but don't go overboard as well. So again, the reason I have to be vague over here because the right amount will differ and depend upon different kind of teams. The only advice I can give you is start small and as your confidence matures and as a trust in other people matures, take it to like whatever heights you want to.