 Hey, good evening everyone. This is Tazlik here and hope you are having a nice day. So today we'll be discussing about doing SRE the right way. This is a series in which Piyush and some other folks will be talking about their experiences in keeping the lights on over the last few years and their experiences which they have gained over doing product support and whatnot in order to enable the companies to run the way they are and without having any outages or minimum outages so as to say. So without much ado, I'll just give a short introduction about Root Conf. So Root Conf if you folks would have been I'm sure some of you would have been attending Root Conf for the last couple of years in which a lot of practitioners come in and gather to discuss their ideas and their implementations about how they have been achieving reliability as well as solving real world problems inside their organizations so as to speak. With that in place, I'll just give a short introduction about myself. I'm your co-host. I'm Tazlik. I work at Gojek at this moment which is where I would be in the platform team. Before this I was at Rezape as the fourth engineer in the infrastructure team. Alright then I think Piyush, I'll hand over the stage to you. Hi, this is first of the three part series that I have about doing SRE the right way. Well, the things just happened and we just lost Piyush. Hey folks, sorry for the inconvenience. I think we are facing a connectivity issue from the host. We will be right back but in the meantime can we have a quick show of hands on the chat screen about by just saying a hi or act by giving introduction of what the demographics are of the audience. If you work in the platform engineering team or in your DevOps team or in your infrastructure team or site reliability engineering team, we would just like to see a show of hands of who, what you folks are up to or if you can just say hi, that would be awesome. Hey Shiva, hey Shiva, how are you doing today? Can you hear us Shiva? Yeah, I can hear you. Hi, how are you? Thank you, good. How are you? Awesome, awesome. So how did you come to know about rootcon? Yeah, so I was just browsing and LinkedIn and Facebook. So I came to know that then I registered. I looked at very interesting topics for them. So I used to attend a couple of conferences like this. So I looked at this one as well. It was very interesting. So just registered myself to know where I, to understand more about SRE site and all the stuff. And currently I'm working as a DevOps lead over there. So taking care for the entire project delivery. That's awesome. Along with platform engineering. That's really interesting. So without much ado, I think Piyush is back. Shiva, we would love to hear more about what you hope and we can share a lot of ideas and I hope we all will be able to learn a few things from you. Without much ado, giving the stage to Piyush. I'll be muting you now. Yeah, sure. Thank you. You know, this is the thing, the vendor tell you that systems are reliable, network is never. Every network will fail always. Anyway, so frequent question that I keep getting asked, you know, what is SRE? How is it different from DevOps? How much does reliability cost? How long before I am reliable? Is it too early to automate? What does an SRE code is reliability achieved before or after the release? Is it not for you? Is it there? Yep. All right. Sorry. Yeah, you can ask your question in comments as well. But these are the general questions that I keep getting asked. Before I go about answering them, I would like to take a stab at answering what is a failure cycle of a physical product. A physical product that you use, it has a burn in the rate over a period of time where you're trying to almost put it in an auto mode where you're trying to, I mean, every car that you buy, a new car that you buy is not really brand new. You know, it has gone through a modern cycle of a certain thousand kilometers, I'll beat on rollers before it becomes usable. It becomes usable, it stays usable and then it starts bearing out. And then the failure rate increases. You know, this is why I'm an old refrigerator, an old car or an old air conditioner or old television will always start due to issues. There's a certain, this is called a bucket graph and this is how typically a very physical product goes. Software products on the other hand have a very typical failure rate cycle. First, when you start, obviously, there are a lot of bugs, you keep testing, you keep debugging and you try to bring the rate down till it becomes usable. If you make an observation here that graph doesn't touch zero or failure, because it's almost impossible to eliminate every bug from a software, you know, every good bug, because you don't know what are the known bugs. So you bring it to a point where it is acceptable, usable to known cases and then you release an upgrade. The moment you release an upgrade, there are failures, because the code is going to meet newer runtime conditions. This is typically, this may look as a surprise, but how often have you used an iPhone and they use, they launch a new update and you realize that, hey, this something is broken, something's not okay. They reduce the bugs, reduce the bugs, then there's a newer upgrade, which has more bugs. If these are new, they reduce the bugs, this cycle goes on and on. Over a period of time, till the software becomes obsolete. And then there are more bugs. One fair conclusion here that we can make is that a software which is obsolete is almost one of the most stable ones, because there's literally nothing which is being added to it any further than what used to be there. Also, the first bar, the first lambda that was there is slightly lower than the last one. What it means is that over a period of time a software actually accumulates in more bugs than what it began with. A stable, of the same conclusion here to draw would be there's a constant tussle between the release velocity and stability of the system. Well, while many would disagree with this, but take a moment to understand this. As you release more frequently, obviously you have less time to run that code through the production scenarios, production scenarios of multi-AZ production scenarios of bad internet production scenarios of stress, of the real world stress when rubber meets the road. And hence in a way you are actually saying that I will release more and in the hindsight you are compromising a bit of stability. This is a seesaw. When you see a release going too often and actually breaking things, you start favoring stability. This is a constant battle that needs to be handled. Historically, what are the approaches that have been to handle reliability in software? Well, first is this is my favorite one, ignorance. Don't care about it. I wish I could do that more. Second is through more engineers and a problem. We would bring all hands on deck. There are more and more number of people who are actually trying to fix things, which is why a standard on call, I have seen have been a part of it, starts with one person very soon. In a matter of 10 minutes, there are 20 people on the call. If the issue still persists, there will be 30 people on the call and suddenly everybody on the floor is actually on the call. But these are more ad hoc mechanisms. What are the more computer science principles? This is what we shamelessly actually borrow from hardware failures, if it doesn't work very well. And that's the same principle that we install in things like, you know, when we advocate things like this fix, etc. And we say that if it doesn't work once, private exponential backups, recovery block of retry, retry, etc. While it is good for transient issues, any DNA defects or any software, any faults which exist in the software are not going to be fixed by it. An attempt to solve that was done in 1977. And this thing is called n version programming. So the idea of n version programming was you have a simple common statement, you give it to two set of teams of engineering, and they will produce two different implementations called it all. They will have unique bugs, you try and merge those code bases in a way where you take the best of both, and you have almost eliminated the bugs. While this worked, it actually was proven by the same guy who wrote the theory called Michael Liu, that this thing actually doesn't work. You give the same set of problem to say to different set of teams, they almost identically produce the same set of code. So if the input doesn't change, almost all engineers make the same mistake. It's kind of funny because this was done back in 1985. And that is when a new research paper was written, which was about self checking software. And in 1985 Michael Liu talks about that a software needs to be fed with more and more data, and more and more control instrumentation needs to be fed in, where it can take decisions, you know, like simple mechanisms like a run-sum system can be introduced with SIG hub, you can actually send interrupts and then it can start behaving differently. The key element here is to achieve this, you require a decent rather than insane amount of observability, comprehensive set of metrics at every single step of the way of the software, and control obviously, policies that actually adhere to production. Like we have static checks, you know, for our code basis, we don't have any such thing called dynamic runtime checks. I, while I'm writing my software, I really do not know how or what the real world simulation is going to be, which can actually tell you why I'm writing my code that this is going to break. Interestingly, this is for the reliability of a software. But if I was to ask the same question to my product people, you know, like what features work, what features don't work out there, this aspect of software development is actually advanced pretty well. And a leaf that we want to borrow from there is that software operation should be treated like the product. What do we mean by that? If you look at a product aspect, you know, it has a product manager, there's customer feedback, they're constantly talking to the customers. There are products, let's say, you know, like, when you try to install those ghost tree kind of plugins, you'll realize that every single web page out there is almost tracking events out of every single click of yours. You know, like a product, I can almost tell you which feature was not utilized, what feature gets used most on a Sunday, which gender uses what feature and everything like there's so much tracking happening around whether conversion funnels people know where did a person get dropped out, at what point did a user actually decide to abandon a cart, at what point did they decide to buy one and they can influence decisions as well. Whereas operations on the other hand is still run ad hoc. It's one failure at a time. And this is vastly different because when you develop a product and when you're launching a new feature, you do a sprint plan. There are things which are thought over a period of weeks and over a period of days that you build it. Fast forwarded to a runtime system operations. Things are thought within minutes, operations are applied within seconds. Now, these things are bound to fail, because obviously you do not have time to think. And when you don't have that much time, they're going to be mistakes, you're going to actually introduce new failures. What it's requires is data driven decisions, prioritizing your failures. If I was to ask a product, why should I actually change my sign up button's location? Or should I go implement a cart checkout? Zee or she would have an answer to this. They would know which one has higher priority or more business value to something. We don't have same set of answers towards our losses or failures. How do we prioritize our failures? There's a disc going down right now. It's 90% full at 4am. Should I even wake up or should I not wake up? Because I don't have, I mean, every bug cannot be a high priority thing. How do I put a loss to it? Why should something be up 24 seven? What is the business that we're going to lose with it? These decisions, the lack of it makes operations really chaotic and ad hoc. And which is why, you know, there are memes and jokes out there that people actually developers hate to do the operations job because because it's stressful. It is. I'm going to take a break. I want to show you what we need. This was something that I asked on Twitter. Those who scale GCK and AWS, how often do you hit a resource limit before you can scale out? I'm actually on the call. Can somebody give me an affirmation? Because my internet is weak. Can hear you? Yeah. Perfect. Good, good role call. At least everybody is awake. So how do you hit a, how often do you know before you hit a resource limit before you scale out? Interestingly, you know, like people said, they don't hit out. They don't hit any such issue. 36% 23 points. And let's say around it off 25% said that we have a person for that. 15% say we have a tool for that 25% said that very often. When you view it from an SRI list, what are the things that we see? We see that 35% of these people have not scaled enough yet, because if they were to scale out on a day and they're going to hit this limit, think about it, you know, I got so much traffic coming in. I need to add more servers. I need to launch one more server, but I cannot because I've hit a resource limit. I need to send a request ticket to AWS to actually allow me to spawn up one more C5.xlar system. These people have just not hit that yet. This is what NSRI sees. Last and the first option, 50% have to wait before the scale out. They are not really automated because if they were automated, they would not be hitting these issues or they would not have a person. You know, like when they say I have a person for this, clearly that person is not checking this by the second or by the minute. You know, like, oh, is my resource limit selective? Is my resource limit selective? What that means is that the next time that they hit this issue, they're actually waiting that a person will solve. So this is not really actually going automated. Now, another thing that I see 25% people are actually paying a person to just do this. 35% people are going to hit a problem at some point. This is what I see. The point here that you need to extract data out of every substitution and we are trying to look into the future. You know, as SREs, you always try to look into the future. This is the whole essence of SITEM, I want to take back to one of the exercises that we did at one of our previous games. You know, we had, I would say a lower 90% around item of time. And one of our first tasks was that because we were losing business because of it, we had to bring it up. So me and this friend of mine, we're working together. So one day we decided that, look, we're going to gather data at every single issue that has been brought, every single error and exception that has been raised in the system. It's kind of a daunting task, but we decided we're going to do it anyway, because we needed to measure that what needs to be improved to actually bring that number from 90 to 99, at least to become 95 for the first target. We started collecting data at every single node, node of the process. One, actual errors that were being tracked and logs, issues that were actually being alerted into slack. Then how many of those became a page or do the alert? How many of those resulted in a Zendesk ticket or could be correlated to a Zendesk ticket? And how many issues were actually closed? Our first ambition here was identifying the leakage in this funnel. You draw an order here, we are actually trying to do the same conversion funnel that a product records. What we're trying to identify is that how many errors leave the exception trace, become either a slack alert or become a after that become a PagerDutyAlert and then tie themself to a customer Zendesk ticket and then get resolved. If there is a drop anywhere which did not go all the way through, we know that issue is getting deep and that's going to happen again. Next was then we put the business value. We asked the business and the product price at which one of these issues actually resulted in how much loss. Next we prioritize it, what the 80, 20, the standard one. By doing 20% of the effort, what are the first, most 80% of the things that we can actually address because that will give us the maximum gain and then we'll keep re-hydrating the cycle. Then we hand over a primitive plan to product an obstacle. We are an operations team. We called ourselves an operations team, but this is what we used to go over every single Fortnite. Every single Fortnite, this was a turn that we had to go through. We started producing some data. The screen might be slightly too bright and white, but I can just go over this. We started capturing in the meantime to respond for a given issue and the meantime to close an issue as well. There were the amount of cloud spend that we're doing for project as well so that we could keep it up on, we could keep it of value because we could know that by expanding the servers on this, we're going to spend this much and because we already have the data of the losses, we could take a very good financial stab at this is what it costs to actually solve these bugs. So fair to say that the question of raising 90 bar to 95 could actually put a numerical value to it in terms of dollars. You also carry deployment velocity, which is how many times does a release fail before it becomes ugly? All of this data we captured is put into a graph just to make it easy to consume. Interesting thing that we observed in this MTTR to MTT, the meantime to close an issue, it took 30,000 seconds for an engineer to actually pick up an issue, which is almost next to 9 to 10 hours here, actually it's 8 hours. And then once that ticket was picked up, the first meantime to respond, the time to close that was only 1800 seconds away. So what you're seeing here is that it is a very interesting start because while running a team, I wanted to know where should I put more resources in. And this told me that once an issue was picked up, it only took half an hour to solve the issue. What that meant was there was not a lack of motivation to solve the problem because if there was this time would have been great. There wasn't lack of tooling, there was enough tooling that we had built by then. This was surely the amount of workload that was there. I'm giving example of what else do we need to capture. We need to capture the team's burn as well. And this was a good symptom of it, that the backlog is over, that it almost takes around 8 hours to pick a task, but only half an hour to close the issue. So we knew where do we have to make the next set of investments. Sorry to interrupt you, we have a quick question if you want to take it right now. So someone is asking how did we produce these stats? Yes. If you could hold on, I can answer this later or try giving a preemptive answer right now because I do cover that how do we produce these stats. So I've got to give you a quick one. So the first thing that we do is we actually plump together every single source of metrics or information that we have in a movable system. When for this taskful or the next one, this one's fairly straightforward. When we start looking at our last six search, we start finding how many errors are there. We take that number. Second, we actually do a simple, how many stack errors were there. PagerDuty has an API. We start pulling that number out of there. Zendesk has an API as well. So these numbers are very simple, a cohort analysis to see which one these were, gives us the funnel of what is being lost. These numbers, this is simple straight forward. On a Google API, you will find this way. From a DC in the clouds where it's produced, we take that and retroactively added that how much resources are collected. We use that as a general job, the deployment failures were actually needed and was captured again from Zendesk in GitHub and wherever they should become. So I think that should answer it quickly on how these dashboards were produced. Awesome. I think that answers that huge. Thank you for taking the question in the middle of the session. If you could just bring the mic closer, we could have a few attendees noting about the voices. Now, if I say that, hey, all of this is side reliability engineering, the question is, where do we start? The first question that we answered and asked ourselves was, what exactly is reliability? We say that this is less reliable, more reliable, but that's just a qualitative metric. We need to add a number to it, we need to measure this. So what exactly are we measuring? Because if I tell you that, hey, you got to do better in your exams next time. How do I measure that you're doing better? If I say, hey, become a good person tomorrow, how do you measure you're a good person? If there's anything which is an adjective, you need to measure it. When we ask ourselves that question that, hey, how do we improve this is when what number should actually change. If I really have to go to my management to say that, hey, this money was worthwhile, let me spend on asking, like if you have to justify our job, what is it that we're improving? Here's an example. So we usually traditionally say that, hey, reliability is mostly about uptime. We say this word uptime. Uptime means that, hey, while your service is up, I'm going to get a response from it. Simple straightforward. In my next talk, I'll actually cover on why this simple term saying that uptime is actually quite tricky as well. It's one of the simplest metric, but one of the toughest ones to do. But for the sake of this talk, let's say that there's an uptime. Now, this is service, there's an issue, a live issue. I won't take the name of which project this is from. It returns a 200, even if the response body has an error in it. And this issue has been open for a fairly long time. This keeps getting repeated. This one was open on April 23rd, but this has been actually coming up in different ways since, I think, 2015 or 2014. Now, an uptime doesn't mean that the service is reliable because clearly the service is returning errors, but the health check is returning a success. Now, these errors are pretty, pretty, pretty tricky to handle because your standard monitoring tool will not work because now you need to actually parse the body as well. On the right, what I have is you will not be able to see it very clearly, but so it's a mobile phone. I'm trying to place an order on Swiggy. They're great service. I'm not calling them out here. I'm just giving an example. Everybody faces these errors. So in here, the web app is working. I'm able to place an order, but I'm not able to go through the payment process of a checkout. Well, quite clearly, the standard principle of retry has been applied, which is why you see nested boxes, a retry block, retry block, retry block, but clearly the payment is not going through. So does it mean that this despite being up, is it reliable? It's clearly not. So it's not just about uptime. There is more to it. Reliability has to have more factors. Now, if I was to ask a more broader question, and I say that should it be up, or am I trying to say that which should be up and not serve errors? Or am I trying to say that should be not serve errors and also so correct data? Even if it serves current data, within what time should it serve within no time, 0 seconds, 10 milliseconds, what should it serve unlimited unlimited transactions per second? What is it? Should I switch off my video? Yes, Piyush, I think let's try with switching off your video so the bandwidth issue doesn't... Okay, is this any better? I think so, yes. Let's see. Do we have a nod from people to say that we can actually hear this? Should I move forward? I think it's better, yes. Am I going to say that, hey, this has to be served within x time? Or are we saying that, hey, this has to serve unlimited transactions per second? Or are we saying that, hey, all of these requests have to be served concurrently? What exactly am I trying to achieve when I say that, hey, this is reliability? When faced with so many questions is when we started to wonder, what is the real difference between a robust system and a reliable system? Because until then, we did not have this clarity. And that's when we started realizing that robust systems are only the ones where you inject error-ness input and it still holds itself. But the reliable system is the one which has to actually go through all of these runtime behaviors, which the system has not even been coded for. Which one and how do we take it? Is the answer yes to everything? Before we make that decision, we have to understand that reliability is not a buffet lunch. It's a line item that you're going to pay for every single thing that you order. Every single thing that you order has a price to it and also there are taxes on top of it. These taxes come in terms of you having to provide for more and more infrastructure, more testing infrastructure, more time, more team, more management overhead, HR cost, laptop cost, internet cost, everything that you can think of. All of this is what you pay when you ask for more and more. Also one thing which is worthy of being noted here is that there is never a bug free system. You cannot have a 100% reliable software. It's fairly easy to say but the reason for this is to say something works, the proof of it has to be established and every axiom has a proof. To prove something is 100% complete is impossible because nobody knows all the error systems, error conditions that can happen in this world. Like not high and not anybody can actually claim that they I know all the errors that can happen and this system works well for it. It's almost impossible to say that and hence it's almost impossible to say that hey there can be a system which is 100% reliable. How do we get there? Now what do we do? So to start actually answering this question of what reliability do I need there is a concept called service level indicators. The first thing is to actually monitor the indicators. An indicator service level indicator which basically calls out that gives a metric to hey this is my number of requests that I'm serving right now. This is the current latency etc etc. You know something that actually give a qualitative metric to how the service is performed. The next set is that you ask is for each one of those service level indicators we set a service level objective. For example in this case we will say that we are close due to short staff so we shouldn't hire dollar staff. Service level objectives are like gears of your car. Coming back to the same point that we are making you have to pick between a release velocity and a reliability of a system. What happens here is that every single time that we keep deploying we are actually in exposing our system to more and more horrendous conditions. More and more situations where it may fail and these service level objectives that we set allow us to change the gears of our effort. When I say our I'm actually speaking on behalf of the operations team here and I'm not using the word site reliability engine team as of yet. I'm only calling this entire domain as operations. The release velocity needs to be captured. If we are doing pretty well with example of a service level objective we will go through that. How do we set service level objectives? I just wanted to give you an idea of what and these several objectives are. How do we go about setting these service level objectives? The first key point here is that it has to be measurable. Now when I say that this is my objective it has to be a numerical number or something which can be measured against a SLR. Second it has to be customer oriented. I can't have anything which pleases me internally or my team internally. This has to save something on the revenue or the business. These have to be challenging. Setting a service level objective as A we aim to deploy once a month is not going to achieve anything. They have to be unambiguous, very crystal clear. You know they cannot be challenged in a goal. All stakeholders participate. Now this is a very interesting one because a service level objective is not set by one team multiple people come together which involves the product team, the developer, the software head, the business guys and also the operations guys come together and say that hey our goal is to actually serve 99% of our customers to be happy. How do we break this down? We'll cover this up and this at the cost rate. Now but the release thing says like look we can't do this because there's already a pending release of so much. We can't actually introduce these new features because we haven't tested them but this is what a QA team would say. Like every car the slowest moving wheel is as fast as you go. The lowest confidence that the team, one of the teams will have is the maximum service level objective. By using all of this information if we actually go about now second real service level objective, what do they look like? Instead of saying that my service should be up or we would come up with a very realistic figure that I should be up through the date but having looked at my service level indicators, I can deduce that 3 to 350 a.m. is where I get the least amount of traffic. It's you're lucky that you may not find any traffic at that point of time or unlucky from the business sense but this is the only downtime which is allowed so I have to deploy, I have to deploy within this period. Now this simple statement has multiple implications here because carefully look this is just a 15-minute window. If I really have to get my deployment all right in 15 minutes, the amount of preparation that goes behind it is going to be massive. I'll have to do a canary, I'll have to do, I'll have to take dumps of the production data and then run testing systems against it to make sure that it works. It has massive cost to it. Every single service level objective has a line item of a cost that needs to be paid next to it or what we say is hey don't serve all the requests, no that is ambiguous what we say is only 1% of my request should have status score which is greater than 400 so that is only 1% of errors are allowed or I could do a deep health check and I say only 0.05% of this deep health check should fail. Now usually what we say is that hey my system has to be fast and I say that hey my P95 one of the most common terms or P99 has to be 100 millisecond. This metric and this service level objective has to be actually inverted. A P99 latency of 100 millisecond means 1% of your request base is actually worse than 100 milliseconds. That's the amount that you have to compromise and that's the question that you asked yourself. If is it the 1% user base I'm willing to lose and that's the number that I'll set. If you say that no I can't even lose 1% of my user base then you will say okay fine I got to lower down this number as well and there's a higher cost. Same goes for transitions per second and concurrency as well. In the next talk we'll discuss how TPS is different from concurrency but if you guys already are aware of that you'll get the gist of it. I take one more detailed stab at the service level objectives. Example a set of 99% of a service allows for 1 million requests in a month will only allow 10,000 failures to happen. You know this is the kind of what we're saying is from the business team we have gotten the requirement and we've got I'm okay with 1% of the request value so we're seeing that 10,000 failures. Now how do we change these gears? Let's take an example of it. I'm 14 days into my month as an operations team. I realize that the number of errors that we have rate is already 9,000. What this means and it quickly suggests is this we need all hands towards stability on the deck right now. There is nothing that we can do about leasing any further because we have depleted our allowance. We have depleted our pocket money. We don't have any more funds coming in. Let's move towards stability but on the other side if 14 days into the month I've only had 500 errors. Well okay all help break rules really as much as you want. We can actually break software if you want because well we have permission to actually innovate so why not innovate? This is what a SLO, a well defined SLO can do as a culture. It can actually change gears. Now this may change over the next month. This may change dynamically based on a feedback that I'm getting. So this is what an SLO does, a well defined SLO. Another aspect of defining the service level objectives is a time window. This is very important because am I counting these numbers over a week or a month or a year? A 99% downtime would actually mean that over a year I can have what like four days but over a month it would only mean 52 minutes. What that means is that if I take it over a year is it okay for from Christmas till New Year my shop is shut? No I can't have that right? I mean I'm legitimately I'm within my SLO but it may have an adverse effect on the system. So it's not like you keep collecting this ration and you actually use it at the end of the year it's not that. You have to pick up a rolling window which is usually what you say that is a month because a week is too short because you may change something over a week and it takes a week for the results to show up. So you actually take a wise number of saying that let's do a rolling window for one month. Now when you do a rolling window for one month there's a challenge because a month is never even there's sometimes 31 days there's sometimes 28 days sometimes 29 days sometimes 30 days right? So this number is going to get imbalanced again. So you pick up a very fixed concrete number. So pay attention what we're trying to say is that these are very these numbers have to be very clinical because if your business is and you have decisions to rely on these numbers they better be accurate. You know it can't be well I am almost 99-ish uptime you know like it can work that way or I am you know like it's like that famous dialogue from a movie anchorman 60% of the time this works every time. Favorite dialogue I wish I could give that every time somebody asked me how often your system is up. You know choosing aggregation this is also important do we actually pick individual or aggregations? Window lens this we already spoke to. So these are also what actually define a very good SLO here. We didn't speak of any automation or any engineering so far. We only spoke enough what needs to be done. All of this could have been done on spreadsheets. In fact that is what we're doing we're actually building like I was blessed with a guy who was really good at spreadsheets and we almost did all of this but beyond the point spreadsheets and manual effort will stop you know like engineering's core function is it allows you to do what humans do only more repeatable faster and economical one you can pick two of these three obviously whatever but there's no fourth utility of engineering and this is where we introduce engineering in this whole site. We understood what reliability is we understand why reliability is now we understand why engineering is because think about crunching all of this data gathering all of this data by the hour by the minute by the same time when it's always impossible to do it. What else does it aside? Hey Pewter. Sorry to interrupt there's just a quick question since we were covering SLO and getting over with it. So there's a question about from Ashutosh if he asks if it is allowed as a low mean does it mean as error budget? Yes error budget is a terminology that we actually picked from Google's book but I usually do not actually use that terminology what I mean that I allow failure is an appetite which is saying that yeah it's almost corollary to saying that there's an error budget but yes clearly suggested you have actually done a good homework on SRE so yeah it is clearly that error budget but for others who probably haven't read that book what it means is what we are saying is that the business because what we said right there's a cost to everything they say that right I get it it takes me 10 engineers or it takes me two on-call engineers and this much amount of infrastructure budget so it takes me $5,000 to reach 99% and other and these numbers are very interesting you know that what we realize what these numbers do not grow linearly what we paid to go from 90 to 95 was almost one-third of what we paid from 95 to 99 which was one-tenth of we paid from 99 to 99.5 you know these numbers keep exponentially increasing because the amount of time to react you know that 99 for 99 means four minutes of downtime only over here this is an insane amount of number right so this this is what error budget and what these numbers actually suggest it basically brings in an appetite based on a lot of factors I hope that answers the question right I think we can cover that just if you are open for another question we can take it up since it is on SLO or do you want me to take it later I can take it up if it's a quick one for sure yeah so Ajit Ajit asks if there are like standard SLOs which you follow in particular uh it depends again on the same answer that I say that this is really a function of how much actually let me cover that later there's a slide for it very next after this which will actually answer that question let me come to that right awesome awesome yeah cool so what does what else does an SRE do when an outage happens so these are the three motors that are site reliability engineering is supposed to do you know like if you want to slap OKRs on an SRE team this it comes out of these three things when an outage happens minimize the time if earlier by any effort there used to be one hour of outage we now you must have less than this number has to improve second question where else is this happening where else is this happening is a very interesting question say I ran into an issue because of which payment service failed because my readers went out of disk my next question to be asked is and let's say if I fix this and I require a restart on payment service the very next question that hits our head is what other service was dependent on this before I move my attention away from the system right now I must patch and fix everything that is there in the purview of this because that is another failure that is waiting to happen the next thing that we do is what is the how do we actually prevent these issues from happening that is a very real one because this is where really engineering comes into play you know you know I mean and this is very interesting that we actually realize as a number 38 percent of our issues were actually repeat of occurring same issue on same nodes and we could have done a better job of it is what we used only when we started capturing it we started realizing that hey this is actually the same issue and we need to do something out of this and when faced with that you actually take better decisions so this is one of the three principles that SRE always what is the recipe behind it one insane amount of observability the first thing that we do is we don't shoot in the dark we put tabs and levers and metrics every single place as much as we want this is an insane amount of data you know like at one point of time you are actually producing a terabyte of metrics a day one single day that we almost had to keep swapping the postgres is out because we have we decided to save it in postgres for a while second is control one part of reliability is that it requires massive amount of control which requires massive amount of observation control means everything has to go through the same set of principles same set of mistakes because when there are same set of mistakes they are easier to fix as well automation automate everything whatever you have because humans will make mistakes according to root cause analysis of everything every time and root cause analysis has been an abuse term at rca because in rc is what i've seen most of the times is that we write a description of the issue every rca must yield two things what was the exact sli hurt and by how much two what exactly is going to change the next time only then an rca can be clawed as complete and closed another thing that sre does is cross org collaboration which is very important you know like until now every developer you know like developer scope is limited to their own teams etc but an sre has to cater as an internal product to the entirety you know like if one product has improved its reliability by using a common toolchain another product should also benefit from it so there's a cross org collaboration happening with multiple teams lastly is guardrails and frameworks what i typically mean by that is when you're trying to roll them guardrails on both sides right there only purpose is that you do not deviate too much in case of an accident the damage can be realized this is exactly what the purpose is here is actually to take a tangible example of what guardrail means is these are small robots that you have seen you know like there's a bot for this there's a bot for that if you accidentally open a security group which opens up the entire internet the traffic there'll be a bot which will say an alarm that hey this has happened if you're depleting your error your cloud spend much massively because of a ddos one day what happens is that you've actually shot your budget like anything there'll be a bot which would actually announce that hey i've seen a rapid increase in the amount of cloud spend maybe you're not doing something about it these are simple policies and checks that are constantly running in the background they're preventing failures from becoming catastrophic they're not eliminating them they're just minimizing the losses that may happen right when we speak of automation we say that hey isn't their Kubernetes and isn't cloud already automated enough why do i need more automation and the reason for that is and it's as it proven over you know to ask a human to do a thing thrice over they'll produce three different results each time there'll be three different bugs each time and at the end of it there's going to be one highly demotivated employee because you're asking them to do mundane work now you've got four problems because now you've got three bugs and one employee as well which you have to address this is why automation is necessary because you the going slightly digressing from this but the sole purpose of engineering is to minimize the deviation that can happen in a given thing you know like when you do something or whatever you want repeated output out of it you want repeatability in your things and that's what automation is and the next question is when do i need that much automation hey piyush before just if you don't mind just before going on the next slide there was a quick question about minimizing downtime so we were talking about reducing failures so would it mean that if we are trying to minimize downtime we are simultaneously trying to accelerate recovery rate so it's a good question on how do we actually minimize downtime which goes slightly into the snippet of my next talk but what we can cover here is minimizing downtime is actually done by going by doing awesome documentation while you're on call you create runbooks you know you use these you dog food your own recipe like i'll use automation i'm going to write scripts because if let's say if i if i'm running into an issue where because of elastic serve the jvm heap goes out every now right now this is an issue that i see is happening is happening longer and i see there's a downtime because of it i want to ensure that any downtime that happens in a system because of a jvm heap issue should be minimized so what i should do is i take i document this i take back a homework which says i need better alert came on this this is a way that i can minimize a downtime in that period as well i can make a script rather than going into every single server and actually single let me patch it let me write a very small script which actually uses either kafestrano or uses fabric or uses ansible or anything for that matter or uses hw system manager where i can go and quickly fix this thing and this requires quick thinking this requires a massive amount of experience as well does that uh does that give a glimpse of it i think that gives a decent answer for now if we have more questions like i'm sure the people will definitely ask but i think that's fair enough for now without much ado yeah so the next question is when do i need that much automation here's the thing servers can scale people cannot well people scale but only horizontally when they eat potato chips but you know like you know uh okay uh nobody should quote me or blame me for this i was just making a joke here so uh so servers can scale what that means is that anytime when we have this problem we should capitalize on the fact that increasing the you cannot throw keep throwing more bodies at a problem if we have these repeat things computers will do a good job of it and that's where we start investing in automation and that's why we do automation which takes us back to the next question which a while back uh i think it was asked by somebody is that when do we what is the right spend and what is the right time to actually focus on this side it the answer itself uh answers that question earlier as well and this question as well at any given point of time when you reach a state in your business where you realize that's something not working the way it should i'm not just saying the website being up or down i'm also speaking about customer experiences because of massively delayed response imagine i'm trying to buy something or watch something on netflix but it takes an hour to just download that video i'm probably going to abandon that ship and go watch something else no matter how badly i want to watch it but i'll just torrent it for that matter i'm not advertising torrent i'm just saying people do that when the amount of loss is significant enough where you know that by your calculation that spend that you do to recover that loss is lesser is when you start thinking of site what you mean by that is i'm today at 90% of uptime or latency whatever my cello is you know whatever those numbers were let's stick to one and say that because of this i am losing out on 10% valuable thing which my competitor is eating now this is a business which amounts to x dollars by actually employing an srv team which also does x y infrastructure cost everything you know all those line items computed calculated precisely i realize that okay by doing x by two i can recover that x is a profit of 50 percent here and you say fine i need to get this because there's no other way one of the greatest examples lies in history december one nineteen thirteen where automation proves that how this has actually saw this this is basically fords assembly line you know single thing that we can attribute that happened to industry as a greatest thing was this it used to take 12 hours to make make a car manufacture a car end to end the model be very famous model back then started many manufacturing in 150 months till that time over about 10 years no 20 years there were only one black unit sold after this automation came in within i think three years obviously after the war started so we can take that period i think sold a billion units this is what automation can do at times we are actually believing that hey i don't need automation because i don't have that much scale it's probably worthy of asking that do i not have that scale because i do not have the right automation as well because i have bad customer experience you know these are fair questions to ask thought invoking questions so pew sure does that mean so someone has asked so does that mean that we should only start thinking of sre when failures start happening and we can't handle them on our own in the way in which we don't want to advocate it i would say that when you come out of a state where you actually start see when you call it a failure that means that there was somebody who's experienced whatever a beta system doesn't have failures right because there is no right right so when you start reaching that point where the failure is going to hurt some loss you know that this and this loss doesn't always have to be dollar ever let's say my only business model right now for the next two years is to garner likes of facebook what that means is that by actually being down i cannot have one million likes so the single metric that my business relies on is not working i will invest in sites every day i'm actually going to cover this part because the answer is not a black and white it's a gray it's a gradation i'll cover that in the slides next right thank you one question fairly famous question that i keep getting asked but sre this is exactly how this whole devops when they've launched devops how is sre really different sre is not different devops is the larger goal developer operations this is a function that has to be performed sre is a way to get there this is almost saying that function for ohaming is a way of doing software development it has its own benefit but we are not saying that's the only way to do software development exactly similarly that is devops and sre is a way to i just missed that colon god fingers but you can't do it i did a small mistake that colon should yeah so sre is just one of the means that we have come up with paradigm to actually perform and fulfill the role of developer operations because we realized that that was a larger bucket the existing operation methodologies were not sufficient so what exactly did sre change why did we even harp about all of this if we look feature development is owned by product developers quality is owned by release managers who owns the uptime in a system when there is something which is when you know who do i say that hey you have the right authority you know there's a simple principle authority and responsibility who has the authority and the responsibility to say that if this thing goes down i'm going to come to you this is why sre was born because in a constant tussle between operations and developers there was always nobody who could take the single responsibility and say that hey i take responsibility for this and why take it responsibility across the board and not just for one single ati uptimes didn't have owners there was no capacity plan to this who owns the capacity plan in the expenditure and this is exactly why sre was born more than and the reason for why this has just come up actually is not just because it's almost been two decades in the making it is because until two decades that software i mean shops weren't open 24-7 consumption wasn't 24-7 software has massively escalated you know like i want to upload my kitten photo in the game the night i want to listen to Spotify through the day i want to watch Netflix through the day i mean i want to do so many things through the day and there is no time limit to it i mean people across the globe business and global agenda this is why the need of something like this have been made more and more and that is what sre changed what we can typically say that sre brought software development what assembly line brought to manufacturing citing some numbers because you know like we have a thing for that give me the exact numbers don't give me fluff over 64% failures are actually bad that was a more bad deployment which we started measuring when we started capturing imagine 60% of the times it wasn't working every time we actually captured that number the essence of it 38% issues were repeated one interesting number sre is not an expenditure what we realized was by actually this was a normal which actually became a highlight of two quarters of our operation with sre practice we actually cut it out operation cost by 60% within a year and this was a massive number because we are a company we're running data centers imagine 60% actually being saved this is money that you have actually earned which you can spend on further product development so sre is not always going to be a cost sink when you utilize it right it can actually do things the right way and which eventually will save you money as well we go all the time right and here's a question when you say that do we go all in do i eat it all or do i start slow we'll start one by one sre has a period sre approach if your organization doesn't have sre practice as of now you don't have to go from zero to sixteen one second you go one by one you know this is famous skate i keep bringing this back there is everything which is correlated to a product development think of it like product development a product the famous photo right where we say that build a skateboard and then a car while many are disagree with it but it still works and similar is approach for sre as well you start this for our network no dedicated sre stuff you have the same engineers who are first this is tier zero you're actually putting some observability in place can you start measuring these numbers because unless there is a measurement there cannot be an improvement i mean there will be an improvement which you cannot quantify then you start carving out some sre type you know projects which came at unifying the data set across services to build a common group things like what are the common things that we're actually making the same mistakes where are the common snis being lost where may be hurt but what exactly is common through the products to the api you start with that when you move to the next level where you onboard one service at a time when you set an on-call study for that you set up a on-call schedule that who needs to get alerted first what is the escalation matrix see how teams and services adopt because you know this is this is going to change a bit of culture of your organization and the processes as well it's not going to happen overnight give it some time see how teams adapt to it because this is going to give more data you know this will tell you hey this service goes down more often because of which people need to stay up more often now you need to worry about those things as well so go slowly on that once that model has been satisfied they have been proven sli improvements you know that hey the observability that we put in the first place that was improved by this much this is a model worth repeating repeat on board other services we use sre tooling and practices go organization this is how you go about that next up what are the team structures that are actually followed for sre structures you know like there are multiple teams that people have followed you may pick up one or over the other at any given point of time it may also change as time changes one sre one sre doesn't mean there's only one single sre one sre means one sre team i should have written it yes one sre team says that it's basically one team which because the organization is not that big so that sre team can actually do multiple functions you will actually give them the the the responsibilities of what traditionally we call as devops cicd capacity planning architecture reviews a bit of production testing as well you know bunch of these things because they want control because if they own the system they want visibility so they will have to do those things it becomes a part of their prerogative like you and also responsibility let look you got to do these that's one sre the downside of that approach is that suddenly that team becomes really large and it may look like this team is actually bigger than my product team but technically it's not because it's just one common unified team for you next model that you can pick is common tooling common tooling is that there is a team which is actually building tool chains the only purpose is that developers own the systems we will give them more and more and better tooling out there to do their own jobs pro of this team structure is that the sre team stays minimal they can actually build an internal product which the other tools which other developer teams will use based on their input and feedback that hey we cannot really have we don't have time for taking database snapshots every time we release they will actually build a common tool which will also take into consideration remember sre is a cross organization function they will also speak with the privacy guys and say can we take a snapshot and they say no the numbers have to be masked so they build a small tool which actually snapshots are running system at the right time masks the number and gives it to the developers to actually go test their systems embedded team what embedded team does is you have an sre which is actually on loan and injected into a dev team which they consult constantly you know where they will like we are launching a new key value pair we want to go with redis sentinel and suddenly the sre because he has seen the other sides of the other product team and says that hey but i have seen other teams use redis labs why don't we just simply use that or somebody says that hey but i've been using photox for this or some other xyz project let's use that we use it so this borrowed embedded guy or a girl or a person this sre can actually provide better decisions and make that product reliable that team reliable gradually lastly is an outsource team basically outsource team doesn't really mean as a business outsource which is sitting elsewhere but it could be a offshore team which is sitting somewhere which you consult on need basis this is the hardest to pull off but it also has its own benefits this one the benefits of this team model is that when it is outsource you can actually put clear slo's and okr's on their head that they have to do this but this also runs into friction of interaction with other teams this model so this has to be carefully balanced out and requires a higher i would say both the team leaders need to be on the same page and collaboration level to actually put this off but once done this can actually give a real amount of power and authority to the outsource sre team to actually bring organization changes faster because we are not actually dependent on others to execute first thing i want to pick up one more talk about this which is the topic of this which is a sre maturity model what does a journey look like from a nine next to a 99.99 and where does one fit themselves into the thing and what do the responsibilities look like a nine next journey or a maturity model of an organization is a very particular model you know here you're actually you are beginning to consider i say that the reliability dealing issues start coming you are actually adapting the software to be calorie or have a rolling that one you're having coverage validations you're having sli collections observability you're having an on-call schedule getting warmed up to it you are building the runbooks runbooks are basically a set of repeat tasks that you can perform every time there is a given failure so what it does in it allows you to actually on-board slightly relatively cumulative because now there's a set recipe if x happens do y and this is a knowledge base that you're building from that point of time once you've built this you start moving towards 99 now here obviously this is everything which was nine x plus you can't lose any of that now you're building tolerance proofs of a system that given a downtime my system will be able to actually handle this this is where you're adding distributed systems robustness you are able to set service level objectives you are done through the practice phase and now you're nearly out there we're actually setting this where you also start discovering disaster recovery techniques you know because here is where you get a detachment from data you start realizing that there's going to be a disaster i'm always at all points of time have to be ready for it this is your this is the first time where you start hitting and making those decisions and the most important thing this is where you build an ability to roll back don't go about actually doing automated rollbacks in this point you know it's it's probably over engine from that point onwards you take a strike towards being automated which is here you do chaos in you know since you're a fancy term but what you're trying to do is you're trying to inject failures and see how does the system behave accordingly and does it actually come back up do all my recovery disaster recovery kicks in does my rollback work do my run books work etc you also start building capacity that you have to use the state of maturity where you know a lot about your system and you can actually start doing forecasting of when I hit x traffic on the valley I will need so much of system but I don't need it for the show things like these you know like then there are real calls because at that at that point of time you can actually save a decent amount of dollar and you can actually start making really really good decisions as well and here you will take pattern based re architecture you know new software in the world was written from on day one and it was perfect forever this is where you start re architecting parts and pieces I have realized that a certain part of my system doesn't work under stress let me start adding cash to these these decisions start coming out and you start re architecting things at that from that point onwards where you realize that even a downtime of 99.9 is not acceptable and you desire to move to 99.99 here the systems have to have to run in automatic mode there has to be a failure mode effect analysis report of every single component out there thoroughly tested what happens when console goes down what happens when post virus goes down how long does it take for something to back up how long does it take for an application to come back up what is the real impact of it all of these things start coming into the picture cost analysis what is the cost of which line item by doing what what is the going to be the expenditure there if this then that automatic healing is where this comes in because all those runbooks can now be triggered as books if a jvm fails you have a runbook for it just write a simple and this is where it all comes to come so this is a journey that a maturity has to go through you know like any I mean there may be an abrasion where you say that look I get it out at 9x but I want to have a rollback you can pick one but if you see yourself actually trying to do something on the other side while your system's maturity is on the left side you see there's something you know this is this is not a golden rule this is just a yardstick that you can actually use to measure system maturity and what level of sre are you aiming at and what level of the slis are you doing that now the question will be where do I find these sre's and what are the key skills of a good site reliability very foremost important of all withstand boredom it's a very mundane job repeat job while you're on call you're going to do the same thing over and over again and during that boredom is where you have to think of innovation how can I actually do this again because your first job is to bring back a system up and bring it up faster so that means you keep doing the same thing over and over and now you also have to think that how do I prevent this from happening attention towards detail these are soft skills you know I should I I'm not I prefer these skills over any other hard programming skills because they are more attention towards detail one of those numbers is going to lie and is going to tell you the truth as well at the same time it's going to be a small abrasion somewhere where you catch it ah I get it this is why it's happening you get it that we need to invest in like now servers are degrading everything looks up now suddenly you realize hey wait IOPS are going down that's right you know these details is what you need to pay attention ah strong unit skills every system server out there runs on a lineup system you need to go to have those programming skills you don't have to look left towards to actually ask a programmer to build these simple tools for you awesome amount of share scripting that is required you must be really good at and passionate about it more importantly it requires an on-call experience real or assisted because on-call brings out these skills that people usually have not been ah the have been have not faced you know communication skills documentation skills these are a part of on-call experience and they real or assisted you know somebody can coach you or you got to have them by yourself but at the end of it good statistical skills must be able to crunch numbers really fast now one may think that these kids will rock my back that's not what an sardine is right they're not going to hear hear for just cost sync we talked about it that investing and the right automation and reliability can actually help you save a lot as well and help you build products that your customers love and that is more important that's an investment how do you train these aside because if they are expensive you obviously can go out there and actually just hire them all for the best of the job well based on the organizational size and maturity based on where you are in your maturity model you train them on the job or they self learn how to throw them in the ocean each of these approaches is unique to your culture if you are a company which actually has a lot of time and right now and you are investing early you can actually train them on the job if you are a company which basically is not that much on time but what you can actually do is that you can actually give them assignments but that requires good documentation last if you're just hired fresh people there's a lot of time you can throw them in the ocean decreasing order of availability of that person once you throw them in the ocean to take a while to come out to turn them on the job you can actually start utilizing faster i'm going to wrap this thing with a few things that we can think of it there's a cost to reliability reliability is required we all know that but there's a cost that also needs to take towards it 100 percent reliability submit reliability can and should be achieved don't aim for that 99.99 on day zero reliability is actually a product which is improved in diverse current cycles take data take measured stab at it see what's going to improve see it improve give it some time to improve reliability needs participation from every single stakeholder it's not some person's thought and will that i want the system to be this because it involves multiple products engineers business guys product guys it has to be a collective decision reliability is a function somebody just happened to call it site reliability engine that's it thank you uh that's my written handle this is where i've worked CTO 400 last night at IO we have a reliability company you can reach out for anything that you may have hey thanks for the awesome talk Piyush uh i'm sure people would be having questions about what we had at the discussion and also as from the presentation which you gave and it's it was really informative for me at least i learned quite a few things on this so folks just let us know if you have any questions on the chat or the qna feel free to give any feedback which you have uh and we are up for questions so feel free to raise your hand on the chat and i will unmute you so that you can ask your questions away okay so i can see the first question by Anikate so Anikate puts forward how do you get the alignment of non-technical and business stakeholders on SLOs oh interesting question yeah so here is where numbers will help a lot when we say a tiered approach a tier 0 1 2 3 when we start adding observability and major ability to each thing a micros tab at improving something and the time that at the cost of it you know like the next time that you're faced with a question why is this bug fail you have to approach and answer that with a more quantitative number that this bug happened because we have a failure why where we do not monitor our failure rate when we deploy so people just come in and actually keep deploying in the right sector i'm just taking example now what you do is there's a number that you've taken in past one week i've seen eight failed deployments you introduce a small change that hey every single deployment just introduce me and you produce a small checklist people follow that checklist you actually prove that hey by doing this simple measurement i could actually reduce the failure rate by four now out of ten failed deployments there are only six failed deployments this is an improvement now this is where you start taking in the cost that it has taken you take this cost and then you actually injected that here you know from where we are i had to spend x amount of time to get this done you reverse extrapolate that information to say that this is the cost to go somewhere so now next time when you're asked a question i need 100 percent fail i want 100 accuracy you will actually present a similar number that look i think this will be very expensive take this into consideration time into consideration i'll be able to deliver this within 15 days or 20 days of work and come up with a real asset that now the number will be okay fine you know if that one percent cost me around and i will give you real numbers here at one point of time we ran a team and you know like when you have a really really large team you're actually looking at almost around 15 people as well you know like it's a massive engineering team that you have to build and to actually gain that and that was done to actually gain just one extra one percent which was justified and this is what defines a real participation from the people you know like all these numbers have to be calculated we discussed yeah i think just to add on piyush what piyush has just said is that i think both the business and the engineering teams have to come together and finalize on the numbers which they want beyond which you basically have a kill switch saying that hey let's not have any deployments otherwise we will just not say oh we will maintain your service after this x numbers has been breached as the error budget as which we have also come across as a term during the term i hope that answers your question aniki feel free to raise it again like if you feel we can get into more details thanks piyush for answering that another question which we had was is by shivanan and shiva asks what are the deployment methods which we can implement to avoid the downtime you shiva not taken together yeah i mean deployment methods i mean the first one would be test your system really well i mean i know i'm not giving an answer of can you already hear that you are like listen but first will be as much as possible build a staging pre-prod alpha production kind of kind of versions those that i mean we don't speak of this as a formal deployment method but i've seen a lot of people bypass and circumvent this uh the incubation period of a bug is actually caught well in this in this uh in this stage process you know it and it it doesn't mean there's an added latency think of it like you're actually releasing it to early set of users you know that is one now once you're finally and fully convinced that this is actually deployable uh you may take canry into uh sort of a deployment into consideration as well but one thing that you every time i can see i'm always going to give an answer to say there's a trade-off now when you say i want to pick up models deployment models like canary what you have to think is think of a server where you are running in an optimal board you know that you are consuming four out of six cores a canary won't work because now you're out of these servers because you actually need to provision four more on it so the amount of infrastructure that you will need will actually spike up for a little and it's a cost trade-off that you love or what you may do is you may go deploy it for only a set of users which is also common practice i think taznik you are better equipped to answer this question you handle this on db misses uh yeah i mean you've uh partly mentioned a very important point uh i do agree that uh having a pre prod uh is paramount to uh basically and like trying to weed off anything which comes in as a regression inside uh your release release cycle uh that being said another thing which a lot of people uh uh foresee as not being important is the importance of end-to-end testing uh i mean there's again there's no silver bullet to have a bug-free release but i feel personally i've seen that if you have checks like unit tests or having uh smoke tests and integration tests and end-to-end tests basically if you are also uh checking for load tests on your service uh before uh pushing it to production i feel these checks would more or less uh weed out quite a lot of problems which you would see by just pushing it uh to production and of course like there are patterns of deployments where you uh basically give x percentage of traffic to uh your uh change which you're trying to introduce uh but i feel those come much later before uh even before because even failure rate of some x percentage to y percentage of customers would be a big thing for uh some services which are under high throughput right so uh yeah i mean uh we can elaborate more on this but i feel there are a few more questions so without much ado i hope that answers the question uh fairly in a decent manner without trying to overlook uh the the the details about it uh right so jumping on to the next question so shereesh is asking what are the three most necessary skills uh needed for sre uh withstand boredom as we already discussed no seriously that's a really important one like we know we take it into uh we we take we we usually don't talk about this but these are really really boring jobs and one of the things that i really actually uh encourage young developers as well as that look you must have that ability to actually have that perseverance with mundane work and only then we start thinking about what can be improvised that is a soft skill second is uh a passion towards running uh you know like things being done true to a protocol you know like a person and this is also soft skill i'm not yet got the third one is a real the hard skill that i'll talk about but the second one is like following protocols because you are a person who is going to set protocols set garden so you have to be a person who has a passion for it as well you know like and if we do things in an x way like somebody you can carry and follow and design a culture as well that's a very very important skill third one is i would say a very important skill is being being affluent in a programming language either go c++ or python rest everything can be taught and they must have done some kind of work which interacts with the system you know like what that means is having done some uh some work of their own on on their on their system uh i usually prefer engineers who are actually using linux on a day-to-day basis and have a familiarity with it because when they are faced with real systems out there it almost becomes uh second nature to them that their hands move very freely and interacting with those systems you know so i would say i gave you one free you asked for three so those are the four skills i would say that you should look out for just just to emphasize on one point which piyush mentioned is i i i personally feel the ability to go on headstrong into a problem uh without like knowing fully uh being fully aware that you might not know everything uh beforehand on what is needed to solve the problem i feel a general inquisitiveness and uh uh like a can do attitude works wonders in debugging problems in production so uh that is something which i definitely feel is important in debugging production outages and production systems going haywire on a day-to-day like whenever it happens uh i think that was a very good question about what three skills i think piyush also answered that in the slides but yeah there's a question by mutu kumar if there are any practical steps in general to create in in in general to follow to create a better sre team yes fire drills i remember this one time and we actually uh so we we were basically uh traveling to our uh headquarters of engineering and we did a mistake of putting every on-call engineer on the same flight and and this is a mistake that we did and we never repeated now one of the master minds of our team decided to create a hoax downtime and a hoax downtime where basically it was falsely announced that system has gone down but the on-call engineers are on a flight for the next five hours the while there was no loss but the chaos that it led to actually taught us the right lessons these are i would say one of the few elements of dry runs that we keep doing that what is the i mean uh testing your system etc those are things that you'll find on the internet i'm not talking about those i'm actually giving you other things that you know that which we don't talk about even testing whether your teams are fully equipped in terms of documentation in terms of processes and manuals and runbooks on what to do when a downtime happens you know everybody will eventually figure out a problem the challenge here is how do we minimize it and these are the key elements that are required to get here you know like this is one of the things that i i would say and this applies to systems as well you know like you test your systems you break your systems a culture that is very important to breed across engineering is infrastructure is like is ephemeral this is something that takes a long time for for engineering teams and product teams to actually sink in that nothing is for ever in terms of infrastructure you know servers will go away so do not make changes which are exclusive to a server practically speaking don't write cron jobs which only run on one machine don't go there and have by hand inject a cron job spend some time using a right orchestrator for a distributed cron you know things like these these are the ones which are a byproduct of the same same learnings and same culture that you need to uh percolate down to every single absolutely agree with what Piyush has said uh to just add on it i feel uh i i feel uh automation should not be done prematurely so if the whole goal of forming an sre team is to like try to get away with manual things that should not be refocused initially and it should be an eventual process uh that being said i feel uh there are a lot of great takeaways from uh uh primarily the sre book so if someone has uh if someone hasn't read it i i feel there are some chapters which do go pretty deep into how google has eventually achieved uh the culture which they have uh that means like it's not gospel so please uh it's usually recommended to pick certain things and drop certain things as and when your organization if it seems to not working for you then try iterating and modifying it the way it's working for you but yeah i i guess uh i does that answer your question i i feel which let us know if you have any other questions on this topic um there's another question by central uh he asks if there are any tools which we would suggest to measure kpis i'm gonna read that as sly's i think right uh uh because they're technically performance indicators they're technically yeah uh service level indicators so i think this is a very new concept what happens is that there aren't that many off-the-shelf tools that you can actually uh go out there uh uh quite shamelessly i'll say that the startup that i'm looking on one of our admissions is actually to make this easier but again i'm not endorsing you to actually uh on this on this talk uh i'm just saying that because there there's a very limited availability of off-the-shelf things out there at times sre's have to build their own because uh a lot of these numbers are unique to your interpretation and a byproduct of a lot of those cross organizational units and metrics that have been spoken about example what i mean by that is if the only number that is spoken about is expenditure then your three performance indicators need to be derived around that direction if the number of uh you know like if number of clicks or number of conversions is what is a is a is a talk on the floor of your of your organization you know then the key performance indicators will tend to only look at those numbers so these kpis actually get derived favoring the talk of the floor you know like they there aren't that look these are x kpis that you always have to monitor obviously they're engineering kpis in all respect which are latency uh you know throughput uh q q depth uh you got uh correctness is not fairly straightforward you got percentiles of that you got uh i think these are the fairly standard ones that you will actually find in any tool uh you like example if you're using a load balancer you can pre configure these this is almost what every every tool will provide but anything which is more than just a poor engineering kpi will require a bit of massaging uh based on what is understood by the organization yep i think that it's a good take to use on what a service level indicator should be and uh yeah uh so there's another question which is about what would be the follow-up session focus on uh yeah the follow-up session is already announced and it will largely cover about going into the the depth of uh sre practices like we we take we really talk about technologies and tools what breaks what doesn't how do we make choices like simple example i'm doing automation should i use bilumi should i use terraform should i use ansible uh configuration management is it dead is the right way you know like those are questions but also uh deeper insights into you know like how sometimes what we see as obvious metrics they are actually incorrect they're lying to us how s loads can be misleading like really tangible and uh i would say implementation level details as well to an extent uh to an extent not boring as an engineering exercise but uh i would say assuming that one is already doing it what are the things to watch out for and how can those be improved for awesome i think that's like a good introduction already into what the next talk would be like my view so that people are aware of what he's going to cover uh so we have another question by agit which say which is will we be covering tools for stats gathering etc uh i can spend half a minute on it only because he's asked but you'll have to promise your time the session i'm sure agit will be attending the session the next time he was just giving a talk awesome he says yes cool 30 seconds for you block awesome then i think that's more or less for all the questions which we have as of now uh i'll just wait for another minute before uh handing it over to sainab if there are no questions any questions for pish i also know that you're hungry since you mentioned rap but uh thank you very much all of us joining here today we promise that we will have better audio gear for pish next fortnight and for all those of you who are curious visit haskeek.com slash root com subscribe follow us and tell us you know what topics are of interest to you and just like pish you can also speak so drop us a note on rootcots.editorial at haskeek.com or tweet was on at the rate root cons see you in another fortnight with pish and a lot more thank you tazdi for moderating the session today it was fantastic thanks pish and to the other auditors as well good bye and good night thanks everyone everyone for attending bye