 You know, like as SRE, you know, most of the times we'll talk about our war stories as the deepest wounds of technology, you know, and like, hey, we had a distributed systems failure, Kafka wasn't working, Conson wasn't working. I wanted to actually talk about that. But there's an interesting one. So we were a young team, a relatively young team who were given in charge of or given a charge of managing a decent amount of software. Now, every once in a while, what would happen is we were using Nomad back then for our orchestration. And every once in a while, somebody would go ahead and make a deployment, which wasn't either authorized or well tested young startup, you know, everybody's been there, I guess. And every once in a while, this would result in an outage. Because I mean, you can refer to the SRE book, it says that almost 60% of the errors are because something changed, either a configuration or a piece of code. Now, though, on, there was literally no release manager at that point in time, and we were a smaller team who could not really set in those processes that fast. So, once what happened, we did a deployment, which was something unauthorized, unauthorized or unvalidated, you want to say it that way. And we ended up raising the customer data. We were a financial institution serving product. And loss of data is a big egg on the face. And so once this happened, we went back to the war room and this is, we used to travel many countries at that point of time. So the person who couldn't do even remotely anything was actually in flight at that point of time. So five hours later, the person lands by that time, I mean, the loss is inevitable, it's gone. So how do we go about fixing this problem? We went to the war room, discussed, and we had in hindsight, it was one of the dirtiest hacks, but I would say the guy came up with a genius idea. So we wrote a small TCP proxy in front of Nomad. So every request would first pass through the TCP proxy. And we distributed two factor authentication tokens for the product and the SRE engineers. And this was a concatenation of a two FA token, which had to be very simple hack, 100 lines of code, not more. And measurement wise, our own outages actually went down to 48% just by that simple 100 lines of code. Now, as SRE, you know, like when we went back and we asked ourselves the first question, obviously, like if I go back into that room and I think that what were the first lines that we're thinking, a lot of people stood up and said, Hey, I'll take care, bad RCA. Root cause analysis cannot be something which cannot be quantified. Or, hey, we need more release managers won't be sold overnight. You hire a release manager today, a person has to get the essence of it, then you need two of them because deployments happen every now and then. It's an added budget. It's we're talking literally two to three people to be added and they don't come in by the by the hour that you hire them today is going to take them at least two months. What do we do for the next two months? So not a good ask here. What's the third RCA? Well, we strip off the deployment ability from certain people, not a variety to go about it. Why? Because now you have the concentration of failure on just few people who are just sitting there just deploying. We have an entire team who has the ability to deploy. It's not that they make deliberate mistakes. It's every once in a while that they fail to look for something because they are relatively junior. And that is when we thought, let's go back to actually solving this, the right assignment. We are coders. We can actually think of a small hack by which we should be able to solve this. Just by adding that extra layer of a security camera, not a lock. I've always said this. If you really want to make things secure, don't use locks, but use security cameras. Locks can be broken. Security cameras cannot be. I mean, there's a higher pressure on you when you're being watched. You move every step cautiously. We know that we need to audit this. And on top of this, there are seniors around you who also have been given the authorization token. So they would naturally ask a few certain checklists that, hey, did you do this? Did you do this? Did you do this? And yeah, so this was one of the most important war stories I remember with a very quick actionable with an instant drop of our repeat failures because that was a record. Without unnecessarily adding an overhead of expenditure, both in terms of time and money. And I mean, I'm not saying that this is the process, but this allowed to set up a process. And one of the byproducts that came out of this, which I really enjoy was we started putting measureability to every single code and contribution that we're making. We were in a position from there after to actually measure. I would really like to know how you guys do it at large scale such things. So that's a good one. And that's an amazing story. Thanks, Piyush. With that, we welcome everybody who's online with us. This is a continuation of the past two talks that Piyush has been having on site reliability, its impact, the key tenants and how organizations really go through it. We have a panel today, a pretty great one. So we obviously have Piyush, who's the founder at last night, I think comes with a lot of experience on different scales in terms of site reliability. We have Manjot, Manjot Paava, who has seen who's an ex Googler, or I think they call them zooglers now. She's seen the life of both engineering and product at Google, and that brings in an amazing insight for us. We have Kalyan Sundaram, who's site reliability engineering at LinkedIn. One of the places where I can say that they've been one of the forefront adopters of the whole SRE culture. And for the audience that we have, we have Saurabh Hirani, who is the DevOps at Autodesk. He'll be moderating and channeling your questions that you would have on the channel. Please, if you have a question during the middle of the talk, we would love to have them. Please raise your hand so that we can coordinate. And last, I'm your facilitator today, Rishu Merota. I used to be associated with LinkedIn for a while in the past, and that's where I got the SRE bug. That's where a lot of SRE chops were honed. I'm no longer there, but I kind of carry that DNA now. I think we all do from places where we have been. So yes, with that, welcome once again. And it's a very interesting building on what Piyush talked about as site reliability engineering, incidents, outages, revenue loss. Whenever we say an incident, this is like a very word with a very strong negative connotation. So where I would like to start, and I'll probably start with Piyush, is when we talk about site reliability versus, you know, there's the general, every company in the industry is different on somebody says operation, somebody says DevOps, somebody says site reliability. Eventually, a lot of people may confuse site reliability with operations or the other way around. But what I would like to understand with an example from you, Piyush, is that how do, what is the lens with which a site reliability engineer, right? Actually take a look at an incident as compared to operations. Is it like a very living in the moment kind of a viewpoint? Is it something where you actually kind of trace it around, plug it, like plug the actual cap in the boat, right? How do you ensure, what is it that would separate out an operational outlook from site reliability outlook? So you'd like to really understand that and this BOF, like we said, is about listening to these experiences from different parts of the industry, from different scales of the industry. So with that, yeah, Piyush, you wanna rate another incident or something to us that touches base on these. So going back to the same incident and probably highlighting a plan of it, I would say the real essence of SRE is threefold. One is obviously, I got a problem at hand, I need to fix this. That is one, you know, we need to run towards the fire, towards the bug, that is the first principle or first responsibility that we have, duty that we have. Second is how do I minimize the loss? So I need to keep the losses to as short as possible, you know, like that's the step one itself. Once obviously I have plugged that fire, the next question that we have for ourselves is where else is this happening right now or is about to happen because there's a pattern to it. Because obviously this failure is not going to happen in isolation. An incident to me has always been a representation of our culture as well. An incident doesn't happen in absolute nature that those are software bugs. We're not talking about those. I mean, they happen but they are very well caught in the unit test cases and declaration tests. Real incidents are actually a byproduct of a lot of our culture and our habits as well. So where else is this happening? That's the second question. The third question that we have to ask ourselves is which is the most important one and that's where the real engine ring word or site reliability engine ring comes in. How do I prevent this from happening again? If this has happened now and there has been a loss to it, first of all, what is the amount of measurement that needs to be done to see what was the loss behind this? How much did we lose actually? And second, if it was considerable, how am I going to prevent this from happening again? And that is where most of the time is spent. So that's a very interesting outlook. So I think the aspects that I kind of gather if I for some reason like bias for action, obviously like this, holding the ball, obviously cut your losses, minimize losses, identify any patterns that are there and culture and habits. But then the questions still for organizations and teams that are probably very operational in nature and are moving to a site reliability thing. Are you saying that the operational teams have to invite these tenants in order to kind of graduate or make that transformation? Because it's a transformational journey. SRE I believe is not really much of a skill as it is a cultural game, isn't it? Absolutely. Absolutely. You got that right. There's a ladder that you need to take, step one at a time. I think one of the first things that you need to do there is measurement. Every put a measure, I mean, you can't improve something that you can't measure. So first step would be to actually measure all of these things. And I mean, I mean, this is where I would really love to hear from the, I mean, all of you as well, because you guys are the masters of measurements. I mean, books and practices have come out of your organizations. Yes, well, we do that. That's a small thing. One of the discussions that was started in the chat panel was around DevOps versus SRE versus operations versus SRE. So mostly if we could sort of clear somewhere around buzzwords versus the actual work that was into an SRE. Banjo, do you want to take a stab at that? I mean, that's where the book came from. Thank you so much, Piyush, for first of all, being very interesting story and also touching upon what the essence of an SRE is. If I had to summarize what an SRE is, I will go back to quoting someone I really respect and my manager at the time, who said the job of an SRE is to eventually be able to automate themselves away. And this is, I think, you know, in one line, if I had to capture the essence of how SRE is different from say a sysadmin or an operational sort of world is how I would characterize it. Building upon the essence of an SRE, I used to say things like I have a day job and a night job, not necessarily because of the time of the day, but because of the two things that I did. One is, you know, being very, very fast to respond, very sharp to capture things and take steps in the very, very short term to fix, not, I shouldn't say fix, at least mitigate whatever is happening immediately in front of me when I'm on call. And the other part of my job used to be, you know, after, you know, the immediate fire has been contained, really looking hard at what were the things that went right, what were the things that went wrong, and what are the steps for the future, most importantly, what are the things for the future that we can be worked upon. So that A, these incidents either don't happen ever again. B, if they do happen again for whatever reason, the impact is, you know, much less than what it was right now. And C, we are able to catch it way sooner than the amount of time we took. So these are some of the ways in which I would characterize the culture of SRE at Google. And I think one of the most beautiful ways in which this can be demonstrated is if you look at a post-mortem by an SRE team at Google, it literally, you know, really boils down what we do when there is an incident, you obviously have a log, you have a summary line, you maybe have an impact statement. But after that is where I would say the meters, what went well, what went really well, then what went wrong and how things can be improved. So I think this is where, you know, some of the aspects around SRE culture not being about blaming, but just about what can be done in the future to make things even better than what they are. That's great. Kalyan, you want to add your pitch? So I think the question, Bodhisattva, what is the difference between DevOps and SRE, right? So I think a preliminary objective for a SRE or a DevOps or SRE at the end of the day is sit-up, getting the product being up. So it's a different names given to it and people have considered different ways to achieve the sit-up. For example, SRE is a person who actually expertizes in the previous world, like you give me a black box, I deploy it, I make sure that after deployment the black box works good. If I can't take care of the black box, I would put a ticket, I would escalate again to you and things like that. In the case of DevOps, it's like, so the changes became difficult because the system admin wanted a reliability in the system because that's what his job is and the development team or whatever we call it, the feature edition team wanted to push the features. So there was a stalemate because both of their goals are different. One team wanted change and another team does not want a change. So the whole DevOps practice is supposed to bridge the gap and the DevOps team is considered to be a team which is coming up, which is trying to bridge the gap between developer and operations. But still the DevOps team also has to have some level of automation, coding, some level of software engineering practices in them. That leads to the site reliability engineering. So I think it's an evaluation of the process. It depends on the organization as well. So some organizations will still name a team as site reliability engineering or a DevOps team, but the team might be still fighting just for the site to be up. So it depends on the luxury of time. So as long as if the practice of the company, the process that the company involves is not to evolve the site reliability engineering team to a way where they just add nines to availabilities and architect. They have time to architect and they have time to do a post-mortem and plug those gaps back to the system when they can implement those back to the system. Then they are on the right track. They might not be a site reliability engineering team today, but they will evolve into a site reliability engineering team going forward. So yeah, that's probably the difference I would, based on my experience, I would say. At the end of the day, irrespective of what the practices are DevOps or system admin or site reliability engineer, the goal is still site up and SRE is considered to be one of the best ways to reach that goal. But still LinkedIn calls site up is the goal of the SRE team. I think that would be the ultimate goal for any other organization as well. So that's amazing. Saurabh, you have something? Yeah, I think I remember this sort of links back to the first session which we used to when we mentioned the SRE maturity model. When we said that if you want to go from not having any SREs to like a complete SRE org, you should not go from zero to everything in one day. There is a maturity model involved in it. You take one step at a time so that you also organically evolve and not just blindly follow all the tools that have been floating around. Correct. That's a very good point. So thanks everyone for your opinion. It's amazing how a question can actually elicit different responses based on our experiences. So working on this forward, let's help our audience in terms of understanding some of these things in practice. These are amazing principles and I think they always stick. So it would be great to kind of hear out some of these war stories that you guys have faced at different points in time. You started with one and that's a great one. And Manjot, you have seen both the product and the engineering side of life while you were at Google. So it would be great to hear something in practice. And this is primarily keeping our audience in mind because the idea is that while we narrate the story, we touch base upon each of these tenets. We've talked about bias for action, keeping sight up, you know, holding the ball, closing gaps, cultural habits. It would be great if you could kind of tell us an anecdote. Absolutely. So when I was in Nehari, the product that me and my team were responsible for was, you know, a mixture of Google Photos, Blogger and a couple of other social products under Google Plus those days. So we had just, I think, launched the Android Auto Backup. The thing about backing up the world's photos and videos is that, well, it's a lot of photos and videos. And the thing about, you know, saving all these photos and videos when you are hosting like a global service with, you know, millions plus users is that this is like the perfect recipe for overwhelming systems. And how it internally worked was we were using Bigtable as our metadata store. And how it works, I mean, so we as Google Photos SRE were not responsible for the uptime, I mean, not completely responsible for the uptime of our Bigtable services. There was a whole dedicated Bigtable SRE team, which is responsible for the uptime of the Bigtables that our service used. What we were responsible for was, you know, deciding on things like schema, deciding on how we actually get to use the service and a couple other critical things related to that service. So on one occasion, what happened actually on several occasions, what would happen is if there is, you know, a certain bug that would suddenly upload several more pictures than what we had expected or there is a certain event that occurred upstream, which would suddenly increase the traffic by multiples is when this whole traffic is basically redirected towards our Bigtable instances. And the thing about how Bigtables were managed in those days was they were partitions and these partitions, as I mentioned, were managed by a whole completely different team. And they wanted to obviously fit in several different services within Google in a single partition. So when you have a single partition for several different teams where one team is a behemoth like Google Photos and other teams are, you know, Gmail or some other smaller services which are not as spindle heavy. This one service which goes rogue has a habit of consuming all the resources in that partition. So not only do you have an outage which affects the product in question, which is Google Photos, you quite often notice outages in other related Google products such as Gmail or Blogger and etc. So if I really think about this incident, there, I mean, I'm going to go back to the post-mortem format. There are a couple of things that went really well and a couple of things that definitely could be improved. If I just talk about, you know, a high level about what I think about this incident and when I hear some of your other stories, it is clear to me that as the, of course, the size of the company grows, one of the things that definitely needs to be scaled up are some of the processes around how incidents are even tackled, how incidents are even communicated. Especially because now, I mean, within a snapshot, I've only spoken primarily about two teams. They're not two teams. There were like five or 16 that were involved. There was front end. There were some other related services. There's YouTube because of the videos aspect of it, etc. So, so now what we have at our hands is not only an incident, which is a little complex to debug, especially for, you know, someone like Gmail SRE, where they are basically like, oh, we didn't do anything. Why are we suddenly seeing an outage on our, on our, you know, monitoring consoles? You have some complexity around being able to communicate this one incident across several of these teams. So obviously there are challenges around communication and who should be declared, you know, the incident commander, how should they be communicating, what should be the immediate steps. And of course there are challenges around what can immediately the Big Table SRE team do, because these services were basically managed by them and they were the only ones who could immediately relieve or put down the fire. So what this incident taught us, you know, some of the immediate things that that the team, both the teams, photos and Big Table did was a, assign Google photos its own partition so that at least if they shoot themselves in the foot, they don't shoot other services in the foot. That was one of the immediate steps. The longer term steps was, we went back to, you know, really figuring out some of the root causes of why these incidents keep on happening every now and then. And some really amazing things came out of that analysis. One of the things, for example, that came out was, you know what, maybe when we designed the schema for our Big Table partition, we designed it in 2008 when Picasso Web was a really big thing. And maybe some of those assumptions are not valid anymore and auto backup is something very recent. And maybe we should rethink on how we approach some of those design principles. Another thing that came out was something we could again, you know, do immediately was, hey, I think we've been running a lot of these cron jobs. Do you think those batch jobs suddenly in the middle of the day have been causing some of these outages? And immediately we could see, you know, a lot of difference when we at least tried to contain when they were they were run instead of the peak, maybe move them to a truck. Just ensure that they use fewer amount of spindles. And some of the other longer term projects also that came out was how can we in general degrade much more gracefully in outages like these? So Manjo, that's a couple of very interesting points that you bring up and again to make sure that our audience kind of for their benefit, right? One, I like the part where you said isolation, which is the team said, okay, let us limit the damage to ourselves, not outside. Very interesting thing. The second thing is around command centers and I'm going to touch base on it in a bit because I think on the chat we have a question, which is very much related to your story. I think it says what if the outage was this is by the way from Anurag Sharma. Thanks, Anurag. So it says what if the outage was first observed by a team which had nothing to do with the change, right? Wouldn't that team's SRE spend a lot of time just figuring out what are the best practices to organize and response teams for teams which you share data resources? So I think this is a great question. So could you throw some more light on it? Thanks, Anurag. And your question is absolutely spot on. Let me start by saying we weren't the favorite team out there in terms of outages and nobody wanted to be in the same partition as us. So, yeah, this definitely happened where, you know, other SRE teams spend a lot of time looking at their own consoles and trying to figure out if there's something on there and how they're receiving more traffic. Is there some abuse, you know, happening, which is causing this. So some of the things that we learned out of that was when an incident happened happens that can potentially, you know, affect other services. One is the reactive sort of way to approach it where, you know, if they, okay, if the SRE, Gmail SRE notices, okay, my victory partition is acting up at me, at least paying the big table SRE on call. Then they immediately find out, okay, it's, you know, photos and we're already trying to fix it. This is the reactive way, which is obviously not ideal. Then there is the more optimal way of communicating some of these problems, which is more proactive. And that is, is there a certain console where people can come and know about these common incidents that are happening, which might affect other people. And that is exactly, you know, the model we move towards all these incidents we built internal tools. So every, you know, solution to all these communication problems is a combination of process and tooling. We built a whole tool internally to manage some of these Google-wide incidents, which were communicated to all the on-callers that were there, especially if, you know, it was something that was affecting you. You were explicitly added to that incident so that you could track what's happening. And some of these measures really helped, you know, get around those problems of not wasting other people's time, especially if you can't do anything about it. So just a couple more things to add on to that question. This is kind of like this opens up a Pandora's box at least based on my experience also, right? Sometimes you may be a downstream to a given service and you would see abnormal patterns in your own service, right? You would go touch all the points in your service, you know, you would examine the same dashboard systems that you see that are there in your service. You would look at things that you would know of, but then you would be like, you know what, all vitals look good, but they're still, the system is not behaving as it should. Now there are only two possibilities here, sorry, only two possibilities, one, either there is a vital that is not being monitored or there's an upstream that you have, which is probably causing this ripple effect into your system. So how would you recommend or how do you usually suggest that teams or people look at this fault isolation or detection mechanism? Because this is like an unknown unknown. Right, so I will first start with what, you know, measurement and capturing some of these things looks like for an average service at Google. And obviously this evolved over time. What we finally settled on was every single service at Google, even before it was launched in production, will have some of the very, very core tenants of a service being monitored from day zero, you know, requests coming in, what API endpoints, the requests are coming in, the amount of errors, latency and some of these core metrics of every single service being monitored on day zero for every single service. So let's say, you know, I, I have a downstream service, which is suddenly receiving a lot of traffic. What I can do is I can try to isolate on the basis of client on which particular service it is that is sending me Google photos, a lot of traffic, is it Gmail, is it YouTube, is it Android versus iOS. So I try to, on my end, through my consoles, figure out which particular, you know, sort of upstream service is responsible for the excess amount of traffic. And my next step as an SRE while I'm trying to debug this issue would be figure out who is the person who's on call for this particular service. Quite often, you know, when you, when you deal with enough outages, you know exactly which friend of yours is on call today for that one service that is always noisy. And we internally again built a tool where you could figure out who is on call for this one service that you found out either through your console or through your monitoring tool or through your tracing tool. And, and then basically ping that person try to escalate them. So the job is as soon as you are absolutely sure of which particular point is responsible, if it's your own, it's, it's, if it's one of your own services, try to debug it further. If it's not one of your own services, try to find out who's on call for that particular service on that day and then escalated to them. Right. So we also have a related question sort of you want to cover that. Yeah. So Ayush has asked achieving isolation comes when achieving isolation and using common service platforms also comes at a cost. Right. So this may be doable for bigger organizations can be much expensive or smaller ones scale and size also results in designing solutions like great shoulder degradations and all those things right. So any other solutions that panelists would have. So I'll interject real quick on this one. So this all this question actually has multiple layers on it. And I think part of this question is also that sometimes in our mind map and we look at organizations that are sizable in nature like Google or Microsoft or so on. Right. We think we are a big company, lots of people dedicated folks per service. There are a lot of organizations that are of a much smaller size still turn out a huge chunk of business and revenue and system. They depend on those systems, which also brings us that all of these key tenants that we are talking about how well do they really scale. So do you want to take a stab at this on how you've seen site reliability scale, both up and down the down the maturity and the org size curve. Right. And one of the things that I usually talk about at this point is sre also has to be seen like a product. It is no different from actually a product being developed. What I mean by that is and when I say sre I mean I want to really distinguish site reliability engineering as a common word and say that reliability is a function that an organization has to perform. So let's just I mean I'm only going to stick to that whoever does it now for it to be performed that needs to see in product cycle. If these days I mean you go to a product manager and say that should I work on a log out button or should I work on a checkout button. There are going to be very clear empirical data evidences where the product manager is going to tell me that this is going to get me more traction more business value etc so do this. A very similar amount of measured approach a data driven approach has to go towards reliability as well. So why I cannot answer in isolation on whether or how do you build an umbrella which can actually monitor every single upstream or downstream service and see what is going to get impacted because of me and who am I getting impacted with. I think it starts with measuring what are the key revenue lines or what are the important lines that you're measuring. Now, if let's give an example here if most of my business if I'm a transactional website and most of my business actually happens through be being a payment gateway or something and I depend on some external payment processor. This is something which is a very high value and importance to me and image service is probably not as much. Now what I need while I've identified that hey this is if it goes down is really going to impact my business and my product flow. This data point needs to come to me now and these points and these monitors need to be constantly built. Now, what I've done in the past is I do remember in I was doing a content startup back in the days of 2010 2014 when I started was still not a common term back then. But we used to depend on external content providers CDN likes where the performance of content on how it was being delivered to the end client was a real key indicator to us because if the video. Was serving in a very poor quality they would drop off. Now this is an external indicator dots of a massive value. So we had built very small now ticking and a cube a leaflet from industrial practices you know these are small jigs that would deploy almost all over our tool chip will build small small tools maybe they're not even reusable at times. We built a very small one which will give us a constant feedback of what was the average latency of media being served to the customers from the CDN. And this was where did we take this data we take this data to say that look if we realize that 480p of a video was actually taking too long. It told us that we need to now start investing in a 240p as well because there is a lot of traffic which is coming from poor quality of internet. Now a simple monitor around the down on an upstream service would actually help me regain my loss of a customer. Now why did we do this because we started capturing that point because we realize that hey this is the most core essence of our product. So maybe I digress a little here but I just when I look at these questions I say that the answer is always it depends on the situation. So maybe an umbrella product won't help it what it matters is in what flow path or what is how crucial or how critical is it to the business. And I'm speaking mostly in terms of when I have to really answer what is the first thing that I go attack and secure because somewhere we need to start. So a startup we are always playing a catch up game of trade off of expenditure and also constant maintenance costs around it. So these are the few lines that I will pick up say that okay I got to secure these deploy smaller smaller smaller jigs there which actually do this job for me. So that's how I would interject on this question so sure please this question where I wish is something if you ask me I had multiple times such question to myself. Primarily I worked as an Asari at Miro.net before this so Miro.net is a company so as a professional me and Miro.net as a company grew together. So I saw the company scaling along with me. So here if you see that we Miro.net has branched the company itself into multiple products and as such each product started replicating their own infrastructures right. So we had separate each team each product had a separate streaming services each product had a separate database data services and stuff and sorry to me that I never had any permissions from them discussing here. So going ahead so there was a confusion always with the SRE team should we merge all these products together and give a centralized streaming service centralized storage service and things like that which which is actually a graduation as a search. So the pushback obviously we had is say one product didn't follow one product had a certain ramp in traffic or a sudden ramp. Will you be able to will wanted cascade to other other products something similar to what happened with the Google Photos and Blogger and YouTube things like that. So this comes with a cost. So once we say we are going to have a centralized storage solution. It can't be like there there has to be a cost for the company where a storage becomes a separate organization. There has to be separate resources allocated to it. There has to be a separate you know continuous monitoring continuous deployment continuous thing it's not like one time job. So it's a commitment that the organization has to make the organization has to plunge. So once that is made then it becomes the responsibility of the team to subscribe to an SLA and all the other products within the company within the organization is going to see the centralized team as a product. And this product has to advertise an SLA and has to stick to an SLA and coming back to the question like how another team which is not affected comes to know that the I'm actually a victim is because the upstream is not giving the guaranteed SLA to the team. So now the team has more rights within themselves to go and ask you have guaranteed me this SLA. Where is it. Rather in other cases if there is no SLA that is agreed upon it becomes like a nagging thing like one customer repeatedly complains just because there is a 1503 that comes up. You can't complain on 1503 in that that has to be gracefully handled as an exception. You should complain only when there is a problem when the SLA is not met. So that brings in processes on how you react to internal customers. So it's a it's a big plunge. The company has to take it's a cultural shift itself. And there has to be a return on investment because of this team. Other products are going to reduce their repetitive work and there is a considerable saving of time to improve the product as such. So once the return of investment return on investment is quantified and obviously there will be problems once the storage team comes up first because you can't have tight controls but the benefits that will happen over the time will be really Once this system comes in. So it's a graduation. I say the client side degradation in the design philosophy would be the step one as a short term solution and medium term to long term solution would be having a dedicated team if you think it makes sense to have a dedicated team. Then the dedicated teams charter will have isolation will have like you know quotas capacities and whatnot how could to scale the system and stuff it's no longer a product teams headache. So so I have a I have a couple of things to ask you right and I understand that again this this is where if I am running let's say, you know, production engineering team or DevOps team or a similar team in our organization which is really big Now you're talking about return on investment which is great right because somebody has to put in the funds the time the effort to build these functions to build this group and it's a gradual process. So can you do you guys have any incident or any anecdote where how you guys basically talk about that how did you guys actually manage to measure the return on investment because let's say there's a huge SRE team and you still have, you know, massive business impacts. So you'd like to hear, I mean, along with the war story has to be a war story because hey, sorry, right. Life's never easy there. It's fun. But we would like to hear on it or said you would have specially on this return on investment because that's a very good measurement. And also if you have audience from a startup world it serves as a great idea for them to say hey, maybe this is how I can seed something like this because I can convince the higher ups or people who signed the checks or give the endorsement. To say hey, this is how I'm going to you can measure it on investment. I mean, if the program works, it works. Else we would know it does not work. So I would like to really hear views from you guys starting with Kalyan. Since you were the last one that topic in the motion one joke about how do you really measure return on investment on this whole and massive culture change. It's not easy for organization to really sign up for how do you measure it. So the return on investments are very easy way to sell return on investments based on my so I would compare my two ways one would be the previous or where we actually to plunge for certain services, you know, to make it centralized. The return on investments there was, you know, a bunch of things one is the availability standpoint. So once the team has shown so you start with a with a pilot project and you and as the team shows that there is a increased availability to the whole service and the number of outages. So these are all things that are generally not measured in the initial phases of an organization when people just run to keep the site up. The number of outages that happen number of times the site has gone down what are all the necessary measures to calculate an SLA say throwing a 404 is not something that would be an SLA miss for a particular for a particular kind of thing. But on a bidding platform probably throwing a 404 would be something wrong. So calculating what are all the necessary things in an SLA and these are all the things that at the end goes to the higher up with a dollar number at the end to say how it happens. So the dollar number we the availability friend doesn't we couldn't quantify the dollars which we were able to quantify with how many reductions in the number of outages and the manner spent with there were certain other efforts made, you know, in high dense compute settings with an on-prem cloud and that involved a lot of dollar numbers stating how underutilized the resources are and how investing in certain set of aspects to make your compute high dense would give a better utilization of resources. So that's something I would call out as a way how you can measure either you can make you can show your output with a reduction in toil reduction in outages or you can show it as a dollar number to say, you know, to convince higher ups that this is actually a worthy investment to make and the consistent follow up because though something that works that looks like really great at the beginning, it might need certain course corrections as we go forward. So there has to be that that's where the SLA comes in. We have to show that the SLA is always met and there is, you know, continuous adoption of the new change. And it also needs a LinkedIn culture has something called office house where, you know, when there is something that is centralized and there is a problem, the people come to the office house, the consumers and then they ask questions they get this actually smoothens the process of adoption. So there has to be a way to adopt the change. Like you say, it is kind of a dictate I'm saying from today, we will stop using your or your MySQL services, there will be a centralized database solution and everybody has to use it. There has to be a way you know to socialize it. There has to be a way to help people to come to the service adopted and also help people say for example in the big table example there has to be a schema help people to understand which is the right schema, which is the optimum schema for people to use. So people should have hands to reach people should have hands to review and all these things has to be built. Only then I think the whole change will be successful. So initially there has to be a goal set, which would be a dollar or which would be an availability improvement or a reduction in toil. And there has to be a help people to adopt and measure this as it grows. And finally at some point we have to call this whole migration successful when the goal is met. Are you sure take on this on especially the ROI part of setting up SRE and especially in context of growing companies or companies that are adapting within the first time. Right. Well fortunately or unfortunately I've been in that shoe multiple times, signing the check myself and at times convincing others to sign the check. I like to quarters, not a quote really I mean I can't use the numbers but I'll go back to one of the kicks so we didn't we had a situation where we didn't really have a thing called reliability as a function and the first goal was to actually, I wouldn't say educate, educate is the wrong word here it has a sense of preaching I would say introduce the concept of notion of reliability. Now let's forget this fall for a while. What were we solving why why did we even want to do this. We all we always have this passion of saying that look whatever we do some needle has to move somewhere. If a needle doesn't move, there is no point doing whatever you're doing. What that means is they had to be a measurement of something. So we started looking for what is the exact thing that the business wants to improve. Now, the interesting bit here is at the end of the day it's all going to come down to measuring losses. Now there's a friend of mine Nishanti actually says a very good he's my co-founder as well. He says a very interesting line. How do you measure a loss in case of a downtime? Well, it's possible the classic would have never showed up. That's always going to be approximate. So how do we take those quantifications? Now, in case of a product company I do remember our quantification was number of negative stars on an app store. When we were running a fintech company, our Zendesk tickets used to be our measurement. Now, because this is the first and the foremost thing that touches the customer, a real customer. Like I usually say this thing. Something broke. No customer complaint. Is it a downtime or it should be fixed with no right then nobody got affected. Why do you want to even go about that fixing it. So, pick out those points which are actually going to touch the customer. We started measuring those. Our next question was, do we have or did we ever get to know in our system can we answer this question. When I look at this data, okay, on X day a customer had a Zendesk ticket of wire. Can I trace this back to find data in my system on why did this thing go down, how long was it down. Now, next we take it to the financial team. We took it to the product and the sales guys and we said that look, if this thing would not have happened, or could have been avoided, how much revenue do you think you could have secured or revenue is just a number here. You know, like if the company is about number of likes, how many more likes would you have gotten it here. If it is about number of shout outs or comment, how much more would you have had. They need to put a number, they need to sponsor this for me, because that's when they put a number to it and said that look, I would have sold two more dollars may not worth it. I would probably have to spend 2000 to secure this too. It's not really needed to solve this. Now, or they could actually give me a very educated stab on the fact that look, today it's two, but I'm sure if this happens again is going to cost me a lot more. So they actually become my sponsors of the entire effort, which I go back. Now, obviously this is where we use our experience and we say that look, my business and my revenue stream could have been proved by why percent if these things should not have happened. This is where the business owners take a call and they ask, okay, how much time and money do you think it's going to solve this. Now, this is where as reliability heads and owners, we put a number to it. We said, look, is going to, I think is going to, I mean, obviously after having done our own experiment saying that look, how much do we lack? What all do we not have to get and then go about it and then we put a project forth and we say that look, I could have secured 80% of this by putting in 20% of investment for this. This is a very, very data driven empirical conversations that we have and that's when we get a sponsor from the business and the CEO saying that look, I think this is worth the investment because at the end of it, the amount of time and energy that we can spend on the problem is also going to be provided by the business guys and this is where we get a buy in from them. Okay, this is what we are changing and that becomes our KPI or a KPI for that quarter. Now, no questions asked. There are literally no questions going to be asked beyond this on why am I even doing this, what am I going to get out of it because nobody has to know that look by doing SRE, I will secure. No. I want to say or I want to increase my revenue nine and this is done by actually doing SRE. SRE becomes an answer, not the question. And what are you going to call it? SRE is just a way that we call it because we coin a term because like design patterns as engineers, we always don't keep narrating the same thing we use a design pattern. So I say that at this point SRE is just a design pattern that we just apply and that's the problem. So actually I don't, I would definitely want to have more things over here. I think one of the questions that was asked by anonymous attendees was pretty much the same that as an SRE engineer for ROI, what are the key metrics that you should be keeping an eye on. I think that is very close to what you said in terms of OKRs and KPIs and so on. But are there any general ones or would you say that this purely depends from product to product and nature of the problem? It depends on product to product because I said I was citing a couple of examples here on multiple times. I mean the journey looked pretty much the same. First, identify where is the customer feeling the pain for it because when a customer feels a pain far too often, they're going to abandon your product. That is something that we need to protect. Now somebody needs to put a quantification of that pain that if this pain could have been saved, we would have grown or we could have gained so much more. Now interestingly, none of this was done using tools to begin with. For a couple of months, we actually sat down with just Excel sheets and I think one of my talks I showed a demo of those Excel sheets. I think it was presentation one. We just dumped data in those Excel sheets. As we say, all businesses can start with a spreadsheet and this is just another business line inside a business line that I had to start. Then it started up there and just measuring, OK, we need to do X. What do we don't have? How much time is it going to take? What is the total effort there? Let's tackle the top Pareto principle there, which is the minimum most effort that I can do to get the maximum gain. That's how you start. That's where you don't need to buy or return on investment because you're already going with that approach. This is really good to hear how you kind of get started along the curve. Now jumping to the other extreme end of it, Manjo, usually during your time at Google, when you have these mature organizations, which have probably already done the Excel work and now have systems to measure it, how does allocation really happen for a new product? For example, let's say there's a team of five engineers who are already taking care of a given product or a set of products and then there are two more incoming. So how does this really work there? It's like just because it's a sizable company and let's say hypothetically funds are not a problem. So you just say, let's just have more people at it. Or do you say, let's do smarter systems? Let's do smarter systems. Do you say anything else? How does it really go? Actually, I'll say something extremely surprising here. As much as large as Google became as a company, there was actually very, very actively stress upon ensuring that the SRE does not necessarily have to scale with the number of services in terms of headcount and resources. It doesn't have to scale with the number of services with Apple. And I've seen this happen in front of my eyes where the number of services and things we were responsible for grew from a few to hundreds. And what we had to do to make that happen with the same number of people was ensure that, as I mentioned, there was a huge, huge, huge stress upon automating the parts that could be automated. And I think Pius and Kalyan, they both take the cake. I think they touched upon two very, very important aspects of deciding on how to calculate ROI of where SRE should be added, what should be dedicated to work on. And the two aspects being Pius touched upon the business side and Kalyan touched upon toil and measurement of toil. From my experience, I think it's absolutely true that one of the best ways to figure out how you should dedicate, which teams to dedicate more SREs to, what resources can you add, should just be tied to business metrics. In the case of Google Cloud, for example, when we were deciding what all projects our SREs should be working on, we actively tried to see, it has this outage A caused loss to our customers, like actual revenue loss in terms of downtime, in terms of whatever other metrics, has this caused Google Cloud to lose very, very important deals, or any other sort of important metric like that. When I was in Google Photos, of course, it was not about revenue, it was more about, is this a feature that if it experiences downtime, does it very, very actively confuse my user? Or is it just something that is a nice to have, for example, some of the auto-ossums we created, right? It was like a gift to the user, you don't really expect it to happen. But if it is an active sharing by the user for a friend and the friend is standing right there and if that photo doesn't meet the friend, now that's an experience you don't want to compromise upon, because that is a very important part of your user experience. So you really have to tie it in to the critical metrics of the products, which you are taking care of. And the way to justify ROI of increasing or decreasing your SRE budget for a particular product is figuring out what those metrics are, actively monitoring them and then measuring them both before and after the change. So I think this is one of the best ways to, I mean, to show change once it has happened and even before it has happened, come up with estimates of what the change should be. And as Piyush mentioned, it will always be an approximation before it has happened. To address the point that Kalyan mentioned around Toil, this is again another thing that we really focus on, because as I said, we actually very, very actively did not scale the SRE or beyond a certain number of people. So every now and then, and I think I mentioned a couple of tools that we use internally already. So every now and then we would figure out, okay, is this a particular action that a lot of the SRE teams are taking on a daily basis, which can truly just be automated away? Should we come up with a tool for incident management? Should we come up with automated monitoring on the basis of a certain framework instead of our SREs writing Bob Monk configs? For anyone who's from Google here, we'll understand how terrible that is. So yeah, every now and then we would, you know, the production leads, the ATLs of Google, senior SREs would decide that I think these are things that really should be automated away for the SRE org as a whole. And maybe we should dedicate some part of our SRE teams to building out this tool or this particular platform and then have everyone use it. Got it. So that's actually a very interesting, those are great ones. I think there's a question that somebody has asked, you want to take it sort of? No, actually I wanted to add on to the RY part. One of the things that I feel was what I learned was to understand what does not create value rather than what creates value. So I have worked in teams and people were more happy saying that we use Chef instead of Ansible. Our great job configuring servers this year said no CEO ever. So, you know, know that, you know, don't be disappointed that using your favorite tool may not create direct value. And, you know, people, one of the things that my teammates always used to tell me is that people want to be happy, but more than that they don't want to be unhappy. So if you can have a steady state system and people are happy with it, don't introduce changes when you can avoid them. And if you're not making money or spending money in SRE sometimes it's viewed as a cost center. And, you know, the direct way to reduce it if you look at dollars is to see and that is always the case. Every year there is someone in every team who creates a script to find out untagged instances that were created in 2014 are still running. So proactively looking for these things when you don't have anything better to do is always a good way and to show that you're saving money is also very important. Sometimes the thing with anyone who comes from a certain background is just to clean up stuff and forget about it. So I think the marketing aspect of it also makes a lot of sense. So I think there's a question on our chat, which is, I think, are there any tools that Google doesn't build but sign up for? We use Workday. Oh, good lord. There is a focus on building versus buying and well, I was one of the people who used to also build, so I think I'm happy with that. So yeah, so that also brings and this is again the whole part of sort of is also now touching. Now let's kind of put process and measurement and business. This is like very number, let's talk about people because they usually say that the most precious asset that you would have inside Liability Engineering are going to be the people that you have. It's a system which is which in its concept was built to prevent burnout to people to prevent that prevent things that you know people would basically be, you know, like get more productive they would learn a lot more. So coming down to the people part of it like usually when organizations start this journey, I know that like a lot of startups they are think they think about making the journey LinkedIn is has already on set on it at somewhere right now I just did Google is considered to be kind of like a pioneer given all the material and literature that gets published right. So when you when when a group of people kind of is moving on this journey, what kind of stimulus what kind of mindset change do people really need to have. And this is primarily where you let's say suppose there's a company which is very operational in nature for whatever reason right reasons could be numerous. But then let's say usually it happens that yeah somebody comes and says oh I sorry or let's say, as pure said, if somebody is a leader, the person observes the patterns go suggests. But now then the next question comes in which is going to be a mindset change and a cultural change. So, would you guys want to talk about your experiences on the cultural paradigms and the shifts that happen over there. I would take some of them. So, because I think when I started, I started with a team, you know, which has like the whole company had like seven sres. So, all our standards are like evening cafeteria snacks we sit together. I mean, this is not working. This doesn't work. This doesn't look good. We sought out. I mean, that's that's our OKR planning that's our KPI that's everything or stand up everything is done in the cafeteria with. Of course, some of our OKR planning are also about the cafeteria food is not good that we still didn't figure out a way to fix it. No post mortem help there. But other than not be complaining about cafeteria food. So, I think I think the voice there you have an employee retention that there's an ROI, but going back to the meat. So, we had from there, the SRE team rapidly grew up. So, we became like a team of, you know, we became a team. I don't want to say the number we became a team who don't fit in one room. We were seeing multiple products at the same time. So, it became the meetings became chaotic. You can't have one meeting even per week where you discuss what happens in each team. Standups for SRE is difficult and each SRE team is working with the product team closely. And it doesn't half of the thing doesn't make sense to the other team members who is coming and sitting. So, there were approaches, you know, how to like calling out. So, this is again a young company. So, we were figuring out processes like we would roll out, you know, a sheet to figure out what if somebody needs help or somebody needs a representation and then we will have slots for them to speak and stuff like that. So, this is one way of implementing that as we evolve. That's a cultural change I feel. So, once I came to LinkedIn, LinkedIn had like multiple smaller chunks of groups and we are talking about, you know, I don't know the exact number but something like a 600, 700 SREs, you are not obviously going to put all of them in a single room for a stand-up and stuff. So, now what happens here is the process is still an evaluation of what evolution of what we were doing in that, in the case of the young company. Here, what is happening is a newsletter comes out for all your customers stating this is all the new features that have been released. This is all that's being worked out. So, if you are a consumer of the service, you are updated with the newsletter. That's a crux way how we could reach to consumers. And if there is something like one of the questions that asked, right, how do I know what issues are happening and so that I know whether I'm a victim or I'm cause. So, you actually need to have a stand-up or something where a person comes and proposes this is the change I'm going to do or some person who or some, rather a person, it's some team which is responsible for an outage being probed where we stand at when situation would become normal. So, it's like a daily stand-up of only teams which want to, you know, present their case when something that affects beyond that affects across section of the team. The stand-up does not take something which affects only with if something that happens that affects only within my team. It doesn't go to the company like stand-up. It can still be sorted within my team's realm. So, things like that is how the evolution happens where the conversations, I would call it like I kind of, you have to keep away certain things as you grow. That's something a personal change itself. You would earlier know what happens in your company end-to-end, what happens in your SRE realm end-to-end, but that doesn't make sense as your company grows. Even after your company has grown, if you know things end-to-end, then there is something that is missing. The SRE team is not right. There has to be clear demarcation of ownership and each team has to own their resources and make sure that their resources work as expected and all these things together as a missionary runs the company forward. That's a good one. So I think this is great to hear. And again, at this juncture, I think I would like to open it up to the audience as an open house for any questions that you would have. It's always fun to hear more from you guys, correct? And it would be really amazing if the audience can come forward with questions and have our panelists answer these questions altogether. I think there's one from Rishabh Vora who's asking any recommended resources to get started with SRE or pathway to get into SRE roles. The first two talks. Yeah, first two talks, okay. Manjot, you have a different take on it other than first two talks. Let's not have Piyush run away with it. I would say that just understanding and I'm just a person who believes more in learning by doing. You know, if you can sort of get involved in parts of the production by cycle, maybe first, as if you're an engineer, you start building tooling for some of these production services. That's one way to get involved, you know, understanding what kind of problems production systems actively face. Or being an engineer, one of the infrastructure services also gives you visibility into how things at large scale should be, what kind of problems they run into, how they should be tackled or other solutions. So that's one part. That's good. That is really good. That's really good. All right, audience, please, please flood in with your questions, we would be very, very happy to kind of do that. I think some another anonymous attendee has a question saying I'm joining a team of SREs. What would my approach to work be so that I stand out? All right, this is great. Be more curious than your colleagues. I say, ability to get bored. Have that. Because the reason I say that is because SREs is really, really a very boring job at times. And it demands creativity during that boredom. And like over the years, I mean, a lot of people are, I mean, I would like to actually say that, you know, this whole concept of sysad versus devos plus SRE, I don't even know when I became an SRE. You know, like, all I know is there were servers. There are servers till date. I don't know when the title change for me because I'm still writing the same shell scripts and quotes. But what I mean, and I have the, I was really fortunate to work with actually a lot of people and over the years what I've seen is this was a trait that stood out. The guys who are actually never bored, made really awesome engineers who I would trust when there was a fire, because even during that boredom they would come up with so many creative ways to actually improve that and that just made it worth it. So, while a very non tangible skill, but this is something that we really have to look at, you know, like ability to actually keep doing mundane work and chase perfection and improvement. To the attendees just an announcement issue. I mean, we are unmuted in terms of if you want to call out your questions directly, you can do so now. Okay, so I have a question, Ayesha. How often do you have to work on technical data items and like technical, like for example, upgrading databases and stuff like that and like from what I've seen in my organization it takes sometimes years to do a single upgrade because we have limited resources and there are performance degradations, but you also need to move on with performance upgrades and stuff like MongoDB, for example. And you need to do it because the life cycle is ending and then you need to move on but and at the same time, the software that's running on those databases is facing its own tech debt issues. Tech debt, I think Rishu and I did a BOF sometime back in rootcon. So, I think where it comes, I can see where it comes from you, Ayesha, because I was in a kind of in a similar situation. So this is something the mundane tasks I think Piyush was talking about this hits, especially in stateful systems, database systems and stuff like that. But data is also important because it's kind of easy in a stateless system. We just do a blue green deployment or something and take off your resources up in a stateful systems. This is a problem. I think it boils down to a dedicated, so if I see it, it's a repetitive effort that's going on over a period of time. Like after you finish one cycle of upgrade after a quarter or two quarter down the line, you will be seeing another cycle of another cycle of upgrades that's staring at you. So some level of automation that should be planned over this so that and the ROI that you can show at this point is I have lost X number of resources. I mean, not even just a human time plus the human might not be interested in doing so many upgrades repeatedly. So as part so the company has to be convinced your organization has to be convinced probably could spearhead how can automations be fit in so that this whole upgrade process can happen smooth. Prop was without any without any outage to the system. Also, things that can really help here is whenever if an outage has happened during the during this whole upgrade process. This would be a this would be a nice way to push that if we dedicate more efforts in automating this upgrade process as a whole, we would be you know, we would be avoiding these outages. It's with more boils down to an organizational will to change it. But I think as an Asari we we are the people who has to you know, show that there is an ROI in organization dedicating people to do it probably and probably coming. So some of these things I thought is probably coming up with a design proposal to the organization stating that these are all the benefits if we involve people over a period of time and then pushing for passing that design would be getting a sponsorship for the design would be the right way to approach it because sometimes somebody has to bite the bullet. Else it's just going to be repetitive and it's going to be it's going to take a toll on the effort of the employees. That's a great one. That's a great one then any other questions folks who are connected with us. All right. So I think I think this is this is great. And I just had one thing to add to what you said in terms of how do you know that how do you stand out as an Asari couple of quick tips from my end one. Whenever you join an Asari team always and this is a personal practice that I've done multiple organizations. Make sure you have PlayStation or a video game console at your desk, which is hooked up. If you get enough time during work hours to conveniently play whatever you want. You're not being disturbed. You're going to become famous real fast because people would like to know that either how you're so terrible at your job that you just don't care about anything and you play things. Or how good are you that you actually solve problems that allow you to just sit there play games and actually make money out of your job get paid for it right. And usually I do that because it means that the more time I usually get to play video games at work. I'm like hey you know what this means I need to start looking for newer problems to solve a look at existing problems to solve them easily. Right. It's just a personal benchmark right try it at your own risk with the disclaimers. And I'd like to end the entire conversation with a wonderful panel on one note which is. We have one who's asked this question saying they are a small startup. They have one sre right what advice would this panel have that he can share with other engineers in the organization so that they can make the service more robust before they expand the sre team. Right. He's like it's too generic we get that but the idea is any tools that they could explore right. Anything that they could do but but this is a great question because this is so forward looking. You have one guy who basically runs the show if something goes things go wrong in production. How do you shift left as you would call it to have the dev teams in Calcutta behavior. What's on this. Yeah. One of the things that I usually I mean, first is the fact that you can ask this question right now is a great position to be in. This is where things start becoming better. I mean that you're standing up and actually asking this question that how do I improve because improvement, asking the question of how we're to improve and how to improve is one of the biggest things. Now, the biggest learning that I've had actually two of them. One is failures are going to happen. No matter what, you know, like, there is no amount of insurance that are covered that you can actually have right now which is saying that look I'll be 100% reliable. There is no such thing. It's it's going to happen. The only thing that will matter at that point of time is how much are you willing to embrace that failure and improve the product from that point onwards. It's down to that. Hey, that one single SRE was a root cause of all failure might as well fire the guy today. You know, like, because that's inevitable. The only thing that can improve from there is that how do we keep improving these things into the product itself. You know, like this is a part that Manju mentioned very briefly. Yeah, and I really liked that was the job of that one single SRE is to make himself invisible, him or her invisible in the organization like where all these things, you know, everything that we've been talking about, you know, like building common organization platform internally is that developers should be able to do their own jobs in the most seamless manner where neither of it feels like it's coming in the way of things. Tools will only go a certain degree. What will go further is your curiosity to go to the depth of those failures and ask each time. What is it that I can actually change in the product itself, you know, like we talk about measuring downstream failures upstream failures as well all the time. Those measurements are really not going to help the real actionable item that comes out of it is the product itself has to embrace failure. What I mean by that is, if you're going to make any number of calls in your system to upstream or downstream services, write your product itself in a way the code out there, which can adapt and take suitable actions on these failures and they will go down to the fact that look, you want to use a checkout button, which will make a payment upstream is going to fail. There are going to be 110,000 of ways it's going to fail. Each time it's going to be unique. The answer is not right in a condition locked to each one of those or an alert for each one of those. The answer is, if I can detect such failures from happening and when they happen, how can I switch off or circuit break or move to assist or make my product go to an deferred payment model. Because this all the time, you know, when the upstream payment pages happen, they just move you to a deferred payment. That's the single most learning that I've had, and that's the only advice I've given. Manju, sorry, go ahead. What I add here is, similar to what you said, it is definitely, I mean, I would say make friends with your dev team and enable them to be friends with you. It is definitely less about tools and a lot more about people. The more vested your dev team is, you know, in understanding the value addition and the amount that, you know, at the end of the day, whether it's business metrics or just product experience for your users is improved with reliability. I think that is one of the biggest ways in which, you know, you can improve the entire process from day zero of working with your dev team. Kalyan, you want to have a go? I don't have. As you said, you've seen a team grow, so that's a really interesting point here. So it's still, I'm still for some reason stuck still with Aish questions. I'd say there are certain deaths that the single SRE here will take, will leverage for the company for the organizations asks, so they have to be documented. They have to be known to everybody in the team. So that makes people understand that this is what you can expect from the system. Like on a day one, you can't say, wow, my system is not scaling great. Let's move all of this to Kubernetes and put them in a auto scaling environment and do it. It's just one guy who is running the show. So he actually has taken some debt. There are better practices and so it's as long as it's like if the debts that are being taken that are being leveraged is is with the is already known that we are taking this risk because this is the deadline we have. It's fine. And all the asks all the requirements that the system makes should be within the realm of, you know, the debts that are taken. We can't expect something way beyond that. That's against the laws of nature. So related to debts and laws of nature. There's this one final question I'd like to take before we close out from an anonymous attendee we have who is saying in my past for an application with every release it actually broke the application more and more. And the SRE was always fired housing than doing any productive work. How can we actually work with such an application? Since, since this is this indirectly is going to be about that and So this is this I kind of sees a seat from a debt one. Second, I also kind of see it in a relationship manner. It's a relationship that's between the dev team and the SRE that matters here. Right. So if, so if I was not in an organization and I'm a single person who is working in the team and if this is what is happening, I would say, look, I'm not going to deploy any more of your services. But since I have to be civil and diplomatic or to keep myself functioning, especially in the COVID era, I have to actually come up with a come up with a measurement. I should say, see, this is how my OKRs are being not able to handle. This is the amount of failures that's happening due to your system. So this is something or or primarily accounting ways to is to reduce the number of deployments your dev team is able to achieve primarily. That's that's a way. That's a way that you make the numbers. Actually, it should be a data driven way to make the dev team feel the heat rather than feel the heat. They have to feel for you. They have to feel that it takes certain amount of toil and they have you you say that this is what my problem is and that the relationship has to be built such that the team feels for you and the team should not. The dev team should not think that as as long as I made the release, I'm done. The dev team should also there can be things like that. For example, some of the things I saw in in my current organization is there is something called a release shepherd at Dev comes along with the release with the SI. It's no longer just an SREs function. The dev team sits dev team undergoes OK to understand what is the problem in the whole release cycle. The person who is here now actually has a first hand experience. This is how a relationship is built. Then the person has to go back to his team. Yes, there is actual problems and that team comes up before the next release that these are all the things we will work on to fix. Do you think this will move the needle? The SRE team can have their kind of a discussion with the SRE team and then that can that can be God you know, during the next release. So building relationship with the team I think helps here. And this is some of the tenets that every SRE town hall in LinkedIn has like act like a owner and build relationships. So build relationships with the team is something that will help here is what I think. That's great. I'll just like to add here that sorry. I just like to add here that a lot of the role and responsibility of an SRE guy becomes about create create write a lot of creative tooling and philosophy which you sort of talk with the devs and come up with ways to avoid these processes. Right. Like a lot of the times we are and we end up writing tooling to avoid these things or philosophy. We have to share discussed philosophies where we discuss these practices that let's do it this way so that you do not do that. But even when we are doing that, like it sometimes becomes hard to convince the other person and make them buy into the tooling and philosophy. But I don't know how to deal. Like sometimes it's like you get you can get also frustrated that why are you not buying into this. Right. So I can I can kind of chip in on to that one which is process. Everybody hates the P word. Okay. My P word I mean the process not the panelists. So with the process the overall thing is hey look if I'm going to give you 10 pages of a checklist to fill in and say then only you can go. To production. There is no way in world that you will hate it and you will look to bypass it. I think with the place where we have been doing infrastructures code and X Y Z S code. It is high time that as SRE you start looking at processes code. Right. If you have a process that you have ironed out. Yes, it will have changed just like code has bugs. Your process will have changed. It's okay. The more you can actually take your process. Convert it into a coded system or an automated system make itself serviceable and and in my experience I've seen a lot of companies who say they're successful at SRE do that. Amazon does it by the way they've been doing it for years on Google does it. It can correct me on that thing doesn't is in on a similar path and there are a lot of startups that well at least I've been associated with who actually ended up taking out that route which is hey. Make it lean and by lean we don't mean that convert 10 pages to two pages it means companies convert those 10 pages into an into you know software driven thing. And I think that is where SRE also becomes really really important. Right. Which is hey I want to have these checks and balances if these checks and balances are automated and you run into them and things just go. Bad I mean they're just bad. But can you come back it shouldn't be that every time you go bad to be like oh yeah another 20 days of time for you and then that's where the development partners start hitting you they're like let's not do this. Right. So that is that is at least with my you just process as code and I don't know how. What everybody's viewpoint on that is but on that note I think sort of do we have any further questions or chats. No I think there was one question on the YouTube stream on how does SRE keep keep up with the whole picture of product change I sort of addressed it but if there are any things that you know we want to call out. Yeah. So yeah I think I think yeah do you want to call that out. Sure. One thing that I feel when working with product teams is that keeping product team and SRE team separate sometimes is a necessary business function. However building a wall between them and the product team throws something that we have to deploy and you throw something back that breaks that is never something that works in the long run. So one of the things that we do is as Kalyan also mentioned that being touched with the dev team as well as you know in the beginning of say a quarterly planning call out that these are the dependencies we have on dev teams versus call out dependencies that teams have on you. Keep that communication alive and not only restricted to your team but also see to it that the information flows upstream also so the department heads and everyone is aware that you know that that isolation is not created. So yeah. That's a great point. It's a great point. All right. So on that note, I think we are already towards the end of this view session. Once again, thanks a lot to all the panelists. Thanks a lot. Manjot Kalyan and sort of thank you so much for being a great moderator. The same point in time. Thanks a lot to our audience. It's it's been some amazing interaction. We always would love more and more interaction but hey, again, there's no ceiling to that. And I think with that. Yeah, for this BoF and for now we might want to call it a day or a night depending upon where you're tuning in from. So thanks everyone. Stay safe. Stay focused and we hope that this session really helped. Thanks a lot guys. Yeah, thanks. Thanks. Thanks a lot. Thank you.