 Thank you so much for it and thank you so much for the introduction. Really excited everybody's here to kind of learn from our journey and kind of learn from our mistakes and our expertise at the same time. So a little bit about who's going to be talking today. So I have my buddy Surya and Thor with me. But Surya, why don't you give us a little bit of background about yourself and Thor after that maybe a little bit of background about yourself when you do your artist. Yeah, sure, sure. Thanks, Ravi. So I have been at Hardness in the last 18 months, 20 months or so. I joined as kind of building out the Surya team at Hardness. And before that I was in the same role at Symantec. So I started off my career as a software engineer and then slowly moved on to this whole OpenStack Cloud, the AWS and now GCP at Hardness. And with Surya, there is a every day is something new to learn. And even on this call during the Q&A, please feel free to ask the questions. It's always good to know, it's always good to learn. And that's what keeps this whole SRE thing interesting for me. Yeah, yeah. So yes, please ask Surya the questions. He's the experienced one in the room with regards to Surya. Yeah, so my name is Thor Taylor. I've worked product here at Hardness. So I've been here for about nine months. And I have been spending my time here working on a product called Change Intelligence. Obviously at Hardness, we've been heavily focused on the DevOps engineer, really the dev lifecycle when we think about CICD. And my effort I've been working in I'll say the data realm when we talk about logs, metrics and traces and these types of things for about 15 years. And so I'm bringing a lot of that information here as we start looking at SREs and understanding the separation that currently exists between DevOps engineers and SREs and trying to figure out how can we build a product that helps to kind of bring those worlds together. Awesome introduction fellas and I'm Robbie Lockman. I'm the evangelist here at Hardness. I'm really focused on the ecosystem. Also I've been creating outages for Surya to deal with. But now with all the guardrails we have in place and all the learnings we have, my outages are not blameless. It can't pin it to me yet. But let's go ahead and get started. And so what are we going to be talking about today? So the first thing is just introduction to like reliability in general. So reliability has different meanings for different people. But really what is the introduction of this SRE role and how do you actually further that engineering innovation? And then we're going to be talking about, well, a couple years ago when we really started focusing on reliability sharing, it was really a journey. Funny story, Surya's first week here, we had one of our biggest outages and he's still here. So thank you Surya for sticking around. But really like we had to go on a journey to make our platform more reliable for our customers because our customers depend on us, similar to how the world works, but also things that we've run into. So even as we're scaling our reliability engineering and reliability practices and the robustness of our platform, still there's always room for improvement. So things that we notice in the marketplace, things that we notice internally, things that we notice externally. And we'll kind of put that all in this little pretty package for you in the next 30 or 40 minutes. And again, we'll have to keep it interactive, keep those questions towards the end. We'll address every single one of them as they come towards you. And so reliability, what does this actually mean? So my car is reliable. It starts every time. But in software land, actually or even in physical land, what is reliability? So there are three main pillars of reliability. You would, as an end user, you would view something as reliable that mainly is a trustworthy, right? Like, hey, I trust going to, it's very funny, like if I ever think like my AT&T fiber is down, I'm going to go to cnn.com because cnn is never down, right? If I can get to that, that means there's something wrong with AT&T. But there are trustworthy source, right? And also because they're trustworthy and maybe authoritative, they're highly performant, right? Like, whenever you ever gone to one of your favorite websites and, you know, they're not there, you know, you went to Twitter, when's the last time you saw the fail will or AWS or Amazon, when you, when is the last time you saw the dog looking at you all set or CNN, when is the last time it doesn't come up, right? And so having these things be performant really makes it to the peer either it's actually reliable or the appearance reliability. And that's what a lot of site reliability engineering does. It's giving you the appearance of reliability or mitigating the manifestation of an object. And again, given, you know, if something's trustworthy and performant, clearly it's going to be available to you, right? And so with these three factors, you might say, you know what, the system or this object is reliable. But there's another problem here, right? So given if we only had one server and one static IP and, you know, one instance of Apache serving up something, that's pretty easy, right? Like, hey, you know what, if it goes out, we can boot it right back up. If someone unplugs the power cord, plug the power cord back in. As, as much as a joke as that is, our systems are quite complexity states, right? There's dozens of hundreds of transitive dependencies or highly coupled or decoupled in dependencies. And, you know, if you are leveraging a modern microservice framework, you might have one call to the user, but it aggregates out to like a dozen backend services to aggregate and consolidate a response to the user. And distributed systems are complex, right? Like, hey, you know what, we might be designing for the happy path, you know, like, hey, between service A and B, but it needs to call C, D, E and F and Z, you know, how do we measure A to Z, right? With the intermediate boundary of B. And this gets very, very complex, right? Like, hey, building a complex system, you know, you just got to plug the thing and plug it back in. You know, there's hundreds of services and distributed infrastructure might be using containerization technology, the public cloud, Kubernetes, all the buzzwords is your, you have different infrastructure, different application requirements, just different expectations, right? So my background, I used to work for an investment bank as an application owner. You know, I only had, I own like five end points, right? But, you know, what was my contract for a response? And we'll get into this in a couple of minutes, right? Like, hey, how do we know something to perform it? How do we know, you know, what is my agreement with the business or agreement with the end user? And each one of these could be different. And this is part of what's, you know, when you're navigating site reliability engineering is what you have to negotiate. And so basically, this is the adage, right? And so slowness is a new down. You might have heard this from a few pundits that it's one thing that it's very rare that a service is completely down. Like, there's just, it's a black hole, you can't get to it. But it's more common in systems and technologies and distributions so that you're getting a very slow response. There could be a degradation of response. There might be, you know, with a manifestation of a problem that might be degrading, like, hey, there's network constriction or there's database constriction somewhere. It's that slows a new down, right? So to your end users, you know, if you were waiting 20, 30 seconds for a response, you might as well just be off, you know, your attribution and basically like, hey, adoption will drop after a certain point, right? Or, you know, your shopping cart abandonment, you know, the old good old e-commerce example, people would abandon, abandon their carts if they didn't get a response. Same thing. If I don't, you know, get a response from AT&T right away, I'm hitting up my favorite site to see if, you know, my fiber bono goes down. It goes down more than I like it to be. Also, kind of this, this old adage, my manager used to tell me this, they still do to this day, but you can't improve what you can't measure, right? So there's, there's many ways, there's very, there's subjective and objective ways of measuring system performance and system behavior, right? So objective might be, can we get a response in 1800 milliseconds? Subjective, is it slow? Okay, you know, what's slow for Surya? It might be faster Thor or vice versa. What's, you know, slow for Thor might be faster me. Beauty can be in the eye of the beholder, but you need to have some sort of way to kind of like baseline yourself and to measure, you know, hey, what is reliability? Is it, can we measure those three things? Can we measure availability? Sure we can. Can we measure confidence? Yeah, maybe, you know, depending on user satisfaction. But again, finding the right measures to measure something is also a challenge. And this old argument like cats vs dogs, two evolving disciplines are your DevOps discipline and also SRE discipline. There actually, there's lots of similarities and also there's differences between a DevOps culture and an SRE culture or practice. There's certainly overlap. I caught a lot of flak on Twitter, pardon that Twitter, but Reddit, the Reddit gods were smiling at me, I made a table kind of comparing like how a DevOps engineer would handle an outage versus how an SRE would handle outage or like, hey, application clustering, like, you know, one has to know the consensus algorithm, one has to know a number of nodes, you know, and it's, it's, they are similar jobs, right, but, you know, focusing on efficiency and kind of what the connective tissue between these two groups as you're going along, right, maybe you're working at a small old place, maybe you're like, you know what, I do the development pipeline and reliability because there's really one of me or two of me on the team. The connective tissue tissue is information dissemination, right. So these are both expertise rules. How do you disseminate and how do you eliminate technical debt, right. So as a software engineer, a lot of a serious job actually is making sure that jokes aside, like I create outages here, how do I distill information to Robbie that he doesn't have to worry about technical debt to how to scale a system, let him write the feature, scaling will be taken care of by something else. And that's just what the service you're going to be talking about in a little bit. And then let's talk about some key, key measurements. So, you know, like, hey, as we go along this journey, you might hear a trio of S words here. So an SLA and SLO and an SLI, but just for level setting what these things are as we go forward. So the first thing is, is SLA, I used to think that everything was a service level agreement, uptime SLA, response time SLA. But no, there actually, there's nuances as you dig into if you read the book, you know, the Google SRE handbook or any sort of, you know, literature that extends out or even talk to thought leaders like Thorin Soria, you know, they'll give you like a more concise answer, right. So a service level agreement is basically that commitment to a customer, right, that could be an internal customer or external customer. And basically is that it's crafted around customer expectation, right. So in a very easy example here, let's say I was selling rubbies, lemonade as a service or last, or last as a service, obviously, you know what, as a customer will give you 99% uptime, right. Like if you pay me $10 a month for lemonade, I will be there 99% of the time. And so this is what a service level agreement is. It's a commitment to a customer internal or external. Now digging into that SLA is like SLOs, service level objectives, right. So what a SLO is, is basically how are you going to meet those SLA? So how am I going to achieve my 99% uptime? Well, one particular measure might be response time, right, or reply time, that there's other ways to measure it, right. But if we're just taking one easy to measure aspect of it might be, you know what, I need to reply in 2000 milliseconds, 99.5% of the time over a set duration. So let's say over 30 days, right. So what are you, are my lemonade as a service customer? I'll be sitting yesterday and know if we have lemonade within two seconds, you know, 99.5% of the time over 30 days. And really, these particular SLOs is basically that. So their time box with duration and also, again, is showing how you're going to meet the commitment. You know, one SLA might have multiple SLOs, but there's one more, more granular level of measurement, which is an SLO, a service level indicator. So basically, definition of a service level indicator is that it's a compliance to an SLO, right. So given that, if you remember back in the previous slide, you know, 90 or 99.5% has to be entered in 2000 milliseconds or less. Basically, SLO or primary SLO is measuring that compliance. So you can see here, you know, our first request was 1900 milliseconds. So that was a good and valid request. Our second request, let's say someone else asked, you know, for some odd reason, I had to think a little bit more. That was 2500 milliseconds, right. And that is bad and invalid. That's a really making sure that we're calculating that correctly, you know, because of those two requests over two, you know, we would actually be in a poor state based on, you know, this indicator for SLO. And given there's only two requests, you know, clearly it could be thousands of thousands of thousands of requests, normally in a normal system over a very short period of time. But this is basically a leading indicator of how your SLO is going to be either going to be bursting your SLO or you're adhering to SLO. And so another particular way, another popular way of measuring, right. So response time was one of them. There's a thing called the four golden signals. So it was, again, coming out of the Google SRE book. There's things you can measure for, right? You can measure for latency. You can measure for traffic. You can measure for errors and saturation. So it was basically looking at, hey, defining these latency might be, hey, what's the delay in a response, right? You know, what is that latency? Saturation could be a factor of load, right? And so like, hey, you know what, our system, you know, we're handling all the requests, but we're fully saturated, you know, we're pegged 100% memory, 100% CPU, you know, this we're oversaturated, we're pushing the bound to the system. And clearly, like you can measure a amount of requests or traffic and also errors, right? As it starts to increase or decrease, these are things to measure. But don't just take my word for it. Let me introduce my buddies, Surya here, to kind of talk about, well, what is SRE, practice and finally at scale look like at harness Surya, take it away, my friend. Absolutely. Thanks a lot, Roy. Thank you. So whatever we are going to cover in the next few slides, right? It's all the practices that are already covered in the Google book, like the Google SRE book. So, and this is how we do it at harness, based on the principles that they define in the book. And there is nothing right or wrong in the way you do things. It depends from team to team, from your infrastructure to infrastructure. And these are some of the practices that we do at harness. Again, please feel free to ask you and me after the webinar is done, right? So as an SRE, I mean, what is one's job? Pretty much, right? You want to make sure that your system is performing at its scale for your end customers, right? And for that to know, how will you know how your system is performing? You need to know, have an insight into all these observability and the visibility metrics that Ravi pretty much touched upon initially. So, right? So the golden, the good metrics that Ravi covered, they are all good, right? But what is important from a service point of view, are the service level metrics? And for the service level metrics are what defines how your application is performed. So at harness, we have our engineers closely working with our SREs, right? The engineers write the code, they know exactly what the service is supposed to do, what are the metrics that we need to measure to say that the service is performing good or bad. So we work with engineering to define these application metrics. And all of our infrastructure is in GCP. We run on GKE and we use stack level or cloud logging for our application metrics. So the first bullet point in summary is work with your engineering team and make sure for each of the services that is powering your application, you have the proper application level metrics. The golden signals are well and good, right? And we do use them for measuring the overall how the application is performing, as well as overall how the services are performing, right? This is in addition to the application metrics, right? These are more on the infrastructure side. Now, so one of the things, right, you do your deployment and you hope like everything is good, right? And if something is bad, you don't want to hear it from the customer. You should be the one who is catching what is wrong as soon as you do the deployment. So these golden signals definitely give an indication, right? For us, there has been a couple of times where we deployed into our production. We had a feature flag turned on and we see a huge spike in the overall application performance. I mean, that didn't cause us to roll back, but that did give us an indication like, okay, this is what is happening. The latency is too much. So maybe the feature flag is the culprit. We turned it off and things went back to normal. So focus on this, as well as the configuration changes after every deployment, right? So you deploy, we do deployments almost every day and we are going to the model where the engineers will do the deployment going forward for their services. Now, the config changes, it could be as simple as you change your YAML spec to say, okay, the pod used to scale based on the CPU of 85%. Now I change it to 70% and that may have an impact in production, right? So always make sure like you have a way to track what configuration changes are happening between deployment A and deployment B. So that is one of the ways we make sure, like after deployment, things are going fine. The next bullet point is I think everyone on the call knows about this alert fatigue, right? We define alerts left and right. We go for each service, we define so many alerts that what happens in the end is you have an overabundance of the alerts. Now you get it in your email, you get it on Slack, you tend to ignore them after a certain point of time. But what usually happens is you miss the actionable alerts, which ends up causing an incident in production. So you don't want that to happen. So there is nothing that is stopping an SRE from defining additional alerts. So we do that as well. But we make sure that those additional alerts are sent to a Slack channel, which only SR is monitored, right? But the actionable alerts are something like we know that something is wrong with the system or something is going to wrong with the system. So we make sure that we define actionable alerts, which are pretty much visible to the entire engineering. So we have it to a different Slack channel. So one of the advantages of these actionable alerts is it will take care of this whole operational under load or knowledge gaps. This is pretty, again, whatever I'm mentioning, it's covered in the book. Please go and refer it. So this under load or knowledge gap happens when your system is performing very well. Say you are at four nights, you're at five nights. So what is an SRE job? You do the deployment and you call it a day. So when you do that, you tend to forget about the system as a whole. You tend to build some knowledge gaps because of that. With actionable alerts, even though it may not lead to an incident, but where it helps is it will tell you that, okay, you have introduced a new query in the database, for example, we use Mongo for our database. And engineers introduce new queries, new indexes, left and right, we want to be agile. But at the same time, this model causes some of these query performance issues in production. Now, will that have an immediate impact for uptime? Not really. But is it good to chase after these actionable alerts? Absolutely, yes. So you make sure that the system will behave well if you keep doing this, as well as there won't be any knowledge gap when you follow this model. And the last one is, we did this only last model. We started this exercise of this whole chaos testing where we introduced failures into the system. And we see like, okay, service A is dependent on service B. We take the service B down. And what happens to service A? So that is at a very high level, that is what this chaos testing is. I mean, it can take its own webinar, there are a whole bunch of tools, Netflix covered it pretty well about this chaos testing. And if you use service mesh, you know what this is all about. But plan for chaos testing, even if you have like few services, plan for it and make sure like, you know what happens when a service goes down or when a service is overloaded. So that covers this slide. So Ravi, can we go to the next one? Okay, so the hardness SR is generated. Like Ravi mentioned, I joined in 2020, and we had a first week itself, we had a very bad incident. And at that time, the engineering themselves, we're taking the responsibility of running these services and production, and they will take responsibilities as well going forward. But I came in as the SRE, and we build a team. But some of the things that again, whatever I'm describing, it's way of doing things, it may be different for different teams. So one of the things we did for our infrastructure is basically we run on GCP, we run on GKE, the managed Kubernetes. So we also used managed services for our database, like we use Atlas Mongo. So one of the simplest thing we did to begin with, we made sure that we have this VPC peering setup between our GCP project with the Atlas Mongo to make sure that the traffic doesn't go over Cloud NAT, and the Cloud NAT becomes a bottleneck. So that's just one example of the improvement on the infrastructure we did. We also made sure, like we have two regions, so we are in US West front primarily serving our customers. We have US West 2, we run in the active passive mode. And we make sure like we exercise that, okay, when we move to traffic over to US West 2, how does it perform? So we do this activity once or twice every quarter. So similarly for our Mongo as well, for our database as well, we have US West 1 and US West 2, where the US West 2 acts as more of a DR kind of a setup. So again, infrastructure is different if you are hosting your own in-house infrastructure like we did back at Symantec, you have your own set of challenges managing that infrastructure. So make sure like there are continuous improvements. It's not like you do it one time and you're done. It's like continuously we make sure that wherever we find scope, we do that. And again, so when we build the team, we are right now four SR engineers, two in US, two in India, and two DDS. The SRM mindset is very important. So when I say what SRM mindset is, like you always look for ways to automate things. You are okay with chaos or we are okay with things not being stable. You actually like to be in the thick of things. It's okay to have an incident, but you need to learn from it. It's okay to have an alert. So that SRM mindset is very critical. One of the things that we do is like any incident that happens or anything that we want to communicate to our customers, we provide them as RCS. And these RCS are blameless, right? It's not like, okay, this guy did something. It happened because of that person and we had an incident. It's more about why it happened. What are the learnings that we took from it? What are we going to do so that it doesn't happen again? So we publish them all the time. And this whole SRE practices, right? So it's not like old style model where the developers write the code, they build the artifacts, they throw it to the SRAs and say, okay, you deploy the code in production. It's your headache. That's not how at least we don't operate like that. So we make sure that a service, service A that gets written or that gets developed by the engineering, we are plugged in. They know the SRE practices, right? They know that they need to get on the PJD call when something happens during production, right? They are all on the call. It's not just the SRAs who are handling the incident in production, but it's everyone on the engineering who is on the call based on which model or which service is possible. So that's about the SRE journey so far at Harness. So the last slide that I would like to cover is we try to be as transparent as we can with our customers, right? And we publish our availability numbers on status site. I mean, no surprise is there, status.harness.io. And if you go there, you will see like Harness has a whole bunch of modules. We have CD, we have CI, we have feature flag, right? So all of those availability numbers we publish on a weekly basis, right? And in the last two quarters, 5th to May, our availability has been 99.96. So that means we were kind of down for our customers for about 50 minutes in that quarter, in that 90-day timeframe. In Q2, we did much better because we put lot of guardrails in place. So we were at our 4.9 target that we wanted to. And how we calculate these uptime metrics, we use a combination of service level indicators per service. We use the real user metrics that we get from AppDynamics, right? And we use a weightage-based calculation in a combination for computing uptime. For example, a login has a very higher weightage when it comes to uptime. Because without logging in, you won't be able to do anything. But a dashboard which renders how your deployments are doing, what did you deploy last week or so, it is useful, but it won't stop the customers' pipelines to be executed or the deployments or anything of that sort. So for those, we have a lesser number of weightage. So again, it's totally up to you. Whatever is always things from your customer point of view, what is the indicator that they are looking for to make sure that they are not disrupted? That will be your service level indicator and that will be your uptime. And like I covered briefly, we publish everything. It's very transparent. You can go to this medium blog where you will see about our RCA's, about our incidents, as well as the engineers pretty much keep blogging about what are the latest things that they did over the last quarter or what are the cool things that we are going to do? What is our deployment strategy? We even have an architecture diagram over there. So please feel free to go there and check it out. So that's about it from my end. It's all yours next. Thank you, Saria. You can go and advance the slide. Yeah. So I do want to just kind of quickly level set with everybody. There's been a lot of information obviously that's being thrown at you guys, but we've tried to kind of break this up into three sections. The first section, which Ravi was talking to you about, was really about laying the foundation of SREs. What is an SRE? What can you expect from that role? So if you did miss it, you joined a little bit late, that's fine. Just rewatch it. And then what Saria just obviously went through was the state of affairs of SREs today from his experience here at Harness and essentially how we're trying to approach it. And then what I want to jump into obviously as a product manager is I like to look at the forward thinking world. So where is this actually going as we look across the spectrum of both DevOps and SREs? And I thought it would be a good opportunity to just kind of talk about where is the world at now? Because we do see a paradigm shifting taking place. And so why is that paradigm shift happening? So obviously, you guys know this. You have a lot of experience. You've been out there. You've been working in different capacities. You recognize that whenever changes take place, it can potentially create problems. It can destabilize things and people start yelling and looking for somebody to blame. So I spent a lot of time in my early years supporting a lot of the large Fortune 500 companies and some of the more, let's say, extreme conditions dealing with firewalls, network routing issues, switches, that sort of stuff. So I really worked on a lot of, let's say the foundational components necessary to run a business. And when I would work with these businesses, the first thing that I would ask always immediately when they come up with something urgent and saying, hey, we're losing a million dollars a minute. We need to fix this now. I'd always start with, okay, well, did something change recently? And if something did change recently, what was that change and see if it relates to the problem that we're trying to investigate? The issue obviously is that the ability to track changes wasn't so available for the individuals who are actually troubleshooting. So most of the time, the individuals I was working with, couldn't answer the question of whether something had changed recently. And so that essentially leads us to where the world had moved to many years ago up. And I would say even now it still operates this way, which is the idea that changes are scary. And because they're scary, we need to slow down. We need to do less of them. And so it created this culture of essentially changing less often, but doing big, massive changes. So you might do it quarterly, you might do it semi-annually, but you want to be careful about how often you change. And so it introduced a number of different processes and systems such as seem to be type systems, approval processes created this whole culture of let's be careful when we changed. So go to the next slide. Yeah. So we've got two groups that are actually now leading the charge of this paradigm shift. It's this recognition now that change actually isn't bad. Sure, it can be a little bit scary, but if we do it properly, we can actually change much faster. The introduction of services and microservices and breaking apart applications into, like I said, the microservices has introduced this ability to do small changes incrementally rather than these big, massive changes. The challenge that we have is the two groups that are leading the charge DevOps and SREs have really different goals and minds. If we take a look at the DevOps engineers, they're very focused on velocity. They want to get changes out the door quickly. They've implemented systems such as the door metric. So being able to monitor very specific things to make sure changes they're introducing aren't causing problems, but they're very interested in getting features out the door, which obviously makes customers happy because they want features quickly delivered to them. The SREs, on the other hand, have to look at the broader picture. So they're not looking at just the application of the services being delivered. They're looking at all the underpinning architecture, the platform and the services and systems, the hardware, all of this stuff that goes into making this successful. And so what they do, and this is what Ravi had talked about with regards to SLOs, and I'm going to kind of break that up a little bit here. SLOs are typically designed around the user journey. So unlike the door metrics, which are looking at traditional metrics and things like that, SLOs are focused on the user journey. You want to identify what are the happy path, what are the things that my application or system needs to deliver for a customer, and then we want to measure that. And that helps us to identify whether those customers are able to be successful or not successful, irrespective of, let's say, the service feeling. It could be a component like a database that the service relies upon that's having issues, but if you're tracking the entire journey, you'll actually be able to know whether things that customer is happy or not. So you kind of have these two spectrums. One is focused on change velocity, getting features out. The other one is heavily focused on reliability and making sure that those changes don't destabilize and that customer is able to be happy in the process. Okay. So what we look at as we've been talking to customers and analysts again here at Harness is we're trying to figure out customers with their change velocity, meaning that they can either increase it or decrease it. So they get much better control of that velocity, obviously without compromising service reliability. And Suria did hit on an important point. He talked about the notion of blingless, which I like. It's the idea that it's okay to have some failures. And so if you're able to measure what's an acceptable amount of failure, what we call error budgets in the SRE world, then it's a good way to think about it rather than saying we're going to be 100%. So we'll talk about that here in just a second, but these are really the three pillars that we've been seeing as we've been discussing with customers. The first thing is how do you track changes? So if you think about your business, think about your world, the old way of doing it, which is let's say a CMDB system tracks some elements of changes, but doesn't track all of the changes. You've got your DevOps team, which is tracking some changes. But think of like you have a build that goes out the door, gets deployed into production. You might know that a change occurred, but you don't necessarily know what that change was or what the changes were, the PRs and stuff that went into that deployment. And so there's these worlds that exist where changes are transpiring and occurring that aren't fully tracked. So that's one aspect that we see that there are certainly some blind spots and challenges for customers and being able to get a better view of everything that's happening in a given environment. The second thing is measuring the impact. And so what this means is, I would say the way companies work today is if you think of like cause and effect, right? Something happens like your root cause and then it kind of trickles out and has some blast radius or some impact of things around it. And then something starts to misbehave or act up. So you're looking at the metrics for it and you see that, hey, this metric deviated. So I want to take a look at this over here. So that's kind of your cause and effect world. And what we traditionally do with monitoring today is we look at the effect, we look at the aftermath, and then we try to walk backwards to determine what the root cause was. And so what measure the impact essentially means in the change world really for an SRE and stuff is if you contract the changes that are happening, then you can measure the impact that those changes are having on a given environment. And so this allows you to watch it from the point of root cause and start to see deviations before it starts to impact the business rather than waiting until everything's been impacted and you're kind of walking backwards. So that's another, I would say the second pillar that we have identified in working with customers is just the ability, can you measure the impact of changes once the change happens? And then the third one is called informed velocity. So really, this is exactly as it sounds. So the idea is that change velocity is something that you want to maintain. But there are times when you want to slow it down and there are times when you want to speed it up. The times when you want to slow it down or when things are unstable or your environment is unreliable and you feel like customers are getting agitated or frustrated. So you need to inform the teams that are making changes that they need to slow it down or maybe even stop, ask for approval, different types of things that might need to happen because reliability isn't jeopardy. So that allows you to inform those teams. The other one, and this is again what Saria was talking about with regards to air budgets, is you might be running at 100%. Things might be wonderful and you might say, look, we actually have some room to take a little bit of risk here because we're not in jeopardy of burning our agreement with the customer. So let's take some risks. So maybe that change where you're updating a database or Kubernetes or something like that where you're holding off and doing that. Let's go ahead and do that because we have some room here without impacting our reliability. So those are our three pillars. And so our goal, fundamentally, is really trying to get the SRA and DevOps team to cooperate. We'll say tear down the wall as you saw on the previous slide. There is this wall that exists between the two. I would say there is somewhat of a competitiveness that can exist between the two. We want to bring down that wall and we want to make it where the DevOps teams and the SRAs are actually cooperating, trading information between each other so that they can help. The DevOps team can help the SREs with reliability and the SREs can help the DevOps team with velocity. And with that, I guess we can open it up for questions, Ravi. Yeah, that was awesome. Yeah, like thank you so much, Suryanthor for your background. I definitely was bringing down production services before Blameless Culture. I kid you not like in the Confluence documents we put at Ravi. Not very Blameless, like an engineer at Ravi. It's like clearly they knew who it was who was bringing stuff down. But yeah, like, so if you want to contact us or what we'll go through the questions now, but like, feel free to give this a scan if you want to learn more, you know what we're doing at Harness or if you want to learn more just what we're trying to come to Harness, feel free. But let's go through some of the questions and answers or well, the questions will provide answers, right? So let's take a look at what we got going on. So I can pair, I can pair a phrase these. And so, okay, so the first question is how is saturation and traffic different? And so I can take a stab at that, Suryanthor. No, go ahead, Jerry. Okay, so my very basic answer is that traffic is just that like how many requests they're coming in, right? Or how many requests are not coming in. And then saturation is the underlying infrastructure like how tax it is. So like, if your service is very saturated, you know, you might be handling the traffic, but like your ACC capacity, like you can't take it like a net new request, because like, hey, you're pegged at CPU, or you're pegged at memory, or some sort of infrastructure constraint, or network I owed. But if you have a cleaner exclamation, happy to hear that. Yeah, I think I think that's good. Okay, cool. I got one right. All right. I do just want to kind of jump in and say that, you know, and Suryanth did mention the SRE handbook, right, which will kind of point to the link if we haven't already provided that. But this is a learning process, you know, being an SRE is kind of a new concept. And so we're leveraging a lot of the wisdom of larger organizations like Google, obviously, but as we work with different companies, they do find different measurements relevant or not relevant. So when you talk about saturation and traffic, you might only find one of those measurements is relevant for your organization. And so it is a, it is a trial and error kind of learning process as you start to think about your SLOs and how you measure things for your business and whether all of the gold standards or some of them are relevant for you. Go ahead. Perfect. Very, very, very funny. I'm skipping to the very last one. Actually, our Cura code 404.naut, we launched a new website Tuesday. So it's live here. So we'll get to you a better link. Sorry about that. Yeah. So live errors. It was the SRE. Yeah, I know. It's just redirects died, silly apaches. So, all right. So another question that we have here is, can we explain what an RCA or root cause analysis is? So I had very bad ones before I might hand it over to SREA. But when we do an RCA here, or root cause analysis, like what are some of the goals of the RCA? Yeah. So the RCA, I think the question is basically, right? It's root cause analysis. And the whole point of RCA is like, okay, you had an incident. So you want to cover the timeline of the incident. Like when it happened, how long did the, did the incident last? What exactly caused the incident? Right. And what are the action items that we took to make sure that the incident won't happen again? Or what are the learnings from? Now, when we say, blameless, right? We don't want to take anyone's name. It's maybe an intern doing something. It may be an experienced guy doing something. It doesn't matter. It's just an incident happened which impacted your customers. And we want to document that. We want to share it with the customers. So I'll just post a link here in the Zoom chat for everyone. Like the, I mentioned, like we post pretty much all our incidents in the medium blog. So you can, when you have some time, please check it out. So we describe one of the incident that happened in this, in this particular link. So if you have any further questions, feel free to ask now. Perfect. I think that also you posted a link to the Google SRE handbook. Yes, I did. So one of the question was about the link to the Google SRE. So I posted that. Awesome. So next question there would be, can organizations adopt dev, psychops, and SRE practices? So I'll hold my tongue on this slide if one of you two, so it would be like, how do you and Andrew? Yeah, go for it. Yeah. Yeah. Was it the, was it the do organizations usually adopt both dev ops and SRE? Yeah, dev, psychops. Yeah, dev, psychops, which is just a branch, let's say, of dev ops, right? It's a focus. So I'll answer the first one, then maybe, sorry, you can answer the second question as part of that, since the question's kind of, you know, how do we handle it here? So I'll tell you what I'm seeing with customers. I'm seeing kind of, it's all over the place. I'll just kind of answer honestly there, which is essentially that really dependent upon which team got started first, that becomes the more dominant team within the organization. But I definitely am seeing that if organizations are larger where there are two separate organizations, they tend to be siloed, meaning that, and this is why we've been talking about that connective tissue, is how do we bring those organizations to work collectively together rather than separately and in silos? But if the organization is smaller, meaning it's a single group, they tend to take on the practices more related to how they started. So if they started as a dev sec ops and are making their way into SRE, they're going to follow the practices more towards dev sec ops and vice versa is also true. But I'll let you, sorry, answer kind of the Yeah, sure. So like I think thought answered the half part of it, right? So at hardness, so I used to manage the security team before I just started focusing solely on the SRE side, right? The security is important, right? I mean, without that, there will be incidents in production because of that we had one incident where one of the web application firewall that we configured late to an incident production. But this dev sec ops is a very broad term. So from harness point of view, we have this whole GCP infrastructure security, we have the application security that work very closely with SRE, right? For example, with the images that we use, right? And we have on-prem and SAS both offerings, right? For some of the on-prem security conscious customers, they want to make sure that the images, base images that we provide are very secure. So the security team for every release, they scan the images, like whether it's Subuntu or Debian or Alpine, they scan the images, we publish the results of it and we post it to our customers. Now some of the other companies are much more agile in the sense like this whole security scanning, the code scanning are part of the build pipelines themselves. So we are also taking that approach where we are going to use our own CIE, the continuous integration product. We will have the build pipelines integrate with some of these security scanning tools and we make sure that it passes before the build or the artifact is generated for us. So to answer, the security and SRE needs to work closely together. I mean, we don't have a choice, right? Because if an incident happens in production, you need to have the security guys on the call if it's related to some external traffic trying to hack your system, or it's something very internal where your traffic is blocked by the proxy and you don't know what exactly is going on. So yeah, those two needs to work closely together to answer the question. Awesome, awesome answer. So kind of tackling a little bit more, I know we're coming up on time here. We don't have time for one or two more. But tackling this question about the good book, the Google SRE book, talking about the split between DevOps and SRE versus is SRE an extension of DevOps or vice versa? I'll give my quick or not quick explanation here. It's actually two different problem sets. I wholeheartedly agree with Thor that one team on one hand is focused on the development pipeline, the DevOps teams, they're focused on velocity. And absolutely, like any time you introduce change to a system, reliability is a problem. If nothing has ever changed, it's fairly reliable, right? Like, hey, I've made zero changes in 20 years, minus if there's a mechanical failure or hardware failure, it would be the same, you experience the same results over and over, but that's not innovation, right? And that's not, you know, that's not dazzling your customer, you'll be stuck in the past. Typically what I've seen is, again, going back to expertise dissemination. So, you know, the DevOps team will focus on, hey, let's focus on how do we get your ideas into production. Like if you typically, the persona of DevOps engineer in my experience is a system engineer focused on the development pipeline. Vice versa with the SRE, my experience with SRE is that they're software engineers focused on operational problems. So, for example, hey, the romantic idea of an SRE is that, oh no, our application can scale. You know what, SRE, no matter if I add another Mongo node, we're not getting any faster and we're past that point. That's a very, you know, who owns that problem? Is it the software engineer? Is it the SRE? Like it is a very, I would say, software engineering focused problem, but a different flavor of it. So, long-winded answer, long-winded no answer answer. I just get that, but I'm not sure if you two have any other. I do. Yeah. So, and just kind of talking directly to you all over, it's a battle actually we're having internally right now as we build out a product that is adding the capabilities to measure and track service levels, right? SLOs, SLIS, that sort of stuff. And there's kind of this debate between, you know, what the Bible says, right? Which is you kind of take it as like your doctrine. This is the Google handbook is what it says to do. And then you have what customers are actually doing. And those two things aren't always aligned. And that's the challenge for us, especially as we start to build out a product is, do we meet customers where they're at today? Or do we try to force them into a way of thinking that the book says they should be thinking. And so, when we kind of put together these slides, it was more in the context of how we see customers operating today, even though technically, let's say in the perfect world, what you say is correct in the Google handbook, which is maybe they're all just one thing in the end, but we don't see that currently. That would be awesome though. It'd make our life a little bit easier. Yeah, there's been like other renditions to the good book, right? Like, you know, they came up with like a 2020 edition of the book, but they had it free for like a month that it took it up for free. Yeah, we've been, trust me, I've been dissecting it quite a bit. And there's just, there's areas in there that are more aspirational that you're just like, man, you know, customers just aren't working this way right now. So we have to kind of meet them where they're at. Cool. I think we're just coming up in time. I know there's a couple of open questions, but if you ever want to get in contact with us, we have a Slack on harness.io. You can just hit anyone in our community Slack. You can get to us. I'm not sure if any closing words here from the Linux Foundation. Yeah, so thank you so much, Ravi, Seria, and Thor for your time today. That's a great presentation. And thank you everyone for joining us. Just a quick reminder that this recording will be added to the Linux Foundation's YouTube page shortly, so you can check back there and review or send it along to others if you like. And we hope that you will join us for future webinars. So thank you so much again, everyone, and have a wonderful day. Cheers. Bye.