 So I usually like to use this quote. This has nothing to do with reliability. The quote initially was about business management. It's by Peter Drucker. But I think it applies perfectly to SRE. And the notion that you first have to measure something and then you can improve it is something that applies to reliability. And the reason we care about improving reliability is because at the end of the day, that's what our customers are experiencing. And we're going to go a bit deeper into that on the why reliability matters so much about customer experience. Now, a bit of history and how did we end up here. This is something we've seen a lot in the DevOps context and how we have the wall between developers and operations. And this is how Google in 2003 created the first team called production engineering back then. And the intention was to bring software engineers and make sure that the Google websites were running reliably and they would improve reliability and all those kind of things. Now, as a few of you remember back then, there were these guys over there on the one side of the wall and they wanted to throw features over to the wall to those guys. And those guys cared about making the service more reliable and things you understand when you are actually responsible of a customer facing service. So what the guys over here decided to say, what the operators decided to say is we have to define a few metrics which are going to allow you to throw things over the wall. So as long as our service is reliable, then you are able to throw whatever you want and we are going to be responsible for running it at the specific reliability level. So that's how SLIs and SLOs came about. And the idea was that in order for us to maintain our visitors, in order for us to have people coming again and again in our websites, we have to promise that we're going to provide a specific service level. And what we cared about back then, we didn't have microservices. We didn't have complicated user journeys. It was pretty much simple. We want the website to be up and running. So essentially that was our SLI. The SLI was uptime, service uptime. And then the SLO was defined and the SLO was a target. So what kind of uptime do we want to achieve? And if we want to achieve 99.99%, which means that 99.99% of the time the website is available, that's our SLO, that's the target. And then we can go back to the customer and promise that we are going to provide at least this level of service. And usually the SLOs are larger than the SLAs because we don't want to promise the customers what we promise internally. But that's with the history. How does it work? How we define those? So we start by thinking what's important for the business. Back then it was just website availability, just service uptime, nothing more. Now it's other stuff. So we want to start and see what's going to keep us alive, what's going to keep customers coming, what's going to keep customers engaged, and so on. And that's our SLI. And then we want to think, usually using historical metrics, using expertise, we want to start thinking what's our target for that metric. So based on that SLI, what's our target? That's the SLO. And then based on that, we want to make a promise to our customers that we're going to be able to offer a service of that level. And those can be either time-based or event-based. So we can say that the service will be available X amount of time, so 99.99% of the time, or 99.99% of someone tries to visit the website. And it's important to set those. But after we get those insights, after we measure and we see the SLIs, what do we do with this information? And the most important thing of all this is the air budget. And we kind of focus more on the SLOs and the SLIs. But the air budget is where software development teams can actually make sense of it. And this is where we start to understand that essentially SLOs are a tool to help us determine what engineering work to prioritize. It's we're going to use SLOs and we're going to use our budget to start understanding whether we have to prioritize reliability work or feature work. And this is going to help us prioritize our backlog. And nowadays, this also is automated into our product management process and so on. So the air budget allows for a specific amount of bad behavior in our application. So taking the SLO, the target, we do 1 minus the SLO. So if we want 99.99%, 0.1% is our air budget. And then similarly with the SLI, we see our budget burn. Now what happened back then in the Google SRE times is that if you reach a point where you burn your budget, then you can just not release anything to production. You only have to focus on doing work on the reliability of the application. So how is this related to customer experience and how does this all come together? And we can kind of define reliability in a few different ways. And we can use the technical terms that it's the probability of failure-free operations and all those kind of things. But besides that, reliability is the perceived quality of the customer. Reliability is what the cloud customers understand as quality on our deliverables. And that's why this is really important. So as a result, all of our reliability efforts should be customer-centric. And so should reliability measures. So SLOs should focus on representing user impact and user experience. It's not about just service uptime. And reliability engineering should really work closely with customer experience engineering, with customer success, and so on. And we've seen it that, especially in enterprise environments, reliability is what's going to drive cloud adoption. So having that in mind and having how customer-centric reliability should be, defining SLOs is not just a discussion between the engineering teams and the architects. It's much more than just a technical task. We should have in mind that customers are different in the way, in the where, the when, how they use our applications. Customers have different levels of importance of workloads on our applications. Customers are different. Other customers are more important for us than others, whether we like it or not. And customers access our application from different geos, different devices, all those kind of things. So all of those are things that we should have in mind when we are defining SLOs. We should also focus on meaningful availability measures. We should focus on what our customers are experiencing and what's the uptime of specific features, specific functionalities instead of just service uptime. And again, that also should be customer specific. We should always have in mind the complexity of our application, the inherent complexity of distributed systems which is embedded in modern architectures. And of course, we should now integrate our systems we should add more context to our data and we have a lot of customer data so we have to integrate service availability metrics with other customer data and combine those in order to achieve customer success. So with that in mind, how does the SLO implementation process change? So we have to start by defining critical user journeys and focus on business impact, prioritize those based on business impact. So we have to have in mind what our customers are using more and what our customers care more in our application. These can be latency, load time, anything like that. We have to determine metrics around those different journeys. And then finally, for all those metrics, we should define desired targets. And again, using the context of the customer and how each customer is different and where they're actually seeing the application, those targets should be calculated if it's gonna be specific periods, if they're gonna be time-based, event-based, if we're gonna do any geo segmentation, those kind of things. And finally, we have to operationalize those metrics properly. And that means put the tools in place that are gonna help us utilize those metrics, those insights in our development life cycle. So we should always have in mind that SLOs are tools that help us better understand what we need to prioritize. And that's why we need to integrate everything. We need to integrate their budget and better understand where we should freeze releases, where should we prioritize reliability, those kind of things. So this is kind of what the SLOs are, how we implement them. But what are kind of the common issues we see when we try to implement the SLOs at scale? So one of the most common issues is that SLIs are usually discussed by engineers and an engineer myself, I've not always practiced empathy when I decide what to measure. And that's really important. We should always try to think from the side of the customer. More often than not, SLIs are not customer focused. SLOs should have stakeholders. So if something goes back, there should be someone in the SRE team or someone in the engineering leadership team, someone who is responsible of mitigating this and making the decisions of how we're gonna proceed if we burn through our budget, for example. Air budgets are usually used reactively, which means that we get an alert once we burn out all of our budget, which is, it's not something we should do retrospectively, it's more something we should use as we go and we should use it as we plan our work and we should use it in our everyday backlog prioritization. Oftentimes we set unrealistic SLOs target, those can be either really high or really low. One of the things that has now stopped is back in the day everyone wanted 100% availability. Now I think we're at the point where we know 100% doesn't exist. So that's why we're talking only about nines. And then a big one is that still the SLIS, a low evaluation process is really manual. We see people having Excel files or working through incidents and tickets and trying to calculate what the downtime was and all those kind of things and this also causes big problems. So specifically when we are trying to scale, some issues is that when you have a really large engineering organization, you usually have different teams that implement observability metrics, all those kind of things really differently and that gives you a really different dataset to work with. So it's hard to implement SLO standards when you don't have standards in the data types of monitoring. So these different results in combination with manual reporting, all those kind of things create what we call the watermelon problem. Everyone reports what they want to report using the data they want to use. So we end up having something really red, really bad, but outside it looks really green and nice. Another scaling issue is, and I mentioned that before about how microservice is how distributed systems have evolved over the years. Back in the day, we would normally care about if the frontend is available, then okay, that counts as uptime, but if you take a modern e-commerce setup, you wouldn't really care if the frontend is up, if the checkout and the payment service and the card service doesn't work. So it's a big combination of things that matters. And again, we should focus on availability of features. And some last scaling issues is, when we have a lot of people a lot of different applications, we lack a common understanding of what service metrics is, what we're trying to achieve and staying blameless using it as feedback, as input rather than using it to see what we're doing wrong and blame one another. In order to scale SLIs and SLOs, we kind of have to automate the evaluation process, the Excel process and going out of incident tickets, all those kind of things, they simply can't work at scale. And also, while we automate the SLOs, we also have to do that in a way that we respect the teams that are using different observability tooling. So we kind of have to be agnostic in that sense. So how does observability come to play and how observability as code and SLOs as code can kind of help us scale this? And we start with cloud native observability and the needs for cloud native observability as growing since this is what's gonna provide us the good clean metrics to set the standards and to automate the processes around this. In order to do that, we need good real-time insights. We need this to be self-service. We need this elasticity in our infrastructure in cloud native environments. We need to have good alerting for different communication channels. Integration part is really important. We've been talking about contextual observability and contextual reliability metrics. And finally, an SLO framework bringing everyone on the same page that's a really important aspect. And this is powered by what we call the MelStack, metrics, events, logs, and traces. And looking in these different data types from the angle of SLOs, metrics is what's gonna fuel our SLIs. Metrics is what's gonna help us measure against our targets, but it's also what's gonna give us the historical data in order to define our target. So it's gonna help us also define the SLOs. Moving on, logs, if we are able to add logs, logs will help us add more context. So in our case, we've combined SLI data with logs and we are able to slice our data differently. We are able to have more contextual SLIs so we can see SLIs for specific customer groups, impacted customers, those kind of things. And then finally, events and traces are what's gonna give us the intelligent part of observability, that's where we're gonna understand where we are, why and where we have been burning through our error budgets. And in order to scale that, in order to scale our observability practices, we are applying observability as code. So this is gonna help us promote better standards and best practices and standardize the data types across multiple teams. To do that, you can use a GitOps approach for onboarding new services, creating new metrics and fuel that by zero trust in order to have a more secure onboarding for new services. And observability as code focuses on metrics, logs, traces, dashboards as code, and finally, alerting as code across different channels. So this is kind of how it looks like. So it starts by, you just commit a configuration file. A build is executed that uses it there for module, generates the variables, files, and generates specific configuration. And then this is run across the production environment. Of course, it's tested first. And then the assets are published and this is run across the production environments to create the new metrics and all those kind of things. And this is kind of how it would look like. So this is a Terraform module. You have a dashboard over there, a New Relic dashboard, and here you have an alert condition and this is an alert channel. So if you just go right now and try to write a Terraform module in order to implement a dashboard and alert as code, this is what it's gonna look like. And the issue with scaling this is that you still want someone to understand what's written here. So in order to operationalize this and scale this even more, what we try to do with observability as code is to create modules that use variables and what's gonna happen is a developer would only have to create this. So in order to create a new alert condition, what they write is, I wanna create an alert condition. This is the NRQL. This could be PromQL or whatever else query. And then the pipeline would run out of this. We would automatically generate this, which is the TFRs. And then those would be run together. So the only thing that the product... Okay, so the only thing that the products engineer should know is this and nothing more. So they only have to think about what is the NRQL or the PromQL query they're thinking about. So this is kind of how you can make observability as code even easier. And as you implement observability as code and as you kind of standardize and you scale this practices out, then you are gonna start thinking about SLO as code. And in recent years, the SLO specification came up. It's an open specification to define SLOs in a kind of vendor agnostic way. It uses YAML format so it can be kind of familiar to a lot of us. And we're gonna see how it looks in a bit. Another project I really like is Sloth. So essentially what Sloth does is, again, you write a configuration file. It runs and then it creates the Prometheus rules in order for you to get alerted for your different metrics that you're measuring. So this is what open SLO looks like. Again, you define here that you're using Prometheus. This could be data dog or anything else. You're using PromQL. This is the query you're gonna be using. So this is the indicator. This is the SLI. And here you define the objectives. This is the SLO. And it's a pretty simple language, easy to understand for anyone. Now what Sloth does is it uses a similar file and Sloth can work either with open SLO or not. So this is a Sloth example with open SLO. And here we define what our objectives are. And here we define what the query is to find these are the bad requests, not the good ones. And the total request, what the target is, and then you're gonna calculate that. And when Sloth runs, and of course, it's a much bigger file than this, what it creates is it creates this Prometheus file with all the rules in order for you to get alerted and it sets up Prometheus that way. So how would that work? Imagine in an environment, you have multiple Kubernetes clusters. You have Prometheus in order to monitor those clusters. And since you have all this distributed Prometheus setup, you use Thanos on top of that and Grafana for dashboarding on top of Thanos. So you would use, you would create the Sloth file. You would commit that in the repository and then the pipeline would take over, would create, would run with the Sloth definition file and would create the rules that Prometheus would monitor. And similarly, as we saw before with the New Relic, with the New Relic dashboard, you can have Grafana dashboard as code and this is what it would look like for you to have your SLO dashboard per service when you can see your budget, you can see your availability and so on. Now, going into a bit more specific stuff and bringing it all together and seeing what specifically we have been doing for the past few months. This is kind of what the architecture looks like. And so we have a multitude of applications. So this can be AKS, EKS clusters. It can be AC2 instances, whatever that is. The ingestion layers, this is something which was optional. So this can provide filtering, rate limiting, for compliance reasons, you might need to add some masking on the log data. Everything is triggered by synthetic tests in order to implement outside and monitoring in order to add this orchestration layer on top of it. And finally, we end up to different backends and that goes to what I said before. In order to scale this out, you have to respect the differences that different teams have. So you have to work with Prometheus, Plank, Nurellic, Datadog, anything there is. And finally, this is the custom part where essentially we run the open SLO files, those create the rules, and then this engine goes in queries, calculates whatever you have to calculate in order to define SLIs. And then we get the contextual data so we can create customer experience reports. We can see each customer, what they are experiencing per geo, if they were impacted in any incident, and all those end up in either reports or dashboards. Now, when we reach a point where an SLI is pretty low and we have eaten through our error budget, we have an integration with the release pipeline and that's where the pipeline essentially has a step where it asks, can I deploy or not? And if you've gone through the error budget, then you have to go through and enhance the approval process in order to deploy. And this part of visualizing is, looks kind of like this, and this is still a mock-up, it's not live in production yet. So you see the different SLOs, you see the incidents, but most importantly, you see specifically impacted customers. So that's where you can actually have these discussions with the customers and know exactly how they were impacted, what the customer experience looks like, and so on. Now, throughout this process, I tried to think, we've learned a lot, I tried to think of the three most important things. One is that, if you implement alerting for on-call and if you implement alerting or thresholds for reporting, or if you implement alerting for SLOs, all those are totally different business cases and you have to treat them differently. So you cannot just create alerting for your reports and use those for on-call as well. That's not gonna work. SLOs should be customer focused, and that's, I think I've said that more than anything else in this presentation. If you leave with one thing in mind today, that's probably the most important takeaway. We should always focus on customer user journeys, on critical user journeys, and measure those instead of measuring systemic stuff. And you should always treat this as an ongoing journey. SLOs are always gonna keep on involving. Start from something, you should start from a simple place, a simple application, and then kind of grow that out. So to kind of wrap up, the most important things we talk today is how observability has evolved, how reliability has evolved along that with the evolution of our, with the growth of distributed systems and distributed architect textures. How observability is not just about measuring stuff, it's really important to feeding in our reliability and how reliability has adapted the technologies and how these have adapted over time. And finally, how observability and reliability come together and the role that those two play in customer experience and in customer success. And the last bullet is a copy-paste, I mistake. So if you see my previous presentations, you're gonna let learn how KS Engineering can help you sift reliability lift. Thank you, I'm trying to upload usually in my Twitter the links for the slides, so if you wanna stay there, I'm gonna add a slide about a reading list and stuff. Feel free to connect on LinkedIn, and yeah, thank you for joining me. And I think we have 10 minutes for questions, okay. George, you talked about using the SLO as a way to decide how you consume your activity budget and you also talked about when designing an SLO, you link it to the business data, the business impact. How do you use that, particularly in a scalability environment to decide how big the activity budget for a particular customer should be? In other words, how do you price it properly for a particular customer? How do you drive that? Yeah, so the air budget wouldn't be so much on the specific customer. So we use those in two different, it's like two distinct things. On the one hand, we have the availability of a CUJ. So in the case of Citrix, a CUJ would be someone to log in and launch a new desktop. So we measure that independently of customers. So if that CUJ has a really low SLI, that's gonna trigger the air budget block and so on. Now on the other side, on the customer specific side, we are gonna be able to measure this SLI, so the login and launch a new desktop for the Linux Foundation as a customer or for the convention center Dublin. So we can see those individually, but the air budget is on a CUJ basis, is not on a customer basis. Does that make sense? Yeah, it makes sense, but you dodged the question about how do you set the size of your activity budget based on what you're? So essentially the air budget is what's remaining from the SLO. So if the SLO says that you have to be four nines available, so 99.90%, the air budget I think is somewhere around 25 minutes per month. So that's kind of just part of how you set the SLO. Now how you set the SLO has to do with a few different things. So in a perfect world, it would have to do with us discussing and deciding what's the best we can do based on historical data and those kind of things. In the real world, we also add some sprinkles of knowledge what the customers are able to tolerate and those kind of things. So if you know a customer can tolerate three and a half nines, then you know you're trying for four and a half nines, you're probably gonna try and promise three and a half nines. And that's how the air budget then works. Did I not dodged this thing? Thank you, that was great. Perfect. I'll take this off. So I'm in an SRA team and we have basically implemented most of the things that you talked about in the presentation. We basically, everything is code and stuff like that. We are facing this one challenge though. We are basically providing services internally to our developers, logging, monitoring and such. And we are providing them with SLAs and SLOs, I guess for us on how those are doing. But we are facing this challenge where for some things we want to actually guarantee something, let's say a query to our logging system goes through under 10 seconds, some amount of times, let's say 99%. But then like in some certain scenarios, like in the evenings we want to, let's say we want to alert on that all the time, but sometimes someone comes and actually wants to go through only that one person go through a really like expensive query that will just take longer. And in that time period, there's not enough queries to kind of cover that. So we get alerted because their budget kind of goes down really fast. So I don't know if you have some solution to this. It's kind of a deep question, or maybe not, I don't know. Well, generally the approach is to kind of think with the worst case scenario in mind. So this guy that comes in the evening and kind of breaks your alerting, that guy should be your baseline. Instead of what we usually do if this one guy and everyone else is kind of normal, we kind of focus on the other people, but he's also a customer. So if you promise an internal SLA, you should kind of cover him as well. Now other than that, I think it's more from what I'm trying to understand. It's more of a problem of right sizing, maybe auto-scaling when you have these spikes rather than... I think it's more of a problem of basically the time period in which the queries get evaluated. So that one person who is trying this expensive query is the only person doing anything at that point. So, right? And that will spike their budget burn down because there is nothing else to compare. If a lot of developers were kind of querying the service, it wouldn't be a problem because most of them would actually go through the SLO or kind of under 10 seconds, let's say, but if he is the only one in that kind of timeframe, then it kind of goes wild. So just to understand, it's a latency-based SLA. You give them based on the query response time, kind of thing? Yeah, basically, we want them to be able to kind of query a logging system and get a response from that in under some seconds. So, yeah, it's kind of like volume problem, I would say, more than anything. Maybe we can talk later. Okay, yeah, sure. Could we go back to your observability architecture slide? Yep. So, how exactly are you using the open SLO spec here? So, this is a bit better to show that. So, essentially, what's going to have... Yeah, so, this is the general pipeline on the observability escort, but what's going to happen in the open SLO spec is someone is going to be creating a file like this, and then this is going to go through the pipeline and when it reaches deployment, it's going to go over in this green place, okay? It's going to go over in this green place and that's where it's going to be deployed and this service is going to start monitoring for this SLO. Oh, okay, now I got it. So, the observability API is a custom API that you are deploying using that pipeline. Yeah, this is a Python server we've... So, it's not your application that you are deploying? No, no, no, no. No, this is specifically for calculating contextual SLIs and implementing... And this is not based on sloth. You have built home... No, no, this is something different, yeah. So, I use the sloth example because sloth is something that I think is... If you're using kind of cloud-native tools, Kubernetes with Prometheus and even a distributed setup with Thanos and Grafana and those kind of things, sloth is kind of the shortest path for you to implement the SLOs. Got it. And once the observability API is built with the open SLO spec, you don't have to go and... Oh, the queries that you are using are in the observability API itself. It doesn't need to go change Splunk, Neuralig or Elastic or anything else. Yeah, exactly. Got it. Thank you. Hey, first of all, thank you very much. It's a fascinating talk and I'm glad that I'll be able to ask you a million questions next month when you have the podcast open observability talks. But one question that I think would be very relevant to everyone. You talked about the COJs, the Customer User Journeys which is I find the most challenging part because it's not technology, it's about people and processes and I'm actually curious in an organization as big as yours, how you facilitated the discussion and brought on board all the stakeholders to come up with an effective way to define the COJs and also make the visibility to these individuals in the organization. Yeah, so... It all starts... As you said, it's less of a technical discussion. It's more of a customer empathy discussion. So that's where you would normally have and that's how it happens in most of our cases. You would normally have PM with an architect talking about it and deciding. And essentially what you should have in mind when you implement it is what's the path that a user is going to go down which is going to affect revenue? For us, for example, if you don't have availability of launching new desktops, that's kind of a big thing, that's what someone is paying for. So logging in and launching a new desktop is probably the number one. So what's the second thing that's going to cost us money, that's going to cost us to lose customers? Those are kind of the questions that we go on to answer. And, you know, rightfully, you mentioned that scale is something hard. You have to bring a lot of people on the table. But, you know, giving kind of the driver's seat to product management for this is what kind of solved the problem. Because as they prioritize the black log, they have the empathy for our customers. They know exactly what the needs are. So it's easier for them to define what their customers are doing, what they need, what they want, those kind of things. So they aggregate the other stakeholders and they get already one phone for the customers. Okay, I think we are out of time. Dotans was the last question. You can hear us next month. We are going to be together in the Open Observability Talks. We're going to be talking more about those and customer experience and SLOs and how those relate. Thanks a lot for joining me today. Thank you.