 Hi, I'm Joseph Seneval. I'm with the Adobe Advertising Cloud. I currently run the platform infrastructure team that powers all of our data centers, public cloud, across three continents, and six data centers. I'm Kip Merker here, and I am from Noble 9, and I want to talk about service-level objectives. You know, when you think about reliability, it's all about perception and reputation. If you have a great experience with a piece of software, you might tell one person about it. But if you have a bad experience, you're gonna tell everybody you know. And this is because of some psychology behind remembering bad experiences more vividly than good experiences. On top of the reputation, I'll hit, of having a bad experience, if you can find a problem in development, it's significantly cheaper than fixing it in production. So when you're thinking about reliability for your organization, you know, the goal is to basically not have people have a bad experience with your software, and you want to keep the cost down by fixing things early before they get out to customers in the first place. But unfortunately, reliability falls out of spectrum. And if you have too low of reliability, your customers are gonna find some other way to accomplish their task. If you're below a certain threshold, like an SLA, you may even be on the hook to pay out credits to customers if they tell you about the SLA violation. So low reliability is bad. But on the other end of the spectrum, as you try to achieve 100% reliability, it gets increasingly more and more expensive to hit that target. We call this the sort of gold-plated infrastructure. This one study I found said that the CTOs say they're spending 30% of cloud is wasted, which means that literally billions of dollars a year are going to cloud providing that isn't necessarily moving the needle in terms of customer perception. So a service level objective, or SLO, is the balance point about somewhere between too low reliability and too high reliability. And it is defined as the point at which customers go from being unhappy to happy. And the result of this is that your team moves from being burnt out, chasing the pager to being productive. And the gap between the SLO and 100% is referred to as an error budget. We call it a budget because you can spend it and you can spend that budget on changes that add value to customers. If you say I want a three-nines SLO, 99.9%, what you're really saying is that 0.1% of error can be safely ignored. And this is the power of SLOs in terms of trying to manage reliability. When we think about the actual definition of SLO, if you're not familiar, there's different types of SLOs, but generally speaking, they're calculated as a proportion, a proportion between good events or total events or events above a certain threshold and all events. And here you can see some examples of availability SLO and how you would describe that for implementation as a specific place load balancer, response codes for a specific API. In the case of latency, ideally you would set your thresholds based on user experience. So you wouldn't just say some random number of milliseconds, but you'd actually say, oh, this is noticeable by customers or this is an annoying experience for customers. And you would tie your SLOs to that actual user experience. The point of the SLO is to figure out what is gonna make customers happy and defining the SLOs based on that goal is what gives you the power of improving reliability for your service. When you're measuring an SLO, one of the key ideas is to bind it to the happiness of users, but how do we actually do that? So in practice, there's this idea of the happiness test. And I got this from the Google Art of SLOs framework. And the idea is that you take a service level indicator, you can give that as a service KPI and you look for correlations to systems that objectively tell you if customers are happy or if the business is happy. So think about how you could have high correlation with your selected indicators to the number of support calls you're receiving or the conversion rate of customers or if people are complaining about you in social media. By using these external signals, you can sharpen your SLO or SLO to be more accurate and to reflect more clearly what's really going on in the minds of customers. The other key part is to try to find an early warning. If it has high correlation, it should happen at the beginning of an outage, not just at the end. When you think about error budgets, as I mentioned before, it's really, an error budget is a, what's the word I'm looking for? An error budget is an allocation that you've defined up front about how much error is acceptable in your system. And instead of having one team fighting to make changes and other team fighting for stability, the question becomes, how do we wisely spend the error budget together to achieve the right kind of result? One could argue that we're spending error budget right now on a global pandemic, but there are lots of places you can spend error budget whether that's shipping new features, making config changes, dealing with a spike in demand or other kind of unexpected issues from third party services. All of those together can be managed as part of this risk management idea of taking service level objectives that are reasonable and applying them to your services to maximize customer happiness and balance the cost of infrastructure. When you think about the new sort of service frameworks that were out there, the service oriented architectures or the microservices we see now, there's lots and lots of service dependencies. And so when you think about the SLO, it's almost like an API between engineering teams. When you have teams running all these different services with dependencies, they're all interconnected. So you can never run the service, you really, the reliability to service is the sum of its dependencies to a larger center. Now, if we break one of these services, if it's very low in the dependency chain, we're gonna see that effect bubbling up throughout the entire system and having customer impact at the edge. But in another scenario, you might see the same exact customer impact, but really the failure is isolated. One of the key challenges the teams have is how do I figure out which service is actually broken? Instead of alerting and paging everybody, find the one that's actually broken. So if you take a step back and you have SLOs across all of your services and service boundaries by looking at the patterns of which SLOs have been violated, you can actually isolate and more clearly alert the team that has broken infrastructure. And this is a really powerful use case for service level objectives. All right, I wanna hand over to my friend Joseph here who's gonna talk about SLOs in the context of OpenStack. Hey, thanks, Kit. Yeah, so great setup. And one of the things that when we look at our infrastructure and I'm all qualified, we have quite a bit on OpenStack. We have things in Kubernetes. And so there's a lot of entry points for us to be able to introduce these service level objectives. And so for myself and the team, one area that we started to examine was, where do we start at? We have five data centers that, each of them have quite a bit of compute overall. There's over 200,000 cores. And so we could look at it in a couple of different ways. But the first thing that we always really asked ourselves was, where is our customer's pain? Who are we really serving? I think that really helped to drive it because as engineers, we tend to start digging down deep and really trying to find all the more minutia underneath it. And actually, if you took a step back and even referenced, there's a good article on the OpenStack docs that kind of gave some guidance. It was giving some architectural guidance around like service level objectives. And what it kind of encouraged was to look at your cloud in the way that the data plane is maybe things like the virtualization, like how fast does it take for a VM to spin up? Or maybe it's a team that really needs to have like low latency or it's a storage. So you're probably looking at credit events in your logs. But normally these are things that always have to be available and they're good starting points to kind of measure from. Other environments could be really focused more down in the control plane, that can depend. So you really have to check and do some homework in regards to what is really meaningful for your customers. Other areas that you also should be aware of and have to call out is that there's things that you can't control. And so you wanna make sure that when you are starting to try to identify like what my targets are, what is, you know, if it's like HHB requests or it's related to events within like Seth storage, you know, be aware that there is some underpinnings that you may have to kind of factor in. And then as well as kind of like plan maintenance schedules, some teams consider involved and put them into their slows, other stone. And then there's also like services that just interconnect or, you know, like have some dependencies outside. So you wanna balance these things out when you start with it. Not that I've kind of laid the groundwork for that. What I'd like to show you is just kind of before a demo, just kind of how this would look in action. So let's cut on over to this demonstration. And what we did is we actually took one of our environments and we have a tool that we use called GoGENA that basically allows us to be able to send traffic. And what we wanted to really do is to kind of demonstrate when we first set up a service level indicator and kind of like throw some thresholds around it. The real power comes in is what you wanna do is catch these things when they start to escalate. You don't wanna get to the point where something completely breaks and your customer is unhappy. You really wanna see the trend. And that's kind of where the power of this comes in at and what you're really trying to evolve into your engineering culture. And so for our demonstration, so I wanted to show this in action and kind of like what we're kind of using is at Adobe we've been using Noble Line to help us be able to track in real time our slows. And so in this example, this is a mock example that we're kind of using in one of our Kubernetes clusters that's running on OpenStack. And as you'll notice in here, we've kind of defined and the great thing about this format alike is that it's kind of also in our normal way of kind of like running Kubernetes applications. And so one thing to note is that, any tennis at Prom Coal can be inserted into here. So depending on like what you may already have you can easily integrate into the Noble Line format. And you'll see that we have a CPU usage as the metric that we're using here in the demo. And ideally what we're trying to do is we wanna send traffic to it and CPU utilization is over a half a CPU. We start breaking into our air budgets but what we wanna do is mainly we don't wanna get to the point where we break something and that's what really the point of SLOs are is to help catch before these things get to that threshold. And as you'll notice we have set a target at that 99%ile that'll help us to be able to be aware and alert and be able to take action if we start to get to that threshold whether it is adding more compute to that resource pool or maybe it's related to something that we need to roll back. Maybe we made a change that kind of increased consumption that we have to revisit. But ideally this SLO will help us to be able to be preemptive before we have an actual like customer facing impact where the customer is not happy. So now we had a change has happened and we are noticing that all of a sudden like our concurrency like the CPU consumption is starting to go up. The threshold that we set is now starting to be exceeded customers are being impacted. Ideally what you wanna do is as you see as we're hitting that threshold is instead of waiting for it to get to a point where the things are burning you actually catch it before and decide to take action. One of the things that we have been trying to do in our environments is really hand it to the users who actually can make those decisions. So maybe in this situation here there may be an engineering leader who actually has like can make a decision of like hey we need to dedicate more resource to this so that we can kind of address the exceeding an air budget or the other thing could be made is possibly that we need to stop doing releases. Maybe it's even a possible rollback or that we need to focus on reliability because we're burning through our air budget and we need to get things back into a healthy state again. So as you can see that when we had in that demo it really helped us to see where if you are using slows in your environment among engineers among site reliability engineers you're able to easily be able to react and not allow things to get to the point to where your customers are impacted. But the one thing that's very impactful with slows in which I realized as I start to embrace it more and our team is embracing it and utilizing them is that it starts to evolve the culture and for us one of the things that we've been really trying to do is I think a lot of SREs start sometimes at the bottom and we realize like hey this is a way for us to kind of gate and control when change happens or be able to determine like yeah we can keep moving ahead keep releasing and things are good and we can write that line and make sure our customer is not impacted. I think ideally if I was gonna start out I'd love to say that I have a green field meaning starting right from the top that I would have product giving us what they say as a requirement what they think is the you know the lines that are needed for this application or service that is being inputted that makes customers happy that trickles down to the engineers and then by the time it gets to us here in the SRE world like we basically know exactly what we need to do, how we need to architect how we need to build it to make our customers happy and to be able to you know build the resilience into the environment that's there. Oftentimes though we have to start with an existing environment and so the one thing I realized is that you know a lot of us have done great jobs with maintaining our infrastructure. I know our teams have maintained lines that I feel like hey I don't think we really paid for it but yet we have got it and so you wanna make sure that you start a little bit below that give yourself some room to kind of like evolve into it and also as well as your partners so that you can make sure that hey I can get the maintenance is done no no change can happen we don't wanna make it so rigid and difficult for you to be able to manage that you're paralyzed and you can't do anything in your environments when you're starting to manage your environments by slows. And the other piece is establish a cadence of okay so you've gone in, you've created these definitions you've gone in and you're tracking them you've given them maybe 30, 60 days but go back and establish a cadence where you're reviewing the environments that are you meeting the slows are you not it's meeting them are you burning up your air budget and you need to make some adjustments this is a normal process it's gonna take time to evolve this so if you continually do this you start building that into the culture the one thing that starts to change is that we're so driven by an alert driven culture like we're reacting to things when they happen but what you'll start to see happen when you start to implement these into your engineering culture is that you'll start prioritizing like what work gets done around the life of a service for example in the previous example that we just had if maybe for example we were pushing a lot of change that was causing us to burn through that air budget well that would be a good indication that hey we need to maybe slow this down or stop basically focus on what is causing it to burn through that air budget maybe it's with the liability related stories that we need to focus on so that we can get ourselves back under the air budget but when you do these things and if you implement slows into the culture as I like to refer to it it is an SRE Nirvana and I believe it's also just engineers upon it because you're able to all work together cohesively and be able to know when you can safely move things forward as well as when you need to slow down. Thanks Joseph. So the last piece we want to talk about is the business technical feedback loop and service level objectives as Joseph was kind of explaining here and as you saw in the demo there's this important feedback loop between the business side and the technical side when we think about balancing the engineering effort and the customer experience. One way to think about this is to come from the top down and look at how do I define SLOs from the business? Now you may have had the discussion with business stakeholders and well how many nines do you want and how reliable should this really be and you may hear, well I want it to be 100% reliable 200% secure and 300% more features than anybody else. Of course that's not a realistic answer to the question of what the SLOs should be. So I want to share a different method that starts from the business and ends up calculating SLOs for you. So the first thing is you've got to identify the top critical user transactions. Not all transactions are created equal. If you're browsing a catalog that's very different than purchasing in a catalog. If you're browsing stock prices it's very different than making a transaction or a trade. And so understanding the critical users who they are and what they're trying to accomplish is the first step. If you are just calculating an average cost of downtime you're going to miss the boat because reliability means something different on Black Friday. It means something different during the Super Bowl. It means something different while the market trading hours are open and every business has critical times. The next question after you've identified seven transactions is to estimate the volume of those transactions. And ideally you're going to get this to the closest order of magnitude. So is it 10, 100, 1,000, et cetera and understand how many transactions are happening within the specific time period for the specific population of users for a specific transaction. This will give you a sense of the different relative priorities and sizes of impact for an outage in one of those given times. The third piece is to come up with a business metric either in terms of revenue, maybe satisfaction, but some sort of dollar figure or business impact figure that would occur for each transaction in each case. One way to do this is you say, okay, we make a certain amount of revenue per year. We do 30% of our business during the holiday season. We do 1% of the entire year's business that happens on a specific three hours a day, understand these different periods and look at the revenue mix or the value mix across these different periods of time. Now, I wanna walk through an example. Once you've done these three steps and the other thing to look at is, well, what does it cost to engineer a reliable system? So let me walk you through a way of calculating this. If you imagine this one transaction in this hypothetical example is $10 million a year in value, okay? What we've done here is we've plotted the blue line against different levels of nines. So one nine would mean 90%, two nines, 99%, two and a half, three and a half and four nines, okay? And these are the reliability. Simply put, we said, okay, if we took our $10 million and multiplied it by the level of nines, what would the revenue be? Assuming that we lost the business during that downtime. The red line represents a potential engineering cost. And this might differ in your business but you're gonna find that it follows this pattern, that every nine you add is another sort of zero on your infrastructure bill. So at one nine, that's a machine running under somebody's desk. Two nines, maybe you've got it finally in a VM somewhere. Two and a half, three nines, you start to add in things like failover, cluster, multi-region, et cetera. And to get higher levels, you need to actually have control management, change management, CICD, testing, automated delivery, canaries, rollbacks. The cost gets really, really big, really fast once you get past a certain number of nines. Now, if I'm a business stakeholder, I'm looking at this and I'm trying to decide, well, how many nines do I want? I think my, first of all, I can see there's a very clear limit here. At three and a half nines, I spend all of my revenue on reliability. Now, if you're NASA, that might be okay. You might say, look, every dollar we have, we're going to put toward reliability. That might be a valid case, but most businesses are trying to actually drive a profit. So we want to be somewhere to the left of that, but your intuition is going to be, it's somewhere in the middle here because you don't want to miss out on this revenue from being in this, you think you need to have this reliability. Well, I want to present this data to you. I'm not going to present the exact same data in a different way and help you see where the right way to justify the number of nines. So I'm going to switch to just focus on the blue line and now we're going to plot it logarithmically. The yellow line here represents the value at risk. So it's actually the inverse of the revenue line, but because this logarithmic, this very, very flat line we saw on the linear scale, we now see that it's actually quite a slope and we see how much money is at risk relative to the different levels of reliability. This again is just derived directly from the blue line. But now if I take this, I'm going to plot the engineering cost over this line. We're going to see where it crosses over. And if I add the reliability cost line to it, which this is the exact same figure, it just is flatter now because we're viewing it logarithmically. So an exponential curve looks like a straight line. And what you see is that the crossover point is much further to the left than we expected. And I believe this is one of the key disconnects between business and technical teams that the required reliability, the justifiable reliability for many transactions is actually significantly lower than people expect. And they're pushing for these high reliability systems, not understanding the exponential cost associated with them. And this creates a massive disconnect between teams. The trick to getting truly reliable services is to understand the limits of reliability and to build an engineer system that's justifiable. So if you're an engineer trying to get SLOs from your product managers and business stakeholders, go about it this other way. Think about the transactions and prioritizing them and then use that to back into the correct number of nines for a given set of indicators and transactions. Once you've done this process and started to build this experience with your business stakeholders, you can create this feedback loop, right? You see a change happening in your system. You can budget around the change, the risk of that change, you can budget around that. You can see if incidents are occurring as a result of that, meaning that you exhausted the error budget and you can decide what to do with that. You do post-mortem and make decisions around should we be working more on technical debt, making more engineering investments, or should we be reducing the reliability expectations? And that is what's going to drive this continuous improvement cycle that leads to both higher reliability and customer satisfaction and cost efficiency for the business, which is super critical for any company that's trying to deliver reliable services and grow. So just to wrap up, a few takeaways and Joseph will pile on as well, I'm sure. But number one, what is an SLO? It's a reliability goal that's centered around users. Even he uses the API for engineering teams to figure out what services are breaking and which ones are working to help drive better action and better improvement. It helps you become more proactive. So you're getting ahead of incidents before they turn into major outages. And it helps you manage costs between the trade-off of engineering and delivering delightful experiences to users. And fundamentally this creates this business technical feedback loop that lets you drive continuous improvement. Joseph, anything to add? I can only just stand behind everything you just said. I think for myself, it has definitely helped us to evolve in regards to how we look at our infrastructure. And I think for us, definitely moving towards like an SLO base, making those on the same level as incident responses has really transformed our life and actually has made things a lot more easier for the teams and we can actually now feel that we don't have to respond to every little thing. We actually know what is the most important things that matter. It's when things impact our customers. So it is definitely starting to involve us as an engineering culture. Thanks everybody, it was fun to talk. See you later.