 Good afternoon, everyone. We're really glad that you're here. I hope you're super comfortable. In about 15 to 20 minutes, I'm going to make you get up. So enjoy the time that you have in the seat where you are right now. My name is Nathan Harvey. I'm a developer advocate for Google Cloud. If you'd like to find me on Twitter, you can do so. My handle is very easy. It's just my first name and my last name all smushed together, Nathan Harvey. But my dad misspelled my first name. So it's N-A-T-H-E-N-H-A-R-V-E-Y. So if you want to follow me on Twitter, you can do that. And we're super excited to talk to you about SLOs and SRE today, Jennifer. Hi, everyone. So yeah, I was going to say, we're not seeing our slides at the moment. But you'll sort that while I'm distracting everyone. So I'm Jennifer Petoff. My friends call me Dr. J. You can also find me on Twitter at jenski. J-E-N-N-S-K-I if you'd care to follow along. I'm a senior program manager on the site reliability engineering team at Google based in Dublin, Ireland. And I lead Google's SRE education program. And I'm one of the co-editors of the SRE book. Yes. Yay. And we have our slides. Yes, I picked the slides. Yeah. So shall we crack on? We have a limited amount of time and a lot to say. So why SLOs? Like why are we here today? Why are we talking about service level objectives? When it comes down to it, it's really about incentives. And how do you incentivize reliability? If you look at the traditional model of software development, there's inherent tension built in because of the incentives. So you've got developers on the one hand who are building cool stuff. They're incentivized by agility about moving as fast as possible, throwing things over the fence to operators who are then incentivized to keep things running as smoothly and as stable as possible. So this is not necessarily a great setup. Leads to tension. And SLOs are really about setting agreed expectations among various parties. So devs, ops, product management, leadership, get everybody speak in the same language, and figuring out what level of reliability is actually needed to meet the needs of our users. And why are SLOs important? So there is one assumption we need to talk about first. And that assumption is that the most important feature of any system is its reliability. So without reliability, users can't see all the shiny new bells and whistles that you've built into your product. And if you're not meeting the reliability expectations of your users, they will leave you fickle buns, right? Yes. And how much reliability do you need? So your first inclination might be to say 100% more reliability, all the reliability. Unless you're a product owner. OK. Because then you want 110% reliability, right? Yeah, give us more reliability. Just more reliability. So that could be your first gut reaction. But in reality, 100% is the wrong target. 100% is the wrong reliability target for basically everything. I mean, even pacemakers have barely two nines of reliability if you research that. And 100% essentially means that anything that goes wrong is an emergency by definition. So even if your users won't notice. And that's a recipe for burnout. It's not sustainable. So let's figure out how to rationalize an acceptable level of failure. All right, but on the other hands, reliability is not a standalone product. You can't sell that on its own without something behind it. So there needs to be a balance. And SLOs really provide a principled way to agree on the desired reliability of a service. So what's reliable enough from a user's perspective? And SLOs and error budgets actually go hand in hand. So you're looking at things. So it's one minus your SLO is your error budget. And an error budget is essentially an acceptable level of unreliability. And it's a budget that can be allocated. So the budget should be spent. It's meant to enable moving at high velocity while you've got budget to spare. But when you're in the red, things need to change. You need to focus on things that improve the reliability of your service. All right, so what we want to talk about here is delivering higher reliability actually has a cost associated with it. So if you're too reliable, you're wasting resources. You're wasting money. That could have been better spent elsewhere. And once your users are happy with your level of reliability, there's diminishing return on investing more along this dimension. So set objectives and try and be slightly over that target. So you want to meet the needs of your new users, keep them happy, but not too happy. Not too happy? So Jennifer, I know something about budgets. And I know how to spend money. I know how to spend a budget. I've seen you spend it. You've seen me spending money? Spending the money. So if we're going to use the error budget to sort of prioritize what work should we do, what are the consequences that we have to face when we exhaust or overspend that error budget? OK, so you get when you sort of declare bankruptcy, if you will, at least for the time being. So consequences of overspending that error budget could include things like freezing feature releases, focusing more on prioritizing action items from your post mortems, automating your deployment pipelines, improving monitoring and observability, or requiring an SRE consultation. So just having a second set of eyes kind of look over things to see, is this likely to cause problems in production? That's super cool. One of the organizations that I've worked with, I talked to them about their consequences of overspending their error budget. And it was that exact last one, require SRE consultations. So what did that mean? In practice, what happened was when they had exhausted their error budget, that was an indication to them that their systems weren't reliable enough. So an SRE came along to all of the development stand-ups. So their development team practiced stand-ups every day, and they were talking about the work that they were doing. And it was really interesting, because they told me that what would happen is the SRE would come along, and they'd see, oh, here's this call off to an API service. And the SRE would just ask a question. What is your strategy when that API service isn't available? How do you back off on calls that you're making there? What's your retry logic like? Just asking simple questions would help put those developers back in the mindset of building operable, reliable applications. So once they got back into the right range within their error budget, the SREs could stop going to those stand-ups and get back to, I don't know, maybe automating some things. But the development team that had that SRE come along kept those questions in mind, even when the SRE wasn't there. And I think that's a really good outcome in terms of consequences of overspending your error budget. Exactly. You want to focus on things that improve your reliability, paying the price for that issues. Right. Oh, am I running out of power? Oh, no, already. No. Oh, I sure am. Oh, man. Thanks, Brian. Risky, risky. Yeah, that's super risky. Right, so. Let's take care of that first. Yeah, we'll just take care of that. But while we're doing that, I see why. Wrong port. It's not that it wasn't plugged in. It's just that it wasn't plugged in the right place. Thanks, Brian, for the save. Also, I'll make sure that you get that back at the end. So, right, what if I have a lot of budget left, though? Yeah, so what can you spend your error budgets on? Yes. So, yeah, error budgets can accommodate a bunch of different things. So it could be releasing new features. It could be dealing with expected system changes, inevitable failure. I mean, think of Murphy's law. We already talked about 100% is never the right target. So you can have inevitable failure in hardware, networks, et cetera, planned downtime, or even risky experiments that you might want to try. So it's just opportunities to move as fast as possible. Right, I think I look at having budget left over as a signal that, as an organization, we aren't moving fast enough. And if you think back on some of the talks specifically, like Aaron's talk earlier, where he talked about the three ways, and that third way is that feedback cycle. If we aren't moving fast enough, we aren't getting that feedback loop as tight as possible. We potentially are missing learning opportunities from our customers and from our systems. So how do we create an error budget? How does that work? Great thing to ask about. So we can talk for a minute about implementation mechanics. So the idea here is to evaluate performance over a set window of time. So you can perhaps look at a 28-day rolling window. And the idea is that the remaining budget drives prioritization of your engineering effort. So if you're in the black, so to speak, full steam ahead, like moving as fast as possible. But when you drop into the red, again, you're focusing on toil reduction and just paying down the tech debt, so to speak. We often talk about using a rolling window as being important because, as opposed to say, defining things on a quarter boundary, because users don't really care about those quarter boundaries or that a quarter boundary has passed. So if something bad happens on that last day of the quarter and like, yay, it's the first of the month. We get a big reset button, right? Yeah, we get a big reset. No one cares. They don't forget problems overnight. So that 28-day window kind of helps account for that. Right. And, OK, still me, I guess. Yeah, sure. Got the plan. All right, so always fun tag teaming, N plus 1 presenters. So let's talk about service level indicators or SLIs. So this is a quantifiable measure of service reliability. So it's thinking about the things that you can monitor. And ideally, you want your SLIs to be correlated with your user experience of reliability and the reliability of your system. So how do you go about choosing a good SLI? And this formula is actually pretty useful for sort of constructing an SLI. So an SLI is typically formulated as a number of good events divided by the number of valid events times 100%. And the reason this is a good formulation is because it ranges from 0 to 100%. It's easy to reason about. If it's 0%, everything is broken and on fire. If it's 100%, everything's grand, as they say in Ireland. And the denominator is valid events, as opposed to all events, because some events might be considered invalid and you might not want to count them against your error budget. So things like 300 response types or even things like 404s, you don't necessarily want to count that against yourselves in terms of your reliability goals. OK, so this happy and sad or hungry emojis, I'm not sure. Hangry, maybe? Hangry, hangry. You don't want hangry customers? Yeah, we do not want hangry customers. But we do want happy customers. So this is really about the key point that understanding the happiness of your users means measuring their experience as directly as you possibly can, obviously considering the costs of doing that as well. Users typically become unhappy when a service doesn't behave according to their expectations. Like, I thought it was going to do a thing. It's not doing a thing. So things like the databases down, the load balancer is sending requests to bad back ends. Users don't care why stuff is going belly up. They just care the website doesn't load I am sad or the website doesn't load slowly and I am also sad. So figuring out how to quantify website does not load, our website is slow. Figuring out how to do that from your monitoring data is what we're getting at here. Right. And so this that we're doing this afternoon is actually a workshop. I know it doesn't feel very workshoppy. It feels like this is another presentation. But we do have some exercises for you coming up. And so what we're going to do is we're going to help you. And by help you, I mean, we're going to ask you and then we're going to watch as you do this. We're going to ask you to write some SLIs and some SLOs for an application. Now it would be awesome if we wrote the SLIs and SLOs for your application. But there are many of you here and many different applications represented in the room. So in order to sort of level set everyone, we're going to give you an application. And by give you an application, what I mean is we're going to draw you a picture of what an application architecture might look like. And so I'm going to just introduce our game really quickly. Now, you don't get to play the game. You get to play the role of the SRE that's responsible for managing this game. So we have this game. It's an online game. You can play it on your phone. It's called Fang Faction. The details of the game don't really matter. The high level things that you need to know is each one of you or all of your customers could be a player. A player can log in on their phone, on a computer, on a tablet. It doesn't matter. When they come into the application, their first thing they'll hit is the load balancer. And then the load balancer will send them either to the website where the game runs, or if they're actually playing a game right now, they'll be hitting the API servers. Now, the important things to know as you're playing this game we're keeping score, because what's the point of playing a game without keeping score, right? Yeah, we're all competitive people, right? Does it make any sense, right? So since we're keeping score, you also want to know as a player, like how do I fare against other players? So we have this concept of a leaderboard. So when you go and look at your profile, you can see information about yourself. You can see sort of where you sit in the leaderboard. And if you just imagine all of the different processing flows of, like, you're playing a game, we capture the score, the score needs to end up on the leaderboard, you come and look at that, there's a lot of moving parts to this application. You definitely don't understand all of the details or need to know all of the details, in fact, to write an SLI and SLO. But this is the application that you're all going to write an SLI or SLO about. And we also have some handouts that we're going to give you in just a minute. OK. Yeah. So I mentioned that you can log in and look at your profile. So here's a typical profile page. I can see a picture of myself. So each user has their own avatar. Again, you can see information like their name, which faction are they a part of, where do they sit on the leaderboard, what are their top scores, and so forth. So this you can think of as just, like, a crud page. Like, I can come in, I can create a profile, I can update my profile, I can delete my profile potentially, although it looks like our product team hasn't prioritized deleting a profile, just updating. So never mind that bit. We'll get to that later. Yeah. So what might an SLI look like if we were to dig into this particular page or this particular part of our application? Well, before we get there, I just want to step back and talk about this idea of having a menu of different SLIs or service level indicators that you can actually select from. So as you think about your application, I want you to think about how do your users interact with your application? Do they follow sort of a request response? Do they do something like data processing? Maybe your application does transcoding of video files. So a user takes a video file, they upload that video file, you take it, you process it, and you spit it out as a different format. Maybe it's an MP3 now. So here's just the audio. So that would be a data processing transaction. Or maybe you're writing data to disk or like to some cloud storage or something like that. So how does that storage actually work? With each one of those different types of transactions, we actually select or offer a couple of different SLI types that you could create. So in that simplest form, the request response, we might have an SLI or service level indicator that looks at what is the availability of that page that you're requesting or of that resource that you're requesting. What's the latency? How long does it take to return that information to you? Or maybe even what is the quality? Like, did we get the right things? No, that was good. You can go ahead. So this is how you start writing SLIs. You pick a critical user journey, a thing that is really important to your users. And your users, that's the users of your system. So you don't say like, well, our service level indicator, I heard SLIs are metrics. I know an easy thing to measure is CPU utilization. My SLI is gonna be CPU utilization. Is that a good one? Do I care about that if I'm using a game? I care about it. I'm a sysad man. I need to know. I know, but I'm playing the spang faction and you know, like, I don't know. So you don't care as a customer? No, I'm like, no, sorry. That was easy to measure, but probably the wrong thing. So it is true that every SLI is a metric, but not every metric is an SLI. Not every metric is created equal. That's right. That's true. So we have to sit down and understand the application and our users and what they want to do with it. So we've decided through consultation, not just as engineers, we've invited the product team along, our business, the business people along, whoever those business people are that live in the air quotes, they came along and they told us, this is what's important for our application. A user should be able to view their profile page and they should be able to do that successfully. But we are engineers. So profile page, view it successfully. What does that even mean? Let me go to the next. So how do we define success? What does it mean to successfully look at that profile page? And where is that success or failure recorded? So we just start iterating on these questions to really flesh out what does this service level indicator look like. As Jennifer mentioned, a good SLI is gonna be some percentage or a proportion. So we say it's the proportion of valid requests served successfully. But again, that's not precise enough for us to put a measure around. What does it mean to be valid? So a valid request might be an HTTP get request against two specific URLs. So a user makes a get request to the profile page and they make a get request to the avatar that shows up on that page. So let's look at the HTTP status codes for each of them because we have to figure out what does it mean to be successful? So what is a good success indicator for an HTTP status code if I go to a page and make a get request against it? 200. 200. All right, so let's look at that. So 200 is good, 500 is better, right? No, that's not how it works, is it? No, so 500 bad, 200 good. So what is success? Let's actually put some numbers around that. So we might say that a success is a 200, a 300, or even a 400. Like, maybe it's a 404, we just say, we don't actually care about those, so we have two choices. We could either say that is a success where we could cut it out of our valid events as Jennifer was talking about earlier. But then we also have to ask, well, where are we going to measure this? So remember back to our application architecture, there's a lot of components. There are obviously multiple web servers, although in the diagram, I think there was just one. But we all know that's a lie. So it's a architecture diagram, I guess. Okay, architecture, all of that. So where are we gonna actually measure this? So we're gonna measure this at the load balancer. So we're gonna look at all of these HTTP status codes for these specific URLs, and we're gonna come up with a percentage. How many of them are good? How many of them are bad? What is the total number that we have? And we might do something very similar for latency. So again, with latency, we'd work through all of those same questions because the product owner said it should load fast. And maybe what we said was, okay, thanks product owner, that's a great thing that we can do. You can go eat lunch. We're gonna sit here and figure out what is the details of load fast actually mean. And we're gonna come up with some metrics, which we'll then share with you. So this is the process that we go through to come up with good service level indicators. Great, yeah. All right, so we've talked about SLIs, and that's the way of basically measuring system performance. But in reality, services need SLOs. So this, and that's the thesis of this entire workshop. Like what is the target that you're actually aiming for? What level of performance is actually good versus not good? And again, service level objectives. This is the reliability target for an SLI. The SLO is your target. The SLI is the measure that tells you if you're meeting your SLO or not. And SLOs are typically just below 100%, so say 99.9, 3.9s, et cetera. But whatever's appropriate for your particular set of users. And again, can't emphasize this enough. The SLO should ideally be set from the perspective of your users, rather than just what's easy to measure or be here when it's engineered. And also, when you think about 3.9s, 4.9s, 5.9s, don't just add 9s because more 9s sounds better. Think about your users. What do your users actually care about? Maybe your business is one that has a specific set of users that only utilize your service from 9 to 5 on weekdays central time. Do you need 100% availability for that? Like take that into consideration. You have to think about those users. All right, so what goals should we set for the reliability of our journey when we're thinking about the game that we just talked about? So again, your SLOs should have both a target and a measurement window. So you might have an availability SLI type, a latency SLI type for the user profile. So an objective could be something like, for the first case, 99.95% successful in the previous 28-day window. And for latency, 90% of requests served in less than 500 milliseconds in the previous 28 days. Right, and as you mentioned, it should be a rolling window. So we're always looking back over some period of time. I think the other thing that's important to note here is that for this one critical user journey, so loading up the profile page, we actually have multiple SLOs associated with it. So as you look at a critical user journey, you might have one to three SLOs associated with that user journey. So you need to think about what performance the business actually needs. So SLOs typically represent business requirements, or that's what you're aiming for. And if you set an SLO that sort of sitting at that bound between happy users and sad users, you can think of these as aspirational SLOs. So you might not be meeting those right away when you're first starting your journey and starting to set these things, but it's basically what you wanna work towards with engineering effort over time. So aspirational SLOs. But the good news is, I guess, user expectations are strongly tied to past performance. So you have data to start with, and start with historical data if you have it. And you can think of these as achievable SLOs. If you look at your performance in the past, you can set an SLO based on what you think you can actually achieve. And then the difference between these aspirational and achievable SLOs is actually a pretty useful signal. If you're close together, you're delivering what the business needs. If things are far apart, it makes it clear that there's perhaps some work to do to meet those expectations of your users. That's right, okay. So we're mostly done with the talky talky. You survived it, way to go. Awesome, I hope that was insightful. Now we're gonna put you to work. So here's, I'm gonna describe the whole process to you, and then I'll ask you to get up and move. So you don't have to move yet, but- But get ready, yeah. Yeah, get ready. In the front of this, on the front of this stage here, we have two things, maybe up to four things that we'd, or yeah, four or five things that we'd like you to take. One is this little workbook that says the art of SLOs on it. This has a ton of great content in here, and some details about that FangFaction application that we're gonna write SLIs about. And then there's also a worksheet here, and this worksheet is broken down into like five sections. The first is the user journey. So within this workbook, we give you more details about that application, and on pages, and it says so here, pages 18 through 22, it describes a couple of different user journeys. So you're going to select one of the user journeys. You'll then look at the SLI menu. Is this a request response journey? Is this a data processing journey? And select an SLI type. You'll then write out an SLI specification. So like the profile page should load successfully. That's great, but what does that actually mean? How are we actually going to measure that? And then finally, you'll come up with an objective. So we want the profile page to load successfully 99.5 times or percent over the past 28 days, something along those lines. So what I'm going to ask you to do, I want each of you to find a group of, let's call it five other people. So we should have groups of like six, seven people, something along those lines. You can, I don't know if I'm allowed to say this. You can move the chairs if you need to, but they might be connected in rows. It might be impossible, it doesn't look like it. You can probably move the chairs, it's fine. I want you, I encourage you to in your groups, not all stick together if you're all from one company. Let's get some cross-pollination happening. So like I work at Google, Emily works at Microsoft. We'd be great on a team together. Me and Jennifer both work at Google, so we should separate, go to a different group. Let's see, so you're going to do what it says on this slide. Choose a critical user journey, an SLI specification, implementation, and think about some business needs. Now the last page of this document has some graphs. As Jennifer mentioned, you can set SLOs based on past performance. So here's the past performance. Now, admittedly, you don't have enough information to actually set a good proper SLI, but guess what? In your organization for your applications, you don't have enough information right now to set a good proper SLI, so this is a good exercise in that. And we also ask you to do this in groups so that you can negotiate with one another and kind of discover this together. So what questions can I answer for you? Yeah, I want, look, I don't want to do the hard work of writing SLIs for my application. I'm going to go run this Fang faction, but I want you to create SLIs and SLOs for me. So it's not a brainstorm. It's about coming to consensus as a group. Yeah, come to consensus. What are some good SLIs and SLOs? And if we have time at the end, we probably won't. We'll ask maybe a group or two to share what did they come up with. All right, does that answer your question? Yeah, good. Okay, ready? And how long are we giving people? We're going to give you 17 minutes. Whoa, okay. Yeah, which is not enough time. So you should get up now and come get the stuff. Get the stuff. Yeah, there are pens up here as well. And my business card and Jennifer's business card, if you want to take our contact info, yes. So each individual should take a booklet. There's enough worksheets for each individual to take one. At least each group should have a worksheet. We also, Jennifer and I will kind of walk around, check in on groups, see how you're doing, ask you questions, and what have you. So find three to seven other good people. They're all good people in the room, by the way. People that you don't work with. All right. All right, so I hate to interrupt, but I do want to be respectful of both your time and the time of the conference. So unfortunately we're out of time. Now my experience, and maybe this is yours as well, when you read about and think about SLIs and SLOs, they just make sense, and they seem super easy. And then when you sit down to actually come up with one, it's maybe a little bit more difficult. Yeah, what did you think? Yeah, cool. Well, we have just a couple of parting words for you. Let me pop along here. Yeah. So we did want to point out that both the site reliability engineering book and workbook are available online at google.com slash sre. So it's all free to check out. And there's lots of great content in there about SLOs, SLIs, et cetera. Absolutely. And just a quick word on those. The SRE book is really, here's all of the practices. And then the SRE workbook is how do you take those practices and put them to work? So there are exercises and guidance on how do you actually put these practices to work in your organization. Exactly. And then finally, there's a course on course there if you want to learn more about SLOs. You can check out this particular course created by some of Google's CRE team if I recall correctly. That's right, yes. And then just the final, final word. I have two things. The first is I would encourage you to stay in your groups if you want to keep working through the exercise. You can certainly move back into the sponsor area. I also encourage you, as you complete or give up on your worksheet, whatever the case may be, if you don't mind stopping by the Google table and letting us just snap a picture. It's really good for us to see sort of how did you progress, what were you thinking. And of course, we'd love to have a conversation with you about what you came up with. And then the really, really final, last thing I'm going to say is that if you're interested in doing a workshop like this at your organization, reach out to me. We have people within Google, myself included, and folks on my team and other teams that would love to come in and just sit down and talk about how this works, not only with you, but with a larger cross-section of your organization. So thanks so much for joining our workshop today. Yeah, thanks everyone. Yay.