 Thank you very much, Kristen. And good afternoon, Linux Foundation. It's wonderful to be with you today. Just a brief note to, I'd like to thank all the organizers for everything they've done to make this go as smoothly as well. It's on me now, how smooth it goes. So let's get on with it. Welcome to Better Reliability with SLOs. If you've been familiar with it, this is heavily adapted from DataDog's 2020 SLO workshop. Slightly expanded material, but it's a great primer for getting to understand how and why to use SLOs. But that's a large part of what we'll cover today. Now, because of the ways that my brain is broken, since talking about SLOs, you also need to discuss SLAs and SLIs. The unofficial subtitle for this talk is S-L-A-E-E-I-O-U, and definitely why. But to go inside baseball, no matter how I tried to write that out, it just didn't make sense when you're reading it. You have to hear it said. And I apologize for anybody who didn't learn English as a first language. You're in the U.S., that's how we, A-E-I-O-U, and sometimes why, is how we learn the vowels of the alphabet. So moving on, the things that we're going to cover today are mainly SLIs, SLOs, and SLAs. We want to talk about what it takes to define and what quality targets are that feeds naturally into what error budgets are. And we'll talk about some practical examples. But you may be wondering who that bearded beauty is and why is he talking to me? Well, I'm Aldo. I used to be a productive member of society before I took this job, but I come from the systems and operations engineering world. Before accidentally going to the right conference, I was a fairly typical bitter sysadmin type. The DevOps community was still pretty new, but attending Velocity in 2011 was revelatory for me. That week literally changed my career. I'd simply never been in an organization that didn't have a contentious and often even antagonistic relationship between the people who make things and the people who run things. Because of this and the relationships between DevOps and SRE, I'm here to talk about one of the steps that enables a healthy transition into incorporating SRE and DevOps practices into your organization. So technical people, we love acronyms, and in this case we've got three TLA's that are very similar, so it's easy to get them mixed up. So we're going to start with some nice tidy definitions just to clear things up and make sure we're on the same page. We'll start with SLIs. A service level indicator is a quantitative measurement that expresses an aspect of the service, and this is usually a metric. The key word here is quantitative. You have to be able to measure something with a reasonable degree of accuracy. This can't just be a gut feeling. SLIs are the basis of SLOs, so we'll look at that next. A service level objective is a target value for a service as measured via an SLI. The idea here is to pick a reasonable value that should be maintained or striven for, for a given SLI. We'll get into a few examples later on, but the core idea here is that an SLO is a target. It's a goal. It's something aim for and spoiler alert is ultimately a way of measuring the success or failure of a service over time. Now it can be reasonably argued that SLOs are the basis for SLAs, even if it's not how things historically have run. They should be related to each other, but previously SLAs generally were created without referencing any kind of SLO. But an SLA is a service level agreement. It's a contract that defines results and consequences of meeting or missing one or more SLO agreements, targets. Now words like agreement and promise and contract, these are all words that can be used to describe SLAs. The important thing here is that it's a statement that codifies what happens when an SLO target is missed. In principle, that result could be anything, but it commonly takes the form of a business decision. Now, as you may be able to guess, there's more than just monitoring tools at play here. It's truly a team effort when it comes to identifying, scoping, defining, and ultimately implementing SLIs, SLOs, and SLAs. In all of your thinking about this, I want you to focus on the user experience. Users in this case are customers, whether internal or external. So let's talk about what a quality target is. Again, focusing on the user, how do they interact with your product? How did they use it? What is their workflow? What services that you run do they interact with? What do your users want? What do they expect? Now, we should say upfront that not all values make good SLIs. System resources, quorum state, number of lines per commit, these values can have use, but your customers don't care about them. It doesn't matter to them. Things you should focus on are things that impact the user experience. Things like the realm of response and requests, were they available? How long did it take? Do you have enough resources in play so that your customers' requests can be completed efficiently and quickly? Can the data be available, accessed when it's needed? How long does it take to grab that? And if there are failures, if there are hiccups, if there are disturbances, how long does it take? Is the data still there when you need it? And how long does it take to rectify? And in the category of pipelines, is the right data coming out of your data processing? And freshness, how long does it take for new data to be incorporated? Now, SLIs are applied values. And your indicators, your SLIs should have a relationship to the user's experience. Now, looking at these two examples, the number of requests to an endpoint that completes successfully and the number of requests to an endpoint that complete within 500 milliseconds. Consider that these are very subtle but important differences. Consider that the second one doesn't explicitly state a successful request, only that it was completed in under 500 milliseconds. Reliable failure conditions are indicators too. You should be able to see a failure and be able to understand why. It doesn't necessarily mean that the failure is bad, but it can be an indicator of more resiliency that can be built into your processes and into your services. Now, SLOs are applied SLIs. Objectives have both a target and a time window. And it's not uncommon for these to be a complex statement, including percentiles of a metric over time. Now, there's usually a penalty for SLAs that's often financial, but these kinds of things are much more common when money changes hands between companies rather than between teams or organizations within the same company. These SLAs address expectations and impacts. Now, coming to error budgets, when we talk about an SLO's availability over time, that time remaining in short is what becomes your error budget. Now, we all love to move fast and break things, but when you have customers on the line, customers experience on the line, you need to move fast and fix things. Failure is unavoidable, but how you respond is really important. The idea overarching all of this is that you want to balance your innovation and novelty with reliability and stability. So, building error budgets or determining what they should be, an SLO is ultimately identified by the product opener. The actual objective is measured by a neutral party, like a monitoring system, something that just tells you the facts. Now, the difference between the actual measurement and the objective, or sorry, the total time period and the objective is your error budget. And you should spend your budgets, like organizations that do annual financial planning, use it or lose it. You should think of it as your error budget is time to play, time to innovate, or run some experiments, for instance. If the SLO is currently being met, you have freedom, you have room to move. So, try some new features, maybe something that was experimental or not quite ready for, not quite ready to be called complete, but you're ready to get some feedback. Maybe trigger some planned downtime to take care of maintenance that's been sitting in the sidelines. Now, if your budget is zero or negative, you should concentrate on that. Freeze your new feature development and improve your observability story. Make your reliability higher. Look at ways that your service is failing and try to mitigate those instances. Now, having talked about that, let's talk about some practical examples. We have a brand new photo sharing site. I'm told that the O character is pronounced F, so you would say this as photosite.net, which is fun and also critically not real. I also suspect that the friend that I was collaborating on with us is subtly messing with me, so we can poke fun at him later. But in our photo application, again, starting with the user's experience, how do they interact with our product? By clicking and viewing photos, doing searches, uploading photos. Now, their workflow is they log in, they search photos, they view and download them. What services do they interact with? Accounts back end, very viewing and uploading services, things like that. And what do they want? What do they expect? They want it to be fast and they want it to be accurate. So some of the indicators that photosite.net relies on, as far as requests and response goes, one of the key features is availability. Could our services respond to the request? In latency, how long did it take to respond to that click? And throughput, do we have enough back end to make sure that the requests are handled promptly? In the storage category, we look at availability. Can the data that's requested be accessed when it's needed? How long does it take to read or write the data? And for durability, is the data there when it's needed? Now, in the pipeline category, was the right data return? This is, if you click on one thing and get another, you're going to be upset. And freshness, how long does it take for a new version of a photo or processed results to appear? Resizing is a very common feature. How long did it take to return that? But not everything, when thinking about SLOs and the services that we run, we need to consider all the services that we run and we depend on. Not everything we run are things that we wrote. So in the Linux foundation diaspora, I found some projects that fall very squarely into this category, such as continuous integration projects like Jenkins, Jenkins, Zaks, and Spinnaker. He values doors like at CD and data collectors like FluentD. There were a lot of products. It's been a while since I made myself familiar with the LF catalog. And there were a lot of things I just hadn't heard of before or were pretty novel. There was one about transferring hospital data between sites, some of which were offline or did not have internet access reliably. That I thought was a really interesting project that I need to look into more. But all of these different services I mentioned, CI, key value stores, and data collection and pipelines, they have very different operational profiles. When thinking about SLOs and the services that we run, we need to consider all of them. Now, in my experience, these three types of services fall under one of two operational models. They're either properly managed or they were thrown together and nobody owns them and they just run in the background until it fails. For instance, very commonly, CI and CD tools often fall into the latter category until a company reaches a certain level of maturity. But these tools are most often triggered by human behavior. New code releases usually ultimately come from a human initiating a change, triggering build, testing workflows, and ultimately providing deployable artifacts. Now, perhaps CI and CD tooling has different SLOs for working hours versus nonworking hours. For instance, if you have a team that's relatively co-located, at least in the same time zone, maybe your SLO for off hours is very different than when people are expected to be working. Now, if you work on a team that's global, that kind of goes out the window. But if you do have different times where your services need to be available, you should also plan on contingencies that can be put in place for situations when, for instance, an emergency response is needed during your typical nonworking hours. Now, key value stores might not need much in the way development or configuration changes, but they do need to be kept up to date periodically. How you're consuming apps use the data provided by, for instance, NCD. You might need a low degree of availability, such as when processes are generally long running and they don't have much churn in the values or only checked at startup. But on the other end, you may have things that require a high degree of availability, like when configurations maybe change live or they're supporting something like Kubernetes, where it's constantly spinning up new instances and Quorum is often needed. The configuration information is needed fairly frequently because individual instances get recycled. Now, if you're running a data collection or data pipeline tool, like Fluent D, you probably work in one of two modes. The first one I would consider the more common where it's always on for continuous or unpredictable ingestion patterns, such as a common metrics collection services or shopping cart processing. On the other end, you might have predictable but high burst data collections. In my mind, the best example is the large Hadron Collider at CERN. They know when they're running experiments, but more importantly, they know when they're not. So they have a lot of downtime in between sessions when Fluent D, for instance, if that's part of their tool chain, doesn't need to be available. And you can do a lot of maintenance and experiments in the off time. But when they're going to run a LHC burst test, they need a code freeze and that service needs to be rock steady available during those times. So we have very different operational profiles, so they should have very different availability targets. You should allow your teams to be flexible when they're describing the SLOs for your individual services needs. So now that we've covered the main parts, we're going to cover a few tips. Now Datadog has an SLO product that helps you make decide these things and make these decisions more easily, especially in terms of the day-to-day, are we working on feature development or resiliency? We have monitor-based and event-based SLOs. Monitors are typically based on metrics or monitors which are tied to metrics. So they're values within a set timeframe. For instance, 99% of the time we want latency to be less than 200 milliseconds. Or event-based, these are more akin to statements like a success ratio. So we want 99% of our request to have latency less than 20 milliseconds. A common, the four golden signals view of metrics is often a useful guidepost for thinking about the different thinking about the different types of metrics, whether a metric is useful or not in this circumstance when choosing an SLO. So and the four golden signals gives you the acronym let's for latency, error, traffic, and saturation. There's plenty of information out there. It's you don't need to hear me talk about it now. But no matter what you do, you want to focus on your users. Step back from the engineering, interesting problems in the day-to-day, develop your user stories and journeys first and then figure out what SLIs apply to your thing. Also involve all your stakeholders. Your stakeholders are included, of course, your product managers and product owners, the engineers, but also your management and executive teams. Because they need to be on board with the idea that 100% reliability isn't, it's not possible and it's also not really desirable in most cases. So when you get into SLOs and starting on this journey, gain experience by starting small. Build out experiments and see what works for you. Building out data sets to make sure that you have reliable data and a historical data to be able to start figuring out what your baselines look at. And at first err on the side of naivety. It's a lot easier to start with a relatively low success rate, say like 80% over a week and make it more stringent over time as you find that you actually need it. But the, it's much harder to convince people to say to when you start off with we need an SLO of four nines to be able to say, you know, we really don't need that much availability. Let's start with, let's go down to two. That will freak people out. So start small. And remember that SLOs need to change. You want to reevaluate your SLOs and SLIs as your environment evolves. And with that said, SLAs, the agreements with your customers, these have to have the capability to evolve too. This is common. It's not a great practice, but it's common when you get updated terms and conditions from any number of services, from your banks to your video game services to whatever. Ideally, being more transparent about those changes is better, but I can't make that decision for you. Now finally, tooling matters. Being able to have, again, that independent view of what your success looks like at the moment, as well as being able to share what your current status is can make a world of difference. For instance, if you're running an experiment, if you're having a little downtime that is, but you still have a lot of air budget to play with, it's useful to be able to provide that information to your customers, especially if they're internal, so that they're not necessarily banging down your door when you're running an experiment or experiencing an outage that just isn't that you can live with. So being transparent with this information is what helps this become a wildly successful program. Now finally, in the last section, because as I teased with our subtitle, S-L-A-E-I-O-U, but definitely why, let's cover some more letters. First, I'm going to do S-L-E for environments. Now I've worked in places that didn't have one production environment or two, but 30. Everything, no matter how customer-facing or not, was considered pager-worthy. This is not a great way to work. If you have multiple environments as part of your SLOs and your API documentation, you should be clear about, for instance, what your staging environment is for. Is it for your team to test things? Is it for other teams to test integrations with your services? What kind of availability do you promise for them? Next, we have S-L-U for updates. Now these are not intended to be written in stone. You should review your SLOs and your related metrics to make sure that they still make sense. Are you still using the proper metrics? Are you targeting the proper uptimes? These are just some of the questions you should be looking at periodically. Now, a common problem I've experienced is, has your service become a greater dependency than was intended? Has it become a critical path? For instance, like you're only running at maybe two-nines of availability, but somebody found it was cool and then before you know it, you have a four-nine service that has a hard dependency on you. The opposite could be true as well. It's surprising to comment how suddenly proof of concept becomes mission critical, and keeping an eye on that is always healthy. But when you think that you could use a service level update, put out a quick survey or an RFC on it to your customers. This is especially easy if this is internal. But when a decision is made, to make a change, update your docs and communicate your decision broadly, especially to those internal customers. And finally, we come to SLY, or better stated, SLY. Coming up with an SLOs is a first step towards bringing sanity to the operation of products. In many places, any outage is considered an emergency. If our businesses demand 100% uptime, we just don't have the time to experiment or even make healthy mistakes. Here's the thing. The truth is that building individual services so that the resilience of faults in their dependencies makes for a better user experience. Additionally, having a healthier relationship with faults by preventing every fault from becoming a three-alarm emergency allows times more comfort and flexibility for dealing with faults when they occur. That's making your product teams, your engineering teams, more willing to own the resilience and availability of their services. Now, in closing, things that we've covered today are SLIs, SLOs, and SLAs, how to define quality targets, what error budgets are and why you should use them, and we've given you some practical examples. With that, I'd like to thank you and open things up for other questions. I'll stop sharing my screen and take a look at the Q&A widget now. If you have more questions that you haven't typed up already, feel free to jump in there. Thank you. Okay, moving my window over and opening the Q&A widget, let's see. Are there any, from Kenneth Jones, are there any known tools that can build a dependency chain or tree that show the software systems that underpin an SLI? These would be helpful to understand the full set of services that affect performance. Okay, I would say that you are, all right, I'm going to answer this as an evangelist. So one, having documentation of how things run, what their dependencies are, and what are dependent on them is wholly useful, is entirely necessary and often overlooked if it's not done through tooling properly. It's just hard to keep documentation up to date because we always have more things to do than time to do them, and documentation is unfortunately one of the things that fails. You're going to have to forgive me for a moment, but Datadog, when you implement our APM product, we give you a service map that shows the active real-time flow of data in your environment. Give me a second and I'll pull up an example doing this off-screen so that, well, for obvious reasons, I just need to make sure I'm in the right organization too. All right, so sharing my, where'd the widget go? Okay, okay, you should see a window with the Datadog app under APM. So when you install the APM and configure it for your service, it shows automatically this gets built out a service map of all the interactions that are happening, the relationships as well as how much traffic they receive. Now, this is a demo app, so things are, this is actually pretty boring as far as how traffic looks, but you'll see different indicators change for how many requests they get or node sizes or, you know, this is all configurable, but it shows you a real-time view of what's available, where the dependency lies and so on. I apologize for the company man aspect to that answer, but yes, this is something you can get with a lot of tooling. You can build this out yourself if you want to, but this comes for free using our APM product. So, Kenneth, I hope that answered your question. If not, feel free to drop some follow-up in the QAA session. From Aditya Pandey, does SLI, SLO and SLA get affected by COVID or any global situation in terms of performance finance and other aspects? I'm not sure if I answer your question or understand what you're asking, but I will try to, I would say that your performance targets don't necessarily change in terms of, you know, the world situation or anything going on in the world that happens meta to your technology. I would say that you would still strive for the same availability targets, but depending on your service, you may not have as much, you may have less or more traffic depending on the types of services. For instance, if you hit your Zoom and Zoom was hot on the lips of everybody when the pandemic started, you probably, you may have increased your various, various availability targets because you became so, because they became so key to life in the pandemic. Whereas on the other hand, if you were running hotels or say the Disney parks, something along those lines, well, when foot traffic dropped way off, on the one hand, it was less of an impact if somebody, you didn't have as many customers coming in, but the individual experience mattered more. So if there was a small number of failures in the quantitative state as a percentile, they became much more impactful. I hope I answered your question. If you have, again, if it wasn't helpful or if it didn't make sense, please feel free to add some context around that and I'll be happy to give it another shot. From Adam Escamilla, thank you for the, he says, thank you for the nice constructive talk. Thank you for the feedback. I hope that it was helpful for you and you can have useful discussions moving on. It seems a fear, how to estimate error budgets, how to get started. So, you know, another version of this slide deck has the formula for this and I apologize for not having it in here. So, your error budget is a very simple ratio or very simple formula. Your error budget is the total number of valid events or good events over the total number of valid, sorry, the number of good events over the number of possible events or completed successful events over your time period. So, an SLO could be defined like per, it's a metric or a ratio over a time period. So, if your time period is, we want 99% of successful requests within 24 hours or within a week or a month, that's a rolling number. You know, we don't burn up all our data for November, we burn up our data for the last month, the last rolling time period. So, that's, if I didn't explain that well, I'd be happy to look around for the, where I have that formula more properly formulated, but let me know if this answer didn't work for you. Let's see, from David, sorry, David Dunham, do you use a RMM like Datto to do the SLI metrics? I do not know what RMM or Datto are, but you can use any metrics provider. Of course, I'm going to talk about Datadog because we're a Datadog, but it's any, you can use any metric as long as it makes sense, as long as it relates to the customer experience. And again, you want to avoid things like system level metrics. No customer cares about how many, you know, how hot your CPU is running or how many instances you're running, they just care about their request that their requests get taken care of. I hope that helps. And if we can also take this discussion offline, I suppose. All right. Now, Rakesh gives us our, gives our question, how to write good SLOs for SAS or PAS solution? Any white paper or links? So a good resource to start. Now, I'm assuming that you, you're talking about, you are running a SAS or PAS and your customers are external. Now, your SLOs are typically going to be internal. Those are for your product teams who are providing services that solve problems. And ultimately, those solutions are in the service of the company that you're running. So that actually gets more into SLA territory. But the, let's see, as far as white papers or links, we have a bunch of articles on the Datadog blog. I would also encourage you, and this is again in service of SLOs mainly, when you writing SLOs, writing SLAs based on SLOs, your SLAs should be much more, your SLOs should be more stringent than your SLAs. And for instance, if you were to look at your internet provider or if you're looking at a cloud provider's SLAs, what they promise for their service, it's usually a very achievable target. And it's often defined in such a way that they don't suffer financial penalty, even things go horribly wrong. It's another thing I didn't cover much in here is that SLAs, SLAs are usually, it's the realm of lawyers and company executives, whereas in SLOs are much more, much more the realm of engineering and product management kind of things. So the, I would also encourage people take a look at the relevant chapter in the, in Google's SRE book that came out in 2016. It didn't, it covers that material very well. And also, Jennifer Petoff, who works at Google, has also given excellent talks on the implementation and daily usage of SLOs in a, it did today, real world context as well. So I hope that helped. Kenneth again, thank you. I'm glad the service map helped. I'm glad I thought about that on time. Let's see. Kenneth, awesome. I'm glad I could help. Do you have any point, excuse me, my tongue is not going to do that well. Polianco, again, apologies. Is there some guidelines to define reasonable alert thresholds to avoid false positives on breaching SLOs? Now, you're, we do have SLO monitoring within the data dog product. You typically want to, it's a good idea to have, to send alerts when you're about to breach an SLO. But you should already be aware of the problem that, you should already have a notification that there is a problem with, with the service that you're, that you're running beforehand. And this would, from my question, yes, it is a good idea to have a monitor at a slightly more stringent level than the SLO you're doing when it looks like you may be about to breach it. So I do think that that's, I can't give you hard deadline or hard guidelines. And it's also something that you can tweak over time very easily is when you're approaching an SLO limit. So I hope that helps you, Evan. And again, apologies for my bad pronunciation. Let's see. So next is from Rassi Karyuki. How do, how does one combine SLO, SLOs from automated tests, like uptime measurement using synthetics and service specific measurements, like successful transactions, response times, percentiles measured from actual traffic. Okay. So, okay. As far as that goes, like, actually all the things that the things that you mentioned are valuable SLIs. And I, I hesitate to recommend, let's say you can have multiple SLIs feed into an SLO. But it also, in terms of starting small, I wouldn't start with a multi metric SLO to begin with. Start small, start simple. You can have multiple SLOs for your service, especially in, as you're starting out, especially for, you know, maybe the first six months, maybe the first year, you can have multiple SLOs and judge on various, various axes. And as you refine your numbers and as you refine your services and your workflow through it, then I would consider maybe after a couple quarters to combine multiple measurements into, into an SLO. I definitely wouldn't start off there though. I hope that, I hope that answered your question, Rasi. If not, feel free to post a follow-up. Now, Patricia Funnell asks, could you explain durability a bit more? What type of conditions would affect durability? And what's an example of an error budget for durability? Durability in terms of data, I assume. This is, that's a, that's an excellent question that doesn't come up very often. The simple matter is if the data is bad live, how can I get it back to correct? And that's, that's an interesting question that, like I said, doesn't come up very often. But if you have a data storage problem, I would, and that's largely what this comes around to is, is data storage and availability and correctness. And if things should go wrong, how do you recover and how long does, what, what is the latency between identifying a problem and being able to correct it? And in that view, I think it's, it's going to be very much a case-by-case basis, what your action is or what your, what your, your objectives look like. And unfortunately, I think that's, I hate to say it depends, but I think this is a very strong case for it depends on like what kind of storage you're using, what the mechanisms are available are. And I think there's so much to that, that that could be its own talk that you should give about your environments and your lived experience in trying to solve what, what data storage durability looks like and what kind of SLOs you looked at, experimented with and solutions to, to deal with that a bit more. I wish I had a more satisfying answer, but I would love to see you give this talk. All right. With that, let's see. We have a few more minutes. If people have questions or follow-ups to questions, earlier questions, I'd be happy to take them now. But if we go a couple minutes without any, without anything, thank you all very much for attending and I hope you got something out of this. Yes. Thank you so much, Waldo. That was great. Really appreciate you going through all of those questions. Oh, thank you. Thanks everyone for joining us. And hope you all have a great day. Thanks. Bye.