 Hello, and thank you for joining me for my talk SLIs SLAs and SL dose learning about service uptime from Homer Simpson My name is Mason Egger, and I serve the developer community at digital ocean If you want to reach out to me after this presentation You can do so on Twitter at Mason Egger or you can email me at Mason at digital ocean calm I will also be doing a live Q&A after the initial screening of this presentation at the open source summit So also feel free to save some of your questions for there Or if you have some later feel free to reach out to me Either of these two ways. I love getting questions and answering them to the best of my abilities So before I get started some of the resources that I used for this were Some of the Google SRE books Google produces great books on how they decide they want to handle SRE topics And these are a great read especially if you are on an SRE team I have also pulling a lot from my experience as a site reliability engineer For the last couple of years before I left my previous job And also if you really like my Simpsons memes or want to be able to use these to maybe make your workplace a little bit more fun Frinkiac.com is a great place that you can go to to get the greatest Simpsons memes and add text and create high quality gifts And it's absolutely great. So I highly recommend it So before we get started we have a slight disclaimer This talk is essentially a giant Simpsons meme with a few helpful tidbits and hints here and there. I hope In reality when I first started writing this talk I was really curious as to how many Simpsons memes I could fit into a presentation and get it past organizers And the answer is a lot. I think this current talk has 31 Simpsons gifts in it So there's gonna be a lot But in reality the inspiration for this talk was as I was doing my past job with my teammates You know may every now and then getting paid something going wrong with the cluster some weird Ambiguities that we didn't really know what was going on I would use to send Simpsons memes to kind of lighten the mood or you know Because you know it could be it can be very stressful whenever the site goes down or whenever your system starts doing things that you're not Expecting it to so I would try to lighten the mood with it And eventually I kind of realized that I could do I could explain my entire job through Simpsons memes Which gave me the inspiration to hey, I should write a talk explaining all about site reliability engineering and by all I mean this much because it's a big topic and we only have a little bit amount of time But doing it all through the Simpsons so that's where this talk came from we're gonna be talking about SLIs SLAs and SLOs today a lot So without any further ado here we go Q classic Simpsons tritone music as we enter into the land of Springfield. This is Homer Simpson He is a nuclear technician at a nuclear power plant This may also have been me at some point as I'm sitting around while I had some my turn on call week Watching all of the monitors doing a little bit of work here and there because you kind of do get some work when you're on call but you are also Typically the one who has to handle the first level problem so you really can't dive too deep into stuff So and I love spinning around in my chair. It kind of keeps me entertained like Homer So we can pretend this is me and I'm sitting there waiting on something to happen doing my little tasks here and there But really and truly the website is up things are going through nothing bad and then Pager duty hits and oh, no now the site is down and we have to do something about it And luckily it goes down at work So like I take first call on it But some of my co-workers are also there so they can help me with it teamwork really does make the dream work in site reliability engineering orgs and Are my architects and my the older sysad many guys who've been doing this for years are totally calm Everything's gonna be just fine. Nothing's wrong. We're okay I am trying to emulate that a little bit and everything's gonna be fine. Everything's gonna be okay Don't panic a little bit panicky But you know doing okay, and then there's the sales org and the pms and the other people who Basically see this as the world is ending the sky is falling and now the site is down and we don't know what we're gonna do So I do what I do best I kind of just poke blindy blindly at keyboards until the site is back online Hopefully not net melting down the nuclear inspection van in the process But in reality we we normally at you know check logs do stuff we investigate we don't just blindly poke I'm slight joke, but in reality if you think about it That's all we kind of do is just blindly poke at keyboards and hope that stuff works So we do that and then hooray We're the heroes because the site came back up and everything is going great and it's an amazing time You feel good. You're like yes disaster averted, but then The pms and ever other people are like oh no, why did this happen? We we need to investigate this What somebody please think of the children and they're right not not gonna knock them on that It's totally right to uh to wonder why this happened and stuff and now we're in an RCA meeting And we're doing a root cause analysis of what's going on and somebody's like hey I recently read on hacker news about this cool technology or this cool site reliability engineering thing that Google does And when we should check into that maybe we should look into that and you know, it's a it's a pretty good answer so let's go ahead and say well, what is what is a site reliability engineer and Google's site reliability engineering book does a fantastic job of giving a definition for what can be a very confusing term and just to paraphrase this quote on the screen right here it is a Software engineer who is focused on the entire life cycle of software objects from creation Deployment observation deployment and operation refinement, and then the eventual decommissioning This discipline that when you use a site reliability engineer is kind of a hybrid of a sysad men and a software engineer They have to understand the software engineering process They have to be able to maybe write complex Software's but at the same time they need to have the Linux e skills They need to have the the system administration the understanding of networking and operating systems and stuff Which is kind of what attracted me to the to the vocation as it was because it sounds like it's a lot of fun I get to do a little bit of everything And I don't have to write JavaScript. I'm just personally not a big fan of it So that's what an sre man is also known as the pie man for those of you who have seen that simpsons episode But in reality sre's don't really solve all of your problems like sre's devops grabbing one dropping them in a team and going poof We have devops does not solve the problem like what we need is we need the discipline of sre's It's actually a discipline not just a not just an engineer It's a mindset and what we actually need is a good and accurate way of measuring our uptime and get being able to guarantee Availability for our users being able to communicate our availability to our users and make sure that not only once we've communicated that We can keep our commitments to that So before we go any forward in that we might as well say well, what what do we need way? What is uptime like if we have this kind of uptime? What is it um? And essentially, you know is if my service is running and I can ping it that means it's up, right? But that's not really true, you know, I used to have a I used to have a boss at my very first job when I worked at university that was very Would enjoy the the aspect of going sshing into the Linux servers and going look at that uptime It's been up for six years and that's great and all but I would ask well How long has the network been down like when did we lose access? When did when did the web the the front-end service that was packing to passing into this back-end database? When did it go down? Was the service actually up if nobody could access it? um or in a more modern Example my example that we deal with a lot today is if our systems are highly distributed and only one of our 10 data centers is up Is the service up if only if only an eighth say let's just a random number if only an eighth of our Population of our customer base can actually access our service is that service up? so in reality While uptime is important the up like the old school uptime of a server isn't that important What we really care about is availability is our service available So how do we measure availability? Well time-based availability is the old school of a uptime availability where it's essentially the uptime over the total time and that gives you the availability This is a good kind of availability for small-scale things such as maybe like personal blog sites Smaller services that don't really matter that much that we can check on is it up or is it down? But it's not always great for systems, especially highly available systems What we really need is what we call aggregate availability Which is the availability of the successful requests over the number of total requests, and this is just one measurement There are tons of other measurements, but for availability in this sake We're just going to go with the aggregate successful request over total requests This could also be negated for unsuccessful requests And this eventually is what allows us to create an error rate Which is the number of unsuccessful requests over the amount of total requests, which is what happens when you negate it So the successful request and total request is by far not the end-all be-all of everything This is one potential availability metric that we have versus many So the thing that we do have to know and the thing that we have to note to ourselves is not all requests are created equal Back to the aggregate availability just because our requests are successful doesn't really mean that like you know It was successful request versus total request if we measure them all in one bucket That's really not a good idea because in complex systems. They're not all the same if a new user Sign-up request may not be as critical as maybe a send message request say say a you know Let's take the example of slack Okay, there's when slack has sign-up and they have like an instant messaging function if the sign-up request Availability is down. Yes, that's kind of a problem. Nope. We can't create any newest slack accounts But it doesn't seem like that would be super as critically as if the message send requests like sending messages to your people if those were Those amount of requests were down So they're not all created equal and you definitely have to determine which ones are the important ones And then you split them up and measure them individually or you split them up and only measure the ones that are important And then the ones that are important may trigger different severity of incidences when they reach a certain threshold So do different types of failures have different effects? If I can't sign up. Yes, that might peeve a couple of people But they're really not like the likelihood is that they will come back and they will attempt to sign up later for a service That's extremely popular or extremely well known if I can't sign up right now I will get up. I'll go get a cup of coffee. I'll take my dog out. I'll come back. I'll sign up I'll try again. Maybe it's done by then. There's not really ever too much of an urgency Around maybe say a sign-up and just remember all of this is hypothetical You know there may actually be that but friend in this scenario that I'm building it's hypothetical however Sending messages paid your duty used to send me messages via slack letting me know when things were getting We're down or data dog would also send me messages via slack letting me know things were within like hey Things are getting into the warning range Maybe you need to figure that out if I can't get those messages That is a much bigger effect because it's a potential warning sign that I'm about to have a severe outage And if I can't get those that negatively impacts me in a much bigger way than saying I can't onboard the new person right now So yes, you do have to check if the failures that you are having are different And then what other service metrics are important to take into account? The number of requests Read and write times from the disc might be important, you know if you have a high If you have like a high database or high disc read write workload What the CPU is what the RAM is that all of these are also very important metrics to take into effect And you have to kind of wade through this sea of metrics and figure out which ones you actually want So I guess the question is that we have to define Availability like as a as a company as a new startup as a as a person who writes a personal blog as someone who maintains a Public cloud is someone who is doing a startup. What is available and how of how do we know if we're available enough? What level of service does the user expect if I create? Let's say I create a website that takes a big image and turns it into a thumbnail If that site is down or it takes maybe five seconds or ten seconds to create it I kind of you can expect that and I expect that service and I get okay with it now something did it instantaneously Well, then hey, I would be used to that but if you have You know that delay in there and the user gets used to it then it's not that big of a deal Does this service directly tied to revenue either to yours or your customers? How available a free service is while it's still very important for free services to be available Um, if it's not directly tied to revenue, you might be able to lower some of the availability on it Totally up to you. It just depends again. My blog is a free service You can go and read my stuff if it's down for a couple of days And I don't notice it's really not that big of a deal to me However, if I'm a startup and I'm and I have a free tier and I have you know Then a paid tier and people will try out my free tier and I know that there's a direct correlation between conversion between free and Event free and paid tier then it really would be not good for me to have my free tier down So does this tie directly to my revenue does it tie to my customer's revenue if a customer uses my service And they make money off my service and then something happens to my service and then they're down They're losing money and then I'm potentially going to lose their money because once they're done yelling at me The sale there may might you know decide to cancel the service or terminate the service or not renew the sales contract So you have to just figure out Who's making money off this if it if there's money tied to it people tend to be a little bit more Upset when it's not working the way they expect it to is it free or a paid service? What does the competition look like? This is a big one. Um Thankfully, there are a lot of tech sites and a lot of technical startups and stuff now where everything really Kind of does have a competitor. You like there may not always be a very large-scale competitor But there's always something um, but if you're the only one doing something and you do it crappy if you're the only one providing it then they kind of have to deal with that but if You're directly competing against someone and y'all are like neck-and-neck and people like I don't know if I want this one Or this one any sort of availability like lesser availability could potentially hurt you if you have a less availability than your competitor Then it's likely the enterprises are going to go with your competitor because unfortunately, and I've been in enterprise software purchasing meetings They're so not fun. Um The amount of availability matters there There's like a set list of things that matter and how many nines you have is insanely important And then who are the target audiences consumers or enterprises? Consumers are people that are typically self-service and can leave you in the drop of a hat in a little bit of time Like they can cancel their service for the month. Whereas enterprises Typically are sales contracts and they're not as much as a self-service. Also consumers are less likely to be too much concerned with the with your availability like they're not going to be sitting there watching Your slo's and making sure that you haven't gone into breach. I'm sorry the slas. Um Whereas enterprises are enterprises if you breach the sla pretty hard enough that could imply negative legal ramifications as well So there's a lot to think about here in How available do you need to be? And one of the ways that we describe that is in what we call the nines and then well What about the nines and then every dev ops engineer every sre? Every sales engineer every used car sales when everybody on earth always talks about the nines It's kind of the new thing and as a new sre who started a couple years ago I had no idea what the nines were and in reality It's what is your uptime measuring the number of nines in your percentage? So 99 99.9 percent which is three nines and there's four nines and there's A lot of nines nobody even I don't think anybody offers that many nines. What is that? That's one two three four five six seven eight That's eight nines anyways Some of my co-workers joke. We need to offer. We need to offer nine fives Not five nines, which is pretty funny. Um, and in reality what this means is how much downtime can I have per year? before I'm on the naughty list essentially If you have an sla that actually you actually enforce is an enforceable by law then you actually have to be very careful of this um And in reality if you do promise something you do also have to be careful of this But sometimes, you know, this may be an internal metric just for yourself And yeah, there might be some yelling and some meetings about it, but that may be not hurt so much So say something with like 95 percent you're allowed 18 In a quarter days per year you're allowed basically three minutes an hour to be down But when you get up to like five nines where you're at 99.999 percent You're allowed five base almost a little bit less than five and a half minutes down a year. Um Which is a lot that's that's not a lot. Sorry. That's that's that's a really hard metric to get to So, uh, it's pretty cool. But yeah, definitely that's what the nines mean essentially is that it's how much uptime So definitely define yourself here um And figure out like how much downtime you're willingly going to allow So one of the questions you may be asking yourself is do I need another nine? Can I add myself another nine? Um, if we increase by a nine, what will our increase in revenue be? So this is a really important question you have to ask yourself when you try to do it if I'm going to increase by a nine How much extra money am I going to get do I have accurate data that shows me that customers are leaving because of my availability? Like we have churn metrics We have all sorts of surveys does this show that people are leaving because of this or at the same time Do we also have contracts sitting in the sales org? Essentially saying hey, we'll give you this millions of millions of dollars from company x That's very large if you can provide me these nines if you actually have that then great That's awesome But if you're kind of trying to add another nine just out of a pride basis or out of a basis where there's no actual You know business reason to add it and you may want to reevaluate whether or not you want to add one Because you have to ask yourself does this extra revenue offset the cost of engineering So all of this extra money that we're going to make by adding another nine Are we actually going to make that money? Are we going to spend more money trying to engineer this stuff than we would actually making money off of it? So let's just do a really quick math problem to kind of demonstrate this So say I want to go from three nines to four nines So I'm increasing by Point zero nine percent So if my service revenue is a million dollars the value of improved availability from one million multiplied by nine hundredths of a percent is roughly equal to nine hundred dollars So we only really get of that million dollars if we go over nine hundred dollars Then we have wasted too much money and resources trying to create that So if you can build out this extra nine for less than or equal to nine hundred dollars Then hey, it was worth it. You will actually make the money available Otherwise you're spending more money than you'll be making um, and then if you take a Average sre if you take an ic2 individual contributor two and you can say they roughly make $120,000 a year uh that they make roughly 57 Out dollars an hour when you do when you do all that math out So a nine hundred dollar budget to add this fourth nine means you really only get one engineer for 15 and a half hours Before you have spent more money on the project than you would have actually making money off of it So you really do need to figure out Does this make sense for me? Now again, if you have these whole huge multimillion dollar contracts and you know, you can get more If you add availability then yes, that's totally worth it But if it's a small amount of payout for a lot amount of engineering work, you're going to spend more money than you Uh, than you would have made and it's going to come out to be a negative loss at the end So now let's move into one of the the things that help us with our availability The things that we use to actually measure and make sure that we are available and ready for people And that is our sli's our slo's and our sli's now you may be asking. Well, what on earth are those? Well, the sl is pretty straightforward. It stands for service level So they are service level indicators, which are i which you can think of these are the metrics This is the raw data and the raw metrics that matter The objectives what ranges these metrics should be in and whenever something Uh is in breach of them. What do we do to affix them? And then the agreements how we react if we are in breach of these objectives So if we uh, if we are down for more than a certain amount of time, what are we going to do to actually fix it? So service level indicators, um, these are defined by the google book as a carefully defined quantitative measure Of some aspect of the level of service that is provided. Essentially. This is the stuff that we measure Examples of this are the request latency the error rate and the system throughput. There is tons of other things You could be measuring you can measure the number of successful rates um database Uh performance and all of that the successful requests a lot of different things here But these are the actually the raw metric data that we use to measure things And then next we have service level objectives Which are defined as a target value or range of values for a service level that is measured by an sli So essentially if I have the number of successful requests and say I want that to be in the 98% range Once I dip below the 98% range. I am now in Uh violation of the service level objective and now we know that something is going on and a alert should be triggered that Hey Our are uh the range that we set for the things what we actually want it to be We are now lower than it Or we are outside of the range of acceptableness and now we need to figure out what is going on with that service So essentially there is typically a lower bound and then an upper bound Sometimes your upper bound might go to infinity and your lower bound might go to infinity But typically there are bounds of these that tell you where these should exist and then Sla's are essentially uh in the terms of simpsons memes. They're a patlin um a they are an explicit or explicit contract that Uh you with your customers that includes the consequences of meeting or missing the slo's they contain So in other term terms that means What consequences happen when you break your slo's so now slo's can be break Breached and then may not trigger an sla event because you are not down for a certain amount of time But at the same time if you are out of down time for your sla's your customers might notice and they might hold you accountable for it So how do we define slo's? Um what this really comes down to a question of what you and your users care about Um user facing services typically care about availability latency and throughput Um and the good thing good questions to ask for this is could we respond to a request? How long did it take and how many requests could we actually handle before the system went down? If I'm a small little site a blog site and I suddenly Get a million requests per second. Yeah, I'm gonna have some throughput for uh problems And I'm probably gonna be down but at the same time My personal blog site. Do I really care that much about it? No, I'm not making any money off it This is a business. Yes, would care about it a lot. Um storage systems on the other hand Usually care about latency availability and accuracy So how long does it take to perform the io operation? Uh, can I access my data when I need it? And is the data correct and is the data correct is a really big one. Um, they're actually all really big But if your database corrupts data, you are going to have problems and you're going to have a lot of them So each system cares more about certain things such as accuracy in the storage system versus throughput on the user facing system And that's not to say that storage systems don't care about throughput Uh, but you know if that matters to you then measure it define it You have to figure out what matters to you the most and define it How do we define our SLOs? Well really and truly we should keep it as simple as possible Um, the more data points we have it makes it more difficult to figure out what actually went wrong So, uh, it can obscure changes in performance If we don't really know what we're looking at it also might make it more difficult to pinpoint exactly where issues are coming from Uh, whenever the system does go down So we should also avoid absolutes Uh, 100 uptime is one of those absolutes that nobody ever makes because you can't make it. It's completely unrealistic Um, we will always respond in x amount of time. Um, that one's another tough one. You never know Uh, what like who's going to pick up the pager at three in the morning if there's going to be some other extenuating circumstances Um, but yeah, absolutes are definitely not a good thing to have in your SLOs. Keep it nice keep it within ranges Uh, and just avoid being too nitty gritty with it Be cautious when picking a target based on current performance Um, so you want to avoid getting locked into supporting something that could require herculean efforts to support in the future Uh, when you're just starting out or it's a brand new service and you're like, yes, this is this is running You know, I have one microservice running behind it And yeah, that that service might be running really quickly and I have great performance and great uptime and stuff Um, but once you start adding more and more things as systems become more complex throughput and uh, availability do Start to wane a little bit and they become more and more difficult to maintain So you just want to be sure that when you're picking something if you're brand new or on your current performance It may be a good day. It may be a bad day. Just be cautious when you're picking it Um, if you have no other metrics to go by it's a good start Limit the number of SLOs that you have. Um, choose enough to cover your entire system and allow you to win conversations about priorities By quoting an SLO. Um, a good a good measurement for SLOs is if you can't win an argument with an SLO It's might be unnecessary Um, SLOs should be the thing that you can use to push back whenever other people like we need to do x y and z You're like, yes, but that potentially violates the SLOs. We cannot do that. That could degrade us We have to be uh, we you kind of become stewards of your platform at that point You are the guardians that protect all of the pie in the sky ideas From the system because you know as sre's and as dev ops dev ops engineers if the system goes down Nobody's making any money and if nobody's making any money. It's a bad time for the company Now one thing to note here is some attributes such as user satisfaction Aren't really and can't really be covered by SLOs. Um, and it makes it a little bit difficult So there is kind of a gray area there. So just be aware that like again Up to our point above avoid absolutes. There is no there's no black and white here. It's not yes or no It's not a uh, it's not a binary system. You there is a little bit of gray area here and you kind of have to Uh, be willing to adjust a little bit to it Uh, perfection can wait and it is easier to add nines than it is to take away nines Especially customers will never complain about you making something better But they will definitely complain about you potentially making something worse So we have a saying at dingel ocean and it's probably a saying everywhere But I we've been using a lot is that uh, don't let perfection be the enemy of good So yes, perfection can definitely wait. I I was an SRE. I get it like I want to make everything perfect I want the systems to be skynet. I never I would love to sit back one day And just watch my dashboards and things go down and then they self heal And you know, I've always said that if I ever automated myself out of a job I would take that as a compliment because it means that I was able to completely automate my entire job And I think I would just write off into the sunset at that point and go like farm chickens or something Because I will have accomplished what I want to accomplish in my career So how do we define SLA's? Well SLA's are usually defined at the company level Between business and legal and they're usually Engineers and stuff don't really have a lot to say in this pms might Other managerial types might have access to it, but it typically is engineers You don't really define the SLA's that's a that's a company level thing and there will be lawyers involved However, if your customers are internal developers Then it may be within your team's ability to define SLA's the team that I worked on originally at my last job Provide an internal pass to the entire company that hosted the entire website So our customers customers were the internal developers So that being said we did define our SLA's and there didn't have to be any lawyers in there because it was all within the same company so I was a lot of information and I bet a lot of people are kind of feeling like Homer is right now And their brain might be a little bit Hurting right now at this point. So don't worry. We're going to go over it So out of all the stuff that I've said, what should I actually care about? um This is an easy one that's helped me remember it SLA's are the things that can be monitored and measured SLO's are the ranges that are acceptable um SLO set expectations for your service other people will expect things from you based on those SLO's Monitor and measure your systems SLA's and make sure they're within the range of the SLO's um compared your SLA's to your SLO's if an action needs to be taken take it if you need to take preemptive action Preemptive action is always better than an outage. So always take preemptive action when you get when you're actually alerted And you know that something's gonna happen um Determine what action needs to be taken to meet this specific target run books or automation tasks are great for this We had tons of run books about you know, because you can't always automate everything away Or sometimes the automation is scary and you it might end up doing more harm than good and having a human there really does help um Because human intervention, especially whenever there's billions of dollars on the line for this platform It does really help and it makes people sleep a little bit better at night. So Uh make run books make tasks Take the action when needed So a couple of last tips before we go. Um, have internal SLO's to your team that are higher than external facing SLO's So what this means is if you're going to say, hey, we're five nines um You don't really have much room to wiggle room there. However, if you say that we're three nines but internally to your team you say that you're five nines then Uh, you have a higher standard for yourselves and y'all will be more concerned about it And then your customer if you do for some reason seem to dip from five nines to four nines It doesn't trigger a breach in the SLA How it whereas if you didn't have this kind of like padding it could potentially cause more problems for you So definitely your external facing SLO's I would highly recommend having them be much lower not much lower But lower than your internal facing SLO's and actually try to meet those internal SLO's because if you if you Use the if you create this padding and then you don't use it like your three nines available up front But you want to keep five nines internally and you know that you're not You know that it doesn't really matter because it's an internal one and nobody cares and you let it slip down to four nines Well, then if something happens you you've lost part of that padding and then just don't let the complacency squash your padding Um users build on the service you provide Rather than what you say so if I said that my blog was up 25 of the year And it had I don't know an amazing api attached to it or something and then people started using it And it turns out it's actually up, you know 90 percent of the year people will expect 90 Even if I legally say 25 People expect 90 if you provide way higher than what's uh what you say That's what people will get accustomed to and yeah, you won't really see any Like legal of it. You won't see any like legal problems with this people aren't gonna you know You can't be taken to court for breaching an SLA when you didn't breach the SLA However, uh, you can get blasted on twitter and made fun of on hacker news and all sorts of things So definitely be careful with that. Um, you can actually avoid this by rate limiting requests design your system to produce similar performance on both light and heavy loads or You can even have plant outages Google actually created what's known as a the service is called chubby Um and introduces plant outages in their services in response to them being too overly available So whenever you're up and you do a great job and nothing ever goes down You might actually need to uh, you might need to in you know, inject a little bit of chaos or a little bit of entropy Into your system Just to make sure that you know, people don't get too accustomed to it being up. So when it does go down it uh It's not that big of an issue. I know this even to me as I say this It sounds really counterintuitive, but I get it like I understand what users think but like as an engineer That still feels a little bit dirty in my mouth to say that so Breaches of agreement happen Just breathe and get the system back online if you're if you spend all of your time worrying about the slas and you're like Oh, no, this is gonna happen and we're gonna lose our bonuses and uh You're gonna make the situation worse and and like the longer you're down the worst the situation was gonna get Get this just breathe get the system back online Like save all of the data that you can But there will be a probably be a root cause analysis later. There should be a root cause analysis later So definitely just get the thing back online and make sure Make sure that um That you're okay and don't make any more problems or cause any more problems than you already have Um, the last tip I have is there is no room for blame in an sre org all postmortem should be blameless Um, google does a great job of this. There's a couple of great talks by googleers who talk about this. Um Definitely If you make people feel Bad about their mistakes and it's not a big you should see everything is a teachable moment Um, you know, there's that old I gotta even know if it's like a it's saying or a parable And you know if someone caught it cost took down the site and cost the company four million dollars And like well didn't you fire that guy and he goes no I just spent four million dollars teaching him that lesson You know, so just don't don't blame people don't be overly, you know It's don't don't be a turd. I guess I want to say um Just help and see everything as a teaching moment and everything will be okay And that being said we are at the end of our simpsons episode. Thank you so much for joining I hope you had fun. I hope the memes made you laugh and I will see you next time Okay, so Hello everyone. I'm mason. Thank you for joining in to see my Uh presentation I have one question in the chat, but I'm going to go ahead and uh, I will Speak the question and then I will uh answer it. So the question that I received in the chat was I am from the government sector. How can we apply this to the government sector where there is no competition? And it moves slower so This is actually a kind of a kind of a tough one to answer. Um, my just my opinion and my thoughts on it would be to um If your service internally in the government sector is dependent on by other Services, then you should definitely try to provide a certain a certain level of uptime to the other people dependent on your service um I know that without competition. It's kind of hard to like Make things uh happen, especially Whenever I do know that like government technologies and government systems do tend to move a lot slower So with with that I would say just hold yourself to a higher standard And definitely set internal slas and slo's so that way you know Uh, what is acceptable for you and then what you strive to get to and hopefully um And hopefully it will be fine Okay, so I got another question. How is sre evolved over time? Um I will give a I will give a really brief explanation of that Based on my experience. Um, that is almost in and of itself a talk Uh, like that's that's almost a talk of like the evolution of how we have maintained systems Uh, but I will give my my my experiences of it, but I do have to say I'm I'm relatively young I'm only 27 years old. So I didn't see The a lot of what I'm going to talk about in this brief explanation. Like I've had to um I just heard about it. So I mean in the old days or what I'm assuming was the old days in in the early 90s Which was when I was a child um and services services and services were deployed to the internet There was the typical division of uh Of like I guess of of engineers There was the software engineers who wrote the code and they would they would set it up get it all going And then when it was ready, they would give it to a systems administrator and a systems administrator Would deploy the code and then maintain it after the fact And really and truly this dev ops movement that we have has been nothing but a kind of merging of those two groups um, we have expected More and more of the engineers that did a lot of the software engineering to understand more about the system side And we expect more of the systems administrators to understand more of the uh software side And that's kind of where the s re uh discipline was kind of born was it is really The way I've always had it described and I think it says this in the google book as well. Um Is that an s re is a system administrator who was trained as a software engineer? So that person uh would probably have a traditional software engineering uh background could be a um map it could be a Collegiate degree it could be um It could be a boot camp it could just be experience But they would they would have experience in that and then they would kind of bring all of these ideas together They would bring the ideas of reusability and portability and all of the what my what my professor at My university would call the illites Uh bring all these together and apply them to sys admins So dev ops kind of like full stack engineer was and it was a part of that evolution Where you know front and then back in and then dev ops is really taking sys admins Which used to be their own thing and kind of embedding them into Uh engineering org some engineering org structure it as that Uh As like there's a dev ops engineer in each of the orgs and that person is responsible for that application's deployment Um where I came from when I worked at uh, I worked previously for vrbo um The we actually had an entire platform team and there were no dev ops engineers on the individual engineering orgs um There was they would all deploy their apps to our platform and we manage all of the applications in the company on that platform So it makes it That's one of the things that actually makes getting I would say getting an sre or a dev ops job kind of difficult Um is because everybody does it differently We haven't really standardized on what that terminology means in the industry So I could apply for three different jobs that I'll say dev ops specialist And it could be completely different the skill sets that they require for each different job would be very dependent on the job itself um I hope that's a good enough good enough answer for now. Uh how sre and how The deployment of software has evolved over the years Would make for an amazing book. It would make for a great keynote at a conference. Um, I could we could probably talk about that for hours Uh, would next question would you suggest you to use slas to measure the performance of your employees? with them being aware of it, ooh I don't know that's a weird one. Um I mean in reality, that's kind of what we already do We have a set of performance metrics that we expect someone to be able to do whether or not they are Um deliver x amount of code deliver This such and such performance to contribute to these conversations managers kind of already Kind of sort of do this. Um now I don't know if I would go so much as to define Like because you can't have slas without slo's and sli's so you need an indicator to determine What is an acceptable amount of performance? And then you would need an objective or actually indicator would just be how do you measure the performance? And that can be very difficult, you know It's really difficult to judge software engineering work because it's not like we can't judge it on the line of code Or the number of commits because that that can be artificially inflated really difficult So you would have to have an sli for that and then the objective would have to kind of tell you What ranges are acceptable and then the agreement would be what happens if these ranges are breached and for how long? um I don't know if I would do that personally, but I would It would be it would be interesting to see maybe like a hypothetical case study written up about it. Um But other than that, I don't know treat people like people not like machines um Machines are a lot more. I'm not going to say reliable, but like they'll they do the same things Like they're not as finicky people have emotions people have bad days people have good days That's part of being a manager is you know Acknowledging this and helping them through it. So I don't know. That's a very interesting question. Um, I have no idea Uh If I have no idea, I don't know that would be really cool to figure out. That's all I have on that one Uh It's like that's all the questions I currently see um In the section Uh If there are any more questions, I will happily take them. Hello everyone. How are you? Yeah, how is everybody doing feel free to I'm here to chat like till they till they kick me out. Um We'll give it one more minute before Okay, how do we manage slas when multiple parties are involved such as different vendors? That is a great question so um What you have to end up do is when you have multiple vendors They will have an sla and then They should hopefully they will. Um, if not I've seen a lot of business deals not make it through because they didn't have slas or they did not have agreeable slas At that point you have to monitor not only monitor your equipment but monitor their equipment as well So like I would get notified whenever aws would go down or like when there was a problem at aws So I could be prepared to You know an implement failover techniques within the cluster that I that I was working with um And if you have external vendors You have to monitor this their slas and make sure that they're not in breach And if one of them goes down And causes a breach that then in turn makes Causes you to be within breach then That's whenever you're going to have to escalate it up to Uh the organizing the organization that was was possible. So the the up level third party And in reality, that's kind of a managerial VP of engineering cto kind of discussion. Um, I don't I would not believe that That would be a normal job of a regular like of a of a site reliability engineer of an individual contributor Not a man not someone on a managerial track. Um for that to be a thing So if you are a manager or something like that, you're probably going to have to escalate it Through that that's through that channel and then if there is going to be a root cause analysis as to what went down your prop your Or because this third party vendor went down, then you'll have to bring that To the table whenever you're discussing it and then one of the common things that will happen would be Um, you may try to figure out how you can be more redundant Is there another vendor that we can have as backup? Do we need to change vendors? Was this was that third party vendors outage enough large enough? Uh that we need to change and if you're the one providing a service to other people Then you have to be ready, uh for this as well. So it works on both sides Uh, so yes, that's a great question. Uh I'm glad that somebody I'm glad that you enjoyed the simpsons memes. Uh, I do Uh, I do enjoy my simpsons memes. Is there a framework or open source? Oh, sorry Yeah, open source tool to orchestrate multiple slas For example, there can be an iot device managed by vendor a owned by vendor b data connectivity with vendor c um To my knowledge and I will say that uh, I'm not that knowledgeable of the open source tooling in this area uh for observability platforms I would say that i'm not sure I would definitely look into it. Um at my previous job We actually built an internal platform for managing that exact thing for managing You every app could have slas Uh defined and then whenever that app would go down anybody that had to you can have your sla defy But you could also say who you are dependent on so whenever that sla would go down Or if you would go down you would be in breach it would notify everybody upstream and let them know Hey, this is about to be in breach. Um as for open source tooling Uh, I do not know off the top of my head But I would be shocked if there wasn't something that cncf was incubating or it already graduated that Does something like this and if there's not anything it's a great idea for a new project Let's see. I think that is almost all of the questions. Thank you everyone for the great questions Uh, I love I love answering and chatting with everyone. I'm glad that people Enjoyed the Simpsons memes. I hope that it made they made you laugh They make me laugh that's at the end of the day if my own talk makes me laugh Then I feel like I did a good job. Okay We're going to go ahead and give it one more minute and then If there's no more questions in the next minute, we will end the stream and I want to thank everybody one last time for Uh for showing up and watching and I hope everybody, uh, I hope you enjoyed it If not, we will be ending really soon. I think we're going to go ahead and just end it here So thank you everyone again. Have a great day