 Hello, welcome back to Brian. Um, I hope you have enough seats. I see one in the backgrounds of which tickets Let me see if we have Andy in line Andy Hello, I can hear you. Okay, then I guess the sound is working. Yes. Yes. Yes. It is Do you have your slides ready? My slides are shared already there. We are they are okay. I'll kick you off then Tell us about the spec. We never knew we needed Okay, thank you. Good morning. Good afternoon. Good evening as appropriate to time zones And today I'm talking about reliability the spec you never knew you needed I am Andy Funinger. I'm a senior engineer at Bloomberg in our data services platform scale and reliability Team um Little bit about Bloomberg. We sell data and analysis So my department's api gateway service allows applications at Bloomberg to get market data from each other This is billions of requests per day So, of course, it's important to us that this be reliable Of Of course all my examples today are Made up. They are not related to any actual systems Except maybe the random number generator in microsoft excel So let's consider something What would it look like if we were missing a specification for our system? If we didn't have a spec for the color of our UI, we'd keep getting prompted to change it We'd get a ticket that says please make the UI a little more red Then we get an email that says hey, I looked at the UI this morning. It needs a little bit more green Next thing, you know You're in a meeting and someone says, uh, let's let's put a little more pizzazz in the UI So what would you do? Well You'd call a meeting Get all the stakeholders together And come up with a spec Then you would maintain that spec users the product team Whoever owns it may change that spec over time but if it changes without an intentional change Then in your qa process, you're going to call it a regression and you're going to fix it So when we look at complaints About reliability They tend to look about the same So when you get a ticket, it says the system's down and explains that it's not really down. It's just too slow and then someone calls and says, uh, we the error rate at 5 p.m. Is Too high and then you put on your yearly plan that you want to reduce downtime Then sooner or later you're in a meeting that says let's build a whole new system to address stability concerns The thing is Our response is a little bit different We build a new system or we add some new features to an existing system And then we release it uh, hopefully everybody's happy but Users provide feedback somewhat sporadically product provides some feedback to engineering Tags these as maintenance or bugs or whatever and works on them until people stop complaining Now once the feedback stops the work stops What if we did this a little bit different? What if We wrote specifications for speed error rates and failures We got a new system That's already meeting the specifications at release Since we know what the specifications are we'll have built-in monitoring for this And we can even plan for the capacity Users know what to expect And they should be talking to us when Either their expectations change Or we're not meeting what we plan to meet What does daily operations look like now Well incidents that meet these specs are prioritized When you go to do maintenance, you know just how careful you need to be engineering discusses changes with product whether that's a new feature that's going to Make things slower Or whether it's the need to put performance work in to enable future new features product prioritizes all work if Reliability suffers You block releases just like any other regression Clients now have a very clear way to request improvements And in fact the specs may change over time And that could be upwards or downwards Product may say well, you know what the features are important So we're going to lower the specification Now you can monitor to that new spec This replaces the old don't fix that yet. We need to push forward with features With an actual question How do we get here? Well, let's put some definitions around this service level indicators Are special metrics. They are metrics that are important to your clients And they are quantitative measures reflecting the health of the service aggregated over time They're observable and of concern to your clients SLOs are agreed upon targets service level objectives are agreed upon targets for which the service level indicators Must be met over a longer time period Service level agreements are legal documents Committing to a level of SLIs for clients I will not be talking further about service level agreements Other than to say they usually have penalties and they usually involve lawyers. I'm not a lawyer. I'm not your lawyer There are a whole topic in in and of themselves and they will not be covered in the rest of this talk Now we have a definition for reliability Reliability is simply are we meeting our SLOs? So let's go ahead to some simple sample SLIs The ratio of homepage requests that loaded in less than 100 milliseconds Could be a latency measure percent of accurate responses could be correctness 99th percentile of records process per second Would be throughput Minutes available in non-degraded state would be availability But remember our service level objectives are never absolute and never instant We typically measure the SLIs over a month And we apply a percentile or rolling average to the SLI So you would set your SLI SLO at 99th percentile ratio of homepage requests that loaded in the last in less than 100 milliseconds over the last 30 days 30 day rolling percentage of accurate responses 30 day rolling 99th percentile records process per second 30 day rolling minutes available in non-degraded state Let's look at some corollaries here If we have an objective Then we have some amount of allowed errors and that's our error budget It may be quite small, but it is non-zero We also have some rate at which our application is failing at the moment Hopefully quite low, but that is our error rate And then the rate at which that error budget is being consumed by the error rate is our burn rate So pull up an example If we have a web service that takes a million requests per month and has a hundred millisecond average response time We might say a request is successful if there are no errors. We returned the correct data And we took less than 200 milliseconds to respond How do we get these numbers? Well As an example in a new system, we know That the other systems need this kind of response time to meet their SLOs In an existing system We look at the fact that the clients are happy today And we leave a room. We leave a little room from our current performance Of course, we also need a time component So we're going to aim high And say 99.99 percent successful requests over the last 30 days This is four nines How does this translate? Well Our error budget Is 100 failures per month Or if we translate it into time 4.38 Minutes per month Now it's not exactly 4.38 minutes. It's unlikely that we have a consistent Distribution of traffic But this is telling us that any downtime we have has to be on the order Of four minutes or less per month If we are saying, okay, we're going to run on a single Physical machine and we're going to yank it down and we're going to occasionally do maintenance. No, you're not If you're going to say We're going to run on a very hard to build virtual machine and it's going to take 15 minutes to build No, you are not If you have something that's going to take 15 30 seconds to rebuild You might consider getting away with that And doing your maintenance that way Or you might be looking at the fact that as most systems you need multiple Parallel machines that can do a hot fail over So let's look at Some possibilities Let's consider the possibility that a component causes an SLO miss So let's say ground truth Our processing takes about 70 milliseconds and Authentication changes so that the cache miss penalty on authentication Increases to 150 milliseconds so anytime we have A Authentication that does not hit the cache It's going to take 220 milliseconds to process that request and We are going to miss Our target Every single authentication so what are our options? Well engineering goes to product and shows them this is the situation And product says, uh, well, we could just turn off authentication What we don't really need it Fine done You've you've re-achieved your SLO Uh, engineering can go to the authentication team Or provider and say we need this to be a little bit faster Can you speed this up a little bit? And they can even say we need you to shave 20 milliseconds off 30 would be better 50 would be better Engineering can change the design If we have 70 milliseconds processing 150 milliseconds authentication We could do all the processing In parallel with the authentication Or product could change the SLO product could say, uh 220 milliseconds. It's fine also Your environment might cause an SLO miss So let's say you're using some kind of virtual machines Kubernetes or something And occasionally a request hits an incompletely provisioned instance This happens About 50 times per month If we provision more often then we would reach our SLO So let's say we decide to make Idol instances Be cleaned up more frequently We start reaching our SLO and product might go engineering. We need you to fix this Engineering could say, okay, we're going to keep more spare instances Or Engineering might say, uh, you know what? We're going to take the hitch. This is happening Now 70 times a month But we're going to go make the authentication faster so that we're not getting a A failure on both Of course, you might change the SLO It might make sense especially if each client is Getting their own instance To say, okay, we're not going to count Hits that go to an incompletely provisioned instance That's not generally ideal But it's an option So how do you find Your service level indicators? Well, ideally your product team would know they know what's important to your clients Whoever's owning the product Should know When they don't or when they ask you how to figure it out There are some other options. Uh one your your customers do know Um, they know what's important to them They may in fact already be telling you when they are giving you a feedback They're giving feedback in certain terms Those terms show what Indicators they are looking at And those are If not Exactly your service level indicators candidates You can also look at similar systems elsewhere In your Company or your competitors In general You're going to wind up with something that can be expressed as a number of good events over a number of valid events so Good events under 200 milliseconds valid events excluding authentication Generally, you're going to look for SLIs in terms of Latency throughput correctness availability and error rate But literally whatever matters to your clients Is what your service level indicators are this is client focused There's also some techniques You can use and of course each of these are their own talk If your request focused you can use the red technique and say What's my rate of requests? What's my error rate? What's my duration? on requests If you're use If you're infrastructure focused Then use suggests looking at utilization saturation and errors Then we come to How do you set your service level objectives? well again product team should know But if sales and product are two different groups Sales is quite likely Advertising something And whatever they are advertising Those are becoming your slo's Um, you can also check your current performance for satisfied customers Uh, that's also another good indicator of what The expectations on your system are Once you have them, what do you do? Well, first of all Check them regularly Every week every month Look at where your slo's are compared to your slo's And if you're seeing a trend there You have time to take action Remember these are 30 day averages So from week to week They shouldn't be trending in anything short term Even a small outage Should cancel out So now you have time to fix issues And adjust plans You might say okay next sprint we need to put a performance boost in If you're consistently beating your service level objectives You can consider offering a better objective to your clients Especially if that has a benefit to you You can also use these once you have them to manage capacity and resources If you have plenty of room between your sli and your slo Then You probably don't need to add more resources If you're thinking about adding more resources And you're close then you need to add resources At least proportionate And this is a Valuable input to your resource Planning You'll also want to use them for alerting But it's not terribly useful To get an alert that says over the last 30 days things have started going bad Instead you want to look at your error budget So take one minus your slo Essentially, that's your error budget Then you can look at your system and say what is your Instantaneous error rate or what's your rate over the last five minutes Then calculate how fast your error budget will be expended at the current error rate So canceling out the time turns terms Error rate over error budget you get a burn rate If your burn rate is less than one You're basically fine You're not expecting to breach your slo's And things are are fine If your burn rate is one exactly Then At the end of the month You will exactly meet your slo So that probably requires some attention But is probably Fine If your burn rate goes up to two You're going to breach your slo in about 15 days For a 30 day slo In this case, you definitely need to take a look at it Probably on the next business set If your burn rate goes up to 30 you're going to breach your slo in one day And that requires action pretty much immediately. It's today's problem If your burn rate goes up to 720 You're going to breach your slo in an hour And You really do need to call someone That is a call out This will let you have your alerting Tied to what your clients actually care about To sum up Speed is usually a feature, but reliability is always a feature If reliability is a feature then we have a spec If we have a spec it comes from product If we don't meet our specs then we need to either fix it Or change the specs Reliability however is a complex feature It involves both time and probability and regressions are not intuitive In fact your regressions are very likely to turn up in prod Continuous monitoring is required And the issues in reliability May come from the environment or underlying systems In fact, depending on how you have things organized This may be the one feature not owned by the development team You may have an operations team that actually owns reliability Nothing says they have the ability to fix it So where do you go to learn more? Well This is sre So you want to go to your friendly local system reliability engineer The if you have an sre team, that's where they are If you don't You may find them sitting in the dev ops team the operations team or the release team And they would love to talk to you There is no reason why they can't be seated on a dev team There's no reason why you can't become the expert on this The best starting source is the google sre books sre.google And sre con 2021 is coming around in october Of course that will be fully online. So there Is a lots of opportunity to attend. I think it might be virtually placed In the western united states But again Travel means nothing right now And with that i'll take questions Thanks So first question we have how do we ensure that product will be prioritizing important fixes that cause incidents in the past So There's two answers here Um, first of all important is defined as what's important to problems So if product doesn't care about the incidents They aren't important and you don't need to fix them um in all probability they do care about the incidents uh, and Then you you need to deal with They need to prioritize And that's that's almost a A flat question entirely aside from this Other thing is if you look at this Uh nowhere in here Let me go back to yeah, this will go nowhere in here. Do I exclude incidents? So if you have an incident that takes this system down for 15 minutes You have failed the slo And all the metrics that both engineering and product are looking at Um are all blown And that should be enough to drive them into prioritizing it Right because the slo is written by product and engineering, right? So they are aligned in that Ideally it'd be great if you could get product to write the slo's entirely But in reality product is not going to come up with an slo in Numbers that are measurable in your system They would come up with something But engineering is going to have to help them rephrase it into something that is measurable understandable monitorable, etc um and that's uh The part where engineering has to help write it Makes sense. Thanks Andy and last question How do we define the starting specs for slo's without without having the accurate metrics in the first place? There's a reason why this entire talk starts with sli's Okay, you cannot you cannot Jump straight to objectives without indicators Uh, you need to start there And those are not easy As much as the talk may jump through suggesting that it is Um, I don't know who answered this who raised this question um Do you want to know how you do sli's without How you start writing the metrics for sli's? Or Is that really the answer that you start with the sli's? So I can tell you the um the original question Which may may give may give more but the the the asker is is writing on the chat So I'll I'll be able to relate to you. So initially the question was how do we define the starting specs? What if we can't count on users or clients to give us accurate numbers? Okay, so usually this means that you have a new system If you have a new system Uh, then what you need to do is Talk to product sales yourself And say what do I need? To be able to sell this thing or to make this thing usable So what does it need to go into the market? What does it need? Uh for Competitive advantage, what does it need for? Users to use it. What does it need to demo? Well? Uh those become your starting approximations and from there You can you can run the whole process forward um Of course as you do get real users There's a very good chance of having some changes um They like any other feature. There's a there's a chance That you'll change and say okay All of a sudden we didn't think we needed to worry about latency because all of our users were supposed to be batch But it turns out that You know the maximum allowed latency is actually five minutes You know and and we We do actually need to monitor that Even though five minutes sounds like such a tremendous uh span of time Compared to you know our throughput of a million data points per second Well, some people are actually sending in Whatever five three hundred million data points And we need to handle a five minute latency Just so that we can say um You know that we're Going to for example reject very large Requests if we can't handle a large request because we're fully loaded We're rejected And that would be a way to meet that sort of slo If you put a five minute latency in and you say well Um At times that we cannot handle very large requests you will get an error Right so you can start from the slo's to figure out what slo is you need to To set them right um Yeah, you I mean the only way you can start from the slo's to get to the slo's Would be if you have A competitor and then you just copy their slo's um a competitor or a direct requirement if you have A service oriented architecture and You have something like a Full process going through your company You can talk to the services that are going to call your service and they'll say well We have a hundred milliseconds That's it Right makes sense. Thanks, Andy. Um Thanks again for coming we we have run out of time. Um And we have now I'll need you to pronounce your name for me I'll pop over to the breakout room cool. I'll see you there