 So good afternoon Okay, so I am from Boston. I live in Boston I don't get to speak here very much and I know from my own experience that that was weak. Let's try that again Good afternoon Much better. So hi there. My name is David Blank Edelman I am the technical evangelist or as I like to say evangelist with a hard G for a company called app Sarah We make container management system things And if you want to talk to me at some point about container management stuff I would be delighted to talk to you And I also know that there's gonna be some sort of thing put on by my company Where there's a happy hour that you're invited to come see me afterwards or come see the nice sales person Who's around here to do that, but what I really want to talk to you about is SRE or you know like like we're gonna do like SRE for DevOps here Just sort of with the assumption that you kind of already know the DevOps piece of it And you want to know like what's this SRE thing just from my own curiosity's sake because my sense is That some people have a handle on this and some people don't how many people here have SRE as your title So you can keep me honest. Wow, there's a lovely group of you. So that's gonna be great So you will let me know afterwards whether I have done a good job of representing you if I haven't I apologize in advance But it's gonna be great. Okay, so the first question is how many people can tell me what this is I'll move that little cursor out of the way. I apologize for that Yep, how many people can tell how many people can tell me what this is raise your hand if you know Wait, you said bad documentation. I don't think that's I'm looking for what else Wait, wait, it doesn't help you gotta raise your hand so I can point to you. Oh It's it is a white screen. This is what happens when a PHP app errors Okay, this is what you get And in fact, this is you know, I was gonna put this up But this is causing too many people in the audience to twitch So so so I'm gonna go back to this the idea here is is that You can have in your product in your service in whatever as many features as you want It can do everything from taking your dry cleaning in to I don't know whatever that you know Removing vowels from new startup names, whatever it is, right? It can be any of these sort of things but the key thing that really is important at the end of the day is this question about reliability because if it isn't up It's useless. And so what SRE is about SRE is a field that is attempting to Make sure that you have the right amount of reliability and you'll see what I mean by that later and To do that it attempts to engineer failure out of the system That's its primary purpose in life right is to do this so that you have the right amount Because the tricky thing that you get is you have this interesting tug-of-war that goes on DevOps It was another response this and we'll talk about that as well there's tug-of-war between the people that whose job it is to write software and therefore to iterate and therefore to make New things and to make features and those who have to operate stuff and really would like things to say as stable as possible Because the more you perturb it the less reliable it is right so we have this really tricky thing There's a group of people out there who want things to say the same group of people who want to iterate as fast as possible So what do you do about that? So Before I talk more about how you deal with that sort of stuff I feel really compelled to put this picture up This is the sort of picture that we used to put up in the beginning of DevOps This is legacy of man is what it what it's called in which we were to say once upon a time over here is the system in and now We have evolved over there to DevOps right? This is this used to be the tale it used to be told and I want to tell you that I'm here to tell you that I'm not asserting that over here is DevOps and over there is SRE, okay? I am not asserting that what is going on here is that there are two sort of parallel tracks in the operations world Attempting to deal with the same problems, but dealing with them slightly differently One is not more advanced than the other they are just two different ways of doing it And they have things to teach each other and that's part of what I want to talk about Okay, so one of the things I'll be talking about comes a lot when I talk about SRE SRE in theory At least the title certainly came from the land of Google. They published this lovely book It's a little short on plot and character development, but I really recommend it And it's their understanding of what SRE means to Google and what Google SRE is And it's a it's a really good book But the thing that people often ask when I talk about SRE is like is it just Google is SRE just a Google thing? And it's not right. This is just a bunch of names that all have SRE teams even sometimes they don't call them SRE teams for example Facebook has what they call production engineering You know is there is there is their SRE group, but everybody thinks they're doing they're doing SRE So the other thing I want to say about this in terms of there being like lots of different places and lots of different ways to do SRE There is a book coming out I'm not but we haven't really announced it yet So don't tell anybody or maybe tell everybody you know that there's another book coming out about SRE that I'm going to be editing So, you know, you might find it's interesting It attempts to approach the wider bit of what SRE is to map out Different parts of what the SRE world is like an SRE space and there's some cool stuff in there coming So, okay, so are we ready to talk about SRE? Yes See, I love it see it's good when you prep the audience So the place that I like to start when I tell people like people come to me and say what's SRE? How do I learn about it? Once upon a time at the very first SRE con which is as you heard a conference Specifically for SREs to talk about SREs sort of stuff There was the very first talk was a keynote given by the gentleman who was responsible for coining the name at Google Ben trainer now Ben trainer sloss and he gave a really great talk That was the keynote that was meant to indicate what he thought SRE was at least to him and That had in it this slide, which I did not actually get from him directly. So it's a dramatic recreation So any typos and stuff remind these were his list of what? Consisted what SRE consists of now We're not going to be able to talk about all these things because I only have a short amount of time So we're going to focus on some of the parts here So specifically let's start with these three this says that you should have an SLA for your for your service, right? Service level agreement for your service you should measure report performance against an SLA and use error budgets and gate launches on them Now error budgets, okay? So that's a really interesting idea here that I think is perhaps one of the more crucial and critical things out of at least The Google strain of SRE to hear about so the idea goes a little bit like this I have a service or I have an application and I expected to be a certain amount reliable There are very few services in this world that have to be a hundred percent reliable Maybe the thing that's ticking in your chest. Maybe they think it's keeping the plane in the sky But everything else probably can have some downtime you can have fewer nines than than all of them, right? So if you could come up with the understanding of just how reliable you think your service needs to be like is it okay to Have downtime once a year, okay might be then what you can do is say let's just say I have a service I think should be up 80% of the time and then 20% of the time It's okay if it's down during that time for maintenance or whatever My error budget is that 20% that 20% where I do not expect it to be up Okay, when I have an error budget that means I can do the following like this So we come up with something where we say hey, we're gonna determine what up means or what working means We're gonna come up with a service a little objective like how how you know doesn't respond Doesn't respond this quickly all the other things that you can imagine as part of an objective for your service Okay, once you can come up with that you can agree on that you and everybody else in the organization can agree on that for the service Then you can stick it in your monitoring system And you'll have a monitoring system that everybody can look at as the source of truth and agree with it So first of good so once you have that then what we can do when it comes time to decide whether whether you want To launch your next version of your products or your next version of the service or to rev or to deploy something You can you can say hey have I have I been within the period of time? You know have I have I exceeded my error budget? Well, I haven't you know I've been up 90% of the time so therefore I'm well within that budget But if my service has been up 70% of the time then you could make the really reasonable decision that says no no no I'm not gonna put I'm not gonna perturb this any further We're gonna spend a little more time working on figuring out why this thing isn't up as much as it should be So you're basically gating launches based on whether or not you have met your error budget Does that make sense? Yes good So so here's the thing to realize about SRE that I think is useful What you can do using certain techniques is you can create these virtuous loops Where it is the case that if in fact you find that your service is down more than it should be well If you can't actually launch new features That's gonna put some pressure on the developers to make sure that it's more reliable Right, and that's a beautiful thing like this this this loop here because things get better in the right direction Okay, so here's the middle ones common staffing pool for SRE and dev meaning you hire for people all the people that can write software excess ops work overflows to dev team. This is going to be really really Heretical in this room. I think you're gonna love it though And the idea that you're capping SRE operational load at 50% The notion is is that your SRE should be able to work on things that make the system better Not firefighting not doing tickets not having an operational load So if you're running a service and its operational load is over 50% then you get what goes what what they call Let how would I say nicely handing back the pager Where you say okay folks? I'm not gonna be able to take care of your service until you get your operational load under control Here's the pager put your devs on the pager once the devs are on the pager the first two times They get woken up at 2 a.m. Is the last time those bugs show up in production, right? Right so so there's a notion that you can hand back to hand back the pager, right? And that's a kind of cool idea another idea that And I want to say in order to do that You can guess that you have to have management support right to do the things we're talking about here If your management is not behind us if someone isn't isn't at the top willing to say no We think reliability is as important as your latest greatest feature or when you actually launch Then this doesn't work and do not try to do that your company if you don't have that Don't try because you will only you only set yourself up for failure, and I don't want to do that Okay, so last one that I said I want to talk about which has shown up lovely I'm so happy in the dev ops world is the notion that you should have a post-mortem for every event People like to call it different things that don't have the word more in it But whatever and that post-mortems are blameless and focus on process and technology not people Somebody made a mistake. What was the context? What were the controls there in place that allowed them to make this mistake, right? We hear about s3 going down because they were able to push the button that made everything go blah What should have been in place to prevent that so you spend your time looking at the process not the people Because it is truly the case that you can't fire your way to reliable It's just true Because if one of the things that we have in this country this notion like oh they made a mistake fire them Right and the thing is is that after a while all you're going to do is get down to one person sitting in the corner With the cigarette doing this, you know not willing to do anything for you right because they're just they they're not going to do Stuff you can't fire your way to reliable So part of the idea here here is that there are ways to Structure your operations stuff such that you can together work to continuously improve in one of these virtuous loops So with that what I wanted to do because there's there's a bunch of people who do this job for living to talk about this All they wanted to do is basically put s3 on your radar. There's lots of ways to do s3 There are lots of places I'd be happy to talk about this stuff and be happy to talk about s3 con And etc, but I want you to know that that like, you know, you should just check this stuff out Go read the Google book read other books come talk to me, etc And I guess if you want to get in touch with me because I'd love to talk here's how you find me Here's my Twitter thing Otter book from that book that I wrote once upon a time If you want to talk about app Sarah and the stuff we make it be glad to do that If you want to drink I'd be glad to give you a thing to drink with so with that I just want to thank you for your attention and I hope that you'll you'll give like as much appreciation and attention as you've Given me to the lovely panelists. They're about to come out. Thank you