 Manage services and the path to the future. And my name is Sasha and I work for Red Hat. So by way of introduction, oops, I've been in the industry for a long time and I started off as a developer. I have a computer science degree and then I went and had all sorts of different jobs in technology which most of them didn't exist when I was a kid so you couldn't even choose it as a career path. By and large I like solving problems with people and technology and I like to believe that the world is getting better every day and that we are in a good industry to help it get better. And so that's why I'm excited to be at this conference today and be talking to you all and maybe we'll come up with new ideas. That's why I'm excited to be at this conference today. Cool. And be talking to you all and maybe we'll come up with new ideas. That's why I'm excited to be at this conference today. Cool. And be talking to you all and... Are we gonna fix this? Hopefully it doesn't happen again. Okay, so anyway, awkward. I'm gonna be quoting this book a lot. This is the book on Site Reliability Engineering, the first one by Google, co-authored by a whole bunch of awesome people. I really like the book. It's a really good book to get started if you're just getting into SRE concepts and you want to understand what they're all about. But I'm gonna start with a sentence that I least like in the whole book and that sentence is, SRE is what happens when you ask a software engineer to design an operations team. And I really, really don't like this. This is usually my face when I see a definition like this because I think it's very elitist and it assumes that developers are cooler than ops people and that ops people couldn't come up with the idea of automation and Google had to come in and solve all the world's problems. It really kind of isn't what happened. The definition that I do like is that SRE is roughly Google's implementation of DevOps. That definition is actually also in the SRE book, so I didn't make that one up. So we started off with DevOps more than a decade ago. This happens to be the picture of the first two years of the Chicago and I'm the only person in this picture of organizers that identified as a developer. Most of the other people identified as ops people and all they really, really wanted to do is automate themselves out of a job. And we've been running this conference for a while discussing certain type of ideas. There was an awesome person and she's right here and wave at me, Bridget, and she helped grow this conference to like a global enterprise which like thousands of people in hundreds of cities show up to. And again, we were all talking about automation and how do we get to a better place where we get to solve more interesting problems? It's not that easy to get to the future. So if you were alive in the 90s and you remember what they looked like, getting a new server up, if you're lucky it took you three months because you had to actually file a procurement order and you had to wait for the actual physical server to show up and then you had to build a server rack and you had to wire it up and configure it and install stuff and whatever. Also in the 90s, this was very common. Unfortunately, it still happens today. System downtime for two days because we are upgrading and deploying a new code version. How many nines is that? Does anybody know? So if you had a couple of maintenance windows like that that would be less than two nines because two nines only gives you 3.65 days a year of downtime. So that's just plain maintenance, right? And we took down servers for plain maintenance for whole weekends. And we used to think that speed and reliability are not on the same... That's just plain maintenance, right? So we took down servers for plain maintenance for whole weekends. I'm loving this. And we used to think that speed and reliability can't be friends and dev and ops can't be friends because devs just are incentivized to push the production as quickly as possible and that breaks things and ops are incentivized to keep the lights on and they carry pagers and they get paged in the middle of the night and they hate change, right? It's all about incentives. The category that works better for software development is actually it's like riding a bike, right? Like, our inherent assumption is that if we go slower, we break things less but it's actually not always true, not in all domains. And software development is kind of like riding a bicycle. If you go too slow, it actually breaks more. You can't keep your balance. So if that is the case, right, then why was there such a problem automating things? Like if going faster is actually better for everybody? Well, the biggest thing was that effective automation requires consistent APIs. And that's something we didn't have. So like one of the words that pops up a lot in this talk is APIs. And you need them to be able to automate anything. But I didn't just talk about, like, you know, the bigger control plane and being able to automate something at a cross-clouds level, right? But we started off at the basic level. So you had to start with a operating system level API. And then with Linux, it was lucky because it's a file-based system so you can write a script and automate things. But with Windows, it was an executable-based system. And so you depended on people having an actual API for the stuff that you wanted to automate. And guess what? People didn't have an API for the stuff you wanted to automate. And it was actually 41% of the market, server market in 2000s. So like it was a real problem that people were trying to solve. Which actually brings me to one of my favorite transformation stories, which was a story about PowerShell, championed by Jeffrey Snover. And that's the CLI scripting language and a configuration management framework that shipped with Windows in 2006. And before that, Jeffrey went through five years in his career where he was on a verge of getting fired every day because angry executives were yelling at him. What part of affing Windows do you not understand? Admins don't want an API. Turns out that admins do want to CLI's and APIs and want to automate things. And fortunately, automation won in this battle. Every wave of automation enables the next wave of automation, right? Next, we got to infrastructure level APIs. So this is another quote from the SRE book, Central to Borg Success and its conception was a notion of turning cluster management into an entity for which API calls could be issued. So basically we arrived at the idea that we needed an API for the entire infrastructure. And it had to be consistent and had to be manageable by automation. Automatable. And it wasn't just Google, obviously, it was Amazon, it was Azure, right? Everybody was kind of arriving at the idea that it was a pressure to deliver adaptable services in scale and you needed API to do that. There was another thing that was happening in a slightly different part of the industry which was like if you didn't, if you weren't Google or Microsoft and if you didn't run like gazillion servers and could custom order the server rex the way you wanted them, you still needed automation. And so companies like Puppet and Chef and Ansible were starting to build that automation for sort of your own data center, right? What wasn't there, what's new is service level objective and that is business approved availability. So there's this concept that 100% reliability is actually unassainable, unnecessary and also extremely expensive, right? So if we even talk about not 100% but the five nines, which was everybody's holly grail, right? That's five minutes, 26 seconds a year of downtime, available downtime. That's all you can have with five nines. And the major question is, will your users even know that you're that available? And the resounding answer to that is no they will not because the internet service provider's base error rate is up to 1% which is like you can be available four nines and then the rest of it will just drown in the network errors of the ISP provider. So you're essentially spending lots of money and effort trying to attain something that's not actually useful to anybody. So SLOs are about aligning incentives between business and engineering, getting people to talk to one another, getting business to agree that 100% availability is not something we're actually going for. And then with SLO comes the concept of error budgets and that's acceptable level of unreliability. So the error budget is your one minus the SLO. So like if you had four nines, that would give you 0.01% which would give you 13 minutes a quarter. 13 minutes a quarter is not a lot of downtime but it's a lot more than five minutes a year and that gives you some ability, some breathing room for the stuff that you can do for the time that you can be down. And so error budgets are actually about aligning incentives between dev and ops because if developers are measured on the same SLO that operations people are measured on then imagine that I have that 13 minutes a quarter and I'm pushing code and I'm writing new code and I'm making changes. And then it gets to the point where we're like at 10 minutes out of 13. So I have three minutes of downtime left because we had some auditors because we pushed new changes. And I want my next promotion so I want to push that big, big change at the end of the quarter so I can get a promotion at the end of the year. And so my best interest is to test the hell out of it before I push the ops people to push it because I only have that three minutes left in my budget, in my error budget for the quarter. So SLOs and error budgets actually help us align speed and reliability in a way that makes everybody be more successful. So I'm not going to dive into any of that but other things that are important to SREs monitoring. If you don't know that your service is up or down then none of this matters because you can't actually measure how many nights you had you were talking about or anything. Of course observability is another concept that's related and that again talks about how much you know about how your services are doing. This is important to me and I know some other people who carried a pager in their life. You need a good signal to noise ratio because if you're paging people about every single thing that's not important they're going to stop responding to pagers and if you're not paging people when their help is actually required that's also a problem. We could also dive into who should carry a pager but we won't. Anyway, I do want to say so it always sounds like when you talk about SRE and automation it always sounds like automation is going to solve everybody's problems so there is a little bit of caution in here automation can also be dangerous like it's a really good way to make errors rapidly and at scale. You can take down the entire AWS infrastructure with one failed line of script. Then the second part of it is that automation drifts starts immediately. You write a service you write automation for the service and then you update the service then you have to update automation. You immediately start accumulating those differences between the automation and the actual services it runs. Automating one-off is inefficient I could spend six hours automating a task that actually takes me six minutes to do manually and if that wasn't one-off then it's never going to happen again and I just wasted time and then very importantly all systems are socio-technical so the goal of this is never to automate humans completely out of a system we make this error all the time we're like oh we're going to automate all the things because humans are the problem humans are the problem a lot of times but also they are the solution because termodynamics states that universe goes towards chaos so all systems left unsupervised tend toward chaos and entropy always wins in the end so you need a human to maintain order. So let's talk about what the future is and Clayton said that he doesn't know what the future is I do so it's not no but I think there's a certain level of goals that we all have we kind of are striving towards the same thing the future is already here it's just not evenly distributed so I know if we talk about the five nines and all the fancy automation things there are companies that are running at close to five nines and there are companies that are like well I had to take my service down for two days just to update because you know we had to merge hell and we don't actually test anything before it gets to production and stuff and that's what people live with people also have like 70 years of what we call legacy code and that's the actual thing that makes some money and they have to run a business you know so I think in a bias because of where Fred had and I want to manage services team so I think the future is to manage services and be defining in many ways it's all about platform as a service we've been talking about platform as a service for a really long time and we've wanted platform as a service for 10 years or probably 20 all we want is to get to the point where we can run our applications and there were many attempts to implement a PaaS service some of them were more successful or less successful problem is that PaaS really works as long as your environment is homogeneous and no one's environment is homogeneous if you have a big enough company you don't have a homogeneous environment you're probably running on three clouds a data center and I don't know some spreadsheet somewhere runs on Excel and someone's laptop it just happens and we know that effective automation requires consistent APIs and we know that every way of automation enables the next way of automation which is why I'm happy that I'm a cube kind because I think that Kubernetes is potentially something that will allow us to proceed to the future and have that consistency and have that consistent API across different systems and different deployments so 85% of global IT leaders agree that Kubernetes is the key to cloud native application strategies I don't know if all application strategies but you know cloud native application strategies so the point is everybody wants to have a piece of Kubernetes which is cool and the other thing is like we all have open source now which provides like open source one and it provides us with a way of setting up a standard and letting people do share knowledge that we have in common and work together to define what that consistent API looks like but the problem with open source and yes I think probably everyone is going to have this slide in their presentation and the problem with open source is that you have the proliferation of services and tools and all the things and that's a real world picture of someone trying to run Kubernetes in production so that's what it usually ends like but you do have an advantage today compared to just a few years ago like if you want to get out of the data center management business you can go to the cloud and if you want to get out of Kubernetes management business you can go to OpenShift again like I said I'm biased so on this team that works on these services that are called breadhead cloud services we run on top of different clouds actually you can pick your favorite cloud and run your manager OpenShift on one of those clouds and OpenShift is kind of an opinionated turnkey way to get all the bells and whistles that you need inside your Kubernetes so you don't have to browse that CNCS slide and identify whose project is maintained by a single maintainer on weekends with all your security on something that Joe is maintaining his garage when he has free time and there's the whole thing which we do actually run SRE for the folks who rely on these breadhead cloud services on top of different clouds which is an interesting problem to solve because we don't own the infrastructure and we are running SRE but we don't own which is exactly the same problem every company in the world who is not Google Microsoft or Amazon is trying to solve so we are trying to solve it for other people which is cool you know, breadhead is we also went through a journey so when we first started offering these services it was an SLA of two nines and now it's four and we're trying to get even better and better and improvement is the thing so if you compare the traditional organizations with cloud native organizations in a traditional manner again we have this proliferation of different infrastructure and we have this proliferation of different platform services right and again as you standardize you're just getting to enable people to automate this complexity and to standardize on something to share, cross board and so eventually what you want to get to is the infrastructure services run by the bio cloud provider or somebody else platform services run by somebody else and then you only have to work worry about the applications that you build there's this picture I like this picture, it comes from like originally from Hans Marbeck talking about AI moving over the world which probably eventually will happen I don't know basically it's like a picture of this water rising in the landscape and so AI gradually takes over people's jobs we're not talking about AI yet here we're talking about automation but it's still happening the API kind of gets higher and higher so if you are a driver you probably want to look at different services because self-driving cars will eventually arrive so the goal here is to keep your skills above the API and solve actual smart problems instead of doing something that's going to be table stakes in a few years so to the extent possible you want to outsource your SRE to your platform provider and last but not least I wanted to mention something that I'm working on so first of all ideas are open source which is why I learned a bunch of ideas in these slides from other smart people and hopefully other smart people learn ideas from me sometimes and we know that open source won because it's cool but we now facing a slightly different challenge that we did before so in open source we are always trying to incorporate the knowledge that we learn back into the code base so upstream first we're trying to share but now that we're moving to everything is a sad we're having this problem again where the platform is proprietary so we're no longer sharing knowledge we're no longer contributing the knowledge back to upstream and so red this is super initial stages but we're starting this new initiative that's called operate first it's a concept of incorporating operational experience back into software development right so you can find some of these concepts on the websites operatefirst.cloud and there's an effort to basically get people started with a playbook for learning how to run SRE and also a playbook for sharing operational knowledge across different clouds so we can all learn from each other in terms of how we run these services so that was all I wanted to share with you today and I'm Sasha you can find me on Twitter especially follow me if you like cat videos and I'd be happy to continue this conversation because like I said I think we all learn from each other all the time