 Hello, everyone, and welcome to this presentation on SRE. I'll be doing a presentation on the DevOps track. First of all, who am I? My name is Ricardo Amaro. I work on the operations team in Acquia, tier two since 2011, based in Lisbon, Portugal. I'm currently transition to SRE, Site Reliability Engineer. I adopted Linux since the 90s, been in the Drupal community for over eight years, mainly contributing to the infrastructure and testing teams. You guys know of the test bot, Drupal CI? Drupal CI, someone? Drupal CI, OK. And this is where I started applying SRE. It used to take a lot of manual work until I decided to start and fix and automate a lot of the stuff that we do back home. So now we never miss any of the fun like you can see in the picture. So here are some numbers from Acquia. We have a total of 17,200, more or less, instances, running 54,000 sites, production sites, not counting with the devs and tests. Around 3,000 API calls per second, a total of 20 or more availability zones, and eight regions. Am I correct? My boss is here, so yeah. So to have more than 50,000 Drupal sites in production would not be sustainable without a lot of automation, large scale productions, and permanent reliability improvements. The presentation objective of today is to present to you guys the site reliability engineering philosophy and the strategies that we are implementing internally. And here is what we will talk today. What is SRE? The tenets of SRE, reliability and toil, error budget, keeping the service level objective, development, and operations, monitoring and being on call, the release engineering, and the post-martin culture, and learning from failure. This will be an overview of the concepts behind it. So Acquia is already in the right way of implementing some of these, as you will see. But don't expect any special codes or software. This is basically about managing systems and people to scale. So first question, what is SRE? Please raise your hand if you have ever heard about SRE. OK. And who is applying SRE? Oh, nice. Two. That's WonderCrawl, OK. So what exactly is site reliability engineering? The term was crafted by Google in 2003 when Ben Treanor was hired to lead a team of seven software developers to run a production environment and ended up applying software engineering to an operations function. It is motivated originally by the same question. As a software engineer, how would I want to invest my time to accomplish a set of repetitive tasks? Now it has become much more. It's a set of principles, practices, incentives within the software engineering discipline. There are now many companies that embraced the site reliability engineering. And they do take it very seriously, as you can see on their job listings. Microsoft, Apple, Amazon, Google. First and foremost, SREs are engineers. They apply the principles of computer science and engineering to the design and development of computing systems. Sometimes their task is writing the software for those systems. Sometimes their task is building all the additional pieces those systems need, like backups or load balancing. And sometimes their task is figuring out how to apply existing solutions to new problems. Who knows exactly what DevOps means from here? So DevOps. OK. So some people may think that SRE and DevOps are two overlapping words. But it's actually the way around. While DevOps is a practice coined in 2008, SRE appeared some years before in 2003, and is taken as a subset of DevOps, including some extra skills like the ones that we are going to see in this presentation. Let's now take a look at the tenets of SRE. These are taken from the actual Google SRE book, where actually this presentation is based on. So you have ensuring a durable focus on engineering, pursuing a maximum change velocity, monitoring emergency response, change management, demand forecasting, and capacity planning, provisioning, efficiency, and performance. So site reliability engineering represents a significant break from existing industry practices from managing large and complex systems. To explain it better, here are the 10 action items that the trainer had. And we will analyze these action items today. This is probably the most important slide on the whole presentation. And we will be talking a little bit more about this. So first of all, hire only coders. Have service level objectives for your service. Major and report performance against SLOs. Use error budgets. Who knows? What is error budgets from here? This is really, really important. Have a common staffing pool between SREs and devs. The access ops work overflows to the dev team. Cap SRE operational load at 50% and share 5% with the dev team. On-call teams at least eight or six people in rotation per product. And they should only receive a maximum of two events per on-call. Make always a post-mortem. And every post-mortem should be blameless and focus on process, not people. So why do we need those, you might ask? Let's begin by looking at the reliability and toil problem. We all know that features for your product are very important, right? But there is one feature that you cannot live without. What is the most important feature of a product? Anyone? It has to work. Exactly. So how about the 503 feature? You guys probably see this in the case it doesn't work. So I'm pretty sure we all understand and think the same that the most important thing is that the product works. And it is reliable. So reliability is the most fundamental feature of any product. A system isn't very useful if nobody can use it. Because reliability is so critical, SREs are focused on finding ways to improve the design and operation of systems to make them more scalable, more reliable, and more efficient. But that's not always been the case. To explain why we care so much about that today, let's go back to the 80s. The software methodology used back then was the waterfall, as you can see in the picture. It was great for software developers. You actually had the requirements and then would go to the design and implementation, verification, maintenance. And you can see a problem there. Like when you launch the product, it is disconnected from the customer. So the developers actually didn't had to respond for the problems that would happen after the software is implemented. They were not accountable for it. Then after the web appeared and several services were born, like SAS, PAS, and Cloud, the operation was not anymore on the customer side. It was on the service provider side. So clearly, that's one of the reasons why these services became so popular. Because that overhead went from the customer to the service provider. But there is at least one obvious problem or conflict in this model. While operations tries to keep the service up and reliable and gets rewarded for that, the developers are rewarded for the opposite thing. Let's say, more or less, the opposite thing. They are rewarded for the features that are launched. And that creates instability. So therefore, OPS's only solution to keep the products and the feature being added is to increase toil while having no time to automate that work. The types of toil can be manual. This includes work such as manually running a script or repetitive. It's work that you do often, and it doesn't end. Work that you can easily automate, but you're not automating. Tactical, which is unplanned work. You're working on something. Suddenly, you're interrupted because unplanned work comes and you have to stop what you're doing to answer that. And it has no enduring value. Like what you do today, it actually tomorrow, it's not useful anymore. And the worst thing on this is it scales linearly with the service growth. What does that mean? It means that you need to use the whole solution, scale with bodies. As your business grows, you have to contract more people. And in this model, you just throw people at the reliability problem and keep pushing sometimes for a year or more until the problem either goes away or it just blows up. So we need to reduce toil in a smarter way. Because if we are successful in our business, workload will grow exponentially, trending to infinity. And as we know, that curve is going to lead us to failure. So we need to cap ops work loads at 50% on the SRE teams and leave most of their time to write code and reduce that toil. Having problems with mine? So we can take Google's example on the operational side. Their goal is to keep always SREs work capped at 50%. The other time should be spend always on engineering project work that will either reduce the future toil or add service features. Future development typically focus on improving reliability, performance, utilization, which is often also going to reduce toil. And this cap, this 50% cap, must be capped all the times. Because if we left that unchecked, it will fill up 100% of everyone's time very quickly. So how do we solve these reliability problems? This conflict is not inevitable. The solution is indeed error budgets. As we will see next, the organization needs to agree on those for this to work. Therefore, SRE only prevents releases if the error budget is exceeded. We'll see that. First, let's see some terminology here. You guys know what is SLA, right? Who knows what is SLEI, the one on the bottom? You guys have heard about it? So it's a service level indicator. SLOs, anyone heard about this? These are completely different things, right? A service level objective, which is that one, is a key element of the service level agreement, the SLA. And it's between the service provider and the customer. An SLA is an entire agreement, right? While the SLOs are specific measurable characteristics of the SLA, such as availability, throughput, frequency, response time, quality, et cetera. And the SLAs, I hope I'm saying this correctly. SLAs is correct, right? Is a measure of the service provider to the customer. SLAs form the base of SLOs, which in turn form the base of the SLA. So you need to have first indicators of your platform. And then those indicators will create an objective that you want to measure. And then in the end, you actually get an SLA. So the business or the product, they must establish what the availability target is for your system. Once you have done that, one minus the availability target is what we call the error budget that's down there. For instance, if it is 99.9% available, that means that it's 0.1% unavailable. And now we are allowed to have 0.1% unavailability. And that is our budget that we can spend for launching things, for implementing new features, for testing stuff. Not in production, please. So how do we obtain the error budget? Of course, we cannot say it's 100%. No, 100% cannot be ever a reliability target for this measure. And the SREs objective is not zero outages at all. Instead, they align with the product devs to spend the error budget on a maximum feature velocity. If we run out of budget, we just need more to do more testing between releases. Therefore, error budgets acts like a self-regulating mechanism. When the system is working well, the developer have an incentive to write strong code and launch carefully to prevent issues and gives control drive back to the SREs to permit change, not just stability. It makes decision based on numbers and not politics, nor feelings, just data. We don't want to be that sysadmin there. It's always stopping things. So both the development and the SRE teams, they share a single staffing pool. So every SRE that is hired, one less developer is available, and vice versa. This ends at the never-handing headcount between devs and ops and creates a self-policing system where developers get rewarded with more teammates if they actually do better code. SRE teams are actually staffed with developer slash sysadmin hybrids who not only know how to code, but they know how to fix problems and find them. They interface easily with the dev team. And as core quality code improves, are often moved to the dev team if fewer SREs are needed on the project. So in the end, this creates a highly motivated and effective teamwork between devs and ops. And that's our objective. So how is the monitoring and being on call of an SRE? I know this presentation is very teoric, and you will get the presentation in the end. I'll make sure it's available online. But these are really the concepts that we are implementing at Acquia. And we think they are the best way of having scalable and complex large systems working correctly. For instance, in this case, there are three SREs take three valid kinds of monitoring output, the alerts, the tickets, and the logging. What do we do about alerts? They need action immediately. But we don't keep actually the email alerts. Because when you get to a large size, actually if you get a lot of alerts in the email, we're just not going to see them. So just stick with the page. Tickets, a human will need to take action eventually on those. It's not to take immediately action, but it will take an action. And logging, we just don't take any action at the moment. We just analyze them afterwards. While focused on operations work, SREs should receive a maximum of two events per 8 to 12 hour on-call shift. This gives the on-call engineer enough time to handle the event accurately and quickly, clean up, restore the system, and then conduct the post-mortem. Use the four golden signals of monitoring. And they are latency, traffic, errors, and saturation within your dashboards. Expose all data very clearly and actually easy to action on. During on-call, Peugeot Fatigue, who has been on-call from this crowd? OK, you feel the pain with that. So Peugeot Fatigue is a real problem, like you get alerts and suddenly just don't hear them anymore. And that's the thing we try to go over with the SRE, actually. An engineer can only react with urgency a few times a day before they actually get exhausted. Therefore, ideally, every page should be actionable. They should require some kind of intelligence to be or something that is unseen before. There is a nice book. I'm reading this book right now from Duke Hoax about the root cause analysis. And it goes really, really deep into this problem. How to find root causes. Playbooks and run books reduce greatly the mean time to repair. SREs write and rely on them for on-call. For instance, the other day we had a presentation on Acre where someone just shown ansible playbooks, which actually they could be used for this purpose. But we now have run books, which improves a lot the response to these alerts. And I can give you examples. So concluding, an LSE monitoring and alerting pipeline should be simple and easy to reason about. Try always to have high level stack overview. Still, some few services like databases need to go in and checked on the system itself. A dashboard might also be paired with the log in order to analyze historical correlations rapidly. So the release engineering, which is also a part of SRE, deals with all the activities in between regular development and delivery of the software product to the end user. It accelerates the path from development to operations. It's formed by seasonate SRE team members to conduct this important internal service. And there are some commandments that actually they use on the release engineering teams that we should actually apply. Some of these you probably already use. But for instance, point eight, who does Canary from here? Canary? Anyone knows? What is? Canary. Yeah, Canary, sorry. It's my accent. So Canary, for instance, it's a very important thing to actually use on your system. Just have a piece of service that you can test. You can actually throw the new features there. And if they explode, well, those errors go to the error budget, but you didn't explode the whole platform. We'll not go very deep into this, but we can talk about in the questions part later. So developers, SREs, and release engineers, all in the same group, they work together. Now, to a part that really pleases me on the SRE, and probably for most of you guys, is the postmortems. Postmortem here is a process. Usually performed at the conclusion of an outage, determining elements that were successful or unsuccessful, and should be written for all significant incidents, regardless on whether they were paged or not. This investigation should establish what happened in detail, find all root causes of the event, and assign actions to correct problem, or improve it for the next time. One of the biggest part of postmortems is they must be blameless. Like in any DevOps approach, we must remember we can't fix people, but we can fix systems. And that's in end processes. That's what we should be pointing at, not blaming people. And this is really key. It is critical that postmortems be blameless, so we can understand honestly and truthfully what happened, why the people involved did what they did, and how to make the systems more reliable, even though it has unreliable components, like disks, people, power sources, et cetera. And another purpose of postmortems is that some of them, they are teachable. They can give a good insight into how your systems work or don't work, how incidents are handled, and also serve as a proof that your postmodern culture takes the blameless part seriously. There is another interesting reading on this subject. If you guys want to take note, it will be in the slides later. But from Sidney Decker, he's actually a professor on human factors and flight safety from the Lund University in Sweden. And it goes really deep into how this psychologically affects people and companies, of course. So in the end, Site Reliability Engineering enables agility, stability. SREs use software engineering to automate themselves out of the job, like we heard before in other sessions. And my advice, if you want to implement this change on your company, is to start with the action items from the trainer, alter your training and hiring, implement error budgets, do blameless postmortems, and especially reduce your toil. This presentation was based on the Google Book. So I hope it was not very fastidious. But it actually is, it goes much, much, much, much deeper into the subject. OK, questions? You can now go to the mic. What you presented seems OK for larger organizations. But what if your team is very small? So how does it fit? For example, if you have a team of three people to organize all your IT. So you actually let me go back to the first. Let's go back here to the trainer's items. And there are a bunch of things here that you actually should take. You can hire only coders, for assist means and ops work. You can have the service level objectives for your own service. You should report to your management what are those SLOs doing or not. Those are good measures. If you have a really, really small team, I don't know, error budgets can be or not important. Because, of course, if it is the same team, error budgets, who are you going to discuss error budgets with? In the same team, you can. You can do that. Let's not explode our SLO this month. So there are a lot of things that you can, and surely if you have a small team, you are sharing a common staffing pool between devs and SREs. OK, did I answer your question? These items, there are a lot of things that small companies can do. Yeah, would you divide the gap between people or would you just? It depends on your organization. Like, what do you actually do in terms of? We have an e-commerce business, and our IT team is really small. So I'm thinking about, so you probably have payment systems? Yeah. And those need to be up like 99.99999%? OK, those are really important. Yes, I would divide, especially. I would put error budgets on those. Thank you. Yeah, more questions? Thanks, buddy. You saved me. You totally invited the question about Canary deploys. Yeah, we can go back to that slide if that's really interesting for you guys. It seemed like not everybody knew what Canary deploys was. Yeah. Why don't you explain that? And then I'm curious how you actually do that, or if you're doing that. Yeah, OK. Yeah, yeah. Yeah, because as you've seen, this is like a very broad topic to discuss on. We would probably just need a full camp only to discuss the SRE matter. But for instance, the Canary stuff, let me see my notes. OK, so imagine you have a grid of computers, right? And in your grid, you have like 100 computers. So what can you do to do Canary? You just choose like a small percentage of those computers, and you actually deploy to those. And you let some traffic go there, but you don't let it go too much further if you start to see errors. Of course, there is some point that you actually decide where to stop, right? But this is an internal discussion you must have with your own peers, right? So Canary is not very difficult to do, but it should be implemented in such a way together with the version control system and probably with the package management system to actually be able to roll back or to just apply a version to the Canary system, right? Did I answer the question? We are starting to, well, we do. No, actually, we do Canary, not in a way that is completely reversible, but we do it, yes. Because if we wanted to do it completely reversible, our platform would have to have a version system that you could just undeploy, network. So we choose a set of computers to do that. OK, more questions about this or other subjects? Maybe it's too much to take. Yeah. Can you come to the microphone? OK. Or I can repeat the question. Yeah. Oh, interesting. Yeah. Can you repeat the question? So your question is the ops work overflows to the dev team. OK. Yeah. You're on operations or are dev? Dev, OK. That's why you asked the question, of course. So it's tricky. It's a tricky place to be on, but you need actually to put everybody engaged with that objective. Because in the end, you guys, devs, will profit from the SRE schema. In the sense of, right now, you probably have a lot of evokes with the ops. OK, we want to launch a feature. No, no, you cannot launch because it's going to break this. It broke this last time. It's going to break this this time. You have a lot of add-ins, right? If you follow SRE, what happens is you have the error budget. And you know exactly. You devs, you can actually use that error budget. And they cannot stop you from releasing the features. Oh, yes. Yeah. But if you exceed the error budget, that month you don't launch any more features or quarter, whatever. But you need to define the SLOs before actually diving into the SRE world, let's say. And another thing is if you work together with ops, they will feel more your work and agree more with your work. And you will feel the ops pain. And you will try to not extend their pain with more burden, right? So I think it's better for both sides. And actually, it's not both sides. They will become one side. Yeah. More things? Bye, Jayce. More about the developers of the issues, the solidate, do some asses. But in this particular process, providing a solution or finding the root cause is one part. But do you think that the developers also should spend time on documenting the correct asses? That's a great. It could be the more time-consuming job for the developers, probably. So that basically. It depends on what type of documentation are you talking about? So basically, a kind of RC where you can document what the root cause of that particular issue, providing detailed logs and so on. Let's see it in this. How detailed one should go, that is what I want to know. Yeah, let's see in this way. Wouldn't it be better if instead of just document how it should be fixed the next time, just fix it? Exactly. Yeah, just fix it. And if it is saying that it's repetitive and it will come back and you know it's going to come back to just hunt you, just automate it. Yeah. And documentation, of course, developers, they should do documentation. But it's not the real solution for the problems. Like, the perfect platform, you just push a button, you release. And it works. That's how it's supposed to be. Yeah, but it doesn't work like that. So this is the objective, right? So we should code better. Not just, oh, it is a new feature. Take it. Exactly. Hi! More alerts. Downtime. Yep. You want to? I'm just going to follow up on that previous comment. OK. Can you talk? It's not going to be recorded. Yeah, that's the problem. All right, so I work with Ricardo on the Ops soon to be a SRE team at Acquia. So yeah, one of the big processes we started to implement over the last year internally at Ops and started to spread out across all of engineering is a much more formal root causes analysis process. So the idea is, OK, an incident happened. And we have a template that actually goes over a few things. So what went well? So there were docs, there were tools that fixed it. The impact was reduced to a few minutes, those types of things. What didn't go so well? Someone slept with their alert because the paging system was broken. We didn't monitor for this thing. The customer was down for two hours, those types of failures. And then what was lucky? A person was on shift. That knew exactly what that failure mode was, even though it wasn't in documentation, was able to fix it, that type of thing. So the idea behind a root causes analysis is that every single thing that you find that went wrong, you create an action item for. And then you put that in your ticketing system and then you prioritize it. Depending on the severity, you prioritize it to be done as part of unplanned work for your sprint. Or you put it in for next sprint or later, depending on what the potential risks are for that finding to still be in your ticketing system. So that way you have a very nice and tight feedback loop so that every time a system event happens in your stack, you now have a set of items that, if achieved, improves the quality of product. That's all I wanted to say. And I would even go further is, if you have a proper SRE team, they know how to code. And they know the code that devs produce. If they know that, you can actually accelerate instead of just waiting for the next sprint. And the SREs can fix that. Because they don't want to be hammered anymore by that same problem. And that's why SREs exist. They actually automate themselves out of the job. That's the idea. OK, are there any more questions? There comes a tough question. I'm just thinking from a certain perspective, lots of what you've presented, Ricardo, is from the operations team's point of view. So you're setting budgets and tolerances. And you're saying, on-call staff shouldn't get more than two alerts. And we won't tolerate failures beyond this level and that sort of thing. But I presume the rest of there have to be arguments made to convince the rest of the business to buy into that. It's not just the ops team standing up for themselves and saying, we're not going to take this anymore. You need to actually convince the rest of the business that it's beneficial to implement those systems. There is one thing to respond to your question. Maybe it was too fast that I want. This is what you don't want to have. You have work that gets into your queue, work, and work, and work, and it never gets done. So that curve there, imagine you're getting more traffic and more costumers and all of that, which is good. But if you're getting that and you're not fixing the errors, you're not fixing the toil. And the toil here is really the problem. The whole company should see this problem happening before it actually goes really strong on them. And that's why the SRE, for instance, on Google, was implemented. Like, he actually is a software engineer. So he comes from Dev, right? But he is conducting an operations team. SRE are operations. They are. But they are not anymore because they were converted to have, as most as possible, development on them, not just toil, right? I don't know if I answered your question. But this is like the point of no return that you don't want to get there. And your company, someone inside of the company, must make them aware, hey, this is happening. Don't go there. Yeah. Yeah. Yeah, that's exactly what this means. Like, you're trying to solve a problem with more bodies. Is the problem going to go away? Well, the tickets are going to get, shortly, are going to get more repetitive work and more work that you can actually automate and you're not automating. That's going to increase. It's probably good to the employment in the country. More questions? No? OK. I would ask you guys, please, go to the page where, this note page, to make your evaluation of the session. Don't be very harsh on me. I tried it. But I think we all agree that if you guys have Hop's team, we all agree we need to change something. And that was the purpose of this session, to actually put us all in a new level of less burden. OK, thanks.