 Let's get started. It's 2.15. Thank you for attending this session. I know it's the last session in DrupalCon, and I think that the DevOps track has picked the best for last. So I think this is going to be an excellent presentation around SRE. Yes, yes. So let's get started. We're going to talk about building site reliability engineering crash course. Please evaluate the session. I'll have this slide up at the end of the presentation. So who am I? My name is Amun Astani. I am the senior manager of site reliability engineering at Acquia. I've worked there for a long time. I was in the operations team from December 2010 to November 2015. And over the last year, I have built out and now lead the site reliability engineering team. So we got an agenda. There's a lot to cover. We'll get through it and hopefully we'll have some time for questions. But we're going to do a quick reminder of what is SRE again? Why do people do it? What Acquia was experiencing before we implemented the SRE team? How Acquia does the SRE team? How you can build your own SRE competency? How to hire SREs onto your team? And then we'll go over retrospective because doing retrospectives and postmortems is central to what an SRE does. So let's get started. So what's SRE again? In a sentence from Ben Schreiner, who was the person who put together this concept from Google, it's what happens when a software engineer is tasked with what used to be called operations. But what he really means is the following. So SRE, they take manual processes associated with the operations, you know, the toil. And they replace them with automation using software engineering. What they also do is they have a set of methodologies and best practices that help engineering teams create a mature and sustainable process for owning their services, building their software, and running it. So what does this have to do with DevOps? Because this is the DevOps track. Well, DevOps is a set of tools, values, and processes that allows teams to best deliver to their customers. Well, SRE is a specific implementation. It's a specific form. So according to Google, there are a set of 12 SRE practices. We can go over those briefly. And if you have more questions about them, we can do it after the presentation concludes. Number one is hire only coders. That is number one for reasons. Site reliability engineers need to be able to write software in order for them to automate the processes. Two, have service level objectives for your service. You want to be able to measure your services' success. And you want to be able to define the minimum standard in which the service should be at that meets the customer expectation. You want to measure against the SLOs, and you want to report the results, usually on a dashboard. Service level objectives have a certain amount of unavailability that you're allowed to have. So for example, 99.95% allows you to have 0.05%. If you're thinking about that in terms of a month of time, that's 21 minutes. So 21 minutes is your error budget. So if you spent all of your error budget in terms of outages for that month, you can't release anymore. That's to ensure that we continue to maintain the minimum standard of service that the customer expects. Have a common staffing pool for SRE and developers. If you only have budget for 10 people and your service is not so reliable, you probably need to have more of those people in that group be SRE rather than software devs. Cap operational load at 50%, that means the amount of time that you spend on the ops work must be 50% at max. And what happens if you exceed that? You overflow the excess work to the rest of the development team. That's to make sure that services are able to grow and to have a sustainable operational posture. You also need to share 5% of the operations work with the development team so that they are in mind. What's in mind is their operational responsibilities. They're in mind with the realities of what their service does. Let's think about on-call. On-call teams should have at least eight people in one place or six people in each multiple locations if it's a global service. So that way you can actually do on-call sustainably. Aim for a maximum of two events per on-call shift. To be very clear, it doesn't mean that you get paged twice and then you don't get paged anymore. It means that you target for two pages per on-call shift. So if you're getting more pages than two, you probably should be evaluating root cause and fixing them so that it is a sustainable experience. Do a post-mortem for every event. When something happens, you look at the root cause, you write it up. And finally, post-mortems are blameless and they focus on processing technology, not people. We're not pointing fingers here. We are looking at how the process has failed us. We're looking at how the technology has failed us. So those are the 12 SRE concepts that came from Google. Now why do SRE? Why is this a practice that makes sense? Well, the first thing is scale. Aquia had rocket ship growth. If you were at Ricardo's presentation, he described that Aquia now has over 19,000 EC2 instances that they track. So you need a process and technology that enables you to operate at that scale sustainably. You also wanna be able to improve the employee's quality of life. If you have a lot to be responsible for, you don't want your engineers burning out when dealing with problem after problem after problem. So that's really important to keep your employees happy. But all of these things actually map to one thing, which is the reduction of cost. And for the executives in the room, that's really what it's all about, minimizing operational cost. Because retraining employees, because you lost some, because they burned out, that is expensive. So let's talk about Aquia before we had the SRE process. So this was circa 2013. That was the ops team. I'm in the back left over there with the neon shirt and the hat on and Ricardo in the front is standing, or standing right beside me with the Vuvuzela, because 2013, that's what you had Vuvuzelas. So they were a small tight-knit team that were infrastructure specialists. They owned the entire fleet. They, we had our own on-call rotation and we were solely responsible for the ops of the software that was running our services. And one of the big aspects of our work was responding to the customer outages and the infrastructure outages that we encountered. How many of you remember back to April of 2011 when Amazon Web Services had a big blow up in US East? Okay, so this guy knows. I know Ricardo knows. So we were there, we dealt with that, it was not fun. And that's one of the examples. And I think, I truly believe that the root cause of this was how the development team, the engineering team and operations related to each other. The dev team simply wrote the software. They didn't run it, they weren't on call for it, they didn't operate it, they didn't understand any of that stuff. And then they handed the software to ops and then ops went and released it and patched it and ran it and did an incident response for it and change management for it. That led in some cases to a disaster because if the engineers don't understand the technology in which they're building, it's great, good, we have automation, we have code, we have features and it produces a big mess for operations to clean up. So there are some things that we tried over the past few years first before SRE became a thing in order for us to try to fix ops from within rather than just reaching out to other teams and saying, hey, there's a problem. So the first thing we did is we implemented Kanban for the operations team. We made our work for a visible, we tracked our time and we did that so that way we can maximize the throughput, we can get as much done as possible with our current processes and tooling and then we're able to visualize what's going on. We also tried to prioritize some internal projects, some work. We were using scrum internal to the team on the tier two team, which was the senior ops group which I led and we were building automation for the team. That had some limited success but because of the level of interruptions that the team took on, it wasn't always possible to make that happen. And then the big thing that happened was that we started generating metrics and we started sharing them to senior members of the organization to influence decision making. I actually did a talk about this very subject called People Metrics in Dublin last year. There's a QR code, you can scan it, you can look at it, you can watch the YouTube video, I highly recommend it. So now we have SRE, how does AQUIID do it? It's a bit different from what Google does. The core principles we adhere to but the implementation is a little different because the company is different. So to start, we commissioned AQUIID SRE as the tip of the spear, the driving force behind our DevOps initiative. Over a year ago, we said okay, we need to make internal changes to make our quality life better, our customers' lives better and we had these falling core values. Eliminate toil. Toil is repetitive, non-valuable work that does nothing for the improvement of a service. It is a core concept behind SRE and that's why we put front and center the idea that we should eliminate toil where possible. We also believe that heroics is not an appropriate thing because it burns out people, it doesn't respect people, so no capes. Delivering with empathy to the customer, delivering with empathy to ourselves, delivering with empathy to other teams. Owning our service, meaning that we are responsible for the software that we write. Own our business, we are responsible for the component of the business in which we serve and how it affects the overall outcome of the business and own customer success. Ultimately, the customer is what matters and we need to think about how our actions affect them. So there's some differences between Acquia and how Google does it. So we embed engineers on engineering teams. We put SREs on their teams rather than building entire teams that take services away from a software engineering team. The entire engineering team plus the SRE that's embedded is expected to again own their service and the SRE is there to work with them as well as provide leadership on how to best handle the responsibility of doing that. The SRE identifies risk as part of the day to day. They find bugs, they find root causes to failures, process gaps, lack of docs, lack of automation and they bring those improvement opportunities directly to product management which prioritizes the stories to make that stuff happen. We also work with engineering and product leadership every quarter where is the biggest need, where are the biggest problems, where are the things that have the greatest business importance and we put the SRE engineers in those places. We reserve the right to remove SREs from a team if things become untenable but it hasn't happened yet. We've had a lot of excellent cooperation from the engineering and product groups to understand why we're there and they've been working with us and it's been great. And finally we have a very heavy focus on time tracking to aid in toil reduction. We classify our work as project work or toil and if it goes over 50% we signal. So there was something that I brought up earlier from the Google processes which was sharing 5% of ops work with the dev team. We took it a step further. Ops work is the responsibility of the dev team. There's nothing for them to hand off. They own it. And we find that that is very, very fundamental because it puts the right incentives on the engineering team with the SRE helping them to build services that are sustainable to operate in the first place. So now we have a little bit of a different model. As I described before we have an SRE who's embedded on the development team to improve the state of the service when they're developing it and what we see and hope to have happen is that the need for operational work the need for an external infrastructure team begins to shrink. So that's what AQUI is doing now but now the real meat and potatoes of this discussion comes up which is how do you build an SRE competency on your own? So the first step is to get management buy-in. You're probably gonna want the VP level, SVP level, CEO level you want as high as possible. And why? Well, they need the authority to give you the authority to do two things. SRE flat out will not work as a process without two things. The first thing is SREs have the authority to stop releases when the error budget has been exhausted. We were talking about gating releases on SLO performance. If your error budget is exhausted for your service the SRE on the team should be able to say we're not releasing anymore for the rest of the month for the rest of the quarter or what have you. You also, the SRE also needs the authority to overflow the ops work to the dev team when the operational load is over 50%. Meaning if the SRE on the team is spending more than half their time on toil or operational work they need to be able to say okay I understand you have development work in your sprint that's stopping right now you're helping me with these tickets and then we're automating them away. And this has to be given from some from an executive responsible for the engineering and product efforts because if not you're not really getting the authority. And if you don't do this do not continue because you're wasting your time. So how do you get buy-in? Well in my example I sat down with Christopher Stone Chief Products Officer at a bar in Boston. We had a beer I said hey we have a couple things as part of the SRE initiative that I need from you. He said oh yeah that's cool. But actually that was the last step in the process. What you have to do first is to establish a sense of urgency and that means providing those decision makers with the information necessary for them to say you know what that's a good idea I understand it and I'm gonna give you that because I wanna help you. There was a talk I did in Baltimore called Viva la Révolution how to start a DevOps transformation at your workplace QR code there, presentation video there. Highly recommend it it has an eight step process from getting in order for you to get the change you need in your organization if you see that a change is necessary especially around DevOps and running your services. So you got the buy-in you get started you're building a necessary competency. What do you first do? Well the first thing you need to do is you need to automatically measure your toil. You need to be able to have your team track their time and somehow have that information be easily reportable. So what I've done, this is the actual SRE dashboard as of a few days ago. I have Amazon Lambda functions that talk to a time tracking service called Toggle. It goes and gathers all the information from the last four weeks or month and it represents it in the SignalFX dashboard. SignalFX is a SAS service that we use for time series database and dashboarding. So with this information as the manager I can go and look and say hey there's a particular engineer that's having problems let's look into it and make sure that the process is being followed on his team. The second thing that we implemented is a concept known as the operational responsibility assessment otherwise known as the ORA. So what it does is it allows us to identify risk and services that we operate. It's based on the capability maturity model which is something from the Department of Defense back in the United States and they had some process for rating things but for us we evaluate certain things. Routine tasks, tickets, right? Emergency response, you're getting paged. Monitoring metrics, how do you measure your service? Capacity planning, how do you know what infrastructure you should have in the future? Change management, how do you release? New product introduction and removal, how do you tell your customer and how do you tell your support organization that you're releasing new products and services so that they can actually can do something about it? Service deploy and decommissioning, how do you instantiate new instances of your service? Performance and efficiency, are you actually using all those instances in Amazon or some of them just lying around in the corner? And finally, information security, do you have a viable amateur security posture when running your stuff? So each responsibility we rate them from one to five using a set of questions and there's five levels. Initial, which is chaotic, there's no process. Repeatable, you have documentation. Three, defined, there's documentation and someone responsible for doing it. Managed, you're measuring it. And five, optimizing, you're taking the measurements and it's signaling your ability to do process improvement. So these things stack on top of each other. So you have to have a documented process first in order for you to delegate your responsibility to somebody and you need someone responsible in order for you to generate metrics on the performance against that process and then finally you have to have the metrics in order for you to be informed whether or not you should change the process. So everything is stacked on top of each other. So you do this assessment, well what do you do with it? So what you do is you do it often, you do it once a quarter. Acquia doesn't once a quarter, I think it's a good cadence. Once you do the assessment, you find all of the gaps and all the risks that you identify and you create issues and you should queue for improvement. You publish the results, you make them public. These aren't things that you want to hide. It's a blameless process and you share them with your organization. And this is very important. You do not tie operational responsibility assessment scores to KPIs or incentives or anything because what that does is it gets engineers to work on the really mature things. You want engineers to work on the really nasty stuff because if they're in there, they're gonna start working on fixing it. So you are probably like, well how can I get a copy of this ORA so I can do it on my own stuff? Well there's a book called The Practice of Cloud System Administration Volume Two co-authored by Tomlin and Shelley for those sysadmins in the room, you probably recognize his name. And appendix A in the back of this book has the ORA process. There's also a chapter in the book that describes the ORA process. It's wonderful. Do take a look. So we didn't just stop there. We took the operational responsibility assessment and then based off it, we created something called the launch readiness criteria which is simply a set of guidelines that represent that the minimum bar on what a new service needs to have in order for it to launch from an up standpoint. So we use the ORA to create the language for the launch readiness criteria. So you're not looking at two processes. It's like this is ORA. Launch readiness criteria is the minimum level. The idea behind it is that it's supposed to be a very concise list. It's supposed to help address all the common forms of risk that you would see and launching services without introducing roadblocks and slowing things down. It's a living document where engineering leadership and product leadership can go and propose changes and ratify it in order to make sure the entire list is relevant. And it's actually inspired by the SRE book. There is a chapter called Reliable Product Launches. We took their process and adapted it for our own. So here is actually an example section from our launch readiness criteria, Acquia. This is around service deployment or decommissioning which means provisioning and deprovisioning instances of our hardware or services. So service deployment and decommissioning needs to meet level three at minimum, which is defined. We have documentation and owners. And we have several things. Full documentation. We have QA step to verify that when you do something they actually work properly. We have a SLO for lead time on completing tasks. So if a customer asks for a new instance it comes up very fast under the SLO and they must be tool assisted without manual effort. Very important. So that's an example section. So sure, I wrote this LRC stuff a few months ago and it's now corporate standard but I can't just run around to the checklist. That would be kind of not nice. So I took the next step recently called enablement which is I gave tools and processes to people so they can make it really easy for them to do LRC by themselves or on their own teams. So I had a service called bin.acquia.com which is a encrypted paste bin based on the Free Software Project Private Bin and I decided to go and set it up and run it and then do the SRE special on it. So I have example service pages that people can just copy and use for themselves. I even have example service dashboards that they can copy and use for themselves so we can see availability metrics, the error budgets in the back or in the far right there in orange we're almost running out of error budget. We got time to acknowledge for incidents, time to resolve for incidents, latency on the bottom there so that's a dashboard that someone can use to inspire their own service dashboards so they don't have to think about how it's done. Example code, okay, this is the application that I'm running. Here's the Lambda Functions that monitor it. Please take it and use it to the Aqua Engineering team. They can just take it and use it so it makes it really easy for them to monitor their stuff. Example operational runbooks for their service. This is how you deploy it. This is how you do capacity planning. This is how you update the monitoring. This is how you do the meta monitoring cloud watch to make sure your monitoring works. So all of these things help fulfill the lunch readiness criteria. They just have to implement it themselves but you get a good guide. We even have a post mortem and RCA template so every time there is an incident they just make a copy, they fill it out, they meet it with the team, they file the issues that they identify and they move on. So we have all of these tools, great. Well the next thing we need to do is create an onboarding process. So that way when an SRE is embedded on an engineering team they know exactly what to do and how to get started. So this is our onboarding process. The first thing we do when we're embedded on a team is we implement an incident response process. If they don't have a page duty rotation, they set it up. If there's no documentation for people to ask for help from that team, we set it up. And we also make sure that every engineer on that team has access to production and they have access to the run books needed in order to support the product. So first things first, we make sure that they are on call, ready to go and the internal organization is available to ask for help. We do the ORA right off the bat. After that we do the ORA and we publish it. We figure out what the service level objectives should be and we publish those. Then we set up monitoring so that you are able to calculate error budget and know if the service is where it's supposed to be. And then finally you create dashboards. So you can actually see the performance, you can look at it every day with the team and you can decide whether or not you should release. Something that also happened during the beginning of SRE is we created something called the weekly office hours. Every Friday at 1.30 we have a sit-down meeting. People, anyone can come in from any part of the company. We have some of our regulars. We have people from marketing that sometimes arrive, people from product that sometimes come in and they ask questions about SRE and DevOps and we give them access to Q&A. And it's allowed us to slowly expand our influence in the organization, teach people about what we're talking about and what value it provides, as well as help others that don't have access to an SRE, they at least get access to the knowledge so they can implement it themselves. So how do you hire SREs? Well, they said hire only coders, right? So I guess we should just go and hire software developers. That's not quite it. Well, I mean, SREs are operations people they need to know about. Ops, I guess, so they can run their services. That's not quite it either. It's complicated is really the answer because you want someone who has the capability to write software. Ideally, you want someone who has a CS degree. That would be the ideal. And they're able to use all of the software development tools that we use today and can contribute to a team. But at the same time is what wakes them up in the morning is the ops. They understand Linux, they understand networks, they understand monitoring and the tools that set up monitoring and the concepts behind it. They understand the need for performance, config configuration management and all the wonderful things that make ops a lot easier these days. Of course they need to be willing to be on call because the entire engineering team is to be on call including the SRE which is really, really important. They need to have knowledge of agile practices. How many of you use Scrum? Some of you. How many of you use Kanban? One, I do too. So the great thing about those processes is it gives the SRE the tools necessary to say, hey, this is our retrospective or our sprint review and retrospective. There are some things we encountered, some sources of toil, some outages and they can use those agile rituals to inform the team as to what improvements need to take place rather than being disruptive or having separate meetings or a separate process. They can just use the existing process to signal that change is needed. There's also this idea that I have known as the SRE temperament. What kind of personality is needed in order for an SRE to be successful? And the one thing that I really see is that an ideal SRE candidate is gonna be able to contribute or communicate their opinions in a way that can persuade others, meaning they have to be friendly and build a rapport with their team and not be a jerk. And as well as being data-driven, you can't make changes based on sentiment. You need to be able to make, yeah, you can only motivate change based on the facts and the data. So let's say you want an SRE, you want a higher one, you understand that this is really cool stuff. How do you do that? How do you sell the position? Well, remember, a toil is capped at 50%, so that means that's it. That means that 50% plus of their time is gonna be project work. That's pretty awesome. So that's a wonderful selling point if you're trying to get an SRE on your team. They get the authority to stop the flow of releases once again when the service is unreliable. There is on call, of course, but it's a sustainable on call because the responsibility is shared with the whole team, meaning they would only be on call maybe once every six to eight weeks. That's not terrible. And then of course, if you do get paged, the root causes of that page are tracked, prioritized, and addressed, meaning they don't have to encounter that problem anymore after it happens. And I truly believe that those four things create a work environment that respects people, that respects the SRE, and therefore it creates a good place to work. So I wanna do, of course, like I said, SRE talks about doing retrospectives for every event, so I figured it'd be appropriate for us to do a retrospective ourselves on what we've done over the last year, what we've learned, what mistakes we've made. So we'll talk a bit about those. So let's start with what went well. That's always fun. So launch readiness criteria is now a corporate standard. I wrote it in the second, no, between the first and second quarter of 2017, I met with all the execs. I said, okay, this is the minimum standard that I think that we need in order to launch new services at Acquia. We debated on some topics and points, but then they all agreed, and now it's a corporate standard. Every new service for now on has to use the launch readiness criteria guidelines. Secondly, the teams are performing their own postmortems. Even teams without SREs on them, they're starting to understand the utility of, hey, I got paged, I should probably investigate why that happened, so I don't have to get paged anymore. So they're performing their own postmortems. One of the things that actually happened was for a little while, we had a Lunch and Lessons Learned event that was organized, but one of our engineers at Acquia, Sarah Jajora, what she did was every lunch on Wednesdays, she would get an engineer to volunteer a postmortem and do a reading of it and then ask questions. And what that does is it creates a culture of blamelessness and continuous learning. So that we learn from each other's mistakes so we don't have to make them ourselves, which is really cool. Teams are independently performing their own ORAs. So I do an event, or I'm going to be doing an event soon called the State of Operational Readiness, which is going over all of the services at Acquia and showing their ORAs and describing what levels of maturity they're at and provide recommendations. When I went and said, hey, we're going to be doing this soon, some teams went and said, oh, well, how do I do it ourselves? And they started to submit their own assessments without, again, an SRE presence, which is wonderful. A very interesting thing happened. There was an all engineering event that we have every year called Build Week. All the engineers in Acquia came to Boston. There was a big scheduled meeting or kind of like a BOF session and a bunch of the cloud engineering team went and organized, described their current challenges on how they work on the team. And SRE jumped in and started making their own statements, recommendations, and they re-orged. They just said, you know what? We're going to group in this format around the various components of the software. And it's like Amazon Web Services, a service-oriented architecture. So they just self-changed. Of course, they got buy-in from the organization, but SRE helped guide that discussion so that the engineers could work better and be happier about what they're doing. More and more teams are starting to understand the need of on-call and they're doing on-call. And there hasn't been, really there hasn't been a whole lot of resistance about it, which is very surprising. Being an operations person, I for a long time had this idea that software engineers just hate on-call and they would rather leave it to another human being. But over the past few months, I realized, no, people do take pride in their work and they wanna fix the things that are broken and it's really, really good and gratifying to see. And again, the weekly office hours event has been an extremely effective tool for sharing ideas. People are bringing their own agenda items. We record all of them and share them with the organization internally. They've been a wellspring of information and a way to communicate our DevOps mission to the organization. And sometimes we have discovered opportunities to help each other, which is great. So let's talk about what didn't go well because it's a retrospective, right? So what didn't go well? We embedded SREs on teams over the past year, but there was an issue with getting the SRE fundamental practices, which is SLOs and error budgets out for all the services. SREs got involved on their teams. There was a bunch of unplanned work and fires that they needed help address. So they jumped right on those and they were being helpful. But the problem was, while they were doing that, they weren't doing SRE because those fundamentals weren't in place because they didn't have SLOs. They didn't have the appropriate monitoring because they didn't have the appropriate monitoring. We couldn't calculate error budgets. And because we couldn't calculate error budgets, we couldn't stop releases when the service is below that minimum standard, again, defined by SLOs and error budgets. So without those tools, SRE is not really able to do their job. So that was a gap. Launch readiness, which is a recent process, it's too new. It would have been nice to have this years ago so that way we can set a standard for new services. So we just felt that for recent launches, it would have been nice to have this so that way people can plan with it and account. So in response to that, there were some things that we started to do to improve the state of SRE engagements. So now, SRE engagements require the onboarding process front-loaded when you get started. So welcome to the team. You're doing these things first. And what that means is they can't be assigned any other work until that list is done. So again, making sure a team is on call and there's documentation on how to reach them. Performing the ORA, publishing the results and getting the stories in for improvements. Defining SLOs, creating monitoring and alerting against those SLOs and creating dashboards against them. Another new thing that is quite recent is if a site reliability engineer is embedded on your team, his proportion of work has to be operational stories. Period. At minimum. So if you, again, for the scrum people out there, if you're doing 20 story points a sprints and there's a team of five people including the SRE, that means that four points must be operational in nature every sprint at minimum. So, I actually went through the slide deck pretty fast, which is good. That means we have a lot of time for questions. But I want to go ahead and leave you with this quote. I actually had this from a discussion between me and Ricardo when we were talking about SRE, which was when we were in ops, it was simple because our purpose was to simply address the incident. Our purpose now is to address the problems of the business. We are the vehicle of change. That's hard work, but we can do it. So that's it. I really appreciate your time. I know that this is, it's probably tough to get the last session in on Thursday, but I definitely encourage questions. We have quite a bit of time. So, yeah, the mic's right up there. So if anyone has any questions, please ask. I'm an open book. I love to talk about this subject, so let's get started. I hope this is the one, right? It is, yeah, we can hear you. Okay, in that, me making a question is quite a, but okay, I'll make the question. So in that process of convincing higher management to actually adhere to these principles, what do you think was the most tough thing to do? I think the launch readiness thing was probably the most tough because it was the biggest shift for people because there was this concept of, oh, there are things that I really have to pay attention to and I have to prioritize before I go to the sales organization and say, okay, we're ready. And I think they didn't recognize the level of work required. Now, granted, I did spend a lot of time last quarter on just the enablement piece. So that goes and makes it easier because they don't have to think so much. We even have 20 templated stories that they can clone into their backlog with everything out for launch readiness. But even then, they have to think about, okay, how do we do this in our service in that context? So that's the biggest challenge right now. I think it's gonna get better as time goes on because it's gonna be around for longer and longer and we're gonna be able to have organizational learning around, okay, as part of building a product, there is this launch readiness process. We have to account for it in our timelines and stuff like that. Thanks for the question. Cool, anyone have any other questions? Yeah, there was like 80-something slides in here and somehow I managed to get through them in 40 minutes. Yeah, so it's another aquean asking an aquean a question. So I'll try to represent some of you who are a little shy to ask a question. I was at a talk by Ricardo on the same topic and the question there came from, okay, I get that Google does this. I get that a big company, relatively big company with over 800 people. Like Acquia does this. But how do I translate that into my little Drupal Dev Shop that it may be even 10 or less people? How can I adopt that or at least some of this that imply that to a small company? Let's go to the beginning and go over the 12 ideas and we can talk about whether or not they are applicable to a small group. Okay, so hire only coders. That means that when you're building a team of your 10 that you have programming in the requirements for hiring the role. I think for small groups, especially startups that is obviously something that you need. So I think that's said and done pretty understandable. Having service level objectives, I don't think that is applicable to teams of all sizes. When you're building a company, when you're building a product, you want first off, you wanna be able to tell customers this is the level of service that we provide. That is in the form of SLAs. SLOs are internal and don't have a contract implication to them, but it's still the, I have a metric, here's the target for the metric. SLAs are just, this is what happens when I don't meet that target for the metric. So a small team can have SLOs and I highly recommend it because you know if your service is successful or not. If you do a release and you miss your SLO, then you know your release is probably the root cause. Measuring reporting performance, I think that makes perfect sense if you're already setting up monitoring for it, you should be reporting it internally to the team. Air budgets again, so that requires the team saying, yep, if we're not meeting SLO and we burned through our budget, we're not releasing for the rest of the month or quarter or however you wanna do it, that is totally applicable by a small team, doesn't require an enterprise grade company to do that. This is a little tougher, but with a small group, it's kind of a moot point because it's a small group, everyone's gonna be doing the work. There isn't, the company or the team isn't large enough to have specialization into an engineering group and operations group. Capping SRE operational load, this is about tracking your time. When I did the people metrics talk in Dublin, there were many questions and concerns about tracking time because it's such a pain in the butt. It is a pain in the butt, but there are wonderful tools out there that make it easy. For those that use JIRA, there's already built-in time tracking tools. There are some SAS products like Toggle that make it even easier to track your time and really all you need to do when you track your time is this project work or is it toil? And toil is again, the repetitive, automatable, non-redeeming work that has no long-term benefit for your service. So if you're tracking your time and you're just putting your work in one or two buckets, you can calculate your operational load and then if your team starts to be spending more time on the ops than on building product for customers, that is a signal that something's wrong and that you need to evaluate what the root causes of this toil are and prioritize in the fixes so that you can continue to, again, do what you're paid for, which is build a product. Same deal here with a small team. That's going to happen naturally anyway unless you hired a single ops person to do all the work for you and if that is what you've done and they're spending all their time on the painful stuff, shame on you. So I think that is applicable as well. I think that is also applicable as well. I think a small team would be aware of how the service is built and operated so they would be able to share in the responsibility. This is a little bit more tough. With a small group, you're going to be constrained by size. So if you're a team of 10, wonderful. That means one week out of 10 weeks, you're on call, no problem. That's really, really sustainable with really teeny, tiny teams. Then it begins to get a little bit more tougher because you're on call more often and that has an impact on you as a human being. Two events per on-call shift. So when you get paged, you're tracking your events, right? Like you're at least filing a ticket for them. Anyone? Who doesn't get paged? Oh, you guys are lucky. So maybe you should. Maybe you should jump on the on-call rotation. So if we're aiming and we're measuring how often we're getting paged per shift, so when I mean by on-call shift, I mean how long you're holding onto the pager. So usual things, you hand off the pager once a week. So that means you only want to get paged twice a week. If we're getting paged more than that, that's a signal to your team and to whoever the product manager is that we got to slow down. We need to fix these things before anything else. Do a postmortem for every event. I don't think you need to be an enterprise-level company to do a write-up on what happened. I think it's something that more and more as part of the DevOps culture that's been spreading and having more influence in the tech sector over the years, I think people understand that doing write-ups and learning from our mistakes is paramount and central to learning and improving the state of things. And then again, you don't need an enterprise company to have a culture of a blamelessness. And I think smaller groups, it's a bit easier because you have rapport and you have a relationship with people and you can say, okay, yes, we didn't have a doc for this. I know this person. I know he's a hard worker. You know, he's not a jerk. Let's just make sure it doesn't happen again and figure out what we need in terms of guardrails to prevent it from happening. So again, I think small teams totally can do SRE. You're not gonna build your own SRE team but these practices are applicable to any team size. I appreciate the question. Anyone have anything? I'm definitely an open book. Love to share. Going once, going twice. All right. Once again, I thank you for your time. And if you run into me, definitely say hi. I have business cards and stuff like that. We are hiring, if you know anyone out in Toronto, we're hiring an SRE out there. So definitely send them my way. And yeah, that's me. It's been a pleasure. Thanks. Yeah. I didn't fill out the entire time. They're good. There are some of them. That's not common. But we, so for support, we can do some of the narrative for support. But support, unfortunately, not the full amount. Which things are we missing? So they had to go discuss that and see, okay, who are the stakeholders? Is it like a knot that happened, right? Because startling is not the same thing, right? And that's really important. So you can just trigger the biggest ones, right? Inside of your own systems, they have like bishards that you need to catch right in the equity satisfactory SRE board, send stats to signal effects. We can have that. So I'll catch up. For me, this was not five. I think Viva La Rueboucyan was a five. This is a four presentation. It's good. It's a four-minute bit. On the basis that... Maybe have it looped twice and then stop the animation. So I'm the kind of person that gets distracted by everything that's going on. What the guy says, I have a joke. Suzanne's joke. But... We saw it. We saw it. We actually... We had it. That's sugar-free. I could make a joke on a UDP when I didn't know. In my... There was no time for jokes. There was no time for anything. Hi. It's like... Yeah, yeah, yeah. Everybody has a different... Few different voices. Oh, that was more fun. This improved some things. You probably didn't know this, but I could... Like, I could... A lot of times. From what I recorded, I kept a lot to it. So you can just... Go into the code and show the code and then result. Show the code. Result. Show the code. Result. Yeah. And explaining quite... Explaining math. To that level. Yeah. It's really hard. Dude, it's really... Yeah. That was an uphill battle, Ricardo. That one. That was an uphill battle for you. Yeah, yeah. Because it's like... Oh, okay. Explaining gradient descent in 30 seconds. Wow. There's no way, man. Nope, nope, nope. Gradient descent is... Oh, Ricardo, you need this back. The relative... Hi. How are you? How are you? Nice to see you, man. Yeah. Nice to see you. Yeah, it was fine. It was fine. Not pretty good. We had a reason to decide. This is a niche topic, this. You know, if it informs five people, I've done my job. Yeah, I think reasonably it went well. The one in Baltimore...