 Welcome to Cloud Foundry for Business. Our second talk is Hannah Foxwell from Pivotal, talking about reliability engineering for humans. This talk is also pretty cool because Denise is going to be live sketch noting this talk. So if you go to Denise.cfapps.io, it will pop up a Zoom channel where you can watch her sketch note. But also, make sure you're paying attention to Hannah. We should check Denise is ready. Yeah. Hannah's not going to answer questions after in like Q&A format. So if you have any questions, feel free to come up and grab her after. Great. Take it away. Thank you so much. Hi, everyone. My name's Hannah Foxwell and I'm here to talk about reliability engineering for humans. Or to put it another way, is site reliability engineering good for you? So I want to start today by talking a little bit about why I became interested in SRE and SRE practices. Because when Google published the SRE book in 2016, this is what I heard. You don't need SRE unless you're the size of Google. And so I pretty much ignored it for the next two years until something changed my mind. After the SRE book was published, I noticed some ripple effects happening in the DevOps community. So previously where we kind of called our teams Ops or our teammates operators, they then became the DevOps team. And the awful job title of DevOps engineer came into being. And then I noticed that actually all of these people were being rebranded again and they were becoming site reliability engineers. And I was like, okay, fair enough, naming things is hard. And I'm a bit of a traditionalist and I've always believed that DevOps is not a job title. So rebranding these teams to site reliability engineers, that kind of made sense to me. And that was really my experience of SRE until the beginning of 2018 when I joined Pivotal. And I was on a call with PagerDuty and our CloudOps team. And we were looking into the data that PagerDuty had collected about how Pivotal, the CloudOps team within Pivotal, were responding to incidents and managing their own call. And PagerDuty have this dashboard called Operations Health Management. And it derives a health score from all of your kind of alerting, all of the incidents that come through from your platform to PagerDuty. And it takes into account things like actually how many times were you paged out of hours? How many times was your sleep disrupted? How many times was your evening at home disrupted? And how many of these incidents were resolved quickly without intervention? How many of these incidents were real incidents that impacted your users? And what the data is showing you right here, so this line at the top, it's showing that the health score derived from all of those aspects, the amount of kind of human impact of those on-call routines, improved dramatically through the course of 2017. So it starts just above 20 as a health score, which is really, really bad. I hope there's no, are there folks from CloudOps team here? I don't wanna say. Sorry, it was really, really bad in 2017. But it improved dramatically to, we've got a health score of over 90 by the end of 2017. So to like, kind of summarize it and show it in a more simple way, the things that we saw happen over the course of those 12 months were improved team health. Meantime to acknowledge incidents was down and meantime to recover from incidents was down. And so we asked the question and you know, Pager duty were on this call with us and they said, what the hell did you guys do over that 12 months that made such a dramatic improvement? And part of the answer was implementing SRE practices. So SRE, you have my attention. Maybe there's something in this. Maybe this is something that I need to, I need to start researching. Maybe it's something that I need to be reading about. And part of my interest in it was genuinely the human impact of good practices because I'm involved in a community called HumanOps. HumanOps was founded in 2016 by server density and it was aimed at being a community and a movement that promotes more sustainable working practices within IT operations because there's a very real health impact when you implement poor practices. Now sleep deprivation and stress and anxiety can have a very real human impact on the people in the team. I always like to say, when you think about the health impacts of sleep deprivation, you know, you're talking about diabetes, you're talking about cardiovascular disease, you're talking about high blood pressure. These are unacceptable side effects of supporting your platform in production. And HumanOps was created to share better practices, better ways of doing things and to try and address this problem that we have in the industry today. One of the tenets of HumanOps is that the well-being of human operators impacts the reliability of systems. And we all kind of instinctively know that to be true. And from what I'd seen about SRE so far at this point, made me believe that SRE practices could improve the well-being of the human operators and as a result, improve the reliability of the systems that they were supporting. There is a lot of content available about SRE. There are three whole books. I could not hope to cover all of the information that is available about SRE in this 30-minute talk today. But what I'm gonna do is I'm gonna give you the basics, the fundamentals of SRE, and I'm gonna look at them through a human lens. I'm not gonna talk about how to improve the health of your platform, but I am gonna talk about how to improve the health and well-being of the team that are supporting that platform. So SRE, as it was defined by Google, is what happens when a software engineer is tasked with what used to be called operations. And we are going through an evolution in operations because we're going through an evolution in our businesses. We are more and more dependent on software and on software services and on the platforms that run underneath them to actually operate our businesses, to actually serve our users day in, day out. And so getting better at this is important. It's business critical. We can't continue to do things the way we've always done. One of the fundamental aspects of SRE is the acceptance that failure is normal. Any system of any reasonable complexity will experience some kind of failure at some point in time. We need to change our focus from being purely preventative, ignoring that failure is going to happen to our team and actually embrace it as part of our normal day-to-day life at work. Failure is going to happen. But saying that reliability is fundamental to whatever service you're providing, whether that be to your customers, to your colleagues internally, whether you're supporting a platform that serves your application developers. The reliability of the service you provide is fundamental to its success. You're not going to be able to delight your users with new features if your service is down. And the velocity of delivery is increasingly important. We want to be able to experiment. We want to be able to prove the value of new features. And we want to be able to do that quickly in a production environment. But here I am saying that reliability is fundamental. So surely we should be slowing down. Surely we should be more careful because surely having a service available is better than nothing at all, even if it doesn't have all of the new features that we're all so excited about shipping. So how do we reconcile these really two opposing aims? SRE gives us some tools that help us do that. So I'm going to talk very briefly about SLIs, SLOs and error budgets because these three things really for me are the core of SRE practices. An SLO is a service level objective and 100% is not your objective. Not even the internet to deliver it through your ISP and especially not through via your mobile device is 100% available. If the internet isn't 100% available, then why should you be endeavoring to me? So when you think of it in that way, a small amount of failure is acceptable. It should be expected. And it's that amount of acceptable failure, even if it's very small that we want to talk about as in the context of SRE. It's important because we need to be able to have sensible and realistic conversations about reliability and how much we need to invest in improving reliability. SLIs or service level indicators is how we measure whether we're serving our users correctly. And you need your SLIs to be unambiguous. You don't want to be in the middle of an incident arguing over whether or not your users care about the metrics that you're observing. They need to be as user centric as possible. If you're not meeting, if you're measuring your service level indicators and it's saying that you're not providing the service you need to, then that should be a black and white decision. So spending some time and defining your service level indicators so that their user centric is really important. And then finally, error budget. So error budget is the inverse of your service level objective. So if you have a service level objective of, say, four nines, 99.99% availability, that means you've got a 0.01% error budget. This is the amount of acceptable failure within your system. And that's really important because now you have something that you can measure and you can embrace failure as a normal part of your day job. You can say, am I still within my error budget? And if I am, then that's okay because actually we've had a conversation about this and then we've agreed that 100% is not our target. And also if you exceed your error budget, if you're supporting your platform and you're not able to meet your service level objective and everybody has agreed an unambiguous way of measuring whether or not you're succeeding to meet that target and you've observed from the data that you're not able to do it, you can have a very real conversation, a data-based conversation about investing in more reliability, about doing things differently. Instead of it being based on feelings and emotions, you're actually saying, well, we've agreed that this is the level that our users need. And so we need to do something differently to meet our user needs. I'm gonna go through each of these concepts in a little bit more detail with some of the practices you might want to implement behind the scenes. So we've established that 100% availability is not your target. So what is your target? I like to reason about availability in terms of time and reason about an error budget in terms of time because I think when you're getting started with SRE, time is a concept that everybody can understand. You don't have to measure your error budget in time, though. But when you do, you can already see some of the areas that you might want to consider investing in. So if your SLO is 99.99, if it's four nines, then your error budget over a 30-day period is 4.32 minutes. That's not really enough time for a human being to intervene when something goes wrong. That's not really enough time for your pager to go off for you to say you're in bed, for you to get up out of bed, log in, open your laptop. You're not gonna be able to resolve that incident in 4.32 minutes, or you might be able to if you're very lucky. So as soon as you get above four nines as a service level objective, you need to be thinking about engineering high levels of automated recovery and automated failover. This is an investment in your time and in extra resources for resiliency in your platform. And now you can have that conversation. You can say, well, actually, is that extra nine of availability worth that extra money that I need to invest in the platform to get there? You need to agree these metrics with everyone. You need to agree unambiguous SLIs with everyone. You need to agree your service level objectives with everyone. These are not concepts that can live solely in your operations team and not go any further. If your business stakeholders are unhappy, then maybe your SLI was not defined correctly and you need to revisit them. Everyone needs to understand the importance of reliability. It's not about shipping features as quickly as possible. If you're degrading the experience for your users beyond the point where your service is usable. Everyone understands the error budget and how it works, because when you exceed that error budget, you need to do things differently. You either need to slow down or you need to invest in higher levels of resiliency. Everyone needs to understand these new rules. And then, of course, sometimes failure will happen. You broke something, what now? Actually, because we knew that failure was inevitable and because we prepared for that failure, we've got good practices and SRE advocates for a lot of really good practices around actually what you do to prepare for failure and to deal with failure when it does happen. So, you know, rehearsing how you respond to incidents. Having playbooks so that you don't have to make it up on the spot. Practice, do fire drills. And then afterwards, do a blameless incident review and take that incident as a learning opportunity so that it doesn't happen again. And then, obviously, review your error budget. How much of the error budget have we burnt through and how much is remaining? Do we need to do something different right now so that we don't disappoint our users? If you take nothing else away from this talk today, I want you to go away with an understanding of those three concepts as fundamental to SRE practices. You need to set your service level objectives and they need to be realistic. You need to measure your service level indicators unambiguously and with your users in mind. And you need to enforce your error budgets. If you've agreed that there is an acceptable amount of failure, then you need to enforce that. But as with everything, the only way to start doing something is to start doing it badly and practice. The only normal way to begin speaking a new language is to begin speaking it badly. So if you do want to incubate these practices and these processes within your ops team initially until you feel confident about explaining to them to your business stakeholders, then that's okay. But I would always encourage you that these are not things that need to remain within your platform team. These are not concerns that only affect operations. These are things that impact your business and your users. So you're just starting out with SRE practices and you've never really measured your availability in an unambiguous way before. You don't really know what your error budget should be. And that's okay. Actually, just make a start and iterate on it. Aspirational SLOs, aspirational objectives are okay because you know where you're heading for and you can have good conversations about the investment needed in both time and money and extra resources to get yourself to a level of reliability that your users expect. However, overachieving on your objectives is less okay. So say your target was three nines, 99.9% available. And actually you're achieving five nines because it's a mature platform. You've fixed most of the niggles. It's just running, it's really healthy. What could you be doing differently with that error budget? What could you be doing differently to save your company money? What have you over-provisioned your environment? Have you got unnecessary levels of complexity? What happens if some people in your team leave? Can they still support it at five nines of availability? Or is there a lot of organizational knowledge within certain individuals? It's a risk if you're overachieving on your SLOs. So I would encourage you to always use that error budget to experiment, even if it's just to run fire drills and to practice your incident response. We need to make a commitment not to over-invest and over-complicate our platforms beyond what is required to our businesses. It's very easy to feel afraid of failures and incidents and outages because that's the environment that we've grown up in. Every failure has been viewed as a catastrophe. But actually when we do over-invest and every single business I've worked with has over-provisioned their hardware so dramatically it's kind of hilarious and kind of troubling when you think about the amount of redundant resources that are sat around in our data centers today. By over-investing, we not only waste time and resources and money right now but we waste time and money and resources for the duration of that platform's existence. Every time you over-complicate or over-engineer a solution you're gonna pay for it in the maintenance of that over the life cycle until it's decommissioned. Best tweet of the year so far, in my opinion. A DevOps engineer walks into a bar, puts the bartender in a docker container, puts Kubernetes behind the bar and spins up a thousand bartenders, orders one beer. So this to me completely, and there's a reason why so many people, this resonates with so many people because we love to play with the shiny new thing. We love to try new tools and technologies. We have an awful tendency for over-engineering and actually every time we do that we're impacting our business because the time and the resources and the money that you're using to over-engineer those solutions could be spent on innovating for your users. It could be spent on building new businesses and experimenting. So I've talked a lot about data and metrics and I've talked a bit about money but I haven't talked a lot about the humans yet so apologies for that and I'm gonna get to that now. I wanna start by talking about psychological safety. So psychological safety is defined as a shared belief that a team is safe for interpersonal risk-taking and another definition is that psychological safety is a belief that one will not be punished or humiliated for speaking up with ideas, questions, concerns or mistakes. And it's been shown that psychological safety is incredibly important for team effectiveness. Amy Edmondson's study based in a medical context in a hospital found that there was a correlation between psychological safety error rates and team performance and in the context of delivering medical care, team performance is patient outcomes. We're talking about practices that make the difference between life and death in some cases. It was found that actually higher psychological safety correlates to higher error rates. So the more safe people felt, the higher the error rates within their department. However, higher error rates correlated to higher team performance and better patient outcomes. So there was an inverse correlation between error rates and team performance and the reason was, was the teams that showed had low psychological safety had the same amount of errors as everyone else. They were just less able to talk about it, less able to raise their concerns, less able to share their experiences and learn from each other. So actually the ability to talk about failure is very powerful for improving team performance because it means you can actually have those open and honest conversations and you can improve the safety within the environment of the team. And Google found exactly the same thing. Google's project Aristotle investigated across Google all of the different factors that affected a team's overall performance. They wanted to find what were the things that really differentiated the high performing teams from the average performing teams. And they found the same thing. It was team psychological safety that made a difference between the high performers and everyone else. So I have a hypothesis to share with you today. Amy Edmondson studies and many other synths have shown that psychological safety improves innovation within teams, increases learning from mistakes and boosts employee engagement. We also knew, no, the SRE practices advocate for the acceptance of failure, learning from failure, measuring of failure so that we can make informed decisions. So I want to add an arrow in there. I actually think that SRE practices can transform teams by improving their psychological safety. I want to talk about toil now. Does anyone know what toil is? There's a few people, okay. So defining toil. Toil is work that doesn't add any long-term value. And if you're in operations and you're sporting a platform, there will be an amount of toil in your day job and it's to be expected. It's things that are manual, repetitive, automatable, tactical. They have no enduring value and they grow in line with the size of the service that you're supporting. I actually spent a week shadowing the CloudOps team at Pivotal in Dublin and it was amazing to see how the word toil have become part of their day-to-day vocabulary. So when they were completing tasks that felt repetitive, that felt like they had no enduring value, you'd hear the phrase, all this feels a bit toily or I'm a bit too good for this toily bullshit. And they put a story on the backlog to maybe have a look at how those tasks could be automated. And I've worked in teams where, honestly, the toil, the manual work associated with keeping the lights on was not just someone's day job, it was more than their day job. I didn't have time to automate myself out of a job. I didn't even have time to eat. And that's how it feels for a lot of teams and that's what happens when you leave toil unchecked because it will grow and it will grow and it will become your full-time job. So SRE advocates for setting a limit on the amount of toil that's acceptable in a team and they advocate for setting that at 50%. So keeping your toil below 50% will ensure that you have the capacity within your team to engineer better solutions around reliability but also automate some of the repetitive, non-value ad work rather than throwing hundreds of people at a problem, let's solve this problem with software engineering, with solutions. But not all toil is equal, in my opinion. And I really like this quote from Joseph Barones. If we have to staff humans to do the work, we are feeding the machines with the blood, sweat, and tears of human beings. If your system requires continuous human intervention to keep it up, then you're sacrificing the health and wellbeing of your team for the availability of your system. And as I said before, the impact on people's health and wellbeing can be significant. These are not acceptable consequences of supporting your platform in production. So I like to categorize toil. We wanna keep all toil below 50% but we really wanna keep an eye on toxic toil. This is the toil that hurts humans. This is the stuff that wakes you up at night. It ruins your evenings and weekends. It interrupts your work during your day job so you can't concentrate. It distracts you and it stresses you out. This is the stuff that needs to be prioritized because these are the things that will cause your team members to burn out, to suffer from stress, anxiety. This is what will cause people in your team to leave at the end of the day. We ran a team meeting a couple of months ago where we started to, where we wanted to look at whether or not the needs of the team were being met. And it said that all models are lies but some are useful and I find Maslow's hierarchy of needs to be a useful model. It's not perfect, but it's useful in this context. And Maslow's hierarchy of needs, it talks about the needs we have as humans. So our basic needs, physiological needs, we need food, we need water, we need sleep. Once those needs are met, we become more concerned with our safety and our security. And when those needs are met, we look for our social needs. Do I belong here? Am I part of a group? And then really individual needs, like actually how do I feel about myself and am I the best version of myself that I could be? And we started talking about actually what aspects of our jobs impacted some of our, some of our fundamental needs as humans. Things like am I getting enough sleep? Is my schedule sustainable? Am I afraid to fail or do I feel safe to speak up when I don't know something? Do I spend enough time with my friends and family? Do I enjoy working with my coworkers? Do I feel like I'm good at my job? Do I have the opportunity to be good at my job? Thinking about things like toil is an incredibly frustrating situation for a talented engineer to be in doing repetitive manual work when they could be doing so much more. And one of the most interesting things was the team identified the ability to continue learning as one of the kind of top tier needs. Like that's what makes them feel that they're being the best version of themselves. This ability to continue learning and to continue growing. So this is my second hypothesis. The SRE practices can transform teams by meeting employees' needs, both the physiological needs actually. Are we looking at that toxic toil that's waking people up at night? Are we addressing that with engineering solutions? Do we feel safe to fail? Do we know that we'll be supported when failure inevitably happens or will we be blamed? Do we have enough time to spend with our friends and family and do we feel like we could be open and honest with our colleagues? Do we belong here? Am I getting better at my job? Do I have the ability and the time to be good at my job? And can I continue learning? Do I have enough time in my day to try new things and to experiment? So at the beginning of this talk, I asked the question, is SRE good for you? And obviously, I think so. But SRE was only published in 2016 and there are so many teams out there that are only just starting to experiment with these concepts. And so I wanna hear from you on your SRE journey. I wanna know how you're implementing your SLOs, what difference your error budget meant to your team, how you went about building a blameless culture from where you are today. Because I think over the next few years, we're gonna see a lot more teams become a lot more healthier as a result of these practices. And I would love to share those stories. Thank you, everybody, for coming. As Molly said, I am going to be around all afternoon. I will be hanging out outside this room and at lunchtime to take some questions. But I really appreciate you coming. Thank you very much. Thank you.