 Today, we are very glad to have Russell Miles with us and he's going to walk us through a very interesting session on developing a culture of resilience and reliability through chaos. And I'm sure everyone is looking forward to hearing his perspectives on this one. Over to you, Russell. Thank you so much. Thank you for coming to this session today. This is one of the last days of the conference. So I'm hoping everyone's mind is ready to be a bit challenged and ready for something fresh. I know what it's like at these long conferences. You can get to date one or two or three and you start to go, wow, my brain is exhausted now. Hopefully I'm going to be gentle with you today. I'm going to introduce some interesting, I hope, things to think about. And I'm really excited to deliver this particular talk because the journey that I'm going to encourage you to consider going on in your organizations inside your teams is one that I've been on with several organizations, very large organizations quite recently. Over the last sort of five or six years, I suppose. And it's actually a journey I'm about to start with a new group, a new team that I'm working with at Skobia Technology Crown Agents Bank. So yeah, I'm recommending you consider something that I'm doing as well. So that feels fairly empirical and practical at least. Okay, so I'm not, I'm not going to assume anyone knows who I am. I do a lot of speaking in America and Europe throughout Europe. But I've assumed that all of you here are going to be fairly clueless as to what I'm all about. And so to introduce myself quickly, I've written a lot of books on technology and various coding practices and chaos engineering. And my sort of approach to this industry is that I am strongly focused on engineering practices, how great engineering teams work together really well. And I have a specific lens that I look through that at really, which is the lens of it being, we should be constructing systems of engineering and delivery and product that are humanistic in nature. And what I mean about that is I tend to put the human first. So whenever I'm looking at a system, I'm engineering it such that the people involved, whether they be the users, the customers, the operations, the development engineering teams, the product, the business, all of the stakeholders as they see it within the associate system. I tend to put those people first and ask, how can we help those people thrive. So this has guided me pretty well in all of the different products and teams I've worked with over the last 25, 30 years now. And it's been a common thread to it all. So just to give you my perspective on things, everything I talk about today is about helping people thrive within the engineering system that we tend to construct. So that's what I do with a day job. This is what I don't do in a day job. Recently, I do lots of tours around the world on a motorcycle, at least I did before everything changed. And this is me in Tibet. And I mentioned this because it actually relates to some of the things I know about engineering and some of the challenges that we face when engineering teams and engineering groups begin to change, which is this here, I think this is one of the best analogies I've ever seen for a software development roadmap. Although it features slightly less pitfalls and slightly less opportunities for utter failure. This is actually a real road in Himalaya in Tibet by road up and down and what you can't see on here, which is interesting to the topic of today. We start to talk about resilience is when you look at this road what you can't completely see here is that some pieces of the surface of the road have been removed around some of those corners. Because there's a lot of flood water and a lot of the surface is removed anyway so what the authorities have done in this for the sake of safety, they've actually removed the surface around those corners, and it slows everyone down. The problem is they've then taken that which makes these corners reasonably dangerous to a motorcycle. They've done that and they've gone well we need to flag up that these corners are dangerous. So what we're going to do is we're going to put rocks around these pieces of the tarmac, quite large boulders in fact, so that everyone can know that they mustn't go on these pieces of the road. So what they've done in the purposes of safety is they've turned a rather dangerous corner into a lethal corner. And this is interesting to the topic of today because resilience is about being better at dealing with surprises about embracing and thriving on unknowns. And when you first hit these roads, these are unknowns that come along unfortunately they're unknowns that have every potential to kill you. And so what we're going to talk about today is how we can develop resilience without creating corners in the road that will kill people. We're going to look at how we practice resilience, how we develop it as a organizational muscle and a team based approach to it. Yeah, in the service of let's not create roads that might kill people let's create roads that help people get better at riding the roads. As I said I've written some books out there mostly on chaos engineering chaos engineering is a key part of developing resilience and I'll talk a lot about that in a bit as well. I've written several books on that this is the latest book I've contributed to it was curated by Casey and Nora who pretty much coined the phrase chaos engineering when they were working at Netflix. And I was very pleased to be able to offer my input to this which was more around the opens the need for chaos engineering and resilience development to be open source because it was a service to the world. And I'll talk a little bit more about that why that's important a little later on in the morning as well. Okay, so this is the grand statement for this talk we're going to talk about developing culture. And specifically how we develop a culture, a thinking approach a way of doing things around here that helps us develop better reliability and security those are the outcomes we get from the system. And the way we achieve those outcomes is through developing this muscle of resilience. And we can do that through various if you think of literally resilience being a muscle, we can do this through several different gym instruments and chaos engineering and chaos is one of them. Okay, so why do we want to do this at all so I always think it's important question when you're going to ask people to consider changing or amending or evolving their culture ever so slightly. When you ask people to do that, why we should we even bother is a absolutely appropriate question, because you have a culture I assume you're working in organizations where engineering is happening. Things are being shipped. So, why would we even consider changing at all. And the framing for this I think comes out of an excellent book that I'm sure people at this sort of conference will have encountered, which is Accelerate by Nicole and jazz and Jean. This is an incredible book because for me the big takeaway from that book was addressing this false dichotomy, this idea that when we're constructing engineering teams engineering delivery systems. When we're putting those together that we have to make the trade off between speed and reliability, or as I sometimes say speed and security. And that trade off it turns out to be a false statement it's not appropriate when you look at the numbers and these are numbers taken from that amazing book if you haven't read you must go and catch up on the when you look at great performing teams. They deploy very, very often. That's the first step that's important so high performing reliable systems, potentially deliver very frequently but at the moment this that just says deploying very frequently high performing teams deploy very frequently continuous deliveries real possible and happening. When you look at okay what is the change lead rate how long does change sit in the funnel before it worked upon and in high performers again this doesn't change doesn't sit still long. If something needs to be done because they're delivering small increments low risk very frequent very quick frequently. That means that the change lead time is low. Two stats now for great performing teams in engineering is that they, they deliver frequently, but their deployment frequency is through the roof, and they also have a very low change lead time. Okay. And then what does it mean to reliability so this is where you could be forgiven for assuming that if we're deploying frequently, and our change lead time is very low. That's, we're paying that back in reliability right so we're very very quick, our speed our velocity is high our throughput of the system of change is high. That means we surely are trading off that against reliability. But what accelerate points out what the crew that wrote accelerate have found through the scientific evidence gathering data crunching approach they've applied is the high performers also experienced better reliability. The meantime to recovery in this particular chart is the high performers extremely low. So they are recovering far quicker than the organizations that are not deploying frequently, and I have long change lead time. Their meantime to recovery is is much longer and trending in the wrong direction. It's important, particularly if your company, your differentiator is in delivering systems that are reliable, and the need to innovate, which most people are you need to be able to deliver things that people want. And that means you're innovating your agile, you're doing things the right way as quickly as possible. Reliability gains from being agile and being fast, you're able to recover much quicker than your counterparts that who by thinking it's a trade off between speed and reliability. They think they slow down the speed of delivery because they believe they're buying better reliability that turns out just not to be the case. The faster you deliver, the more frequently you deliver, the more frequently and fast that you get changed through your funnel of engineering work, the more likely you are to build systems that can recover quicker to issues you might encounter. And this sort of problem leads us to this final slide here where we can see that stability of a system with high performance because of all this, a system is genuinely more stable because we are able to recover faster because we are delivering more frequently and our change lead time is low. Okay, so this leads us to a conclusion that it's not speed versus reliability, it's speed and reliability. But I often get a complaint that comes back at this point, a sort of an argument that says, for example, you know, we have reliability problems in our system. We need to go slower to pay back that debt to invest in our observability to invest in this invest in that we need to go slower. And there's some there is some argument to that. But it's a type of slowness that you need to be aware of you might decide that you're going to invest in your observability you're going to invest in your ability to debug in production a system because you want specifically to improve your mean time to recovery there's nothing wrong with that but that doesn't slow down your ability to frequently change the system. The trade off here that I'm always sensitive to is when I hear the phrase, we're going to deliver less frequently, because we want to be more reliable. That phrase means someone believes what has now been proven to not be the case. So, in a world where speed and reliability is the perhaps surprising qualities that go hand in hand. Question and switches. How do we get this if we've not seen this in the past and I've worked on many systems where the faster we go the more unreliable it seems to be. What are these teams doing differently. That means that speed and reliability is the trend that they are experiencing. That leads us then quite neatly into okay, there must be doing something differently. What are they, what do we do and think about differently to go on the same journey as those teams. And that leads us to culture. Okay, and culture is a very blurry time right everyone. I think there's a beautiful phrase I've heard a few things Simon Wardley talks about this you haven't followed Simon Wardley on Twitter or experienced his talks before you need to Jen is absolutely fantastic. One of the things he points out is culture is is one of those blurry concepts that really is very badly defined there are lots of different nuances, what people think culture is so I simplify it down. Okay, I take what people think culture is, and I define it in two different ways. The first way is how do we do things around here so what is the behavior and the motivations that are exhibited by how people work around a particular sort of locus. So when I go to an organization how they do things around here is the culture that's the most evident evidence driven measurement of what culture is we do things this way. So culture is also accessible by digging under the surface and going, what do you believe about a system, why are you doing these things this way, as an example, relevant to what we're talking about today. If you believe that by going by deploying less frequently, you'll be more reliable, then you will change the way you behave you will start to say things like maybe released once a month, once the six months once a year. That's buying you reliability, also even buying new security. But when you realize that that belief is not founded on evidence, then we need to adjust the culture we need to adjust what you believe in adjust the practices you apply the processes around those practices, and then we need to therefore adjust the behaviors and habits of what's going on. So that's why I start from, we need to adjust the culture, because I need to help you adjust the way you think about your systems in order to realize that there are better ways at different practices to apply. So the reason it's really important to look at things from a cultural perspective is to encourage behavior change. And one of the things that you notice about these extremely high performing teams is they value learning from incidents, how they get better from crisis. And it's an interesting thing they actually characterize these things differently. I talked to a lot of organizations and they, I'll say, you know, do you do incident analysis do you. Some people use the phrase post mortems on crises and problems when you have a reliability issue. What does your outcome look like when everybody's involved, and most organizations legitimately say they do post mortems. But the interesting question is the next one I asked, which is okay when you do a post mortem, who reads it, what do you learn from it, how do you turn that into actionable learning items, who gets to see it. And that to me is the measurement of learning happening from incidents is who's actually interested in what you might learn. Great organizations I've worked with people right at the top of the organization are interested in what we learn from incidents. Now some companies that scares some teams I've worked with a scare to say, you know, we don't really want to show all our dirty laundry to our managers because our managers are going to go and say well why did you get things right. And so I helped them with that right help me think slightly differently about incidents. When somebody turns around and says, why didn't we anticipate that incident. That sounds like a rational question. It sounds legitimate. It makes you feel personally if you're involved in that incident why didn't we avoid that incident, but the problem with that question is it's part of what we call sort of hindsight bias. Why didn't we know that was going to happen. And when you use the word incident, that sounds rational, because incident is actually the wrong phrase for what we're talking about. An incident a crisis a reliability problem with a system is a surprise. We didn't design the system to fail like that we didn't consider the cases that led to that failure clearly, otherwise we wouldn't have had it. Surprising so when you look through a different lens at an incident if you play it forward, if you go to the incident you go out what happened what do we know at different times of time, how did we react to this in real time, you realize that these things are surprises. And that question if someone and says you why didn't we anticipate, or why weren't we brilliant at that incident why do we prevent that incident switch the word incident to surprise and you realize that the question doesn't make sense. Why didn't we anticipate that surprise, because it was a surprise. Why were we not entirely prepared for that incident, because it was a surprise. And that just that little switch in terminology can help people understand that we need to realize that systems are going to surprise us we have we work in sufficiently complex systems that surprises become the norm. And yet, we can still build systems that are reliable. So now we're starting to get the heart of what those teams actually do differently to learn from incidents to learn from surprises to embrace that surprises happen, but also realize they can be fast deploying very low lead time on change and also be more reliable. Okay, one hint in all this, because most things that we do in engineering is not new. And that's the case with almost everything every new phrase I've, I've encountered every new rebranding is exists because we actually had the right ideas, some time ago, but we actually forgot about them or they were branded differently. And so one thing I would encourage everyone to look at is the recovery oriented computing project. This is, you can research this online. A lot of what I, what we talk about now comes from at least philosophically from this project. And one of the key findings in this project is that failures happen. You can do everything right. You can get the right observability the right high availability strategies the right circuit breakers the right all the strategies you can apply to make something robust and reliable. You can do those things and failure will still happen. That's what the recovery oriented computing project. That's his baseline assumption that's what starts from, and it turns out that very much the case these days is that ultimate prevention designing the problem out is not possible failures are utterly inevitable in what we build. So we need to think differently we need to adopt a different culture then that's what we're seeing with those high performing teams, because they have embraced the fact that yes they want low lead time on change they want to continuous delivery. And also, they really know that failures are going to happen. So how do they get better at that. And the temptation that you see across the industry is people tend to try and model these things better. Can we just go and think about things more. But as I said before you can do everything right. And still the system will encounter conditions you didn't anticipate because these things are surprises. And so modern analysis doesn't help us with surprises because there will always be the surprise that the modeling analysis didn't cope for. And this comes from a sort of a maximum if you like of this thinking, which is a priori prediction of failure failure modes is not possible. You cannot sit there and formally prove that the system will never fail. It's and I let's look at why that is let's quickly look at that. So one way I'm looking at is that human action is a major source of system phase which is a truism. But it's interesting is not human thinking that is a major source of system phase humans in the system, try to do the right thing. And at a given moment in time do it with imperfect knowledge that's what we're good at reasoning with imperfect knowledge. That's a human traits systems are usually computer systems are not so good at that yet. But that means that we make actions we construct actions within perfect knowledge, which means that sometimes we get it wrong. So it's we're a big part of how system failure occurs. And that's that's something to factor in. So resilience is the quality of our ability as a team as an organization to be able to go failure will happen surprises will happen. Can we get better at surprises? That's what resilience is all about robustness is where we drive in change to a system because of the things we know are going to happen or things we can anticipate in their fullest sense. But resilience is about what we do to be better for the surprises. And it's a cultural change is a mindset and cultural change. Okay, that's leaving developing culture. Let's move on to how we then make that change happen in an organization. So this is what I tend to do is I help people realize why the values of resilience, the values are that failure is going to happen. Issues are going to arise. We need to get better at surprises. That's the why we do resilience. Then when we know why we're doing it, the stages we need to go through to develop that resilience are pretty straightforward. First of all, we need to be able to observe our resilience, our reliability, our security, those outcomes. Can we observe them? Can we then change the rules of the game the principles to be able to be better at those things? So these are the rules we can apply that help us guide our behavior, our processes, our actions on a daily basis so that we can then get better at developing our resilience. So this is a classic learning loop. And at the heart of it, something that's often missed is the practice of this resilience. They're like going to the gym. You know, when you're developing your muscles in a gym, I have to admit I'm a techie so this isn't too frequently I do this. I think I go to the gym once a year. Usually out of guilt. So, but yeah, if you're actually using a gym for the right reasons, then you go there on a regular basis because you know that your muscles will benefit from continual stress. Your heart will benefit from continual reminder of its ability to handle stress. And so you go and stress your system in a gym so that you get better at, you know, you get more healthy the system adapts to the stresses of the gym. Well, that's practicing the same approach we can apply to resilience resilience is just another muscle to organizational muscle it's the muscle that helps us be better at dealing with surprises. So we need to practice we need to stress that muscle going around this learning loop by do by it not being a one pay one payment pay down situation we don't go right. Yeah, last week we did we handled resilience. Now we're good resilience is something that is better than to how we think how we work and it's how we practice on an ongoing basis. Okay, so reliability is one of the outcomes that people consider doing resilience for resilience is the investment we make reliability is one of the outcomes just like security. So I like to define reliability quickly now to make sure we're all understand what I mean when I say reliable. Reliable doesn't mean the system isn't experiencing failures. It doesn't mean that the system is immune to, you know, parts of it being unhealthy as we would define as engineers, reliability as I and as I define my working definition is a measurement of a consumers experience with the system you provide. So quite literally, can they rely on your system when they need to do the things they rely on it to do it will you can they rely on it not to break their trust. Can they can they rely on it to do the functionality that it claims it's doing, and can they rely on it to be there when they need it. So you can see here my perspective on reliability is almost completely focused on the experience that the consumer the user the customer has with your system it doesn't matter if you're creating a system inside of a company that's entirely fine. You have consumers you have users they are the ones their experience is the key to whether you're reliable or not. So yeah, reliable and security as far as I'm concerned is measured from the consumer the user the customers perspective. And so it's completely different from is the system healthy system healthiness is something we as responsible parties that look after a system that evolve it. That's what we care about is it healthy. And do we need to address it and do we need to react to anything but system health is for me not. It's related to perhaps how the users experience the system, but trying to aim for ultimate healthy system is not the point, particularly if really what we're trying to do is provide the right trade off to show that the users are getting an experience that they value in terms of reliability. Okay, resilience is the key to what we do differently. The cultural change that we adopt in order to deliver better reliability and better security. And the reason we have to do it is this is a nice analogy this picture here. This is what production looks like whenever I imagine production. Some people say that production is a wonderful place and you should take your kids there but I like to say I don't hate my kids that much. Production for me is a hostile environment. It's the place where all of your assumptions about how the system is going to play out. It's the least likely place that those things are going to come true. Production is under stresses that you just don't have anywhere else under mental stresses that the operators are under that don't happen anywhere else. And so this is a nice picture I think for production and understanding why surprises happen is that what you have in the foreground is what we anticipated. A nice organized system, two fields, nice fence, nice borders. We've done everything we've done modularity. Everything's right. And in the background, the hot tornado is your users. It is your cloud provider. It is the updates being made by a standards group to Kubernetes. It's all the unanticipated turbulent behavior that your system is about to experience. And it's because of that turbulence, which by definition is unpredictable that we end up having to, you know, all our assumptions don't play out the way we think they do. We can model and analyze and define the perfect fields, the perfect production, but that turbulence is going to find itself in conflict with what we then need to do. And one of the ways of explaining this, a model that's helpful I find to explore this is really an abuse, I suppose, to some degree of Kinevin, a wonderful thinking framework that I've used a number of times. The way I like to navigate this to help you understand is that when you're building most modern engineered systems, software driven systems, you're not creating obvious systems. There's usually enough moving parts when you consider the people, the practices, the process, everything that surrounds it, all of the different cloud providers. You end up with something that isn't what we would call a simple obvious system, which is a shame because in that sort of system, we can have best practices. I as a consultant could turn around to you and go, right, do this, do this, do this, you'll be better. But that's not what we deal with. The bad news is we don't even deal with complicated systems. The systems that we work with and I work with all the time are usually over here, they are distributed, they have external dependencies, we end up with complex. And then when I add in the fact that most systems I work with have to evolve quickly, they do evolve quickly because we deliver frequently. Then we end up with the potential for those systems to be chaotic, which means turbulence happens. And there's a lovely phrase that has been coined in the stellar report for the kind of problems that we experience, which is this problem here, let's find it, of dark debt. In any get sufficiently complex system, which most of us work in and around. I always use the phrase, if you're not working on Hello Well, then you're probably working with a complex system. You have this thing called dark debt. Dark debt is kind of the evil twin of technical debt. Technical debt is created consciously. We know we're doing it. We know we're cutting corners. We know we're not doing everything the way we would normally do them with creative, we're constructing a debt we're going to have to pay back at a hopefully a close point in the future. Technical debt is like debt with a light shot on it. Dark debt is the opposite. It's hiding under the table. It's all the stuff that's ready to surprise us. And as I mentioned earlier, trying to model out design out dark debt just cannot be done. In the nature of it, it is in the system and you are going to be surprised by it. So we need to invest in something different. We need to do things differently in order to accommodate dark debt. And you know you've got it, right? If you've had incidents, if you've had system failures, then you know you have dark debt because those things happen. Dark debt can hide not just in the technical ramifications of the system, but also in the behaviors, the policies, the run books. Everything that all of the processes and human engagements with the system are all part of the socio-technical system and therefore can also hide dark debt. So we need a mechanism to work with dark debt and that's where resilience comes in. I like this phrase as well, a little tagline that I use when people say, well, we have lots of incidents, but I'm sure that we're better now. I often say hope is not an option, right? How do we know we're better than we used to be? How do we know we're more resilient? How do we know we're better at surprises? How do we know that we've even overcome surprises we've had in the past? Hope alone is not an option. Driving more design thought into it isn't necessarily going to make us any better. Also, perhaps a little counter-intuitively, pure reaction is not an option. Now we can learn from incidents. We can learn from surprises, but that's a really expensive way to learn. The analogy I have for that is if you're learning to drive a car, if your very first day of learning to drive a car is on a freeway moving at 90 miles an hour, whatever is legal in your country, that's a really dangerous way you'll learn or you'll die. Back to my analogy of the roads in Tibet from the very beginning. We need to make systems safe so that reaction to massive problems isn't the only way we learn, because resilience requires us to learn from these things. And so what that leads to is we probably need a different way of working in order to do the reaction to crisis isn't the only way we get better. That means we need to get proactive. This is where we start to lean towards chaos engineering. Chaos engineering is a proactive way of exploring turbulence in a system before the crisis. There's a lovely phrase a friend of mine uses, which is if you're only learning from crisis, that's good. You're learning from them at least, but that means you choose to learn at the most stressful, painful points. You only want to learn when it's stressful and that seems inhumane. So it's good that we do learn from incidents, but if that's your only strategy, that seems pretty harsh. What we do with chaos engineering in particular is one tool in the resilience engineers toolbox is we start to go, OK, can we choose when we learn? Can we choose to learn on a Tuesday at midday rather than on a Sunday morning at 5am? So chaos engineering, although it looks like we are causing crisis, small scale crises to learn from them to develop our resilience. The key factor is we're choosing when we learn and we choose everybody to be around when we learn, whereas crises don't give us that particular help. They tend to happen when no one's around. So you learn in the harshest possible way. So this isn't the heart of resilience. We need to develop that resilience muscle and we do it by practicing what dark debt can look like, what turbulence can look like in our system. And it's an investment we make. It's something that culturally we need to think differently about. It's a practice, so it's something we constantly do as alongside everything else we do. And the reason we do it is because we want to build more reliable systems in the face of that, the dark debt and complexity is the norm. There's actually, I often get a complaint back at this point where people say, wouldn't it be easier if we just simplify the systems? Well, this is where I recommend you go look very carefully at the thinking in Cannevin, because Cannevin isn't saying that we're accidentally creating complex systems. It says we essentially have complex systems. The systems need to be that complex to do their job. If we simplified the system further, we would end up with an oversimplified system. Now the definition of oversimplified is it doesn't do its job anymore. And so that's the, what we're talking about here is not the accidental complexity that sometimes does happen. And we can work with that. We should remove it. We should put the effort into removing that. However, when your system is down to the essential complexity, which most production systems are at, then you'd realize that we need to work with this and within it. And that's what reliable resilience helps us do. Now, how long did you get started? Sorry to interrupt Russell. So we have five minutes left. You just wanted to. Okay, I'm going to speak through these things. Thank you very much, by the way. Thank you so much. So I'm going to leave these as things you can do to develop. This is the journey I'm about to go with the team I'm with. And it's a journey I've been with lots of teams up until this point. There are four capabilities to develop. You need to ask yourself how good are you at anticipating, synchronizing, responding and learning from what happens in your system when things go wrong? Okay, that's number one. Those are the four capabilities you want to develop to develop those capabilities. We're going to go speed on through this bit. We're going to be proactive to develop these capabilities by looking at these seven proactive sort of processes that we can invest in. These are how are we defining the reliability? Can we define it? Can we observe it? So give a quick example defining. This is something where SLOs come in an SRE. We do SLOs and some of these aspects of SRE because they give us the tools to look at what good should be like. And therefore when we invest in resilience and those four capabilities, those four capacities, then we develop these seven capabilities help us know that we're actually doing these things better. These metrics are important for defining reliability because they give you the ability to make decisions. They give you the ability to get a handle on reliability, what it really means to your system. It helps you define what matters. And so if you haven't invested in SLOs and SLIs, I don't care if you're doing SRE. I don't mind if you're doing SRE or not. SLOs and SLIs are a general tool for beginning to get a proactive handle on reliability. That's why that step of define is so important when you're working on these capabilities, the underpin, the capacities of resilience, the four capacities. You need to be able to define something first and SLOs and SLIs are the key to defining your reliability and giving you the tools to make great decisions around are we getting better or worse. You then need to be able to observe it. So can you can everyone observe your SLOs and SLIs? I've had long arguments with people that say that SLOs are something that we should hide in the teams. They are not. Now, something that SLOs and reliability systems is not an island. You're always dependent on others. You're always servicing others. So making these SLOs as openly available as possible is key. Don't hide them away in a dashboard somewhere or hide them away in your observability system. These are patterns that are regretful. Make SLOs something you're proud of that you can share with others and the trends upon them. Your code, your systems are not islands. Then we get into near the chaos stuff. Now, this is where I'm going to bring these things to a little close here is when we explore, fix and verify. What we're doing is being proactive. This is where we start to create chaos engineering experiments game days because because we can observe reliability. We can now explore within that framework how good we are anticipating synchronizing responding and then learning from how we get better at these things. So proactively exploring weaknesses in your system. This is chaos engineering. This is where we go. Okay, we're going to construct experiments proactively because we're going to start to throw a light on our dark debt to begin to develop how good are we at being resilient? Are we are we better anticipating these problems? Are we better at synchronizing responding? Can we take all of that and learn from it? Okay, the sort of learning loop that you get with chaos engineering looks like this. The main thing I want to point out here is that it looks very much like learning from incidents. It is very much like learning from incidents, but choosing where we learn from incidents, choosing the incidents themselves. That's why chaos engineering is so powerful as an exercise for the muscle of resilience. And it needs to be done as I mentioned before across for everything you have, the people perhaps the process right down to the infrastructure. Dark debt can hide anywhere. So taking a holistic approach to the experiments you create is important. Okay, so let's move now in quickly into the next piece to want to get there. Okay, right. Once you've explored, then you can begin to improve. You don't begin to improve reliability by taking this approach, a resilience approach by improving first. You improve it in the light of the dark debt you've discovered. So then you can apply things like circuit breakers. You can apply better statistics. You can prove that your observability works because you're practicing it. All of these tools, what we used to call non-functional tools to your system, the observability, the management and monitoring, the retry bats. All of those get exercised when you do chaos engineering in the service of being much more resilient. Are we much more able to anticipate, react, respond, synchronize all around these things? Okay, verification is an adaption if you like. Something you get almost for free when you start to explore turbulent conditions and dark debt. Because really what's happened is the experiments you use to find these things can then be used to verify that your system is actually better than it was before. So what happens is if you've already got a collection of chaos experiments that you use, you can then flip them to verification to say, well, we believe we now deal with this better. We can then verify it on a continuous basis. And continuous verification is something I'm seeing more and more people understanding that that's a great umbrella term for how we should treat systems. We should continuously verify that they would behave in the real world the way we expected them to. And finally, learning. So the key measurement of learning in all of this is soliciting and gaining interest in what you've found. Okay, if you're going to get better and invest in anticipating, synchronizing, responding and learning, the learning part is something you mustn't lose. And the way to focus on that is who's interested in our learnings. If you create learnings from a chaos experiment or an incident, a full on crisis. And no one reads those things, then you're missing a trick. One of the things that I often encourage people to do is invest in almost storytelling narratives around incidents, make these things interesting. Because they are interesting, but some people don't deal with just the pure facts of what happened. That isn't very interesting to them. What they need to be told is how it felt, how it all went together. That enables the learning. Now it's a hard skill to develop as engineers. We don't naturally tell stories about incidents. So sometimes I will actually encourage people to get third parties involved in this narrative storytelling, so that people become interested in what we're actually learning from things. Okay, quick summary for you, Nugs. Resilience is a crucial muscle that you as a high performing team and an organization can invest in and will invest in in order to be that high performing organization. It's the key capacity and muscle that enables high performing teams to move quickly with reliability to move quickly with security. And in order for you to do this, you need to invest in those four capacities, develop those, ask yourself what is our system like around anticipation, synchronizing, respond and learning. Can we develop those and then we can take those that we develop and we can use these capabilities to develop those capacities. These are the tools that you then use to develop the capacities that you're aiming for in order to improve the resilience, the behaviors, the habits of your team. And at the heart of it all is practice. This isn't a one shot solution. You don't do resilience this week and then you don't ever need to do it again. Resilience requires practice. So you develop those capabilities continuously in the service of improving those resilience capacities. Chaos engineering really is just one of the tools in the gym. It's one of the exercises in the gym to improve your resilience and it's a powerful one for that because it relates to two significant things you need. You need to be able to explore shine a light on dark debt and you also need to verify over time that the system actually has overcome that dark debt and that we are now better at anticipating synchronizing, responding and learning. And one very last thing for you just to hang out there and it's something that I'm talking about a lot next year is that a key behavior in all of this something that's missed when people we count a chaos engineer and they think it's about hurting systems. It's not we're doing all of this because we empathize with the user, the consumer, the system and the experience the team has. If you as a team are responsible for running your systems. Then empathizing with your experience with doing that the child doing that lead leads you very easily towards developing resilience the more resilient you are. The more humane is to work within this highly turbulent environment of running writing and running your own software. That's why we do this. Empathy is an undercurrent in our industry. And it's something I'm going to be talking a lot about next year. It's at the heart of humanistic engineering. Okay, thank you for your time. It's whistle stop tour. If you have questions, please contact me. I'm very keen to start dialogues around how people are doing this, or whether they're considering doing this. I've only told you how to get started. I've given you a glancing blow on the four capacities to develop to be more resilient and the seven capabilities to work on in order to develop those capacities. But I've only given you the starting point. So your journey will be yours. I would love you to start a dialogue with me as you consider going on it and maybe how you progress along it as well. Thank you very much. Yeah, thank you Russell so much for such an interesting and insightful session and I'm sure the participants definitely have a lot of inputs to take back from this one. And once again, thank you Russell for the session and thank you everyone. Thanks everyone.