 Good afternoon everybody. I'm Jen Bergstrom. You can see the topic up there. I'm not going to say it all because it's quite a mouthful. Long story short, it's about injecting chaos into your systems and your personnel and your organizations to make them more resilient, which I think is a really fun topic. So first thing to note is this topic isn't random as much as it may seem a little random in this summit. It's chaotic. You can see the bifurcation diagram up there on the chart, which basically is meant to illustrate the idea that even in a chaotic system there are periods of order in there and there is a lot that you can ascertain from the system by doing deliberate tests on it. So today I'm going to talk a little bit about what chaos engineering is in general. And actually let's do a little audience participation real quickly. How many folks here are familiar with chaos engineering? Okay, good. Excellent. How many have applied chaos engineering to infrastructure and architecture and applications? Okay, a little fewer hands. How many have applied it to something more from a security standpoint, actually doing security tests and penetration testing? Excellent. And then has anybody applied it to their organizations? Yeah, that one usually gets no hands at all. So we're going to talk about this. It's just an extension of the same idea. I'll go briefly over what chaos engineering is. I'll look at security chaos engineering quickly and then we'll get into the meat of it using chaos engineering on your personnel and organizations to help build a more resilient structure for your organization and your staff. But first we'll do a use case. We'll talk a little bit about this use case throughout the presentation. So Company X is opening a new capabilities demo lab. And that lab is going to feature applications, software, hardware, and other interesting capabilities that Company X has been creating over the last while. And in particular there's one system that they really want to demonstrate, which is Product X. And Product X is this phenomenal new software package that they've been developing in independent research and development for a while, and now they're ready to go public with it. So the lab manager is very concerned about making sure that everything goes smooth as possible for this grand opening. It's a big deal. They have key personnel that are coming in, which are, you know, essential customers that they really want to win business from, and the person who's going to be running the lab for the demo. And the lab manager has a few concerns about the Product X itself because it's been being developed as an IR&D effort, oftentimes things like security, data security, data protection, and resiliency are kind of ignored when they're doing that development work. So the lab manager wants to make sure that everything is good to go and strong and resilient and productizable. So they assess the system, they assess the likelihood of the lab opening on time, and these are the top three risks that they came up with. And we're going to come back to these throughout the presentation here. All right, so quick summary, what is chaos engineering? There were a lot of hands up, so I'll go through this part pretty quick. A couple of key quotes from principlesofcaos.org. Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. That's a mouthful. I like to simplify it. My quote is, it's an innovative discipline strategy to drive down risk within a system. And that system can be a lot of different things, right? It can be hardware, it can be software, it can be security, it can be personnel. All right, so what are the key steps? You have to validate your system. You have to understand the system well enough to be able to conduct experiments on it. You create a hypothesis about the system. What is it going to look like? How is it going to behave? You plan the experiment. What are you testing? What's your panic button? If things go really catty-wampus, how are you going to stop it? And how long are you going to run the test for? And what is the scope? Then you run the experiment, and then you monitor, rinse and repeat. It's an iterative process. This is something as everybody that's already done it before knows it's not a run once and be done with it thing. It's a continual run it over and over again kind of system, and it's a discipline process. So quick history. Where did chaos engineering come from? Well, Netflix made a move into AWS back in about 2008. They made a commitment that all of their infrastructure was going to move into AWS. In about 2010, they started realizing that you know what? Cloud and compute environments are built to be a little more fungible than a regular environment. What do I mean by that? Things fail in cloud environments. They fail regularly. They fail unpredictably. They fail more often than in a data center that you've built up and invested a lot of money in. So you have to design systems to accommodate that. And they started simple and small with Chaos Monkey in 2010. Chaos Monkey basically went in and just terminated a random instance in their system. And they use that as an effective test to see how resilient their system was going to be. Could it behave itself? Did everything fall apart? But they pretty quickly realized that just testing a single instance and killing that was not going to be enough. So they began to build a Simeon Army. And Netflix built the Simeon Army up as they saw failures in the real system. Or as they came up with ideas that they thought might cause failures in their system and they'd preemptively build a chaos experiment to execute that. And they called that their Simeon Army and it had things like Chaos Kong, Chaos Gorilla, Compliance Monkey, you know, a whole bunch of different ones, Latency Monkey, and they kept growing. They kept building this army. And about the same time they really expanded the Simeon Army, they also released Chaos Monkey as open source code and it's still out there today. And then in 2014, they decided that there was a lot of value in officially recognizing all the work that their engineers had been doing on this. And they created a title, Chaos Engineer, which I kind of love because that's a really fun title to be able to carry. And they started advertising it out. And as other companies started picking it up and adopting this capability, the title of Chaos Engineer really became a real thing, right? You can go look on LinkedIn now and people actually have that title in their bio and are doing work on it. And Chaos Engineering has been pretty widely adopted across industry. There's a lot of great open source libraries out there to help do chaos experiments across all sorts of different infrastructures and frameworks. And it keeps growing. There's more and more companies that are doing this. So it's no longer a purple unicorn type of implementation. It's something that a lot of people use. All right, so let's go back to Company X briefly. So in order to address risk number one, which we'll look at again in a sec, they implemented some chaos experiments on their product, product X. And they started with the basic ones, Chaos Monkey, Chaos Gorilla, Latency Monkey. They stopped random elements within the system and made sure that the system behaved itself. They stopped a group of elements within the system. It's a cloud deployment. So, you know, maybe they mimicked the loss of an entire data center or something like that. And then they injected delay in between the microservices that were deployed as part of their application. And they found some failures. This is the point of Chaos Engineering. Find those failures. And they corrected them. And the result of that was the green boxes up here. The probability of failure in the system that it wouldn't meet the SLAs dropped from 0.75 to 0.15. And the overall risk for risk one was now 0.1125. It's never going to go to 0, but they improved it a lot. But there were still two other risks. And Security Chaos Engineering is the strategy that they're going to use for risk number three. So we're going to look at that one next. All right, so Chaos Engineering isn't just for software and IT. It's good for a lot more than just that. Security Chaos Engineering is one of the things that it's actually really good for. Spending in incidents and security, I mean, this has come up a lot in the conference this week. Spending keeps going up. The number of tool sets that are available keep going up. And the investment that companies are making are going up. But then the cost of security failures are also still going up. The means that we're using right now to try to secure our systems and make sure that they're resilient in that direction aren't working very effectively yet. So Security Chaos Engineering is a strategy that extends the concept of Chaos Engineering specifically into the security realm. It's run against the tool sets that are used. It's run against the system doing essentially penetration testing as planned experiments. And then it's run against the processes that are in place to secure systems to validate them and make sure that the system is actually security and secured, not just security and name, which is a common problem, right? We say that we're secure, but we're not really secure. So you can use Security Chaos Engineering to test things like vulnerability to identify failures and to test the effectivity of the tools that are being implemented in the system against the system to try to secure it. So same key steps. You validate your system. You have to understand it. You hypothesize. You build out your experiment. You run the experiment. You monitor it. You rinse and repeat, right? You're going to see this slide a few other times during the session here. So Company X decided to use Security Chaos Engineering specifically for risk number three on Product X. And some of the tests that they ran were Access Monkey, which is a user that has a valid login but not the authentication that they should, or not the authorization that they should have in the system. For example, they go and attempt to access a database table or fields within the database that they really have no right to access. And they validate, is that data protected? Is it secured? And then anonymous given, no valid login, right? And you can there's a whole bunch of other security tests out there. There's actually a really great book out there written by Aaron Reinhart, who's one of the fathers of chaos engineering that talks specifically about security chaos engineering. But that's not the point of the talk today. I just kind of wanted to touch on it because it's a great strategy. And in the case of Product X with Company X, they discovered that one of the databases was not encrypting the information at rest. Information was being stored in plain text. Now thankfully, this particular database did not have any user identifiable information in it. So the risk was relatively low that bad exposure would have happened there. However, it was a legitimate failure of the security of the system, and they had to correct that. They went back in and implemented encryption at rest on the database to make sure that the data was fully protected. And as a result, they mitigated the risk of, the risk number three. Now they were very focused for this particular set of security tests on the data encryption at rest and in transit. So they didn't look at the whole security of the entire system, which is why the risk went or the probability of failure happening in this case only went down to point three five instead of lower number. But it still took the whole risk down significantly to where the lab manager was more comfortable with the situation and willing to move on. All right, so now we get to the fun stuff. Now this is using chaos engineering on your personnel and your organizations. Now this is a relatively new extension that hasn't been run all that often. However, risk number two here, if you look at this, personnel A, B, and C will not be able to access the physical lab facility. That's not a technical problem. That's not an application problem. That's not an architecture or infrastructure problem. That's not even really a security problem because it's not dealing with the tools and techniques that we use to monitor our systems. This is a personnel problem. Are we able to get through the processes that we have in place to allow access to a restricted facility for people that aren't normally there in a timely manner? In this case, they had a very hard deadline of two weeks to make sure that the accesses were in place, which should be plenty of time. But the lab manager had worked at the company for a while and knew that that was actually kind of a tight schedule. So they were a little worried about it. All right. So what types of tools and tricks can we use for organizational CAS engineering? Well, an organization is really built up of three key elements, your people, your tooling and your processes. If you have all three of those pieces in place and if you have them secured and if you're correctly running CAS experiments on them and have it designed right, then you're going to have a resilient organization. But if any one of those three is not very resilient, it's brittle in one way or another, your organization is not going to be very resilient. So we're going to look at each of those independently for some CAS experiment ideas that you can run on them. All right. So again, same steps. Right. I told you you'd see this slide a few times. Here it is again. Verify your system, understand it, hypothesize, design your experiments, run your experiments and then monitor and repeat. So for personnel CAS engineering, if we're looking at the cases that we have to where we'd want to run this, these are experiments specifically focused on your team, on your people. And they include some really fun experiments like lying lemur, which is actually one of my team's favorite ones when we run it. They just love this one. Basically, you designate an individual or a few individuals in the team as liars for the day and you tell them, hey, you you're going to lie 25 percent of the time. Every time people ask you a question for every fourth answer you give, or at least for every four answers you give, one of them needs to be a lie. Kind of fun, right? How often do you get told that you can tell blatant falsehoods to people at work and you're not only not going to get in trouble for it, but you're supposed to do it. It's kind of cool. So you tell people to do that and the liar can even advertise that they're the liar for the day. It doesn't matter if people know that they're lying on occasion. What matters is they don't know the frequency and that your liar is convincing enough in the lie that it's plausible. Right. You can't answer a question about deploying an application that has usually tens of microservices by saying, hey, it's going to be thousands of microservices because your rest of your team is going to look at that and know instantly that that's not true. The point of liar liar of lying lemur is to see if your team understands the work that they're doing, the system that they're operating within well enough to pick up on when something just doesn't sound quite right. And the other point of it is then do they know where to go aside from the person that they asked that lied to them to find out what the ground truth really is? Do you have a place? Do you have a key document repository? Do you have other frames that other people that they can talk to? Is there another resource they can seek out and identify that ground truth? Now, one of the important points with this one is the liar for the day has to make sure that they document every lie that they tell during that day so that the next day when the experiment's over, they can go back and remind everybody, hey, this one was a lie. This is the right answer. Right, especially if it's the first couple of times you've run this experiment because a lot of times people won't know where to go to find those answers. So you have to make sure you correct all those incorrect statements after the experiment is over. Absentee I.I. is also another one that people tend to really like when I run it. And this is, as a person starting their day at work, they get retasked. Pull them off of whatever they're doing. So maybe they're a manager for a software team, but today they're not managing a software team. Today they're going to come over here and work on this proposal for some new work that you want to bring in for a new client. Right, completely separate work, still present, still at work, still available, but not available to their team, not in the normal capacity. And the point of Absentee I.I. is you see how the team responds to somebody who's gone all of a sudden and just out of reach. Again, do they know where to go? Do they know who else to reach to? Is the knowledge distributed enough across your team that they can still do the job they need to do, even when that person is not there? And then latency, Loris, this one, everybody hates, honestly when I run it, but it's also a really good test, so I still do it. This one, you tell somebody or a couple of somebodies that they're not allowed to respond to any inquiries that they get from any media for a certain period of time. We're very, very used to now being able to respond immediately, right? Reach out via Slack, reach out via Teams, reach out via whatever and get a response within two minutes from anybody any time of day. This one puts a wrench in that works and says, no, you're going to get that question and you're going to sit on it for at least two hours before you respond. And the point of this one again is how does your team handle that? Do you have enough resiliency built into the team so that they can say, oh, so and so must be really busy. I'm going to go talk to my backup or I'm going to go read the documentation or, you know, I'm going to look for another opportunity. Now, this one can be really destructive. So I don't advise running it if you have a very tight timeline for something because when you throw a wrench into the works and really back things up, especially if you pick a key person in there, this it can mess people up for a week or two really throw them off the game. So but again, with chaos experiments, one of the key things is you always have that panic button as you define the experiment. So if you see this one going really cattywampus and you can tell that things are just really derailing. You can pull that person back in and say, OK, experiment's done. Now you can start responding normally, right? You can always put a kill switch in there. With company X, they ran some of these and they actually did identify some of the things for product X that just weren't documented or some of that, I guess, in this case, it was a product X. It was the process for getting people into the facility, but they identified some things that just weren't well documented. There were some steps in the process that were a little opaque and only one person really knew how to do them. And the communication wasn't as effective as it should have been. They didn't have a way to notify people that something was messed up or that things weren't progressing the way they needed to. And then the processes themselves were not optimal. We'll just say there were some steps in there that probably needed to be resolved and cleaned up. All right. So then we get to the tools. And there's chaos experiments you can execute on the tools that your organization uses. Things like inaccessible eukary, which is a fun one to say, eukary is just a cool word. And that one is essentially revoking tool access. So you have a system that everybody uses that people are very dependent on in your organization, something like Slack or email or a time card or you know whatever it is. And you go in and you just turn off somebody's access to it. Now it's generally good to warn people that you're going to run this one because depending on what tool you turn off, it can cause panic and they might think that somebody's going to come and escort them from their desk out the door really soon when they don't have that access. You don't want to do that to people. So it's usually good to warn folks that you're going to run it. But don't tell them what tool you're going to remove access to necessarily because then they can do steps to prepare for that that you don't want them to necessarily be able to do. You want a true picture of what happens when that access is lost. And the things that you're watching for here are do your people know how to restore that access? Do they know who to talk to? Do they know what email to send? Do they click a link or reset whatever to get that access back? Or are they stymied for the whole time that you've turned it off? And as always, documentation is key on this. Make sure it's very, very clear whose access has been revoked and why it's been revoked and how long it's supposed to be revoked for so that when the experiment is over they get it back without having to go through a whole lot of painful steps. All right, silent sake. This one is one where you take the tools and all of the automated notifications that we get very dependent on for some of our processes get shut off. No notifications at all. And the point of this one is to detect if people understand what the normal business rhythm is, what that normal process flow is, and do they notice when they don't get those notifications? Do they know that they, oh, I'm not getting a notification but I should be, I'm going to go check in the tool and see what's going on. Do I have something waiting for me? Or do they just completely forget about it? And nothing happens, right? The process is stymied and stopped. And then the opposite one to that one, which is also one that tends to annoy people, is spam or spider monkey, which is turning on every single notification that you possibly can within that given tool. That one can be infuriating. It's a good idea to warn people that you're going to be doing that one too. But the key with that one is do people know how to go in and then set those notifications back down to the level that they want them to be so that they can see through all the noise and pay attention to only what they have to within the tools. So in this case, when Company X ran this experiment on their tools that they used for the security process or for the access process, they noticed that there was inefficiencies in training. People lost access and had no clue where to go to try to get that access restored. And that's a problem because that, you know, obviously threw a pretty big wrench in the ability of them to move the process through. All right. And then in the process itself, Company X implemented chaos experiments on the process. And some of the tools that they, or some of the experiments they implemented there, missing link, cool name, really frustrating. They took a required approver and just made them not respond for the day. So it's similar to the vacation, you know, the unexpected vacation, but specific to this one process. The person was just non-responsive. And what that's testing is, is there a way for people to recognize that approver is not there, not available to do the approving and to find somebody else that can do that approvals? Or are they stuck until that person comes back? Rejection red colobus. Everybody has a bad day sometimes. Sometimes they're just in the mood to say no. That happens sometimes. And usually when it does happen, they're not really picky about what they're saying no to. They're just like, oh, another one? No. Another one? No. Right. And that can be because if the approver is actually in there and actively saying no, but it's really something that they need to be saying yes, how do you get around that? How do you identify that that's what's happening? How does the team understand that? And what are the steps that are available? Are there steps available to then escalate the problem and go beyond that approver and say, okay, you're saying no, but we really need this to be yes. I'm going to go talk to this person or that person. Does the team know who to do, who to reach out to? Or do they get stymied again? Do they get stuck? Does the process stall? And then the last one is one that is a huge problem in most companies from my experience, and that's time waste tamarin. There are unneeded steps in the process. And you can test this one several different ways. You can actually go in and add new steps to a process, which I don't recommend because it's much easier to add steps than it is to remove them typically. Or you can just enforce steps in the process that people tend to ignore that aren't necessarily, you know, valuable steps. But the key with this one is this is really looking at the ability of your organization to adapt in an agile way and to change their processes so that they make more sense. So that one, you know, you can add five or ten extra steps to it or identify those optional steps and see if there's a way for people to circumvent those steps or go around them. That's an approved process. And of course, when Company X did these, they identified some roadblocks in their processes because everybody has roadblocks in their processes. All right. So one of the big roadblocks that Company X discovered was the single point of failure in the approval chain. There were several key steps in the system in the process to get those visitors approved that had one approver assigned to them and no backups. Now in one of the cases that approver had a backup originally that they had added in and that backup moved to a different tasking and different responsibility and they were removed from the system. But because the person had already put one in, they didn't think they needed to replace it. They didn't think about it and they ended up being a single point of failure. So the steps that Company X took was to add some extra approvers. They identified those key steps that had just one person tied to it and they added backup approvers to that to make sure that if that person wasn't available, the process would still continue on and get its job done. All right, so the capability lab was looking a lot better, right? Those three major risks were mitigated, they were dealt with and because of that and because of the chaos experiments that gave visibility into those risks, Company X was able to open the lab on time and actually got a pretty good contract out of it from one of the guests that showed up. So all's well that ends well. Yay. All right. So let's look a little bit at building a resilient organization. It's great to build understanding of your organization and the personnel and the impacts of your processes and tools on your people but if you don't know how to correct the problems that you identify then what good is doing the testing? So to look at personnel resiliency there's a few things you can really build into your system to help enhance the resiliency of the personnel. Now one thing I want to point out because this is a question I get all the time with this is none of these things have to cost a lot of money. None of these have a huge impact on the money that you're spending to run your organization but they can have a tremendous impact on the bottom line and on the return on investment that you get. So peer structure is one this one is one that I am really really fond of but it's often a very hard sell for organizations. Training and development similar kind of concept companies tend to look at that as outflow not in flow. Documentation and mentoring those ones are typically easier but if we look at peer structure so I have an organizational chart up here and you can see there are two teams they each have a team leader and they each have three team members roughly there are two pizza teams right that's the idea and each of the team members you can see has a dotted line connecting them to somebody in the other organization that's their peer now the key with this is anytime you get a new person into an organization you identify a peer for them and those peers are allocated to each other and what I want to emphasize with this is the peers have regular tag ups to share critical tasking those tag ups are typically once a week or less and they're typically 15 minutes or less in duration these are not long meetings they really truly are tag ups and the point is to share just the critical tasking that each of the individuals in that peer relationship are dealing with at the time know who the stakeholders are know what you're dealing with with your peer now by doing that if somebody goes absent if team member 1.1 you know trips over their dog in the morning or over their little barrier here and twists an ankle and doesn't go on to work that day team member 2.3 the dotted line connection their peer can come in and say oh I know that team member 1.1 was working on these three critical things on track it's not a huge time commitment typically for that person that's stepping in but it keeps things from derailing it keeps things moving so the impact as far as the efficiency of the team is minimal on a day to day basis you're talking 15 minutes out of a 40 plus hour week and the benefit is tremendous because now with that person disappears unexpectedly for a little while their critical tasks are still covered so one of the things with this one that I hear fairly often is well you know they're meeting up once a week they're talking to each other now I'm all of a sudden losing a lot of efficiency from my team because they're spending all this time focused on tasking that's not theirs and it can be done that way but I argue if you're doing it that way you're not keeping to the intent of the peer focus like I said 15 minute tag up once a week or so to keep those moving it's not meant to be a replacement for that person it's not meant to pick up all of their duties and the idea is everything else that they're doing that's not critical that can be dropped on the floor for a day or two or however long they're out but those critical tasks don't get dropped and typically when we put that into place we've actually seen an efficiency increase in the organization and in the teams because again those critical tasks are consistently getting done and that takes a lot of time and it's going to be more efficient and more effective in their day to day job right so documentation I'm a software engineer by training I hate documentation just on principle however I also know that it's really really valuable and that it makes a lot of sense to enforce documentation on things document everything there are a lot of unwritten processes at most organizations that everybody just kind of knows that's how you do it but they're this strategy for building resiliency points those out and you find those through your chaos experiments and it takes the time to actually document them. In a common repository that everybody knows where it's located so you have to advertise out where that repository is, where people need to go to look and then you have to make sure that you're investing enough time as those processes adapt and change to go and update the documentation so that it stays current as built not as intended six months ago, right? So it's a continual process but the benefits of it it increases your overall team knowledge. Your team has a better understanding of what those processes are, what those tools are, what decisions have been made in the past and why so that you don't go back and revisit things that don't need to be revisited and you don't go back and waste time on things and so that if somebody needs to figure something out they have a good place to go and look to start. It reduces soft matter storage, right? There's a lot of organizational knowledge that's carried around in the brains of the people that are part of that organization and especially right now with the great resignation going on that everybody's been hearing about and without looking like it's gonna be continuing and probably picking up pace again. You can't let people carry a lot of knowledge around in their brain that's not documented anywhere else. It didn't used to matter as much before people were jumping jobs as frequently as they have been lately but now that's really, really important. Again, it provides that ground truth and then the other benefit of it is when you're bringing new people in it saves a lot of time on onboarding because you can have part of their incoming tasking being going looking at that documentation, reading it and then as they read it, if they have questions they can go and ask their peer or their mentor that are assigned to them and they can update the documentation with the answers that they get. So it's a continual revision process it's a continual updating process and it makes the whole team more efficient because you don't have to have an experienced person on the team standing side by side the whole time with a new person coming into the team. It makes them more self-sufficient much more quickly. All right, so mentoring. Mentoring is another really important one. Now most companies I know of any more have a pretty good mentoring program in place whether it's formal or informal but this is a little bit different type of mentoring. If you think back to the organizational chart that I had where we had two teams with three people in each team plus their lead this would be a team member to team member mentorship relationship. So team member one dot one would mentor one dot two. One dot two would mentor one dot three and one dot three would mentor one dot one. It's not a true career progression type of mentorship. It's more a peer to peer type of mentorship. And the purpose of that is to broaden that team knowledge again spread that knowledge spread that understanding reinforce the culture of the team and how we do things and just build in all of that understanding across the broader team. One of the big benefits of this one is that third bullet there. Most organizations have one or two superheroes. Those are the go-to people that if you need something done you always go to that person because they can do it better, faster, quicker than anybody else. Problem is they always end up drastically overburdened. They get burnt out. They might quit because they get tired of it and they become ineffective because they have way too much on their plates. So you really don't want those superheroes to have a resilient organization. You wanna get rid of the superhero concept. And you may still have your people that are your go-to folks that you go to when you need things done and if they can do it they can do it faster than anybody else. And that's great, but you can't rely on that. So the team mentorship like this helps drive out the superhero need. So you still have your experts. I'm not denying the value of that. You have to have people that are just really, really good at what they do. But that knowledge gets spread out a little more. So not everything has to go through that person. So again, it increases your team effectivity and efficiency. All right, and then training and development. This one tends to be a hard sell again because a lot of organizations look at training investments as outflow of money and they don't look at the value that comes in. So this one is one that it's good to try to identify a strong return on investment calculation equation for. And that's gonna vary widely depending on the program that you're running and depending on the culture of the company and the team for what that return on investment calculation looks like. So I don't have any numbers up here, but you're increasing capability by training and developing your teams. You're improving your retention because if you invest in your people, they tend to want to invest more in the company. They feel like you actually value them. It's a good concept. And it reinforces the culture again. It builds that culture of innovation, that culture of curiosity and that culture of wanting to build that independence and that knowledge base. And that's a powerful culture that's a very resilient culture. And it builds your team flexibility. It helps reduce the likelihood that one person is gonna bring a great new technology into your team and they're the only one that's gonna understand it. Because if you provide training, then other members in the team can go, hey, that's a great technology. I really wanna dive into that. And they can go and do it because you're providing the training. All right, so tool set resiliency. There's really three big things here. Automated recovery. Giving people the ability to restore their access to things or to reset their passwords or to any of that kind of stuff. That's really important. It's not always done very well in organizations. I don't know how many times I've had to call service desk with various companies that I've worked for and sit on the phone with somebody for an hour just to get a password reset, right? I mean, that's stupid. Don't do that to your people. Put an automated recovery and train your people on how to use it. Notification management. We already talked about this one a little bit with the one experiment. Make sure that the essential notifications, those ones that really have to be out there are controlled and managed at the higher level so people can't go turn those off. But every other notification, let people tune them to what they need, right? That makes it so that people are more likely to actually pay attention when they get a notification. It makes it more likely that they'll see the ones that they need to see and it helps keep the process and the flow moving more smoothly. And then a self-help service, self-help. Wow, that word is hard today for some reason. Self-help service desk. This one again is one of those that you think about and it seems pretty common sense, but not a lot of companies really do this effectively. Is there a common repository across the company for the tool sets that you use? Can a person go and look up and ask questions? Use natural language, right? This is a great case for AI, natural language processing. Can they use the natural language question and ask, how do I do this? And get an answer they can actually do something with. The more that your company can provide that kind of capability to your employees, the more you're freeing up their time and saving them energy and frustration and that builds resiliency in again. So I call this one tool set resiliency but it's really all people focused still. It's all about the personnel and how they can do their job. And then process resiliency, there's four pieces here. Approval management, escalation process, tailored workflows and identified clearly defined strong change process. So approval management, this is streamlining your process as much as you can. Identify default backup approvers for every step. Don't let there be a step in a process that has to be executed often that has a single approver on it. Make that an automated part of the tool system where if somebody moves to a different role and they're no longer the appropriate backup approver, somebody else gets automatically assigned to that, right? Monitor that, make sure that that stays the case. Provide just-in-time notifications for approvers. Again, if you're only giving them the notifications that they really need to see, they're more likely to respond to them in a timely manner. That keeps the process running more smoothly. Show the workflow for processes by default. Now this is one that to me seems like common sense but a lot of companies don't do it. My company included honestly. In order to see the workflow for some of our processes, you have to get all the way through the process and then you can look back at the history and see, oh, it went through all these steps. That's not the way you wanna do it. You wanna advertise out what that process is and then provide clear contact information for your approvers. So if somebody isn't responding, make it easy for your people to reach out. All right, escalation process. Provide authorized workarounds. Define that escalation process. How do people go around somebody who's being a no sayer? And empower the team to escalate and again train the team. Tailored workflows. Not all the steps are needed all the time. Clearly identify those optional steps. Train all your personnel on how to handle those optional steps. And that's all your personnel. The ones that manage the process and the ones that use the process. And then a change process. Sorry, I'm speeding up because we're almost out of time here. But the change process. Static processes or broken processes. Processes need to be able to change. So encourage your team members to own it. Empower them to change it and define the steps that they need to use to make those changes. So it's still a controlled process. It's not a free for all, but they have the ability to go in and make those changes as they need. All right, so the final takeaway. Chaos engineering is a disciplined risk management and risk reduction strategy that's used for testing and operations. And those operations can go across much more than just IT and software. They can reach across your people, your processes, your tools. And it's an enabler for building more resilient organizations. So we're kind of out of time here, but I'm happy to stick around and answer questions for anybody. We can go out in the hall and continue the conversation. Or if you wanna connect with me on LinkedIn, this is my LinkedIn address. Please reach out. I accept every request. And that's it. Thanks for coming today. Thank you.