 It's an honor to have Colton with us. He is a big name in the chaos engineering space and I was so pleased to finally get him to come to this conference. He runs his own chaos engineering conference which has been very influential, lots of followers. So it's amazing to see him join us today. And for those of you who don't know Colton, he is the co-founder and CEO of Gremlin which is one of the leading companies in the chaos engineering space. He was one of the original person at Netflix who was helping them improve their streaming, basically streaming reliability and operating the Edge services. He also built the FIT at Netflix which is the fault injection service. So having him over here with us to inaugurate our chaos engineering day as a closing keynote speaker is a huge honor for us. So thank you so much Colton for joining us and I would request you to please share your screen now. I'm excited to be here. I'm excited for the opportunity to speak. For those of you who don't know me, my name is Colton Andres. Andres was saying I had the opportunity to work at Netflix but my journey into reliability engineering or chaos engineering really started even before that. About 10, 12 years ago I was part of the team at Amazon in charge of making sure the website didn't go down and we reviewed all the post mortems. I served as a call leader and we built and tried this approach of chaos engineering before the term was even coined. So after that opportunity I got to join Netflix. I got to build on some of the good work that was done by folks who had built Chaos Monkey before I joined, built that application level approach in FIT and really helped continue to improve to find the things that could cause us problems and to sleep well at night. So today I'm going to talk a bit about the path to reliability. The reason that we're here, the reason we're talking today is the same reason it has been for some time. It's always been about reliable systems. We care about being able to ensure that the code that we've written works correctly, that the customers that trust us and rely on us can do so with confidence and that our systems are doing what they should. But obviously it's been an interesting year. There's been a lot that has occurred and the question for this group not in the broader sense of how has COVID impacted your life because it's impacted all of our lives is how has it changed the way we approach reliability? How does it affect the way that we operate our systems? This could be working from home as opposed to being in the office. This could be having to VPN into your systems when before you had direct access via your office. This could mean you're on a conference bridge in the middle of the night and nothing changes there but it could also mean that you're working with a different environment, a different set of tools. Then you add into it the just additional mental stress. Being on call is difficult. It's a stressful situation and it's often one where it's funny. Getting up and giving a talk early in the morning is a little bit the same. You're half awake, you're wiping the sleep out of your eyes, you're trying to find the right words, you're trying to say the right things and you got to do it on the fly and you got to do it right and you got to figure it out. When we're dealing day in and day out with this extra stress, these extra conditions in our lives, it means that we're all going to be a little bit more at risk. We're all going to be in a position where we may not be at our best and that is something that we can plan for and take into account as opposed to just ignoring and hoping that it won't impact us. So a big part of what I want to talk about today is not the why of reliable systems, but the chaos engineering is really this path, the how. We all have this goal of ensuring we don't get woken up, we don't get paged in our systems work and so chaos engineering is really the vehicle that drives that for us, that helps us to understand it. It is not the answer in and of itself, it is what allows us to arrive at the answer and to stay with the answer. But given what's going on in the world, given that we would love to have a vaccine at this point so that we could return more to normal, I think the vaccine analogy is more apt than usual. When things go wrong, we have to be prepared and this ability for us to inject chaos, it allows us just like in our bodies to train our immune systems to respond to the things that occur. And this analogy fits well in my opinion because there's both the people side and the tech side and the people side, we need to train our body to respond to bad things, to fight them and to circumvent them. Similarly, our teams need to know how to respond, how to deal with different situations and not everyone will be the same. And so that opportunity to practice and to learn is a key aspect. And then of course we want our systems to gracefully degrade, to self heal, to be able to handle failure when they occur. So we want both the tech and the people. What's interesting to me is that really what we do is about preparation and risk assessment. And the preparation is again the people and the technology. I was funny for me to look up the definition of prepare because it's really make something ready and to me that's make our systems ready, make the service I just wrote ready, and it's make someone ready. And so in this case, chaos really prepares us to be on call, to be ready to go. I was talking with a customer yesterday and the story they told me of how they brought in the concepts of chaos, engineering and reliability, how they spoke to their teams to get everybody on call, but how really the blocker was people were afraid. People were afraid that they wouldn't know what to do. People were afraid that they would make things worse. And that fear really prevents people from getting started. It can prevent them from jumping in. And so by preparing ourselves we can build that confidence. We can have that opportunity to practice during the day when the lights are on and when we can ask questions and when it's a safe space for us to learn. Now the truth is what we're doing is hard. If you think back to the academic days, if you think back to your early days as an engineer, the distributed systems problems were always the most complicated. They were always the most difficult. Now every problem is a distributed systems problem. We've built our code to rely on other people's services, other people's databases. And when they fail, it can cause the classic engineering joke. Someone else's failure can turn into my failure quickly. So we need to understand that we're trying to solve a very hard problem. And as I mentioned a couple of times, the tax important, the people are important, and the process really helps us to understand what we do in these situations. A manager I once had said process is like programming for people. And I think it's important to think of it that way. We learn what works for us and that ability to adapt it to our organization or are the people on our team. But we need to think about that process, ask ourselves if it's working and update it or change it if it's not. As I thought a bit about chaos engineering, one of the things that's really interesting to me that separates it from the observability or the APM space is that we get to perform rights on the system. A lot of times we're observing how is the system behaving in steady state. I'll go look at data. Has anything odd happened in the past that could tell me about how the system will behave? And that's great and that's interesting. But by going and causing changes, by causing rights, we're actually uncovering misconceptions, things that are not working the way that we expect. And we can dive into those. We can see the deviation from happy case. And it may be parts of our code that we haven't seen otherwise. And I think that that read versus write aspect, it really, it can be more impactful, but it allows us to explore and to better understand these other pieces. Now, one of the truths is we all have our own mental model of our system in our head. And as we look at our, as we're talking with our team members, we come to recognize that they have a different mental model. And so as we're working together, we're all trying to take our different mental models and bring them together. And the hard reality is that there's a bunch of detail in there that's missing from our mental models. We're missing facts about configuration, about timeouts. We've written our code on top of libraries and frameworks, operating systems and network stacks. And so there's thousands and thousands of lines of code and configuration that affect that mental model. And one of the things I truly enjoy about chaos engineering is it forces us to deal with reality. We can whiteboard and say things behave in a certain way all day. But when failure occurs, you know, we have to work in the real system. We have to work in production. And we have to deal with all of the configuration and all of those details. And the truth is those are important pieces of the system to ensure that they're operating well. So we want to dive in and uncover those. And really, we're building something that hasn't been done before. Elements of it have, but we're early in this journey. And there's a lot left for us to uncover, a lot for us to discover. And so there's a question of, you know, what works well? What have we learned? What doesn't work well? What should we keep? What should we throw away? And this will be our opportunity as a community, as we learn from what causes teams to have success, what causes customers to build that trust in companies. Those will help inform us about the best things that we can do. So I'm going to talk a little bit about what doesn't work. Reliability is a project, not a best practice. So when you want to bolt reliability on after the fact, or if it sounds like something cool you want to play with, but not critical to operating your system, then you might be in, it's a bit of an anti-pattern. You might be in a tough spot. Really, we're here, but it's about quality engineering. It's about ensuring that what we build works and works well. And so this ability for us to practice this, this ability for us to build it in from the ground up is critical. And when we just try to wedge on after the fact, we end up in a difficult spot. And a couple of symptoms of this can be that your intern is in charge of your reliability project, that your leadership really isn't bought in, that they're not investing in it, that they don't see that it's an important critical part of your business. Another anti-pattern is buying into the concept but not really investing in it. And in this case, that is, this sounds cool, we think that it would be something that... That was hilarious. Just got a memory exceeded error in Chrome there that crashed. And I noticed that my video and slides appeared to be falling slower a little bit and a little bit further behind as we went. So maybe some guidance on where I should pick up. I think you were talking about what does not work. Great. And again... We are literally chaos testing their systems. You know, it's a joke, but it's not. It's hilarious or maybe it's maybe hilarious is the wrong word. It's hilarious to me as a chaos engineer when failure hits, it's something I talk about all the time and I think that we should plan for it. But now everybody is on live streaming platforms to communicate, to run large conferences and we're stress testing some stuff. We're hitting some edges. So bear with me. Thank you for your patience. All right. Yeah, I talked a bit about buying into the concept but not investing in the concept. And so to me this is, you know, when rubber hits the road, is this something that is important to get done? Will somebody get promoted because they did great reliability work? Is this something that you account for on your roadmap? One of my personal pet peeves is, you know, we don't have time for reliability work. It's technical debt. We're never going to allocate time for it in our sprint. But when we rush and push out bad code, we're absolutely going to spend the days or the hours or days across the team fixing and responding to that incident, handling the post mortem, correcting what comes of that. And so we're going to spend the time. Do we want to plan for the time or do we want to be surprised and have it impact us? Again, I think it's really about preparation. Another anti-pattern I see is about fighting your company culture. Really, different organizations are in different places. And you have to meet people where they're at and you have to know what works on your team. And so a couple of the things I hear, we have enough chaos in our environment. This one's another one that's near and dear to my heart. If you have a lot of chaos in your environment, that probably means you're doing a lot of firefighting. You're playing a lot of whack-a-mole. And if that's the case, you need chaos engineering more than anyone because it's there to help you get in front of that, to help you stop playing whack-a-mole, to help your systems behave in a more deterministic way. The other one, and I alluded to this earlier, is if there's a culture of fear and blame, if people are afraid to go make improvements, if they're afraid to uncover things that aren't working well, then they're not going to be willing to do this good work. And the system is going to be left in a worse spot. We're going to be worse off for it. And so we have to make sure that leadership and management understands the value and supports it. So I'll talk briefly about what works. I'll watch the slides get just a little bit further behind. I'll try to advance the slide just before I get to that point. So one of the things that I've worked is thinking a little bit more about the holistic process. It's good for us to take some cues from the security space in this regard, in my opinion. Reliability, when it comes to that defender's dilemma, is there's many, many things that can go wrong. And part of our job is to understand what could happen and to prioritize which ones we should prepare for based upon the risk. So like a vulnerability scan, understanding the things that could happen within our system and helping us to prioritize that list gives us the starting point and the path to go harden our service or prepare our teams. Now, a convenient side effect of this is we can use this to baseline steady state behavior and that can help us if there's any observability gaps. If we're missing an alert or a monitor or we see that something just isn't being captured by a dashboard, we can fill those in here. This information, this prerequisite steps, they really give us this set of scenarios, set of failures that we should prepare for. And we can also leverage data from the industry. If there's any common outage patterns or symptoms or things that we know other companies have run into, we can look at those and say, should we prepare for these? Now, of course, the balance in the risk assessment is there are things that may be more common, more likely, but less impactful. Kind of those low hanging fruits, the noisy things, the squeaky wheels that need a little bit of grease versus the very large black swan events that may be very unlikely, but if it happened, could be the end of your company or could be a catastrophic event. And that comes down to a discussion with you, your product team, your business team, your business continuity team about what risk is and what the appropriate steps to take. So from this, we have a checklist, a set of things that we can go do, and we can work through that and we can work through it over time. It doesn't have to all be done in a day and we can do that over weeks or over months to be able to improve our system. And as we find each thing, as we better understand it, as we tune it, then we can build it into our building deploy pipelines to ensure that it's going out, that it's always tested and that we're not regressing. Now, one piece I added in here that I think is critical is we have to do a good job explaining to the business the value of the work we've done. And being able to, first of all, it is valuable to tell the business about the risks, even if we haven't fixed them yet. Making them aware of the things that can go wrong is a big part of a lot of the security and compliance process. So by telling them these are the risks that we understand, you're allowing them to understand and to have some ownership and influence about what could go wrong in the system and what's worth fixing. Now, the other key part is that baselining step allows us to track and to measure the improvements we've made. And that's critical. Being able to go to leadership and show them about bugs that were found, but incidents that were avoided and teams that were trained and risks that were mitigated is critical. If we don't do that, then it's hard to prove the value of the outage that didn't happen. Everybody knows that the million dollar outage then when we were down for four hours was horrible and we never want that to happen again. But if you and your team work hard for three or six months and you prevent the next million dollar outage, how are you going to show your leadership that you are able to accomplish that? And this is a challenge we all have. I think this is one of the opportunities that we can share as a community what works in convincing our leadership in capturing the value, quantifying it so that we're able to justify and continue to invest in the great work that we're doing. The other key aspect is having great champions, having great people that are able to come in to navigate the organization to help fix and influence the organization. I've chosen a couple of quotes, but they really typify the behavior that I think helps bring this into an organization and help win over folks culturally. The first quote is about being on call and Adam says he's proud to step into any on-call rotation for any of their platform critical infrastructure without hesitation. And to me that's really about the confidence that comes from being able to having prepared is that I felt this way at Netflix where when I was in the middle of running these chaos experiments I was in the middle of all the outage calls that if something went wrong I could jump in and as a call leader you're kind of forced to do that, but jump in and be able to help out and understand and add value. You know it's not about someone else's problem. If you make it all a reliability team's problem then maybe no one else will work on it. And Chatiana really one of the approaches he took is to go and win over folks to ask them to volunteer across the organization just an hour or two of their time and really helping people to understand that it's everyone's responsibility to build quality software and ensure that our customers have a great experience. And that's also exactly what Jen had to say and I think is again the right behavior, the right attitude. We have to treat our customers problems as if there are problems because it is because if our customers see failure and experience it then they're going to be unhappy and they're not going to be our customers anymore. So as I mentioned earlier chaos engineering is really a journey and what we need to do is we need to find and learn along the way what works best to be able to help each other on that path. But there's a few truisms I think that apply. Reliability will always be important. It will always be something that we need. We need to be able to trust in our systems. Our COVID has shown us that we're only going to rely more and more on our technology on being able to operate in a fully online or in a mostly online world. And so we're doing important work and that's not going to change. And just as the world changes and society changes, our system and our technology are going to continue to evolve. And so we're going to need to be able to adapt and to that change. We're going to need to be able to understand the complex systems that are happening, understand the frameworks that are being built. We're going to have to understand new approaches and new pieces and be able to go out and train our teams to adapt to not just those technologies but those processes. The best example of this is microservices. We've had to relearn how to engineer. We've had to relearn operations. And it's been for the better, but it's been a painful process at times. And that'll happen again and again and again. And that's all right because we're going to be able to grow and adapt and respond to that to become better engineers, better technologists, and to build more reliable systems. So with that, thank you very much. I appreciate the time. I'm going to stop sharing. All right. Thanks a lot, Colton. That was really great presentation. Very well summarized. Great tips there in terms of how to quantify the value and show. I think there are some interesting questions. I'll quickly try and get to the questions so you can answer them. So the first question here is what can developers do to build applications that are more resilient? Is chaos testing the only way or are there some good practices that developers can adhere to during development itself? Yeah. I think the answer to this is, you know, there's QA, there's our unit tests, our integration tests, our smoke tests, and our chaos tests. And in many ways, chaos engineering is just distributed system testing. So my answer would be kind of like the test-driven development world where there's a lot of value in thinking through how you're going to write your tests and how that allows you to structure your code that applies equally well to chaos engineering. By simply thinking about what could go wrong while you're building your system, you're going to make some better choices or trade-offs along the way or at least more informed trade-offs. And so I think it's important to be thinking about it early, but when is the right time to bring it in? Well, you know, again, like test-driven development, you could do it, you know, before you've written any code. That's probably a little early in my book, but certainly before your, long before your rate of rollout to production. And so as you've got your first builds, as you're building against dependencies, those are great places to start factoring it in. In fact, one bit of advice at Netflix, we had a library that would catch every time we added a new network dependency into our service. And that allowed us to go make a decision. First of all, was it wrapped in the circuit breaker? Second of all, have we tested what happens if this dependency fails? And so that's a great way to ensure you're not surprised by someone else's failure. That's a great advice. The next question we have here is, can we quantify the reliability of a system based on the chaos testing result? Yeah, that's a great one. That's one I've been thinking a lot about recently. And I would say probably where we stand today over the last few years, it's hard. We can show that we've run these experiments. We can show that we've prepared for certain scenarios. But it's hard to quantify that scenario not occurring. And so to me, that was why, you know, thinking about how do we baseline? How do we track the activities? How do we do that in a way that's credible and meaningful to management? And you got my best guess today, which is really it's about having a holistic overview of what could go wrong, being able to look at the risks, and then quantify and capturing that work being done, capturing the improvements. You know, oh, we fixed a time out. Oh, we fixed a configuration. We made sure we were resilient to this dependency failure. In some regards, just tracking and listing all the work and all of the learnings makes it evident the value. All right, great. I think you touched a little bit upon this, which is, you know, how do you measure the value of chaos engineering? I think you touched a little bit, but if you could expand a little bit more on this, please. Yeah, I think it's again, again, a topic I think a lot about because again, it's how do you measure the value of the outage that didn't occur? How do you show the value of something that didn't happen? And to me, I think about, you know, really, we have to look at the negative case and look at the costs and then show the absence of those costs. And so an outage that occurs has an impact to revenue or customers or traffic. We know those external KPIs, but an outage also has a huge impact to the people that are on call and the work that they're doing. And so to me, even if you have, say you have an hour outage, you might end up with 20 people on that call. They might have to go out and, you know, root cause, collect logs, look at metrics, sit in a postmortem, talk about what should be fixed, put it in JIRA, not work on it for three weeks because they're behind because they've been on call and dealing with this for the last week, and then eventually have the time to go back and fix it, test it and roll it out. That time and effort is exactly what I was referring to, that we can amortize that. We can pay for that up front. If in our sprints and in our planning, we say, look, one day is for reliability and chaos engineering, then what we're saving is, you know, 20 people's time for a week when there's a really bad outage that occurs. So I think quantifying that value of engineering time and of engineering velocity is a unique and different direction than just revenue lost or customers impacted. That's a great answer. I think that at least personally, I've never thought about quantifying what people go through when there's an outage and if there's a way to highlight that, even if you can't really quantify numbers, I think it can help people understand the importance of something like this. So I think that's a great answer. Thank you. Moving on, Sakshi has a question. How much chaos engineering is enough for an organization? Yeah, you know, it's another one where what we're doing is about trade-offs. It's about risk assessment and it's about the right risk mitigation, which is not always 100%, and that's why we don't strive for 100% availability. Actually, I gave a talk about this earlier this year. I can find a link for folks that they want on kind of, which talks through, you know, what does the world look like in two nines, three nines and four nines of availability? What does it take to get to the next level? But part of that was, yeah, what is the right amount? If you're a small company, if you're a small team, if you're a new service, if your failure isn't going to cost your customers a lot of pain or money, you know, maybe it's okay that you shoot for two nines or three nines. You know, maybe that's worth, maybe really you can get to three nines by spending a couple of hours a month and with doing some automation and some type process. But if you operate, you know, something critical, something that involves, you know, real-time vehicle automation or, you know, emergency response or critical communications, then the bar is higher. And if you need to be in that four nines or five nines, where all the answer is, you know, my opinion is you can and should invest more heavily in that. Now, I think if you're getting your, or the other good news is, yeah, if this is all one team's problem, you're going to spend so much, that team's going to spend so much time trying to do everything for everyone to fix every problem. But if it's everybody's problem and everyone spends just a little bit of time, you can really divide and conquer in a very efficient manner. And that's where, again, and Chattiana's quote talks to this, if you could get everyone in the company to spend an hour or two on reliability in a year, you could have a massive impact in uncovering and fixing a lot of the little issues or interconnected things that could be troubled down the road. Awesome. I think this morning, Dave, who did the keynote, he talked about error budget and using that as a guiding light to basically see how much is enough, whether you really need the point, you know, the 49 reliability or not. And his point was, you know, on one hand, if you are far above the error budget, then you're doing a disservice to your customers. On the other hand, if you're far, sorry, if you're far above the budget, then you are doing a disservice to your own company. And if you're below the commitment, then you're actually doing the disservice to your customers. Meaning that right budget, I think was his guidance in terms of how much, you know, you should invest on this. And I love Dave's talks. I think he is an excellent speaker. But one of the balances there with the error budget is you have to ensure, you know, there's always business pressure that pushes you away from reliability. It's always about efficiency. It's always about shipping things faster. It's always about how much time you can spend. And so there's a great talk on these organizational pressures by Sydney Decker that, you know, the different things that push on us. I believe it's from drift into failure. And so the business is always pushing on us to be more efficient to get things done faster to run closer to the edge. So the world where we are way above our error budget and it's impacting our team, I don't see that a lot. I guess I'll just say that. All right, cool. Thanks. Moving on to the next question. How do we, which is kind of just building on the previous one, how do we know that our organization needs chaos engineering and to what extent are there pre checklist points to guide us? Yeah, again, I think this is where, you know, think of it less like chaos engineering. Think of it just as reliability. It's a pillar of good software. Our software should be the right amount of efficient, the right amount of available, the right amount of performant. And it's for us to know and to understand, you know, where we, where we are in that journey. But if you think of it just as, as good distributed, good testing, good quality engineering, then the answer is yes, everybody needs chaos engineering because everyone needs good testing. Everyone needs to make sure what their building works. Where I'd say chaos engineering really shines is if you're building a distributed system, if you're making a network call and something could go wrong. If you're running in a cloud environment or a containerized environment where something else could restart your service, shut down a host, shift traffic or move things for you on your behalf, then it behooves you to prepare for those circumstances. It's in your best interest to understand what occurs because those things will happen. And the question is, you know, again, and it goes back to the value question from the last one. What's the cost of being wrong? You know, if you go down and you're down for an hour, are your customers going to be mad? Are they going to leave? Is somebody's life going to be on the line? Is somebody going to be unable to communicate? Is money going to be lost? If the answer is no to everything, then maybe you don't need as much reliability or chaos engineering. But if the answer is yes to any of those, or if you're someone to get mad or someone to get fired or someone to be very unhappy, it's worth factoring in. In terms of where to begin, you know, that's the tricky part. There's a lot of places that you can apply this approach and there's a lot of different ways we can think about it. What could go wrong to me? How do I handle resource constraints? What happens when I'm CBU bound? What happens if there's a memory leak? What happens if a disk fills up? Every company I've worked at, including Amazon and Netflix, I've had a big outage because a disk filled up and a log didn't rotate, or they took a network dependency on a spring configuration file. These things occur. The second set is what could happen to something I depend upon. So I store data in S3. I rely on Google's cloud compute or this Azure service over here. I rely on my internal identity service. I rely on my government's, you know, XYZ service. What happens to me if they fail? Have I tested it? And then once you know the black and white failure, what happens in the gray failure? What happens when it slows down? And is that going to cause a huge bottleneck? You know, a couple of the anti-patterns to look for latency amplification. So I have some code, and in a code, I have a tight loop. In that tight loop, I make a network call. And when everything's fine, you know, to my upstreams, it just looks like I'm getting work done. But when my downstream gets slow, it gets amplified because it's in a loop. And so 100 milliseconds or a second becomes tens of seconds and things just start piling up and then we start not responding to our upstream service and it becomes this cascading failure. So this is where the concepts of circuit breakers and being able to isolate failure, being able to shed load, being able to back off of downstream dependencies, being able to understand these interactions really become critical to ensuring the overall health and operation of the system. And then there's the really large hammers. Are you running a multi-cloud? Can you fail over from one cloud to another? Can you handle a host failure? Can you handle an availability zone failure? Can you handle a region failure? Could you handle a cloud provider failure? I think very few people need the last one, but everyone should be able to handle a zone failure, in my opinion. So there's a quick overview of the ones I see most often are most valuable. Awesome. Thanks a lot. Just again, going back to one of the things Dave said this morning is humans are distributed systems, and I thought that was an interesting perspective. So people often say, hey, I don't have distributed systems, but then the point he was trying to make is you have people working in the team and humans are distributed systems, so there may be issues with that. And that's a perspective that I thought was quite interesting. And I love that. Dave talked about that last year in a talk he gave about chaos engineering your team and saying this person, you can't talk to them for four hours. Let's see what they know. Let's see what would happen. This happened to us. One of my leadership team had jury duty over the last month and it was a serious case and he could not really be in contact and boy did it highlight how important he was and all of the good work he did. Sometimes you don't realize it until you're forced to acknowledge it. Cool. One last question and then we should be done here. So this is a question from Carol. Who's responsibility is chaos engineering? Would this fall under specific roles or teams or would it be a virtual team spanning across all the teams including involved in the application? I mean, yes. The last one. No, the way I would answer this is, you know, it's really where your organization is is an important variable in this equation. If you're in a very siloed organization, if it's the old school waterfall, if it's very separated responsibilities between development, QA and operations, you know, maybe it's at those boundary points where things are being handed off. Maybe it's at that operations team on the front line feeling it. But if your organization is moving toward an agile DevOps, you know, environment where things are more shared, where you have more of the, you build it, you own it, you operate it, that's kind of the other end of the spectrum. In that world, I'd say it's everybody's responsibility. And one of my bosses at Amazon used to joke, the best way to get to fix a problem is to make an engineer feel the pain because we as engineers are all inherently lazy. And if we feel the pain, we're going to go fix it and make it go away. And so if it's everyone's problem, then it gets fixed quick. And if you get woken up because you, you know, made an assumption and didn't validate it, then, you know, in some regards, there's some karmatic justice there. So I think where you are on that spectrum though, and there's a lot of big companies trying to do the right thing and change their culture. And it takes time, just like microservices in the cloud, this is a journey that's going to take 10 or 20 years. And, you know, we want to look for opportunities where we can leverage this to help that transformation, to help that journey. Because sometimes we as engineers, we feel like we're right. We come in, we say, you know, you're wrong. We need to do it this way. What the hell is going on? And we can actually, you know, hinder ourselves or hold ourselves back. And so think about it in terms of what your leadership cares about. How can you save the money? How can you save them engineering time? How could you make them more money? How can you help speed up what happens? And when you talk to the business in those terms, you're going to find more support and help them when you go tell them they're wrong. Awesome. Sorry, we're taking a little time. This is just one last question. I think it's a really good question. Hey, I'm up. It's early here, but I've got nowhere to go. I'm happy to do Q&A. This is a fun part of the talk. Cool. Cool. All right. This one is from Sakshi. She's asking, what has been your most surprising or audacious chaos experiment and what was your key learning from it? Yeah, it's a good question. I was just saying to someone the other day that I've never caused a large outage with chaos engineering, and I've certainly prevented several. But we've had a couple that we needed to be very careful with. So when I was at Netflix, we were doing a lot of testing around the identity service. And it's ensuring the right fallback and behavior between cookies, between the cache, between the service and the database. But the key is the user experience. Like if someone gets logged out or changes their profile in the middle of the experience, it could be jarring. You log in, you select your profile. You're looking through a movie, and then you get logged out or it changes your profile. Things would be very noticeable. And so we needed to be able to test this. And so our goal was to test, let's fail the identity service for everybody and have our fallbacks, and ensure we could trust those fallbacks if no one ever hits the spot. And we had to get all of the front-end engineers in a room. And this was like early December. It was the end of the day meeting, like three times a week. Some of those front-end engineers were just, they were kind of mad at us. They, you know, was their code. They wanted to fix it, but like they had other features and things to get done. And we really had to get the whole company to be able to handle that failure well. And then when we went to run it, we went to run it, we scheduled that game day, I'd say three, four times. First time we got to 1%, hit something that looked weird, pulled it back. Had to go, had to go dive into logs, had to debug, had to come back another day. I think the next time we got to 10% of all traffic before we had to stop. And then the time after that, I think we got above 50. And we got to, I don't know, 99% of all traffic where we were failing that service. And the nice thing was that, that had caused several outages over the course of the year. And that holiday peak, we didn't care if the identity service failed. And it was actually, it was a very quiet holiday peak. And I remember my first Christmas at Netflix, I was on calls, I was on my laptop, I was getting paged. And my second one, after we'd really dove into this, like I was ready, I had my stuff ready, but it was quiet, it was calm, it was peaceful. And it speaks to that confidence of preparation. Awesome. I'm sure every engineer in the room would like that, not to be bothered. And that's, I think where chaos engineering becomes extremely important. So thanks a lot, it is, it's a real honor and a pleasure to have you, wonderful answers and wonderful presentation. And you can look at all the likes going up there. I believe people had a great time, great learning experience. And we look forward to having you in the future conferences. Sounds great. This has been, this has been wonderful. So thank you for the opportunity, great questions. And thank you for all the likes. Thank you for, thank you for listening. All right. Would you be available if someone, I think we've answered most of the questions. So I think we should let you go. Sorry to wake you up in the middle of the night to do this. I'm mountain. I just got up early this morning. It's not too bad. It's seven 30 in the morning now. So don't, don't feel bad for me. You could find me in the, the chaos engineering slack. If you go to grandma.com slash slack. That's a public chaos engineering slack for anyone. And you can DM me or ask questions in there or tweet at me or find me wherever I'm more than happy to answer questions and share my experiences. Yeah. Thanks for sharing this slack thing. I've been part of it. It's a very active group of very nice folks out there. So I would encourage everyone who's interested in that topic to join this slack. I will put it in the discuss. I'll put the link in the discuss for folks who are interested to join that. It's a great place. And that was started by you folks. So again, appreciate that. And thank you very much for that.