 Hey everyone, good morning, good afternoon, and good evening for wherever you are today. I'm here from Harness and I'm here to talk to you about building continuous resilience in the software delivery lifecycle with Chaos Engineering. Ultimately, cloud native development has enabled teams to move quickly, but it also introduces new ways for software to fail quickly. SREs, QA engineers and developers need to work together to optimize reliability and resilience to improve developer productivity. My name is Matt Shilisham and I'm a product marketing manager at Harness. We are a modern software delivery platform for continuous integration, continuous delivery, security testing, feature flags, service reliability management with service level objectives, cloud cost management, and Chaos Engineering. For the past 20 years, I have been helping teams build reliable and resilient systems and teams across the nuclear power industry, retail and e-commerce, as well as nonprofit groups that I've been a part of locally in Minnesota. I've enjoyed being a software engineer, product manager and product marketing manager and hope you enjoy this presentation today. Why am I here? I'm part of the Lytnis Chaos open source community, which is an incubating CNCF project. Harness as a sponsor is also part of the CNCF as a silver sponsor. You may have seen me at QCon plus cloud native Con Detroit, where we had our first ever chaos day in October 2022. Please feel free to contact me via email, Twitter or LinkedIn. Ultimately, we are here because we are building and making things better. As engineers and leaders, we are always seeking to understand and learn how the world works. I talk about how building for resilience is in fact chaos engineering. Ultimately, this discipline is simply allowing us to understand how the system works and operates. This is one of my favorite quotes from Twitter, from Andy Stanley, who has a podcast, where he just says, if you don't know why it's working when it's working, you won't know how to fix it when it breaks. And for me as a prior IT administrator, this is great, because you know, like when I had incidents and I had to respond, if it was a brand new issue that I didn't know what it was, it was hard. I had to dig around, I was stressed, nervous, but when I was able to practice failure and prepare for it, I was more confident. And ultimately, less failure occurred because I proactively planned for it. Another thing here, so why does chaos engineering even exist? So like, we get this little logo from bytebitego.com, resilience mechanisms were developed in the code and as architecture to help the system recover, fail gracefully, or simply display an error message to the user. Not everything has to be perfect, but an error message can instruct the user why something failed or help, you know, the IT person solve the problem. Chaos engineering can be used to validate and tune these mechanisms to make sure they work. Now at the end of the day, you say, why does chaos engineering exist here? Another way to phrase this is, you know, like, what failure modes does my system have, you know, and what mechanisms am I using to prevent that? And like, how do I test it? Do I wait for an incident to happen to prove it? Or can I test it proactively? Some common, you know, Kubernetes failure modes. This image is great to think through. I got these failure modes or stuff at the Kubernetes website. System instability, resource contention, scaling issues, configuration errors, resource exhaustion. Now Kubernetes is self healing to an extent, but the application that you put onto the container isn't necessarily always self healing. And you have to know how to handle these failures that happen. Now chaos engineering, the experience today, it's basically like, you know, a shopping cart on an e-commerce website. You can basically click around. This is an example from the litmus chaos open source project from the CNCF. But basically, you know, you can click and say, like, hey, I need to do this failure on Kubernetes or Cassandra or Kafka. And there's a button you can click and there's experiments that you can pull from, right? But ultimately, you know, that works well. And like, when you click in here, you can see experiments that are basically, you know, easy to understand and self service. But ultimately, like developers, QA engineers and SREs, you know, they can't manually click buttons all the time. So like today, a CAS experiment might just look like this, which is a declarative YAML file. You know, in here, all we're trying to do is delete like a pod from a Kubernetes deployment to see how my application behaves when it restarts or when that pod gets disrupted. You know, does my user on the other end of the application, you know, get an error message? Or does it gracefully degrade? Or is it a seamless transition to the restart application? So that's basically it, right? Now, if I dive into what we're trying to do with continuous resilience, you know, let's break this down on how reliability and resilience can help development teams. So improving resilience across the software delivery lifecycle, ultimately giving customers improved experience. Now, generally speaking, SREs, Q engineers and developers are a team, but they do work siloed, right? They hand off work to each other, whether it's through a PR, through a test, or, you know, through an incident. But what we're trying to do here, right, is, you know, SREs can leverage chaos engineering, maybe after an incident to recreate that incident to see, you know, how they can fix it, if they can fix it, or perhaps increase the blast radius of that incident in a simulated environment, right? And then you can validate that the experiment, the fix that you put in for the experiment resolves, and then they can shift that learning that test, if you will, left to the QA environment. So now the QA team can run that same experiment to see if there's any, you know, failure in that environment to see if the system, the configuration drifted. And ultimately that QA person, you can shift that to the left of the developer, so the developer can run that chaos experiment, you know, in that CI pipeline or that QA test environment. So now you have protection across the pipeline, and you're not waiting for a random incident, you know, that can happen. Now you can actually avoid that incident. So if I think about this at the business level, you know, innovation in software is a continuous process, and it has to be. And it can help us improve multiple aspects of the business, but more importantly, the developer and customer experience itself. So let's talk about innovation and achieving reliability and resilience. You know, it's challenging to solve everything at the same time, but this year in 2023, we need to not only move fast with high velocity, but we also need to do it efficiently at a low cost and with the highest reliability and resilience needed for the best customer experience. It's a mouthful, but how do we solve this, you know, automatic automation is key in a pipeline. So let's talk a little bit about the cost of software development. So right now, you know, there's approximately 27 million software developers globally with an average salary of 100,000 in the annual payroll equivalent to 2.7 trillion. You know, that's a lot of money, right? And if you look at, you know, how much time developers spend coding, this was a recent survey poll on LinkedIn, you know, 54% said less than three hours a day they spend coding. You know, that's equivalent to, you know, wrench time, right, like three hours of wrench time per day that they're, you know, making something, creating something, innovating, you know, the rest of the time, like, what are they doing? Right, there's meetings, there's other toil, there's watching the deployment like babysitting it, right, there's security testing, there's all these things, you know, the toil that's preventing development teams to be productive. And not that you have to code for eight hours a day, but you can't be creative, you can't innovate. And if you're being, again, bogged down by all these toils in the deployment process, then you're not being as productive. So if we look at the math behind this opportunity, again, the annual payroll of 2.7 trillion, you know, what does that look like? So if we can cut developer toil in half, which is doable, right, then ultimately, like, you can look at your developer budget increasing as well. And then you can redirect that to development, right, whether that's being more productive with the same amount team or hiring more to do more capabilities, right, you don't always have to do more or less, but you can do more with cutting down this toil, you know, and I just wanted to point out as well, code isn't always the best way to solve a problem. But if the toil around building a prototype or testing a small unit of code is complicated, a development team won't feel comfortable testing out new ideas. So again, if you're able to quickly, you know, write code to solve a problem to test it to prototype it, and you can if you can deliver that, you know, to non production or production quickly and test it with a few customers, like that's an ability of a development team to task it feedback, you know, and iterate. So ultimately, like, innovate, you need to innovate to increase developer productivity and saved costs. And so like, where can you increase developer productivity? Right, let's break that down for reliability and resiliency. So you can reduce the software build time. You can reduce software deployment time, and you can reduce software to bug time. So let's dig into that last one a little bit more. Why do why developers spend more time in debugging right now? So one thing is oversight, you know, there's just a million things going on and you test as much as you can, you automate. But you know, you overlook something, that's just human nature. Dependencies have not been tested. You know, it's very normal to not understand all that goes in and goes out of your system, especially in these managed service environments, you know, so you can't test everything, right? Sometimes you sometimes you wait for an incident to uncover that dependency, you know, retroactively, but you would rather be proactive, then, you know, a lack of understanding of the product architecture in today's world with thousands of microservices, you know, can a human understand the map of everything, you know, it's very hard to memorize that in the old school days of, you know, monolithic applications, sure, but microservices today are challenging. And then, you know, sometimes the developer, you know, their code is running in a new environment. And again, your code should be written in a way to kind of move around the workload to different clouds, but sometimes, again, there's dependencies that are intertwined that you just don't know about. So software developers are spending a lot more time debugging, right? And if you think about it, debugging in production, right, is the worst possible experience. If you think about responding to an incident, right, it's very stressful. It's painful for the customers. People are hunting and digging through the problem. And ultimately, the cost of that is expensive, because now you have production code that's broken, you have to go back, fix it, test it. And that's time that could have been spent on, you know, a new feature development, right? So and it's also like the loss opportunity costs of that customer, because maybe you lost that customer with that transaction that they weren't able to get. So that's where going back to that other slide where we're shifting left to QA and shifting left, you know, to non production and the code for the developer. If you can actually find these, you know, infrastructure failures and application failures early on, it's cheaper. And this graph shows you, you know, like, if you fix it in a QA environment, it's, you know, 10x reduction. So if you look at this, it may cost $10 instead of $100. And if you fix it, you know, in the code right away before you even push it to QA, that can be up 100 times different than actually fixing it in production. So these are real values, right, that you can apply to kind of show why it's important to test more up front. Now, if you look at cloud native developers, they're focusing on the container itself and, you know, the consumable API is that it's using, right cloud native developers experience this at a rapidly increased pace because we are making it easier to deliver software, you know, they're experiencing more failures because it's easier to deliver software. Containers are helping developers focus on their application and API not worry so much about the stack underneath chances. You know, the chance of lack of understanding the lack of testing can cause issues across the whole infrastructure stack. And again, if you look back at, you know, even common Kubernetes failures, right, the system instability, the resource contention, right, these occur when Kubernetes cluster run out of resources such as CPU or memory configuration errors occur when a Kubernetes cluster is not properly configured. Resource contention occurs when multiple components compete for the same resources, system instability occurs when the Kubernetes cluster is not stable, and is regulated crashing or restarting. And chaos experiments that should be automated in the CD pipeline, for example, include testing for this resource exhaustion configuration errors, resource contention. And additionally, you can automate the testing for the ability to recover from these unexpected events and errors, as well as the ability to scale up and down as needed. And then ultimately, this can help you automate testing for the ability to detect, diagnose, and mitigate security vulnerabilities. So as develop, you know, as developers dig into these problems and debug, you know, they shouldn't have to like dig too far to find that issue, right? You're testing code as fast as possible and shipping code as fast as you can, but not looking at the overall system. And as that container sits in an application that consumes APIs and resources on infrastructure, the impact of the outage can greatly, you know, extend just beyond that container, right? So you have to ask yourself, are containers, are they tested for the functionality of these faults occurring? And is it revealing the deep dependencies in that infrastructure stack? So if you look at the faults of these deep dependencies, the problem is that happens is that customer face application is impacted, and the developer jumps in to resolve the issue, and they find out that there are multiple dependencies that are causing an issue. And this ends up increasing the cost of development, right? So now we have a service resilience is impacted, developers are debugging it, you know, dependency fault is discovered, and then new resilience issues discovered as well. So this is the case of the 10x to 100x costs and bug fixes. If we look at dependent fault testing, and what's required, what we can think through here is, you know, test your code for faults that are happening in like the code itself, the API consumed external resources, dependent infrastructure. And then again, this can apply over the container code, the application, API resources and infrastructure. This means that cloud native developers need fault injection and chaos experimentation. So if we look at revisiting the original use case of chaos engineering, we introduce controlled faults to reduce expensive outages. Seems important. But ultimately, we introduce controlled faults to reduce expensive outages. You know, we recommend recommended production chaos testing. And then it was very high barrier to entry. And then it was more so on a game day model. And then traditional chaos engineering has been more so a reactive approach driven by regulations like a requirement. But the new patterns driven by chaos engineering are the need to increase developer productivity, right, to remove that toil so they don't have to dig for answers and the need to increase quality in cloud native environments and the need to guarantee reliability in club in move to cloud native. So this this need leads the emergence of the new continuous resilience, which is basically verifying resilience through automated chaos testing continuously. And all that means is, you know, have a known failure mode that you need to protect against, you know, you're using a resilience mechanism, you can have a chaos tests to validate that that resilience mechanism still works as expected. Right, whether that's alerting you or, you know, triggering a failover, or just an error message, right, but if you can do this continuously, you can know your system is protected across the pipeline. So again, continuous resilience is chaos engineering across development, QA, pre prod and production. And one way we look at this is we measure it with resilience metrics, because if you can't measure something, you don't know if you're improving it. So the resilience score, it's the average success of the percent of steady state, given an experiment or component or a service. And then what that means is basically like my, your expected system is supposed to behave a certain way during a disruption. So then you can have a score associated with, you know, did it change or not change? Is it good? Is it bad? Is it up or down? And if you map that to resilience coverage, that's basically the number of chaos tests executed divided by the total number of possible chaos tests times 100. So if you think about building resilience for a system, maybe you have 10 failure modes you're trying to protect against that equate to five, you know, five to 10 tests that could be ran. So as you onboard a new system to get production ready, maybe you can start off by running 10 out of 10 to get that 100% coverage, or maybe you're just doing five out of 10 because, you know, every deployment you only need to do like most common ones, but then once a month, there once a quarter you're doing the other five. But basically continuous resilience and the developer pipeline is a way to achieve that resilience. So again, if we look at, you know, the game day approach versus pipeline approach, chaos experiments are executed on demand with a lot of preparation versus in pipelines chaos experiments are executed continuously and without much preparation. And then, you know, primary, primarily retarget SREs in the persona with a game day model versus with chaos and pipelines, all personas are executing the chaos experience. And then again, chaos with game days, the adoption barrier is very high because they're manual, right, they're events, whereas chaos and pipelines adoption barrier is much less because every time you're running a deployment or at a certain frequency, you're automatically running the tests. So again, traditionally, you know, developing chaos experiments has been a challenge code is always changing bandwidth is not budgeted to creating that. And the responsibility is typically not identified, you know, SREs are usually pulled into the incident and corresponding action tracking and then pulling the QA or developer. And then ultimately, like from, you know, identify identification to fix, it's not tracked to completion. So no idea how many more experiments develop or what failure modes of protect against, but a continuous resilience, you know, developing chaos experiment is a team sport across the delivery life cycle. And it's typically is attributed to an extension of regular tests, you know, chaos, hub or experiment repositories are maintained as code and get it. So you can have version control and historical information on how systems were configured. Then you can know exactly like how many tests need to be completed because you have the resilience coverage metric. So it's never an unknown that you're talking to leadership about what tests you're running or how how it's performing. You can actually just say, here's the test I'm running and here's the trend. So in summary, resilience is a real challenge in the modern or cloud native system because of nature of the development, use fault injection and chaos experimentation to get ahead of the resilience challenge and push chaos experimentation as a development culture into the organization rather than a game day culture. Thanks for listening today. I appreciate your time. Just wanted to let you know of a community event, Chaos Carnival, it's happening March 15th and 16th. It's a two day virtual event that's entirely free. The CNCF and Linux Foundation are a proud sponsor. And if you have any questions, again, you can reach out to me at my hardest email or on Twitter or on LinkedIn. Thank you very much. Have a great day.