 Welcome everyone to the Improve Resilience with Automated Chaos Engineering session with Gunnar Grosch over to you Gunnar. Hi everyone and thanks for joining this session. So chaos engineering has proved to be a great to have option in the SRE and SDE toolbox, but the transition into more complex systems is really accelerating. So my name is Gunnar Grosch and I am a developer advocate at Amazon Web Services. And in this session we'll look at how automated chaos experiments can help us cover a more extensive set of experiments and what we can cover manually and how it allows us to verify our assumptions over time as unknown parts of the system change. So chaos engineering, it is the process of stressing an application in testing or production environments by creating disruptive events such as server outages or API throttling. Then we observe how the system responds and implement our improvements. And we do that to prove or disprove our assumptions. Our assumptions about the system's capability to handle these disruptive events, but instead of letting those happen in the middle of the night or during the weekend, we can then create them in a controlled environment and during working hours. So I think it's important to note that chaos engineering is not just about improving the resilience of your application, but also its performance. We do it to uncover hidden issues within our system. And we often use it to expose monitoring, observability and alarm blind spots. And a lot more. We can use it to improve recovery time, to improve the operational skills of our teams, the culture of the entire organization and so on. So when we do chaos engineering, we follow this well-defined scientific method that takes us from understanding the steady state of the system we're dealing with, to articulating an hypothesis, then running an experiment, often using fault injection. After that, we're verifying the results of our experiment. And finally, we learn from the experiments in order to improve the system. And then we're back to improvements such as resilience to failure, the performance, the monitoring, the alarms, the operations, the overall system. And today, we're seeing customers use chaos engineering quite a lot. And the usage is definitely growing. And we've also seen that two very clear use cases have emerged. The perhaps most common way of doing chaos engineering today is creating what's called one-off experiments. And this is when you create a chaos engineering experiment by, for instance, looking at previous outages in your system and different events that might have happened in your system. Or perhaps you identify the services within your system that have the biggest impact on your end users or customers if they go down or if they don't function correctly. And then you create experiments for those services. Or perhaps you've built a new feature. You added a new service or you just made changes to the code or the architecture. And you then create an experiment to verify that the system works as intended. And companies are doing this in different ways. Some have dedicated chaos engineers that are creating and running the experiments. For others, it's parts of the SREs responsibilities. Or as we partly do at AWS, chaos engineering is done by the engineering teams themselves on their own services. The other very common use case is to use chaos engineering as part of your game days. And a game day, if you haven't heard about that, it is the process of rehearsing ahead of an event by creating the anticipated conditions and then observing how effectively the team and the system respond. And an event in this case could be an unusually high traffic day, a new launch, a failure that you've seen in the past or something else. So you can then use chaos engineering experiments to run this game day by creating the event conditions and monitoring the performance of your system and of your team. So doing these one-off experiments and perhaps the occasional game day, it helps us get very far on the road to improving resilience. So isn't this enough then? Well, it definitely can be. I want you to look at an example. So in this case, just a basic use case example. Let's say that you and I were running an e-commerce web application together. And this is a very successful application. We're selling stuff and we have thousands, if not millions of users using the web application and our mobile apps as well. It's a very straightforward application. In this case, we have an auto scaling group running EC2 instances spread out across two availability zones. We are using Amazon Aurora for our data store for our databases. And we have some level of resilience by doing it this way. And we have then started using chaos engineering to then verify the resilience of our service. So we're using, in this case, AWS fault injection service, our fully managed chaos engineering service. So we're doing our experiments. We learn and we gain confidence from doing those. So let's just make a quick demo of how we can do our chaos engineering experiments. So I'm simply going to do an experiment where we want to try what happens if an instance in our application stops. So this is the AWS fault injection simulator console. If you're familiar with AWS, you can see that it looks similar to many of the other services. So let's start this. All right, I'm creating a new experiment template as it's called, which basically is the template that defines our experiment. Start and stop instances, giving it a name and description. Then I'm selecting the role, the role that's allowed to do these experiments on my system. Next, I want to add the experiment action. And an action is what we're actually going to do, the fault injection injecting. In this case, we're going to stop the instances. So I'm selecting the EC2 stop instances action. I can then select that the instance should be started again after one minute. I can save this. So we have an action. Next, I want to define a target. So editing the target for our action. And in this case, I'm going to select one specific instance. But we can also do this by selecting a certain percentage. We can use different filters and so on to select more at a random way. Saving that target as well. Next, I have the ability to add what's called a stop condition. Stop conditions means that if an alarm goes off, it will automatically stop my experiment. Creating the experiment. And next, we can then start this experiment. This is the EC2 console. We can see my specifically assigned EC2 instance as part of that auto-stailing group. So let's start the experiment. I need to verify that I want to start it as it might have an impact on my system. The experiment is now initiating. It is now running. Meaning that AWS FIS fault injection simulator will now stop that specific instance. If we're checking the EC2 console again, we can see that it's stopping. And as soon as the experiment is done, when that minute has passed, it is then going to start the instance again. So it is completed. Let's check the EC2 console again. So now my instance is starting up again and it is now moved back into the running state. So that's a very quick and simple example of how we can use chaos engineering experiments. In this case, to verify that our system is able to handle that single instances are stopped within our auto-scaling group. And of course, we need to then verify that it didn't have an impact on our users by using monitoring, observability, use that steady state that I talked about early on. So back to the e-commerce web application, we had AWS fault injection. We're running experiments to it. But besides that, my team building this product service, we are of course also using good practices for deploying to our service. So we're using CI CD and continuous integration and continuous delivery. As you probably know, it has made it possible for us and it even encourages frequent deployments. And as we know doing frequent deployments, they are less likely to break as it's more likely that we'll catch any bugs or gaps. But frequent deployments are hard to cover manually with chaos engineering experiments. Think about the experiment I just did. Having to do that manually every time we're doing a new code deploy every time we're making any type of changes, well, it takes time. So next then, we, sorry, switching there. Well, so we had our product service and we're using chaos engineering on a manual basis there. We're using CI CD, but in a typical application, it's just not built with one single service. Instead, we have different services within our application. So for instance, another team, they are building the order service for our e-commerce web application. And we have another team building the user service. Perhaps there's a team building the cart service. We have a recommendation service, of course, for our users and we, of course, need to have a search service. So this might be built by different teams inside our company or organization. And these services, of course, they also have dependencies to each other. So the cart service depends on the user service and the product service depends on the cart service. And the order service, of course, needs to have both the cart service and the product service to work. The search service needs to work with the product service and the recommendation engine, of course, needs to work with our product service and the users to give proper recommendations. And well, when we're doing these kind of experiments on our service, we probably know exactly how the service works. But what happens when there are changes that are unknown to us in different parts of the system? Another team makes a change to dependency of ours. For instance, the order service. Well, we can't really know. These are unknowns to us. So we might need to run experiments quite often to be able to know that the dependencies, they are actually working as we expect them to. So if we look at this very simple example, I would say that frequent deployments makes it hard to cover with manual chaos engineering experiments. If we wanna make sure that every deployment we do work as intended. And to cover an extensive set of experiments is time consuming. We just did one single experiment now, but then perhaps we wanna test different parts. We wanna make sure that we do experiments that cover our data store as well. What happens if there's latency? So we need to create these different type of experiments. And as I talked about in regards to dependencies, even though we might have full control over the service or microservice that we're working on, unknown parts of the system might change. Other teams make changes. Other teams update their services. And so when systems become more complex, it's hard for anyone to create a mental model of how the system works, let alone to keep documentation up to date. And that brings us to automated experiments. So automation, it helps us cover a larger set of experiments than what we can cover manually. And it helps us verify our assumptions over time as unknown parts of the system are changed. So doing automated experiments, it really goes back to the scientific part of chaos engineering. So repeating experiments is really standard scientific practice for most fields. And repeating an experiment more than once, it also helps us determine if the data was a fluke or if it represents the normal case. It helps us guard against jumping to conclusions without enough evidence. So let's take a look at three different ways that we can automate our chaos engineering experiments. First, let's think about how our system evolves. So as I mentioned, we might have full control over the service or microservice that we're working on, but other teams or third parties are making changes, delivering new code, releasing new versions of their services. And those might be services that we depend on. So the verification we got from doing a one of chaos experiment a month ago or a week ago, it might quickly become obsolete. So by scheduling experiments to run on a recurring schedule, we can get that verification over and over again as unknown parts of the system changes. So let's have a look at an example of scheduling an experiment. And in this case, I'm building a simple scheduler for my experiments using a very simple serverless application. So jumping into VS Code, and I've created an infrastructure as code template for my scheduler. So we're creating a serverless function that will start based on the schedule. So I can schedule my experiment. And I will just simply set up so that this is able to start my experiments. This is the code for the scheduler. It will take an experiment template ID, like the template I created before and simply send a command using the SDK to AWS FIS to start my experiment. It's a very basic scheduler based on an AWS Lambda function. So I have an application running, as we know, a bunch of instances running in two different availability zones. And these are part of our service. I can then create an experiment template to begin with based on a JSON file in this case, instead of doing it through the console. And let me just copy the ID of my new experiment template. And now I will deploy this simple serverless experiment scheduler. And deploying that, running it in Dublin, pasting the experiment template ID and now setting the scheduler. And this is a basic Chrome syntax. So this experiment will run once a day deploying that. All right, so this will now deploy in the background. Let's switch over. It's fairly quick, so checking. All right, so it's done and now deploy this. So if we look in event bridge, which is our event bus, we can see that we have a rule and it's set to run at midnight every day. But I can now simply test this by creating a test event and run my schedule right now to just make sure that it works, pasting in my test event and testing it. So this will now send an event saying that it's time to run my Lambda function. It succeeded. And the Lambda function then starts my experiment. It is running as we can see, checking on the EC2 instance. We can see that this experiment is now running. This specific experiment is using up CPU. We're stressing the CPU on our instances. And we're also stressing the memory. So this experiment will now run and then use up CPU and memory on our instances. You will then move on to do other things as well, multiple actions. And this experiment will then run on my schedule every day, meaning that we're able to verify that assumption, that hypothesis every day to make sure that the system still continues to work as expected under these adverse conditions. And that helps us verify our assumptions over time in a very quick way. What's important when doing these experiments, of course, is to have stop conditions in place. So if an alarm is triggered, you will then abort the experiment. So next up then, we have what's called event triggered experiments. And it means that we're able to run experiments based on events. And an event, well, that can basically be anything that happens within your system. It could be an event related to the tech stack, for instance, that latency is added when there is an auto-scaling event or maybe a business related event, like an API being throttled whenever items are added to the cart. So building automation around these types of event, it helps us answer those hard to test questions. What if this, when that is happening, even when that is an event in another part of your system? So let me show you an example of that as well with an event triggered experiment. So back to VS Code, setting up in the same way, an event triggered serverless application. So once again, using a template, using AWS SAM, and this consists of AWS Lambda Function, which will trigger on what's called the CloudWatch event, which is an event within our system. And the pattern in this case is auto-scaling. So whenever an EC2 instance is launched, successfully, it will then pass this event on and trigger my AWS Lambda Function. So the function works in the same way as before, but looking in the event bus, we can see that there are a bunch of different event patterns that we could use. So for instance, based on the different AWS services, so any type of service we're using that's able to pass an event, will then be an option for us when creating these event patterns. So I chose to go with EC2 auto-scaling for this example. So we are then passing on the auto-scaling group that this should apply for, and setting up a new experiment template. In this case, I have three different stop conditions in place. So for making sure that it doesn't cause an issue whenever we're running these experiments. So let's go through the process again, create a new experiment template based on my JSON file. This will cause CPU stress on my instances. And now I'm deploying this event-triggered serverless application. We have it in our AWS Lambda console. And it consists of my Lambda function. These are the auto-scaling groups that we have. And so now if I manually make a change to this auto-scaling group by adding an additional EC2 instance, that will create that event that we're using. So now it should fairly quickly, yes. So it triggers my Lambda function which starts an experiment based on that experiment template. So it is running. It is, let's check the EC2 instance. Fairly quickly, it should then start using CPU and stress my instances. So we can then test a specific experiment based on an event. In this case, an EC2 auto-scaling event. All right. Next, we have the perhaps most common one. And I would say this is the one I'm definitely getting the most type of questions around. And that is doing experiments as part of your delivery pipelines. And we talked about before that continuous integration and continuous delivery has made it possible for us to do frequent deployments and they are less likely to break. So by adding chaos engineering experiments as part of our delivery pipelines, we're able to continuously verify the output or behavior of our system. So think about, you're doing a code change. You're delivering that in your pipeline. Then you're able to run these experiments to verify that the system still works as intended even after this new code change. Or if it doesn't behave as intended, well, then you can roll back and make changes needed to the code. So let's have a look at that as well by doing a demo of a continuous delivery experiment. So we are now in the EC2 console again and this is AWS code pipeline, our pipeline service. Very simple pipeline setup, fetching source from GitHub. We're deploying to a staging environment and then deploying to a production environment. And the most basic of the pipelines, I would say. This is set up using infrastructure as code. So here's that part. We have our pipeline, source stage, deploy to staging and deploy to production. But now I'm adding an experiment stage. And I wanna do experiments on the staging environment after we've deployed there. And the way, let me just kick off that deployment so it updates the pipeline. And the way this works is that it's gonna kick off an AWS step functions workflow. And AWS step functions, it is our state machine service. And this workflow then, it has a definition that simply starts an experiment and then waits for that experiment to finish. So let's see, all right, our pipeline, we now have an experiment stage set up. So it means that we can now make a code change. I'm just updating straight in GitHub in this case, a small change to the application. Committing that change. All right, so it kicks off the pipeline, fetching the source, deploying that to staging. There's no build steps, so it's fairly quick. Deploy that to our EC2 instances. And now gets to the experiment stage. And this is now in progress, meaning that it's called the AWS step functions workflow and the AWS step functions workflow will start my experiment. Checking, we can see that the experiment is running and that will run for a short while. This, since this is just a simple demo, it's a quick experiment. Usually we run experiments for a longer period of time. It has completed already since it's a quick one. And then it sends the message back to the pipeline that the experiment succeeded. So it can continue to deploy this into production. So instead, what happens if it doesn't succeed? So let's test that assumption as well. Let's release the same change once more and see what happens if the experiment fails. So releasing it, fetching the source once again and then it's gonna go off to deploy that into our staging environment, which is fairly quick once again and switching over to start our experiment in the staging environment. And now if we check the AWS FIS console, the experiment is running, it's doing whatever actions we've added in our experiment. But now let's just say that something goes wrong. So an alarm sets off. We can use the AWS CLI to set the alarm state of my alarm into the alarm state. So now let's run that command. And this is just to simulate that an error happens. And we can see that the experiment now turns into the stopped state because it's stopped by a stop condition that an alarm set off. And looking in the pipeline, we can see that it now shows the experiment as failed. And most important is that the pipeline then fails and will not move on to deploy this into production. So now we've looked at three different ways how to do automated experiments. We have the recurring scheduled experiments. We have the event triggered experiments. And then we have the continuous delivery experiments. So three different ways to automate our chaos engineering experiments. So the question then is, if you get to this point, should you then stop doing the one-off experiments and the periodic game day? Well, no, it's the simple answer, you shouldn't. Those should still be the core of your chaos engineering practice. And I would say that they are super important source for learning and it helps your organization build confidence. But with the automated way of doing it, you now have yet another tool to help improve the resilience of your system, automated chaos experiments. So I would say one way to think about it is that the experiments you start off by creating as one-offs or as part of your game days can then turn into experiments that you run automated. So after doing the experiment manually to start off with, they can then be set to run every day, every hour, or perhaps on every code deploy. So that brings us to the summary with a recap of some of the takeaways. So automation, it helps us cover a larger set of experiments than what we can cover manually. Running manual experiments, it takes time, it is quite time consuming. So this means that we're able to run more experiments than in a manual fashion. An automated experiment, it helps us verify our assumptions over time as unknown parts of the system are changed. Think again about these downstream dependencies we might have, different parts of the system that other people handle and deploy and make changes to. And to do chaos engineering in an automated way, you really need to make sure to have safeguards and stop conditions in place to do it in a safe way. So make sure that your alarms and your observability into your system are in place before switching into an automated way of running it. And just a final reminder, doing automated chaos engineering experiments doesn't mean that you should stop doing manual experiments. They should still be the core of your practice. And if you just can't get enough of chaos engineering for improving resilience, I've gathered code samples used in the demos and some additional resources for you in the link shown on screen right now. So you can either scan the QR code or simply enter grosh.link slash auto chaos. And with that, I wanna thank you all for watching. We looked at how to improve resilience with automated chaos engineering. If you have any questions or comments besides what might already be in the QA section, do reach out on Twitter at Gunnar Grosh as shown on screen or please connect on LinkedIn. I'm happy to connect with every one of you. And with that, thank you very much for watching. So let's see if there are any questions. All right, so we have, okay. So we have one question so far. So how is it different from load or stress testing? Is that enough? Yeah. And that's a common question I would say. Load testing and stress testing, they test how your system handles exactly those things, how your system behaves under stress. Chaos engineering or using stress testing is not uncommon to do in together with chaos engineering. So for instance, when we do chaos engineering experiments, we wanna do them as close to production as possible. So for instance, we wanna run them in production. That's maybe the end goal, but that's not always possible. So what's quite common is to run your experiments in a staging environment or a test environment and then use load testing or at least create load as well. So you're running that at the same time to make the system more production-like so that you actually have traffic running towards the system. But with chaos engineering, instead of simply testing load or doing stress testing, you can create experiments like the one I showed that for instance, terminates instances, but you can create quite complex experiments that perform different things on your system. For instance, removing a downstream API, you can throttle APIs, you can use commands to kill a process on a running instance and things like that. So this is something that might use stress testing as well as part of your experiments, but it's a bit different in how you're actually running experiments. All right, well, thank you so much, Vinor, for this session. And yeah, thanks everyone for attending. Thank you all.