 Welcome to Performing Chaos in a serverless world. We are very happy to have Gunnar with us. Gunnar, thanks a lot for doing this for Edge of the India 2020. And without further ado, Gunnar, over to you. Thank you. Yeah, thank you very much. There we go. All right, so welcome to Performing Chaos in a Serverless World. My name is Gunnar Girosh, and I am a senior developer advocate at Amazon Web Services. And today, we're going to look at chaos engineering, particularly for serverless applications. So let's jump into it straight away. And this is the abstract of the talk, that the principles of chaos engineering have been battle tested for years using traditional infrastructure and containerized microservices. But how do they work with serverless functions and managed services? So that's what we're going to try to find out today in this session. So the agenda is quite brief. Since this isn't the first session on chaos engineering today, I'm just going to let you know my thoughts, what chaos engineering is, just a few slides before looking at a few of the motivations that I see behind doing chaos engineering. And then we'll jump into the fun part, the serverless part of the presentation and how to do serverless chaos experiments before we do a few demos. So I'm going to look at some practical ways of doing these experiments. And just short about me, as I said, I'm a senior developer advocate at Amazon Web Services. My background is in both development, operations, and management within IT. And I've been in the industry for about 20 years now. And I work a lot with the communities in the Nordics of Europe, help out a lot with both organizing, speaking at different events, particularly around serverless applications. And I have three kids at home, which I think is the reason why I got into chaos engineering in the first place. So let's kick it off. What is chaos engineering? Well, chaos engineering isn't about breaking things. We often hear the phrase breaking things on purpose. And I think it's great marketing lingo. It draws attention to the practice. But we have to agree upon that the breaking part isn't the purpose. We do break things now and then when doing our experiments. But the breaking part isn't the purpose. Learning is the purpose with doing chaos experiments. And in short, chaos engineering is about finding the weaknesses in a system and fixing those weaknesses before they break. Because no matter how much we focus on making our system resilient and reliable, there's always unknown factors that come into play. It can be traffic patterns, third-party dependencies, different network issues, our code deploys, configuration changes, and so on. So by doing these chaos experiments, we're able to, by measuring the results, we're able to draw out the weak points in the system. So finding and fixing those weak points, it helps us avoid those big outages that we know would never happen at the right time. But perhaps even more important is that chaos engineering is about building confidence. It's about building confidence in the system, how it works and how it behaves. But also about building confidence in the organization. People are really key when we want to build resilient systems. So building confidence in the organization, how they handle the system and handle failures, that's a way to build trust. So perform your chaos experiments and you will learn new things about your applications. So a few notes about what the motivations behind chaos engineering that I see are. Everyone who builds or runs a system have customers, be it internal customers or external customers. The question is, are your customers getting the experience that they should or are your users unhappy? And when we build a system, we often talk about that nothing on the internet is free, but it doesn't matter if you run a big e-commerce site where you sell things or if it's an ad-riven blog site or a SaaS solution of sorts, downtime or issues usually cost you money. And that can be in decreased sales or it might be that users are leaving your platforms because they are unhappy. And what happens when the incident is there? If there is failure in your system, is or are you confident that monitoring and alerting will actually notice the failure? Will the on-call get that page they need to start engaging with the incident? And you probably have these runbooks or playbooks that describe how to act when incidents occur, but is the organization actually ready to handle these outages? You probably all do your fire drills, so everyone knows how to act in case of fires in the building, but are you doing fire drills for outages? And also, every time there is an incident, we have this huge opportunity to learn about the different conditions that exist for the incident to take place. So chaos engineering is a great way for us to learn from incidents and we can learn from them in a controlled manner. And that brings us to the perhaps greatest motivation behind doing chaos engineering, for me at least. Don't ask what happens if a system fails. Ask what happens when it fails. Because we have to remember that a resilient system isn't one that does not fail, but it's one that maintains an acceptable level of service in the face of failure. So what happens when the system fails? Well, chaos engineering helps us reveal that. So if nothing else I've said so far, as motivated you to do chaos engineering, perhaps this statement that's actually from the reliability pillar of the AWS well-architected framework will. Chaos engineering should be done regularly. That's a part of the reliability pillar, the one that tells us how to build reliable systems. All right. So let's jump into the serverless part of this. Let's look at how we can do chaos experiments for our serverless applications and workloads. Because to be honest, creating experiments for instances for containers and so on can be fairly easy because we can easily shut down instances. We can destroy pods, cripple network traffic and so on. Because we have control over most of the underlying infrastructure. But when we design our experiments for serverless, it's a different beast because we have to be a bit more creative when designing them. So let's look at how we can start building them. First off, of course, we need to design our experiment and decide upon what we should test. So when creating them, we can start to look at some of the common weaknesses we see in architectures. So errors, for instance, are we handling errors correctly within our application? It doesn't matter if error handling is inside our code or if it's a feature of the service we're using, we better make sure that we're handling errors correctly. And different releases from AWS, like dead letter queues, for instance, for SNS, that's a great way of handling errors because AWS takes care of the error handling. But by doing chaos experiments, we're able to test that it actually works the way that we've implemented it. And lambda functions and our dependencies in form of other AWS services or third parties, we need to and want to get our timeout values right. And they probably in most cases are correct while we are in what we call the steady state. That is the normal operational state of our application. But what happens if there are issues, for instance, latency within our application? Do we have the right timeout values then? And with event-driven architecture is becoming more and more common. I had to talk about that at the conference yesterday. How we handle the events in our application is really key. So are we queuing events correctly? What happens to events in case of our application failing? And when we use services or third-party dependencies, it usually means that we trust those to be there. So do we have different ways of doing fallbacks or graceful degradation when they are not there? And failovers. And having failovers doesn't mean that we think that there is going to be a regional outage. But having failovers might mean that we're able to fail over to a location that's closer to our users. For instance, if major ISP have networking issues, so our users are unable or have a harder time reaching a specific region. So testing failover is another way we can do our chaos experiments. And these are just some potential weaknesses that we can see in serverless applications. And there are, of course, a lot of others as well when we design and architect our applications. So let's look at it more practically how we can do serverless chaos experiments. So this is a simple architecture. It's a web service where we have a client calling API gateway, the managed API service in front of one Lambda function that retrieves or stores data in DynamoDB. And we have another Lambda function that stores or retrieves data in Amazon S3, the object storage. Alongside this, we have another S3 bucket that is used directly by the client, for instance, for static resources. So we can start off by, for instance, injecting errors or create exceptions within our code to see how our application handled those errors or those exceptions. We can remove different downstream services to see how we handle those. We can alter the concurrency of our functions, for instance, to simulate that we aren't scaling in the way that we intend our application to do. Or we can restrict the capacity of tables to see how the application handles those types of issues. Other examples are we can inject our security policy errors. So injecting errors into security policies to, for instance, restricting access to certain resources. We can create course configuration errors to simulate that we have configuration errors within our application. Or we can, of course, do any type of service configuration errors to simulate that we, for one reason or another, are having configuration errors when we deploy or when we reconfigure our services. Or if we're using the disk space within our Lambda functions and the temp space, we can create experiments where we fill the disk, for instance, so that we are unable to store the data that we perhaps want to cache or manipulate with the help of our Lambda function to, once again, test how our application behaves when those failures occur. And then perhaps on to the mother of all serverless chaos experiments, the latency experiment, where we add latency to our functions. And we can do this to test several different things. For instance, we can simulate cold starts. Cold start is the first time that our Lambda function runs. It usually takes a few milliseconds extra for each first invocation or when the application scales and more containers are spun up for our Lambda functions. So we can test the behavior of our application to see that the user is still getting a good experience. Or we can, of course, simulate different types of provider issues. We can simulate runtime or code issues. We can simulate integration issues, for instance, latency to downstream or third-party services by injecting latency, and use it as a way to test the timeouts of our Lambda functions to see that the timeout values we're using are the correct ones, even though we are perhaps not in that steady state, the normal operation of our application. So these are examples of different chaos experiments we can do for serverless applications. And to do those, well, since we don't have access to the underlying infrastructure, like we perhaps do with we're running easy to instances or containers in one of the container services, instead, we use libraries to do these experiments. And one library, the one we're going to use for our experiments in this session, is one that's called Failure Lambda. It's for Node.js. It's a library that I've created that, of course, is open source you're able to use. So this is an NPM package for your Node.js Lambdas. But if you're using Python, don't worry, there is another package called Chaos Lambda, by my good friend and colleague, Adrian Hornsby, who's created that. It has more or less the same functionality, but for Python functions instead. You configure this Failure Lambda package using a parameter that we're storing within parameter store. So we're able to do the configuration, enable, disable, and so on. And we have several different failure modes to choose from. So we can inject latency, as I mentioned, example experiments before. We can set status codes. So instead of our Lambda function returning a 200 response as an OK response, we can inject different status codes, for instance, 404, 502, 301, and so on, to be able to test how does our application handle those errors. We can create exceptions, of course, in the code. We can use disk space experiments to see what happens if the temp space is full and we're unable to store items in temp, or we can use denialist, which means that we're intercepting the network calls to, for instance, downstream services, and then we're able to block those calls so that our Lambda function isn't able to call, for instance, a third-party dependency that we're using. And it's fairly easy to get started with. You just install the package and then you wrap your Lambda handler with it. We'll look at it in the code shortly. And then we have the configuration. As I said, it's stored as a parameter, and this is basic JSON, where we're able to enable, disable, and set the different parameters for our failure Lambda package. So let's jump into the demo part straight away to see how we can do this then. To make it a bit easier and more visual, I've created this simple site that's called Serverless Chaos Demo site, quite descriptive name. It's a basic website that I use as a way of demonstrating how we can do these experiments. So it's a site that loads images and we're able to then inject failure into it. This is the basic architecture of the application. So it's similar to the one we saw before. It has an API gateway, and behind that we have three different Lambda functions as a way for us to then inject failure into three different ones one at a time. And behind that, as a downstream service, we have Amazon DynamoDB as a way of storing items. And what happens is that the client every five seconds calls these three Lambda functions through API gateway. The Lambda functions in turn fetches a random item from the DynamoDB table and then returns that to the client. The client then fetches a new image based on a new URL that is returned from API gateway. So looking at the architecture again then, what type of experiments can we do here? Well, to create our experiments, we usually talk about using what ifs. So we ask these what if questions. So what if my function takes an extra 300 milliseconds for each invocation? What if my function returns an error code? And then what if I can't get data from the downstream service? In this case, Amazon DynamoDB. By asking these, we're able to design our three experiments, create our hypothesis to then form our experiment and actually do it. So let's do that. And the hypothesis in this case might be that if I inject failure into my Lambda functions, my application will use graceful degradation. So let's see if we can prove that hypothesis. All right, so switch around. This is the serverless chaos demo site in action. And as you can see, we have function one, function two, and function three. Every five seconds, they are reloading, which means that they're calling API gateway, which invokes the Lambda function, which fetches data from DynamoDB. And as you can see right now, all of them are getting a 200 response back, getting a new URL so they can load the new image. And invocation time is about 200, 300 milliseconds, somewhere around that. The three Lambda functions that we have, we can see them here in the Lambda console. Let's open function number one. Here we go. All right, so this is the failure Lambda package. So that is installed. And then we have the function handler here. So basically just wrapping the handler with failure Lambda, which means that everything inside that wrapper will then be subjected to the failure when we enable our experiments. And as I said, we have the configuration of this as a parameter that we're storing in AWS systems manager parameter store. So let's look at the parameter for function number one. Let's zoom in a bit. But this is the first parameter. As I said, it's basic JSON. First off, we can see that it's disabled right now. I've set the failure mode to latency so that we're able to inject latency into our application. I've set a minimum latency of 100 milliseconds and a maximum latency of 400 milliseconds. So for each time it's being injected, it will be something in between 100 and 400 milliseconds. So let's enable that chaos experiment and see what happens in the application. Going back to the application, let's see function number one. Just redo that edit. Shouldn't be quoted. Let's try it again. So updated the parameter. And let's now look at function number one. There we go. Now we've enabled the experiment. And for each location, well, it's still returning a 200 response. We are still getting new images. But as we can see on the invocation time, for each invocation, it takes between 100 and 400 milliseconds longer. In this case, my application didn't break. Users would still be happy, I guess, because they are getting new images as the application intends it to. But it just takes a bit longer. So that's, I guess, an example of an experiment that doesn't break the application. It still works just a bit slower in this case. Next step perhaps would be to then increase that latency, to add even more latency to it, to see how that is handled. So let's look at function number two we've done instead. Function number two is also disabled for now. The failure mode is set to status code, which means that we're able to inject a specific status code. As we could see on the application right now, it's getting a 200 response, an okay response, back from API Gateway. But now we're able to inject some other status code. I've set the status code to 404, which means that it will return a 404, not found to the client instead. I've also set the rate to 0.5, meaning that it will inject failure on about half of the invocations, one being all invocations, 0.5 being about half of the invocations. And that is perhaps a way that an application would behave during failure. It wouldn't give an error every time, but on some of the invocations. So let's enable it. Setting it to true, saving, and moving back. All right, we can see that it's already invoked the function once more and got a 404, which means that it wasn't able to load a new image. The next one was a 200, happy users once again, but then we get a 404, no new images. So the application doesn't handle the error messages in a good way. The user is subjected to the error, the user perhaps wouldn't get the intended function of the application. So this is room for improvement in the application, something that our experiment shows that we're able to then solve in a better way. So let's look at function number three, then. The parameter for that, it's disabled for now. And we've set the failure mode to denialist, meaning that we're able to intercept and block network connections. In the denialist right now, I've set S3 and DynamoDB. We know that we're using Amazon DynamoDB as a downstream service, so any call to that would then be subjected to this failure, denied network connection. The rate is set to one, so it will be on all connections. Enabling it and moving back. So let's see. Function number three is now unable to get a new image. Instead, we get a 502 error back from API Gateway, because our Lambda function is unable to fetch data from DynamoDB, in this case. Once again, we're able to see that all of my application doesn't really handle these errors in a good way. So the user is subjected to the error and doesn't get a new image loaded once again. So once again, we get unhappy users, I'd say. So those are three basic examples of how we quite easily can get started doing our chaos experiments. And remember that we don't have to do this in production. This is something that we could do in a test environment or development environment to actually see how does the application behave with these types of errors. So let's move back. So let's look at another way or another example of this. So in this case, we have another, call it a simple web service, but one that's using a downstream service that it's calling using API. So the client calls API Gateway, which in turn invokes a Lambda function. The Lambda function is then using a downstream service. This could be any type of third-party API that we're using to do something with the data within our Lambda function. In this case, our downstream service isn't really reliable. It has issues every now and then. So I want to improve my application to handle those errors. So what we can do then is use a specific pattern, a pattern that's called a circuit breaker pattern. And a circuit breaker is basically a way of using graceful degradation within our application. So it checks for these calls, the call to the downstream service, and if the call is successful, that's great. We're storing that state within DynamoDB. So that we know that it is still working. But as soon as we are getting failures, we are unable to reach the downstream service. We're then able to open the circuit breaker and stop any calls there so that we don't keep trying reaching that downstream service. So that's what's called a circuit breaker. We also have the functionality of using a fallback. In this case, we can have a fallback that is, for instance, a cached response, the last successful response, or we can have a static response, one that is always used when there is failure with the downstream service. So when there is failure, for instance, in this case, well, then we're returning the fallback instead. And then in this case, we of course want to test this behavior. We've implemented our circuit breaker, and now we want to test it. So the downstream service that we're using in this case, it is actually one of our own simple web services, API Gateway, AWS Lambda, and DynamoDB. So we're able to test our circuit breaker functionality in this case. So when we inject failure into the Lambda function in the way that we did just now in our previous demos, we can then test our circuit breaker to see how it actually works. All right, so let's give that a try then. Let's open another site instead. So this once again has a quite descriptive name, serverless chaos demo circuit breaker site. So this is more or less the same functionality as we had before, but instead we're calling a downstream service to fetch those images, to fetch the URL for an image. So you remember the architecture that a Lambda function is using a downstream service through API. And we have the circuit breaker in place. Let me open up the code. I can easily show you how it works. Zoom in a bit more, all right. So this is our Lambda function, the one that's calling the downstream service. So we've added the circuit breaker Lambda package. That's a way to implement our circuit breaker. And this is the functionality for calling that unreliable downstream service. So we're calling that as an HTTPS get. And if we get an okay response back to 100, that's great. Then we'll just return that as intended. If we're not getting that, well then we instead use a reject to send an error back. And then we're using the fallback function in this case. And the fallback function is in this case a static image that will be returned instead of the response we would normally get from the API. We've set a few options for the circuit breaker that we have a threshold of three. So three errors when we've counted to three, well then we'll use the fallback. And then every now and then it will test for that if the downstream service is working again and then close the circuit breaker if possible. All right, and then we have the downstream service. Well, that's basically a Lambda function where we have failure Lambda installed. So we're able to inject our errors. Okay, hopefully that made sense. So looking at the site right now, it's working. It's calling the downstream service. The downstream service is working as intended. So let's enable failure on that. So we have a specific parameter for that, of course. It is disabled right now. I'm gonna use the failure mode status code and return a 502 so that the upstream Lambda function will get a 502 in response back. So let's enable it true and let's set the rate to 0.5 for instance. It means that we will inject the failure on about half of the invocations. So let's save it and quickly jump back. So now when it's calling the downstream service we should, hopefully, there we go. Now it injected failure, which meant that the circuit breaker used the fallback. And now it's counted to one error. Hopefully we'll get more errors shortly. So that's where we're able to see. That's the second error using the fallback because the downstream service is unreliable since we're injecting failure to it. And there we got another fallback and then we got another fallback, which meant that we've reached the number three threshold so that the circuit breaker is now open and it will now use the fallback image every time before it retries the downstream service. So if we would disable the experiment again now, it should, after a while, when it's doing the retry to the downstream service it would then close the circuit again so that we're able to use the downstream service. All right, so hopefully that made sense. There we go. Now we tried the downstream service again. So this was an example of how we can use chaos experiments to then test that the things we'd put in place, in this case the circuit breaker, actually works as we intend them to. And in this case, we're able to then add a different type of failure modes to see how it behaves. For instance, adding latency to the downstream service, how did we handle that in our upstream service? Cool. So back here. And then the question is, what's next? Now we've seen how we can do our chaos experiments for serverless applications. We've seen that it's quite easy to get started with. It doesn't take that much effort to do experiments where we're able to learn how our application behaves and how we're actually able to improve our application by building confidence in that it works as intended. Well, remember this quote from the reliability pillar of the AWS well-architected framework. I actually cut it off a bit because it's longer. So this is the full one. Chaos engineering should be done regularly and be part of your CICD cycle. So not only should you do chaos experiments, you should actually do it as part of your deployment and delivery. And that is a really cool statement because that means that we're going from doing one of experiments where we of course start off to doing experiments that are more automated and part of the way that we're doing deliveries and deployments on a day-to-day basis. So let's look at how we can do that as well. Once again, our simple web service. In this case, API Gateway, AWS Lambda, and Amazon DynamoDB. But I've added my, let's call it fairly basic deployment pipeline where I have my code in code commit, I have a pipeline for it, and I have code deploy to be able to then deploy it into my application. Not going into details. So what can we do with this then to do our chaos experiments as part of our CICD? Well, have the same what-ifs as we had before. What if my function takes an extra 300 milliseconds? What if my function returns an error code? What if I can't get data from DynamoDB? We still ask the same questions. But now, as I've said, we want to do it as part of our deployment strategy and delivery. So what we can do is then add a step where we actually enable our experiment as part of CICD. So when we deploy something, we then enable an experiment and then test or verify that the application behaves as intended. So of course, we don't want to break things here. We don't want users to be unhappy with this. But we want to enable the experiment, make sure that it still works as intended. For instance, we inject a certain amount of latency to the application and then monitor using metrics to see that the application still works as intended. Or if we have the circuit breaker or other ways of making our application reliable in place, we can then of course measure that as well to make sure that our users remain happy. So we enable the experiment as part of the deployment. Then measuring with the threshold set, we're able to then disable the experiment and continue with the deployment. Or we're able to then disable the experiment and roll back the deployment. We found issues with the new code that it didn't work as intended. And of course, this doesn't mean that you have to do it at the same time as you're releasing new features. You can do it only with enabling, disabling chaos experiments as well, using the same version as you had previously. So that's a default deploy. Everyone would be subjected to that chaos experiment. Another way of doing it is instead of doing it as the same type of deploy that we had before, we're able to do it as a canary deploy. So instead of everyone getting that failure injected, we can do a canary deploy where we inject failure, enable these experiments for a certain amount of users. For instance, 5% of users should get this failure injected into their invocations. So that way we're able to observe the difference between the majority and other part that didn't have the chaos experiment enabled compared to the ones who get the chaos experiment enabled. And if it works as intended, let's find disable the experiment, continue with the deployment. Otherwise, just roll back and go back to the previous version and fix the issues you found while doing your experiments. And then a third option is the option to, perhaps instead, use new versions with the help of feature flags. So we create a deployment that deploys a new version, a new version of the API, for instance, a new version of the Lambda function, and then by using feature flags, for instance, by using app config in AWS, we're able to then enable this experiment for certain users so that they get the experiment. And in the same way, we then observe, we monitor the behavior of the application to see if we are below the thresholds we've set. That's fine. Then we remove the experiment and we continue with the deployment. In this case, continue with rolling out this new feature to the remaining users. So those are three different ways we can do chaos engineering as part of our deployment pipeline. So let's try to quickly look at an example of that as well. And in this case, we're going to do it, here we go. And we're going to do this by using the serverless framework. That's the framework for deploying serverless applications. In this case, I'm doing it locally on my computer, but this would, of course, be something that you could put into your pipeline and then have it as part of the entire CI CD. But in this case, I'm doing it manually just so that we can see exactly what happens. So this is, as I've said, the serverless framework. We're using a jammel file that is all of the infrastructure that's involved in the application. We're going to use a different site for this. Let's open that. And that's called the serverless chaos demo canary site. Also a good name. In this case, sure, this is just one web page with one user on it, but I want to replicate that we have 12 different users. All right, so we have user 1 through 12. All of them are calling that basic web service once again to fetch data from DynamoDB to then return and get a new image loaded. And it seems to work for these 12 users. They are all happy right now. So then we want to enable an experiment for them using a canary deploy in this case. So I'm using the serverless framework and a plugin within the framework that's meant to help me with my canary deployments. It makes it a bit easier. And in the background, that uses code deploy to perform the deployment to AWS Lambda in this case. I've set my deployment settings to a canary with 10% to begin with. And then after five minutes, it should continue with the rest of the deployment unless we then roll a pack, cancel that deployment. Instead of having one parameter that I'm changing to enable the experiment, in this case, I have two different parameters. One that is the experiment enabled, the one that the new 10% of users will get, and one that is disabled, the one that all of the users have right now and that 90% of users will have as soon as we start deploying. So for the enabled one, it is set to true. We have a status code set to, or a failure mode set to status code. And we're going to inject, let's see, 502 to this. All right. So since we're using this as an entire deployment, it might take a few seconds or a bit more for us. So hopefully it's quick. So before we kick it off, I have one thing left to do. All right. So I'm going to deploy this to the stage develop in the region, EU west one, that's Ireland. Before we kick it off, I of course need to change then, so that we're deploying the enabled version of the parameter to these 10%. And we kick off the deployment. And as I said, this takes a bit longer than enabling disabled, disabling in parameter store. So you'll just have to bear with me. So right now we still have 12 happy users. Let me open code deploy as well. So we can see when the magic starts happening, probably take a bit longer. So right now this creates cloud formation in the background. It is then updating that cloud formation stack with the new version and using a canary deploy with code deploy. So let's see, hopefully soon stack is still updating. Now it should hopefully, there we go. So now we've started a new deployment. It's in progress, as we can see. So it's done the first step, which is the pre-deployment validation. That's fine. But then it started with the second step, the traffic shifting. And it's done the first step, which was 10%. So that means that 90% of users or invocations would get the original one and 10% would get the replacement one. So let's then look at the application. All right, and we can see user two getting an error, user eight getting an error, now user 12 getting an error, user three getting an error, and the rest are getting 200. So it works. We're able to then do it as part of our canary deploy. But it also shows us how canary deploy is work with AWS Lambda. Since it is stateless, we don't have control over which user is getting that failure injected. So we still have the possibility that all 12 users would get the error. But we have it as a smaller blast radius in that it's only 10% of users that are 10% of invocations that have the error injected. So now then, as I've said, now we're able to measure does the application work as intended? Well, in this case, the user is submitted to the error. It doesn't load a new image. So probably not in this case, but if we've implemented ways of making it more reliable and resilient, perhaps if we had the circuit breaker in place, if we had graceful degradation and so on, perhaps it would work as intended still, and we could then continue with the deployment to the rest, the other 90% of invocations. All right, so that was another example of what we can do and perhaps what's coming next, making it as part of our CI CD. So in summary then, chaos engineering helps us find weaknesses and fixing those before things actually break. And chaos engineering really is about building confidence, building confidence in the system and in our application. And as the AWS well-architected framework says, chaos engineering should be done regularly and it should be part of your CI CD. Don't worry if you're not there yet, but that should be something to aim for. You can start off with these smaller experiments, one-off experiments. And just remember that doing chaos experiments isn't rocket science. It's definitely something that everyone can do and having people doing these experiments makes us build better applications. And if you want more on the subject, of course, check out the reliability pillar of the AWS well-architected framework. It really has some useful information about both chaos engineering, of course, but in all about reliability, how we build more reliable applications. Check out the serverless chaos demo app, the one I use in the first experiment. If you want to try the failure lambda package, it's, of course, available on GitHub and as an NPM package. If you're instead more into Python, that's fine. You can use chaos lambda instead, the package that Adrian Hornsby has created. And if you want to check out the circuit breaker package to see how you can use circuit breakers within your lambda functions, you have that available as well as an NPM package. And if you want to try to do some labs around serverless chaos engineering, my good friend Jason Bartow has created this serverless chaos lab that you can start using as well. And so do check that out. And with that, well, my name is Guilin Agarosh. I am a senior developer advocate at Amazon Web Services. I want to thank you for being with me today. If there is anything, just reach out using Twitter or contact me on LinkedIn. All right, thank you very much.