 Hello. We're going to speak today about Beyond the Monkeys Chaos Engineering on the Cloud. I'm Bella Wiseman. I lead chaos engineering at Goldman Sachs. I've been at Goldman Sachs for 10 years. I am a mother of four, so I've picked up a thing or two about leading in the presence of chaos. And I'm also passionate about growing resilient systems and teams. Sinduja. Hi, I'm Sinduja Durai. I'm part of the consumer CICD team and I lead chaos engineering adoption across the division. I've been with Goldman Sachs for close to a decade now and I'm very passionate about building solutions that enhance developer experience. Thank you. So I'll go through the agenda. So we're going to talk a brief introduction about what chaos engineering is. Then we'll discuss two types of chaos engineering, centralized and decentralized. We'll talk a little bit about chaos readiness. Then we'll go on to actually going and automating chaos tests. We'll walk through a game day, some fault injections and findings, and then we'll go over to takeaways. Next slide. Okay. So chaos engineering. So the way I like to describe it is it's about finding your next production incident before it finds you. You can kind of intentionally create turbulent situations in production and then use that in order to reason and actually prove out how the application will behave. Figure out what the operational impact will be. Figure out how you can mitigate it and improve the system under test. Okay. So we look at there as being kind of two types of chaos, centralized or decentralized. So centralized chaos is where you have standard chaos tests that apply to all teams, similar to something like the original Chaos Monkey. You can opt out or other teams that are not ready may choose to opt out. But if they don't opt out, then chaos tests will apply to them. And then you have a central platform that orchestrates runs and tracks the execution of these chaos tests. Decentralized chaos is when you allow teams to opt in to running chaos engineering. So app teams actually are empowered and they kind of choose which tests make sense for their systems based on things like prior production incidents or their knowledge of their system under test. App teams actually will own their own tests and decide which tests they want to run and when to run them and how to run them. You may still want a lightweight centralized team that can drive adoption, figure out what are standardized tooling that can help across the organization, encourage reuse of code and architecture and patterns, and also provide centralized insights and tracking. Next slide. Okay. So we'll talk about centralized chaos. Like everything in life, centralized chaos has pros and cons. So the pros are that you have a central team that can actually go and manage infrastructure and deployments, kind of simplifying things for the people who are actually having the chaos tests executed on their system. You can also programmatically enforce that all systems in an organization actually undergo chaos testing because you can actually run some software that will actually do it, right? And it's really good for organizations where there's a high and kind of even bar of operational excellence across all the different teams in the organization. My downsides are that there are operational and security complexities to having a kind of central, maybe single point in the middle that has access to multiple systems under test, right? Operational complexities, meaning that, you know, if somebody accidentally runs chaos tests when they're not supposed to, or if there's some bug in the chaos testing framework, it can actually, the blast radius is very large and it might bring down all the systems in your organization, right? So that's a pretty big risk of centralized chaos. And from a security perspective as well, if somebody breaches, you know, and actually gets into that chaos account, they then again have very elevated access to many of the systems under test. Second time is that actually writing and managing chaos tests for other team systems is really hard, but you don't know what makes their system tick. You don't know what's happening in their production environment. You just lack the knowledge to know when it's safe or not to actually run chaos testing. So you're kind of taking on a little bit more risk in that sense. You also need to manage examinations for teams that are not ready for chaos testing at all or might not be ready for that right now, right? And then also it can be difficult to actually get buy-in across the organization to go and run chaos testing, you know, from a central point, you know, with minimal permissions and minimal advanced warning, right? Some people might not be comfortable with that. And finally, you may become the team that everyone loves to hate, which might not be a con for some, but, you know, definitely something to keep in mind. Now, I would compare this a bit to penetration testing, which I think is a little bit different. So penetration testing, you know, is basically when you're trying to get access to the system under test, but they're not providing you with access as opposed to chaos engineering, we're deliberately opening up access to a, you know, another application to go and do things with elevated access. So that's where I think, you know, here there actually are risks that go beyond what penetration testing does. So now just going over to decentralized chaos, right? So the good things about decentralized chaos are that each chaos instance has permission to one and only one system under test, right? Kind of reducing the blast radius of anything that goes wrong. Also teams can manage infrastructure, architecture and permissions according to their own standards, right? So in a large organization, there will also be often be different ways of doing things, maybe well under certain approved patterns, but still like a lot of variety and diversity and how teams choose to run things. So, you know, when you do decentralized chaos, each team can kind of run the types of chaos tests and set up infrastructure in a way that aligns to the rest of their application infrastructure, right? You're kind of empowering teams to run chaos testing. And then you don't have to worry about managing exclusions because if a team is not ready, they simply won't opt in, right? And then when it comes to, you know, the tradeoffs, which we'll discuss later about whether to run chaos testing in production or not, each team can decide on their own. Are they ready? Are they not ready? Depending on their deep knowledge of the tradeoffs of their system, and some of the other things we'll discuss later. I think decentralized chaos is especially good for organizations where there are many teams that are not actually chaos ready, and you might have to run around, you know, kind of getting exemptions for more people than would actually be willing to go ahead with it. The downsides are that each team will need to maintain its own infrastructure and code, right? So the central team might then be, you know, charged with building tooling or patterns that make this easier. And then the chaos team also needs to motivate teams to adopt chaos, generally using a more positive approach, right? Because you can't say you must, you know, though shall run chaos, right? You just have to like, you know, kind of encourage teams to do it and make it exciting. And I find that there's a lot of grassroots engineering interests, a lot of engineers are really excited about chaos testing. And then you can also give out things as kind of a badge of honor, right? Some type of mascot or, you know, badge that shows that a team or individual was involved in chaos engineering efforts. Next slide. Okay, so now I'm going to talk a little bit about a chaos ready environment. So first thing I would say is you should never, ever, ever start in production, right? Just like your features, you would never first push into production unless you do. I wouldn't advise it, but I guess if that's what you do, if you first test out your code and prod, you know, then I guess do the same thing with chaos testing, but that's not the practices that we follow in Goldman Sachs. So we said they never, ever start in production, right? You actually would want to follow the same life cycle that your features go through, right? So you would test birth and non-prod, maybe do some blue-green deployments, canarying, whatever it is that makes you confident to release your features to production, those same things should be in place in order to start doing chaos in production, right? And then the environment should be as similar to production as possible. So you would want ideally a prod power parallel environment, but don't get stuck on that, right? So you can actually produce and see a lot of value from even running chaos engineering and environments that you know are not similar to production quite enough. Often you'll get value from that anyways. Even in a non-production environment, you still need to be very focused on safety first because most non-production environments do actually have customers and SLOs. They might not be, you know, three nine SLOs, right? But if your developers can't push or write or test any code for a week because you ran a chaos test in non-prod and you brought down the environment and didn't have a way to bring it back up, that's impact. That's business impact, right? So you want to make sure you have at least some basic observability and monitoring. You want to make sure that you have a small blast radius, right? In an environment that's shared amongst many teams in an organization might not actually be the right place to start with your chaos testing. You might want to start with the environment that's used by only one or two teams so that if things do go sour, you're only impacting a few, you know, bought-in developers and teams. And also obviously you need to make sure before you start that you actually have the ability to recover from the outage without, you know, impact that you wouldn't want, right, on the environment. Now, after you've tested in a non-prod environment, you can then have the data to decide whether moving to production is actually worthwhile. Some of the things to consider there is how similar is your test environment to prod? What are the gaps between there, right? And then think about what's the risk of testing in production, right? What is your level of operational excellence? Will you be able to notice and remediate issues before your customers? Are you confident in that, right? You'll be able to do it without impacting customers. But then you also have to think about what's the risk of not testing in prod, right? How are the differences between your non-prod and prod environment, you know, going to impact the reliability of your chaos testing in on prod? And if you don't proactively test in production, you know, what bad thing might happen in prod that you then won't be prepared for, right? There are risks on both sides and there's no one correct answer. But you know, I think you just kind of look at it with a risk mitigation perspective and try to just do what's right for your system. Okay, now I'm going to move on to chaos ready teams. So a chaos ready team should have a commitment to resiliency and operational excellence, right? If the team that you're in has known resiliency issues that are not being fixed, you would want to fix them first, right? If you just run chaos testing, you will find more resiliency and operational issues that will go on that list and also not be fixed, right? That kind of would mean that there's some underlying issue where perhaps management doesn't actually value operational excellence in your organization, right? And if that's the case, you want to address that first. Now what's interesting is though that chaos engineering might help you drive that change, right? So if you go and you run some chaos tests in non-prod and discover that, hey, if this machine goes down, right, we will have a five hour outage, right? You can then bring that data to your management and explain to them, you know, I know we weren't prioritizing getting a backup or getting, you know, you know, kind of, you know, scaling horizontally, but look at this bad thing that can happen and you can then articulate risks to management and to the business because a lot of times it's they who do not fully understand the risks, the technical risks that underlie operational excellence, right? So you can actually bring them data and say, so we actually tried this out, we tested this out in a non-production environment and here are the risks that we're carrying if we don't prioritize this technical data, right? So that can be an important way to leverage chaos even if your resiliency and operational excellence is not where you want right now. But some things that are, you know, pretty important are actually having a culture of actively acknowledging, embracing and mitigating risk, right? Not sweeping things under the rug, not, you know, pretending that all is well until things blow up, actually like, you know, realizing and escalating and talking about risk is extremely important. And then most important, like psychological safety, blameless post mortems, right? There will be times when testing and production can make sense from a ROI perspective, from a risk perspective. But if in your organization there isn't psychological safety, the team will just be afraid to do it, right? Because they're not, they don't feel safe that like if something does go wrong, because there's never anything that we do that's 100% without risk, then you don't understand that, you know, the organization will have their back, that people will understand that if they've mitigated risks appropriately and they've done what they were supposed to do and they made a good decision and they got buy-in, that people will, you know, be kind to them if something, something bad happens. And that's extremely important. Okay, next slide. So we advise people to actually start simple with chaos engineering, right? Start simple and then non-prod. So one thing that we actually did recently is we worked with a team that had in a system that ran, you know, partially on ECS Fargate. And we encourage them to simply go and stop an ECS task, right? It seemed like such a simple thing and honestly going in, I figured it was just a way to get started. I didn't think we would learn anything from it. But actually we learned quite a lot. So going in, the system, the team thought that stopping a single ECS task would have no impact on their customers. But, next slide. So what actually happened was they did find, and again in non-prod, that when they stopped one ECS task, they were actually, their customers were actually getting errors, right? So I think the good things that happened out of here and kind of, you know, the wins here is that we proactively discovered these issues early, right, during beta phase before the system was being widely used. And we did discover that there was a self-healing system, it was self-limiting, it was resolved within three minutes, and it was a small blast radius, right? Nothing terrible happened when that ECS task was stopped, right? So those were like, you know, the good things that happened and we were unhappy to see that some of the things that we were confident in actually did occur. But the real value came from what we didn't expect to happen, right? If you look here, you can see that red line actually represents the errors that the customers were seeing, that the team did not expect to happen. And they're now kind of analyzing, right, why that is, you know, is that due to load balancer, is that due to timeout, is that due to health checks, right? There's kind of investigation ongoing on that. And they also discovered that the dashboards that they were using to monitor things in production were not complete and didn't actually reflect the experience of the user. That's another outcome and finding that they're then currently following up on and improving. So, and this was like a really great experiment that we learned a lot from. Okay, next slide. And now we'll talk about, you know, how you can transition from doing manual KS tests into actually automating KS tests. Sintuja. Thank you, Bella. So in line with some of the things that Bella had spoken about, I want to take some time to focus through some considerations we had in mind when designing a chaos test execution environment. So the first one is risk. Now, as you can imagine, a chaos test is going to have significantly elevated privileges over your application, be it through similar infrastructure failures or tampering with your network connection, any of these. Now, to counter this risk, some of the patterns that we have adopted is making sure that the chaos test execution environment itself is ephemeral. What this means is that your environment does not exist once your test is complete. The environment is torn down. Having the chaos test themselves as short lived executions and making sure that there is no central access to inject chaos, rather the control to inject chaos sits with the application teams. Next one is around repeatability of these experiments. We want to make sure that these experiments are repeated easily. If they are going to be manual efforts, it means it's going to be one of exercises in frequent exercises and therefore not give very valuable readings. We also want to make sure that these tests act as gates to the deployment of our software into production. What we'd like to do is before your application of software against more down to production, run these chaos tests in a lower environment, make sure that there is no resilience regression introduced and only then allow your software to be released. Having the chaos test execution as scheduled or on-demand depending on the team. Now, if we have a regular release cycle, you might want to schedule your chaos tests for every, say, every Friday versus you might also want to opt for ad hoc on-demand declarations of these chaos tests. The next is about where the control of the chaos test sits. So the application team needs to be able to govern when the chaos test is run, where it is run, against which environment it is run and exactly what is the test that is run. So to counter some of these concentrations, we have gone ahead with integrating the chaos test execution with our CICD pipelines and I'll speak more about this in our next slide. The last bit is around tooling. Now, one of the main factors in our tooling is we want to make sure that the tooling is extensible enough to support any custom faults that might be custom to our application or to our organization. So chaos toolkit comes into the picture here which is an open source tool. It comes with this default set of drivers including drivers for AWS and it is very easy to author and support custom ports with this tool. So coming to the pipeline of our chaos test execution. So as you can see, right at the start of the pipeline is when we set up the chaos test execution environment. Now the block on your right below talks about the system under test, which consists of ECS Fargate cluster, the elastic search, Postgres DB instances, load balancers, lambdas and what have you. And the environment on the left side is the resilience test execution environment and this is the environment that is going to be effluent. So right at the start of the pipeline using infrastructure support, this environment is spun up. Once this is done is where the tests are executed again as part of the pipeline. Now because we spoke about having these test executions as short-lived executions, AWS lambda is a natural choice of our over tooling. Once these tests are executed, these AWS resources, the execution environment itself is torn down and the results of these tests are now saved in the firm's test evidence repository. And that is the end of the pipeline. A few things I want to call out here is one about how the privileges and access management works. So the test execution environment uses short-lived access tokens to inject faults into your system under test. And in terms of audit of what happened during a test, Kiosk tool technically provides a very detailed journal at the end of its execution, which talks about exactly what was the fault that was injected, what was the time step associated with it, what was the outcome of that particular experiment and so on. Now this detailed result is what we save in our repository. In terms of code maintenance, so there are two parts to this. One, the scenarios themselves, the test scenarios themselves are going to be very specific to your system under test and are therefore maintained by the application teams. Whereas the plugins that are authored are residing in a shared repository and that can be used across the organization. So now we have spoken about what the test execution environment looks like. How do you decide what are the faults that you're going to inject? So one thumb rule is that we want to run these tests to identify hidden and unknown vulnerabilities in your application. The goal is not to run tests that you know is going to fail. Now this might be intuitive and obvious, but we just want to call that out. Now in terms of what faults we run, one very valuable source of information is an analysis of the past production incidents that your application has seen. Now this is going to give you a very insightful and candid view of what are the plausible faults that your application is vulnerable to, what has happened and what could happen potentially with your application. And you can start of course small and you can even think of bigger blast radius. So broadly speaking there are five categories of faults that can try to start off with infrastructure faults. Now as the name suggests, this involves failures at the infrastructure of your application. This could be single or multiple compute not failures, database failures, database failovers, bigger or wider blast radius failures, like taking an entire availability, making an entire unavailable. The next is the application fault which is much closer to your application layer. And here the kind of faults could be random latency or random failures from within your application. So network faults is more about faults at the network layer which could be micro latency or packet data loss, tampering your network connectivity such that your connectivity to the dependencies are blocked. This can be either internal dependencies or external dependencies. So use a traffic pattern is more about subjecting your application to traffic patterns that are outside BAU patterns you would see in production. This could be having retry storm patterns, introducing varying payload sizes as part of your request, increasing the concurrent users having spiky traffic patterns. And the last bit is the concurrent faults. Now this is more about one or more of some of the other faults that we have described happening at the same time. So now that we've sort of analyzed our faults what exactly does the chaos experiment, what is the anatomy of your test. So we start off with what is the steady state of your application. This is to make sure that your application is in your normal operational steady state before any fault is detected. I mean in your SNS, do you have error budget to spare right now? Are you not seeing any unforeseen latency issues in your application right now? So the idea is to validate this before the fault is injected because if your application is already in a degraded state, injecting fault additionally is not going to help the app. The next part is a fault injection which is, which is, which talks to the faults that we discussed previously, the actual unfavorable scenario that we want to introduce into an application. And the last section is our rollback. So rollback is to make sure that you leave your application in the same state as it was at the start of the test. Now this is an optional utility. And the reason being that not all scenarios require an explicit rollback. There could be situations where your application does auto recur from the fault and this is not required. So there is a, there is a short sample of kiosk experiment which talks about how we failed over a database. Now as part of the steady state, you see that we make sure that your database cluster is indeed healthy. We apply some amount of synthetic load of the service and make sure that the error rates and latencies are within the acceptable threshold. The method section of this experiment talks about failing over of a particular database cluster and given that the failover does not mean any explicit rollback, the rollback section is empty. So what, we'll talk about what a game day is and what are sort of the findings that we get typically out of running these kiosk experiments. Now game day is more about getting, getting your teams, your development team, your DevOps, the folks that are going to support your application on a day-to-day basis in production, getting them all together, either in person or virtual. Based on a predefined set of scenarios, inject a particular fault in your application and have the team reason about what is, what is the application behavior, is it behaving as expected in the scenario? And if it does not, does the team know how to recover the application, what is your meantime to detect, what is your meantime to recover from that fault and so on. So it is a very valuable exercise, both in terms of figuring out any process gaps that we have in our incident handling, any personal training issues that we have as part of the incident handling, or simply regression gaps that are already in our application that we were unaware of. So the game day was done on a system under test, which consists of seven microservices across five AWS accounts and seven VPCs. So this game day was done on an environment that was very close to production, was not production. The game day was spread over two days. It involved three regions, 13 participants, 14 scenarios and 15 findings and dive more into details of the findings. So the findings were across architecture improvements, runbook updates, observability gaps, monitoring gaps and so on. And at the end of the day, what do we want to get out of this is get the right findings that we reduce our meantime to recover or remediate from a specific failure and reduce the meantime to detect for unfavorable scenarios. Now coming to the findings, we'll start off with some architecture improvements. All right. Now, as the name suggests, this is about introducing additional components within your architecture itself to counter certain unfavorable scenarios. Now, one particular type of finding we had is when an external dependency that our application had was blocked, what we noticed is that this left the application somewhat vulnerable to unfavorable traffic. Now, this sort of prompted a discussion on what is the right integration pattern that we need to have with this particular dependency. Another type of finding could be identifying single points of failure. Now, if you're having a data store as a single point of failure, you might want to think about the caching layers that we need to introduce to sort of improve on the availability of data. The second category is around observability, monitoring, capacity enhancements, and gaps. So this is more along the lines of do you have the right alerts and alarms in your environment to detect when issues happen? Are the alarm thresholds right? Is the timing right? So one observation we had was when there was a compute mode failure, it took close to seven minutes for an alarm to fire, which is a pretty large gap. And this only results in an increased meantime to detect. The other theme that we observed is around health check frequency and the efficiency of the health check. If your health check frequency is very low, you're going to see an increased error rate. And if the health check is not a very efficient health check, this is going to again result in a higher rate. In terms of capacity enhancement, this was a very interesting finding that we observed when we simulated a spiky traffic pattern, where there is a sudden burst of incoming requests. While our application is configured for auto-scaling, the auto-scaling configuration was not tuned right, and it took too long to scale up to the maximum number of loads. So this scenario allowed us to sort of tune the scaling for configuration and cause an improvement of about 50% in the scale of time. Now the last finding category is around documentations and runbooks. So one of the findings we had when we brought down an external dependency was that the escalation methods and contacts of the dependency needs to be clearly documented. The alerts that go out as part of the application needs to have the right verbiage, so the teams know where to look. The runbooks need to be updated when and how. So typically, the application level is much faster than the runbooks, which makes the runbooks redundant and outdated, which does not really help at the time for production and so on. So these are sort of the summary of findings that we had as part of our kiosk test executions and the game days. Now to round off what we have spoken about so far. So a lot of things that we have discussed so far ties into some of our Goldman Sachs engineering tenets, which is around looking around corners, make sure that you're prepared for the unknown and uncertainty in an environment designed resilience system. In a way, incrementally, startup is doing something maybe manual and think about how you automate it and improve. And of course, along this journey, keep learning and have fun. Thank you.