 Hi everyone. So today we're going to be talking about Cloudy with a chance of chaos verifying the resiliency of cloud-native application. So first a little about myself. I'm Bella Wiseman. I work at Goldman Sachs. I've been for 10 years of experience in financial services technology. My mother was also a software engineer so that makes me a second generation woman software engineer. So let's go through our agenda. We'll start with what is chaos engineering. Then we'll go over chaos readiness. We'll go through a case study of a chaos experiment we actually did on a real Goldman Sachs system. And then we'll go through some takeaways. So chaos engineering. You'll find official definitions online. This is my definition. Find your next production incident before it finds you. So that means kind of deliberately doing things that cause production incidents so that you can determine what the impact will be. This is going to be an environment but eventually doing something like doing a bad trigger that might cause a production incident. And then hopefully the impact won't be a prod incident but if it is you're able to actually see what that impact is. So the classical example which is how this whole thing started is bringing down a machine. Bringing down the machine. One of the machines that your application is running on. So I look at there as being five parts to running a chaos test. First of all defining your success criteria, understanding what you're trying to achieve. Two being defining and measuring the steady state of your system. Three injecting the failure so that's the actual chaos test itself. Four observing the outcome, seeing what happens. And then five if required restoring your system back to steady state. So what is defining your success criteria mean in this context? It's what does success mean for this test? Are you expecting no impact to availability? Everything should just continue to run as expected. Minimal impact to availability in which case the question is how much impact is acceptable? Three a self healing system. There will be impact but you expect the system to be able to recover on its own without outside intervention in which case the question is after how long? And then four perhaps you're looking for you know there will be impact maybe even manual intervention required and you want an alert to be triggered. You want your support person or your uncle to be paged. Right and then five you also might want to check that your dashboards are actually reflecting the system state showing that you have an issue. That might be another thing you want to determine from the chaos test. So let's go into defining and measuring your steady state. Right so when we talk about steady state it's like how your system is supposed to behave if everything is going well and typically when we talk about observability and resilience we talk about an SLO, a service level objective and that's usually measured actually over a longer period of time. So something like a 30 days you might say we want 99.9% of requests to return a successful response over a 30 day or 90 day period and that's extremely important because you don't want to be impacted by little blips you know for five minutes something happened for three minutes something happened. You want to be able to maintain that SLO consistently over the long term but for chaos test that actually doesn't work very well right because your chaos test for a 30 day period at least I hope not you want to get results much quicker than that you might in a one minute interval 30 seconds five you want to get your answers more quickly right so what I find is that more helpful than an SLO when determining the success or failure of your chaos test is actually something like your alerting threshold right when you would actually reach out and page somebody to take a look at something and if you breach your alerting threshold during the chaos test right that would mean that something bad did happen and if you don't breach the alerting threshold that means that everything is kind of working as it's expected but obviously you might also discover that your alerting threshold is not what it's meant to be which is also you know another great outcome of actually doing an experiment itself um then there's measuring availability so always try to do it from your customers viewpoint that's really important and then you know while all this may be complex and you might not have all the answers especially if it's a new system it's okay to run a chaos experiment in non-prod before you have all the answers because it will help you learn about your system help you discover some of these things know what the right questions are to ask and it can actually be a really useful exercise so next we'll talk about some best practices when injecting failure so first of all you start off by identifying points of failure you look at your architecture diagram your system figure out what could fail right and then if you do a good job of that you'll probably end up with quite a long list of things that could fail so then you need to prioritize because you're not going to start with testing everything so when you're prioritizing you want to prioritize based on impact right what's the worst thing that could happen if something goes wrong here as well as the probability of it happening and when you talk about probability you know there's you want to avoid both sides of the extreme so something happens every 30 minutes there's not really much use in doing a chaos test on it because you might not even call that chaos right it's happening already all the time on the other hand if it's likely to happen maybe once in 2000 years I mean I'm not saying it's not valuable to run that but you'll probably find something with a higher ROI somewhere in the middle like something that happens every six months something that you know will happen in you know 30 days from now like that's usually your sweet spot in terms of priority and then just current confidence level like how much do you know about what will happen if this thing goes wrong third thing I would say start simple even manual like there are great tools out there um you know for chaos testing and definitely feel free to leverage them but when you're just starting out like do something really small and easy and simple because you'll learn a lot from that as well and we'll get into that soon also and always like start and non-prod yes running chaos testing production is cool and fun and you should it's actually a best practice but when you're starting the first time you're running the experiment just like you don't release your features directly to prod at least I hope not whatever like pre-production clearing and testing you're doing for your features you want to do the same thing for your chaos tests as well right because the impact could be potentially catastrophic if you start right away and prod and then observing outcomes so you don't want to rely exclusively on whatever you know system dashboards you already have you want to consider those dashboards as part of your system on your test part of what you're trying to verify with your chaos test so you want to have a different way of also verifying what how your system is behaving and then just like in a real incident you would be double checking everything like are you sure this is up you know make sure the website is up look is this working do the same thing during your chaos test as well and again measure from your end user's perspective so if you have a web application actually you know fire up a web browser and see how it's working right take like as much of a step back as you can finally you need to restore the steady state might not be required at all if everything went well and your system is self-healing but if your chaos has kind of expected outcome you may need to do this and since you started a non-prod no need to panic but you do need to get things back up and running or two you knew right away the system wasn't self-healing there would be some run book you would need to follow but you've documented that and now is your chance to follow that run book in short it's well written and well understood and able to be executed next we'll talk about chaos readiness so first of all you need a chaos ready environment again because you're not starting in prod you do need a prod like environment in order to start off with your chaos testing you want it to be as close to prod as possible right the further away it is from prod the more different it is the more risk you're carrying that your chaos test is not actually doing what you think it's doing um and then prod safety first because often if your system is very complex and it's hard to set up a non-prod environment if it's let's say a very data heavy system you might be actually working on a shared non-prod environment right so even though it's non-prod if you all developers are blocked for three days because you ran a chaos experiment the chances of you you know being able to do more of this is it's not that high right so be careful and non-prod as well after you test a non-prod then you can have the information to decide whether you want to move to prod and do this in prod or not right you have to consider the risks of doing it in prod and then also the risks of not doing it in prod right so you have to you know consider that carefully um next I'll talk about chaos ready teams right so first of all you need a commitment to resiliency and operational excellence if for whatever reason the system that you're talking about actually has known resiliency issues that no one is fixing there is no point in running a chaos experiment to discover more operational resiliency issues that will just get added to the backlog and no one's going to fix right the one exception there might be if like there are resiliency issues that are coming out in prod but so far they've been small no major incident sometimes running a chaos experiment can actually help you bring proof to management that yes so far the issues have been small but if one of them would happen during this high volume day or if one of them would happen and this thing would go wrong you could have this major incident and then that can help you get the attention to resiliency that you're trying to get um and then as well like when it comes to culture right um the team needs to be able to acknowledge embrace and mitigate risk right because when you're doing chaos engineering you're actually discovering risk right and if people are risk like not just risk averse but like afraid of risk and they don't want to know about risk and they'd rather not know you're not going to have much success with chaos engineering because you're just bringing more bad news that people would rather not know about right and then finally psychological safety playing with close mortems right you know when you're doing chaos test if you're running them in production there is always a possibility that something will go wrong right hopefully you've thought about it you decided that the benefits of running the chaos task outweigh the risks because the risk of not running the chaos test is also very hard but knowing that you know the finger won't get pointed at you that people understand that everyone's doing you know what they think is right for the system and is you know doing things appropriately will help you be able to do chaos tests repeatedly and you know move things forward okay great so now you have the chaos ready environment chaos ready team now what do you need to do is convince your boss so here's something you shouldn't say it's fun i just read a really great blog post about chaos engineering it's trending on my twitter feed i just saw Bella really great talk at cloud native con about it and i want to try it out right i want to write a really cool blog post or i want to get promoted right these are probably not the things that are going to convince your boss to let you do chaos engineering but there are what i find like two categories of times one of the section commercial value to doing chaos engineering and this is a great time to raise your hand and say hey let's try it the first one is before a release right whether it's the first release of your system you're about to going to go live soon or it's the release of a major feature but at that point people are nervous right they're looking for confidence they're looking for a way to make them comfortable that when they go live something bad won't happen that's a great time to actually figure out what the points of failure are trigger those points of failure ensure that your system can be resilient to them that's one category second one is after a major incident right you have all your post mortem follow-ups but if you actually recreate the trigger that happened in production how do you know that those are the right follow-ups right how do you know that those follow-ups were actually done appropriately so that's another time when actually saying a chaos test like let's actually re-trigger that make that bad thing happen again and then ensure that we have an incident or at least that that incident is much smaller than it was previously that's also a recommendation that i often give you know post incidents so then you think is chaos engineering easier is it hard and i kind of came to the conclusion that it's both so here's some reasons why chaos engineering is easy right because first of all even if you do a manual chaos test and i'll discuss with that means soon you can get a lot of value with very very minimal investment also there's lots of open source and vendor software available basic use cases similarly like tons of blog posts out there tons of talks it's easy to like you know read up on what you should do and then four if you're on the cloud you know cloud providers like because you know infrastructure is much more you know democratized right like it could be actually really easy to bring something down you know with some apis and just like easy tools on the other hand chaos engineering is also really hard right some of the technical challenges first of all chaos tests by definition are system tests right and system tests are notoriously a flaky right you have all these false positives you know things that are like all these moving parts how do you make sure everything's up and running the way it's supposed to be so that's one challenge and then you know running in a production parallel environment which could be an ongoing investment if you have one already then great you'll use it but then you might be you know with other developers like kind of competing for resources you might have to invest in creating a dedicated environment for your chaos tests and that can require investment alternatively you can use canaries right even though that is already in prod you have a mature canarying thing but there are some like large-scale issues that can be difficult to simulate a very second managed services on the cloud so not just cloud but actually using like the higher level of abstraction of managed services very often they abstract away the details right serverless right there there obviously are machines but you don't have access to them you can't manipulate them and that can actually make it really hard to inject a failure and cause something to happen another reason why chaos engineering is hard it's a human one right so there are three virtues of programmers right laziness impatience hubris right so when you think about laziness everyone's like okay I already have enough chaos without this right if it ain't broke don't break it like leave me alone I want to go home and then there's impatience right I want to just code and I like this feature and this is going to take some of my time and then hubris right my code won't break right I already wrote code to handle this this scenario you know two years ago right it might be true have you tested it recently right you know what we all know it's always a moving target and things change and the environment changes okay so now we'll go actually into a case study so this is a simplified you know slightly modified architecture of a real system in pre-prod at Goldman Sachs so if we go through you know the architecture on the right hand side you have an API gateway which is proxying and validating authenticating you have a service running an ECS fargate right three different tasks running in different availability zones we're using OPPA open policy agent attacks to check entitlements and then we have an elastic search instance which searches across different documents and then as well there's Redis which is a dependency for some types of requests so when we were looking at doing chaos engineering we went through this diagram figured out points of failure across all the different components and then proceeded from there so some points of failure right the Redis for example was supposed to be a soft dependency for most requests so you know well what happens is it really a soft dependency perhaps it accidentally turned into a hard dependency that would be one thing to look at right elastic search what if it goes down what if some of the nodes go down right OPPA right the entitlement system what if you know what if that gets disconnected like we know that we can't operate without entitlements but can we like recover from that how quickly can we recover once it comes back up and then finally you know on the fargate side our tasks are going down how are we going to be able to handle that now these are all like real points of failure and things that can go wrong but again right challenges on the modern cloud how are you going to simulate these things you don't have any machines to bring down right I mean there are machines but you don't have access to them you don't know where they are who they are how how you would be able to manipulate them so how can you inject chaos so there are three ways three things that I call chaos seen that you can use to inject chaos even when you're running managed services in the cloud right first category is you're using cloud apis and functionality so if we take elastic search as an example even if you're running on a fully managed cluster with no access to the underlying machines right there are usually apis that let you trigger right a cluster resize right so that can actually be a really useful chaos scene where you can trigger that cluster resize have traffic coming in and see how you um you're a system reacts to those circumstances right to trigger known weaknesses right so I like to say you look at the documentation all the things it says that are best practices and you do the opposite right so it says don't run this you know resource intensive query make sure never to run of query against your your elastic search instance and you go okay write up that query run it right and we're non-prod right and see what actually happens right you might find the entire cluster goes down how can you how can you recover from that right that can be another really useful real-life way to induce us without access to the machines third of all disrupting network connectivity so you have multiple managed services that are talking to each other but you can actually use an is to actually you know create network black holes so the services can't talk to each other right that can also be a pretty generic and powerful way to inject chaos into your system so now we'll dig into the ecs um fargate use case that was the thing we decided to start with because it's you know relatively simple so what is aws ecs that's the elastic container service fully managed container orchestration on aws you can either run on ec2 instances in which case it's not fully fully managed and you have access to the underlying machines or you can go serverless with ecs fargate so ecs fargate is serverless container orchestration you're just going to define your tasks the task a pod and kubernetes um and then first million c you obviously want to configure your cluster to be running across multiple availability zones in aws so then the question becomes what happens if a cast goes down well i mean well the good thing should happen right automatically bring up a new one to replace the one that stopped but that opens up a bunch of questions how long is that going to take what if multiple tasks stop right how is your application going to behave in the interim and if there is like impact how long will it take for you to recover so you can read the documentation you can guess or you can try it and that's what we did so when i talk about starting simple this is what i mean right there might be amazing tools out there that you might want to use soon or eventually to do this in an automated fashion but you when you're starting like start simple even a non-project just start simple you can actually go to the console or run a simple um api command or something on the aws cli and just stop a task right and what you learn from that is actually huge and you'll see we learned a lot just from doing that once so simple experiment valuable findings and we found out that there was actually a 20 error rate for between three and ten minutes um the failed request we're returning 502s bad gateway errors but on the good side the system was able to return to steady state without any outside intervention so what happened right well obviously we we know what happened you know we stopped the ECS task right um so one of those three tasks was down the network load balancer though was still sending traffic to the bad task right for a few minutes until the health check for that task finally failed and it was brought out of rotation so how would you solve for something like that right so first of all you can increase the number of tasks right three tasks is not very much you can scale it up to hundreds and then the the risk or the impact from a single task going down is negligible but that's a cost versus resiliency tradeoff depending on how much resiliency you need you you might choose to pay for more tasks or not two is you can tune the time out of your load balancer so that you remove tasks that are not responding more quickly that's also a tradeoff because you might end up with churn right like you know you remove the task it really would have come up on its own things like that so there's always tradeoffs and tuning that goes on here um but the second really interesting thing that we found was the dashboards so on the left hand side you can see a big green smiley face because the system dashboard said everything is good all the requests that are seen by this dashboard are returning successful responses but that's not what our customers were seeing right on the right you see actually we were using locus the load testing tool to simulate the production traffic and there again testing from the customer's perspective we were seeing errors so what was happening there was that the dashboards were pulling data from an agent that was running alongside the ecs task when the task went down the agent went down and it started stopped reporting those metrics right and therefore you didn't see those errors at all so lessons learned one try to make sure you're monitoring if they're coupled from your service to and this is exactly what we did test your dashboards and alerting regularly with real you know production like incident like scenarios and then three monitor from your customer's perspective using things like synthetic probes and then finally a fundamental question is this an incident right and the answer is it really depends right what are your load what are your service what are your customer's expectations is three minutes eighty percent like it really depends and this actually having this graph in front of you having a real thing that happened you know maybe inviting your customers to come test with you in non-prod can actually help you clarify some of these things before you know fingers are being pointed before there's actually a real incident happening live in production so that concludes the case study and now we're going to takeaways so first of all getting started with chaos engineering does not need to be difficult does require fancy tools if you want to be successful ensure that your chaos tests are aligned with real business needs and requirements and three after major incidents strongly consider including a chaos test as a post-mortem follow-up to prevent a recurrence and this is just about Goldman Sachs engineering we have a bunch of engineering pennants I highlighted the few that I thought were particularly relevant to chaos engineering in this talk and now I'll stop and I'll take any questions thank you everybody it's a great presentation especially I like the slide how to convince your boss so my question is especially in the serverless functions that we write right can you talk a little more on how do you do the chaos engineering especially where you have very low visibility or almost like zero visibility into the infrastructure right because you don't know where your functions are running so did you guys do any deeper onto that can you talk a little more on it sure so you're saying just to make sure I understood the question you're saying it's a serverless and you don't have access to the underlying infrastructure right I talk more about how to do chaos engineering right yeah because you have very low visibility onto it underneath where where things are running sure so I think the first and but the high-level answer is that to do chaos engineering you don't necessarily need the visibility into the infrastructure right you need a visibility into how your customers are viewing your system right so what you do need access to do something bad right so however you actually choose to achieve that typically right previously it was on the infrastructure level because that's like really generic you can you bring that a machine it impacts anything right but now that we're moving to the higher abstractions that becomes harder to do so with ECS tasks I think the example there is clear for lambda there are ways that you can actually inject you know latency into your lambda you can actually like put things in front of it to like simulate failures but the truth is that it is more challenging right with serverless to actually inject these failures and to make things fail so realistically um and I wish I had like all the answers um and I don't um but but yeah like there are you know there are ways that you can actually go and you know whether it's network right putting something like disrupting the network so you can't reach the lambda but your customer can't reach the lambda even though it's up and running or whether it's you know putting something in the lambda that like lets you for certain requests fail or for certain requests you know inject latency right those are some of the strategies that we use I hope that helps thank you hello there are different type of failures that you can simulate have you ever simulated a failure that could affect data consistency that's a good question I personally haven't simulated I've definitely seen incidents um where you know like databases uh did funny things and you know didn't affrate a 20 but it's that's definitely a really valuable thing to do for chaos engineering it's one of the most dangerous things because you can't recover like recovering the data is hard um and definitely you know definitely a scenario worth testing I can't give you personal uh a personal fun anecdote here though thank you hey so thank you for this um so in terms of chaos uh testing in pre-prod you need to simulate stuff which means you need to do load testing because not everything happens just because it's in a browser right one person at a time so in terms of uh simulating load tests uh how have you found that just capturing network traffic you know especially on new features that doesn't exist yet you know if if something's already in production you can kind of capture a sample and then and then sanitize it and stuff but like how how have you found to speed up the process where the the chaos testing that you have to do in non-prod does rely on a load test and sometimes that's not just as easy as like capturing traffic and then replaying it have you found ways to speed up that loop to get to a simulator a resimulatable state that requires load testing which is not easy right it's definitely a challenging issue I think once you're in prod but you can use things like fingerprinting well fingerprinting is more like for the data shape of the data but like you can you can replay more easily right you can record what's actually there in prod put that down and then replay it in non-prod right like it's a little bit hard to um to know what your traffic will look like once you're in prod and it's really just without best guesses um which is why like in general you want to move incrementally and release to prod even in small increments right as quickly as you can so you can get real life feedback but yeah I don't have a magic magic answer there okay so how much time do you invest in researching reviewing investigating the results after a chaos test that's a great question so it actually is um a very very large part of the chaos testing and that's why I said before like there isn't a business use case attached to it like it'll often just like fizzle out right you did the chaos test you found some things but like nobody's like really interested enough to dig into it if it's not material right if it's not something that people actually care about but yeah I mean it's just like a post like if you had a real production incident you would spend right the production incident is stressful and then you could spend months afterwards like cleaning up and doing all the follow-ups similar for a chaos test as well there's a lot of time that goes into the into the follow-ups there too all right um just a quick question how do you involve to the ops team where you are uh testing or making cows in in non-production environments uh for example when you need to analyze the root cause analysis of the of the team I don't know if you understand my question maybe just repeat it you wonder how we interact with how do you involve the ops team where you are testing in a non-production environment sure so depending on the structure of the team right often the ops team will be involved in the non-prod environment as well um so if that's the case then that's great but definitely like you need to involve your ops team right because they're the ones who know what's going on from an ops perspective and let's actually what I find is a very good outcome of chaos tests is that you bring everyone together right and like it kind of under a less much less stressful than a product everyone comes together for a product incident sorry but it's not the most pleasant time for everyone to get together you get everybody together for a chaos test you know everyone starts talking to each other and you'll often uncover like disconnects and understanding right where ops thinks that like you know you're doing x and devs thinks you're doing y and then they realize that oh well actually we've been right like completely misunderstanding each other the whole time so definitely you want to involve ops often they're involved in non-prod anyways and even if they're not get them involved right like make it and hopefully I find that ops people are actually often more even more excited about chaos engineering than the devs sometimes because it's like you know more up their alley and they're very often really happy and excited to have somebody proactively looking at ops type things so yeah so my question just adding to what he has asked one of the often challenge we face is that we require access or admin access for carrying out this thing I understand that pre-prod or lower environment it's easy but Goldman's how you are planning to do in the broad because that require probably beyond VP level approval to touch anything sure so it's it's definitely challenging there's two past this is that there's an access side like the the technical part of like does somebody have access and typically somebody will have access right there like there will be somebody who has that access so then getting that person to actually participate and then there's the flip side which is not the technical part of getting access but the like organizational part which is like if you're going to do a chaos test in production you want to have like a go no go type of meeting with everybody in the room right you don't don't chaos test alone for sure not in pride right just like you before you release you have a go no go like should we do this should we not do the same thing write down your plans get everybody in a room have everybody sitting there you know virtually or in reality and then have the person with permissions press the button right and then the other pro of that is that if something goes wrong anybody who is needed to actually resolve the incident is already in the room eyes on it you're not going to waste the 20 minutes that sometimes happen until you page the right person to actually get on on on the call and starting to fix things hope that answers your question thank you my question is are there any new developments like tools techniques in this space that you're finding interesting are there any new tools and techniques that yeah in the chaos engineering space any new developments that you find particularly interesting that we're that we're planning on using or that we yeah perhaps you're looking at that for implementing right so um you know I know AWS is coming out with a lot of like FIS their fourth injection service so we were looking into that as well I think they're starting with a lot of network disruptions um there's definitely a lot happening like in the space um what I find is interesting honestly is like a lot of tools start with the simple things right and then like kind of stop because the other stuff is really hard right so I think like what would be great is you know kind of getting together and and slowly getting a library where we actually also solve the more complex problems right rather than everyone kind of writing their own new chaos tool that starts and does bring down EC2 instance stop ECS tasks and then kind of fizzles out because the other stuff is too hard if you can kind of unite and get to you know the more complex and meaty problems that would definitely be a great great outcome uh so a lot of the so we were talking about like manual chaos tests and stuff like that what about moving to like automating the chaos test like what are some of the things that you have to keep in mind and I guess like how do you ensure that you're always watching the automated chaos tests that are running sure that's a great question um so automating is definitely very valuable um it is a system test right so let's say a lot of what suggests is running as part of your your CICD pipeline right you have a chaos testing step some of the challenges there are again the false positives right so when you're running it once you can set things up you can make sure everything's running what we found to be a challenge a lot of your automating chaos tests is like you know things are down not because of your chaos not because of the test not because your system changed just because of the environment right you're usually not running in pros that's like one challenge um the other challenge is like you know determining and that's why running manually first helps you need to have when you're automating a pass fail threshold right you need to have like that the binary like yes or no we other pass or failed and when you're running manually you can look at it and kind of tune that and figure out where that should be um but if you set that too aggressive you're going to have a lot of false positives also right you say oh we only want 99.9 percent to request every single time that we run our chaos test and our CICD pipeline you're probably going to end up with lots of build failures and everyone's just going to start ignoring that right so you want to set it you want to make sure you have a really stable environment to run it in and you want to set your bar low enough that you won't have false positives um and then you also might want to consider you know running it if you're not doing really continuous delivery you might want to consider running it just before a release not every day or you know every hour everything for sure not like every build like have mercy on your developers thank you bella thank you everyone for coming if you have questions you can come after that sure thank you everyone