 Thanks for attending my talk, so I will be talking about Jaws Engineering, how you can build the community in the production systems using Jaws Engineering, actually Gregor kind of talked about Jaws Engineering a little bit in a stream of today morning, I will expand it more into what Jaws Engineering is, why do we need it, you know, tools and practices to adopt it, so let us actually get started, so I will start by the definition of Jaws what is Jaws Engineering, so there is a wonderful website principle of Jaws.org and this is the textbook definition of Jaws Engineering and you see a tool called Jaws Monkey, that is actually a tool Netflix originated with Jaws Engineering and they have this tool called Jaws Monkey, so in a lot of book you find this reference, but what Jaws Engineering essentially is, is that it is a live experimentation on your distributed system, right and the goal of that experiment is basically to a, learn and understand what are the distributed system failure points and how to sort of give you a guideline on the areas that you can improve upon, that is what Jaws Engineering is all about and a common example of Jaws Engineering, I love this example that you know when you are a kid they give you vaccination and what you are really they are doing is they are actually infecting you with the virus or bacteria or whatever it is, right and the goal is not as much as to infect you but let your body develop the resistance, so actually you have something like that you can recover from it. Jaws Engineering is kind of the same thing, it is actually injecting failures in your system to build the silencing, but before I actually expand on what Jaws Engineering is I want to start with why we actually need it, so you know I need to text the problem they teach about all sorts of testing, unit integration, security etc etc and typically how it happens is that when you build your system and you put it in production, you have tested your components individually together how you exhaust it in multiple ways, so the question then is why do we need another form of testing, well with all these testing is the focus of your test is actually on your actual system, right how it behaves, but there is one thing missing in all of these tests which is what Jaws Engineering actually tests and that one thing is how the system behave when it is actually running in an environment, so unit testing even integration testing their focus is on the code not on the interaction between the code and the environment, so there are some nice examples like you see the traffic light and you test it thoroughly and everything works and it is deployed in the production environment and see what happens, right it is basically used the same thing can be said about the cupboard drawer, so what is important in actual systems distributed systems is that you need to test the system itself and you need to test the environment on which that system operates and that is super critical and that is what Jaws Engineering mostly focuses on, so the problem is that world is chaotic, right I mean we are operating at a large scale you are deploying your applications in AWS, Google, cloud, Azure whatever your cloud environment is and things fail at scale, right there are many moving parts of a distributed system your disk storage things can go wrong, there are couple of examples here, right hard disk fails network goes down etc etc point is the world is naturally chaotic, so you have to there is nothing there is no way you can actually prevent that Jaws would not happen but what you can do is to be prepared for it and know what to do when it actually happens that is what we are going to see how, so people tend to ignore the environment, right we deploy the application in AWS it is AWS job to take care of all the environment works perfectly well it is but the problem is there are things that are beyond their control things happen and you can't just ignore the environment, right so when you test it even if you have a staging environment which is exactly a replica of production environment there are differences and you have to take into account so this common example of Mars rover right you're you're putting this probe on Mars so you make an assumption of this is how my environment is going to look like and you test you know like astronauts test for zero gravity on earth etc you kind of simulate the environment on your local testing environment but the problem is you have made a certain assumptions about the environment this is how the environment are how do you know that those assumptions are correct I mean you're making an assumption that fine I'm the network is reliable is it what happens if there's a network failure the problem with the environment is environment is beyond your control once your application is let's say on AWS or whatever club that's it you can't control AWS right you can control your application but you can't control your environment so environment is something that should not ignore so there are common fallacies of distributed computing right we tend to make an assumption that we have a reliable network bandwidth etc etc if you go to this website they list you like eight fallacies of distributed computing but the but the whole common point is that never make an assumption that your environment works perfect it doesn't right what you need to do is you need to take into factor what to do if things fail so understand how your environment how your system would behave in case of a failure build resiliency in the system itself so one of the common failure conditions is cascading failure so a distributed system you know there are these set of interconnected components right so one component difference another component that depends on another component you you test all these components individually fine what if there's a component on the back end that fails and you don't see the effect immediately but there's another component depending on it which depends on another component and eventually the failures kind of repeat and what happens with these cascading failures is because there are so many moving parts of it it's not possible for you to anticipate every kind of failure up in my head up in advance and they're triggered usually by very unusual set of circumstances like something that you never thought and you know you have a problem with your system and your system is running in production safely everything's great for one year you think everything is good there is some very very very unique condition that gets triggered and one thing leads to another and ultimately you have a crash turns out this actually happened and this happened on February twenty-eight two thousand seventeen and AWS outage so this is actually the incident report there was this report was finally prepared by the CTO of Amazon so a lot of websites in US actually on the eastern coast stopped working because of this outage and what happened was a simple storage service S3 service went down on the US East region and there are a lot of other services that are dependent on it so slowly and slowly everything starts shutting out and I think the outage lasted for four hours and how it started as per the report was that one of the operators we entered some wrong set of commands and you know they were like hundreds of thousands of websites affected this is a perfect example of cascading failure right you may not be dependent on S3 but there are other services that you're dependent on who in turn depend on S3 so even if S3 fails and you're not dependent on SP your service is going to be affected so point being no matter who your cloud vendor is no matter what your environments don't make an assumption that it would work fine it just won't so finally let's get back to chaos engineering what chaos engineering is is it's test the interaction of your system in your environment so typically in the stage you know you have the developer local machine you do some unit testing then you know you have multiple environments one or more staging environments but the focus of that testing is testing your actual build artifact once you deploy it to the code there's a new set of tests that actually have to done in your actual production environment you could also do it in pre-stage and pre-production or staging environment and many people do it but has to be done on that actual production environment that actually tests and the focus of those kinds of tests are actually not as much as the build artifact but the interactions you have with an environment so that's what chaos engineering is chaos engineering is a controlled experiments that you do on your distributed system what these control experiments are is they are basically injecting failure some kind of failure some kind of condition that could potentially happen an example that I'm running a service and you know there's a resiliency building I have multiple instances of VM what happens if one VM goes down in an ideal world you have a job schedule right you have a job scheduler that that should detect that this instance goes down it should shut down that instance divert the traffic to other instances come up with a new instance and ultimately divert the traffic back how do you know if it happens so what you do is you inject a failure in a system it's okay let me actually screw up one of my instance let me actually you know shut down the instance I see what happens that's an example of a controlled experiment right it's an experiment that you're doing that replicates what could potentially go wrong in your environment or in this case in your in your cloud environment or cloud window so chaos engineering different from other forms of testing we already kind of a little bit talked about it but in all this testing you are actually testing one particular known set of conditions right so you're doing a unit test on one particular component you think this component can go actually wrong in ABC manner and you test those ABC manner right your unit tests are essentially a set of assertions that you make right did that happen did that happen if it didn't I feel chaos engineering is a little more than that because in chaos and because in distributed systems you have many moving parts and you have many you know millions and millions of combinations of things that can go wrong it's not actually possible for you to illuminate all of them so what chaos engineering is more of a of a learning exercise what if this happens what do I do how would my system behave so it's it's a process of generating new information not testing for something that is already known it's more exploratory right you test the effects of various conditions and then you kind of generate more information and often this information is a little bit subjective it's not a objective yes or no pass or fear it's like well the hard disk went down or this network failure my system did recover but it took me five minutes to recover so it's not a guess or no it happened but it happened after five minutes that's unacceptable I want the recovery to happen in 30 seconds now that's an example of subjective information it's no longer a yes or a no so one of the common misconception is like chaos engineering is about causing chaos well that's not the end goal chaos engineering is not about causing problems it's actually revealing problems right you already have problems in your system everybody has no such thing as a perfect system but chaos engineering is doing is revealing them and if availability matters in your system but you should be testing for it so often the name chaos engineering actually you know sounds a little bit counterintuitive or if you go to an execute to say well we're gonna test your system by blowing up a couple of instances in production how does that sound not sure the CTO would really buy it right so so how chaos engineering actually started was Netflix actually came up with chaos engineering and around 2008 you know next we started and in 2009 they decided to go all in on AWS they wanted to move to the club and I think the person in charge a million cock rock he's actually a very famous guy he speaks in a lot of conferences and he's now at actually at Amazon he's the VP of cloud architecture so he started that and with that transition what kind of happened was the environment that was up until that point was in your control now is no longer in your control because now you're relying on AWS for that and around Christmas or something there's an actual outage that happened and Netflix was like kind of down for 24 hours something at the peak peak season and that actually caused a lot of revenues on the whole chaos engineering concept was actually born from that point of view that these things are likely to happen and we can't control it you might be able to you know build a silency that we would be able to recover if such a thing happens so they generated a set of tools it started with what is called chaos monkey the stuff that you saw slide number one right that's chaos monkey and chaos monkey was the most simple form of two so chaos monkey would just start with you know shutting down certain instances of a service and gradually and gradually they they you know they came up with new set of tools and they named chaos con chaos con gained notoriety for shutting down the entire region and finally they have something called chaos automation platform and what chaos automatic platform automation platform does is that it actually periodically does live experiments on your system 24 by 7 so they have they have a whole team one of my colleagues actually she works there they have a whole team we monitor they do it everybody knows and it forces the developer mindset that such a thing would happen don't make an assumption that it won't and build a real the silency in your system right in the development state so it's a it's an organizational shift or like a change of mentality of how you actually develop your software your development cycle itself has to incorporate the fact that you know failures are going to happen you have to take into account that and really like to be silenced right from the start so they actually all these tools chaos monkey chaos con later on got open source and they have now what is called as a similar army will cover that it is basically a set of all of these monkeys and you know animals that are also some disruption the open sources we actually can access and there's a book on it also so before we actually start chaos let's actually talk about you know what are the prerequisites for chaos engineering well as I said chaos engineering is about generating new information right chaos engineering is not about fixing known issues so the first thing is if you have a known issue in your system then it doesn't make something like chaos engineering to work because I mean you already know there's a problem why would you do it is not going to generate any new information early information that you have is known so the first thing about chaos engineering before you start to adopt chaos engineering you at least fix all the known issues of your system right that's the first a prerequisite second thing is that you're going to inject a failure and then you're going to observe what's going to happen how do you observe it will you observe it because you should have monitoring right so the second prerequisite is that you should have adequate monitoring on your system and I'm emphasizing the word adequate because you know monitoring again there's a little bit of a gray in the sense what kind of monitoring you have some monitoring might just you know monitor the instance of health of services and so on but you need like detailed monitoring you need like logs you need to know what's actually happening inside the system so you need to have adequate monitoring chaos engineering so principles of chaos engineering but how do you how do you incorporate how does chaos engineering work well your system is working in what is called as a study state right you have certain hypothesis I have service it takes a request you know process it right something to do whatever and you're making in a sturdy state ideal world you're making some assumptions right my my resources are fine my network hard disk everything was good and then what you start you start you know injecting these failures or very real events and these events are something that mirrors a distributed system failure mode what is a distributed system failure mode well a distributed system failure mode is a possible configuration a possible way in which your system can actually fail so you vary one example I pointed out was hard disk failure right so your services you know services 10 VMs you have 10 disks what if one of them fails so you sort of inject this failure and you have to run this experiments in production you could also run it in a pre-production staging and usually most of the people would do because I mean if there is something wrong it's better to catch it in pre-production stating that actually goes in production that does not replace production testing it only supplements it so you have these experiments running in your environment your actual environment and then you kind of automate it so that's what the chaos automation platform does right it's an automation you need to run it you know there are different ways of doing it 24 by 7 you need to have team monitoring and there's a concept in chaos engineering that says minimize blast radius what minimize blast radius is is start from small like don't don't inject a big failure like two of my three regions fail right from the start start small because if if you inject a failure and if you inject a big failure and things go out of control you may not be able to recover it you start with something very small right one instance of one service probably a tier free service it doesn't even have to be it shouldn't even be tier one or tier two failed right so you've isolated a very small failure that probably does not impact the customer directly and thing works good if everything doesn't you have identified a weak point fix it it's like an iteration right so now we know that on tier 3 services one of the resources we are fine let's actually move to tier 2 so we do the same thing in tier 2 you know you inject a failure so essentially you start with a very small blast radius and you gradually increase it and ultimately your goal is that if a big failure happens either the service should recover or at least fail gracefully in Netflix I think they have something like that that when you actually log in you see the recommendations of all the movies right and if the service that fails because the stuff will still work except you just won't have the recommendation that's a graceful failure right it doesn't really affect or terribly affect your user experience sure you might not have recommendation at that point of time but if you're watching or if you are looking to watch one particular movie nothing stops you from doing that okay so we talked about experiments right I kind of briefly touched upon it but there's actually a sequence of steps that you actually have to do so you first start to pick in hypothesis so let's let's have an example my hypothesis is that I am immune to this failure then the hypothesis that's actually because scope scope is kind of a minimizing blast radius kind of philosophy right so this this particular failure affects the set of services so that's my scope right I'm going to inject a failure if if this thing fails what are the metrics I need to observe in order to see what's happening you know so observe metrics could be like I don't know my my response times my thresholds you know my throughput switch for a decrease some kind of some kind of metric that using so you identify certain metric and then you do is you notify the organization here I'm doing an experiment right so it shouldn't come to them as a surprise what the hell happened you obviously tell them ahead of time and you actually run the experiment so you run the experiment you have the metrics you collect that information you analyze the information and if things work as expected good you increase the scope and this is the automation cycle that you have and things don't you have identified a failure fix that repeat that experiment and continue so this is the experimental cycle that you know pick pick a metric define what is normal inject a failure and observe the difference so here what you have is you have two groups you have a control group and you have an experimentation group actually the kind of you know in in in medicines and in all this in all this you kind of do it right you you're testing a new drug so you have a control group and you have an experimental group let's say if you're testing I don't know if it's ethical to test an animal but whatever you have a control group and you have an experimental group and you kind of observe the difference right so chaos engineering is kind of the same thing the good thing is you're not harming any animal or human in that process and then you know you have you have I you observe if everything works fine good you increase the scope you repeat it and if things don't work there's a discrepancy analyze the result you've identified a weakness and fix it and repeat the experience so that's the experimentation cycle chaos monkey we kind of talked about it repeatedly there's a there's a source code for chaos monkey how they did it actually included in the link so you actually can see and they have good documentation also on it so chaos monkey is like the most basic form of testing pick an instance switch it off observe the effect and then you observe if your job stability level recreated or not if not you have identified a weakness and yes everything works fine then at least you know that when actual server is that happens so different kind of experiments that I have like I mean obviously this is not an exhaustive these are some of things that you could potentially be facing and one of the things is it's not always a binary yes or no right something failed something fast sometimes it's just about latency my network works but it's low right so I have Kafka I have like whole sets of brokers producers consumers one topic accidentally got deleted so there's always a concept of partial failure and again it goes back to the statement I made that chaos engineering the information you generate from it is a little bit of a subjective in nature it's not always a yes or no but it's all about like you know there are numbers involved there's like gray area in between so you have all sorts of experiments you know let's say your function through an exception were you able to recover from it what you were able to gracefully fail for it injecting latency Kafka time travel whatever you can think of Simeon Army so what Simeon Army is is it's actually a set of tools chaos monkey chaos wrong and whole bunch of tools and these tools all they work together to essentially try to disrupt your system and if you can survive this army then you're happy because you know when the actual outage happens you will be fine there's again a link GitHub link for Netflix so you actually can check and see what Simeon Army is all about so adoption in our organization now this is actually a little bit of a challenging thing and why it is challenging is because there is a mentality shift right you're actually doing something live with production that can impact the revenue so it's it's hard for executives to buy it you know when when the idea first came in I mean nobody was buying it until they actually face the issue that what happened with Netflix during that Christmas break so and also adoption is a little bit complicated because in an organization we have 50 teams some people feel they're ready some people feel they are not ready so typically adoption is not a guess or a no game but there are strategies around adoption right there are strategies around some teams is well we are willing to participate but we don't want you to do anything in production we are fine with you know pre-production environment and that's about it some people will say well we are not ready sorry and some people might say oh we are confident everything is good so typically how it happens is that there is a is actually a kind of a configuration that you have to create when my team was working on chaos in having a check what they had was they they developed all these tools chaos monkey chaos Kong and so on and then we had a set of configuration for each team so they can configure what are the services that they can opt in and opt out and we kind of monitor it in the sense that you shouldn't opt out everything or you can opt out everything but if you do it for long then there's something wrong right you're not building for the silencing so this is that configuration aspect of it some team should be able to opt in and opt out and then you need to automate these experiments right they shouldn't be they shouldn't be like one off you could probably start with something like one-off experiment but eventually if you as you go into sophistication of it you want to do an automation and you have to be willing to experiment in production why because that's your real environment we talked about you you want to test your interactions in your actual environment even if you do it on pre-production and you think that you're actually you're fine you're making an assumption that pre-production is identical to production in my opinion you can get close and if there's something wrong with pre-production definitely you can identify that and fix it so you don't have to go to production in that case but you can't ultimately skip experimenting in the production environment so they have a concept of game days and game days is like a fire drill when you have a fire drill your manager tells you well it's going to be a fire drill and everybody the kickoff is in the morning 10 a.m. and they are emergency because there are actually two emergency points here also by the way in case you didn't notice and everybody supposed to assemble and what you're doing is you're actually simulating an actual outbreak of fire so game days is basically a practice that adopts that companies adopt actually kind of simulating what they have is they basically in my company they usually have teams competing with each other right of whose service is more silent and usually you have a prize money for the team or actually some dinner outing something but game days is basically a collaborative environment where everybody will work with each other to simulate for you to understand this system to build resiliency and you know you there are obviously controls that you have to build around it like you have to have a success criteria you have to have about criteria if things go out of control you obviously need to have some dashboards for monitoring some metrics and so on and you need to like really practice it okay these are the failures that we are going to inject this is going to be our old black plan if things go wrong these are the criteria success criteria and so on and it usually involves everybody so right from sea level down to the lowest engineering sets it kind of involves the whole organization it's a great kind of team building opportunity you know learn about the system share knowledge chaos maturity model so typically a chaos maturity model basically measures what's your adoption of chaos engineering or how how sophisticated of chaos engineering practice you have in your organization and there are two dimensions to it right there's a dimension of sophistication and there's a dimension of adoption so I'm going to examine each one of them in details sophistication right so sophistication simply means that how how advanced is your practice of chaos engineering so typically there are four stages here is elementary simple advancing in sophisticated and in an elementary you start very simple right element elementary stage you really run on your production you might run on your local machines on your at best on your stage and process is very manual there's no automation involved in it and you might have something very simple so you might have like you know disk restarts network failure and so on and once you feel that's confident you move to relatively simple but better than elementary stage and there you have a you have more sophistication on it right you have like experiments that actually run on production like traffic so what you could potentially do is you can you know not as much as a divert or right you know so probably divert or just keep take an actual production traffic load and then apply it on your system and see what happens in a simple also results are somewhat manually manually aggregated and analyzed but it's actually running on a production traffic and then you have more more expanded events you kind of want to bring some kind of cascading system the cascading failures also into your system right finally going from simple you have advanced and in advanced you would start having experiments that are going to run in each stage of your system so you have it you know you have like typically local dev environment goes to your build pipeline build pipeline has one or more pre-staging environment you'll have a unit testing integration environment whatever so you start incorporating chaos pretty much in all of them and most of it is automated right and experiments start having a dynamic scope so experiments start having the loop of you know I'll start with small blast radius increase it gradually gradually and finally that's the most advanced and before that you have a sophisticated also which is also semi you know experiments kind of run in productions but you know there might be some frameworks etc and you might have to do something manually manual terminations and so on so the another dimension to chaos maturity model is adoption how much how much is your organization adopting it right so you start with the shadows in the shadows they like to call it so you have few systems covered and then this this is usually how chaos engineering will start that some of the teams say we should do something about it there's no organization why mandate right and they're kind of an earlier adopter they might do some infrequent and then you move into something like investment where at least the higher management buys you know this is something we should be doing and slowly and slowly you have some resources dedicated to you might have multiple teams and so on gradually moving to something like an adoption right where you now have a dedicated team and this is where most of the specific organizations would have they actually have a dedicated team that actually works for chaos engineering and then you have a more complicated incident response system that if something happens you know the person on call is notified and you know critical services are now starting to practice and then you finally have like you know cultural expectation you have like built it all critical services would be tested you know chaos is regular part of engineering development cycle it's regular part of onboarding participation is like a default behavior you you should not opt out unless you have a very good reason to do so there are a lot of organizations that use Netflix obviously started it a colleague of mine does it at Microsoft my previous employer which was jet.com acquired by Wal-Mart so they do it and there's this whole bunch of companies that actually form and some of them obviously not everybody's at the same field some of it practices it relatively less adopted some of them are more adopted but they are actually entering a kind of a mainstream discipline now so there are plenty of resources for basics you just start with the Wikipedia page then principles of chaos which is like the official chaos engineering page then if you want to implement the tools there's actually a chaos toolkit chaos toolkit basically will have chaos monkeys and all those sort of libraries that you can take to build your own tools chaos IQ is actually a consulting firm it's run by a guy named Russ Miles so Russ Miles and associates he's also a very frequent speaker in many of the international conferences and then there's this company called Gremlin and Gremlin actually were people in Netflix who originated with chaos engineering and they decided to leave Netflix and you know sort of commercialized these ideas that's what Gremlin is there's a book on it by over Riley about chaos engineering so ending towards the conclusion environment is something that you cannot ignore right it means your interactions to the environment that needs to be tested and failure is something that's bound to happen so rather than you know putting yourself blind or saying you know failure can't happen embrace it know what happens know how to deal with it and for best resiliency you need to continuously run these experiments in production and what chaos engineering ultimately does is it builds confidence in your system right you know your system can withstand failures you know when it can gracefully be great where and when necessary thank you