 Hello all, today we are going to see what is chaos engineering, how to do chaos engineering with litmus chaos, how to use chaos engineering to test resiliency and reliability of a system. Let me first quickly introduce myself, I am Ruturaj Kadhikar and I am working as a senior SRE at InfraCloud Technologies. As you can see these are some of the famous companies and you might be wondering what is common in all these companies. Let me answer that, these companies have faced major outages in the past few years. Now let us look at some statistics around the outages. According to the latest Uptime Institute's outages report, these are the 10 major outages in 2022 and 2023. Going with the latest one that is Federal Aviation Administration, it faced a major outage due to a configuration issue which caused all the US flights to be grounded. Most of the flights were cancelled or delayed and it incurred a major business loss, domains in your system after executing the resiliency test. Reliability, reliability is getting consistently stable results over the period of time without facing any issues. With reliability, you can achieve higher SLAs and accordingly you can set higher SLOs for your businesses. Another metric for reliability is MTBF, mean time between the failures. The higher the MTBF, the higher is the reliability of your system. Now there might be a question like why to test resilience? We have seen that okay there are outages, we have to resolve these outages but again one can claim that okay I don't face any outage but then why to test the resilience? So to answer to that is it may avoid down times. So you have not faced down times earlier but there is a possibility always. Testing resiliency provides us with the correct understanding of overall behavior of your system when subjected to failures and by mitigating these failures we eliminate the system weaknesses thereby minimizing the unexpected issues in the production. This will increase the mean time between the failures which will increase the reliability of your system thereby helping you to achieve the agreed SLAs. It will enhance the end user experience. Now sometimes it may occur that your system is working correctly but some intermediate or intermittent issues may affect the system responses and degrade the user experiences. Lastly, many industries and regulations require that the systems have a certain level of resilience and uptime. You can use this testing to ensure your system meets these requirements and avoids potential legal or regulatory issues. So next why to test resiliency in Kubernetes? We already know that Kubernetes is operating in high availability mode. Then specifically going for this test what is the reason for that? With microservices application hosted on Kubernetes the underlying architecture may become complex and critical. With lot of interconnected services any minor issues can become a domino effect and may turn into a disaster. It is a distributed system. Many people working on different pieces that needs to be put all together. There may be several human errors or people sometimes not following the best practices which may lead to failures. And lastly the Kubernetes is evolving at a rapid speed. There might be incidences where some APIs are getting depredated and that go unnoticed causing failures in your production. Now since we have talked around what how to test resiliency, what are failure domains, let's just deep deep deep deep deep deep into the failure domains. We have seen earlier that failure domains are the critical areas which can cause major issues in your system. So what are the failure domains in Kubernetes? First is the network. Some latency some packet loss or some jitter in the network can cause major outages in your system. So this is the first failure domain. Next is the pod crashing. So we know that sometimes the pod crashes abruptly it is ephemeral in nature. So after lifecycle new pod comes or the pod gets stuck into the crash loop back off. There might be in it containers which get stuck into their respective processes and it may cause the effect in your overall scalability of your system. The pod may not be able to scale properly. There might be issues with the image registries where the particular image is not able to pull into your Kubernetes from the image repo. There might be issues with the processes like Kublate or container runtime. Sometimes abruptly the Kublate may stop working on a particular node or a container runtime may stop working on a particular node. There might be issues with the nodes like abruptly termination of the nodes, resource saturation on a particular node, resource saturation in terms of compute or storage. The desk full errors might be there which may cause issues in your system. Then there is an issue of load patterns. So let's say you need to test your system with burst load patterns or spiky load patterns so that you need to understand what is the behavior of your system when these load are subjected onto your system. Lastly, we can categorize the configuration or human errors like randomly changing the configurations, randomly changing the environment variables. The service dependencies, a particular service is not able to resolve that dependency or if one particular service is depending on another service that may not be reachable. So like this we can categorize the failure domains inside Kubernetes and accordingly we can create the chaos for the system in the Kubernetes. Now we know that applications are not just hosted in Kubernetes. Failure domains can be beyond Kubernetes and the first category which is there is databases. So many companies they host databases outside of Kubernetes in the cloud or on-prem. So what are the failure domains for these? First is network partitioning. So there might be issue with your database cluster. Let's say there are three nodes in your cluster and one particular node is out creating a network partition. This may create data inconsistency, a split-brain scenario. It may have a cascading effect on the entire system. Due to such the node is taken out and the recovery of the data gets very complex. It may impact the replication. So you need to address these issues whenever any network partitioning incident occurs. Next is time travel again sort of like connected to the network but with synchronization issue of the NTP. Again it will create data inconsistency. It may create security vulnerabilities. It may create event synchronization issues due to incorrect time stamp. There might be issues with the log analysis and debugging. This in turn will impact the legal and compliance issues. You need to check with latency and packet loss on your databases. What is the impact of that? There might be issues with the accesses which can be categorized again as incorrect credentials whether you are able to connect with the database with the incorrect credentials. Are there any authorization by pass? What are the effect of expired tokens or expired credentials and there can be lot many in terms of access issues like permissions and all. So you need to categorize them as access failure domain. There might be again no termination as we have seen. It can create network partitioning or the issue may be smaller but again you need to address and you need to see the system behavior accordingly. You need to have different types of load patterns on your databases so that the read and write cycles are working properly or not. You need to ensure that. The next category of failure domains beyond Kubernetes is cloud services. The one main issue with cloud is instance termination and restarts. There might be random or abrupt instance terminations and you need to evaluate what is the impact of that. Let's say a new instance is coming up and it takes around one or two minutes. Is your system is able to cope up with this short time span. There might be like a use traffic during that time and these two minutes can cause you some business loss around that time. Next is security group or Neckle configuration. There might be accidental human errors while configuring security groups and Neckle configuration which may cause huge business losses because directly the communication is hampered over there. You can check for load balancers, inject high load onto these load balancers, check whether they are able to cope up with that. In AWS scenario you can check whether the target, if the target is healthy and if it is not healthy then what is the impact on your system. What is the impact of draining these targets on your system. Lastly, you can simulate a particular AZdowns scenario considering that your application is hosted in a completely HA mode. You have let's say multiple, your application is spanned into multiple AZs, availability zones and you take down one particular availability zone and measure the impact on your system. Ideally there should not be any impact because your application is hosted into HA but still there might be some issues and you need to evaluate them beforehand. Now we have seen like the basics or the context around resiliency and reliability and how chaos engineering plays an important role in testing the resiliency and reliability. Now let's take a step further into chaos engineering and how to do the stuff with chaos engineering. Let's just start with principles of chaos engineering. The first thing is you need to hypothesize about the steady state. On a normal day, on a normal traffic how your system is responding it can be considered as a steady state. So once you know that this is your steady state you need to identify the failure domains. Identify where, what things can go wrong and accordingly create the chaos scenarios and run those experiments in your system and find and then you check or you verify whether your hypothesis and the practical scenario are matching or not. If there are any differences try to mitigate them and you know improve the your system. So you can start with like minimum blast radius first and you can slowly increase your blast radius. So whenever any unexpected issue comes into your production you can you know what are the solution you need to implement and you can minimize the overall blast radius around it. What are the tools available for chaos engineering is litmus chaos, gremlin is there, chaos monkey, chaos mesh, you can also use AWS FIS. All these tools mainly do the work for you, you need to create a chaos, you need to inject using them but for this talk and personally why I feel litmus chaos is first thing it is open source so you can anybody can use it easily. You can use it in a centralized or distributed way so one of the use case which I found very helpful was if there are multiple accounts in your organization and you need to execute chaos from a centralized way. So let's say one central account and there are multiple spoke accounts you can do that easily with litmus chaos. So litmus chaos has the agents which you can deploy in different environments and execute the chaos over there. Next thing it is flexible it can be the scoring around the chaos scenarios or designing the chaos scenarios it is very flexible and easy to use and lastly the other use case which I personally found good was it has a good integration with AWS SSM. So what is AWS SSM is in AWS you can write the scripts around whatever functionality you want to execute you can create a session manager document around that and then you can run that document. So litmus can integrate easily with that document and you can induce the chaos in your AWS accounts also. So whatever we have seen failure domains beyond Kubernetes if you have AWS account litmus can be very much you litmus can be useful in that scenarios also. So let's say we have seen what is chaos engineering how to do chaos engineering and increase the resiliency and reliability of the system you created the experiments around chaos you executed them. But now so it is a cycle this chaos engineering and resiliency testing is not a one time. This resiliency testing should be periodic in your organization. You can have a resiliency framework as the points we have discussed earlier define a steady state go with the hypothesis execute chaos verify the steady state what is the difference between whatever you have hypothesized and whatever you are getting practical create a reports mitigate those problems again define the steady state with a new vision again create hypothesis and create the experiments and in this way you can minimize all the outages that are happening in your system by minimizing the unexpected failures in your production. Finally you can have resiliency scoring around like let's say if a pod crash is chaos is there the pod can be spawned a new if one particular pod is terminated in that case you can score this chaos with the minimum points and whichever chaos which you are introducing in your system which may affect or which may have greater blast radius you can score those experiments accordingly you can have game days wherein you have you can have one particular day where you execute chaos and let other teams to resolve those issues so that the system knowledge in your team can be increased you can have period periodic resiliency checks and reporting you can have resiliency checks in the CD pipelines let's say you are giving a new release you can have a chaos experiment integrated with your CD pipelines which will test the chaos on your new release and you can see what are the failures or what is the effect of your release beforehand and lastly it will it will improve your observability posture as we have we have seen earlier that you will to gauge you will be able to gauge what is the impact of the chaos and if there is something wrong in your system you can easily find out and quickly find out what is going on now let's see how to run this chaos experiments with Likmas chaos in practical so let's just go with the setup I have set up a small EKS cluster in AWS and I have majorly three namespaces where I have divided all the application the first is the Likmas namespace so let's get the ports in Likmas namespace as you can see the Likmas is deployed in Likmas and Likmas namespace and I have used the Likmas Helm chart to deploy the Likmas stack with this Helm chart you will deploy the Likmas control plane and the Likmas agent now this agent is being deployed on the same cluster where the control plane is deployed hence it is called as a self agent whenever you're deploying the agent outside of the the same cluster then it is called as external agents the next namespace is the Prometheus stack as we need to observe whatever the chaos we are doing we need to observe that so let's see what's in the stack so in this namespace I have deployed the Prometheus stack including Grafana and for test purpose I have used a microservices demo application that is Sockshop so here I have deployed this test microservices application called Sockshop so with this this is the bare minimum small setup that I have done for this demo now let's look into how Likmas chaos looks so whenever you log into Likmas chaos you will see this UI which is called chaos center and here you can see chaos scenarios so chaos scenario is nothing but many chaos experiments bind together so it may be one experiment chaos scenario or it may be multiple chaos experiments inside a chaos scenario so if you see this is chaos scenario and if I say this is one experiment which I have executed inside this one chaos scenario then you can see the delegates as I mentioned earlier since the agent is running in the same cluster as the Likmas control plane so it is called a self agent then you have chaos hubs where you have predefined templates of the chaos experiments which you want to execute you can have like AWS SSM Azure you can have for Cassandra for deletion of cold DNS there are some experiments around GCP then there are generic for delete you can delete this is a template from which you can delete any particular pod or you can kill any container you can increase the CPU or memory or you can have network network corruption so all these failure domains which we have seen in our previously in this talk we can able to run those failure domains with Likmas chaos so this is the chaos hub containing all the pod all the chaos templates then there is a section for analytics it must provides its own analytics for whatever the care chaos scenario you have executed and again you can use statistics I think there is some issue so how many users are there what are the projects how many chaos delegates are there what are the total chaos experiment runs what are the chaos scenarios and when those chaos in scenarios are scheduled accordingly you can get all the statistics let's just visit analytics once again so here you can get all the analytics how many times what are the number of runs schedule stats how many experiments failed or what what was the success ratio if you come to a particular statistics of any one particular scenario you can get the statistics over there so for this experiment I had given 10 points and it is past you can see this a resiliency score over here so for each experiment as we have discussed earlier you can give a particular score for pod crashing you can give a smaller score for CPU or memory chaos you can increase the resiliency score and accordingly you can get the statistics around those chaos scenarios or chaos experiments over here let's see for this demo we will target one creating one chaos scenario from the chaos hub and executing inside the cluster itself and one chaos scenario we will induce in beyond Kubernetes that is in the AWS account let's see so whenever you want to schedule a new chaos scenario or execute new chaos scenario you have to click over here schedule chaos scenario then you can go with like whatever the agent let's say if it is a different cluster then there will be an external agent you can select that particular agent and you can proceed further let's say I am we will select a particular experiment from the chaos hub we will name it as memory if you click next you will see that in this scenario you need to add the experiment so we will add the experiment for memory hogging and we will take this particular template that is generic pod memory hog now the good part is from here itself you can tune your experiment accordingly so where you want to induce the chaos on which particular pod you want to increase the memory let's say I am going with a sock shop namespace and I am taking the deployment let's say catalog okay then you can click on next if you have any health checks or probes you can mention over here then you can tune the memory consumption or the total chaos duration for what for which the memory should be increased currently I am for demo purpose I am reducing that and we can reduce it to maybe 30 seconds if you want to run this experiment parts onto a particular node you can use node selector over here and you can provide the node selector value you are here but right now we will not go for that we just click finish and we will reward the schedule so whatever the parts are getting scheduled for this chaos scenario in your cluster after the chaos scenario is executed successfully it will clean those particular parts so that's why we click on reward schedule over here and click next so in this step you can give like what points you want to do let's say we give eight points for this experiment and then we click next we want to schedule it now we'll click on finish here you can see that the chaos scenario is running we'll say show the chaos scenario this chaos scenario is in progress so meanwhile let's just see log into the Rafaana so here I have created okay let's see if the dashboard is in place so we will create a new dashboard you can see the basic memory statistics over here for in this dashboard for the sock shop applications for catalog you can see the CPU usage and memory usage similarly for payment a user and frontend whatever we have plotted over here let's just go back to what is the status of our chaos scenario and you can see that first the chaos scenario has run in this it has installed the chaos experiments so whenever it install chaos experiment it is nothing but it will deploy all the custom resources for that particular chaos scenario and then it will execute the or it will trigger the custom resources in this step now here we have seen this for memory how it is in progress and we will be able to monitor that in catalog so you can see this memory has increased like a lot so for since it is a demo application we have not increased the memory too much so whatever we are we have induced a chaos we are able to observe that so first thing is you need to create all the observability solutions or you should have all the observability solution in place so practically if you want to map it let's say if the memory is increased I should have an alert okay for this particular part my memory is increased so with chaos you need to identify the gaps in your observability also you can see the chaos scenario is success it has reverted all the chaos or pods also let's just say you won't be able to see any new pod over here so all the pods are all the custom resources which were created for that experiment are first completely you can see the memory has increased over here let's just take look around the next scenario that is the scenario is such that I have created a test instance in AWS and I am able to ping that instance now with our experiment we will change the security group and this ping should not be successful so idea behind this is whenever you can you know change the configurations and check what is the impact check whether you have alerts or observability in place if somebody changes the security group by mistake are you able to or is it noticeable to you quickly so that you can minimize the downtime around which has caused due to that security group change so with this example let's just start with our experiment for this we will use a different approach what I have done is I have taken the template of AWS SSM and I have modified that so let's just see how we can use that so whenever you again we are scheduling a new chaos scenario with new experiment I think this is slow so this time we will import a chaos YAML so I have created a YAML the workflow is into the YAML and which we will be applying over here so whenever this kind of experiments you have to induce in AWS there are majorly two steps as we have seen that for AWS whatever you want to execute the chaos in AWS you have to go through you have to write SSM documents so what is SSM documents is nothing but let's say you go in systems manager and you write a script whatever you want to do in the cloud you write a script for that in terms of documents so if you click on documents over here it will you know there are predefined documents given by AWS whatever you want to do you can refer those and you can create your own documents and we create our own document and then that document so let's say if this is my document right test chaos through SSM which we through which we will change our security group configuration so for this demo we are keeping it basic minimalistic design and you can see that I'm just revoking one ingress rule in the security group so this is how AWS document looks like what we will do is we will put this AWS document inside a config map and pass it to the litmus so how it is done is you can see I have created a revoke security group config map and this is just a config map and in the data section I have just pasted my SSM document over here and then I have applied this config map with kubectl apply command next once your config mapping is in place we have to design a workflow so with the template which was there I have utilized the same template and I modified a bit so as you can see this is the workflow and this scenario has three steps as we have seen earlier it will install the experiment then our main x experiment will be executed and then the chaos will be reverted these are the three steps in this workflow what changes you have to do is if you see the workflow has all the custom resource creation here you can see the resources which are created over here chaos engine chaos experiments what is the chaos result so from this resource it will showcase into the UI what was the result and all and here is the chaos engine I have passed my config map over here you can see litmus revoke security group and I mounted this config map in this workflow second change what I have done is I have taken the document path I have given litmus revoke hg what was the document name and the path for that I have specified the path I have specified the instance let's just revisit the instance one once again so that we are sure that whatever instance we have taken it is correct so let's just copy this instance ID pasted over here so that for running that particular SSM document it will take this particular instance ID and then in the last step you can see it is deleting the chaos engine from the litmus namespace so the reward chaos scenario is in the third step so we will take this chaos scenario we will upload it into the chaos center so we just select our workflow and then if I click next and I remain rename it as revoke hg new and I'll be scheduling it now you can see the code is fine if there are any issues with the yaml it will showcase over here that there is a issue with the linting or something and let's just click on finish if we go to the chaos scenario we can click on show the chaos scenario and we can see that it is in progress now it is installing the chaos experiments we can see over here that the new pods are getting created from that particular chaos experiment can see there is a new pod revoke hg new if I get some chaos engine you can see the chaos engine running over here so it is showing AWS SSM chaos by ID because we have used that particular template over here let's just see the experiment is in progress now we can see that let's just say see whether it is in action or not so behind the scene it is executing this AWS SSM document and whatever AWS SSM document is getting executed it uses the run command we have seen over a click over here in the run command and you can see either in the command or in the command history so this is revoke hg command as of now and we can see that it is success you can check the output for this is it has written true and we can verify we can verify it by checking the security group you can go to the security group and check it just let me go to the instance first so you can see there is no security group that will allow ICMP packages and now let's just check whether we are able to ping or not so you can see that we are not able to ping so whatever the rule was there it was removed and this is the impact that there is no connectivity as of now this is how you can execute any scenario in your AWS maybe instance deletion maybe whatever we have discussed like if your databases are hosted in AWS you can execute you can write a SSM document you can put it in config map and using litmus you can execute that particular chaos so coming back to our analytics if you see again there will be so these are the parts that it is completed you can see the chaos result this is the latest one it is awaiting the result in this way you can execute any chaos inside AWS account using AWS SSM document and litmus so that's all for this talk we have seen what is chaos what is a resiliency and how to increase reliability of your system using chaos engineering and resiliency testing we have seen litmus chaos why it was you know useful for our use case is firstly because it was open source it was flexible it had centralized approach for executing chaos into multiple accounts and mainly it has the integration for AWS SSM documents through which we executed chaos inside Kubernetes and outside of Kubernetes in AWS so that's it thank you