 Hello everyone, welcome to this session on chaos foreclose with litmus and ergo the cube con North America 2020 virtual edition. This is Uma Mukhera co-founder and chief operating officer at my data and I'm also one of the maintainers on litmus chaos CNCF project. In this session, I'm going to start with the litmus project introduction and then summit is going to talk about our goal and do a quick demo of construction of a chaos workflow using litmus and our goal. Let's start with litmus. Litmus is a CNCF project. It is accepted as a sandbox project early this year. Our mission is to help Kubernetes SREs and developers in practicing chaos engineering in a cloud native way. My data integrate and Amazon or the current maintenance of this project. Chaos Hub is an asset to the project where all the chaos experiments are hosted together for the community. So why chaos engineering for Kubernetes? It's because your application resiliency depends on how host of other cloud native services or cloud native applications. In fact, 90% of the resilience of your application depends on some other cloud native service or Kubernetes itself. How do you achieve resiliency? The answer is practice chaos engineering. What is chaos engineering? It is a practice of introducing a random fault into a system that is running at steady state and then observing if the steady state has been regained or not. If yes, then the system is resilient. If not, you find a weakness. So how do you do this on Kubernetes? You follow the Kubernetes native principles where the lifecycle of such an experiment is run using the chaos operator and the management is done using a set of custom resources. In this case, litmus provides a chaos engine CR, a chaos experiment CR and a chaos results CR. Using this custom resources, you can introduce or you can run a chaos experiment in a cloud native way in a totally declarative way. So how do you run this chaos and practice chaos engineering at scale? So for that, litmus is introduced to chaos workflows. Chaos workflow is built on top of an Argo workflow or it uses an Argo workflow to run multiple experiments. You can configure to run them in sequence or in parallel. And then this litmus chaos workflow consolidates the results of such experiments to give a meaningful result to the user. Because this entire workflow is run declaratively or configured declaratively, you can practice chaos engineering using GitHub's practice. Let's look at litmus chaos architecture. Litmus provides a help chart with which you can install a litmus portal. And then you can run agents, litmus agents on different Kubernetes clusters where you need to practice chaos engineering or you need to run chaos experiments. This can be the same cluster on which you install portal. So once you have that, you set up a chaos workflow which will be picked up by the chaos operator and results in running of the experiments as per the flow defined in the chaos workflow. And that results in a set of metrics, chaos metrics and events which are then uploaded to Prometheus for your analysis. From litmus portal, you can run chaos on different Kubernetes clusters, the multi-cloud environment or a hybrid cloud environment as well. So this really means that with litmus, you can practice chaos engineering across your enterprise and it's not just a set of experiments that you're putting together, but it's a tool set that provides the entire infrastructure required for running chaos engineering at scale across your clusters in your enterprise. Let's also look at Chaos Hub. Chaos Hub, as you can see here, it contains generic experiments, a bunch of them and a set of application specific experiments as well. So you can practice your chaos engineering most of the time with already available experiments, 70 or more than 70 percent of the experiments are already made available for you. You construct the remaining experiments using litmus SDK and if you think this new experiment is going to be useful for your users, you can also upstream search an experiment back to the Hub. Let's look at a list of chaos experiments that are available. Primarily, they are divided into generic and application specific. You will see that currently there are about 22 generic chaos experiments, the most famous one being port delete. That's the one that everybody starts using whenever they try litmus. But one of the other important one is also Cubate Service Scale, which is an important function to test when your Kubernetes is very much in production and serving at scale. This experiments covered a lot of variety across network, CPU, memory, disk, and services, IO, and node in general. Using these generic experiments, you should be able to cover much of your chaos engineering needs. One of the other important aspects of litmus is how easy it is to build new chaos experiments. You call it as bring your own chaos into the environment. All you need to do is put that chaos logic into a Docker image or a container image and use litmus SDK to create an experiment skeleton, put them together, appoint your Docker image to that CR, and then you are ready to use that experiment into a chaos workflow. How do you get started with litmus? Litmus provides a help chart with which you get a litmus portal and use that litmus portal to select a predefined chaos workflow and run it on your choice of Kubernetes cluster. This is how a litmus portal looks like. Once you log in, you will have a set of predefined experiments here. I am showing a set of workflows that are already scheduled and run, and you can go on schedule your new workflow, and that gives you a set of predefined workflows which you can choose, and then you can tune the weightage of experiments that are part of the workflow and then you schedule it. When it is run, you can go and see how a given chaos workflow was executed, did everything go right or not, and it gives a good picture of the sequence of experiments that were run, and also a litmus portal provides a good amount of analytics around this chaos experiments. As stated earlier, the mission is to help developers and SRIs to practice chaos engineering on Kubernetes in a Kubernetes native way. With that introduction, let me switch this session to Sumit. Hello everyone. As Uma has given overview of litmus, its architecture and plugin infrastructure, we are one of the recipient of leveraging this plugin infrastructure. I am Sumit Nagal, and today I am talking about chaos workflow with litmus, and using Argo workflow, how we are leveraging the workflow capability. Little bit about me. I work for Intuit, who is a proud maker of TurboTax QuickBooks and Mint. I play Kubernetes, a lot of open source being in Java and Python for a long time, doing a lot of testing and performance work, and leveraging cloud on observability platform. I am leading a reliability engineering team. As part of that team, we are building a paved road, which is on open source tools, providing infrastructure and reliability for any specific service. It has a three-solution of chaos engineering performance and infrastructure. Now, this team is actually helping the Intuit developer platform, which is building next-gen Kubernetes platform for Intuit. We have thousands of developers and hundreds of clusters with more than 2,000 plus services already onboarded, and this one is going lips and bounds. This is the overall architecture of this Intuit developer platform, where we are building a paved road for any onboarding service, and we give them a template so that they can onboard, build their specific functionality, leveraging this platform, which is primarily on the Kubernetes, leveraging AWS infrastructure. There are a lot of informations are available in that. Here we come and be part of that. This platform, we wanted to build very robust, reliable, and stable, as well as we want that to be the scale. A little bit about my chaos journey. I've been working on chaos from last couple of years. Initially, I've been working mainly on the application side, where we were building the historic circuit breaker, as well as a point-to-point solution like AWS, Chaos Monkey, and Seminarmy kind of stuff. We did some game days, so that brings a lot of awareness about chaos. Last year, we started working on this chaos toolkit. We built a lot of use cases, specifically from the application cloud, which is AWS platform, which is a Kubernetes. We added many extensions, we enhanced many extensions, and later we wrapped it up in service and created a node-based use case. During that time, there were many incidents happen, not to our platform, but a few of our products, which being there and prod, that create a bit of awareness about that how someone will go and build the resiliency for this. What we started doing is that we tried to see that if we can put this work to the Kubernetes. Then we figured it out that it is a little challenging. We cannot put that because of security compliance and it's not a native Kubernetes. So we did the initial POC with the litmus, and we have many use cases, more than 70 plus use cases, specifically from application cloud and platform side. We don't want it to rebuild everything from scratch. So what we have used and we built with a small POC and help from the community, we build a plugin infrastructure where all our work has been being executed by a custom resource. And later point of time, we figured it out that how this work can be added as part of the CI CD and workflow. We work closely with the workflow team and build the solution. And after that, we have introduced the performance as well as the chaos side by side and executed via Jenkins speculative pipeline. So this is the overall setup, what we have for the chaos design, where we are installing this chaos operator, which is coming as part of the litmus, as well as the few CRDs in one of the namespace. And we install the Argo workflow as well as Argo controller on another namespace. And we create many custom resources. This custom resources are the one which is actually being invoking our framework. And with the right kind of RBAC, we are targeting specific application as well as the cube system specific namespaces. And all the data we are right now pushing to various monitoring and observability solution. And this has been executed by Jenkins pipeline. Taking one level down how this magic is happening, we have this framework, which is actually being called in this custom resource. Custom resource are about what? And then when you have created that chaos operator will look for those custom resources. And when we go for executing a chaos scenarios, that time the chaos runner pod will come up, which try to bring this experiment, which is nothing but your container code, which has a logic. And then it will finish that execution. It pushed the data to the Kubernetes event. And then we have done internal integration of pushing the data to operation data lake. So it pushed to our operation data lake. There is a one session on operation data lake by Amit and Vijay. Please attend that. And then the all the results we are pushing to the chaos result. So by this time, you have that idea that how this overall structure will there, I will go one level down that how this framework works. So we have this framework being exposed through custom resource. This custom resource is using an image, which is a Python based container image. And in this Python based container image, we are using chaos toolkit framework, which has many extension we use existing as well as we build our and then it has a specific logic written, which is using with the steady state hypothesis. It's a steady state hypothesis before and after. And then you go and execute the chaotic operation and later it will go and push the data to operation data lake. So everything in nutshell, if you wanted to take it out, how this one is there. So we are writing these use cases. These use cases are nothing but our chaos experiment test. These tests we are putting in the Argo workflow. This Argo workflow is a code which is being checked in in the gate. Gate, we have integration with the Jenkins. Jenkins pick this specific code. And with the cube context, it interact with one of the namespace and it submit the Argo. Argo will execute a workflow. This workflow will look that whatever instruction has been provided as part of this workflow that experiment exists. So this experiment which already being set it up on this specific namespace, it pick that specific scenario, it launch the chaos runner. This chaos runner will be using and our existing framework image and launch the experiment. This experiment is picking the code and using the cluster role RBAC and target specific pod. And once the test is done, it will go and it will bring the result as well as the report to our operation data. Now, why Argo workflow? So we have seen that you can execute kip kettles and yamls and then you can execute. But logically speaking, if you really wanted to execute everything as part of pipeline many scenarios, it become very challenging. So automation was one of the things. Now with Argo workflow, everything is coming as a one yaml where we can just use one of the parameter to the Argo submit and we can invoke that specific thing and everything is a code. So you don't need to maintain the various different kind of yaml and we have a scenario of hundreds of clusters. It become very very challenging how we roll it out. As we are using Argo, there are a lot of already cost optimization and resource utilization optimization already done that, which is actually helping in the cost, which is I think one of the more important thing in the current pandemic going. And it as we have introduced the chaos with the performance, we are getting more reliability. We are not only looking for statefulness of the chaos, we are getting the statelessness of the performance and then we merge both of them. It helps us to build a lot of complex scenarios, which is in practical writing at Coups, Catalyaml is not possible. And as this whole execution is happening in a manner that it is a very predictable, it bring a lot of trust and confidence in the overall setup. Last but not least, because it's a code, self-service and onboarding is very very easy and then as we are using workflow, it become the complete life cycle, whatever you have set the base state, it will finish the same base state again. Now, this is the final chaos Argo workflow setup, how it will look like, where you could see that we are running the chaos as well as the performance execution. I will go for the demo. We have this place where we put all our code base of all the YAMLs. And I will be executing two scenarios. One scenario is that from this namespace, I'm creating chaotic operation and application and cube system. And I will try to do it with the existing framework as well as through Argo workflow. So right now here, I have this specific namespace where I have the nginx demo running. And then I will just try to get these experiment added here. And then I can validate that. And then I will apply the RBAC. Once I apply the RBAC, I can go the execution, which is the chaos experiment using chaos toolkit to spy them. By the time it started, the runner pod has come up and the runner pod will initiate the container, which has a framework, which is here. This container will try to do certain action based on that, how we have configured that. So I have kept one experiment one. So here you could see that we are using one of the image from litmus chaos, chaos toolkit image. This is our custom resource. And then here we have provided certain parameters. And this is the experiment we actually use. We overwrite those parameters. And then we specify which specific experiment we are executing. So let's see here. Now here it actually deleted that nginx demo, which is another app. And it bring down that delete and then the runner pod. So with this, we are able to do the execution. And I have this demo script available at your fingertips. Now let's go to the another scenario where we wanted to do this, but we wanted to use the Ergo to do that. So now in this scenario where we apply a YAML, I have parameterized everything. And those parameters are that I am attacking one of our KM pod on a cube DIM system. And this time I'm passing a different chaos toolkit file. And then here I am just executing. And then this will launch the Ergo workflow. Let me just go and grab that. So Ergo workflow when it will launch, it will again do the same thing runner pod it will execute. And then the runner pod job is to initiate that container. And now here you could see that KM pod has been terminated. And then you could see that the run chaos will be finished. And then we will bring back the chaos back to the same state. I have the workflow also available through the nice UI. So this is the way it is executing. Now everything is done. Now I will go and I will say that I wanted to execute the chaos test with the performance. That's the give me the value add off that if during chaotic operation something is happening to my endpoint, I'm going and this time I'm executing the test and it will impact the application pod. And here I'm side by side running the performance test as well as I'm running the chaos test. So here chaos runner has been started. So here you could see that the flow has been started chaos execution started as well as the test also started. They both are going in parallel. I can go and look for workflow. So here the PDP create happen and then I'm execution as well as the performance test both are happening side by side. Now one chaos has been finished and you could see that the test is still going and the application has been brought down. I can go and look for my logs here, which is the performance test during this time. Here you could see that during certain time there is a pod has been impacted. You could clearly see that. And I go and look for other one. I'm right now running two parts. So now with this, you could see that we can go and identify all the problem on your application and we probably need to put the resilience build on that application. Thank you very much. Thank you, Sumit for that wonderful demo. With that, we have come to the end of this session on creating chaos workflows using Latmas and Argo. Please do take a look at the GitHub projects of both Latmas and Argo. Thank you again for watching our wonderful conference. Till next time, see you. Bye-bye. Thank you.