 Hello, and welcome to GitOpscon North America 2021. I'm Uma Mukhera, maintainer of Ritmus Chaos CNCF project, and I'm also the CEO at Chaos Native. In this session, the goal is to talk about the role of GitOps in chaos engineering and how we can use GitOps to scale the chaos engineering itself in production or production. In general, the idea of GitOps is to do a set of operations whenever there's a change to an application. You do some set of operations. You can use the same principles to extend whatever you're doing to do the validation testing of your change as well. And you can do chaos engineering whenever there's an application change on your existing deployments. That's the idea of extending GitOps to do some validation testing. More importantly, chaos testing so that you retain the reliability of your system whenever there's a change, or you may even improve the reliability whenever there's a change to your application or to the service. That's the idea of your change anyways. So before we go into how we apply GitOps to chaos engineering, let's understand a bit about what chaos engineering is. Uncertainty is definitely related to reliability. So chaos engineering is helping in improving your reliability or reducing the uncertainty. It's really about if there is a fault. Is there uncertainty about the service being continuously available? If it does, if it continues to available, then there is no uncertainty and it is reliable. So the faults do happen because of many reasons in various dependent components. So your end service really depends on a lot of other things and a fault anywhere can cause uncertainty to your service. So if the uncertainty increases, your reliability reduces. If you need to increase the reliability, you need to reduce uncertainty. That means your system has to stay up against a lot of potential failure. So it's really about increasing your chances of uptime against certain potential and use chaos engineering to willfully inject these failures and keep correcting them. So that's chaos engineering. So in summary, why will you invest in chaos engineering is to reduce the failures or increase the time between the failures and to decrease the recovery time or MTDR, mean time to recover. Also to reduce the mean time to identify or mean time to inject a failure. So in practice, what you realize is reduction of MTTI to begin with and the more capability you have to quickly inject the fault, you will be able to do something about recovery rates. So you increase your automated scripts or do something new to recover the system quick because you're able to recreate the fault very, very quickly and at will. And as you keep fixing more bugs because of this faults inducing down times, you will be able to increase the time between the failures or increase the uptime that is MTDR. So this is the end goal to reduce the failures or increase the uptime. Chaos engineering has been there for quite some time. Now why is it more important in the era of cloud native DevOps? So it's primarily because your code is shipping faster, maybe 10 times faster and there are many more components to deal with. Containerization is leading to proliferation of microservices. So together you are getting more dynamism into your DevOps maybe 100 times more dynamism sometimes and that really increases the chance of some fault happening elsewhere and you becoming a victim of that fault. So the burden is to make sure that your service is up and running or uptime continues to be higher. So there is more complexity in handling the situation of service uptime. Another way of saying this is even though your application is small the uptime of your service depends on 90% of the time on other components. So you have to be worried about fault scenarios in 90% of the components that are not owned by you. So that really increases the dynamic complexity of how you can keep your uptime high or reliability high. Chaos engineering is being used as a way to deal with this complex situation and definitely it is on the rise for the last couple of years. It is definitely going to be more active sometime in the near future you will see chaos engineering becoming a more common practice. So in the chasm circle we are somewhere between innovators and early adopters. So in about a year or two you should see the early majority becoming a reality. So chaos engineering is a solution for reliability not just for ops but also for QA and developers. So the whole idea is you increase the chances of your uptime by giving shocks wherever there is a possibility shock of bringing down a dependent component or a resource within your application. And also one of the reasons for this adoption is there are more tools available for doing chaos engineering in a cloud native way, more easier way there are more experiments available. You can get started with chaos engineering super quick. So you can do chaos engineering by simply picking up a tool putting the experiments together sometimes even the workflows are also available for validating your service against a failure of complex systems such as Kubernetes. And you will continue to validate the hypothesis of your system against such failures and you find issues and then improve reliability. Let me introduce one such tool to do easy chaos engineering in cloud native that is Litmus. I'm a maintainer of Litmus. We've been developing this project for four years almost four years now and it's got a great community of users around it, contributors around it. There are about a thousand plus possible users who are using Litmus in some form and it's achieved quite a bit of stability. Recently the project has achieved 2.0 GA and that really tells that there is a lot of usage validation and feedback that is put back into the project and it is a CNC of sandbox project on its way to get incubation status sometime in the near future. So let's see a little bit more detail about what Litmus is. So with Litmus you install it using Helm. You get Chaos Center. It's a central place where you coordinate or collaborate the development of chaos workflows which are nothing but a set of experiments put together to be executed in certain passion. So your team members need not be about ops team. It can also be about QA developer team. All of your DevOps personnel can interact together in the development of chaos experiments, development of chaos workflows and put them in chaos hubs and finally you can use them against various different targets. It need not be just about Kubernetes targets. It could be about bare metal physical resources or any other cloud platform such as AWS Azure, GCP and you can also target virtual machine resources such as VMware resources. So Litmus is end-to-end capable in helping you practice chaos engineering and it is developed to be used by teams. It is developed to be scalable to do some real-world chaos engineering at scale in production. Where do you typically use Litmus and what are some of the use cases? As I said there are some experiments that are already available for various resources such as Kubernetes and its platform, VMware and all other cloud platforms and you use these experiments to create Litmus workflows and you put a steady state hypothesis validation scenario using Litmus probes. This is one of the powerful features that you get in Litmus to declaratively identify a hypothesis around steady state which is really important in chaos engineering. Introducing fault is one thing but being able to say that the fault has resulted in a failure of expected steady state or not. That validation is pretty important in chaos engineering. So we have developed Litmus probes to be able to declaratively manage this hypothesis. As far as the use case is concerned once this Litmus workflows are created you can use them either to do SLO management or you can use them in QA for doing continuous chaos testing or you can start game days quickly to introduce chaos culture into your organization or even to enable some efficiency in your observability systems against failures, are you able to identify the required stuff on your observability dashboards? That validation can be done using chaos engineering and again you can also use chaos engineering in your performance testing setups or scaled testing setups as well. How we use chaos engineering in general in your organization in your DevOps is first by introducing chaos itself. There is a lot of inertia that you can expect when you introduce chaos engineering in your org primarily because system will be going down initially much more than what it is because you are doing willful fault injection but you start by injecting in some minimum form using game days or some experimental sessions in your QA in your pre-prod and that's the first step. Second step is you start actually developing some meaningful chaos experiments related to your services and put them into QA or pre-production and once you achieve that you will start realizing the value that you are achieving or the returns are going to be much higher if you automate. So once you find a way to introduce a fault you continue to automate and put them in your QA cycles, pipelines and pre-production updates scenarios or scripts. And finally it is time to take it to a scalable situation where you actually do chaos testing as an extension to your regular software updates. Whenever there is a software change you also execute some chaos states sometime in the near timeline of the update software update itself in a random way and you can do this in pre-production or in production also. So the final stage of chaos maturity model is you have a system where you are doing chaos testing whenever there is a change introduced and validating if there is a degradation in reliability or at least maintaining the same level of reliability if not improved reliability. So GitOps will become very, very important as you start using chaos engineering at scale or in production. It has to be guaranteed that that's the last level of maturity. Somebody should know that whenever I introduce a change there is going to be a fault testing in pre-prod or prod for sure and it's going to be random. It could be within a week, within a day or few hours. We don't know when the fault is going to be injected but definitely it's going to be injected. Whenever the app change is introduced it gets deployed through GitOps and you use the same GitOps concept to also pull in the chaos test on top of your application update. So you're guaranteeing there is going to be chaos testing after your change is pushed into production. So that's a definite way of telling your team that there is going to be chaos testing and there is somebody that is going to look at the reliability benchmarks after the changes are pushed into production. How do you pull a chaos experiment on top of an application change that is rolled out? So there are two parts to it. One is about how your chaos experiments are being managed by your DevOps teams and they're generally put in a Git repository and your team is going to make some updates to chaos experiments and whenever there is a change to the chaos experiments or execution of that, you can actually see them in the UI tools. So that is one side of the story. The other side is that whenever there is a change to the app manifest it gets deployed and that actually is registered as an event in the view of the chaos application here for example Litmus and once that event is triggered you start another cycle of GitOps where you go and pull in a chaos experiment from your Git repository and you execute and send the results back to your chaos center or chaos portal. You call them as front-end GitOps where you are managing a single source of truth for your chaos experiments while teams are doing through your regular Git type of version changes or you're using some UI tools like Litmus Chaos Center to manage your chaos experiments so you're using Git for your configuration changes management and the chaos GitOps or the back-end GitOps is you are triggering a chaos experiment or workflow execution on deploying a change to your application on system. So this is how GitOps in chaos engineering gets used. Let's do a couple of quick demos. In the first demo we will try to attach Git as the back-end for Litmus and Chaos Center which really means that we are going to configure a GitHub repository. In the configuration database we will try to execute some chaos workflow and the details of which will be returned to the back-end GitHub. In the second demo which is to show that the chaos workflows can be executed based on an event or to a deployment or to the system in the back-end. So this event can be introduced through GitOps. That's how generally it is done. Configuration updates are rolled out through GitOps onto Kubernetes systems and in our demo we are going to introduce a manual change to a deployment which otherwise would be done by GitOps and when we do the change to the deployment you will see a Litmus chaos workflow will start executing. So that's the objective of these two quick demos. So let's get to it. This is Litmus Chaos Center. Our objective is to create a Git up-based back-end for the configuration store and this is how it's deployed. It's all running and good and the back-end store is right now by default MongoDB and you see some event tracker as well that's used for back-end GitOps. And we have one agent that is connected. So the Litmus Center looks all healthy. Let's go and do some deploy keys to GitHub. So this is where you do in settings. We need to go and do Git repository connection. So this is the deploy key and we need Git URL and the deploy key. So we have a demo repository that is needed where we need to do the deployment of these keys. So let's create one. This is a private repo and we can initiate this. Initialize this repository with a readme file. At least one file needs to be there in this repository. So the repository is created. You just need to copy the deploy key and add it to the deploy keys of GitHub repo. And the deploy key, copy and paste, give the right access to it. Let's give it title. The deploy key is added and hopefully we have the right access. Just need to copy that newly created GitHub repo where we have added the deploy key. Put in the details and connect. Now we should have the right access to that back-end repository and it must ChaosInter is now configured to write the workflow details. And by default whenever you configure this deploy key-based configuration, we write some metadata on it. Now if I run a workflow, you should see the workflow details getting written to the Git hub repo. So that's the front-end GitOps. So let's schedule new workflow. So we're going to create a workflow to delete an engine export. And this is how the workflow scheduler of Gitmas ChaosInter looks like. We're going to go by most of the default actions here. So let's pick up a port delete experiment from Gitmas Chaos Hub. It's already configured and we just need to tune it for write namespace and write duration and set up some resiliency scores if needed. We are just going to ignore that. We're going to schedule it now. So what this does is it is going to schedule Gitmas workflow and it should write the configuration to Git. It's running. It's all good. And the workflow started. Engine export delete experiment will run. And if you see just now we added this. So now anybody can go and make a change to it. This experiment of workflow details. Those details will be available. Those changes will be seen at the ChaosInter as well. Now let's look at the second demo of back-end GitOps, which is based on an event tracker deployment of Gitmas Chaos. Event tracker listens to certain event tracker policies and based on any change it triggers a policy. So it looks like event tracker is looking good. It's listening. So we just need to create certain policies and let's look at how the policies look like. So this is a kind event tracker policy. It's based on the conditions and key value operator. So let's look at the policy that we have here. So we have a policy that is related to EngineX deployment itself. The replica is one. That's one replica. Any changes to replica or any change to the deployment itself. Any change to the deployment will trigger the workflow, Gitmas workflow. So in this case we also had to annotate the deployment with certain keys so that only when there is annotation this event tracker policies will apply. So we need to have two annotations. Let's do them. GitOps equal to true. Let's annotate it. So we need to annotate it with a particular workflow ID. This is the one whenever there is a change or even triggered. This is the workflow ID that gets run. So we need to copy and paste that. Workflow equal to that, engine namespace. So our application is now set up to run against a policy to run a chaos workflow against a given event tracker policy. Now we can just go ahead and change the EngineX deployment. It can be anything. So it can be as simple as adding a label to the deployment. Let's add a Gitmas label here. This should trigger a change to the state of deployment. Then that should trigger. Let's see if we can run. We're going to save it. This changed the deployment. And now the event tracker found it. But we also need to create an event tracker policy. We've not even applied that. So the policy setup, annotation setup. Let's go and add one more label to change the deployment state. So new owner with Gitmas. Now we have a policy and annotation. It should pick up the policy should kick in. And the workflow should run. Let's go and look at the ego. So a new workflow is now run because of a change in the deployment of the application. The same in the next part will it work as being run through a triggering of an event. So that's our backend GitOps demo. I'd like to summarize the session by saying, you can adopt GitOps for chaos engineering to automatically run workflows and also to consider Git as a backend for your configuration store of your chaos database. And it's always good to have a guarantee of random fault injection whenever there's a change to your deployments or applications are being rolled out. And let us chaos is one such tool that has native integration of GitOps. So with that, I'd like to thank you and you would be able to get more resources about Letmas and GitOps at our docs. You can join us on the latest channel on Kubernetes Slack. Thank you. And you can reach me at my Twitter handle. Thank you. Have a great GitOps con as well as KubeCon folks. Bye.