 All right, well, folks, welcome to our presentation on power leveling over 9,000, improving application performance with some chaos engineering. Hello, everyone, I'm Saiyam Pathuk, Director of Technical Evangelism at CUO, building the next in simplified QBinary service on CUO. I'm a CNCF ambassador and influx ace. I'm author of Learn CKS Scenarios, which is published on Gumbroad based on practicing Kubernetes CKS certification. Yes, I am CKA, CKAD and CKS certified and I organize a few of the meetups at Bangalore. I run a YouTube channel where I talk about all things cloud native and you can reach me on Twitter anytime at the rate Saiyam Pathuk. And I'm Karthik Gekwad. I'm the head of cloud native engineering over here at Verica. And lately in the last probably eight months or so, I've been doing a lot of work in the chaos engineering space, building chaos engineering verifications. I've written the learning Kubernetes and other cloud native courses on LinkedIn learning. I used to manage the Oracle Kubernetes engine over at Oracle Cloud and also do a bunch of things in the DevOps and cloud space here in Austin, Texas. You can find me on Twitter at Erasion One. So let's talk about the storyline today. So we'll do a quick overview of the state of the cloud native universe, figure out where chaos engineering fits in and then kind of talk about what we mean by chaos engineering, it's intersection with cloud native and then the most important part, the demos of course. And finally we'll go into some speaker recommendations at the end. So state of the universe. So I'm a big fan and I'm kind of a nerd when it comes to looking at reports and things like that. So the state of our world is we're kind of going to a multi-cloud world. We've known this for a while, but in the research report from 451 research, this is kind of, it's a little bit older, came out in 2018, 2019, that timeframe, but it's at 67% of enterprises will have a multi-cloud or hybrid IT environments by 2019. But with greater choice brings more complexity. We saw this again in the CNCF report that came out last year, where 26% of usage is actually multi-cloud. So that's grown a lot from years past, which if you notice in the graph, there was nothing in 2019, but 2020, that's a brand new thing. There's also increased container usage. So this is no surprise to anybody. The use of containers in production has become the norm today for the survey. And so there's steady increase in use of containers in production today, as well as everybody and their parents and sisters and brothers and coworkers are using Kubernetes. It's kind of become a ubiquitous thing. So that's kind of where we stand from a cloud native standpoint. But also in the 2020 CNCF survey, they talked about cultural challenges, where cultural challenges and complexity were the biggest challenges with respect to using containers. Actually, let's talk about complexity for just one second. So on the complexity scale, you have the simple systems and then you have the complex systems. So simple systems are defined where the process flow is very linear, you have predictable output and everything is comprehensible versus more complex systems are, everything doesn't really happen in a linear fashion. So things are non-linear. You have a lot of unpredictable behavior, so you don't really know what might happen. And it's also impossible to build like one complete mental model. One of the really interesting things that I found from this was, in the past 10, 15 years ago, you'd have one architect that was able to understand the whole system. So when something went wrong, you could go to that one person to ask him or her, hey, what's going on with the system? And you might get a response of like, oh, there's maybe the messaging service or something like that might be offline. But in current world today, when everything's a little bit more distributed, it's hard for one single person to have like a complete mental model of the entire system. So it's become a lot more complex. And testing is hard. So this wouldn't be a conference. I'm guessing this slide is gonna be in every presentation, but you have your productionized, distributed application on the right, large container ship, and then you're trying to test this. And we're all kind of like using, probably using JMeter to like load test our applications. So you have the little like forklift on the, on the left-hand side of this image. So, you know, testing is just a hard problem in a distributed world today. So enter chaos engineering. How does this help us? So chaos engineering, it's defined as a discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. This definition comes from the principles of kiosk.org. So go check out that site. So let's talk about testing versus experimentation real quick. Testing is the idea of making an assertion on a property of a system based on existing knowledge and then validating that specific property. Experimentation on the other hand is proposing a hypothesis which can either be proven or disproven. So as long as the hypothesis is not disproven, then the confidence grows in the actual hypothesis. If it is disproven, then we know something's wrong and we can figure out why that hypothesis was wrong in the first place. Okay, so you were trying to practice chaos engineering in the real world. So how do we actually make this happen? So for the principles of chaos, you're first trying to define a steady state. So you know what normal behavior might look like in normal conditions. And then you take your hypothesis and you hypothesize on the steady state which will continue for both your control group and your experimental group. Now that you have two groups, you start introducing variables to reflect real world events in the system. What does this mean? So you might like in the Kubernetes space, you might start to delete pods or take a note away, et cetera on the control and your experimental group. And you start to inject real world chaos to it. And then you try and disprove the hypothesis by either looking at the difference in the steady state between the control and the experimental group. So thinking about this from an advanced principle perspective, you really want to build your hypothesis around steady state behavior. So you want to think of this from an outside in approach. So you want to look at your measurable outputs versus internal attributes. What does this mean? So you want to look at, if you have an application running, you want to look at things from the outside like traffic coming in, response rates and things like that versus looking at something inside of your application maybe like CPU or memory for example. You really want to vary real world events. So you want to turn things off, slow things down, send invalid responses, things like that. So once again from an outside in perspective, but you want to change things that might happen in the real world. You want to run these experiments in production because this really guarantees authenticity of whether your experiments or your hypothesis are going to be valid or invalid. Also blast radius. So I just said that you are running experiments in production. So you want to remember that you're doing these things in production. So you want to minimize your blast radius. So in order to not take everything down. And last but not least, continuous. We'll talk about this as we go along. But really what I mean by this is you want to keep doing this from a continuous kind of perspective. So you don't run your experiments maybe once a release but you're really doing this on a day to day basis hour to hour minute to minute even potentially. So what's the intersection with this? The theory I explained with chaos engineering and cloud native. The coolest thing about cloud native is that we are in a rich ecosystem and there's a lot of great open source tools in the landscape. Some of them are, you know, Kraken, Litmus and Chaos Mesh and there's a whole bunch of others as well including enterprise tooling. So for the purposes of this presentation we'll focus on the first three because realistically we only have 25 minutes probably, you know, 22 by the time I'm like done talking about this stuff. So we'll cover the first three but definitely take a look at all the others as well. So first up Kraken. So what is Kraken? Kraken helps us inject failure into either open shift or Kubernetes. It's kind of built from two components. One of them is powerful seal that injects failures onto Kubernetes clusters. So powerful seal is the actual thing that, you know, injects failures or does something to your actual cluster. And then there's cerebrus that watches and reports on cluster components. So cerebrus is actually the watcher component. So when powerful seal goes and maybe kills a pod or, you know, slows down on a road, for example cerebrus actually sits there and watches board changes in your infrastructure. So it'll tell you what's actually going on. So Kraken is actually used for a bunch of scenarios. You can use it for pod chaos. So for example, your et CD pods you can try and kill et CD pods on your cluster. Same thing with API server pods, adding some node chaos where, you know you might simulate a crash node and see how your application or your cluster recovers from that. And then time chaos, for example where you screw, where you're skewing your clock time to see how your cluster or your applications might respond to that. Here's a quick architecture diagram. So Kraken, you have, we have it on the left hand side here. This injects failures and open shift. These are the different components that it can kind of work with. It also works with, you know, different cloud APIs. So if you want to use the AWS Azure, for example it can, you know, call out to those as well. On the left hand side, we have cerebrus that actually watches all of these different components as well. But Kraken itself, you know behind the scenes it uses powerful seal that, you know, actually goes and does the actual experiment parts with your workers or your masters, for example. Thank you Karthik for the awesome introduction to chaos engineering and why it is needed. Next up on the tooling is chaos mesh. So chaos mesh is a powerful chaos engineering platform for Kubernetes. It is extremely simple to use can be installed on any Kubernetes cluster and it is designed for Kubernetes. It has a interactive dashboards which you can use for creating the chaos experiments and also can be used for analytics as well, like which all points your chaos actually ran. Some of the disruptions that chaos mesh causes are getting the pod, latency, network, system IO chaos. So, and there are others as well. So all these chaos can be introduced within your, within your Kubernetes cluster. Chaos mesh architecture. So these are the set of experiments that are there with, which you can create as a CR and submit it to the Kubernetes cluster. So you'll be writing a YAML file, which is a custom resource and then you'll be submitting to the Kubernetes cluster. So after the control manager takes up the CR, it'll then hand it over to the demon and the demon is actually, the chaos demon is actually running on all the nodes which will internally run your chaos experiments. And these experiments then are based on certain selectors that selects the particular workload to which this particular chaos experiment has to be run. So that's how in a short summary, chaos mesh works. Next up on the tooling is litmus chaos. So litmus is again an open source tool. It's also a CNCF project and it is basically chaos workflows for Kubernetes. So litmus features are team collaboration. So you can integrate it with your GitHub and you can add your team members. You can add public private chaos hubs which will contain the chaos experiments. So you can enable chaos GitOps as well, which can act as a single source of truth. So whatever workflow you create in the litmus will be stored in your GitHub repository, whichever is connected. And then you can have the chaos observability as well. You can view the duration of the experiment when it ran on in the Grafana dashboard and you can also have the Prometheus as well. Chaos APIs that can be used and chaos workflows for creating the complete workflow of the chaos experiment that has to be run on the cluster. Litmus architecture. So this is litmus 2.0 architecture. It has a portal and portal comes with a web UI. So web UI is very, very great component of litmus 2.0 with which you can create the chaos workflows from the UI itself and you don't have to write any of the YAML files. And then you have the litmus server and you also have a litmus DB which stores internally all the workflows. And if you have enabled GitOps then also your workflows are stored in your Git repository. So you have chaos agents on your doors and chaos agents are set of the operator in the CRTs, the chaos probes, the exporters exporting the metrics and the subscribers to which the applications can subscribe. And here you have the chaos experiments and the chaos workflow.yaml. So you can also edit the YAML when you are creating them and you can enable, disable some of the feeds. So this is a complete runtime workflow for litmus. So you can see here the CRTs. So there are three CRTs, the chaos engine. So we write the chaos engine file, the chaos experiments, the chaos experiments are installed onto the cluster and then the chaos results. So chaos result will tell the success, failures and what are the metrics for that. So you have the chaos operator installed onto the cluster, you write the engine.yaml, chaos engine.yaml file. And so user creates a chaos engine CR and maps it to an application to which the chaos experiment has to be run. So then the chaos operator will read that file and create a chaos runner pod, which is actually reading all the configuration, mapping it all to the workload which it has to run the chaos experiment on. And after that it triggers a chaos job. Now chaos job is actual experiment that runs and performs the chaos checks and do the experiment and then hands over the result. Chaos hub is the place where you have all your experiments stored. So you can use the public chaos hub or you can create your own chaos hub privately and link that to the litmus portal. You can also attach multiple target clusters to litmus. So it's not only limited to single cluster, you can have multiple clusters and select targets for different workflows. So this is how all in all litmus works. Demo time. So this is what we are going to do in today's demo. So I'll be launching a CEO Kubernetes cluster, which is K3 is based. We'll install litmus 2.0 and we'll create a pod network loss experiment. And we'll also enable GitOps with the repository and we'll install flux. And again, it will be watching the path, a specific path in the GitHub. And we have the GitHub actions which will be used to build and push the Docker image and also commit the deployment file to the same Git repository to which the flux will read it and deploy it onto the cluster. And after that, the application would have been subscribed to a particular pod network loss workflow. And in the black and using the black box exporter and from ethios will be able to visualize everything in. So for this demo, we'll be using CEO Kubernetes and we launch a cluster. Let's name it cube con. We'll select a large size and we'll just create the cluster. So the cluster is ready under two minutes and I have already downloaded the cube config file and exported it as a cube config variable. Let's set cube CDL, get nodes and our cluster is up and running. So I have already cloned the repository and let's not deploy the components required for this demo. So the first one is the deployment of litmus. So we'll create cube CDL, create namespace litmus and we'll deploy litmus 2.0 beta. So it has created all the components. So let's check what is happening in the litmus namespace. So cube CDL, get pods, iPhone and litmus. So we have our database, frontend server all the pods running for litmus. Now we deploy the application. So cube CDL apply, lower application it will create the deployment and the service and we'll deploy some monitoring components. So cube CDL apply, I can have monitoring Prometheus. So it is creating all the Prometheus components and we also need black box exporter approving the HTTP endpoint and we also need Grafana. So it creates a deployment and service. It has also a JSON dashboard that will import later which will not be applied to the cluster, obviously. Let's check the application is running or not. Our application is up and running and let's access the application. So our application is up and running and here it is. So now let's access the litmus dashboard. So this is running on the node port 30406. So this is the litmus dashboard and the default username and password is admin and litmus. We'll change it. We have to do a project name, keep on. So this is the litmus portal when we log in. So it has workflows, it has the hubs and the target clusters. So first let's enable GitOps for our repository. So we'll give the Git URL, we'll give the branch as main and we have to give the access token and we have successfully connected to GitOps. So all our workflows will now be stored in Git repository. So let's create a workflow. So we schedule a workflow, we select target cluster which is self cluster and we create our own workflow. We give it a name, network loss and we select the experiment from Kiosk Hub or network loss, we go next. So we select the application based on the app label which in this case is Hello Service. We'll run this experiment for 120 seconds and since we are running CO-Cubinities which is container debased. So container B, the socket path will be slash run K3S container D, container D.soc. So let's add this experiment, finishing editing and construct the workflow. Now this is the YAML file that you can edit as well. So we edit the monitoring section. So we want to enable the monitoring so that it can flow and tell us what the exact Kiosk duration was when the experiment ran and let's schedule the workflow. So the new workflow network loss has been successfully created on our Kubernetes cluster. So let's go back to our Kubernetes cluster to see what is happening. Let's check the written name space. We can see the network loss pod is initializing. After that it will create the runner pod and actually run the network Kiosk. So our workflow has ran successfully and all the network loss pods have done their jobs and we can see the workflow succeeded. Now how to visualize this? Now let's go to our Grafana dashboard. So I have already imported Grafana and connected it to Prometheus and loaded the JSON. So this is how it is showing the probe duration seconds and you can see the graph which is having the network loss. So the red area that we are seeing is coming because we said the monitoring true in the YAML file. So monitoring true in the YAML file will give the metrics the proper chaos duration where actually the chaos ran for the particular workload. So this is that particular duration for which the chaos ran and it helps to visualize when you have a lot of chaos running in your cluster. So that's how you can run chaos workflow and visualize it. Now let's do a CI CD for that. Now in the repository, I already have a workflow section which is the build and push image. So this is the GitHub action which on the push commit to the repository will build the image, push the image with a tag using a Jinjar template. It will edit this particular YAML file as well. So there's a template, which is in the TMPL section. So this template, it will update with the deployed tag and again commit back to the same location. Now for the CD part, we'll be deploying flux. So flux is another open source tool which is very good for continuous delivery. So let's deploy flux to, let's bootstrap flux and with this particular repository, it will create the flux system namespace and deploy all the components. So we can see in the KubeCon repository, we have the litmus folder created. So when we did the workflow and we did the GitOps for litmus, so it gave the workflow, it created the workflow inside the repository, it pushed it and we have the workflow ID and the cluster ID. So we need to have these two so that we can subscribe the deployment for this particular workflow. So we'll copy the workflow ID and we'll edit our template and we add the workflow ID over here. So now when there'll be a Git push from the GitHub actions, it will actually be keeping the workflow ID same so that it runs the workflow, the same workflow whenever the new version of the image gets deployed by flux. So the bootstrapping of flux is finished and we can see everything is there. The flux has been bootstrapped, KubeCVL get pods, hyphen and flux system and we can see all the pods are running. So what we have done is we have bootstrapped flux with some of the components on the repository KubeCon branch main and the path deploy. So the path deploy will be taken and anything that gets deployed or edited for the deploy folder, it will apply the new version of the application onto the cluster and in our application, we have added few annotations. So in the annotations, we have GitOps as true and the workflow ID we just pushed to GitHub. So if I do a Git pull, it will give me all the latest details. So for the deployment, we have added some annotations which is the GitOps enabled and subscribed to this particular workflow. So anytime the new version of this application deploys, it will trigger this workflow which is our network loss workflow. But we can see that the previous pod has been removed and a new pod based on the commits that we have done is already deployed onto the cluster. And if you go back to a workflow, we can see a new workflow has just started trigger. And if we again go back to the Grafana dashboard, we can see it live in action. So we can see this red bar has already appeared and the chaos workflow is starting again. So that's how you can use litmus chaos using your GitHub actions and continuous deployment, continuous integration. So all these things will be helpful when you are running for a production environment. So you will be having on every commit or on a specific commit or in specific action, you can run various different workflows and your GitHub repository acts as a single source of truth which has all the workflows that you created. So even if your database component goes down, you still have the workflows. And if you edit anything from this workflow, the other workflow will automatically be updated. So that's it for this demo. Hope you like it. Please try it and let us know how you work with this particular demo. Simon, that was awesome. It did look like we went through a lot of things in that demo. So let me see if I can break it down and see if I can get this right. So we had an application that we deployed and then we ran litmus 2.0 on the cluster and we decided to run the pod network class experiment. So we ran that a couple of times just to make sure it was running. We saw the output of that in Grafana that was using Prometheus behind the scenes to go and pull the metrics from the cluster. And then when we wanted to run that continuously, basically meaning whenever we did a push to the application that was using a GitHub action, we pushed the image to Docker Hub, et cetera. And then Flux was basically configured to go and pull the commit hooks from the deployment from the file and git. And then Flux would rerun the actual litmus experiment again. So anytime you pushed your application, once it got pushed into Docker Hub, Flux would rerun the experiment again, correct? Yes, we have used the annotations so that the subscribed workflow runs. So bang on. That's awesome. Yeah, and I like it a lot because it allows us to basically tie the continuous aspect of running a chaos experiment or chaos verification over and over again. So litmus is the actual runner that does the chaos in your cluster or chaos free application. But then we're also using Flux as well to do that in a continuous format. That's great. All right, well, let's wrap up. We talked through a bunch of things in this specific talk. So let's give our recommendations for just chaos engineering in general. So I'll start by saying that the whole idea of chaos engineering is pretty nascent. It's still pretty new. It's kind of like where DevOps was maybe in like 2009, 2010, that timeframe where we're still trying to learn and come up with ideas for how a lot of this is gonna be implemented. It's really important to get the basics and fundamentals right of chaos engineering. So learn the theory behind it versus running to implement something specific upfront. This is like the best time to talk to the maintainers and the practitioners in the space that have the ideas and are actually doing a lot of these things in the space. So come talk to us, talk to the folks, the litmus folks are all here at QCon. It would be great to kind of understand their thinking behind the scenes. Probably the my most important recommendation right now would be to communicate your experiments because a lot of this is, it's recommended to run these tests in production, but it's also important from an organizational perspective, from a team perspective to be able to communicate that, hey, we are running chaos engineering tests in your cluster on production, for example, so you don't get into a weird corporate or a weird organizational kind of trouble or whatever. So that's my big recommendation. There's a bunch of like community stuff and community groups and meetups that you can join. Put a couple of links in here. And also I talked about theory. If you want to get the chaos engineering group book from Casey Rosenthal, you can get that at verica.io slash books or slash book. So just to add on to Karthik's recommendation, I have some of the things like chaos engineering is expected to grow even more. So it is new, but it is growing rapidly with the number of tools with everything that is growing in the cloud network ecosystem. And we should work towards continuous chaos. As in the demo, we have shown with the power of continuous integration, continuous delivery and chaos engineering implemented in that. So obviously work towards that so that you can have a complete life cycle integrated with chaos engineering. So keep an eye on the new tools and get involved in the community. There are a lot of new tools coming up for chaos engineering and the current tools are maturing enough to mask the needs of the production-grade clusters. Check out for the integrations that will help you in your developer workflows and CI CD workflows. Like Captain is a very good project and Captain plus Litmus works very well. Octito plus Litmus works very well. So you can try these out. They have a lot of blogs, a lot of meetups talks that they have done and others as well. Awesome. And with that, that is the end of our talk. You can reach me on Twitter at iteration one and then you can reach Siam on Twitter at Siam Patek as well. Any last words Siam? Yeah, definitely the try out the repository. It's public and you can try out the demo and you can try out on any Kubernetes cluster as well. So try it out. Let us know what you feel about chaos engineering. Any questions, we are happy to help and we are there in the Slack channels, Kubernetes, CNCF, you can reach us out anytime. All right. Thank you, folks. Thank you all. Bye.