 I mean, you guys are really, really awesome because the room is full packed, and it's the last session. So you are really here to learn, and I'm really happy about that. So yeah, let's start with the session. Today, we'll be talking about Chaos Engineering, Chaos Mesh. That's what you're here for, and I'll try my best to explain you what Chaos Engineering is. By the way, can you raise your hands? How many of you know about Chaos Engineering, like what it is, and how it is being used? Awesome. Almost everyone. So my name is Saiyam Pathuk, and I'm working as director of technical evangelism at SIBO. I'm also a CNCF ambassador, and yeah, you can find me on Twitter, either at Saiyam Pathuk and tweet anything about the session that you learned or found interesting. So the storyline for this session would be interaction, like what Chaos Engineering is, where it fits in, and what are the principles. So these will be the general introduction that applies to all the projects. And then we'll move to the project Chaos Mesh. What Chaos Mesh is, what are some of the new features that the maintainers have been working on. So I am not one of the maintainers of Chaos Mesh. I'm more of the community member of Chaos Mesh and the user. So I'll be talking from that perspective more. And then we'll be looking at some of the demos. I think three demos in which I'll showcase one interesting one, which is the multi-cluster chaos that is a new one that have been introduced. So the systems have been moved from linear systems to the complex system. So previously we used to have monolithic applications, simple applications where even it happens like a single person also knows the end-to-end how a system works. You can go to them and ask, okay, how this works. This is where it failed. And this is where you'll be able to find a bug in that particular system. Over the period of time, the systems have matured enough. It has become non-linear. You have moved to a distributed world where it is all microservices, microservices architecture, you know, you have so many smaller chunk of your APIs running as a separate microservices, inside containers, Kubernetes, and that is unpredictable behavior. You don't know where exactly things go wrong. Like you have so many hundreds of microservices. One thing fails in your application. It becomes very difficult to see where your application actually has failed. So that unpredictable behavior have made even the systems complex and we need more enhanced tools to solve this complexity in terms of chaos. And it becomes like in the previous systems, you like you can go to a single person and you can, you know, even ask them like what is their end-to-end. But now it's very difficult to prepare the complete mental model of which microservice is talking to which microservices and you cannot ask a single person, okay, things are wrong. How do we fix it? You need some other mechanism to tackle this problem. So same in different like traditional systems have slowly moved into cloud native. With cloud native adoption, we have more Kubernetes adoption because that's what that was the first project of CNCF itself. So with that, you have Kubernetes adoption. Kubernetes itself is a very heavy piece of software that you are running. It's not small. You have your control plane nodes, you have components over there. You have your worker nodes. You have components over there. So Kubernetes itself is having a lot of components that has to work properly so that your application works properly. So even the systems have matured and become complex, though they provide a lot of powerful features, but the complexity has grown. Then you have more microservices and more third-party tooling. What do more third-party tooling means is even when you install Kubernetes on it's not just Kubernetes. On top of Kubernetes, you'll be adding your observability layer. You'll be adding your security layer. You'll be adding your service mesh layer. So you keep on adding more and more softwares. So you keep on adding more and more complexity. Each layer that you add, that particular piece of software have to work perfectly fine in order for your application to run perfectly fine. In the end, you are running your application on Kubernetes. These are just the helper utilities that are making things able to run at scale at massive production scale. Testing is hard. So you have productionized distribution application, and you're trying to test it with something like JMeter, so it's hard. So that's what we cannot do. So we need some other mechanism. And that's where we enter chaos engineering. I don't know why I'm not using the clicker. So chaos engineering is not something which is very new. It is something which have been in works. I mean, you must have heard of Chaos Monkey, like Netflix originated 2010. So you can imagine, like it's more than 13 plus years that it's there. So the technology itself is not new. It's been there. It has been getting mature more and more with respect to cloud native, like cloud native chaos engineering was not there because cloud native was not there previously. Once the applications have matured, the systems have matured, so is the model that have matured over the time. And so are the tools that have matured over the time. So that's where the cloud native chaos engineering have come in. But the chaos engineering philosophies is something which are there, which people accept, which people agree to, like we'll be talking about the chaos engineering principles. So by definition, it means chaos engineering is the discipline of experimenting on a system in order to build confidence in systems capability to withstand turbulent condition in production. What it means simply is we are injecting the failures into the system to understand how the system will behave if any issue comes and catch that issue before it actually happens in production. It's similar to injecting a vaccine into a person and making them immune to a particular disease. So that is what we are trying to do with chaos engineering. We are injecting that failure in that particular system, and then we are seeing the behavior of that system. If everything works fine, then it works fine. If not, then probably our system needs some modifications to tackle that particular bug. So that's in simple terms. Testing makes an assertion on the property of the system that is based on existing knowledge and then validate that property. So that is the regular testing that is there. But we are more interested in the experimentation where we define the hypothesis, which is proven or disproven. And as long as hypothesis is not disproven, we confidence grows in the hypothesis. If it's disproven, again, we get to learn that something is wrong and we can fix something and improve. So chaos engineering principles. So principle of chaos engineering states that first what we do is there are series of steps. So first we define a steady state like this is how our application should be behaving. So that's the steady state of the application. And next we define the hypothesis. So hypothesis at that steady state will continue to be in the state. Now we are bringing this particular change that is there. Then we keep on adding some production ready variables. Now what does that variables? The variables are the real world scenarios like your application is there and you add the latency to your net application. Your pods are there and let's say out of three pods, one pods goes down how your applications behave that. Your nodes are there. One of your node goes down how your applications behave in that particular scenario. So we keep injecting failures, the real world scenarios to understand how our hypothesis that we have done, we can disprove. So we try to disprove the hypothesis by looking at the difference in the steady state and between the control group and the experimentation group. So then this is the same and then we vary with the real world events. You can see turn off things, slow things down, send the invalid responses to the API request that are there and then we have to run this experiments in production. Now the funny part is on one hand we are saying that we are bringing in chaos and we are turning off things but turning off things isn't a good thing in production, right? So that is where you have to minimize the blast radius which is very important critical aspect when you are doing chaos engineering in production. Because your application, your customers are using your application, you do not want them to be impacted by your piece of software that is doing chaos engineering and giving you and trying to find bugs which might not be there or which might be there. But that should not hamper any of the existing things that your customers are using. So for that you have to minimize the blast radius. You can minimize the blast radius in various ways like you can you'll be carefully selecting you know which but if there is a pod goes down scenario or pod failure, pod kill kind of chaos that you want to inject on your Kubernetes cluster. So you will be carefully selecting the node selector or particular node of particular application which particular node is fine to run this particular chaos for this particular application. So that is how you have to carefully minimize the blast radius and you have to communicate more with the team. So you have to make sure you have enough communication done that this is what will be happening in this particular scenario or this particular application. These are the number of chaos that we'll be doing and continuous. Yeah, we have to keep doing it in a continuous manner like as soon as we let's say you bring in some change or a new feature in your application. So you have to redo all the chaos engineering experiments that you have done with the new release of your applications because that is what cloud native is right. You keep on adding new and new features faster and faster. So that is where you need to do it in a continuous way. Introducing Chaos Mesh. So Chaos Mesh is a tool for doing chaos engineering on Kubernetes. And it also has its own what you call physical chaos experiments that you can do on the physical nodes. And there are a lot of experiments that you will be able to do on the Kubernetes level. So designed for Kubernetes, you can see you can add pod kale. You can increase the network latency, system level chaos, kernel level chaos. It has deep cloud integrations already with some of the cloud providers that directly do the cloud provider specific type of chaos that you can inject. And then it has a dashboard for analytics that you can view. This is how it went. This is how the chaos was done. And it's not on this particular screen, but it has workflows. So you can actually create a chaos workflow. So basically, if you want to run chaos experiments in serial, like one after the other, as a series of experiment, you can do that. If you want to run a few of the experiments or two types of experiment in parallel, you can do that. So that sort of workflow things also you can create. So how Chaos Mesh works. This is the simple architecture. So as a user, first, you have to install Chaos Mesh. So first, you'll be having a Kubernetes cluster. Let's say you have a CO Kubernetes cluster that you have created. On CO Kubernetes cluster, it can be any Kubernetes cluster. That's just an example. So on that cluster, you'll be installing Chaos Mesh. So Chaos Mesh can be installed via Helm. So you'll be installing Chaos Mesh. After that, it will be installing Controller Manager. It will be installing your Chaos Demon as a demon set on all the nodes. So what user will be doing is, user will be creating a custom resource with a specific Chaos Experiment Type. Now, when you install Chaos Mesh, there are a lot of CRDs that gets installed as part of it. There is not a single CRD. There are multiple CRDs. For each experiment, there is a separate CRD. So you will create, for example, you want to create a stress chaos. So you'll be creating a stress chaos kind of Kubernetes object. And then you'll be specifying what type of stress CPU you want to introduce and on which particular application you want to do that. So user will do that. And then Kubernetes API server will inform the Chaos Controller Manager that there is this particular object. And the Controller Manager will recognize that this is my object. And it will designate that to a Chaos Demon. Now, it's the responsibility of the Chaos Demon to actually perform the chaos and select on which particular node the chaos will be done, on which particular pod the chaos will be done by using whatever you have specified, like the node selector and stuff like that. It will be deciding that this particular pod has to be killed off this particular node. And then, yeah, it also gets the C groups, namespaces, and all those stuff for that pod. So this is the responsibility of Chaos Demon. And then the results goes back to the dashboard. And you will be able to see everything running over there. Moving on. What's new in Chaos Mesh? So last time, Chaos Mesh version 2.0 was there. And 2.4 features were introduced. So I'll be telling what's new after 2.5 in 2.5. So already you had pod, network, JVM, IOS, stress, HTTP, GCP, DNS, kernel, AWS, which are the cloud provider-specific chaos. Then the red ones, Azure, block, and physical machine chaos were introduced in 2.4. Then in 2.5, you have multi-cluster chaos experiments, which is something new, which we'll see in the demo as well. Basically, what that means is you will be able to install Chaos Mesh on one cluster. You will be able to connect a remote cluster, which is another particular cluster to that. So let's say you have two cluster, one cluster, two. So you will be able to run and create chaos experiments on cluster one. But those experiments will actually be able to communicate to cluster two and run your experiments on cluster two. So that's a multi-cluster kind of chaos experiments that you will be able to do. STTP chaos TLS support. So STTP chaos with TLS support is a way to bypass the TLS using the self-signed certs. So that's a new addition to the STTP one. And the new workflow UI is enabled by default. I'll actually show you both the previous one and the new one, how it looks like. So when you install the newer version of Chaos Mesh, the 2.51, you will get the new workflow out of the box. OK, cool. We will go through the demo now. This will work. OK. So pray us to the demo guard, and we'll start. So the nodes I have already prepared. So this is a C-work Kubernetes cluster based on TALOS, which is 1.25. This is the first cluster. And this is the second cluster, Chaos Mesh 2. So Chaos Mesh 1 and Chaos Mesh 2. The controller on the first Chaos Mesh installation that was done was on the cluster 1. So we can see that as well. So you have your controller manager, you have your Chaos team and you have a Chaos dashboard. One interesting thing, or you can say you should be doing that, is always add STTPS TLS for your applications. So that is what I have done for the Chaos dashboard as well. It's pretty simple. You can use cert manager and let's encrypt to add certificates to your application. So you just use cluster issuer, certificate. And also I have Ingenix ingress controller installed on this particular cluster. And I have created and pointed this particular domain, which is the load balancer domain of the Chaos dashboard to have the secret, which we created in the certificate. So that's in general how it works. Like whenever you want to add a TLS support to your application, this is how you do it. So this is already done and we already have Chaos dashboard, which is there. So the tokens keep on expiring. So what we'll do is we will generate the token again and log out. So we specify the name of the token and the value that we have got. I think the expiry time is one hour. So I was just trying back one hour back so it might have expired already. So that's how the Chaos dashboard looks like. The Chaos mesh dashboard looks like you have your quick start experiments and you can create workflows. You can create experiments and stuff. These are all the CRDs that are installed in the cluster. So we'll not be doing much via the UI. We'll be doing much via the Chaos actually. And the demos, there were a few more slides. What do you want to do? So in demo one, basically I'll be introducing the pod network latency. So you have the Chaos mesh, which is already installed in cluster one. You have application, which is already invest and there will be a network chaos that I'll be creating and showing like how the network chaos is getting done. So you can see this particular custom resource, network chaos. So it is just adding a delay to the label selectors of app web show and it is adding a 10 millisecond of latency to this particular application. That's what Simplate is doing. So Qubectl get pause. You can see the application is running. What we'll do is we'll apply network chaos. It's created. We can actually do Qubectl get network chaos. So that's the CRD. That's why I am able to get network chaos because the CRD exists. So now we will to a port forward to the application web show and then we should be able to see. You can see a 10 milliseconds of latency that has been added. So that's a pretty normal experiment, most commonly used experiment, which is there. Coming to remote cluster stuff, which is there. Now that's an interesting one. So now multi cluster chaos. So in cluster one, you have chaos mesh and which is already installed. So user install chaos mesh in cluster one. Now user creates a remote cluster resource. So you can see down the remote cluster resource is there where I have defined the name of the cluster, the namespace where I want to install the chaos mesh components and the cube config file. Before applying this YAML file, you actually have to create a secret of your cube config file in cluster one. So you have to do a Qubectl create a generic secret in the cluster one so that you can give that particular name and value. Once you do that, the chaos mesh controller from cluster one will automatically install the chaos mesh components, the demon set, et cetera on cluster two. And then you'll be creating actual chaos. So if you look at the top right, which is the spec remote cluster, so we specify another field, which is a remote cluster. And I want to do this particular chaos experiment, but this will not be applying in cluster two. We'll be applying this in cluster one, but the experiment will be running in cluster two. So that's how it is supposed to work. So remote cluster chaos. So it is burning burn CPU, the name of the chaos, the kind of the chaos is stress chaos. And it is using the label selector application engine X, which is already running like Qubectl create deploy engine X iPhone FN image engine X replica three. It is already running on cluster two. That is why it will be able to select that and then add a stress to one particular node from this. So let's apply this remote. And now we'll go to cluster two. We'll do Qubectl get stress chaos. And we can see that the burn CPU chaos was created and has been running. We can actually describe that. Yep. And it has 17 seconds ago. It has successfully upgraded the records and stuff like that. The only part where you would be needing as of now, you'd be needing to like, if you want to visualize the stuff using the dashboard, you have to use chaos mesh cluster two dashboard itself. Like the things are not right now visible in the cluster one. So yeah, we, I mean, that's where the community support would be like, if you need the observability from the connected clusters as well, how do that should look like in the cluster one is something we can definitely talk with the maintainers and see how things will progress in the future. But as of now, if you want to view, you will be viewing it from the cluster twos dashboard, which is there. Moving on to the third demo, which is the part network latency. So this is again, the network chaos, the same one that we did in the first one, but we'll be creating a workflow for this particular chaos. I would have to access this dashboard. I don't know if this is also logged out. Yeah, this is also logged out. So we need to create the token again. Okay, so now we'll be going to the workflows and we'll be creating a new workflow. This is the old workflow UI, which is there. So you can select single, serial, parallel, all this. We can select parallel. We can give it a name, cube, con. Deadline means for how many minutes this should run, two minutes. Then we have to create a child task so we can load from previous experiment. So we recently ran burn CPU. We'll just import that experiment. We'll submit that and we'll create the child task too. Again, we'll import same burn CPU experiment and just submit that. And on the right-hand side, you'll be able to see the workflow custom resource that got generated. Actually, you can save this. You can load the same custom resource workflow or apply this workflow without even coming to the UI. You'll be able to do that. You can see that the templates that you have selected, the name, the label selector and the stress that you want to do and you submit and you again have to give a name for the workflow. And namespace is default and deadline is two minutes. So it is running, it will be running. In the new, what is that? Yeah, so this is the new workflow where you actually can visualize how things work. You can add and connect the dots. So you can automatically add like a kernel chaos and give a sample name and provide all its field and submit. It will be adding a kernel chaos. So you can connect the dots like one after the other after the other or you want to do it parallely how it does. So it gives you much more good visualization and you can select all the chaos from here like you want to do a pod kill and you can also directly upload the file. Import a workflow here. So if you have a workflow like a custom resource saved you can pre-import that particular workflow in this particular new UI and it should run automatically. So that's the workflow and you can see the previous KubeCon workflow got completed. So that's the green bars which is there and that got completed. So you can see the events that happened and you can also see in general the events which are there. You can also schedule the experiments if you want on like in a continuous manner like you want weekly or daily a particular type of experiment to be done on a particular time of a day then you can do that as well. So yeah, good that all the demos worked. So recommendations is that chaos engineering is now not a lot of the software well-defined software architecture framework as well. Yeah, so it is part of now the well-defined architecture framework like your application is when you design an architecture application in a way you have to have chaos engineering implemented and designed with your architect that with your architecture that you're building for your application. Yes, you have to learn some basic concepts and stuff like that. So we learn the theory there are a couple of books as well on chaos engineering. So make sure you check them out. Talk to the maintainers and practitioners because that is something is very important like because maintainers will be able to tell you what are the exact use cases that they are seeing people are using in production a lot and that can help you benefit in your applications as well. Chaos again, communication is the key because you are running the chaos engineering is meant to run in production as well with the minimized blast radius and that can only happen if you do proper communication. So you have to properly communicate what you are doing at which level of your application in production. So that is really, really very critical. Chaos engineering is again expected to grow even more. There are more and more tools which are coming in. I think there are a few companies, booths as well, which are there and in KubeCon that are doing more and more on chaos engineering, managed chaos engineering, introducing, giving you a complete platform dashboards. So those are rising because this is the need and the challenges are here. Working towards continuous chaos. Yes, everybody wants your Argo CD in front of you get as a single source of truth and then everything happening from there. So have that sort of mechanism in place where you have your CI CD implemented and you create your custom resources, push them to get and as soon as it is done, your chaos experiments run. And then yes, keep an eye on the new features for Chaos Mesh because it is getting better day by day. Join some of the community groups which are there. There's also a CNCF working group, chaos engineering working group that is publishing a paper on the chaos engineering best practices that not only Chaos Mesh but the other projects like Litmus Chaos and Chaos Mesh and other projects are coming in together to form the best practices kind of thing. Getting involved, I think it's pretty simple. Every project needs to grow and to grow they need contributors. And that's what we have been hearing in the whole KubeCon that we need contributors and same is with Chaos Mesh just like any other open source project. And you can do in all fields like dogs, new experiments, even feedback is pretty important. This particular feature is missing and if this is there, then probably it will cater these many use cases. So I think those level of feedback is required and then new features, what all new features are expected from you, what you are using in chaos engineering and what you feel is missing is something that the maintainers needs to know. So this is something I worked with the maintainers to come up with this slide so that I can put that particular message to you folks because Chaos engineering, Chaos Mesh in particular is open to have new maintainers as well. So basically if you start your journey right now as maybe trying it out, providing feedback, then improving some of the stuff that you think that will help others as well, probably you'll end up being a maintainer of this project going forward and taking it to the next level. And you can see there's a working Chaos Mesh community monthly call which is there. So that happens every month once and it is there in the project Chaos Mesh. So the Slack channel is the CNCF. So if you already are in the CNCF Slack channel, you can in the Slack, you can join this particular Slack channel where all the development and the queries about Chaos Mesh will be answered. So yeah, that's pretty much it that I had for this particular session. Thank you so much for coming in and listening to Chaos Engineering. I hope you got a gist of what Chaos Engineering is in terms of conceptually. We are injecting failures to predict, something goes wrong in production before it actually goes wrong in production, trying to mitigate that sort of behavior. There are a lot of tools out there to do that. And today we discussed about Chaos Mesh because this session was the Chaos Mesh maintainer track. Some of the new features, the multi cluster Chaos Mesh, the TLS support is something that you can try out in 2.5 and they are working on 2.6 and they need more contributors to keep this project going. This is actually a CNCF incubating project. So this is actually already at step level two. Then the only level which is left is graduated. So I think once the community gives enough feedback and we have the maintainers take it to the next level by having more and more production use cases. People like you who will be using it and yeah, you might end up being a maintainer. Like it has happened many times. Even in LinkerD, it has happened like you start using a project and you end up being a maintainer of the project. So yeah, I hope I gave the message from the maintainers about the community and getting involved pretty clear. And I hope the last session was, you know, you liked it and you'll go home with some interesting stuff. I'll put the source code and whatever stuff, the CRDs and all these things. And even in the Read Me, how I prepared this environment, this cluster, how you can exactly replicate using CO Kubernetes. I'll be putting that in a GitHub repository and I'll be updating the slides in the same, you know, if you go to the session, you'll be able to get the slides. I'll be updating the slides and the GitHub link in the slides as well so that you can try it out, try on CO Kubernetes cluster and run your chaos experiments. Thank you so much.