 Okay, so hello everyone, and welcome to Chaos Mesh Overview Practices and Futures. So my name is Sanyam Bhattak, and I'm working as a field CTO at Civo, and I am a CNCF ambassador. I'm a CK and CKS scenarios book author. I'm also a founder of KIP Simplify. That aims to simplify cloud native. So even after the session, feel free to reach me out at different places. Unfortunately, my co-presenter was not able to make it to KIPCon, but we have a live recorded demo from my co-presenter. Today we'll be discussing about obviously Chaos Engineering, Chaos Mesh, some of the demos, project updates and where the project is heading towards. So we'll be sharing some of the insights of the roadmap as well, what we are looking towards adding or improving in the Chaos Mesh ecosystem. So starting off introducing Chaos Engineering. Before we dwell into the discussion of the Chaos Engineering, we need to understand the current context. And in the current context, the systems are complex with the rise of adopting microservices, all these architectures. The complexity of the applications is steadily increasing. And even the best and the best in class engineers are keeping it, you know, hard to battle the complexity. And in these scenarios, conducting testing is extremely difficult, especially to cover all the corner cases that you can imagine. And this is where Chaos Engineering comes into play. And in this, you actually inject or simulate various failures, failure scenarios, which are very much the real world scenarios. And Chaos Engineering can help us better understand and improve the systems. The entry point, again, is to simulate an extreme scenario that could occur in real world. That's the main goal. But it should be done in a controlled environment, because we are trying to replicate something that happens in the real world in the production system. So we cannot randomly do and break things. And thereby verifying and improving all the systems. So failures and incidents can happen anytime. If you see the timeline, you'll be able to see the almost 44 outages in the past three months within GitHub itself. Like even regardless of the size of the organization, the number of engineers working in that organization, the best practices followed by an organization, things can fail. And they do things to break. Human error, all these things, or decision making, anything can lead to all these things. So what is Chaos Engineering and what it is not? Chaos Engineering is about breaking things in the controlled environment and through well-planned experiments in order to build confidence in your application. To withstand turbulent conditions. So you can see the five steps over there. The first one, the steady state. So first is defining the steady state. So what do you mean by that is it's actually the measurable output of a system that indicates a normal behavior. And we can get that behavior by observing the systems carefully and get the steady state. Now we define the hypothesis. So second portion, the second part comes in the hypothesis that the steady state will be in the control, will be same, will continue to remain the same in both the control group and the experiment groups. Now we introduce the experiments and these are the real world chaos experiments that are there. So introducing all these variables like hardware failures, network failures, network slowdown, all these kind of failures can be injected. And then trying to validate and disprove the hypothesis to find the difference between the steady state and the between the control group and the experiment group. And then keep on analyzing improving based on the results that is getting cleared. The last point is important. Chaos engineering is not about breaking things randomly, which I just mentioned before as well without a purpose because since this is being done in production. Also the development or when we go back the memory lane, it's not very new concept or terminology. This has been there for about decade, more than a decade now. So the development of chaos engineering can be divided into like, you can see different stages in a timeline, starting from the early days of Netflix, doing their business, killing down the virtual machines to understand the resiliency of the systems back when the crowd provider like AWS were not that much resilient. So this really worked well for them. And slowly and slowly they started writing more and more engineering blogs on how they are implementing these kind of failure scenarios, getting the cultural change like we have the cultural shift, we had the cultural shift of DevOps. So cultural shift of this learning and implementation in the architectures as well. And as you can see on the timeline, different organizations, different companies coming in and in 2017, you can see more and more of open source projects like chaos toolkit, chaos cube, litmus, chaos mesh, chaos blade, all those coming in. And then even the cloud providers providing their own SaaS kind of offerings and then gremlin and all these. So this is not something that happened suddenly. Obviously it has it's kind of a combined effort over the years and knowledge that has been transitioned from over different set of systems production systems to with respect to the adoption of the cloud native technology because as the technology has evolved there. There have been a lot more complexity that has increased and so has the chaos engineering experiment way of doing things. And what do I mean by that? So when we talk about the adoption of cloud native technology, chaos engineering is more and more integrated into the engineering environments with respect to the declarative API. Now you can declare the steady state you can define. This is this is what the steady state looks like expected state looks like and the controller then, you know, verifies that and tries to read that particular desired state clearly and then fit it into the chaos like creating the chaos out there in the systems and which you can easily control and observe. The popularity of Linux containers obviously allows the application to run in the isolated environments. That's what the containers have been. So when you can run them in the isolated environments, it makes it even easier to minimize the blast radius when we talk about doing things in production and also the standardization of the Kubernetes container runtime makes it makes the fault injection more easy smooth and universal. And then you can also use something called a privilege container to perform node level chaos when you talk about the Kubernetes ecosystem. Also, coming to chaos mesh. So chaos mesh is an open source tool and it is a chaos engineering platform that provides comprehensive and user friendly tool chain to do chaos experiments on Kubernetes. And even beyond actually, I'll touch on that as well. So to understand the key features like what chaos mesh gives is it offers the several kinds of chaos experiments that you can see but it's not limited to those pod chaos network chaos IO chaos DNS time JVM all these type of chaos. And also you can extend these capabilities via chaos D to do the node level experiments like stress disk failures, network failures and things like that. Chaos mesh also offers a dedicated predictable chaos experiments and scheduled schedule workflow features so you can create your own set of workflows with different set of chaos experiments in that and create a whole workflow in it. And additionally, yes, it has a web UI to manage and inspect all the chaos experiments. So the design goals have always been, you know, to make it easy to use. Regardless, you are a beginner or an expert, you should be able to understand because every chaos experiment has a separate kind of custom resource definition. So you should be able to simply understand if this is this type of chaos, then these are the set of parameters that you need to set. Coming to the chaos mesh architectures, there are majorly three main components. One is the chaos dashboard that is again simple web UI to manage and inspect chaos experiments. It like allows you to run, manage, create the even monitor the chaos experiments. The next one is the brain of brain heart, whatever you want to call it off the chaos mesh architecture, which is the chaos controller manager. So that is actually responsible for managing and scheduling the chaos experiment. So basically, when you when you define what you want inside a chaos experiment, then the request goes to the chaos controller manager and it detects handles that state based on what you have defined. And it also embedded it also is embedded within the workflow engine where you can define and execute the complex chaos scenarios. So users can use it to orchestrate multiple chaos experiments, creating more, you know, complex use cases and complex scenarios close to real life production scenarios and health checking and all that stuff is there. And the next one is the chaos demon which actually injects and runs the chaos experiments. And you can you can actually do it via UI via the API and the CLI so you should be able to do or customize templates. So you should be able to do it in different ways as well. So it's not only the UI that you should you can create the chaos experiment. So this is the chaos controller manager, one of the key components and it monitors the creation changes and deletion of the resources. When whenever there is a change or there is a new experiment which is created, the controller manager determines where to inject when to revert from a failover recover from the faults. And this decision is based on the configuration provided. Once the decision has been made, it goes to the chaos demon. Chaos demon operates in a Linux environment and it is responsible for actually injecting the chaos fault. And it uses the common Linux namespaces and C groups concepts to inject the fault inside the Linux containers. And then you have different type of chaos experiments will be having different dedicated executors for them. For example, the if you can see the Kubernetes pod Kubernetes API is used to implement the pod chaos pod kill and different container runtime kind of scenarios. And Linux kernel network and command line tools like IP tables is used to implement the network chaos. We have a built in user mode like for the IO chaos and then the layer three transport protocol for the STDP chaos. Also the built in a kernel module that you can see the block isodular chaos driver for the block chaos. And then the with the existing tools integrated biteman to implement the jbm chaos. So instead of using the scripts and series of scripts and how to define the design philosophy has been pretty straightforward with chaos mesh. And you know, you define everyone has a different executor and it runs. And for different types, you have to write a different type of CRD we'll get to know like we'll do a sample kind of CRD how it actually looks like. So for different type of chaos experiments using a deterministic executors along with the custom resource definition is another aspect of the chaos mesh design architecture. So you will be having a standard YAML file, which everyone is used to writing in the cloud native ecosystem where you talk about Kubernetes containers, you have a cluster, you write a YAML file for whatever object you want to create. And when you install chaos mesh, it install the custom resource definitions. So not only you can use the standard. So it lets you use the different set of chaos experiments like you can use network chaos. So the kind will be network chaos. So those things can be done. And at the same time, we also have the workflows and the schedule objects that can allow to allow users to combine all these things. All the complex experiment scenarios. And you can also do the health check for more detailing and understanding of the health of the application. Allowing the workflow to respond to the changes in the application status such as terminating the workflow or when the application is unavailable. So this particular example is of a network chaos. I don't know how much visible that is. But yeah, you need to trust me. This is network chaos. So this is a network chaos example. And we have several. So think it is as there's a redis and there's a backend. So there is a backend application redis instances. These two things are there. And what we want to do is randomly select a backend application randomly select a redis application and inject a delay from the backend application. To redis application periodically for a certain duration of time. So we want to add a delay for a particular seconds and do it on a regular basis from the backend application to the redis application to simulate this real world network chaos. Like if something like this happens, then what happens? So that is done and selected randomly. And let's see how it is done. So the boxes that you can see are the selectors. So we use selectors to mark our source and target where the usage of selector is consistent with that of Kubernetes itself. An additional point to note over here is the mode is one. So it will only select one pod like there can be multiple replicas when you specify stuff within Kubernetes. So it will only select one out of that. And that's where you do it in a controlled manner. So you don't just kill everything or you just you just don't delay traffic from all the parts to all the parts you select one particular part from from it. And then you declare the experiment action as delay. So you can see the delay over there and configure the chaos experiment delay of 10 seconds on the outgoing the network traffic. And then is the scheduling object. So you define the duration that this experiment this particular experiment will run for 30 seconds every 60 seconds. So as a result, something like this should appear if you see on the graph for 30 seconds every 60 seconds. That's that's how it will run and from the backend pod to the redis pod things things are going. So that's how kind of easy it is and it actually should give you in your head that yes, this can happen in production. And when this happens what happens you can see and then you try to figure out whether your systems application everything is running fine. If not, then you need to fine tune or you need to see what you need to more more replicas or certain other aspects that you can use to fix. If there is any issue comes in the production. Just to add a bit more about chaos D, which I talked briefly. So in in our community, there were many requirements for conducting the chaos experiments, not only on Kubernetes clusters, but on the bare metal servers and the virtual machines. So chaos D is actually can be actually used independently. The collection again of fault injection tools that we have reused the injection logic of the same chaos demon that actually injects at the end executes the chaos. It can be used as a CLI as well as as agent node as a service when used as a service, it can be integrated with chaos mesh using the kind physical machine chaos. So the kind that was network chaos, the kind will be physical machine chaos. And then you should be able to either create a workflow together for the physical machine chaos network chaos create one after the other or do the chaining. So but in summary, outside of Kubernetes chaos D is also a powerful chaos injection tool. So chaos D. Yeah, chaos D. Now coming to the demo. Hope it plays as expected. Hello everyone. I'm here to show you some awesome demos and some practices about chaos mesh. My name is Zhixiang and I am the material chaos mesh. And I'm also a, I'm also an individual freelancer. So about the demo, we're going to use this application called a party to a hat. It's a really funny application and it's the demo app for the same attack of delivery. And if you use it, it's used the microservice architecture about the microservice. The application is split into several components, which with response a part of the political man's body, like the hat, the arms and the legs. So it's really interesting for an application use a composition of political man to respond the microservice architecture. Today we're going to show, we're going to show you three typical, three typical cases about chaos mesh, the public chaos, the network chaos and the HDB chaos. So I'm already installed the chaos mesh and political application inside of my mean cube cluster. So the first, I'm going to show you the public chaos. So public in this case is that in this case, I will kill one of the part such as that charm and the rest of the microservice show also works, but you could see on the preview. This part will not works. So I started this is a normal situation application and then I will inject this, this is this podcast into the occasion. So I will execute by control by dash out podcast. So this, this case is great. It's created and let's refresh the pages and we can find that this page is already gone. And if we want to recover this chaos, we just remove this object. Yeah, just as K instead of cool, cool control. So let's take a look about the status of the chaos, the status of the path, it should be left arm and the status, it will be running as it is already restarted and let's refresh the back so the application goes well. So, actually, this, this is this situation reflects a design, a better design called a single point failure. Actually, it could be passed through by multiple replicas. So maybe I could show you another cases. Let's scale the left arm to three replicas. Yeah. And let's take a look. The pause is creating. Yeah, this way. So, if we execute this chaos experiments again, please notice we are using mode one it means we will only make one of the part failure. So, if one part failure, the rest to still works, and it should be well for a high available high available occasion. So let's start it. Let's create this chaos experiments again. And yeah, it's still well why because there's still other instance running very pretty well, pretty well. So, okay, that's the part, that's part of chaos, and another chaos is a network chaos. I will, I will inject network networks delay to different components and change the sequence to this is the appear to the web front page. So, this is the normal situation. But when I refresh the packages, it disappears so quickly and after I apply the network has delay. Yeah, before that, I should take a look about the contents. I will, I will create different delay into the onto the different components 200 400 800. Create create it and then I refresh it. Yeah, you can say the different parts of the body appears one by one. Yeah, this kind of this is called because of the networks delay. Let's do it. Let's recover it by just delay date. Yes, it's going to go back to normal. And at last I want to show you the most interesting here is it's it's the big chaos. Actually, it should be chaos could modify the content of the request or the response in this case is I will use another picture to replace the original, that had pictures. I will just smash logo to replace it. And also we could get the back of the file. Yeah, there's a lot of those things because actually this is the content of the SVG file. Let's do not care about that. Yeah, we can hdb-cals, replace, replace the body with this kind of content, as we created. Yeah, then we take a refresh. So, yeah, that's, yeah, the hdb-cals injected and it replaced the picture with the chaos mesh logo. Then we recover it by just delete the object. Yeah, it doesn't recover very quickly, very efficiently. So, yeah, that's how, I think you already get awesome thing which chaos mesh could do. And then I will share some practice from our adopters to talk about the integration with chaos mesh and their own platform. So, the first share is called Stability Test in TIDB releases. TIDB is the product of Pincap and it's a high availability and distributed database system and using chaos mesh in their CI system, in their CI pipeline and in each release version, they will use a lot of situations or there's lots of situations of chaos tests to verify that TIDB is still fit for the high availability requirements. Yeah, this is one of the usage. Another usage is chaos mesh on TIDB cloud. Pincap also shares a global area available TIDB, that is called TIDB cloud. And the business across lots of the region and available zone and the design is, there is a regional chaos mesh and it could inject different chaos into the tenant instance under the same region. Yeah, it's kind of very efficient design is avoided or heading to install too much chaos mesh instance of every tenant's cluster. And they also have their products called clinic which also integrated the chaos mesh inside of their product and you could say the TIDB user could just use the drill test functionality to simulate some chaos experiments in their cluster and also with some beautiful metrics chart inside of the page. And another adopter is Tencent cloud. They also have their own platform called OSCAR and it's Tencent's private cloud chaos and in your platform, it's really a complex system and they have lots of customers such as internet customers like two kind of platform and which they pay and they also have other external customers like a bank company, interest company and finance companies. And inside of their design chaos mesh, they use the chaos mesh as a powerful tool to inject chaos and they have other components to management and orchestrate the chaos experiment. For example, there are a lot of chaos experience could be reused and they have built their management system called OSCAR hub. OSCAR hub have lots of templates of chaos experiments which could be reused by different user and they could also use their CMDB. It's really like the label things in Kubernetes system. They could then dynamically to choose the chaos experiments target to get injected. And they had, and OSCAR also have their management system they have, they build the great dashboard for management and the report. You know, chaos mesh does not provide this kind of functionality about experiments reporting or chaos experiment history. So they just do it by their self. So yeah, there's some practices about how to use, how to integrate chaos mesh with their own platform. So I'm very happy to share the demo and the practices and thank you very much to join us. And back to Sam, thanks. Awesome demo. Coming to the project updates and the roadmap. So some of the updates with the ongoing project. So firstly, it's, we have a new release chaos mesh 262. The latest version comes with some of the enhanced features and bug fixes providing more robust solution for chaos engineering experiments. And chaos mesh always follows the upstream. So it's, it supports the latest Kubernetes version 1.2 8 also redefined the dashboard view. So the updates aim to provide more user friendly interface and better overall user experience in the commitment with respect to security and safety. We have upgraded the dependencies to address the vulnerable issues. This ensures the platform continues to be secure and reliable. And lastly, the introduction of S-bombs, softwares of material and the signed container images. Office with these feature again, enhance the transparency and trustworthiness of the software products that you're using S-bombs is mandatory. Yes, that's not new thing now. Looking at the future of chaos mesh, some of the most requested one as well. So consistently the team is working on providing more and more observability in chaos mesh. So we plan to provide the metrics for chaos mesh for helping better understand the effects of that chaos experiment, mainly the graph that you've shown. So those sort of more graphs that you can, you can see when the chaos experiment happened, what happened in that particular duration the network loss happened. So for example, in that network latency, the injected port for network chaos, fine-grained CPU, memory usage metrics for the stress chaos and all those things. And then we also plan to increase the ability of health check with the goal of achieving the similar liveness probe on a pod. Then multi-cluster support chaos, allowing users to conduct experiment across multiple clusters, using a centralized management and scheduling. So centralized management scheduling you should be able to run different chaos experiments across different clusters. Then codebase is continuously being optimized and with respect to latest Kubernetes support, end-to-end testing, integration testing, all that stuff. And a very interesting thing that is happening or which the kind of aim is, is striving to collaborate with other open source projects to build a chaos automation platform kind of. So CCHAP, like CHAP, which is the chaos automation platform that we call it like that, originates from the internal practices at Netflix and can automatically carry out small-scale chaos experiments and provide automated analysis and reports. So this allows the idea of sacrificing the experiments of a very small portion of users, exchange for the stability of the entire platform. Obviously, there are many technical points involved in automatic chaos experiments and the development of this particular platform needs a lot of effort, meeting and all those things. So for example, we can use Kepton for Canary releases and rollback after the experiment is completed. We can use Service Mesh or Gateway API to schedule traffic, redirect a very small portion of traffic of chaos experiment group and the world. We can use K6, Grafana K6 as an additional load testing tool for simulating additional loads. And we can finally use observability facilities to collect, analyze, performance logs, generate reports, create fancy graphs and all those things. So, yes, there are a lot of future possibilities that will be there with respect to Chaos Mesh. So, but this is the direction that the project is going, but they're also looking for more and more contributors so that these things can happen fast. So as we move towards exploring all the challenging things and more and more opportunities for building the CHAP-like solutions, the Chaos Automation Platform and integrating the different set of tooling, Kepton. I mean, these are just the examples, but there are other set of toolings as well. And that should be needing more and more contributors. So that's pretty much it that I had for this particular session with my co-presenter who was not there, but you saw the awesome Potato Head application. I mean, that was really funny, right? So thank you so much for coming in and I'll be here around hanging if you have any Q&A. I think the time is already done. So thank you so much for coming. Thank you. Thank you.