 Welcome all of you in this room to, today we are going to give the, what's new in Calsmash and also I want to share a little bit history of Calsmash, you know this room is the maintenance truck of Calsmash, but technically I'm not the maintainer, I would say, but I'm the kind of like the creator of Calsmash and I would say Calsmash is a project that is created by the accident, yeah, so first of all I want to do the really quick survey, who is, and of you are using Calsmash in production, can you raise your hand? Only one, oh no, one, okay, cool, I would say I can see the great potential of this, this open source project, so actually before introducing, you know, the newest update for Calsmash, I want to give a little bit about the history about Calsmash. Well, you know, we have come from the company called Pincap, which is the distributed database company, we are, seven years ago we started to build an open source version of Google Spanner, which is IDV, it is a highly distributed, massive scale distributed system, and they provide a full feature of Calsmash, so that means it's super complicated distributed system, so that's, it's really hard to build, but I think the even harder problem is that how to test the, you know, a distributed system like this, I think it is even harder than, you know, to build a, you know, distributed database. Internally, I have a, calculate the code of our, you know, like unit test and the test case, I think the line of code of our, you know, integration test or test case is like 10 times larger than the actual code base of our database, so that's, that's what we said, how to test a distributed system is really, really hard. And a good way to test the, you know, distributed system is actually, you know, deletes your file in your disk. I bet some of you may know this TV show, like Silicon Valley, and it is accidentally deletes the file, that's, I think that's a good way, maybe that's a good way to test a distributed system. And another, you know, very interesting, you know, speech is given by FoundationDB in 2014 in the exchange loop. This talk is actually really, really good. And this talk is long before that Netflix introducing the idea about, you know, chaos engineering. So it's like, I think, eight years ago and before we even create Pincap. And I, I watched that, that, that talk, I think I learned a lot from this, this talk. Because in this talk, this guy introducing the very interesting concept means deterministic simulation for the, you know, for the distributed system. So that guy is basically want to build the, you know, deterministic simulator for all of the, to emulate all of the complicated status of your system. And to reproduce the bug 100%, reproduce the bug. But I think, I think it's not, the deterministic testing for the, you know, large-scale distributed system is not that realistic. So, but on the other hand, I think, increase the probability of reproducing bug by, you know, inject the failure to the, the system is, is doable. And so we are not going to create 100% reproduce rate for, for your system, just like unit test, I think unit test is not, unit tests are not working for, you know, a lot of, you know, pro, bugs in complicated distributed system. But I think the fault injection, which is like chaos engineering is a more reasonable way to, you know, create a high quality system on high quality system in the distributed world. So that's the beginning of, and we take a look at, do the research on a lot of, you know, like for injection frameworks and open source project. And we found that it's really hard to, you know, orchestrating all of these fault injection tools together to create the experiment for, you know, like, you're, you're, you're orchestrating this, this kind of too. And also you can see the Namazu, you can see Japson and it is, but all of these two are totally separated and it's really hard to use. So, and in the early days of, of Pink Hat, the, the database company, at the very beginning we, we, we create the internal project we, we called Scherdinger. If you're familiar with quantum physics, you know what I mean about this name. It's basically create a container based fault injection system for, you know, for, for our database. So it is only designed for, for, for, for TIDB, but I think that's the, you know, and then we open source this, this project, changing the name as the cache mesh. That's the beginning of cache mesh. So that's the original screenshot of Scherdinger. But, you know, cache mesh have, have a way more fancier, you know, dashboard and UI compared to this project. So this is a little bit history and we created this, this project in 2018. We open source this at 2019, the last day of 2019 and then become, this year, cache mesh become, have become in the incubating project in, in, in CNCF. So, yeah, that's a little bit background about cache mesh. And, okay, the, the, the next part I will hand over to, to, in the following slides. And the third one is multi Kubernetes support, which allow users to launch chaos workflow on multi cluster using one global controller. Okay, so the first feature I would like to talk about is the Azure chaos. So SNNM imply this feature allow us to run chaos workloads on top of the Azure cloud. So before 2.4, we have already support GCP chaos and AWS chaos. We have already received the many positive feedback from the community. So very naturally we wanted to extend the cloud provider feature sets by implementing this Azure chaos. So the underlying idea of the Azure chaos is very similar to GCP chaos and AWS chaos, which authenticated a chaos mesh controller manager to access the control, to access and control Azure cloud resource under the user's account. Some simple Azure, some simple chaos actions we provided are virtual machine stop, which allow users to stop specific virtual machine instance. So nowadays we know the public cloud provider provide very reliable servers, but some unexpected incidents can still happen sometimes, like the whole rack of servers may temporarily out of use due to fire or power outage and a VM stop can help us to simulate this situation. And the next one is VM restart. This one is a slightly different between, a slightly different because sometimes the virtual machine may keep restarting during a period of time because of hardware issue or resource scheduling problem. This behavior can add extra pressure to distribute system that require coordination between peer nodes. For example, if you are using Tike AV as the backend storage, if a server keep restarting, you will have a peer keep leaving and rejoin a network, which will add extra synchronization overhead to the whole system. The third one is the disk detaching. Most of the virtual machine instance on a cloud nowadays are using a unified remote storage, like EBS or Azure Block Storage. And this operation can help us to simulate a case of remote storage temporarily unavailable. Here is a brief example of how could you use the Azure chaos? First, the user need to put the authentication information into a Kubernetes secret. Here is cloud key secret and apply this to the Kubernetes cluster. Then we can define an Azure chaos workload, which will point to the secret we just created. And the defined action we wanted to take, here is the VM restart. Finally, we'll apply this workload to the Kubernetes and then the chaos match controller will load it and authenticate and proceed the action. The second feature I wanted to introduce is the block chaos. In early years of Kubernetes, most of the users are using Kubernetes to build a status application, but nowadays we are seeing increasing demands of building staple application on top of Kubernetes. So for staple application, what's the top priority? It is saving the date correctly and efficiently. Therefore, staple and high-performer storage is very critical. And ensuring application can still provide servers with temporary IO performance degradation is even more critical. To this end, we implement the block chaos. You may be a little bit curious because before 2.4, we actually have an IO-related chaos workload called IO chaos, which can also be used to simulate a scenario of unexpected IO latency on block device, then why we introduce a new IO-related feature. The IO chaos works well in most of the case, but it may come with a side effect when simulating a case of accessing huge amount of files. To implement IO chaos, we run a chaos sidecar container inside the target pod using P-trace to replace the file descriptor of the target process and then take control of the file descriptor by adding extra latency to the IO operation. This mechanism works well in most of the case, but if the process is opening huge amount of files, like thousands or tens of thousands of files, then the P-trace will pose the target process. And that is some side effect, side effect that we don't want it to import. So to better simulate the IO chaos scenario without importing side effect, we implement the block chaos in version 2.4. So block chaos simulates the case of extra IO latency on block device by implementing a kernel IO scheduler and offering two key features. First is the IO delay, which allows users to specify the latency of the block device. And then the second one, which is currently under intensive development is the limit IO ops, which allows users to specify the IO ops of the block device. So how does block chaos work internally? First, we take a look at how a normal IO request will be handled on a computing node. After the application sending a read or write request to the file system, the file system will then send the corresponding block IO request to the kernel request queue elevator. So this request queue elevator contains multiple IO scheduler from the kernel. And then a block IO request will go through all kernel IO scheduler, be sent to the hardware request queue, pass to the driver, inserted into the nbme queue, which is the short of non volatile memory express, and finally be right into the disk. As I mentioned, we implement the kernel IO scheduler called IO EM, which will be added into the request queue elevator. So when a block IO reach the IO EM, instead of passing it to the next scheduler, the IO EM will hold it for a specific time, and then pass it to the next IO scheduler, which simulates IO latency scenario. So here is a brief example of how it works. We start with a write request chaos is cool to the specific block sector and IO EM scheduler capture the request, hold it for three millisecond and pass it to the hardware queue. Compared to the IO chaos, this feature simulates IO throughput decreasing in a more natural way without posing the process. The third feature I would like to talk about is physical machine chaos. If you are familiar with chaos mesh, then you may probably know that chaos mesh are usually used as a Kubernetes plugin, which allow users to launch chaos experiments along with target Kubernetes workloads. But there are some legacy workloads that just cannot be run in container, and it cannot be easily migrate to container or Kubernetes environment. So some users may need to manage a hybrid cluster consisting of nodes managed by Kubernetes running containers, nodes managed by hypervisor running virtual machine, and the nodes managed by themselves as physical nodes. To better support this case, we implement the physical machine chaos, which extends the chaos mesh framework from Kubernetes to physical nodes or virtual machines. To use the physical machine chaos, we need to first set up chaos D on each physical machine. Chaos D is a demon process receiving commands from a central chaos controller in a Kubernetes and launching a cross rounding chaos experiments. Chaos D can be run as a role command or system service on a target physical machine. Some common action of physical machine chaos are simulating a process fold like process interruption, simulate network fold like network drop or floating on network bandwidth, simulating host fold like host shutting down. And it also provides some application or runtime specific features like JVM chaos or Radis chaos. So those are some new chaos workloads you can try out. And next, I would like to talk about some three generic features we added to the chaos mesh framework since 2.4. So the first generic features I would like to discuss is the status check in workflow. So to get the best results when running chaos workflow, we sometimes launch the chaos workflow in a real production environment along with the real application or workloads. However, sometimes this application or workload is so critical that we don't want it to break them. To ensure this, we introduced this feature which will keep checking the status of the application or workloads through the status endpoints or external monitoring system and automatically stop the chaos workflow if they find out that the application or workload is unhealthy. Here is a brief example of status check templates you can use to start a status check. The deadline field specified that this node will be executed for a maximum of 20 seconds. The mode field specified that this node will execute status check continuously. It will try to pull the defined UIR through the HTTP protocol and it will automatically abort chaos workflow if the return code is not 200. We are currently only support HTTP protocol, but in the future we plan to support other methods like monitoring the Prometheus metrics and use the alert manager like Magnison to automatically stop and restart the chaos workflow. We also improved the workflow web UI to facilitate the process of defining a chaos workflow. I believe most of you are like me. You have years of experience using Kubernetes and quite comfortable of writing YAML. People say Kubernetes engineer or YAML engineer, but if we try to define a large chaos workflow, including multiple stage, then which can easily lead to hundreds lines of YAML file and I believe writing hundreds lines of YAML file can easily get lost. So we add a new drag and drop feature to help users to define a workflow more easily. We record a short demo. Let me play it. So first we need to decide what kind of experiments we wanted to run. Here we choose the serial. We name it, set a deadline. Then we will have a blank workspace and we can drag and drop this action we wanted to run in the workspace. We define the namespace and all other related fields and then we click the submit button and then some of the workflow button. As we can see that the system will generate chaos experiment YAML in the back end. So feel free to try out this feature. I personally love this feature and I like it. So last but not least, let's talk about multi cluster support for chaos mesh. So we keep receiving requests from community users that they are wondering about if chaos mesh can support running chaos experiment across multiple Kubernetes cluster through just one single global entry. I think this is a very reasonable request considering that many users are building large distributed system or SAS platform that need to be deployed across multiple Kubernetes cluster located in different geographic regions. And sometimes they wanted to run a global wide chaos experiment like maybe simulate the apocalypse situation. So launching chaos mesh experiments on each Kubernetes cluster separately is very inconvenient. As you need to log in different clusters, set up chaos mesh framework on each of them, run the chaos mesh workflow and keep your eyes on multiple chaos mesh dashboard to monitoring all the cluster. So to release the burden of managing multiple cluster, we developed a multi cluster chaos mesh support or you can call it a chaos federation. The basic idea is we're connecting multiple Kubernetes cluster on a chaos mesh level. We will still need to set up chaos mesh controller on each cluster, but we don't need to do it manually for each of them. More specifically, we will assign a row to each chaos mesh controller. We will have a central controller which will be assigned as the row coordinator, and we will assign agent row to for all other chaos mesh controllers. So when setting up the environment, we only need to set up the coordinator once and then register each member cluster to this coordinator by applying a YAML file like this. So we will first encode the member cluster could be configured into a secret. Here like cluster blah blah could be config, apply the YAML file, the coordinator will load it could be configured from a secret and be able to access the remote member cluster. Then the coordinator will set up the agent chaos mesh controller on each member cluster. The agent controller is just a normal chaos mesh controller that identified itself as an agent. Once we successfully set up the environment and build up the connection between the coordinator and agents, the rest of the steps are very straightforward. We define the chaos mesh experiment as usual with only two difference. First, we need to specify in a target remote cluster and submitting the experiment to the coordinator instead of the agent. Here is a simple YAML. As we can see, it is just like a normal chaos mesh YAML file with one extra field's remote cluster. Once the workflow is submitted to the coordinator, it will then create the workflow on the target cluster and sinking the workflow status back to itself. Using this, users can control and observe chaos mesh experiments across multiple Kubernetes cluster using one central controller in the dashboard. Okay, so that's all the major features I would like to cover in today's talk. So for the future plan, there are several things on our roadmap. First of all, we wanted to improve the usability of the system. We wanted to add more inspection and the reports. We also wanted to improve the debugability of the system, like adding more reach of stability results. We wanted to support more status check types, as I just mentioned, like parameters and the date docs. We wanted to develop a new framework that is cool because we wanted to allow users to extend chaos mesh framework themselves by defining their own chaos mesh workflow. And the fourth is what is currently under development is we are constructing a global hub that allows users to upload their cool chaos workflow and share with all the people around the world. Okay, so that's it for today's talk. Thank you so much. Any questions in the comments? Yes. Yeah, so I was curious, how would you handle a case where injecting a fault leads to the inevitability of the cluster itself? Okay, so your question is, you're curious if the injecting folds will link to the cluster itself, right? Like, do we have any specific case? Like, what kind of injection are you thinking about? So, yeah, so I'm working on like stress testing. So I was thinking is, like, where I'm creating a lot of resources and seeing if it basically slows down the APS server or the APS server goes down or something like that. Okay, okay. So I think in most of the case, if we are just using the chaos mesh, then since we're roaming container, right? So the things mostly is injecting to the container. So it will not link to the physical machine. But in other case, as I just mentioned, we have a physical machine chaos. So that one you need to be careful. So if you inject something to the physical machine, then that may be something, yeah. Only running on the test environment. Sometimes we are not going to use this tool for the production system. They will easily mess up the housings, yeah. Okay, well, I haven't used chaos mesh that much, but I just have a basic question. So how do you, so Kubernetes has got all these horizontal power scaler and then radial scalers, right? So how do you actually deal with the scaling and then start the scheduler and then replicas it from actually thinking that it's actually failure and then going in a loop, that while you're actually bringing down some of these parts. Okay, your question is about how to decide is this bug or... I mean, how do you keep the scaling in control while you're doing the chaos testing? Actually we have a customized controller in chaos mesh. Chaos mesh will build the CRD and also have a customized controller provided by by by chaos mesh itself. It will inject into the Kubernetes runtime. So this controller will follow the status running the chaos experiment. So my question is what if chaos mesh, like chaos controller itself crashed during the fault injection and as a result it felt like cleaning up the the chaos injection process. Yes, yes. So your concern is what about the chaos controller itself crashed during the runtime, right? So chaos controller normally you are deployed as a container, right? So if you it crashed you definitely will want to restart it, right? And we have a CRD to record the status of each chaos workflow. So after you restart you still read the latest the status from the CRD and you can recover and resume your workflow. Yeah. And what if like the chaos demon that Ronnie IH knows fell to like like revert the chaos injectors to the status before? Because actually we experienced that once and the team that carried out the experiment panicked for well before we manually recover everything. Yeah. So maybe that's a bug. Yeah. Is that a answer? That's a bug. Is that an answer? Maybe. Yeah, but I yeah but like to clarify like most of the time it weren't perfectly. Yeah. Then we need to fix the bug. Okay. I will probably like try to collect the logs and submit them. Yeah. Great. Great. Thank you. So in case of chaos toolkit we have a section to check the steady state, right? Is the status check is equivalent of it or what is the status check in chaos mesh actually is for? So yeah. So your question is what is status check used for, right? Yeah. So sometimes when you're doing chaos mesh you may want to add stress to a large cluster, right? So sometimes I mean as I just mentioned we don't want to run it in production environment but sometimes maybe we wanted to run it in production environment but we don't want to break the actual application of workloads. They are very critical. So we have these status check to keep checking the application and the workloads. If we found this application of state or workloads on housing then we will automatically abort the chaos mesh workflow and then you know we will dump all this data and then SRE team will come back and investigate and then we can know the limitation of this whole system in a cluster. I will use an example to explain that. For example if you are running the distributed database the unit text, the testing is like keep a stable workload and you will have the QPS and TPS status and then you do the fault injection because you know the system will be automatically recovered but if the system after the fault injection they are not recovered you will know hey this is something wrong with your system. So the check is make sure the system is behavior right. So just to abort the chaos testing just in case if things goes on the wrong side. So Istio has like fault injection already built into it. How does it work with Istio? Is it something you would use instead of the Istio fault injection or does it work with it? So Istio has like its own fault injection mechanisms where you can inject latency and failures and stuff so would this work would you use this instead of the Istio fault injection or with it? I think the two systems can work together they are decoupled right they are not but you need to make sure you know understand make sure that this this behavior is caused by Istio fault injection. This behavior or this unexpected situation is caused by chaos mesh. Yeah maybe maybe you can create the new chaos type into chaos mesh. Istio chaos something like that. This is doable. Okay that's cool I didn't know Istio itself can do the fault injection that's really cool I'll check it out later yeah okay okay I guess that's all. Thank you.