 So, thank you for making it on the last day of TIBCON, Friday afternoon, post lunch session. So, really appreciate you all coming in. This is the maintainer track session for Litmus Chaos, Chaos Engineering project. It's an incubation project in CNCF. So, we probably get started with the introductions. So, I'm Kartik. I'm one of the maintainers of the Litmus Chaos project. Sure. I want to talk to you about that. I am Kartik, one of the maintainers of the Litmus Chaos project. I work as a principal of very engineer at Harness. And I have Uma here. Hi, everyone. I'm Uma Mukhera. I am the head of Chaos Engineering at Harness. But Kartik and I started Litmus. We co-created it back in 2018. We've been maintaining this project ever since. So, it is an incubating project at CNCF for the last one year. Nice to be here. Nice to meet you all. Okay. So, this is what we have on the agenda. We'll do a quick introduction to Litmus Chaos project. I'm sure a lot of you have heard about Chaos Engineering already, so we'll not spend too much time talking about what it is. We'll directly get into the project specifics. And we'll talk about one specific feature. We'll probably stress upon it for a few minutes. That's about how you can use a single Chaos control plane for doing Chaos against varied infrastructure targets, across your different cloud environments. And we'll talk about what we got done since the last KubeCon. So, I think the project cycles nowadays are measured in KubeCons. So, we'll talk about what we did from the Europe KubeCon time. And then Uma will speak about the 3.0 beta for the Litmus project and also talk about the community a little bit. So, this is just a quick refresher, just to set the context before we start off. So, Chaos Engineering is essentially injecting controlled failures into your environment. And the idea is to introduce these faults to find out weaknesses. If there is something that's unknown, something that you've probably not accounted for, you'll find those things out using Chaos Engineering. So, there's a lot of material on the Internet around what it is, especially used for distributed systems. As far as Litmus Chaos goes, it's a project that started around 2018-19 time. We are an incubation project. The Litmus tool or framework is Kubernetes native. It's just a Kubernetes application. You can install it via Helm command. You can do a Helm install. And you can just go do whatever configuration you need in the values. So, you can do the installation of Litmus framework as a limited scope one, just for yourself. Or you can do it at a larger scale for your organization. There is multiple modes supported, namespaced mode and clustered mode. So, in the namespace mode, Litmus will allow you to orchestrate chaos in a specific namespace that might have been allotted to you as a developer. That's where you're experimenting. Or you could be the SRE and you'd like to set up one control plane for all the developers. You're trying to create some kind of a self-service environment. So, you can go install it in a cluster mode. Once you've got the control plane running, and I'll probably show a quick demo before we get into the release details and all that. Once you set up your control plane, that's where all the chaos management happens. You're basically connecting different target environments into that control plane. These target clusters could be residing anywhere across any cloud platform or it could be an internal data center. You could be connecting clusters or you could be connecting namespaces as well. So, you could have Litmus control plane in one cluster and you're connecting either another cluster or one namespace in another cluster. It depends upon how much autonomy you have on your infrastructure. And when you connect that, you will be basically running an agent in the cluster or namespace that you've connected. And that agent is what we call as the chaos delegate here. That carries out the business logic of your chaos experiment. And when we talk about the business logic of chaos experiment, we are talking here the fault injection, the pre and post fault checks that we would like to do as well as any other hypothesis validation that you might want done. You might be querying some metrics, you might be looking at the availability of some downstream services. So, any hypothesis that you have around how your application should behave that can be born into your experiment spec. So, the experiment in Litmus is essentially the fault plus the pre and post chaos checks plus whatever monitoring you'd like to do. And all this will result in a verdict of your experiment, a success or a failure which you can then use for decision making. So, in the Litmus control plane, you will find certain chaos artifact sources embedded. We call them the chaos hub. So chaos hub can be sort of spoken as an open marketplace of chaos experiments. That's where the community goes and pushes their experiments into. So, you can pull those experiments from their construct complex scenarios using them and go ahead. Once you've created something that you feel is worth sharing with others, you can push that scenario back into the chaos hub as well. So, what you see here are just a couple of chaos hubs on the demo environment that I have. So, once you've gone ahead and logged in into the control plane, once you've set it up, you've logged in, you've configured your chaos hubs, now you're ready to build your chaos scenario. As part of the scenario creation, you pick a fault and then you go ahead and say, what is it that I want measured during the fault execution, fault injection? We use something called probes. We've built something called probes to do that. You can actually see there are two probe types. There are many others. You can find them in the documentation. I've just cut a couple of screenshots here for illustration purposes. There's something called as HTTP probe where you can go and vary the status of your services. You could be testing the service status for the application that you're subjecting to chaos or you could be testing the status of some service that is dependent on the one that you're impacting. And then there's something called a Prometheus probe. You could run some promql queries and you could basically check what is the deviation in the metrics. Is that as per expectation or no, etc. So, these are probes. We use that for steady state hypothesis validation. And once you've constructed your scenario by picking the right faults, by setting the right probes, now you go ahead and run it and then you like to bring more users onto the control plane into your workspace. So, in the litmus control plane, each user is allocated a dedicated workspace, what we call as project. You go inside the project and you invite other members on the platform into that project to collaborate with you and you assign a certain role to them. Are they only going to be able to view the chaos scenarios that you've already run? Are they going to be able to run it? Are they able to construct new scenarios, etc. So, that's the next step that you would generally do. Then when you've got that system working, you've got your experiments working as desired. It is proven to be effective. Now you would like to run this again and again, right? You might want to run it as some kind of a background service. You can schedule a scenario for recording execution. There are some ground schedules that you can do. Or you could be invoking your chaos scenario as part of a CD pipeline. You want to go ahead and deploy your application and you want to check the sanity, so you go ahead and run the chaos scenario. Or you could be triggering the chaos scenario as part of a GitOps deployment. If you're not running the pipeline way, if you're using GitOps, then you have Flux or Argo CD or any of these tools to go and upgrade your application. And the application change can go ahead and trigger a chaos scenario run, right? So, that can be done. So, what you see here are some screenshots that explain how you can do it. There's a recording schedule. You have the Ritmus API which you can use and you can invoke the Ritmus API to run the chaos scenarios from your pipelines. Or you could be setting up the event tracker policies. This is a CRD where you can see in the conditions you're basically going and looking for certain checks on your deployments. What is the replica count that I have here? What is the image that I have here? Any changes to this? Your configuration is actually going to trigger a chaos scenario run. And finally, once you've got all these scenarios running and you're comfortable, now you would actually go and check what is the resilience trend, right? So, you're running these scenarios that's great, but are they succeeding? Are they not? What is the state of your application's resilience? So, Ritmus provides something called as a resilience score. So, this is a metric that essentially manages your chaos scenario and your application target, right? So, this application is this resilient to this scenario. This scenario can actually be an outreach that you saw in the past, so that's why you've actually created it. So, you're going and mapping that against your application and you're saying this is how resilient I am. And there's a specific way to calculate this resilience score. We'll probably talk about it on one side of the demo. You can actually go ahead and add a lot of probes to your faults. Each fault is going to return a probe success percentage. And each scenario is going to have multiple such faults, right? And each fault is associated with certain constraints, what we call as probes. So, each fault is associated with a weightage or a criticality. And that criticality, along with the probe success percentage, together decides your resilience score on a scale of 0 to 100. And using this metric, you can compare scenario runs. Maybe you ran the scenario on build X and then you ran it on build Y. You're comparing how the resilience varied. Or you might have run the scenario in your dev environment and QA and then pre-prod and prod. You might be looking at how the resilience trends have been changing across environments. Or maybe there's a change that you made within the experiment spec. Your scenario spec, you have gone ahead and changed a tunable. You've said instead of 2,000 milliseconds of latency, I'm injecting 2,500. Now you want to see what happened across runs because of the change in the amount of latency, right? So, you can still go ahead and make a decision based on the resilience scores. So that's about most of the features. And I'll just introduce you to this concept of single control plane on hybrid infrastructure. I think you might have already got it, but I just want to stress upon it a little bit. So with Litmus, you have one control plane, which is comprising of a dashboard, a chaos center, a chaos server, and a MongoDB to support the state. And you could be connecting other Kubernetes clusters or namespaces, which are running anywhere. You could be running on on-prem KITS. You could be running on AWS. You could be running on Azure, VMware, what have you. As long as you have the network connectivity, you will be able to connect them and run chaos. So why is this powerful? One of the recent ways of deploying applications is to have it spread across different cloud providers or redundancy purposes. So you would like to have one homogeneous control plane from where you can actually target your infrastructure components residing in different platforms. So that's what is enabled by this model. You can actually see this in this illustration. We have the portal here, which is the chaos center, and then you have different clusters. You can see three clusters here. Each of them is connected to the same portal, and they could actually be residing on different environments. So you have one cluster, which is actually co-residing with my control plane, and then there could be others. For example, you see something called virtual machines, which could be either VMware or OpenStack, or it could be maybe bare-metal physical machines, and then you have your cloud environments, and you will have the agents of litmus sitting there, and that is going to carry out the execution of your chaos workflow, what we've been talking as chaos scenarios all this while, and then it emits some metrics and events which you can bring back to your control plane and make a decision on how your resilience looks. So that's what it is. The feature somebody is there here. This diagram is sort of having a lot of circles. You can see the chaos center at the center of it all. You can use that for user management and teaming. You can connect agents to it. Once you've connected an agent, you do some amount of asset discovery or find out what services run there, the eventual subjects of your chaos. Then you pull the faults that you would like to do against those microservices that you've discovered, and then you add in the hypothesis validation using probes. That's the observability part. Let's say you're working off a disconnected environment. You might want to change the images. You might have pulled the litmus images and pushed it to your own registry, so you can go ahead and use that, make that setting in the control plane and say all the images that I'm pulling to the execution should come from this registry. Then the chaos workflow actually runs. You could schedule that once or you could schedule that repeatedly, and then this loop, the circle just continues. You can see a few other circles here. We basically say that the experiments that are there support different runtimes, container, docker, CRIO. You have the GitOps way of doing things. You can enable a switch in the control plane and all the chaos scenarios that you build in the platform can get committed back into a Git repository so that you can share it with your teammates. That's pretty much the summary. Before I go to the release section, I'll show you a very quick demonstration. Pull up the chaos center. You can see this is the chaos center. I'm going to log in with default credentials. Admin and litmus is the default that you get once you've installed. I've gone into the admin project, as you can see here. Admin is going to be allocated a dedicated project. I happened to log in as admin. I have the admin project here, and you can see the chaos delegates section has one chaos delegate connected, one cluster connected. It's called self-agent, and I've run some scenarios before. I'm going to run a new one, and you can see the chaos hubs here. This is the chaos artifact source that I talked about, and you can see experiments of different categories. You can see a multitude of experiments which mostly solve your resilience needs. You can write new ones as well. Litmus allows for creating new experiments using an SDK. Let me schedule a chaos scenario. I'm trying to follow the sequence that I showed you in the circle in the previous slide. I go ahead and select the chaos delegate. I've selected the self-agent here. I've selected the hub that I'm going to use as my artifact source for this scenario creation. I'm just going to give it a name. I can go add a new fault. Let's do something that's very simple. Let's say I'm going to do a part kill, and I'm interested in subjecting specific application to part kill. Let's say I have a microservices application. You can see there is an online boutique here, simple microservices app. I've selected that namespace. I'm going to select deployment as the workload kind. I'm going to select one of the services, the carts. This is the asset that I was speaking about. We are going to target a deployment which is identified by this label and which resides in this particular namespace. I have the option of adding probes. I'll probably not do that in the interest of time. I can provide some tunables here. How long do I want the chaos run? At what intervals do I need the part kills to happen? What is the nature of the kill? That differs from experiment to experiment. There is some advanced configuration that you can do to define topology and additional filtering of workloads. With this, let me go ahead and provide some weightage. This is the part of the Resilience Code calculation I was talking about. Let's say I give all points because I've just selected one fault. I have the option of scheduling it repeatedly. I'll schedule it just once. This is the summary. I'll just go ahead and finish it. You can see the experiment execution can be tracked here. Typically, each experiment within a scenario, I've just got one fault or one experiment within this scenario. You can pull multiple ones. Each experiment is associated with a couple of steps. One is just pulling the template from the hub and installing it on your cluster. The next step is the actual invocation of the fault and hypothesis validation. Other business logic runs in the next step. While we are at this, I'd like to show you the Grafana dashboard that I'm using to monitor the boutique applications. It's a very simple dashboard that's got QPS and some latency being measured here. I've annotated this dashboard with a metric that's coming from the Ritmus framework. You can actually see when the fault is live. You will have some detail here that actually tells you the fault is active. Then you can see what has changed during such time. Let's wait for that to happen and meanwhile you can also do all that I did on the chaos center using the Ritmus API. You can use that to automate all these proceedings. You could invoke that in your CI CD pipelines or maybe you have your own test execution framework that you'd like to plug into. Once you have these runs executed I said you can compare it. You can actually see some runs here. I'm just doing a random selection but ideally you would be interested in selecting the workflows that matter to you and probably the ones that are running on different environments or different builds. I've just selected it and it shows you some upward trend in resilience. Maybe I've got better resilience across builds. That's improving here and this comparison is made on the basis of the resilience score. You could download some reports on how your experiment execution has gone. What are the probe results? What are the verdicts? What is the resilience score? All that information is available. You can actually go ahead and add members to your team like I said. I've got a couple of members apart from the admin here. I've got Uma who is editor and there's Karthik who's on this platform. You can create new users and once they are available on the platform, you can pull them into the team. That's a quick demonstration of this particular deletion is happening. The particle is happening. You can see what's happened here. As far as the front end and the carts go, the QPS knows dived, the error count increased. This is reflecting an inefficient deployment or something that's gone wrong with my application. In this case, I just have a single replica of carts which is why the failure is seen but the same could be caused due to an application failure, an application bug in your code as well. This was a quick demonstration of what I was talking about. I have not added any probes so you can see without additional constraints, the experiment succeeds but there are several runs that have happened prior. This one where you can see the failure has happened because of a probe failure. The experiment has failed because one of the probes has failed. You can see that in the QPS result here. That was a very quick demonstration of what you can do with litmus chaos. Let me go back to the presentation and talk about what we got done between previous Valencia you've gone to now. This is a project which has monthly releases. We have releases on the 15th of every month and there's a community sync up call that happens immediately after on the third Wednesday. Across the last few months we've had about 3-4 releases and this is what we managed to do in that time. We've added new fall types. There's a HTTP chaos experiment suite that we've added which allows you to simulate erroneous status codes with a response body etc. to see how your application behaves. You can inject latency. We also have support now added for newer versions of Kubernetes and OpenShift. We've upgraded our operators etc. to be able to do that. The support for many chaos experiments of the network category, HTTP chaos experiments, the DNS ones latency, packet drops etc. are now supported and environments which also have service mesh running there and we've got support for randomization of fault inputs. What we mean by this is let's say I provide a range of latency 0-2000 you could actually be injecting faults with latency anywhere between 0-2000 right? In different iterations of the experiment you'll have different value of fault getting injected and you can also do the same for intervals. You can do the same for any other tunable that is native to a fault. You can randomize the input. We've got the ability to do all the operations that I just showed you, most of the operations at least, that I showed you on the KR Center using a CLI tool called ECMACTL. You can list workflows. You can trigger new workflow runs you can delete them, basically do some crowd operations using the ECMACTL tool. We've added support for the run times container D and CRIO. We started out with Docker but obviously that's replaced mostly container D is the de-factor and you can use CRIO. We have first class support for these run times and we've also introduced improvements to our SDK which basically helps developers to bootstrap their chaos experiments and we've also got some newer category of experiments here as you can see we've got applications that we've got chaos that does faults on spring boot applications we've improved the scope of our probes. Some of what you saw in the spec, we had the Prometheus probe and the HTTP probe but there's a command probe which we've gone ahead and made it very flexible. Command probes help you to run any validation of your choice. It need not be an HTTP request it need not be a prompt QL query but you could be doing anything. You could be having your own CLI anything could be done and then we've made our APIs better, more user friendly so we've got some feedback that we got from the community so with that said I'll now hand it off to Umar to talk about what's coming next. I think I got the mic. Thank you, Karthik. There was a great run of what we got done in the last six months but just to go back we've been adding newer features for almost three years now so with a release coming out every month on the 15th I think we have done more than 30 releases without missing the 15th date. So what that did really is it's kind of feature complete to do the basic chaos experimentation so Litmus 1.0 was more about getting the chaos operator, all the CRD is done by making sure that you do the chaos experimentation in a cloud native way. Litmus Chaos 2.0 was getting the multi cluster support introducing the chaos center where multiple clusters and namespaces can connect to the chaos center and introduce the teaming. So that enabled enterprises to think in a practical way. So more teams can come and practice chaos engineering. So what's the status today is there are a bunch of users teams, enterprises using the open source Litmus and what does it mean by Litmus 3.0 so we are kick starting the effort of 3.0 this week it goes on maybe for six months to nine months to one year and we really want to focus on making Litmus more robust right and because you know whatever the bug fixes that the community reports that we find through our automated test will focus on making sure that even if it means that you know re-architecting few things here and there will focus primarily on making it rigid and we also want to make Litmus leaner so what that means is today to run chaos experiment it runs under Argo workflow so many people want to run Litmus experiments in pipelines right so maybe it is if you are running just one chaos experiment it may be having little bit more footprint than what you need because you have Argo workflow controller it can do a lot of things but just for one experiment it may be little bit more than what you want so we are going to start work on providing a native orchestration engine where you can run the exact chaos experiment without the Argo workflow so that is going to make Litmus leaner probably know on low footprint Kubernetes on edge and other places you can start running Litmus in a more easier way and the next focus for us also is how to take chaos engineering for developers so one thing that we have observed is we started originally for SRAs running chaos experiment in pre-prod just make it totally declarative through emails and also make sure that it's got all the API support, chaos controllers operators everything and then we realize that chaos has found its need in the pipelines so more and more community users are coming and using it in their CD pipelines so that's when this chaos workflow engine getting started more and more adoption now I think it's time to do even more left shifting where developers can come and think of introducing chaos experiments even before the code gets merged think of the scenarios as I'm writing a lot of cloud native code and it's before it goes to build eventually it goes to a pipeline and then finally gets pushed is it possible for me to run a chaos experiment on the build possible build code right what happens if you delay it is it goes to a pipeline and then gets deployed and then rollback has to be done so there's a little bit of a cost involved in that and the developers really avoid that right so it's actually great to see the adoption of chaos going back a little bit to the left right so we are going to focus on how to make things easier for developers what it really means is making it lean is one of them and maybe introduce one more CRD right to make a chaos experiment run so et cetera et cetera so it is in the end phase and we are actually calling out for discussions on how what's the feedback that you're going to give us in terms of this requirements et cetera right so we have created discussion thread on let us can you go to that discussion so if you go to the litmus repo there's a discussion thread created and what's the focus of the roadmap for the next one year and we are going to work through this monthly maintainer or contributor sessions a little bit openly right whoever is interested they can actually come and not only contribute the code but you can contribute your requirements and ideas hey you know this is what I'm thinking of for my chaos engineering use case so you know come on openly and then discuss and we'll see you know how to get that done right so that's that's one announcement that we as maintainers want to do that you know we are opening up the board for the next phase of litmus which is 3.0 right so and finally I want to conclude this session by showing how to get in touch with the community right so we've been fortunate to get contributions from various different people right over a period of last three to four years so we have 200 plus different contributors from various companies right so we would like that to increase more so we always encourage contributing new ideas new experiment types not necessarily that you have to develop but you know hey here is my use case and then you know somebody will pick it up right if you can then we certainly would like to add more as a core team right so the other contribution that we think can be accelerated is the list of experiment types into the chaos hub so chaos hub has grown from 30 plus so experiments last year to about 45 now or 40 plus experiments so now that we have a stable litmus platform it's time to you know focus contributions on the experiments right so the community has actually created a lot of new experiments because it's got a good SDK but we have not concentrated in requesting you know here is how you can upstream it maybe you know the project can maintain your new experiment so we're going to focus a little bit more on asking our community members that upstream your experiments back so that it can be managed better and more people can use it right and there's a contribution guide right that's available contribution guidelines or in accordance with there are still scenes of projects right we are especially looking on you know easier areas as well docs or if you want if you are doing some testing we have a nice e2e platform that test litmus itself for every release process it can to see more contributions and litmus agents can be made you know there are already some contributions there so there are many areas which the project can you know take contributions and to do all this you know we're always hanging out at the litmus channel on Kubernetes lag so please do join us and we also run a monthly community meetups virtual meetups where you know users come and talk about their use cases and also ask questions and also maintainers announce any new features that were introduced etc etc so because of this beta that we are announcing we also run one more meetup every month that is for contributors and maintainers right it's a meantime meet so now if you are looking at developing yourself into a more active cloud native you know contributor is an opportunity to directly work with the maintainers right so that's all I have and they know we obviously want to know if you like the project we want you to like the project so give us a start they'll always help you you know who's liking this project so that's all I have maybe it's time to take any questions beyond the session we'll be still around here if you want a demo in person we can still do that thank you