 Hi, everyone. Good evening. Welcome to KubeCon again. I saw some of you in the morning session. So some of it is going to be It's going to be of a repetitive in nature Just to introduce ourselves a few of the maintenance of the project late mass chaos are here. I'm Oma Mukhera We have Karteq and Udit. All of us are Been with the product project from the beginning for about last five years So happy to answer any of the questions that you may have and we also have an end user of the project Siam Patak director at CVO and he's also a CN safe ambassador. So we have about 30 minutes and then we will take, you know, quick Q&A in the last five minutes, but For those of you who are new to litmus, probably I'll Rush through the first five minutes on what's litmus is and why litmus and then Karteq is going to talk through about How the journey has been for us and then what's what you can expect the rest of the year from the project and also then, you know, how can you contribute? We're obviously interested in working with the community and taking more contributions So, you know, how can you contribute if you're interested in contributing, right? And then Siam will do a quick demo of it So outages are expensive business reliability service reliability is very critical to business. You don't want to get into the irreparable damages that are caused by outages, right and even though your systems are always thought to be built for Reliability you are building redundancy in your systems Failures are inevitable, right and failures will happen if not frequent It's only a matter of how often they often happen then, you know, whether they happen or not, right? So how do you really avoid service downtimes? That's the bigger question that we need to ask as we move towards, you know, Kubernetes based microservices deployments, right? So as part of this, we all we have seen Importance of the following metrics continuously being, you know, talk more and more in the business discussions You know, how fast you can recover from your outages and how Can you avoid these outages or how can you increase the distance between these failures or outages, right? So the answer to that is adopt chaos engineering from beginning and as much as possible in your all phases of DevOps or cloud native DevOps and also chaos engineering Is not a magic bullet. It's a step-by-step method to build a kind of a culture into your DevOps practices that you always start with a few small chaos tests and then incrementally start building, you know over a period of time automate them etc etc and Why is this more important in the cloud native world? That's because you have more dynamism There are more containers to deal with and your view of your code is much much smaller There are so many things that are changing outside your code So you want to be able to test how my code works when something else fails, right? So that's very very important and this code always keep changing faster and faster, right? So all of us are trying to see how can I increase my ship time, right? So that that's one thing that continuously is happening. So together with more number of microservices and The faster shipment you have a situation where there's so much dynamism and you don't know how to take care of Or how to ensure reliability so chaos engineering definitely will help there So chaos engineering think of this as a DevOps culture is no one's responsibility Anybody can start you can do move left or right But important thing is that someone has to start doing chaos experiments in the entire DevOps and then start training others Right, so that's that's what is important and late must helps you do that. It must is a project which is Going to provide you a full platform To manage and orchestrate your chaos experiments across your team members in the entire DevOps, right? So it's got a centralized Portal or control plane where developers sres and QA members even your management can come and see what's going on with respect to reliability in your DevOps and You can take actually some control on building resilience into your system, right? And it is a Kubernetes app. It works for non-cubinities targets also, right? And they can actually execute chaos experiment across various cloud platforms as well How do you get started with litmus? It's pretty simple. It's a helm chart You can actually deploy it on any Kubernetes it takes minimal resources being a Kubernetes app You can extend it as you grow with chaos or we also have You know hosted control plane, you know litmus chaos or cloud just to get tested and then keep going, right? so you basically start with simple installation either by helm or signing up and then You start connecting your actual targets through litmus agents to your control plane and then start creating your chaos test Litmus gives you a set of experiments already about 50 plus experiments and it gives you a good SDK API you can be very creative You know once you get a hang of it and then you can start running chaos experiments and start running analytics seeing how reliability is improving across Across your CA pipelines or CD pipelines. So we got a bunch of experiments I think you can go to hub dot litmus chaos I o and then start seeing it and this is one area where we've been You know seeking More contributions, right? So as you learn chaos and as you You know experiment with chaos you get better ideas you can upstream them for the community to get used So chaos experiments are stitched into chaos work flows and chaos work flows is nothing but your your chaos test Right so you can actually once you build chaos test You can schedule them or you can put them into your githops, right? So as you deploy your deployments chaos can be triggered automatically or it can inject them into CR CD pipeline Or you can just ask your developers. Hey, you know, I have some chaos test You can run yourself, right? I know and just do it the cloud native way or cube cuttleway so it's a very stable and You know reliable project we've been seeing people using it for last few years In an enterprise the way so it's got a full list of features that you can go and you know learn more about But the next thing is, you know, like cardiac and as a maintenance are going to actually talk about what has evolved in the last Few years and what you can expect In the year and you know going forward so with that I will invite No, kothic to take us through the next few minutes And maintain a session you might all be interested in learning about how the litmus journey started and whatever the decisions We made along the way in building a cloud native chaos engineering platform. So I'll just take you through that so we built litmus basically in response to a need to test the resilience of a SAS platform that we were internally building and We wanted that was cubanities based So we wanted to go ahead with a chaos engineering platform that helped us to do chaos on a cubanities Based as a platform and do it in a cubanities native way. There's a lot of cubanities being thrown in here What do we mean by that? So people are used to running operations on cubanities clusters in a certain way everything is declarative Everything is a resource and the resource is reconciled And most of your application life cycle management Security policies and so much else is being done in that way you would also expect Would also need the resilience tests to be done in the same way sort of homogenize the way you do chaos experimentation resiliency tests and The platforms that were available at that point of time the chaos tools were not meeting that purpose That's when we went ahead and wrote litmus. The idea was to keep it open source and community collaborated because that's how people really use their That brings in a lot of rich libraries You could use that to create faults of different types Because everyone has their own idea of what to fall should be and then we went ahead and made a Plug-in based chaos platform that is you could bring your own experiments. That's what you see on the right side You could just get your own fault business logic Container is it run it as a cubanities job instrument it a little bit and then it'll be Orchestrated as part of the platform that makes it easy and standardized for all people to come and write their own faults So the initial deliverable there was creating chaos experiments as custom resources You had an operator to carry out the business logic defined within those custom resources we put all those experiments as part of a hub and By OC is just bring your own chaos. Just keep expanding the hub. That's what we had initially And then over a period of time we got some feedback on how we should improve People needed a way to centrally manage their chaos on their cluster fleet Not gone individually track chaos experiments in each cluster have it all visible from one platform and They wanted to do complex scenarios. Okay, one fault is great But most of the time you have complex scenarios in your environment things are getting built up over a period of time in production So you're really interested in doing chaos workflows not just single faults workflows are a sequence of faults strung together in different order that you need and Then you have the ability to collaborate with your team members on the workflows that you created Store it in a git repository. Have a single source of truth and Add observability features here. We are talking about the ability to Set certain constraints around what should happen when your fault is injected be able to validate This is how your application behaved during the fault injection So add we added those features and then we also added some features around automation When do you want your chaos to be injected at what point maybe after a application got upgraded as part of some deployment by a pipeline or by some Github's controller like Argo CD or flux so That's the observability the automation the teaming and the user collaboration features and then the workflows these were the major additions that we made and all of them from one centralized control plane into which you can correct different clusters that are the target Environments for chaos, so that's the evolution and if we've sort of ended up with this there's a chaos control plane which is where the users come in and Say what they want to do as part of their experiments create their workflows manage it collaborate with their team members and on the right hand side you have the execution environment where the chaos actually happens and We've got Metrics that you can use to instrument your own observability dashboards to say this is how The application is behaving when a particular experiment is running so at this point I'll probably Just summarize what I just said the original principles you saw that in the first slide The ones they are the ones on the right hand side of this screen and on the left hand side We added certain observability capabilities with probes and exporters and things like that and added the ability to run workflows Schedule them integrate them with your github's flow Etc. So this is sort of broadly the framework for cloud native chaos engineering at this point that Liquid sort of defined and this is found some resonance within the community I'll probably let umar talk through some of the use cases the way it's being used most commonly and Then we'll go back and speak about the roadmap after that I'll probably in interest of time. I'll just rush through we talked about this slide in the morning as well You know the the main point that we need to take from here is chaos is also for Qa and developers, right? So the way litmus is architected is you get a declarative file a YAML That's well Tested by somebody in your DevOps and then you can actually go and use it elsewhere Wherever you see fit, right? So think of chaos testing spreading across DevOps and There is incremental feedback being built as a shift left or shift right into that, right? so that's that's probably one of the things that I wanted to share and You can throw into game days or CI pipelines or CD pipelines And one of the most common things that we have seen happening in litmus is that being used along with github's, right? people trigger chaos as a task of You know deployment happening As part of your CD CD pipeline, so you you can easily automate it either using or the CD or spinnaker or flats it works with it's been tested by community and You know proven to work with most of the CD platforms as well so Some of the other things that we have seen litmus being used is also to test observability You have invested in your observability systems and you don't know whether something goes wrong Your observability system works well or not, right? So you can actually introduce chaos and see whether you are getting the right alerts or your people responding to your You know setup practices To such chaos or not, right? We're also seeing people testing it Kubernetes itself Kubernetes keeps changing all the time, right? So when you upgrade your Kubernetes stacks can actually verify your entire application against Possible failures happening in Kubernetes itself. For example, when a cubelet service goes down, you know, or you set up for a proper service redundancy or not. We are also saying Kubernetes being used as a cross-cloud control plane and you're bringing in a lot of infrastructure into New infrastructure into the consideration. So when such infrastructure fails, how does your service work, right? So how can you inject failures into some of the cross-cloud control plane? And then we are also seeing, you know, Kubernetes deployments are always hybrid in nature There is some stack that is not Kubernetes underneath So what happens to your service when failures happen under such stack? We're also saying people taking chaos all the way to your well-defined services such as, you know, Amazon RDS or other data services. Can you actually simulate chaos against such services and see whether your services are reliable or not? So these are some of the advanced use cases of chaos that people are saying But you always start with some simple use cases such as port deletes and then, you know, go further from there So with that, let me give back to Cortic, you know, to talk about the roadmap and then how to how to contribute to latency Yeah, I think some of the roadmap that you're going to see now is driven by all these new use cases that Tomah spoke about So obviously we are looking at a hybrid set of targets The the Kubernetes targets, of course, were the ones that were initially added. We are looking at more experiments that can also target infrastructure resources Better UX for the chaos workflow management by this we mean How can you construct a workflow in a very seamless way and execute it and visualize it? Improved operations support around chaos engineering. Whenever you execute a chaos scenario, there are a few things around it Where do you want your results to go? You want an artifact sink that you want configured? Maybe you're running a load test along with a chaos Experiment as part of your chaos experiment along with your faults. So you want to see how the results of that benchmark really came So you want to push it somewhere How do you ensure that the users on the platform have the right set of permissions? And they are restricted in what kind of actions or faults they can do on your system That's the RBAC and probes here is the ability to do advanced Hypothesis validation So there are a few probes today the scheme off which is getting continuously refined and new probe types are getting added and of course integrations with CNC if other CNC projects in the areas of observability at CI CD and just to achieve the overall chaos agenda Contributing to litmus, I'll probably let Odit pitch in we have various areas that Probably need the contribution. So I'll just let him talk through that for a few minutes You can contribute on a different different areas in litmus You can contribute a whole experiment itself and there are other components. We'll shortly discuss about that So what do you need to have before contributing to litmus? You need to have an ideate. You need to have a chaos injection logic What pre and post chaos checks you are gonna perform and how will you regain this stage? after the chaos is completed Then the sex Second step comes is fill in the attributes file You will have some attributes file that we need to fill to run an experiment to prepare a work to prepare a code Completion of the experiment now when you fill the attribute file You will generate a experiment scaffolding that will have all the functions that are required to inject the chaos to perform some pre and post chaos checks Once this scaffolding is done, you will have a business logic You will have some business logic Let's suppose you are running some part delete chaos, then your business logic is deleting a part So that means this logic you need to integrate with the functions After that the next step comes is Testing in the job now once you have all these Business logic created you can just dockerize it you can run some jobs and validate whether your what your hypothesis is getting Successful or not you are getting the part deleted and recovered or not Once everything is done now you will provide some metadata to the experiments that will tell about what the experiment does some more detailing and then you can simply add it to Committed to chaos hub and it will be publicly available to it must chaos hub now the other areas where you can contribute is Some as we discussed like the new experiments you can already can contribute now you will have You can improve faults it must today have more than 50 chaos experiments They are all having different different for different features They have different tunables you can always go and add some few features in that you will find all these in the shoes Good first issues are already created there You will have some UI UX bugs that you can fix or you face feel something that you can add on you can go ahead and add There are a lot of e2 test cases that you that you can write as of today We have pipelines associated for each of the experiments some unit test cases you that you can add for the code You will have the helm charts and charts are there for litmus installation. You can add some test cases around it You will you can add some documentations you see if something is not working You can just go ahead and add the documentation fixes for that. You will have some you can add your own use cases You have some observability integrations. You can add simply you have some CICD integration You can add so all these are the other fields then experiment that you can basically use now to To contact our developers to discuss different topics. You can just log into litmus dev channel slack channel It is present in Kubernetes slack workspace And you can also join us on litmus GitHub issues and you can see all the issues which want to address There are you can also share your feedbacks. That's very useful and Raise issues raise issues for discussion for all these queries. We are all also available in slack As I mentioned we are on community sync cups. You can come and Have your ideas there. We are also available in office hours These all you can join these meetings if you have some demos if you want to Contribute you want to talk more about it. Please join and share your experience there With this I would like to give the mic to say him for Giving a use case. Yeah, so So basically now we know how and what litmus does what are some of its benefits and how you can use it But let's see how we at Civo Kind of uses litmus and what are our plans to even expand it further? so Civo is Kubernetes provider Kubernetes service provider based on K3s So we provide managed Kubernetes which is based on K3s and how do we do that? We have our super cluster Our super cluster is having Cubinities the kids as a base on top of it. We have the cube work layer and on top of that cube work layer We have the tenant clusters that basically you when you log in to see when you create the cluster So you will be having the tenant clusters So when we talk about the tenant clusters the tenant clusters are having the complete control And they can run untrusted and unsecured workloads So we don't have control over that but at the super cluster level for us It is important to check the resiliency of our microservices to check the performance of a microservices that we are running internally For that we have litmus Which is installed so why a kiosk engineering for Civo is the first question that I'm answering So it becomes important for us to test all the resiliency of our infrastructure of our microservices which are running on our super cluster and For that we have litmus and now how do we do it? And how do we do it continuously because it's very important that you keep running your chaos as well So we have chaos implemented on our staging environment And we have created workflows that we keep on running in a continuous manner And those workflows are like different types of chaos experiments which are there node level experiments pod level Experiments code in a service if it goes down But what will be the impact we have to target those because we are a cloud platform and if anything goes wrong We we need to kind of you know think from the perspective if any bad Actors or bad people would be attacking then it would be good for us if we can find the issues beforehand and fix it rather than wait for anything to happen and Then we investigate what is happening So we can create multiple chaos like Kartek and Oma and Odit they mentioned there can be like bring your chaos experiments Multiple chaos scenarios that we can create So that we can think from those perspective What maximum bad we can do to our super cluster and we can save it from anything that will be happening in production The next point that I want to cover is minimizing the blast radius that is very important when you are doing chaos engineering Because as a cloud platform, we cannot you know just kill anybody's pod or kill anybody's cluster. It has to be resilient So we run it as of now in our staging environment, which is a complete replica of our production environment and a continuous manner What we want do want to achieve next is To run it continuously like it already has the CI CD integrations and stuff like that So in order like a new feature is developed So it will run all the existing workflows which are there and if all the workflows are completed successfully and They they like passes all the checks then that particular feature is supposed to production so I think that that continuous manner and Moving from staging for to production using the chaos experiments would would complete the story now Now I talk like this is a production super cluster and this is the tenant cluster which is there and litmus Can be installed. So how how this is happening? So on staging we run network chaos pod chaos litency and latency and core services chaos on We also give something to the customer. So like I told you we Customers have full control over their clusters But we have something called marketplace and in marketplace we have litmus from the maintainers So it is a maintainer maintained application that is there that a user can install on their cluster and perform similar experiments So we use on our super cluster and we give our users as well The litmus to be installed on their Kubernetes clusters, which is running on our platform to perform chaos experiments on their cluster So that they can also Do the same pattern of in a continuous manner Make sure their services and their workloads are running fine resilient and Yep, I think that's those are some of the areas that I wanted to cover like why chaos engineering is important and how to kind of use it and What would be to minimize the blast radius and how we give it to our customers? So that's what we do at sego But that I think I would give to Uma Yeah, I think yeah, we're good. That's all we wanted to share in this brief time of 30 minutes So we are definitely expecting more feedback to come in chaos is just becoming commonplace in DevOps So feel free to Take a look at litmus start slow, you know Move fast once you are convinced that chaos actually is going to help you And we are always available to take more feedback from communities So if there are any questions that we can answer here, we'll be happy to As we say we give license to chaos right so you can we can now use litmus and then play around and see Don't don't believe what somebody else is just see it for yourself break something and see, you know You get a call from somebody else or not Well, thank you. We are available on the litmus channel litmus dev channel for contributors. So we are looking forward to Not seeing some of you there. Thank you