 everyone. This is Saranna and welcome to this webinar on getting started with Litmus Cures 3.0 Cures Engineering Made Smarter. Bit of intro. So Amit and I are both senior software engineer at Harness and also we are maintainers of Litmus Cures. And we have been like contributing to Litmus for more than three years and it has been an incredible journey so far. Without any further ado let's get started. So in this webinar we are mostly going to talk about the Litmus Cures 3.0, what all features it provides and we'll do a brief comparison of 1.x, 2.x and 3.0 and lastly Amit is going to show all the features and give a brief demonstration of them and yeah. So that's that and without any much delay let's get started. So starting off with introduction. So a couple of months back we have introduced Litmus Cures 3.0. So the main idea behind it was making the whole Cures experimentation process more leaner, easier and developer friendly so that it can be used not only by the SREs but also easily by the developers so that Cures Engineering can be adopted in all the stages of a software development life cycle. So with that let me talk about introduce you to the all the high level features that we have introduced in 3.0. So starting off with revamped user experience. So this is one of the major improvement or enhancements and one of the highlights of the 3.0 that like we have moved on from a material UI, UI co-library the component library to now we are using harness UI co-library which provides a very sleek and intuitive user experience and it comes with a plethora of components which are very easy to use and support plug and play architecture. Some of the features some of the major components such as step wizard and pipeline studio have not only made the visualization of Cures experiment easier but have made the whole user experience more much better and simpler. So yeah that's that. Next coming to the other feature that is environments for Cures infrastructure organization. So managing Cures infrastructures especially when different experiments are running in different clusters or environments at one at like simultaneously it can be it can be difficult and to manage all all all those infrastructures. So to address this situation we have introduced the concept of environments which kind of acts as a wrapper around the infrastructure that is like it kind of streamlines the whole managing of the Cures infrastructures and an environment can be created as a pre-production type or production or production type and an infrastructure can be created within those environments and the whole organization becomes very simpler. So yeah coming to the next feature that is Cures studio. So Cures studio is one of the highlight is also one of the major features that we have introduced in C. So it it makes the whole starting from the experiment creation to the scheduling or executing of Cures experimentation. So it it makes the whole process more simpler and more intuitive and starting off with like you can like starting with addition of faults to the like tuning them and configuring each of the faults. It the whole process becomes very easy with like with the support of visual visual experiment builder and and the YAML editor that is built on top of Monaco. It enables the user to to precisely configure their their experiments as per their requirements. So one of the enhancements or more like a change that we have introduced in this whole process is with the resilience driven approach we have made the addition of probe to each of the fault mandatory and it it it helps in calculating the resilience probe resiliency of the application in a much better way. So similar to the older flow experiments can be scheduled as as cron or a non-cron experiment and they can be executed accordingly. And yeah that's that and coming to the next feature that is resilience probes as first-class citizens. So you the users might have known that earlier in the older versions the probes when they need to be added they were the the configurations were added in the experiment manifest and which which kind of was a bit difficult because like these these probes are not were not usable and you have to add them every time for each fault but with with the introduction of like probes as an entity it kind kind of simplifies the whole procedure because you can create them once at one place and it can be deployed across multiple experiments and faults. It can be used multiple times and if you edit them in the in the probe section the the changes are reflected in the subsequent experiments that are being created. So it kinds of it kind of provides the plug-and-play architecture here and yeah and currently litmus cure supports four kinds of probes that is HTTP probe command probe prometheus and kubernetes probe. So yeah and also you can get the whole like execution history and probe configuration like what all faults that have been referred like reference this probe you can all get the you can get all these details via the cure center the UI itself. So yeah next so other important features so these are some of the some of these features are actually like introduced as part of enhancement that were being suggested by the community itself and we have we have given focus on we are focused on incorporating as much as them. So starting off with support for high availability of MongoDB. So these like as like this was one of the requirements from the community that we can make the like it this feature enables the deployment of MongoDB as replica and which which ensures the high availability of the DB and this can be installed via Helm using bitnamy MongoDB. And next is the terminology changes so we have done some few we have changed a few terminologies to reflect their functionality better such as like the Cure's agents or the delegates have now been renamed to Cure's infrastructures Cure's scenarios or workflows are now referenced as referred as Cure's experiments and Cure's experiments are now referred as Cure's faults and yeah next coming to the API refactors and improved code architecture. So in this release we have given a lot of focus on making like making the API stable and faster. So we have done many such refactors and also added a few other new APIs to for the for the community to use and with addition to that sorry with addition to that back in unit tests have also been added in the in the authentication and graphical server and the code architecture has also been enhanced changed so that ensuring that the developers can developers or the new contributors can easily contribute to the litmus Cure's ecosystem. Next coming up is Helm charts for Cure's infrastructures so now like earlier like setting up the execution plane was a very daunting task but with the introduction of Helm charts for the infrastructures it has simplified the whole flow. Next is the application level Cure's experiments for Spring Boot so with the latest release we have also introduced Spring Boot Cure's experiments such as covering the areas such as latency, exceptions, CPU or memory stress and application termination etc. So application level Cure's experiments were one of the focus in this release as well and next is the support for sidecars and experiment pods so the community actually wanted to like preserve the logs and experiment logs so for that one of the request was to enable the support for sidecars in the experiment pods primarily the purpose was to forward the logs into custom since so this support has also been added so yeah these were some of the high level features but in the background we have also added a lot of enhancements and in the upcoming releases as well we are targeting a lot of like we are hoping a lot of new features will be coming in we have a release cadence of like monthly release cadence where we do releases every 15th of the month so yeah that's that and yeah lastly like for my from my side this is like the feature comparison between 1.x 2.x and 3.0 so as you can see like most of the features most of the features I have already talked about so these these are features were added on top of the existing ones that is revamped and simplified UX and environments introduction to the Cure Studio to simplify the whole Cure's experimentation process and high availability of the MongoDB resilience probes as first-class citizens and improved APIs improved debugability and experiment stability so along with that some of the other features such as like making the enhancing the litmus sdk and and improving the litmus ctl and the like whole litmus ctl is now following now built on top of Prom2y which makes the whole user experience much better so these are all enhancement that we that 3.0 litmus ctl provides but one thing also I want the users to note here is like with with a large number of features coming in with in this release backward compatibility is not supported so you users will have to like make a new like upgrade to the 3.0 and make a new completely fresh setup to to access all these so yeah that's all from my side now Amit is Amit is going to give a demo and like show all these all these features that I have talked about and yeah over to you Amit thanks Adanya thanks for displaying all the new features that we have with the litmus 3.0 and how we have changed the usage of chaos and the curation of experiments from litmus 2.0 to litmus 3.0 a lot of new features are there so now like we'll be moving forward with the demo so in this demo like I'll be showing all the new features that we have with litmus 3.0 and we have a demo setup which is running and we'll be creating some chaos experiments and we'll be playing with the chaos studio so let's get started so as you can see in my screen that this is the new UI of Chaos Center 3.0 and litmus 3.0 like we have a lot of new features that are available now so as mentioned by Saranya like we have a new feature of environment which actually helps us to like differentiate or create chaos between multiple types of environment like pre-production or production and we have a resilience probe like earlier with litmus 2.0 we were having pros in building the experiment manifest but like this is completely a new feature now in which you can configure your probes and you can use these pluggable checks in your experiments and like we previously had the chaos hub and the major refactoring on major work that went in with litmus 3.0 was the flow of experiment creation so we have introduced chaos studio which actually allows user to you know create experiments with a very very very smooth flow and yeah so before going to the demo or like before creating some experiments I wanted to show the demo setup that we are running so in this demo we'll be using an online booty cap which is basically an e-commerce app so as you can see that just like a normal e-commerce app it has got a lot of functionalities like this is the product catalog service and with that we have some currency generator service we have the checkout and cart service and once we add a product to our cart we can simply do a checkout or we can continue to the shopping and we have a payment service as well so just like a normal e-commerce app we have a full-fledged functionality with this demo application which is online boutique and to monitor all the resource consumption or monitor all the networking of this app we have set up a grafana dashboard where we are checking all the queries all the access duration of these different services which are yeah so which are running as part of this booty cap so we can see that a lot of services like ad service, cart service, checkout service, currency service etc are running on this application and to track all the metrics we have this grafana dashboard which is set up here so let's start let's start by creating an environment so as I mentioned before the start of this demo like an environment is basically a place where you can like differentiate between the clusters like is it a pre-production cluster as a production cluster and where do you want to do the chaos so I'll just give it a name test demo and let me keep it as pre-production and I'll create a new chaos infra so chaos infra is basically the execution plane from which will be which allows us to like perform all these experiments and faults and target different applications so to start like I'll have to enable chaos on one of our one of the namespaces so let me just give you give the name demo so I'll be providing the cluster wide access so we can we will be able to target a lot of applications if you want to keep it restricted to a single name space you can also select the namespace scope so for the demo I'll be using the cluster wide access and I'll be installing it in the litmus namespace with the litmus service account so I'll just click next and I'll click download so this will download this this infra manifest and once we apply this infra manifest our infra will be active and will be ready to target different applications from different namespaces I'll just copy this command of apply we've done we can see that chaos infra is enabled and it's in pending state let me just go ahead and connect this infra so it is installing the execution plane or the chaos infra plane components so it's installing different crds different components like the subscriber and some rbacks as well so I think this should be done in a couple of minutes or seconds yeah so the manifest simply applying the manifest process is done let me just check if all the components are up and running yeah we can see that the subscriber is up and running so subscriber is basically the execution plane component which will allow us to perform the chaos experiments and perform all the chaos faults on different different applications so we can see that the chaos infra is enabled and it is connected and we can like we can get started with creating the experiments and and target this online boutique application so let's go back to the chaos experiments page and like this is probably the best part of this litmus 3.2 the flow with which you can create experiments is very seamless now so I'll start with the experiment name so I'll give it a name of card service delete and I'll select the demo webinar chaos infrastructure that I've just created it next so with chaos studio we get three different type of options like first we can create an experiment from scratch from a blank canvas or we can use a template from the cure sub itself that is already provided as part of the installation and we can either upload a amel which is completely user dependent if they can provide different spec custom specs and provide different envies or different values in the faults and this is completely user specific so for the demo I'll be creating an experiment from scratch I'll use a blank canvas option and yeah so here like in chaos studio we have a vast variety of functionalities first of all is addition of different faults from the hub so we can see that the chaos sub consists of large variety of faults and for this demo like we'll be using the port delete fault and once I'll add the port delete fault I can easily tune the variables or the environment variables of this fault and we have a functionality of yaml editor as well if you want to add some additional changes in the yaml you can just come here edit the yaml and tune the experiment manifest as well so yeah let's go with the basic flow I'll just click on this add button and I'll select this port delete and once the port delete is selected I'll be shown up with this modal where we can select the application which we want to target so I'll be targeting the deployment in the boutique namespace for the card service so I've added this and to tune the fault I can like increase the chaos duration let me increase the chaos duration to 60 and the chaos interval to 20 seconds and these are some additional or optional environments that you can tune according to your needs and probes now coming to the probes like probes are some additional checks that we can add with our fault execution so for this like will be we have a boutique ui pro so which actually checks if the online boutique application is available when the card service is down so this basically an sttp probe we have four different kind of kind of probes like sttp, cmd, gates and prometheus probes so each probe has its own functionality depending upon the use cases so for now I will be adding this probe and I can decide when I want to run this probe like we have several options like before the start of the fault execution or end of the fault execution or edge so in edge it will start it will start before the experiment execution and it will also start at the end of the experiment execution and we have two different modes which are continuous and on chaos so for now I will use the continuous mode and I'll apply the changes and to finalize the changes of this fault I will just click apply changes and yeah so we have successfully added a fault in our experiment which will actually delete this card service and once the experiment starts we should be able to see some some increase in latency or some changes in the query per seconds and it should be visible in the fauna dashboards and it should also be visible in this online booty gap so let me just save this experiment and once the experiment is saved you can either run it from this chaos studio or you can also come back to this chaos experiment page and we have a run option here so I'll just click run here and it should start the chaos experiment execution to validate the same I can go back to my terminal and I can just check if the experiment has started yeah so we can see that the card service experiment to delete the card service to delete the card service port has started yeah I can open a new terminal to check the application is running in the boutique namespace yeah so we have the different services running in the boutique namespace so let's wait for the port delete experiment to start and we should we should see some some action in the dashboards as well yeah yeah since the experiment has started we can see that the online boutique the application availability has gone down and if I go back to the page I can see that this screen has popped up it shows oh there's something has happened one of the services down the application is not resilient and the whole application is down and in the fauna dashboard we can also see that the access duration of the front end services has gone down and similarly the car service is also seeing a tip here and the queries are increasing because queries the bad queries are actually increasing in number for all the different services which are present in this boutique namespace so once the experiment is completed we should see that they should come back to its original state and like yeah I think the experiment has completed so we can see that the application availability is coming back we can see that the access duration is coming back to normal and if I can go back and okay I have refreshed it so yeah the application is back online so for this time period during the chaos execution the whole boutique application was down which shows that this is not a resilient application so in case if we might have a few more replicas of the car services maybe the front end the UI wouldn't have had crashed and it would have been resilient so coming back to the screen I can see that my experiment execution has completed but we can see that it has completed with some some sort of fault because the boutique UI was not available at all just go back yeah so we can see that the experiment execution has completed but the probe was failing because the probe was actually checking if this boutique application is available or not if you are getting a 200 response code or not but we actually got a 500 response code which is a bad request so yeah so this this this was the pluggable check that we added on this bot delete fault and to see the different properties that we have already configured in the resilience probe section so yeah so the experiment execution has completed but the result that we were expecting was not there because the probe has failed and the application is not resilient at all to make the application resilient we can we can follow some counter measures like you can increase the replica count of the car services and other services so that if one of the pod gets gets impacted the whole application will not go down so yeah and to see the different stats like we have this new flow new flow of the UI where we can see all the experiment list we can go here and create a new experiment like for example we can create a let's see check out service check out service latency and select a intra and let's start from the scratch itself and create a network latency fault to target the fault I can just go here select the namespace and this for the check out service and select this and click apply and I'll add the probe I can use the same probe here as well so these probes since they are like a separate entity now they can be used anywhere in all of our experiments I can click apply and I can just save this so with the new flow we can actually see all our experiments listed here and with the experiments we can also see the recent 10 experiments so I've executed this one so it shows like the experiment has a completely executed but this course is zero since the probe has failed so yeah this is the new UI and apart from that as Saranya mentioned like we have the environments functionality which actually differentiates between a pre-prod environment and a prod environment and the resilience probe so this is a completely separate entity now we can create some new probes like HTTP command Prometheus and Kubernetes probe and we can then use these probes in our experiments so since we have used this probe in one of our experiments we can actually backtrack it and see that this particular fault has used this probe which was the continuous probe that I have continuous mode that I have provided and the probe configurations are also visible here so yeah and we have the chaos of chaos of essential part since litmus 2.0 where we had basically a marketplace of all the different faults you can connect your own own chaos of with your own customized faults and yeah so as a as an overview like this is what we have added as part of as an overview this is what we have added as part of litmus 3.2 and yeah I hope you guys will give it a shot and please please let us know how how's the experience with litmus 3.2 yeah over to you Saranya. So thank you Amit the demo was really awesome and with that as I mentioned earlier we have a monthly release cadence of release cadence that every 15th of the month we make a release and we roll out two features and and enhancements and so with with the valuable with the valuable feedback from the community after the 3.0 here are some of the features of the enhancement that we added in 3.1 so one of those is stopping of kiosk experiments so in 3.1 we added a feature that can that can help enable users to stop the ongoing experiments and so it is use like useful in case of long running experiments which kind of which have been stuck or pending for running for a long time users can stop them and like this feature gives them more flexibility and control over them so yeah and another one was like another one is enabling like enabling users to toggle the toggle the experiments from non-cron to cron and vice versa so this is one of the this is also one of the features that was suggested by the users that was suggested by the by the community that like while stopping the experiments or like in general users can just toggle the experiments like to from cron to non-cron or like or the vice versa so this feature has also been added along with the stop experiment feature so yeah and other focus was also on enhancement and stability of the kiosk center so focusing in in in in addition to these features the LFX mentees and other the community members have also added unit test in both back in and in the front end as well so we have now like like front end coverage has been also like is on track and we will soon be adding more unit tests to make the code base stable and some like coverage checks as also have also been added as part of the PR PR viewing process so that any new contributor if if anyone wants to contribute they can like like it ensures the stability and the quality of the product if we are adding this check in the PR pipeline process so that's that and other than that we have like regular bug fixes and enhancement data also going along with the along with the other features so yeah that was that was all and in the upcoming releases we are also planning to add more and more features and yeah that that's it and lastly I lastly I just wanted to inform like I just wanted to let the community know that if anyone if anyone interested in joining the community and get started with the cure experimentation or wants to contribute to the litmus cures they can join the litmus channel in the kubernetes slack and second secondly is the litmus 3.0 documentation so with the latest release we have also updated the documentation so 3.0 documentation can be found found here and here you can like get all the details or like all the all the things that you need to know in 3.0 and if we have the concept section which we briefly describe all the new features or the all the enhancement that has been added that has been added and and there is the user guide section which will help the community to get started with the starting from the installation to the setting up the execution plane then how to inject inject cures to schedule a fast cures experiment inject falls and create resiliency probes so these all have been these are all being covered in the documentation and we also have the troubleshooting section where if any problems face me you can users can definitely check out this the troubleshooting section or you can or the users can like put their put the message in the in the slack channel in the litmus channel so that we can we can help you get started and yeah that is that and that's all with this I I hope you like this like the session and yeah thank you thank you everybody