 Hey everyone, welcome to this webinar on the project updates of open source litmus and what are the new set of features that we have been working on and added on the open source product. My name is Shayan, I'm a senior software engineer at Harness and I've been closely working with the litmus open source project since the past two years and also mainly working on the litmus 2.0 and the features associated with it. Now before we move on just wanted to quickly add that we're really excited to do this and thank you for having us. The agenda for today is about the new features that we have added on litmus 2.0 and not just the core features since I'm pretty sure we have already heard about the product and looked at the features that litmus 2.0 comes with but rather the development features and the developer centric focus features that we have added to the open source platform and also we are going to take a look at two different use cases to help understand how litmus is actually being used in the world right now. How litmus is erupted by different clients. We are also going to finish it with a small demo where we'll be seeing how the new enhanced features of litmus can be put to use in a much in a commercial application. So before we jump into the new developer centric features that we have added just a quick refresher of what litmus is for all the new users out there. So litmus is a cross-cloud cloud native chaos engineering tool set or a framework you can also say that which helps not only SRAs but also developers and different personas. Any personas try out chaos with seamless integration and automation that will help you ease your chaos engineering journey. So you can choose different experiments you can create scenarios out of them and then you can run your workflows and do chaos in a much simpler way with the help of litmus. Now assuming some of the community users are present here and have already used litmus we are going to take a look at the developer centric and the developer focus features that we have added to help ease both the developers as well as the community users the community contributors that have been working closely with us. So previously we already had the ability to add probes but we have also worked upon that and improved the probe addition capability. So now you can also once once you've added a probe now you can also go ahead and edit the same probe. Previously we did not have the capability but now you can edit it like previously you have to delete the entire thing and then create a new thing which was quite a bit of a pain when you are thinking about constructing new probes and thinking of the hypothesis and doing the steady state validation with probes. So it's better that you get an edit option so now you do and you can change the probe types in the same edit feature so you can just completely change the probe all together with this new probe editing feature. We have also improved the tune workflow section when you customize the scenario so now you can also edit the sequences so let's say you have two or two or more than two experiments in in your same scenario then you can also go to edit sequence and then customize the sequence put them in serial put them in parallel based on your requirements you can also have a combination of both and also in the yaml section you can also update the same steps and that should be visually that should be visually visible in the edit sequence as well. There's also the option to go to the configuration wizard which is the pencil icon that you see on the different experiments in the table so there you will have the option to tune your environments give certain value to your environment which will be reflected to the kiosk engine and also at the bottom of the table you would also get advanced options so in there you can select the pod gc strategy if you want to clean out all the different pods post chaos and then also if you want to add tints and tolerations to your particular chaos experiments then you can also do that kind of advanced configuration to that. Third we have the ability to upgrade a chaos delegate so let's say you are on the latest version of litmus you're running your litmus workflows on the deployed with the latest version of litmus but the community has pushed a few you're a bit behind on the upgrade part so let's say the community has pushed a new feature a new upgrade of the chaos delegate so with the new versions of litmus what you would be able to notice is there's an upgrade option at the chaos delegate side so if you're on a latest deployed version of litmus and there's an upgrade available of your chaos delegate then litmus would suggest that you go ahead and update your chaos delegate to the latest version so if you're already on the latest version the button would be disabled but if you are not if you're a little lagging behind on the update then litmus will show you the option to upgrade your chaos delegate to the latest version. Next we have added a more secure rback with update to the api this rback security updates have been added about to the api as well as to the ui so this was mostly a bug addressed from reported from the community side which is addressed by us which is there are two rback permissions which is editor and the viewer so as a viewer you shouldn't be able to access certain pages certain apis certain you shouldn't be able to do certain perform operations as a viewer which where there was few leak cases where you would be able create a scenario or view and like go to the editing section of a particular workflow which you shouldn't be able to do as a viewer so those things were addressed and now we have added a more secure rback so now let's say a viewer is given the specific screen through different api calls even you wouldn't be able to do so because the rback checks are also added to the api so now the apis are hardened as well as the ui so now a viewer should not be able to view screens which are only accessible via the admin and the editor also there have been a request of adding the support of running or scheduling a basic argov workflows rather than just chaos workflows so previously we had the support of running different kinds of chaos workflows which were also argov workflows but we didn't have the support to clearly run a very stripped down simple version of a basic argov workflow so now we have modified the back end and we have added the support of just scheduling a basic argov workflow as well so if you directly take an argov workflow from argov and try to run it on our platform that should work as of now similarly once Kubernetes moved once Kubernetes updated to 1.22.0 and above there were a few apis there were a few manifests deployments that users from the community side were trying and they were unable to execute them because we didn't had support for 1.22.0 and above so we have also addressed that and currently we do support all communities versions on 1.22.0 and above we also added the support for ipv6 and to ensure that we have a better end-to-end capability end-to-end strictness of all the different builds and continually do this kind of deployments via nightly builds or via regular ci checks we have done that we have ensured we do better testing and we have added more e2e test suits to cover all aspects of this development so when we push or when we complete our work by the end of the day there's always multiple nightly builds happening so that it ensures and also it gives you the reports so that it ensures that you that you know the status of your deployments so we have also added the ability to skip SSL verification so let's say in case of applying a manifest or you are trying to connect an agent so if you want to skip the SSL verification there's very much an option to do so now so we have added a feature so you can just provide the flag and you should be able to you should have the ability to skip your SSL verification if your use case demands so now also for developers also for both the community developers as well as the developers in the core team you have improved the log functionality of the different servers that we use so that way whenever you encounter a bug or whenever you encounter a specific log that you're looking for let's say an agent connection or a subscriber connection so those log metrics have been enhanced so now you you'll be getting better events you'll be getting better results so that way you would be knowing what exactly went wrong or what exactly is happening so we have enhanced that part of logs both for the development as well as for the production setup for production when you visit the kiosk engine log so when you go to the show the workflow and you check the logs there the pod logs they have better highlighting so let's say if approves was resulted in success or failure or if the experiment the verdict everything resulted in success or failure you'll be having that individual line highlighted out as either green red or if it's a warning so those kind of things so we have taken measures to enhance the logs both for the production as well as for the development side now on the internal side we have also migrated the project collection that we use so the project collection was used mainly to store metadata of the project of of your litmus projects previously it was in the litmus database now we have shifted that to the authentication database and also apart from just this shift we have also done internal code refactor of the authentication server to enhance and improve security we have also added enhancements in the cmd probe specifically in the source tab so previously we had the option to provide source just as an inline source but with the latest updates we have added we have destructured it and we you can add images and host network separately as a part of the source configuration we have also hardened the litmus alpine images that were used in different litmus kiosk tools and also the e2e pipeline to monitor all the pipeline builds have been created so there's nightly bills and and and a whole e2e dashboard that you can explore where you can see the different builds for individual workflows for each of the experiments etc and also lastly we of course are working on new experiments so experiments like a gwz az experiment as your disc laws etc so those new different kind of experiments have also been added to the public kiosk now let's take a look at what ifood is what their company does and what was the use case that they switched to using litmus and how litmus is actually helping them with the challenges that they faced ifood is one of the most demanded latin-american online food ordering platform and they have they have delivered more than 60 million orders each month they do deliver more than 60 million orders each month it was founded in 2011 and it aimed to provide a better and a quicker solution for online food delivery with innovative systems so that user can order deliveries on the internet with no hassle and with ease of use so with over 80 percent of the market share geographically ifood covers most cities and regions in brazil especially in the brazil's financial center of sau Paulo now what were the exact challenges that were faced by ifood and why did they think that the solutions they had didn't really work out specific to their specific use cases now due to its growing popularity of course because they were serving such a huge amount of customers they decided to break their existing monolithic architecture into several microservices so that they can scale better so this decision of moving from monolithic to a microservice oriented architecture of course came with its came with its quirks but also came with much more complexity and additional costs so that was one why they were trying to look for solutions to handle this kind of complexities and to deal with resiliency the more and more they scale false like database services going out of business message brokers crashing and the entire region of cloud providers going down due to different kind of outages and also network bandwidth dropping without notice were definitely some of the challenges and these are like these are definitions of different outages that might happen to your systems to your applications so these were some of the major problems that ifood was facing in the latin america and they wanted to mitigate these challenges because the user base was growing and they wanted to give them a seamless access to their delivery platform now they decided to tighten up the reliability by continuously doing load tests and bare minimum chaos experiments but the solutions that they were using at the time lacked specific use case driven functionality so if they had certain scenarios in mind they wouldn't be able to do it but they will be able to target and do basic chaos but based on the problems mentioned above they wanted something some solutions which can tether to their specific use cases and also the ability to customize their own kind of scenarios based on their experience and use them automatically in an automated fashion so they also had the requirement to know which users perform what kind of chaos to enable a better r-back control and production so that let's say because chaos testing requires an amount of responsibility when you're doing it so they wanted to know which user is performing what kind of chaos because if they're doing it on production let's say on a specific environment and if something goes horribly wrong because it was specific chaos experiment then they would like to know which user performed what experiment and what was the scale of it what exactly did it target so those kind of things they wanted to know which user performed what and enable better r-back controls and the current chaos engineering solutions they were using was not really automated and it also had limited number of experiments but with the amount of ideas that i food had regarding the scenarios they wanted to create these kind of things they wanted to customize the experiments and eliminate manual cost as much as possible because they wanted to have these ideas created custom scenarios and also automate the same and have it running by itself rather than doing it manually so these were some of the challenges that they faced and one of the main challenges downtime which is why the thought came into their mind to switch to an automated chaos tool so what exactly would go wrong if you have downtimes right so right of the bat there's a loss of customer confidence which is the biggest let down if you have an application with a huge base of customer and there's a time there's even a slight amount of downtime you would have a loss of customer confidence and not to also mention the amount of costs that you might incur in that time frame so the average downtime for an outage is reported to be about 79 minutes and the average cost of these down times are about 84 thousand dollars so which is huge considering in that period of time let's say even if it's not 79 minutes even if it's five minutes in that five minutes millions of users could have ideally clicked on or wanted to get food or wanted to have something delivered or just wanted to check out your platform so that is the main thing when there's downtime so you have a huge amount of customer confidence that is lost also the damage to brand's integrity if if a certain outages faced by them so I would consider those two as the main points to note when you have downtimes now how litmus exactly is helping i food now litmus came in with the idea of providing a lot of chaos experiments which suited their requirements because litmus has the ability to add your own private hubs as well as the public hub which litmus has is also filled up with over 50 plus experiments so one of those experiments they can usually pick up and then add on top of it and it comes and it covers a range of different types of experiments so i food really liked that idea of having to customize something that they can use for their specific requirements because there were multiple options multiple areas that litmus were was touching so they went with a declarative approach which helped them customize these chaos experiments and then target the chaos engineer and target the chaos engine further to add their own ends and attune that specifically to their requirement and it must also give them the ability to find grain this are back controls by having integration with dex so they integrated with the authentication service called dex and authenticated users to litmus kiosk which restricted their services as a developer to target specific applications where they can inject kiosk so that gave them the ability to restrict certain users to perform certain activities and gave the are back level control that they were initially looking for we also gave them the ability to construct a workflow as a chron now because they wanted to automate and also and save manual labor because we have the option to even save it save the difference scenarios as a template for later use so that aided with easier automation and auto chaos even after specific intervals so that is one so that is one feature that they considered handy and that is something that is helping them to automate this entire process and remove manual labor as much as possible so yeah that was one of the stories that i food currently is using litmus for and to continue with the next use case and also the demo i would like to hand it over to nilanjan and he can guide you with the rest hello let us take a look at another end user use case halodoc halodoc is the most popular all around health care application in indonesia a rapidly growing startup founded in 2016 the mission is to simplify and bring quality health care across indonesia halodoc has partnered with more than 4,000 pharmacies in over 100 cities to bring medicine to people's doorsteps recently they have launched a premium appointment service that partners with more than 500 hospitals for booking their doctor appointments using the application the platform is composed of several microservices hosted across hybrid infrastructure elements mainly on a managed Kubernetes cloud with an intricately designed communication framework halodoc has leveraged AWS cloud services such as rds lambda ms3 and consumes a significant suit of open source tooling especially from the cncf landscape to support these core services while operating a platform of such scale where newer services are onboarded quite frequently it's quite plausible to encounter service downtimes due to unanticipated causes it's not an isolated incident where newly added services go down which eventually get mitigated after much effort that affects the team and end users in a system with the kind of dependencies that halodoc had it was prudent to test and measure service availability across a host of failure scenarios this needed to be done before going live and occasionally after it a bit in a more controlled manner hence chaos engineering was found suitable to supplement the existing qa with comprehensive automated test suits and periodic performance testing analysis to make the platform more robust the major requirements that halodoc chaos engineering practices sought to include first being kubernetes native halodoc uses kubernetes as the underlying platform for a majority of the business services including hosting tools that operate and manage observability across the fleet of clusters they also needed a chaos tool that could be deployed and managed on arm64 that is aws gravitron based kubernetes as well as the ability to express a chaos test in kubernetes's language that is resource yamls second being wide range of experiments considering the microservices span across several frameworks and languages such as java python c plus plus and golang it was vital to subject them to varied service level faults add to it the hybrid nature of the infrastructure varied aws services and the ability to target non kubernetes entities like cloud instances disks etc becomes clear furthermore the application developers were required to be able to build their own faults and integrate them into the suit and have them orchestrated in a similar fashion to the cloud native faults chaos scenario definition there was a need for full-fledged chaos scenarios that combined faults with some custom validation depending on the use case as the chaos tests were expected to run in an automated fashion after the initial experimentation or establishing a good fit halodoc also uses a variety of synthetic load tools mapped to the families of microservices in its test environment that they wanted to leverage as part of the chaos experiments to make it more effective and derive greater confidence security features the tight staging environments at halodoc are multi-user shared by environments accessed by dedicated service owners sre teams with frequent upgrade to their applications halodoc needed a tool with the ability to isolate the chaos view for respective teams with admin controls in place for the possible blast radius contain this allied with the standard security considerations around running the third-party containers were required absorpti hooks halodoc relies heavily on absorpti both for monitoring application or infrastructure behavior the stack includes new relic crometheus grafana elastic search etc as well as for reporting and analysis they use allure for test reports and light lighthouse for service analytics it was only judicious to choose a chaos framework that can provide with enough data to ingest in terms of logs metrics and events lastly community support halodoc saw value in an open source project that has a strong community around the tool with approachable maintenance who could see reasons in the issues raised and the proposed enhancements while keeping a welcoming environment for users who can contribute back hence halodoc chose litmus chaos which meant the requirement criterias to a great extent while having a roadmap and release cadence that align well to their needs and pace another reason for choosing litmus chaos is the github support which allowed for the automation of chaos experiments halodoc has also contributed towards better user experience in the chaos center and improving the security of the platform from them halodoc's initial efforts with litmus involved manually creating chaos engine custom resources targeting the application ports to verify their behavior this in itself proof beneficial with some interesting application behavior on earth in the development process eventually the experiments were crafted with right validations using litmus probes to form chaos workflow resources that can be invoked programmatically and automate the process of hypothesis validation during the chaos today these chaos workflows are stored in a dedicated grid repository a map to a respective application services via subscription mechanism and are triggered upon app upgrade via the litmus event tracker service residing within the staging cluster while the chaos experiments on staging are used as a gating mechanism for deployment into production the team at halodoc believes firmly in the merits of testing in production scheduled chaos experiments are used to conduct automated game days in the production environment with the mapping between the fault type and load conditions that are devised based on the usage and traffic patterns the results of these experiments are then fed into a data lake for further analysis by the development teams while the reports from the chaos center the control plane component of the litmus chaos especially those around comparisons of the resiliency score of scenarios are also leveraged for high-level views the personnel involved in creating or maintaining the and tracking the chaos tests on staging are largely developers and extended tech teams belonging to different verticals while the game days are exclusively carried out by the members of the sre team the upgrades to the of the chaos microservices on the clusters are carried out in much of the same fashion as any other tooling with the application undergoing standard scans and checks in the git lab pipelines with that we are all set for a demonstration of litmus where we'll see that how we can inject chaos into a kubernetes application to assess its resiliency see you in the demo hello there and welcome to the demo on litmus chaos but before we actually jump into creating some chaos as you can see i'm here in my chaos center i would like to bring your attention to the booty cap the booty cap is basically an e-commerce microservice application which is completely composed of kubernetes microservices you have your typical sections of an e-commerce application such as the different products that you can buy the product descriptions for each one of them and you also have something like a for cart for example where you would essentially store all the items that you are meaning to check out and pay for so what we are going to do today as part of our demo is basically we are going to inject some chaos within this booty application and see how does it face against the injected chaos to be more specific what we are going to do today is we are going to use our shatp chaos experiments which is one of the newly added chaos experiments and we'll target it against this particular cart service which is a microservice within the kubernetes and see what is its effect on our application before we actually jump to doing some of the chaos i would also like to show you this dashboard which is a grafana dashboard for our booty application as you can see right now the metrics that we can observe here in the dashboards are indicative of a normal system behavior we have a black box exporter which indicates the service endpoint is quite healthy and the probe success percentage for the same is 100 therefore and we can also see the queries per second of the cart lies somewhere in the range of 60 to 40 qps or ops basically which is indicative of a normal system behavior and the access duration or you can see the latency is also quite low right now which is in vicinity of somewhere 2.4 2.0 2.8 seconds which is quite normal so yeah with that we can actually go ahead and target our chaos within our application using an sttp chaos experiment to do so in my chaos center i'll first of all schedule a chaos scenario i'll choose the self agent that i have and go next then i'll choose chaos up because that's where my pod sttp latency experiment the sttp chaos experiments is situated then i'll go next and then we can name this cart chaos scenario since we are targeting the cart and over here we need to now add our experiment which happens to be pod sttp latency now that we have added our chaos experiments we just need to simply fill up this experiment in a way that we are specifying the exact details of our chaos so that the experiment can target the requisite pod and the resource that we are target that we want to target so for that i'll first of all go next over here and here in the application namespace we need to choose the namespace in which our booty applies so that happens to be litmus for now since we have installed it in the same namespace as of litmus chaos and for the application kind well it is a deployment the cart is present as a deployment here and if we check for the label we can see that the litmus chaos performs your kubernetes resource discovery so that it can fetch the label for our cart service over here so we'll choose app is equal to cart service now it's worth mentioning that we have only one pod under this deployment which we are going to target right now so let us see what would be the effect on our application when we are targeting the only present pod so we'll go next from here and at this point we can add a litmus probe so what is a litmus probe well for the initiator litmus probes are a way to automate the process of hypothesis validation in a simple and declarative manner what we are going to do is we are going to define a criteria for this probe and the probe will basically validate this criteria when we are injecting our kubernetes and this would allow us to validate whether that condition is fulfilled or not as part of this experiment and help us to determine the outcome of this experiment so to do so I'll first of all add a new probe I'll go for let's say a cart probe over here that's the name of the probe and it would be a type of an http probe which will be running in a continuous mode that is throughout the experiment in a continuous fashion before we fill up any of the probe properties I'll first of all try to bring your attention to what we are going to do as part of this http probe we are going to validate the endpoint of this cart of this cart here in the booty gap so we are just going to provide the url over here for the cart and the condition that we are going to enforce to be checked is we are expecting a response code of 200 whenever we are performing a get request so what would happen is that we'll be performing a http get request at this particular endpoint in a continuous fashion throughout the duration of the experiment now we can go ahead and specify a few probe properties that is what is the timeout after which the probe would timeout basically would fail so let us give this as three seconds and how many times shall we retry in an event where our probe is actually failing so we can set this as once we can retry once just to be sure and then what is the interval that we want to have between the successive probe iteration so we can see that this can be one so with that we are pretty much done with expressing our probe in a declarative fashion as we saw and that's all you need to basically to initialize a probe and check your application steady state conditions during the chaos so with that I'll add the probe and go next lastly in the last step we just need to specify a few environment parameters for the experiment so these are the parameters through which the experiment would run first is the total chaos duration so we'll be running this for a duration of 60 seconds which seems plausible and for the latency what I'll do is I'll add a very big latency which would essentially go ahead and block our HTTP request for this latency value and this is in milliseconds so I'm adding an HTTP latency of 80,000 milliseconds which appears to be very large so we'll see what happens when we are applying this large of a latency to this experiment also we need to provide our target service port so this is the port that we are targeting for that deployment service so let us try to see what this target service port looks like within our Kubernetes terminal that is using kubectl so what I'll do is I'll try to list down all the services that we have over here you can see that we have a cart service and the cart service has a port of 7070 so we'll be using this as our target port lastly I need to also specify our container runtime so I'm using a container d runtime so I'll just promptly go ahead and mention the container runtime as well as these socket paths with that we are ready to go ahead but not before we actually specify the ports affected percentage so this is the percentage of the ports that we are meaning to target so the minimum number of ports that this experiment will target is one and above that like whatever percentage we are specifying over here would be the would be the percentage that it will go ahead and target so I'll mention 50 over here 50% would essentially mean half of the ports that are as part of our deployment but since we have only one port right now so it would go ahead and target on the one part so with that we can finish up over here and I'll make reverts scheduled to false what this will do is that basically it won't delete any of the application or basically the experiment metadata that that is getting created during the experiment execution this includes all the ports or the workflow resources that we are having as part of the experiment so this would allow us to retain the logs so that we can view them with that I'll go next and we can specify a weight for our calculation of the resiliency score at the end of the test so we can keep it at 10 since we have only one chaos experiment and it doesn't really matter what weight we are providing here as we only have one experiment we can go next now and I'd like to schedule now next so this is the summary of our entire chaos experiment what we just created let us actually go ahead and create this chaos scenario so our chaos scenario has been successfully created over here as you can see it's running so let us actually wait for a while for the experiment to get initialized so you can see right now that the chaos experiment is getting installed the pod stdp latency experiment is getting installed right now as part of this step all right so with that our installation of the chaos experiment is over and now we can see that the pod latency stdp latency experiment has in fact started so what we can do is that we can verify whether this the effects of this experiment in real time using our observed dashboard that is the grafana dashboard you can see that due to the chaos annotation applied by the chaos exporter we are able to see the impact of the chaos here in the dashboard we can observe that due to the experiment running the qps is going up right now and this is only explainable due to the impact of the experiment what we are doing is we are essentially applying a very large latency on our on our card service application and therefore we can see that the access duration or the latency is increasing exponentially while the qps is also taking a hit you can see that the mean qps indicated in yellow is going up while the while the 99% or the immediate qps is in fact going down so this indicate that indeed our application is affected and to confirm that if we refresh our cards you can see that we don't really get a response from our application the response is still pending and we can see that it says something has failed and there are some logs and basically some information for debugging it says 500 internal server error which makes sense because we have essentially added a very large latency and right now the front end is not getting any information from the card and hence we are observing this error if we go back to our application dashboard right now we can see that the chaos duration has in fact passed and at this stage we can see that the experiment is effectively getting over we are right now in the post chaos stage where the chaos effect has been reverted hopefully and what we would like to right now understand is what this caused to our application what we saw in real time is that well our application is unavailable but during the chaos it's very much important to validate what it's happening in an automated fashion so for that let's first of all try to see whether our application has ended or not okay it's still going on let us first of all wait for the experiment to complete as we wait for our experiment to conclude we can see that the service metrics are again regaining a normal system behavior the access duration is going down the card qps is returning to its normal state somewhat and for for the black box exporter that is the probe success percentage that we have it's also getting back to a normal 100% as part of our experiment run we can see that it failed but before we analyze that why it failed and what does the log indicate let us try to actually refresh this page yeah you can see that the chaos has been in fact reverted we can see that there's no remnant of the 500 internal error that we were getting since the the effects of the chaos have been removed for now so if we go back to our chaos center let us try to analyze that what went wrong in this experimentation what this litmus has to say about it if we go inside the table view and try to view the logs and results we can see that we have all the experiment logs over here as part of our experiment logs of course we are first of all getting all the different experiment metrics for example the running pod over here this is the name of the pod that we are targeting card service then we are also seeing the run properties of the probe over here that is the timeout the interval or the retries everything is here after that we are in fact going for the actual experiment so in the course of time we can see that initially we were getting an actual value of 200 when the probe was running which makes sense before since before the chaos there was the service of working correctly and henceforth we were getting a 200 response code as expected but during the chaos well we didn't quite got a 200 response code which can be seen over here in this log it says that actual value is 500 which is not expected to the expected value of 200 and this is in sync with what we saw earlier in our application in the browser as well basically we saw that we are getting a 500 internal error and therefore this has been the cause of the failure of this experiment as you can see the probe status has failed and therefore the experiment has failed so this shows that how litmus can be leveraged litmus probes can be leveraged to automate the process of the hypothesis validation during the chaos and how you can use the logs for verifying the perfect cause of your failure of your experiment or passage of an experiment using litmus chaos we can also get a quick summary of the entire experiment using the chaos result where we can see that the experiment status is completed but the verdict is failed and the probe success percentage is zero for the probe that we have defined that is a card prop we can see that it says better luck next time for the continuous mode which means that well it has failed so with that we saw that how we can validate the experiment how we can use litmus chaos in order to validate our chaos experiments resiliency so we got the information and the validation that well something is not quite right our with our application and some component of it at least is weak so how can we make it more resilient in this case so the most perfect or plausible explanation could be we can just bump up the number of parts that we have as part of our application deployment so let us actually try to do that we have one part right now let's scale it up to maybe two parts and rerun this experiment and see how this experimentation goes then i'll go back to my terminal and what i'll do is i'll try to scale up the card service deployment which is the deployment enabling the card to two replicas let us try to determine whether the scaling is done or not using this watch command so as you can see right now the scaling is still going on the pod is getting created let us wait for a while for the scaling to complete all right with that we have now successfully scaled our card service deployment to two pods so now the question remains that what would be the behavior of the experiment in this case right now if we are targeting 50% of the pods that is only one out of the two pods will our application be able to sustain the chaos let us try to find out so for that what i'll do is i'll simply go to the chaos scenarios to the schedules over here we have the card chaos scenario schedule that we had created just now which we run as part of the earlier experimentation and i'll just simply rerun the schedule what this will in effect do is that it would rerun the same workflow that we created last time and it would run the same pod http chaos pod http latency chaos experiment once more this time around let us try to observe the effects of the chaos again using the application dashboard but before that let us wait for a while for the experiment to start all right with that our pod http latency has actually started so let us head back to our application dashboard the booty gap dashboard we can see that slowly our chaos experiment is taking place over here the chaos annotation is quite prominent over here but what we are observing here in this case is that so far our steady state seems to be maintained the probe success percentage for the endpoint of the card service seems to be stable there's no deviation in the probe success percentage and for the cards as well we can see a slight change in the qps that we have over here right now it's slightly varying in the range of 100 and our access duration service is also spiking up a bit it's in vicinity of 2.5 minutes right now which is not too bad so yeah it seems that the chaos is doing something to our application the qps is steadily increasing and the access duration the latency is kind of flattering out but yeah the most important question remains is the application still available so for that I'll simply refresh and as you can see this time around it's not going down we are not waiting for any response code or anything as such it's still accessible no matter what the application dashboard is showing to compare the result of the application dashboard I'll actually like to compare them side by side maybe with 15 minutes yeah so this time around you can see that although we observed a spike in the access duration for the card service it's much less compared to our earlier expectation it's almost half which makes sense because we have added one more probe and sorry we have added one more pod and that is mitigating the effects of the chaos in effect and therefore helping the application to sustain the chaos this allows for a much scalable and much more reliable solution around our chaos scenarios where even if one pod goes down there will be another pod which can sustain the chaos running at the same time so with that we can see that our chaos duration has essentially passed we can go back to our application right now and we can see that well even after removal of the chaos everything is fine everything is working in order and we can wait for our chaos experiment to complete to observe its effect and as you can see the chaos experiment is this time around it's completed time around it's completed over here so let us try to observe the logs this time although we can already see that it has passed but let us try to still validate using the logs and the chaos result if we take a look at the logs this time around we can see that of course before the experiment as well we were getting an expected value of 200 as well as actual value of 200 that is the response code which makes sense and this time around every time we are performing this check we're always getting a 200 response code and there is no response time out this time that is we are right on track with what we observed within the browser as well for the application when we were refreshing the website was available throughout the experiment duration and as a result of this our probe has in fact passed as you can see the card probe has passed and this in turn made ensured that our application our experiment is passing we can observe the same from our chaos results as you can see over here the experiment status is completed while the verdict is passed with the probe success percentage of 100 since we had only one probe and it passed therefore the probe success percentage is 100 and the continuous probe that we had defined by the name of card probe it has passed so yeah with that we conclude the demonstration of litmus chaos we saw that how we can use the shtp chaos latency to validate the behavior of a kubernetes application when we are applying a latency of of a value to our kubernetes uh microservice and we also saw that how you can define litmus probes within the litmus chaos experiments in order to automate the process of hypothesis validation during the chaos with that i'd like to wrap up this demo thank you so much