 Greetings, everyone. My name is Prithvi Raj and I welcome you all to the CNCF on-demand webinar where we will be talking about the litmus chaos year review 2022, the chaos engineering updates. It's been an amazing year for the litmus chaos project from incubation to so much more that the project has achieved as a community together. Over this year has been commendable and we are here to just share the year review with you all. I have with me Vedant. We'll be introducing ourselves quickly. So moving on to the introduction. As you all know me Prithvi Raj, I'm leading the community for litmus chaos. We started off at Maya data and since then it's been a journey of more than two and a half years where we have switched companies and finally we are at harness, who is a primary sponsor to litmus chaos and we are contributing to the litmus project here in and out. And alongside that, I have been involved in the Kubernetes community, the chaos community by organizing chaos carnival, KCD, Bangaluru and Chennai and we organize chaos engineering meetups every last Thursday of the month. Perhaps Vedant, you can introduce yourself. Hi everyone. This is Vedant. I'm also a core contributor at litmus chaos and seniors of the engineer at harness. It's been a same journey with Prithvi. We started from Maya data and then we are here in harness. So I also work as a part of at least an automation team at it must chaos. And yeah, I'm looking forward to this demo. Awesome. So moving ahead, let's just move on to the agenda. We have a packed agenda. We'll be talking about a lot of things. It's a year review. So we have to talk about a few things which are important. We'll obviously start off with chaos engineering. There are a lot of new folks who tune in who want to understand what chaos is, don't have a lot of idea. And then obviously we'll introduce the litmus project. We'll talk about the litmus journey from incubation which happened early this year to the metrics that have happened over the year. We are going to talk about the website, the adopters, what are the community events and programs we took part in participated in and what happened during the course of this year. Then obviously the student programs and lastly, cube cons that the participation with missile cube cons with us will be taking it ahead with the project updates and what's in what's lying in the future for litmus chaos. And lastly, you will get an idea how you can be a part of the community, how you can contribute and how you can join this amazing massive chaos engineering community. So let's get started without any further ado. Chaos engineering a closer look. Let's let's talk about chaos engineering. But before I start talking about chaos engineering, I think it's best to talk about how this practice came in or why and this practice is pretty essential. So I mean this this term chaos engineering was coined popularly by Netflix back in 2011 12 when they were actually starting to do some production level failures and they wanted to just bring in some easy tests to bring down pods. If in the Kubernetes term you can say at a production level and that's what brought in the term chaos engineering and I think initially it was just where people saw scaling as an issue or people saw production level failures curating them as something which is vital and that is why the term chaos testing came into place but slowly slowly it was realized that chaos engineering is is not just about production level failures but it's it's obviously you want to test in production but but eventually it was more about bringing in it as a testing and you understand what sort of chaotic conditions might happen in real life you understand what sort of conditions might happen to your system when it goes into production but just before that as well in your staging in your pre staging in your testing devops or CI pipelines all these environments so let's take a closer look at chaos engineering I mean it's just the process of inducing a fault deliberately into your system to understand what are the unexpected disruptions or vulnerabilities that can happen in a real life scenario and to understand the resiliency of your system it's basically you need to identify the weak points or the weaknesses in a system through you know testing in a controlled way where certain random behavior or unpredictable behavior can be analyzed visualized and understood so that when the system goes into production when there's a requirement for scaling for example on your black Friday sales there's a spike in the number of users and that's in your system might show an unpredictable behavior so what what helps there is is chaos engineering that's the goal of chaos engineering and I mean it was seen as something as breaking things in production but it's more like breaking things in production or breaking your systems to identify or make your systems resilient that's if you want to complete the sentence there and that's chaos chaos engineering in a few words I mean there's so much more you can talk about it but let's move on quickly why chaos engineering is the solution and how you can you can start your chaos engineering journey or what are the four principles of four easy five easy steps on you know running chaos engineering solution. I mean chaos engineering was seen important as I mentioned about your systems are vulnerable and as we move on to the microservices era or we move on from legacy systems if you take an example of a Kubernetes application then you can say that your Kubernetes application itself is like a pyramid right so there's a Kubernetes application on the top and then there are your other services there's your platform layer there are your other applications your MongoDB Kafka and even your cnc of applications like code DNS or open EBS the initial application we started using the chaos for running alongside your application right so and each and every layer has a potential vulnerability or it's dynamic in nature we are progressing to enhancements new developments and there's a potential outage that can happen and that is why chaos testing helps in continuous validation or it helps in continuously test if your systems are resilient or not even if there's an enhancement in your system. And to move ahead to the solution or to make sure that you're running your chaos tests in the right way there are four easy steps. I mean initially you identify the steady state of a system how does it behave what's its normal behavior how does it behave and it's it's in a normal condition and you then you you hypothesize around it you create a hypothesis where you know how your system behaves in a steady state how your system behaves when certain set of experiments are run or certain sets set of vulnerabilities are found out in an experimental group and then you introduce a fault or introduce variables I mean that that happened in a real life that you're just like a server crash or a spike or a or a you know network connection error or a malfunction some sort of a vulnerability is introduced in your system, which can be called a chaos experiment or a chaos scenario. And then once you you test your systems you try to disprove the hypothesis that you're created by you know looking for a difference in how your system behaves in your steady state and how your system behaves when the experiment happened, and you continue I mean if you are an SRE you you understand that you have a certain set of service level objectives SLOs and if your SLOs have continued to be met in spite of running the chaos scenario then obviously your systems are resilient. But just in case your SLOs are not met then of course your systems are I mean you need to again find a potential fix to that vulnerability and then you continue to test which which can be called the chaos engineering loop or the overall process as as defined in the principles of chaos engineering. So moving ahead let's we have quickly introduced litmus chaos I mean we have quickly introduced chaos engineering and now we move on to the litmus chaos project a project for everyone who wanted to run chaos engineering it's open source it's it's a tool to identify weaknesses and potential outages in your infrastructures initially it started off obviously as a tool to test only Kubernetes I mean infrastructures or had only Kubernetes chaos experiments and slowly with the understanding of the community, you've moved on to creating more and more experiments, which obviously are cloud native, but also are beyond Kubernetes where you can test your applications that there are experiments for VMs, your GCPE AWS. So it's a complete toolset it's it's CNCF native once again, and it became a sandbox project back in 2020. And now in 2022 it was successfully became a CNCF incubating project and it's an amazing community where people come in contribute and make chaos engineering more helpful, more available to everyone out there in the community. So let's talk about CNCF incubation first before we even move on to more details on the project on January 11, 2022 litmus became a CNCF project after more than a year of hard work toil with the community we've got some amazing adopters who came in, talked about how they have been using litmus and end users like orange, ketopy, Glen Scott, and so many more and then obviously now the community is growing I mean Chris from CNCF himself believes that chaos engineering techniques have enabled organizations to bring in more reliability practices robustness and their production environments are growing and I think litmus chaos, myself as a community lead I believe that litmus has helped a lot of enterprises to improve their resilience goals and SRE practices and bring in chaos engineering as a practice in their organizations and obviously the next goal is graduation we are hopeful that we are able to achieve more grow more as a project have more adopters and look forward to the graduation stage in the upcoming months or in a few years time. So moving ahead just a few starts I mean litmus as you can see it's adopted by some amazing enterprises I mean these are some formal adopters that have listed themselves but then again there are so many more adopters out there are using litmus day in and day out and there are so many stories I mean AWS FIS and so many other stories Accenture they came out spoke about how they are using litmus so there are a lot of stories that are coming in day in and day out and it's been five years of active development I mean today there are around more than I mean a million experiments that have been run recently we saw 5.4 million docker pools so the project has got massive growth you can see that exponentially chaos engineering I mean chaos adoption is growing litmus is the option is growing and post release of 2.0 back in 2021 I think litmus has achieved a stable platform and with what we are looking next in terms of our future with more and more enterprises coming in litmus as an open source platform is growing obviously. So these are some metrics this this year we had we grew by 1100 plus stars we had 100 plus folks and then slack members also grew by 400 plus around 500 folks joined in the slack community docker pools that was a massive surprise for all of us and we saw more than 2.5 million docker pools. Growing and we've got a massive amazing community obviously we believe that the metrics do not actually prove the love litmus has received other number of people using litmus but feel free to check out the litmus chaos GitHub and drop a star make sure you use it join this slack community so that you can also be part of this massive community. So moving on let's check out the latest litmus chaos website we just made some changes to the website. So as you can see this is the latest website which obviously speaks about what litmus is takes you to the community events happening speaks about other the adopters as you saw, and then obviously gives you a look at what chaos engineering or litmus. Platform is how you can get started in the in minutes by running I mean pulling a help chart or running a couple of commands and then obviously it has amazing features the chaos have multi tendency observability these these things we have spoken about a lot so you can just go through the latest website which helps you get a lot of information on on what litmus is and how you can be a part of the community what are the docs what the docs are what what the enterprise version holds and all these things. So let's let's move on back to our slide deck, we were speaking about the website and now let's talk about the adopters the end user the formal adopters I mean we saw eight amazing end user adopters in this year and then we saw so many people more. I mean coming out, sharing the litmus stories and I'll share share a few stories with you I food FIS and added ours, the other stories obviously are available on the litmus kiosk it up. So the added our story I mean they started a few months ago and it was obviously about bringing the culture as a practice. But then they eventually after evaluation after various tools that they evaluated they chose litmus kiosk because of the following reasons I mean they are using litmus kiosk for their applications workloads and infra and then I mean they are also using experiments like for deletion network latency packet loss score for the payment section they check out the login section. And obviously they haven't moved to production yet as shared by Victor who is one of the community numbers, but hopefully there will be changes into into this story where I did ask moves into production and speaks fairly more about the litmus usage. And obviously we are glad to see why they have chosen litmus they they had their priorities matched these are the priorities that they have shared and then obviously as of now it's being used in a staging pre production environment and then the future plan is to move into production through the pipeline so this was an amazing show story shared by Adidas I think one of the best stories that can come forward for litmus usage and maybe you can also if you're using litmus you can come forward. Share your stories what exactly how you're using litmus why did you choose litmus and how do you plan to use in the future. And similarly Raj Vadhe Raju another amazing community member he shared how FIS is is using litmus at FIS global they have been I mean moving towards more SRE practices transforming platforms and that is why they chose litmus kiosk because I mean it fulfilled a lot of things for them and I'm I'm glad that it fulfilled their testing requirements had a great community thanks for again mentioning good words about the community and all these factors help them use litmus and then obviously they're using again litmus in their applications and workloads they're simulating experiments to understand their utilization of JVM's key resources they're using also using litmus for Kafka Resilience and eventually looking to integrate litmus with CICD. So these have been some amazing stories and one last story which which also became a case study for us it's come out as a blog is the one by iFood iFood is a food delivery platform operating in out west out of Latin America Brazil and Colombia and they are approximately having 60 million orders per day it's similar to you know Zobato, Uber Eats, Food Panda, Talabad and these platforms and you know the problem statement occurred for them on a Brazilian Valentine's Day where there was a huge outage that they faced due to the spike in the number of orders and they slowly saw that there was an entire I mean the region went entire region of a cloud provider went down they had network bandwidth network bandwidth issues and that is where they decided to I mean shift a strategy obviously they introduced a fallback or a circuit breaker method and most of the engineering teams tried to provide support for the outage but the eventual goal was that they needed a better approach the better approach was chaos engineering that's what they shared in the community and that is where they decided to check out various tools as an architectural way of defining how they have using litmus chaos and I mean they saw the broad experiments they saw how litmus can help them and they believe that it has a well defined network and authentication mechanism and that's what led to I mean I food using litmus chaos these stories have been amazing they have come out this year sharing how how community or how people can use litmus chaos that's become a business case or a use case for everyone to adopt and feel free to check out the blog I food story for litmus chaos and how they plan to use litmus chaos. Moving on this is the community events and programs that happened around this year there were so many I just took out I just took out the pictures and I put it out for the community to see I mean there are so many more I shout out to so many folks out there Amit, Saranya, Jena, Vedant who has joined Saiyan, Monjil, Karthik, Kunal, Khushwa, I mean Saranya, Jena all of these folks, Pawan, Belagati, Umma I mean he's also been leading the litmus community for some time now shout out to these folks who have been contributing to the community have taken part in amazing community events like KCD Sri Lanka, KCD Bangalore, Chennai, All Day DevOps, I mean the Docker meetups, the AWS Community Days, Kochi, there's so much that has happened over the year and we thank all of them who have taken part in the community who have contributed and have joined these community events and programs to make the community a really successful one. So moving on let's talk about a couple of things that we usually organize the community sync up calls I mean our monthly cadence is obviously having a release every 15th of the month and then obviously following it up with some patch releases and fixes. And we are having the community sync up calls every third Wednesday of the month so if you haven't joined one yet, feel free to join in, feel free to join the Slack channel and join the community sync ups and then obviously we have the chaos of generating meetups we had an in-person one late last month and we are having it online every last Thursday of the month so if you're available feel free to join in feel free to submit a talk let us know if you're interested in talking and this is what the community has been doing to meetups every month. And then chaos carnival of course the 2023 edition is coming up litmus chaos has been a proud community sponsor there are amazing folks who joined in this today. A lot of the global conference community members like Sangam, Akram, Hendricks, Michael and so many more folks who joined in spoke about litmus I mean someone spoke about litmus class and Jenkins and then they spoke about how get off meets chaos engineering with litmus and then there are so many stories that have come up and the CFPs are open so. If you want to speak about something related to litmus and want to join chaos carnival as a community member feel free to reach out to us feel free to join the community and submit your talk. Lastly, student programs litmus chaos has been a crucial part has taken crucial part in the student program GSOC, GitHub externship, LFX mentorship and we had Prayag. This year early this year he became one of our mentees and we had a great time with him he helped in adding new CLI commands for scenarios crude operations. And he also allowed users to automate scenarios as part of the CICD pipeline so basically his work was around developing new features and adding integrations test for the litmus CTL and again kudos to Prayag for being an amazing part of the community and helping as an LFX mentee. So litmus. Lastly, we once everything got over we'll talk about cube con the cost cube con both the cube con this year were amazing they were massive for litmus chaos or massive participation from the community. In terms of cube con EU and cube con NA. And we had two amazing project meetings we had a case study. I mean, we we had UMA and Ramiro from the community talking about the case study of bringing chaos into cloud native developers and then obviously both are maintenance tracks featuring I mean an end user story from we had OMA and Karthik sharing the project updates and how the project has grown over the last one one and a half years. And obviously there were two amazing stories that were shared by the Raj from FIS and how chaos engineering is applied to the fintech domain and how I mean iterate the iterate community came out and shared how chaos engineering injection and SLR validation goes hand in hand. Also an amazing co located event in chaos state which which saw a good participation from a lot of folks from the community. They, they spoke about litmus shout out to Bianca from HCL crystal lamb another amazing community member who spoke about how they are using litmus how litmus has been essential to them. And then we look forward to cube cons in 2023 again Amsterdam Chicago, the plan is to have litmus chaos there have a booth, if possibly we are having a boot and then obviously I'll maintain a transaction and speaking more about litmus in in these amazing cube cons has been a pleasure and we thank CNCF again for giving litmus the platform and obviously we will look forward to participating more in cube cons and and making the litmus teach the community further. So these are some snapshots. Lastly, I mean thanks a lot to Chris, Bianca, Dames, Kiran, Sumita another amazing community member and we we are having friends from everywhere. Thank you so much to all the community members who have participated with litmus who have given litmus the platform, and who have loved litmus so much and helped the chaos and meaning adoption and we hope that you love keep enjoying what what litmus is how litmus is growing and we hope to see you again in one of the cube cons some conferences or somewhere where you continue to support litmus. With this, I would allow Vedant to take over who will be talking about the updates to the project and how the future will be talking about the next version and how you can get involved. So without any further ado, Vedant, you can take over. Thanks to three for sharing such great details. I mean, it's always good to know how chaos engineering is doing and how a project is doing right. And obviously it's been a great year of you know we have we had we have had like a good number of contributions and also very good number of future requests and also we got contribution in terms of future enhancements so yeah the community helps us in building a product also right. So yeah and it's been a year of learning by building such you know future enhancements. So yeah, I will proceed with what enhancements we did this year so I'm just sharing my screen. Okay. So yeah, like the main highlight of this year we introduced STTP chaos experiments I mean, you know, we started with the port STTP latency but now we have our own five, five different experiments. So these are these experiments was you know introducing on you know, for the you know, these based platforms. So now what you can do is you know, you can target a particular part of at a particular port. So by specifying the port, you are you know, targeting, you know, the traffic going through that port, you know, and injecting, you can say you have STTP latency experiment and you know modifying the status code for the traffic that is going on right and similarly for reset peer you know connection reset events and modifying the header of the traffic and modifying the body of the traffic right. So, you know, if you want to know more about these experiments now these are available in our you know documentation catalog so what you can do is if you go into the documentation here. So in our experiment documentation you will find you know if you go into the port chaos category you will find you know five different experiments based on a port STTP chaos. So here you know this is what we started with policy latency but now we have our four or five different experiments here. And so yeah, that's on the policy to be chaos. Next is edition of AWS is that experiment. So this is also a gate, you know, feature and feature enhancement you know this this experiment will help users to detect you know ability zones for a particular load balancer. And this was added. This is added in the litmus python and this is actually a good example of how you can also write your experiments be it litmus go or be it was python so and also like there's a place like you know in case you're not much familiar with co you can also use litmus python for writing your own experiments from scratch right. And then next is GCP experiments. So this year, we all like previously, we already had GCP experiments it was GCP instance stop and GCP in this loss, but what we had was you know, with respect to names. So if you go in the GCP category right we already had the GCP VM instance stop but that was you know you will have to give the instance name or the list of instance names, then you know inject the VM instance stop chaos. But here, you know, one issue that you might face right for example your instances might not have the same name always right or you might be using manage instance groups your instances are going down and coming up right. In those cases your instance names might not be same right so what you want to do is instead of providing names you can use instance labels. And similarly, in case of this the same behavior was there, we were able to you know, detach the disk by giving the disk names, but now with the current experiment that we have introduced a VM disk loss by label you can provide a label of that particular disk and accordingly it will detach the disk from the VM. So, yeah, this is this is also heavily in our documentation and also in case you can also find chaos engines and experiments for the same experiments in you know chaos. These are the new experiments that we have it added this year. They have been many good enhancements and on the experiment side or I would say core core litmus core site so moving ahead, let's, you know, I start with the enhancements I will just make it full screen. Yeah. So, like we already had our network latency experiments. Now, and now what we have is a new ENV or I would say new tenable that you can provide jitter. It's, it's like, you know, for example, you are giving a delay of 10 seconds to your experiment right. So what it will do is it will inject chaos to target parts and inject a constant latency of 10 seconds right but in a realistic manner right. It might not or the latency might not always be 10 seconds you want to be more realistic right in that case what you can do is you can use make use of this ENV jitter. So let's say you have a 10 second latency and if you provide 10 five seconds of jitter then it will you know range between you know five to 15 seconds that latency will be like that. So this will help you to you know simulate more realistic behavior of you know traffic latency and that's how you can also monitor your applications how they are you know behaving in such cases right and as we have done some enhancements on the stress chaos experiments so by default, we have that enables for you know, injecting chaos based on absolute value. So for example you want to, you know, consume memory or CPU or disk in terms of you know GBs or millicores be it CPU or memory. You know, like for example you can't always want you didn't you don't always want to consume memory in terms of you know absolute values you might want to consume in terms of percentage right so that you can see how you know like how your application is behaving right so in such cases you can make use of this to enable we have the like memory but memory load or you know a CPU load that in those tunables we can provide the consumptions to be done in terms of percentage and it will go ahead and you know consume the memory or CPU in terms of percentage right and then moving forward in stress chaos experiment yeah same experiments we also added support for C group of version two so like we already had support for version one but now we have version two. So now we are just going to experience also support the C group version two. Next was, I would say feature request from community. So there are many many use cases right you might not want to. So, for example, you are targeting applications in your cluster right. So you are giving up labels and which name is based these particular applications are residing in part what you want to do is you want to you don't want to touch the applications which are on a particular note right, or I would say you know you want to only target those applications which are on a particular note I don't want to target any other So in that case now what we can do is we can provide node label also. So, for example, there are three applications of an application and there is a one replica residing in a node a and you don't want to touch the replicas which are not on node a right. So you can provide the label of that node and you know another node label to enable and then it will only target those target parts which are residing on that particular node a. So in that case you are also you know reducing your blast radius and you will be you also get more granular granular you know control over the chaos. Right. Yeah. So next is CMD probe enhancements so this year. We have done a good enhancements and CMD probe like we have there are some good additions there. So, in CMD probe, we, we already had support for source you can provide your own image to run commands as by using you know by deploying a new pod pro pod. We didn't have support for you know providing envies or in case your images private right you might not be you may not have been you know use image full secret for pulling that particular private image and you may also want to provide different other particular image right so now what you can do is in CMD probe we have added the support for is you know adding the image full secret image full policy CMD RS and also in between able so in case you want to know more about it like we also have our documentation here so he for this one you can come in concepts and under the pros we have come on probe. So here in line mode is like you know you are not going to provide any image of your own so it's like it will be running as part of our experience pot right but what we are looking for is the source mode in source mode. It's like you know you are providing your own image right so this image can be private public or you know you might want to have your own customized CMDs or arguments right so in this case now if you see that we in this source we can provide the image and we can also provide the image full policy that particular part will be privileged or not if that container is going to be in the host network or not and similarly you know envies or image full secrets and other things. So this was an enhancement done in CMD probe and it will you know allow you to run the pro parts with your private images and in a more customized way and to have more control over it. Right. Moving forward yeah so this was a great enhancement this has been done you know this will this this so let's just you know go through it so for example, you are you know coming to litmus chaos and you want to write your own experiments right. Previously we had we have had so we provide like your sdk template so if you, you know, go in this litmus litmus go repository right so we have a developer guide you can users can community can follow to you know bootstep their own chaos experiments from scratch right, Previously we have had we had templates for creating experiments based on, you know, Isaac model, but now, if you see as part of you know enhancement done in this year you will be able to bootstep experiments which are having helper pod model or even on the community space experiments let's say AWS GCP VMware or Azure. So we have templates now for all the different categories of experiments. So if you see we have templates here and if you check right we have AWS Azure execution model helper model here and GCP so now you don't have to write everything from scratch you can use these templates to bootstrap your experiments for you know any of these categories and we will continue to add more of these so that you know it becomes easier for you know generating experiments for the community also right. So and you know it's it always helps to you know promote our you know by using your own chaos right. So, next is container, we already added variety support for container DCRI support for DNS chaos experiment so it was you know previously we are DNS chaos experiment only supported Docker container runtime but now these experiments are also supporting a container these runtime support. So this is also work in such cases also. And next, so this one was done for the service mesh enable environment so this was also query from community so in some cases you know if you are using HTTP chaos experiments or if you are using network chaos experiments right. In this case, if you're providing destination host and because how the traffic, you know, flows in, I would say service enabled environments right. It might be different in such cases how we did use the target IPs for those destination host so in network. If you know like in network chaos experiments we have to enable for providing, you know, destination host particular destination post you want to target right. In service mesh enabled environments that the process is a bit different so we were not able to, you know, find the destination IPs for the same but these enhancements will allow us to, you know, run our HTTP chaos and network chaos experiments in the service mesh enable environments also so it will, you know, go through a different process for driving the target IP of target ports. Yeah. So, next is this one. So this one was, you know, we have had, we wanted to do this one so for example in node and infrared related experiments we have we had the AUT status checks so target application checks but that has been removed now. So, and like, because these are node and infrared experiments if you're using non chaos based experiments or if you're using node based experiments right you might not want to check the target application you want to monitor your target target node right. So in those cases this was sent required and this was also a query from community so this has been removed now. And also, for the part related experiments right so, and it is it is done for all the experience I would say so now it's like the, for example in pre chaos and post chaos, whenever you run experiments we have a pre chaos and post chaos checks and those checks in those checks we what we do is we check the status, you know, or liveness of the target application or target node right, but those are now checks can be made and there isn't your tunable app health checks parameter in chaos engine. So what you can do is you can make it true and false. So based on that, now it will check the app health or node health, accordingly. So if you're providing the app health check is false like you are sure that you know, I mean, if you're confident in your notes and you are already sure that okay if I even if I do chaos it will the target application will be healthy or, you know, the target node will be healthy or target info will be healthy you can you know make it false right, you don't want these checks, but in case you want to have it you can just you know have it true so by default these this will be having its values true, but you can make it false so it's optional for us. So yeah, these were the, you know, some changes which we which were done on the litmus core side and now moving ahead to chaos center side right. So here, here like in on the chaos center side we have had some good number of enhancements here also. We're like starting with the sales and certificates so this was also you know community request for using sales and certificate for the communication between the delegate and to the graph kill server right, because previously what was happening in case you are using, you know, I would say virtual gateway or ingress right in that case you might be using your sales and certificates right. So if you are using sales and certificate and in that case that communication would break because we were not supporting sales and certificate so now we have added support for sales and certificate. What you can do is for enabling it right. I will show you in helm so in helm now we have an in the you know server envies, you have an Envy TLS 34. So you can provide base 64 encoded form of the TLS that you have your self sign and you know you can also you will also provide you can also provide secret name. So there are two ways to provide certificate first is TLS secret name. So in case you're the how you have deployed chaos center so in case you have deployed chaos center in a cluster scope so it can go ahead and check the secret right. So in that case it will be able to you know, fetch the secret and the certificate from the same, but in case right if you are using namespace code in that case might not be able to fetch the secret directly so you will have to provide that base 64 encoded form of the set here directly. So in that case the server will you know decode the same and the same certificate will be used for the communication between delegates and server. So if you are you know connecting a new delegate or new agent to your, you know, a chaos center right in that case, the manifest you generate while it must see the manifest you generate while it must still for you know deploying your agent that manifest is going to have that certificate, you know, in a secret on a config map so subscriber will use a certificate for the communication to the graph case which is also having the certificate because you provided it here. So this helped us to you know support sales and certificate communication between GraphQL server and the agent, and you know it makes it easier for users who are using their certificates. Yeah, so next is you know we added. So this was also a feature request so yeah like I would say that you know this year we got some good and great feature request from community and it also helped us so as I was saying that you know it this year has been a career of learning for us and this actually was a learning for us. This, I will show you through UI right. So if we go into the chaos scenarios UI, you might be running your experiments here right. You are running your scenarios here. So this was the field that was added here, but previously it wasn't here. So here that is what was the issue here why did we added and why was it you know, like feature like feature request from community right. So for example you are in a project right in a project in the chaos center and you have multiple users right now. There are many number of users and anyone can come and run the scenarios right now, how, how will you know who ran which which scenario right and that that creates a zoo because for example someone run experiment and you had a you know, downtime or you just want to monitor right who did the part delete or who did you know network loss by your target application was not behaving correctly right. In that case, you know, we need this field executed by right. So this will this will now what will happen is like in case someone is running the experiment here in the scenario here it will show the username of that particular user who ran this scenario right. So this will make the, you know, you know, audit easier for the users they will be able to know who ran the experiments and all the scenarios and you know, it's not a, I would say, you know, it's a good way to audit. You know, who executed the experiments. And similarly, in the same manner, we also added the updated by field in the chaos scenario so chaos scenario runs are the one which we execute and chaos scenario is what we should do right or we create right. So in this case, we also added on last updated by so last updated by is also required because let's say, as a user or I would say let's say you are the admin of the project right and you created one scenario. And now, you know, the next day you are coming and someone has updated the scenario and you say you know you are seeing that okay it was working fine yesterday but now it's you know behaving differently right. So you want to know who updated it right. If you if you are the one who updated it you your name should be coming here but if let's say you created an I go ahead and update right then my name should be available here so that you know you can reach out to me and ask me like why. Maybe because I might be having my own hypothesis for the particular scenario right. So you just want to know like you might be curious you want to know that you know why why did we change the particular scenario. So and you know you might also you know get some good points right. So yeah that was that is that is also you know how going to help you in last updated by and you know this field executed by and updated by was a feature request from community and also this was a contribution from the community. Right. So, yeah, like this is you know some, you know good things and the good good contributions did we get from the community right. So, next is the ability to configure self agent components more certain tolerations so previously, when you are deploying the you know self agent, the other delegates right. While it must be till we have had, we had the flag for providing no select and dollar Asian, but for self aging, we didn't have. So for self agent. Now, I will show you through help. So, here, now we have the ENVs I will just make it a bit. Yeah, zoomed out. So we have the ENV self agent no selector and self agent toleration right so we can use these ENVs to provide the toleration and no no selectors for the self agent. And for the delegates if you want to know we can always use litmus CTL it must tell us what is used for deploying external delegates right. So there we have we already have the flag so in case you're not familiar with it must see you can also check this report. It must care slash it must CTL here. This is a CTL CLI tool that we use for deploying chaos delegates and connect them to care center. So this tool also like you know provides a functionality to provide the no sector and toleration and for self agent as we said we can provide the same from the ENV so this is the level in hand chat and in case you're using manifest there also under the graph cancer you can use the CNVs to provide the not certain toleration so in this way the self agent is also going to have the provided toleration more selected. But yeah one thing just make sure to you know add these no certain toleration before deploying right. I mean, the if you know like when we log into chaos center the self agent is deployed at that time so in case you didn't edit it before, you know, logging in, you will have to edit the deployment but if you're adding it before logging in and it will be you know the self agent will be deployed with the provided toleration and you know no selectors. Okay, so next is we added support for scheduling same experiment multiple times in a single scenario. So yeah, this was one issue. And if you try to create a scenario from chaos hub, right. So we were able to, you know, select chaos up that is fine, and we, you know, move forward right here. And when you're adding, you know, chaos experiments here so let's say I add a pod delete once right. And I, again, at the party because I want to do the same party but you know my hypothesis is I want to do the party, apparently, but on the different notes, different target applications right. So in that case, what was happening we were providing the same names for both the experiments so this was creating issue but if you see right I added two different party experiments and they both have different names. And even if I, you know, move forward right in the weights case I will have a different name for both the experiments like I will not be confused like home I am providing the weights right. So this way, you are able to, you know, select multiple multiple number of same experiments and then since currently by default they all target up equal to engine X but you can go ahead and add it to you know target application here. And in that way you will be able to, you know, specify a different application for both that if you know experiment so the experiment is same target application is different. So now you are able to do this also previously since they are the same name so we were not able to do that. So, next is, you know, support for custom image registry inside the experiment. So, to this let's go to your center and go to settings page. So, we had introduced image registry tab here in the settings. So what it does is, you can provide your own image registry here. And if it is public then you can let it be and if it is private and you know you can provide the image secret. So now, what was happening is you can provide your own image registry here, but this what it was doing is it was updating the images so let's go to manifest generation right so let's say I go ahead and select a chaos hub and if you see there is a checkbox in enable image registry changes. So, if you don't want to use your own image let's say you are we were using your private images for running experiments but you just want to use it must use images right so you can just disable it or in case you want to use your own private image that you specified in the image registry tab we can enable it and then move forward. So now, what is what will happen if you, you know, select let's say we select a part date experiment and come here and we go into the edit email. Right. So, here if you see this image right here this image and these are all the workflow level images. So, these are images will be updated with your own image so that if the private image you will be providing these will be updated if you enable in the checkbox. So this way what will happen is whatever the parts that gets generated by this workflow, and what are the parts that get generated as part of the chaos injection, all will have a private image that you specified under the image registry. So this way you will be able to use your own image registry and you don't have to, you know, go to me and mill and update the image here by yourself manually. Right. You can just enable that checks box checkbox and you know providing provide your image registry details in the image registry tab and that way it will be easier for you. So in a single checkbox you can you know enable or disable the image registry override process here. So yeah, this was the image registry enhancement that we did this year and this was also I would say a feature request from community because you know for example, you are using a private image you don't want to, you know, in let's say you are using the image in one workflow but you don't want to be overriding the images in all the manifest yet you are going to deploy right so it's better to set it up in the UI for once and then you can just keep scheduling your workflows with your private images you don't have to always update. So this actually helps in a great way. Next is NY proxy. So previously in front end engine x we were using HTTP version 1.0, which was not, you know, compatible with NY proxy. So we upgraded our, you know, FTP version in, you know, engine x console if I should be here so this is a conflict map that we use for control pin of front end engine x conflict. So here it contains the default con for the front end, you know, engine x. So here previously we had 1.0, but now this contains 1.1 which you know helps us in supporting and why proxy also so in case. So this was the issue raised if you're using studio enabled, you know, enable environment in that case you might be using virtual gateway for in instead of ingress where you might be using virtual gateway right and virtual gateway using and why for, you know, you know, redirecting the traffic right so in that case he wanted to support and why proxy also so in now with this we are able to you know, support normal engine x as well as the in my proxy. So this was a great enhancement. And a feature request from community. So yeah, next one is advanced tuning feature for experiments. So this was done. Let me show you in the UI. So, let's go back and come back here. So, yeah, so previously this is the like advanced options that we added here so previously this of these options were not here you were only able to, you know, update the Okay, yeah, only update the study state details and target application details. But now if you see there is one more type here also. So, one is that you can update the, you know, advanced configuration for at the workflow level and one is you can also update the advanced configuration level in the chaos engine. So this details node sector toleration, these are going to be for chaos parts, chaos related parts. And like what are the experience that you had you can enable it and add your node sector here or you can enable it and you can add your toleration here. And similarly you can enable it and you can you know, this is for only for the annotation check right. So this is like already a core functionality from litmus core, what it does is it allows you to reduce the blast radius so for example you have three to four replicas right and not three to four replicas I would say three to four different applications which are having the same label right. So now you only want to target one application but they all are having same labels so you know by default, it is going to inject chaos on all the applications which are having that label right. So for enabling, you know, the for using the blast radius to single application or those applications which you want to target, you can just add a label litmus chaos slash chaos colon true, and then you can just take this annotation check in that case it will only target those applications which are having this annotation. So this way you will be able to reduce the blast radius at the experiment level also. And similarly we also added the advanced options for workflow level so these are the one that we showed here that was at the experiment level but there can be multiple experiments so you can configure there but for the workflow level also you will get the node selector here. So you can go ahead and add the node selector here or you can add the toleration similarly here and there's one more to enable that is cleanup chaos scenario part so how do you want to do the cleanup right. For example, you're running a workflow, and you know you want to clean all the parts after workflow is completed or you want to clean all the parts after workflow success right. For example, you want to debug what why your experiment failed right. So in that case you might want to the you know set the part gcs on workflow success so that the parts are only deleted when the workflow was successful. If workflow was failed. In that case, the parts will not be deleted. So you will be able to debug why you know why a particular experiment failed or why particular workflow failed right. So this was you know added as part of advanced configuration at the level and at the experiment level. Okay, so next is we added support for connecting a remote chaos up also. So, yeah, this was a, you know, great announcement so I will tell you what is the issue and how it is what is solving right so we already have we have we had this feature which was already available, but let's say you are, you know, you are in an air gap environment and you only have access to, you know, I would say gcs bucket or s3 bucket you don't have, you know, access to get repository or get lab right or any, you know, get source right. So in that case, you might want to put your, you know, chaos chart the chaos up as, as a, you know, as part of your, you know, s3 bucket or get, I would say, you know, gcs bucket or any bucket right. And then you can what you can do is you can provide the URL for the same here and provide the name here and one one if you see right there is one warning the zip name and the chaos of names should be same. So, when you're, you know, pushing your chaos up to gcs or s3 bucket right, you will have to zip it so that it is a single file, single file, a zip file, and the, so the file name and the chaos of name that you provide here should be same. So what it will do is it will go to the URL that you are going to provide for gcs or s3 bucket, and it will download that it will unzip it and you will be able to see the same chaos up added here as a car. And when you go inside it will be able to explore the what are the experiments there and what are the different chaos scenarios are already there that you added as part of your custom chaos. So, this will help you, you know, so that you know you become independent of get resource and you can also connect your, you know, chaos up via s3 or gcs bucket or any other, you know, bucket. So, the next is, so this one. So this, this, this, this enhancements what was added to solve a problem so first let's see like we added an API for fetching the server version and also we also added LITMA CTL compatibility matrix so previously, let's go to LITMA CTL and come back to read me. So we didn't have the compatibility matrix and in that case it was creating issues right because a particular version of LITMA CTL might not be compatible with part of version of you know care center so we had these details, but we you know, LITMA CTL if you are you know you know CICT or automation pipelines right in that case you will have to upgrade your CTL in case you are getting your care center because that particular LITMA CTL that you are running as part of your pipeline might not be compatible with the care center right. So, in that case, what is better so you know, if you look at the debuggability perspective you know the LITMA CTL might fail and you might not be able to debug it right. So in that case you might want to know why is it failing right and the issue can be because of the care center and LITMA CTL version compatibility. So now, what we have done is in LITMA CTL we have added the version compatibility matrix so if you and there is also command LITMA CTL version. So if you do you know run any other now if you run any command via LITMA CTL which is you know going to communicate to care center also it will check the version first so the version of care center as the version of current LITMA CTL. So in that case, if they are compatible the request will go through and you will have your operation successful but in case the versions are incompatible let's say you are using you know 0.7 version of LITMA CTL and you are using the center version 0.2.9.0 and in that case it will give you an error that these versions are not compatible and the request might not be successful so it is better to upgrade LITMA CTL to this particular version. So at least you know you should be able to upgrade your version to 0.10. So now it will you know make the debuggability or I would say as an update easier and you will be able to know it in a you know very faster way. Like API was also made so for example there are some community members which are you know not using LITMA CTL but still they are using you know APIs for you know or their automation right in that case they are using API so they want to know in case they upgrade their chaos you know center so is the API that they are calling is it compatible with the current care center or not so what they can do is now they can call this server API for the to get the new like the version of the server. Then they can check if the version is same then they can make the query otherwise they need to upgrade. So these are the issues that it solved and this was also query from community because many issues many users are facing these issues with respect to a great so this also solves that issue. Next is yeah so this one is your center UI endpoint eNV. So this is in case of you know you are using is to enable environments or any other environments but where you're not going to use ingress right so you might be using virtual gateway and then in that case you might be providing the you know host and other things right. In that case are you know GraphQL server you know is not aware of those you know custom resources to you know fetch the host from the virtual gateway and other things right. It is aware of ingress it can go ahead and fetch the server host from the ingress it can fetch the node IP from the nodes and it can fetch the load balancer IP from the target from the server service and and other things but you know there's a limit so now for solving solving that issue right you know what we have done is there is a new eNV care center UI endpoint so what you can do is for example your host you already know because you're going to access your center on that host right. So you already know what you can do is you can go in the eNVs of server and then there is a eNV care center UI endpoint here you can set the host here. Here what will happen is it will go ahead and like whenever you are going to connect a new delegate right in that case the delegate will be provided with this URL. So that it knows that it has to connect to server via this URL, not through node IP or anything because that is not going to work in case you are in an air get you know environment right. So this is going to help you mostly in Istio enable setups or in case you are using air get environment you have your own domain on which you are accessing care center of which care center might not be aware of. You can provide the domain or the host here so that you know it can server can be made aware of this and then delegate at the result is also made aware of this same. So yeah, these are the enhancement done on a care center site. Next is litma CTL. So litma CTL. This year we are actually this there was a get contribution done by Pryak as you know pretty we shared these were the you know he helped us in you know adding he contributed you know scenario credit operations, you know operations to be done via CLI. So, you know, we have now, we now have support for you know credit operations which we can do by litma CTL so as I was saying you know users might be using litma CTL in their CICD pipelines right. So in that case now you can run, you know, scenarios by via your litma CTL or you know describe the scenario or get the scenario run so now this will help you in automating your CICD pipeline because now you can create scenario you can also get the scenario run so you can check the status also, you know, using some batch scripts and other things but yeah, so you can check the status also you will get to know the experiment is passed or failed in that case right you can also delete another you can do delete operations and other things so this is a you know I would say feature request which community has been asking for and this is a great you know contribution done by Pryak. And next is you know some enhancement that we did in litma CTL as part along with the you know new feature that was you know we added a flag that is was you know slash slash you know cube conflict flesh in case you have multiple cube conflicts in your cluster and you want to target a particular you know cube conflict via to a cluster then that is you can just provide the cube conflict flag and provide the path for the cube conflict and in that case it will work. So next is, this is what we were discussing in the previous slide. We added the version mapping in that must sit in with respect to your center so it will allow you to check the compatibility of it must sit in your center and also will allow you to upgrade, you know, it will, you know, make the it easier in your automation pipelines based on worsening. So yeah, like, these were, you know, all the updates on the chaos center side chaos core side and you know litma CTL side and you know these and if you look at the whole, you know, the how how the new feature requests and enhancement that we have done, most of them have been, you know, via, you know, community feature request and even most of them were done, you know, via community contributions also. So yeah, thanks for you know, having such great request and also helps us to, you know, move further. So, let's just 3.0 beta. Right. So, in 3.2 beta. Let's go to discussion like what are we, you know, having in a roadmap and what are different things that we are looking into right. So now 3.0 beta, kindly, there have been already two releases in beta beta zero and beta one, you can check them out. They're still in beta we you know, don't support a good for them now but you know you can surely try them out you can surely try to check what is what is new coming in there and you know, you can also give feedback. What is there and what we can improve in, you know, in those versions also. So, there are three aspects of it, how we are going to make it robust how we are going to make it leaner and how we are going to make it more developer focused right. So first is, you know, more how, like, let's start with the robust how are we going to make it more robust right. So, first, you know, we can start with improve chaos orchestration right. So previously, you know, this is more, you know, mostly focusing on the residue that you know it stays on your cluster after you know, doing the chaos right. So, for example, you are running a particular experiment in your, you know, in your cluster, and in that case you know what gets evicted or something like that happens right in that case your chaos engine might be in, you know, your chaos parts might be living in your cluster in a evicted state right or in the error state right. So to make it easier and you know to make it more, you know, to be more interfacing to front end also because we need to know what happened on actually onto the cluster. We are going to make it more better by you know, to, you know, so that you know no chaos resources are going to be staying on to your cluster and in case something happens right. If something happens, we should get, we should know about the same onto the UI. And to make it like there will be some changes which will be doing in you know, mostly on the core experiment side because we have to reconcile onto the parts like for as an example we can say that we have to reconcile on those parts when parts were getting evicted. So in that case, who is going to, you know, handle such situations you know chaos operator might be the one chaos another might be the one. So they have to reconcile on those parts and you know, check the status and then you know, take decision based on that, and the same decision should have been should have to be reflected on the UI. So those types of things we are going to make it more improved so that you know, we can stop you from, you know, going to your cluster, you should be on to the UI. So next is help base or help base automation so this has been a great, this has been a, I would say, good as from the community, you know, and we, you know, we agree to most of the parts because like, if you're using litmus CTL for connecting your delegates right in that you don't have much control over what that particular manifest contains and you know you want to change some CRD you want to change some RPEC but I would say not change you just want to look at what is there in the RPEC and what are we going to install as when you install, you know, connect a delegate via litmus CTL right. So now there is a new chaos agent coming which will be helm based. So this helm based agent so you can just you know run helm commands to connect your you know chaos delegates to chaos and with this helm you know because this is helm chart you can you know you can have your own custom chart you can fork it or you can have your custom values.yml and you know you can have your preferred settings already present in values.yml and just use it directly. And because it is going to be a helm chart you are going to be the templates are going to be visible to the community. So like in case you want to know what RPEC are going in for CRDs are going in and what deployments are going in and how are we you know doing all this you can surely go to the templates and check them out. So this is something which is in the roadmap and which will be available very soon. So next you simplify to UX so like now since we already have a UI you know to show you how you can you know construct complex scenarios via UI but now as I was saying right your So let's say like you know you are already able to run your scenarios right but now what you want to do is you want to know what happened on to the cluster right as I was discussing right. Something happened on your cluster and your parts are erected or killed right in that case you want to know the same on to the UI you don't want to go to the cluster right and what you want to know what is the impact so you know how your application behave and other things so what you can do is now like this is really this will be an enhancement coming in UI how we can you know look into more on to this so this you know you can also get feedback for this one what you are you expecting into the UI and other things. So that was on the robust side how are we going to make it linear so as you already know first is the native workflow we are already using our go under the hood for you know she doing workflows which is a you know I would say manifest which contains stitched experiment so it contains multiple number of experiments right so as we were scheduling the workflow here just as an example I can just show you here. So let's check this and add one experiment let's take this one and yeah so this manifest I will just make it. So this manifest is a workflow and you have the stitched experiment your experiment here chaos engine here, but you know, like you may not want to run the complete workflow you want to directly trigger the chaos engine, you don't want to run this complete workflow just to trigger this chaos experiment right. So in that case now is a roadmap there are different enhancement that will be done so that you can you can we can enable the users to directly run chaos engine so currently the chaos engine is embedded here in the workflow as a artifact. But this we will enable we will you know it is in a roadmap to enable users to directly you know trigger the chaos engines instead of you know workflow. And yeah, so and we will also be looking into litmus native workflows so as we are using kindly our goal we might also introduce our own litmus native workflow so that you know. We can have more control over it and we can make it more easier to you know schedule the scenarios and make it more straightforward. Next is, so this one, this is the one which has been, you know, with a comment from the community currently what happens is if you're using stress chaos experiments or network chaos experiments it generally creates helper parts, and now, in case you are having 10 target application 10 replicas of target application 10 helper parts will be created right so and like you know you are going to target the same application and then for 10 target application 10 helper parts coming in. You might hit a situation where you might you know you might have less resource to accommodate those helper parts right. So in that case now what we are going to do is you know to make it more scalable and more helper friendly what we are going to do is we are not going to launch helper parts for all the target parts. What we are going to do is we are going to launch helper parts on each node and those helper part so in case you know one helper part is launched in one node that part that part of the helper part is target or is going to target all target parts which are residing on the name. So that way you know we are going to reduce the number of helper parts going you know getting launched and at the same time we are going to you know reduce the impact of you know any resource conduction issues and other things. So the last is you know how are we going to make it more developer focused so currently you know we haven't been using chaos I live you know library in our you know chaos and while we scheduling you know why we schedule workflows but now what we are looking into is how we can enable chaos I live also to you know run experiments without workflow like so that is something that integrations might be coming in you know future that is surely in a roadmap. And then code basically factor is the one which will have been you know this is a continuous process and it will keep happening right so you know to reduce the duplicate code and to make it more optimized so these things will you know keep coming and keep happening so that is something that you know that will be in a roadmap as usual, and then improve SDK. So, as we, as I already have shown like we already added support for multiple templates here, AWS and other things right here. So, similarly we will keep, you know, a working on it keep growing it you know so that we can make it more developer friendly and you know it will help you to generate your experiments and very easier and from scratch and in a very great great way so it will be. It will be a great enhancement there. So, yeah, that's, you know, or that's all you know we are looking forward in the 3.0 roadmap and like that so you know, I would say this has been a, you know, great year for us like we got some good contribution as we discussed in the previous slides. Many of them were featured request from community and many of them were actually contributed by the community and this is great. And like thanks to all like this is really great. I think. Yeah, that's all from my side, I think we can take it from here. Thank you so much. Maybe we can move on to the last slide if you can share the last slide for us. Yeah, so obviously thank you so much for sharing all the enhancements and developments we have had over the year and lastly obviously as we spoke how you can get involved with the community. The GitHub is out there feel free to check out the GitHub that has most of the information there are the docs which can help you get started okay you can get started with this what are the various I mean functionalities how you can use the links and the chaos hub obviously you can access your chaos experiments from there. Join us on the Kubernetes Slack channel it's the witness channel on the Kubernetes Slack. And feel free to check out the YouTube channel as well as the Twitter to make sure you're connected with us on the socials and that's how you can get involved in the community once you join slack feel free to ping us and mention your questions on the Slack itself and we the maintainers the core contributors the community gets you get started. Thanks again, everyone for tuning in and I hope this webinar was really helpful to you on and with this. We look forward to an amazing 2023 and hope to see you part of the chaos engineering community. Thank you so much everyone. Thanks everyone.