 hola namaste and a very warm welcome to this talk automating chaos with litmus chaos to ensure Kubernetes application resiliency first things first I would like to thank the organizers for organizing this amazing KCD Africa edition and I would also like to thank all the viewers and all the participants and all the speakers who have ended up making this conference a success and I look forward to many more editions of KCD Africa moving on introducing myself my name is Prithviraj and I'm based out of Bhubaneshwar Odisha India India is going through very tough time due to the COVID-19 crisis and I hope everyone is playing for India and we pray for the same that we get out of this condition very soon I've been I work at chaos native as a community manager for litmus chaos and since one year I'm working with litmus chaos I joined my data back in 2020 and then chaos native focusing on chaos engineering itself I also am a co-organizer at chaos carnival it's a global conference which hosted its inaugural edition this February I organize Kubernetes chaos engineering meetups part of the cncf community every last Saturday of the month so you can join in and talk about litmus or chaos engineering on Kubernetes and we can learn a lot together my co-speaker is Sian Mondal who I mean there's a lot to talk about him but he majorly works as a software development engineer as well as a chaos engineer in a chaos engineer coin a new term which has been coined after chaos engineering came into play and he's gonna talk more about the technical aspects and I'll start with the introduction so moving on to the agenda that we have in hand first things first we'll be talking about chaos what is chaos engineering and a lot of people might be knowing about the term knowing what chaos tests are exactly but it's also for the ones who do not know about it and obviously then we'll be moving on to chaos engineering the cloud native for Kubernetes systems what exactly is litmus chaos and for the sign we'll be taking forward talking about CRDs and how you can install litmus and be giving a technical demo and showing you dashboards and how you can deploy your chaos experiment to your first chaos test so first of first things first I would like to thank the cncf for obviously thinking about chaos engineering as one of the technologies to look up to one of the most sought after technologies in 2021 and beyond and I think you all might agree that chaos testing is coming up in a grand way so what exactly is chaos testing before that I'll start with an example obviously as you can see my my PPT of my age shows slide shows that down times are expensive so I'll give you a short example of how you can think about it let's say in India there are the e-commerce market is developing in a very huge way and Amazon and Flipkart are two major tycoons of this business and usually they host yearly or annual sales which are coined as big billion days or let's say Amazon Great Indian Festival and these sales see a lot of people jumping in to grab the offers a lot of there's a spike in the number of users and usually or you know sometimes it happens that due to the spike there's a condition which causes an outage which causes a downtime which leads to loss of millions and billions of rupees or dollars for these companies chaos engineering is a term coined by Netflix Netflix coined this term and Netflix started off chaos way back in 2011 and they started it to test their systems to check how does chaos function so basically chaos is doing is nothing but putting it or putting in or inducing a fault in the system to figure out how will the system actually react when there's an outage or when there's a downtime so basically predicting what will happen in the future beforehand so that that outage doesn't occur in future doesn't occur when the system is in production so why is chaos testing important exactly why should you think about chaos because you need to test you know you shouldn't wait we believe in the chaos first principle that is why test your system when it already goes through an outage why I don't test it before and this also helps in activating the feedback loop in the DevOps system DevOps engineers and SREs need to focus on chaos engineering so that they can go ahead with proactive testing and production staging in the CI CD anywhere and they can actually predict what can happen and save their systems from these outages so how is like what exactly is the state of chaos engineering till now obviously there are standard practices a few companies big companies like Amazon Netflix Flipkart I mean the Apple Google these companies have already started practicing chaos in some way or the other but this is limited to experts and enthusiasts those are already aware of what chaos is or those are already aware of what exactly chaos is doing that they are more into adoption and slowly people have started adopting although they're still time to it and obviously it's part of large deployments various companies have come out let's say IBM and maybe various folks are applying chaos in the large deployments but we believe that small deployments are you know individuals should also get started practicing chaos because as of now those have already burned their hands those have already seen what how an outage occurs for them a resolution is chaos engineering but a chaos engineering practice can be a resolution for each and every engineer or each and every company so moving on we can see Kubernetes as we know we are talking about Kubernetes here it's it's already crossed the chasm it has already reached the mainstream market and as you can see most of the people on majority of the people have adopted Kubernetes in some way or the other to build their infrastructure or architecture but what about chaos engineering how is chaos engineering fairing in the chasm we believe chaos engineering is yet to cross the chasm it's still in the early market and we are seeing you know early adopters there's early adoption but it's yet to see me go to the mainstream market in form of a huge business or in form of huge adoption there's still a lot to do and a lot to develop so that people can adopt chaos and there's there's a lot of fear in people minds as well that should i adopt chaos undoubtedly you should go ahead adopting chaos engineering chaos testing for infrastructure and systems moving on how how is it typically done how is chaos engineering practice usually chaos engineering practice to game days and some of them have already integrated them in the ci series but as of now only sre's are practicing chaos engineering kia chaos engineering is practiced by basically q and a engineers sre's or devops engineers but most of the developers have still not started engaging in chaos engineering but we believe that every developer should try and practice chaos in some way or the other if not today think about it tomorrow manual planning and execution is happening it's it's obviously necessary but preparing a roadmap is something we believe can help enterprises or help companies come forward adopting chaos observability is a very important aspect people need to think about monitoring these chaos tests and which is obviously not a commodity everyone needs to think about observing the chaos test or what is happening so typically as of now there are there's a long road ahead people need to start looking at what are the practices that they need to start to adopt chaos or to get started with care chaos engineering because all in all eventually it's going to increase your resiliency increase your reliability so moving on we'll be talking about cloud native chaos engineering what is exactly cloud native chaos and how does chaos engineering function in the Kubernetes world obviously Kubernetes is a very dynamic space and every now and then with such a huge amount of adoption and with highly dynamic applications being built there might be an outage or a fault here and there and there are security concerns or there might be concerns of testing amongst engineers and sre's so how does chaos engineering come into play on in the Kubernetes way or in in the Kubernetes space obviously for example there might be an outage in a pod and a node so experiments like pod delete or a node delete respectively come into play and help you understand and analyze how your system might behave there are a lot of experiments that can come up i have heard about black hole experiments and unlimited experiments that that can be possible but what exactly is cloud native chaos engineering so the principle is that we believe of cloud native chaos engineering obviously it needs to be open source open source is the future every other technology is coming out to be open so the cloud native world is obviously based on open source technologies which further leads to community collaboration it's very important for the community to come out and suggest changes or work together hand in hand suggesting what are the additions or what are the issues that can be created and how a project can develop obviously with with chaos engineering having custom resources and all the ml files i think an open api and life cycle management is very important for chaos engineering to be cloud native and obviously scalability we do not talk about scalability but scalability is very important and that is why as kubernetes is being like the principle of kubernetes are being changed by github github is very important for chaos engineering to be adopted and for users to get a whole new experience and observability as you know as i mentioned monitoring is very important these experiments and how their function should be monitored so open observability helps users to monitor these experiments properly so moving on litmus as you know follows all these principles and i think is a cncf sandbox project having a pool of amazing experiments we will go to go back to the next slide and talk about it and they follow all these principles and it is built for modern chaos engineering it is coming up in such a way where chaos engineering the dynamism of chaos engineering will change and testing will will become feasible and easier for each and every engineer out there testing on their kubernetes architectures so what exactly is litmus and litmus is nothing but an open source tool set practicing highly scalable chaos engineering practices for sre's developers kubernetes engineers software engineers who want to practice chaos tests in somewhere and want to bring in resiliency in some for to their kubernetes applications as you can see the stats obviously stats do not define a lot but it has got a pool of contributors 1600 plus github stars we are talking about community here at the cncf kcd africa and community is what is what matters when i joined back last year this project just had 60 slack members and 500 odd github stars but now you can see there's a pool of experiments the github stars are increasing and the project is seeing amazing amount of attraction in the community and we believe that in the upcoming years this chaos testing is going to be the go-to project or go-to technology to look forward to with this i would like to hand over to my co-speaker siren model who will be getting into the in-depth of litmus chaos and kubernetes applications resiliency thank you folks thanks for three for sharing about chaos engineering and litmus in general so i'm shyan i am a chaos engineer and chaos native and i'm going to talk more about how you can use litmus to inject chaos for your particular use case for your enterprise needs so to start off with i am going to show you how you can install litmus so there are two ways you can do it either using helm or you can directly apply the manifest so if you visit the litmus docs beta dot netlify app and you move over to this installation section and over the control plane you should be able to see installation using helm and using kubectl apply the manifest command so this is the two dot beta two dot o manifest that is currently we have the current version we have you can use that version or you can go to the github repository which is litmus chaos slash litmus github dot com slash litmus chaos slash litmus and go to this folder called litmus portal once you are inside the folder scroll down to see the read me section and you should be able to see installation using ktas manifest so you can either apply the master latest cluster scope manifest or the two dot o beta four manifest so i what i'm going to do is i'm going to use this master manifest for this demo and to apply it basically what i have is a mini cube running right now you can either use kind k3s or even do the same in your cloud provider so now that i already have mini cube running i'll just use this command to install litmus in my in my local cluster so while that is happening i would just watch the same in the litmus namespace so when i do it i should be able to see these three things now while that is happening i would want to come back to my slide and explain what exactly happens behind the hood so if you take a look at this diagram this is basically the crds that litmus leverages the three primary chaos crds we have is chaos experiment chaos engine and chaos results so what we have here is chaos experiment comes first which has been fetched from chaos hub now what is chaos hub we have a list of eight different experiments listed down in our hub which is hub dot litmus chaos or i o if you visit this particular hub you would see a lot of public chaos experiments already up and running for you so you can pick any one of these experiments they are predefined and you can either build on top of it or you can just use these predefined ones this is what chaos hub is you can pull this experiment since you're injecting chaos directly with the predefined configurations so the chaos experiment dot yaml is what pulls this particular chaos experiment so the chaos experiment cid is basically a low-level chaos your chaos experiment itself with the default unibles the chaos engine is what binds the application instance with this particular experiment a chaos engine is what would trigger your chaos injection in your particular application so chaos experiment is only installing the experiment not injecting it chaos engine would bind your application instance with the experiment and inject it and the chaos result is what would store all the default parameters like the status of your experiment and those things so it will save a word it will say if you have probes in your system it will save the status of your probes the success percent and those things so it's basically store the matrix so what chaos operator would do is it will take a deeper watch into chaos engine and whenever a chaos engine is triggered it will also spawn a chaos runner and the chaos runner pod is who is responsible for spawning multiple chaos jobs and these chaos jobs are nothing but the particular chaos experiment which you're running so let's say you want to run a pod delete or a container kill so the chaos runner would the chaos runner pod would spawn multiple pod delete or multiple container kill jobs that you have scheduled as per your particular tunables so you can do all this tunings in the chaos engine itself so whatever way you have tuned and overwritten the basic tunings in that particular manner the chaos runner would generate successive chaos jobs you can either change the chaos duration you can add annotations to control the blast radius and all those kinds of things so that's it in a high level so let's go back to the previous watch statement and take a look so now we can see that this has been finished if i just get out of the watch command and i see take a look at the installed steps so you can see that it has installed this configure this namespace litmus and the config map deployment services the role bindings and all the dependencies that are needed for litmus to run if you install a helm or apply the beta manifest you need to install the name create namespace of litmus first so if you go through this documentation of how to install you can find all these different details but if i if you apply the latest master manifest it'll create the namespace for you now that we have litmus installed and let me go back to the watch statement since this is going to be handy i'll open another tab and i'll take a look at the i'll take a look at the different services that are there currently in litmus inkspace so we can see that there's the litmus folder frontend the server service and the mongo service so we need the node ip folder where the port in order to visit our frontend service so in this case i'll just go i'm going to use my mini qyp with this particular frontend port there's going to be this colon this particular port let me just copy over the port and with that i should be able to access the frontend service of litmus so now we have this i'm going to log in with my admin and litmus credential so by default the username is admin and password is litmus once you try to log in since you're a new user and this is a fresh cluster you install it on you should you will be greeted with a project creation onboarding step so you need to confirm your password if you want to create a new password at the time of the first login i'm just going to keep it the same and my project name would be demo so this is only a one-time thing for new users so if you create another user as an admin they would have to go through this onboarding process just once so this is my control plane the litmus control plane that i have and there are multiple options that you can do you can schedule a workflow you have particularly one agent right now you can see your project up at the top you can see your details if you have set up an email which you won't if you have if you are a first-time login you have to set it up manually so you can set up your email you can log out you can edit your profile and you'll see all the project related details like who is sharing who you are sharing your project so you can invite other people as well into your project you can give them viewer editor owner different types of access and there's something called as an agent so what is an agent it is basically when you install litmus by default we create a self agent which is already running in your particular cluster so with the help of this agent we will be running all the chaos experiments in this particular agent so if you want to create an so if you want to also include an external agent you can do so you can connect an external agent by clicking on this you have to go through this process to download litmus ctl binary and connect your own personal external agent so we are going to use the self agent for now and if you go to settings you have the teaming option where you can invite new team members who are already a part of litmus and then you can choose the rule like I mentioned and there's another section called as a workflow which is where your workflows will be scheduled whenever whenever you should do the workflow this is where you'll be able to view all the different details of the workflow if you create a schedule and you do not just run it once if you want it to be a scheduled cron job then you can do the same and all the schedules will be visible here and you also have some predefined templates that you can use to directly inject chaos without any configuration so you can just use these predefined templates so what we're going to do is create a scheduled workflow on the particular names on the particular agent self agent that we have already now moving forward we have four different options which is we can either create from a set of predefined workflows which is the same as the templates and let's say you have a particular body leader or a certain template which you want to modify and extend its capabilities so let's say you have port elite and you want some sort of specific port elite application that are to your enterprise requirements so you can save it you can modify it and save it as a template and from the next time onwards you can click on this create a new workflow by cloning an existing workflow and see this is where you'll find a particular template that you have saved so you can use that template and run your workflows again and again with the same configuration and then you have this option of my hub so what my hub is essentially like I already showed you there's a chaos hub so my hub is essentially your own personal chart your own personal chart so let's say for example I have this demo chart here so you have to make sure that this is exactly in this particular format so you have to have a charts folder and inside of it you can create your own personal chaos experiment so I have generic experiments which are all this no taint no drain disc fill disc loss and I also have a core dns which is core dns for delete experiment and these are the different metadata's the icons and the package jms experiment jm so now if I go to the portal and I select the hub I'll only be able to see kiosk up so if I visit this kiosk up section I should be able to see kiosk up only which is exactly if I view this it is exactly similar to this hub so we already have this inbuilt with the litmus portal so whenever you open portal you should already have hub pre like already there but if you want to connect your own hub you can choose the connect to new hub section I'll name my hub as demo hub and I'll just give my gate url which is in this case this one so I'll just use this gate url and I just want the branch to be mastered so it's public repo so I'm using it as public if you have a private repo you can either choose to ssh or use the access token and just add the deploy key in your gate repository that should work so if I submit it now I should be successfully adding the demo hub to my list of kiosks so if I view this demo hub now I should be able to see all the generic experiments as well as the core DNS experiment which is just one that I had in this gate repo so this is one feature that allows you to create your own custom charts if you want to and then use your enterprise particular enterprise use case so you can create your own experiments here's experiments and push them on gate hub and then use your own hubs to schedule kiosk according to your needs so now if I go to workflows and I schedule a workflow and I select my demo hub there's also an option to import a workflow from yaml so if you have a custom yaml already created you can just click on this and drag and drop or upload your yaml and the experiment would be picked up from this particular yaml so I'm going to select my demo hub that I have just set newly so I will keep the experiment name as demo let me move forward so right now I don't have any experiment so when I add a new experiment let's say a container kill and a pod delete experiment I should be able to see a visualization at the right side which is giving me an exact example of what would happen when like how this experiment would look how this workflow would look when I finalize it and I see the visualization so this is just a demo visualization of a predicted visualization that it should happen once we schedule it so you can of course click on edit yaml and see the different yaml configuration so like I mentioned chaos engine chaos experiments at the top you would have the chaos experiments these are all the different chaos experiments and if I scroll down to the bottom I should be able to see the different chaos engine these are all different chaos experiments and if I scroll down I should be able to see the chaos time chaos engine and these are the chaos engine of the two particular experiments that I just installed all right so if and there's an option called reward schedule so reward schedule actually allows you to reward all the particular chaos that has been happened if you turn it to false it will not reward it and your chaos metadata all the experiment details job details will be persistent and will be present there so now if I click on next I should be able to adjust the weights of the particular experiment and I have the option to create a recurring schedule or schedule now so if I schedule it now and this this is the summary of your entire workflow if I schedule it now and I click on finish I should be able to view the particular workflow that I just scheduled and if I click on the particular workflow's name or even go back and click on this option to show the workflow or show the analytics I should be able to see the workflow details this is how the workflow details would look like this is the current step that is running it will install all the chaos experiments it will spawn up the potlate and container kill experiment and once we click on the particular node it will also give us the chaos results as well as the logs so let's this is the master node so that's why we are getting a total an entire overview so this is just a graph view if I move on to table view I should be able to see the same but in a table view which will give us the individual node duration how much it took and the log and the result and if I take a look at this watch statement that I had already running I should be able to see all this different services like I had chaos experiment the chaos operator the subscriber the workflow controller and the event tracker so these are the same things that I talked about when I was discussing the crds and this is the experiment that the chaos experiment that we just created so demo this one is running and since it has installed the different experiments it will create more two more experiments which are potlate and chaos container kill and that would be working on this particular target that we have so current target is nginx application which is already inbuilt now if you want you can change your target to a particular application that you want to target the chaos on of course that will change based on your amul configurations and that's how we do it so let's just wait for this experiment to finish so now that the install experiment finished we have a container kill that is in the pending state so if I visit the logs again I should be if I visit the watch statement again I should be able to see the another instance pop up which is spot in the initializing state and this is the container kill chaos experiment that is trying to install itself and inject chaos so after container kill we'll be also able to see pot delete spawn up so yeah now you can see container kill runner is being there so this is the chaos runner and it is uh squanning the chaos jobs so now if I if I visit the container if I visit the container kill uh node I should be able to see the logs as well as the chaos results as well right now it's running that's why we don't have any chaos results once it finishes we'll be able to see the chaos results so yeah that's that's it for the portal how you can inject your personal your own chaos if I also go ahead and change my charts detail in this section that's a generic pot delete if I modify the pot delete experiment here and I come back to this scheduling a workflow and viewing that particular experiment I should be able to also see the changes there so whatever change you do in GitHub would also be reflected in your workflow itself and yeah that's that's all about lipmats and how you can leverage it in a in a high level view so this is the exact architecture that I talked about so you have the portal you have your chaos workflows and the metrics and the events and the operators and the different experiments that are running so these are all exported into chaos results so the chaos results you can also monitor the chaos results using Prometheus uh with the chaos interleaved analytics also you can run this on a bare metal environment as well as public clouds AWS as your or VMware with the bare metal environment so that's all from my side and that's how you can leverage lipmats to your own enterprise requirements and inject chaos as you want thank you