 Talk at OpenSource Summit Latin America 2022 in Cloud Open. In this talk, we'll be more on talking about the QoS experiment and how to create QoS experiment using the Litmus HDK. Let's have a short intro about me. I'm a software engineer at JFrog. I'm a certified Kubernetes administrator as well. I'm a core contributor and member of the CNSF incubating project. I'm a litmus QoS and a maintainer of the Litmus HDK, Python part. I'm also an international chess player. So, let's talk about today's talk now. We'll be more discussing about the QoS engineering. So, any idea what is the QoS engineering and what exactly it is? Okay, so QoS engineering. Let's take one short example, okay? So, in day-to-day life, you face some issues. Your website is down. It is taking some time to get open it. Suddenly, it gets crashed, some of the big organizations. We can see many news also. There is a downfall, some were down, and many other things happen. So, why it happens exactly? There must be something happened. Something goes down. So, it affected the application and then the application also goes down. That misbehaved that downfall. Something wrong happened. So, that's the place where QoS engineering works. So, QoS engineering is more on we create an atmosphere where we inject the QoS and we actually see how the application is behaving. Let's take a short example that we have some 10 microservices running inside our application. With multiple microservices, like example 10. Half microservices goes down. Do we know how others will work on behalf of that microservices? No, we might not. So, we only see most of... We mainly target that we just see our containers are in healthy state, ports are in running, and that's all. But it's not something that's all we have to... There might be many possible ways where application can misbehave if some of service misbehaves. So, let's talk more on the tool, which is Litmus QoS tool. Litmus is an open source CSF incubating project, which is basically how the infrastructure for the outages and they have Litmus QoS have the more than 50 experiments where the disturbance can be made with the different types of experiments. Like, just take an example of memory hogging, CPU hogging, network loss, Dixfield and many other things. So, let's focus on the QoS experiment thing. So, QoS experiment is a particular way applicable to distributed computer environments with the large distributed systems. The components often have complex and unpredictable dependencies and it is difficult to troubleshoot the error or predict when an error will occur. So, more on using V, we are going to create experiment. Why we want that experiment? Then to make a disturbance, make something... Let's break something and let's see how the application behaves. There where the QoS experiment plays a role. So, where we encounter actual error and we see how the applications behave and we just regulates the conditions in distributed systems to test the system's resiliency. Let's talk about the Litmus ADK now. Litmus provides the SDK, which is the abstracted way of writing experiment where we provide some flow of writing experiment with the checks, pre and post QoS checks and also get the library packages and many more by defaultly generated where we can actually write our logic and we can write very seamlessly. We can write and we can actually test it and we can actually write and learn more about the QoS experiment through the SDK and we can update, we can overwrite whatever the logic wanted to. We can write that logic, we can test it, we can create QoS, we can inject QoS, we can see how the turbulence occurred and how the application is behaving within Litmus SDK. Litmus SDK provides in three different languages, one is the Go, Python and Ansible. In today's talk we will be more talking about the Litmus Python SDK. Litmus started supports the Litmus Python SDK from 0.0. Now we will follow, there are a number of steps which we have to follow for getting generated our abstracted stuff and we will just discuss about how to generate it and what are the components, which component is where it is used that way we will move on and we will build it and we will test it locally and we will also run experiment using the QoS center which is the Litmus QoS portal. Okay, so let's move now. Let's take a look against the GitHub repositories also. So this is the Litmus QoS project, Litmus repo where this is the QoS center. To install the QoS center, QoS center provides the user facing interface as well like we can actually run our experiments and we can see our dashboard matrix and so many things already there, we can have the teams and a lot of things just explore this. So I have already installed this Litmus QoS cluster scope. There is another thing is the namespace scope also supported. This is the Litmus Python where more or less Python based experiments are located here. There are already two experiments are there inside the experiment. One is the AWS and one is the generic port delete. One can take reference from both the experiments while writing in the experiment and SDK parts get covered into the Contribute Developer Guide. So there are the step by steps. All stuff has already given one can follow to get generated and get tested that locally. Also, there is one more way to test it like there is a small blog written by me where we can follow the steps and we can get at the point where we can get all stuff generated and where we can write our logics and we can build it, we can test it that all stuff has been mentioned here. This is the locally, suppose someone don't have Python locally installed, they just want some container to get environment of Python environment and they want to test there, they can follow the documentation. If someone is interested in PyLo installing locally Python and every stuff is there, they can just create Python environments locally and they can test it. We will follow this flow in today's talk. Let's move it to SDK part, this Contribute Guide and Developer Guide. This is the SDK components come where it generates files. Let's generate it, we will take that. So before generating, this is the main file where we have to first update our stuff. This stuff will get reflected throughout the experiment project where all stuff will get generated that will come from here. It might be a chart, it might be a Python code, experiment file, library file packages that all stuff will get generated from here. So this is the attribute.yml is the moron like the metadata aspects of our Kiosk experiment where we have to give the name of experiment, category of the experiment, version, repository where we keep community description, keywords, platforms. It can both namespace scope and cluster scope so we can provide scope, auxiliary checks, permissions or whatever permissions required for this experiment. Maturity level, maintenance list, provider name and references and many more they can like we can provide. So this is developer, let's move to that repo directory. So we can like, there are two flags which is provided. There is one file attribute generate experiment.py which takes actually three parameters. One is the attribute file to other things for generating the charts or experiments or charts where minus g's provides for the experiment and chart where the experiment creates the CR based manifest and charts creates the other manifest which is like chart, service, version, .yml and all. Also there is a minus t flag which is by default all but we can provide the category, only the category best charts will get generated experiment, the only experiment best manifest will get generated. So we will follow both one by one experiment and chart. Since we require both see we can see that it has shown it has created. Now let's take a look what all files got created within these two commands. So yeah, so we can see this when is the package thing when is the experiment thing and when is the cure sleep three main things which got generated throughout our commands. So what all these things are like let's take example first thing is in experiment and experiment we got the sample category this name is reflected from our developer this attribute file which is located in the developer contribute developer guide from there this names and all got created it got like it has taken. So let's take go to sample execute chaos where we got three file generated chart experiment and test. So someone using that octet environment to test on they can use this testing if you are using just a testing locally using some Python you know that this is not required experiment file which is the actual main dot actual experiment file actual flow of the experiment for at all is common for all chaos experiment that is given here. We can directly use this flow where we have pre-flight check or pre-check and post checks where before experiment and after experiment we just take that application is targeted application is healthy and we prepare chaos. This is the actual function where we inject the chaos. This is the pre before a course injection stuff and this is a cost during chaos and this is after course injection stuff all the events all result everything get updated like this is the main file and second file which we get which we get is the library file for the chaos it must where we get same name flow where we have to write our actual chaos experiment logic by default CPU hogging logic comes the same follow we have to follow and we have to write our injection logic just take example like logic one can do the network loss so network loss logic has to be written here one has to do the port delete logic has to be written in this file if there is a takes field and takes field logic has to be written or any Kubernetes or non Kubernetes based experiment is there then that has to be written here all require stuff and the next thing is come during this writing like for this our experiment what all environments required for example suppose we have memory hogging that which application has to be target applications namespace label kind chaos interval and many other environments is there this are based on for this experiment suppose someone has to be write some specific experiment some x y y for that x y y experiment what all environments we have to pass from our CRs which is the chaos engine chaos experiment cause results CRs are there chaos engineering chaos engine and cause experiments here result is some CR where we can see our all cross experiment details after the injection is this code and all and this is cause engine and cause experiments here where we can where we are these are the main CRs where user user user facing CRs are there those sample CRs like let's have a look this in the chart we get couple of manifest got generated which is the chaos engine so this is the chaos engine CR which holds the environments experiment environments for the environment specs and all name some others properties of cleanup policy and also details if it is no in space then we can have this this is the experiments here which holds the permission and some other meta data specs and ENVs as well including the image and command which you have to know and icon and all for experiment so these are the things which we get and next thing is the now once we are done with this now next thing is to test it so at this stage we have generated our default SDK stuff after that we have live folder where we can actually write our chaos experiments logic once we are ready with this we have to test it and we have to test it locally and using QoCenter as well once we are done with that we can press PR and we can actually see our stuff in QoCenter so let's test it so one thing to get that before we test that we have to update our couple of stuff here the main file is in been experiment experiment this is the place where we get call our experiments where we have to I already added couple of stuff here like to save our time to we need to import our that experiment sample category sample execution chaos experiment that we have to call that we else if condition needs to get added that is everything is mentioned in block we follow that block see importing else if condition and next thing we have to add this all directories for which are for our chaos experiment inside the setup where we initializes our setup like repo setup okay so once we are these done now let's now let's create a Python environment I believe I already have we can create so we are in the environment and requirements and all packages now one thing to get to know that like whatever changes we doing this our changes to before we run our cover Python main dot main file we have to run one more command which is the setup dot py command install where we what are the all stuff is required for this required packages and other requirements a repo setup project setup and all that stuff is there so need to install these ones to get our stuff done now once we are once we do that we can now actually run our commands before we do that let's create a term let's create a deployment which is already exist which is an engine is already running let's see the labels see we can see the labels and it is in default time space now we have to since we have to inject the chaos against what we install this application engine is application we will inject the chaos against this and we will see actually the behavior so be it for behavior and all we can have the set of the prometheus for example for some application we can have graphon and I go down to get much cross cross and I go to some many stuff it many more like to analyze our town falls and all that stuff we can get but this our talk is more on focus on creating this cross experiment and testing it so we'll focus let's discuss more on this now so environment we have to update our app name space default app label is this thing app kind now once we are doing this we can set up our chaos interval and chaos total good duration here one thing to remind that there are two such two things to be able to do support one is the parallel and one is the sequential in parallel we can target all application if suppose application level and there is a default and app name space app name space default and kind is a deployment set up there are 10 parts which is matching this three things and if we give the part for affected percentage and disturb environments where if there is a 100% it will take all parts and if it is 50% we'll target 50 parts so we can have we can do the parallel and sequential stuff on that so if you give the parallely what are the targeted application is there it will take all targeted application and the cause once if it is a sequential it will inject one application in one part it once we injected it will wait to complete the injection that time cause time cause interval will be the in addition time after the cause it will again inject it will again inject it will again inject one by one application it will keep on injecting and if it is parallel it will inject together with all the targeted application cause that both support support according to LeapMasker experiments both support should be given based on requirement now let's test it now we have updated now add the command which we have to run now let's this thing here also one thing to observe now if there is any command we have to run in some container LeapMask Python these processes are running as a background process to we have to use the ion operator to cause inject cause command and get out to another trader so for that we need to add an operator for an injection command if is there any command in your experiment if not then it is fine now the next thing is that once we have done this that we can test it locally our we know that in being we have experiment experiment.py so now we can pass our name as kiosk and we can wait now only when we get target container name it is it is waiting for the target container now let's see ok so as I said just sometime back whatever we do change we have to run one command installed so once we done this now new things has been taken now we can see that it is started injecting the kiosk now the total waiting time is 30 seconds and we have this kiosk interval time and as you will have to cure duration time is same so it will last for the 30 seconds and after that it will revert back the kiosk then so we can see our logs also we have verified that application is healthy before injecting the kiosk it is running and healthy containers also healthy and readiness is true we will see in 30 seconds that if after the kiosk injection also application is open healthy and everything is done we have verified the port kiosk application is running and healthy so successfully we have injected the kiosk we have created we got some default stuff from HDK we have to update that we have to test in this manner and next thing is that we have to build the Docker image and we have to follow the next steps now for the kiosk charts kiosk chart is this repo which is used to get reflect kiosk experiencing hub so whatever the experience are it is getting from the kiosk charts kiosk so now we have to update the kiosk chart repo in the format what we want now if you we have recently created some of the charts from Likman's python right so now let's have observation against that so here a couple of things to get there now okay so if you also carefully in that kiosk experiment we got chart full folder where we got many charts okay so now what we have to take this charts and create a folder and we have to arrange in a such manner that it follows the standard of Likman's kiosk charts the first thing is that icon we have to just copy the icon here inside icon at the top main folder will be the sample category what type of experiment is there that category should be here inside the charts which is the chart this chart like in the charts folder so in that chart we have to create a sample category whatever the name we have given for sample category and important note here is whatever we got in generated Likman's python is with underscore like this sign below sign but here we have to use the minus sign icon in sample category we have to while copy pasting we have to update everywhere inside here as well as in all manifest also if you carefully observe right side I have already created to sell the time so what are the engine and all places see this thing I have updated this to minus that thing we have to do in python folder can speak give any like minus sign all sign you can add there icon but for import it will throw error so we have follow the python standards to keep in this thing and for our standards we have to update that so in all files wherever that name is there sample category and sample is chaos experiment like this we have to update that in this way like you have to update this to do that is the note please it's a star now next thing is that it folders also we have to do that in left side this is the Likman python right side is the chaos chart repo GitHub repo I have already clone and open it and this is the names we have to update that now next thing is to first thing is the sample category we have to create a folder in that we have to put the icons and experiment dot e-mail sample category dot chart version chart service version e-mail and sample category dot package dot e-mail there's three files should be there from there we can see there are files couple of package file category file are already here and now next thing is that we have to create a folder with sample execution code which is the experiment name there we have to put engine which is this one experiment file we have to move there and some two might change modifications like the modifications that name we have to update we have underscore two like hyphen we have to update that and r-back we need to require there and one more is the sample execution chaos dot e-mail this file is required there so these things has to be get done this file is holding information about our experiment and next thing is to remember that in experiment dot e-mail we have to update our new docker image file to test it so I have updated it to my kale om slash pyrunner ci and I have updated that the chaos experiment name also these things has to be taken care and we need to add our all required env's in this like whatever this is since we are more on default for chaos but as per requirement we have to add our env's and all here so once we are done with this modification I will repeat once more first thing is that we have to create sample category in that sample category three things will come one is the icons one is the sample execution chaos and one is the next thing is the manifest three this manifest and all everything is we get from litmus python while generating that we have to copy paste with two changes one is the folder one is the file name folder name changes underscore to mine hyphen second is that inside the files also we have to update names sample execution chaos name and sample category these two names should be updated everywhere so if you see left side this is the name and right side this is the name hyphen so that thing we have to do once we are done with that we have to push it to repo I already pushed it now next thing is that since we already installed the litmus chaos enter like I already show you that it must repo with cluster scope and scope is presented so already installed that so this is this so here what we can do now we have to connect our hub so let me show you again why disconnecting let me show you this thing you have to connect it we can use that open now the GWRL is let's need to take mine since I have pushed it to my repo we will give and the branch name is opensource submit sorry chaos starts this chaos start we have to copy this link and the branch is the open source submit our changes chaos starts and just with the chaos hub with chaos enter we will connect our repo how to connect that let's take a look now we have to give any name now we have to give the repo link where we have pushed our changes chaos start changes recently this is this and branch name expect for this to get connected now this is connected now let's go to the chaos scenarios where we can actually schedule our chaos this is the agent where there are a couple of ways where we get like some we can see our hub workflows default workflows we can add we can use any existing template after learning that we can save our templates also this is a chaos hub directly we can run from chaos we can directly upload our flow email also in now we will use for our hub to get connected and to run this we can add our chaos experiment also our name as default name we have taken this all default name sample and sample category we can update that in attribute email only earlier time only so this is the experiment which we have created and that name also now we can edit the email if something wanted to edit or something we wanted to update that we can do that now we can see that I have like just use that docker image and give on the name chaos which we have recently given that and the another way is to do like directly but then we can next do we can update our environments our chaos interval and all app name specific default app in labels and all whatever we are updated in that manifest that will get reflected here and we can actually now schedule we can do the weightage there are ways like schedule now recurrent way cross jobs and all supported now let's finish we have scheduled our experiment we have created it now let's see that this is the argo workflow which is used now let's have the look of let's have watch again this now so this is the workflow which is started now the first step is now creating installing the chaos experiments that is getting done now since it was already there it's showing the unchange so the first after this workflow started the first step is to install the chaos experiment once it get done now the actual cause injection starts where the first thing which comes up is the chaos runner which takes the information from CRs and launch when experiment job get done and let's wait for now so we can see that so runner has launched and once the runner get launched after the launch is the actual job of the chaos experiment we can see our logs locally as well as well as from ua so this has started injecting the chaos and we can see the logs in UI also we can see that chaos experiments stuff and all now let's wait for some more time to get it updated after this experiment get done results get subtits we can see now view tables also steps okay so locally we can see that it has injected the chaos successfully and it's done now let's for some more seconds to get that reflected so we can see now the injected successfully injected all the stuff has been done now we can see the chaos result as well where we can see our pass but it completed phase our resiliency score and many other things which we can check here probe success percentage and all pass run and all stuff which we can get here and tables view also we can see if there is analytics connected we can see all the results also so this is more on our the chaos experiment we have created it we have locally tested we have test solution chaos center and we got stuff these are the references for our repo like one is litmus chaos one is litmus python chaos chat and chaos once everything get done we have to raise the PR to chaos charts once it get much that experiment will get reflected in chaos and feel free to contact me on twitter, link in github, dev.to I am Omkar and thank you so much