 Hi everyone and welcome to our presentation on putting chaos into continuous delivery to increase application resiliency. My name is Joergen Esselstoffer. I'm a maintainer of the captain project and I work for Dynatrace and to get on here with Kartik. Hello Kartik. Hi Joergen. Hi everyone. This is Kartik. I am the main engineer of the Ritmas chaos project and I work for Chaos Native. Absolutely thrilled to speak at KubeCon. So let's dive into the topic because we have a really exciting topic prepared for today. So let's start with the typical CD process in a multi-stage delivery pipeline. You have your bath, any kind of pre-production environments and finally your production environment. And what you already might use are some kind of quality gates that evaluate based on performance and load testing, the quality of your applications and only if those tests pass you are allowed to move them to the next stage. This is totally fine but in production there is always something happening that you can maybe not foresee or you cannot really test. So there is one trend that goes into testing in production and basically trying to break production and then getting insights into what has to be improved. But today we really want to advocate for taking this idea and moving it to pre-production environments and not adding to your performance and your load tests, adding also chaos tests and evaluating them not only on performance criteria but actually on resilience criteria and with moving and shifting left the chaos into pre-production environments to keep production let's say a safe place and keep the green lights on in your production environments. That's the main idea of today's presentation and I'm handing over here to Karti to explain a little bit how we can evaluate resiliency and to take it from there. Thank you again. Thanks for sitting in the context. Now that we've already spoken about chaos and the need for us to test it before deploying in production let's look at why chaos engineering is important and what are some practices we can follow to improve the resiliency of our application in class structure. Load tests are great, puncture tests are great but they need to be augmented with failure scenarios especially so in a cloud native world where everything is in the form of a microservice everything is loosely coupled there's so much points of failure the surface attack surface area for attacks of failures are rather more and it is important for us to test what's happening when the components surrounding our applications or business apps fail. If you look at this pyramid for a different application that you have deployed on Kubernetes you have the platform services that could be cloud or on-premise you have the Kubernetes microservices Kubernetes being very dense then you have a host of microservices that you have from the cloud native ecosystem the CNC of landscape for service discovery the storage for observability etc and then you have your application stack there can be so many things that fail and it's important for us to recognize what's happening these components around failures so failure testing is important in other words chaos is really important for us to inject and find out what's happening chaos engineering is a discipline every scientific one and there are a lot of assumptions that we go ahead with when we write our applications or when we go ahead and deploy them we assume that networks are reliable the latency is always very less or nearly zero we have infinite bandwidth abundant storage and computer sources but it's not always the case we really want to simulate these conditions and find out what happens there are some failures that you know what's going to result we call them as known knowns and it's important to do some kind of regression around that some kind of chaos experimentation repeatedly to find out if that still works too but there are a lot of unknowns and a lot of assumptions that you want to validate especially so in environment that mimics production where things are really very dynamic and there's a lot of churn coming to the word churn why is it important for us to do chaos testing continuously I have taken this snippet from the principles of chaos.org where it is recommended for us to automate these experiments and run them continuously because we are going to have several versions of our software several builds several releases it's important for us to run continuously there are a lot of infrastructure changes that can happen an underlying operating system or your Kubernetes versions might keep getting upgraded and the best way to run things continuously is to put them instead of a cd pipeline exploratory and free style a gameplay oriented model of execution of chaos is really important that's not to be done away with that's really the nirvana of a mature chaos engineering practice especially when done on production but it's important for us to automate this and keep running this continuously on pretty broad environments to find out whether really our services behave the way they do and we are confident and whenever we do chaos engineering whenever we do chaos experimentation it is important for us to carry some hypothesis around what's the failure going to result in how are my services going to behave what's the impact on downstream applications how is my performance statistics going to change is my user experience going to remain the same all these are important but you need to define what your service level indicators are you need to define slo's on top of that and these are going to be very close and going to come going to form what will be the service level agreements that you might have with your end users that's important for us to accompany or marry these checks along with chaos experiments every time now coming to we spoke about why let's talk about how in chaos we want to go the declarative way now because in the cloud native world everything is declarative right from the way you define your infrastructure the way you define your applications the way you manage their life cycle the way you define resources on your policies everything is done in a declarative way as emails we want to do that with resilience checks as well in other words with chaos experiments as well so it is important that we can define chaos intent where custom resources and use the same paradigm that is that of operators and controllers to reconcile these chaos resources and execute your experiments and that way it's going to lend itself to a model where you can store everything in it and also use the traditional detox controllers in your chaos flow as well let me introduce the let us chaos project it has been growing for some time now it's a sandbox project right now it's getting contribution from other organizations and the chaos intent here is defined as custom resources there are multiple custom resources each serving a different function and there's a chaos operator which acts on these and actually executes your chaos and this tool or this platform rather has been built with certain principles in mind you can see those principles on this screen there are a couple of good blogs you can read about it and this project has seen somewhere adoption because of the way litmus is built everything being declarative and everything having a result or a word at the end of the experiment you actually find out whether you met your hypothesis or around your study state whether you met your SLOs or no all that information is captured in an experiment because of this model it fits well into CICD pipelines in the subsequent portion of this talk you again will talk about what the captain project is about and how it leverages litmus to introduce chaos stages as part of continuous delivery pipelines over to you again thank you Karthik so we are using two CNCF projects for this presentation mainly the one is a litmus chaos and the other one is captain which is a cloud native application lifecycle orchestrator that means it's not intended to replace all the tools that you already have your tool stack but actually to orchestrate them in to bring them together the main power of captain is to automate ways of your parts of your organization such as observability dashboarding and learning set up for example automated dashboards in prafana or learning rules you permit to learn to make sure another way is to automate SLO driven multi-stage delivery which is the main part for today's talk but also to automate operations and remediation for example to orchestrate remediation actions in response to alerts from the permit as a lawyer to manager everything is based on declarative descriptions very well aligned also to other projects such as litmus and stored everything git so we'll follow your github's approach here as well so how can we actually use this and how can we set up now the project where we want to integrate litmus chaos into a city workflow in captain we are following the the idea of a captain shipyard definition and the shipyard definition is really that what you want to do and not so much how you want to do it so it's basically a process and environment definition in yaml you will put it in your repository and captain will act upon this definition and in another concept which is called the captain's uniform you will then add the tooling that is responsible for each task for example that has to be executed as part of the shipyard definition this separation of concerns by the how and the what is basically done with cloud events so it's an event-based approach and captain will make sure to send the cloud events with all the information that is needed for other tools to to act upon this so for example the deployment itself is not done by captain itself but captain can use helm for example to send all the deployment information to helm and get the helm integration will act upon this and will trigger the execute the deployment the same is true for the testing so it's not only that we can support litmus tests but also jimmy the tests load the tests other test integrations they will wait for captain to trigger them and then once triggered they will execute the tests the test instructions are stored in the git repository managed by captain will be provided to the test integrations and then they can execute the tests in the case of litmus chaos the um the chaos experiment the chaos test itself is a yaml file it will be provided by captain to litmus and litmus will then act upon this will execute its chaos test and will then come back with the result to captain so captain can for example in the evaluation phase then trigger the tool that is responsible for the evaluation we're using here a built-in functionality of captain the captain lighthouse service doing now the evaluation so how does an evaluation look like again here it's a declarative description it's based on service level objectives and let me start here with the with the second big trigger on the left hand side where we already have an objective defined based on an sli let's say the probe success percentage that means what is the success percentage of all the probes we're sending within an even time frame or that we are doing within a given time frame and we needed a higher than 95 percent of success if we want this objective to fully pass if we cannot meet this criteria captain will evaluate the boarding criteria it has to still be it has to be higher than 90 for it to receive half the score if we cannot meet both the pass in the morning captain will not give it a score and this one objective would fail how the data is actually retrieved is then defined in the service level indicator file this is basically a mapping between the name of the service like indicator and you can think of a promql with placeholders so that you can easily reuse this for different services for different time frames these kind of things so with these two files we can go ahead and do evaluation either within the execution of a captain shipyard definition or also triggered via the api or via the cli if you're just interested in a captain quality evaluation ad hoc let's say either way once the evaluation is triggered captain will reach out to the different data providers such as chromitha is going to be querying the data that is used in the service level objective file it will then score the data and will come back with a total score and based on the total score this microservice can then be for example promoted to the next stage even to production if it's meeting the criteria or the resilience criteria for example or it can be automatically rolled back or just held back in the stage depending on what is defined in the shipyard definition so this is the idea how captain quality aids work and how you can evaluate the resilience it's basically how you define your service level objectives in which service level indicators you're using we also have a demo prepared for this what we will see in the demo is which application we're using and how all these different tools basically come together but for the explanation of what we've done here and how we validated our approach i will hand over to Karthik thanks again now that we appreciate the need for chaos as part of continuous delivery and now that you've learned about it much and captain projects let's talk about a simple use case this is a way to illustrate how you can do chaos in cd using these projects the diagram that we have here on this slide is essentially a representation of the shipyard that you're going to talk about a few minutes earlier so we have got three stages importantly one is the deployment stage one is the desk stage and then it is the quality evaluation as part of the deploy stage we deploy a simple hello world application called the potato head app so this is a popular hello service maintained by the cncfc app delivery tool it is deployed using help that's the deployment tool of choice and once the deployment is completed there's a deployment finished event which actually triggers the next set of tasks one a load generator that's the locus is starting to put load on the hello service app so we want to do chaos when the application is actually serving the requests not in idle conditions that's where we have locust then we use litmus the chaos experiment and engine crs are used to define a part delete chaos experiment on the hello service app as part of this we're going to delete one of the replicas of this application and we're going to identify what's going to happen as part of that before and after the experiment there's a certain before we actually complete the experiment and give out a word it then we have a test finished event that gets generated once this is done this triggers a quality gate evaluation and as part of this evaluation we're going to use metrics provided to us from Prometheus we're going to have defined some service level indicators which are essentially Prometheus functions on top of some metrics that's exposed by the applications and litmus and the tools that we have and there are some SLOs defined as cutoffs on top of the values provided by these SLI and we're going to evaluate whether those cutoffs are met as part of the quality gate evaluation and once it is met we're going to promote this application to the next stage probably production or we may go ahead and do a next chaos test or another important test if not then there's something that we need to improve either in the application or probably in our deployment practice so in this particular demonstration we're actually going to highlight a inefficient deployment approach and you're going to take you through that demo before we actually get into it let us spend a couple of next weeks to detail the flow that's going to happen as part of this use case the first we are going to have the potato head hello service app deployed it's going to be deployed with a readiness probe that has any delay seconds of around 30 seconds it's actually going to take some time for this particular application to be ready and come up with an endpoint which can be queried and then we have a black box exporter which is consistently trying to access this application and give us some accessibility information and it's going to give us these two metrics that is probe success and probe duration seconds probe success is an accessibility factor and duration seconds is an indicator of how long it takes for us to access the application successfully and then we have the litmus operator and the dependencies along with the litmus service integration service in deciding in the captain control plane it's actually going to help trigger this chaos experiment the experiment itself is going to be a simple graceful deletion of a single replica of the hello service app which is going to do one instance of this part delete and initially we're going to do this against a single replica deployment and see that the SLOs are not really met and these SLOs are essentially being built on top of these SLIs that are averages against the probe success and probe duration. We are going to see that this is going to be met when we have a multi replica deployment that's what we're going to show as part of this demo and with this info I think I'll probably hand it over to Jurgen to talk us through the actual steps and the commands involved in doing this. Thank you Karthik. So for the sake of efficiency we took all the screenshots of the demo and put it on the slides to skip all the waiting times whenever the tests execute so the demo starts with triggering a new delivery based on the shipyard file that we've seen earlier and we are just going to deploy one image it's the hello server application which is part of the potato app application. Once the deployment is finished we're doing this with hell and with a replica set of one in the first round of the demo captain will make sure to trigger the tests the test definitions and the chaos definitions these are all part and all stored in the git repository of captain and basically provided to the tool integrations here so both the litmus service and the locust service will start their work they don't need to know that the other service is running as well so you can just reuse all the performance tests you already have and add the chaos on top of this. Once the services are finished and captain waits for both services to finish the evaluation is started so we can see here the litmus service was finished successfully also the locust service was finished successfully that means they both did their job and indicated that during their execution there was no problem so in the evaluation captain is now reaching out to Prometheus in collecting the data we can see the evaluation is failing in this case and taking a look at a detailed evaluation overview we can see why the evaluation finished and both of our service level objectives actually did not meet the criteria first the success percentage was not high enough we expected it to be higher than 95 percent for a full pass or at least higher than 90 percent for the warning but we could not meet this because our application was not available for a given time due to our readiness probe for example it took it takes at least 30 seconds for it to be ready after it is we deleted by a pod delete experiment the one that we are using here in this demo so what we can do to improve the resilience of our application in this case is to come up with the idea of a more high availability setup increasing the replic account of this application that would be one first good approach to do this and we just rewrite the cloud event that we are sending to captain and they will be then used as the instruction for the deployment and help so we are just rewriting this file and adding the replic account of three to this sending it to the captain control plane which will then forward it to helen helen will do the deployment once it's finished again the tests will be automatically triggered it will be the same tests it will be the same integrations once the tests are finished again they will indicate back to captain it's finished captain please go ahead and do the evaluation and this time the evaluation is successful we can even take a look why it is successful and we can see the probe success percentage is now 100% and also the probe duration was fast enough so 100% why we could achieve it to 100% because that's basically how kubernetes works if there are more than if there's one more if there is more than one replica kubernetes will always make sure to send the traffic only to the instances of this application that actually are ready and if we are going to delete one of those instances and they have not indicated that they are already ready to serve some traffic the traffic won't be directed to them so the traffic was only served by those two other instances that were not affected by the pod delete test that we triggered with litmus so with this we can evaluate already how to increase the resilience of applications we have not changed anything in our application code but we actually increase the resilience by tweaking the deployment instructions and having a higher replica set this of course might be different for your applications but this is just a validation of the approach and where we also have a continuous evaluation of chaos and the impact of our chaos tests on our application so with this we want to leave you with three key takeaways the first one is we really want to encourage you to establish a process of continuously evaluating the resiliency not only to do it once or twice via a so-called game day but really putting this idea of a continuous resilience evaluation into your cd pipelines and having chaos tests in addition to performance tests i think with this you can really increase the resilience of applications and having this in a continuous way it really gives the highest value the evaluation should be based on service level objectives they have proven to be a very efficient way to evaluate performance criteria but also resilience criteria you can even have memory consumption or other parts as part of your slo's and having a combination between more than one or between a huge amount of service level objectives gives you a very strong quality indication of your applications and we've already seen what we are talking about today adopted by a company called ketopy they are using exactly the stack that we are also used for the demo they're using locusts for the performance tests they're using captain for orchestrating them and for doing the evaluation and they have added litmus chaos as part of their quality evaluations and they've already run those experiments and those tests hundreds of times and it has already proven to increase the resilience of their applications if you're interested in more we have one resource slide here so please visit us on litmus chaos.io that's the project website of the litmus chaos project on captain.sh you will find everything on the captain project and how to use it if you want to use the litmus integration you will find it on github there is even a tutorial how to use it the litmus chaos team and the captain team have even teamed up to write the blog a two-part blog series on this whole topic so there's a lot of resources around this we really encourage you to make use of them and you can also reach out to us our respective twitter handles are here and we are really happy also to follow up with you on these topics with this I think we already can open up the question section thanks so much so much Karthik it was really a pleasure working with you on this and thanks to the whole open source and cncf community for taking part in this talk thanks you again enjoyed working on this I hope this is useful to the cloud native community thanks everyone thank you