 Hello, welcome everyone to KubeCon and to this session. I am Uma Mukhera, CEO of Chaos Native. I am also a maintainer on the CNCF project, Likmas Chaos. Along with me in this session, we have a co-speaker, Samar Siddharth from Orange. Today, we will be talking about a case study of Chaos Engineering Project Likmas at Orange, how they are using to improve the resilience of the overall Kubernetes-based platform. So, we are going to talk the first challenges in general about reliability around Kubernetes ecosystem today. And I will be talking about how Likmas can help in incrementally tackling this challenge of reliability in Kubernetes and cloud-native environments in general. And Samar will talk through the challenges that they had seen for reliability in their Kubernetes environment and why they chose Likmas and how they are using Likmas. And of course, he will be doing a detailed demo of a couple of scenarios. So, let's talk about the resilience in general. We all know that cloud-native is mainstream IT and it's in deeper option phase. So, this also brings a couple of challenges as far as the resilience or reliability in production or in use is concerned. The reason why these challenges occur is there is proliferation of microservices. There are too many of them and they are shipping quite fast. They come to your environment faster than you would expect. So, this, even though they are all individually reliable, well-tested, when you all put together this microservices to form your application service, the dependency matrix increases a lot and any fault anywhere really means that you may have a problem of availability in your service. So, you need to be able to be resilient to all these faulty scenarios. That's really the reliability challenge. And the solution to this bigger problem is adopt chaos engineering. And chaos engineering has been on the rise for the last couple of years. We are seeing a lot of people using chaos engineering or project Likmas is an evidence to that. And we are expecting that chaos engineering will be a mainstream solution or a tool set in the very near future. And chaos engineering is being adopted for the overall DevOps, not just for ops, which is typically the case in the last decade or so. But now we see chaos engineering being used both in pipelines, QA environments, reliability testing, test beds, all that stuff, performance engineering. So chaos engineering is emerging as a greater tool set for developers and DevOps and children. So the whole idea of chaos engineering is to cover the unexpected. So think that anything can go wrong and you make sure that you test your system against all such possible failures and you reduce the chance of downtime of your service. And there are many tools available in the industry, especially in the cloud native space. They're easy to use. You can use them and put them to work and reduce the possibility of downtime at the production site. And how you do this is put the chaos experiments together in various forms, change them, build steady state hypothesis closer to your reality and go deeper, improvise your steady state hypothesis checks. And then start fixing your either configuration issues or software bugs or infrastructure tuning to your service to be highly available. And litmus is born out of clearly such a need and it's an open source project, which is part of CNCF for more than a year now. And we have developed a litmus about three years ago and donated it to CNCFS sandbox project last year. We're doing the incubation process right now. Hopefully will be an incubation project in the near future. It has got quite a good adoption with 1000 plus users and it's a pretty stable platform with 2.0 as the latest release. And we are seeing quite good installation rate of about 1000 installations per day. It's a complete tool set for anyone to do chaos engineering and it is a Kubernetes application and it can run both Kubernetes targets, attacks against Kubernetes targets and on Kubernetes as well. But it's a tool set for doing chaos engineering. And how it works is it has got a control plane where a set of team members can get together, collaborate and collaborate on tuning, developing, tuning the experiments. So together you develop a chaos workflow, which is nothing but a chaos scenario. And you can target them onto your Kubernetes resources, Kubernetes platforms, app Kubernetes applications and also a bunch of non-Cubernetes use cases as well. It could be any cloud platforms or VMware or bare metal physical infrastructure as well. So where do you use Litmus? Litmus, first of all, has got readily available chaos experiments to a large extent and for both Kubernetes as well as for non-Cubernetes such as VMware and other cloud platforms. So you just need to create chaos scenarios. You don't need to start from like writing chaos experiments for the basic scenarios. So what you would be doing is really construct this chaos scenarios called Litmus workflows. And Litmus also comes with a powerful feature called probes, which is to help you in creating a steady state hypothesis logic. You introduce a fault, you want to know whether my system is still working as expected or not. That's a difficult way to express is my system working properly or not. So Litmus probes really come here to help you to define exactly that problem. Declaratively, you can tune it, you can get it as close to as you would generally describe what your steady state hypothesis is. And then you can use this entire end-to-end chaos engineering idea for multiple use cases. It could be continuous chaos testing or it could be random game days to introduce chaos engineering into your system and also for service level objective validation and management. And you can also see if your observability systems are working well or do they need tuning, right? And also in your scaling and performance testing, right? You can introduce chaos while you do the performance testing and see if your systems can retain. So how do you get started? As I said, we have got a bunch of experiments already available. These are like Lego blocks. You just need to put them together and you install Litmus through Helm and you get Chaos Center. You start running your basic workflow, invite your team members and attach it to your Prometheus Grappana monitoring system and everything is in place. So it's fairly easy to start at the same time. It's highly scalable, a powerful SDK is there and you can go very deep describing your complex fault scenarios. So let's actually go through the case study on how Orange is using the Litmus. So the environment that they have is a large open stack system which is now being moved to being managed by Kubernetes. So it's a very large system and very critical system and Kubernetes has to manage that pretty well. And the entire challenge and use case is how can I ensure that my Kubernetes is really reliable while managing this open stack application or the services. So the solution is to really apply Litmus into that Kubernetes system and keep executing various different experiments and continue to verify your open stack continues to run and Kubernetes is behaving as expected. So with that, let's welcome Samar who will be talking about why Litmus and how and he'll be taking us through a quick demo. Hello everyone, I'm Samar Siddharth and working as a lead software engineer in Orange, which is one of the leading telecom companies. And today I'm here to present a use case on improving resiliency of KTES applications. But before we begin, let's have a look at the complexity of Telco Infra in comparison with General IT Infra. Telco has a complex workload that is tightly coupled with hardware and there are many proprietary vendor applications running on it. In telco sector, migration to cloud native is happening at a rapid pace and many operators and vendor are embracing cloud native technologies. If you look at the complexity scale, Telco Infra requires to be highly secured as it is hosting user data and with the technology now moving towards 5G, it also requires ultra low latency and high throughput. Also, telco apps have different requirements in terms of network and bandwidth, which requires acceleration techniques like SRIV, DPDK, CPU pinning. To add to these complexities, Telco Infra hosts different type of applications from domains like OSS, VSS and it can be in form of a VNF or a CNF. Now let's talk about why do we need chaos and resiliency testing. As you all know that Kubernetes is a dynamic and a complex system and there are a lot of activities happening under the hood, which means Kubernetes component can interact in a number of unpredictable ways, causing emergent behavior. As deployment grows in size, so does the number of possible interactions between these components. And with the traditional testing, these scenarios are hard to uncover. In the real world scenarios, we have resources that are customized and required focus testing to cover these scenarios. Coming to the architecture of the system under test, here we are using two most widely used technologies, OpenStack and Kubernetes. As you can see in the diagram, Kubernetes is serving as an underlay for the OpenStack services where one Kubernetes node is hosting the control plane services and the other node is the OpenStack compute host, which have compute related services running on it like Nova, Neutron, etc. So there are clear segregation of services based on the type of Kubernetes node. Additionally, if you see we have applications like VAL for secret management and Y for enabling TLS communication, these are being used by OpenStack services and now it requires some additional testing so that we don't end up with a single point of failure with respect to integration of these applications with OpenStack services. As we proceed, we will cover few such scenarios in the demo. Let's see why Litmus comes into pictures. Litmus is a Kiosk orchestration framework that focuses on Kubernetes workload and offers out-of-the-box generic test cases that covers both Kubernetes workload and infrastructure, example PodDelete and Node CPO. Additionally, Litmus has a pro feature that enables to run customized validation and it is highly configurable wherein we can configure the time between the two pro executions and the particular instance where we want to execute the pro, like at the start of the Kiosk injection or during the Kiosk injection or towards the end of the Kiosk injection. This gives the required flexibility to run the validations. Also, it is easy to integrate with our existing automation framework. Litmus also has a great community support which is really good for an open-source project. Now in this slide, I will be covering another open-source tool that is cross-testing which we have used for writing custom validations for Litmus probes. It is a very good framework for writing containerized test cases which are highly usable and easily integrated to CI-CD chain. It also offers multiple drivers for writing test cases like Python, Unitest, Vast and Robot framework. So towards the bottom of the screen, you see a sample outward from the cross-testing test case which includes list of test cases along with projects to which these test cases belong and the tier of the test cases like Hell Chip. It also contains the duration of each test case along with the final verdict of each test case. So in this slide, I am going to cover the cross-testing workflow. We start with the identification of application which is called application under test. This is identified in Kubernetes based on the labels and selectors. Then we move on to the pre-validation step wherein we perform pre-validation checks using Litmus probes and cross-testing which we are going to see in the demo. Next, we move on to the chaos injection phase wherein we inject the chaos to the identified application and also run the on-cause probe. So basically we are going to check the functionality of the application using Litmus probe at the time when the application is under stress and finally the post-validation step. So apart from the validating the final state of the application, it should be up and running as it was during the start of the experiment. We can also have some additional post-validation probes that can perform the custom check for the application along with the other actions like cleanup of the resources that were created during the pre-validation steps. Coming to the use cases, Resiliency realizes the motto of this cubecon and as you see that Resiliency is also at the center of our use cases. We can utilize the open-source tools for building chaos and resilience test cases around KITS workload and infra. What you can see here are the different scenarios such as validating Resiliency of the containerized control plane. We can also use this to simulate issues and works that come in production and fix it properly in the pre-prod or development stages as it can be simulated easily through automation. Next, we can improve the monitoring and alerting system based on the observations of the chaos experiment by timely and meaningful alerts. We can also use this for validating HEE of different control plane services as we are integrating additional applications like Vault and Y and others to the existing control plane services. It can also be used for end-to-end automation and testing interdependencies among different applications. In the first use case, we will be targeting Vault application pod. In this experiment, the scenario is we will delete the Vault application pod that is deployed in HEE. That is, we have three Vault application pods running. You can see it here. I will just quickly go through the Chaos Engine experiment file. If we see that we have the application identifier here, that is the label of the Vault pod. If we go down below, I will cover the important parts. If we go below, we have the probe section wherein we have the first probe which is related to unsealing of the Vault. That is basically done once all the Vaults go into sealed state. It becomes unserviceable. We will have to unseal it once we delete or restart all the Vault pod. It automatically goes into sealed state. It has to be sealed manually. First is the unseal Vault probe which is running at the EOT, that is end of the test. The second probe is check front-end access. This is basically checking the access of the Vault endpoint. This is running in the mode edge, which means that it would run at the starting of the experiment and towards the end of the experiment. As you can see, there are different parameters which covers the gap between the two probes or the timeout parameter for the probes. Similarly, if we go towards the end, here we have set pod effect percentage as 100 and the sequence as parallel. It will delete all the pods parallelly applying this chaos engine manifest. As soon as I apply, we can see that the litmus pod is getting created, that is, driving this experiment. If we go to the logs, we can see that the first probe is passed that it was able to get 200 response codes from the Vault endpoint. Similarly, we can see, as per the chaos engine, all the pods are getting related at the same time. By now, our Vault is in a sealed state and it has to be unsealed towards the end of the experiment which has been taken care by the unseal Vault probe. Now, the unseal Vault probe has started and we are towards the end of the experiment so it will take some time to execute. Now, the EOT probe that is towards the end of the experiment, it is completed and we have passed all the probes. As you can see in the chaos result, we see that our experiment is passed, the final verdict is passed and we can also see the status of different probes and probes pass percentage as well. So this is 100% and status of different probes is also good. Since we have this check, you frontend URL probe as edge, so it was first checked at the beginning of the experiment and towards the end and the unsealed part was only towards the end so it is the post chaos probe. Moving to the next experiment, it is again related to Vault but with a slight difference in this, we will be deleting the Vault pod severely instead of parallely which means that a single pod will be deleted at a time and we can see if the failover is happening properly or not. So the Vault URL or the end point should be reachable till the time all the Vault pods go into a serious state. So in this case we will see a failure scenario where all the pods get sealed and the URL validation will be failed towards the end. So that is expected and for this reason we have configured the Vault probe to proceed on failure, so we will still continue with the experiment and unseal the Vault towards the end rather than failing and exiting. So this is also an option, we can fail the experiment in case a probe fails and exit it. So I will quickly go through the chaos engine manifest for this one also. Again same label and in this case the only difference that we can see is the check front end access URL is in continuous mode which means that it would be tested from starting till end. So basically it would check at the start of the experiment if it is accessible and during the chaos injection phase and towards the end. So towards the end it would fail since we will be deleting all the pods sequentially. So it will pass for two pods and towards the end when the third pod is deleted which means that all the pods go into a serious state and the endpoint is no more serviceable. So the sequence we have selected here is serial. So the probe has passed and we are deleting Vault 1. So this has already been deleted and it has recovered. Similarly we will cover Vault 0 and Vault 2. So now Vault 2 is deleted. So it is continuously pooling the Vault URL, the endpoint and we will see a failure towards the end when the last Vault pod is deleted. So I have intentionally added a 30 second delay between consecutive chaos injections so that the pod can recover. So now the connection has lost as soon as we deleted the Vault 0 pod so now the probe has started to fail and finally it will mark this probe as failed and proceed with the final probe execution that is unsealing of the vault. Since all the application is up and running all the pods are up and running so it proceeded and it failed the front-end access URL pod which was expected since all the pods went into sealed state during the continuous probe evaluation and now it is running the unsealed Vault probe. So now we can see that the final probe has completed successfully and it has unsealed the Vault but the overall result is a fail because the probe that was validating the URL failed towards the end. So we can check this in the end as well. So in this result, chaos result we can see that it was failed during running the probe and the probe success percentage is 50 since it was able to pass at the beginning but was not able to pass towards the end so it was partially successful so that's why the success percentage is 50 and it also lists which probe got failed so this is the check front-end access URL probe that got failed So moving on to the reuse case where we are going to target one of the open stack service that is NOVA scheduler quickly going through the chaos engine manifest for NOVA scheduler So here we are identifying the application by the namespace and by the label of NOVA scheduler Here we have three probes that are running out of which two are running on-cures and the last one is running towards the end of the test case So the first on-cures probe is create resources which is basically creating open stack resources at the time of chaos injection and similarly we have check ping which is checking the reachability of the VM created during the create resource So here we have adjusted with the timing since there is some time that is taken to create the resource so this probe would get triggered after a delay of a certain amount that is 180 seconds and similarly towards the end we are running another probe that is cleaning up all the resources So here the pod effect percentage is 20 which means that a single pod would be targeted at a time since we have four pods that are running in an HM out So these are the pods I will quickly apply this manifest file to start the experiment Let's follow the logs for this pod So this is the first pod that is getting deleted So since we have selected the selection of the pod random so it will select any pod randomly and delete it and would continue to do that till the total chaos duration is completed and will be waiting for 30 seconds between every chaos injection so that we get time for the previous pod to recover So this we can see that the resources are getting created and here we can see that randomly a pod is getting deleted So the first one was this one based on the uptime So now this is the one that was deleted last So we have kept multiple iterations of this chaos injection for Nova scheduler pod so that we make sure that it is overlapping with the on chaos probes So we want to make sure that the resources that are being created is overlapping with the chaos injection time so that we can actually test the availability of the services to the end user if there is any degradation of the services or any impact in creation of the open stack resources with respect to the Nova scheduler pods So this is the log of the cross testing container that created the services So this has passed Now it has proceeded to the second on chaos probe that is checking the connectivity of the VM that was created So one thing that needs to be taken care here is aligning the timing of the chaos experiment So we do have options of aligning the time between the probe running of the probe and the chaos injection so that has to be taken care So now the connectivity check has also passed So there was no packet loss observed 100 and the final verdict is also passed All the probes passed without any issues So we can now say that the Nova scheduler is resilient to pod delete experiment This is a sample output of the cross testing based probe containers So resource validator is basically the one that created the resources Checkping was a connectivity check Resource deletion was a cleanup part And this is another probe with which we used to run the ensemble roles and take corrective actions during the chaos injection phase or post-caus injection phase Similarly, this is the litmus result which I just showed you which captures all the details and makes it easily integratable with the CICT chain So that's all from the demo perspective Thank you all, thank you for your time and thank you for joining this demo Welcome back I hope you liked the demo by Summer where he actually showed two scenarios by injecting chaos into his running open stack system and he was able to verify that the system continues to function and he was able to check the steady state hypothesis at various instances using litmus probes So the summary is you will be able to do deep chaos and you should be doing deep chaos in an operating environment to validate the functioning of your application the functioning of your Kubernetes and Summer and his team at Orange is able to do that successfully and I hope you will be able to do something similar if your needs are similar Use litmus for your reliability needs You can get started with docs of litmus at docs.itmuscares.io and if you need any help with litmus we will be available at the Litmus channel on Kubernetes Slack community and with that thank you very much folks you all have a great KubeCon Cheers