 Hey guys, hope you guys are enjoying Kupan. Thanks for joining in in this maintainer track. So we'll be talking about Litmus Geos. We are the maintainers of Litmus Geos. We'll see about this page soon. But this is our talk bringing fire in the cloud, which is talking about managed services, because most of us use managed services to manage our Kubernetes workloads. So we'll be specifically looking into what causes problems in managed services and how we can mitigate them using Litmus, and we'll also talk about Litmus. So those of you who are Litmus users or new to Litmus should, you know, get a good experience around it and then probably hopefully start contributing and be a part of the community. So this is us. I am Shain, and he is Shum. I'm a senior software engineer at Harness and also maintainer of Litmus. And Shum. I'm also senior software engineer at Harness and maintainer at Litmus. Cool. So we both will be sharing some Litmus wisdom, hopefully. And let me start by jumping into the dependency dilemma that we face today. So a lot of us use managed services. A lot of us are reliant on managed services because not everybody has everything on prem, like their own data center and everything. It's costly. So a lot of people use managed services. Now the problems that we get is a lack of control, obviously, because we have limited visibility on what they're providing us. So we base our services based on something that's theirs. So that is one thing, lack of control, which we don't like. Second is the vendor lock-in. You might argue that we don't really have vendor lock-ins since we have our systems set up in multiple clouds. So that way we can pull everything out from one, shift it to the other. But it's definitely a vendor lock-in because they're there between your services and the actual vendor providing it is opaque to you guys. So there is a vendor lock-in because there are proprietary technologies that they are using which might not be open to you in the vacuum. And then there's a trust issue, obviously. So the availability, reliability, uptime, everything is based on what they promise you. But yeah, it's a trust factor. Cool, so before jumping into gas engineering, this is a scarce light because outages are expensive and these are some experiments I'm not calling anybody out, trust me. But yeah, these are some of the actual outages that had happened and this causes sort of a fear or what do we say, we don't really like having outages because of things like this. It's very costly to the business. It's very costly to the developers, the employer, like the company in general because it doesn't, not only does it affect the company reputation but also the lack of customer confidence in you and also the employee confidence in its own, so to mitigate this, we have gas engineering which is a generally practiced solution or remedy as of now. It was not there a few years before but now it's growing way up into the maturity model and so what exactly is it? So what we suggest doing is, or what generally is scarce, is scarce engineering is a practice of deliberately injecting disruptions in your system to test how they behave. Is your system even resilient? Is it not? If something happens, does it have the ability to come back up, so all this kind of things, we can test via gas engineering. Now what are the principles that typically we should be adhering to when we talk about gas engineering? It's, you should have a system to test first of all, obviously, and then you need to stimulate certain experiments. So this has to be specific to your use case. You should specifically select certain faults, change them together, and then kind of be able to create a hypothesis around it. Then you have to run the gas experiments on the target system, observe using monitoring, ABM tools, observability, and then use those learnings to improve your system. So this is the entire cycle of gas engineering that we advocate. So now, the more the better, but it's a little scary because the more gas you do, the more expensive it is and the more gas you do, the more things you break. So it's not really the perfect thing to advocate, but it is also the perfect thing to advocate because you are really making your system very reliable and there's less and less amount of outages that you would expect or you would be more reliable, reliant on your own services uptime when an actual chaos happens, when an actual outage happens. So the things about quality, speed, and developer productivity, you can, I mean, this in turn advocates the functionality of better developer productivity in case of how much time has been saved by the developer not having to go around and figure out what went wrong, make things back up again. Also the speed at which you do this. If you keep doing it again and again in game days, you know, in every set intervals, you would basically be much more resilient towards things happening. So your speed at deploying, releasing, the cadence would of course increase. And also the quality, if you have chaos, your quality is definitely better because you're definitely better than the others who aren't practicing this. Oh, so now jumping into the context. So these, this is something I specifically wanted to talk about to give you an example of how you might look in for a vendor locking system. So a lot of people are dependent on using managed services today and managed services is very good because it gives you everything out of the box. You just plug and play, you use it. It's easy to use, easy to manage. But there's this thing, right? There's a layer. You don't really know what's inside, what's the proprietary tool or what's the infra they're using. It's generally an opaque space. They're just using it. We are relying on them and maybe it's good. It's 100% uptime, but it's really hard to believe. There's something that might break somewhere and you have to figure out or you have to wait for them to fix it and then your systems might get back up. So you are doing everything. You're practicing kiosks and you're doing 100% right. Everything perfect, but it's on this side of the line, not on this side of the line. So we are trying to advocate doing something which includes breaking stuff, not just on what you see, what's visible to you, but also on the entire infra, whether or not it's opaque to you or visible. Ooh, so some of the examples, this is not what we will be showing you guys today, but this is a reference to what you can do or what I mean by the previous slide. So some of the scenarios where kiosks engineering might apply to a system specifically AWS specific or specifically managed service specific is let's say your RDS failing over. So you might have a data fallback mechanism. You might have load balancing, higher availability or scalability issues. So these kind of things you can test via kiosks engineering, specifically in the database context for the EC2 because EC2 is a widely used service. So you can check if there's impact on failure, what else, what exactly is impacted when you induce or simulate a failure like that? Also, is it highly recoverable? Also, does it have higher availability? So things like that you can check with a service like EC2. So similarly, there's so many services, you can try so many things. Lastly, I can talk about the Lambda function which is a serverless. So serverless depends a lot on the triggers, depends a lot on the configuration that you're provided with. There are multiple layers to serverless. So you might want to check each and every layer which might be costly, but then it's safer. You can check the triggers or some misconfigurations that might have been caused which may affect your Lambda which might affect the performance or might affect the general configurability of the Lambda function. So let's say if you push in the wrong configuration, your Lambda can go down. Your entire triggers can go down. So you might want to check the serverless architecture of your Lambda using Chaos Engineering. So some of the hows of Chaos Engineering. So these were the whys of Chaos Engineering up till now. So why would you want to practice it? But this is the how. So how you can actually practice Chaos Engineering in a context like Lambda or a context like AWS managed services. So there are misconfigurations like I mentioned. So it could be improper network configuration that may lead to disruptions. It could be inadequate allocation of resources. So in that case, you might end up spending more or end up spending less than your actual optimum load which might cause spikes obviously. And then you can have mismanagement of Lambda using either RBAC capabilities which are which shouldn't have been there. You can manage your Lambda effectively by controlling your RBACs, by controlling your individual triggers. So a lot could be done, but yeah, this is just a simple list. Now for the database. So this is simulated database. This also includes your services which are data related doesn't have to be RDS specific. But yeah. So simulated synthetic load is specifically inducing simulated or synthetic load to databases which can inflate your resource usage. So you might actually can, you can actually check for resource uses in your database. This is actually hitting what you are aiming for or is it below that? You can test the scalability, higher availability of the data. You can also check if your data loss is actually impacting customers. If it is, what could be the possible solution? Maybe you wanna cache service. Maybe you wanna avoid data loss by storing it somewhere else. Maybe what happens when there's now digital database? Where does the data go? So things like that, you can kind of wrap around and generally target for services like this. So introducing Litmus. So this is the project we both work on. We are the maintainers along with a couple of other amazing maintainers. So the project actually started in 2007. So it's started as a 1.0 which is now 3.4 which is open source of fours and generally available. The APIs are open source. The command line is open source. So the project generally is, was started as a method for SREs to observe and look into chaos engineering and some of the easier sort of chaos and breaking out storage, open EBS, how we can mitigate that. But then it kind of got a lot of traction from the community. And we shifted it to a much larger project. So it's currently an incubating project as of now in the CNCF. And the main objective is to install Litmus as an on-premise solution in your cluster, in your customer's cluster and then use it to find vulnerabilities in terms of chaos vulnerabilities like inducing chaos using multiple faults and figure out what's wrong in your system and then of course improve the reliability or resiliency of it. So yeah, that's Litmus for you. It's a growing community and we'd love if you guys, if you are not already, you're part of the community. So next I'd like to hand it over to Shubham. He can specifically talk about AWS chaos and move forward. Hi, I hope all of you can hear me. So in Litmus, we have a Litmus SDK where you can generate your new experiment. So you can, based on your hypothesis or use case, you can create a new experiment. And once the experiment is created, you can upload that experiment or the charts to the Chaos Hub, which is your custom hub. So in default hub, we have 50 plus experiments. So based on your use case, if you want to create a new experiment, you can generate that experiment and upload your chart to your hub and then you can use that. You can connect to our front end and then from the front end, you can run that experiment. To generate a new experiment from the Litmus SDK, we have, in the left-hand side, you can see we have a couple of steps. So main lead is, first step is to check the steady state check. So based on your application, if it is a Kubernetes, you can check the, if it is a Kubernetes port, then you can check the port status. If it is AWS instance, you can check status. So you can check the status of your targets or if there is any other SLO, if you want to verify, you can run the probes in ISOT mode. The next step is inject chaos. So based on your use case, you can run the exact business logic of your chaos experiment as part of Chaos Inject, which will run for a specific duration, which is chaos duration. And once the chaos duration is over, then it will be reverted. So you can update your revert chaos step, so which is an optional in case of many services. If the chaos will be, let's say if you are stopping some service, which should be ideally started by the many service, then you do not need to write a revert step, but if it is not a many service and then you need to write the revert chaos step as well. Then next step is a steady state check, same check, whatever we are doing in a pre chaos. So after the chaos is over, you can check the steady state check in the post chaos as well. And the final step is, we are, I mean, this we do not need to do. It's, it's will, it will automatically generated, but yeah, last step to give you all the reports, the results. So it will give you the target details where the chaos is injected. It will give you the historical run as well. If you're running a same chaos experiment multiple time, it will give you the report for those things as well. And if you are running the probes, then it will give you the probes detail as well. Your resiliency score or the probe. If probe is failing, then why it is failing, all the status and everything. And in the right hand side, we are just taking a simple experiments example here. So in the blue, in the yellow box, you can see it is, we are, we are stopping a EC2 instance, which is part of our auto scaling group. And we just wanted to create an experiment for that. So we can try to understand how to generate the experiment for this, this specific scenario. So the first step is the steady state check, which we see in the generic flow. So for the, for, for this experiment, we can validate the status of EC2 instances. And after the validation is done, then second step is to inject the chaos. So as part of chaos injection, we can, we can stop the EC2 instances. And because these are part of many services, so we did not need to write the river logic here. It will be automatically restarted. But yeah, in the post chaos, we need to check the steady state validation for that we can verify the instance count, which should be like, because it auto scaling is enabled. So it should be like after the, like once the chaos is over, after the duration should come back. So we are just checking the count of those number of instances. And this is the actual generation part. So we, in the litmus SDK, we have attribute.eml, where you can provide the experiment details, mainly the name of experiment and RBEC permission, whatever will be the required. And there are a lot of parameters. You can tune those values. Once the, once the attribute.eml, you have tuned the value, then you can generate the experiment by litmus SDK. So you can use the generate experiment command with support multiple type of experiments. So we have template for a different, different type. So we have template for Kubernetes, AWS, GCP, Azure, VMware. So based on your type, you can provide in a minus T flag. For, in my case, I'm using AWS here. And then you can provide in minus F the attribute.eml file path. So based on this in this input, it will generate the experiment file where the orchestration logic is there. It will generate a chaos lift file where the actual business logic is there. And it will create the configuration file as well. So once the experiment will be created, so it will automatically generated the main flow of the fault, but the actual business logic or actual study state check we need to add. So in the right hand side, you can see the to-do items. So we will create a to-do item for pre-chaos and the business logic and reward chaos and post-chaos. So based on your scenario or your hypothesis, you can update all these to-do items. So for example, for EC2 stop, we saw all these study state validation and check your step. So we can update these step in this to-do item. And then we can validate the experiment. Once the experiment is tested in our local, and it is working fine, then we can generate the charts. So we can use the generate chart command. It will generate the, we are using chaos engine and chaos experiment custom resources. So chaos experiments contain the experiment related to enables. And chaos engine will use to bind the experiment with the application. So it will contain the application detail which you can configure from the UI. So in the UI, we have a visual representation as well as the GML representation. And both will be always in sync. So if you will change anything in the visual, it will automatically update in the GML and vice versa. So, and there is a chart service version which is the CSV file, which will be used by the hub for the visual representation, all the mapping in all. So once you will generate the chart, it will create all these three files. Then you can upload these file to the chaos hub. So for chaos hub, we have a chaos chart repo. So in chaos chart repo, we have those 50 files. You can clone that repo and you can upload these files to that chaos chart repo and in your own fogged repo or any other branch of chaos charts. And then you can connect that hub to the UI. And in the UI, you will see your fault and you can directly run it from the UI. So that is all about the litmus SDK, how you can generate a new experiment for your use case. So next thing is update about the litmus recently updates. So we have recently released the litmus 3.0. So as part of litmus 3.0, we have improved the UI. So all the user experience had been changed. And we have updated the chaos SDK as well. We have introduced the chaos SDK where the experiment creation logic has been changed from 2.0. It is now simplified. User can directly run their, seamlessly run their faults. The next thing is a resiliency probe. So earlier we have our probes to validate the SLOs. Those probe was one to one mapping with the fault. But now with the litmus 3.0, you can create a fault once and you can reuse it for all the experiments. And in the resiliency probe, you can see all the experiments and you can track the historical data for individual probe as well. And next thing is we have added the high availability for the Mongo. So we are using MongoDB to store all the data. So we have added the high availability support for the MongoDB. Next item is for the network fault, we have added an enhancement where earlier it was blocking the traffic for the specific IP or a specific host. Now you can block the traffic for the source port and destination port as well, as well as for a specific IP, if you wanted to block a traffic for all, I mean earlier it was blocking traffic for all port, but right now you can block our traffic for all, I mean specific port as well. And if you, or you can add a denied or a block list as well. If you wanted to block our traffic for all port, except specific port, you can give a comma separated value in that ENV, then it will block, it will restrict those ports from the chaos injection. Then we, recently we are added a unit test in the first test to check our, to increase the code coverage for the testing. And we have integrated the, we have a litmus CTL which is a CLI to connect the hub and all the crewed operation for the litmus to run the experiments and all those things. So we have integrated the probes as well. So probe is for the SLO validation. So now you can perform the crewed operation for the probe with the litmus CTL command as well. These are our item for the future roadmap. So litmus core means execution plain side, the experiments side. So right now we are passing the security context. I mean in the UI we have an option to pass the security context for the experiment, but right now we were not passing it to the pro port for the SLO validation and the helper port. So now as part of future roadmap, we are planning this item. So it will be propagated to the probe and the helper port as well. Right now for the pod memory hog experiment which will, which spike the memory of your pod. Here we are supporting right now the, we are taking the value in the bytes, how much memory you want to consume or you want to spike. So we will support the percentage as well. So out of the total requested memory or the locket memory to a pod, you can provide the percentage as well based on that percentage it will consume that much memory. And in the current implementation of our target selection, so when you are running a fault for us application, so you can filter the pod by app kind which is deployment, stateful set, daemon set, all those kinds. And then you can provide the name space and then you can configure the label as well. So now the enhancement we want to do is we can support the only by app kind. So if you will provide the deployment without the label then it will randomly select the, it's just increasing the blast radius. So it's, I mean, if you wanted to randomly delete all, like without any label, if you want to delete all pods which belong to a specific app kind, then you can, we will add the support as well as for the label without app kind. Or right now we have, we are not supporting the set based labels. So we'll, we will support the set based labels as well. Next thing is right now we support a specific workload type deployment, stateful set, daemon set. We have four to five type of workload. So we will, we will support the controller type as well. If you are using our pods, which is many, which is not managed by the replica set or the deployment, these workload, if it is managed by the some controller or operative SDK or some controllers, then we can target that pod as well. For the probe right now, we are running SOT and EOT probe in a serial. So the probe will be run one by one in the serial fashion for the SOT and EOT modes. So we, we are, we are, we are, we will implement the, to run those probe in a parallel or serial. We will, we will provide both sequence. We can run in the both sequences as well. And in the, in the HTTP chaos, if, if you, in the HTTP chaos, you can add a latency or the status code failure for all the, all the APIs available for that application. So right now it is, I mean, it is, it will apply the fault for the, all the APIs. So for, for the all endpoints. So if your application have a readiness in the liveness probe, then because it is affecting all the APIs and readiness and liveness probe will fail and it will start the pod. So right now in that scenario, sometimes because the pod will be restarted. So it might, it might not revert the chaos sometimes. So we will implement that logic instead of using the PID, we can use the Sandbox PID. So it, which will not revert the, I mean, which, which will not, based on the pod reset, which will not change. So it will revert always, even if your pod got restarted. And right now we are running, we have a continuous and on chaos probe as well, which will continuously ring for the whole chaos duration. So which will run for the multiple iteration. So we will, we will add the support for the iteration count as well. So it will say it ran for the 20 times and out of these 20 times it will pass. I mean, out of 20, it's past 20 times or eight, 10 times. So we will give that stats as well. Along with that, we will, we will improve the verbosity as well. So for each iteration in the banners, we will add for this iteration, it is running and what is the expected code for this iteration? What is the actual code for this iteration? So that type of verbosity we will add. In the authentication, just one more point. Most of these are the community raised issues or the community raised enhancements. And so, yeah. So this one is also from community side. So we will, for the login, we will do that dex integration. So for the login thing from the, from the login screen, you can log in by the dex integration as well. Next item are from the UI and the front end side. So in the front end right now, whenever you select the infra, on which infra you want to inject the chaos. So right now it will list all the infras. So we will add the grouping based on the environment. So we have an environment for a specific thing and each environment can contain multiple infras. So right now it is only showing the infras from all environment, then it will show the grouping as well. So they will know if they have same infra in two different environments, then you can uniquely identify which infras from which environment. And in air-gabbed environment, right now if they are connecting a default hub, they were seeing some issues because it is a public URL from the GitHub, chaos chart repo. So I mean, we will provide an option to configure it or the, I mean configure it, we can change the URL. They can convert the default hub to their own hub by changing the URL based on where they are hosting the charts or we can disable the default chaos hub and they can connect their own chaos hub based on where they are hosting that chaos hub. Next set of issues, items are from the orchestration. So right now we are not have the support for the multiple project or multiple owners in chaos center that you are. So we will add a multiple project and multiple owner support. So right now this is a LFX issue, LFX mentees working on this. Next item is for the subscriber is our execution plane component, which will, which usually communicate with the control plane. It pull for the task from the control plane and whenever it get the task, it execute the task. So we will add a client certification or custom certificate. If somebody is using STTPS, then they can have their own certificate as well. So that support we will add in a subscriber. Next item is halt test when control plane is down. So if they're running, because they're running litmus on, it is not SAS, they are running in their cluster as well. So if control plane is down, then right now it's stuck, the experiment is stuck, right? It will not automatically halt the test. Once the control plane is back, they will click in the UI, then it will halt the experiment. So we will add a support to, if the communication is break for some, some critical value or we will set some value and then it will automatically about the experiments based on communication breakers. And right now we have a infra installation two ways. One is a cluster scoped or one is a namespace scoped. So in namespace, it will allow only that namespace, but in clusters scoped, it will list you all the namespaces. You can run the chaos in all namespaces, but we will add a capability to restrict the some namespaces. If you install the litmus in a cluster scoped mode, but if you do not want it to allow to running a specific namespace, then you can restrict those namespaces. You can add that list that you should not inject chaos in these namespaces. Then it will not show in the UI whenever you will configure the target application, then it will not list those namespaces. Then we will add a force disconnects option to clean all the resources. So right now as part of whenever we are deleting, so whenever we are deleting the infra, so it will delete from the UI and it will give you some command which we need to run the cluster. So we will give a force delete option as well, which will automatically delete. And the next thing is a GitHub support for the Azure Git. Right now we are directly connected. We can't connect from the Azure Git right now. Then we can connect from the Azure Git as well, the chaos of if that is hosted on the Azure Git. And then if you are using, I mean, we are using QVAPI, we are using Kubernetes for the Kubernetes experiments. Then if sometimes we see because of some QVAPI, some throttling or because of some reason, sometimes the QVAPI is failing. So we can add a retry, we will add a retry there, which will make sure that it will run always. And then the next thing is for the unit as an E2E coverage, then we will keep adding the unit as in the first test to increase the coverage. Yeah, so you can join the Kubernetes Slack in the Kubernetes Slack, you can join the Litmus channel. And we have a ContraFest for the Litmus at 230, then you can scan this QR code for the ContraFest. And if there is any feedback, you can scan this QR code and give the feedback. Thanks for listening us, we are open for the questions.