 Hello, everybody. Welcome to another Cloud Native Live. My name is Mario Lora. You've seen me before, I'm sure. I've been hosting Cloud Native Live on and off since we started the Cloud Native TV network this year. Thank you so much for joining us for another wonderful episode where we are going to dive into chaos engineering. So today I have Karthik with me from Chaos Native. And they're a company that is working on resilience engineering in our world of Cloud Native and leveraging a chaos engineering to achieve that with their product litmus chaos. So I'm going to be, I'm not going to get into Karthik. What I'm going to do is try to answer all of your questions. Please leave your questions in the platform of your choice that you're using to watch this right now. We thank you for tuning in and spending time with us today. This is going to be a really, really fun session. I have a lot to learn. I know Chaos from the Chaos Monkey project on GitHub and other talks from Netflix engineers like Adrian Cockroff and others who have been pushing for a chaos to increase resiliency and reliability of your platform. Very, very difficult to hone in and get just right. And it's also very scary for a lot of organizations knowing where companies I've been at. You know, what they've been able to do and what they've been comfortable with doing introducing chaos is actually really scary. And it's hard to take that first step. So I think Karthik is going to be teaching us a ton of great stuff. I'll let leave him to get into his background. But again, I think everybody for joining, please leave your questions, comments, thoughts in the chat. And I am monitoring those. So I will be sure to get those questions asked to Karthik. And we'll be kind of going through a wide gamut of different areas in chaos engineering, the ecosystem and cloud natives role. So thank you so much. Please join the public Slack channel, hashtag cloud native live as well. If you want to chat with more people, I think myself, Karthik and others that have been on the show are definitely, definitely hanging out there. And check out chaos native while you're watching at chaos native at litmuscast on Twitter, chaos native.com. I'm scrolling through here and I even see Victor Farsik from the I think DevOps Exchange podcast has even done a episode on integration with Argo workflows, which is super exciting. And I didn't know existed and I use Argo. So I have a lot of work to do this week to share this with my team, I think without further ado, Karthik, thank you so much for joining us. I'll let you take it away. Go ahead. Thank you, Mario, really excited to be part of CNCF live and discuss chaos engineering in it was chaos. Thank you for the introduction. Like Mario said, my name is Karthik. I work for a company called chaos native. And one of the main dealers of the open source CNCF Sandbox project called it was chaos. And today we just want to discuss what cloud native chaos engineering is and talk about the project that there has been a new version that was released just early or late last week, I should say. So witness 2.0 is out and it has some improvements over the earlier version that was one dot X. And I think in the process of talking about the project, I just introduced you to what it was one dot X did and what is the feedback that it received from the community and how was part of how it was created. And we will go through a couple of demonstrations to discuss about what this platform is about. And I hope this encourages you to start with your chaos engineering journey in your organizations. So please feel free to ask us any questions and be happy to answer. And that's about it. All right, with that said, I just want to introduce what chaos engineering is. I'm sure a lot of folks already know about chaos engineering. You might be practicing chaos engineering also might be practitioners or you might have heard about it read about it. Mario mentioned about Adrian Cockroft and Netflix. They were really the pioneers that sort of started the movement of chaos engineering. And Netflix along with a few of the other organizations early adopters of chaos came and created the basic tenants of chaos engineering. You can look at this website called the principles of chaos.org, which carries a lot of information about what ideas, why it is important, what are its principles, etc. This brief summary, I have a couple of slides, which I just used to talk about chaos engineering and state of chaos engineering today with all the cloud native revolutions that is going on. So I hope my screen is visible. Okay, I'm hoping that the screen is going to be visible. Let me just quickly run through this. One of the basic reasons why chaos engineering is very important is because downtime is very expensive. We have had some past incidents in organizations that are generally very resilient, where it has cost a lot of money because of downtime. And that is something we would want to avoid. We would like to test out how the system behaves to different kinds of failures and how the deployment environment can be improved and how the application residents can be improved in order to withstand this turbulence and still be available. A lot of services that we're consuming on a day-to-day basis have publicly available SLAs. For example, Google Cloud and Azure Cloud says it is 99.95% available all the time. So chaos engineering is a practice which actually helps you to verify if you're able to provide that kind of Nestle and be available all the time. And by definition, there are a lot of definitions on the internet where you can see. There are a couple of forms that I've picked. It basically says it is a method by which you're testing a distributed computing system so that it can withstand unexpected disruptions. You're testing if some of the components or some of the assumptions that you had when you built the code. For example, you always thought the network is always going to be alive. You have unlimited compute, storage, bandwidth. That's not often the case. There are failures that happen all the time. And we need to check if we can withstand the disruption. We can recover quickly enough and continue to provide the service and acceptable level to the users. And chaos engineering is not about reckless fault injection. It's a scientific process by which you identify a control group, an experimental group. You inject faults in a very controlled way. You basically ensure that there is minimal blast radius when you're injecting faults. And then basically you see what happens. You try to learn about the system. Sometimes you go with hypothesis that is proved. Sometimes it is disproved. If it is disproved, it is better. That means there is something that you learn newly. You can go back and fix your application. Or you can go back and fix your department practices. Maybe you can improve the underlying infrastructure and make it more resilient. There are a lot of things you can do. Repeat your experiments, gain confidence, etc. So that is generally the practice of chaos engineering. Because of the times that we are living in today, I think the pandemic, probably this is a better analogy. It is like a vaccine being inject harm and try to build immunity from outages. That's what we try to do. The standard chaos experiment flow is like this. You identify some study state conditions. That's part of your hypothesis. You see how much deviation there is going to be from your study state, if at all. How quickly you are expected to recover. Then you go ahead with the fault injection. Verify the hypothesis. If it is, that means you're going to go to the next fault. Maybe it's a more complex fault that you go to the test. You might call yourself or call your system rather resilient to this fault that you injected and then you move on to the next one. Or if there is a weakness that you found and your hypothesis was disproved, then you go back to the drug board and check what went wrong. What needs to be better? Make those fixes and then repeat once again. Chaos engineering is traditionally, it has been traditionally done in production for a long time. That was the philosophy chaos engineering is most effective and useful when you do it in production. The principles of chaos tells as much. But with the recent proliferation of Kubernetes and the evolution of the cloud native paradigm, where in a lot of organizations are re-architecting their applications, they've been taking from, taking away from the monolith model and creating everything as microservices. They're containerizing it. They're running it in new deployment environments, mostly Kubernetes. So there's a lot of apprehension about how things are going to work and probably folks are not ready to do chaos engineering in production from the get go. There's a lot of protection and chaos experimentation that is done in preparation environments to gain confidence before it is really done in production. That is the change that we've seen happening in the chaos engineering world in the last couple or so years. And we mentioned the ADM.NETX and Amazon in the beginning of this discussion. There is a principle that ADM is a big advocate of called as the chaos first principle. So it is about doing chaos engineering in a more ubiquitous and democratic way. You start doing it in development environments, you start doing it in stage in production. Maybe you'll add failure tests as part of your CI pipeline, CI CD pipelines. And you basically do SLO validations based on or during chaos experimentation. Basically, you go and validate whether your system continues to stay alive and your objectives are met. So it's our objectives are met under US before let's say your application is moved into production. And then probably when you mature you start doing the actual game based chaos experiments in your production environment and try and see whether your system works out there. So that is what we've seen happen in the recent times. I mentioned about how Kubernetes and Cloud Native is a factor in getting people to do chaos engineering much before and earlier and often. That's because in a Kubernetes based deployment environment, there are so many variables, so many factors. Kubernetes itself is quite dense. There are a lot of content services, but you're hosting there on top of some platform infrastructure. Then you have a lot of tooling that you've pulled, a lot of frameworks that you've pulled from the CNC landscape for service discovery, for monitoring, storage, then you have your direct application dependencies, your databases, message queues, then your app with all its services, services, your middleware, front facing user facing services, etc. A lot of things can go wrong. And it is important that all these components work well in sync to provide the best user experience that you have guaranteed to the users of your service. So it is possible to test out various scenarios and test it often. And one of the reasons, one of the pillars of the Cloud Native way of doing things is to release fast, to keep everything as micro services, to ensure that everything is declarative, everything has the gift as a source of truth, you have controllers that ensure your infrared, your code in your source is synced on base. So changes are happening in a fast pace. So you need chaos engineering to sort of borrow some philosophies from the the current model, ensure that your chaos intent can be declarative, ensure that you automate steady state hypothesis validation as part of the experiment run, ensure that it lends itself to GitOps, and ensure that you have the same user experience or same homogenous experience that you have had with other things on Kubernetes. Maybe you're talking about defining application, life cycles, security, resource policies, etc. Everything is defined as resources and we have controllers in Kubernetes to manage that. And you would sort of like to break that into your chaos engineering as well. So that is an introduction on what chaos engineering is generally, and what this category of cloud native chaos engineering is. It introduces you to latest project. This is an open source project, which has been around for about three years or so now. And it provides you an interim platform for doing chaos engineering on Kubernetes. And we've also started in spending it, providing capabilities to do chaos against pre-cubanities or say non-cubanities infrastructure aspect. Easy to instances on AWS, GCP VMs, VM-ware machines, VM-ware VMs, etc. So the Flippers platform runs as a set of microservices on Kubernetes. So it uses Kubernetes as the substrate to run the chaos services, so to say. And you could make sure you could pull some ready made off the shelf experiments that are available in what we call as the chaos hub. The chaos hub is an open marketplace where you have a lot of common scenarios that you would like to execute. You can pull the fault templates, install them on your cluster and define a custom resource that maps the fault to any object on your cluster. It could be a node resource, a power resource, and you sort of go from there. That's what the difference is about. Very simply speaking, it has a set of custom resources to define chaos and statistical validation at its heart. And it has a controller that reconciles these resources and carries out the chaos experiment of the fault injection business project. And there is a way for you to look at the results of the experiment so performed and gain some information about your applications and enterprise behavior. So the project was started, in fact, to serve the resilience test needs of another CNC project called OpenEVS. And over a period of time, it sort of acquired a roadmap with its own became more popular in the community. So we had requirements which sort of start coming in. And we went from being a platform that can do chaos in a cloud-native way, that is you define intent in CR, this controller is going to carry out the experiment for you. From there, went on to make it an end to end platform because chaos and stating has a lot of other requirements in terms of observability needs, in terms of defining blast radius in a very controlled way, in terms of ensuring that your chaos results are being analyzed for a period of time to give you useful information that you can generate about your system. There's a couple of KPAs associated with chaos and training practice. So help you gain some information about how your KPAs are doing as far as your practice goes. So some of that information we wanted to bring in, and we also wanted to make it easy for folks to do complex experiments, be not simple faults. We just, we start off being simple faults, of course, but sometimes you want to generate complex conditions. You want to provide the study state validation agent, spot of the experiment run, and sometimes there are very diverse ways of verifying study state. So you want to impact all of them in. So all that is about what we did to get it from its initial stage of the one dot x release to make us 2.0 on what it is now. So that's an introduction to the philosophy of chaos and what it says. I'd just like to see if there are any questions we've got to see. Yeah, for sure. Thank you so much. Oh my gosh, so much to chew on. This is, this is great. Okay. So I just have a couple of these are lightweight questions that you'll be able to smash pretty quickly. I think when when most people think about testing an environment to to maybe test as close to a real world example as possible, what they often consider is something like emulating a DDoS attack, something outside is causing harm coming in, right? If the kind of ingress layer, right? And they are hitting certain API endpoints, or they are sending malformed requests, right? They're doing something at maybe a higher level. And it sounds like what litmus does actually runs in cluster, right? And this is kind of going back to chaos engineering, you're actually inflicting chaos from from yourself, like in internally, not externally, right? And so, you know, you mentioned limits has a few different types of maybe like attack library. And I'm actually looking at gremlin as well, which is another chaos engineering platform. And I'm interested in some of the differences there. But what what, you know, what is kind of the go to MO for the the default patterns that you find people using litmus for? And can you expand on a little bit of what you're doing? It sounds like you're taking litmus to the next level of building a platform. How do you intend to like, leverage that platform to help provide continued reliability insights, SLAs, things like that for not just your Kubernetes cluster or the API or an API, you know, for your application, right? But more so for the entirety of a platform, if I, you know, I have 30 microservices, what does that look like? It's a great question. Yeah, so you're right about the first part, my latest litmus when it started a lot of members of the community started using it and still do predominantly a large set of users use it for inflicting some chaos within the Kubernetes cluster and the services that are inside it. So litmus has this feature to do some asset discovery, which we will show very shortly during the demonstration where you can identify what are the different services that are living inside of your cluster, and you will be able to sort of access them and sort of target them. And two different faults, for example, the generic category consists of most of the part level and Kubernetes node level faults. So you would just kill parts, send some transmission signals to containers, you could do some chaos on the part network, injecting latencies or eat up resources and slow down the applications, your PID one essentially running inside of your containers. And similarly, for nodes, you basically take it to maintenance, you drag it out, you cause eviction paint and push out all the parts. So things that can be done with incubalities. And it has this model, especially in 2.0, where you could run the control main services within one cluster. And you can register several other target environments or Kubernetes clusters into the control plane. And you can run some agents there, which will actually carry out the faults. There are ways to keep the litmus parts immune from being impacted by the chaos it is generating. So you can target specific other parts, you can have labeled the rotation selectors in space, namespace filters that we can use. And you can also set up some affinity policies and load selectors and things like that to ensure that your application services, litmus application services are not really impacted by the fault it is causing. And you can really point to specific resources against which you want to do chaos. But when you think about things happening outside of your cluster, for example, you're talking about some application services, which are receiving requests from an ingress of their requests from outside. So it also has the capability to go ahead and ensure that the traffic is inhibited against a certain destination addresses, probably. So let's say there is service that is talking to a service inside of the cluster, you can ensure that the traffic between these services alone are disrupted, for example, that's something that you could do. That's one thing that we developed recently. And a lot of organizations still have a hybrid model, not everything's on Kubernetes. So there are experiments that we've started creating for AWS, for example, where you can run the experiment business logic from inside of Kubernetes, so the transfer's upon. But we are making use of the API that's provided by the cloud provider, for example, AWS or GCP or Azure. And we can go ahead and invoke those APIs, they are provided by SDK. So you can make sure that you can have some access secrets created on the Kubernetes cluster where the experiment business logic comes. And then you target some non-Kubernetes resource that's completely elsewhere. And still invoke the failures there, you can cause instance failures, discretaches, and also cause, let's say, the source of a network degradation within those fields there, which may have an update on Kubernetes. So that way you would still run like this from within the Kubernetes control plane. And you would still be able to target resource that exists outside of Kubernetes. That is the model that we are sort of working towards. And there are limited set of experiments on AWS and GCP and VMware, and Azure, and we are the process of expanding those experiments. And the faults that you could do on non-Kubernetes, so that you get sort of a wholesome platform, you're able to use the same centralized platform for doing chaos on different kinds of targets. So that is you brought up Gremlin, which is another great tool, which which has been around for quite some time. And they have added capabilities to do Kubernetes based chaos as well. And so they primarily started out as doing chaos against VMs. So that is we are in the chaos engineering community, and the tools in the open source as well as the closed source space are really growing today. It was differentiated in terms of its architecture in how it runs as a Kubernetes app, as well as the way in which it treats an experiment. So what we're trying to do is align with the principles of chaos and provide an end-to-end experiment, the notion of a complete experiment that has spot injection as a added score, but also has blast radius control, ability to do steady state validation, ability to simulate complex scenarios by sort of stitching together experiments. So you could actually go ahead and run more than one thought. Let's say you have a node which is almost exhausted with resources, there's nothing much scheduled there, then you have an other node in which there was, let's say an addiction that happened. Or let's say there's a part that got deleted there because of some reason, and you're not able to get scheduled anywhere else because there's another node which is already running full capacity. So this is a condition that is sometimes seen in production, I want to bring up this scenario, it's a complex scenario, it might happen to two fonts and tie the right validation along with this. So it was enables that through what we call as repose workflows. So to summarize, we're trying to build what is an end-to-end years platform for doing complex experiments, and also visualize the progress of the experiments. And you mentioned about how can I get information. So how can organizations try and take a look at how their case-injury practice is going, does it have an overrun residence view? That's what we're trying to build with LITMAS as well. So there's an analytics section here which goes through all the past workflows that you have run against your services, then you will have a comparison that you can do against, you might have done these workflows or experiments against different environments, maybe dev stages in production, or maybe across a few releases. So you're trying to validate this, you're trying to compare how your experiments went, and see whether you're improving or whether you're going down. So that is something that we're trying to add. So there are other views, other viewpoints that we're trying to add here based on community feedback as to what is what they're most interested in when they're running experiments. For example, people would like to see how their application behaves. We'll see that one of the demos where, okay, there's something that I'm pairing into my application dashboard. Now, I want to see when chaos is actually running, when did it start, when did it end, and how the application behaved during this process. So that's some amount of observability that we're trying to add into the platform as well. So there are some dashboards you can add within that it must chaos center as we call it, and there's something that you can use directly to instrument your phone dashboards, etc. So that's how we are trying to improve the platform. Yeah, that was amazing. Thank you so much for getting into that. And this platform, this UI really helps seal the deal in terms of what am I actually getting from an end-to-end perspective. I think the analytics, you need to be able to measure progress. Where am I at now? What is my desired state? And what are the incremental changes or pieces to getting to what my desired state is, whether that is being able to support so many requests per second or being able to sustain failures of database connections or whatever it might be, right? So I think, you know, principles of chaos.org, you've heard Karthik mention it a couple times. I think that's a good starting point. That's actually a GitHub project as well. There is literally just search chaos engineering on Google. You can find tons of great resources that kind of look like what Karthik and I have been talking about here and why this is so important. The other thing I wanted to mention too, Karthik, is a lot of people don't really understand why do I need chaos engineering. Why do I need that? I'm not going in. No one in our SRE team is going to go in and just delete pods. No one's going to go and mess with external name objects in Kubernetes or screw with our CNI daemon set. No one's going to do that, right? And it's not about the humans as much as is the natural elements of a cluster and the churn, what's going on, there's maintenance, there's updates, there's auto scaling. There are devs constantly deploying things. There are people hopping in pods to look at things and test things. There are objects coming in and out. There's many different namespaces. I'm talking about some of the SRE core principles here of assuming failure. Assuming failure and using strong measurements, SLAs, SLIs, to track your services, your endpoints, and really once you think about that and you think about maybe, let's say, because I have experience with an e-commerce platform where you've got, at any given time, you might have a marketing event and have millions of people coming into your platform in the scope of five minutes. How are your applications going to work? And I've seen so many different sorts of problems where you can throw compute at something, but if the things it depends on, if it can't get out to the internet, if the NAT gateway is broken, or if other services it depends on in the chain of doing its operations to do some processing and produce an output, if something is broken there isn't scaling there as you'd expect, right? You're not going to know about that until it actually happens. I fundamentally think that there is no way to anticipate problems until you actually experience them. And so I think that chaos engineering basically is saying, you have to commit to being okay and again it's going to be scary, but things are never going to be perfect. You're never always going to have five instances of your application 100% working, perfect, hitting health checks, responding in under one second, right? You're never going to have things in a perfect capacity. So what actually happens? What does that mean for your end users? For the people that are using your platform day in and day out who might be buying something from it or depend on it for whatever reason, this is all together making Kubernetes a better platform for you to continually use ship applications and really get that feedback loop of analytics, metrics, and other data so you know what's actually going on. Having that intelligence, right? It's not maybe the older world where you just like throw it on there in the services system CTO. Great, the service is running and you know hopefully everything's good, right? I think this is the new kind of model for thinking about how we do things, especially in a cloud native way. So with that I'm going to give it back to you, Karthik. I think you've already shown us a little bit about the platform. You probably want to dive into what some of the differences from 2.0 and 1.0. I'd love to hear more about you know what was the intent with 1.0 and what are some of the key things and learnings that you and the team leveraged to kind of figure out we want to do 2.0 and this is what it should be. Take us through that a little bit and I'm sure we'll have some questions from there. I know I have a few other questions as well but take us through that. Sure. I think when we built 1.0, it was just like I said the need for us to be able to create something that was cloud native to do chaos. So one of the things we felt was in Kubernetes everything happens to be a resource whether it is native or custom and then you have a controller that becomes as things. So we want to bring that experience to Kubernetes chaos engineering as well and that's when we basically created some custom resources and they call it with experiments here, the engine and the result of course along with the controller to carry out the chaos process. So this is a very brief summary of what it is. On the hub that I just showed you a few minutes back there are a lot of templates that you can put, pre-built templates that define a particular thought. Then there is a chaos engine the one that is user defined the one with which users deal with on a day-to-day basis where you provide other run characteristics about the experiment and map a thought with some component that is living on a cluster right now which is either a truck component or service or maybe in case of non- Kubernetes chaos some instance that is living somewhere on the clock. So that's what you create and so chaos engine is the one that actually triggers when creation of creation of which triggers the part injection process and the results of this experiment are stored in a chaos result and these sort of create certain resources because there's a huge scope for you know sort of expanding the schema and what all this can hold. In case of chaos results you can store the experiment status, the verdict of the experiment upon completion based on certain steady state validation constraints would like to know how each of those constraints work when you run the experiment. We use something called probes to define those constraints when you run the experiment. So how did they go and then of course to repeat the experiments in end with different scheduling options you might want to randomize your experimental donate over a prolonged period of time either strictly scheduled or with some randomness thrown in. So these are all the chaos CIDs. So when we started with chaos engineering project rate was we had let's say this part along this deployment alone called the chaos operator the rest that you're seeing here is part of 2.4. You initially had just the chaos operator and you could create a chaos engine manifest something like this we just have an application that you're defining by namespace, label and the kind these are the identifiers for a given application and you can basically go and run this experiment with a specific service account. It allows you to run your experiments with a defined set of permissions so you can choose who is doing the experiment and what permissions the persona has and therefore you can control what can be done as part of the experiment maybe provide permissions just say delete parts maybe nothing more. So you cannot do a more level experiment that way let us dance it into sort of a self-service model of experiments maybe there's a shared cluster different service orders and developers are running their own experiments you can still do that by adopting different service accounts and then you have something like this you have a total duration for which the chaos starts and then you have some tunables just to control the next of the experiment and we're going to do this experiment against a very simple service called as hello service it's a basically a hello world application that lives in demo space let me just take you to that it lives in a space name space called demo space it's just a part what I'm going to show is like the hello world of chaos engineering we're just still a part with one of the tests we provided this capability to create resources and run for it so you can actually see the part so people found it very convenient to make use of this in their scripts and automation pipelines cacd but things like that so you basically run the part the duration happens a few times we just set 30 seconds duration 10 second intervals it's going to move the kills two three times and you're going to get a chaos result and then you're going to get some events on the engine resource that might be of interest so you can actually see what's happening as part of the experiment it does a pre chaos check each experiment and red mess does a pre chaos check to see if the application that we're doing chaos on is in good state because we don't want to degrade an already degraded system so we make some checks and then we carry out the fault then we do some post checks and then we finish the experiment that is essentially what was available and you can see the gas injection is in progress like I showed you so this is going to complete the experiment and then just allow you to take a look at the results and see what happened to your applications you have your own influences that you can draw from what has happened so that is what we have you get the summary based on the constraints there are no constraints essentially in this experiment it's a the experiment passed because the only checks that we used to verify that it passed is whether the app was good before and whether it recovered after within a specific period of time but as we went along and I think I just just showing you the chaos result so the chaos result is quite simplistic in this case so you can see that the experiment was run once it passed and this was the target and it just ran right so this is very something that's very simple but real world scenarios are more different and real world requirements are more different so we sort of got that feedback as we went along the building the project in the open so we were asked how can I visualize the impact of chaos I have an application dashboard and I want to see exactly when chaos starts when it ends sort of how do I visualize the impact of chaos and I want to see what stage it is in of course the events are helpful but events is not for everybody there are a lot of Kubernetes is really democratized the way applications are operated so there are some personas involved that there are some folks who are very deep involved know the Kubernetes API very happy to navigate things like logs and events there are others who are looking at more graphical representations of a process dashboards so how do I visualize the impact of chaos and then how do I basically validate application behavior the word that you're telling me is too simplistic application health check application recovery after chaos is great but I want to see what's happening during the fault injection I have some SLOs that I want to validate maybe some IOs throughput latency that they want to validate even as the experiment runs how can I go ahead and do more faults maybe as part of the larger scenario like you mentioned some some minutes back we were talking about a case where your node is run to exertion and there's another correlation happening elsewhere you're stuck you're not what is in pending state how do I accelerate these kind of scenarios how do I do benchmarking I think many were mentioned about these cases where there's thousands of users using a platform at a single point of time how do I how do you simulate that load and then do fault there how does your system respond to faults under such loaded conditions is doing chaos in the utopian conditions is not great I don't want to do it in the ideal case what would have been this traffic that's full how do you do that basically do multiple things at once as part of the experiments a lot of pattern processes and how do you tell me in the precedence this is a metric that you can show and tell me this is how your fault or your experiment and your application of service is connected this is the metric that you have that says your application is resilient to this fault by this level by this much right and then there are other operational challenges how do I get different team members to comment collaborate on many chaos artifacts and visualize it how do I ensure that there's a single source of truth for case experiments yamels are great you can store them and get but GITOps is really the in thing I want to ensure that there's a change made on my GIT and the experiment definition needs to get changed on my cluster when it is around so how do you ensure that how do you use a single platform to target different environments we built upon this a little bit doing chaos in cubanities but also being chaos against other components while still running as an app on cubanities so how do you do all this these are the requirements that we got and we sort of spoke to the community and there are several meetups where we went and presented and we got talking to people and this is what we brought back into the project and built with it was 2.0 so the result is an architecture which basically gives you a single centralized cross cloud control plane we'd like to call it it's like a centralized management platform where you can connect one or more target environments of chaos depends on where you want your chaos to be done I have a fleet of clusters but I can use a single management platform to manage chaos against it so you have the self agent so this is the cluster on which the portal of the chaos center is installed it automatically registers itself as a candidate for a target environment for chaos then you could add other clusters as well essentially this is an execution plane you could do your chaos business logic on the same cluster where the cubanities control plane for chaos that is where the chaos center resides we can also use a different cluster as an execution plane so by doing that you're able to discard Microsoft's city on that cluster but you're just making use of you could also just be making use of that to run your parts your chaos parts targeting something that's totally not the varieties at all also and we have this ability to create workflows now instead of a single chaos engine that we saw getting created you can create a workflow that can stitch together more than one chaos engine in a different order panel or in sequence and you can also have load tests embedded within a workflow you could probably use locust or vegeta or k6 i o or a lot of other tools if i would probably that runs maybe as a job suspension things like that be known that the community is using today and that can be done if you can containerize then run that as part you can run that is one of the processes in the workflow along with the experiment so you get a better scenario the do test and we also have the analytics which does the comparison of workflows against what is called as a resilience score so the resilience score is essentially a metric that connects your experiment and your service so the resilience score is created or or calculated depending upon some importance or weightage that you give an experiment we see that very sharply and the success factor returned in the chaos result success factor itself is dependent upon some steady state variation constraints that you defined within the experiment using probes so you take the success factor and the weightage that you give the experiment you get a product and you get a summation of that product for several experiments that are listed within a workflow divided by the total points possible and you get a resilience score and that resilience score is something that you can use and you can use that to compare over a period of time and you can see how your experiment is sort of improving or whether it is reducing its resilience etc so that is about the analytics and it also has the option for you to sort of pick workflows chaos artifacts from a git repository so you can commit workflows that you construct around the on the chaos center on this wizard that is provided with the chaos center and that gets committed into your git repositories and any changes made that gets reflected here and the next time you run the workflow you run it with the new spec modified spec that you run and it also has a set of users that you can bring together as part of your team to collaborate in chaos and each individual gets a project into which you can invite users to form his team or her team and the users can be of several types you could create a user who is an admin a few only admin or just an editor etc different levels of users that you can use and you can collaborate on chaos so and it also has the chaos hub embedded within it very much inspired by the open-shift operator hub that's embedded within open-shift for context so we have the embedded chaos hub here so when you construct your workflows you can actually pick experiments from the hub and stitch them together and run it so these are all the capabilities that we created in response to the requirements that we gathered I will do a very quick set of demonstrations to show you how you can leverage this Mario was talking about e-commerce applications we've taken the example of an example e-banking app called the Lank Phantos which is comprised of several micro-services you can see that I have a bank of Anthos in the department space on this cluster I have links here just to make things easy in terms of visualization so you can see that there's a balanced reader application of service rather which is allowing me enabling me to read the balance here make payments do all sorts of things I've said this application without much resilience and what I'm going to do is inject a black hole attack something like very simple for a data attack maybe we're going to incubate the balanced reader service and that's going to give us a semi or quasi-operational e-banking application which is something you generally want to avoid so you can take a look how we do it we are just scheduling the workflow and selecting an execution environment and selecting the same cluster as where I'm at, it's the chaos center of the site and I go ahead and select the chaos hub from where I can pick my experiments you can define your own hub here as well if you're in a private environment you can create your chaos hub and pick experiments from there and then you go ahead and give a name I'm going to call it back of Anthos Backhole and we're just going to pick an experiment in this case network loss is the experiment I'm going to use to create this attack so once I've selected the experiment I can go ahead and tune it the way I need so I'm interested in the balanced reader service residing on the default namespace so I'm just going to select that I have the option of validating some behavior as I do the ford but I just keep it simple for this first round I'm just going to say next and I'm going to keep it intact the ford intact for 60 seconds this is going to be 100% network loss that we're going to inject so I'm going to say finish I'm going to say the word schedule fords so this is basically a feature that cleans up your TS resources or your customer sources that were created to do the ford afterwards I'm going to keep them just to show you the logs and this is the step I said where you can provide the baitage or criticality to your experiment we're going to give it all points you can do the ford once now or you can do a recurring schedule and what we're going to do is just do the ford once now and this is going to give us an argo workflow underneath it's going to construct an argo workflow where it has a couple of steps to pull the ford template for network loss and then actually run the experiment you can see the chaos engine is auto configured or created you don't have to create it by hand you can see the same thing that we saw last time with the pod delete that we did a few minutes back we have the balance reader being targeted in a particular namespace and we have the duration and other tables for the ford and just going to say finish so this is going to run the workflow can visualize what is happening as you do the ford and each step is going to get played out on the cluster through some transient parts that get created you can see first pulls the ford template and then it actually does the trigger of the chaos engine for this point Bancoventos was healthy and good so once the ford starts running you will be able to see that we can read the balance or make payments so like I said sometime back let us do some pre chaos some post chaos checks at this point the previous checks is to see whether the parts that carry this label that we just described is healthy and it is alive before it actually starts doing the font you can see that the step trigger the font has started you can use the the chaos engine schema is quite rich you can do a lot of things with it the documentation for that is available here you can take a look at the concept section and you can see all the the specifications that it contains and all that can be tuned so you can take a look at these sections here to see what all is terrible you can set resources you can define affinity policies to where this part is to go you can inject annotations into it you can define the amount of time that you spent trying to validate whether an application is alive a lot of things can be tuned but if setup is something very simple the font is active at this point so if I refresh this application you can see you cannot read the balance and I try to make a payment but that shouldn't go through because I cannot read the balance to see how much I have in my wallet to actually make the payment so these are some things we would really like to avoid in our applications we need to find the have the right middleware property to direct us to a different replica that is working and ensure things work and it is always risky that you have semi-operational applications for example I am able to make the pass but I cannot read how much I have to pass it and things like that so this was a very quick demonstration of how you can inject a font and how it runs and you will see that this experiment yeah this was this was wonderful sorry I didn't I didn't mean to cut you off I just wanted to I think so I think a lot of people think of down and up and they think binary about how a platform is operating it's just it's it's black and white right and that's not actually how most outages work most outages are actually kind of like a a brownout like a like a some things are working some things are not working the problem is that the totality of maybe some of the critical flows for using an application this example is fantastic that you that you use is that some things might stop working right and and things might seem okay but when a user actually goes to do something that's when dependency services other API endpoints are called and certain microservices that make up the overall platform are not doing their job as is expected I think that's one of the major use cases for this I'm loving this workflow dashboard I think this makes it easy for me to kind of again you know obviously the schedule interface you have here some of the agents and the hub stuff is fantastic so it's all tied together kind of all unified interface and I think like this is the next evolution of what it looks like to be comfortable feeling like you have the control to test your environment and the way you need to to really get the correct signal instead of just noise about like oh well you got this one pot using lots of resources you know that's one of many little things going on but there's a lot more signal that you need around the flows I think and what's actually happening in these certain scenarios so this is fantastic to see Karthik well you click around a little but I'm going to ask a couple of lightweight questions because we have just a few minutes left I think one of the big ones here that I'm thinking about is what is some of the key things what is the next step and you do a great job of talking about like what 2.0 is delivering versus 1.0 and so like what are the what's next on the roadmap what can we see here over the next few months as we end you know end of the year what should what is litmos looking to implement that has been like something you know huge that a lot of users have maybe been talking about or things that you've been asked about from people saying like hey you know I'm using this and if it did x, y and z I would be so much more efficient right be able to really nail down certain things what are those things look like for you and the team it's a great question so what are the things that we we're being asked is to improve it's exactly what you mentioned it's about you want to see what's happening to your applications it's it's not about it's it's not binary it's it's brown out generally and you want to but validate a lot of things get insights into a lot of things that's happening on the cluster even as we do the part so sprobes is one thing that we introduced to help do that for example in this case I'm just repeating the fault that it did at the 1.x experiment where we're trying to kill a part but along the way we are doing some checks we're checking whether a downstream application is always alive I am basically checking if it's always going to be alive there's a 200 okay that I get against it for a polling interval of every one second if it is not then probably I would like to abort the experiment so I can say stop and feel you true so these are things how do how do I abort based on conditions how can I have a dead man's switch so to say as I don't fault to sort of auto stop and what are the different kinds of probes that I can I can probably these probes today that's basically using for me this matrix to check division in your study state maybe is it is it under your SLOs or no so I support for different kinds of probes that work in different observability tools that are there in the in the CNC landscape today that's one of the things that we've been asked and we're working towards and more experiments for non-cubanities environments so that's something that we're working towards as well Azure, AWS, GCP these are some of the being where this is more the environments where people are still running a lot of their services in and we have initial set of experiments there but there's a request for being able to add more and from folks that who are trying to use this in enterprise environments who also want support for different kinds of authentication and authorization mechanisms as as to how people can access the kiosk internet on it so that is one of the things that we are looking forward into the identity roadmap as well and there are other fault types within cubanities that we can add faults are like thousands of faults that you can actually do using a base set of faults you can create thousands of sedans use rather so it first provides you a good set in the kiosk but we are committed to increasing that set of faults for cubanities and non-cubanities and also provide a resilience a very resilience view that people can make use of analytics and resilience scores are great but there are other things people would like to know about when they run the experiments approach and stuff in their direction people also want to know things about how their recovery was how much time did it take really maybe it was under the limits but how much under the limits is that something that people can improve upon how do you get better rate how do you provide better insights to people who are practicing kiosk engineering that's something that we're going towards that's awesome this is all fantastic yeah we just have one more minute I want to end on a strong note for people that are looking to dip their toes in and get started and maybe even evangelize this a little bit in their organization or play around in a kind of lower end environment like a dev environment of playground what would you say what's kind of the the best way for them to get started understanding kiosk engineering and starting to leverage it in their day to day on their laptops whatever they're trying to do what are some resources yeah I think we have done a fair bit of refactoring on the docs as well as part of the one that takes 2.0 so this is the docs Dr. Panskeas.io there are a lot of resources here that you can use to learn some of it is still coming in but there's a good sort of good sort of concepts docs that if you could use here to learn about it was and you also have experiment documentation there's a lot of information about how you can run each experiment the different variables that it provides and how you can run it with different options all that information for example is we talk about part of it there are different ways in which you can run part of it a lot of which is explained here so this I would recommend are a couple of good resources the docs and the pages of the repository itself that's about where you can find information about it must but when it comes to general information about chaos engineering you can take a look at the principles of chaos engineering and CNCF was just started with a chaos engineering work group which is trying to actively put together information about chaos engineering for beginners and for practitioners who've been doing chaos engineering for a long time and then jump into a cloud native world and looking at doing chaos engineering and cloud native way we meet once in two weeks and we're trying to put together a white paper that talks about chaos engineering there is some information for example if you're looking at common terminologies associated with chaos engineering this dictionary that we're creating talks about chaos engineering it's not really alphabetically sorted glossary it's more about chaos engineering as your element so you sort of come to principles and then you talk about what is an experiment and how you can sort of understand each part of the experiment glossary is hypothesis SLI's and SLO's and how you can practice it as an SR you know you can conduct game days information that we're trying to expand upon so this is probably good space to look out for for sure yeah this is perfect I did not know there was a working group that is amazing to hear thank you very much Karthik so much great content today you really did some amazing demos as well and the demo gods are clearly with you litmuscaos.io chaos native is the company thank you to Karthik today and thank you for the people working behind the scenes my name is Mario Loria it's been a great great pleasure to host today's session and talk to everybody later to have a great rest of your day thank you so much bye bye thank you so much bye