 So, good morning everyone. As I said, my name is Prithvi Raj and I welcome you all. It's nice to see so many folks joining in for a chaos engineering session. We'll be talking about the litmus chaos project. It's a very popular CNCF project and chaos engineering or chaos testing has been a very popular aspect that has been coming up, not just in the CNCF space, but also overall a lot of people are adopting chaos engineering. I hope this talk is helpful to you all and again, I'll just start by introducing myself. So as you can see, my name is Prithvi Raj. I've been working in the chaos engineering space since the last two and a half years. We started off with Maya data and then chaos native became a primary sponsor to litmus and now harness is a primary sponsor to litmus chaos. I've been working in the communities for some time now. I've been organizing Kubernetes community days, Bangalore, Chennai and so many more events, chaos carnival. So it's all been about chaos engineering and feel free to reach out to me on these socials if you're interested in chaos or something around community. So moving on, let's take a look at the agenda. We'll be talking about outages of course and then we'll be taking a closer look at chaos engineering, why, why chaos engineering and how, how you can get started with chaos engineering, why, why chaos engineering is important for you. We'll also take a closer look at running a chaos engineering game day. And then following by that, we'll, we'll just talk about a short case study on I food, which is one of the adopters of litmus. And lastly, how you can be a part of the community, how, how you can contribute to the community as well. So moving on, let's just get started. So here is one outage that I have presented that, that recently happened with Meta where Facebook, Instagram, WhatsApp was down and reported 1000 of outages. But then, you know, this is not just one outage. If you see my screen, here are examples of outages happening day in and day out where, you know, large companies, large enterprises are having outages here and there every day now and then. And that is where, you know, scaling is not just the only issue. There are, you know, there's, you know, hardware issues, there's issues with, you know, some, some sort of environment, which they are moving to, let's say we are moving from monolith to microservices architectures and their issues are supposed to happen. And if, if you talk about Kubernetes itself, I mean, if I take the example of Kubernetes, then Kubernetes, a Kubernetes application itself is in the form of a pyramid where your Kubernetes architecture is on the top, but there are other dependencies as well where you can see your, let's say you're monitoring Prometheus or you, let's say you can, you can see, I mean, your platform layer or there are your other services, your MongoDB Kafka running and each and every layer can possibly have an outage. Each and every layer can go down and, you know, it's, it's just like the Murphy's law, what, what is supposed to go down will go down and, and why wait for an outage to happen? Why wait for something like this to come out as news or, you know, down times to cause losses in various forms? We'll be talking about them to, to just, you know, follow, follow some sort of case where, where a lot of companies are actually not moving to your testing because they are skeptical. People are skeptical how to run such production failures. Maybe they don't have the budgets and these are the things we are facing day in and day out being part of the, part of the community and running the community for so long. So moving on, we'll, we'll see, I mean, as I said, these outages are avoidable and these brownouts are avoidable brownouts as I, as I spoke about, I mean scaling is, is another issue. I mean, Netflix started this concept around 2011-12 where they started off with a single production level failure where it was more about deleting your infrastructure in production or deleting some, some pods in the Kubernetes aspect and then it moved on with more production level failures and more failures. So this is something that outages can cause and or not adopting chaos can cause loss of customer confidence. There's loss of employee confidence and reduced stock price. As you can see, if, if you have seen the stock prices of Facebook over last one year due to multiple outages, I mean, that can be one possible reason there may, may be other reasons as well. But here are such possible reasons and if, if I take examples from real life, there are so many news applications, other applications which have faced such outages in, in the recent past causing various sort of losses for them. So obviously outages, as I mentioned, will still happen. And these are some, I mean, companies facing frequent outages are not looking at the chaos aspect. They, they have seen, I mean, my 2x higher mean time to recovery. There are more team members who are required to avoid these outages. The operation costs are increasing. And obviously there are many more damages with the major aspect. As I spoke about, we are moving infrastructures. We are moving from legacy systems to microservices, cloud architectures and, and such sort of outages are supposed to happen, are going to happen, eventually, right? So if those who are, those who have decided not to make a choice, you're still making a choice, right? If you have decided not to adopt chaos, you're still making a choice. You are still deciding not to go ahead with this practice. So obviously this, this choice is something which, I mean, by the end of this talk, I hope you, you rethink and let's take a closer look at chaos engineering now as, as I think you might have got a clear idea. Chaos engineering is nothing but inducing a fault deliberately into a system. It can be, I mean, a lot of people think it's just production level failures, but I'll tell you if you are in your developer environment, in your pre-staging, staging environment, in your CICD pipelines, you can induce a fault anywhere, you can target your infrastructure in a certain way to identify what sort of a possible outage might happen when, when the system goes into production or when a real life scenario happens. And a simple example I'll take here, I mean, in India we have, I'm coming from India, by the way, in India we have these great Indian festival sales or Black Friday sales, as you can call them globally. And, and there's a requirement for scaling that there are a lot of users accessing these e-commerce applications at a particular point in time. And the systems go down. There's a payment failure. There's a failure with the catalog. There's a failure with the overall cart. And it's a very simple example where chaos engineering can come into play where such scenarios can be predicted beforehand by inducing such faults. If you talk about the Kubernetes ecosystem, then a simple fault can be a pod delete experiment or a node kill or a container kill. These are some very simple experiments that are used by, that is used by the community in day-to-day life. But as I mentioned again, it's basically it's not breaking things on purpose, as a lot of people say, but it's breaking things to mitigate the risks or to safeguard your systems. So moving on, how to start chaos engineering or what is the process of chaos engineering? What are the four easy steps, which I mean they are not so easy, but I have just listed them down in a few steps. So obviously, you understand the steady state of your system, how your system is behaving while it's in the normal condition or what's the normal behavior of your system. And then you induce a fault as I took in a very simple example the pod delete experiment. And what happens when the fault is induced? If you talk about the SRE terms or SRE practices, I mean chaos engineering was seen as an SRE practice, but being part of the community for some time now it's also being shifted to developers. We'll talk about that later on. But if your SLO service level objectives are continued to be met, if your system is still resilient or the way you expected the system to behave after the fault was induced, if the system continues to behave in a certain way, then of course your systems are resilient. But if no, there's a weakness found, there's a possible vulnerability or requirement to mitigate that sort of an outage. Then obviously you move back, you mitigate that and then again you continue this process where you again identify the steady state conditions and introduce a new fault. And I have created this or this has been taken from the principles of chaos engineering if you have already read them and you see what I explained is explained properly here, how a chaos experiment or how chaos engineering is run, you select a set of experiments, you hypothesize around it, what's your hypothesis, what are the faults that are supposed to be run in your system and how will the system behave according to your hypothesis. You create an experimental group and then you obviously experiment with those systems, I mean that running your chaos experiments is just not about running it on your target applications, but it's also about setting particular intervals, creating the right chaos scenario. A chaos scenario is nothing but multiple chaos experiments running according to your test suite. We will be talking about that later on as well. And then you use your learnings to basically identify how your system is behaving and then you continue the process, it goes on in a loop so that you know, because you have to understand it's just not a single point as you grow, as you move ahead you have to understand what are the points where a failure is possible and you continue to create new scenarios, new chaos tests. So moving on, where can chaos be used? I mean as I mentioned it's not just about running it in production, obviously your eventual goal is to run chaos engineering in production, ensuring your production systems are resilient, but you can run them anywhere, your defci pipelines, your staging environment, your pre-production. Again it depends on the chaos tools, I mean there are a lot of tools, it's all started off with chaos monkey, which is again an open source tool, a lot of people use it for simple production failures and then there was the Simeon army, there were new tools that came up, Gremlin, Chaos Smash and a lot of tools, but the right tool for you helps you run chaos engineering in all these environments. So moving on, we'll introduce the right tool according to my perception, but again there are so many tools out there being part of the open source community, I would urge you to take a look at all the tools that are out there. Blitmus Chaos is one of them, it's a CNCF incubating project, it started off as a project to test another CNCF project by the way, OpenEVS, it's a cloud native data project, and it started off to just curate some simple database chaos, and then we started writing it and eventually it became a project of its own, and it's today one of the most popular chaos tools out there, open source tools out there and it's having around 50 to 60 chaos experiments. I could have shown you the chaos hub maybe we'll move on to that, but it's driven by the principles of cloud native chaos where it's open source, we have a functionality called GitOps where you can scale your experiments, you can contribute, you can write your own experiments, there's so much we can do, we'll be talking about that, but just taking a look at stats and it's adopted by the world today a lot of, I mean even in Japan a lot of folks have been using Litmus, we have seen a lot of usage, AWS FIS has a Litmus integration, there's so much more being used by the community as you can see, end users to cloud native projects have been using it, and Chaos Engineering has exponentially grown, I mean over the years when people were skeptical today, not just large enterprises, even the small enterprises are moving towards adoption of chaos and as you can see, that's the exponential growth and maybe this is just a small map where the light purple ones are not using, I won't say the whole world is using chaos, but the deep purple ones are the ones, the countries which have people or enterprises using Chaos Engineering. So moving on, this is the dashboard, I mean I don't have a lot of time to explain a lot of things or run a demo, but this is the Litmus Chaos dashboard as you can see, so this is the home page where you can see your chaos, I mean what are the projects that are running, you can invite your team members, Chaos Engineering is not just a process for a single SRE or a team member, but it's a team-wide process where SREs, QA engineers, developers all come together to ensure your systems are resilient and then there's the Chaos Hub, I'll take you through the Chaos Hub perhaps, just give me a second, yeah, so if you see this, this is the open marketplace, there's only one project called Litmus that has the open marketplace where you can see the experiments that are contributed by the community, I mean it's all started off with Kubernetes centric experiments and then, I mean Chaos Engineering is required not just for Kubernetes, but non-Cubanities environments as well, that there's application level chaos in form of code DNS, Cassandra Kafka, and then there's your cloud architectures, AWS, GCP Azure as well, which might require some form of chaos or which might require mitigation, so this is something which the community has developed as of now, but again, we are moving ahead, there's the Litmus SDK which helps you write your own experiments, so if there's some Chaos Engineering experiments specific to your application, then you can write them yourself and use them with your application. So moving back to my presentation, we'll talk about a Chaos Engineering game day and eventually we'll talk about a small case study, so a game day is nothing but gamifying the process, it's nothing but you running Chaos Experiments according to your application, understanding who the stakeholders are, understanding how your workloads are going to behave, you need to create a report and Chaos Engineering starts, I mean obviously it should start with building a culture, you need to understand the practices, you need to understand how Chaos Engineering works, but once the practices and the culture is adopted, then you move on to running real experiments, running a game day and these are some simple steps initially you introduce the participants, it depends upon who is running Chaos or where do you want to run Chaos, let's say it's in a developer environment, then maybe your Q-engineers come into play, if it's being run on a production or pre-production environment, then maybe the SREs or the DevOps engineers come into play, and then if you are eventually wanting to run it into production, then your customer success engineers and so many more people come into play to run Chaos Experiments and then you describe your environment, let's say you are running it in a GKE or your AKS, EKS clusters, you formalize that your workloads are pretty important, I mean for Kubernetes you are running Chaos on your pods or you are running it cluster-wide or you are running it on a specific workload, but then again non-cubinities as well, you are having your PVs and your stateful sets and so much more, and these are some steps, I mean I don't have a lot of time to explain everything, but I hope everyone can take a picture of this so that they understand the real steps of running a game day, everyone doesn't really know how to run a game day, but this is what as I have mentioned observability aids, I mean you have to monitor where the Chaos is happening or where the systems are going down, and Dynatrace, Prometheus Grafana people folks are using and eventually the last step is obviously to make the fixes, so make sure that post creating a report or once you have run the game day, you constantly run these game days according to the reports that you are creating. So quickly, I mean the typical way of doing Chaos engineering as I said, the eventual goal is to automate these game days or automate the experiments, it usually was a manual process, and I mean developers run chaos gated code merges or integrated tests, but the SRE eventually runs an automated or a manual game day. So let's check out the iFood story, that's the case study, iFood is a food delivery app just like Uber Eats in Japan, iFood is a popular delivery app in Brazil and it's been taking 60 million orders per day and it's similar to any food delivery app that you'll see out there and the problem statement was that they were moving to that microservices architecture and they saw servers going down, I mean their messaging brokers were crashing and the main issue arise on a Brazilian Valentine's Day in 2020 when they saw a major outage. And what did they try? I mean they started off with circuit breakers and fallbacks where their major system went down and then their fallback system started running or they stopped using the circuit breaker methodology, they stopped these sort of outages or they curated those sort of tests, but again they had to understand that this is a larger problem and this is not just about mitigating that fault at a particular moment. And that is where they moved on to chaos engineering, I mean they started off with manually testing, again chaos monkey was what they started off with, but they expanded, they needed a set of experiments, they were expanding and that was the need of the R and that is where they started running various chaos, I mean tools as I mentioned, you need to run various chaos tools to understand your requirement and then they were surprised by what they saw with litmus chaos where they were able to run a chaos engineering in a controlled and a simple way and they were able to write their own experiments and this is the architecture diagram where the chaos center, the one I just showed you, the UI of litmus is connected with the clusters, the agents and then they have a dex authentication provider and basically the agent is attacking the service or the application that they are running. So this is how they chose litmus, I mean there's so much more which is available on a blog, you can check out the CNCF blog and as next steps they decided to create more experiments, create more permissions and work on the security and so much more that can be done in the chaos space, serverless chaos and application level chaos. So what's next for litmus? It's 3.0, we are developing a new UI, it's more robust, clean and developer centric, we are working on creating something special for the community and you can wait perhaps for next KubeCon and you'll see what's coming up and I mean you can contribute to it as well. There's new features, bug fixes, charts, workflows that you can contribute to, you can be a part of the community and get involved in the community, there's the code, docs, Slack, Twitter, which you can join. So make sure, let's connect after my talk maybe, feel free to reach out to me. I might not have time for Q&A because I'll have to stop but thank you so much once again, you can get involved in the community and thank you I'm part of Harness again, we are building Harness chaos engineering over litmus chaos itself, so feel free to reach out if you have any questions or want to get started with the chaos engineering practice. Thank you so much everyone, thank you. Thank you very much. Are there any questions? Awesome stuff. I'm curious, it didn't seem very clear so in terms of like observability like as we're promoting chaos, I think I'm a little confused if there are integrations with this product to other observability options or if it conducts the observability itself. So it's chaos engineering integrated with observability products. So observability products are not inducing any sort of chaos, they are just helping you observe how your systems are behaving, what's the normal behavior of your system and when a fault is induced, what sort of behavioral change happens to your system. So you are pulling out the metrics, basically you're running chaos on these tools itself, litmus chaos, chaos mesh or any other tool and then you're using your observability tool to observe what's happening, let's say your Dynatrace or your Grafana dashboards where you are actually observing how the network is slowing down or how the system is going down and when there's a spike in the system what exactly happens, how much time it's taking, that's just the metrics that are important as part of your chaos. So it's a separate entity, you have to check your observability metrics. Yes, it can work as a separate entity, although I mean with chaos engineering monitoring is obviously important, but it need not be a mandatory entity with chaos tests itself. So through the UI we wouldn't be able to see exactly what's going on. With the UI you would be able to see, as I mentioned there was an analytic section, so second if you see here there is an analytic section. So with the UI there is some observability already which you can monitor, but other than that I mean that's just an inbuilt UI but people usually the community uses other projects there, I mean for logging, elastics, plan and for monitoring again, they pull out the Prometheus metrics, they visualize it on the Grafana dashboards, Dynatrace, AuditorDog, or there are so many platforms out there which you can use to analyze. I mean that's just expanding what you are already using, just chaos comes into play there. Thank you. Any other questions? No? So I have a question. I understand chaos engineering is to break some machine on purpose. So in that case what kind of faults you can introduce? For example, hardware fault or only software fault? See chaos engineering is not just about software faults, it's about as I spoke about there's service less chaos, there's application level chaos, there's there are multiple levels of chaos, there's VMware chaos, bare metal chaos. So chaos engineering is expanding today, I mean we haven't worked on hardware chaos specifically as a community because the cloud native world itself hasn't moved on to a lot of hardware level faults, but again chaos engineering as it expands maybe in five years from now you will see a lot of hardware giants. I mean it's essential for everyone as I spoke about from airlines to banking to e-commerce to food delivery. There's chaos required everywhere, everyone is moving to chaos, not just Netflix today but this practice has moved on over the world but yeah hardware chaos maybe hopefully we'll see that soon as well. Thank you. All right folks, thank you so much everyone. Thanks for tuning in. Okay. Oh, one more question. Thank you for your presentation. I have a question for the process on chaos engineering, we are a beginner waterfall model development team, but you are introduced at chaos engineering and end game. So end game is a would be occurred on the during the end of the process or on the agile environment, maybe we make that end game sprint would be happen. So my question is that how often put the end game during the development time frame? So I want to know that that timing introduced that chaos engineering that is my question. So as I had mentioned before, a game day is not usually run at the beginning of the process. I mean as you mentioned that you are beginning with chaos engineering, so the first step is bringing the culture in, understanding the practice, differentiating it with the existing testing practices that you already have and then moving on to the practice of chaos where you initially understand your systems, where chaos engineering is required, where are this, the vulnerabilities involved in scaling or other failures that are usually happening and then once this culture is brought in, you start with chaos testing itself, not running a game day because game day is a bigger process which is run not at the end of it. Obviously at the end of it, the goal is to test, the goal is not just to run a game day because game day is more or less to understand your systems, but it starts from the middle of your process where once this culture is brought in, once you have started running some tests in your staging or developer environments and understood the impact of chaos, understood what the journey is ahead, then you start off with a game day where initially a game day is just run to understand how the game day works and then slowly, slowly you run multiple game days to basically create multiple reports on application behavior, what are the risks, what are the mitigation processes, how the stakeholders are affected and all these things come into place while you have moved certainly ahead in the process. I hope that answers. Yeah, let me try, yeah. But yeah, we need to make a definition to create when we try to do that chaos in Jaringo. See, as I mentioned, get started with the resources. So all the resources that are out there created by the community are really helpful for you to get started and then you just pull up this platform, the litmus platform and you run a basic chaos test on a normal Kubernetes cluster, not even in your environment. You just pull up a Kubernetes cluster and you just turn a simple pod delete or certain experiments then you run a chaos scenario where multiple chaos experiments are run on the application and then you move on to your application where you understand that these chaos experiments are specific to my architecture. Let's say you are part of Vodafone, so how your, I mean, the telecommunication architecture or mobile application architecture requires chaos and then you start running a game day according to your application. We are doing the currently KHS environment, so Kubernetes environment. So we try to do that as fast as they say. Infrastructure level chaos. Absolutely. Yes. Thank you so much, Mr. All right, thanks everyone.