 Hello, everyone. Good morning. Thank you for taking time to be here. Great to see all of you. Today, we are going to talk about a new topic of interest, Chaos Engineering for Developers. I am Uma Mukhera, head of Chaos Engineering at Harness, and I'm also co-creator and maintainer of the Litmus Project, which is a CNCF incubating project. And here, we also have Ramiro. Hi, everyone. Happy to be here. My name is Ramiro Berreyesa. I am the founder and maintainer of Octeto, which is a CLI tool for remote dev environments on Kubernetes. We are here for the next 30 minutes going to talk about how can you apply Chaos Engineering for the cloud-native development ecosystem. So before we do that, let's quickly talk about what's Chaos Engineering and a brief introduction to Litmus Project. What can you use Litmus Project for? And then Ramiro is going to talk about how some of the developers on their development ecosystem or the cloud platform are using Litmus to actually do chaos testing while doing development. So first of all, we all know that outages are expensive. We've seen they cause reputational damage, financial damage, and also cause a very hard experience in terms of a user experience net promoter score. So service reliability is very key for all of us in this era of digital transformation. But we all know that services are always built with redundancy and still failures happen. That's something that's unavoidable. And they cause irreparable damage sometimes. So some of the key metrics for all the businesses are how fast you can recover from a fault and how can you delay a potential failure. So you need to be doing continuously whatever possible to get a better reliability out of your code that you are building and you're going to ship. So what's the solution? Definite a good solution to do this in a very organic way. Use Chaos Engineering. Chaos Engineering is the science of breaking things willfully, proactively, and trying to find weaknesses and fix them on an iterative ongoing process. And why is Chaos Engineering more important for Cloud Native? We know Chaos Engineering has been there for quite some time. We've been using. But Chaos Engineering is more relevant for Cloud Native because of two things. One, you are really looking at a very small tip of the pyramid when you're building code. There are a lot of microservices that you're now dealing with underneath your stack. And they are also changing very dynamically. So just to summarize, you are looking at a code in a container which is much smaller that comprises of your service. At the same time, there are so many microservices that your microservice is going to interact with. And they are shipping fast. That's the whole idea of Cloud Native. Kubernetes enables the system of things getting shipped very, very fast. So in essence, you have a code that's changing all the time in your production systems. So things can go wrong and will go wrong. And Kubernetes, pods, they all keep getting swapped, getting deleted for the regular reasons. Pod getting deleted need not be something in a very bad happen. But they can just get deleted because of some pressures that are happening. And then Kubernetes reconciles. So this makes Chaos Engineering very, very important. You have to have a science of how can you build reliability in a structured way while doing the Cloud Native development. And that brings Chaos Engineering more relevant right all the way from development to testing, then production on production. So in other words, Chaos Engineering is becoming a kind of a DevOps culture. It's not necessarily a thing for ops, not necessarily a thing for SRS. It's still very, very relevant there. But Chaos Engineering can be on ease being adopted as a DevOps culture. You do Chaos Engineering in all places, right? In CI pipelines, in CD, pre-CD, post-CD, and then pre-production, post-production. So this brings a better ROI for investing in Chaos Engineering because you are getting your weaknesses found all the way from build to test to production. So how can SRS use Chaos Engineering, right? We all know that people start with manual game days and then they try to automate it, right? So that's one way of looking at it. And now you can extend Chaos Engineering practices for QA, right? So there is a lot of advance procedures happening in continuous deployment. So you can actually go trigger Chaos tests for triggering a build deployment and after the build is deployed, right? So pre-CD and post-CD. And you can use Chaos testing for continuous verification of your deployments that are happening. And then shift left, you are going even further, right? You can actually use, in a good, matured Chaos System Engineering model, your developers are testing their code using Chaos testing, right? We all know that you push your code. You commit your code after you do the functional testing. But now you have a chance. Your code can be tested for all potential failures underneath your stack. You can get it tested right now before you actually merge it. So that's the use case that we are talking about. And you can do that using Octetow as an example here. Let me quickly talk about Litmus, right? So Litmus is a CNCF incubating project. We started writing it about five years ago. It's got a great community. Thank you, all community members, for using it and giving great feedback. It is an end-to-end Chaos Engineering platform that lets you collaborate chaos across all DevOps. It's not just a tool where you can inject chaos. Chaos Engineering is really about collaborating, chaos testing, and getting the feedback back into the DevOps, right? So it's an end-to-end platform where you have a central API server or central portal where developers, QA, SREs, observability teams can all come in and create chaos, chaos experiments, orchestrate, observe, and then take feedback and put it back. So at the center of it, you have something called Chaos Workflow. Chaos Workflow is a set of experiments that you put together, and you push that into Chaos Hubs, right? Chaos Hub is a kind of a Git repository where your team will see Chaos Workflows as a single source of truth. And once you have Chaos Workflows as elements of execution, you can target various targets. Most likely you'll start with Kubernetes resources, but you can also extend to various other cloud platforms or bare metal, et cetera, et cetera. So that's a quick introduction to Litmus, and you should remember that you are basically collaborating around a chaos test or a chaos workflow. And once you build a chaos workflow, you can schedule them whenever you want. That's usually how you do the game days. And you can also trigger chaos workflows based on deployment, pre-CD-po-CD. That's your continuous deployment use case. And then you can actually put them into CI pipelines, right? So, and then you can also inject into CD pipelines. So most important here for this talk is you can actually kubectl them, right? So at the end of the day, a chaos test, a chaos workflow is a manifest. You can, that's really is the power of Kubernetes here. You can start creating new use cases for chaos, right? And once the manifest is executed, it's going to set up the entire chaos environment, execute chaos, and then it connects to the chaos center and then gives you the right results, right? So that's how easy it is to inject chaos for developers at the time of development, right? So just to summarize, you have an ML file that you need to kubectl, and that can be collaborated using chaos center and a lot of people can do that. You have a lot of predefined chaos tests. You can modify them. You can upstream them or you can version them. You can manage this chaos workflows. And then you can use them in pipelines or you can kubectl them in many different ways. So this is another thing that I wanted to mention. It's not necessarily that chaos is a left shift process, right? Generally what we have seen is you start chaos as a game day in SRE or Ops ecosystem, then you try to automate them into CI pipelines, but it can always start the other way around. Kubernetes requires more chaos testing before the code gets shipped. So you can actually start chaos in CI and CD pipelines, and then you have less problem convincing in, hey, we've been doing chaos in the pipelines and then we are well-tested. Why don't you do this in production? So that makes it actually more easier for people to do chaos in production. So shift right or right shift for chaos is very much a good use case. So to summarize before we go actually and see how you can use Litmus to do chaos at the time of development using Octato platform, let me summarize the return on investment for chaos at development stage. So basically your writing code, you can actually get your code tested against possible faults in hundreds of various different deployments. You can get it tested on Azure cloud, you can test it on AWS, GCP or other distributions. Before you even actually merge your code, it is possible to get your deployment done in all possible places, cause some chaos and then see whether your code still works and then get there. Very similarly, you can get that done in your QA testing as well. So with that, let's actually go and see how you can do Litmus chaos at the development environment. Welcome, Ramiro. Hi everyone. I'm very excited to be here. This is the first live call live talk I do in a while. So it might be rusty, but I hope you do enjoy it. There we go. Just to summarize, Litmus chaos is an open source chaos engineering platform that helps you validate that your code is receding and your applications work. Octeto on the other side is a open source project. It's a CLI that allows you to deploy and develop in remote environments running on Kubernetes. When we started Octeto, one of our main drivers was that we saw that the divide between dev, stage and prod, especially when you adopted Kubernetes, was getting bigger and bigger. So it was like three, four years ago when two of my commentators and I decided, hey, let's do something about this. We started with this open source CLI to launch dev environments into Kubernetes, give you code synchronization, remote execution, all those things. As we started to explore this concept of remote dev environments, the community started to bring up a lot of interesting use cases. And one of them was, hey, now that we have our development in Kubernetes, there's a lot more we can do as we write code in terms of validation, verification, testing than before. Because typically, you have to wait until you have a CI workflow to do things like chaos engineering, end to end testing, stress. So a lot of the community started to poke around and say, hey, now that I have a cluster for development and it's remote so I don't have the limits of CPU and memory, what else can we do? And then I met the people at Lidmus. We started talking. Our community is going to start talking. And we started to see this really cool use case of doing chaos engineering as part of the dev flow instead of something that you did at the end. So I want to show you a demo of how that looks like and talk about some of the key points of what we've seen and why we wanted to share this and why we think that chaos engineering should also be shifted as left as possible because it's a great tool for building cloud native applications. So I got a video this time because I didn't want a Wi-Fi to fail, chaos. So this is an example of a live environment. This is using an octet is running on Kubernetes. It's a simple app. It's a to-do app. You put your tasks. So as I was preparing this talk, I built it up to track, preparing my talk, packing my bags, getting to Qtcon. Simple app, it's a Go API. It has a Postgres database. You know, an Ingress, the whole thing. And everything is running on an A space in Kubernetes. So now I have some applications. So the first thing that we're doing is we're going to run a baseline to see what's happening, right? So one of my favorite things about litmus and about harnesses, chaos engineering is that it's all self-service. As what I was saying, it's a YAML file. You can pre-create that. You can put it on your repo so everybody else on your team can use it. There are catalogs. And one of the things that I really like is that it's completely self-service. You don't have to be an expert. If somebody else put the YAML on your repo, all you have to do is go to this portal. It can be a shared portal. It can be your own instance. Load the YAML on litmus. Click. For this demo, I am using harnesses hosted service, but it's the same thing that the open source version. So you create your workflow. You import that, give it a name. And then you always have a chance to customize it. So just to show you what we're going to run for this experiment, we have this experiment that's going to delete a pod. We're going to delete the database of our application to see if our application can handle a database outage. With litmus, we can pick the target of the experiment for how long to run it. And something I'd really like to do is this concept of probes, which is you're going to be running a test while the case experiment is running to validate an assumption. In this case, we're going to make a guide request to this to-do list to get the to-do items continuously while the experiment runs. This experiment is going to run for 45 seconds. And every five seconds, litmus will kill the database while in parallel making all these requests. And the objective is to see if our application is resilient, because databases fail. They go down. Pods get recycled. You've all been there. And it's something that as developers we need to be aware of. So you configure your experiment, click Start. And then litmus will give you this dashboard. You'll see I've run other experiments before. And we'll go. One thing I like about the portal while this runs is that it's also very easy to share your experiments and your results with your team. All of this about collaboration, about enabling others, is part of what makes it easy to shift left these practices. Because it's not just about you. You're building applications with a team. It might be async. They might be sitting next to you. And that is something to keep in mind as we pick these tools is make sure that tools enable your team. Make sure you're not putting extra gates in having one expert run the tests. It's much easier and more effective when anybody can participate. So in this case, the experiment is running. The first thing that happens is that litmus will install the experiment on an in-space. It will spin up a pod. And then the experiment will run. If you go to the litmus scales, there's a bunch of experiments pre-created. This is the pod delete, the most basic of them. And you can see here now how. If you look at the state of the DB, it's processing because now we have litmus running the experiment and killing the database over and over again, which is not a good simulation for, hey, I have a bad node in my cluster. I don't have the right tolerations in place. And the application keeps getting there and failing. And in this case, these are the locks of the actual to-do application. And you can see how litmus is already quitting the application, hitting that endpoint, and getting some errors. So once the experiment finishes, you're going to see the result of it. You're going to see the end there. You're going to give us a table of all the actions that it created. This is a simple experiment. It's only one thing pod delete. Once you get more comfortable with chaos engineering, you can build experiments with multiple things. Delete, delete the PVC, creating your stuff. But in this case, you can see that. Well, you can see it, but right there it says prop failed. You can see the better lock next time under that. Oh, I recorded with that on there. Anyway, so now we failed the experiment. So what we're doing now is we're going to look at the code. You can see that is the get items code. Super simple, super naive. Call the database. Don't handle any errors. Just get the items and return them. So now what we're going to do is we're going to fix this. We're on this experiment. We found that our application is not receiving it. And here is what OCTEDO comes into the picture. This is CLI, open source. You can get it on GitHub. You search for OCTEDO. And the first thing we're going to do is we're going to connect our local. We have our dev environment running. So the second thing we're going to do is we're going to connect our local machine with this remote dev environment. And that is the flow that OCTEDO helps you with. You have your local machine, your local IDE, all your favorite tools. You run OCTEDO up, and OCTEDO will then synchronize your local machine with this dev environment. In this case, it's going to be synchronizing the code with the API. It's going to give you a remote terminal where you can run anything you want. And the idea with this is that we have a remote dev environment. We can write code and test it as we go. So as you've seen the demo, we could just run a chaos engineering test as we write the code. And this is what a lot of you are committing. Our time to realize is that with OCTEDO and LITMOSQS together, you don't have to wait until you write, commit, merge, push, and redeploy. Instead, OCTEDO up. Then in this case, start your process, and then you can run all these tests. So here I'm showing how this terminal is actually on Kubernetes. You see it has the environment variables that Kubernetes is injecting for Postgres. It's running in a container. So now, we're going to change the code. I pre-wrote this code. I call this Vansu Reven Chaos. And I did something naive just for demo purposes. Don't do this in production. And it's just retrying for a bunch of times until the database goes back online. And I just did that to show you how we go from an application that is not resilient because we run this test and we discover this. We're going to use OCTEDO to develop. You see we run OCTEDO up. We have this remote terminal into our container. We just run the go process. One of the things that OCTEDO gives you is that you can iterate on your code. You don't have to rebuild your containers every time. You don't have to redeploy your application every time. You can change your code, go run, go run. And you see there, the new application is running with my code. What I think is cool about this is that because everything is running in Kubernetes, now we can integrate these tools and get this extra level of validation. So we're going to try and say, hey, let's test to see if this application works. We don't have to change the PR. We don't have to commit and kind of go through that step. We can just informally write code, make some experiments. Let's say, in this case, let's see if this retry mechanismally added actually makes sense. And then we go back to LITMOS and we run the same experiment. We have that Jamel. That Jamel is reusable. You don't have to every time rethink your experiments. And this could be a complex experiment. It could be 10, 15 different experiments, one after another, really throwing everything out of the application. Or something as simple as a delete. So we're going to rerun the same test again. I did shorten the loop so that we don't have to go through the entire thing again. But I just want you to see how the flow works. Launch experiment, see it fail, octet it up, change the code, small changes, iterate on it, see what works, see what doesn't work. Run your tests again. Now we're running this against the code we just modified. It's not the code in staging. It's not the code on a branch. This code is on my local machine and on this remote environment. And then you'll see how the experiment will start. The database will again continue to cycle. But as the experiment runs, we'll see if our code is now resilient or not. Back to the LITMOS portal, back to the results. By the way, you can get all of this also through Qubectl. I just like to do it through the UI because I think that also makes it more accessible to other people, especially people who might not be as experts as we are on Kubernetes. But you see there at the bottom, now my test passed. Actually, we have proof here that this change that we made made our application more resilient. And then we're ready to go through our formal QA-CI pass. We know we have a certain thing that worked. It passed in Kubernetes. It passed on our namespace. Now we can commit, push to a branch, ask for review from our peers, and then eventually make that into staging. So this is the scenarios that we're seeing now in our community. This is an example of how you can take all these cloud native tools together and really create this culture of iteration and shifting left. One of my goals with this talk is for all of you to think what other things you can enable your teams to do before even CI, before QA, before staging. So just to wrap it up, our goal here is when you have ephemeral dev environments running on Kubernetes, it's easier to run KOS experiments. You deploy this ephemeral environment. You run your tests. You kill the environment. You go to your next task. So this makes it easier. It's running on Kubernetes. You can reuse the same experiments that you're using in staging and production while you write code. Second is, the more we run these tests, the more we see them, they're going to start to feel less expensive. Because when I talk to the community, a lot of people feel like, hey, KOS engineering is really hard. It takes some time. I have to do all this setup. And I get it. If every time you want to do it, you have to deploy a cluster and get a branch and build all these artifacts, yes, it looks expensive. It is. But with this kind of tools, because they're more accessible, inclusive, you have cell service dashboards, you can do it more often. And then it's going to feel less expensive and more of a normal part of your workflow the same way that unit tests and integration tests became part of our dev workflows. And finally, and this is something that I think is important that we have to think about, which is the more we run this test, the more we use KOS engineering tools in all the phases of development, the more we use Kubernetes, the better we're going to get at it. Because now, I found this bug on my code. I made my to-do list app more resilient. Now every time I write code for a database, I'm going to remember, all right, I have to think that there is maybe offline. And then, because I've been doing this, every other application I ever built is going to be slightly better. And this knowledge will compose. I will help my friends, my peers, and we're all going to get better at it. And that, for me, is something that's really important, especially as we mourn to this cloud native public world where, as Oma was saying, applications are more complex. You have multiple components, and these things become more important. So just to summarize this talk, and I'm happy to talk about this after, but let's bring KOS to developers. We want to make KOS a natural part of the web workflow. Let's not put gates to it. Let's not think of KOS engineering as something you only do on QA, or something that only your pre-release team does it, if you have it. And importantly, no, let's do it because quality will come up. The more we use KOS testing, the more we do these practices, the more we develop in Kubernetes, our applications will get better. So thank you so much for attending this talk. Please join our communities. The links is there. Board are open source. And also, there's a Litmus channel and an Okteto channel on the Kubernetes Slack. Join us. If there's a topic that you want to chat about, we'll be more than happy to collaborate on this and push the state of the art together. Thank you, folks. Any one questions? There is a, OK. There's a mic in the middle of the aisle. I'm going to pass that on the front. Hi. I just wanted to ask if Litmus supports some network disruption issues, or only delete in boards or something. Yes. Litmus supports many types of network disruptions, network delays, packet drops, network corruptions. You can just do it very declaratively. Yes. I'm sorry. For those of you who leave the room, please be quiet. We have a few more minutes. Most developers that I've come across primarily use unit testing on a terminal CLI. Is there that kind of a capability to have absolutely zero requirement on a web UI to run these types of tests? Yes. I mean, that's the real power of what we are trying to bring. So there are certain developers who will go create a test, let's say a more expert developer. You can use UI to do it. But at the end of it, you have an YAML file. And that gets pushed into your gate, right? Now, another developer comes in. You just need to do a unit test. Now you're going to do a chaos test. CubeCut will apply this particular test along with your unit testing. It gets run. From the developer point of view, you just need to do one command, CubeCut will apply of that chaos test. And internally, it's going to set up all the environment that's required for chaos, and connect to the Chaos API server, and then give the results back. You can go to the extent of actually tagging the results back to your merge request as well. So yeah, and to add to that, a really cool thing about LITMAS is that all these experiments are defined in YAML. So you can check them in on your repo. You can make them portable. It supports all these tokens for the namespace name, the application target. So you can start to build a repo, or even in your application, a catalog of tests for different levels of chaos engineering. That's something that I like a lot because in other tools I've used before, the actual efficient test might be in an application, in a web UI. It's kind of harder to understand what's going on. But the fact that you can put everything, your help chart, your chaos engineering, your development manifests on the same repo makes it that odd, a lot easier. It's something that I've been talking about the maintainers of open source projects trying to get them into the, hey, include this test there. I think we're getting there. And it's something that I like a lot of LITMAS. Of course, we need more feedback. It's a community project. We've been getting a lot of good feedback. So please do try it out. And then create more GitHub issues. We are happy to engage and then fix, as we learn through the community. Anyone else, questions? No, thank you very much and see you next time. Thank you, everyone. Thank you. Have a good keep going.