 of experiments in Kubernetes. Development time, obviously we're gonna be writing unit tests for our application. We wanna make sure that our application is behaving the way we would expect it to behave. Now, no matter which language we use to develop our application, we have our favorite unit testing frameworks. It's gonna let us write our unit tests easily and execute them easily and get results from them easily. Now, let's say we're taking this application we've developed and deploying it in Kubernetes. Don't you want to test the deployed version of the application? Of course we do. This is where experiments come in. So we will simply use the term experiments as the short code for tests that you run for your deployed and running applications. That's what an experiment is. And we're gonna be looking at how you can easily alter these experiments and how you can easily execute them and also how you can consume results back from these experiments. So I mentioned, I'm sure you're familiar with different types of experiments. So the simplest type of experiment is load testing your application. So let's say it's a service that you're deploying. Obviously you wanna make sure that it can handle realistic load and its latency and error related properties are okay, even in the midst of real world load conditions. That's a load test experiment. Let's say you also have a new version of your application. You may want to dark launch it or canary it. So you're either sending a copy of the end user traffic to the application to the new version or maybe you're sending a portion of the traffic to the new version and measuring how well it's performing. That's an example of a canary experiment. How about resiliency? So maybe a pod goes down in the cluster or maybe a node goes down in the cluster and you wanna see how the application is holding up in the midst of these instabilities. This is where chaos testing comes in. So chaos is a way to inject this type of instability in a very controlled manner into the infrastructure and see how your application is performing. That's an example of resiliency testing and resiliency experiment. And finally there is of course AB testing. So maybe you're deploying a machine learning model and maybe the machine learning model is recommending books or socks or news articles or whatever. And you wanna make sure that you're getting new users, you wanna make sure that you're increasing your revenue. So essentially AB testing is all about picking the best version of your application with respect to the business metrics. So that's an AB testing experiment. So in this talk we are going to be focusing on a couple of very simple experiments and I can give you pointers to other experiments also but we're gonna be focusing on the load test experiment and also the resiliency experiment, the chaos testing experiment. And I'm gonna be demonstrating a version of these experiments. We're gonna be using a couple of open source tools. The first tool is called iterate. It's an open source tool for Kubernetes experimentation and release engineering and release optimization. The second tool is called litmus. This is a CNCF incubating project and it enables all sorts of chaos engineering and chaos injection experiments. All right, so all the demos that I'm gonna be showing today, they're all off of our public iterate.tools URL. So you should feel free to just write out if you're on convenience. These demos should probably take a couple of minutes end to end when you run them in your cluster. Okay, so the first demo is the load test experiment. We are simply gonna be load testing in HTTP service inside the Kubernetes cluster and hopefully it will introduce us to the nuances of experimentation. And as part of the load test, of course, we wanna make sure that the application handles a given load and it's able to meet its performance requirements and performance objectives. That's the idea of the experiment. So in detail, this is how the schematic of the experiment actually looks. We are gonna launch an experiment. First of all, we have a Kubernetes cluster and we have our HTTP service running inside the cluster. And we are gonna launch an experiment. The experiment to run inside the cluster and the experiment actually has three different tasks going on. The first task is simply checking if your application is ready. So if your application resources are not available or if it's not ready, there's no point in starting an experiment. So it's a basic check that it's gonna do. And the second task is gonna generate load for the service and it's going to collect various types of metrics based on the responses from the service. Latency metrics, error metrics, and so on. And the final task is going to validate the service level objectives. These are the performance objectives that we want the experiment to satisfy. It's really a simple experiment and we're gonna use the iterate CLI to launch the experiment and also get the results back from the experiment. View report of the experiment and also assert various conditions on the result of the experiment. All right, so let's jump in. As I said, you need, oh, I did not start, I knew something like this was gonna happen. Let me get my docker started. The first thing you need is a Kubernetes cluster. So I'm going to get myself a local Kubernetes cluster for this experiment. But while the cluster is booting up, let's take a look at what this experiment, what we're gonna do in this experiment. So first thing, we're going to create the cluster, then we are going to deploy the application, the HTTP service inside the cluster. In order to do the experiment, you also need the iterate CLI. I already have that CLI installed in my local machine. It's a simple brew install, or if you prefer, you can use one of the binaries that are available to install the binary on your machine and run the same experiment. And then we are going to launch the experiment and this is how the experiment launch actually looks like. Hopefully you see how simple it is. I said the experiment has three tasks. Those are the named tasks out there. So iterate actually provides a set of pre-built and pre-defined tasks for you. And I'm using three of those tasks, a task for checking readiness, a task for generating load and creating metrics, and a task for assessing the metrics. And finally, we are also going to do the assertions and reporting from the experiments, like I mentioned. All right, so I got myself a cluster, I think, cluster and let's go to our service. It's very simple, it can be bent as you're probably familiar with the sample application. It's just useful for testing and demos. We created a deployment and we exposed the deployment of the service inside the cluster. Okay, so now let's go ahead and launch the experiment. So as part of the launch process, iterate actually fetches. Under the covers, it's fetching a helm chart and instantiating a helm chart, which is how the experiment is packaged and delivered to the cluster. That's what iterate is doing under the covers. So we are launching the load testing experiment. It's launched and it's running right now. But let's take a look at the parameters that we supplied. So not only did we say what tasks we want to run as part of the experiment, we also parameterized those tasks. And here's how we did it. We said we wanna check readiness of this deployment. We wanna check readiness of this service. So for a service, it's simply checking if the service exists or not. And for the deployment, it's checking if it exists and if it's available. So these are all big then as part of the task. And for load testing, I simply gave it a URL. This is obviously a cluster of local URL that I've given it. And but there are a whole lot of other parameters that you can use to customize this test. For example, this task also accepts things like headers that you can use as part of the queries it's sending, the number of requests you wanna send, the duration of the test, the rate of the requests, and also the number of parallel connections that you wanna use, all sorts of things that you would use normally in a load testing experiment. I just use the default values. And I'm also saying I want to assess these SLOs. I want the mean latency to be within 15 milliseconds and I do not want any errors. Those are the SLO conditions that I would like the application to satisfy. Okay, by now the experiment must have finished. Head and check it. Very good. So I just asserted that the experiment completed and the experiment ran without any failures and all the SLOs are satisfied and looks like all those conditions are met. So we are good. The experiment ran and the application is working as we would expect. Let's also take a look at a report of the experiment where we can get a text report, HTML report. Let's get an HTML report. So the experiment once again saying the status of the experiment did its job. All the SLOs are satisfied for the application. The mean latency which we specified should be within 15 milliseconds, is within 15 milliseconds and it's also giving us a nice histogram of the latencies and as you can see all the latencies are well behaved and this is how we wanted the application to be. Okay, let's clean up of this experiment to set ourselves up for the next experiment. So that's really it about the HTTP load testing experiment. Hopefully you got a sense of how easy it was to author the experiment, launch the experiment and get results out of the experiment and it's also very, very easily configurable. You can just change its parameter settings to suit your application and your requirements. Now I mentioned HTTP load test, maybe you're running a GRPC service and you can load test the GRPC service also using a very similar experiment. The only variation here is that instead of using an HTTP task, you will use a GRPC task which does the GRPC load generation and also metrics creation. And once again, you can customize it, you can customize the load profile, you can customize the call data, call metadata, everything that you want. Okay, so let's move on to our next experiment. So this is really the main experiment that I wanted to highlight. This is a resiliency experiment, right? And here is one practical use case for an experiment like this. So we all know that the same application, bigger, common configuration is different than it can have a very big impact on the resiliency of the application. So here is an experiment to check a couple of different deployment configurations and decide which one is working, which one is more resilient. That's the idea of this experiment. And here is the setup of the experiment. So again, you're going to have a service, that's our application. If there is a service, there's a deployment and the deployment has its pods. And we are going to periodically kill pods of this application. That's the chaos we're going to be injecting. And we're going to do that using a litmus chaos experiment. So this is really, I said an experiment, but this is really a joint experiment. This is two experiments in one. One of them is the litmus chaos experiment and that's what is going to inject this pod kill chaos. Once in a while, periodically it'll just go and get your application pod, kill it. That's its job. And then that's going to be running concurrently. And that experiment is your load test experiment. It's very similar to the one that ran. And the only difference is that we are now going to be doing a load test in the midst of this chaos injection. So we're going to be checking if your application is resilient even in the midst of this pod chaos that we are injecting through the litmus experiment. That's the whole idea here. So spoiler alert. We're going to use two different configurations. We're going to use an unscaled version of the app. So just one pod. And we're also going to scale up the application and retry the experiment. So two or three pods. And without no surprises here, the unscaled app is not going to be resilient because of the pod failures. And the scaled app is going to be resilient because it has more replicas. Of course, this is just a demo scenario, but really there is no limit to the kind of chaos that you can inject and the kind of configuration testing that you can do. The idea here is just to demonstrate the concurrent experimentation using litmus and iterate simultaneously. All right, so let's jump into this experiment. Things as a setup for this. Cubilase cluster. And yet to get, I need litmus in my cluster. So I already have litmus installed in the cluster. Litmus also has a server cluster site component which is going to orchestrate these chaos experiments. I already have that installed in my cluster. So let's go ahead and create our application. It's once again a very similar deployment that I'm creating. And once again, I'm exposing that deployment. Okay, good. So I already have it in the cluster. Now, I said we will launch two experiments concurrently. That's what we're gonna be doing right now. So let me launch the litmus chaos experiment. Oh, that's not good. Let's try again. Good, so what you've done here is that we have taken the experiment and packaged it up as a helm chart so that it can be easily configured and launched into the cluster. That's what we just did. And we have launched the litmus experiment into the cluster. And it's also launched the iterate experiment in the cluster, all right? So we are also launching the iterate experiment. So there is the little variation of the iterate experiment here. If you notice, we are checking the deployment. We're checking the service. And we are also checking the chaos engine that is running the chaos experiment, right? So we want the iterate experiment to wait until the chaos experiment starts and then kick off its load testing. And this is how we accomplish that synchronization between those two different experiments. So going back to the litmus experiment, what we just specified was that in a pick a pod that has this label on it and every five seconds kill it and do this for the next one hour. That's all this litmus experiment is doing. The experiments, I can use just plain cube CTL to see how far the litmus experiment is. It's still running, which is good. That's what we wanted. The chaos experiment is running. We can check what the situation for the iterate experiment using the iterate assert command that we have used before. So this is interesting. The experiment completed. The experiment itself does not have any failures. It ran from start to finish. But the SLOs are not satisfied because obviously we are killing the pods and we are asking that latency needs to remain low until latency needs to remain low. That's not gonna happen as expected. So the SLOs are not satisfied. So let's take a closer look at what happened. If we take a look at the report, we will get a better picture of the experiment. Let's look at the text report. So we can already tell that the 99th percentile latency is off the charts. It's like 3,000 milliseconds. Whereas the experiment we specified actually wanted much lesser. So in the experiment specification, the iterate experiment specification, we said the day latency needs to be within 100 minutes. That's what was the SLO requirement, whereas this thing is off the charts. So no wonder it's not working. So let's iterate and this time we're gonna scale up the application, scale new deployment configuration for the application. So let's go ahead and give the same experiment a try. All right, so this experiment is getting launched and while it is in the process of the cluster, we can also take a look at the logs from an iterate experiment. So if you want to get a closer look, if you want to debug an experiment, for example, something went wrong in an experiment, you can get logs from an experiment easily and take a look at what's going on. So let's try that and it's simply a log sub-command. So things are, this is interesting. So it tried to check, I guess, the readiness condition for a deployment. It was not true, it tried again, it became true. All the readiness conditions, it's all good. And so it started the HTTP task for the actual load test. That's what you're seeing from the logs. Essentially you see how far the experiment has progressed, which tasks have completed and which tasks are in flight and if there's a problem with the task, you get to see the problem with the task also as part of the logs. So going back to our experiment, let's try to assert the SLO conditions and see if we had a better luck this time. Oh, perfect. So the experiment completed, it has no faith. And all the SLOs are satisfied as you would expect because we scaled up the application, now it's suddenly more resilient. So even in the midst of this part chaos, which is still going on, the application is able to satisfy the SLOs. So, hope you get a sense of how easy it is, not just to run a single experiment, but also these different joint combined experiments, resiliency experiments, where you're doing a chaos injection along with SLO validation or performance validation. It's easy to author them, it's easy to execute them and it's easy to get results back from them and clean them up in the cluster. That's the model of the story. So we badly touched the capabilities of the tools that I demonstrated today. So I'll quickly go through some of the other experimentation features that may be useful in your setup. In the case of iterate, first of all, you can use metrics, not just metrics that are generated, built into iterate, but also metrics that you may be collecting in a database. So you can ask iterate to fetch those metrics, for example, metrics from Prometheus or Sysdig or wherever you're collecting metrics for your application, you can fetch those metrics and do a validation based on those metrics. That's one of the built in tasks in iterate. You can also do multi-loop experiments. So we just did a single loop experiment from end to end, start to finish just all the tasks when sequentially one after the other, but you can repeat these tasks periodically in a multi-loop experiment, and iterate lets you do that. And what's more, you can also say things like, after 10 loops of the experiment, send a notification to the Slack channel or this GitHub receiver about the status of the experiment. So there's support for notifications with conditions. You can also use the iterate GitHub action, so there's no need to launch the experiment manually using the command line you can bake it in as part of a GitHub action if you want. And because the GitHub action essentially downloads the CLI. So all the commands that were available to you using the CLI are now available to you using the CLI within the GitHub action. So one other thing is iterate has to be extensible. It's built to work with just Kubernetes native resources. But also a couple of examples, there are examples of iterate working with Knative, which is a serverless resource. KS, machine learning resource, Selden, which is also machine learning resource, you want to find examples of the readiness staff working at the Chaos Engine resource in the resiliency experiment. Litmus is also a very mature project. It's a CNCF incubating project. And do check out the Litmus project page for all the awesome features that you can use. The heart of Litmus is its chaos center. In particular, the chaos hub in Litmus has these thousands of different curated, community curated chaos experiments. Fundamentally, 50 different types or 60 different types of experiments, but thousands of variations of those experiments are curated as part of the hub, readily used for your setup. So what next? First of all, please try the demos. They are really easy to try. And you can also orient the demos to your app because it's really easy to configure these experiments and point it to your apps. And if you have questions, comments, I'm around today, tomorrow through our KubeCon. And also please feel free to chat us up at LinkedIn. And my handle is 3MCP and my co-speaker is iSpeakCode with a zero. And also, please visit our project pages and you can join our community, engage with us using Slack, GitHub, everywhere. Thank you very much.