 Marisa. Awesome. So we are recording again, as I just got that notification. But before I jump into the presentation, I wanted to learn a little bit about all of you. And so I know some folks are still joining, but I'm curious, what was your reason for joining today's session? So I'm hoping that you'll interact with me. I want you to post into chat what you're hoping to learn from this session. So there isn't a Kubernetes emoji in, well, in any of the standard keyboards, but there is this fun like wheel thing. So if you want to learn how to make Kubernetes more reliable, post that into chat. Or if you want to learn more about chaos engineering and are more interested in that part, there's the little dynamite stick, because sometimes chaos engineering is about blowing things up. And if it's both, that's great. Right. I think I'm seeing some folks post Sal and Axel are interested in both. Lots of people interested in both. Looks like some folks are interested in disaster recovery setup. So that's fantastic as well. Awesome. Seeing lots of great responses. So it looks like we've got a lot of people that's that are interested in both, which is fantastic. Reliability and chaos engineering do go hand in hand. So it is sort of a weird question in some senses, because I think in order to be interested in chaos engineering, you're probably interested in reliability as well. Awesome. Well, thanks for indulging me. And hopefully that was fun. Feel free. If you, if you haven't posted, go ahead and do that. But while we do that, you've told me a little bit about why you're here. So let me introduce myself. As Marisa mentioned, I am the director of advocacy at a company called Gremlin. Basically, that means that my job, and I love my job, is to inspire people and educate them to build reliable systems and just to use technology and become better engineers. Previously, I worked at Datadog. I worked at O'Reilly Media, the company that makes the tech books with animals on the covers. I was an engineer at MongoDB. So a lot of technical history working on systems that are about reliability. Aside from that fun stuff about me, Marisa and I were chatting just before the started about Pokemon Go. So I do play a lot of Pokemon Go. I also learned to make chocolate during the pandemic while everybody else was making sourdough. I decided that I wanted to learn to make chocolate, buying raw cacao beans, roasting them, doing that, which is super fun, also super frustrating. If you want to chat about that, we could chat about that later. My email address is Je. So my first initial last name at gremlin.com. If you ever have any questions, you know, if you're, you see this webinar and you're, you're thinking about chaos engineering later on, like later today, and you're like, oh, I wish I had asked that question. Feel free to email me. I'm always happy to chat and help you out. Or if you're one of the people watching the recording of this, you weren't able to join us live, definitely hit me up, ask me your questions. Always happy to answer those. I do have a YouTube channel, which I've neglected now for like a good six to nine months. But I do have some videos there as well. And so I try to record videos about just general reliability, chaos engineering, monitoring, all of this stuff in the, the SRE space. So if you're interested in other things, go ahead and check out my YouTube, you know, standard like and subscribe. All that said, if you're here to learn about Kubernetes or about chaos engineering or both, that's great. Cause we're covering both in this webinar. But I think that how you approach Kubernetes in particular Kubernetes reliability is heavily influenced by how you view Kubernetes. So my friend, Chris Nova, who's heavily involved in the Kubernetes community, she tweeted this question and to summarize it, it's the question is, do you think of Kubernetes as a tool to manage nodes or as a tool to manage containers? Right. And I think that depending on what role you have in your engineering team, you might choose one or the other. And so similar to the question that I asked before about reliability and a lot of people answering that it's both, the correct answer here is, is sort of that it's both, right? But understanding that it's both and understanding which direction you tend to lean heavily influences how you think of reliability. And just explain this, you know, the typical perspective from an ops point of view is that Kubernetes is a container or a service manager, right? It abstracts away all of that. So if you ask an ops engineer or a DevOps engineer or often an SRE, you know, as an ops engineer, you set up your nodes as cubelets, right? You install the software, you add them to the cluster, you set up the networking or, you know, all the things like load balancers, ingress, storage, whatever you need. And so when it comes to what is Kubernetes, you're doing all this infrastructure part, and then Kubernetes really manages the services and the apps, right? The services and their particular requirements, whatever developers need are sort of abstracted away. So from an ops perspective, Kubernetes then becomes really a container or a service manager because all of the other stuff, all of your work is setting up the infrastructure. But on the flip side, if you're a developer, then that typical perspective is that Kubernetes is an infrastructure manager, right? It abstracts all of that away. So you can think of all of that infrastructure as just a giant pool of resources. You know, most developers don't have to worry about what cloud provider they're on or what sort of EC2 instance they might be using or whatever node instance for the cloud provider that they're on. As a developer, you just build apps, you put them into a container, you might declare what resources you require in your Kubernetes manifest, but basically Kubernetes handles the rest of it. It just runs, right? You need three replicas of a service. Great, that's easy. You need that to scale up to 10, no problem. So in this case, Kubernetes is more of a resource or infrastructure manager. Now, back to that idea of a lot of things being both, in reality, Kubernetes is both and this is great, but it also means that there's a lot of complexity in what Kubernetes is doing. It's not magic. Sometimes we think that Kubernetes is magic, right? You just tell it to run something and it runs it or you tell it to handle infrastructure and nodes get added or removed from cluster and everything just keeps working, but underneath it all, it's just software and this means that we need to test to ensure that it's doing what we expect it to do. And so this is where I think that chaos engineering is extremely useful. So what is chaos engineering? Let's lay some groundwork here, right? I think a lot of people are interested in what this is and there are a lot of misconceptions about what chaos engineering is, whether it's the idea that chaos engineering is randomly taking down or unplugging servers or that it's production testing or that it's a practice only done by SRE teams. Now, there's a lot of chaos engineering jokes, but in practice, and this is how I like to define chaos engineering. The first thing is that it's an intentional experimentation on a system. It's got to be intentional, right? It's not random. And it's experimentation done by injecting controlled amounts of failure. You always have to be in control. Now, the main goal, though, of this is to observe how your system responds, right? We're doing this to make observations and ultimately that's about learning. We make those observations in order to better understand how our systems work so that we can improve our knowledge and in improving our knowledge and improving ourselves as engineers, improve the systems that we work on. Ultimately, chaos engineering is about this learning. It's not about surprising your colleagues, keeping them on their toes. It's not about testing them to see if they know how to respond to incidents or conditioning them to be on alert. It really is a methodical process of learning. As I keep hammering home, learning improves yourself. You, in turn, improve the technical systems that you work on. So a lot of folks hear that description and it sounds a lot like science and that's because it is science. When we do chaos engineering, we follow the scientific method and that involves five steps. So if you remember from your grade school, those five steps, the scientific method starts with observation. Now, this is a step that often we overlook because you do this all the time when you work with your systems, right? The goal here isn't just to watch your systems but to have a basic understanding of their baseline. Some people call this the steady state. I personally hate that term because I think it's a misnomer. You work with dynamic systems that are constantly changing. You're deploying new code or there's changes to infrastructure or your cloud environment. Your users are doing different things. Your systems are not steady but they do have a normal state or a nominal state. So before you start doing chaos engineering, you need to have a sense of what that normal or nominal state is, right? If I said that there were 300 milliseconds of latency on a particular service, you should have an understanding of whether that's fast or if that's slow or if that's normal. You should know if CPU usage on a node at 75% is high CPU or low CPU or if that's just normal. And so oftentimes we generally pick this up as we do our daily work. We know what normal is. So you don't have to spend a lot of time on observation but you need to have a sense of what that is. Once you know what normal looks like, you can then make an educated guess. And this is the hypothesis. So the hypothesis is your answer to what would happen if, right? These what would happen if questions. What would happen if your service that normally has 300 milliseconds of latency suddenly has 1000 milliseconds of latency? Your hypothesis might be that everything becomes slow or it might be that your service reaches a timeout limit and returns an error. It could be that it returns the timeout error and several of these things happens that it triggers an alert and a pager duty or it lets engineers know that something's wrong or maybe it doesn't really matter that 1000 milliseconds is perfectly fine and nobody should be alerted. But once you have your hypothesis, you then can act on it, right? You can test your hypothesis and inject that failure. Recreate that what if condition. If your service normally has that 300 milliseconds of latency, then inject 700 milliseconds and see what happens to the service when it's at 1000 milliseconds of latency? Did it hit that timeout? Did it return the expected error? Did multiple timeouts trigger an alert? And if it did, was that alert useful? Did it contain accurate information that would have helped the on-call engineer solve the problem? Also, we have to pay attention. Did anything else happen? Maybe this service was a dependency for another service. Did anything happen to that other service? Now, as you do all of this testing, it's important to be safe. The last thing that you want to do is blow up production or even blow up development, right? And lose company data or hinder other engineers from doing their job. I don't want you to lose your jobs by doing chaos engineering. So it means that you need to be safe. Being safe first means that you need to communicate with your team and the broader engineering organization about what you're doing. If your hypothesis is that an alert will be triggered, you better tell that on-call engineer about what you're doing before they get that page. Communicating what you're planning to do and ensuring that everyone is aware is critical to this process. You need to be sure that you're not negatively affecting others. I mean, imagine that you're in a startup and your founder is doing a demo to new investors and you suddenly, because of your chaos engineering, make that demo fail. Lose your company millions of dollars or imagine that your engineering team is trying to resolve an outage. You make that outage worse and they can't fix it because of what you're doing. So communicating is the first part of safety. The next thing is that you want to start small and in a non-production environment. If you're not confident that a single node or a single service in a staging environment will reliably handle the failure that you're injecting, then you have absolutely no business attacking multiple nodes or services in a production environment. You need to work exactly how you do with anything else. Whatever we're doing technical, we always start in development and we work on it and we build confidence before we deploy it to staging. In our staging environment, we're trying to mimic production and ensure that everything works as expected. And it's only then when we're competent in the staging environment that we're ready for production. Lastly, always be safe. I recommend people wear helmets, not when you're walking down the street and not for chaos engineering, but around here where I live in Portland, Oregon, lots of people on scooters, especially as the weather gets nice, please wear a helmet. So that's my other safety tip. But back to that process. As you test, you need to gather as much data as you can so that you can analyze it. If everything went according to your hypothesis, then that's fantastic. It means that you understand your system well, but oftentimes things won't behave exactly as we expect them to. And that's okay. It means that you'll have to do some analysis and really dig into your systems to understand them better. Very few people completely understand how Kubernetes works down to the smallest details. I don't fully understand how Kubernetes works. I understand the general principles, but down to those details, they're constantly getting changed. They're constantly getting updated. We're pushing three releases of Kubernetes every year. Things are constantly changing. So it's important for us to dig deep. So all of this means, again, it's okay when things break. It's okay when things don't go as expected because they're learning opportunities. This means that we have that opportunity to learn how Kubernetes works better, how to build better applications and how to make things more reliable. The final step in the scientific process then is to take what we learned to iterate on it and to share that information. Scientists are constantly repeating experiments so that they can tweak things with minor changes to learn more and understand things better. And they're constantly sharing the results so that others can do that as well. Chaos engineering is similar. So what does this look like with Kubernetes? As I said, there's a lot of complexity in Kubernetes. And depending on your role, there can be a lot of mystery about what's going on on the other side of Kubernetes. So I'm going to talk about the ops side and give three experiments that Kubernetes operators and admins should run. And then I'm going to talk about the dev side and give three experiments that Kubernetes developers or developers deploying onto Kubernetes should run. So let's talk about that back end, the infrastructure side. What are three experiments that you should start running or consider running? So for the first one, for Kubernetes admins, this is the baseline. This is where I think everybody should start. And it's the turn it off and turn it on again test, right? That's always the joke with IT. Any problem, well, have you tried turning it off and on again? If you're running a Kubernetes cluster, I want you to think about this. But what would happen if a node in your cluster died? What would happen? So we start with the observation step, right? This again should be easy. You should have a general sense of how your cluster operates right now. You should have a sense of how many nodes are in your cluster, maybe not the exact number, but you should have a sense of the size, the scale of it. But then what's your hypothesis? Maybe you've got your nodes set up in an AWS auto scaling group. So when a node dies, it should automatically be replaced by a new one. But as you think of your hypothesis, you have to ask yourself how long do you expect that to take, right? If you're in a cloud provider, how long will it take for that cloud provider to notice that your node is down and replace it? And what about the pods or the services or the objects that are on that node? What will happen to them? What if they were, for example, worker processes that were handling some data, doing some sort of data transformation? What happens to those jobs? If they were user services, what happens to those user sessions? Do they get dropped? We've all had that happen, right? You're on, like, working on an application you're logged in and suddenly you go to do something and it tells you that you need to log in again. And it's obvious that your session got dropped and it's kind of frustrating. Or what happens if a developer tries to make a new deployment or scale up an existing one during that time and you're down a node? Also, what sort of metrics and alerts should your monitoring tool surface to you? And if the new node is created automatically, is it automatically re-added to the cluster? Do pods automatically get rebalanced onto it? So even for a very basic first experiment, there are a lot of questions here that you need to be able to answer if you want to run a reliable Kubernetes cluster. So once you start asking all these questions and forming your hypothesis, getting down into those details, you need to run the test, kill a node and see what happens. Now, there are a lot of ways that you can do this. You can use your cloud provider's interface or CLI to delete the node. If you run your own on-prem Kubernetes and it's your own nodes, you can SSH into them and shut them down or really anything. There are a lot of ways to kill a node. There are tools like Gremlin, obviously, that make it easy to do, but no matter the environment, this is the super basic test. And once you run that experiment, you need to analyze what happened, where you might be able to improve. For example, if the replacement node takes too long, you might need to account for this and add extra capacity to your cluster. Or maybe your monitoring needs updates or your alerts need more details. But whatever you find, make the improvements, rerun the experiment and validate it. And of course, share what you learn. So experiment two. We've talked about killing a node, but sometimes, especially if you're running your own cluster, you're also having to manage the control plane. Sometimes these are called master nodes. This really only applies when you're running your own cluster. If you're running in a managed service like Amazon, EKS, Microsoft, AKS, Google, GKS, any of those managed services, often that means that they're managing the control plane for you. And this means you're not going to be able to take that control plane down to see what happens. It also means if the control plane goes down, there's really nothing you can do about it except for go on Twitter and complain about it like the rest of us. But for those that are running their own clusters, this is a very important test. Again, the observation part should be easy. But your hypothesis, what happens if you lose a master node? What do you think will happen? Hopefully you're not running just a single master node because that would be disastrous. But if you are, what do you think will happen to the services that are currently running? What happens if you try to make new deployments? What about your monitoring on your Kubernetes control plane? What sort of monitoring or alerts do you expect to get? So to actually implement this test, you could simply kill a master node. It could be the same as that first experiment. I do not recommend it. The reason is that if you kill it and something goes horribly wrong, it's really hard to bring that back. That's going to take a lot of time. So for tests like this, I recommend that people use a black hole attack. Black hole is the term that we use for cutting off a node or service from the network. It's a really easy way to simulate that something's down without actually taking it down. But what it looks like to the world is if I can't communicate it with it, then it's down. The idea though here is it's easy to restore that. It's much easier to just re-enable the network than to bring back a node, re-add it to the cluster, et cetera. So as before, once you run the experiment, analyze what happened, dig into that hypothesis. Anything that didn't go according to plan, use that to improve your systems. If you are running your own cluster and you don't have redundancy on your control plane, you need to fix this immediately. This will expose a lot of the issues that you're going to get into. And then as before, iterate and share what you've learned. So for a third experiment, this is one that is a little bit more advanced. So I actually grew up in California and there was always this joke about the big one, the impending massive earthquake that would split California and we all fall into the ocean and suddenly Nevada would be oceanfront property. If you're running in AWS's US West 1 region, this is a huge disaster for you. Similar Azure's primary Western region or Google. Although if this happens to Google, it also takes out Google's headquarters. So you won't be built anymore. So it's probably okay. But seriously, this can be a problem. All of the major cloud providers have had zonal and regional outages in the past year. We often hear of US East 1 being the most common one because a lot of us use that. So as we think of this, again, what's your observation? This should be easy. You should know what running normally looks like, but what's your hypothesis? Well, for availability zone failures, you should be running multi-zone clusters. All of the cloud providers make this relatively easy. And it means that if you fail a zone, you at least have some nodes still running. And so this might look similar to our first experiment just scaled up. You'll lose more than one node if you have a larger cluster. For regions, though, this tends to be a bit harder. I noticed that someone had mentioned their interest in disaster recovery processes as one of the reasons they joined this webinar. Well, this is where you start to test that. None of the major cloud providers have multi-region Kubernetes clusters. It's a very difficult thing to do. So it's not saying that any of them are at fault. But now what it means is you're not just using Kube chaos engineering to test technical systems, but you have to start testing your processes as well. You have to think about what alerts will you receive. And not just those alerts, the technical side, but how quickly can your on-call team respond? And is there documentation or a disaster response plan for failing over to another region? You also have to start thinking about what's your SLA or time objective, and can you meet those because there will be this human factor involved. Testing this becomes much more complex. Typically involves configuring a black hole attack against multiple services to block specific zones or regions. So oftentimes we'll use tags to do this. Generally, all of the cloud providers add tags so you know what services are running and what regions or what nodes are running and what regions. And you can try to block those networked activities to simulate this. But it's going to be very specific to your particular setup. And this is where I'll reiterate my warning about safety and starting small. So don't start here. If you're new to Kubernetes or to chaos engineering, do not do this. But I mentioned this because ultimately you'll want to build towards this. If you're starting to get here and need help, reach out to me again. My email is always open. I'm happy to provide you with advice specific to your situation. Also here at Gremlin, we've got a fantastic team of solutions engineers who help folks with this all the time. But when you're testing at this level, you want to collect even more data about your processes. It's important to run a post mortem just like you would for real incident. And that'll help you collect the information about your incident response processes and documentation. It's also important to communicate and share what you've learned. Large scale chaos engineering like this will typically affect multiple teams if not your entire engineering organization. So those are three back end chaos engineering experiments that I think are important to run. Starting with that node failure, testing your control plane, and eventually building up to zone and regional failure. Some other things that you might want to look into. There's a project within Kubernetes called the node problem detector. I think a lot of people have missed out this tool is there, but it's a tool that you can deploy to your cluster that will help you monitor your nodes for problems. I think you should take a look at that. So it's in the GitHub for Kubernetes. Go to the github.com slash Kubernetes and look for node problem detector. Another thing that I think people should also test for is cluster auto scaling, especially if you're on a cloud provider. Figuring out what resources you're monitoring and how you're handling that scale up is important. If your cluster is low on resources, it's going to have a hard time scheduling containers. So you should have a way to detect that and add new worker nodes to the cluster automatically. Chaos engineering tools can easily consume resources and help you trigger those or validate them. So let's talk about the three front end chaos engineering experiments. What should we test if we're developers deploying to Kubernetes? So the first one is similar to the back end experiment, right? On the back end, you should take down a node. On the front end, you should test your pods. That data dog, this was always chaos engineering experiment number one. Every service had to pass this. What would happen if your pod crashed? So again, your observation should be easy. You know how your pods behave normally. The main part of your hypothesis should be impaired. If your pod crashes, Kubernetes should restart it. But similar to our other experiments, this is a good opportunity to get more in depth, right? How long will it take? Kubernetes sometimes we think it's magic because it's instantaneous, but it can take some time depending on what service you're running. You might be surprised at how long it can take for larger containers to go from a created to a ready state. But you also need to think about what happens to user traffic or user sessions. Do they go to other replicas? Or if you're only running one instance of a pod, what happens? What about your monitoring? So testing here, super easy. I think everybody here knows kubectl delete pod. We've probably all run it a million times. That's a great way to start. It's a graceful deletion. So your pod will know that it's going to be deleted. But another option here would be to find the process ID for the service in that container and killing that process. And this is less graceful. It'll give you a better sense of what would happen in an actual incident, right? When an incident happens, oftentimes there isn't a graceful shutdown. Service just stops. So try killing that process. Find that process ID and kill it. And then analyze what happened. Some things to pay attention for if your pod has affinities or taints, meaning that they have to be scheduled on the same node or a different node from other pods. How does this affect the pod replacement, right? Or the replacement pod? When the pod restarts, does it reconnect to other services, to storage, all the things that needs, if it maintains a persistent connection to other services, how do those services handle this? For example, one thing that I've seen is databases that hit a connection or a connected client limit with pods. When pods connect and die multiple times, they'll eat up that connection limit. And suddenly it looks like a denial of service on your database because it's refusing new connections. So there's a lot of things to check out here. The second experiment that I think people should run is really an iteration on that first one. If you're starting to test your pod restart or pod crashes, you have to ask yourself what would cause my pod to crash. Now there are obvious internal issues like errors and code, but pods can crash for a number of reasons, such as a lack of resources. And one common reason is network issues. Often services will throw errors when they encounter network issues. And so they may trigger timeouts. Most pods, hopefully you've set up readiness and health probes that specify the limits around response times and latency. So introducing network issues such as latency or drop packets is a good thing to test for. Some common things to consider in your hypothesis, how much latency can your service handle? We commonly use this to test dependencies between service. How does your service respond when it can't connect to a dependency or if it has a poor connection? We also want to test if and when timeouts kick in and do these align with our expectation. Oftentimes this shows up as mismatches. If your service is set to timeout in 30 seconds, but the service that's calling it only allows for 10 seconds, then it starts to create a problem. So you want to work with other service teams to put these timeouts in alignment. Now there are a few ways to test for this. The common way that a lot of folks use is just to adjust their IP tables. IP tables has a number of modules such as limit. The limit module will help you adjust that rate and slow down traffic. However, one of the problems with this is that other tools commonly use IP tables. So this can cause issues, especially if you're using service meshes like Istio that uses Envoy sidecars. And that will often use IP tables to route traffic through Envoy. So if you're using a service mesh, you'll have to use other ways. The other common way and the way that we use at Gremlin is to use TC, the Linux traffic control module. It interacts with the Linux kernel networking module, which is a bit more robust and tends not to cause issues with services using IP tables. The third experiment is around horizontal pod autoscaling. One of the backend experiments I suggested earlier was cluster autoscaling to add more infrastructure resources. HPA is pod autoscaling. And so in Kubernetes, you can tell it to check for a specific metric, whatever metric you want that to be, and then scale up the number of pod replicas as appropriate. So that metric could be anything like CPU or a number of connections to a web server or jobs in a queue. But your hypothesis should account for things similar to infrastructure of how long should scaling take up. If it's too slow, it might not scale in time for your system to not be overwhelmed. So testing that is extremely important. And ultimately, you'll want to ensure that your service can scale to handle the limits you need. Case in point, Coinbase, right? If you watched the Super Bowl, if you're in the U.S., Super Bowl Sunday was this past Sunday. Coinbase had an ad that went out to the unestimated 100 million people, and everyone went to their website to sign up, and it crashed their app because they couldn't scale. So just to recap the three front-end experiments, pod failure, network issues, and dependency testing, and HPA. Some other things that you should test for, simulate crash loops by repeatedly killing pods. You probably don't care if a pod dies and Kubernetes restarts it. That's what it should do. It should be kind of invisible. But Kubernetes has a habit of if it restarts it and it dies again, it'll try to restart it again, and it'll die again, and it'll get into a loop. And then Kubernetes will notice that and do a back off. So you want to pay attention to that. The other test that you can do is to try to test for right sizing your resource limits. If each replica of your pod can handle 100 current requests, and each request consumes 10 megabytes of memory, but you've only set the pod resource limits to 512 megabytes, you suddenly see a mismatch. So using CAS engineering to consume resources to test things is a good way to start. Ultimately, though, these are all manual, and just like anything else with engineering, you want to automate. So most of these experiments, once you've iterated and your systems return consistent results, you want to find a tool that you can integrate into your CICD platform so that you can just automate the test and ensure that your services or your infrastructure are constantly returning those results and not regressing. So with that, we're coming up on time. So I'll end it with this. There are a number of resources out there. I hope I've given you enough inspiration to add CAS engineering to your reliability work. Some additional resources to help you do that. We've written a blog post about four CAS engineering experiments that you can start with. This is general. So if you're on this and you're just starting to adopt Kubernetes, you're not quite there yet. This has experiments that you can start with and use as you're migrating or adopting Kubernetes. My colleague, Andre, wrote a guide of what he thinks are the five experiments to start with for Kubernetes. Some of them are the same as what I've covered. Some of them are different from his engineering background, things that he's seen. So that link is there as well. I definitely recommend that you check that out. And then finally, we've launched a certification course. There's currently two levels. So you can learn more about CAS engineering and get certified if that's something that you're interested in. So with that, I'm open to answering questions. Also, again, if you think of any questions after the webinar is open or after the webinar has ended, feel free to email me. I'm always happy to chat with you. So it looks like I think Marisa's pasted those links into the chat so you can easily get those. And with that, if you've got any questions, feel free to post those in the Q&A panel in Zoom. Also, taking a look at the chat, there is a question, is a certification Gremlin specific? So the first level of certification isn't. It's just general CAS engineering. It should be fairly easy to pass that one because it is general CAS engineering. You might want to take a little bit of time to brush up on CAS engineering fundamentals. So that's there. The second level of CAS engineering starts to get into a little bit more of Gremlin's platform specifically and how to run some of those experiments and do so in a safe way. A few of the other questions, is there a recording? Yes, we are recording this. So I believe that my colleagues at Gremlin will be sending out an email later and that'll have links back. But everything will get posted to the Linux Foundation website so you can check out the recording there. All right, a few other questions coming in. Tools, there's a few questions about tools. So one's just general tool recommendations and then another about tools for automating the CAS engineering workflow. So tools, there are a lot of tools out there, right? Within the CNCF, there's the CAS Mesh and Lemus Chaos. There's obviously Gremlin, the company that I work for. There's a lot of tools out there. As much as I'd like to say, go use Gremlin because obviously they pay my salary, I started CAS engineering just manually. Again, it's pretty easy to kill a pod or to take down a node manually. So that's probably a good place to start without really hammering home adopting tools. But as you get more advanced, as you start to adopt this as a practice, getting a tool that helps you with safety is important. So tools like Gremlin, we monitor what's going on as you run the attack. We default to safety. So if anything goes wrong and deviates, our client will, if it notices something, it'll roll back and stop the attack. We also have a big red halt button in our platform. So having that is a really good way as you're adopting tools, look for those safety features in addition to the functionality of launching experiments or running these attacks. In terms of adding it to your CICD platform, ensure that whatever tool you're looking for is API-driven, right? That there's programmatic ways of integrating with it. So similarly for Gremlin, we are API-driven. Actually, our front end for the platform is really just a nice overlay on the API. You can do everything with our API. So you'll want to look for that with whatever tool that you're looking at. Let me see some other questions. One from Sergey. Why is random crashing of resources a bad way for testing? So random is bad because it's not controlled. So there's two parts to this, right? One is safety. If it's random, it makes it really hard when something comes up and something goes wrong to actually debug that. The second reason is the goal of this is learning. So one of the original things that Netflix did was Chaos Monkey was sort of random, right? Chaos Monkey works with Spinnaker and generally the idea is you deploy a service and when you deploy it, you mark it to be Chaos Monkey ready. And that means that Chaos Monkey would within a certain defined window be able to randomly take that service down. That only works if you know or if you're confident that that service will restart and keep running and deliver the user experiences that you expect, right? Or meet if you've adopted SLOs that it will be able to maintain those SLOs. Most companies aren't there yet when you're first starting. When you're first starting, you want to be very, very intentional. You need to practice things out before you get to the point where they can just run on their own. So that's why we say to start manually. And then, again, I like to think of it more as a test to roll it into the CI CD process rather than something that's constantly running and taking things down. Because again, when you're running randomly, you don't have that much control. And so when it's random, you're not doing it to learn. You're doing it for sort of this ensuring that things are working. And that can be a problem because if it randomly takes something down and it causes a problem, you don't have that visibility. You don't have an intention to learn. All right. So a few other questions. Are we sharing the presentation? Yes, again, this is recorded. It will be on the Linux Foundation website. We'll send out an email from Gremlin with the links once that is up. All right. So I think that's the last question that I'm seeing. Oh, one more. So something that came to mind to Arnout is are there people looking at machine learning algorithms around chaos engineering? That's a great question. I haven't heard of anybody looking at machine learning. I think right now a lot of what people are doing is really just trying to set up this as a practice. And it's still fairly early for most organizations. Machine learning would be great. Although I often have doubts about machine learning because it depends on how you train them and what models. But I think ultimately that's where we want to go with all of the reliability. As we monitor systems, as we get data about how our systems should be operating, it should become easier to use that data and automatically inject failures and compare the two. And hopefully in the future, thinking like five, 10 years from now, it'd be great if SRE wasn't really a job. If we could just deploy things, make a definition, computer spin it up for us and automatically monitor things for when they go wrong and automatically fix them. But I think we're a long way from that. So my faith is less in machine learning and my faith is more in all of you as engineers because you're all smart. You're all getting smarter and building better systems. One more question Camilo says, it doesn't make sense to run chaos engineering in smaller deployments. Yes, per what I said about safety, start small. Start on simple services if you're just getting into it. That's always the best way so that you can know that you have a better handle on things. All right. So I think that is the last of the questions that I'm seeing. Again, if you do have any other questions, feel free to email me. I'm always happy to answer those questions. So with that, I'm going to turn it back over to Marisa to close us out. Awesome. Thank you so much, Jason, for your time today. Thank you, everyone, for joining us for this awesome webinar. As a quick reminder, I know Jason's mentioned it a few times, but this recording will be up on the Linux Foundation's YouTube page later today. That will close it out. We hope you'll join us for future webinars. Have a wonderful day.