 Hello everyone, welcome to Cloud Native Live, where we dive into the code behind Cloud Native. I'm Annie, and I'm a CNCF ambassador, as well as a senior product marketing manager at Camunda. And I will be your host tonight. So every week, we bring a new set of presenters to showcase how to work with Cloud Native technologies. They will build things, they will break things, and they will answer all of your questions. So join us every Wednesday to watch live. And this week, we have amazing two speakers here with us to talk about using Litmus, Chaos Engine, and microservices demo app to demonstrate automated RCA. And as always, this is an official live stream of CNCF. And as such, it is subject to the CNCF Code of Conduct. So please do not add anything to the chat or question that would be in violation of that Code of Conduct. Basically, please be respectful of all of your fellow participants as well as presenters. So I'll hand it over to Gavin and Brandon to kick off today's presentation. Hey, everybody. My name is Seamus, and I'm joined today by Brayden. And we're both DevOps engineers for Zebraium. So we're going to talk today a little bit about Litmus, which is a Cloud Native open source chaos engineering framework. And we want to talk a little bit about what that means and what we're doing with it. So what we're doing with it is actually a little bit unique. So at Zebraium, we've built a product that analyzes logs to find the root causes of issues. So being able to cause issues on demand is just absolutely invaluable for us. So when we're validating and demonstrating our product, we need to be able to create problems within our Kubernetes demonstration clusters. We generally don't have access to customer, prospective customer environments. So it's important for us to be able to do this ourselves on demand. And Litmus provides on-demand chaos by simulating issues that can occur in environments such as just bad configurations, heavy infrastructure loads, rainy days, just any stability-threatening issue you can think of. Really, the only limit is imagination for it. So a quick 1,000-foot view, what exactly is chaos and what is chaos engineering in this context? So there's many different testing methodologies available in the world right now. The thing that they all have in common is that they all have blind spots. And the problem is that the blind spots can overlap, and then you can have really bad, unpredictable behavior sometimes. And if the first time you find out about a resiliency issue is when a customer reports it at 3 o'clock in the morning, that's an ops fail. That's bad, and we don't want that to happen. And chaos engineering allows you to create these doomsday scenarios in a more controlled environment in a way to test resilience before bad things occur in the wild. Litmus is by far the best cloud-native framework for reducing chaos that we've found. We've done some extensive searching, and we've actually developed some stuff on our own. And Litmus is just by far our favorite tool for it. So it exactly is Litmus. So it's a framework for conducting chaos experiments, just individual little blurbs of bad things that can happen. So this has done a declarative way via experimental templates. And experiments can be orchestrated into chaos scenarios, which can include things like chained experiments. You can run experiments in parallel and sequential. We can set up and tear down experiment resources, and you can even deploy entire environments as part of a chaos scenario. It's an extremely versatile platform. So Litmus was originally accepted in CNCF Sandbox in 2020 and actually just moved into incubation. So huge congratulations to them for that. That's a big step up, and we're really happy for them. So what kind of experiments are available right now? There's a fantastic library available at hub.litmuschaos.io. There's currently 58 on the shelf, minimal configuration-required experiments. They're pretty much just drag-and-drop and PIDGO, who receive chaos. They're available for a wide variety of platforms, including things like Kubernetes, QBWS, Azure, VMware. If you use a major Kubernetes platform, there are compatible experiments waiting for you. So I'm gonna hand things over to Brayden now. And Brayden's gonna give us a live demo of setting up Litmus in one of our demonstration environments, configuring it and running some experiments and seeing what happens. Yeah, so thanks, Shamus. Let me go ahead and share my screen. Which one is the one I watched, this one. Cool. And hold on, I have a phone in the house. There it is. All right, so kind of what we're gonna run through real quick is we're gonna kind of run through the Litmus kind of installed directions. We're gonna spin up the Litmus cluster. We're gonna access the chaos central, their UI. We're gonna run their default test cluster and then we're gonna connect it to one of our live apps and actually break some stuff. So the first thing we wanna do is Litmus offers an install through QCTL or through, or an apply email or through Helm. I'm just gonna use the Helm one. So the first thing you do is add Litmus Helm repo and then make sure it added. Yep, it's in there somewhere. Yeah, I have a lot of requests. And then let's do an install command. So we're just gonna install the basic configs that come through Litmus. So this command slightly different from the one we offer in the instructions. I'm kind of grouping a couple of steps together to basically create the name space and upgrade if it doesn't already exist. And apparently, I- Could we have a bit more zoom on the terminal? So that we can see a little better. Give me one second. There's no link. Nice. So what, Hunter? It's a little better. Can we hit that one more time? Yeah, a little more. There we go. Hey, that's legible now. Okay. So go back and forth. First one was just installing the repo, listing a repo. And then now we're running the upgrade command except I forgot to actually add the repo. So let me do that real quick. Okay, it's k.o. slash litmus. Cool. So I'm using Lens as a kind of web UI to front our cluster. It's a little easier than trying to remember and type 5,000 QCTL commands. So as we see, it's kind of going through and applying it right now. So I'll wait a little bit for that to finish. I don't think we have any questions at the moment. Apparently I'm getting a question. There's a question from someone on, will this be recorded and be available? Yes, it will be. So it will be available in YouTube, in the CNCF YouTube. Pretty much immediately after this live ends. So you can tune in to watch it there. Why, no, don't stop it. Why are you upgrading? No, no, no, no. You have to update Outlook now. Update Outlook now. It doesn't matter about the presentation. Apparently you close Outlook and that one's the update. All right, so it looks like it's fully installed. The next step, when you deploy this out of the box, as you can see, if I go look at the service. And apparently now Alexa's going off. Yeah, so when you deploy it straight out of the box, if you look here, it just does node port cluster IDs. So there is, there are instructions for installing it either going through and heading your node port, creating a load balancer or putting an ingress object on it. I'm not gonna dive into that. I'm gonna cheat. And Lens does a great thing where you can do an internal queue proxy than just proxy from the cluster to local. So that's what I'm gonna do. Just to kind of get a straight log in. Much faster. Yeah. Much faster and I don't have to start just messing with us as I'll serve to anything like that. Admin litmus, I think, right? Yes, it is. If I can type this on that, let's see. Maybe you could. No, I don't wanna add you to the last pass. I don't wanna add you to Google and we're not gonna change the password for now. Cool. So just like that, we have it stood up. We have the intro UI. And now we're kind of ready to rock and roll. So a couple of things to kind of walk through here. Chaos delegates, you can see this one's pending. What we installed was just what they call the chaos center, which is the UI. It's kind of the command and control center. So it's the UI. You can specify scenarios. You can download from the hub, do analytical stuff. The actual real meat or the bread and butter, how this works is installing the self agent, which, okay, there we go. So what this is, it's kind of think of it as a runner. So the idea being is that you can install different chaos delegates inside different clusters so that the UI doesn't have to be inside the, where it's actually gonna be. So that way we can do it. When you first open up the UI and sign it for the first time, it actually installs the self agent for that cluster that you have running on itself, which is what you see here. If we were to hop back into this, which is here, we can see that, let me close this. We can see that there are three or four servers. And that's the one we just did that just installed. And this is all part of the chaos operator. Some of them have been subscribed buses that actually this is all of that self service agent that just installed in here. So let's actually break something. The first scenario we're gonna run through is we are going to use their demo app. Their demo app is called Potato. It's actually a funny word on this potato head. It's pretty funny. Yeah, so you have Mr. Potato here. Potato head, sorry, my bad. Mr. Potato head. Yeah, so we're gonna run through. We're gonna leave this the same. Yeah, and so we have kind of the sequence of what's gonna happen. Since this is their predefined template, it's gonna actually install the Potato head application. It's gonna install the chaos experiment that we're gonna run. This chaos experiment for this one is a cube kill. So it's a pod kill. As you can see right here, it deletes the pod. And then once that completes successfully, we are going to refer and install the chaos container as well as delete the application that we installed directly here. These workflows are customizable. It's all YAML based. So you can upload YAML. I believe they also have an API so you can actually just apply it directly rather than having to go through the UI steps. We'll circle back on what the weights do in a second. And so yeah, so we kind of, within a couple of clicks, we have our first thing. We wanna go ahead and schedule it now. It's just gonna ask us to verify everything. It all looks good to me. So let's hit finish and let's go to the scenarios. Now we see it's running. You can see the experiment. The main experiment is a pod delete. And yeah, so now it's just gonna kind of sit here. And if we go back into the lens, you can see that Mr. Potato Head is actually so went through and it's spinning up a couple of containers. You got one for the head, hats, new arm, left leg, main body, right arm, right leg. I don't actually know what this does. It's basically hollow world for Kubernetes essentially. It just has a whole bunch of containers that talk to each other and pop up a picture of Mr. Potato Head when you go to the front end. Yeah, it's all, it looks like it does. Oh really? Yeah. We can go look at that. I actually like it. Let's do that. But it's really fun because like, there was nothing you had to do. Like this was 100% set up by chaos up. Like it was just drag and drop. We didn't deploy this. It's just a self-contained Kubernetes application that just spat itself into existence and it's about to delete itself. Yeah. So yeah, it looks like, let's see. So if we go click onto here, we can see where we are. So we're in the middle of running the actual pod delete command. The one thing about chaos experiments and I can see if I can find it in here. Oh, here he is. So every experiment kind of gets spun up as a job. And that job does whatever the experiment does. So as you can see, this is a pod delete. It has a target argument. That target argument is probably one of these mains that just deleted by this one. And so what this job will do, so you can see it terminated one pod and it's spinning up another now. So it's just kind of a little job that runs in there. They can be really complex or they can be as simple as this one was where it's, hey, we're gonna delete a container and we're gonna delete a pod and see if the pod comes up. As you can see, because of the way this app is designed, ArchEllo server is still available. Oh, no, it's not. I lied. Oh, nevermind. Scratch that. I was proxying from the container that died. So as soon as it died, my proxy connection died. If we had no load balancer around that, it would have been available. Yeah, so we're kind of just waiting for this kind of finish. But setting up a load balancer involves work. I don't like to do that stuff. Yeah, something, something Bill Gates, Lazy Engineer, best test, something like that. Yeah, I've heard that a few times. I'm not sure if that's actually true, but I have heard that. I am 100% at times Lazy Engineer. So the less work I have to do that I'm gonna scratch later, the better. Oh yeah, especially like volatile work that like is only going to exist for a brief moment and then just gets vaporized. Yup. Yeah, so we're just gonna sit here and watch spinning circles. Anybody have any questions? I could try singing, but nobody wants to hear that. I do, I'll pay money for that, let's go. No, we're not. Cool, all right, so it ran and now we're doing cleanups. I know this is kind of the cheesy side, but does it help with declarative configuration? Well, so this is actually, what we're seeing here is the execution of declarative configuration. So this is, we didn't configure anything. This is all just completely off the shelf. This is just default behavior that comes with the Chaos Center installation. But yeah, every individual step, this was, yeah, there you go, there's the manifest. So it does actually provide manifest files. You can go through and do a declarative manifest file and apply it directly like that. That will work. We're just doing it through the UI because I didn't write manifest files. Does it support? Yes, sorry, I see your clarification there. And then mark cloud arc. Yes, it does allow you to target destroy specific namespaces we're actually about to do that. Yeah, I thought that was a clean star. I thought it was doesn't support everything. I was gonna be like, yeah, sure. Yeah, no, it does. All right, so we ran it. You can see the experiment. The experiment was a result, everything passed. So let's do some stuff where it doesn't. The first thing I'm gonna hook up, the other thing that this allows you to do is you can tie into a previous data source. I'm hearing myself. Okay. So we have previous installed in all of our clusters. If I can type correctly. Mainly because I'm also too lazy to spend up on my own previous. So I'm gonna use the one that's right there. Do you have the option to enable a lot of authentication? Yes, you do. I just didn't do it. It would be good to not allow just any arbitrary person on the internet to run chaos experiments on your production stuff. Yeah. So they actually have, let me show something quick, hold on. They actually have the ability inside the settings to go in user management. You can actually add users to authentication with create new users, login details, username, passwords and all that. You can actually, as part of the install directions as well, you can do OAuth authentication. Yeah, that was actually an extra to the question. So something like AWS Co-ordination, Google Apple and so on. I believe so, you can do that with OAuth. I'm not 100% sure we haven't set that up or gone in that far yet, but yes, I do believe it does support SSO OAuth to actually also answer the other question about declarative, you can set it up for GitOps as well. I think, do we answer Mark's first question about the targeting namespaces? You can, if we didn't answer that already, you can target specific namespaces. Yeah, I answered that with a, where about to do that? Fair enough, fair enough. I think that's everything we're caught up on so far. Did I answer your question, Mark, about the OAuth and stuff or about the authentication? I believe you can. Best answer I got is to check the docs. I know there's a section in there we haven't done it yet. I haven't personally done it yet, but I think so. This is true, Jonathan. This is very true. This is true. We're not gonna play with production right now, though. Why? It's so much more entertaining. Because I don't wanna get paged. That's the main reason. I don't wanna get paged. All right, so we're actually gonna, we're gonna play with the pseudo production. So one of our demo apps that we have installed and it's been on here for a while. Blub, blub, blub, blub, blub, blub, blub, blub, so we're gonna go to the stock shop. Yeah, everyone's familiar with this, I think. If not, we're gonna walk through the UI real quick after I found my UI. So we work stock shop. It's another open source project. It's like one of the Microsoft, or Microsoft's demo software applications. I think it's the two big ones are we've worked stock shop and Google's boutique app. We like stock shop better. Do you two mind? I just like socks. Because as, yeah, socks are cool. And it has more. It has more applications and several different database layers in the backside too. But yeah, so it's a fully functioning basically marketplace store. So you can go in, buy socks. We can buy a shim with some more socks, because I appreciate it bud, I appreciate it. Welcome, yeah. Yeah, so it has like a full catalog. You can go see colorful socks, non-colorful socks, super soft, super sport socks. I can't forget about it. I want some cat socks please. Okay, I'll buy you two. Thanks, thanks. You know, it has a full working cart. If we, you know, you save these. Oh, we're missing shipping payment at the word. Trannis, can I get credit cards so we can? Yeah, yeah, absolutely. Just to start with the 404. Yeah, I remember the rest of it off the top of my head. Oh, okay. Yeah, so it's a full app. So now that it's off, let's break this thing. Because breaking is fun. That's the wrong guy. All right, oh, the other thing to note, we do have, if I switch to the right namespace. We do have a low generator on this site that's been running for like 15 days. We just kind of permanently keep it running. So there is load and we're gonna see some fun, cool things in the graph, hopefully. This is all Grafana, Paranthia, Stag. So we'll see some fun stuff. What do we wanna name this thing? Sock Breaker. Yeah, it's called the Sock Breaker. Break those socks. The Breaker of Socks. Okay, my thumb's fun. You will throw if anybody got that. So as you can see, there's a lot of different experiments. Actually, let me back this up. I'm doing things out of order. So KSF, he talked about it earlier, it is where litmus stores all of their experiments they've written. So this is a pretty fun scenario. We don't care about that right now. That's what we're writing. So KS experiments. They currently have 58. I believe you have the ability to add your own repository and they have instructions for how you create your own scenarios. Yep, I like to call it 58, but it's also 58 times infinity because they're all infinitely customizable. So yeah. So they have a little bit of AWS SSM. They have some Azure stuff, some core DNS, some GCP stuff, some generic stuff. So we'll play with the generic and the generic pod stuff. If anybody has anything they actually want us to run, feel free to drop it in the chat and if I see it, we'll run it. I'm gonna stay here. I have network corruption. Yeah, we'll do that one in a second. I'm gonna stay away from the QB AWS stuff because like I said, I don't want to get paged. I don't know if I get paged for this cluster. If you are, we should turn that off. Yeah, we should. Yeah. Back burner. It's open EBS stuff. So we're gonna do network corruption to start with and then we'll do some fun stuff. If anybody has any suggestions or just wants to say anything, I keep hitting the wrong tab. All right, so let's go back and do that eight to him. That's the one I want. We want to use the self-agent because I haven't installed this instance on anything else. We'll choose experiments. I think we call this thing a sock breaker. Breaker of socks. I spell everything, it doesn't really matter. All right, so you want a network corruption. I love network corruption. That's network corruption. I know, so. Where'd my mouse go? I lost my mouse. There it is. Come back. It's trying to run away. Did it edit? And then. Fairly effective. I'm thinking about it. It's thinking about it. Let's do it again. All right. Oh, I didn't click on it. I don't think it's done. There we go. All right. So we've added it, as you can see, kind of defaults to App Engine X. So let's edit it and let's change some stuff. It's very many and we're not gonna mess with that. Default. So this is where you're asking, can you target stuff? The answer is yes. So there's two different ways to install litmus out of the instructions. There's a, I think there might not be a DNS poison. There's DNS spoof though. So that's kind of the same, but maybe not. It's close-ish. But there's two ways to install this. You can do it either through a cluster wide or an in-space wide-scoping. I did this cluster wide so you can see all of our in-spaces. But we're gonna target sock shop. We're gonna target an app. So it targets based on labels. And let's do, let's do the carts. That's fine. We'll crash a DB next. This actually will end up taking out the network on the node itself. This is only a one node cluster. So I should take out everything and you'll see that. So it doesn't really matter which, for this one doesn't really matter which one you target. It makes the best crater in the graph though for Prometheus. This is true. So you can add probes. What probes can do? I'm not gonna go bother looking up with the endpoints for this. But yeah, you can add a probe, probe names. It does an HTTPS endpoint. You can do HTTP, commands, days, or prompt. These are all commands you can do. You can give a timeout period, retry period. And this will actually probe the endpoints of your application. And so this is where the weighting comes in. So this basically says, hey, is this thing up and is this thing healthy? And if the thing is up the entire time, you consider it successful. If it's not up and the probe fails, then the test is considered failed and you're not resilient. That's where if we look at this next section, let's find. Do you have to add next again? I know I'm gonna go clean up the pod. Yeah, you're going for the number weight thing, right? Yeah, this thing. Yeah, so the weight thing. So basically, if I were to schedule like five or six different of these things, let's say I do a pod network corruption, I take out an AWS node and, oh, by the way, this is running on EKS. So that's why I keep referring to AWS. Say I take out a node and then I do a memory, or CPU load test. I can weight the different tests accordingly to each of this on a one to 10 point system. So let's say I don't really care about network corruption, so I can rate it a four. But if I were to go in and do our node kill, which is something that's highly more likely to happen, I can rate that a 10. It will actually hit the end points. It will test everything. And basically if it succeeds and it doesn't go down at all, you get that percentage points calculated for the resilience score. And this is part of the really, some of the really cool stuff you can do as far as like CI CD stuff, you can actually integrate Chaos Center with your GitOps so that every time you're updating things, if you make changes, you can actually run resiliency tests automatically to see, okay, numerically, like what is our score for what's our resiliency like? Like for instance, when Mr. Potato Head, when we ran that, we came back with the perfect score. Okay, what if it wasn't so perfect? What if we ran something, one of our Chaos experiments did actually cause a service disruption? Like how bad was that service disruption? This allows us to tune that to a more high level that especially like management's really interested in saying because it's a simple, just integer value, say the bigger the number, the happier we are. Yeah, basically. So the Chaos SportFlow won't actually remove itself. If you select all resources of the Chaos itself when you run it, that thing I did at the end where I said, let's clean up. In this aspect, it is talking about cleaning up the network corruption pod that it is running. It's not talking about cleaning up the Sockshop workflow that actually exists. The only reason it cleaned up the application, the pod, the Potato one, is because it actually deployed that internally. So if it were to deploy it, you can then have a step to clean that workflow up. But since we're using a pre-existing workflow, the only thing it's gonna clean up is the job that actually ran. And I mean, like if you really wanna make life difficult on yourself, you totally can have it set up just to delete something that already existed when you ran the Chaos experiment on it. Like it has the flexibility to allow you to do that. I personally would prefer for it not to do that, but that is something you can configure if you want. And I think I actually got a different aspect to your question too. And the aspects that we're doing like no kills and stuff, this is where it would probably be beneficial to run Chaos Center and make cluster you're not trying to test from the off chance that you do nuke the node that is actually running on and all that. That's where kind of the Chaos delegates come into place where it's only a small tiny subset of pods. And hopefully your environment is running on more than just one AWS node where it can just get rescheduled. I haven't actually tried the instance of doing a node kill mainly because I don't wanna get paged. But I believe in that instance, it would kill the agent and the test would just fail basically being like, hey, we've lost contact with Delta. And cross cloud compatibility, which is hard for me to say for some reason is a really big aspect of what makes Chaos Center so good. It just has stuff baked in say, okay, I know that I'm running on EKS, I wanna go do some bad stuff to Azure, no problem out of the box. Yeah, those triple C is really 30 for a loop. It's gonna think about life for a few minutes. Hopefully I didn't actually jinx myself and my node died. I don't think so. That would be really funny actually if that happened. Yeah, I'm on. Is it doing stuff? Wait, what was that? Was it pending at the top? Bending what? No. No, I'm on. I don't know what you're doing. Seriously, feel the pain. Sad life. All right, let's try to finish again. Hey, now it works. I janked myself talking about it. That's what happened. Yep, that's exactly what happened. All right, so now it's running that experiment. Let's install it in the Chaos experiment. So we should be able to see this both in lens and in Grafana here in a second. Yeah, let me go to the lens real quick. So the one interesting thing to point is when you install litmus and when you install the Cloud Delegate, the workflows will actually run. So you can see right here, the workflow is running. The workflow actually runs inside of the namespace that litmus Chaos Delegate's running. So it's not even running inside my sock shop namespace. The sock breaker just popped up and disappeared. Yeah, he's popping up and running some stuff. So if we go into here and go here, we can watch this for some cool stuff. Maybe hypothetically, we should see, what's going on in the network stuff? So that was it thinking about it. So it was the last hour, it was like the last 30 minutes. Yeah, I was tidying up the... Yeah. Yeah, it does take a second for CPU to start cratering too. Yeah, well, that and for me, this is on like a 30, 40 second delay. That's kind of... While we wait for that, can I ask you a question? You're perfect. So how frequently do you use logs when troubleshooting? Oh, constantly, that's an absolute constant thing for us, which is partially the reason why our software exists, why Zebraium exists in the first place. So we were trying to alleviate some of the just headbanging headache that goes into diagnosing root causes as issues occur. So we have a very powerful artificial intelligence engine that can actually help identify the root causes of your problems as they happen. Oh, there we go, there we go. So yeah, the normal log volume we would have to look at for Sockshop would be looking for a five minute range, probably about two, two and a half million logs, lines of logs. And we do not have the time or interest to actually try to look through that many log lines to figure out what exactly went wrong in our environment. So with Zebraium, we can actually pick out only 30 to 50 log lines that are the actual relevant log lines. It's much more user friendly, much more human readable to present information like that. So if we, yeah, can we scroll up to the CPU and RAMs? Yep, there we go. Yeah, so all that drops slightly because all the network activity dropped which since this thing's communicating to itself, it all plummets. Yeah, so we'll just wait for this to finish. And then we're quick fetching data. Well, yeah, that actually makes sense. Because I just took off the network interface of everything. Yeah, considering what just happened. Yeah, hindsight's 20, 20, something like that. Yeah, I know. Yeah. I love it. So once the interface comes back up, which normally took 60 seconds. So it should be fine. It should be coming up. We should just get in the lag of scraping. Yep, yep. That's the fun thing about all the network ones is it kind of, especially if you're running on one node, it kind of just nukes everything. But the nice thing is the blast radius is confined to the namespace. So we didn't like knock out the entire cluster or anything like that. No, we knocked out the entire cluster. Oh, we did. Yeah. Well, by staying corrected then. Because it, what the network corruption does is actually corrupts. I believe it removes the network interface for Docker on that container or on that node. Oh, wow. This is a table node cluster. We knocked out networking for the entire cluster because there's only one node. If we had multiple nodes, it'd be a smaller blast radius. Yeah. Fair enough. Yeah. That's also what partly causes the dip in all the graphs too is because everything stops. As you can see here, network blocking. Yep. I like seeing the sharp graders in the graphs. Yeah. Well, let me rephrase that. I like seeing that when it's not production. It's a bad day when somebody accidentally runs this on production. Which has never happened, thank God. Knock on wood. Oh, look, we got a detection too. Yeah. And so kind of the go full circle about this. This is full disclosure. This is like our own, our widget that we have installed in Grafana. Kind of what we talked about in the beginning, we use this tool to kind of induce live alerts and stuff. So as you can see here, it's, it grab cards, DB says, hey, the master pop is restarted and the cubo restarted. I don't know if that's actually right. Seriously, you're gonna make me sign it now. So I got the credentials for that if you need it. All right, fine. I'm gonna, really. And then there's a new audience question as well, what we have. So yeah, I would say it's possible to bake that into a scenario. I would say that it's probably not, let me say, I don't know for sure, but I don't think that that's a feature that comes natively. So, actually, yes and no. Let me go back and let's, let me show you, let me go back, let me get the schedules on. Let me get back into the manifest of this. I actually want to go here in the manifest. Everything's time bound. So inside of this massive chunk of script somewhere, there is, that's the weight. Yeah, right here, total chaos. Oh yeah, there it is. So everything's time bound in seconds. So for that test, we ran it specifically for 60 seconds. There's also a flag you can set inside. Let's say you apply this with a YAML file directly. You can reapply that YAML file and there's actually an environment variable you can set to disable the chaos test and it'll stop it instantly. Through the UI as well, we can also stop the execution of a test if it actually breaks something hard as part of a glass break procedure. From that aspect, everything's time bound. That would be the glass break as you kind of, it's like there's a mixture between, I believe you can cancel a test inside of here as it runs as like a hard glass break stop or. You can just do that from kubectl if it actually gets really out of hand. In the scenario, he said he locked himself out of the target K8 cluster. Right, right, right. If you locked yourself out of the cluster, there's probably bigger issues there too. Oh. Yeah. Yeah. Yeah. But what kind of log lines do we get out of that detection by the way? Yeah, a bunch of errors, creating workflows, MongoDB coming tonight. Yeah. Yep. So yeah, the word cloud there on the left, we see that it was the cart's thing, it went down, it's MongoDB. Yeah. Yeah. Not the best one we've gotten out of that, but it's still pretty good one. It's a lot better looking at millions of log lines. Yeah. So that's our end of use. I think the last one, I think Mark wanted to see, well, DNS. Yeah, let's try the DNS one, so why not? Yeah, I mean, hey, the new territory for us, let's find out what happens. We've got 20 minutes to kill, so. Yeah. Honestly, this is one of the things I like so much about Chaos Center is I can just sit in here and I can just play. Like, I've never done something so catastrophically bad that I've not been able to just hit a button to reset everything, but yeah, let's say it might be the day, let's find out. Any worst case would you pull it all away? I mean, it's a demo cluster anyway, so. Yep. Okay, I know I don't have any delegates selected. What do you do? It's a deep breath. There we go. Hey. It's a little buggy right now. I think most of that stood in me using a cube proxy and it being on a VPN and the combination of the two is a little fun. If I were to do the work, it actually set up a load balancer if I wasn't lazy. Go to that and actually set up an ingress object and the proper annotations for us to spin up an internal ALD for it. It would actually be a lot better, but like I said, I'm lazy. It does work a lot better with an actual network who woulda thunk it. But now, let's see, let's see the DNS ones. Curious, port DNS, pod DNS error. Spoof. Yeah, I think that's the closest to a poison. You know what? As fun as that sounds, Mark, I don't think I'm gonna do that. Oh, man. I mean, I might as well just do a, you know, just hop it in the node and just do an R and dash R slash. I mean, I got a kettlebell. We could just throw that at the server. Well, let's do some spoofing. I haven't done that one. It's kind of, next, let's target up. I do not want to target the DB forum, I want to target the Sektro. You don't want to target litmus? I, you know what? It's too early for that. Come, come, come, come, come, come, come. It's like that thing with Ida Pro, Ida Pro is like the most reverse engineered software of all time because everyone who downloads Ida Pro immediately uses Ida Pro to reverse engineer Ida Pro. I didn't catch half what you just said, cool. Out of pro, out of pro, out of pro. Pretty much. All right, once, let's do that, let's do that, I'll schedule it now. Yeah, finish, go to scenario. So that'd be fun, because I have no idea what this is actually gonna do. I guess I could have been like, well look, we can see our app. Well, for now. Yeah, for now, we'll see what breaks. How's it going, anybody? Angry, I don't know why. Can you lay down? Well, that's a specialty. Yeah. So yeah, there's getups integrations that I have messed around with. I would assume that there is something that we can do to interact with Slack. I honestly don't know off the top of my head. Yeah, I don't know either. And full disclosure, us moving to the V2 is kind of, this is definitely newer for us. We first grabbed onto, let this be one. V2 is when they scan and put the UI and the chaos center and everything in front of it. V1 was entirely server and API based. And so that's really what we were doing is actually crafting YAML manifest files and just doing QCTL applies with a series of files that like added the RBAC process and all that. This is definitely much easier. And after diving into this a lot, it's on both of our, we both have tickets now to go look at the full scale implementing this all the way through with the UI because it just makes so much everyone's life easier. You can actually, we can have sales engineers and stuff you have to go in and do on the fly chaos scheduling without having to know, hey, how do I attach to the cluster or without me having to get them access to apply stuff to a cluster? Limit the break factor and all that. Turns out people actually really like GUIs. That's interesting innovation, man. Fresh out of the 70s. Yeah, so I haven't honestly dove into that so I can't really say that. I believe they have a Slack integration, but again, I don't know. Yeah, using their GitOps flow, you can cobble something together to make Slack channel stuff happen. I just don't know what the logistics that would look like. So it looks like... Actually, do anything. Because it should be. It should have effectively been DNS. So maybe, I don't know, it was actually doing something. Not entirely sure. I might need to check Prometheus. Yeah, it doesn't look like it's doing anything. I wonder... So part of it, since we're not doing this on like an actual network, I wonder if that... No, I wonder if it's because it's looking for something that's not co-ordering us to actually go through with. That would be my guess. It's looking for like QDNS or something if I actually go and read the manifest. Like I said, we haven't ever done it, so it's a cool thing to kind of see. We can see Prometheus, that's something interesting. Yeah, yeah, yeah, yeah. I mean... Because if I see network stuff in Prometheus, that would tell us. Well, there's a little bit of a shelf, but I don't know. Sort of, but there's no package drop. There's no re-write drops. You want something? Yeah, it doesn't look like that actually targeted anything correctly. I probably could have started off wrong, too, so. I mean, we could do something like pod delete or something like that, something a little bit more innocuous. Yeah, yeah, I didn't do anything. All right, let's go do something a lot more fun. Let's actually break something. Brave. Yeah, go out with a bang. Fair enough, fair enough. Let's see the fireworks. Maybe it actually did something. Ha ha ha ha ha ha. Watch it turn out that experiment affected Chaos Center more than Sockshop. If anything, it affects my cube proxy more than anything. No, just self, don't be lazy next time you do this. No CPU hog, no train, no train, no train. Cubelet service kill. That one, oh, maybe. No restart. You know what, screw it. Fair enough, fair enough. Let's terminate some tags. You might want to select it for that one. No. There's only one note. Oh yeah, okay, fair enough, fair enough. You see two instance ID, give me one second. And Mark is also asking, suppose I have isolated communities only for Chaos, what do you recommend for authentication and authorization for my chaos cluster to connect to my communities target? My chaos cluster to connect to my communities target. I don't know. Yeah, honestly, it depends on how you've backed your K stuff. On some of it. Like I said, if you're using EKS, GCP stuff, obviously authentication and authorization is gonna be some form of an RBAC role using defaulting back to their IMA management using a role or something like that. Internally, I don't really know. Yeah, I don't have a good answer for that, really. Like I said, we haven't played super much with isolating them and actually sharding it out in different clusters yet. I'm sure I'm going to have that same question in about a week. So I don't really, maybe in a couple of hours depending on what happens after this call. Yeah, pretty much. Yeah, so that's the, as bad as an answer as that is, it's the best answer I got. Yeah, we've pretty much only messed around with like the default auth stuff. Yeah, easy to, I promise I'm actually doing something. Perfect, but there is about 10 minutes left in our time. So if any audience member has any questions, now is the time to start typing them away. So everyone, ask those questions before. Okay, time to do, there is the, I saw something pop up, there we go, here. This is ID, oop, it's not far away, where did that go? Okay, where did that go? Okay, I lost my mouse now. Why is my mouse over, okay, seriously? Like, how do I get my mouse back? I mean, I just saw it fly in front of chaos center, did you? Oh, it did? I totally, I don't know if this will work, but we'll try it. Next, next, finish, next, I should be the instance ID. This is what we call yellow chaos testing. Pretty much. Let's see if this thing actually makes the node. Oh, it's installing it still. Well, I guess the other thing we could do. So they actually, once you connect it to a dashboard, you can actually do something called where you can set dashboard and they actually have some built in. See, the problem is I just set the node to delete. So I'm like, I don't rush, I'll like beat the node. They grab those metrics while they still exist. Although it exists, they're persisted by a BBC. So that's just more why I still have access to them. Here we go. Yeah, so you can kind of see, here's plot metrics. It's, I mean, it's limited use case. I think you're still blowing this out, but it's still kind of cool because you can see this actually needs a Prometheus scraper, which I didn't set up. There's Prometheus scraper, I'll dump all the chaos intervals into Prometheus as part of an exporter with service monitor. It didn't play nice with our already previous Prometheus installed the directions they have kind of installs Prometheus itself, but you can hack through it to get it running. Did this thing actually kill the node or when the node's still up, did it actually run? Oh, I've done that wrong. We'll see. Oh, it failed. I bet it didn't have permissions. I was able to get secret. Yep. Yeah, I didn't have permissions. So to answer your question, you can permission about it, so it can't do stupid stuff like that. Kill itself. Yeah. Honestly, I'm a little bit relieved, but. Yeah, me too. CPU is the fun one. Yeah, sure. If you do. Oh, why don't we do like a CPU and a ram hog alike in parallel? Let's just do, okay. Seriously. Because we haven't. It's doing its thing again. Yeah, because we haven't demonstrated running like parallel chaos experiments. Sure. We've got eight minutes to do it, so. That should be sufficient. Yeah, that's too many of my proxy holds up. There goes. I'm impressed at how well it's done so far today. Honestly. Yeah. Pardon me, it's also like, I should have taken the five minutes and gotten all the load balancer configs correct. New experiment. Memory hog. Yeah, we're having a public review hog. Nice. Maybe. Yeah. Oh, come on. We'll configure this one first. Target application. Do you, do you, do you, do you? Do you, do you plan on just doing it? It's fine. Yeah, it's fine. Hey, finish. Yep, there we go. And then memory hog. Baby. Come on. Man, it just not like memory hog. And try one of these other ones, baby. Yeah, my proxy doesn't like it. I can do this, Phil. I can fix that later. There we go. All right, so you see up at the top left where it says edit sequence. Yeah. Once you, yeah. Once I change those, should you just do, we can fill, let's fill a database. Let's fill a database. That's a good one. Yep, yep, that looks good. C clouds, specializing your C clouds, B, B, B, B, B. It should not look like this. I know, it's my proxy's acting up. For my baby, it's fine. Sad. I don't want to hit refresh. Yeah. Let's see if I can get this kind of comment. I don't know why the context by saying it only works if you do not have namespace quotes to define memory and CPU hog. Yeah. Going back to the, remember I said I was lazy? Yeah, I didn't set that up on this cluster. I'm a proxy dude. And also this is getting closer to final call for questions. So if anyone is typing, which enter as soon as you can, you have a few minutes left. Let me see. Let me stop for, let me try this one more time. Don't let me sign in. There we go. All right. Let's see if we can speed run this. Speed run chaos. You're working now. That's amazing what happens when you reset your proxy. Let's do a test fill. Hey. Sequence. Hey. Actually doesn't work. I want this guy and this guy to over and at the same time. Yeah. Save pages. All right. Let's go through and speed run this. Put it in here. I don't care. Finish. Finish. Next. Next. Steady user DB. Next. Next. Finish. And let's do you as the first DB. Do you have some time left? No. Finish. I don't want to get part of it. Finish. Next. Don't care. Finish. All right. Well, something ends to fail. So let's. Nothing ends to fail. That's. Oh, oh, oh, oh, oh, oh, oh, oh, oh. That's what you told me from earlier. Yeah, yeah, yeah. Ignore me. I generally do. That's probably healthy. All right. Now we're fighting the clock. Let's see. Will it work? Will it blend? If I, there's a hands on left. I don't know if this is gonna start around. That's good to know. Yep. Yeah. I've also seen some really good demos. So chaos carnival was a annual conference specifically for litmus. And there's some really good demos that came out of that. Yeah. So Jonathan, is we recommending the hands on left for litmus introduction inside for our learning space? We should open from atheists and check this out. You should. Oh yeah. From atheists drags by a few seconds. Yeah. That's not what I want. That's running. So let's see. No, I wanna switch to the sub-shot. Sub-shot. Do, do, do, do, do, do, do, do, do, do, do. Get rid of that. All right. Nothing's bouncing. So that's good. All right. Oh, that's good for me at this. Oh, I'm starting to go up a little bit. Yeah. I forget which one I did a room and CPU on. So. Just randomly clicked a couple. Acceptable. And we are at the time. Great, speed running in the end. Any final comments or notes for our audience? Not off the top of my head. Yeah, so it looks like that one did the memory. Yep, yep. Yeah, now I really appreciate the opportunity to come show us, messing around a little bit with what litmus can do. Yeah, definitely. Yeah, loved it, particularly the speed run. It was very, very nice. Great to see that. Perfect. Thank you so much, everyone, for joining the latest episode of Cloud Native Live. It was great to have a really good session about using Litmus Chaos Engine and microservices demo after Demonstrate Automated RCA. So we really also love interaction and questions from the audience. And as always, we bring you the latest Cloud Native code every Wednesday. So next week, we will have a session on in the Cloud with Cloud Muthers. So thanks for joining us today and see you next.