 All righty, we've waited long enough for folks, so we might as well get started. So hope everyone could hear me loud and clear. Let me go share my deck really quick. Hello everyone. Hi, hi, hello, hello. I'm gonna go share my screen in a second. Cool, everyone see it, see it? Yeah, yeah. Cool, so yeah, agenda today. So we'll do some introductions, then we'll have some community presentations about 15 minutes each from the Gremlin and Chaos Toolkit communities. We'll talk a little bit about the status of where we are at the landscape, mostly me asking help from everyone and trying to come up with some reasonable categories on how to categorize at different technologies and chaos engineering. We'll also talk a little bit about the white paper based on community feedback. I uploaded it cut to GitHub so we could bang away at it via pull requests instead of Google Docs, which seems to be favorable to a lot of folks. And then we'll kind of end things out. But first off, it would be great if we could get some introductions from folks, especially if you're new on the call. So any new faces from last time want to speak up and say hello. Cool, I'll go ahead. My name is Roman Kane, I'm up in Seattle. I'm working on an early stage start up here called Fuzzbox that runs our failure simulations on your production artifacts, so great chaos adjacent. So that's why I'm here and interested, so thanks. Cool, very cool. Anyone else? Yes, hello, my name is Alexi Ledinev. First time joining this call. I work on Open Source Project Pumba which is chaos testing for Docker containers. Okay. Awesome, glad to have you. Anyone else? Anyone else here? Yeah, I can go. Hi, my name is Julian. I'm currently in Stockholm in Sweden and I'm a software engineer. I'm doing a lot of Kubernetes and Docker work these days and I'm really interested into a chaos testing and especially the continuously testing production environments. Awesome. Anyone else? All right, that'll be interest for this time. We'll try to make sure everyone has a chance to introduce themselves that is new every time. All right, let's go move on. So, presenting mode again. So, landscape, so one of the few things that we wanted to accomplish from the working group is to kind of produce not only a list of kind of the different chaos engineering projects and products and tools out there, but also attempt to categorize them. And I've been bashing my head a little bit working with Sylvain and some others that kind of come up with initial idea, but I would love to hear from the group if anyone has an idea in terms of how to categorize things. So whether you could, the simple things you could categorize by like hosted solutions versus, maybe frameworks that run client side and our driven client side versus maybe chaos engineering tools that are focused on security. So having an idea of how to categorize this thing is definitely something I've been trying to tackle. So I'm asking folks, we could have probably like a five minute discussion on the call today, but there's a GitHub issue open and if you kind of have ideas of how to kind of categorize things, it would be great to hear because that's essentially going to drive the landscape that will be produced by CNCF for this work. So I don't know if anyone wants to take a stab at this. I know Sylvain, you had some thoughts that we chatted over email, but I'd kind of love to hear from the group if they've got a picture in their mind of how this should be categorized. I've got actually a good question, Chris. How did that work for other working groups like serverless? So anything, did they go through the angle of the market, of the tooling, what was the rationale for? Well, I mean, it was a lot of bike shedding for ever to kind of come up with the categories. I mean, first we started with a list of serverless projects and things, right? And then from there, we tried to break it up in terms of categories. And that kind of what makes sense. So that's kind of the approach we took. It took a long time. And I was trying to use that framework to apply to Chaos Engineering because obviously there's seems like there's an upstart of kind of hosted offerings, with Fuzzbox, Primalin, KSQ, et cetera. And then there's a bunch of kind of tools that have different focuses on whether they're trying to do Chaos Engineering in a, what's it called? Security context or, you know, some other context. So I'm just asking how to categories things from like what are people's thoughts on this? Yeah, I have some sense that focusing on the context, like whether it's security or something like that may not be the right way, even though it seems like it might to begin with. Yeah. Just because I think, yeah. I think like reading the Chaos book and stuff, like coming at it from the perspective of like either running experiments or I think Julian mentioned, like continuously running tests and that kind of thing. It's maybe more of the way to dissect it. Cause I think at some point you get into the concept of like duct debt and things like this, where it maybe isn't necessarily applicable to very clear domain. So, you know, more and more ways of applying Chaos. Like by its very nature, Chaos is ethics, therefore not. Easily customized, maybe. I don't know. We could just start with like one big box and just call it, you know, chaos, you know, engineering and just throw everything in there in one box, but yeah. I also like, you know, there's some, you can maybe even say like whether it's actually in production, whether it's like, you know, how Netflix do it with sort of canarying almost how it gets applied versus what it's being applied to. Like this is what I was poorly trying to say. Whether it's, you know, like what I might do, which is like run on like a snapshot of the artifacts, almost like in a staging environment, whether it's actually, you know, you have like the gremlin agent deployed and you're running that on production with a bus radius or whether, you know, you're actually just setting something loose, like chaos monkey. So almost like what's the scope of how you're applying it? It might be a way to break it up because I think there's different stages in sophistication to where you apply each of those. And that might be a way to kind of step through the tools. Yeah, I think that's very true. I think there's a case to be made of here's the technology and here's the philosophy. And the philosophy might be a great place to start and then technology is a doctor, is it Kubernetes, is it fair, model, is it whatever? Yeah, yeah, I mean, you know, I'd also ask people to have some like, you know, empathetic thoughts from like an end user perspective if you're trying to like evaluate with tools out there and like, hey, I want this to stop the work against AWS or Kubernetes, you know, making it easy for them to find that, at least from that perspective is super useful. So I'd like us to also make sure that's somehow possible from like just an end user perspective. That's a great point. How do you get started? I've been told I have to do this thing at work. So where do I even begin? Yeah, yes, yeah. And I'm running on Azure or something. I want to make sure this stuff works there too. That type of perspective. Doesn't it get, you know, application has several dimensions when you can filter, you know, things on diverse, you know, degrees of refinement or something like that. Yeah, there could be multiple kind of accesses of a filtering, you know, I'm not asking for like a complete, you know, solution, but for folks to kind of put their thoughts on that GitHub issue and for us to kind of keep iterating on this, eventually I kind of want to get to a state where we have kind of a rough agreement and then I could work with our design team to start sketching out how this is going to look and then we could kind of continue to iterate on that. If I may, I've been maintaining a Docker repository for that's called Awesome Docker. Yes. Been basically doing that for like four years and I can tell you the pain of trying to categorize a landscape that is shifting under your feet all the time. So with the other contributor, what we came out, the main idea was we split everything into use cases. So who is it for? Not what does it do that? Who is it for? And from there, most people find like the common terminology and can relate to keywords much easier and find their way and contribute to the list as well. So that's a way to involve the community around it. Yes. So this would definitely be like in an open source fashion where anyone could contribute their thing to the landscape. We already have, if you go to something called l.cncf.io, you kind of see what we've done for the wider cloud native landscape. It works pretty well because community members tend to police themselves, which is beautiful. It's a beautiful thing to watch to make sure that things are kind of categorized properly. But it's for us to kind of come up with the categorization scheme. So yeah, it's basically all the time I kind of want to spend on this one just because we have demos and I'm really excited to see them. All I ask from the group is to throw some ideas on the GitHub issue that I've linked to and we'll kind of go from there and try to do that work stream in async fashion as possible. Cool. All right, so moving on. So next up we have two community presentations, kind of see how people are doing chaos engineering in the wild. So first off, we'll have Eugene from Gremlin talk a little bit about what he's up to and then we'll have Sylvain to talk about chaos toolkit. But let me stop sharing my screen so Eugene could get going. Cool. Thanks, Chris. You there, Eugene? Yeah, sure thing. Actually, I'll share that screen and just move over to the slide. Yeah, sure. Give me a second. Yeah, no problem. I'll get going. Sorry. Oh, God. Here we go. OK, got it. That was good. Thanks, everyone. So hey, my name is Eugene. I've been at Gremlin since July last year. Just want to, whoa, is the slide OK? Seeing a little bit of artifacts around. Got it. Good now. Very good. Cool. So the company itself has been founded in 2016. Colton left that point in time and recruited Forney to be CTO to start working on this product to make chaos engineering available for basically the rest of the world since we've all seen it work at places like Google with their dirt exercises at Dropbox similarly with Tammy and her dirt exercises and also with Colton at Netflix Chaos Engineering and then Colton with Forney doing chaos engineering and making having a tool, building a tool internally at Amazon retail. So we know the value at the corporate level and so we're just trying to make it available to everybody else. Gremlin itself installs in a myriad of ways. One of them is a Linux package. So all you really have to do is pull down our repo and then install the package. Or you can install it as a Docker container. Pull us from gremlin slash gremlin on Dr. Hub or as a Kubernetes daemon set. So I've seen a lot of our customers do that as well. We have a CLI interface such that you could SSH into any host to run gremlin experiments. We have an API as well for automation as well as a web app just to make it really easy for people to dive into all the types of failure modes that we could introduce into your ecosystem or otherwise just to properly scope your attacks. One of the things that I find very valuable when helping our customers scope out chaos engineering experiments is that they go, where do I start? You go, well, if you don't have this nice rich UI to say to give them all the parameters, it's really hard to get started because I could just blow up a whole auto-scaling group and I want to see what happens. Well, start small and we could help you scope that out properly. We also have a built-in scheduler. I think one of the things that we all hear a lot is about automation. And so putting things into the scheduler really helps you maintain that floor of resilience in any of your applications. Finally, we also have this HALT rollback feature or a dead man switch where if you think that you're causing a lot of damage and you really want to stop it from happening, we have a HALT like just click on it or otherwise use an API to stop the attack. And we'll be able to stop and get back to steady state. Alternatively, just to stop things from going wild and our client loses control to our control plane, you can then, a dead man switch on our client will kick in and then roll back to steady state automatically such that you don't have any uncontrollable chaos within your ecosystem. So that's it for this slide. Let me go ahead and get into the demo really quickly. Let's see. Let's just go ahead and put my screen. Here, I need to stop sharing. So again, just like. Chrome? Cool. Now go for it. You should have permissions now. Cool. Let me know when the screen shows up for everyone. Can I copy it? Yeah, works for me. Fantastic. So right when you log into our service, this is the UI that you're going to get. If you have clients already hooked up to our control plane, you could already begin to create your first attack. Otherwise, put things in the scheduler and finally just to manage your clients or users that's connected to us. I already talked about the way of installation. So I'm not going to usually talk about that with our prospects and customers. So I'll just skip on that and go right into the types of attacks that we could run. Right now, the client itself is focused on infrastructure level attacks. So things that happen on your host, things that happen on your operating system, things that happen on the network, all those things we have a good suite of attacks that you could run with Gremlin. So resource attacks, first off that I'll start off with, is things that happen on your host. We could consume host cores. So the amount of cores that you have on your host or your instance, for example, you can specify the exact amount of hosts that you want. We could specify the amount of disk space you want us to fill. What happens if our law of rotation doesn't happen? We internally got bit by bit. You should check out our blog post at gremlin.com. And finally, we could also introduce disk read-write activity. For those of you that have a lot of disk intensive tasks, this might be a good gremlin to run if you have a lot of heavy IO operations. And finally, too, we could also consume gigs of memory. Notice that every single one of these attacks, you can also save as a template. When you have this attack that you find that you're going to recall fairly frequently, some of these attacks might be a little bit more specific and highly targeted. So as a result of that, the configuration could take a little bit of time, save it as a template. That way, you could bring it back in the future, or at least throw it into the scheduler so that you can then just automate that particular attack. State gremlins here alter the state of your operating system. So for example, we have a process killer here where we can string match the process that you sent to us, and we'll just kill it perpetually. What happens if your web server like your HTTPD were to go away, or your Java or your Tomcat app were to go away? Does health check pick that up? And then otherwise, terminate the host or start recovering from it? Good thing to test with this. We also have a shutdown gremlin. So what happens in, say, public cloud, right? Like AWS, that failed health check, your instance is going to get terminated, or meltdown. Your AWS would go rolling reboot across your fire fleet. You don't know when it's going to hit, but it's going to hit. Use the shutdown gremlin for this. One of the key values that we have here is that if you use this in conjunction with our scheduler down here, and you say, oh, run this during business hours, right? Nine to five, we're running the shutdown or reboot gremlin five times a day. Well, you basically then would have that chaos monkey experience right out of the box for yourself right there. Final state gremlin that we have is time travel, where we'll break NTP and introduce clock skew. Many times I hear from our customers that when you introduce this in, say, your casserole cluster, terrible things happen. Maybe you want to see what happens in your own world for your data layer. Otherwise, things like day life savings time is also something to consider a certificate on your hostware to expire. That is good thing to simulate as well, or a leap year. Now, the final gremlins set of gremlins that we have are the network ones. And these tend to be the most valuable and most powerful ones because in a distributed system in AWS or any kind of cloud provider, the network tends to be the most fragile point. As you're breaking your applications from monolithic to microservices, your network is now basically part of your application stack, right? You expect it to be high performing, but really some things happen because, well, cloud happens, network devices have built in entropy to them. Definitely want to simulate what happens when things become degraded or otherwise unavailable. The black hole gremlin here drops out packet going from one place to another. So you can simulate something like a full service unavailability. Notice that all these network gremlins have the most amount of arguments that you can pass to it. And I really just want to highlight the concept right here that we could actually simulate full service outages. Now, some of you might remember this great outage that happened last year called the S3 outage. And we can simulate that for you out of the box just by adding that as a service provider right here. The next gremlin that we have is the DNS gremlin and we could break DNS for you. You want to test what happens if your primary DNS server were to go away. Do your host actually do secondary fallbacks to your secondary server? Definitely something to worthwhile to check. Otherwise, you can simulate bigger DNS outages like Dine DNS, right? Or Ultra DNS that happened a few years ago, or just maybe Route 53 being unavailable as well. These last two gremlins latency and packet loss are the ones that I would call gray states of failure. Your systems are running, but due to things like noisy neighbor or having to traverse through a lot of internet traffic, things become slow or otherwise degraded. So it's not operating at its most efficient point. So the problems usually manifest themselves in the form of latency, right? Things just become a little bit slower than what you're expecting them to be. You've seen the menus here, so I'll just go ahead and talk a little bit about it. You want to dial in how long you want to run the attack for. Sometimes your observability or your monitoring tools take a little bit longer for the metrics to show itself. So you can definitely specify how long you want to run the attack for. You can specify things that the IP address, IP address range or side block level. At the device level, such as your ETH0 or your ETH1, your host name or end points, such as if you want to just inject some latency going to Google.com, you can do just type it in. Or if you want to have an external, your service has a third party dependency to something for messaging like Twilio, you can do that as well. If you want a whitelist particular traffic, it's just a carrot right here. So maybe you want to whitelist your monitoring just so you can have some observability into this chaos experiment. We could also support port ranges. And finally, you can also specify what protocol you want. Once you're done defining the attack, you now want to specify the targets that you want to attack. And by this, we're talking about the blast radius, if you will. So to help you with this, we pulled down for AWS instance metadata here where you can click on any of these bubbles to filter by say region or availability zone. And finally, we also support services that you could pass on to us as well so that your hosts are serving up a particular application. You can specify that. One of the things that I like to highlight here is that it's our concept of random. While most of the tools that we see say, do random, do it in prod, we kind of say you want to still be a little bit targeted. So our concept of random here is that we take all of the clients that you have installed with us and then three of filtering like say, I only care about things like things up, like our API, for example, you'll then pull and although my filter all the clients that are serving that particular service, and then you could then specify, well, I only care about maybe two hosts here. So go ahead and use that as my target. Or otherwise, you can also support a percent of our environment is impacted, right? So maybe I would say 50% of it is, sorry typo, 50% of it is going to get impacted. Now, if you want to do container attacks like in your Kubernetes environment or something along those lines, you could send us your labels that you have put in your pods and we'll just go ahead and say, we'll go ahead and attack the matching pods within those hosts. We don't use it right now. So let me just go ahead and just kick off this attack. Once you've done specify your attack and you kicked it off, it'll take us over to our attacks page where you see all the current attacks and also all historically run attacks. We eat our own dog food. So you're gonna see a lot of attacks in the Gremlin plain, for example. Once you click in, you get to see the full attack definition. All client logging comes back up to us so that you don't have to remote into a host so to see what's going on. At any given point in time, you feel like, again, you've done enough damage or I need to roll this back because I made a mistake and I fat fingered my attack, for example. We have a halt button right here. Our client will pick that up within seconds and go back to steady state within seconds. Everything I've shown you, again, full feature parity with our API, right? We don't circumvent ourselves via the web app or anything of that sort. So you can go ahead and orchestrate your own tooling or otherwise put it into your CICD pipeline such that you would have your smoke tests, your regression tests. Well, spin up a canary cluster, install Gremlin, run some resilience tests on it as well to get some confidence in the resilience of your systems. So that's the demo for Gremlin. Hope you all enjoyed it. Maybe if there's any questions around that I might be able to answer for everyone. Hi. Can you guys hear me? Yes. Yeah, yeah. This is Hemant from India. I work for VMP. I have experience of one year working as a case engineer. New to the group. I forgot to introduce myself in the beginning. So my question is, you were mentioning that you can write some tests to test to post the Gremlin attack. Is there any provisioning in Gremlin where I can switch these tests? Where you can, I'm sorry, do what to the test? Can I run these tests through Gremlin? Yes. Well, you do run everything through Gremlin because we have the client installed and then you can use us to orchestrate the attack for you. Okay. Can you just give a small demo around it? Yeah, please visit gremlin.com. We'd love to speak with you further. Just go ahead and request a trial and we'll be on our way. Okay. Yeah, fine. Thank you. Okay. Well, there's nothing else. Thanks for the time. Hey, just one question. You're like, I was just curious for the, this is Deepak from Capital One. So for the network disruptions that you talked about, are you using like a traffic command-based shaping? Like, are you using any TCP libraries? What are you using behind the scenes? Sure. For Linux, if you're familiar with TC, it's something similar to that. Okay. Yeah, most of the production boxes I've seen like Linux OSS, so I believe, yeah. As long as we have access to run the TC commands, we should be able to achieve that, right? Right. I'll defer to Fornee to talk more about what's under the hood since he's the developer here. Yeah, all of our gremlins use like core Linux libraries to do the impact. So yeah, if you have the TC library, you're good to go. Okay. Most of your demo around the AWS end point, do you support any other remote machines? We support Linux and other computer environments. So if you run those environments, you're good to go. We currently do not have a Windows environment, for example. Okay. Can you schedule a tax against like the Kubernetes API, for example? I am not sure about the Kubernetes API specifically. Throw it away in here. Yeah, please do. No, not against the API right now. We're adding in a bit more container and Kubernetes support. Okay. Originally kind of built for bare metal and our, you know, EC2, that sort of thing. And now we're kind of iterating on that build right now. I think that's enough. Cool, thanks. Everyone, so I wanna get a little bit more detailed. Can you share more details about what are you planning around Kubernetes API with gremlins? I mean, I can't share exactly what we're planning right at the moment. But suffice it to say there'll be better Kubernetes support. How about that? Is there a timeline that can be shared? Not at the moment, no. Okay, thank you. Are there any plans to integrate with kind of any of the service meshes out there, like Istio or Conduit to essentially, as you're doing a slow rollout, also integrate kind of gremlin testing as part of that to ensure things are kosher before doing a full rollout? Yeah, it's definitely something we thought about. We're sort of just getting past the POC stage right here. So there's a lot of things we want to do, you know? And it's sort of figuring out where the, this squeaky wheel gets the grease sort of thing. So containerization has been a real big part of our past roadmap. Okay, can you go into a little bit of the telemetry that you get after you've run an experiment? I see like there were some logs on your dashboards. Do you get any sort of visualizations or anything like that? So right now we primarily, we've kind of like opted to stay away from integrating with monitoring just because it gets a little hairier when you start to take some of that information. Plus there are a lot of great monitoring solutions out there. So we tend to leave these sort of things to light step and data dog, those sort of things. I'm sure, you know, there's opportunities to expand into that space in the future. It's just not been, since it's not sort of our core competency, we've sort of let them do their thing and let things kind of line up in the monitoring dashboards. Does that make sense? Yeah, sure, cool. My question was actually centered around monitoring as well. So how do you guys interface or do you interface it all with things that are tracking like SLOs, SLAs to like make them put them on like a quiet mode or like would you even want that and like kind of subnote I'm new to this. So maybe that is something that you want to see if your SLAs are affected by such a thing. It's a fantastic question. Since we kind of stayed away and it's kind of hard to do, I guess SLO and SLA definition generically, a lot of companies do them kind of differently. We've sort of not integrated into that space just yet. And that's fine. Like I heard about your statement on core competency. I think that's great. But like being able to tie into something like Prometheus or like maybe making it so it's not going to go page like an entire engineering team while you're doing tests like this, maybe at least in a controlled setting, but... Yeah, no, you're totally right. I guess what we sort of, what we usually tend to advocate is over communication in that regard. So if somebody doesn't get paged, they know that they're getting paged because we're running testing. Obviously, yeah, you can turn off your paging, right? You can turn off whatever pager duty or silence these sort of things. Oftentimes though, actually we want to do this to see that a page actually goes off when something bad happens, right? So if you peg a bunch of cores, you expect pages to go off. So it's kind of like unit testing your paging and I don't know, just the idea of making sure that your on-call knows what to do, testing engagement, spinning new people up. You can do it either way, right? That's a fair point. Yeah, no, that makes a lot of sense. Thank you. Right, I have to agree with Forney on this point. Many times when I run game days with our customers, it's not so much finding posts within the application or more so making sure that they have their observability, their monitoring, their paging dialed in tune. And a lot of times when they go, this happened, I never got a page for it. Yeah, you follow up here as real quickly to get that dialed in guy. Otherwise, it's gonna be, this is already in production. Okay, cool. I have another question. Oh, sure, one more question. Where, Chris? No, no, I'm just saying one more question to ruin and then just to be sensitive time till Sabane's presentation. Go ahead, ask your question. Yeah, so I would like to find out if there is any integration with CloudWatch events, like let's say if my auto-skilling group is going up and down too fast. Now, is there any integration with Gremlin where you could trigger some tests to give me some more details about it? Go ahead, Forney. I'm not sure I understand the question entirely. I mean, if you want to trigger your AFT cycling pretty quickly, like you can just use, like you can use the shutdown Gremlin on loop, I suppose. I'm not sure that there's like a direct integration in terms of CloudWatch events. I'm not sure like what, I'm not sure what your hypothesis is here that you're trying to accept or disprove, I suppose. So let me ask differently. Has any customer asked for integration with CloudWatch events, particularly in the AWS land, because that's where sort of my state of my cluster, all the events, et cetera are thrown? Sure, people have asked, people like, I don't know if you know engineers, they ask for everything, they ask for everything under the sun. Yeah, it's been asked, we're slowly rolling out more integrations and sort of prioritizing what our customers ask. They definitely ask for that. They definitely ask for other sort of things as well. Sorry, I can't be more specific. No, that's good. Thank you. Yeah, no worries. Okay, cool. I just want to make sure that we're sensitive to time, but thank you Eugene and Matthew for the presentation. That was super cool. All right. So we've got about 20 minutes left. So it's about 15 minutes for Sylvain to present with some five minutes for questions. So Sylvain, are you there? Yes, I am. Yeah. Awesome. You can put the slides as well as you did for Eugene. That'd be nice. Yeah, I'm trying to go find that right now. Give me a sec. I'm afraid things won't look as... No, no worries. As they did for Gremlin. No, no, it's different tools. Eli, I'm not that very glad, Maros. Different tools. Okay, so hopefully you see everything. Yeah, so hello everyone. Right, so I'm going to talk about the KOSU kits and open source default that Rasmus and I have started in September, I think, last year. The idea was roughly that we saw that tools like Remlin, Pumba, or KOS Monkey obviously were there. But as we were trying to figure out how to apply the experimental part pattern that we had read about in the KOSU engineering book, we felt that those tools, while actually delivering the goods, weren't helping us forging the experiment, if you will. So we decided to create the KOSU kit to do that, basically. Yeah, so the KOSU kit does, by itself does nothing like Remlin does, or others. What it does is it helps you declare your experiment and then you decide how do you want to drive, what tool or what API you want to drive to actually inject KOS. So the KOSU kit wouldn't actually provide anything like Remlin, but does actually drive the API if you wish to use Remlin. For that matter. So basically it's just an open API for your experiment, if you will. It's a CLI driven. We felt like we wanted something we could automate that way. That's why we didn't care for UI at first, or initially. A target simplicity, I'm talking about the code itself. Well, we wanted something that other people could actually contribute to. And we tried hard to actually make things as simple as we could. So it basically, it's just a bunch of functions glued together in Python. Well, it's a bit more than that, but that's that rough idea. If you want to contribute just, you do need to know Python to a very basic level. What it does, it orchestrates existing tools or API. By existing tools, that means that if you have a binary that you want to drive from the KOSU kit, you can call it. But equally, if you want to call an API, you can also just call it, passing all the parameters the API requires, and simply it will call that for you. We already actually implemented a set of drivers. We call them drivers, but just extensions to KOSU kit, really. We don't claim we support all the API of those providers that will be foolish and just a lie. But we try to target API that people don't necessarily trigger very much, like stop a service or stop, remove just a Kubernetes service or things like that. Basically anything that you probably don't call except if you're a developer and you do that on a daily basis, but without the idea of a KOSU engineering in mind, you're just stopping something to restart after that. Well, those APIs are very powerful in production or pre-production or wherever you want to run them to actually impact your system, if you will. So for example, if you want to remove a service or if you want to terminate a pod, just call delete pod, basically that's it. If you want to destroy a pod, that's different. But if you gracefully just want to stop something, you can do it with Kubernetes. And we started to implement things like that for all sorts of providers. Azure with service fabric is a bit different because they already actually have KOS services native to the platform. So you don't actually call, you just call start KOS or stop KOS, much like Istio, I think, which has fault injection. And when realized, well, causing trouble is fine, but we really need some probes as well. Like you guys said earlier about coding your monitoring tool is fantastic, but we wanted during the experiment to be able to collect the data that mattered to us in that regard of the experiment so that you can support your analysis after that. So we have probes. Basically, we call, we query, you can query Prometheus or Umiyo if you use that as a central logging platform. And that basically all that is contained in the file that I'll show you in a minute. We've got a bunch of plugins for creating reports and sending Slack notifications. And finally, the future of KOS circuit is we're going to go more native in Kubernetes with Chrome job so that you can schedule things and just run, let them run. We're looking at operators as well to actually control a bit more the experiments when you run natively into Kubernetes, but that's just starting to think about that. And drivers in those are run times. We run in Python and we're not, we love everything. So if you want to run your extensions and go on Rust or anything, we're going to try to make it easy to actually call it from the KOS circuit as best as we can. And we are trying to aim for milestone one this year at some point. So that's it for a very quick review of KOS circuit. It's open source, it's Apache license. And now I'm going to try to demo something. If the demo dots are with me. One second, I will exit and so you could share your screen. Yeah. Perfect. Right, indeed. Share my screen. Should be good to go. Yes, let's speak that one. Right, so that's a website. Like I said, it's a CLI driven so nothing very fancy to show here. The idea is just to work through the important bits. Like I said, we try to create an open API. So the open API is, it's just a definition of the various elements of a KOS experiment. That's what it looks like, briefly. You've got a set of metadata, blah, blah, blah. What's interesting is you've got the steady state here. The normal in your system. So what we do is that is we use a bunch of probes, you can have as many as you want, to query for some things in your system. And what we're going to do is if any of them fails at tolerance here, it's just a bully and it could be something else, we bail the experiment. Because in our mind, if your system is not normal, at least in what you expect to be normal, and there's no point actually in going through to cause trouble because you won't be able to read, analyze, and make sense of what you see. So we bail immediately. But we run that steady state hypothesis at the end again, once we've caused trouble to see if we deviated. If that passes again, now that means two things. Either you ask the wrong questions, or you find a potential weakness. So at that stage, what you want to do is basically go into the report and see what happened and make sense of it as a team. Now you've got the method. So once you've run the steady state once, the hypothesis, you run the method and just it's a bunch of actions or probes. Usually you've got one or two actions because you want to make sense of what's happening. So if you change too many things, I guess it's similar to what Gremlin is doing. You can't create attacks, I suppose. You're trying not to create attacks that conflict each other. Otherwise it's probably much harder to actually make sense of what's happening. And then you've got rollbacks. We tend to call them remediation these days because to us rollbacks are strong promise that you can come back to normal state, which is not always the case if you're really broken in your system. But the idea is sometimes you want to come back to steady state. Now with Kubernetes, usually rollbacks are empty because Kubernetes is meant to actually support and deal with failures automatically. So you don't actually do anything there. Right, so that's it. It's a JSON file that is declarative and what happens is you define your probes actions. They all have the same format. So let's pick that one. It's a provider, it's in Python. It takes that module and that function. In this case, it doesn't actually have any parameters or arguments. In this one, you can actually specify the name or the label, things like that. They are just functions basically in the Python module somewhere, but you declare them and you can share them on GitHub and you can, basically see the, we wanted to have something on file so that you can really use that inside your CI CD pipeline as usual. It's just another source thing. Although I'm trying to not sound like a test, it's certainly overlaps with the tooling in some fashion. And you can have reference existing probes so that you don't have to actually duplicate things. Sometimes we do have a pause. I personally dislike that, but I couldn't see any other fashion except if I was doing synchronization with a stem telling me how it was doing. So sometimes it's a bit flaky that thing. So contributions are very welcome to actually improve the API definitely. All right, let's try to show an experiment here. So we're going to use a very stupid demo. We've got that application, which is this one does nothing. That data that you see is pulled from a Postgres database. And what we want to see is under some medium load, can we actually kill what happens if the database master actually dies? And behind the scene on Kubernetes, what we're using is patch money from Zalando, which has a leader and follower for Postgres and will shoot switch from one to the other if the master dies. And that's what we want to prove because we expect that if the master dies, we don't actually impact our users. So let's see, it looks actually, probably it's more interesting. That's not the one I want. Right, can you see that? I'm not sure I'm zoomed enough. So please do tell me if not. All right, sweet. So what we have is exactly the same what I showed you before we select the application that we're interested in. That's, we checked that the pod is alive, the application must respond. And if that doesn't happen, the experiment bails immediately. Otherwise it goes to the method. And here what we do is we terminate the DB master and we don't have actually a function doing that. What we do is we terminate the pod, the pod that has that label. And luckily enough, we only have one. It's a demo, blah, blah, blah. Round is here means nothing because again we have only one that matches that label. But if you had many, it would actually pick one through that argument. And then we've got a bunch of probes. Now you might wonder why I'm going and fetch logs. During the experiment doesn't actually do anything, but it's interesting why when you do the analysis, because you can come back and look at if you had the logs that you were looking for in your application or in your various parts of the system. So for your analysis, sometimes it's nice to actually go and fetch them as you run the experiment. And that's basically it. So let's pray that this works. So you just run chaos is the command that you're going to run. I don't know if you see that, but there is a notification on the top right because we send that to Slack saying it started. And it basically doesn't do much. It runs things in order that, it reads them from the file. And that's why it does look like a test because it does, doesn't it? In this case, it should fail if I'm correct because the application will collapse. There you go. So here, the reason it fails. So what's interesting here is we see that first it did succeed. So the system was normal. So we went on and killed the DB master, but it failed when we run that again. And the reason is because in that specific case, my code was not good enough to actually reconnect, if you will, to the new database master. So when I called the application, it failed because the connection at that stage was tall. So what we want to do, and if you have a CD, what you do is you map probably that to a fix and then you push that to, you could use weave or something like that to say, well, okay, I fixed the thing now. So I'm going to release a fix, apply the fix. Let's wait for it to come back. Come on. There you go. It's already there. And now we're going to run that. What I did, but I don't have that set up right properly. Oops, it's not set up yet. It's not up yet. So in that case, actually, what you saw is the steady state bail the experiment initially because the system was not normal. I went too fast and probably the system was not yet ready. Oh, it is now. There you go. But what you do is probably you look up that with git cube or whatever and say, well, okay, I've released a fix, rerun that, all those things that you would do with any sort of test basically. And hopefully that fix does work and you've proven you had a weakness and you found it, you fixed it and you try it again. And that's before and after that you would want to see from chaos engineering experiments. Now there you go. It shows that the steady state now is met. So that means our application now is able to actually sustain that kind of loss. If the master goes away and the connection is lost, we are able to actually sustain that error. That's a kind of the failure that you won't want to see. That's a basic one. And if I have time, I don't know, I'm perhaps not, but there is another one I would want to show you. I'll stop now, but I don't think I have the time. Would be if you, let's say you have your, you're using GKE or something like that and you realize one of your notes, the virtual machine is actually, I don't know, security as security issue or something. What you want to do is, pull out a new notepool, right? New set of machines with a fix, but you want to see whether or not it's going to impact your users to switch from one node to another node. Well, that's an experiment you can run with a Chaos Toolkit. You're going to bring a new notepool with new machines. During that to Kubernetes, see the load spread from one cluster to the other side of the notepool and see if your application is actually impacted. That's the kind of thing you can do with a Chaos Toolkit because we, all we do is we drive existing APIs. We don't try to create new sort of chaos, you know, tooling because already they already exist and in various fashions like that. Right, that was my demo. Cool, thank you. We have about five minutes for questions. Anyone have any questions? I actually had one. I saw when it got to the end of the run it said like, let's roll back, no rollbacks to Claret. I'm curious like what the rollback would end up looking like for maybe an experiment like this doesn't have one, but I guess what is the JSON body of that end up looking like? So yeah, it's a good question. Why in steady state we use probes only? In rollbacks, you can use actions. Basically all you do is you try to revert something. So if I had the time to show you the notepool one, I actually delete, I create a notepool. So it's just a bunch of virtual machines really. And while that runs, I killed the notepool in the rollbacks. But in the example I showed you because I'm using Kubernetes, Kubernetes take care of the rollback if you will. Yeah, they don't revert those pods. Yeah, I killed the DB master but patch funny with the operator and Kubernetes make sure that it comes back to life. So yeah, and Kubernetes actually rollbacks are I wouldn't say meaningless, but certainly yes, less useful. Awesome, thank you. Thanks. Any other questions? Otherwise I will ask for volunteers to present at the next meeting. I know Paul has volunteered to talk about his work on spring and maybe Miko, powerful seal sometime. Any other volunteers to present next? Yeah, Chris, I can demo the LinkedIn stuff. Obviously it's a little private, but it still shows us some of the techniques pretty cool and actually probably builds on some of that spring work as well. Okay, I can also demo the Fumba, but again, it may be not next meeting, since we already have enough presentation in the meeting after that. Yeah, I just want to collect the backlog of things and I'll just schedule you. I'll try to do at least one presentation and meeting to Max. Okay, so write me. I'll write you on the backlog. Thanks. Any other volunteers? All right, well, that's good for now. So we have a couple minutes left. So just want to be sensitive to people's time. So thanks everyone for showing up. I want to continue to do all the white paper and landscape work as async as possible. Hopefully people could give their input on that while we kind of build that out. I think other than that, that's it. Anyone else have anything to say? Otherwise we could thank our presenters and meet again in a couple of weeks. Do you have a timeline in mind for the white paper and landscape? I know, really bold question. I'm just, I'm curious. Yeah, so like I love forcing functions. So like picking a date and like hammering towards it generally works well. I think it's going to take, it's going to take still some consensus building, but I would love to get something out probably in a one to two month timeframe. We're also going to have to give my designers on the CNCF side about two weeks to kind of make things pretty and professional and line up PR and all that stuff. So whatever we choose, we'll have to at least have a two week buffer for them to do that work. Yeah, sounds great. I was just trying to get a general feel. Yeah, so, cool. All righty, let's do it. Do it. Great, thanks. See you in a couple of weeks. Thank you. Bye-bye all. Thank you, Chris. Thank you.