 Right, hopefully everyone can see my screen. Yes. Yeah. Yes. Okay. Well, whatever. Right. So the agenda as Chris is not here, I'll try to do my best and hopefully just, you know, raise your hand if, you know, I miss something or if I'm not clear enough. Today we're going to have one demo around PCF from, from you guys, Karen and Ramesh. And then we'll have a quick landscape and white paper update. I'm not sure there is, there is much there, but let's, let's keep the discussion going. I think Chris was meant to do a quick recap of, of chaos conf of fully there will be someone who can, you know, you know, tell us all about the, the conference. Yeah, I was there. Great. So probably you'll feel the role. I don't mind. Right. So let's get started. I think we are rolling for the demo guys. So I'll, I'll leave you, you know, share your screen. I think that's probably best. Sure. I'll stop sharing. Is there any button here to share? I think on zoom, do you have a green share screen at the bottom? Yeah. There you go. And you pick up, you know, whatever. It works. Okay. So you want to start? Yeah, sure. Thanks. Good morning folks. My name is Ramesh. I'm the senior engineering manager for the platform engineering team at T-Mobile. First off, thank you for giving us an opportunity to present here. Very excited to know that there's a community around this and we can tap into like extensive community network to also get help on what we're trying to do at T-Mobile. But quick background on me. I've been with T-Mobile for 10 months. My team owns the container strategy for T-Mobile. And Karun is one of the brilliant engineers we have on the team. So Karun, do you want to introduce yourself for the question? Yes. So my name is Karun Chanuri and I'm senior software engineer working at T-Mobile. I joined this team in March, 2008. And so I'm fairly new to T-Mobile. Before that, I was with Huawei as a cloud security engineer for them. I have about 13 years of experience in the field of information security and enterprise security. So yeah, that's pretty much. So here at T-Mobile, we take care of pivotal cloud foundry operations and DevOps kind of a role. And also we have Kubernetes in-house cluster. These two come under Ramesh. OK, so let's get started. I know some folks are excited to see the demo and what we're trying to talk about, right? So I'll gradually take you through the journey. Karun, do you want to go to the next slide, please? Yeah. So one of the things that I'd like to start off with is Joker's analogy on how he interprets chaos in this movie Dark Knight. And we kind of spoke about that at our conference at spring one. And Joker calls it as a fair act, which is every time you disturb the harmony of systems, good things will come out of it, right? So that analogy has started kind of like started my thinking also, which is my team builds all of these capabilities on top of like a massive infrastructure. And behind the scenes, there's compute network and storage and things will always go wrong, right? So I'll get to the actual problem statement, but I always like to start off any presentation with who we are, what we do, what are the services we provide in this pack, and kind of drive you towards the problem statement and then hand it over to Karun. So our vision today at T-Mobile is to deliver simple, secure, scalable platform services that are infrastructure and platform agnostic. And we do this with relentless focus on customer obsession because we are connected to the needs of an internal engineering community with over 800 plus active users on our platform. And on the right side, you're seeing the evolution of the access service model. You might have seen different variations of this, but we're one of the biggest partners of our infrastructure team who is also moving towards more of an automation model with one click access to compute network and storage and other resources, but then we're their biggest customers and we're extending that portfolio to even give better capabilities. One of the capabilities and the container as a service matrix is offering for Kubernetes so people can bring their own containers and then a platform as a service abstraction where people can just give us their code, we'll go run it within our abstraction layer and they don't have to worry about other capabilities. It's like a self-driving car, we give them the bells and whistles of operating their code at scale. And at the same time, they get the best experience in terms of dealing with live customer issues. So depending upon the kind of abstraction you choose, different flavors comes in. So that's really what my team does in a nutshell. Let's move on to the next slide. So one of the things that folks have asked me is, okay, so what's the big deal? Every company is doing this, you know, what's really in my portfolio, which is driving the need for chaos engineering, right? I'll focus on the fact that we're building our services on top of on-prem infrastructure today. There is compute network and storage. We've gotten the business used to agility already in the last two years, roughly 4,000 applications, 500 active users per day for 31,000 containers across development and prod. They're a production for me, even though that's development environments. Business has gotten used to agility, which is faster applications, faster meantime to respond and resolve. The last iPhone launch event saw a max peak of like 16,000 requests per second, right at the minute iPhone launch was launched, right? And then since then it has been trending around an average of 14,000, hang on there. I'll tell you when to move on to the next slide, please. So then we're moving towards the feature, which is like, you know, where we want to extend our capabilities around simplicity, security, and scalability. We are trying to deliver many new capabilities on the function of the service and also exploring new capabilities at the platform as a service layer. So a lot of work that is already being planned in terms of agility, resilience, and enhancements, all of which entails infrastructure in the background. Next slide, please. Okay, so a bit of the explanation of the problem statement a little more. Have folks here on the call seen this before, like the blue and the black dots? Anybody? Okay, so, let's actually go through the animation here, Karan. So basically what you see here is what's called the Death Star Diagram, and it's a representation of the kind of the ecosystem your microservices deal, you know, live in, and the kind of interactions that they have with dependent services. The snapshot on the left is from Amazon from 2010, and the Netflix Death Star Diagram is the blue version, and then ours looks a little more less chaotic, but we're getting there in terms of, you know, what the chaos is going to look like in the future. So the key message here is, you know, we as engineers, we write services, the services have backend systems they interact with, and obviously when systems fail, you know, things, your customer is going to take the impact in terms of, like, any customer impacting events. So what we're trying to do here is, like, in terms of our digital transformation initiative, we're trying to think about failure in a different way and try to embrace failure, but failure is inevitable. Next slide, please. Okay, so, and I actually started off with this problem statement with Karun, and a couple of months ago, we wanted to look at, like, two kinds of failures. Obviously there is the platform level failures that I care about because I run the platform, and what I mean here is, what are the kinds of things that go wrong with my platform? What are the assumptions engineers make when they build services on top of an infrastructure, right? Think about things like a network being homogeneous, the fact that we have infinite compute resources, all of these assumptions need to be validated because when you fail to validate it, there's a when problems start happening, and, you know, you could get into a disaster scenario, like these two guys on the left side, which is not our data center, but somebody else's data center. The fact here is, you know, our data center is in an earthquake prone zone, you know, and anything could happen here. We have active volcanoes in this region, so we're trying to be cognizant about the fact that, okay, if things fail, how are our systems going to react? How can T-Mobile continue its business on because a lot of the business critical applications run on this platform, right? And then there is the containers that are running within the platform. Containers have applications, and it's not just one target application, right? There's several applications, so you want to launch specific targeted attacks on containers, and just affect that one application under context, right? Or the dependent service it interacts with. So that's a fairly significant problem on its own because, you know, because of the way PCF is built and because of the way containers work on PCF. Okay, next slide please. So this is where Karun, you know, I think you should hand off and talk through your journey, Karun. Sure. So looking at the problem statement, we have two problems there. One is the platform level attacks and the other one is applications running on the platform, you know, attacking the applications that are running on the platform. So we started exploring, you know, are there any existing tools because we didn't want to reinvent the wheel. So we started with an open-source solution called Chaos Lamer, but all we could see, make it work with the Chaos Lamer is killing of the virtual machines. But in fact, at the platform level, we wanted to achieve killing virtual machines, killing a process, introducing latency into the system and introducing memory and CPU hog, right? So all these come under the infrastructure level attacks, but Chaos Lamer for us was like more like a chaos monkey. It can only go and then, you know, turn off a random virtual machine, which is definitely not something that we are looking for. We're looking for a more bigger solution. So we started looking at Gremlin as one of the commercial offerings as well. And we see here the version that we evaluated with the Gremlin. It's a pretty very powerful tool, I must say, because it comes with a very neat UI. And there was initial thinking whether Gremlin would work on the PCF environment or not, but we made it work. And it seemed to fairly work, perform the operations at the infrastructure level, like killing of virtual machines, killing of process, introducing latency and memory hog. Again, the version that we evaluated doesn't have the application knowledge, which seemed to be the case that, you know, the Gremlin is working on it. And even in the recent intro from their CEO, Mr. Colton, I think it seemed to be coming up and they have added this capability in the latest release of Gremlin, which we have never looked at yet, right? But again, Gremlin comes with the cost and we are also very conscious about the cost that is involved, you know, and running on the infrastructure. So we looked at turbulence as another alternative and it's an open source. As you can see, it performs fairly, you know, and it's very native to the cloud foundry as well, like, you know, which is pretty much we were looking for any Bosch-hosted virtual machines can be done with the chaos engineering attacks or failure injection attacks performed on the Bosch-hosted environment, performing killing virtual machines process and introducing latency and CPU memory hog. But again, it lacks application knowledge, right? So for us, as I said, like, you know, as Ramesh explained, those are the two problem statements, like infrastructure level chaos engineering attack or the platform level chaos engineering attack and the application level attack, right? So here is what we looked at chaos toolkit as a framework. It basically orchestrates multiple other solutions like gremlin and turbulence as drivers. So what we built is we built two drivers there. One is a driver for the turbulence itself and then another custom homegrown driver that's built from the scratch, which has application knowledge. So it can go and then figure out, discover where your application is running within the cluster. So if I have a cluster of 2,000 nodes, in that 2,000 nodes, this driver can go and figure out your application is running on those particular nodes. It can also figure out what service dependence is this application has. So if it is dependent on MySQL database, it can go and randomly kill the MySQL database instance and see how your application behaves. So these are the two drivers that would be demo today. And let me, before I jump into the demo, let me explain more clearly what exactly that we are talking about here on simulating failures on the platform level, right? You can see the component diagram for the PCF, the pivotal cloud foundry. There are various components here. Each component could be a virtual machine or multiple boxes here could be processes running inside a virtual machine. So failure can happen. There's a lot of interaction happening. So it is so obvious that failure can randomly happen at one point at any point that it might eventually lead to the disaster as well, right? So for example, let's take a simple example here, the rep process going down. So what happens is the rep process running inside the Diego cell is responsible for managing the lifecycle of the containers running in it. So Diego cell is like a worker node in Kubernetes. So if the rep process goes down in that node, there is no way to manage the lifecycle. So it's one good way to simulate and attack via turbulence and the driver that we spoke about. And another attack is think of applications running in your cluster. Let's say a set of apps have auto scaling enabled, which means based on the load or the CPU stress or the HTTP latency, the apps going to scale up and scale down in terms of the instances. So for that, the auto scaler as a service depends on cloud controller. So what happens when there is a huge traffic to the app and then the cloud controller goes down. And then what happens is the app going to scale up or scale down, you know, those kind of failure injection tests that can seamlessly be performed via the driver that we are talking about here. So how we perform this, this is a turbulence. Turbulence comes with API server and the agent. The agent goes and sits in each of the virtual machines in your cluster and then listening to your API server, which is a control plane. So we use CTK and initiate few attacks, which goes through the API server and then agents fulfill that request. So all the ones which are highlighted here, like pausing a process. So some of the attacks that we can perform is killing a VM, killing a process, pausing a process. So that would be one of the demo scenarios here. Introducing a stress into the system by introducing like CPU and memory hog, corrupting a disk associated to a virtual machine, right? And network delays, limiting the bandwidth, reordering of the packets. What happens if the packets are reordered? How is your system going to behave? How is your platform going to react? Fire walls, attacking on the firewalls at the platform level, targeted level blockings, like, you know, you can go and then perform, you know, IP table rule level failures as well, shutdown, block DNS and duplication of packets. So these are some of the features that come with the turbulence and highlighted ones are the ones that we have added and contributed back to the open source. So let me show you the first demo for this. I would like to run the video from my desktop. So just give me a second here. Are you able to see my screen now or are you still seeing the presentation? Presentation indeed. Oh, okay. I think I have to share this particular window then. Okay, so. Yep. It's okay now. Yeah. So this is a demo I was talking about. So what we will do here is we are going to demonstrate how chaos toolkit has been used as a driver and then a turbulence driver has been added. What it does is like, you know, it demonstrates two scenarios. The first scenario is pausing a process. So here for this, we are going to pause SSH process. What happens if an SSH process has been paused to the Diego cell, right? And then the other one is killing a particular VM itself, like killing a Diego cell. What happens to the containers running in that, right? So a very short video and it's there on the YouTube as well. The reason why I run it here is it has better quality and the zoom-in effect. So first I go in here, pause process.json. This is my experiment file in chaos toolkit with title and descriptions here with some steady state hypothesis and configuration information that I'm supposed to pass as a part of the experiment. So these can be, again, can be enhanced, like, you know, instead of putting a username and password here, it can come from the vault as well. So what I'm doing here is this is a one box environment called Bosch Light, which gives you a cloud form running in one laptop. And you can see in the turbulence deployment virtual machines, you can see the API server, turbulence API server is running there. As I said, like, you know, there's a turbulence API server and turbulence agent running in each virtual machine. So that's what the configuration we provided in the experiment. And then the method here is to attack, pause process. SSH for one minute. So we are going to pause SSH process for one minute on deployment CF and group DGOSL limit to one, which can be any DGOSL. So right now in this environment, I have only one DGOSL since it's a one box environment, but we tested this successfully on the staging environment with about hundreds of VMs there. So there's only one VM here, DGOSL. Let's perform SSH operation on the DGOSL. As you can see, it's pretty responsive, right? It's very quick. And then now I go and perform failure injection test using kiosk toolkit by running this experiment pause process.json. So these are some verbose running the experiment and steady state hypothesis has been met as well. And then experiment ended with status has completed. Now let's try to do an SSH into the same DGOSL. There's a pause like, you know, for, and it's pause. So for about 60 minutes and we can go and check the UI as well. There's an UI aspect of turbulence. It is saying the pause process is in progress and it will continue to, you know, spin up till for like about one minute. So after about a minute, you will see the lock is released. Right? So you can try again to do an SSH and it's again, you know, very responsive after one minute, right? So because there is no lock on that SSH. So since we could do it for SSH process, you can do it for anything like, you know, you can do it for the rep process or anything. Second scenario is killing a DGOSL. Any DGOSL, like again, there's an experiment file for this separate one. We go with the standard definition experiment file for now. Steady state hypothesis is actually empty for now. Like, you know, we're not doing much, but we can do, we can add some probes there. Method is to attack and kill a DGOSL, which is running in the deployment CF and limit to one. So you kill one DGOSL for that matter. So any DGOSL. And then I'm going to run this experiment, kill DGOSL, validating hypothesis and experiment ended with complete, status completed. So as you could see, there is the DGOSL running in this. Now let's go and then print the VMs. Clearly there's no DGOSL here, right? So it's killed, right? So what happens to the applications running in that DGOSL or containers running in there, right? So that's another way of looking at things. So again, the UI shows that the attack has been successfully completed. And that's the reason it is in green color. And then since I, as I said, it's a Bosch hosted environment, we'll go and then, you know, get the, bring back the DGOSL again. So after a certain interval of time, you could see that the DGOSL is back because Bosch has created it again, right? So in the same, in that scenario, Bosch also made sure the apps are scheduled to back again. So that's my first demo. Let me go back to my presentation. So this is again, so that's the first half of the problem statement. How do you perform the platform level chaos engineering attacks is via turbulence and chaos toolkit driver for the turbulence. So the next half of the problem statement, which is crucial for us is application level chaos engineering because we have about 4,000 applications running on our platform. It's not a single, we are not a single application company, right? Where in you just attack an application and all the components of that application and see what happens, right? It's not that we have different teams building different stuff every day and about 4,000 applications, not at all independent. They're pretty independent and not interdependent applications, right? So we don't want to randomly go and then kill a DGOSL. It would impact multiple teams within T-Mobile and that's a big problem for us. We wanted to make a very conscious level attacks like in a very targeted attacks in a way that specific application is targeted for the chaos engineering. What would happen? How would, you know, without disturbing other applications running on the same DGOSL? So do you want to talk about this, Ramesh? In the ops world? Or should I go ahead? So I think Ramesh... Karun, sorry, I was speaking. Karun, I was on mute. Yeah. Can you hear me now? Yeah, so in our ops world we deal with an open support model where we get a number of different questions and I want to talk about a few of our favorite questions as to how we can actually not just be enablers but also be guardians, right? Which is not necessarily just to rely on best intentions but provide tools and capabilities that will also kind of like help with the best intentions when working with such a large intelligent customer base. First to one is that my app is in picking up latest configurations and our first reaction is this is because of bad karma and your app is misbehaving, right? Second one is my app is in connecting to Cassandra and our first reaction is it's because your Cassandra cluster was potentially decommissioned or you misbehaved with the Cassandra team. And then the next instance we see is my app works locally but not on PCF. This is likely because you misbehaved with the PCF team, which is us. And then last one we see is it was working till yesterday and then it stopped working and this is because we believe that you've not paid the bills for us. All right, so with jokes aside guys, these are some serious concerns where we classify them as debugging as a service and oftentimes customers like to start with us because we're very nice to them. We try to enable them. We try to like make them some sufficient but that's not enough. So you need to like provide tools and capabilities which will also provide God rails for them to operate with them. And that's where a tool like this is going to come, you know, going to be very effective which is it's going to like, you know, enlighten them as to what they can do to actually validate some of these optional conditions, right? And help them be more sufficient. Perfect. Sounds good. So having said that, you know, again, the same extending the problem statement, I would like to touch base on the cascading effect popularly, butterfly effect as well, right? So there are two, we all know this, what is cascading effect? I just wanted to, I don't want to dig into the more details but quickly explaining this. We have a concert and web, weather microservices running in our platform. And in this case, whether it's dependent on third party, what happens if the third party application, you know, goes down. So it totally impacts whether and that's times out concert, right? And it may so happen that, you know, the database also, which is dependent on, which concert is dependent on might go down. So all this put together creates a cascading effect and gives an unfavorable behavior to the experience to the web application running on the front end of, front end to the client, right? So the client will see experience a very unfavorable thing. So zooming in a bit, what happens if these applications are running in a spring cloud kind of an ecosystem? So these are the degos. So imagine that, you know, weather and concert microservices both are scheduled and running on the same dego cell. Dego cell is virtual machine. So both these containers are scheduled to are scheduled and running on one particular node. So it means what I mean by targeted attack is what happened? How about blocking the traffic to the concert? So coming from the go out or the load balancer to the concert service only and then blocking the traffic from weather service to the MySQL database. As you can see weather and concert services both are dependent on MySQL database. So I'm doing a very targeted attack here. Both are running on the same node, but these two apps I could go and then, you know, do a fine grained attacks in a way that, okay, blocking the traffic to the concert and blocking the traffic to database from weather. Again, the failure can again happen at different levels as well. There is a lot of interaction happening here. The weather might lose the connection to the service discovery or circuit breaker concert can lose connection to the config server as well. All these attacks can be simulated. All right. So how we do that today is CTK CF blocker. That's a new driver that we wrote and then target specific CF apps application host. So what it does is it discovers where your application is running and then it discovers what services your application is bound to. So in this case, my weather and concert micro service are bound to MySQL database config server, Eureka and then a strict service. And then it can also go into service instance. Like for example, the CTK CF blocker can also go into your config server that the app is bound to and it can bring down the config server as well. So what it does is primarily blocking all traffic to app instances and blocking traffic to bound services as well. So how we do that is again, the next demo for us is let me share this screen. So again, we are going to see two scenarios here. The first one is blocking database connection. So I have two apps running in the cloud on respring music and spring music too. As you can see these two, there are two URLs routes that's pointed to it. You can go and then check. Karun. Yeah. Karun and team. Yeah. Sorry guys. Sorry, didn't mean to interrupt you. I'm going to stay on the call. I have to like dial out of the web meeting. I have to drop my daughter at school, but I'll stay on the call. Okay. So the spring music. So just to see what happened here was the app got loaded and these are the breadcrumbs. These are the album information that got loaded from a database. So you can see here which database it is. It is my SQL database called music DB. So these two apps spring music and spring music to both are bound to this music DB. So now what we're going to do is bring down application connectivity from two music DB from spring music only. Okay. So again, we have a experiment file here. With all the attack definitions. And it has got probes. We have used probes in this example. We are going to check for HTTP 200. Okay or not for this URL. And then the method here is blocking a service and unblocking a service. So first we block the service for 60 seconds. The service that we are going to block is for the app spring music and the service name is music DB. And we have to provide some information like org and space name. So which is specific to PCF again. In the on rollback unblocking the service provide org name space name and the app name and then the service to be unblocked. So let's verify again. These two apps are pretty, pretty much working good. There's no problem there. So let's perform the attack now. Again, it's a Python script chaos toolkit. And running the experiment. It's a block service. And there is a verbose. You can enable verbose to have a deep dive stack trace information, but I will not use that for this demo. So what this driver is doing is it is trying to block traffic to music DB bound to only spring music app. So now it found where the applications are running. And the app has three instances. So it figured out all the three instances of the app running in your cluster. And then it started attacking now. So you can just refresh. As you can see, there's a successful attack. There's no data now. Again, spring music too, which is running on the same VM and pointing to same database can fetch the data. There's no issue there, but spring music, there's no data. So that proves our successful attack. And it takes like 60 seconds to bring the system back because we have a rollback policy after 60 seconds. So let's go back and see what's happening. So the rollback unblocking music DB has kick started. So there you go. You see that spring music is back in action. So within that 60 seconds, you can see what happens with your application. That's what is actual goal here. And it can be any service now. It's not just music DB, but it can be anything, MySQL or Eureka service, history, all the services that app is bound to. The second scenario is blocking traffic to an app. So again, we are going to use the same apps here, spring music and spring music too. So what we are going to do is we are going to block traffic to this spring music app only. Again, these two apps are running on the same virtual machine. Blocking traffic is an experiment file. Let's look at the block traffic JSON file. And you can see, again, it's the same, you know, similar pattern of definition configuration goes in here, study state hypothesis. Let me come back to the method here. What we are doing is blocking a traffic to the app spring music. There's no service here. That's fine. We don't need to define a service here. We are going to block the traffic for 60 seconds or, you know, and then the function has unblocked traffic. So blocking all traffic to spring music. So it's still working, but let's give it a time. And there you go. You see the traffic has been blocked and you see the 502 bad gateways. So the gateways aware of the problem route, but it doesn't know how to do a TCP connection to the app. So that's just a successful attack for us. Right. So and unblocking is been initiated. So you can see that, you know, unblock happened because there is no time out for 60 seconds here. So it happened so quickly. So that's that brings down end of the second demo as well. And let me go back to my other slide. I have like two, three slides now. That's that's all I have. So I think Ramesh is not in the call. Okay, anyway, so this is upstream contribution we have made. So all the demos that you have seen here, they are part of our upstream contribution from the T-Mobile side. There are two. I don't know if you want it. No, no, no, you can speak. I'm actually going to be driving so you can speak for this. Yeah, please. So we have chaos toolkit CF blocker driver and chaos toolkit turbulence. We wanted to bring these two into the chaos toolkit umbrella, but that's the effort we are putting in as well. And these demo videos are available for you to go through again. And we have raised, again, turbulence is built by another person, you know, and we, we are not the ones who created turbulence. So we wanted to do a pull request and we did the pull request with all the new feature add-ons and it's pending approval of the PR still. We'll wait and then see if there is no action happening. Then we wanted to create a new report on top of it and start adding more features to it. So you want to talk about this, Ramesh? This is our team. What do you want to do next and all that, Ramesh? So the next is actually, we wanted to conduct some game days, game days. We wanted, this is still a slightly matured and proof of concept right now. We wanted to make it work and productize it and call out teams on team-by-team basis and perform game days in a war room and randomly attack our own infrastructure and see how their applications behave. So we wanted to build this capability and give it to the application teams to, you know, perform chaos engineering attacks on their apps. So that's all I have. Well, thank you very much, Karen and Ramesh. It was a really, really sweet demo. It was really, really interesting to see. Thank you very much for that. I'll carry on on the slides if you don't mind now. So if you share. Should I stop sharing? Yes, please. Yeah, thank you. It was really a very interesting demo. Very fun to watch. All right. I think that's one. All right, so we pass that. Nope. Yeah. I think we are back to all, you know, usual discussion around landscape and, you know, and crafting the categories. I personally didn't get any chance to actually look at it yet. So if anyone has a fantastic, I'll be paying. I see there is a prerequisite. I don't have the prerequisites as best as I can, but I think the idea, and we'll be talking about KubeCon in a minute is probably to have that wrapped up in some fashion enough. Anyway, by, by soon ish to actually get that, you know, demonstrated that KubeCon and, you know, and basically settled in some fashion. So I don't know if anyone has comment to that, I don't want to put a request specifically, but, you know, please raise your hand. Happy to try to talk about that. Right. I'll carry on. Just, just let me know. Otherwise. So yeah, KubeCon is a Windsor bonus, but certainly not far away. It's a Christmas, you know, treat almost. And we have, Chris has set up and two, two talks for that. One is the intro of Chaos Engineering and the other one is deep dive. The intro is certainly talking about the working group, but not only. It works like you presented Karen and Ramesh and, you know, all things that have happened in the field of Chaos Engineering, you know, Grayling announcement, recent announcement, all those things need to be, you know, saying basically this is where we are at now as a community with Chaos Engineering. The deep dive, I think, at least I know I and Julian, who presented two weeks ago, will be doing a demo, a joint demo during that, that one on ECO where he started and showing how we can actually automate that after that with the Chaos Toolkit. Please, Chris, if you're interested in talking or being there, you know, in some, some capacity, obviously. I'll reach out to him. Yes, please. Don't reach out to me because I don't have any control over that. I'm just passing the message. Yeah, please do talk to Chris. You'll be happy about it. All right. It'd be, to be honest, it'd be fantastic if we could see, you know, as many people as we can, not just on screen this time, but, you know, face to face and, you know, just have a coffee or something would be perfect. So we, T-Mobile, we are in Seattle, so, and we are joining KubeCon. So hope to see you guys. We'll see you there. That's definitely fantastic. All right. This is where I'll be leaning on the people who were at ChaosConf. I've felt very good things. So I'm willing to listen to a bit more about it. So who's be, you know, would be willing to talk about it really? I think Michael. Yeah. I don't know if there was anyone else from the group. So ChaosConf was about a week and a half ago in San Francisco. We kicked off the day with Colton from Gremlin, talking about the sort of evolution of actual chaos engineering attacks and sort of concluded by talking about Gremlin's new product, Alfie, which is application level fault injection, where you can write into your application various attacks, which was pretty cool. Yeah. Adrian Cockroft from Amazon then spoke about the history of chaos, which I really enjoyed. It was really good talk. We had a number of other presentations from, I think the Bloomberg. We had a great one from Twitter. I'm just talking about, so one of the lady who was presenting was a technical diver and she was talking about chaos engineering in team building. Later in the afternoon, we had Tammy and Anna from Gremlin do a really cool demo of breaking AKS and EKS, which was pretty cool. And we finished the day with Jesse Farzell, who was talking about containers and breaking stuff in containers. So it was a really good day. Sorry. No, no, go on. I was just saying nice. Yeah. It was a really cool day. Good to meet more people in the community. And I look forward to next year. And I hear them may have a new venue, which will be cool. Very cool. Exciting. I'll be trying to watch to all the videos online. It always takes time to watch all the talks, but it's usually worth it. Yeah. All our videos are online, by the way. That's actually a really good point. I just posted the link. All right. Thank you. I was just digging it out of my inbox. I'm going to scroll online now if you want to watch. Okay, cool. I think last time when I clicked on the link, it was dead. So have you posted the link here in the chat window right now? The link to YouTube link to the Kiosk on talk. It's in the Slack. I'll post it in the chat now. Okay. All right. Awesome. Awesome. Thank you. Thank you. Carry on. That's back to what I was saying about the working group work. I need to pay attention personally to the state of the PRs and everything. Not just the one I mentioned earlier, but the various ones on the, on the white paper, just try to merge everything into one document and see where it stands right now. Probably needs a lot of polishing. All LPs, you know, obviously welcome. And that's, that's about it for this week, you know, a meetup. Anyone ask you and I think Chris usually ends up by saying that we are welcoming any demo and anyone wants to do a new demo or better one, you know, whatever. I think it's always cool to have that. So pass along as well to people that aren't aware of this working group. I think it's important to have, you know, I'm sure there are plenty of companies who do, while doing chaos engineering or resiliency engineering, whatever you want to call it. And they should come and just show it to us really. All right. Any questions? Not that I can answer them probably, but they'll be there. No, that's fine. So this working group is, so we are, we are late entrance, right? Ramisha and I and what is the expectation like, you know, what are the best now low hanging fruits that we could grab and then, you know, start contributing? Good question. Yeah. Yeah. If anyone wants to respond, I don't want to hold the, you know, I actually actually done this sharing the screen so that we can see each other. Happy to let anyone answer that one. I can go for it. I think I think the working group is not yet a working group. That's important to notice. From a CNCF point of view, it's not yet a working group. It needs to be proposed and accepted and blah, blah, blah. The point of what we're trying to do as a community is, is to bring everyone across and basically work together. And for that, we try to at least have a white pepper. Doesn't have to be complete, you know, comprehensive and they're all mother of, you know, got white pepper for care centering, but enough to, to showcase what it is and where it goes. I know Chris is, is, is looking at expanding on the landing scape of the CNCF with tools like turbulence could be one for some of the tools. Seeing this is a field and, you know, bias technology that exists. So I think right now it's, it's more awareness and trying to have that step one to, you know, to get us that trampoline that get us elsewhere after that. I have a question. Did you have this discussion? Sorry, I didn't come from a while. Did you have this discussion for Chris? What is the next step is to transform this kind of to a real working group in CNCF or, or what is the, did you, I think it's important to fix date or milestone because something interesting to fix target, maybe next, maybe not next CNCF but Barcelona or something like that to, to achieve something and to deliver something and to go further for that. Yeah. And Michael, Michael did mention that last, last time as well. We need to. Sorry. I didn't. No, no, no worries, but it's, it's a good point to rate. I think the idea is, you know, I, I don't work for this CNCF. I don't know that, that proceeds. So I, all that I can do is echo what we discussed before, but I, as far as understand the idea would be to put that work, you know, that kit up repo in good shape where we, we can, you know, agree on that white paper, at least with, you know, phase one, version one, it doesn't have to be stable or anything and, and increase the landscape so that we have a good view of what's, what exists as a milestone. I think it would be before the end of the year. And the best would be if we could, you know, do that before KubeCon because during KubeCon, I think Chris is trying to raise more awareness and more, you know, promotion. Yeah. So. That would be better. Okay. That means no official deadline. Maybe Chris having better visibility, but the most, the best case is to fix the target to finish the landscape before the end of the year and to motivate to, to grow up to working, a real working group for, for the target to write specifications. I would say specifically, but writes best practice and co-engineers, guideline for co-engineers, for all engineer working in the microservices. I think, I think that's interesting because as far as I understand the CNCF tries to, to avoid specification initially. They don't want to be in us, you know, a governing body in that, in that sense. I don't understand it. So my, my, my assumption here is that we do the tidy up, you know, within like six weeks or something. So that gets ready for, you know, during November. So that at KubeCon we can meet up and probably discuss about next year. If we look at what serverless has done, it's certainly interesting. I was talking about specification. They came up with a specification on the side. And I, I don't know if we need a specification or, I don't know what we need as a community, but KubeCon and, you know, another meetups here for those who can't come to MewCon are the good place to say, this is where I would like to go. And as a community, we can basically decide that. So it's, it's probably up to us as well to say, not just where are we, but where we want to go. That's me. That's me. The next since Seattle is a good place to discuss about that. Well, every time we meet up in person is probably a good, a good, you know, it's, it's, it's faster. It's more concrete and, you know, it's, it seems to work better. Even though we, you know, we all distributed and we all, you know, you know, in places. Somehow when we meet up in real person, that seems to be faster. Thank you. Thank you for your answer. All right. Any other question that I can't answer properly, probably. All right. I, you know, it's five minutes to the end. So let's just wrap it up. And thank you very much for, for everything today. And, you know, let's discuss on, on Slack. Thank you for managing the session. Thank you. Thank you very much. Okay. See you next time. Bye-bye.