 Good evening folks. How's everybody doing? Good. Well, first off, I want to thank you for joining us for a session this late in the day. I know everybody is ready for their evening beers right after this, but we'll promise you there's good content here, and there's nothing boring that you will get out of the session. True to our name, we'll start off with the story around chaos. So I'm here with some crew members from T-Mobile, and before I get started with my story and the actual content, how many of you have actually not been on an airplane? Okay, that's good. I was expecting that. Who's been with Alaska Airlines? Okay, a couple. Who knows about chaos engineering? Okay, that's great. So you may be wondering what the hell is the connection between all this, right? But there's a story to it. So this crew that I'm talking about started with Alaska Airlines on Monday from Seattle, so we're based off in Seattle, and we checked in. I had my baggage checked in. Hand luggage, which I should have just carried on with me, but I figured I'll just check it in. Showed up at the gate, and they were looking for some volunteers. Alaska Airlines, by the way, is huge in Seattle, if you didn't know that, right? And I was in the east coast in Fort Lauderdale. I never heard about Alaska Airlines up until I went to Seattle, and didn't realize that they're huge there. So we checked in. At the gate, they were looking for volunteers who could take a later flight, and in exchange, this was April 1st, in exchange, the deal was you'd get $750 per person. We're like, this crew, right? There's nothing we're going to do on day one. We can offer this topover. It's too good of a deal. The hourly rate just was mind-blowing. We're like, yeah, we're going to take this deal. So three of us decided to take the deal. Karuna and I got the coupon, a handwritten voucher, and our flights had a stopover in Detroit, Michigan. And then we ended up coming into Philadelphia by 8 o'clock, and then another member came through Raleigh. So what happened is, little did I know that that was the start of some chaotic events. After I landed into Philadelphia, my luggage made it into Philadelphia, and it was in this Alaska office locked up, and I could see it from outside. So then there was numbers. We tried to go to different booths. Nobody was there, and little did I know that Alaska Airlines only has three flights out of Philadelphia every day, two in the morning and one at 6 o'clock in the evening. So the last flight was done for the day, so these guys are checked out, and they're home. So you reach their customer service, terrible customer service, by the way, so I have to take that to twitter. So anyways, so I was done for the day, and I was ready for drinks, food, so we came in. I had just my laptop so you can imagine what I went through for that night, and I had to wake up early in the morning, but this time around on Tuesday morning, I decided to validate some of my assumptions. What I did there was I called their customer service number again to validate if these guys are actually working, or they're just checked out. So the central number happens to be a number in Philadelphia, and I called them, and then they warned me like, hey, we have two shifts. The first shift is done at 8 o'clock, so if you don't make it by 8 o'clock, you can't pick up your luggage up until 2 o'clock. So I made it by 8 o'clock, got my luggage, thought everything was cool, and then for some reason, you know, naturally the $750, that was the big cut for me, right? So I was like, I couldn't validate it online. I couldn't get it checked online, and then I called the customer service again, and they tell me that, oh yeah, the person that handed the voucher forgot to do an extra step of validation, and I'm like WTF, right? So as of now, the coupon is not validated, so I'm still like dealing with the pains of the chaos. So you may be wondering, what's the story and the connection here, right? The connection here is the ripple effects that one event can have in your life leading to poor customer experience. That's exactly what chaos engineering deals with, which is expensive customer phasing outages, and how do you avoid it before your customers realize it? In this case, the customer was me. Yeah. So without further delay, we'll get started. That's our first story for the day. So I'm Ramesh. I'm a senior manager for the platform engineering team at T-Mobile. Karun is one of our brilliant engineers on the team, so he's here with us. He's got some live demos, by the way. We don't do recorded demos. We'd like to do live demos. We'll see how that goes. Anyway, so without any further delays, I've already mentioned my team's name. We're called as, fondly call us the platform engineering team, and behind the scenes, my team runs the container strategy for T-Mobile. So if you look at the evolution of infrastructure, you know, everything is as a service model today, starting off with infrastructure, containers, platforms, and functions. So you name it, we have services and capabilities in each of these stacks, and we're growing fast into the top level stack, which is function as a service. So my group's, you know, prime objective is to deliver modern, simple, secure, scalable services that are platform and infrastructure agnostic. We actually want to plug our platform capabilities that we built, agnostic to any infrastructure. So behind the scenes, application workloads don't know where their workloads are. That's the vision that we have in mind when we deliver on these platform capabilities. If you didn't know, the higher up you go on the stack, the more flexibility you get, less conformance to standards if you stay lower on the stack. So every application workload is unique. So at T-Mobile, we're trying to expose these different capabilities from a platform perspective so that our application teams will rightfully choose what and where they want to run their workloads at. So speaking of Cloud Foundry, I see a couple of phases that may have been in the previous session around how big is too big. If you didn't know, we are actually one of the world's largest Cloud Foundry installations, by all means. Some which are very proud of, some which we are not, and I'll get to those. We have 13-plus foundations. The 36,000 containers is outdated. As of last night, it's 39,000 containers. 700 million daily transactions, right? And 3,000-plus business critical applications, 100-plus project teams. So that's the scale at which we're running. And behind the scenes, my team is around 25, including my leadership group and my product manager. But this group is not all doing PCF. We also have another platform to manage with this PCS, right? So behind the scenes, we have four different domains trying to deliver this customer experience, doing bigger things around platform intelligence, and then focusing on some of the core stuff that we need to deliver for PCF and PCS. So PCF is all about agility redefined. I won't sell too much on marketing here for Cloud Foundry. You guys know what this is about, right? So for us, it's all about developer agility, faster apps, more frequent changes. In fact, one of the numbers I have from our peer teams here is 1,000 changes, daytime changes that have gone through in fiscal year 2018, fewer incidents, zero downtime deployments. It's all of these radical, culture-shift changes at T-Mobile, really embracing the DevOps buzzword. It's no longer a buzzword where we truly believe that you write a piece of code, you own it when it's deployed to production. Who knows what this is? Minus the team of folks. That star. Who is that? Awesome. Thank you. Do you know who's that star that is? Yours. Well, I wish, but... Okay. Okay. Who do you think this is? Good. Who do you think that is? That is T-Mobile. So yeah. So this is what I call the explosion of microservices, right? Containers is everywhere. This trend towards, you know, service model comes with a cost, which is people write these microservices not knowing what kind of ecosystem that these microservices live in, right? And that's the kind of ecosystem. Keep in mind that I used to be with Amazon at some point. So this is, I think, a snapshot from year 2009. So it's pretty old. You can only imagine with the growth that they've had what this could look like today. The guy that wrote this tool, I don't think he had to refresh. He's like, this is too much work for me to refresh, right? But we're fascinated to see all growth with containers and the whole evolution towards microservices and this explosion. So for us, it's all about what could go wrong when your services get released to this wild atmosphere. It's a shared ecosystem, but you need to know how to cope up with failures, right? So for us, the only thing that is constant is not change anymore. It's failures. We want to embrace it because failure is inevitable. So this is not a line from the final destination way. This is my own, if you've seen final destination, they talk about death is inevitable right now. That's not this. This is our own version of chaos engineering and what we think about failures. So going to the problem statement, today's talk is about chaos engineering for Cloud Foundry. We spoke about developer productivity. We spoke about customer obsession. You want to have your developers build, deploy, operate these services in the cloud, but at the same time, you want to focus on delightful customer experience. All it takes is this one chaotic event just like I explained with that last guy event that happened to me to break my trust with them, which today it's gone, right? So they need to get it back for me and they have to work on it. So the same thing happens with service ecosystem. We are a customer driven dream. Our job is to deliver capabilities to our customer and if we break that trust, it's going to be an expensive trust break that we have to fix. So what if there is real chaos? And that's the problem statement. That said, So before I give you the actual solution design, let's begin with another story here. Once there lived a king. He received a gift of two magnificent falcons. They were so adorable, the king gave it to head falconer to train them. Months past, the falcon trainer trained them for two months or three months and months past. One of the falcons was flying but the other remained on the branch. King got so upset. He got so depressed. He called up all the wise men in his kingdom, but no one could make the bird fly. But after a few days, when the king came out, he saw the second falcon flying too. King immediately calls his minister and asks him, who is the doer of this miracle? Minister, it's a local farmer who solved this problem. The king with the farmer, how could you do it when all the wise men couldn't do it? The farmer says, it was very easy, your highness. I simply cut the branch where the bird was flying, where the bird was sitting. So the moral of the story is the simple change made the bird fly. The simple change can disrupt our systems. Not all problems need a complex solution. So with all the complex systems, with all the distributed environment we have, are we prepared for chaos? What do you think about the app resiliency? Yeah, rightly said, like I said on the slide, our confirmation to the familiar comfortable on Monday and that's our usual comfort zone, but for you to get out of your comfort zone, you need to learn to destroy the branch of these network connections and free yourself from the glory of app resiliency. If you want to build better applications, you've got to break your applications. Great. So let me reiterate the problem statement. Before that, we need to understand the definition of chaos engineering. It's basically a discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in the production. So that's what as principle of chaos engineering from Netflix definition is. So let me reiterate the problem statement. So like at T-Mobile, we started looking at the chaos engineering problem in two different ways. One, infrastructure level chaos attack. The other one is application level chaos attack. So T-Mobile is not a single application company. We have 3,000 plus applications belonging to multiple internal customers running on a shared foundation. So performing a chaos attack at the infrastructure level is definitely going to create customer impact at different levels. So as Ramesh said, how customer obsessed we are, we came out with another concept called application level chaos attack, wherein we can perform a single targeted attack on a particular application or its dependency without affecting any other application running on the same Diego cell or on the foundation. So we used two open source solutions here like turbulence plus plus is a wrapper around another open source solution called turbulence. Whereas Monarch is our own internal toolkit that we have started open sourcing it and it is responsible for performing application level chaos attacks. So in our journey for finding out the existing solutions, we didn't want it to reinvent the wheel. So our journey started this way. So we started with chaos lemur, but you can see the feature functions like some of the functions that we have picked up here are only finite few, just to give you a brief understanding of how the comparison goes between the different solutions that exist in the market today. So chaos lemur can only perform killing of the VMs, but we also wanted to perform killing of a random process, killing of introducing a latency, introducing CPU memory hog and application knowledge as well. So chaos lemur was falling short in this. So we also evaluated another commercial offering called gremlin. It later point of time, they also added application knowledge to their toolkit. So which was good. And then but we were still constantly hunting for a solution in the open source space. So that's when we came across turbulence. The turbulence as I said is a chaos engineering toolkit at the infrastructure level. It can perform killing of virtual machines, killing of a process, introducing latencies and introducing CPU and memory hogs. But it doesn't know where your application is running in your cluster. So that's where the T-Mobile's CTK comes in, which has got all the check marks and the CTK of T-Mobile is a combination of both Monarch and Turbulence Plus Plus. So introducing Monarch and Turbulence Plus Plus. Both these solutions enables initiating sophisticated failure injection tests on any Bosch deployed infrastructure. So it could be PCF or it could be a Kubernetes that is managed by the Bosch. And apps deployed in such infrastructure can also go through sophisticated chaos engineering attacks. Now let's look at the high level features that we offer as part of Turbulence Plus Plus today. So the Turbulence as such is both an API server and an agent. So the agent gets deployed in each of the virtual machines of the cloud foundry. Now the features that are highlighted here are the offerings that we have added on the top of open source solution turbulence. So killing VM, killing a process, pausing a process, introducing some stress, correcting the disk, limiting the bandwidth, reordering of the packets, targeted blocking and blocking DNS and duplication of the packets. So these are some of the features that we have added to Turbulence Plus Plus. We are going to look at some demo and get a feel of the functionality as well. Whereas from the Monarch, which is an application level chaos attack, it can identify, it can discover where your application is running inside the cluster of multiple virtual machines. It can block the traffic progress or egress to the application. It can also introduce latency at the service level. So let's say there is a service to service communication that is happening and you have programmed a timeout mechanism. What happens if your dependent service throws out some latency? How is your service one is going to act? So those kind of scenarios can be simulated through Monarch and bandwidth restriction and crashing random AIs. We'll look at some demos explaining some of the functionalities as well. So infrastructure level chaos engineering. So Cloud Foundry is a collection of multiple virtual machines. So imagine a scenario where there is a process called Rep that is responsible for managing the entire life cycle of the containers running on the Dego cell is killed. What happens if a virtual machine goes down? So what happens to the application instances or the containers running in that Dego cell? And Cloud Foundry is based on timeout mechanism. What happens if there is a latency introduced between GoRouter and the Dego cell? So all these kind of simulations can be performed with turbulence plus plus today. So you can given a Cloud Foundry cluster, you can go and choose which particular virtual machine or which random virtual machine you wanted to kill. Or you can also kill a particular process running inside the Dego cell. And what happens? How would the Cloud Foundry cluster behave in the process of when the process gets killed? And what happens if a fake latency is introduced between GoRouter and couple of other virtual machines? So the turbulence plus plus is a solution for this and we have it's all written in Go and we have a demo for showing these three capabilities. Can we look into the demo? Yeah, let's go. So for this demo, we are going to look at using turbulence plus plus. How are we going to kill virtual machine? How are we going to block SSH traffic to Dego cell? And how can we manipulate network traffic? So for this demo, I have used Bosch Light environment, but it can very well be replicated on any Cloud Foundry or pivotal Cloud Foundry clusters. Is the font visible? Check. Okay, cool. Thanks. So this is the Bosch Light running on my laptop. So Bosch, I have already logged into it. These are the deployments that I've got. I have CF deployment and I've got turbulence. So let's look at what turbulence Bosch add on. This is the Bosch add on that we have added. Turbulence. Let's list the VMs that are there in turbulence. So there is an API server that is running. So it's a very simple and we have not made much change for this demo. It's already part of open source turbulence solution. So the API server is running and it's listening on this IP address. And let's look at the virtual machines of CF. So you can see we have a bunch of virtual machines in the Bosch Light, Cloud Foundry deployment. You have adapter, you have API server, worker, console, database, Digo API and Digo cell. So we have only one Digo cell here. So as I said earlier, as you all might be knowing, Digo cell is the one which hosts application instances in it or the containers running in it are deployed in the Digo cell. And right now for this Bosch Light environment, I have only one Digo cell. So I'm going to use Chaos toolkit, which has a driver for turbulence to initiate a pause process. So what does it mean? So let's see, let's do an SSH into the Digo cell. So what you do is you just select the target to the actual deployment, do an SSH and the Digo cell. As you can see, you are inside the shell of the Digo cell and it's very pretty responsive now. So let me exit from here and I'm going to repeat the same step when I do the chaos attack, during and after. So chaos toolkit turbulence is all open sourced as well. So when you go in here, we run a bunch of Python scripts for running these experiments. So first attack that I'm going to do is, okay, before that let me give you a very quick insight into the actual chaos toolkit JSON itself. So this is a chaos toolkit JSON object. It basically has a standard title, description, what kind of attack that you're trying to perform. And what we are doing is impacting or pausing an SSHD access to a random Digo cell. There is a study state hypothesis. It is empty for now. But the most important thing is a method that you are performing. The action that you are doing is attacking and pausing a process with a process name SSHD deployment CF for a Digo cell and select any ID for this, limit to one. So what I'm going to do is I'm going to run this experiment, pause process. And before I run this, let's check if the SSH connection is fine. And it's pretty responsive again. So let's go back and initiate the attack. So it's running the experiment and it's trying to run it. So let's try and see if we can do an SSH now. Let's do live demos. Cool. It's not working. So let's move on to the second demo and let's see if we can perform network latency there. That's the challenge with live demos but hang tight guys. Somebody is running an attack on us. That's the challenge here. I don't know who it is. Please, if you're doing something, stop. So we have the ping operation happening on the Digo cell and you can see the time is less than one second here. It's a couple of 0.06 milliseconds. So let's try to inject some. I think I'm connected on the internet. That's the reason. VPN. Should we disconnect from the VPN? Yeah. So I was on the VPN so it was playing with my network, internal network on the Bosch Lite. So let's start the experiment again. So we're going back to the POS process, the original first part of the demo because we want our demos to be successful. Okay. Let me exit now. Let's start with the POS process. So the steady state hypothesis so the attack has been initiated. So let's try and do the SSH connectivity now. So as you can see, there's a brief pause because we have paused the SSH connection for about 60 seconds and that you can see it in the chaos toolkit time out here as in the file here you have a timeout of one minute. So after 60 seconds this lock on the SSH connectivity on to the Digo cell will be automatically released. So for now, for about 60 seconds, the SSH connectivity to the Bosch, to this particular VM will be lost. And you can simulate this experiment on any other process as well like rep process. What happens if you're pausing the rep process or killing the rep process for about 60 seconds? What happens to the application instances running on that Digo cell? Does it automatically roll back after 60 seconds? Yeah, it got released to see after 60 seconds. So now I'm going to exit again and I'm going to try because now I have not run any experiment now. I should be able to do an SSH connection without any issue. So that's pretty responsive now, right? So that's our first failure attack on a Bosch by killing a process. Now let's move on to the other attack which I said is manipulating the traffic inside the Digo cell. So for that I have got another experiment which is defined here as network attack JSON file. It has got title, description, the standard definition structure of chaos toolkit. The method that we are going to do is a network attack Digo cell. The action that we are going to perform is attack, control net, we tell the packet loss of 10%. And this is going to time out after one minute with a delay of 100 milliseconds. So we are injecting 100 millisecond delay when you are doing a ping operation on to the Digo cell. So let's run that. Again we follow the same thing python-m chaos toolkit run the experiment and network attack JSON. So the attack has started. Now let's look at the virtual machines list and get the IP address of Digo cell. So you can see the time factor here. There is a time of 100 milliseconds, approximately 100 milliseconds. There is always a standard deviation of 10% here and there which you can see like it's taking about 100 milliseconds to come back. And we also spoke about introducing a 5% packet loss. So if we keep this ping operation running the ICMP traffic, you will find at one point within that one minute some of the packets are deliberately lost. So the turbulence agent which is deployed inside the Digo cell is going to play a role there in dropping some of the packets. So as you can see there's two warnings here and there's a packet loss. So we have kept this attack for about 60 seconds again. So that's the reason you see and then we have also said like 5% of the packet loss. So again if you come back and see here the time is back to the normal it was 100 milliseconds around and then there's a quick transition. So the attack was there for about 60 seconds and there are again n number of combinations. You can keep trying this like you can inject some latency deliberately into one of the Digo cells. It can be 100 milliseconds or it can be even more than that and see how your cluster behaves. That's one way of checking the attacks. Now let's perform the third attack, killing the virtual machine itself like what happens if the Digo cell goes down? What happens to the application instances running in that particular virtual machine? So I'm going to kill this Digo cell running in the Bosch light. So there's a kill Digo cell. Jason objects is again a standard chaos toolkit structure that I've got here. So the action we have terminated Digo cell and the attack function is to kill any random Digo cell limiting it to one Digo cell for now. So you can limit to if you have 5 Digo cells running in your cluster or 100 Digo cells you can limit to any number for that. For now since I have got only one Digo cell I'm going to kill that Digo cell. So it got killed the experiment ran now. So when I list the virtual machines you can see that the Digo cell is missing here. So there's Digo API but and then the doctor there's no Digo cell. So now since it's deployed on now so if you can see in the previous case you had API server and the Digo server slide been added. So the Digo cell has been killed by our turbulence plus plus experiment. Now the Bosch the nature of the Bosch is it has a self-healing and self-resilience in built capability. So it is going to bring up a new Digo cell after couple of minutes. So this is what you know in brief how turbulence plus plus works and there are more functionalities that we have added and more experiments that you can perform. Let's move on to the application level chaos engineering. Now like I said earlier the application level chaos engineering enables one to perform a very targeted attack on an application running inside a cluster. So let's say I have an application instance a droplet container that is deployed inside the Digo cell here. So now this app this app is dependent on MySQL database. It could be a shared service as well. Now in this case what happens if the traffic between the application instance or the container to the MySQL database is blocked. How would your application behave? What happens if there is a UI front-ending this microservice and if you are crashing that microservice how would the UI behave? And again what if there is a latency introduced between your container and the MySQL database. All these three simulations can be performed with open source again a Monarch. So there are a number of again a lot of combinations that you can think of. What happens if there are multiple applications using the same shared MySQL database. You can go and then do a specific targeted block traffic attack for a particular application in that case without impacting any other applications running within the same Digo cell or inside the cluster. So that is what is the application level chaos engineering really mean. Let me ask you a question. Turbulence is focused on infrastructure. I just want to make sure folks understand that Monarch is our newest capability focused on application level. So there is failures at two levels that you are targeting. As I said T-Mobile is not a single application company. We have multiple customers using our foundation. So if I go ahead and perform a chaos attack on the infrastructure I have to get an approval from many other customers. So that's not something that I want you to do. So rather how about simulating those failures at the application belonging to a particular customer. So that's what is app level chaos engineering. So you might be familiar with this there is a concept called the butterfly effect. Do you know what is butterfly effect? Just the effect that I went through with the last question which is the butterfly effect. One event leading to another and in industry it's also called as a thundering herd effect where you have this cattle running against you or running in front of you because something triggered an event. So that's the butterfly effect. Well said. So again coming to the definition of the butterfly effect it's basically a butterfly flying somewhere in the Brazil area. The flutter of the wings of butterfly can lead to catastrophic effects somewhere in the Texas area. So that's the theory. So applying that theory here typically let's say there is a microservice or an application deployed in the cloud foundry with microservices here. There's a front-ending application called web app. It has a UI and it has it depends on service one and service one in turn is dependent on database and service one is also calling service two and service two in turn is calling third party. Now in this case what happens is the third party application starts behaving. So the service two would not get the response from the third party. Now what happens to the service one that is dependent on service two which is in turn dependent on third party. So there is a cascading there is a timeout that you can clearly see in the service one. And what happens if the database connectivity from service one goes down. So all these three put together is going to create some favorable behavior at the UI level and in turn is going to create a customer impact or a UI issue at the client side. So just for these scenarios you're talking about specifically don't we have something like a circuit breaker that can copy out here? Great question. So circuit breaker you are going to configure circuit breaker. Let's say if it's a spring application you can leverage spring cloud service circuit breaker here and the web application in order to minimize the UI impact it can call the circuit breaker and fall back into the other service that this application is configured with. But having said that what we are trying to say is Monarch is creating a job for the circuit breaker. It is giving you a chance to verify if circuit breaker is working as you have programmed or not as you have configured or not. So that's the one way like Monarch can give you wings and testing all this kind of latency and also the issues within your app. For the next demo I'm going to talk about a Fortune UI application. It's an open source application and backed with a Fortune service. It's a microservice here. And Fortune service is backed with MySQL database. So the Fortune service is pulling a random Fortune event from MySQL database and Fortune UI is talking to Fortune service to display whatever is a random Fortune string that is passed to the UI. Now in this case what happens if the database, if there is a latency that you have introduced into Fortune service and MySQL database. What happens if you have crashed Fortune service, if you're killing the service itself, how is the UI of Fortune UI going to behave? Now what happens if you're blocking, so there was a latency and now you're blocking the service itself. So Fortune service can no more talk to the MySQL database. In all such scenarios you have Fortune UI configured with Historic Circuit Breaker. The Circuit Breaker has to fall back onto alternative service in order to ensure that the UI is not impacted when these kind of attacks are happening on the dependent or down services. So let's get into a live demo again. So I'm going to show you a web application. Fortune UI You can see the screen, right? Can I? So while he's connecting back to the VPN because this part of the demo requires the VPN I guess I'll be a storyteller until that happens. Everybody here has ever been on a fire drill? Not a fire drill where your boss calls you and says get this done, right? Not that. I'm talking about an actual fire drill. Who's been on an actual fire drill at work? Do you guys know why we do it? Practice? Why do you think we need practice? Bingo. Yes. So if you think about a fire drill what's happening is you have a system for which you have a steady state and there's a known recovery path for that system. And what happens in a fire drill is you're trying to validate the known recovery path because when a real incident happens you don't want to fail, right? So the practice drill here is to validate that system's recovery path so you can get all your clients outside building safely and securely. That's the whole point of a fire drill. That's very much related to the analogy of chaos engineering because you have a system every system has a known recovery path. You make assumptions if you don't validate the known recovery paths you are bound to fail. So it's the next story I had. Not really a story but just came to my mind. Are we done? You have another story? I don't have another story but we'll take a question. Okay. Yeah. Yeah. No. So there's a graduation path to everything just like where you go through your grades to get to high school to get to college. So you don't want to go start off targeting your business critical apps in production and take them down because your CEO or somebody is going to call you and like WTF again, right? So if you think about Netflix's concept of chaos engineering they were nicely positioned to be that company where they started off these in a lower environment, graduated and today they run their attacks live in production in Amazon's infrastructure. They take an AZ down if you don't know. They take an actual AZ down and you know the impact because it's happening during live hours. So there's a graduation path and that's what the principles of chaos engineering advocate, right? We're nowhere close to that. In fact, we're not even ready to touch our non-prod environments because it's production to us. So any team that is trying to go through this path, we have to do it in cycles and not just aspire to go big. So right now our focus is a specific set of apps in non-production in a control environment and then taking it further because you want to understand the blast radius. You want to make sure the teams that you're interacting with has specific runbooks to know the recovery paths. Can we getting back to the demo here? So we have two applications. The Fortune Service is a micro service and Fortune UI is talking to the Fortune Service. And let's look at the services. So I'm using CFCLI and connecting to my org and space. And you have Circuit Breaker configured or bound to the app Fortune UI here. And Fortune Database of Type MySQL is bound to Fortune Service. And Service Registry is a service discovery. So all the apps that you have inside this space are bound to this Service Registry so that they can talk to each other. Now let me open this UI. So I copy this URL and then I go in here and you can see here you have random fortunes of online strings that keeps showing up here. And it's pretty responsive like you don't see any issue here. So now let's perform a chaos attack on this application. I like the last one. I don't know if you saw that but it's random. So Monarch has been open sourced so you need to... It's based on Python so I need to import the required library. So I need to config a YAML file here because I need to provide to the Monarch which environment I'm connecting to which Cloud Foundry target I'm connecting to. So we have a VPN foundation in our own environment and I have to connect it via VPN. It's a valid argument. You have a YA. Is that it? Yeah. Now there's an extra config. So let me go in there. Okay so we have this now and then the first thing that we are going to do is we have to discover where the Fortune Services is within the cluster. So all I need to do is provide an org name and the space name and the application name. So app.discover is going to discover where my application is running inside a cluster of 100 virtual machines. Once I have the app object loaded, it takes a couple of seconds to identify where the app is running. The first attack that we are going to do while this is running is we are going to block the traffic from Fortune Service to the MySQL database. It's taking some time to discover. Okay so the app has been discovered. Now I'm going to block Services, Fortune Service. So this is a MySQL database that I'm trying to block. Anybody has an idea of what's going to happen when you block? All back. So the circuit breaker has been kicked off and then you can see your future is unclear. So this is coming from the circuit breaker. So we didn't want it to impact the UI. So some random string is being, it's not random, it's a static string which is, your future is not clear, has been programmed by circuit breaker to show up here. So that's what is blocking the service. So we have blocked microservice to the MySQL database. So all we need to do now is if we have to roll back we can unblock Services. This is done. So now again, when you unblock, you will see that everything is back into action. So one thing I want you to note here is the response coming from the Fortune Service is pretty quick. Now I'm going to inject a latency into the Fortune Service. So for that I'm going to use manipulate traffic. What do you guys think will happen when you do latency injection? Anybody? Time out? So I've given like 100 milliseconds of latency for now with a standard deviation of 10. So it's basically a 10 percentage of the latency. So it can be 10 plus or minus 10 milliseconds. So when I started this manipulate network, you can see that it takes time. There's a pause. I think we can increase time out to more. But it's pretty obvious that it's taking time. So you have to unmanipulate before you can actually target a new attack. So that's the improvement that we are trying to make again. So let's introduce, let's inject like 400 milliseconds of latency to the same app. So this is running now. You can see there's a brief pause. So there's a time out and the service registry picked it up and sometimes there's no time out. So you can see there's a brief pause about 400 milliseconds for the message to come up. So now I'm going to unmanipulate and then finally do perform one more test, which is crashing the app instance, which is Fortune Service here. So for that, I'm going to... Any ideas what's going to happen here when you crash the instance? Karun hinted about it earlier when he did a similar attack. Correct. That's the beauty of Cloud Foundry. Self-healing. So when you crash this instance, the Cloud Foundry will bring back a new instance in its place. But you can also check CF events on Fortune service. On the left side, you can see the crash events. Pretty much clear here. And then you can see the circuit breaker is coming into action again. And after a few minutes or a few seconds, like once the app is back, we can check the instance count of the Fortune service now because Cloud Foundry would have brought back a new app instance in its place. So you've got one out of one to maintain the desired state. So it started working again. These are some of the attacks that you can perform on any target application. So let's move on to... We just saw how to block a service, how to introduce latency, and how the circuit breaker will play out. So some of the limitations and improvements and future enhancements that we are planning is Monarch right now can attack one cluster at a time. So imagine you have applications deployed in multi-foundation. So you have multiple clusters and you have spread across multiple clusters. So Monarch at any point of time can only attack one application in one cluster. So that's a particular limitation that we are trying to overcome. And we wanted to make some more enhancements in that and improvement areas. Turbulence++ and Monarch are two different applications or two different toolkits right now for performing infrastructure and the application. So there is a possibility of merge into one solution most likely that is going to happen in the next version of Monarch. And there are evolving design patterns like Istio and Envoy. So Envoy, which is a sidecar container to every application instance, can also have this kind of injection attacks that you can program and which can also perform simulation tests like what Monarch is doing today. So we wanted to see how we can converge the efforts that we are putting in Monarch and how Envoy can be leveraged. Or is there any possibility of these two working together going forward? So those are three things that we wanted to work on. Do you want to talk about game days? Oh sure. Let me take the mic. So we are towards the end here. Anybody heard about game days here? So game days is essentially an industry standard for anybody doing chaos engineering where you get your teams together and you run targeted attacks in a confined environment and you validate your runbooks right. So we are trying to share game days for our app teams. In fact we completed our first game day with a team that does coverage maps. If you are a customer of T-Mobile there is the personal coverage checker where you can see the T-Mobile coverage and we did an attack on their applications. Running an NPE non-proud environment not with Proud. Like I mentioned again you got to start off in a smaller environment. So forging ahead six months ago when we did our first talk in chaos engineering we were exploring the capabilities and tools. Here we are today where we are actually doing game days and we want to continue in this momentum of running more game days for our app teams. So if everything goes together you will see us back at spring one to talk about our advancements with chaos engineering. So this is the last slide. So just to sum up like you know clearly we are moving from the analogy of Joker here where he says chaos is fair in 2008 to what is today in Game of Thrones where the later says chaos isn't a pit, chaos is a ladder. So every person wants to go up the ladder but still they falter or some people do not want to grow up the ladder. That's because they are still accustomed with the group rooted, it's so rooted that they wanted to stick to their own beliefs or some kind of you know belief in the God that you know everything would work so well. So what we are trying to say is make sure that you know chaos is, you have to decide whether you want to stay in a pit or you want to grow up the ladder. So that's where this thing comes in. Yeah, nicely said. So before we wrap up 9 o'clock tomorrow keynote we'll be back on stage. There is another exciting demo which won't be a live demo because it's keynote. We want to play it safe the keynote. No validate or recovery path there. So we are just going to like do a lot of demo but similar content but in a more condensed version nicer format than this. I won't say this is not nice but you know something new at 9 o'clock. It's only a 10 minute talk so I encourage everybody to come there. Cheer for us. Thank you. Questions? We can take questions. Yes. Lineage to and fault injection. I'm curious what it is. What are the questions do we have? Go ahead sir. Take that. Sure. The Monarch is, I think we have scoped it already in the next release where we wanted to inject security configuration into the app level. So for a given runtime application we wanted to inject a specific configuration and see how your application behaves. Yeah, this is pretty much there. Anybody else? Questions? Okay. Thank you guys. Thank you.