 Good morning. Good afternoon. Good evening, everyone wherever you are on the world and This session we will be having now is called mangle systematically introducing chaos into your applications and I'm very happy to welcome as what he and avinosh on stage now The speakers for the session and actually is going to begin. Are you ready Ashwati? Yes? Thank you so much. Yeah, cool stage is yours All right. Good afternoon. Good evening. Good morning and friends non-romance countrymen lend me your ears and Put down your phones. I Come not to talk about chaos engineering or handle how to market them Well, but with a much higher purpose and that is to keep you all awake after a heavy lunch My greatest intent is to make sure that the session doesn't turn into the much needed lullaby that Everyone is looking forward to this afternoon And I'll get to the reason as to why we were given a post lunch time slot probably in just a minute all right to Let's start with a fun activity To jog everyone's memory a little bit. All right, what do you see on screen is a clip Could you tell me the name of the process and why it is important? I'm gonna show you able to see any responses on the discussion down. No responses Watching yes, that is right. It is forging So as you can see for for your information my eight-year-old son was able to answer this So if you think that I'm putting you under a little bit of pressure with that tidbit, then you're right. It is intentional Now in this process of forging there are quite a few elements involved So let's try listing them a bit one after the other So what you can see here is obviously the tools right which enable the blacksmith to subject the metal to unfavorable and harsh conditions We obviously have the blacksmith who enables and puts in the effort into the process The raw materials itself right the iron ore the steel whatever me it may be the metal that you're using and Obviously the sword or the metal that you're trying to forge and this is the before forging image and This is after forging right now any any action movie enthusiasts who can try and name that sword for me No going one going twice Alright, so that is the samurai Japanese samurai sword. It is also called the katana sword, right? It's it's known for its resiliency its durability its power and its speed Now if we were to draw parallels between the forging process and the process of chaos Engineering as it applies to software today, then what would each of these elements stand for? So we have the raw material and what would that stand for right? So it obviously stands for design and code and that's basically the crux of anything that you're building any application that you're building There's obviously the blacksmith Who puts in the effort now? Equal into that. What do we do in software? Who are the people who do it? So that's developers testers the side relic reliability engineers the process enablers in general could be managers Who manage those teams or the organizer? Organization as a whole who enable the whole process of chaos engineering, right? Now what you can also see is that as a bonus the blacksmith develops a Very chiseled body powerful biceps and great abs, right? So what is what is the kind of? Bonus that we get out of this process if you were to follow it, right? And that's right. No skin. So let me break it to you Chaos engineering is not a very easy process to adopt and sustain any team who newly starts off this process has to go through a lot of lot of learning and Literally name calling within the teams because they they have to go through a lot of pressure when they initially get into this process of chaos engineering So that's why I call it the rhino skin Which I'm sure myself and my team has developed over the years We are quite thick skinned and we nothing actually bothers us much now And obviously there is the knowledge base that we gain you get to understand how your applications behave How your platform performs does it fail completely in case there is a fault introduced into the system? Or is it able to Resiliently come back to its normal working position. So there is a There is a vast amount of knowledge that you gain in the process as well now obviously there is tools right and What is the parallel for this in software? So in chaos engineering all these tools That you that are available in the market and that you would like to try out it could be any of those tools It would be mangle it could be Kremlin chaos monkey, whatever your organization adopts. So that's basically what the tools are and Finally the Katana's what what is the Katana's? Sword for us right and what is it that we are aspiring to develop and that's basically a very resilient product Right, and that's what we aspire to create through this process Your baby in the sense in in technical sense every time we talk about a product. It's always it's it's my baby It's your baby, right? Now if people people if you're trying to determine the gender of the baby Let me remind you that it is it is your unconscious Bias to gender diversity that is making you do it and you're not supposed to do that, right? Okay, so jokes apart that is chaos engineering for you in the most simplest terms The process of forging software to make it tough and resilient by subjecting it to controlled Unfavorable environments, right? And that's why we draw a parallel between the forging process and between chaos engineering as it applies to software today now Okay, why do you have to listen to us? You know why talk a hollocks? Ashwati and why should you listen to us anyway? organizations and teams Are at different levels of maturity when it comes to adopting chaos engineering as a step in the development process For software usually does not an easy process And if you are driving chaos engineering initiatives and walk into a room and is greeted by an applause Then you're seriously doing something wrong, right? No one is going to greet you with an applause It's primarily at the Anglos job Anyone familiar with this particular picture? All right, so if you've you would have guessed it correct. It's Prahlad and Holika Obviously, I'm not the I'm not referred to as Prahlad in VMware. I am definitely the Holika there And for the Indian mythology Uninitiated what that means is I am the witch who threw your baby into the fire or in short the person who is involved in Enabling the process of chaos engineering in VMware and Who is Avinash? So Avinash is seen as the hand who stokes the fire today So in short, he is one of the key developers of mangle and contributes heavily to the development of its features and faults So we are people as you can see who evoke a lot of powerful negative feelings every time we walk into a room Full of engineers and leaders who are trying out chaos engineering and for the very first time So we must be doing something right because as I mentioned it is not an easy process to adopt and sustain So this is an opportunity for for you to learn from our experience in dealing with various teams within VMware who's adopted this process and today we can say for sure are successfully progressing and gaining a lot from this process So I'll not get into a lot about the agenda. We will get to it as as in progress So let's get to the topic. Why why mangle right? So there is there is a lot of tools in the market today for chaos engineering. Why should we go with mangle? so literally if you were a software blacksmith and you are keen in building Brazilians and Strengths for the applications you develop by subjecting it to conditions that are unfavorable Then what kind of workbench would you prefer? Would you prefer the first one? Which is like a crowded mess that you have just too many tools to choose from and you don't know what is laying on there and Whether you're building a sword or a rod based on that you have to keep changing those tools or would you prefer a neat Organized workbench with a specific set of tools that can be used irrespective of what you are building, right? Now chaos engineering tools that are available in the market today are like the first workbench. You need to have a Different set of tools if you're building it for say on-prem versus SAS deploying on k8 is versus Docker But mangle aims to give you a clean single platform of tools that can be reused Irrespective of where you are developing and hosting the application It makes chaos engineering a lot more simple and easy to manage and the icing on the cake is that it is open-sourced right and it's very easy to extend as needed so we will come to the the framework within Mangle, which will enable you to extend it very easily All right, so what you see in this slide is the value proposition of mangle it a I will not get into details about what the list is and For all those keen observers in the audience I'm sure you would have noticed that the list for gremlin is longer than the one for mangle Right, so let me give you one more data point to think about So that's one This is another right So now it makes it a lot more clearer Since the difference is in millions you're ready to kind of like do the length of the list it seems very insignificant now and Probably that is also why we have an early afternoon session today And Colton Andrews has a keynote in the evening and with that comment I have made sure that the conference committee will not extend an Invitation to me to present in the future anymore, but again Now for the folks who are still not satisfied. I have only one thing to say Free may it nice me later Okay, anyway job jokes aside What this also talks about is the commitment of these 10 engineers who worked hard and put out this tool for the larger community to use As in case of any successful open-source project It is always the larger developer community who contributes and makes it successful Right mangle has still a long way to go to Ghana that kind of support And I hope we get to that point very soon So in a brief mangle is is a tool which allows you to orchestrate chaos It Run enables you to run a lot of chaos engineering experiments a different variety of faults on Whatever be the platform that you have and where you've developed your developed and hosted your applications So it's tried and tested against VMware product as well as all the common cloud platforms out there It enables you to run a falls on gators dockers recent or any remote machine that you might have It has a very efficient custom fault plug-in model, which is based out of P4j Which enables you to plug in new falls on the fly without really rebuilding your port from scratch At a high level It's very simple There is a data layer for mangle, which is the persistence layer and that's built out of Cassandra Usually we build these out as containers So there is usually a Cassandra container running as well as a mangle app content container running in a typical deployment single node deployment of mangle But it also allows you to have multi multi node clusters so that high availability of this tool is taken care of So that is the cluster management part of it is taken care of by Hazel cast and there is a module Which completely is spring rest-based and which handles all the fault services, right? So the beauty of the tool is that it has a very robust rest API framework layer Which allows you to invoke all the falls or any app anything that you can do through the UI just through a Normal API called so automating this process becomes a breeze, right? Now I'll not get into too much about the deployment models if there are questions We'll come back to this at a high level There are three different types of falls that we support infrastructure level falls application level falls and Database level falls which are coming up with version three of mangle So as as you can see we cover a broad Variety of falls in the infrastructure layer It could be different types of resource falls and as you can see There are quite a few of them and there are very there are certain things which are very specific to the infrastructure Like for example in Kubernetes You have you can make resources available you can make resources not ready to lead them So those kind of falls are very specific to K it is platforms So we have support for those kind of falls as well then We also have application level falls what this does is again This does not require you to have any kind of instrumentation at the application layer We use an extension extended by management to introduce this falls. So Without really doing anything to your application as long as it is a Java application You will be able to run all these falls without any any additional Instrumentation needed for this right and that's the beauty of the biteman agent and we basically have enhanced the Jboss biteman Agent to make that happen, right? With that let me hand it off to my Team member and colleague Avinash He will take you through the demo and as part of the demo we will be covering these falls which are highlighted in green One sample of network fault a service unavailability for fault on Kubernetes and a Java exception fault So, thank you Over to you Avinash Yeah, so as Shruti already described about the part of demo what we are going to cover So it's broadly categorized into two parts one is where we will see the fault injection at the infrastructure site and the other section where we will see the application level site and on the infraside we will inject the fault on the remote machines and Kubernetes cluster and for application fault simulation will see the demo for Docker endpoint So we'll start with the remote machine endpoint. So the this is the home page of your mangle application So once you log in to mangle, this is the first page where you land up and Currently for this demo. I have used the admin credentials and this is the default user which comes with mangle application Whereas you have the option of having your own or outsource configuration and you know have your own various users and custom roles for those users as well, so and Apart from that this home page describes about the key items which mangle presents mangle supports and then From here, you can navigate to the various pages as well So without getting wasting much time I'll first start the the primary point which is to inject the fault the first primary item, which you need to configure is your target endpoint and here the thing which we are Considering is the naming as the endpoints and the support for the endpoints We have is about the remote machine endpoints the Kubernetes cluster as endpoint and the Docker Server as endpoint and apart from that we have specific endpoints for VMware vCenter and AWS as well And there is a section for credentials So to support the endpoint you need the credential section to be configured for the respective endpoint We'll see that now So I have already added an endpoint. I'll just show the required fields for that So here we can see that there is a remote machine, which is remote machine is a Linux box where The SSH is enabled for that machine and the port you need to specify while adding the endpoint in mangle so What are the fields our mandatory here is the your machine host name or the IP address and As well as the port number and that right now the default Operating system is Linux for us to support for for remote machine endpoint types and Then to add that you need to know the SSH credentials So from this option you you can directly navigate to the Credential page or you have a separate section as well So to add the credential you need to give your username as well as the authentication in either a password in a form of password or the private key whatever is supported So that's how you add the endpoint for your to start with so once you fill all the Required fields into the for the endpoint edition you will do the test connection once the test connection is passed Then you can submit for For the final step for submitting the endpoint Now we are done with the first step which is adding the endpoint now the second step is to Use the fault repository which mangle provides and target that endpoint for the fault injection so So here you can see the like we have the category to categorize the fault execution part into Broadly into infrastructure fault and the application in fault. So I'll just highlight here at the Infrastructure level fault where you can see that the Various types of fault which we are providing namely CPU memory the disk IO fault and the kill process And I don't want to get into detail of that because due to the limited time so like They apart from this generic faults, we do support the Networking Docker Kubernetes vCenter AWS is specific for infrastructure level faults So for this demo, I am going to present one of the network is specific fault and for network fault They are also categorized into four types We have packet delay packet duplicates packet loss and packet corruption and I think the names are like self-explanatory to Understand what type of fault we support for that. So let's see One fault injection for packet delay So this is the end point which I added into mangle and I'm just running a pink command to see how it is reachable from my machine and Here you can see that the natural latency right now which is coming from the pink command is like around 200 millisecond and Now we'll inject the fault So first thing is you you will select the end point from their drop-down and then the nick configured for that and Then the latency of the millisecond which you want to inject so here for this demo I am injecting around 3000 millisecond latency for any communication for any packet transfer and The this section talks about the timeout in millisecond. So what is the meaning of timeout is? so what mangle does is after the particular Timeout millisecond which you provided into the request body the mangle upon completion of that time angle will Remediate the fault automatically. So here in this case, I am providing the 50 1500 millisecond as 15 second for timeout for this particular fault Apart from that we have two more options for schedule the fault and Like scheduling you can do one time like any particular Time you can choose for for future And as well as we do we can Schedule the fault for Recurring at a particular time frame. So like for example, if you want to run the particular the fault And daily at a specific time that also you can achieve by a scheduler All right, so this 15,000 millisecond timeout will provide the 15 second timeout basically and Let's see how it works. So I have injected the fault and will observe the behavior on the console So here we can see that the latency is injected and now the time is added with the 3000 extra with the original latency which was coming into the ping command and see now the 15 seconds are over so we can see that the it's went back to the original part and This is the page where the once the fault is execution is done the reports will Generated for that and you can see even the details of the fault injection like what was the endpoint and the other Parameters which you use for the fault injection Yes, so that was about the first Part of the demo is about the remote machine So second category into the infrastructure fault We are going to inject a fault onto the Kubernetes cluster and Before getting into that We'll see that what what type of Application where we can inject a fault so as Ashwati was mentioning about the various categories of fault in Mangle supports so and we know that the applications are like Micro-services based architecture are the key nowadays and various services all together combinedly creates one application and so in this example, I just Created one sample robot shopping application. I deployed it in one in my cluster and What if this sample app is does is it provides you the option to buy the robots and Various categories are there to purchase the robot. So basically The this cattle categories section is the catalog part where the catalog services Responsible of you know providing you the details about different type of robots and they are then you can Select the robot for purchasing in add to your cart so that's the Quick information about the application where I am going to inject the fault. So as a remote machine the The first point where we From where we need to we can get into the fault injection process is the endpoint So here also we need to add the endpoint for this particular Application and to add that Like remote machine we need to add the credentials, but the credentials will be different than that. So here The credentials will be your cube config file Which which is you know communicating the the mangle should be able to communicate to that cube config file so here we'll add the Credential bias submitting the cube config file to it and once the credentials are added for the For the communities endpoint will we need to mention the name space because it's mainly required to segregate your name space In the cluster you might have various clusters available so to inject only on a particular name space you it's required and Like in remote machine we did the test connection here upon test connection successful You should be able to submit the endpoint All right. So once the endpoint is added we'll go back to go to the fault injection category. So We have a category for Kubernetes related faults and They are three type of faults so majorly we support for Kubernetes as you can see the names are delete resources resource not ready and service unavailable for and By as the name is Explain explains itself that delete resource means With the Kubernetes cluster you added if you want to randomly delete some resources maybe for example, if you want to delete a couple of parts into your cluster or Or if you want to simulate a fault where The condition where the parts get into the crash loop back state So for that type of such type of simulation we do resource not ready fault and the third category is about the service unavailable fault So as I was talking about the robot shop application and any microservice is like combination of various Microservices so if you want to see the effect Failure of one service and how the dependent service reacts to that and is your application resilient enough for that So we'll see that particular fault injection via this category and how you inject the fault is first you select the endpoint and then the Criteria here is There are two ways either you choose the name of the service or via the labels which are defined for the Service into into the description. So for this particular application the labels which are selected are like service name is catalogue and Based on this label the fault injection will happen and there is one more field the random injection true. So what that means is for example, if you want to delete a part with a particular label and They they can be multiple parts with that same label. So if you want to choose one of the part for Deletion, then you choose random injection true and if you choose false then All the labels which will be discovered via via mangle The fault will be injected on all the entities Which will be discovered alright, so let's execute the fault and Fault is injected now and if we click again the various categories for Catalogs items into that into the application. It is not it is not You know available. So Let's see the process request Yes, the fault injection is completed and we can see the details of where we have injected the fault and Let's go to the application and Here we can see that the particular API response are failing and it's giving the server error that because The service is not available and those APIs are not working right now And you can see that the images are not popping up and it is providing the blank So let's remediate the fault to see whether it's how it reacts now. So fault is remediated so by the way, we They are a couple of more options apart from remediation So if once the fault is injection is completed and in future if you want to again Retrigger the same fault we do provide the option of either re-trigger the same fault or if you want to tweak some fields into the fault injection request body that also you can perform and Now the since the fault is remediated builds as we can see that now the catalog items are present and the API starting started passing and There is no failure now So that was about the Kubernetes level fault so the third category about your docker endpoint and Here in this for the docker endpoint. We are going to talk one application level fault So the application level fault as Ashwati was explaining earlier. We do use bitement Your libraries for doing the bytecode manipulation at the runtime. We don't need to do any Separate configuration to do that at the runtime. We will see so in this demo We'll see how this code will be manipulated using mangled so this is a sample application which is running into one of the docker container and In the swagger UI The plug-in controller shows the APIs which are running and I will target one particular API which is the get call for the plug-in and If we execute that we'll see that it is showing the 200 ok response with with the empty data right now there is no data into that and We'll go to this docker machine where you can see via docker ps command that this is the Docker container which is up for that particular application and we'll start See the logs of that particular container so that we will see the fault injection Processing via logs into that All right. So again the entry point is the endpoint and here now we'll go for the docker endpoint So the docker endpoint is already in and we'll see the required fields so unlike remote machine and Kubernetes cluster It's a little bit different. So here the protocol which we use are directly using the connection from the docker APIs and If you are if the TLS enable is false that is TLS is not enable then you don't require to upload any Certificates or something otherwise the there is a separate section for certificates where you need to upload else the default port where the docker On the docker server where the docker services are running you need to mention that port and the docker host IP or the host name and Similarly, once the test connection is successful, you should be good to go for the adding the docker endpoint Here we'll go into different category, which is the application level fault and Here also the similar to the infrastructure or fault we have various common faults which it goes and go and inject the fault into your JVM and The names are like CPU memory fault file under leaks and thread leaks and etc and for the Apart from that we have Java method latency spring service latency and Spring service exception and similar Java exception faults all these things you can achieve via mangle and For this demo purpose. I am now going to present the similar Java exception so first thing is you are selecting the docker endpoint and Then injection home directory, which is the optional part and the class Class name. So I am providing the plug-in controller class name which gives the get API for the plug-in and the method name for the gate API is the get plugins and Here I'm going to put the rule at the rule event is the add entry So there are various rule events are available from biteman. So add entry means when the code flow Touches the as soon as the it talks to the method at the entry itself the fault will be Injected and you will see the effect of that and if you want to you know The method should execute and it should do what it intend to and at the end of the Method execution if you want to inject the fault you can say at exit. So similarly, they are various type of rules are defined for that and Then you can define what type of exception you want to simulate at that point of execution and the custom message message also, which you would like to throw for a particular Fault injection now here the for that docker host the as two containers were running it will list all the containers which are up for that moment and From that you will select the container on which you want to inject the fault and then the Java home path you would need to define and Then the JVM process either you provide the process ID or the process process descriptor And the free port is the by default 9091. You should you can use anything which is free Okay, let's see how it behaves So the fault injection is completed here. We can see that a fault injection is done Now we'll go to the swagger UI again, and this is the log as of now. There is no exception or error message and We'll trigger the API again and Here we'll call the get a big API and execute it And there you go. So here you can see that they are the API is failed and it's showing it's showing the 500 internal server error and even in the logs You can see the exception which you wanted as invoked and even the particular Exception message which you wanted to throw the application is giving and their main intention is if if the dependent Classes or the consumers of that particular API is the exception is not handled then For from that perspective, whether your application is resilient or not. So that was About the fault injection an application fault now will remediate the fault and quickly verify By just running the API again that fault is injected or not. So We'll go to the swagger and execute our game and Yeah, we can see that the now it's again behaving properly as it was earlier alright, so That's about it We have time for a couple of questions. I guess is that correct? Yeah So there There were a few yes in the discussion panel, which I've answered Yeah So anybody has any questions in the audience? Please feel free to ask now You can raise your hand and probably come on stage if you want Okay, all right Thanks, Vinash Ashwati for this wonderful session Thank you for the opportunity and have fun building your own cut and our swords Thank you for the opportunity and have a great day. Thank you. Bye