 We have Ashish with us today and Ashish worked for Siemens. He's an agile and lean coach over there and specializes in testing. So he's going to be sharing his experience on chaos testing and how they have leveraged some of these skills in their testing at Siemens. So thanks, Ashish. It is great to have you here and here from your first-hand experience. So that's awesome. Just a quick reminder, if people have questions, please leave that in the discuss tab under the Q&A section and we will get to those questions. It's a short 20-minute session but we do have some time towards the end. So I've requested Ashish, I've basically requested if he can take five more minutes extra if he wants and then we should be able to go through that. So all right, over to you Ashish. Thank you so much. Thanks a lot, Nareesh. And thanks to you and Agile India for providing this opportunity, even in all the situations all around that we never knew that how the session is going to be and with all that having this such a seamless experience. Thanks a lot and thank you all for joining the session and very good morning from my side and welcome over here. So let's get started. So let's get started with a question that like how much testing we should do to determine whether the system or the software we have built is right or not. What amount of testing will determine that the system we have developed, it is a right system. So just think over it for say 30 seconds and I would like to see some interesting response in the discuss in this discussion panel and let's see what's coming from the audience. Okay, if not let me help you think in a direction in it, which is which is a totally different and different direction and which will help you think like in a direction which like you were thinking that hey you know what I was thinking that system is totally robust when I did the testing. Let me start with this statement that no amount of testing can determine the system is built or right or not. Okay, but a single test can prove that the system is not perfect or it is wrong. Now when we are in this state and that too when we are moving in a large-scale distributed system, okay and and we are moving towards this cloud and all. So how to determine that how to develop this a perfect system or a software out over there for your consumer or end user. That's where this topic comes in the picture and let me share my screen with all of you. So what do you see over here the topic the confused tester in chaotic world and that's where I asked this question that how much a tester should be sure that my system water developed is right and that's where I had that question that what amount of testing can determine whether the test whether the system is right or not. So with all this confusion around and that to in a chaotic world where the business need is changing or the technology where the way of technology working is changing. So this becomes a real important topic. So and that too when I said that the entire game is changing literally like in this aspect of Kios the question remains that how much confidence we can have in this complex system when we are putting that in the production. So that's where this thought of Kios testing comes in a picture. Now let's take a pause over there and let's look into this word called chaos. Okay so when you go into this literal meaning of chaos when you look into the dictionary the meaning of this word chaos is the state of complete disorder and I don't know whether it is irony or Naresh how he have thought about and the agile in a team that they have introduced this chaos engineering track in the year 2020. Like this year 2020 like if it will be remembered for many things and one is definitely the chaos and all the chaos which is created with this COVID-19 pandemic thing. Now one thing which I learned in this year 2020 like the key to preventing and mitigating the pandemics is early detection and early response. Now mark these two words early detection and early response. This is the key state which we are going to talk about in this topic also and how it is going to be helping you in terms when you are thinking to start this thought of chaos testing and chaos engineering for that matter. So it is a short talk I may not be able to go into very technical details but yeah what I have thought through that if I can like generate that aha moment and attraction for you to go and research and learn more about this topic and we can definitely we will get connected after the session if you want to go into very much technical details of it. So one thing we should know about the failures that you cannot legislate against the failures. Okay it will happen it is bound to happen and you cannot really legislate against failure. So what you can do you can focus on fast detection and the response that's what you can really do. And then comes this thought in the mind but yeah this looks like a fairy tale yeah once upon a time in theory everything works perfectly your plan to survive the disaster you thought of it in advance but how it will work like it's all sound theory but how really will that work out while it is possible to sit down and anticipate that some of the issues you can expect in a system is failing knowing what actually happening is another thing like the picture what you can see over there like I really don't know whether it will work work or it won't work out for the key again over here but this really depends on what is your tolerance of failure and this is based on the likelihood of happening okay the result of this is your force to design and build a highly fault tolerant system which will withstand the massive outage and the minimal down and that too like in today's world when the world internet is going through stress testing and you don't have to go so far just go to Mumbai couple of days back when there was because of this heavy rain and outage like there was a lot of chaos created over there and and many of the like renowned like internet organizations or the companies they were not able to provide they were not able to recover back from what happened in terms of the outage over there so in all the situation as an organization if you are like going out with a large distributed system you have to be ready with all this thing there are chances that it will the likelihood of happening this is very minimal but the impact of the failure will be very large so to be ready with that to ready to tackle that impact you have to be fully ready in terms of building that highly fault tolerant system to withstand any outage which is coming with a minimal downtime okay now that's where where I should see where I should see in successful conditions failure condition what we should do so the prevailing wisdom is that you you will see the failure in production okay the only question whether you will be surprised by them by seeing the failure in production or as the organization are you inflicting them intentionally to test the system resilience to learn from that experience and becoming better like choice is yours so the letter approach on inflicting them intentionally to test the system resilience is called chaos engineering and the very important aspect of this chaos engineering is chaos testing which we are talking over here today now a very interesting image on the screen what do you see over here historically the emphasis have always been on the minimum mean time to failure like to calculate that like what is the correct behavior and how my system is failing and what is my mean time of failure but in today's world the emphasis needs to shift to the mean time to recover like minimize the time it takes to recover from failure that's what you have to do if you are really like thinking from that customer perspective or the end user perspective and that too in the situations what is happening all around over here so at a high level if you look in the kid in this term called chaotic testing at a high level it is simply creating the capability to continuously but randomly causing failure in your production system so that's at a high level chaotic testing is for you but the goal of chaos engineering is not to simply find the vulnerabilities through tooling like I'm very sure many of you over here who knows this concept they must be aware about all this semen army all this chaos monkey the latency monkey different type of tools you are aware of the point what I'm saying the tools are not important okay the important is to find that vulnerabilities and the goal is to push forward on a journey of resilience through the vulnerabilities you are finding that's the goal of chaos engineering all about so what I'm going to do in this very short time I'm going to talk about a case study from my Cisco days where I have worked on this very interesting solutions called unified contact center enterprises it was a legacy software system legacy solutions for the contact center like whenever you are dialing to any contact center for any support that and you are hearing this dial one for English and dial two for Hindi and dial three for Canada or something like that and then you are getting routed to a customer base routed to a agent based on your needs so all those solutions the seamless solution behind the scene that's what we used to develop and it was a very interesting solution we're used to capture the need of 40 000 agent at one point of time and so in this contact center if you look into the system deeply in this contact center router is the brain of the entire solution okay and almost all the enterprise contact center application comes with a duplex side like there will be a side a there will be a side b and it can be geographically distributed like one contact center side can be in India another can be in Philippines that's how it can be so it so it comes with a duplex system uh so during the high volume call load to the contact center there can be und anticipated issues like network communication issue can happen burst of call can happen beyond the capacity latency issues can happen even very interestingly your call center can can get hacked like you can start getting a very much inflow of calls from some unknown source just to hack your system so that the entire contact center goes down and the customer and business can suffer so in all these scenarios you have to as an organ like if you're if you're building that solution you have to determine that contact center functionality should not be impacted with respect to any existing or active call like suppose if you're on call with an agent and you're having a very like discussion where you're already frustrated many many times we call the contact center when you are already at the peak of frustration and in that frustration if you call get dropped your frustration will again go to the different level right and just to ensure that a customer is getting the seamless experience you have to test that during the call is going if any issues like network communication or latency or burst of call whatever is happening like this thing call active calls are not getting disrupted or any new call processing when it is happening like it is already happening that you are getting your wait time is one once like 30 seconds or so and suddenly your call get dropped so all this thing should not be happening along with that there should be no impact to the agents and the contact center over there because suppose if Cisco is an organization I'm developing that solution for me my who is important definitely the agents and the contact center enterprise is important and I have to serve them as well so I have to make sure that there is no impact to them as well so looking into all this condition and that's what we call is like the expected state in terms of chaos in testing if you're going the first state is that identify the expected state and then you and this what you see that network communication issues and everything these all are called as defining the hypothesis like in this condition what will happen so let's we'll talk about the hypothesis in a bit so first of all what I explained over here is called that defining the expected state that in all this scenario this should not happen okay now what we used to do we used to simulate the real call load in our labs and the production to perform the failure scenario to validate the impact on the contact center and this like the important aspect we used to look into that checking the mean time to recover and getting the other side to go active in the time when it is that the impact is coming on the one side of the contact center like if you can see over here for the pg1 if one thing is going down pgb comes up and take over the existing calls and handle the new calls that's what we used to do what do you see on the screen over here there are a few of the chaos experiment I have listed out over here the point is that I mean not be able to go in detail for each one of this but let's see how much we can cover so like like the first chaos experiment that using the VMware APIs we invoke to cut the communication between the active node and the standby node on a random basis now over here what is the expectation that if this is happening there should not be any call drops and that's what we used to validate we used to take care that's with this experiment in the production with there should not be any call drop the second is that using the VMware API we invoke the commands to cut the communication at one complete site like if it is geographically located distributed India goes completely down and Philippines have to take it like take it up very seamlessly like other sites would take over seamlessly that's the second state of the second experiment what we used to look into then using this internal developed tools and the PowerShell script I mean we we used to invoke the command to kill the process completely now when you click killing this process like another very random basis the entire idea that if there is an agent already on the call there should be a whisper going to his ear that this is what is happening and you have to transfer the call to another agent who is going to take it over there so that's all we use to determine that whether the other system it is responding in that way or not now push the access load traffic using internal tools all the component like in the system you can determine that how many calls per second you are taking like it can be 18 call per second 20 call per second that's what you can set and we used to push with this tool that API we have developed a beyond capacity like 200 calls per second in all this aspect when we are doing it the testing world that through the IVR or direct disconnection or the direct disconnection the access calls are getting dropped the aim was that the agents or the contact centers are not going down because the moment the burst of calls happen the entire system can go down then introducing the latency between the components to using simulators while the systems are running with the peak load to identify that like whether the system can run at the right load or not so these were the few experiments we used to do the entire like idea what I want to give it to you that we we used to do the random testing to determine that if in production it is happening how our system is going to get affected with this now one thing you should be very aware that this adopting chaotic testing will help you to improve your MTTR like it's a minimum time to recover it will help you determine that resiliency of the systems and the entire environment on which you have developed the system your organizational confidence in the resiliency of your production environment like you really don't know that when you when you deploy the system out over there how your system is going to behave in that production environment and it will also keep you out of tomorrow's headline like again I don't want to name any of the like internet provider but there are a lot of memes going about them with this what happened in the Mumbai so they're already in the headlines because they might have not done it in a right testing over there so if in an organization's thought process if you really want to get out of this tomorrow's headline this is the thing you should be looking for now what are the challenges and what are the basic steps to perform this chaos engineering so as I already said the define a steady state that represents the normal behavior per system if you don't do that you really don't know that where you are going which all direction you should test like testing in a production can be real challenging and you can go in all the directions okay the second thing you should do is that you define a hypothesis like your engineer should hypothesize the expected outcome when something goes wrong and when you are doing it again don't do it alone do it with the team like bring this whole whole team testing approach over there and do it because again if one person does it it the challenges again that his thought process is not going in a way of thinking the way a real user can do that and if possible include your real user behavior over there then the third steps that design the experiments with the variable to reflect the real world events like dependencies failure server failure memory malfunctions and so on like go on to perform a testing like like in netflix world they call it chaos monkey because yeah it's literally like monkeys they don't know that which plug they will remove what kind of scenario they will create over there so think on that aspect that if there's a natural calamity is coming how the natural calamity will behave in for my system and the important thing is measuring the impact of the test and observing the steady state in both the groups like as in the state and the desired state now while doing all this one thing you have to be very determined that make sure that your blast radius is not very huge like suppose if you're doing a testing in a very robust way that's good but if you're not able to recover back from the system you are doing a larger harm to your production environment over there and the cost of recovery will be high over there so determine that what is your blast radius so that you can minimize the loss over there in in in a right sense so that's what you have to be very sure of so looks like we are perfectly right on the time that's how I have done my chaos testing that's how I'm going to do it in real time environment so yeah I would like to bring a closer to the discussion over here with this thought process that failure is a success if you learn from it so the entire idea is that of doing this chaos testing that you do it you determine that where you're going validate your hypothesis learn from that hypothesis to become better in the production environment so I hope I was able to share the thoughts or my thoughts were resonating with all of you and with that I will bring my discussion right thanks Ashish that was a great presentation I think do you want to handle the questions by going here so one interesting question from Saqshi and thanks for asking that question Saqshi like what do you think would be the best fit for doing the chaos testing so see if you are just developing a desktop software and your impact is not going out over there in determining like many people chaos testing may not be the right thing to do over there but if you have a large distributed software system that too if you are going on the clouds if you're dealing in the networks like if I take just example geo as a internet provider to such a large scale should they do chaos testing yes they should do it like this is the example what I provided over here that contact center when I'm catering the need of 40 000 agent at one point of time in a distributed environment so I do the chaos testing yeah I should do it so very simple answer if you're building just a desktop kind of software if you are building a financial software which you know that towards the end of the day I will be able to recover back maybe chaos testing is not very recommended over there but anything which is large distributed software system you must do so thanks everyone and thanks Ashish a great session thank you thanks a lot Nareesh and thank you everyone for joining the sessions