 So, we are basically building a social real time game shows like multiple game shows. So, when we talk about the stress testing we first need to understand like every system has their capacity limit. When you are running any production system at some point of time that production system will like come under the stress because of the traffic which is coming and then starts responding slow showing errors more than reacting in a unpredictable way which we don't want. So, stress development development help us to keep check on their development all the time. So, if you are not doing stress development development you are doing prayer development development where we are praying that your system will never fail. So, when we talk about like load on a system or stress on a system we first need to understand how is our traffic coming. We need to describe our traffic like in our case because we are running a game show all that traffic comes at exactly a single point. So, that generates huge amount of stress and because people everyone want to enter the game show they like try to fetch all the information which is required to enter the game show and they hit our server a lot. So, which like similarly similarly you can describe your own traffic that what is your peak load like at what time in our system is your system ready to handle those kind of load. So, then like you need to understand the limits of the system because without understanding limits you won't be able to optimize your code. Then you need to understand like on different loads how the system is performing so that you can identify bottlenecks. Into a system because currently our production systems are so complex that it involves like multiple pipeline like it has their applications of what it has some readiness databases, messaging queues. So, understanding the bottleneck and then fixing the bottleneck is most critical part. So, when we talk about scaling a system it is not about just increasing the number of subwords it is about identifying the bottleneck and actually going and fixing that particular bottleneck. So, that you can your system is end to end ready to table. Like whenever we talk about like stress table development like we need to understand how to start what is our baseline like let us say I have a server and I can easily handle a thousand requests per second. Is it your right thing like is it the max protec energy. So, we first need to identify the end points on which our users can generate load of the system or it can actually create stress on our system. Then once we have the identified our like base like end points then the second thing is like benchmarking results. When we talk about benchmarking results we need to benchmark it with some baseline. Now how to identify the baseline. So, the first thing to go set up a empty server more like let us say you are running a node share server with your own like set of code. So, the simplest thing you can do is have a empty node share server with a simple endpoint. It does nothing it just take interest return a response simple okay response. Now with this you can start generating node of that system with a given set of system configurations and you can identify your baseline that a empty code share server which has a simple rest in here returns okay can maybe generate like twenty thousand request per second. So, that will be a kind of baseline on which you can then start thinking that okay now my server which is doing some task how it is performing. So, then the next step like obviously is to start generating node on your server and then measure in terms of baseline. That gives you an idea that okay from the baseline how far I am. Like can I even reach the baseline like so that gives us a more like mental order in which we can work with. So, with this we can find bottlenecks we can optimize our code we can test again and then we can optimize. So, this is a whole next cycle which we can do. So, to give you a like bigger picture. So, this is more like a final version which we want to achieve every time. So, we never want to generate node on our production system because at the same time even users are making request on the production system. So, we make it exact technical production. So, we know like how our production workload like perform when we are running our test workload. So, the configuration of systems need to be same. Second is we need to simulate like we need to have clients, lot of clients who can actually go and generate similar kind of node. So, let us say you are having like a app maybe iOS or Android app. So, you need to simulate a similar kind of behavior and have similar like same number of apps generated or maybe more like a simulation of that. The third thing which is very critical is plan your test. Now, when we say plan your test it becomes critical because you need to have moderately user behavior. Like one of the thing which we does is you can have a proxy and like play through your app and then identify how the app is actually performing with the help of proxy. So, we use Charles proxy. What it does is we then we play the mocking user behavior. It exactly gives us how our APIs are doing. Like what is the first API or the second API. So, that we can exactly submit and make our own test plan. The fourth step is start the test. So, we try to simulate the test at the same time on multiple devices. What it does is it actually try to simulate the production environment with like multiple clients sending the request at the same time and generating stress on our server. So, we know how our server is actually performing at different venue. And then finally watch for results. So, in our case we use new running. We also have some internet dashboard which actually help us identify how many requests we are serving. What is the latency like which user is actually behaving like what is the app's latency. So, before like step this development methodology we were actually like even sending request like taking 60 second to actually giving the response to the user and which is like not a ideal behavior for a user. So, why we do it? So, once upon a time like we were we are just a start up we like keep developing features and so many features and you are very happy. You are like okay our app is ready now we want to see scale. Like we want to experience what happens in scale terms. And the unique part in our case is we send a notification to all of our users and everyone comes at the same time. So, even though like in our past we developed a lot of games and all of their games we worked with millions of users. In this case even with less number of users our system like apps, tablets, respondents, they slow. All the latency of each user entering the game show was very high. They were completely done over we are not able to enter the game show. And then we were like we don't even know how many users we can handle them. So, the first thing which we decided is for like remove the bottlenecks. First optimize the most critical weekends so that we can give user that initial the initial experience to its best what we can achieve. So, the next question is how we did it. So, it's not that on the one day like we started and we reached to a final stage where we are generating load on the system and now our system is performing in a super awesome way. We went into stages where I gradually to meet as a story like how we started and all. So, enough. So, first thing like when we started it is our first request. We wanted to optimize so that we can handle like huge number of first requests because everyone can make initial request for our server which is more like a different request. Like the user want to get their data with which it knows the initial information to enter a game show. Then we thought let's try to create a stress on our server. So, we used jmeter to actually jmeter to say a pool which is like a batch of jmeter open source can go and use it. So, we used it and hit our A theory like a thousand times at the same time. And when we started we started from our local system. So, we had a developer had a laptop he used jmeter and he started generating load then he went to the library saw the logs and then identify code it's actually happening that we have huge amount of users coming at the same time our system is not performing as it is even the single response time can reach to 60 seconds on the code when lot of people come at the same time. So, then we actually went inside like added logs, added timestamps and then identified what is the critical piece what is the critical section which is actually taking the most time and then we try to make part of the things as soon as optimize the code, identify all the things which we needed and then we read on the same test action. Now, this thing like for a starter this thing is good enough we are able to generate some amount of load we are able to get a request but we cannot achieve like everything in this setup. So, the limitation is what is there is no how much is tested environment. Everyone is running on their own laptop every laptop is different. Second is this. So, a single instance for single machine cannot generate infinite amount of load. Every system has their own capabilities and when you start generating like when you start making infinite request from the system it is going to generate it with time because of the limitation of the instance. So, another thing is sharing test code in itself is hard because like I have my own stress test code someone or another have their own stress test code he want why he send them on Gmail or maybe share using Dropbox or somewhere. So, eventually there was no standardization plus the coordination is hard because even at such a critical time all your major workforce like work to just try to identify bottlenecks and fix on it. So, multiple people started and load at the same time and the final result is not accurate. So, the standardization is very important and coordination is very important. Plus it is a very many process I need to go actually when the test and tell everyone that ok I am running the code. So, automation so obviously like we thought let us move all the steps at a single place. So, we will move it to a version the code system which is they can have it in jet lamp we are using jet lamp. Then instead of having our own local instance we move to a single stress test server. So, we said this is our server from where we are going to generate it. And everybody in the company can go to the same server and start generating it. It is good like now we are at a much better place everyone knows where the script like load generating scripts are everyone can go there and generate load. It was like really good cause other developers who are new to this they can go and start generating it. We achieved a lot of stuff like we came to a place where our all the stress test code is standardized. But the problem is again what happened like one day I ran it and I started generating load at the same time there was another worker who again went to the same machine and started generating load. So, the problem is coordination become like coordination is still a problem because multiple people can come and start generating load on the single system. So, you need to communicate you need to post on Slack that okay I am there to start the test no one else should start the test. Again like there is no automation because I need to go as such to a stress test server and generate load from there and again we are limited by a stress test instance capability. Like now they mean like they are easily able to identify the problem. If we automate this thing and we tell everyone in a single place that okay now stress test is started don't go and start again. So, if they are able to identify where button is from there they can start it and even they have capability to start it then it will be like great. So, we move to a version where we actually use Jenkins for doing the same thing. So, now we have a stress test instance we actually configured it Jenkins you can talk with now Jenkins it clones stress test code on that particular instance and run test from there. Now we are like in a good state when everyone can go and start testing from Jenkins. So, even testers they want to do this. So, now because eventually our ABA is like developers want to empower other people who can go and do all those things. So, now the only problem which stays with us is we can only generate limited amount of load in our like simulated production environment. But that limited amount of load is not good enough. We want to reach to the end of our system. So, we know at any point of time how much stress we can handle. So, for that we knew that now we need to handle the problem where we can instantiate many instances while even like costing check because other problem is let's say developer comes he spawn like ok today I am going to generate load from 100 instances. I am going to generate infinite load so like everyone will be proud that ok I generate the most load like I know our system also such a thing. So, for this like we use Terraform it is very easy to use say infrastructure as a code it simply with combined infrastructure as a code but it is not limited to infrastructure as a code. You can even have like pass as a code or SAS as a code even you can it supported by many provider it is written in like it uses a check of configuration language you can even use JSON but I find it easier to use so it is written in code and for all our like stress client instance we use NWSC instance. So, like I am assuming everyone is aware of NWSC instance. So, let's go deep into Terraform and understand how it works because without understanding Terraform it is really difficult to directly go and create instances and all those things. So, first thing Terraform can be support more than 70 providers which are officially documented on their website but community support is like more than 100 and as you can write your own like provider if you want. If you have your own custom setup where you want to use Terraform you can write your own provider because they have a code point here and then there are a lot of providers. So, like before going to talk about commands you can directly go and show it in the code. So, in this case what I like I have used modules so I will go into one of the one of the module. So, let's say I went into a VPC module so what this module does is this module generate a complete network for me. Now, with this module I will generate a VPC a public subnet, security group I will give access to my like whatever access like security device like for the CLI's routing device and like setup update it. So, whenever we want to write a module or module in itself is Terraform code. So, whenever we want to write our own Terraform code we need to first specify what provider you want to use. So, that when we are actually running our code it knows that ok these are the like ABI each one we use it can download any dependence. So, in case of providing ADM class it is going to download ADM class CLI which internally we make all the AWS tools. Now, we write like we define what resources you want and all the like required information regarding the source. So, when we write the complete Terraform module Terraform code what it does is internally create a dependency graph with this dependency graph it gradually goes and identify what task I can do in parallel and what task I can do sequentially and start provisioning the structure. In our case what it does is provisioning like the network and inside that we have to introduce like instantiate resources like instances. So, I directly go and like simply run Terraform in it what it does in this case it already I mean it will say that I will download it. Now, other small commands are validated. So, if you validate your code and this point of time it will tell is there any like syntax error that you have done. Now, we have a code right, but you can't understand if we directly run our production system the problem is we don't know what is going to happen we don't have any kind of visualization so, it gives a command plan with this we can exactly see what is what is the plan it is more like a real world mapping of what is going to happen on the production system. So, now in this case it has provided some IP addresses because I already told you that your CID and CID have block IP addresses 10.1.0.0.1.6 but there are so many which are computed because as of now at this point of time it doesn't know what their value is going to be we might provide it sending from some other resource or it would have computed based on the code configuration. So, these all these information will be computed when we actually apply it before that just to show you and show you the command graph Terraform graph. So, what it does is a complete dependency graph. In this format it is not very humanly but what you can do is copy this and post it to any cloud visualization it will actually show you the graph. In this case it is very complex thing everything is in parallel most of the things are in parallel which don't have dependency. So, what it does is it itself it identify dependencies based on the like if you use any resource argument in some other resource or you can also define your own dependency with the help of just like there is a dependency which you can define anything else. So, let's what I will do is I will apply it on my local my own AWS account. So, it will generate a instance. Meanwhile we say let's go slides because it will take some time. Yeah. So, I already talked about in it it is to create your workspace where you are going to actually work at the multiple parts. Then validate it help validate your code. Plan will give you a exact execution plan which is due to one of your production system. Apply it and actually apply it then I think in every like Terraform code we are going to define some variables which are called dynamic variables which we can inject either from like environment or environment variable or to our own code so that the main code and variables are separated then we can also define our outputs because in some cases we might take some output from one of the parts of the code and use it in some other part of it. So, in my case I have multiple modules. So, in for one module I have only networking. It provides all the networking details which can be used by instances and then I use it. And destroy is to destroy the possible complete Terraform infrastructure. They are also. So, how they manage they are like how Terraform is different from other like providers. You can also provision infrastructure using Ansible but Terraform keeps state. So, it's state is a real-world mapping of how your infrastructure is. So, if you have a state file you can exactly see like you can tell how my infrastructure is provisioned on actual universal. By default store the state in local. So, in our case we have storing state local in Gentiles using the job number. So, that way it is unique. But you can because of the problem is it also contains sensitive information like let's say I am a resource and creating a user ID it will keep the state file I am access key and security which is like you don't mode that. So, you can use a mode back end which store this information security. So, one of the thing which you can use for free is S3 plus DinoD. So, DinoD is used for locking. So, that when multiple people run it prevents security. And then when others like secure partners then we are doing it locally. It takes the complete state from the back end and store it in heavy. It doesn't store it in hardware. So, that it will get common access. Modules are more like reuse of the code. So, in our case like I have a network module and I have an instance module. So, now if anyone want to work on by for creating a complete network layer they can use my module. They can go take the module and directly use it. We don't. So, it helps in reuse of the modules. You can. So, there are many modules which are already there. So, there is a paraffin registry where there are already pre-existing modules. You can directly go use it. But there are a lot more on JCOM. You can directly pull them from there. And you can specify the input parameters of the module and use the parameter from module to another module. So, let's come to our final version. We automated everything with Jenkins. So, now there is a single base from where we go. We run our test base. Paraffin is useful for generating modules. In our case like it's everything is not tested using Jmeter because everything is not in SKA. There are some code which is in this batch of it. So, we use our custom code, but we use a command named it. What it does it runs some specific script at a particular time. You can go. And our test code is in simple, single repository which is as I told. So, we know like if something is failing on the code side, we know who did it and now it is part of our release process. So, before everything is running. So, at that before it is we know ok now our critical data are failing and the time we are actually explaining it to all the Indians. We are actually generating load with the exact same scenario which happens in production. So, in our case codes we used are Jenkins, Taraffin, it means Lamar, Jmeter, custom scripts and custom client apps like we store our own with third party API which we are using. And we have internal dashboard to identify how much time it takes. We use neuralink to see latency, load and all those things. I think this is important because these are the learnings which we got during the complete this process. When it is in Taraffin, always do plan before applying. Because if you apply, you won't even know what you did to it. Code can fail and all those things you will be able to identify at the back stage. And once you are done destroy everybody want the same cost. Like in our case we initially started with on-demand instances and that instance is still learning we are spending a lot of money even though we are learning. So, how Taraffin is different from like if you know about cloud formation, they do same task cloud formation for AWS Taraffin's loads and loads. What it does is Taraffin will take the approach where it keeps states. So, let's say our Taraffin code fails. Maybe because like AWS which happened with us, like they wanted to spawn 400 instances once. Now we won't have that much money. We don't even know why they didn't. And we suddenly saw it fail. So at that point of time, go fix the problem or destroy whatever infrastructure we have created because it keeps, like it don't go and don't impact everything. Test on stage before production. Just test locally or like somewhere so that you are sure your code is working as expected. Jentin is the one people who will apply because we won't want anyone to go create infrastructure for us. One easy way is to have your own access key which has permission to run this code and put it on Jentin's or like some way by environment. And don't let anyone else re-app those kind of questions. Do small and critical changes because you don't want to write a lot of code and then you realize there is no tournament. So, like I told you on your instances they are costly. So we don't know what instance or they are very cheap. Like they are like one fifth of the cost and then they are much cheaper than that. Yeah, so just to give me an idea use Jenkins so we can provide custom scripts like which can be downloaded on the instance. You can provide time like after which you want to run your code because you quite want developer or tester to be ready so that they can go to a tournament and like see. We can use old instances like old instances that created from some other Jentin jobs. You can destroy them because once we are done you want to destroy it. So, for now we may keep it normal so that like tester have the capability that they want to go and fetch some information from like those clients and then these are custom parameters which extra data we want to use on our server. So, we send all these information to our final stress inspector and set it up with our planning and then after like specified time we start running them too. So, in case of jpentry we use their own internal client server order, but for our own custom scripts we have our own custom code which is the type. So, this is just a sample of like instance in this case it is only 15, but we at times went 400 to 100 depending on like how much we will go and the analysis is an example of neural link dashboard where we are actually seeing like what is there on the dashboard. So, yeah we have our Terraform code so it asked me so every time it runs it ask for a client, but you can force it. So, when they are running it on Jenkins we do not want anyone to actually go and manage the ideas and all those information. So, we forced it. So, every time like code is run it is actually forced. And one of the reason for that because we know that this code is performing as it is. So, now what Quotitini does is to create these things and we can actually go and see what are our inputs currently generated to generate it. So, I created a VPC internet because I want to come up with some that routing cable and all those. Yeah. So, in this case one is on demand instance one is spot instance. So, if you go and see spot request we already generated a spot request. So, in case of spot yeah. So, we already requested a spot instance which we got and we provided the maximum price same as that of on demand price. So, in the worst case like we get this spot request because now pricing it and it will suddenly handle the cost. So, yeah once it is done like still waiting for the consensual reply. Now I can show you the state file it has it is more like a replica of real world like whatever we are actually seeing. And once I am done I can destroy it. So, everything is automated you can go to this data thing and take the code and try it yourself it already have like on use and also if you reuse it slides are available at this lecture if you have anything I have put anyone have any questions. Yeah. So, Jmeter is running so, in Jmeter we have a like server client model we run it from Jenkins it is a machine itself. But we can have a separate machine also. So, it is more like just now we will do a server client model before that you are running like directly copy code on all the instances and running client there. Because we were not seeing the Jmeter response, we were initially seeing unique response which is actually government dashboard. But now to get better understanding we started doing it on from like Jenkins machine itself who actually connect to all the instances and running client there and so that we get the data. So, that is currently transitioning to a better system where we have like if something fails we can see that in single place. What are they doing actually? So, what happens is when they are running the Jmeter instance SSH can get copied on all the machine. Now that instance has actual SSH access on all the machines. It goes there you SSH copies file because we have given generic you are also more like you can use your own box you are using some other S3 world. It downloads that file from there on the complete setup which we already done in script. Like you can use ensemble or you can even use telephone like provisioning everything. It does a complete setup. Now after like with the help of KT2 we actually go and run everything after like T time and then like actually you got what you want to do. Yes. Okay. So, I am going to use cloud formation now because you know I started with Terraform it's super easy to use but the major difference which I have come right about is Terraform keeps so when if something fails on cloud formation it holds it goes to 0 state but what Terraform does it keeps the state where it is. So, you need to actually go fix that code in a priority and it keeps a real world copy so even if you go on your influence line to do something like you go and delete half of the instances and then the next time you run them like it try to map both the things or like if you removed something from the code because Terraform mode is also versioned but yeah I don't have like in that detail about cloud formation but provisioning infrastructure. If anyone have any question I think I have Praveen like he is my colleague so you can ask him or me. Thanks. Thank you Mithish. I have this track here but if you are interested in lightning talks or any other tracks you can go to main auditorium you can propose your lightning talk over there and this is opportunity you can grab it. Thanks all for coming in general.