 Hello, everyone. Hope you are having a great time at Selenium Cons Day 2. So a brief about myself. My name is Bharti Madan, and I'm working with Sumo Logic as software engineer. And prior to Sumo Logic, I was working with Adobe. Today, I'll be talking about on-demand test infrastructure on ECS as code. So agenda for today will be as follows. First, we are going to discuss why do we need scalable infrastructure? What was the actual problem statement we faced at Sumo? Followed by the solution we used and how we build our infrastructure. Next, we're going to discuss, after building the infrastructure, how did we manage the same using infrastructure as code? So before proceeding further, I want to ask a couple of questions. So what is the main goal of QE team in any organization? Anybody? Finding bugs? Next? Exactly. That's very obvious. So this is the quality engineering teams to ensure quality. Next, I have a question. What is the main goal of automation of any web app or any app? Exactly. To prevent regression in existing functionality with upcoming changes or improvements that are being done in the app. So if you're introducing new functionality or improving new functionalities, user using old functionality should not be affected. In the era of continuous integration, continuous testing, so as the company grows, number of functionality, number of features will increase. And accordingly, number of test cases will increase exponentially. And in the era of continuous integration and continuous development, making sure that every certain change will not introduce any regression is extremely important. So what is the solution to that? What is the solution to catch regression before they go into the productions? Run automation, right? Run appropriate test cases around that. So now it depends upon change to change or any pull request to pull request or the change list. But all areas that the same is impacting so that we can run appropriate tests around that area only. So we do not need to run our whole suite of large number of test cases with that change. So currently we have a suite of 5,000 UI tests with us to run on Sumo web app. So in Sumo we have three development deployments on which we have average number of PRs is 200 in a day. So we have to certify them all, right? But the problem increases when multiple PRs are raised at the same time and you have to certify them all. So you have to certify let's say four PRs coming in a day and normally the number of tests to be run on each PR is around 250 to 300, but this number can reach higher than that depending upon the change in which component it's like the PR is changing. So running these many number of tests, right? A big challenge. In addition to these test cases, the number of tests, we have five production deployments on which we run our whole suite of 5,000 UI tests thrice a day. So approximately we have 1,000,000 tests to be run in a day. So running these many number of tests with predefined set of sessions is going to take more than a day. And to certify like four PRs continuously is going to take couple of hours and more than couple of hours, I must say. So we have to have some infrastructure which is fast, which is predictable, which is reliable, and can run all the tests in parallel so that one complete cycle can be finished within the time taken by the longest test so that you can certify a PR. Let's say your longest test takes seven to eight minutes. So that should be seven to eight minutes to certify the whole PR. So all the tests should be capped. Like infrastructure should be capable enough to run all these tests within the seven minutes or five minutes. So first I'm going to discuss what tools and technologies we are using to build our Selenium Grid infrastructure for our UI automation. First is AWS Elastic Container Service. Basically ECS is managed highly available and scalable container management service. It manages container on EC2 cluster. Next we are using AWS Fargate. So this is being used to run Selenium Grid container without needing to manage the underlying infrastructure to run them. So what AWS will do is AWS just ask the Fargate to run these many number of containers and Fargate will take care of rest of the details. Next we are using is AWS Elastic Container Registry. So this is a kind of repository where we can store or manage Docker container images. Next we are using is the combination of Grid Routers and Selenoid. So what are these two terms? So basically Grid Router is simply Selenium Load Balancer which is stateless and very lightweight. Next Selenoid is a powerful grow implementation of original Selenium Hub code. So it uses Docker to launch browsers. We are using Terraform to have our infrastructure as code. So I'm gonna discuss Terraform in a little later in the talk. Also for real time reporting of our Selenium test to try those test failures. We are using our own sumo dashboards for that for to save time. Yeah, this is how our infrastructure look like. So we have multiple GGR Grid Router instances behind the load balancers, right? When the test has come, so the same will be routed to load balancer, AWS Elastic Load Balancer. Now AWS ELB will route the request for new session to any of the GGRs, right? Let's say it chooses the GGR X. Now GGR again is a Selenium Load Balancer. So it will try to create a new session with available hubs. So let's say it routes the request to Hub 1 and if Hub 1 is at its capacity. So GGR will retry with another hub and creates a session, create a new browser session for a test. Now the session is created. So the hub will reply back to the GGR with session ID, right? Now what GGR is having right now, GGR is having original session ID, plus it knows like from which hub this reply is from, right? On which hub my test will be running, right? So what GGR will do is there is a trick here. So GGR will calculate the MD5 checksum of the hub host name. So let's say it is the hub identifier, the hub ID. So it will append this hub ID with the original session ID and creates a longer, let's say session ID and forward it to the test. Now test is having the longer ID which includes original session ID as well as the hub ID, right? When the consecutive request from the test will arrive, so let's say this time AWS load balancer route this request to another GGR, like let's say GGR 2. So what GGR 2 will do is, so GGR 2 doesn't know to which hub it has to proxy the request, right? So GGR 2 will extract the hub ID from the longer version and finds the hub in its map. So on which a hub this request has to be routed on. So it will like remove the hub ID and gets the original session ID and it simply routes proxy that request to the particular hub, like it maps with the hub ID and goes to the session ID and the hub will take care of the request and what the test has to do. In this way, we are reducing, like having multiple GGR instances for the conscious call we take. In this way, we are like having high availability. So let's say if like original GGR X on which the first create session request was sent to dies, so our test need not to fail. So it will be handled by the some other GGR and our test will continue, right? Now you have this beautiful infrastructure working on, your tests are running, your tests are running fine. But the problem comes, how did you, how will you manage your infrastructure? How will you handle your infrastructure? By handling or managing, I mean, how will you keep track of the changes that are being done to your infrastructure over a period of time? So because any critical change, which is like behaving unexpected, causing an unexpected behavior in the infrastructure can create chaos in the company. We did create, they created that chaos. And that chaos might block your production pipeline, production deployment pipeline, right? So which is not acceptable. In the word of CIC, CD you have to build and take it to the production. Next is like if you update configuration like add new configuration or remove some configuration of resources that are being used in the infrastructure, manually that might affect other resources as well. So might be there will be chances that some resources are depending upon each other. If you change one configuration of one resource, this might affect the other resources as well. Next big challenge was like, what will you do in case of infra breakdown? What will you do? Creating the whole infrastructure manually? No, that's not scalable and that's not acceptable as well. So the solution was to have one source of truth for everything, for your whole infrastructure, right? So our end goal was to have a unified view of all the resources that are being used in your infrastructure and how your infrastructure works at one place. And we wanted to expose a way to safely and predictably change your infrastructure. If you want to manipulate it, it should be safe and it's predictable what is going to be happened with your infrastructure. So we decided to have infrastructure as code. What is infrastructure as code? Infrastructure as code means writing code that can be done in any high level or descriptive language to manage configuration and to automate provisioning of your infrastructure, right? So with these advantages, infrastructure comes with some other benefits as well. So with infrastructure as code, you have one codified workflow to create or manipulate your infrastructure. You can integrate it with your app, like application code workflows and since this is a code, you can treat it as an application, right? Or with infrastructure as code, you can put it in Git or source control management, like you can have code reviews on it, whether the suggested change in infrastructure is acceptable or not and these kind of things. After infrastructure as code, this is not something a person does manually and follow a bunch of techniques. It actually is like you can see the code and observe how this code has evolved. Next biggest benefit I feel like with infrastructure as code is distribution of knowledge. Having wiki pages stating what all resources are being in your infrastructure or how your infrastructure works is not scalable. We also tried that model in our team, but as the team grows, maintaining that doc is not efficient, believe me, right? The one day will come, your doc is dead and you have no source of truth for your infrastructure, how this infrastructure works. If something happened to this infrastructure, you are screwed, right? So after a bunch of comparison, bunch of investigation and POCs, we concluded Terraform is one good tool for us. I'm not gonna go deep in this comparison. So we decided to have Terraform to have our infrastructure as code that meets all requirements. So I'm gonna go briefly about what is Terraform and how Terraform works. So Terraform provides platform to write infrastructure as code across cloud, right? With infrastructure as code, actually we wanted to have a way to create our infrastructure from scratch or modify the infrastructure at any given point of time. So some sort of reproducible infrastructure. What we call it at sumo is one click infra, right? So we wanted to achieve like that as the end goal. So some of the key features which attracted us to use Terraform for our IAC infrastructure as code. First is execution plan. So before going into deep into Terraform, how many of you knows exactly what Terraform is at or have heard of the Terraform? A quite few. So key feature of Terraform first is execution plan. So Terraform has this planning step where it creates execution plan. So what execution plan does is it compares your existing infrastructure and what your infrastructure is going to be and shows you the difference. So this is a sort of a demo run. So it lets you avoid any surprises or any like unpredictability that when you are actually manipulating your infrastructure. So before manipulating your infrastructure like before giving this presentation I had a demo run to avoid surprises. So this is kind of demo run only. Next is resource graph. So Terraform build this resource dependency graph. What all resources are like dependent on each other. What all resources are non-dependent. So Terraform applies parallelism with the non-dependent resources. So if the resource is not dependent that creation or manipulation of that resource can be done in parallel with other resources. So in this way with resource dependency graph Terraform builds your infrastructure as efficient as possible I must say. Next is change automation. Complex change set or small change set can be applied to your infrastructure with minimal human intervention. So because with this execution plan and resource graph you actually knows what is going to happen with your infrastructure and in what order Terraform is going to manipulate this infrastructure. So you have no surprises at the end of it. Now how Terraform works, how have we evolving infrastructure with Terraform. So in Terraform like everything is sort of config generated. So you define what your infrastructure should look like in config and Terraform will take care of it. So you define the config, you do the Terraform plan, if everything looks good that's all you wanted. So you can go ahead and do apply and the cycle continues. So this may seem like a software development what we do is we code something, we run some unit test, if that's work fine we go ahead and commit the change. If something mis-happens like unexpected happen we revert the change, we again run some unit test and we go ahead and commit the revert. This is some sort of that only. So I'm gonna go through a workflow of Terraform. So let's say you wanted to create a new AWS instance in your infrastructure. So what you will do is you will add that instance and configuration of that instance in the config file and do Terraform plan. So Terraform plan will show you this in green color stating that green color is for creation. So if that looks good you can do Terraform, sorry, Terraform apply and within couple of seconds it will show you the status, how many resources are being added or changes are destroyed, right? So similar is the workflow with the Terraform update. So let's say you want to update configuration of any file, sorry, any resource. What you will do is you will do like, you will update the config in the config file, you will do the Terraform plan and Terraform plan will say that this is resource which is going to be changed and this is update in place and it also show what is the difference between existing resource, like resources in existing infrastructure and in the infrastructure you want to be, right? If this shows like instance type will be changed from T to micro to M for large, right? So if these resources being depended upon the other resources and other resources are being affected so it will list all of them and you go ahead and Terraform apply. Also before Terraform actually apply it will again ask you whether you really wanted to apply that change or not, right? If you proceed and it will do the change. Similar is for Terraform destroy you remove the resource and do Terraform like plan it will show in red color stating that this is gonna be removed from your infrastructure. So in this way we have like an efficient way, a safe or I must say predictable way to manipulate your infrastructure, right? So how we use Terraform is couple of second. So we use Terraform in a modularized way. So with module what I mean is any Terraform template is called module. Any sort some small template of Terraform code is called a module, right? So this avoids you like a duplication of course. So if you want resources and multiple infrastructure so a resource is like required by the different teams in your company. So what you will do is you define a module and that module can be reused among all. So like it avoids duplication of Terraform code again and again. So let's say you want to like add ELB in your infrastructure. So like different teams want that. So what you'll do is you encapsulate all the logic and the resources that are being required to create ELB with your infrastructure and create a template off of them. And you can give that template to like multiple teams can use it and create like ELB as efficient as possible. So module can be treated as a blueprint that defines a specific part of code and infrastructure can be referred as houses built from the blueprints, right? Some multiple blueprints, you're a big house, you're big infrastructure is ready in the most efficient way. So Terraform does all of this is like amazingly. So I'll conclude my like talk here though I'm like almost in time. So with the great saying we liked like most this is from a quote from the ThoughtWorks blog I picked. So thank you ThoughtWorks for that. So enabling idea of infrastructure as code is that system and devices which are being used in your infrastructure that can be treated themselves or software. So you was using software to run softwares. Thank you, that's all from my side. So I'm welcome for any questions. There are any questions? Hi, so you just said you run about one lakh tests per day. So how are you gonna track any failures? There would be many, you said you had this. That's good. So that was the like very big challenge for us because if you are running like thousands and thousands of UI tests, so it will fail, right? So we had a, like I wanted to explain but I ran out of time but I'll explain. In very short brief, I have something prepared. So yeah. So like there are chances. So I'll briefly go into this. So this is the slide from my like office presentation. So I'm gonna use it here. So there can be n number of reasons. So you are running one lakh test in a day. So there can be n number of reasons. You are running one test multiple times in a day or you are running multiple tests at the given point of time. There can be n number of failures like intermittent failure or job failing because of same reason because you are running multiple tests with a single change with a single PR. So there can be job failure pattern as well or job dependency if the job is running. This is the different test fail and all kind of thing. So what we do is we used our own sumo product to save like time in this. So we used sumo dashboards and log search and metric search and built some pretty good dashboards around it. So this is the like stats, overall stats around number of tests that are being running and what is the success rate of every day. So as I mentioned, we have eight deployments in total. So this shows well. And this is like pretty good. So you are running multiple tests with a single change. So there can be chances like one or more tests fail because of the same reason, right? So this is the screenshot we took like when we are creating our infrastructure and we are manually doing some tweaks to it and one time we goofed up and messed up. So this dashboard shows these many tests are failing because of the one reason. You need not to go to each and every job like we use Jenkins. So in Jenkins, like stating what is the reason of this failure, what is the reason of this failure, you have in place and your work is done. So this is 504 gateway timer. There is something moved up with your infrastructure and go ahead and fix it. You need not to waste your time in debugging all failures, right? Also like as I think you like mostly use Jenkins or other distributed system. So Jenkins shows you the current state. It doesn't show you how many times it has been failed, right? It shows do not show the intermittent failure. So what Devs will do is they will retrigger the job and say, oh, look, this is working. Why were you saying this is not working? Because there can be a number of reasons for intermittent behaviors, right? So for that, we have this dashboards. So which says that free sign-up test is failed 10 times in a day. No matter what's the state of this free sign-up test currently, like it might be green on the Jenkins, but it has failed 10 times. There has to be some reason for these failures. If it has run 100 times, it failed 10 times. There has to be, there can be n number of reasons, API latency, network latency, infrabug, anything, but there has to be some reasoning about that. So when we started running all these tests with every PR, like we have to have two to three Jenkins-Sheriff to handle this stuff. If you're running tests, debugging of CM will be very important, right? So this helps a lot, and we are down to one Jenkins-Sheriff, and it handles Jenkins during half a day, and the time is saved like super-awesome. All right, let's give another round of applause for Baharti. Thank you. Thank you.