 Hi guys, we are so this is going to be a quite challenging session for me because you know I think the next break is lunch hopefully all of you can you know withstand hunger you know until the 115 till the scheduled end of the session today we are going to talk about you know how we have helped to boost our engineering productivity by building a reliable test infrastructure unlike the previous session where there was a commercial break in between we do the commercial break in the beginning just to introduce us my name is Arun I lead the quality and release engineering at Mintra and my colleague Sreery who is a lead engineer as part of release engineering right. So what do we do in quality and release engineering we actually build quality and productivity platforms tools to help us deliver faster with better quality so that's that's what we are we focus on you know in this team right. How many of you shop at Mintra this quick raise of hand so I think for those who don't shop maybe you can have a look at what Mintra is all about we do have about 2000 odd Indian and international brands with about 700 K catalog items 25 million monthly active users we do about 250 K shipments a day we are helping to make this world stylish colorful and happier by use of technology so that's that's what what we do at Mintra okay. So what what do we cover here in the next 45 minutes we will talk about you know why what triggered the need for us to create this reliable test infrastructure we will touch a little bit about you know Dawkins that is our internal platform which is our test infrastructure platform how it evolved a demo of quick demo of the platform and how we integrated you know with the continuous integration and deployment right. So before I get started a quick raise of hand on do you guys use test infrastructure you know do you have test infrastructure multiple of them one of them or do you have single centralized centralized test infrastructure anyone right okay. So what how it all started is you know as Mintra started evolving over the last five to seven years the stack complexity also started increasing quite a bit so this is actually a live view of our production microservices that actually runs it it doesn't fit into the screen but that's the that's the mesh that we have created we have tried to break down a lot of monoliths into microservices still there is a number of components out there which are still being run as monoliths. So as the as the stack increase increases it became harder for us to you know start maintaining reliable test infrastructures for us to kind of like do integration test run automated test you know from from an integration performance security end to end systems system end to end test you know all of these things started becoming a great challenge because more developers you have the more complexity the more code starts to get pushed in you know and it's the same platform that gets used and the test infrastructures started you know deteriorating. So what it led to is we had a lot of unstable test environments that is going around primarily because of you know the bad code quality that comes you know few techniques may be from the last talk on how we should improve code quality debugging environmental issues of these 500 services also started becoming a nightmare you know and you know environment issues were derailing a lot of our you know projects from reaching our consumers right. So as a result of unstable test infrastructure our ability to certify with confidence also started deteriorating because the automated test starts failing because it is being run on a brittle volatile environment constant data changes conflict changes and all of these things led to even not being able to certify with confidence. And this this amounted to loss of engineers productivity because engineers spend most of the time debugging their environment issues you know maybe the dependent team there are there are few things that gets changed you know from the dependence dependency and you know due to which you know you are stuck kind of like because you are not able to proceed for your integration tests and you know other stuff. The other challenge that we also had is we were unable to work or push features in parallel because there were you know few centralized infrastructures where people used to push code on and you know it was it was getting blocked and pipeline with the amount of changes that are that were needed to get across. So frequent downtime you know and a lot of these things technically you know the more time that you take to release stuff to production you are you not only get frustrated but but also it it leads to you know eating into your engineering teams productivity frustrations all that started increasing right. So finally we thought that maybe we started off thinking of you know why don't we put all these 500 services and you know give as a minter in a box to any anyone you know who would want to create an infrastructure on their own. But you know the sad part is it's it sounded good but you know I think it also comes with a lot of infrastructure cost just to give you some perspective to put all of these 500 services in VMs and give it as a single you know minter in a box kind of environment it it minimum takes about 20 high end Azure machines and imagine if you need all of your engineering population to have this it is just going to you know blow everything out of proportion. So we were thinking quite a bit about this and then you know finally we came to a conclusion that you know running on VMs doesn't scale anymore for us and we need to start shipping all these services as containers so and you know that was the thinking of you know that we needed to move into the container way right. So this was the thought process behind this a we wanted to containerize all services we wanted to have a reference infrastructure which is equivalent to you know production which we called internally a stage and we wanted all anyone in the engineering org to be able to you know create a subset of environment and practically fall back to any other dependencies that you need back on stage. So we wanted to create a ephemeral environments which gets destroyed you know as soon as the testing your dev test or you know quality tests are done on these environments and nobody should be depend on knowing you know how to deploy these services right everything needs to you know come by its own and the last but not the least you know in case of big bank projects where the compliance norms and many other things keep changing in e-commerce world a lot of components of a lot of services gets impacted when certain big rollouts happen say for example when we rolled out GST a lot of these services had to undergo a change because of the way data gets propagated right. So we also needed ability to create bring up full fledged environment both from you know test perspective as well as you know getting stuff deployed on production in terms of if we have to tackle BCP in an event of disaster how do we quickly get you know the whole infrastructure stack up. So these were the thought processes of you know for us to build this platform that is Dawkins and I will have my colleague Sreery to take you through what we do in Dawkins and what's the architecture and you know the demo of the platform as such right. Thanks Sreery. So first of all can anybody guess what the name Dawkins stands for exactly that's exactly what we're doing we've just you know continuous the whole of Jenkins and abstract that layer and because nobody does builds better than Jenkins right and if you do know of any tool which you know does it even more better with so much more you know agility than do let us know we'll try to do that also. So this is the architecture of the Dawkins portal it's a little complex to see it at first shot but probably while we go through the demo we can come back to this slide and you know take a look at what are these components what they do in a nutshell there is a UI layer which we built in house and there is also a controller which talks to the different other components that we have we have a rethink DB as our data store we have Jenkins as you've already guessed the Docker registry where all our artifacts go and sit and there are different environments also there are fixed environments and there are continuously evolving environments there are in a short term environments. Some of the fixed environments are the stage environments and the integration environments where stages like a production look alike while the integration environment is a bleeding as a forward-looking environment while your feature tests run in a QA environment which you know is can be recreated in you know smaller subset does not need to have all your complete set of services right. So we will actually jump into the demo right now and I will actually show you how a particular QA environment is created how a developer can actually quickly create an environment for his own test without any disturbances from any other person. So this is the Dawkins UI that we've built in our and this is a view where you can actually create your different types of clusters. So I'm going to actually create a QA cluster with the Mintra website and suppose I'm a developer who's got some changes to the API back in for of Mintra right. So I give it a name first of all let me give it a name agile2 and as you see my email id comes up and the team that I belong to comes up and I can say let's give a TTL for this particular environment for maybe 20 hours. So I just want to run a few tests and automation manual whatever you want right all of that for 20 hours and there are a few different configurations you can fall back to the fixed environments are the ones that you can actually fall back to for now we have only the stage environment. So like I said I was going to make a change to the API of the Mintra. So there is a service called the API gateway just quickly selected and you can see the different types of configurations for this particular application come up you know the build type. So we use a Node.js build type I'm sorry app type and a Gradle build type and I still want to use the master branch and all the different configurations are how what is the command to build it what is the ports that you expose it and you also notice something called depends. So API gateway directly depends on one of our HAProxy configuration. So in my cluster if I add API gateway this browse HAProxy application also tags along. So this is the way in which developers really did not know like what are their basic dependencies and they can just quickly add their own service and ensure that their base dependencies fall through it right. So I'm going to start with API gateway and also maybe I'll show you the Jabong website also it's called reincarnation. So I add in the Jabong website also and hit create. So when I hit create I've been given two options whether I want to build a new image or just use the stage image right for time sake you know I'm just going to use the existing stage image and bring it down. So as soon as I create this particular application this is actually going to start up the logs of your application and it's going to start up by creating a network and how this is going to panel is going to take a while. So we'll quickly go back to the slide and see how things are happening and come back to the demo right. So this is the QA environment creation stages. The first step of course right now I'm using a stage image which does not really build a new image because that would take a lot more time. So during your build and package stages you would build your application, imageify your application and push the image into the Docker registry. Post that we use Docker swarm in our backend. So there is this concept of network subnets that is being used and a small subnet is you know specified for this particular QA environment and all your services which ever you select right now you've selected a couple of services one is the API gateway and one is the reincarnation service. So both of these services are going to go and reside in this particular subnet. So why we choose a subnet is because it is easy to talk within a subnet and you know it will control that kind of traffic. Beyond the network creation you would actually create the container and using Docker swarm services for this purpose and this will actually create the container from the image and also do a basic health check on the container to ensure that your container is live. The last stage of any environment creation is the addition of this particular route to the HEProxy. So HEProxy is our ingress controller in this sense and because of the because you can't really expose all the different posts because we've got around 500 services and you can't really expose 500 plus ports on our cop network. So the HEProxy acts as your ingress router and so every service can be accessed via the basic ports of ATE and 443 and depending on their host mapping we redirect traffic into the appropriate container. So the HEProxy is the last step of updation. The QA environments are non-fixed environments like I gave a TTL of 20 hours for that particular environment. But the stage environments or the integration environments are fixed environments. So the only difference is you don't need to create a network. So that's the only difference across your QA environment and the stage environment. So this is how the actual routing happens. So in this example I have three different clusters C1, C2 and C3. C1 has two services A and B, C2 has another two services C and D while C3 has B, C and D with different versions and all of them residing in the same swarm without any contact with each other in their own subnets. And using the QA HEProxy people can actually access these services while the pre-product HEProxy is the fallback of these services onto the stage environment. So let's quickly go back and check if our cluster, yeah. So if you look at the logs of this particular you would see that this the service health check for API gateway was done. For reincarnation also the service health checks were being done and you will also notice the browse HEProxy service has been automatically added and it is also gone through his fair share of health checks. And beyond that we are actually waiting on the HEProxy to be reloaded which probably takes five minutes every five, the reload happens every five minutes. Once this is done you will see that each of these services API gateway reincarnation or browse HEProxy have their own URLs. So the URL is an auto generated URL which specifies to which particular service you want to go to. So using the browse HEProxy URL or the reincarnation URL we should be in a while probably able to see the Minthra website, the base homepage for Minthra website because I am working on the API gateway side of the Minthra website. So I would see the homepage and of Minthra and the Jabong services. So how it actually works we will come to. So if you see this is another example of a cluster which has two services A and C and the service addresses are the ones that I just showed you. So the cluster name is C1 and the service A would get a URL called c1-a.docins.minter.com and the service B would get c1-b.docins.minter.com. So internally one of the biggest USBs that we have gotten is the fact that there are no need for any kind of config changes within a particular cluster because one thing that we noticed in our older stack was the fact that as soon as you had to change these, if you created a new environment you had to go in and change the configs, the DNS and all these configs so that they can actually talk to each other. So what we have actually done here is by using Docker Swarm's aliasing technique c1-a.docins.minter.com will have an alias called d7a.minter.com. So services A and B talk to each other using the internal d7 URLs. For external access you would just continue to use c1-a.docins.minter.com. So the advantage here is since we are in the same subnet and that's the main reason why we have networks in subnets. So since we are in the same subnet, d7a URL would actually map to the A service within that particular subnet. While a d7b URL would map to the B service within that subnet. So in this way the resolution order of a particular service within a particular subnet, if one of these services A and B were talking to another service D and it is not there in your cluster, you would fall back to the stage. So the first lookup would happen within that particular subnet. If it is not found in that particular subnet, it would then fall back to the stage. So in this way, a developer could just bring in his services into a cluster and just forget about the rest. So he can fall back to a bleeding edge environment like the integration environment because, for example, if there were changes which were going across the whole board and you have a few services which have gone into the integration environment and are ready to go to production and you have a set of services which you want to test with this bleeding edge. Then you would fall back to the integration environment. But if you just had no relation with any of the other services and you just wanted to test how your service would act if you just dropped it into production right now, you would just fall back to a stage environment. And this way, instead of having the whole stack, and because at the time we started, we had like around 80 to 100 services. But then that grew to 500 plus services. To have all of that in one single environment for each of your teams or for each of your developer sets was going to be a very costly affair. And using this kind of a clustering logic, having only your services that you need in your cluster while falling back to another environment for your default dependencies was one way which we saved a lot of money. So let's quickly go back to yes. So your cluster has successfully completed. So I'm just going to quickly go to the browse HA proxy URL. So developer could actually quickly open it up in his browser. And this is the Mintra website, which has just been created. And if I wanted to, I could quickly go and change the branch for my API gateway and deploy my own feature branch and redeploy the application also, right? Similarly, there is also a reincarnation URL which is the Jabong website. Quickly see if that is also available, right? So this is the Jabong site. So both these URLs and API hits also can be done by means of which a developer. Now this environment agile to environment is my own environment. Nobody else is going to have access to the environment. Nobody else is going to have any way of modifying anything. So this way, my tests remain my own unless I decide to go ahead and share my cluster, right? So if I go and click this button here, it's actually going to share my cluster. And my whole team is going to be able to view my cluster, even edit, rebuild my cluster. So in this way, clusters are used for automation test speeds also, wherein people quickly have a team cluster with all their services, maybe 10, 15, 20. We've had clusters with 80, 90 services also, all of them together and running automation test on Jenkins, right? I'm back to the slide. So this is our stack. We can obviously talk about the stack and whatever has been used in all of our different layers offline also. And how we've used Docker swarm and this project was started probably around three, three and a half years ago. So that's the reason why you don't see Kubernetes on the slides as of now, and you still see Docker swarm. So obviously this is not a very easy thing to do because you're talking about 500 or engineers having to just completely change the way they've been working with VMs and shared environments and the problems of getting things to prod fast. And to be able to have mutually exclusive environments for everybody was going to be a huge problem. So swarm did help us, but it also caused its fair share of problems right up in the beginning. Swarm was not that much or maybe a couple of years ago when we started off. So some of those problems for the network layers on swarm did come back in haunt us. And I think some of the things that we did to ensure that it was to not have one swarm with 50 or 60 nodes. And as against, we ended up having six swarms with 10 or 15 nodes each. So in that way, swarm was able to cater to all the load. And this is something that we did very recently and to a lot of success. So scaling and access to those environments also was another bit. So I'll quickly show you accessing a particular cluster. So this is a small tool that we wrote. Suppose I am the API gateway developer. I quickly, I want to quickly SSH into the API gateway application. And I want to see what the application logs are. So for this, we created a small web shell tool, which basically is an SSH interpreter interruptor. And it'll actually quickly show you the demo for it. So every cluster comes in with an SSH command and a password for each of the services. So I'm just going to quickly copy the SSH command for API gateway. Can you see this? Is it visible or should I? Right? Much better. So I'm just going to SSH into this particular service. And I'm inside the container. So this is not a normal SSH into a container. It's an SSH service which intercepts your SSH command and then just does a Docker exec into the container. So that's another small tool. So if you see, this is a Node.js application which is running. And you can actually take a look at the logs or whatever you want. So this way, every single container that gets created, you can also SSH into it. Modify the container, restart, because it's all yours. Obviously, you wouldn't go and restart containers which are in a shared environment. But if you are testing something, if you are even developing something, you can quickly create a cluster with your service and just SSH into it, make a few modifications, see how it pans out, restart your service, all those things. So this way, access also was solved. And one of the other problems was the onboarding, considering the fact that there was no standard across the whole stack. We had Node.js, we had Golang services, Java services, Python services, quite a few. And we had to standardize all these services in such a way that our Docker containers understood a certain format. You knew where the logs were, you knew where your service startups, all those kind of things were needed. So that was another challenge. But then nowadays, there is a standard format in which people have to provide us with information and be quickly onboarded. So it's going to be a self-service portal for that also. Another problem we faced was the fact that because there were so many clusters available, so many environments that anybody could create, there was a lot of abuse of the capacity that people were creating clusters, left, right, center, and we would easily running short of space on our cloud provider. So obviously, limitations based on the team or on the cluster service itself, limitations in the number of services that you could add into your cluster questions onto why you could not have done it with a smaller set of services. So those kind of things and limitations put on the web UI itself solved a lot of those problems. And in the end, network challenges did force reliability and robustness problems. But then once we've solved the network, it kind of solved the reliability. And as of now, I think we've got around 2000 plus services running on 2005 and cores and 30 TB of data that is going flowing across this whole stack. And anything that needs to go to production has to come through this stack right now. So what was the impact? How did this actually improve the productivity? The parallel development started becoming possible now because people were able to create their own environments. Each team was able to create their own sub-environments. And a single team could create multiple sub-environments for their services on different branches, different features of the same service. All of these things are possible. 32K automated tests were being run in parallel. And this helped a lot on the production issues. So the average production deployments also increased because of the fact that before deploying code to production was a big headache. People really didn't like it. Developers now could focus on what they do best. They could just develop. While your test environments are stable enough, your production deployments just flow through. So they have to absolutely not bother about any of these kind of things. So as soon as we've onboarded onto our framework, there was absolutely nothing else that the developer needed to care about. So overall this would actually give a lot more focus on the development task. And there's no wait period wherein you have to wait for somebody to release a test infra so that your team can go and access all those kind of problems just vanished. So going forward, we also are planning to put this in the middle of our CI CD architecture. And this is in the works right now. And this is a small diagram of how we've got multiple flows which create, if you see right bottom, there is a commit review flow, very similar to something that Nareesh was talking about earlier on the PR, a feedback on the PR. And this will actually give you integration candidates which are the ICs. And then it goes through the integration flow where your QA environments are being churned out with all these tests that will become your delivery candidates. Your delivery candidates go to your bleeding edge environments like the integration environment. The delivery flow happens there which is a 24 hour automation test that keeps going on. Once that is done, then it becomes your RC, your release candidate. And that release candidate goes through a deployment flow, through approvals, sign-offs, and all the reporting mechanism and into product. So this is something that we're continuing to work on. So these are the other things that are also coming. Obviously, like I said, you did not see Kubernetes until now. So network problems are a thing in Docker Swarm. As of now, there have been solutions. There have been fixes. Quite a few of them. We've tried multiple versions. But what we think is moving to Kubernetes is one of the better options rather than waiting for a fix from the Docker Swarm side. So obviously, simple IP routing is also another way rather than using Docker Swarm's embedded DNS technique that they use. It's a very close area wherein we really don't know how they do this kind of technique also. So instead of investing too much time on Docker Swarm, our thing is to be able to switch completely to cluster IP models on Kubernetes and not have fixed machine for services with our PVCs in Kubernetes. Volume claims are all standard features in Kubernetes nowadays. So you don't really need to go into the fixed model that we use in Docker Swarm. A much stable sandbox because of the fact that you're not going to deal with DNS level or service. Even in Kubernetes, one of the things that we are going to do is not use the service workload at all and just use the controllers and cluster IPs for our direct access. And code DNS is something which we are trying to use also for this. And right now, there is a slight difference between fixed environments and your QA environments. But that is also going to become a standard flow. Everything is an environment which is a non-standard environment. So just to wrap it up completely, investing I think three, three and a half years in making our test infra robust, making its self-service really boosted our engineering productivity and also improved the quality. I think you should do the same. These are a few quick tools that we've done. Some of them are up there on the links. You can quickly check them out. They're all open sourced. Some of the parts of our project are open sourced. So you can take a look at them. Some of them are orchestrators. The HAP reload is just an auto reload of the HAProxy. Like the last step in my environment creation was the auto reload of the HAProxy. So that's another tool to return. Right, I think we have time for questions. Yes, I have a question on when your test environment falls back for some services on the stable environments. Don't you have cases where your test environment actually is going to create data or make the stable environment not stable, actually? I mean, in your stable environment, you're going to start seeing logs request that comes from outside the stable environment from, I don't know, something not very clear, probably? Right, I think it creates confusion. I understood your problem. It's something that is already there right now. And as of now, most of the data that is being created is forward looking and does not really affect the stable flow. So that is one reason that has saved us in a way. But yes, this is a problem. We've not completely tackled data as such. Even right now, we use the standard databases across all environments. Because what we think is, if you look at it, every config change on the database should be backward compatible. You cannot have completely breaking changes unless you're switching across. But so in that sense, suppose that there is a change of a column type itself in a database. We usually ask these developers to go back, create a new column, migrate the data, keep the old one back. So those kind of methodologies are used so that we ensure backward compatibility. So in this way, we have solved a few of these problems. But yes, there are quite a few other problems also which arise because of erroneous data. So we've also pushed it forward. But on to that, I think the way we have structured this is, if data is becoming an important stuff, it can also be brought into the cluster. Right now, when we started off our schemas had gone so far that there was no single person who actually understood the database in the entirety. But few teams have gone ahead and created database services as such so that that could be brought into your localized cluster. So whatever you are impacting is on the local databases. Hi. First of all, that's a great presentation. Thank you. The question here is, once you bring up the infrastructure right for someone to test it, is that same infrastructure used to run your suit of automation test cases? Or do you have a different infrastructure for that? It can be the same also. As a team, I can decide this sub-environment of mine is for my testing. And another sub-environment of mine is for my end-to-end automation suite. So I have my Jenkins configured to talk to a particular environment. As a consensus, they have configurations on the Jenkins side also. You send in a parameter, give the QA environment's name, and they point it to a different environment. So in that way, you can have separate environments for each of your activities. And you drive that through the same workflows? Exactly. It's the same thing. In the end, it's another environment. So the other possibility also is, so all of these are epified also. So technically, what you could do is before you run the test, you could actually spin up an environment, run the test, and destroy the environment. So that's also the possibility. And if you're not using it from a pure CI world, you could do that. Even on CI, you could do that. Sometimes you may want to just save time in just bringing up the infrastructure and just redoing stuff. So that's how people, a few teams, tend to use. So we have this concept of persistent clusters also, which don't have a TTL. So these are usually used for automation suites and for ever running tests. So that's one way people set up those environments and continue to run their tests. While all the others are just dev environments or QA manual test environments. Yeah, I mean, again, thank you for the presentation. It was very nice. I'm more interested in understanding how do you test this infrastructure, whether it works or not? How frequently do you operate it? I mean, just one example. Let's say there's NKIN score. You have to keep operating. They keep releasing stuff. JNKIN score safe, for instance. You have to keep operating it for the vulnerability and all that stuff. And every time you upgrade, it's a new change. So how do you ensure your infrastructure as a whole is working? So as of now, we've done upgrades with respect to Docker. Quite a bit of upgrades. And we consistently use Ansible for our upgrades across, I think we've got around 200 odd machines running the whole stack in total. And upgrading Docker across the whole stack is a big job. And even swarm level upgrades with respect to another solution that we had tried earlier for the swarm network issue was to just recreate the whole swarm again. Overnight at 3 or 4 in the morning, there used to be an automated cron which will go and sweep the whole swarm. And recreate all the services, all the environments, and get a fresh network layer. Reboot the underlying machines, which was one of the major problems. Because there was a lot of percolated persistent data from long, which was causing it. And do you have any automated tests in a way like to understand even this particular framework of use is working and not giving false positives? Right, right. So Ansible is the way in which we are actually running these tests to ensure that these swarms have been up and ready, and the services are ready, and in the state that they were before. So it's just a Jenkins job right now which runs Ansible Suite on a regular basis. And on the other side, you showed the form where you actually were to fill in to get the services. So do you have like, if you'd like to leverage this for, say, CI, right? I mean, we don't have the form filling stuff over there. Do you templatize this standard configuration somewhere? Or how does that work? So overall, if you look at it, we've got like four, five types. Based on the build types, we've got Cradle, Maven, and some of them are just shell builds and stuff like that. So we've got a standard format and a form of questions which developers will just fill in while they're onboarding the services. And it's just a matter of entering this. Just to add, I think we encourage people to onboard. So when you create a code repo, so the way where we are looking at this is anyone who wants to create a new microservice in the company, he just comes to this platform and tells the name. And the repo, right from the repo, onboarding to the dock ins, onboarding into CI, and getting production provisioning, everything should happen in a quite seamless way. So that's the forward-looking direction with which we are trying to work towards. I got it. Anything else? Last question. So how do you get the sample data set? I think that the same question that we have. Seed data. On these small database. So when we started this, there are a couple of techniques that we use. We started off building the stage and the initial environment by actually taking the production data, the entire bunch of production data, pruning them to make sure that all the sensitive information and information that one should not probably be seeing. So it runs through a number of pruning steps and that comes as a seed data. So there are actually teams where, let's say, new type of category of products and many other things gets released on production, but they may not necessarily be on stage, right? So those kind of teams, those teams actually, once in a while, they bring the data in an automated fashion from the production into the stage using the pruning techniques. We are also looking forward to on the integration environment when things starts going. So you start actually seeding some of the data from this side also before it goes to production. So it works both ways. So hence, you start getting the test data baked in correctly for your test. Thank you. Thanks a lot.