 Am I audible yeah. So thanks Pradipto. All right. Let's get started. Thank you everyone for having me here. My name is Richie Singh. I work at snapdeal as a software engineer in cloud and infrastructure team. And in this presentation, I'm going to talk about the journey that snapdeal took from moving from a public cloud to its own private cloud, which is based on open stack. So what are we going to cover in the next 15 minutes? I'll tell you who we are. What do we do to get you an idea on how why we migrated our complete infrastructure from our public cloud to our own private cloud? Why we migrated our 300 micro services are more than 200 data stores and their dependent applications. Then I'll tell you how did we plan for that migration? What runtime gotcha we found and how did we fix them? And then we'll talk about some rollback strategies. We figured out in case of failure. And after that, I'll share some key learnings. So snapdeal. We are an e-commerce company. We are one of the largest e-commerce marketplace in India. We have more than 30 million products in our catalog. And we have more than 300,000 sellers who come to our platform to offer their products to the customer base. We have. And in the last few years, we have seen a phenomenal growth in our business in the amount of transaction that we are doing in our platform. So this is we then what did we do regards to OpenStack? We have built our own private cloud. We call it Cirrus. So Cirrus is a private cloud based on OpenStack. We have more than 16 petabytes of storage that we have built for our infrastructure. We are running in our two different regions with the more than 100,000 cores of capacity. Our networking is 40 gig from servers going to the spine nodes and 100 gig on the top of it. So very fast networking we have. And we are doing all of this with 100% automation and deploying everything through Ansible and some automation scripts that we have written on our own. And we just came out when we when we launched our private cloud. And we got to know that we are one of we are on the top 4% of glow of in the list of global OpenStack deployments in the world. So very happy to know that we are running in a very high core density racks of more than 3500 cores per rack in a completely broad based architecture. So this is our cloud. We call it Cirrus. So here's the story. Begin. I'll make sure that you guys don't sleep. So because I'm going to tell you our story that how did we migrate our complete infrastructure from AWS, which is which we used to use earlier and how we migrated to how we migrated this to our private cloud. So before starting anything we always do this we always make a checklist kind of checklist before starting anything big or small. So we the first item in our checklist was understand why you are migrating. We also did this we figured out that why we were migrating why we were what made us to start such a big project on such a big scale. And we also saw that where we were lagging behind where we also saw from a business point of view that where we spending a lot in our in the in the terms of our infrastructure need. We also were we having any performance issues any security kind of issues and many more. So after that we having strong reasons and factors that we had to do it and we had to we were insured that we had to migrate our infrastructure. Strong planning was needed for this project as it includes lots of risks and failures because this is the project where you can fail. Where you can when when you are migrating all your services in the new data center you have to ensure the compatibility between the applications and the new data center. So strong planning was needed for this project also after that regaining knowledge gaining knowledge about the infrastructure was the main and huge task for us. Some applications may be decade old. So we decided to have a complete knowledge about our infrastructure first about like how many we figured out that how many services are currently running in our existing infrastructure. How many data stores they are using and how they are communicating with each other. So we figured out that how we can basically learn everything about our existing environment before starting migrating from from a from a different cloud to the different cloud. After that risk reduction and ensuring compatibility was also the main concern for us before starting this. So extended or unplanned down times if I if I talk about risks extended or unplanned down times were the main risk that we had to recognize them and we have we try to mitigate them. To what extent our business can tolerate these risks will depend on the importance of the application you are migrating at a particular point of time. So similar calculations was needed to was needed to apply on the risk of data loss. The more important sensitive data the more safeguards needed to be in place to prevent its loss during migration. So we thought of some kind of disaster recovery processes some kind of extensive backups that can be work during migration. Then for all applications for ensuring compatibility we thought of try our migration on our staging environment before doing it directly on production. So that we can ensure the compatibility compatibility between our applications all our applications and the new data center and all these things like networking and all these things. The fourth list item was network particularities and limiting latencies in terms of network we had to ensure that that each service should have a predetermined space in the new network. And for this it is important to consider all the aspects of firewall settings domains and trust certificates to ensure the full compatibility with the new network. If I talk about limiting latencies so controlling latency was the important and main concern for us before starting this especially for those services which were business critical. So knowing exactly which all services work together and their frequent and the frequency of their communication can work in terms of controlling latency. And last but not the least getting everyone on board was also the huge task for us because we are we are a big company we have more than 5000 employees in our company and we have more than 70 components running in our in in our company. So we have we had to tell everyone that we are doing we are starting such a big project and you had to be you all had to be with us during this project. So getting on board was also the huge task for us. So this is the main items in our checklist we listed down before starting such a big project after that. Why did we build our private cloud? What was the main reasons that we what made that made us to start such a big project so cost was one of the major factor like I said before that we were running in a hyper growth mode. And as our business grow our infrastructure requirements also continue to grow and our bill on the public cloud was phenomenal. So we had to find ways to control that bill or to reduce that one. It was very clear to us that at a certain scale public cloud stop being cost effective. They are okay when you are just starting up but at an inflection point you need to start some some kind of alternative some kind of alternatives. And for us it was building our own cloud. But how did we made make it cost effective? Stabby cloud is completely running on open source technologies. We did some analysis on some enterprise technologies like VMware and all and we also calculated the operational cost of managing our data center. And we came up with this plan that we had to do this and we can we also have a small in house team to build automate and control. And manage our data center platform. So the next big reason for us to do so besides cost was performance and security. We wanted to get more performance from our infrastructure site and in public cloud. You are restricted because you are kind of a you are kind of in our shared tenant architecture and there is only so much performance you can that public cloud offer. They can offer you more but again on a very high cost. But it is restrictive for the large scale that we were looking for. So by building our own cloud we are now able to optimize it for our self use. We are able to put some some advanced security appliances some dealers prevention and intrusion detections into our data center. So definitely are a step up step up into the security that public cloud offers. And lastly a data sovereignty as I said that we are an e-commerce we are an e-commerce system and we are an an ecosystem. So we also have a digital wallet. Which required us to make sure that all the money related data that we used to store in that application remains within the boundaries of India. And at that time when we were using a public cloud they did not have any region in India at that time. So for at least that for at least that particular application to be hosted within the boundaries of India. We had to look for some alternative reasons to at least to host that particular application and to remain to keep that data within the boundaries of India. So these were the four main reasons because of which we built our own private cloud lower cost better performance security and compliance and data sovereignty. So we started before started our migration we started it started this by gaining a by gaining the knowledge about an existing environment. We found that we have more than 300 micro services running currently we have more than 200 data stores which includes my SQL MongoDB Aerospike Redis Elasticsearch Rabbit MQ active MQ and many more. We figured that how these services are being deployed are they being deployed manually or through some automation for our deployment purposes we use Chef. So firstly for those services who which were deploying manually we onboarded all those services to Chef and made them deploying through automation. So that in our new infrastructure they will there will be no manual deployment or no manual thing we have to do. So we first onboarded we first make all our applications or deploying automatically after that these were the after planning steps we did we like earlier in our private public cloud. There is no central place we are an off person or a deaf person can find that okay on in this particular server this particular application is running. We have like like I said before that we have 300 micro services. So what we do we what we did so we actually listed down all our services in a in a central place and in that case in our case it is a YAML file. So what we did we actually listed down all the services in a YAML file all the critical informations for a particular application that we needed like on which port that particular service used to run which all applications it used to communicate with each other. What all databases it uses so we listed down all these information in our YAML file and that YAML file also acts as an infrastructure as code in our case. So we actually using that YAML file to apply some infrastructure as code technologies in our infrastructure after that we divided all our services in some small small groups so that we can migrate them together. Those services which were tightly coupled which were which used to communicate frequently so that we can basically reduce our latency issues. So we divided all these services and smaller smaller groups. We also made a dependency graph to facilitate our migration so that we can we can have a visual type of thing where we can see that these services we already migrated and these all services we have to migrate. We kept all our data stores in our application mode so that we can basically avoid the data corruption issues or data loss issues. So we were keeping all our data stores in the sync mode including my SQL, Aerospike and MongoDB. And we might after that we migrated all tightly coupled services together to according to our business flows like we have three main flows seller flow buyer flow and supply chain flow. So we migrated all those services according to these flows. After that planning is the most important key for every project but there are always some runtime gotchas that you get only at runtime like some runtime exceptions we get every time. So we also did some mistakes there were some scenarios one of them was when we were in the middle of our migration we used to put some security checks or some security groups in in our old cloud on the ELB to serve the traffic to the actual servers. And there were some services which used to communicate frequently and one of them was running in the public cloud and when we were migrating our first service. So traffic was supposed to go on public cloud according to our code that we were written because the service the old service was running in the old cloud. But we forgot to place the security groups there and there was no traffic coming on the public cloud service where the actual service service was running. So we had to check all these checks before migrating before serving the actual traffic in the production environment. The second one was we used to keep our databases in sync so that no data corrupt things happens. But there were some scenarios where our data has been had been corrupted and we were getting some serious exceptions in our application. So in that case we took the data and we dump that data with we dump that data to the new servers but there was kind of pain for doing so. So we had to make sure that all our database servers are in sync correct correctly so that no data loss or no data corruption thing happened. After that another scenario was we when we were migrated one of when we were migrating one of our service the lost machines were not able to handle the load. So we had to extend our infrastructure at runtime. So to avoid these type of risks we had to ensure that the number of machines needed for that application to handle the node. So we had to ensure earlier before starting this migration and to avoid all these issues monitoring of your infrastructure is one of the most important thing you had to do. We were doing monitoring of every individual system as well as our applications. In this case order count was the critical thing that we had to ensure that our order count does not drop during our migration activities. So we use Isinga and EFK stack for our monitoring purposes and we used to store one views to monitor our infrastructure every time when we used to do any migration activity. There are some rollback strategies we also figure out. So traffic redirection we actually figure out that in case of failure we had to ensure that within milliseconds we can redirect the traffic to the old servers to the old cloud so that we cannot have the exceptions in our new cloud. After that database re-sync it is also the most important thing when we used to migrate our service. What we do is basically we before like handling it to QA we used to keep that database in a read mode so that no so that we actually we can we could actually avoid some exceptions we actually we were facing them. So we also set some cooling periods is that like we don't shut down our old service that was like that was around a month until we found that services running flawlessly in our new data center. These are the technical tools we were using for our application for our data center migration our YAML file which is which also acts as a infrastructure as code in our case. We were using dendrite for our service discovery we wrote nerve and synapse code for this we use salt stack and shift as an orchestration tools we use git and Jenkins for our CI CD pipelines and we also wrote some automation scripts. Archie sorry to be rude but we have to stop here we are over your time. Sure. Can you just recap quickly and you can talk to the people in the booth maybe. Yeah, so actually I had only 15 minutes to like talk about our migration but there are a lot to know about that. So those who actually wants to know more about our migration I'm around you can talk to me anytime. I just want to share some key learnings. It just it just takes five minutes or stay two minutes. Okay. Like plan for you my plan about your migration understand your services your infrastructure better before starting your migration. We also created a live dependency graph to see that don't migrate as it is fix the problems that you apply you have and after that. Like migrate that application to the new cloud and ensure that you have you set the strict naming conventions and make sure that all launch services are registered with all your with all your orchestration tools and automate and monitor everything before starting such a big project on such a big scale. Thanks Ruchi we like to stop here Ruchi and a colleague are also going to be around tomorrow at the Isengar camp if you're planning to attend they will be talking about how they've used Isengar for monitoring. And of course Ruchi is also going to be around at the conference so we can take more questions. Thank you.