 Hello everyone, I am Anush Arvind and I take care of DevOps, SRE and infrastructure at Freshworks. The product I primarily work for is fresh service. So fresh service is an IT service management tool. So it has the typical features of ITIL which include incident management, asset management, release management, service catalog and all that. So my team is basically responsible for taking care of DevOps for fresh service and when I am not working I am a real travel freak, I love to climb mountains and I am also a big FIFA person. So why did we need to merge two live data centers into one, that is what we are going to see. So by the end of 2016 we had two data centers, one of them was in US East and another data center in Ireland and at the beginning of 2017 we launched a new data center in Frankfurt. There was a lot of requests from the people in the Europe region itself saying they wanted a Frankfurt data center and Frankfurt had a lot more benefits which we will get to shortly. So because of the demand we launched a new data center in Frankfurt and we were present now in two places in Europe. What we started noticing was that everyone in Europe started preferring the Frankfurt data center more compared to the Ireland data center. So signups gradually reduced from Ireland and everyone started preferring the Germany data center and one more thing was a lot of people in Ireland approached us and asked hey can I be moved to the Frankfurt data center and I would like to be hosted there instead where some of the most frequent asks. So why people started preferring Frankfurt was that GDPR came along around that time and Frankfurt had very stringent rules with regards to data protection. So all the customers felt it is better if my data resides only in Germany and we also had a lot of government deals that were going on in Germany. For us if we are now going to have two data centers one in Ireland, one in Germany and both are almost in the same place which is Europe it is going to be a lot of additional costs for us and we would end up maintaining two separate sets of data centers in the same region and we would need to take care of all deployments and monitoring and all that. So that is why we came up with this idea. You have an Ireland data center and you have a live Frankfurt data center. So migrate the Ireland data completely into Frankfurt and merge it into the live data center that is Frankfurt. So in the end you are going to have one data center in Europe which is the Frankfurt one where all the Ireland customers would have been migrated into Frankfurt. So there were a lot of things that we had to consider. Tickets being the main module for fresh service there are high volumes of tickets for both our Ireland and Frankfurt customers. We face around tens of thousands of requests that come in per minute and the same order of rows written into our databases and then we need to consider all the components that are present in our product, identify all the layers, the impact of migrating each from one place to another, does a certain thing need a downtime, does it not and so on. And when it comes to the data store so there are a lot of cases where data collision is possible. You have similar kind of data in your source and destination. So how are they going to collide and how are you going to handle that and we do have a lot of high paying customers across both the data centers. So you need to give them a seamless migration and you need to make sure that your customers data is secure and you do not play around with it. So consistency was another important thing. One more problem is we had for the same data source we had different services across Ireland and Frankfurt. That was because originally we started off with normal set of services in Ireland based on what we have a good knowledge about. Later as we learned more and we got more mature we started opting for better solutions in Frankfurt. One example was we were using Elasticash as the service for Redis which we use as a config store. We were using Redis Labs in Frankfurt. So the compatibility between all these when it comes to migration needs to be taken care of. So there were a set of steps that we did before the migration and then the actual migration involves switching of the traffic from your source to your destination that was the actual migration and then there was a few rounds of testing that was done. So these are the steps that we are doing before doing the actual switch over. We initially gave out a migration notice to all our customers saying that they are going to be migrated and that there would be a small downtime for the Ireland customers alone when the actual switch is happening. After that we are going to see in detail every single component of our infrastructure and how each of it was migrated, what was the solution in place. So one more point was a lot of customers in Ireland have pointed to Ireland resources directly. One of the use cases we have in fresh services is that our customers can use their own domains for our portal. So how that is established is for their domain they are going to make a DNS record to point to our resources. The problem there is when you migrate from Ireland to Frankfurt that resource is now going to change in your DNS. So we also have to ask all our customers to update that before the migration is done otherwise the records the request that go to your the customers domain is going to fail. So that was a part of the migration notice given to the customers. So next is in order to make our data flow from Ireland to Frankfurt which both are separate VPCs we obviously cannot transfer data over the internet we cannot make any of the resources discoverable. So all data needs to go through private network so we leverage the option of VPC peering which is nothing but a private tunnel is built in between the two VPCs according to the rules that you specify in your route tables. So your Ireland resources and Frankfurt resources will now become discoverable in between each other over this private network and through that you can enable data to go through. So this is how this was the first step of the solution and with this both the VPCs would now become discoverable to each other to enable data to flow from one VPC to another. And attachments is one of the modules that we use in a fresh service. So as a part of all your modules you can upload various kinds of files for all of them we would use S3 for storage. So Ireland is going to have a set of S3 buckets in Ireland Frankfurt is going to have its own set of attachment buckets in Frankfurt. So there are two steps to this first the existing set of objects that are present in the Ireland data center need to be moved to Frankfurt and any new attachments that gets stored in your Ireland for the customers who are supposed to be moved that those need to be replicated in the Frankfurt bucket. So this is what we did we identified a particular time and all the objects that were there were copied into Frankfurt with a script and there is we enabled cross region replication for the Ireland and Frankfurt buckets. This is an option that is available in S3 itself where you can make one bucket replicate the contents of another bucket. So through this the existing set of objects were copied manually all the new objects started being replicated. So we have achieved live replication in S3 for all the attachments. When this migration was being done we had to do this in two steps but now this is a feature that is available in S3 itself called coal stud CRR where they take care of migrating your existing and maintaining replication. Next we can look at databases. So fresh service we use sharding for horizontal scaling for our databases. So there are a lot of databases that are sharded and they are just horizontal partitions where it is not a big deal where you just make a copy of those databases as separate shards in Frankfurt. So for these all the shards in Ireland were just duplicated as extra shards in Frankfurt and there was no problem. But the thing is when it comes to sharding you also would have a global database which is your configuration database. This is going to take care of the configuration data as to what data resides in what shard. That is one database where you have similar set of tables across all your data centers and you would have the pretty much same data. Where there is going to be a data clash you obviously would not be able to just migrate the data as is it is just the same database. And the problem there is the tables are also going to have an ID clash. The thing is the configuration rows that are stored in Ireland say it starts from 1 to 10. The same set of entries are there in Frankfurt 1 to 10. This master database how are we going to achieve replication now. This was a very interesting problem to solve what we did was in our Frankfurt. So for the sake of argument we can assume that Ireland has 1 to 10 entries and Frankfurt has 1 to 10 entries. Your Frankfurt database you can just go and make a dummy entry without with null values just give the ID as say a large value like 100. So by following the rule of auto increment all new entries that come into your Frankfurt database are going to start from 100 and 102 and so on. Now the actual 1 to 10 values that are already present in your Frankfurt database do an update on their IDs to another range say 51 to 60. So now you have 1 to 10 in Ireland you have 51 to 60 in Frankfurt and 100 onwards the new entries in Frankfurt are coming inside. The new entries in Ireland would go from 11 onwards so that set you can just migrate it into the same table that is present in your Frankfurt database. So this is how the clashing data also was merged between the two data centers. One important thing to remember here is if you are playing directly into your database you need to make sure if it is being cached by your application your application needs to update the value or clear the cache otherwise you are going to get some inconsistencies in your application. And we spoke about how the customers domains through that we can access the portal. So the solution there is we take in the customers secret and the certificate body and import them as certificates into our load balancer. So they can access their portal using a HTTPS with their own domain. We have a set of certificates in our Ireland load balancer for our Ireland customers a set of certificates in Frankfurt for our Frankfurt customers. We would just need to transfer the set of certificates from Ireland to Frankfurt. So AWS has a service called ACM where they take care of secure certificate management. So we would just need to boot a new virtual machine which is going to copy all your private key from Ireland and import them into your Frankfurt ACM and add them to your Frankfurt load balancers. So now any customer who wants to access the same domain in Frankfurt with the HTTPS that is also going to work. But for this to work the customer should have updated their DNS to point to the Frankfurt's resource by the time the migration was completed and then Redis. So Redis we use as a data store basically for some configuration data. The problem here is both in ElastiCache and Redis Labs replication is not possible. I don't know why they have not given that option. In plain Redis you have this command called slave off. So you can make one Redis server the slave of another server but that is disabled by both ElastiCache and Redis Labs. Because of this we will not be able to make one Redis server the slave of another server and thereby achieve a cross region replication. How are you going to resolve it? We brainstormed a lot and we actually came up with some very bad solutions for this. One of the solutions being from your application start writing data to both Ireland and Frankfurt servers and then start doing your reads from your Frankfurt servers and then start doing your writes to Ireland server. This itself is going to take three separate deployments and even when we discussed about it I decided that is not the solution. Then we took a step back and then we looked at MySQL how the cross region replication was actually done there. The process is MySQL writes something called a bin log which is a log file of every command every query that was executed in the server. And what a replica instance does is it just reads all those queries and executes those queries onto itself. So this is how using bin log MySQL does a replication. So the same should be possible to be done for Redis. That's what we did. So here you have an ElastiCache server in Ireland and RedisLab server in Frankfurt. So Redis also has the command called monitor. What that command does is it prints out all the commands that were executed in your Redis server. So you can take all those and start pushing it to a queue and then note a particular time and then make a point in time copy of all your keys from your source server into your destination server. Meanwhile all the commands that are getting executed in your source server are being pushed to that queue. Once the point in time copy is done with another script, pull all the messages from your queue. After the particular time where the copy was done, start evaluating those commands in your destination server. So through this what is going to happen is before time all the sets of keys that were present in Ireland would be copied as is into Frankfurt. Any new key that was created after that time, there is going to be a create command for that in your queue. Any key that was updated during that time, there is going to be an update command for that particular key in your queue. So once that copy is done, just start polling all the messages and start evaluating them in your destination server. By doing this you have done what MySQL calls replication and because you have a peering enabled between your two VPCs, this is a simple cross region replication solution for Redis servers that also do not allow slave of connection. I think in the back end this is what is employed for the slave of command implementation as well. So Frankfurt is currently serving its own set of customers. Now Ireland is also going to be migrated there. So we obviously had to scale up the Frankfurt data center. And it was just a simple set of math that we had to do. How many resources, requests come into Ireland, take care of the request and response sizes and the counts and scale up your Frankfurt data center accordingly. So all these steps that we saw till now, these were done before the migration. So all the application servers have been buffed up now in Frankfurt. All the other support resources have been replicated and all the data stores are now going through a cross region replication. Now what we need to do is just the switch. So this is the actual migration that was done. In the Ireland data center put up a maintenance banner. So all the customers they are going to now only see the maintenance banner and they will stop doing all writes into your database. Afterwards you can verify if all the traffic has actually stopped and all the writes to your database have stopped. At this point of time you can also encounter some ghost process which might still be writing something into your database. The solution for that is if you have made sure that no traffic is coming in, just cut off your databases security connection to your application. So obviously no one can now write into your database. Once you have stopped all the writes into your set of databases which includes S3, Redis and your MySQL master databases, you can stop the replication for all those, promote the data stores in Frankfurt as the new masters and update the configurations in your Frankfurt data center to include these new sets of servers as the new set of masters in your Frankfurt data center. And then of all the Ireland accounts that was migrated into Frankfurt enable traffic just for a few accounts and test out how things are going around and if everything works fine. Once you verify the sanity and have made sure that everything is going fine, you can point all the traffic into the Frankfurt data center. So such a migration in production is always going to have a surprise. So back where we saw the vanity URL support for customers who need to update their DNS settings to point to the new resource in Frankfurt. Not everyone had done that on time and what we noticed was post the migration also a lot of customer requests for the vanity URL was going into Ireland. So the structure we have there is we have a load balancer which listens to the DNS which is behind the DNS, a proxy server which is going to take care of a little more intelligence for routing and the set of application servers. Whereas now in Ireland you cannot accept any more database rights, the application servers in Frankfurt should only process all those requests. What do you do there? So the request that comes into Ireland it would come to your proxy server, do not forward it back, instead route the traffic from Ireland into Frankfurt. So we anyway have Peering enabled and all the resources can talk to each other. So the proxy servers that you have in Ireland they would instead forward the request to the Frankfurt application servers and not the Ireland application servers. So this was the solution that was done live after the migration was completed immediately to support the vanity URL traffic. One more thing that you need to have in mind here is with every deployment your set of application servers is going to change and there are going to be new sets of servers. So your application servers update your proxy servers. Your Frankfurt proxy servers will then in turn asynchronously make an API call to Ireland proxy servers and update them saying these are the new set of Frankfurt application servers and all the traffic so it would come to Ireland and go to Frankfurt be resolved there. That one extra hop was there and then we started making more communications and asking the customers to update the DNS settings. So when this process was done we noticed one big mistake that we did. Why do the Frankfurt, why do any customer to point their own domain to us need to point it directly to our resources. So basically with DNS you have something called an alias record where you point one domain to a particular resource. That is also the concept of C names in DNS where one domain it is just going to translate into another. So this is what we found out then and we came up with that solution. So for example we would create a new sub domain dot our domain say custom dot fresh service dot com and any of the customers who are planning to use this feature they just need to point their domain as a C name to custom dot fresh service dot com. Custom dot fresh service dot com so we own fresh service dot com we would point that custom dot fresh service dot com to our own resources. So any customer who now has their domain map to fresh service they would not need to do any change at their end. If any kind of resource is changing at our end we would just change the alias of custom dot fresh service and we would not ask the customers to do any kind of updates on their side. This was one very important learning and bad practice that we were employing which was learned and changed in the course of this migration. And all this was done live when the actual production migration happened. After the migration just verify everything is coming fine and there are no more issues you can then go and bring down all the resources in your Ireland data center and then document all these steps. So through this you will have a proper complete understanding of your infrastructure and how the migration was actually done. So anything goes wrong anywhere in your product you will have a proper idea of what is actually happening and through this migration I also had the privilege of going and typing a drop for database master database in my own master database server which was something exciting to do and some of the learnings that we got out of this. So for the Redis migration that we did what we learned was production traffic is never to be taken lightly. So back in the migration what we saw we had some polar scripts which will take the commands from the queue and it would evaluate them in your destination master server. So that we had to run around 25 polar scripts in parallel to cope up with our production traffic that is the rate at which we are pumping data into the queue the same rate you need to poll that and one more important thing here is in the source end of the queue you are just taking the command that was executed as one string and you just write it into the queue but at the destination you need to take the command out and actually execute that command in your destination server so that is a bigger job. So you cannot have the same rate there also because at the destination it is a bigger job. So we had to write so many multiple parallel polar scripts to make sure of live replication which is seconds behind master should always be 0. The thing that you need not worry about here you do not need to worry about any kind of synchronization, race condition problems here because Redis is always atomic it is single core so Redis internally takes care of all those things and one more thing this was also handled only in the last minute so there are going to be a lot of integrations with other applications that you have and a lot of internal microservices itself and there would be say a set of username password combinations or APA key combinations that are going to be used for authentication. They need not necessarily be the same between Ireland and Frankfurt so Frankfurt is going to have its own set of APA case Ireland is going to have its own set. So once the migration is done Frankfurt set of customers would have configured their integrations with their APA keys which should be recognized in the new data center also. So what you use to store in may be say a string now needs to be converted into an array because the APA key is going there would have been only one APA key in Frankfurt now it has to accommodate two. So this was a last minute handling that we had to do and then when it comes to production database it is definitely no joke so when you boot up a new database in production and you dump a huge load of data into it and you just start making big queries into your database it is going to start performing very poorly. So there is something called warming up that needs to be done. It is nothing but just running a lot of the most commonly used queries and through that your MySQL query optimizer for that particular engine is going to become more aware of what kind of queries would be run what kind of indexes I would use what sort of tables I would use also employ a little bit of MySQL's own caching and all before you point production level traffic to it otherwise you are just going to break your database. So any kind of new resource that you boot into production but your traffic doesn't increase gradually you are just going to bombard it from time zero with a lot of traffic you need to make sure it's warmed up. So Amazon already offers pre-warming up solutions for databases and load balancers and overall some of the takeaways that we had from this migration were the following. So first is when there is huge production migration that is going to be done always be ready for surprises nothing is supposed to be a shock for you and you need to always have some kind of solution ready for implementing and that only comes with a proper understanding of every single layer in your application and the proper understanding of what flows where and knowing what you should do with every component in your infrastructure. And the ideal thing is always have contingency in place so you have predicted that your solution is going to go in one particular flow it can always fail so you need to remember that always so have a plan B and if something goes wrong make sure you always have other ways to try it out because again production is no joke and then you would have a proper rollout solution in place make sure you also have a rollback solution from every single place because say everything goes wrong you need to be able to stop right there cancel everything that you did and then re-plan it from step one because you can't leave your customers hanging in an inconsistent state so for your complete rollout plan have a proper rollback plan as well. So this was the entire migration that was done involving two live data centers and a merge of all the data stores that happened. So any questions I can answer them. Hi so you talked about the multiple follow scripts that you had to write so even if said that it is a atomic and everything but even if that's true we still need to maintain a sequence of operations right how do you manage that? Yeah so we used the SQS basically so SQS has two sets of queues one is a standard queue and one is a FIFO queue. So FIFO queue always makes sure that it takes care of ordering but it has a limit of the number of messages you can process with that queue per minute. The other option is a standard queue where it offers up to 90 percent up to 90 percent ordering is taken care of out-of-order messages are very rare. So obviously for our traffic we weren't able to use the FIFO queue we had to fall back to the standard queue. The call that we took there was that most of the keys so there is going to be only 10 percent of messages that would go out-of-order which means they are obviously highly used set of keys. So we just took a chance there where if that key is really updated that frequently even if there is one out-of-order update that is done immediately it would be updated again to get the consistent state over time. That's the part of the queue that I understand but when the polar scripts that are actually writing into the fresh master Redis that you are trying to replicate in that area if there are multiple polar scripts so suppose one worker picks up the one one of the first in the queue statement and another one also picks that and suppose the second one actually executes the later one before the first one how do you maintain the sequence of operations among the polar scripts that's what I'm talking about. Yeah so that's where we were aware of this out-of-order thing can happen but we were confident that for all the cases where it happens it is going to be for frequently occurring keys. So say I have a key and it is not going to be updated that frequently I don't have the chance of the same key being picked by multiple polar scripts at the same time achieving an out-of-order execution. In case it is so possible that two parallel polars pick up the same keys updation and do it out of order that means that key is being frequently used by our application so I will do a wrong update but immediately because it is being frequently used it will be updated right again. So we actually this was one of the assumptions that we had and when we did test runs of this Redis migration we actually made sure that that case is happening correctly and for one more key store that we were using in Redis we were using an auto increment use case for one of the keys for that what we did was we had a fallback into our database so if the Redis is going to give you an auto increment value that is already present in your database the database kicks in with a trigger to update the value and reset it in Redis. What's the migration when you talk about traffic split? Yeah. So how do you manage the traffic split like just a small amount of traffic should come here so what do you do to manage that? So basically what we did was we have all the list of customers who are present in Ireland and for the Ireland application servers they are now only up with a maintenance banner so that is going to involve a test set of accounts from us and a list of customer accounts so you just split them accordingly you make API calls to your DNS server so our DNS server is Route 53 which is provided by AWS you just make sets of API calls and update it to point to your Frankfurt resource when it points your Frankfurt resource Frankfurt resource is already scaled up for traffic it has your Ireland data that has been replicated so once you point traffic to the Frankfurt load balancer you have done the migration for that particular account. How does this ensure that partial traffic migration and not the whole traffic that you are getting on that route? Yeah so by partial what I meant is partial set of accounts but for one account it is full traffic only so by partial I meant to say I have 10 accounts I'll point it as 2 to 1 by 1 like that. How do you verify your Redis data migration? So you mean? So like you have replicated your data in this polar strip, how did you verify that it has basically replicated correctly? So the first level of check is check the number of messages in SQS so if it returns zero that means you don't have any pending keys that are present in SQS that are yet to be operated so with that you are going to make sure your replication is running live the next thing that would need to is run another set of scripts which is going to take random sets of data between your Ireland and Frankfurt and check them for consistency that was the only way with which we could make sure that everything is happening fine. But you might be having a case where something is being updated and you have selected those keys? I didn't understand that. So basically you said that you selected a random key and then you compare the value. So by the time you have selected from this Ireland database they may have been updated in the Frankfurt database. So how did you handle those? For those cases basically the only case of where the keys were updated very frequently where a set of auto increment keys only. So one level of validation is if the values between them are close enough to each other we know that there is just a time difference between pulling the values from both the set of servers that is the first level and also for that particular key we had a fallback in our database. So if you return an ID from Redis that is already present in your database, database falls back to the trigger option to generate its own ID. After the migration did you get any request from customer that they are seeing some data problems or something? No actually it was completely seamless migration and we didn't come up with any such issues.