 Hello everyone. My name is Juanito Fadas and I'm working at Cookware's Ramen Specialist. And I have a Spanish name, but actually I'm from Taiwan, but I live in Tokyo. So recently I became a sorry man. So if you can find me in this picture, I can give you a special present after the presentation. So I work at Cookware Global Team, and we have another team, it's Japan team's biggest monolith of the world. And I work for Global Team. And this slide is made in Japan. So it's trustworthy, you can rely on the content. And the other day I was preparing the slides, and I put these slides to, you know, think about local jokes. Then the girl behind me just asked me, what is your joke? Then I told her, I don't know, I need to think about my jokes. So I still don't figure out my jokes. So I put the slide here. So today I'm here to talk about data migration. And so what is data migration? So in Rails we do schema migration, and the other one is data migration. But the normal migration we talk about is schema migration, which you change the database schemas over time. While the data migration is you translate data from system A to system B. And there are two types of data migration too. One is existing data migration. So for example Stripe, they need to migrate all their data in the system to a new model. So this is existing data migration, you can check out this talk. And today I'm going to talk about you migrate external data to your existing system. For example, you buy a company or you need to reload the site for a client. So this is my talk today. So first, why we need to do data migration? For example, you convince your client to switch from the great PHP 6 website to Rails or you have a new partner joining your company. For example, your company share the same value and they join your company, so you need to migrate their data to your website. So you need to do some data migration. And there is a simple goal is to get all data to our system, but it's very hard to achieve, as Matt said this morning. So the simple goal is to get all the data. We only need to do these four things and it's very simple. So the first thing to do is to get the actual data you need to migrate. And there are two ways to get the data. First way is if your provider can provide you an API, so you can access all the data. Or you can just get the data dump from the database. Cookpad, we already did many migrations. So if they have developer resource to build an API, we already have a generic migration code to make all the things happen automatically. So today I'm talking about the migration for data dump, which is impossible to automate. So let's see how to do it. So we can start with a simple task. And it's as easy as you just run a Rails command and hopefully everything will be migrated to your system. And the method you can just put whatever you need to migrate and start to write a code to make it happen. But first we need to import the data to your system so your model can connect to the database you want to migrate. And it's as simple as one command that you just import to my SQL. But in Cookpad we currently support 62 countries with 30 million users. So our production side has a lot of users, 24-7, across every time zone. So we cannot just import a SQL dump because the database will be saturated. It's too fast. So I need to add some delay to the SQL dump. So I'm thinking just add some sleep statement before the insert. It is true. But the SQL dump file is very huge. I tried all the editors below. And all the editor doesn't work. I have to use this one called Hexfriend. It can actually edit a few gigabyte files. But it's still not perfect because I made a mistake. So instead I write a simple Ruby program which uses this amazing feature called lazy that every time you only need to preload 2,000 lines into memory. So it's fast and very easy to use. And one of the friends say, if you want to get better at Ruby programming, you just read about IE and read it again. So this is how I add a delay to the SQL dump. And so you can just connect to the database through an environment variable. And you set the environment variable accordingly for development and staging and production. Then you can start to modeling your database. So the data you get from the provider is all kinds of formats. And you just need to map them to your existing system in the case of Cookbase, user and recipes and other things. And with all these five methods from Rails, basically you can model anything. Just these five methods is amazing. And so let's see an example to map the data to our current system. So suppose recipe has many steps. And sometimes the data is as easy as you just specify the foreign key and create this one line of code and it automatically works. But sometimes it's as hard as the association is stored as planned HTML in the recipes table. So you need to write some Nogogili parser to pass all the HTML into the association again. So it is also not so hard. You just need to write a parser that makes all the data aligned with your current associations. So modeling the database. And we can also set up the test suite for the migration. For example, you just need to tell your test database to run against a different database. And you just require all the files not being auto loaded. And in the test, you just require this special helper and you skip the test on the CI because you may not want to import your confidential, valuable data to the CI service. But you may ask why do we need tests because the migration code only use once because we can write better codes through boring tests or we can use TDD to get things done. So now you get the test. So you can do the modeling and write tests and repeat this process. So you can model all the data from the provider into your system. So the PHP side will now work like Rails application. Then you can start to do the real migration. And for the migration, you need to create a record or update a record. You use all the methods that will raise exception like save band or update band or create band because we want to fail fast to find all the errors that could be happened before the real migration happens. So for example, we can look at how to migrate recipes. So I have this structure that I made a lot of migrators and each migrator in charge of migrator entities. So for example, to migrate a recipe, you just need to write a migrator for a recipe and in the recipe model, you just tell what attributes should be migrated. And in this migrators class, you just find the recipe from the provider data and create a recipe in your current system and just update all the attributes then it will migrate to your system. It's very simple. And you just keep implement these migrators and they all respond to the same interface. So I always keep my code simple and stupid because I don't know how to do metaprogramming. And you just keep add more migrators then your migration is done. But I want to talk about how you ensure data integrity. So the first thing you can do is to wrap all your migration code in transaction. You also make your code a little bit faster because in transaction there will be less commits to your database. And also you need to make your code ID more potent. It's very hard to pronounce. And someone put it in a tweet that you do the same thing but produce the same result because we want to run the migration many times but produce the same result. It's like fx equal to fx. So basically you need to have something called Upsert is when the record is this, you update, when the record is not this, you insert. And in my SQL or Postgres SQL you have these two SQL that can do Upsert or you can use this gem. But I keep it simple. Just use simple Rails API to implement it because we first want to make it right and make it fast later. So this is Upsert implementing simple Rails. And data accuracy, how can we ensure data accuracy? First I just do some manual track for a few data I migrate but this doesn't scale so I think about how to automate all the tracks so I can have more time to enjoy, I don't need to work. So for example you, let's see how to check user with most recipes in some simple Ruby code. First you just loop all the users, has a lot of recipes and you pass them to a user check object and in this user check object you just check about the recipes and the draft recipe then the recipe tracker it looks like this so I just check if the recipe from the provider I'm going to migrate the count is in line with the recipes migrated to Cookpad or not. So and for the draft part you just change the method to use draft you see these two class didn't change much because I was keeping simple. So and the tracker just give you a log method to recall the information so you can implement the track like this in some simple Ruby class and you just keep adding more trackers because for example you need to check about migrate follows or comments or other entities so you use many small objects to compose what do you want to do and I believe everyone's code base should have everywhere should have a lot of objects but it will become messy very soon so I recommend you to check about dry RB or wrong RB for better object design or trap laser. So now we can make our code a background job because now we make sure it already works and for how many CPU cores then you can have how many workers and you design your job to queue in a different queue so you can distinguish from the regular jobs and the job is just you just call another class that will migrate the record and you migrate the record in this class so and in the background job you need to log every unexpected error so you just in the base job add a logger and rescue every error every possible error then you can fix it but I'm not sure if this is good you can check out the next talk to handle the errors better and you need to run your background jobs against all your data to be migrated then you can find out all the errors and you can fix them before the real migration and you will also need some tools during the migration for example retry mechanism and in MySQL if you have a foreign key or a constraint when you do an insert or update you will have logs and because you have many workers working to migrate data and they may try to access the same record so the log will result in MySQL day log and in Rails you can just rescue from this exception and automatically let it retry after like two minutes and so after retry it will work or it will just retry again but sometimes you need to make sure what you want to retry so the first time you don't need to want to rescue everything you just see what the error is and to understand if this need to be automatically retry or not and Rails provides these two APIs retry on and discard on you can do anything you can imagine you can retry later or just discard an exception or retry is potentially longer it's only two APIs and you have everything but sometimes you cannot automatically retry and you need to look at the error in the fail queue and in rescue you can implement a simple retry object that can retry an error from a designator queue and this retry rescue object just find out all the failures and re-enqueue and remove them so you can retry the errors in the queue you want to retry so in migration you also need to see some status reporting because you want to know how much data you still need to migrate and you can implement a simple Ruby class called progress and you just loop through all the models and call the progress method only model and each progress method only model is just check how many data you need to migrate and divide it by total data you need to migrate then you can know the progress and you can just wrap another Ruby Swift to make it report every minute so you can know the migration progress every minute and you also need to monitor the CPU usage and in our company we use Gula Fana I'm not sure if you heard about but this is a good service to monitor your CPU usage or performance and request so the next I want to talk about performance how to make your migration code run faster but first performance is a rabbit hole because I spend so much time and every change you make to the migration code you need to run through every records again because you need to make sure it works first then you can guarantee it is faster so some performance tip I found is you can preload the associations or you can minimize the scope of transaction and you can also tune the transaction isolation levels to make it faster and you can also avoid unnecessary callbacks for example you create a user or a recipe and you can just skip the moderation because it is finally migration you know it doesn't need to be moderate or you can use no touching so the code will not touch the association so you will be faster because you can touch them all after the migration and another tip is to process multiple records in one single job and it is also very easy you just use each slice and you process 100 users in one job and you will be faster and so if you process 100 records and some of the records may be already created and you can cache them in memory or later on I try to cache them in Redis so it is even faster and to make it faster you can also migrate the important things first for example only migrate the first 100 users 10,000 users with most recipes so migrate the important things first make your migration code run faster but later on I run into a problem is IO bound so Ruby is actually super fast that my database has reached the limit of IOE can perform so either I need to scale up the database or I find out just decrease the workers actually make the IO bound better because if you have too many workers working on the same database your IO will soon to fill up so instead you decrease the worker actually you can do more than more workers or you can do something like back insert or back absurd to insert or absurd many records in one go so every chance you make it fast you need to run the whole migration again and when you run the migration you need to look out your CPU usage to keep it at max 75% so your site won't run down and after migration you can update all the things you need to update like counter cache or statistics or you need to touch some associations and you can also do redirect because you want to redirect old data from the old site to your new site and first we do something like we generate redirect tables and we hand to the provider and they design a redirect programs and their server start to redirect but this comes with a cost because the migration provider doesn't always have developers so recently we are working on my redirection service is open source to do these simple things that doesn't require developers from the provider okay so I'm going to share some stories about data migration site done for the email you better remove the duplicate emails before the migration or remove the invalid emails before the migration because it's far more expensive to handle them after you migrate data to your site and you need to make sure your down case only emails because it will cause you a lot of troubles and another story is how to get the site dump so one of the provider their site is 100 gigabytes EC2 and EC2 has a bandwidth limit so if you do SCP or other things it will take days if nothing fail within the days or you can buy a more expensive EC2 machine so the solution was just use DHL to deliver encrypted disk to Tokyo it's actually faster than SCP and another story is migrate millions of records so when I need to migrate millions of records I look into all these solutions so the first one is active record with transaction or bug insert or absurd or faster one active record input or even the fastest load data in file my SQL command you can load data in file that is so fast but it still takes like weeks or one month to migrate all these millions of data so instead I think we just need to run this migration in very low priority job and when the user sign in we actually migrate the data in a high priority queue so in this way I don't need to migrate all the millions of records to our system we just slowly migrate and if they do sign in we migrate their bookmarks or some other important data so it's as simple as just check if user need to migrate then you run this migration in bookmarks so you don't need to migrate millions of data beforehand you just need to migrate slowly and when user really needs it you migrate in high priority queue another story is migrate 100k photos so first let's see how our image work so every time when we upload an image we will generate a unique image ID then use this ID on CDN to access our photo but this process takes like 5 seconds because all the images from the external site you need to open the HTTP connection and upload to your server then generate a unique hash so this will slow down your migration in very great scale so instead I design a way that it always produce the same hash but yet it's still unique so in the migration I only need to set this previously designed hash during the migration so I don't need to upload the image during the migration I can upload the images beforehand then I make benchmark how long to finish only image upload so it's like 10 days and I just migrate all the data 10 days before in low priority because in the reality 99% of the photos will not change at all so you can just do the things beforehand so it won't slow down your migration and another story is to migrate the user password to a secure authentication because their authentication may be like use MD5 or something simple so first you need to figure out what is the algorithm they use to encrypt the password and when the migrate user sign in and they enter the password and you will fail in your system because their password is encrypted in MD5 or something not send as your current system password encryption so when you fail you fall back to the legacy authentication and in the code it looks like this and when you fail you just use the legacy authentication to check if they enter the correct password or not and when they enter the correct password you just set his password through your existing secure password scheme so you can make their password secure again by doing this simple change and for the future of data migration I think if I ever need to do again I will look into migrating the data in a ghost table so I just write into another table then when it is ready I swap the table or I will look into make my migration code more generic so I only need to model all the database then I can just run the migration and everything will work I can have more holidays so some takeaways Rails provides shop tools thanks to the Rails code team and you use small objects to make your code more reliable and maintainable and my friends say abstraction is the god of programming and you always remember schedule is more important than fast and accuracy is more important than schedule and data migration stands out but if you keep it simple it can make it easy do the simplest thing as my previous boss Winston said great and thank you and enjoy your tea break after my talk thank you Do you have any questions about the migration for Juanito? Yes Hi Have you ever run into a situation where you import the data that is not valid under the new system? Ah yes So when you run the migration code you will find a lot of records cannot migrate to your system because you have only validations or their fear is too long these kind of things so you can either increase either soft your validation or just update them without running these validations like update color and then you just keep invalidating the database Yes and deal with it later somehow and do you have any strategy to deal with it afterwards? So basically before the migration I will run through all the data make sure everything works so I didn't have much to do after the migration Okay thanks Hi Okay we got two So I think I noticed a part there where you were migrating data and then conditionally migrating some more data based upon the user like the bookmarks the users like wanted that Was that all beforehand or was that based upon input like live from the user so a user finally logs in and only then you lazily migrate a different part and if so do you see yourself moving more towards that style of migration of like lazily migrating things along? So in our system we have some important data like recipes so we only migrate important things first and things not so important like bookmarks and there are many of them so we migrate this kind of large data in lazy faction Yeah that's how I try to did it Thank you We have one question here Hey so how do you manage your foreign keys if you've got users and recipes and you're uploading them in parallel as you were kind of showing what if you're trying to put a recipe and that doesn't yet have a user that's been created So when I migrate a user no migrate a recipe I will actually migrate the user associated first then migrate the recipes If that's going through parallel thread how does the recipe migrator know about the user if it's come from different data or does it? So even if the job is in different workers every worker will need to check first if the recipe has a user migrated to our system or not so we always migrate the user first so it's kind of more slow but integrity is secure Thank you Do I understand correctly that when you migrate from the old site to the new site there is no downtime they enable your site immediately and your data is still not there you will be loading it right or there is some scheduled maintenance that the owner of the old site says that way in the three days we come back or how does it look like from the user's perspective Thank you I didn't cover this So when the migration happened we make sure the provider website changed to read only and so everyone cannot sign in they can just read and we start the migration on Cookpad like few hours then when we're ready we shut down the old site and redirect all the things to our new site It depends on the size the short one is like 4 hours it can also be 8 hours depends on how much data Do we have more questions? Alright, right at the back Sorry, just want to ask I didn't know database 2 that allow us to migrate the data Excuse me, could you say again? I didn't know database 2 do we need to actually create I think Ruby is a generic language we can do anything we want but are there no database way to dump the data and import to another database? So your question is how does Ruby talk to other database? No, like are there any for example MSSQL or Oracle they also provide data terms and data transformation maybe for example to import to another database so do you consider using those tools for example? Ah, you mean use the database tool to do the migration? Yeah, I have been considering doing that but I'm not very good at database tools so I use Ruby and it still works so Yeah, thank you I think just add on to that sometimes you need to do some transformations or variations that you do in your Ruby code and or maybe the schema you're migrating from is not the same thing so you need to do some transformation bits that the database tooling doesn't quite support it depends, yeah I think we can take one more question before I open the break one up there, okay I just want to ask about the lazy migration so when the user logs in then only you queue them out in the high priority queue to migrate the data so what happens if the user never logs in so you don't care about the data anymore? Ah, so I actually has a low priority queue running for everything that will take like few weeks but if they sign during these few weeks their things will be migrated in high priority So you remove that low priority queue Not, it's duplicated It's duplicated, okay Yeah, but your migration code is Yeah, you can run it multiple times and it's still the same, so no problem Okay, cool, thank you very much Juanito Thank you