 Hello everyone and first of all, thank you very much for being here and today I will say about implementing your downtime migration strategy in a software assessor service project and Imagine you are a big player in e-commerce stage like Amazon and you are forced to take a maintenance break Last time they experienced downtime. They were losing $66,000 per minute and We face similar problem in sailor and that is open source GraphQL first e-commerce platform Obviously, obviously not at this case because we are not We are not yet as popular as Amazon, but still the problem is the same I will show you how we handle the situation and the obstacles that we face along the way So let me quickly introduce myself. I'm Iga Karboviak. I'm from Wroclaw. It's in Poland and I've been using Python in my work for over five years right now and I've been working on a sailor platform for almost four years and I'm happy to see how this project is changing from just an open source project to a proper product Firstly, I would like to tell you shortly the story that is behind a sailor So sailor started as an open source monolith project built on Django with views and in 2018 the big decision was made and GraphQL API was added and a year later Django views were totally removed making sailor a headless app Which basically means that we started to offer just an API without the front and side Over time and sailor grew up we gained a community round and the idea to build a product around sailor came up and This leads us to the creation of the full Cloud environment that offers sailor as a SAS product and As you can see the story behind sailor is constant grow and when you are getting bigger You are getting more clients and you are facing bigger problems And one of clients requirements was the ability to update versions without any downtime So to sum up the background of the problem Is that that clients of e-commerce platform that sailor is don't want to stop their stores at any time As of course it means losing money So we need to ensure that the system is working all the time even during the update and we need to minimize the migration time and I think it's safe to say that sooner or later each SAS application will face downtime problem so let's move on to the examples and I will start where we've shown you the problematic operations and then I will move to their solutions For this talk I created a sample project with simple GraphQL API and Database on PostgreSQL is the same configuration as we have on sailor and I will give you the link to To this project at the end of the presentation and Let's assume that we are Currently run on version V1 of this system We have multiple app clusters that are using shared database and we are a plan to Introduce some changes on product type that we will release in the next version V2 Here is the product model on the current version V1 and The model is just simple Python class that refers to a single database database table and the class fields refers to table columns and Let's assume that our product model will have a name description created fields and We are Planned to add new unique slug field that will be a human readable identifier of our product and Additionally, we want to rename field Created to created that to be consistent with other part of the project As we previously state we are currently on version V1 on the system and Assume that we released mention changes in version V2 and we want to upgrade to this version So Upgrading up instances is easy. We just need to gradually Gradually replaced old up instances with the new ones The problem is with the database because it's shared resource and we cannot just clone it and replace it So instead we will upgrade the database in place So firstly we'll begin by migrating the database to the version V2 so we'll be in a stage where we have upgraded database, but still running up instances on previous versions and once the migrations are completed we can start the up workers of the versions V2 and Finally stop the up workers of previous versions So the two stages that we should worry about are two and three where we have upgraded Database but still running up workers from the previous version and in other words we have the old code that is using upgraded database and We analyze these stages in the following examples and so let's move on to the problematic operations We will start with adding a unique slack field that will be a human readable identifier of our product and What you see on your right is jungle regression and for those who don't know migrations applies object relational mapping changes into the database schema and in other words database needs to know that something has changed and in models and migrations defines Operations that will be performed on the database but also allow to run some Python function for making updates on existing instances And to add a unique field we need to follow three three steps first is adding an label field so we will create new slack column on the product table Next we need to update existing instances to set the proper values on To set the proper values for the slack column so we are calling the Python Python function to do that We'll do that synchronously and after the function is finished We will have the new slack slack column filled in with proper values and we'll be ready to perform the Last step that is changing the slide field to be unique And so to apply this change we are altering the slack column on the product table Now let's assume that we release those changes on version V2 and we Update the database to this version and we are currently on version V1 on of the system as is shown on this schema When the APR request for product creation is called for second or subsequent time the integrity error is raised the error message is saying that product with empty slack value already exists and This is because the version V1 is not aware that the new Unique field was added and it's trying to say the row with null as a slack Below you can see the part of the product table that is showing describe situation so we have instances with proper slack values set and One with one row with empty value. So adding the next row with null as a slack and Will rise an error because values won't be unique anymore and We'll discuss the solution later. Now. Let's move on to the second operation That is renaming the field from created to created add to be consistent with other part of the project and Seems to be pretty simple operation. We just We just renaming the column, but it might be problematic as well As before and let's assume that we really smashing changes in version V2 We upgraded database to this version and we are currently on version V1 on the system So the APR request for product creation As we can expect is rising an error and this time the error message indicates that the That column created on product table does not exist Which is accurate because we already renamed this column and we have only created add column on the database However, however, the version V1 is not aware that something has changed and it's trying to save the value in the old created column and We'll get similar error when trying to retrieve an existing product from the database And this time the API is trying to fetch the data from the created column, which does not exist anymore Now let's move on to the last operation that we'll discuss today. So let's suppose we have a Large data sets of products, perhaps a million or more So updating all existing instances will take significant amount of time no matter how hard we'll try So the problem is that the update is Blocking the migration process as we need to wait for update to finish to continue with the database migration It also locks the database tables and keeps the database in unstable stage Which may which may result in slowing down the application or even or even make it unresponsive And as a result, it significantly extends the times the time of the database migration So these were some of the problematic operations that we should be aware of and let's sum up them so Firstly adding the unique or non-label field like adding slack field in our example and In sailor we had such situation for example when adding exploration date to our order model Next updating a big number of data And I can say that in sailor we are facing this problem most often as our clients have quite a big collections of orders and products and Any field normalization like recently adding discounted price means that we need to update each instance separately And then renaming the field like renaming created in our example, but also renaming the table removing the field or table moving the data from one field to another and all of these operations will cause similar error like renaming the field and In sailor we face this problem when changing ID to universal unique identifier So the key element is to not remove any fields that previous ORM will use and to minimize the time of each migration Now as we know all the problems. I owe you the solutions So the biggest difficulty in upgrading is changing Is changing the database as its shared resource and you can just clone it and replace it and to ensure the Zero downtime we need to We need to ensure that the updated database will work with the old and the new version of the system And there are two possible options to do that First is make old code compatible with the new database schema And the second is make the new database schema compatible with the old code and Fixing old whole old code is hard and it required to craft two releases. So we decided to choose the second path As it's easier to achieve and I will describe the solutions that fits this statement And one important one more important assumption is that we are ensuring zero downtime only Only from changing one version at the time. So in our example And upgrade from V1 to V2 will be possible without any downtime But switching from V1 to V3 won't be Let's start with the solution for our first problem that includes adding a unique slug field So we need to ensure the compatibility of the previous version V1 We've upgraded database to version V2 So we need to apply some of the database changes also on the version V1 and The solution here is to apply the first two steps of migrations on version V1 of the system So first is adding a new label field. So we are adding a new label slug column on product table and The difference is in the second step because we want to minimize the time of each migration So instead of updating existing instances Synchronously in the migration code. We will delegate it to the Asynchronized task that will do that in the background After the migration process and I will tell you more about that later for now The most important thing is that we are doing this asynchronously And we also need to ensure that any new instances created on version V1 will have the proper value set so this leads us to the last step which is Which is updating the API. So when any So when any new row is added it will have the proper value set on slug column Just after performing the migration from V1 the database will be in stage where we have new slug column with empty values and When the Asynchronized tasks are finished The slug the slug column will be filled in with proper values and At that point we'll be ready to safely the field into Unique on our target version So the operation that must be performed on version V2 is to alter the slug column to make it unique and The second operation was renaming the created field. So to ensure the compability we will have to perform the changes in the three main steps that are First Adding we need firstly we need to add a new field next to existing one In the second step we need to copy the data from the old field to the new one and Finally we can remove the old field and What is also what is very important? We need to ensure that on each of these state steps the Database will be compatible with the previous version of the system So let's start with the changes on the version V1 and Here the steps are almost the same as in case of adding a unique slug field So firstly we need to add new label created at column on product table In the next step, we need to copy the data from from the old column to the new one to update existing instances and as before we will Will delegate this to the asynchronous task that we'll do that in the background and We also need to update the code So when any new instances are created on version V1 The instances will have the proper value set for both old and the new columns and After finishing the migrations and asynchronous task will be in the stage that is shown below and We'll be ready to apply the changes on version V2 So on version V2 we need to remove the old field from the ORM And from the code as we don't want to use it anymore But we cannot remove it from the database because it's because it's still used by the previous version V1 instead we need to ensure that the old field old column created is new label or has the default value set and In our example, we'll make it new label So in on the right in demigration, we are separating the database and ORM changes to perform those actions and We are also ready to change the field into Into non label so in version V2 Will be in the stage where the old field is not used anywhere in the code, but it's still in the database and Finally in the next version in our example version V3 When we are sure that only new field is in use We are able to safely remove the old old old created column from the database and As you can see there is quite lots of steps that must be performed to just rename the column the field Now let's take care of update of a large data sets and first of all the update should be done in The version before the target version So we can be sure that all existing instances will have proper value set And we can safely apply the changes on target version and To minimize the immigration time the data should be should be updated Asynchronously after the immigration process and Here's an example of migrations that cause the task which copies the data from the old created to new created At column and the task is delight in postmigrate signal Which means that it will be called after the migrations are completed to not bother the immigration process at anyway So the data will be copied in the background Now take a look at the task code. I have some tips for you that we work out so firstly Update should be done in batches Secondly only instances that haven't been updated yet should be taken for update and the instances should be ordered by some unique fields like Primary key and in our example, we are taking products that have empty created at column After the batch update the task should quit itself if there are still data to be proceed to not block the Asynchronized task quit for too long and What's also very important? The update of instances should be done in transaction with locked rows to avoid potential deadlock That might happen when multiple Asynchronized task our task workers are in use Right now, we know the problematic operations. We know how to write migrations to not crush the system So the last missing piece is how to proceed and update So firstly, we need to release changes applied on version V1 as next minor or patch release I will use a minor release V11 for simplification So in our example on the version V11 we'll have two new label fields Slack and created at and Next we need to upgrade to this minor version and then to the target version So firstly, we need to switch from V1 to V11 and then from V11 to V2 and upgrade through this V11 version is crucial and In both cases the process will look the same So I will describe it in general on the example for switching from V1 to V11 So we'll start from the configuration where we have one up instance that is using the database and we have to Asynchronized task Asynchronized workers for example seller workers So the first step is stopping the Asynchronized task workers to make sure that they don't process any task during the database upgrade And then we need to run the database migration to update the database to to the version V11 and after finishing the migration we can start the up workers of version V11 and Let's notice that at this point we have two up instances of different versions that are using the same database But we ensure that Database comparability So when any new product is created on version V1 it will have the new value set for both Slack and created at column and When the new product is created on version V11 it will have the proper value set for Both columns as we add use the API to do that In the next step and we can start the seller workers to run the task delight in the migrations and And finally we can stop the up workers of the previous version So I describe you the zero downtime upgrade Now let me digress a little about what zero in zero downtime really means So during the whole upgrade process. There is no moment when all up instances are stopped Insta instead the database is upgraded in place while the up instances are still in use And this may result in minimal downtime, but it's so brief that is essentially zero and it's not visible to the user So moving back after you upgrade to V11 any new instances will have the proper value set and All old instances will have the new values for the start and Will be updated by the asynchronous task in the background And when the task are finished that product table will be filled in with proper values And we'll be ready to upgrade to our target version V2 and to do that We need to perform The all steps that we do for switching from version V1 to V11 So if you want to upgrade to go smoothly without any maintenance breaks Remember that the most important thing is to not remove any fields that previous or M will use and that's it and If you're interested, here's the link for the example project That contains each steps that I explained today and some additional ones for example like adding a database index Two more slides, it's okay So first of all, thank you so much. This was very insightful My question goes so I perhaps it wasn't clear to me one of the Potential issues that I might see with this is that while the two versions are running You might be writing data with the old version That potentially doesn't get migrated when you run the script to like for instance when you rename Moving the data from create to create it at if the old instance is still writing new new stuff You might get into a situation where you're writing On the creator table and that data doesn't get migrated into the created out Did I get it wrong or is that something that maybe I will move a little Yes, here is this example if this is the moment when you have these two instances I believe so. Yeah, so the issue that the thing that I see the issue Just correct me if I'm absolutely wrong, which might be the case is that after these you stop the version one and The thing that I that I might be wrong is that if you if you do that at the very last step You might be still creating stuff in the old version that doesn't get created in the new version Yeah, if you create in this moment one when any is exactly that just before the fight the fifth step That's what I mean To this the last step. Yes, so exactly you will between the fall and five You have the old instance that is still writing on the create table Yes, and but it's not the problem because the Data are copied in the background from the old code to the new one And if any new instance is created on from the old up instance from version one Then new value will be set for created add and form for created will have the proper value and it will be copied So if I understand right the migrating migration It's a synchronously in the sense that it doesn't stop. So it's it's not something that You wait for it to end before stopping the old instance So it keeps running until they are no nothing on the created Column that's it's that what you mean? the column created is still there after after Upgrading to version view on one. Yeah, that's exactly my source of my confusion. That's that's precisely it So Yeah, I don't want to keep this. It's so shall we just talk about these later, and okay We can take a later Thank you for me as well for the nice talk I'm also in the e-commerce sector and I had quite a bit of nice recognition here today So things that we do very similarly one thing I noticed though is that Doing this migration kind of like into two steps as you explain puts quite a bit of Owners on the developers to kind of notice that they're now doing something that requires doing that and That's something we've like had some troubles with we've tried like writing tests for that and so on but yeah The cases get quite complicated, especially when you have indexes and so on and do you have any experience there? How to kind of handle and help developers kind of notice that now they're really doing something that you need Kind of that two-step process. I can say that we are currently learning that to be honest and Yeah, but like, you know, if somebody is Putting something to review all of the our teams need to take care of that and say hi. Hello, you need to Add there is your downtime Support here and we also have some you know Some pages in our docs that are saying that we need to do that But it's it's hard to keep it you need to remember you need to keep the Pull request for the next version and you need to create the pull request for the previous versions So it's lots of a job additional job for the developers Appreciate I'm not the only one who's having that problem Thank you very much for the presentation. I have one small question. Why we Stop salary worker before integration why What's the fundamental difference between app V1 and salary workers V1 yes because We are on on version V1 we are V1 1 we are defining the seller workers that will be Sorry the asynchronous task that will be called in the migrations and they are only on version V1 1 So we need to upgrade the salary workers to this version of V1 1 to contains this task that will be Delight in the migrations. Yes, but why Can't we stop salary workers to be one after? Database migration, why can't we switch step one? Because the seller workers will Don't have information that this task exists and I'm not sure if it's gonna be deal like quay if If the migration will call this task because the tasks are called directly in the migrations So if the migration we call the seller task that it does not exist in seller it might crash Got it. Thank you very much. Thank you Yes, I was just wondering what you're using to actually manage this process So for example when the database is done making all of its changes, then you need to know to then stop at version 1 Right, so how do you do that? What what software or methodology are you using to to track that? But you are saying from developers perspective. Sorry. I don't understand correctly the question I mean, well you're you're so when you make the changes to the database Then you need to when it's done Then you need to tell after you want to stop, right? Mm-hmm. How does it know? To be honest, it's a cloud developers work