 My name is Michiel Rook, or as most people abroad call me Michael or Mike. I'm from the Netherlands, which explains the name. Today I'll be talking to you about database schema migrations and not just any schema migration, but a schema migration that doesn't cause downtime. Before I begin, I'd like to introduce myself. At this point, I'm a freelance developer and a coach. I do a lot of consulting and coaching around continuous delivery and DevOps. If at any point you want to reach out to me, that's my Twitter handle right there. I'll also make sure to post the slides on there and also join in, of course, after the talk. And if you have any further questions later on, then please reach out to me. I mentioned that I do a lot of consulting around continuous delivery. And for those of you that were here last year, some of you may have been present at one of my other talks, which is called The Road to Continuous Deployment. A talk in which I explained a case study, a project in the Netherlands, of which I was a part that proceeded on to continuous deployment, which basically means deploy every change that is automatically tested and verified to production, which for that particular company resulted in 50-plus deployments per day to production. A company that basically ran a 24-7 operation, so high availability is definitely a concern for such a company. And when you add database migrations in a mix, changes to databases, to databases, to database schemas, then for a 24-7 company that is highly available, downtime is unacceptable, if you can avoid it, that is. So that talk, I've given that a few times, not just here at PHP UK, but also at other conferences. And one of the questions I keep getting asked is going to production 50-plus times per day, that's very cool. But we have this database, and we do schema migrations as we do code changes, and we need to take the database down or the app down while we're performing the database migration. So how can I proceed from the situation we're in now to the situation you're describing where you're going to production 50 times per day? So that triggered the talk, or triggered me to write this talk, and so here we are. We're going to talk about database schema migrations, specifically database schema migrations. And let's go back to what a database schema is. Well, it could look something like that. A lot of tables, you have your keys, you have your primary keys, your foreign keys, relations between tables, you have table structure, all those things. And at some point, you have a database schema that is not sufficient for the direction your application is going. And then we end up with a migration. We end up with a change to our database. And of course, this change can be manual, but at this day and age, especially if you're going towards continuous deployment, then automation is the name of the game. In this talk, I'll typically refer to database schema migrations as being SQL. And when we consider SQL database migrations, most tools and most talks and blogs about this topic consider a migration to have two directions, up and down, where up means I'm going from a previous version of my database schema to a new version, and down means I'm basically going to revert that change and go from version X plus one back to version X. Okay. Now, typically, these migrations are run in transactions within your MySQL or your Oracle or what have you. And let's take a typical example tool that you would see in the PHP world, which is doctrine. Now doctrine, when you write a migration in doctrine, it's basically PHP code. So you would write a simple class which extends an abstract migration base class. And there are two methods in that class, up and down, just as I explained earlier. The up has some SQL to modify things in the forward direction and the down if applicable has some statements, SQL statements to modify things in the backwards direction. Now, if you would then apply it, doctrine, make sure that the database, the version of the database is actually tracked. So to do that, it adds another table. By default, it's called migration versions. And any migration that successfully applies is added to that table to indicate to you that that migration is actually applied to the database and that your database is now in state X plus one. Personally, I am very much a fan of a tool called Flyway, which records a little bit more context. Now Flyway, in the default mode, uses straight SQL scripts. And it actually does not consider down to be a valid direction. And I'll go into that in a little bit. So it's just straight SQL. And when it's applied, it also records it in a versioning table with the addition of a few extra, a little bit extra metadata in that table. So not only do we see which changes were applied, but also when, by whom, what was the script, how long did it take, all those things. So there's a little bit more information for you to have. Now, I talked about the down migration or the down direction rather in a migration and down means we undo a migration that we have done. So another word for that would be rollback, rolling back your database, right. Now, I want to get into whether rolling back or actually rolling forward is the way to go. Now let's let's look into why a rollback could be a problem. Let's consider destructive changes. A destructive change could be a long lift lock, which means that transactions that come in are actually blocked or rejected as they come in, which is a problem. You need to store those transactions or retry somewhere. So that's very destructive to your business at that point. Renaming things is of course destructive. And deletion is, if anything, is the most destructive thing. When it comes to, for example, my sequel or especially in the earlier versions, a lot of the alter table statements that you could come up with trigger an implicit commit. Meaning that whatever transaction you're in, if you do an alter table, the transaction you're in is actually automatically committed and a new transaction is started. Well, if you're at that point, and this is hard, well, it can be detected, of course, but then rolling back, what would that mean? Would that mean rolling back that new transaction and the transaction before that as well? If you want to view them as a single migration. Bless you. So in the sense of an implicit commit, well, there is no longer a transaction to speak of. So another thing is sharding. If you shard your database over multiple machines, and you shard on a specific key, for example, then rolling back means rolling back the schema migration means going back to all those machines and rolling the schema change back on those machines and potentially resharding if you have changed your shard key, for example, during that migration. And one of the more important things is if you have multiple applications, potentially multiple services attached to multiple databases that are part of a distributed transaction, rolling that back is going to be difficult, if not impossible. So another way of doing rollbacks rather than rolling back your transaction would be restoring from a backup, of course. That implies that the backup is fresh enough, young enough, and that you don't have a lot of updates going on between the time of the last backup and where you are now. Because you would either lose all that information or you need some sort of mechanism to capture all that all those changes and replay them onto that latest backup that you have. So I personally consider rollbacks to not be a valid situation or not a valid setup in the modern world with databases. If you can take your database offline for X time, sure, then you can rollback from a backup, but we're here to learn about database changes without downtime. So in that context, I don't consider rollbacks to be viable. Now, a little bit about downtime. Let's take a simple app with a database migration. Typically, in the pipeline, you would create a new package or a new tar GZ or whatever your deployment artifact is. You then shut down the application, you run your database migration, deploy your package, start the application. And this is very simple. You don't have to do anything to your database other than run the migration. Of course, you don't have to do anything fancy. The problem is running the migration that can take quite a while, depending on how big your database is. So that automatically determines the downtime. Now, what if you don't want downtime, which is why we're here? I don't think downtime is acceptable, at least not if we can avoid it in this case. And especially if you start going toward things like continuous delivery, where continuous delivery states that your code should always be in a releaseable state. So at any point in time, I should be able to go into the team room and say release now, deploy now, because continuous delivery stipulates that we should always be able to do that. And if the latest migration that you add triggers a two hour downtime, then the code is obviously not always in a releaseable state because then you need to find a deployment window. So continuous delivery is like that, even though the deployment to production is still a manual trigger, it needs to be always ready for that to happen. Now continuous deployment, what I've been talking about is code that passes all the tests and all the automatic validations is automatically deployed to production. So we cannot get away with downtime, especially if we do it 50 plus the place per day. So let's take a look at zero downtime deployments. This is something you would do on highly available applications or service. So let's imagine a load balancer between us between the application and the internet. And we have an app with a version one attached to that load balancer behind that load balancer. Now we have app version two ready to go. So we we started and at some point it boots and then we add it to the load balancer and it starts receiving traffic. At which point we take app v1 out of the load balancer and stop the instance or the container or whatever. If you do this right, then your end user will never know that app v1 has just been replaced with app v2. Maybe they will see some new functionality, but they should not have seen any downtime or any problems. So no downtime. Now in this case, we didn't have any migrations. So it's easy. We have no downtime. But what if we have a certain database state? Well, let's go back to app v1 and let's assume we have a database of version one. So app v1 expects the database to be in state v1 as well. We then start at v2 and at v2 surprise surprise expects the database to be in version two state. So at this point we have an inconsistent system at v1 expects the database to be in v1 at v2 expects it to be in state v2. It cannot have the same states or both states concurrently. So one of the two is going to crash probably or at least give us wrong results. So in general, to avoid that, the old app should always work with the new database state. Because of course the old app that is running now cannot predict in what direction the database is going to go. So we need to make sure that the database change is always backwards compatible to what we have running in production. With highly available setups, you always need to assume that at least two versions of your app are concurrently running in production. And if you adhere to this standard, then you should be fine. So it means that we need to decouple. We need to decouple a database migration from a deployment. And let's take a look at a few patterns that help us to do that. One of them is called the expand contract pattern, which is also applicable to code, the API development, but also to database development. And it stipulates very strongly that we should maintain backwards compatibility to whatever other system is relying on us. So let's take it, let's look at it from an API perspective. We have a bunch of clients that are talking to an existing API. And now we want to introduce a new version of that API. And rather than changing the existing API, which could probably lead to backwards compatibility issues with the three clients that we have. Now we add that new API and then we start deprecating the old API. And we say, okay, you have X amounts of time to switch over to that new API. And one by one, we switch over the clients to the new API. At which point the existing API is no longer used and we can take it out of commission, right? In database terms, it means that we can only do non-destructive changes. And let's go deeper into what they would mean. Adding things in general is non-destructive, especially because your old app is not relying on that to be there. It doesn't even know that it's going to be there. Adding tables, non-destructive, adding columns to existing tables, same thing. Creating indices, usually non-destructive, depends on the database technology. So now we have a bunch of tools or a bunch of things that we can do to our database to ensure backwards compatibility. If we do that, it means that we can always go a diff back, a diff encode. So we migrated the database to V2. Our app is now V2 in production. We see a problem with app V2. We can always go back to app V1 without having to touch the database. Because again, app V2 is backwards compatible, or rather the database V2 is backwards compatible with app V1. So whatever we do to the database in that particular change, it doesn't really matter. We can always go one version back in terms of the code. I want to take you all into a little example, an example to rename a column. Let's assume we have a table of persons that have a first name and a surname. And we're not very happy with the column name, surname. So we want to actually rename that. But again, without introducing downtime. So we're going to do this in a few steps rather than just in one step. The first step is we create a column in the table with the new name. So we don't actually touch surname yet, we just add last name to that person, to that table. We then start writing to both the old and the new columns. Whenever we do an update, for example in your doctrine entity or something, when we did set the last name in the entity, we also make sure to set the surname so that it is populated for new objects. And of course, we switch our get last name to using the surname in this particular case. And we may want to add an if there that if it's not set yet, then we get the last name. Then we start migrating the old records. So individual records that we did not touch, we're going to update them. Just to make sure that last name is always set in every record. And then we can start reading from the new column. At which point we can remove the old column and the code to deal with this backwards compatibility. And we actually switched. So rather than having it in one single step where we deploy the change to the database and the change to the code in a single package, in a single deployment, we do it in a couple of steps. We split it up essentially. So cool. That's simple. Well, not really. Of course, there are a few challenges. It's database technology after all. One of them is dropping a column on a large table. Depending on the database engine, that would mean a table lock for as long as it takes. That is a pain in the ass. I cannot be any more clear about that. Another one is non-nullable types. So remember that I did add the column and I said that it was nullable. Or at least, you know, I added the FARCAR 255 and nothing else. But it means it can be null. If you say it cannot be null, then of course everything you need to do, is set a usable default value. Otherwise, you would have integrity errors when you add new records. You can, however, after you've done all your migrations, alter the column again to make it non-nullable when you are verified that every single record in that database or in that table has an actual value for the non-nullable type. I think things are if you migrate primary keys and then you have your foreign key constraints and stuff like that. But those are a little bit, I think, more rare. I've never needed to do that. At least not on an existing table. I want to discuss the concept of a feature toggle with you in the context of a database migration. Database migrations and feature toggle, well, feature toggle enables us or helps us to enable and disable individual features in code at runtime. It means that we have alternative code paths in the code. Essentially, it's a glorified if statement. If the feature toggle is on or evaluates to true, we have one code path. If it evaluates to false, we have the other code path. And again, we can switch that at runtime. By translating that back to an actual application, we would have two versions of a front-end page, for example. An old one on the left and a new one on the right. And a feature toggle, we can switch based on IP address, for example, or a cookie setting or whatever to have people see the new version depending on whether we allow them to see that. It's sort of A-B testing, but then on the feature level, on the functional level. If we're going to get to databases to database migrations, we could add a feature toggle at the writing to the old and reading from the new stages. This allows us to do the whole code change at once in the single deployment. And the code for writing to the old and the new columns would simply remain dormant in production until we enable that feature toggle. Same goes for reading from the new column, which means that we can do the code change at once. We can test it together, and we can even improve and iterate on that and test it in staging, perhaps. And at some point, when we're confident with it, we flip the toggle on production, and hey, Presto, our old column or our new column starts filling with data, and then we can read from the new column. Now, if you do a migration, if you add migrations, you typically want to test them as well. So I would strongly advise to keep them as part of source control, probably with the application that is actually using that particular database, or that set of tables, at least, and include a testing stage in your pipeline. So have, for example, an empty MySQL or a SQLite spun up before you run your tests and then apply all the migrations to that empty MySQL or a SQLite instance, just to test whether your migrations actually make any sense, whether they work. You can even add fake data to that to see how it would respond, not just to an empty database where you just test the schema changes, but also how it actually affects data that you have. And you can even, or probably should, run that on staging and have some sensible set of data on your staging or acceptance environment, which, as close as possible, mirrors the data on production. As close as possible is becoming more and more of a thing, especially now with the GDPR coming in effect in May, which, depending on how you read it, stipulates that staging data should be anonymized, or at least you cannot copy production data verbatim to acceptance, because typically the access controls to acceptance are a little bit more open than production, but you could run all that data through a filter or something and anonymize all the personal identifiable information. And again, referring back to the feature toggle, you can actually flip a feature toggle specifically for staging or acceptance and then have that test over there. Okay. Digressing a little bit into NoSQLLand, which is also known as NoSchemaLand, which is for our intensive purposes, is very cool, because we don't actually have a schema to change. There is no schema, so we don't have to do anything to change it. But if you do want to do data migrations, one of the options you have is migrate during reads. So as you do a read, you get the document, for example, from a MongoDB or CouchDB, you get the document in memory, and then some fields will not actually exist, because the document has been written before that field existed. You can then add some default values to that field, or rename fields in memory. And of course, you have to write that back to the database as soon as you are done with that migration. So in line, or in the background actually, as you do a single read, you will have some code that will change the document to the new state and then immediately write it back to the database. Now, that also means that depending on what you read, you will never touch all the records. There are a bunch of records that you will not touch, or not touch in a long time. So while this is going on, you also want to run a background migration on all the records you are not going to touch, or you haven't touched yet. So in chunks, take a bunch of documents and modify them and write them back to the database again. The nice thing about that is that it has no downtime, but it does require some housekeeping in that you need to maintain a counter or a log where you have been in terms of the background migration so that you can resume with another chunk and another chunk. Okay, something I want to talk about is another challenge in migrating databases and that is a large data set. Large data sets and dropping columns, we also talked about that of course, but in general, large data sets mean long-running migrations, which in terms of errors, you will typically the migrations that I've had that run for hours and hours, they will fail right at the very end and rarely in the beginning, which is annoying to say the least because it makes debugging a little bit more of a problem. Long-running queries, especially if you have migrations let's say you split a street name and a house number that was one field and you want to split that. Now that is computationally complicated depending on the country you're in but if you could write a query to do that, that could run a very long time and that's something that you need to be aware of because memory usage is then a potentially a problem if you do things in memory or also in the databases memory and again, if you fail at some point in a migration or such a query where did you stop and can you restart or resume from the point where you failed and do you record that and also what is very important with these things is log what you do and log the progress of where you are. So again, it helps you to debug and eventually restart or resume your process. Yeah, one of the other strategies that is running migration on a failover so let's say we have a replicated database we have a master and a slave setup and both have the schema in version 1.0 and potentially even maybe not even a schema but just the database engine version and we want to upgrade either the engine version or the schema. Now one of the things you could do is terminate the slave replication so you only have master and then you modify the slave to 2.0 whatever 2.0 is whether that's a schema or the engine version and then bring it up and bump it to master and then immediately you of course the old master reverts to being a slave and you take that offline and you start migrating that and then you pull it up again and the replication restarts but only in the other direction so what was the slave is now the master and vice versa. Of course your master or your single machine needs to be beefy enough to handle the traffic especially if you do REITs for example from your slaves then your master needs to be beefy enough to handle this but this is a completely no downtime scenario. Yes if the migration on the slave fails it's relatively easy to bring it back we can simply pull it into replication again and either from an empty slave or the last known backup and it will catch up to whatever is reality on the master and we'll have a full cluster again and we can try again at that point. This is something that for example Etsy has done quite a few times to do with database migrations and they have seriously large databases so another strategy that you could use is introduce a message bus or a queue or something called Kafka but there are other technologies as well Rebinem queue serves well and let's take a look at what our world would look like then we have a UI layer which sends messages not directly to the back end in this case but through a queue or through a message bus and those messages are then picked up from the queue by the back end and processed and typically lead to state changes in the database and that database is then queried through a data layer by the UI so we have a complete loop now. Now if we want to replace the database technology one of the things we could do is simply take the back end out for a while and the queue will simply start filling up with messages there will no updates will not take place but we won't be down we can inform our users that we won't process any updates for the next however long we think it's going to take we add a new database or a new table and then we re-enable the back end and it catches up with all the messages that are in the queue and then it can process those on the database. This leads to a more general pattern called CQRS which stands for command query responsibility segregation which is a mouthful for basically saying we separate reading from writing so rather than reading and writing to the exact same database and database model we actually split the two right to something that is uniquely geared towards us writing that data very fast or at least very structured and then we have one or more database models that are geared towards us reading that data it also says that we can then store any update that we have in a log somewhere and CQRS marries very well with the thing called outsourcing. There is a talk later today I think about this particular topic and then when you look at that particular world it looks a little bit similar but rather than messages to a queue we have the UI sending commands to the back end. Commands basically saying to the back end this is what I want to achieve and when that is all validated by the back end it turns into events something has been done and those events are then handled and replayed or handled by event handlers and they in turn can influence multiple databases multiple different databases with different schemas, different models and if I want to have a completely new database model based on all those things then I can get any of the updates that I have in my log applied to the new query model and if I have seen the last update in my log then I use that new query model and otherwise a loop until I have replayed the entire event log to my database and so I can even use that to rebuild staging from scratch use the event log or the update log from production and rebuild staging just by running all the updates again onto a fresh empty staging database and of course we can then run it through a filter to anonymize the staging data now a couple of tools to help you of course we have the simple we have the typical migration tools there's flyway, doctrine I talked about, Fing has something for this there's Laravel and liquid base which is also a very useful toolset for these particular things but one very cool tool which I found out about last year is Percona tools online schema change and I'm not sure if this actually works on a box standard MySQL it does work on their Percona offering but what's cool about online schema change is that it uses triggers to do the whole thing we just talked about without you having to do all these things basically so what it does is it creates a shadow copy and that is basically your target table and that has the new schema version for you so going back to the surname last name we would have a shadow copy which has the surname or rather the last name sorry and not the surname and then we have a source table which still has surname and online schema change then adds a bunch of triggers so that any update to the source table which is still very much in use at that point is forwarded to the shadow copy right and then the tool starts copying the source data in small chunks from the source table to the shadow copy to the target table and at some point that process is considered complete and then the target table is renamed to the source table and without the application knowing about it your source table has now bumped up in schema version and of course the backwards compatibility in code principle still very much applies because now you have in production an app that is not aware that under the hood essentially a schema change tool and a bunch of triggers are taking care of this so at some point the process is going to be complete and then your source table is no longer in the state that you previously had so the app needs to be able to handle that otherwise it would still break I want to recap this talk if you do database schema migration do it in multiple steps rather than just a single step a deployment with the code and don't do any breaking changes always maintain backwards compatibility if you want to read more about this topic then I can hardly recommend refactoring databases and there's the accompanying website database refactoring.com and there's another github link I'll make sure to post these slides online and I'll join in later on so you can read that back on that all right if you want to reach out to me that's my Twitter handle my email address and my personal blog site and I would love for you to leave me some feedback on this particular talk at that join in link over there