 Okay, so hello, thanks for coming and my name is Alitia and together with Greg and Roman we'll walk you through the subject of logless upgrades. Okay, so first of all we will explain the idea of upgrades with minimal downtime and specify some assumptions that have to be fulfilled while migrating database schema online And then we'll present some approaches that were adapted in different OpenStack projects Next we'll move to the implemented proof of concept that was designed to test table logs problems and Finally we'll present some results of comparison of different database engines so we all know the reason for doing upgrades and And we all want to do them without without violating as a law and While keeping our cost minimal in order to keep our users satisfied So it would be the best if upgrade will be possible with minimal or even zero downtime So the goal is to perform so-called rolling upgrade. A rolling upgrade allows you to keep your existing distributed system working while Upgrading your components gradually So in comparison to a standard procedure in which you have to stop all services Upgrade them or even reinstall and then start them back again A rolling upgrade allows you to take only one service down for upgrade and Joining back to the others however, as a result you have Some services in older version and some services in newer version Running and interacting in the same cloud. So you have to take care of compatibility between them And there are a couple of problems with Upgrading distributed system Possible areas of conflict include changes in API in content sent over RPC and in RPC interface itself Another important thing is to migrate data and database schema and In this presentation, we would like to focus on online schema migration So to start with what does online mean? So online means that the database schema is Migrated without interrupting any downtime. So all services have a constant access to a database and both older and newer version run against the same schema at the same time and In order to alter a table structure without blocking reads or writes For services that use database. It is important to make some assumptions. So You can't do any destructive changes. So you can't delete or rename any column or table until any service Use the older version. You should rather rename. You should rather sorry add or add Add new table or add new column and you can delete the previous ones once all services stop using them and Now we would like to present current status of Chosen OpenStack projects We'll focus on projects listed in this slide We chose them because of the reasons first of all Each of them is a core project and upgrading it with minimal downtime is crucial and Secondly, a lot has already been done or discussed in a matter of upgrades in this project And we will provide some basic information. If you want to get some more details, please visit Etterpad at the link provided below There is information from today's ops upgrade session so Generally, Nova is at the most advanced stage of upgrades Cinder has not introduced migrations yet, but a technical preview has already been discussed and It is a work in progress In Keystone there were some discussions considering possible table logs what has led to the Implementation of proof of concept which we'll describe later In Neutron there is an approach known as Expand Contract Approach What means that we expand the database schema online before introducing new code and then we do a contract offline and a server is backward compatible with agents and In Hit there is a code for live DB migrations done, but it hasn't been tested yet And to give you a better overview of how it works in Nova, Cinder and Keystone We will briefly describe them in the next slides Okay, so in Nova there's a possibility for upgrading the Schema and data in the two development cycles So this diagram represents how it is done so that We have three simple steps. First of all, we have to upgrade the schema to new version then we have to upgrade the Service which is called conductor and it's important to Done it in atomic way. So all of them at once and Then we can just run our cloud again and In the internals of the Nova so in the model there will be a conversion of the data which will be done online So that we can have Service compute services which have two different versions at the same time after some time or after another development cycle we can just get rid of the old schema and upgrade it to the new one. It's important also to mention that If not all of the records have been has been converted and we can just make it Using the Nova manage online data migration command so that we just forced it to convert it again So the sender approach would be a little bit different because all sender services need to connect to the same database. So in This example for example, we need to change column type We would introduce this new column, but still keep the old one to have compatibility between with the older release and After this change we would Run the data migrate script because we are not sure that all the records would be touched in normal operations So all of these data from the old column needs to be migrated to the new column in this version in the X plus 2 version we would start Reading from the new column. We need to keep the old column to be compatible backwards and In the last release we can stop reading from the old column and then we can drop the schema So in summary the first release starts The base release would write to the old and place and read from there the new release would start writing data to both places and Then we would start reading from the new place and in the last release. We can stop Writing to the old place and drop the schema In Keystone, we thought about another approach because we wanted to keep the change in one release So if you follow from the top, this is an SQL alchemy model We we could introduce a backwards compatible flag which Means that in the new version when we want to keep compatibility We have the old column roll ID we introduce a new column and there is a copy column implemented which also converts data by using the new type and When we switch the flag we stop being backwards compatible when we are sure That all data is migrated you can switch this flag and then we can drop the old column Okay Although we separate the data and schema migrations there still could be an issues with Experiencing some downtime even if we just doing the schema migration So this simple diagram shows how internally my SQL use My SQL changed the altering the table So that We have a changed schema so first it's creating an empty Table of the new schema and starts upgrading not upgrading sorry starts copying the data from the Source one and then after the process is finished then we Then it's deleted and the new table is renamed But unfortunately during the Transferring the data from the table one to the temporary one We cannot change the The data which is inside or deleted or altering some other way So this is how it's in my my SQL is doing This is just a sample of the a bit larger Table which shows operation which changed the schema and starting from my SQL 5.6 There was Operations that that altering the table was a little bit reorganized so we have Possibility for doing Upgrading the schema without Doing this copy which I'm showed on the previous slide however, there is still way for Stacking into some kind of lock so for example, we will change the car set or willing to change the cost of certain column then We the copying is still Performed and we cannot change update or delete the data inside On positive SQL, it's kind of different because it has internally doing this operations different way and Yeah, so We experienced some problems for example with adding foreign key and There is some work arounds. However, there is not perfect Okay, so we did some experiments we the idea of introducing this POC was actually made at the last Keystone meet cycle and We needed a way to quickly test the problem and test any solutions which may fix it and these were the 14 examples that were proposed to test The DDL load tool which we call it consists of Initial schema initial workload which fills the database then we run the workload test and we change the schema and This is converted into song XML file. You can run these tests by typing make generate graphs and then Compare these with TS plot to for example compare an empty test with one that changes a schema So there are some results With my sequel Adding for example a column causes a spike in this graph So first we fill the date database with initial data Then we wait some time and start this test and at 400 we change the schema Is other operations Also causes spike, but you may see on this graph that dropping a table, which is actually not used during this drop Doesn't cause any problems In postgres we also tested some operations most of them Didn't cause any problems. For example adding a column causes only a minimal spike and dropping Changing the foreign key or adding foreign key And longest the duration time Request we actually also saw some postgres errors. These are deadlocks on this table So to sum up these first results must make equal 5.5 have some problems Beside deleting an unused table in postgres The results were mostly positive, but adding foreign key would Be a problem There were also locking tables return locking errors returned and From the last bus but least graph you may have seen that it was not actually the load because the connect time was even but the Request time started to spike so the database was not loaded, but the lock on the table cost cost the duration of request to get longer What are some approaches? Okay So first of all there's a simple solution for upgrading the database because there are some Okay Sorry because there there are some Abrils within the engine itself so we can just try it first Secondly, there is something called like triggers based online schema change so the idea is the very same how my sequel is doing it internally, however Introducing Triggers which are connected with the field or the column which we have changed Doesn't block our rights or another altering and its sense is directly to the Copy table And we don't have to do this by by hand we can just Use some of the existing tools first one is Percona PT online schema change and second is online schema change from Facebook and There is a suite of tools which is called open our kit which is reasoning Python unfortunately those two last ones aren't And Developed anymore, so it's like from four years something like this and the Percona is still under heavily development So in case of postgres we didn't find similar tools, but there is a different approach You could use the concurrently keyword to add and drop indices. You can then use these indices to create Constraints using with the using index keyword so you can divide these operations into two and This would hopefully reduce the table and the time the table is locked. You can also Use the not valid keyword to Do not validate the constraint at the alter operation and you can validate the constraint at the week later time and from a version 4.4 of postgres and this takes a weaker shared upgrade exclusive lock You can also use the PG repack tool instead of vacuum fool The third solution is to is a similar solution that Wish I showed you previously That cinder makes but using a separate table So to avoid the table lock you would first start writing the data to the New table you would migrate all of the data to the new table and after you stop using the table you can remove it Some more results after we The changes that we stated were introduced. So when you we upgraded the mysql database to 5.7 the Schema altering operations which caused spike of mean Request duration We saw an improvement in the case of other operations in this example we We the change primary key still caused differentiation in request duration We also tried a Blocking operation of changing the chart set Upgrading the database caused an improvement But and using the PT only schema change actually elevated the problem in our test in case of postgres Upgrading the database caused an improvement in performance, but you can see there's still a spike when adding a foreign key and Other operations still behave like previously We also tried splitting this operation into two with not valid and validate which also caused an improvement but upgrading the database When upgrading the database we didn't see much change So what are the key take house you should definitely upgrade your database if you're running mysql 5.5 you can also try PT online schema change and In case of postgres you should divide your queries and use their concurrently alter with an index not the value divided constraint keywords In case of postgres you could also use transaction transactional DDL so we can revert long-run queries so When you introduce a change that starts to block your service You can revert it and then try it at a later time when your service is not loaded You can also try the DDL load test to Test your own approach can drop in your schema from your database You can then record some of your session with some proxy to Have the workload be as much similar to your real workload and Run these tests and you can graph and analyze the reports and see if there are any errors So thank you. This is it. Do you have any questions? Yeah, I have one if you go back to the the for phased migration you had under cinder After the third phase where you stop reads to the old column The old table whatever how How do I go about getting all the data in that old table in to the new table? Do I have some sort of migration event out of band of these four schema deployments? Yeah, so in the version X plus one you Would start writing to your new table when normally accessing the data So if you write the data you would write to both places at the same time And then you would also run a migration script to make sure that all records are touched and migrated before Switching to the new version and the new version would check When if all of these data are migrated seven method to read both tables But after I've deployed that I can be moving data out of the old table into the new table Yeah, that's what that migration script entails. Yeah, cool Okay, so we'll finish up by maybe Inviting you to the event that's tomorrow on which also there was there will be a panel of experts to Discuss upgrading OpenStack. Thank you