 Since that we are on the same topic, like I would like to share like a recent blog post, which I read about like how GitLab planned their migration from Postgres 9.6 to Postgres 11. So they have written a detailed blog post. I thought like I would be really useful for everyone to go through that blog post. I will explain a little bit of the planning what they have done, and then we can go ahead after that to discuss how do normally people plan their migration, some other upgrades to newer versions. So you can access this blog post in the GitLab.com. So I'm going to paste the link in the chat so it becomes easier for others also to follow. So it's a bit lengthy blog post. So what GitLab wanted to do like in their latest version, they want to upgrade from Postgres 9.6 to Postgres 11. And this migration is not only technical decision that team has taken, they also found a business reason why they should migrate from the 9.6 to 11. Of course, one main reason being there's end of life coming. The other thing is that they have a lot of features which will help them for their business to scale and a lot of other operations connected with that. So how did they plan this entire migration and what tools did they use and how did their day of migration look like. And they also recorded the entire process of migration on that particular day and they released this as a YouTube video. So it becomes easier for people to go and view like how they did it in a period of say two to three hours. So some details are given as like they are, what is their cluster size, what is the number of transactions they do. From the scale of it, it looks like it is like really a big scale because like there is no like one particular day GitLabs usage go to only 10%. It looks like they have like over the time, it ranges from 200,000 transactions to 300,000 transactions. Because you cannot say like on Sundays it's going to be less, on Mondays it's going to be high. So it is more like the traffic and the distribution looks like equally distributed. As a result, there is no easy way for them to migrate unless they give some kind of a say two or three hour window to the customer as possible early. And they put up the reason why they want to move away from PostgreSQL 9.6 to PostgreSQL 11. And you can see some technical details for that. And the next part is like they have detailed about how is their infrastructure running and you can see like infrastructure is like running in a cluster based setup which with more than like six machines. And you can see the number of cores to 96 CPU cores. It is really a large setup, setup machine. They're using like all the things required for load balancing, outlier election, everything they've been using. And then like they have described their, what is their load on average day and how much transaction happens on the BCS star, how much transaction happens on the non BCS time. So the main thing for them, for GitLab is like they are migrating to from 9.6 to 11. At the same time, they were very sure that the migration has to go through and they should also plan for the rollback. Let's say in between something goes wrong, they should be able to rollback to 9.6 without any issues. And they also had like a downtime announcement for two to three hours. So the migration has to happen in this particular time frame and it cannot and all the nodes in the cluster cannot lag behind a lot. And as a result, there will be some, there is a lag in the, what to say the application itself. So these were some of the main constraints for GitLab for some of the constraints they need to take into the account. How did they plan? So while reading the blog post, it was clear that this is not a kind of a migration where you can do it in like a week or two planning. It required more like a month of planning, though they're not disclosed how long the entire iteration went for starting from the planning phase to the execution phase. If you read through the blog post, like it is very clear that it is more like developing like a quarterly plan like this quarter what we are going to do. So it looks like the upstream have developed a proper quarterly plan, figuring out like what is going to be our data backup and how are we going to do it in the staging testing. And then we are going to run a performance measurements and all the queries which are going to the current system and see if there is no regression or there is no performance penalty. And then actually how do we migrate the cluster? So they have developed these into multiple phases. In the first phase what they did is like they came up with the Ansible playbook how they will do this migration. And second place is like change the development to a staging development with the same production database with some kind of setup, environmental setup. And once they upgraded from the 9.6 to 11 and they did an end to end staging test in that. And fourth is like they went for the upgrade in the production that is like on particular day they figured out this is what we are going to do. And some kind of measurements they did what our maintenance announcement and all those details. Once that is done and they were able to complete in two and a half hours to three and a half hours. And the important thing to notice like they used a tool called PG Upgrade. I think the PG Upgrade will help you to move from one major version to other major version. And the detailed explanation they give how this PG Upgrade works. And it is the main thing to notice during from one postgres version to other postgres version. The logical of the data structure or how it is stored in the disks may not change. But the metadata information about the tables how they are laid out may be different. So that means they do very careful in checking the compatibility of the versions. And they did that. And then they did a regression test benchmark saying that queries don't. There are no slow queries or there is no performance penalty once they are migrated. As you can see the automatic automated everything using the Ansible playbook. And they have to do everything like stop the services which are accessing the database like your API servers and your workers, all the other parts. And finally, they upgraded the postgres on that particular migration. This is like the traffic where you can see like before migration and after migration. It was more like a smooth migration. And they had two hours of downtime. And if you see that this entire process took like four hours but actual migration was only two hours. An interesting part is like there was some person who was driving the initiative like when you run certain script you will be asked to enter certain information like yes or no. So there's someone else reading that and giving instruction to the person who's actually doing this migration. It becomes easier for the person to doing the migration to make any decision because the person is only need to enter the commands no need to think too much what will happen to remember what are the previous answers he has given or she has given. So this is something like interesting for me because they also planned out what if things doesn't go well that is they have to revert back to the older version and they did this entire what is a migration on the protection data on a staging environment. So most of the time I have found like very few people are few organization has this kind of resources to plan out the entire migration in advance as well as to able to have a dummy play of a protection data and trying to see what are problems they can catch in advance. That's the details of the of the Postgres migration and we can have a little bit of discussion about the migration how people and upgrades they do in their organization. And once we are done that at the end I will present a small tool called SQL check. So SQL check is a tool which helps you to find anti pattern in your SQL queries and we'll get to that and in a moment before that I would like to pause for a few seconds and then let's talk about the migrations and upgrades how you do and what are some good practices you can share. Hi everyone, this is Navarone here. So since craze brought up the PG upgrade tool. So I work at claricides and we have five and a half ish terabyte single machine postgres database and talking about the size of the primary. We do have a secondary with logical replication basically streaming application, which basically replace everything that we have. We were actually playing back so we were on postgres 10.2 since some time and we wanted to go to like 12.3 because of the improvements there. So we did all that playbook thing, not writing in an interval playbook but we actually did literate programming basically in a markdown file, we wrote out each of the steps that we will do for the staging process. One thing really beautiful about PG upgrades heart link mode is where all of the the actual upgrade took just three minutes so 500 five and a half terabytes of database upgraded in three minutes. Just that we had to take some precautions even before that. So we took a hot online backup of the database while it was active. We are on GCP and the DB is running on VM with persistent disks. So we just took a hot backup and in between we just took a snapshot of that is so in any case if anything went wrong. After the new cluster started basically the PG 12 cluster started PG upgrade documentation says that you can't revert back because then the pages are rewritten basically the new schema is PG 12 PG 10 won't understand it anymore. So then also like we had this idea that hey, even if we can't, we will just stop the machine and revert back to the previous snapshot. Although in between we stopped all rights to the system. So PG bouncer was stopped and platform was taken down. We are still trying to find a way to do it in production because in production we can't take that long downtime in production we have to be really swift with this. One way that we thought of was since our customer base is mostly you we can probably take it down time where our traffic metrics are really low. Although there would be connection severances but yes so this this was our story with the PG upgrade tool. It's a great tool. So we jump we were a bit apprehensive about jumping to major versions, but it worked out really well. Although the tool says that please go for go from like hop one versions at a time. So yeah, on the on the sequel fire side, we have an interesting story. So once one of our engineers ran an update query without setting the where clause. So basically without any filter and that we wrote the whole tables one of the columns. The silver lining here is the table was not a like business data table it was more like application metadata. So we were a bit safe in that sense that had just like I think 50 shows it was a very small table but then the problem was we did not have the data anywhere. Even the replica had the data updated basically the query had transferred into the replica the records were gone. So we thought hey since the red tuples would still be there on the hard on the disk. We basically read the dirty pages from this and fortunately we had someone in the team who had very good understanding of the postgres internals. How does postgres read pages. So that person basically wrote a program which read that page and then like to a to a very well defined extent of human interpretation got the data in a form and then we we just entered the data into the table by hand. Since we saw that we saw when reading the binary page, we could comprehend the data. This was one war story and the second one I don't know if it's a war story or not but it's probably more like not knowing what to do what not to do with PG someone sick to the postmaster and there was a havoc into the whole system. And since this this is like a very inadvertent state. Yeah the engineer eventually learned that they should not do a sick kill on the postmaster. But yeah, thanks for listening to me for the last few minutes. Thanks for sharing that because it's very rare to find people who understand like the postgres from the top to the bottom where you can read the pages and how it is B plus trees everything getting back to that is not something you can do it in pressure but I don't know how long it took to come back to that state if you can share that it will be much more interesting. So, to give you context, it was around seven o'clock in the evening when this happened, everyone just wrapping up their stuff and going back home. Suddenly engineer just read out aloud that hey, I think I did a mistake. I think I did an update query without setting the filter. And we got the data back by nine o'clock like two hours. It took so the idea struck really fast. So the first thing that we tried out was, let's see if the replication lag was really high the replay lag was high so that the secondary would still have the data we can just severe the replication then and there and still get back the data from secondary. But the problem was that the replication was running really fine. So there were really less number of like update queries or insert queries that replication was pristine that So, yeah, that was our first or the second part was, hey, now let's go by the basics that since it's a since PG does not modify the existing tuples PG rewrite like adds a new tuple and then change the CT ID, let's go that way. So that's where that this idea came from that if the pages still if we can somehow find the page and this we could get it back. So finding the page took a bit of time. Understanding it was a bit easier because we had the engineer who knew postgres integrals. So that has a couple of things here. Yeah, so go ahead. Sorry, join late. One thing that came to my mind is why can't you try point in time recovery. So the problem with PITR there is we have lots of data. So we do, we do a different form of PITR, but we just snapshot that. So to a certain extent of delta time delta we can get back the data but not to like, hey, can I get the data to a second granularity. We don't do that as a conscious decision. I think anyone wants to read a little more about what Nabono was talking about the tuples and how Postgres does it right. Uber has a very famous story on why they migrated away from postgres and into toward my sequence. Same data later. But it also points to the same fact that what do you mind meeting yourself. Sorry. It also is to do with the fact that postgres does not actually do in place rights. It writes a whole new row and then changes the ID, which is the same, same thing that number one and team class was able to use that particular story the postgres rebuttal and original blog post are quite illustrative reads are strongly recommend people to go read them. Secondly, I think we've been talking about PG upgrade quite a bit. The reason PG upgrade is so good. There are two reasons. One is it's part of the postgres build process. The point release or any RC release does not go out before PG upgrade is run on that every incremental upgrade goes through a PG grade test. And so every single commit that goes to postgres 100% upgrade compatible. So sort of speaks about the postgres is core teams. Commitment to stay true to the correlation model and also be like very forward looking. One of the reasons why poses is so good. And the weird is the fact that the core team will not make easy compromises for the sake of performance or for the sake of operational benefits. The core model. And that reflects, you know, all the decisions. That's it. I have one, one more war story to share. I'll give them a chance. If not, I'll go ahead and share. Please go ahead. Cool. So this one is slightly different. It's not operational. It's about data model understanding data types. So this is from one of my consulting projects. And again, I unfortunately can't share a lot of details. But I'm assuming people here have worked with serial or auto-implementing IDs. Can we do like a quick show of hands or something or I'll assume like anybody who's worked with active record and Django or in general any modern Postgres system. They would have worked with auto-implementing IDs. And so this project I was working on, they also had all their integer primary keys as auto-implementing. However, instead of treating the integer IDs as just a unique value, this team made a mistake of relying on the fact that these IDs are auto-implementing and therefore they would be serial in nature. Two very different things. And so let me talk about a lesson from this story first and then we'll talk about what went wrong a little bit. So the lesson here is that if you're using a particular data type, like what are the invariants? Whenever you write something or we design the schema or give a data type or give a type to something, we need to look at the invariant of that data type or all that assumption. What is always true about the integer ID, for example? And what is always true about the integer unique key auto-implementing is that it is unique and it is integer. And that's it. And those are the only things, two things that are true about it. The fact that it is auto-implementing does not play into the core invariant of that model. And so now coming back to the workshop, what happened was they had certain assumptions built into their model and in their business logic that incrementing ID means something. They treated it as like a series and there were things like aggregations and stuff ordered by clauses built on ID. And so this was all fine when they were operating on a reasonably low scale, no risk conditions, etc. Things were okay. Then they started growing and infrastructure was running at pretty high capacity. And now what happens is if you have the serial ID in particular does not survive a rolled back transaction, for example. Let's say there's a column, primary ID, and you inserted a record in a table within a transaction. And now after that you rolled back your transaction, you're not going to get the old ID back. The ID would have moved on because the ID is a part of a sequence. And a sequence once consumed does not go back. It's not rolled back. So let's say your current maximum ID is 10,000 and you started a transaction, you created a record with 10,001 and then that record was rolled back. The next ID will be 10,002 and not 10,000 back again. And so this particular fact, this team did not understand because it's not apparent. It's not easy. The documentation doesn't say it very similar to the transaction wraparound problem. And so these assumptions ended up costing data inaccuracies because obviously the moment you say the ID serial one comes after one other is no longer true. And so all your reports, everything that you built based on this assumption is no longer true. And so they ended up, so the cost that they pay was pretty dear, like they lost a couple of customers, etc. And they ended up suffering as a consequence of this. But the lesson learned was this, that if you're designing something, you need to look at the invariance for that particular data type or anything for that matter. That's it. That's the story. Thanks, Swannan. And that's also interesting because like there are configuration where you can get back the IDs but stating what you said like it is also not a good practice because as you say customer ID gets changed and it can cost like a lot of, as you said, like a lot of Ivox. Exactly. This actually like this particular pattern is a favorite thing of mine that I keep talking even in my workshop I talked about quite a bit is invariant based design, like when you're designing something, you write down the invariance first like what doesn't change no matter what happens. And then you sort of build up your design based on that. UUIDs, in a way, speaking specifically about primary keys, UUIDs help you sort of go over or overcome this particular problem, but they are their own problems. Anyway, that was the story I had to share.