 So as you said, this is migrating a live database into RDS with no downtime, but what the talk should be called is how we thought we were gonna migrate into RDS with no downtime, and what actually happened when we tried it. Now, the plan should have worked. It's just, there were a few roadblocks on the way that we could not get past. And I'll explain all of that and how we hope to resolve it in the future. So just a bit about our installation. Actually, before I even get there, I tend to talk really fast. If I'm talking really fast, slow me down. If you have any questions, ask me in the presentation because I timed this out and I'll probably be way too fast. So a bit about our installation and what we do. So we have a completely isolated production environment in AWS that's not connected to any other part of our company. So that's where everything is. We have multiple databases that host everything from live transactions, payment details, foreign exchange quotes, and the trades associated with those, fraud tracking, audit records for PCI, anti-money laundering, all of those things. And then we have right now dedicated EC2 hosts for Postgres installations. And we do backups to S3. So pretty much the standard five years ago recommendation on what you should do to host a database in AWS. But things have evolved, so we're trying to evolve with it. So what do I mean by no downtime? What I mean there is no disruption of the service. So we have an SLA that guarantees five nines outside of maintenance windows. Not quite the SLA that GE presented yesterday, which is absolutely amazing. But it's still pretty limited and we have very limited opportunity to have whole service outages for upgrades and migrations. So we have to plan the evolution of the service accordingly and make sure that we have things so that we can swap nodes in and out as we upgrade them. Customer service doesn't count towards uptime. I don't really care if the CSRs can't get to the database for five minutes. That's not my problem. Yes, it is a problem, but it matters more if we lose a thousand transactions for a customer because we're down. Administrative functions also don't count. So this includes things like fraud screening. Their SLAs are on the order of hours to review transactions. So again, I don't care if it's down for five minutes and merchant access. So if a merchant comes in and wants to see details about their transactions, again, I don't really care if it's down for five minutes. They care and they will yell at us, but we'll send them email, smooth it over. The only thing that really matters is payments. So the outline of how we were going to do this is we were gonna determine a reasonable plan and plan an approach to migrate to a multi-AZ RDS installation so that we have true redundancy across multiple sites in AWS. We're gonna change everything else after. So we were not planning to do a wholesale refactor of our system as we migrated it. We're going to move this, get it working and then change everything else because we have grand plans for using lots of read replicas and changing over lots of components of our system to take advantage of all the things Grant talked about yesterday, but that's a second step. So the deadline, because this actually does need to happen because the current administrative overhead that we have inside the company is too high. We're not DBAs, we're not experts, we just sort of pretend to be sometimes. And then of course submit a talk because what could possibly go wrong if you have a deadline and it has to be done and yeah. So the plan was to create a replica of the current installation, work out all the details, keep good records for the talk of course, but actually for the audits, rerun all the tests to make sure it's all one AM proof because even though we deal with payments everywhere on the globe, it turns out that one AM is the best time to do it even though that's the middle of the day in Australia. I have no idea why people don't buy things in the middle of the day in Australia they buy things at exactly five o'clock. Measure the side effects and go through it again to document everything to make sure it's one AM proof because it has to be really idiot proof. So start with the worm. So this is a write only read many database that we have. This is a dedicated database to storing events. We keep audit records of everything. Every request and response that goes in and out of the system, we need to keep a record of it. So we have detailed timing on this. Actually it's interesting this system started as a finger pointing exercise because we were having an issue with one of our fraud partners and they didn't believe it was their fault. So anyway, there are no updates to this system. It's for auditing and diagnostics only. So we can afford to have delayed read updates. So this is perfect as a first test. The audit diagnostics can usually wait but we can't afford to lose the rights. So those were the constraints on moving this database. So the plan for this one is to use DMS to migrate all of the data, switch all the reads in the system to use the replicant which is the soon to be new master, verify the data integrity. Actually that should be swapped. Then switch the rights. Now to switch the rights because it's still actively writing to the database when we're gonna switch, the plan here and this does actually work is to bump the sequence numbers by a reasonable amount. So basically what we've done is we've measured the average number of rights that happen in a certain window. We're gonna double that, bump the sequence numbers by that and then any delayed rights that happen in the system to the old master while things are switching over, we'll just get propagated into that window of sequence numbers that is not going to be used by the new master. So once we do that, we switch the DNS records, wait for the old bit databases to drain and let DMS finish. So this is the plan. So the first lessons is max replication slots. You actually need to increase this. If you're already doing replication, it's likely that all your slots are being used by your current replication process. If you're not doing replication, you need to read about this and see what it does. This needs to be increased by the number of tasks you're going to run. So DMS does tasks by schema. So if you only have one schema, you technically only need one task. But we have a lot of schemas, so we had to increase it by a lot. The next one is max wall senders. Again, if you're already doing replication, you'll just have to increase this to accommodate for sending to another sync. But if it's set as zero because you're not doing anything, then you need to increase it. And then wall sender timeout, we actually set that to zero on ours because we wanted to make sure it was consistent. And then there are some changes in HBAConf that you need to make. If your system is locked down, likely you're not gonna have access open to just any instance you create on EC2, which includes a DMS instance, so you need to allow for that. And this is specifically what we put in two hours. We just said host replication, the user and the actual IP address. For some reason, sorry, the super user obviously needs replication permission in the privileges of the database. So we did that and then we ran into an issue, which was this nice log message, which is basically telling you that your data is being truncated. So that I thought was a little bit more than a warning. So, and then this one happened, which was a foreign key. So the first test that we ran, we actually didn't do the recommendation, which was to have DMS migrate the schema for you. We decided to make up a new RDS instance, install the schema and then tell it to migrate into that. The problem is your schema likely has lots of foreign key constraints, likely has indexes, likely has a lot of things in there that the first initial bulk load of DMS is not going to play nicely with. So, the first test failed because text is considered a C-LOB type by DMS. So, and don't load your entire schema. Either let it do it or just do the basic tables. So, the second test we did was full-LOB, but it was slow, like really slow. We let it finish and it took nine days, almost 10. This was 100 million rows, about 50 gigabytes. You'll get the comparative times in a minute. They're quite, the third test seemed to work. What we did there is we ran it with LOB truncation, but we set the truncation at double the maximum length of any octet thing in the database, and that finished in two hours and eight minutes. So, that much, much, much, much different. But did it work? Well, the reality is we thought it worked and we actually got to the point of switching everything. So, we checked, everything seemed to indicate it did, like everything, all our metrics, all measurements of all data consistency seemed to work. We switched over our production readers to use it and everything looked fine. Rights were still on the old master, so DMS was responsible for migrating those changes over to the RDS instance, but that was really it. All of the reads were happening off of the RDS instance. So, continue to migrate. Unfortunately, we had corrupted records. And this was weird, and we still don't really have an explanation for this. It is something that the DMS guys have no idea how it happened either, so we're still wondering about this one. And it only happened after the initial load. So, it's something about the streaming application service that is messing this up. So, that's a bit of a hint, but I don't really have the answer there, so it's not really fair. Other things we learned, functions are not migrated. That might be problematic for you. It is for us. Our entire service is written as if the database is basically an RPC server. So, the only access that any of our apps have into the database is by function. So, the functions aren't there. It kind of doesn't work. The data's there, it's great, but can't use it. Indexes are not migrated. That's a good thing, but it's also a really good thing to know ahead of time, because if you try to use your database without the indexes, things are not gonna go so well, so you need to know that you need to recreate those. Constraints are not migrated. Again, that's for the initial bulk load of data, because they wanna be able to do one table at a time, and if the ordering of that matters and there's interdependencies of it, resolving that would just be stupidly complicated, so they're not migrated. There's likely to facilitate the bulk loading, but it could be done after. So, that could be something that DMS could do for you. So, basically what you get is just the basic table layout. That's it. The tables, the columns, the types are all right, but none of the interrelational parts of, like the relational part of the relational database is not really there, but your data will be. So, what we did is we used PGDOM to get all those pieces that were missing. Specifically, we used PGDOM minus S out to a file, which just dumps the schema. It doesn't dump any of the data in there. We edited it quite a bit to remove all the table creation, remove any of the sequence updates, because of course we didn't want the new one to be doing any sequence stuff until later. Then we did DMS full load with ongoing changes, which will import all of the data and then start migrating changes as they happen, and then when the full load is complete, then we load that file into the database. Just in the side, when doing this, when two people do this command and they do it on different hosts, and one person edits it and the other person doesn't, and the host that didn't get the edits loads the file. Doesn't work so great, so. This is actually, I said that this is the third test, but it's actually the fourth one, but what, so what that gets you, your data is loaded, ongoing changes are migrated, your functions are now in place, your indexes are recreated, your constraints are back, and basically you have a mostly functional database, okay? Except for the sequences, everything is there. So how do you do the sequences? Just before I get that, indexes take a long time, so adjust your console timeout so accordingly. If you have them, if you could turn on some kind of ASCII progress meter for indexes that would have saved our first run of it, but PCI is fun, so. If you don't know console timeouts for PCI need to be set, so we needed to actually write down that we needed to disable them during this process, so. So the sequence update itself is something like this. Now this is what we did to generate the SQL command that we would need to run on the RDS host. So we ran this on the, just the EC2 installation of Postgres to get the command that we would need to run on the other host. So select the maximum one, add 10,000 from the milestones, and then create this text to do it. I don't know why you can't do this inline, but you can't, you have to select it into a variable. It's very strange. So one last test that we did, this time it was the right instance type, and it took 29 minutes to do the initial load, so not the two hours and eight minutes that it took before. We suspect the IOPS for the destination made the difference, and that makes sense when you consider the fact that the first four tests that we were doing were on a DBM3X large, and the actual target that we were going to was a R32X large, which has a significant bump in IOPS. But it still started to corrupt the data after the first load. So we kind of, we thought maybe it was an anomaly, but it obviously wasn't, it kept happening every time. And we didn't want to run long term without the full log. So again, we were doing truncating of the log up to a certain point, which was double what we had in the database, but of course we don't know what's coming, right? Because we're logging everything that is sent to us in like as a request or as a response to one of our requests out to a vendor, we don't know if some idiot is gonna respond with an inline image, right? Which happens, unfortunately. So we know the length of the longest record, but we don't know what's coming. So what do I mean by corruption, this kind of thing? So the data, by the way, this was created before the JSON type, before the JSON B type, so it's just a string that is representing JSON. But yes, the only difference there is that there is a single quote at the beginning of the second one. There are other forms of corruption that happen where it truncates it in a weird spot that's like it would take, just for example, we set the truncation to be of 200 kilobytes, and we would have some of them show up that were truncated 20. The data was 25K, but it truncated it at 20K, and we have no idea why. So that's just something interesting. This is actually what triggered us to find out that this was happening, because the people doing our fraud screening were looking at this data, and the admin UI was picking up on this, and basically an intermediate node was saying, that's not valid, Jason, I can't send that back to the UI. Yeah, it was only the text. Yeah, so yeah, only the log data types were being corrupted. So yeah, it's only the new data. So what's interesting is we did an initial load with this data, and that all worked, but something got corrupted after that point, right? But so one of these records was corrupted in migration, but then we tried again, which included that set, and all of that worked fine, and then the corruption happened after that, and then we did another test, and it was obviously something in that, it was nothing to do with the source data itself. It's very, very strange. So next, well, we said, okay, we'll put that on pause for a minute and try the important database. So this is not the right only database, this is the database that has basically the entire company, and it's about 75 interrelated tables of live transactions, order details, payment details, fraud, foreign exchange quotes, remittance data, all this kind of stuff. The normal evolved mess that you have in your company that's been there for five years. You start out with a really nice schema, and then life happens. Just a minor detail that we ran into here, C. So we had this function forever as part of our normal EC2 instance. So what this is is just a function rename, really, okay? So it's just referring to one function inside of a library that happens to be written in C. But because RDS considers C an unsafe language, which I understand why, because you can get it to do anything you wanted to, this is not allowed. Now, I've debated with them whether this should be allowed to reference a part of a library that's actually allowed as an extension in RDS, but it's hard. So we ended up replacing it with this, which is unfortunate because this, even though it is the exact same function, you see here it's UUI generate before, out of the UUID OSSP extension. This is 50% slower because it's running as PLPG SQL instead of just referring to the C function. But that's okay because you should remove the UUID OSSP extension from your life and replace it with PG cryptos implementation of this, which is more than twice as fast as the other one, so it makes up for it, but this is still half as fast as it could be if it was just referring to the library implementation of PG cryptos UUID generator. So all of that just so we could type UUID generate instead of PGcrypto.gen random UUID. But that's a tangent, so just don't use that other library. Then we came to another problem, and this time it was kind of worse. So HStore is not on the list of supported data types, and in case you're interested, JSON B isn't either, which I kind of understand, but HStore has been around for a long time. It has to do with the fact that you are taking the, you're trying to take the representation of that data and ship it across in a transparent way into another database, which may not have that mapping, because remember, DMS is not strictly for migrating from database type X to database type X, it's a multi-database mapping technology, which is really cool. So I understand why it doesn't have this, but it'd be really awesome if it did, because then we could use it. Just another tangent. While you're figuring out what to do next, this is a message I sent. We used up quite a bit of our disk space, because while you're figuring all this stuff out, Postgres is still working. It's still in the background trying to make sure that you aren't doing something stupid, so that when the replication endpoint comes back up, it can ship all the logs over there and make sure it's consistent. So that set off alarms in our system that it was filling up disk space, so don't do that. If you're gonna turn off replication to figure out what to do next, turn it off completely. So we didn't get that far, because we used some Postgres-specific types, like h-store, and actually our next install, sorry, our next release of the database uses jsonv, so we kind of be worse off there, even though those are supported by RDS. So there's no problem with us putting our installation in RDS, it's just a problem of the migration of data. So we use h-store in about five different tables, doesn't like it yet, and we don't really have any other clever schemes to deal with that. It's possible that we could do something there, but so we stopped. So some recommendations for the RDS team, support all the Postgres data types, harder than it probably sounds. But if you can't do them all, some kind of warning or error at the very beginning of the process would be really awesome. So there is the schema translation tool service, can't remember exactly what it's called, and there is the connection test that you can run when you're setting up your DMS tasks. Those check basic connectivity and load up what schemas are there. If they could also scan the tables and see if all the data types are supported and so you're actually gonna be able to ship that data over, that would be great. Fix the full log migration. I know it's in progress, but it should not take 120 times longer than truncating, especially if 99.9% of your data is below the chunk size there. I realize that the implementation of that is probably wacky, but it's definitely something that might have contributed to the data corruption. We're not really sure. We didn't have time to test that, but that is something that we would be very interested in testing if it didn't take 10 days to get the initial load to go in. Another is DMS instant types. We really have no idea what the difference is. There's no guidance on this. There's no indication of what's impacted by the selection. So like for example, the instant storage on there, we couldn't really figure out what that was. DMS instant storage, I have no idea what it's for because it is just shipping data from one side to another, so I have an option to choose storage under the DMS instance, but I don't know what it impacts when I'm selecting that. Figure out the data corruption. I can send you any data that you want if you need help. No idea why it would happen. There's nothing special that we're doing. It's just text. Design for novice users because we want like a key motivator for us going to DMS, not DMS, key motivator for us going to RDS is because we don't wanna take care of the minutiae of the low level details of administering Postgres. We wanna offload that. So we are by definition not experts. And I have no idea why. It couldn't resolve an internal name, but it couldn't. We had to use an IP address for the source and the destination that could help. Just one more, the status bar. It currently indicates the percentage of tables migrated. Now if you have 80 gigabytes of data on three tables, that doesn't, what happens is you sit at zero for a while and then you sit at 33% for a while and then you sit at 66% for a long while because one of the tables is much bigger than the other. So some better method of that could be good. So are we done? No. Can we use DMS? Not yet. Can we use RDS? Yes. But we want NotifyListen to work soon because that's something we can't move over right now. So we have another database that we know we can't move because NotifyListen is not a supported extension. So it'd be good if that would work. So what now? We go back to old school methods. Which is very close to what we were trying to do at DMS. But shouldn't have any of the corruption problems. So for the worm data we're going to do a manual replication, bump the sequences, update to DNS, backfill the updates. Basically exactly what DMS was going to do except the backfilling process that DMS was gonna do with the ongoing changes migrations. We can do just by doing a select and an apply. Payment database is a little harder. We're gonna backfill as much as possible because there is a lot of historical data in there that doesn't change. But then we're gonna have to stop everything, dump or store and eat into our uptime budget. So that kind of sucks. But that's it. So unless anybody has any better ideas, that is the end of my talk. Yeah, nine, five, two, sorry. The source and destination were both nine, four. Yeah, I can't remember the minor. The H store, I don't know, it's just not there. This, it just says these are the ones that we support and it doesn't give any guidance on why they don't support other ones. Yeah, that's right. Yeah, so DMS doesn't actually care where your on-premise one is as long as it's just a normal Postgres installation. You can't go right now. You can't go with ongoing replication changes from RDS installation to RDS with using DMS, but you can, no, not yet, but it will be, yeah. Yeah, sure. So basically it's a select. So the backfill of the updates on the worm data. So basically what we're gonna do there is because it's just records go in and I get a sequence number, right? So that's just an ever-increasing number. We just select everything, move it over to the other one and then we can do just select the max idea to those three tables that's in the migrated table and then select everything that's greater than that in the still master, dump that and restore it. It's pain. That's why we wanted to use DMS because it would do it for us. Three weeks ago. Three weeks ago. Yeah, no, I will qualify this. This does work awesome, except for the worm data, it would have worked perfectly if it wasn't for that data column that we had. And I was absolutely amazed at how good of a job it was doing. It just worked, right? And then we found out that it corrupted something. So if that small piece can be fixed, I think it would be amazing. That's what it was corrupting in. It's on the fix, right? Well, again, I just wanna go back to this where I show the corruption. So that data column is of type text. It is JSON inside. That's basically because we wrote this on PG91 so it didn't have the JSON type and certainly didn't have the JSONB type. If we were gonna do it today, we would use JSONB there, but that was not there. But it's just, I don't know that. All I know is that the specific text that I have was not copied exactly as it should have been. So I don't know what property it was about the text that made that happen. What types looked good? Well, this one didn't have any other text types. So I can't answer that. Well, we don't use Varkar anywhere because text is basically the same in Postgres, so. We have some fixed length car values that we use, but we don't have any Varkar anywhere. Anything variable length, we use text. So on the worm database, the timeline database, this one here, so we're gonna do a, basically a dump or store, that's the first step. And then once that's finished, we're gonna bump the sequence numbers so that anything past this point, there's a window where anything writing to the old database will write into that window of sequence numbers. But at the same time we do that, we're gonna update the DNS so all the applications will start moving over here and writing after that bumped sequence cover. Once we see that all of the writes are gone off of the old master, we'll take a dump of the new records so anything basically inside that window, we'll take a dump of that and move it over. But you're right that anything reading from this side won't have the things that's in that window, but that's acceptable, yeah. But it's not acceptable in the second database, which is why we need to stop everything, do the dump or store, and do it that way, which again sucks. It was not the plan. The plan was to use DMS for both of them and the payment database, which was gonna be a bit different strategy for sequence numbers, but it was still gonna work. Database Migration Service. If I had to do it more than once, maybe, but I only have to do it once. And honestly I'd rather help make the DMS service better. Sorry? No, no, no, it's part of an extension called UUID-OSSP, that's the original one, but the one we're using now is part of PGCrypto. So both of those extensions are supported in AWS RDS. The only piece that I think is missing is the ability to reference functions inside a supported extension. No, that was for something else. We use sequence numbers for some things and we use UUIDs for something else. Basically anything that an external client is gonna see, we use UUID, we don't give away the internal numbers. Anything else? Any better ideas on what I should do here? What does it call? The D-State. Oh, right. Yeah, okay. Okay, I'll look into that. Good, well thanks.