 This I'm now going to talk about something that's really special to me Which is the new multi-region support that was recently added in 6.0 And it's special not only because I spent the last year and a half painfully implementing it, but also because It removes a really big caveat that foundation DB is had up until this point Which is that generally if you're as a user choosing foundation DB One of the reasons you're using it is obviously like data safety and availability, right? You want your database to be like always alive and never lose your data and so the fact that foundation DB has basically been generally designed to work within one data center has Kind of been that caveat I was talking about it's it's like well It's great if they're all co-located together, but I want to be safe against failing a region and so Like this the 6.0 release like is going to remove that caveat And so I'm really happy to be here and share this share this new feature with you So I already talked about myself And I've already done a lot of motivation But you know the obvious reason why do you want multi-region support? You want to survive you want to remain available, right when the region is out? And the other the other one that's a little that's also important for some people in some applications It's potentially you want to be able to serve reads locally from a lot of different regions So that can make a big difference for some applications So coming to this problem We had foundation DB and we're looking at what we can we do with in multiple regions And the first thought was sort of to take the same approach that a lot of databases like sequel databases use Which is basically to set up some asynchronous replication between your primary region And a completely different foundation DB database that you're going to run in the secondary region So these are just completely independent databases and we're basically shipping the change log from the first one And there's some external agent to the system like these dr agents were taking the data and applying it to the other region in version order This approach it has a huge flaw Which is that basically if you lose your primary region, right? Just like it's asynchronous replication You're going to lose some amount of data if that hasn't been synced yet from the primary to the secondary if you just Instantaneously lose your primary So this is leaves you like with a really really hard choice as an operator, right? Because your database is down and most databases failures like a region failure generally is not permanent unless you know There's something really bad going down. So you're stuck with like, okay Do I wait till it's back or do I choose to lose some data and no one wants to make that choice? It's Of if you are losing data though, here's the sales pitch It's the best kind of data loss you could have because you're going to lose just the tail of the mutation log So you're basically going to roll back to a consistent point in time, but still no one wants to make the choice The other Kind of bad thing about this design Is because these two databases are completely separate. They actually don't have the opportunity to cooperate So let's say we were running double replication in both sides and in the primary There's not a region failure, but we just lose both replicas. We lose two machines simultaneously We'd really like it if we could heal from that by grabbing the data from the secondary But because they're different clusters like there's no ability to do this healing across across the regions This generally is going to mean that we need to run with more replicas in both sides Which is obviously going to cost you something Um, so our first attempt at doing something better here was What was called three data center mode in the 5o releases or it's still there in 6o And the basic concept was well, let's just take foundation to you be and spread the processes across regions Um, and this approach works, but it comes with some pretty big caveats So the first one is you're basically the the the way the setup works first off Is that you're gonna have three different regions and you're gonna put your transaction logs in two of the three regions And you have storage nodes Across all of them like two in each two storage replicas in each of the regions for a six total Because we're syncing replicating to multiple regions on the transaction logs We now can survive a region failure with no data loss. So that's you know checks that check marks However the there there the costs here are That basically the System wasn't really designed to handle this configuration. We sort of jammed it in there by spreading our processes everywhere So you're gonna have more because you're replicating across regions You're gonna have a cross-region network latency as part of your commit pass So that's gonna obviously increase your latencies and for some applications having 40 milliseconds is on your commits is gonna be a big deal The the other thing is that basically the data distribution algorithm that we have Is really targeted for machine failures not region failures and what you want to do in these two cases is dramatically different So if a regular machine fails you really just want to Replicate and like heal from that loss and put the data somewhere else if an entire region fails You know, you're losing one-third of your total capacity It would be a disaster to try and copy all of the data from all of those machines to other locations So basically what we had to do with this design is effectively when a region fails disable data distribution altogether So this means that while you're in a region failure scenario You're you're you're in inherently degraded state You're pretty fragile and it means that basically an operator of the system is gonna need it to to drop the region That's failed immediately discarding the replicas that are there so the system can get back to a healthy state So that's pretty painful The the final thing I'll talk about that's that's not great about this this mode is that It has no awareness of the locality of the machines So I talked preview in the previous presentation about how the transaction logs in the storage processes I had like a buddy system where for a given storage node. There's one transaction log shipping. It's all its data Well in this no model the transaction logs are sped across two of the regions in the storage servers across three and Basically, there's no Attempt to match up the transaction log that's serving data to this to have a store to a store server in the same region So this is gonna explode the amount of WAN traffic you have or cross region traffic You're gonna have because basically for all six of these storage replicas They could potentially be grabbing it from some transaction log across the network and even originally when you're writing the data to the transaction logs Those were traveling across regions so you for so a given mutation might go across the network maybe eight times here So basically that's that's where we were at five to we had this asynchronous replication option It had this manual failover with this with the potential of data loss and the two sides didn't will it cooperate Then we also had this three data center mode this policy rep based replication It had a high cross region latencies the high overhead, you know lots more storage replicas and this inefficient way in communication So the goal with the 6-0 release was really to try and combine these approaches into something kind of unique And I think is really special So we're gonna do a synchronous replication within a database and then some policy based Use some of the policy based designs from the other from the three data center mode on the transaction logs And what we're gonna do is take Advantage of sort of the geographic features that are kind of provided to you by cloud providers today And that is that basically, you know like AWS or whatever Already has these two different notions of locality, right? You have regions which are generally gonna be distinct geographic locations that are far apart from each other And then within a region you're gonna have availability zones that are distilled distinct geographic locations usually However, they're pretty close together And so there's like a difference here the being farther away apart is gonna isolate you better from failures For correlated failures, but being close together is gonna give you a lot lower latency So the the mode that we've come up with is basically taking advantage of this difference And so we're going to make do synchronous replication of the transaction logs to multiple Availability zones in one region and then do asynchronous replication of the actual physical storage data across the regions What this basically means is that in the event that we lose one of the availability zones Wherever we're print serving rights at the moment We can copy the data from the other availability loan zone And it's just the most recent history the last little bit of the of the transaction logs history We can get that shipped to the other region and then we can just seamlessly and automatically fail over with no data loss and and basically as Long as the availability zones the two availability zones don't die within you know, 10 to 30 seconds of each other It only basically like you only have they don't have to survive both availability zones zones don't have to survive a very long time Just enough to like copy this last little bit of data. So we're getting the Failure resilience for the most part of being in multiple regions But we're only paying the latency cost of talking to multiple availability zones and this is really powerful It's kind of a best of both worlds scenario So I think I said this a lot of it at least but I'll go through it So the commit latencies are only talking to all of the availability zones in a region so very quick commit latencies again and Storage replicas you only need to in each region So this is much better compared to the each of the other scenarios are described Basically only have four total storage replicas because we have that because we're all one cluster We can use them to heal we can use copies in one region to heal another region So you can lose all both replicas in one region plus another copy in the other region and your database is still running just fine and then also we've Like optimize the design to only send every mutation across the network exactly one time And I'll get into how we do that, but it's so it's gonna be significantly more efficient than the previous implementation So it's time to bring back the boxes So if you look inside of region one you're gonna see a diagram that's very similar to what I described this morning It's got all the components there and generally we're basically gonna be accepting commits in that in that like that primary region and The only thing that's really different about a 6-0 configuration Well, I'll go through the differences is that when the proxies are writing stuff to the transaction logs They're gonna make sure the stuff is durable in both availability zones inside of that region So the second region, you know just has those Just has those extra transaction logs And so you're paying a little bit of cost but not too much to replicate your data there Um Once the After everything has been committed. We're gonna we have this new role called a log router Which is gonna be responsible for pulling the mutations across The network across the regions and it's gonna pull every mutation across it exactly one time So basically the way we accomplish this is every mutation is basically when it's created or when it's like committed Assigned a random one of these log routers and that one log routers are responsible for pulling it across the network Just purely random then That means that the log routers now Combined in total have exactly one copy of everything and so the transaction logs on the other side will re-index the data for the storage server for the local storage servers and So basically they're pulling their combining results from all the log routers and redistributing it Um So a lot of changes went on under the covers that you know are not in my diagram here Related to these transaction logs Because we had to be a lot smarter about the pairing between transaction logs and storage servers as I a little bit mentioned Basically, we only want storage servers to be able to grab their data from the local transaction logs to prevent the crosswind traffic So what happens when our region one goes down? Well, the first thing that's going to happen here. You can notice that the other az has survived and maybe that's temporary so The the first thing that's going to happen is the cluster controller that the coordinators are going to detect that the previous cluster controller died And they're going to pick a new one over in the other region This cluster controller is then going to spin up the entire system And the the new transaction logs are going to use the log routers to stream the last little bit of data from those lat from those Last remaining transaction logs across the network and this is the part that you know, hopefully is generally quite quick I put 30 seconds as a upper bound, but generally it's going to be a lot quicker than that Once we've streamed all the data now we're completely safe even if we lose those other transaction logs It's just a pure short-term storage and the database will just continue seamless, you know seamlessly running in region two You have the option of having a different az on the second side Which might give you some better failure properties when failing back to the first side Although it's optional to configure them The coordinators in this scenario You'll notice that there's a third region here that has some coordinators in it The coordinators as you'll recall are relying on quorum-based logic to do To provide like failure properties So we need a majority of them alive And so if we want the failure property of surviving one one region failure plus one additional machine failure Like three coordinators in three different regions is a nice way to accomplish that you can You'll if you'll lose losing a region takes down three of your copies an additional machine will take you four And there's nine totals so you still have a majority So What's next here? So foundation EBS like the 6-0 release Has all of this in it in working today It can it can you know fail over to one region fail back to another region However, there's still you know some work to go and there's still some things we'd really like to add The current implementation only supports two regions I mentioned all the way back at the very start that one of the reasons you're gonna want this feature is potentially to do local region To replicate data like look to do local reads in different places And so because we only have two regions support if you're using this for reading from lots of different places It's not quite there yet. We're hoping to get there if also even though we think is a good trade-off between the the availability zones and regions in terms of your log replication You might be super paranoid and you you might not care about a 40 millisecond commit latency So you might want to synchronously replicate those logs across the network instead of doing the async plan I showed so that'll come to probably in the next release The other caveat the other big caveat here is there's always the potential That you lose an entire region or that in in which case in the previous in this design here If you lose both like all of your t-log simultaneously You may want to switch to the other side even if that means data loss So this is equivalent to the fdbdr trade-off before it's a decision No one wants to make but we want to let you make the decision in the worst-case scenario that like a meteor takes out something you know like And there's probably easier ways that you lose both availability zones at the same time So that feature exists you from the CLI already in 6.0 However, it's not tested to the same standard as the rest of the code base There are some really rare correctness problems related to the fact that if a machine comes back alive in the region You thought was dead at the same time you're doing the command it could possibly like have a corrupt view of the world So in any case I want to throw it out there because we're super paranoid about these things and But probably it's safe, but be careful The last thing to talk about is Right throughput is currently going to be reduced when a region is filled. So while a region is down We enter a performance mode where the transaction logs are having to queue up all of the data That's bound for the other region and currently the transaction logs just aren't haven't been optimized for this use case So what this means is that generally when a region goes down in this configuration? You're gonna have to just like with a three-data center mode You know within if you have a high rate band with workload you're gonna have to configure it to drop that remote that remote region Pretty quickly so that you can flush out the log data. That's being queued up for that region The last thing I'll mention here is Replicating to multiple regions adds a new thing to monitor Which is that you really need to pay attention to how far behind one region is from the other region? Because the amount of data that's queued up in the log system bound for the other side is going to determine what you have to copy On a region failure. So if this thing gets out of hand and is an hour behind the other side Well, you're no longer safe against a region failure because you're gonna spend a long time copying that out That hour of data before you can recover So it's really important operationally to monitor this lag and if it gets you know more than a few seconds to figure out What's going on? So That's all I have Thank you guys