 Hello, everyone. So I'm Marcus from Snowflake Computing, part of the FTB engineering team. And this is David. And we are here to talk a bit about high availability within FTB. So this is the outline of this talk. First, I want to motivate our work and why we chose the architecture that we implemented. Then I will do a small detour and talk about building distributed systems in general. And after that, David will take over and he will talk about the Snowcanon architecture and how it is implemented and how it works. So Snowflake uses FoundationDB as an integral part of the system. It's the whole of our system. If FoundationDB goes down in a region, our service goes down. So we came up with a list of requirements that we have for a high availability slash disaster recovery solution. The first one and the most important one by far is we never want to lose data. Data loss means we lose customer data if we lose FoundationDB data. We want to have the possibility to have something like a standby and be able to fail over to the standby and then also have the possibility to fail back. The reason for that is usually your data doesn't, when you lose an FTB cluster, we usually don't lose it completely. It's more like its performance degrades for networking issues, losing machines, these kind of things. And if we can fail over to a secondary but don't have to throw away the old primary, we can fix that old primary and have a runnable secondary again for free, basically. We also want to have multiple secondaries. We are that paranoid. We want the possibility to shut down secondaries to do something like software upgrades, go to auto machines, these kind of things. And we want to have terabytes, potentially, of mutations, of changes to our FTB storage in on-disk, safely stored. We will see a bit later how that is useful. And it should be as highly available as possible. Correctness is more important than availability, but it's still, we want to optimize for that as much as we can get away with. So the first solution that FTB implemented in that space was backups. And that is what we started running with. However, backups are not free. So what you see here is one of our production cluster. And this is the sum of all disk operations executed over time. And the red areas is where we are running a backup process. So sadly, the scale doesn't start at zero. But what you can see is that roughly the number of write operations, like the number of disk operations, doubles as soon as you are running a backup. This got better with FTB 6, but the cost is still there. And disk costs finally is also part of the disaster recovery solution within FTB, because that one builds on top of the backup mechanism. So this is the comparison of all the solutions that are available. Snow can and the thing that we built isn't yet open sourced, but we're in the process of doing that. So you will hopefully, rather sooner than later, be able to deploy that as well if you choose to. Keep in mind that this comparison here is highly skewed to our requirements. So you could come up with other points where you would see more crosses at the Snow can and thingy and more ticks at other solutions. But basically, what Snow can and gives us is the main drawback that it has is that commit latency will go up slightly, like depending on how you deploy the whole thing. But at the same time, it doesn't increase load on primary. We can recover a backup and replay Snow can and logs and get back in new cluster without having any data loss. We can switch over and back these kind of things. The way we implemented this, and this is very high level, David will go a bit more into detail, is we built basically a second system called Snow Cannon, which is a full cluster, and it basically implements a distributed queuing system. Think of it as something like Kafka. The main FDB cluster will then stream synchronously all its transaction that it executes or the mutations of that day's transactions to Snow Cannon. Snow Cannon will persist this to disk and it will then asynchronously push it to a secondary cluster, or even a third or fourth one, however many of those you want to have. And because of the asynchronous nature on the second part of it, you can bring down the other cluster for maintenance to upgrades, these kind of things. So now this detour, when we started with this project, the very first part of the code was actually the idea was to build a new distributed system. We built distributed system this before, but if you build a distributed system, you have to think about failure scenarios and how you are going to fix those during runtime, right? You can have machines are failing, processes are failing, disks are failing, you can have network partitions, you cannot differentiate between network partitions and machine failures, you can have message reordering on the wire, and to make your life even more miserable, these things will happen at the same time. Because of that, you need to have a good testing story. And this is a difficult thing. And if I would have to say what is the best thing about FoundationDB, I would say it's the testing story. So if you think about FoundationDB, you can either say this is a database or a distributed key value store, or you can say it is a distributed system that implements several kinds of services within that system. It starts like T-Logs and Masters and Resolvers and these things and glues everything together. So instead of building a layer on top of that, what you can also do is you can take this thing as a framework and just implement your own service on it. And the amount of code you need to change is surprisingly small if you want to do that. It is actually so small that I managed to put it on false lights. And that is exactly what I did. So the first step was to add a new machine clause. We basically need to teach FoundationDB we now have this new service in this example called Snow Cannon, which is our queuing system. In the next step, we need to tell the cluster controller, the guy who's responsible to recruit new rules that this thing exists. So we basically add a new API call to that. The way of then actually serving the API call is pretty much some copy paste. It's like three lines of code. Then workers, which is basically the main role that every process executes needs to be able to start that role. And now what you can see here is basically this is the code that actually executes then a Snow Cannon, so one of these special processes. And then we need something that orchestrates everything. And here we, a healthy FDB cluster always has one master server orchestrating a Snow Cannon is pretty cheap, so why not do it there? You could make another decision, this was ours. So whenever the master finishes a recovery, it will simply start up this track and be done. So this means that using FDB as a base for a distributed system makes your life much, much easier and I want to advocate a lot for that. And I want to again say like the simulator and the testing stuff is awesome and will make your life so much easier. Being able to run civilizable transaction within your services, then just icing on the cake. Thanks Marcus. So now that we've talked a little bit about Snowflakes requirements for its metadata and sort of our motivation behind making this thing, I want to go into a little bit into the architecture behind Snow Cannon. What is Snow Cannon? It is a multi-cluster replication solution built on top of FDB. The producer cluster pushes data synchronously to the Snow Cannon cluster, which then buffers it together and batches it and pushes it asynchronously to the consumer cluster. Your consumer can act as a standby. The Snow Cannon cluster is responsible for maintaining your replication factor for making sure that your failover has managed correctly and for making sure that the clients know which cluster to currently talk to. Let's go into a little bit more detail. The first thing that has to happen is that the client needs to query the Snow Cannon for the current producer. The Snow Cannon acts as a client proxy and gives the interface for the current producer to the clients. The clients can now begin pushing data to FDB. The proxy on the producer will simultaneously push transactions to both its own T log system and to the Snow Cannon log system. A transaction is not acknowledged as complete or committed unless it is on all of the T log replicas and a majority of our Snow Cannon logs. The Snow Cannons will buffer up this data and a replicator actor will read from one of the Snow Cannons, batch the transactions together and push them asynchronously to the consumer. The consumer cluster is in a read-only state, which was some state we added to the metadata store to ensure that the consumer doesn't accept any transactions except from the Snow Cannon. We wanted a DR solution full backup restore that could work across availability zones or data centers with minimal impact to the customer and with zero data loss. This is a tall order and in order to do this, we implemented Snow Cannon to give us the best of both worlds. The synchronous push to the Snow Cannons ensures that we have zero data loss even when we're restoring a backup. The Snow Cannons were also implemented to be very simple. There are append-only logs that write directly to disk. This means that our writes are very cheap and it means that usually they outperform the T logs, which in turn means that we can get our replication factor without any additional impact or latency to the commit time. The Snow Cannons can also give you data center fault tolerance without having to deploy the T logs or the nodes in your cluster across different data centers, which can cause a lot of extra latencies and some performance degradation. You only have to deploy the Snow Cannons across different data centers. Now for just a single, this essentially gives you the same guarantees with only a single hop across data centers to your commit time. Now your producer cluster could go down due to some terrible disaster and you can bring up your consumer in a different data center, replay the Snow Cannon log in that same data center and begin again with zero data loss right where you were before. The Snow Cannons pushing asynchronously to the consumer also grants us a couple of interesting benefits. We designed the Snow Cannon to be able to buffer the data indefinitely, which means that the consumer could go down or become unresponsive for hours at a time without any impact to our customer workload. This means that if we wanna take backups, as Marcus was showing, it has a huge impact to customer workload. It won't impact customer workload at all because we can do it on our backup, on our standby cluster. It also means we can have multiple consumers so we can have one running as a standby and we can have multiple consumers running backups at the same time. And you can even have a consumer running test code with actual production level traffic in order to test your code. We also implemented Snow Cannon to work on a quorum-based logic. Usually the proxy has to commit to all of the T-log replicas before committing its transaction, but it need only commit to a majority of Snow Cannons before we commit. This means that we're fault tolerant in the face of node failure. We can lose a Snow Cannon, we can recruit a new Snow Cannon, and we can repair holes all while the data continues unhindered to be replicated to the consumer. One of the biggest benefits of our architecture is the ability to switch over to a hot standby cluster. Let's say that your cluster's performance has become degraded due to the B-trees being fragmented, something that we at Snowflake see a lot because of our churn. Or perhaps there's just some network partitioning or some sort of network degradation. Or maybe you just wanna do a major version code upgrade without bringing your whole cluster down. The Snow Cannon switch can handle all of this for you. Here's the overview again. For the standby to come up as your primary cluster, it must first have all of the data that's currently on your producer. For this to happen, we must block new transactions to the current producer. We do this by setting it to read only, and this means that new transactions from the clients will simply error out and retry. Now the Snow Cannons are free to finish pushing its data to the consumer. A standby consumer is only five seconds behind the producer, and because of batching, the push of the remaining data to the standby is an extremely fast operation, not more than a few seconds. We call this the flush. Once we've completed the flush and all of the data on the Snow Cannons is now on the consumer, we can bring this cluster up as our primary. We can set it to read, write, and it can now handle new transactions. But before this can happen, the clients must be aware that the switch happened as well. So the Snow Cannons need to inform all of the clients about the switch, about the new cluster interface, and tell all the clients to invalidate their key location, catches, and watches, et cetera, so that now they can begin again with the new cluster. And voila, the switch is complete. We're now serving data on our primary, and we're free to do whatever maintenance required on the original cluster. And the switch only took on the order of seconds or faster because the right transactions need only wait for the last five seconds of data to be batched and pushed to the standby cluster. I also wanted to talk briefly about the challenges we faced in getting the producer cluster recovery to work. When a node goes down and your producer cluster needs to recover, we face some interesting challenges because now the proxy is pushing simultaneously to two completely separate log systems. So it could be that your T logs are ahead of your Snow Cannons or the Snow Cannons are ahead of the T logs or some combination of both. It is up to the producer to coordinate the recovery and make sure that both log systems will begin again at the same point of data. So the first thing that happens is the master, the master, the new master in the cluster must choose a point in data to begin the recovery on. This is the maximum version that's found on all of the old log T log replicas. It takes the max, as Evan said, because the transaction is only committed if it's on all of the replicas. In this case, this is version 400. It now recruits the new T logs, tells them to recover from version 400. It must also tell any behind Snow Cannons to recover at this version. So this means that if there's a behind Snow Cannon, in this case we have one at version 300, it must also stream from the old log system until it contains the last epoch end. And we call this the last epoch end version. Once all the Snow Cannons and the new T log system contain the last epoch end version, the producer can continue its recovery. It does this by sending a recovery transaction to its log system and to the Snow Cannon logs. This works as any other transaction does, except it also pushes the version up by 100 million. It does this in order to make sure that the old log system and the Snow Cannons don't serve any data from old proxies. It could be that there's a proxy from the old cluster that due to network partitioning or some other reason is not aware that it's from a old generation of cluster. But before, but this recovery transaction has to be a blocking call for the Snow Cannon because it could also be that the Snow Cannon is ahead of this version. So when the Snow Cannon receives a recovery transaction, it must first roll back to the last epoch end version and then it can apply the recovery transaction. Now both the log systems are synchronized and can begin in the new epoch at the same version. There were plenty of other challenges that we had to face in order to meet all of the requirements that Snowflake had for backing up and having disaster recovery for its metadata store and a lot of interesting problems that we had to solve, but this is all we had time for. And we'd like to take any questions either offline or now if there are any. I don't think we have time for questions now, but feel free to ask us afterwards.