 All right, good afternoon. I'm John Brownlee. I'm on the FoundationDB SRE team at Apple. And I'm here to talk about how we manage FoundationDB at scale at Apple. To give you a little bit of context, we've been running FTB in production for a little over three years now. We have hundreds of clusters with over 100,000 total FTB processes. And that footprint has been growing steadily over the years. And we know it's going to keep growing in the future. And so one of our core concerns is how we can manage a deployment of that size without scaling our operational staffing accordingly. And so to do that, we have extensive automation and operational tooling designed to make it as easy as possible to run. And today, I'm going to go over how we run FTB on Kubernetes as an example of the kinds of challenges people will face running FoundationDB in real-world circumstances. So first, some overall design principles. We run all of our FTB processes through FTB Monitor, which is a process launcher that ships with FTB. We run one FTB server process for each FTB monitor process. And we use config maps in Kubernetes to provide the monitor conf and the cluster file. Config maps are a really great tool for this because they allow us to update these files dynamically without having to bounce the pod. And we also use anti-affinity to get Kubernetes to spread processes across hosts. We have pretty simple anti-affinity constraints, and we trust in Kubernetes to do the scheduling properly and give us the diversity we need. We have also a custom sidecar process. And the reason we have that is it allows us to provide binaries dynamically and configuration dynamically, again, to give us the properties we need to be able to update FoundationDB's configuration even in the code without bouncing the entire pod, which is very costly for us. I also wanted to go over some general design considerations about running FoundationDB based on common questions we get. One thing that you'll want to do is to use different process classes to help isolate roles. This is important not just to first distinguish between stateful and stateless processes, which on a platform like Kubernetes are going to have pretty different requirements. It also allows you to customize memory allocations and knobs for these different roles. For the log and stateless processes, you're going to want to provision those based on how many of those processes you want to recruit, how many logs you need, how many resolvers and proxies. And you also want to throw in a few spares because there are some failure modes that you're not going to hit until it's way too late if you try and run these things too tight. For instance, if one of your proxies dies, then that will initiate a recovery, which is going to cause a few seconds of unavailability. And if you're running exactly the number of proxies that your database is configured to accept, then when that proxy comes back, it's going to recover again. And if it's flapping going up and down constantly, you're just going to be unavailable until you can get it back up to a stable state. So in order to help FTB recover onto a stable configuration in these circumstances, we just throw in a few spares, which are pretty cheap for these kind of process types. An overall property of FTB that is going to be surprising to people that you'll need to prepare for is how to handle recoveries. These happen in lots of different circumstances. They can happen due to process failures or exclusions or reconfiguration or bounces. And they're a necessary part of running FoundationDB and something that clients need to be able to accept. But if you have multiple recoveries in closed succession, it can effectively extend that window beyond what clients can handle. So you'll need to make sure that you're careful to avoid this and control how you're taking deliberate actions that you know are going to cause recoveries. Next up, I wanted to run through our core operational loop. These are the 10 basic steps that we run for managing the FTB cluster lifecycle. And these steps can encompass almost everything you need to do to run an FTB cluster. When we make, we design our configurations based on a cluster spec, which specifies the desired state of the cluster. And then we have this operation loop capable of reconciling it by taking a variety of different actions as necessary in order to make sure that the current state, the end state of the cluster is what has been configured. And depending on what's changing, we'll only run a subset of these. But this is the overall framework of what we're doing. So first off, I wanted to dig in a little deeper into the binaries and configuration file management. In general, you're gonna need some kind of dynamic config update process to be able to install the new monitor conf when you're changing the configuration for the processes. And similarly, even though when you change the cluster file, you can generally get that propagated live to clients without any explicit action, in order to make sure that things are being kept in a consistent state, you'll need to update that through an out of band process as well, just to make sure that when the processes, if the processes get restarted in a way that they lose their container context, you're gonna get the same configuration back. The most complex part is that when you're doing upgrades, we have to install the new binaries inside the live container. And this is what the sidecar process is for. There's a neat feature in Kubernetes where if you have a pod that contains multiple containers, you can upgrade one of those containers by changing its image without bouncing or affecting the other one at all. So we use this as a beautiful little hack to be able to upgrade the sidecar, just restart that process and copy the binary in live, and then update the monitor conf and bounce the processes under new binaries that were never in that image to begin with. This is, we think, the safest way to dynamically inject code into a running process. Next up, we wanna talk a little bit about adding new processes. One of the great things about expanding on FDB is that unlike a lot of databases, new processes can be bootstrapped and joined the cluster just by providing that cluster file. They'll need your basic start parameters around the public addresses and that kind of stuff, but you don't need to configure anything in the database to accept a new process. Just by having the process connect, it'll be able to join the cluster and start taking on work. However, when you're creating a cluster from scratch, you still need to seed it with the initial cluster file in order to get those seed processes to talk to each other. One option is to bootstrap with a dummy cluster file, oh, and I should say, the challenge in this is the cluster file needs to contain IP addresses, and on a platform like Kubernetes, you don't know what the IP addresses are gonna be until you've launched the process, so you're in something of a catch-22. And one approach to that is to launch it with a dummy cluster file that won't enable it to form a cluster and then update it once you have the real IP addresses. Another option is to bootstrap it with an empty monitor conf file, in which case it just won't run any processes at all. Similarly, if you're replicating across, say, physical hosts, you're gonna need some way to get that machine information in dynamically. See, the monitor conf file is just a flat text file that's gonna give the start args, and so if you have something like the machine name, which needs to come from the environment dynamically, you will need to run some additional cleverness to get that supplied dynamically, and this is another feature we built into this sidecar. Since it's already providing the monitor conf file from the config map and copying it into a volume where the monitor process can access it, we also have it doing some basic template substitution. It's not really that hard of a thing to write for the limited scope of what you need to substitute it into a monitor conf file, but it may also be something that, given that this is a common need, we may wanna move into the core as a core feature. Another tricky bit here that people often run into is even though when you change coordinators, the cluster file gets updated automatically, and even though you can update it out of band as well, if a client process can't write its cluster file, it'll report itself as being in a degraded state, which will leave the whole cluster marked as unhealthy until you remediate that problem. And so if you're using something like a config map, you may run into volume permissions that prevent this from working, and you may need to do add an indirection layer to make sure that you don't get hung up there. Next up, I wanted to talk a little bit about how we reconfigure the database. This is really one of the simpler actions because we just need to run a CLI with the configure command and the new desired configuration, and then wait for the cluster to be healthy so we know that it's got the appropriate fault tolerance and that it's rolled out any new parts of the replication. In general, we think it's best to do this between adding the new processes and removing the old processes, because this gives you an assurance that whether you're going to a higher replication mode or a lower replication mode, you're always gonna have the safest number of processes at the time you do this configuration. Another aspect that's really complicated and unfortunately way too complicated for me to get into in this format is changing between data centers. We manage our multi-data center configs through FTB's region configuration, but the region configuration is one of the areas where you can't change the whole thing at once. If you're mixing changes of adding regions and removing regions and changing the usable regions, each of those may need to be its own atomic step. This is something that we're gonna need to add more documentation about and some guidelines for how to build a safe process to do this in an iterative fashion, but it's only really a problem if you need to run in that kind of multi-region config. Most other changes, you can just run a single configure command. Next up, I wanted to go into a few different stages that all are around the area of how you remove processes. One of the things about this overall architecture is that the process of replacing a host is effectively identical to doing a grow followed by a shrink, and this allows us to reuse a lot of that logic and create a much smoother flow. So you will need some logic for determining what processes you wanna remove either by having whoever is operating it provide a list or doing a shrink operation that has logic for doing a diverse selection of what processes to keep around. And then once you've done that, you just need to do an exclusion which tells FDB to evacuate the data from those processes and also tell them not to serve any other roles like make sure that they're not serving as the master or something like that. That operation is gonna block until the database can confirm that it's safe to remove the processes. And once you get that confirmation, you can just shut them down through the control plane. You know, for instance, deleting the pods in Kubernetes and then check the cluster status to make sure those processes aren't removing because one thing that can happen is your control plane can just lie to you and say it shut things down even though it couldn't confirm whether it did or it didn't. So it's always best to double check in both the control plane's belief as to what's configured to run and the database's belief as to what is running. After you've confirmed that, you can use the inclusion command to re-include the processes which just cleans up the exclusion state and make sure you're not keeping around dead configuration about processes that don't exist. Now, in between that flow is where you'll probably need to change coordinators. For instance, if you're replacing a bad host and that host has a coordinator on it, you'll need to remove it from the coordinator list by selecting a new set. And you'll probably wanna do that during the exclusion phase because that way the database also has knowledge about the fact that this is an undesired coordinator. Now, when you're starting a new cluster, as I mentioned before, you need to generate an initial cluster file from scratch. After that's done, you just need to recruit new coordinators by running a coordinator change command whenever a coordinator is unreachable or when it's being removed. Another thing to note that's very challenging on Kubernetes is that if a coordinator changes its IP, it's going to be unreachable as a coordinator. No other part of FDB has this constraint. If just an ordinary storage server comes up with a new IP, it's fine, it gets to keep serving its own roles, doesn't even need to redistribute data. But if a coordinator changes its IP, it's a totally different process from the databases standpoint. Now, one way you can mitigate this more or less in practice, as long as you can control the rate at which this is happening. But that's not necessarily tenable in the long term, which is why we have a project to make sure that this constraint goes away and that you can address things in the cluster file by some other thing, by a DNS name rather than an IP. That I believe is being slated for 7.0 and we think it's going to be a key piece of making FDB work reliably on Kubernetes. After you change the coordinators, you can update the cluster file lazily because any process that was connected to the database will get that new coordinator list automatically and write it to its own cluster file. Last step I wanted to go into is bouncing instances. So once you've installed the new monitor comp and validated that it's been properly installed on the machines, we bounce all of the new processes through the CLI. After you've done the bounce, you can then check in the cluster status to make sure all of the new processes have come back up. For instance, that there wasn't some kind of problem on the host that prevented it from relaunching it. One thing to note is that when you're bouncing as part of an upgrade, you probably want to confirm that the clients all have compatible client libraries and you can do that through the cluster status by checking this connected clients key. Or if you're willing to be a little patient, this requirement should go away totally once we have the new RPC layer that Evan talked about earlier today. I think that's really going to be a key piece of making the upgrade story sustainable for people running FoundationDB outside of very narrow integrated use cases. I wanted to spend some more time now talking about our bounce strategy in more detail because this is one of the most controversial parts of the way we do our operations. So when we bounce a cluster, whether it's part of an upgrade or a knob change, any reason whatsoever, we bounce everything at once. And because we're running through FDB monitor, which is designed for fast restarts, the process has come back up in a matter of seconds and to the clients of the cluster, this is totally transparent. They'll just get a latency spike that should be within the tolerances and the transaction retry logic will make sure that any in-flight operations will eventually get completed. Now the reason we do this, there's a few different reasons we do this. One of them is that when you bounce a log or a stateless process, it initiates a recovery. But if you bounce all of it at once, it still just initiates a single recovery. So this means that your clients are getting a more compacted window in which things are a little weird rather than having it stretched out over a longer period of time. It also means that you don't have multiple, you don't have different processes on different configurations. This isn't necessarily bad to have them on different configurations, but it is less well tested. And one of the principles that animates the FDB team is that the safest thing is the thing that we can test the most. And even for upgrades, we have simulation tests that can simulate this upgrade process of just running a cluster, killing it, bringing up a new version. But any kind of rolling change is much harder with our simulation technology. So this gives us some additional confidence that the changes are gonna behave as we expect. Now there is one time where we do a rolling bounce and that's where we need to, when we upgrade FDB monitor. For the most part, new FDB releases don't really require upgrading FDB monitor because it's not doing much. There are rare occasions where it's necessary, but going forward on Kubernetes, we're gonna try and make this more consistent. And just for safety's sake, always upgrade FDB monitor and the main FDB container to make sure that things are converged on the configuration that you would get if you were creating a cluster from scratch. Now you may still be skeptical of this. You may say, this goes against everything I know about distributed systems. And you're not wrong, but what I will say is that this has proven to be an extremely effective strategy at Apple. We have challenges running FDB at Apple. We're not perfect, but this part of the process works very, very well and it has no negative impact on our uptime or our SLOs. Furthermore, we're in a relatively good position here because we can build confidence not by doing a rolling bounce within a cluster, but rather by moving new configurations in new versions of FDB through a QA to prod pipeline. And the benefit of this is that if you're trying to do something with canary processes in a cluster, it's hard to be confident that you found all of the potential problems. FDB is a very heterogeneous system. There's lots of different processes serving lots of different roles. And if you do something screwy with one process, FDB is pretty good at making sure that it's not gonna harm you too much. So you may accidentally just get into a position where your testing didn't discover anything. So instead, we think it's a lot better to do this through a full QA pipeline where you can do a rigorous set of tests on the new configuration. On that note, there's also changes which due to FDB's architecture are fundamentally impossible to canary. For instance, if you're changing to a new replication mode, that affects the whole database and there's no way around that right now. So you need an effective QA pipeline to be able to test that. And once you have that QA pipeline, you're better off just using that for everything because then you have one test strategy that you're using for a wide variety of potential problems. However, I will also say that part of what influences this is that since we also develop, do a huge amount of development on FDB and we've been running it for years, we have a lot of confidence in our testing infrastructure. And I can totally understand if people outside of Apple don't. It's a pretty extraordinary claim. So I did wanna go into some potential alternatives to this bounce strategy and what kind of pitfalls you'd run into. First, I wanted to talk about the problems with rolling bounces and the advantages of rolling bounces. The biggest one is that if you're doing a minor or major version upgrade, the new processes are gonna be protocol and compatible with the old ones, which means there is absolutely no way to do a rolling bounce. And I'll also say that the work that's being done around a stable wire protocol only affects the protocol between clients and the servers. So the server processes are still gonna be protocol and compatible with each other and you still won't be able to do a rolling bounce strategy. You also need to handle the number of recoveries for the login stateless processes, which you can kind of mitigate by just waiting in between doing those bounces and making sure that things get stable before you do the next set. Or another way you could approach that is, and this is especially relevant on something like Kubernetes, is you just bring up a brand new set of login stateless processes and then exclude the old ones to force all of those roles onto the new processes. And that can also give you a pretty widespread canary that's relatively easy to roll back from. The other thing to be wary of is that if you're rolling out new knobs, setting those in a way that's heterogeneous across the cluster isn't necessarily safe. There's some knobs where we know it's totally safe, there's some knobs where we know it's totally unsafe and a large set that we just haven't tested. So that's something you'd wanna be careful of. Another strategy is doing a DR cutover. We have a multi-cluster DR solution that allows you to have a secondary cluster that's kept up to date within a few seconds with the primary cluster. So one thing you could do is bring up a new cluster on the new configuration, set up a DR relationship, do a DR switchover, and then make that your new cluster that your clients are connecting to. Now this also doesn't work across versions right now, but it's a smaller problem space that's likely gonna be tackled as part of that RPC layer work. It's also very expensive because you have to copy all of your data to a new cluster, which means it's probably not practical for simple things, but it could be a good way to do version upgrades in the future and in concert with a rolling bound strategy could provide a full alternative to doing the kind of insta bounces we do. Ooh. So in conclusion, what I wanna focus on is that having a strong story around operations is key in our minds to FTB's adoption and success. And it's an area where we know there's some weakness in our documentation and in our overall engagement with the community and that's something that we wanna make a huge effort on in the coming years. And we're also always looking for more opportunities to share internal tooling and things of that nature with the community and build a stronger operational ecosystem. And in light of that, I'm very excited to announce that for the last year or so, we've been working on a Kubernetes operator for FTB. And in fact, that core operation loop is the operation loop of our operator. And later today, we're gonna be open sourcing that operator. I'm so excited to get to share this with the community because it's the first time we've really been able to get our day to day operational tooling in a form where we can share it and get community feedback. And I can't wait to see how you all can help shape our Kubernetes solution. Thank you.