 I'm Evan Shannon. So I managed the development of the FoundationDB key value store at Apple. And I'm going to spend the next 40 minutes taking you through the architecture of FoundationDB. I was first taken through the architecture eight years ago. However, back then it existed as some hand-drawn diagrams on a piece of graph paper. What's pretty remarkable though is that it took around three years of development before we actually made any significant changes to the database that weren't on that original sketch. And a lot of what I'm going to talk about today, if you found that graph paper now, would be there. So it's pretty remarkable. So to focus this presentation a little bit, before I get into it, I wanted to highlight some of FoundationDB's strengths. So later, as I'm going through the architecture, I can kind of explain how FoundationDB is providing these properties. So one of the things is that FoundationDB is operationally very easy to use. And specifically, it does a lot for you in terms of load balancing traffic across the different machines in a cluster. And it also does a lot for self-healing and responsive failures. So I'll come back to both of those points as I'm going through the architecture for how we do it. The other point is comes back to something Ben really mentioned, which is that we provide scalability without sacrificing performance or consistency. It's really easy to be skeptical about a claim like that because it sounds too good to be true. So I want to be really specific about what we mean about this. And so hopefully, I can convince you all that we actually do it. So it's a little bit easier to talk about not giving up consistency in FoundationDB for scalability. FoundationDB provides, gives you acid transactions. There's a single global key space inside of these transactions. You can read or write any key across the whole database. FoundationDB gives you strict serializability and external consistency, which I'll get into a little bit later. It's a little bit harder to quantify what it means to not sacrifice performance for scalability because there's a lot of dimensions to performance. So my approach is to think about FoundationDB as organizing a lot of individual non-scalable key value stores into a single cohesive unit. So we can talk about the theoretical perfect performance of the system is if you dedicated all of your hardware to these individual key value stores, what would be the aggregate throughput of all of these things put together? And so when we compare FoundationDB against that model, we achieve roughly 90% of that theoretical best perfect scalability. So hopefully I'll be able to convince you of that as we're going through it. In terms of latencies, our read latencies are basically as good as you could possibly get. They're a single hop from a client directly to a server that's going to serve the response for that read. Rights are a little bit worse. To get a right successfully committed to the database, it's about four hops through the system. In practice, this is going to mean about four or five milliseconds of latency on commits. So it's important to note that all of this argument about performance is related to the architecture, which is what I'm going through. But it says nothing about our actual performance. And in fact, a lot of the projects we're working on today are going to significantly improve the performance of our system as a whole, like the Redwood storage engine. But this is basically showing us with a perfect implementation of FoundationDB, the sky's the limit. So now I'm finally ready to get into it. And I'm going to do that by starting with a little bit of distributed databases 101. So in this diagram, we have three servers and a single writer, which is trying to push data into the system, and a single reader, which is trying to get data out of the system. And in this system, we have the criteria that we need to be resilient to a failure of a machine. So the traditional way to do this is using quorum logic. So if the writer, when it's writing data, make sure the data gets to two of the three servers and the reader like reads from all of the locations and only is sorry about that, is only successfully read when it can talk to two of the three servers. We'll make sure every piece of data gets to the system. And then if one of the servers fails, well, the system will just keep going on naturally because we only needed two out of three responses from both places. Basically, this system has implicit failure handling, it just will naturally and gracefully handle. So kind of the, I guess I should have brought some water up here. So sort of the core principle behind FoundationDB is if we don't actually need this implicit failure handling that I just described, there's actually a lot of benefit we can gain by doing things slightly differently. So in this case, since we don't have to handle failures, our writer can write to all three of the servers. Because we now had no, well, thank you, Ben, you're rescuing me from my coffee. Okay. So if the writer writes to all three of the locations, this actually gives us two huge performance wins. One of them is obvious and the other one is a little more subtle. So the obvious one is because the data is now at all of the servers, the reader no longer has to talk to all the servers to do a read. It can actually go to just one of the servers to get a response because they all have all of the data. There's no, like, quorum to combine them. So basically, our reader is now three times more efficient than it was in the previous diagram. The other more subtle point is that when we can handle failures in this system, we're actually going to do a lot better job. So the quorum logic with three servers gave us a resilience of a loss of one single machine. In this system, because the data is making it to all three servers, we actually can lose two of our three servers and not lose any data. So we're going to be much more resilient to failures. So then the question obviously becomes, well, how do you handle failures? And so the foundation to be answered to this and kind of one of the key innovations behind the design is that we are going to have an entirely different database that is going to store metadata about this other database, about our primary database. And so basically, this other metadata database is going to hold membership of the servers that are in our primary database. So in this case, the servers A, B, and C were holding were responsible for committing data at version zero when we started up the database. And after at some point in time, you know, server A died. And then we replaced server A with server D. And so we wrote to this other database. Now we're doing our commits to B, C, and D. This other database can use the quorum logic and implicit failure handling that we talked about before. And because it's a very low traffic database, it's only going to you're only going to ever write to this other database when there's failures in our primary database. So foundation B take took this basic idea and turned it into a system. And so now I'll go in. I'll start adding boxes. So this diagram here shows all of the stateful components of foundation DB. So the coordinators are this other database I talked about, this metadata database that uses, you know, Paxos and quorum logic to do to do its rights. And a fun trivia fact, the first version of foundation DB, this literally was a separate database, we use Apache Zookeeper to hold this metadata. And which was quickly replaced with our own implementation, you know, like a year later. So for the other stateful systems, we have the transaction logs. And this is a distributed right ahead log that's responsible basically for accepting commits and rights into the system. Its job is basically to get stuff durable on disk as fast as possible so that we can commit successfully to the client. Basically, it's a right once read never data structure. Mutations are only being held there transiently. They're coming in, we're making them like a pending them to the end of the file. And then once the storage servers have the data will quickly get rid of it. So generally they have very little, they're using very little storage. They're also super efficient because they have such a simple job. The storage servers go back to the point I was making when I was talking about performance and scalability. So they are 90% of our system. And they are all basically individual key value stores that we're kind of allowing to cooperate together to act as one single big key value store. So each one of them, you know, has a, is basically holding data for long-term storage that is getting from the transaction logs and it's serving read requests that are coming in from the users. Basically, the entire system is designed around making these guys' job as easy as possible and to simplify that. So because they're the bulk of the system, the transaction logs basically are set up in such a way to make the storage servers have to do as little as possible when they're ingesting rights. And like the clients and how we're doing our reads are also trying to do as few reads as possible and download for them. So to that end, when we're, the transaction logs are sending data to the storage servers, the way we do this is we have every storage server has a buddy transaction log. And that storage server is going to get all the data that it's responsible for from that one location. So basically, it's getting an exact stream of rights that are specifically designated to this one specific location. And so we'll talk a little bit about how this happens, but we're doing it for efficiency sake because these storage servers are only talking to one other server, it's super efficient. So now we can start from here and start building, adding on the stateless components onto the system. And the first one I'll talk about is the cluster controller. So the cluster controller is a leader that's elected by the coordinators. And its job is basically to organize all of the processes in the cluster into the full system. So basically, when every process starts up, it's going to talk to the coordinators to figure out who the cluster controller is. And then it's going to register itself and kind of whatever information it knows about the process to that cluster controller. So it's going to say, I have a disk, I prefer to be a storage server, and this is my information. So the cluster controller is then going to take all of the workers that have registered with the system, and it's going to start assigning them to do these different roles. Well, you become a transaction log, you become a storage server. The cluster controller, in addition to giving out roles, is also doing failure monitoring. So it's going to track, it's going to be basically talking to these processes continually to determine if they fail. And we'll get into failures a little bit later. So before I can add more boxes, and trust me, there's more boxes, we have to take a little break to talk a little bit about how FoundationDB provides consistency. So I mentioned strict serializability before. And so to provide that, we first need serializability. And in FoundationDB, serializability is explicit. Every single commit you do is given a version number, and you can just observe the serializability by comparing version numbers. You know this commit committed after the other one, or happened after another one, if its commit version was higher. So it's right there for you to see. It gets a little bit more complicated when you want strict serializability. And basically what we need to do here is make it so that when we commit, it's like our whole transaction happened at the instantaneous moment in time that the transaction was committed. And the FoundationDB approach here is basically, when you start a transaction, you get a read version, which is basically the latest commit version that the system has committed previously. Once you have that version, for the rest of your transaction, all your reads are stuck at that moment in time in the past for when you started your transaction. And so the data you're reading from the database is going to be a little bit old. It's not going to see the newer stuff that's going to come in. It's just stuck at that consistent point in time. And then when you finally go to write, the whole concept here is that if none of the keys that you read changed in this time interval between when you started your transaction and when you eventually committed. Then it's actually like you did your, then it's actually like you did all your reads at the final commit version. It's like you're at an instantaneous point in time. So that's the optimist concurrency model of FoundationDB. And as a consequence, if someone changes one of the keys you read in this time interval in the short window between your read version and your commit version, we fail your transaction as a client you retry. So the layers team is kind of having to deal a lot and you're going to hear a talk later by Alec that this is something that you have to be very conscious of that happens in FoundationDB and you have to work around it. Okay, back to the boxes. We now have an API and you can see that there's now a master and proxies and resolvers added to the diagram and these components are all basically the stateless components that are here to implement the basically the consistency model I just described. So let's go through the different operations you can do on FoundationDB and see what happens and see what happens. So we'll start in with reads. As mentioned, reads are going directly to the store servers that are responsible for those range of keys. So if you read key A, there's going to be some set if your triple replicated, there's going to be some set of three servers in the system that have that data and you're going to talk directly to one of those servers and the storage servers is going to give you an answer. If you remember from the previous slide, your reads are versioned, right? So you're passing in a read version with your thing and the storage server has to give you the value of that key at that moment in time. So we're reading A at version 200 and the storage server basically is going to keep some history and memory of recent commits and it'll give us A at that version. This comes back to one of the constraints of FoundationDB that you probably already all know about, which is the five second transaction limit. There's two places, I'll mention the five second limit, one of them is right here because currently the storage server is keeping the recent history and memory and so to limit the amount of memory used by the storage servers we restrict like how much history we keep so we only keep five seconds. So the other thing to mention here is this metadata related to which storage servers have which keys because as a client you need to know who to talk to to give you any answer to any individual query. So this like mapping between keys and the storage server is responsible for them is state that's actually held in a lot of components in the system. The client keeps a cache of these locations and if it doesn't have an answer, if it doesn't know at any given moment who's responsible for keys it's going to ask the proxy for that information. So the proxy keeps the entire map and the client will just send a request over the proxy saying who is responsible for key A. If we shift responsibility of a shard the client's cache could be invalidated because the person previously thought was responsible of the keys is now shifted and in that case the storage server itself will tell the client that the data has moved, the client will invalidate its cache and it'll re-grab the answer from the proxy to get the correct location. This state, this mapping is actually stored in the database itself and what's kind of a really cool and interesting design so it's the foundation to be the byte ff is the system key prefix and it stores system metadata in the database there and to change shard responsibility to give ownership of a key range from one set of servers to another we actually accomplish this by just committing transactions to the database itself. So you can it's a two phase protocol where you say like I'm intending to move this range to some other location and then once that location has copied all the data then you do another one saying these these servers are taking over responsibility. So this data distribution algorithm is you know responsible for a lot of the load balancing that I was mentioning earlier that we get out of foundation DB. So basically this we're constantly kind of monitoring how much work each of the different storage servers are doing and shifting responsibility around by executing transactions on the database kind of giving ownership of keys. So like if you added a completely empty new process to a foundation DB cluster what's going to happen is the data distribution algorithm is going to notice there's now a new process in the system and it'll just start giving it key ranges and key ranges until it's part it's like taking some traffic. Okay although a lot unreads or onto commits. So writes in foundation DB are actually just cached up on clients so there's nothing really to talk about there and commits are basically going to take everything you did in your transaction, bundle them up together and send them as a single unit to one of the proxies. So in this case you're going to you're going to include the version that you did your read that every key that you read and then every mutation you're attempting to write in packages that up together and the very first thing the proxy is going to do with that information is assign that that transaction a commit version. So in this case we did our reads at 200 and the master is going to tell us that we're committing at 400. So the master's job it's the only singleton in the pipeline that's that's not scaled out and there's only a single box here and its entire job is just to give larger and larger commit versions back to the user. So it even though it's a singleton it'll never be a scalability limit to a cluster because it has such a simple job and as you scale up the system we actually combine different transactions together into batches and the master's only giving a single version number to an entire batch of transactions so basically even if you hammer the database super hard the master is not really going to sweat giving out these version numbers. So once you have a commit version you're now ready to do what I was saying on the previous slide and detect to see if anything you read in this transaction changed between your read version and your commit version and there's that is being done on the resolvers. So basically the this is this is another location where we get back to the five second transaction limit because the resolvers are a stateless role that are storing the previous five seconds of history of reads and commits or reads and writes and it will basically be able to tell you for any given read if it's changed in a time range. So one important thing to note about resolvers is they're sharded by when you add multiple of them they're sharded by key range and so you're going to give you're going to split up your your reads and writes according to the key ranges of those resolvers and send just the specific data for those that resolvers are possible for to those locations. What this means though is that as you add more and more resolvers to a system you're doing a little bit of conflict amplification. Basically the two resolvers don't know about the decisions the other the other resolvers don't know about the decisions other resolvers are making. So if one of the resolvers fails a transaction and the other things it succeeds well the transaction was failed but the one that thought it was successful is going to fail future transactions based on the rights that happened in that one even though they didn't actually apply to the database. In general this is not a huge issue and FoundationDB as you as I've mentioned already you already have to design your clients to avoid high-contention workloads they're doing a lot of complex anyways but in general it means that you don't really want to scale out resolvers you don't want to just configure a hundred of them right off the bat start slow and scale up to just as many as you need. So assuming the resolvers haven't found any problems it's now we can now finally make the transaction durable so the proxy is now going to ship the mutations the rights from this transaction to the transaction logs and so as mentioned previously every transaction log or every storage server has exactly one transaction log that it's going to get all its data from. So if we're writing QB here there's some team of storage servers at the end of this pipeline that need that mutation that changes QB and so those servers are mapped to some transaction logs and so the proxy basically knows all these mappings and it's going to send the change to QB to this set of transaction logs that'll eventually get the data over to the storage servers responsible for it. But because there's way less transaction logs than there are storage servers the you could just happen to get lucky or unlucky that all the locations that want QB are just one transaction log this all you know and in that case we actually have to additionally replicate QB onto some other transaction logs so that we're safe against failures in the transaction logging subsystem. So there's a little bit of complicated logic the proxy does to figure out where to store the mutations in the transaction subsystem. Once the data has been f-synced on to the on those transaction logs and made durable there we finally can return success out to the back to the user and behind the scenes the data is going to be replicated over to the storage servers. So finally I can wrap all the way back around to a get read version request and get read version requests are responsible for giving us or how we provide them are responsible for how we provide external consistency. So the concept here is that when you start a transaction it's a very nice property that you'll see every commit that's ever previously happened on the system. So in this specific example like we previously committed to one of the proxies and now we if we immediately start another transaction that talks to a different proxy we want to make sure that this new transaction sees the result of the previous one we just did. So the way we do this is that when we talk to one of the proxies for a read version that proxy will send a message to the other proxies asking if they've seen any versions that are higher than the one we have locally and we'll just take a max of all those responses and send it back to the client. This may at first glance seem like a scalability problem because every proxy is basically talking to every other proxy exchanging these versions. However we're saved here by batching. We basically can we don't have to do these read version requests for every single transaction we can group up a lot of different requests together and for a whole bunch of them we can send it send these requests to the other proxies. So really we're only sending these messages between proxies every in a millisecond or so. Okay so that's kind of takes you takes that's the that's the first part we've gotten through how you how you do commit how like how transactions flow in FoundationDB. However if we wrap all the way back to what I originally said a huge part of this design that that we've a trade-off that we've made is in relation to failures. So basically we're having to explicitly handle like failures in FoundationDB because we don't have the implicit failure handling that a quorum give you. So now I'm going to take you a little bit through what happens when a process fails in FoundationDB. So I'm going to start with the with the hardest one in this case in which case a transaction log dies and so this is going to kick off a recovery process. So the cluster controller is going to detect that this server has gone down and the first thing it's going to do is basically attempt to recruit an entire replacement for the entire transaction subsystem that's going to basically take over from the previous generation basically like storing the data in these other coordination database the court and on the coordinators. So this is using this this right is going to use Paxos and we're doing this because it would be an absolute disaster if we could have two different masters take over simultaneously like any consistent system is going to need a two phase commit somewhere. So the first step is that you know the the first phase of our two phase commit where we're going to do a read from the coordinators to find out who exactly was who exactly were the transaction logs of the previous system. This new master is then going to talk to those old transaction logs and find out the final version that was committed to them. In this case we're the one that died had version 400 and the other ones that were alive actually got another commit after that there's 410 and 420 so they have even more recent data. For because the commits had to go to every single log the only version here that actually was returned commit success back to a client would have only been version 400 and so but we're allowed to take any version after that so 410 or 420 is also fine it's sort of like a client it doesn't know the result of those transactions. So when it's asking for these versions the master is also locking these transaction logs to prevent them from accepting any additional rights in the future basically preventing us from recruiting new logs and having the old logs continue accepting rights. So the master is then going to pick kind of the lowest of the responses that gets back from these guys and take that and basically tell the next generation that they're taking ownership of the data is starting at that moment. So the master is then going to basically start up all of the other systems at that moment in time so the proxies are going to recover that key to server mapping as of that version and the transaction logs actually have to do a little bit of copying as well and the reason for that is actually really subtle. So back on the previous step I said that we see that this transaction log with version 400 is dead it's gone right but that might actually just be a temporary failure it could just be a little network hiccup that triggered this recovery and if we have three servers in reTriple replicated we actually want to be safe against two failures. So what can happen here is that after we've committed to our recovery version of 410 the two remaining guys who are alive can permanently fail and this one guy who is temporarily dead at the start can come back alive and he only and this guy only has version 400. So we need to copy just a little bit of data from the previous generation to make sure that even if there's this crazy scenario of things dying and coming back alive we're still able to have all of the data up to the version that we committed to be our recovery version. In any case once the transactions logs got this little bit of data we're finally ready to go with this next generation so the master is then going to do the next phase of the two phase commit to the coordinators kind of finally writing these new sets of transaction logs into the database saying that they're taking ownership from the previous generation. After all of that's happened we can finally tell the cluster controller we've taken over which can then basically open the floodgates of traffic onto the system again. So that was the hard one if that same recovery process I just went through is going to happen even if a master resolver or proxy dies. The philosophy here is basically we know we have to make this one recovery algorithm really fast and so if we do the hard case really well we can just use that hard case for any to handle any failures even if they're potentially a little easier. So eventually we may get around to writing specialized recovery logic for replacing some of these stateless roles but for now there's just one thing that the one process that happens if any of them die. If the cluster controller dies well that's really simple it's a leader elected by the coordinators and so it's heart beating constantly to those coordinators and so when it dies it'll stop heart beating and the coordinator will detect that and just to reflect elect a replacement and even simpler than that if a coordinator dies well it's using quorum logic it has implicit failure handling nothing happens when it dies the user will probably want to replace it just to restore fault tolerance but it's there's no emergency and finally we wrap down back over to storage servers which as you can recall are 90% of our processes in the system and when one of these things dies there's basically no effect to the user clients will just automatically start sending requests to the other remaining servers that are responsible for the keys that that server had and that data distribution algorithm I mentioned earlier is going to start shifting responsibility of all those key ranges on the failed node to other sets of alive servers so the system will just naturally heal from this problem note like with no problem so that's kind of the the gist of it I just have a few other points to talk about one of them is kind of a pathology a performance pathology that's specific just to foundation db really so as I've been talking this whole time we've mentioned a number of times this five second transaction limit and normally databases handle saturation performance by basically relying on back pressure to add so as as more and more people overload a server it'll just naturally slow down the responses to those requests however in foundation db because of our five second limit that can lead to a death spiral the concept is like if every read you're doing if you're doing like five or ten serial reads in a row in your transaction if every read starts taking a second well all of a sudden by the time you get to your last read you've passed the five second limit so then you've done all of this work and all of these reads and you get to the end and you try and commit and you just fail and if every client starts doing this then basically no work is being done in the cluster it's just a disaster you're in death spiral everyone's hammering the server hammering the servers and no work is getting done so we saw this problem and our solution to it was this concept of ratekeeper the idea is that when we're in saturation we're going to build up all of the latency before we get the original read version to our request so basically back to the diagram when you start a transaction your five second limit really only starts from the moment you get a read version basically you have to commit within five seconds of getting that version so if we can basically slow down the rate at which you the clients are getting versions we can make it so that one even though it might take a few seconds to get a read version once they have one they'll be able to do the all the rest of their operations with very low latency so that ratekeeper or a component is another single ten that lives on the master so the final thing to mention is is sort of a theme of foundation eb and ben highlighted it a little bit but as you know obviously this design is really complicated there's a lot of little moving pieces there and there's a lot of edge cases i didn't even really get into in this architecture discussion so the question of the day comes back to well how can we trust or how can you guys trust that we actually don't have any bugs that just throw out this entire design like do we act you don't actually have acid guarantees if there's some bug that corrupt your data in some weird case so basically the the founders recognize this basically from the get go and that's why they started with simulation right because the idea is this thing needs to be tested really severely so that we can actually trust that our inflammation our implementation of this design is meeting the guarantees we say it does so the way this works is kind of when i was going through this whole architecture right i was talking about all of these different processes in these boxes sort of in my mind and in maybe in your mind you're thinking about them is happening on different machines or different processes in those machines but in actuality there's an indirection between the roles and the work that's happening and where it's happening in the cluster so it's actually you can actually start up a foundation eb like cluster on your laptop in a single process and that one process will do all of the work of all of these roles right there locally in that one process what this allows us to do is if when we're running the whole system in one process we can actually have that single process pretend like it's an entire network so it could because it's all one process just instantly send messages you know from the proxies the resolver from the resolvers you know back to the proxy from the proxy transaction log but instead of doing this instantaneously we actually pretend like there's latency between these components we drop some packets randomly we reorder things we pretend like whole system dies we pretend like there's corruption on disk basically every bad thing you could possibly think of we we do in this system and because it's an because it's a like single process that's doing this we have determinism so if we can run you know we run hundreds of thousands of these things of these randomized tests every night pairing these random failures with some workload that's going to check some property of the database and when it comes back with an error we can in the next morning like replay the exact really rare series of events that caused this and kind of figure out what happened this is really powerful of my eight years working on this database probably six of them has been tracking down problems found by this thing so it's really kind of the one of the big secret sauces that makes foundation to be possible so that's all I have for you guys before I step down I just want to say thank you to Dave Rosenthal and Dave Shear they made a lot of hard decisions early on with like developing the simulator developing flow even through the developing this whole architecture basically at every moment they were making technical decisions early on they were focused on kind of the long-term sustainability or long-term success of this database and so now I'm reaping the benefits of the work they laid out early on so thank you guys