 This is about the future of FTB storage engines. My name is Steve Atherton. I've been working on FoundationDB for about four years. I've done a bunch of different things, but the thing I'm here to talk about is storage engines in general and the Redwood storage engine, which I'm very excited to talk about today. So this presentation is going to hopefully have about four parts and minimal pauses and mistakes made by the speaker. So first I'm going to talk about storage architecture of FoundationDB and then review current storage engine options, talk about what we'd like to see in future storage engines, and then introduce the Redwood storage engine and some of its technical highlights. So what's an FTB storage engine? So its main purpose is to persist keys and values to disk. So it's not distributed. It lives and it's used by a single process from a single thread, which makes it the most exciting part of this distributed database. Maybe not, but it's really important because about 90% of FTB clusters processes have storage engines. And the other 10% are designed to funnel data to those storage engines as fast as possible. So just a little block diagram here, we have to have one. There has to be one system diagram in every presentation, I think, is the rule. So this is showing basically where the storage engine fits in to FTB's architecture. So we have this distributed log system and we have a storage server role. The storage engine lives inside the storage server. Every storage server has exactly one storage engine instance. The storage server receives mutations in version order from the distributed log system. And it initially writes them to two places in memory, two different data structures. One is a tree-like thing that lets you efficiently point, read, and range scan values of keys by their version. Sorry, you can read keys and ranges at a specific consistent version and get, of course, their values, which is why you would never mind. That was a blunder. OK, so the other place that the mutations go is a structure that stores blocks of mutations ordered by version. And then on a delay, those mutations are applied in version order on the storage engine. So I say on a delay because our storage engines currently are single versions. So once you've written something to the storage engine, you can't read the value that was there before. And so in order to support reads during the five-second window that our transactions live for, we need to keep that data in memory, which is that little tree structure on the right, and not push mutations to the storage engine until they've left that buffer. Then periodically, and in practice, it's once every one to two seconds, we commit on the storage engine, and then we send a message to the distributed log system telling it can forget the log it was talking to, can forget versions up to that committed point. So commits always happen on a clean version boundary. So our current storage engines, there's two of them, SSD and memory. So on the left, we have our SSD engine. It's based on SQLite. So it's a B-tree on disk, which has a nice property of giving you instant recovery on a cold start. And as its name implies, you're supposed to use the storage engine only on SSDs. On the right, we have the memory storage engine, which, despite its name, does persist data to disk and exists as essentially a binary tree in memory. And its on-disk structure is a rolling log of mutations and snapshots of keys and values from the in-memory structure. So as a result, it has a slow recovery time from disk. We recommend using SSDs for the memory storage engine, but you could probably get away with using spinning disks. So we have these two storage engines, and they're pretty great. So what else could we possibly want? Well, three main things. First is read-only transaction lifetimes that are longer than five seconds. So note that this is not going to increase the right transaction length. That's determined by the version interval held in the resolvers. And fundamentally, you probably don't want to keep on increasing that number in an optimistic concurrency system because you're going to increase the longer you let your transactions run, the more likely you are to have conflicts. But the other thing we want is prefix compression because, as you've seen earlier today with the document model and the graph model, data models on top of FDB tend to use a lot of common prefix bytes in their keys. We'd also like better performance, particularly fewer discreets per key and more right parallelism, which I'll talk a little bit more about later. So regarding that five-second transaction lifetimes. So one thing we could do, which could be done on top of FoundationDB, or it could be done on top of a storage engine by a proxying storage engine that basically turns a single version storage engine, like SQLite, into a multi-version storage engine, would be a multi-version layer. So basically, you'd store keys as tuples of key and version. Here's a table showing some examples. So we have three keys that were set at different versions and two of them were cleared. So in this model, reading a key becomes a reverse range read from the version you want to read plus one, the way our ranges work, down to version zero for that key with a limit of one. So you're doing a range read, but you really just want the first result back. Because you say you want to read at A at 50, and you don't actually know where A was last set or cleared at. But range read performance suffers as you accumulate old versions. So in this example, if I were to range read from A to D at version 30, I would read over six key value pairs and return only one. So you also have to scan your entire key space to remove expired versions at some point. So we don't want to do that. We'd like to push multi-version support into the storage engine and do something more efficient. So in general, FTB storage engine requirements, I'd like to review this now. So it has to be an ordered key value store. You, of course, have to be able to read and write keys. You need to support range read in forward and reverse order. Certain encodings could make reverse odd, awkward, but that's the only reason why I bring that up. Here's an important one. You need to have fast range clears. So that is to say that your range clear operation has to take immediate effect and not significantly harm subsequent reader or write latency. It can have background work that happens later and for a long time, as is the case in our current storage engine, but the clear range can't stop or stall the speed at which you can apply mutations to the storage engine. You also need to be able to read data at a committed version. What we have today, you can only read the latest version committed on the storage engine, but what we want for future storage engines is to be able to read any committed version within some defined, some configured interval. So notably missing from this list of requirements are low commit latency because our distributed log system provides, is what determines our commit latency and provides durability for FTB transactions when you commit them. So the storage engine isn't involved until later so we can buffer up writes and commit them periodically every couple seconds, for example. We also don't need concurrent writers because the storage server is gonna apply mutations serially so there's no need to worry about different threads or different processes accessing the storage engine. So I'd like to talk a little bit more about our current SSD storage engine, which is based on the SQLite B tree, which notably is not a B plus tree. And so quick review, B trees have values inside their internal nodes and B plus trees do not. So as a result, B trees tend to have worse branching factor, branching factors than B plus trees. And so we, which in turn, well basically, sorry, I lost my train of thought there. With a high branching factor, you could hopefully reach a point where you only have one out of cache read per lookup, per point lookup, which is a great property to have. So SQLite is not optimized for single writer throughput. Every set and clear operation must traverse the tree serially to its target page and then modify it. So as a result, our writer thread, it's really an actor, but I'm calling it a thread, and FDB only has about one outstanding IOP at any given time. And SQLite is also now optimized for large key value pairs. And it's not designed to be used in an async framework. It's not written for an async framework. So we've adapted it using libcoro, which is a library for stack full co-routines. And it's kind of, it's a lot of complexity. And we'd prefer to have a storage engine that was written in flow. That's a lot to say that we wouldn't do the same gymnastics again to adapt some other great storage engine. It's just that, sorry, totally. It's just that we're not doing that right now. We have some ideas in mind for what we want our storage engine to do, and nothing else does it exactly. So we're writing it from scratch. So the first decision to make is, do we want a B tree or in the B plus tree, of course, or an LSM? So a B plus tree optimizes for read performance, which is in line with the rest of FDB's architecture. LSMs do usually have fast point lookup using probabilistic hints like bloom filters or cuckoo hashes. But range reads are very common in FDB applications and probabilistic hints are less useful there. But I understand there is research being done in that area. A good example of this is non-unique indexes. You'll have some index identifier, some value, and then your primary keys are the last part of that key. So you need to range scan your index name and value to do a lookup in that index and get all the primary keys of the relevant records. So without native versioning, like for example, people often ask about RocksDB, without native versioning support, we would need to use something like the multi-version layer on top of that, which has the pitfalls discussed earlier. So this isn't to say that an LSM storage engine is a bad idea, it's just, it's certainly not, and it's a great idea for some workloads. It's just not our focus right now based on what our needs are. So this brings me to the Redwood storage engine. So it's a version B plus tree on top of a version pager. It persists version history, but it mitigates the inefficiencies of the multi-version layer design. I'll talk more about this later, so at this point I'm just kind of reading bullets to you. And of course it has key prefix compression, which I mentioned earlier, and I'm gonna talk about that in more detail later. So a quick review of what a copy on write B plus tree is. So here we have a B tree root node, and we're gonna show the child length of H points to page seven, so there's that child page, and it's in our child page. And so when you wanna modify this structure, the sequence is you first copy the page, then you modify it. So here we've added high equals Z at the leaf level, and we've copied page 11 to page 25 before we made that modification. So we've done this, and now we need to make page seven point to page 25. So we have to update the parents pointer, which means we have to copy page seven and make the appropriate change, and then we have to do that again all the way to the roots, and then we have a new root. And the nice thing about this is we don't have to have a write ahead log because we not left our data structure in an inconsistent state on disk at any point. The atomic, like the point in time at which all of the new data is visible from the tree is that last step where we updated the root. And so this is expensive for random writes because so if you have a branching factor of 200 to one and you have four levels in your tree, if you touch 200 random leaf pages, you're likely gonna have to touch 200 random parent of leaf pages to update those pointers because your third level of the tree is also larger than 200 pages. And then probably also most of your second level will change too, so you get a lot of write amplification basically. So we can limit the copy on write cost using indirection. So here I have the same three nodes, the same setup as before, and on the right I'm gonna show the same sequence using indirection with a page table. So this page table maps logical pages to physical pages. And so now the same 7, 11 page numbers that are in my B tree nodes, those are now logical pages instead of physical pages. So whereas on the left we copied page 11 to page 25 and modified it, on the right we keep page 11 as page 11, we write it to a new place and we update the page table to say the new place, the new slot where page 11 is physically located. Whereas on the left we still have to do the copy upwards to the root. So Redwood has a versioned pager, which is like that page table we just saw, but it also has a versioned dimension. All entries of the page table are kept in memory at all times and the on-disk format is very much like, in fact the prototype, it's literally exactly the same thing as the FDB memory storage engine. As a result recovery from disk is slow and so to avoid that the in-memory state will be stored in a shared memory buffer that lives, that can survive graceful process exits and restarts and the new process which is attached to it and use that structure. But of course if you power down, power off the machine, reboot it, et cetera, you're gonna have to do a slow recovery from disk. So the versioned B plus tree on top of this pager, basically it consists of logical pages that are all read at the same version. So there's many, in a sense, there's many versions of the B plus tree. Because if you start at the root and read it at some version and then read every page below it at the same version, you essentially have a like unchanging version of the B plus tree, like a snapshot of the entire B plus tree. So within a page you can have multiple versions of the same key, but we control the amount of history to avoid that slow range read effect that we talked about earlier, whereas you accumulate older data, you're getting less, your reads are less potent in terms of actual useful results you can return. Also it's notable that prefix compression makes this cheaper because if you have A at five, A at 10, A at 15, like the A part, it could be some long key, it cancels out and even the first couple bytes of the version could actually, I'm sorry, not cancel out, but prefix out. So even our versions are, as Alec mentioned earlier, our version is actually pretty long. These versions, for the purposes of this interface, they're gonna be eight byte versions. And it basically, if you write A three times rapidly, the first few bytes of that version are probably unchanging and then can be borrowed from the earlier node. Well, I'll get into prefix compression more later. Getting close to the end. Okay, so another property of Redwood is a high branching factor, which comes, it was as a result of using minimal boundary keys and having key prefix compression. Let's see, okay, next. So the last thing I'd like to talk about is prefix compression and a little bit about how Redwood does it. So data models on FTB tend to use repeat prefixes. I think I said that before in the last few minutes. So string interning can reduce repeats, but at a cost of additional reads and additional complexity in your application. I'd like to, so ideally you would like to just keep, it would be better to not have to do any string replacements or enumerations and just construct your keys in a way that is natural for your data. So here we have one option with the JSON model or you can have some key equal to a value of an entire document or you can have what the document layer actually does, which is a bunch of separate keys and values, one key value pair, poor value in the document. So you can see there's a lot of repeat sequences there in those keys. Sorry, okay. So one way to do prefix compression is linear. So you basically serialize the sorted set of keys as a prefix length and then a suffix string. And that gives you the minimal possible footprint. And so here on the right, I'm gonna add some links that show basically each, where each record borrows its prefix bytes from. So as you can see, there are all just every thing, every record borrows from the previous record. And therefore to search this structure, it's a linear search because you have to start at the first key and read all of them to get to the last one and actually be able to decode it. So is there a better set of prefix source links? Could we change these prefix borrowing relationships to get a better search time? Well, it turns out that we start at the middle and work out, there we go. If we have some nodes borrow from the middle and then some other nodes borrow from those nodes, we get a set of links that look like this. And then if we redraw those in a different way, this is the same links here. So the dotted lines are showing the prefix source, the prefix borrowing source and the straight lines or the solid lines of the child pointers. We get a binary tree. Notably everything I've added to the binary tree so far borrows from its immediate parent. This last one actually borrows from the root. It skips over its immediate parent. It turns out, and I don't have time to go into deriving this, but it turns out that for any binary tree, whether it's balanced or not, there is one ideal prefix source in your ancestry and you can describe it with just one bit. And that one bit is telling you whether you're borrowing from your previous ancestor or your next ancestor. Previous means the ancestor on your left that is greatest and next means the ancestor on your right that is leased. This results in perfect prefix compression and log n search time without the only additional overhead in this besides the binary tree is your prefix length and that extra bit. So why does this matter? It's not because of space. Non-perfect prefix compression gives you a pretty good space saving. It's about predictability because now we can very quickly answer the question, will this set of keys that I wanna add to this compressed page fit once I fit in the page which has a fixed size, limit of say 4K, once they are compressed. So that ends up being really useful for adding to a page. It's also really useful when you have a bunch of data and you wanna bulk build pages, you can scan through it linearly and you know exactly where based on just comparing adjacent prefixes, you know exactly where you can stop your scan and then build a binary tree because you know exactly the point where the compressed form would overfill your page. It turns out that this also works at the B plus tree level. So here I have two pages and two B plus tree pages each with a small binary tree inside it. And if you'll notice in page seven at the bottom, the left side and the right side and the root of the binary tree don't actually have a previous ancestor or a next ancestor in the binary tree. But it turns out that if you substitute the previous key boundary from the parent page or the next key boundary it actually, well basically you get exactly the same effect. So if you take a bunch of data that's sorted and you build a whole bunch of B tree pages out of it using this borrowing bit, using the same single bit precision, borrowing, sorry, prefix source specifier, that was terrible. Then you get perfect prefix compression for that tree. Now note that the tree as a whole is not always going to have perfect prefix compression because as you, oh sorry, I thought I put that here. So deletions can cause parent pages that originally had ideal boundaries that the child nodes were borrowing from to basically have bytes that the child pages are no longer borrowing because the items that were borrowing the parent's bytes have been deleted. So it'll drift over time but it's, again that's not the main purpose but it's still pretty good compression. So that's all I had. Sorry, I did run a little bit over. I'm in conclusion, Redwood is gonna bring longer read-only transactions. It'll be faster for reading and writing and it'll be smaller on disk. It is on the master branch right now. It's a work in progress, it's far from finished. A lot of the work done so far has been experimenting, it's been very experimental and just kind of exploring design choices. There's a lot of check-ins that completely replace the results of other, the content of other check-ins. But it's, you know, work is in progress, it's coming along pretty well. That's all I had. Thank you. Thank you. Thank you.