 Okay, so I'm Steve Atherton, and I'm here to give an update on the Redwood storage engine which was introduced at last year's summit. So these are the topics I'll be covering, first an overview, then status on what's done and not done, some performance numbers, and then I'll talk about an important design update that was made in the past year, and then I'm going to delve into detail on IO patterns for some operations that the storage engine has to do. And then lastly I'll talk about future improvements that are coming in the short term. So a quick overview, so Redwood is a B-plus tree, it's multi-version, so it can contain different values for the same keys of different versions, and it employs prefix compression both across nodes and also within nodes themselves. The B-tree lives on a pager, which is a stateful component, it is also multi-version. The pager provides read, write, and commit access to fixed size pages by version. So a little bit on status, so what's done? So first and foremost it works, and it's actually been shipping, so to speak, in recent patch releases of 6.2, it passes correctness testing, most of the necessary functionality is done, notably prefix compression is done and is in final form for the foreseeable future, and deferred range clears are done and also are in final form for the foreseeable future. Pretty much all of the IO patterns are where they need to be at this point. So that all sounds great, so what's not done? Well, performance, mainly because there are a lot of CPU optimizations that haven't been done yet, and actually earlier Diego talked about one of them where the mutation buffer definitely has some performance issues. So also notably B-tree node merging is not yet implemented, you need node merging in a B-tree to reclaim accumulated slack space that you'll get within nodes over time by clearing keys. If you clear an entire page or set of pages, that's great, because those pages will be freed, but if you only clear part of a page, you need to eventually reclaim that space. Also the on disk format may change and no backward compatibility on the file format is being provided yet. So next I'd like to talk about performance. So specifically performance with compressible prefixes. This avoids most of the unoptimized parts that still exist in Redwood. So I have a data set that is 40 million records. Each record exists under one of 100,000 unique prefixes, and each key has a random 4-byte suffix. The values are 32 bytes of random data. The inserts are mostly localized under one of the prefixes. What I mean by that is groups of inserts are done together under one of the 100,000 unique prefixes. So here's a big chart, and I'm going to focus on just a column at a time. So what this column is saying is here we've set our prefix length to three. So we have 100,000 unique prefixes that are all three bytes in length. This produces a key value set size as far as just the key lengths and value lengths of 920 megabytes, and we're using a two gig page cache. This is very similar to what you would get if you were using the FDB directory layer and you created 100,000 directories. Your average shortened prefix that the directory layer would give you would be about 3 bytes. So on this test you can see that both SQLite and Redwood are around 1.5 gigs of used space. Redwood loaded it a little faster at 160 seconds. And both of the deferred clear time, which is the time it takes to walk the tree and clear all keys from the data structure is about the same. So on the next column we're going to bump up the prefix size to 16. So this is pretend instead of using the directory layer you used long descriptive names for every prefix. So Redwood's characteristics haven't really changed. It actually, everything kind of got lower, which is just basically noise in the test. But essentially the footprint and the load time and the clear time are the same. Meanwhile SQLite has seen an increase in all three of those things. We're still using a two gig page cache, and so everything comfortably, just about everything comfortably, comfortably fits into memory for both storage engines at this point. Then we move up to 32 byte prefixes, which are just more flourishy descriptive names on those same 100,000 unique prefixes. And again Redwood's metrics are basically the same. SQLite has seen a large bump in this space used, and also notably that 2855 megabytes is larger than the two gig cache, and so now there's a bunch of cache misses happening. So the load time has gone up substantially, and the clear time is 20 times higher. And then I did one more test, which is using the same exact prefix size and workload shape as the previous one, but we lowered the page cache to only 100 megabytes. So Redwood, again, is basically unchanged. The clear time did become four times longer, but it's still very small. Load time was largely unaffected, and SQLite has the same footprint as before, but the load time has gone up, and the clear time has almost doubled. I'll talk more later about some of the reasons why that is. But the takeaway from this slide is that common key prefixes in Redwood are nearly free. Moving on, I'd like to talk in detail about a design update that I made in the past year based on basically the experience I had going down the path laid out in the last presentation. So a quick review of the pager design. So the idea is you can read or update or read a page at a version, and the pager would keep a page table that maps logical pages at a version to some physical page that contains the actual data. This table will be kept entirely in memory so that lookups are fast, and then it also must be periodically in some way snap-shotted or persistent to disk, like some sort of rolling snapshot, and it needs to be recovered on startup. So the goal of this design was that one page update equals one write, because basically when you update a page at some version, you put that data somewhere and then you use the page table to route reads to that location, so there's never a reason to copy that data anywhere else. Unfortunately, there's some drawbacks with this design. And one of them is once a page is updated, one or more remap entries will exist for that page basically forever. So here we have on the left a table showing nine different versions of logical page nine at different physical pages. Once time has gone on, and we are only maintaining version 44 and newer, we only need one of these entries, which sounds good, except for we still need one of these entries. And if you think about it, over time, almost every page in your storage space is going to be updated. And so nearly every page is going to have its actual data located at a different page from where it started. And so with something like two terabytes of space, you have a half a billion pages, page IDs, and if nearly all of those are remapped somewhere, no matter how you design it, you have gigabytes, many gigabytes of memory being used for your lookup table, and then also being snapshotted, persisted to disk, and reloaded on cold start. So this is very memory intensive, and it's slow to start. So the new design is what I call a delayed write ahead log. It's for shorter retention periods, and the pager is pluggable so other pagers can be written. The idea here is that one page update is up to two writes. So there's one write immediately to get your data to be durable somewhere, and then there's up to an additional write later, which usually will be one. And I want to go through the sequence of what that is actually referring to. So in the updated design, the page table now has an implied identity record for each logical page, which basically says that if you look up a page at a version that is less than the oldest version of the table, then the answer is the same page ID that went in. So the physical page ID is the same as the logical page ID. So as older page table entries expire, they are removed. So how can we do that? Well, so here on the second chart, we've advanced the oldest retained version to 22. So now we no longer need to keep version, oh, sorry, did I mess that up? I think I did, I'm sorry. Pretend that that is, this completely screwed me. Okay, the version on that might be wrong. I can't think about it at the moment. But basically, we want to remove the 9512 entry, and we can do that by copying the contents from page 12 onto page nine. And then we no longer need that second entry. So if you read, yeah, actually, sorry, it is correct. So if you read at version 22 or 23, the mapping will be served by the identity record, and it will give you physical page nine, and that is where that data does, in fact, exist. So now we have only three entries left for page nine, which is nice. And then as time goes on, now we've bumped our retained version forward to version 44. So now, we do the same trick again, but with an additional optimization. So since we bumped the retained version forward a large amount at once, we can actually skip over two of these copies, because we really only need the latest version to be copied back to physical page nine. So we skipped over 23 and 40, which saves rights, which is why the last slide set up to one additional right. And so now we have no entries left in our page, in our page table for page ID nine. So next, I'm going to talk about some IO patterns. So first, writes. So specifically, the reads to service writes. So SQLite does one tree traversal for every mutation in order, and so it incurs serial cache misses. So I'm going to go through a series of six insertions here. So we start at the root, we go to this page, we have to wait for it to load. Oh, I'm using the wrong key, I think, so it's not animating. Are you guys seeing the animations? Okay, it doesn't show it to me here. Okay, so we're on our second latency incurred. We go back to the root, wait for this internal node to load, and then wait for number four. I'm just going to go through this kind of quickly. I think you get the idea. So basically, each of these seeks had some stops along the way to wait for pages to load. We suffered six serial read latencies to do these changes. But in Redwood, we buffer and sort mutations for an entire commit, and then we do one parallel traversal per commit cycle to all of the nodes affected by those changes. So the traversal is not purely breadth-first or depth-first. It actually depends on what's in the cache. So we start at the root, and we start traversing, and we start loading these pages. We don't have to wait for them. We just continue on going as far as we can. And now that all those are loading, we wait for them to finish. And then once this one finishes, there's some follow-on work. So we have to wait for these children to load because we're editing both of these pages. And once they're loaded, then we've written our changes and we're done. I'd also like to point out that, let's say, we had a bunch of changes during our one-second commit cycle. A bunch of different writes came in between the keys D and E. All of those insertions in that range would be handled together at once, which makes for very efficient sequential insertion on Redwood. So now I'd like to talk about read latency. So with B-trees, read latency is affected by branching factor. Ideally, cache size is approximately the amount of space you're using divided by your branching factor. The idea there is that every page, hopefully, in your cache points to branching factor other pages. So if your branching factor is 20, hopefully most of the pages in your cache are internal and they point to 20 other pages. And so in that case, if you had 5% cache to disk ratio, you would have most of the internal structure of your tree in memory. So if you're dealing with, say, a 2 terabyte or a 4 terabyte disk, having to have 5% of that as a disk cache is pretty high. So the higher your branching factor is, the lower your cache needs to be to ensure that most reads suffer only a single out of cache, a single cache miss. So in SQLite, repeated key prefixes make the records inside pages bigger. And so they harm branching factor. Non-trivial values, like larger than 4 bytes, also harm branching factor. In Redwood, values aren't stored in the internal nodes. And so that actually means that the keys in internal nodes don't need to be complete because they're not user keys. So they can actually be minimal boundary keys that are just long enough to do the left-to-right decision of whether or not you can go downward on the tree on one side or the other of that boundary. Also, prefix compression means that there's no penalty for having repeat key prefixes and nodes. So we get more entries per node. So I'd like to talk about large get ranges. So in SQLite, if we want to read the range g through k, we'll start at the root of something over here. And we will traverse down to g, we'll wait for that page to load, we'll read its contents, and then we'll go to the next page. This is our second cache miss, third cache miss. So we have four relatancies to get this data. In Redwood, it supports a prefetch size. So in this case, we have 4k pages, and we're using a prefetch of 15k. So during the traversal, it'll recognize that 15k is not, or that one page is not enough to satisfy 15k. And so it will start four adjacent siblings loading simultaneously. Now I'd like to talk about oversized nodes. So in SQLite, it's actually records instead of pages that are oversized. And they're actually oversized individually extra pages called overflow pages. So here, if we want to scan c through e, we start at the root, and we jump down to this leaf node. And we find as we're scanning that some of these records have these overflow page changes. This one has one overflow page. We read that. Then we go to the next one. Oh, it has an overflow page. And that overflow page actually points to yet another overflow page. So we have three read latencies to read essentially four pages worth of data. In Redwood, what we do is we just make these child pointers bigger. So node links are variable sized. And a child pointer has a node link which contains all of the pages that the next lower node is made of. So in this case, there's four of them. So we start at the root, and we in parallel load all four of these pages that constitute what we call a super page. And now we can read c through e, suffering only one cache miss latency. So next, I'll talk about deferred range clear. So this relates very close to the previous slide. So SQLite must read every page in a cleared range in order to determine if any of those pages, any of the records in those pages have any overflow page change. So it has to read not only read the pages into memory, but it also has to iterate over all the records in them. Redwood, because of its multi-page pointer design, doesn't have to read or scan through leaf nodes when doing range clears. So this results in branching factor x less i o, which in Redwood is going to be usually way more than 100 x less i o. That is the reason why that third number on the performance slide was so good for Redwood. For clearing all of the keys in the data set, it just has to do, in that case, probably 150 x less i o. So that's why. Which is also why having a 1.5 gig data set and 100 megabyte cache wasn't really a problem. Because most of the nodes that were needed for the deferred clear were still in the cache. So next, I want to talk about data movement. So data movement in FTB is basically when you read a shard from one storage server and write it to another storage server, and then clear on the original storage server. So this is not a great pattern for SQLite, because the reads of a couple hundred megabyte shard are going to push a bunch of data out of cache to do this one-time read, which likely won't be repeated. And then the write happens over time. And then we come back, likely, after a lot of those pages have been evicted from the cache. And we do the same reads all over again. And we pollute the cache likely again just to clear that data. So Redwood supports non-caching reads, which are used in this case in some other situations, too. And as we talked about it before, sequential insertion is faster, and deferred range clear is much faster. So data movement on a Redwood-backed cluster of FTB is going to have a lot less impact on the cluster's performance. So lastly, I want to talk about future improvements. So first on my list is to reduce duplicative write path mutation buffer lookups. Basically what this is saying is that there is a lot of extra lookups into the mutation buffer being done, really about 50x for some workloads that is basically because that code was written for a correct is first and not for performance, and I haven't gotten to fixing that yet. But additionally, as Diego presented earlier, his team is working on a replacement for the mutation buffer, which is currently just a SD map for convenience. And so the lookups themselves are also going to get a lot faster, and very notably, the insertions are going to get a lot faster, as those needs to happen either way. Also, the prefix-compressed structure of the nodes and the overall tree is not being fully leveraged. There's a lot more byte skipping that could be being done during comparisons. Nodes themselves are currently always rebuilt by taking all the old data and the new data and building a new node. That also for small edits, nodes could be surgically updated, which would be a lot faster for highly random small writes. And there's also some single version optimizations that could be done for when you know you don't need any version history. So the first production release of Redwood is planned to be an FTB 7.0. It will definitely be a single version mode only. No history will be exposed yet, even though a lot of the internals are there. It should be faster for many workloads, but the worst case is going to be small key value pairs with highly random writes that are very little sequentialness. And that is all I had.