 Databases, a database seminar series at Carnegie Mellon University is recorded in front of a live studio audience. Funding for this program is made possible by Ottotune. Google. We're excited today to have a talk from Alex Kangway, who's a researcher at VMware Research and he'll be talking about the Center DB. Alex did his PhD in Rutgers and computer science. He did his undergraduate in Rutgers and math and another math degree in Princeton. Saw the light, saw the databases were the way to go in life, and he switched over. So we're happy to talk. So as always, yeah, how it did not. So as always, as Alex is giving a talk, if you have any questions, please unmute yourself and fire away at any time with the conversation for Alex and not him talking to his screen in Zoom for an hour by himself. So Alex, we appreciate you being here, the floor is yours, go for it. Thank you. Thanks. I really appreciate it. I want to reiterate absolutely to interrupt me. I have a flexible length talk, so there's definitely stuff I can, that's totally optional. So if we eat up a lot of time talking about other stuff, that would be awesome because I think it's a better way for the talk to go anyway. All right. So today I'm going to talk about SplinterDB, which is a key value store that we've been building at VMware. So this slide just has my name on it, but a ton of people have worked on SplinterDB and I'm not going to call them all out because it's just a big list. These are people from research and engineering, kind of all over the place, as well as individual contributors. So I'm going to start with the story of SplinterDB. And so this is kind of one way of telling my story. So SplinterDB starts with a product group called BCN at VMware and they needed a new way of storing metadata. I'll get more into that in just a second. And then it kind of goes away from the sort of systems into theory because the way that we decided to solve this problem was by modeling it as an external memory dictionary, which is a sort of theory problem. And we came up with a data structure that's inspired by some techniques that were used in external memory dictionaries, which we call the maps be the epsilon tree. And that data structure we implemented and turned into SplinterDB, which is sort of the end product. All right. So let's start with BCN. So I want to start with what is BCN? So BCN is a virtualized storage solution. So the idea is that you have a machine that's running some virtual machines and it has some local physical devices and you want to slice and dice up those physical devices into storage containers that the VMs can access. You know, maybe they have some SLAs or something like that. This is sort of the simplest version of what BCN does. Okay. So what is this metadata that is the problem? So in a virtual storage, kind of like in virtual memory, you have to maintain a collection of virtual to physical translations. So if I want to go access some data that's sitting in one of these virtual storage containers, I'm going to have some sort of virtual disk address. I need to translate that into the physical address where the data actually lives. So you can think of this as being sort of like a page table. But unlike a page table, it's a little bit less specialized. There's a bunch of other metadata that's stored as well. So you can't just rely on some narrow solution to solve this problem. You have to use something general purpose. And another difficulty here is because this is storage, all this stuff has to be durable. It's not something you can just keep in memory and forget about. Okay. So why is this particular problem hard? Well, there are several reasons. One is this is on the critical path of all IO. So it has to be really fast. You can't be doing a bunch of stuff. So B-SAN is supposed to be a nearly invisible part of the storage layer. So you can't consume a lot of resources in doing this. So you can't have like a large in-memory cache, a large in-memory index or anything like that. You can't use a lot of memory. And you don't want to use a lot of CPU resources either, right? This is supposed to kind of just act like it's not even there. This problem really came up as B-SAN was trying to support fast new devices like Optane-based SSDs or other ultra low latency devices. Because on slower devices, you can kind of do a lot of stuff and amortize against the slow device. But on faster devices, those kind of solutions didn't work anymore. And the last problem I'll get to more in a second, but it's hard because metadata is a fine grain. Each little piece of metadata is only a small fraction of an IO. And so you have to be careful with how you read and write that metadata. And really at the end of the day, the problems that stay there are key value stores couldn't keep up with this workload and would always become the bottom layer. So we decided to create something new. All right. So at this point, I'm going to cross over into theory. I'm not going to talk about theory very much in this talk. This is really just a little bit of flavor of what we were thinking about and where we're coming from. This is really going to be a systems talk or really it's going to be a data structure talk, which is kind of like, you know, bridging the gap between theory and systems. But we'll go into theory for just a second. Okay. So how do we think about this problem? How do we model it? So I was just saying that metadata is fine grade, right? So each individual piece of metadata is really small. So here on the slide, I have them as being 48 bytes. That's not actually what we use in vSAN, but it's sort of in the ballpark. And you know, each IO to a device, even a really fast device like an Optane SSD or an Ultra-Low latency SSD, has to happen at four kilobyte granularity. So you have this sort of mismatch here. If you just go and write each individual item directly to the device, you're going to, it's going to be extremely expensive. You're going to have a huge write up application. And the same thing can happen on the read side. So this problem is really well modeled by what's called the external memory model. In the external memory model, you have an internal memory of size M and you have an external storage, which can be unlimited in size. And you can read and write to and from the external storage in blocks of size B. And when you go to analyze your algorithm or your data structure, each one of these blocks that you read or write counts as an IO, and the total number of IOs that your algorithm or data structure uses is the cost of the algorithm. So here B is going to be the number of items in an IO. So here are items of 48 bytes. So B is going to be roughly 48 kilobytes over 48 bytes. So it's like 100-ish. And something to note here is this model only works in this sort of setting. If you make the items larger, then it won't work as well, right? So if you had a kilobyte items or four kilobyte items or larger than four kilobyte items, this wouldn't really be an appropriate model. And something that's sort of interesting here is that this external memory model really came up initially as a way to model hard drives. Because in a hard drive, any sort of item that you read or write to the device is going to be much smaller than the sort of good performance, the granny lighting which you get good performance of hard drive. You want to be writing to a hard drive, reading and writing to a hard drive in megabytes, ideally, in order to get the best performance. And so if you have something that's several kilobytes, that's only a small fraction of one of these blocks. One of those sort of like good writing efficient IO blocks. But here, because our data is so small, the external memory model works well as well. So we're going to stay in theory for just a little bit longer. So that's the external memory part of this. So this is also a dictionary problem. So what dictionary is is sort of the theoretical version of a key value store. It's a data structure, an abstract data structure in which you can insert new items and you can do point reads of those items. Sometimes dictionaries also include deletions and sometimes they also include scans, but the simplest version is just insertions and point reads. And so what's sort of interesting about external memory dictionaries is they come into flavors. And those two flavors of dictionary have different performance limits. So in theory speak, they have different lower bounds. And we're going to see that in practice, these lower bounds make a big difference. OK, so the first flavor of dictionaries are comparison-based dictionaries. So in this model, you have items that are stored in your dictionary. You have keys stored in your dictionary and you're not allowed to do anything with the keys other than compare them to each other. So if I have two keys, I can only ask, is one less than equal to or greater than the other one? And this means they're not allowed to do things like using hashing, I can't use filters, or if you can think of something else clever to do with your keys, you're probably not allowed to do it. You just imagine them as these opaque things that you can compare against each other. So in this model where all you're allowed to do is comparisons, there's a no lower bound called a Brutal Foggerberg lower bound. And what this says is that if you can perform insertions sufficiently fast, so this is lambda over b log base lambda of n, then your lookups are going to be somewhat slow. So there's this trade-off here between your insertion speed and your lookup speed. The actual formulas aren't really that important, right? But the point is that there's some trade-off where if you want to make your insertions faster, sure, you can go and do that, but your lookups are going to get slower as a result. Okay, so what if we drop this comparison part of this model? And we just like you're allowed to do whatever you want, right? So I have a key, I can hash it, I can use filters on it. The reason I'm presenting these two things are these are the most natural thing that you would want to do with a key other than compare it. But you can do anything else. So then there's a different lower bound called the Iacona-Petraskier lower bound. Now this lower bound looks almost exactly the same as the Brutal Foggerberg lower bound, but you get to drop this log base lambda of n part in the insertions. And your lookups stay the same. So what this says is if you can use hashing or filters in some clever way, or something else that's not just comparisons, you can get this basically free insertion performance boost over the comparison base dictionary. And in practice, this log n term is pretty big. It's like a factor of 10. So it's something that you actually care about. This isn't nothing. Okay, so there is two lower bounds. And we have a question from Ben. Yeah, when you can hash your keys, why is lookup none of one operation? So that's an interesting question. So you can do that. So for example, suppose I have a hash table, just like a regular in-memory hash table. But I put it on disk. You can get O of 1 reads, but then you're going to have really slow writes. The problem is each one of those writes is going to have to access a block on disk and perform a write. And you might be like, oh, well, that's only one. That's not a big deal. But the idea here is that in this model, this lambda over b term is subconstant in general, and usually quite subconstant. So that's actually saying, yeah, this is sort of fine accounted for by the model. I'm not sure if this is exactly in this trade off curve or if there's a slightly separate case for this. You can get your lookups to be constant time, but then your insertions are also going to be constant time, but you don't want. So that's a great question though. Okay, so in the, sorry. So in the literature, what's known about these types of problems are that there are optimal data structures for both of these lower bounds. So for this comparison based model, there are B trees and B epsilon trees. So B trees are optimal in a certain specific parameter setting, where reads are really sort of the only thing you care about. So they have sort of slow writes and they have basically reads as fast as you can get. BD epsilon trees are another data structure that allows you to get much, much faster writes at the cost of slightly slower reads. And you might sort of be like, well, I don't know about these data structure. These are some weird theory things. Well, so the data structure that comes up more in systems is the log structured merge tree, which is sort of very related to a BD epsilon tree. And while it's not actually optimal on this lower bound, if you just make some small tweaks to it that matter, sort of maybe more in theory than they do in practice, then you get something that looks a lot like a BD epsilon tree. And it would be optimal under this lower bound. Okay, on the right hand side, on the other hand, where you get these faster insertions basically for free, the things that exist so far are the icon of Petrascu hash table and something called a bundle of arrays or bundle of trees hash table. I just want to call this out because this is some work that I did in theory several years back. So this appeared at ICALP in 2018. But these things don't really work super well in practice, in part because they don't support scans. And also these are actually just kind of complicated data structures. Like, don't let the name hash table fool you here. These are actually kind of theory-only data structures. So the data structure that we use for Splinter is actually going to work on this right-hand side where we get faster insertions for free. And it's also going to support scans. And it does this by sort of borrowing some of the ideas from this earlier work of bundle of arrays, bundle of trees hash tables, and integrating them into sort of normal B trees, which are nice data structures and things that you can actually implement. Okay, so the rest of the talk is going to be basically what does this map BD epsilon tree look like and why does it have good performance? And how does it actually get implemented in SplinterDB as well as some other cool things we get to do with it? Okay, before I dive right into that, I'm going to give you some top-line performance numbers for Splinter because just because something is optimal in theory doesn't mean that it's actually good in practice, right? It's not unusual for theoretical data structures to kind of not implement so well. So first I'm going to introduce ROXDB because that's the state of the art that we compare against. So ROXDB is a high-performance key value store from Facebook. It's designed for solid state devices. Been around for a long time. They have a full-time engineering team that works on it. I'm sure you guys all know ROXDB. Okay, so here's the basic operational performance of SplinterDB against ROX. So here in this plot, we have a machine with 28 cores. It's got an Intel Optane SSD. The dataset here is relatively small key value pairs. So 24-byte keys and 100-byte values. And there's an 80-gigabyte dataset where the system is giving 25 gigabytes of RAM to the key value stores. So here on the left are insertions. So in this plot, the y-axis is the throughput and thousands of operations per second. This is a uniform insertion workload from YCSB. As you can see, SplinterDB is nine times faster than ROX. And it doesn't come at the cost of reads. On point lookups, this is YCSB run C. This is if you're in distributed reads. SplinterDB is two and a half times faster than ROX. If we look at the entire YCSB application benchmark suite, you can see that SplinterDB beats ROXDB on every single one of these workloads. All right, so how does this, where does all this performance come from? So it comes from the data structures that we use and like I said, this is gonna be what the rest of us are talking about. So the way I'm gonna get to the mapped BD epsilon trees, I'm gonna start sort of slow at the beginning and work up. So I'm gonna start with B trees, work through different variations of BD epsilon trees to get to what the mapped BD epsilon tree actually does. I start with a B tree. So a B tree is a B area search tree. It has internal nodes that are packed full of pivots and it has leaf nodes that are packed full of key value pairs. So if I wanna insert an item into a B tree, what I do is I start at the root, I compare my items key to the pivots in the root. This tells me which child to go to. I repeat this process in each internal node and eventually I find the leaf and I insert my item into that thing. I'm sort of well known that the insertion and lookup cost are logarithmic in the number of items that are in the B tree. Okay, so now that's the B tree. Let's go to BD epsilon trees. So BD epsilon tree is a search tree just like a B tree. But in internal nodes of a BD epsilon tree, rather than being packed full of pivots, have only a small fraction of the node have pivots and the rest of the node is used as a buffer. The leaves of the B tree, BD epsilon tree are just like a B tree. So there's packed full of key value pairs. And so something I should mention, I don't wanna gloss over is that the nodes in a BD epsilon tree tend to be much larger than the nodes of a B tree. So where a B tree nodes in practice, usually are four, eight, 60, maybe 64 kilobytes in the largest case. In a BD epsilon tree, they're often larger. A megabyte is not uncommon or more than megabytes is common. But you can kind of think of each one of these nodes as like an SS table in the log structure merge tree if you're more familiar with those. You play any tricks of having like the upper nodes have a larger size or smaller size than the lower nodes? It's an interesting question. So, in theory, you definitely can't. In practice would be interesting to see what you can do. Why can't you do that in theory? Well, you can do it in theory, but you don't get anything. This is already like optimal in theory. So like additional benefit you get in practice. Yeah, we gotta pay the bills, right? You gotta build the thing. So, I don't feel like theory's that easy. Constance matter, right? Yeah, absolutely, it does matter. That's interesting. I'm not sure exactly what would happen there, like whether you would get some sort of benefit or not. I'm just curious. Good question. Good question. It's kind of like a monkey flavor thing. Maybe you make your filters more or less, you know, large at different levels and you get something. I'd be interested to see if there's something there. Okay, so that's BD epsilon tree. So let's see how insertions work in a BD epsilon tree. So if I have an item I wanna insert, what I do is I just put it into the buffer of the root. And that's it. If this insertion is done, you don't have to do anything else. Once it gets put in the root buffer, you're done. So suppose I wanna do some more insertions, I'm just gonna keep putting in the root buffer and everything's going great until eventually the root buffer fills up. So what you do when a buffer fills up is you pick the child that's receiving the most messages, which in this case is this middle child, and you're moving to that child's buffer. And this is called a flush. So the thing to note here is that when you perform one of these flushes, we're always moving multiple items. And if you imagine these nodes hold lots and lots of items, not just four so they fit on the slide, then you're moving lots of items at once from node to node. So you amortize the cost of flushing across all of those items. Let's keep inserting some items in the root. We've made some room so we can do that. When it fills up, we're gonna do the same thing. We're gonna flush to a child. Now in this case, this child is also full. So we're gonna repeat this process. We'll pick its child that's receiving the most items and we'll move them to that child's buffer. And these flushes can kind of continue down the tree. So be the epsilon trees have much faster right performance than be trees because we always are moving multiple items down the tree at once. And so this batching gives us much better performance. If we imagine that the root node is always in memory, then at each point when we actually access the disk, we're doing multiple writes at once and we can even get subconstant write speeds so we can write cost. Okay, so that's how insertions work. What about lookups? Well, they're gonna work just like they work in be trees really, except now we have these buffers along the way. So we have to check in those buffers as well. So if I'm doing a point query for say 71, I'm gonna start in root. I'm gonna look at its buffer. If I don't find the 71, I'm gonna look at the pivots. It's gonna tell me which child to go to. I'll look in that child's buffer. I'll work my way down the tree. In this case, we ended up finding 71 in the leaf, but we could have found it a buffer along the way. Okay, so that's how lookups work. They're slightly slower than in be trees because we have these buffers along the way. So our fan out isn't as high as it would be in a be tree, but other than that, it's very similar. Okay, so I just told you that be trees have fast insertions and now I'm gonna tell you that they're not as fast as they could be. They're more expensive than they look. And what I'm gonna say here is also true of most loft or should merge trees. So why is that? So suppose we're doing insertions into a be depth cilantro and we have a node here and we're flushing some data from its parent into the node. Well, what do we have to do to make that flush happen? We have to read that node from storage and then we have to merge the data, new data into the data that's in that node. So this requires using the CPU and then we have to write the node back out to storage. So what I want you to notice here is that the amount of work we do isn't proportional to just the new data that we're adding. We also have to do work on the old data. So when we read the node, we're reading the old data. When we write the node, we're writing the old and the new data out together. And if we're merging the data together, that also involves having the CPU run through all the data that's in the node, both the old data and the new data. If we keep flushing data into this node, the data that's there, that's already there is gonna be written over and over again. So here in this sort of animation, the key value pairs that are turning darker red are getting written over and over again. And it turns out that this can happen up to be the epsilon times per node on average. And so this is a be the epsilon times blow up in the amount of write amplification that we have in the system, and also the amount of processor time that we're using up in terms of the processor being, doing all of these sort of compactions in these nodes. Okay, so we're gonna use new data structure that tries to address this problem called the size tiered be the epsilon tree. And this is work that appeared at ATC in 2020. So the original data structure that SplinterDB was built on. So a size tiered be the epsilon tree, it's just like a be the epsilon tree except we're gonna store the buffer discontinuously. So here in this picture, I have a be the epsilon tree node, it's got its pivots and the rest of the node is buffer. Now we're gonna break this buffer off from the pivots and we're gonna break it into several discontinuous pieces. So I'm gonna introduce some terminology that we use in Splinter to describe this data structure I'm gonna use in the rest of the top. So we refer to this pivot part of the node as the trunk node and the collection of all these sort of trunk nodes together in the superstructure as the trunk tree. And the trunk node has the pivots it also has pointers to these buffers and it has some additional metadata about the node as well. These buffers I'm gonna refer to as branches. So you have a tree and it's got some branches coming off of it. That's sort of the implicit metaphor here. So how is this gonna help us on insertions? So I'm only gonna show you one little piece of the tree rather than the whole thing because it gets very complicated. So here what I have in this slide is I have a trunk node, it has no data in it at all. So it just has some pivots, some metadata just that part of the node. And I'm gonna flush some data into it from its parent. So what you do in a size to be the epsilon tree is you just add that data as a new branch using essentially a pointer swing. So you add a pointer in the trunk node that points to this new branch that you've added. When it's time to flush some more data into the node we do the same thing. We add another pointer to the new data. And the point here is that we're not touching anything in that old branch. We don't have to read it from disk. We don't have to look at anything that's in there at all. So this can complicate things a little bit because these branches may have overlapping key ranges. So if I color in these pivots and the children that they point to and I color in the key value pairs in these branches you can see each branch can have keys that are going to any of the children. So they're all kind of can be mixed up in any combination you like. Okay, so let's add one more branch to this node. So here this again is just a flush from the parent adding a branch. And now let's say the node is full which really just means that there's some threshold for the amount of data that's in the node been passed. What are we gonna do? Well, we do the same thing that we do in a bdfs long tree. So recall what that means is we pick the child that's receiving the most items in this node which in this case is this orange child and we flush them to that child which means we read through each of these little buffers pull out all the keys that are going to this child and we merge them together and write them out as a new buffer they'll get added to that child as a pointer swing. And the key observation here is that each key value pair is read and written only once in a trunk node as it descends down the tree. So we don't have this problem of reading and rereading and writing and rewriting the data at each node. So this gives us a speed up over bdfs long trees. So your question? Yeah. So this must so lookups would take time with this, right? Because these pieces are that are not the trunk are spread across the disk and have pointers to each other and they're not also sorted. Yep, you're exactly right. This is right where I'm going. So as you pointed out correctly, when we do lookups we're going to have to check every branch along the path now instead of just the single buffer that was in node. So I'll just walk through this slowly but you're entirely right. We're gonna be checking in more places. So we do a query. If we look in a trunk node here I've squished down these branches so that they fit on the slide each one of these sort of blueish purple rectangles represents a branch key value pairs. We have to go look in every single branch down the root sleeve path. So we can end up looking a lot of branches as we go and we can eventually find our key value pair. This sort of overall other than the same but now we might have to look in more places. So if we actually look at the lookup cost as you're saying we get a larger lookup cost it turns out to be about bdfs long times more. So this makes us very sad. Another one of those. Yeah, go ahead. So what I don't understand is what is now happening is every, so when there is like an input it like it without any sorting without putting it into the final sorted array it's just placed in some empty area in the disk which is similar to what LSM tree does. I think my question is why are insertions insertions should be the same with the size tierd B epsilon tree and LSM tree. I can understand why lookups take time. So there are two flavors of LSM trees, right? There's leveled LSM trees and size tiered LSMs. And this is going to be roughly the same as a size tiered LSM on insertions. So where size tiered LSMs give you an improvement over leveled LSMs on insertion performance this is going to be the same kind of performance improvement here. So you can make that connection. There's like, they're not entirely unrelated data structures. That makes sense. So in the numbers that you showed in like in one of the earlier slides the LSM rock CB insertion performance was much less than splintered B performance. So that's rocks to be at least by default uses a leveled LSMs. That's where a lot of the insertion performance comes up. So where this is going is going to be how do you this is how you make insertions fast. And now the question is, how do you make reads? How do you fix the read the slow reads that are going to come up from this? So that's it. Yeah, no problem. I see some more questions. Yeah, leave go for it. I wish a leaf is also a former Tokyo tech. Yeah. Oh, cool. The factory pin. Yeah, I recognize some on the earlier slide with people on it. That's pretty cool to see. Yeah, I bet you've met like Doni and all those people, right? Very cool. Yeah. So I was wondering on your previous slide for the size to be epsilon trees after a flush what do you do with the elements that you flushed from the higher node? These empty spaces. You see them kind of disappearing, but like are you rewriting those buffers in place? Are you coalescing things in the internal node that they got flushed out of? What happens there? She keeps going down. That's a great question. So I was kind of intentionally leaving this as a mystery because I don't want to get too much of the details. At the end of my talk, we're going to talk about, I'm going to talk about a policy called flush that impact where I kind of explain this. But I think we might not get to that, which is fine. The idea is that we just mask them out and we kind of leave them there. And if there's enough, if they're like large enough that we can free up the spaces in those buffers for reuse then we go ahead and do that. If they're not, then we kind of just leave holes in the buffers. So these buffers here, I know I drew them small here but they're actually pretty big there. So they measured megabytes so we can, we're happy to cut holes in them and just use metadata to mask out the stuff that's been removed. Okay, cool, thanks. Thanks a lot. I think I'm all out of question as well. Cool. Yeah. Sorry, I might have missed something here. So we're still in external memory model, right? So you're pumping everything at blocks. Sure. You just mentioned that is buffers are megabytes size. So I got confused again. Yeah, so, you know, the way I would say this is we're not sticking completely rigorously to the external memory model. So, you know, on the kind of devices that we're really targeting, you know, for the external memory model, you'd expect the block size to be like four kilobytes or, you know, not much larger than that. And here I'm making these things larger. And it turns out that works better in practice for various reasons like, you know, a lot of times there are other parts of the storage system that become bottlenecks. So for example, going through the kernel interfaces in order to do writes and things, especially we use asynchronous, you know, support from the kernel and those things are slow. So it makes sense to have larger structures to read and write. We actually perform IO at like 128 kilobytes on the right path as much as possible and like four kilobytes on the read path. But this is getting into implementation stuff like that I don't think is super important, but yeah, we're sort of bending our relationship to the external memory model here a little bit to make things work out. And also because we want to cut into these buffers like this, right? You know, we don't want those sort of cuts to be at the sort of external memory size, not the whole buffers themselves. Because we want to be able to efficiently read out chunks of them. So I think that's a great question and, you know, yeah. Thanks. No problem. Anyone else? I think that's all I saw in terms of raised hands, but I can't see everyone on my screen. So. Yeah, you're gonna go for it. There's another question about bloom filter optimizations like LSM trees, if there is anything and I believe you have something of that later in the talk. Yeah, we're going to be talking about bloom filters a bunch soon. So let's go ahead and get to that. Okay. My machine is slowing down for some reason. I can't run zoom and keynote at the same time. Okay. All right, so let's start with how do you fix look up? So we're going to fix look up in two ways. One is going to be a sort of simple way that does a lot but doesn't work in all cases. And then we're going into a sort of slightly more involved way that really handles everything. All the things you can throw at it. Okay. So the problem here is that each node has multiple branches. And when we do point reads, we have to go look at each branch. And the first idea you could come up with to fix this is using filters to avoid looking at every branch. So filter is just a probabilistic data structure which enters membership queries with no false negatives. Somebody already mentioned bloom filters. There's also other variants like cuckoo filter and quotient filters. They all kind of do the same thing. So it's better to be we originally use quotient filters. So here in a trunk node, we could put in a filter for each branch. And I use these pink sips here to represent the filters. And now when we do a look up in the node, we'll only search those branches which contain the key plus some rare false positives that might come up. So if I do a query, I'll look in the node, I'll look in the first filter and say, now you don't have to look in this branch, look in the second filter, so now you don't have to look in this one either. I look in the third one says, yeah, you do. I go and find the key and this mostly works out. And it turns out if you can tune your false positive and rate small enough, then you can get lookups down to a very small number of IOs across the entire system. You can also do this in a single IO, if you barely more than a single IO, if you have enough memory to keep all these filters. Okay, so that's a simple thing. You said it mostly works out, when doesn't it work out? All right, thank you. So let's get down, really fixing lookups in size tier to be the epsilon trees. So the problem with this solution is you create all these filters that you have to look at and it can get really expensive. We have multiple filters per node and in practice on not even particularly large data sets, we can see 15 to 40 filter lookups per point query. And you might be like, well, it's not a big deal, you're probably gonna do an IO anyway, right? So why do you care about all these filter lookups? But there are a bunch of situations where that math doesn't quite work out. So the first is sort of simple, if your data set fits in memory or the similar situation where you are just, you have a few queries that you're doing a lot that fit in cache, you're not really gonna be doing much or any IO. And so this CPU limitation, the amount of CPU work that you do on all these filter queries is gonna limit your performance by a lot. Even if you're in the kind of like normal scenario where you're doing an IO for every query, the CPU costs all these filter lookups, it's still gonna matter because on fast devices, doing 40 filter lookups is not, is a substantial chunk of doing a single IO. So you're gonna need more threads to be able to drive the same query throughput to the device. And finally, in a case that's really important to us with our motivating example from VSAN, if you don't have enough memory to keep all these filters in memory, you're gonna degrade your performance because they're gonna get paged out to disk. And then you have this sort of bad choice. Well, you can go ahead and look at the filter that's on disk, but then you're doing an IO just to determine whether or not you should do an IO to actually go read it from the branch itself and the buffer itself. Or you can just skip the filter, but then you're just stuck reading all these branches again. So there's a bunch of scenarios where this isn't quite good enough. Okay, so this leads us to the solution we came up with, which is called Maplets. So what's a Maplet? A Maplet is a data structure that's like a filter, but it also stores small values. So a regular filter just answers the basic question. I have a key X and I can ask, is X in the set? The filter will say yes or no, and that's all it can tell you. Now a Maplet does a little bit more. It'll say, is X in the set? You'll ask, is X in the set? It'll say it's not in the set or maybe is in the set with some small value that you get, say maybe four. And you can even return a set of values. So it might say, yes, it's in the set with values three, four and seven. And that's sort of totally okay. Just like a filter, it has no false negatives and has the same false positive guarantee that a filter has. And it also turns out that if you consider the cumulative false positive rate of a collection of filters, replacing them with a single Maplet, even though it has to store these additional values, it'll have the same memory footprint as those multiple filters for the same cumulative false positive rate. The performance of Maplets, when you implement them using quotient filters, which is what we do. They have the same performance cost as a single quotient filter. So every lookup into the Maplet, if it's in memory is two cache line misses and you can perform lookups into them on storage at the cost of a single IO. Okay, so that's a Maplet. How do we use them in SplinterDB? This is where we get to map to be the Epsilon trees. So what we do is we replace these individual filters in a size to be the Epsilon tree with a single Maplet. So here I have these three pink sieves. I'm gonna change them into a pink signpost to represent that we've replaced them with a Maplet. And then when you use the values in the Maplet to store which buffers contain matching keys. So we're gonna use the buffer in this, we're gonna take our buffers or our branches, we're gonna order them and say, well, this is branch zero, one, two, three, and we'll store those numbers in the Maplet. So now when we do a query to the node, we just look in this single Maplet, it outputs us an index that we go look into and we immediately find, we immediately know which branches we need to look at. We only do a single query per node. So if this trunk node or this Maplet is in memory, this is fine, if it gets written out to disk, we can also access the Maplet directly from disk in a single IO. And so that one IO will tell us which of these branches to go to. So we never do more than one IO in a node unless we're actually going to a branch and we might do two. All right, so how does this actually play out in practice? I'll show you some read benchmarks we did with SplinterDB. So like I said, there are these sort of three scenarios where regular filters can fall short, low memory, medium memory, and high memory or hot queries. Let's start in the low memory case here. So what we predict is that in the low memory case filters, we'll get paged out to storage and so we'll have a lower look at performance overall. So here's a plot of what happens. So in this plot, I have SplinterDB in pink. It's going to be with Maplets in pink, SplinterDB with quotient filters in purple and RoxyB is here in blue as a sort of baseline. And here the y-axis is the throughput in thousands of operations per second. This is YCSBC, that workload. And on the x-axis, we have the number of threads that we're performing these queries. These are all synchronous queries. So we're not doing anything asynchronous or anything like that. Okay, so if we compare SplinterDB with Maplets to SplinterDB with quotient filters, they both relatively quickly hit a wall as to how much from how, you know, it doesn't require a lot of threads to hit the wall of how many queries you can do. And this is sort of because they're both maxing the device IOPS at that point. But SplinterDB with Maplets, because it can use these Maplets, can do IOs to these Maplets in order to, rather than having to do IO to all the individual filters in the nodes, it can get a bunch more throughput. It gets about 25% more throughput than SplinterDB with quotient filters. So if we move to the medium memory setting, so here we kind of predict, well, we're going to do roughly one IO per query because the filters fit in memory, but the data doesn't. You know, so what we predict here is that the cost of filter lookups means that we're going to need more threads in order to drive the same, in order to have the same throughput because the filter lookups are chewing up a bunch of CPU time on each one of these threads. And so if I show a similar plot here, notice the y-axis is at a different scale, so everything's way faster. It's not surprising because a lot more stuff to fit in RAM. You see SplinterDB in pink, and so with Maplets in pink and with quotient filters in purple, we kind of get to near the same endpoint once you have 28 threads, but with Maplets, we can get there much faster. We only need 16 threads to match the performance of SplinterDB with quotient filters at 28 threads. All right, so finally, we can look at the case where everything fits in RAM, or you know, you can imagine, like I said, that you have a few hot keys that you are acquiring. So in this case, we predict that you're basically not doing any IO, and the performance should be limited by CPU. And so unsurprising in this case, this is where Maplets really shine because we're doing, because all these, you know, filter queries and Maplet queries, all the filter queries are really expensive if you don't have any IO to amortize against. So here throughout, Maplets are about 50 to 60%, similar with Maplets of 50 to 60% higher throughput, no matter how many threads you use. And so unsurprising in this plot here that SplinterDB quotient filters is slower than ROXDB because ROXDB is level, so it doesn't have this additional sort of fragmentation that you have when you're sized here. With Maplets, we can get around that. Okay, any questions at this point? Should I keep going? Going to a slightly new topic. So I'll take this moment to drink some tea anyway. I guess my question is like, I mean, the sort of the Maplet and then the tear part, that's novel. I guess I'm, I mean, leaves here, we can ask them. How does this differentiate between, what was the key difference between this and the practical trees? So the size tiering is part of that. And it's, you know, it looks like sort of straightforward here like when we do it with cute little diagrams, but it's a little bit less straightforward to do size tiering, you know, it'd be the Epsilon tree, I think, then it sort of looks. And it is size tiering and Maplets are the main differences. Yeah, so it's like a data structural difference. I guess what would be interesting to see is the comparison between like what the, like between ROXDB, that's right, that's between SplitterDB and it's, it's map being Epsilon tree, sorry, the dog, and then the, and like the fractal tree from Toku Tech come back in the day. I mean, so in terms of like abstract data structures, it's, it is like size tiering, Maplets, it's, you know, some additional optimizations that I'm not going to have time to talk about in this talk, but like, you know, I don't know, you're asking a question that's sort of like, you know, and then the whole like implementation is obviously completely different as well and is designed with different, you know, targets and so on. And purely like data structure land, you know, that these are the core differences. Like, I don't know, do you want to say that that's, you know, incremental or something like that? I mean, you're entitled to your opinion, but. No, no, no, no, very clearly I'm not claiming that at all. I'm just trying to like, I guess I need to go read the fractal tree paper stuff again. I don't remember what their key demonstrator was. What they do is they use basically the BDF Epsilon tree, right? I mean, I think they call it a fractal tree, you know, I think that that's somewhat like branding, but like it's basically a regular BDF Epsilon tree. And so innovation is going on size tiering and using Maplets and things like that. Okay, okay. Chris, Chris has a question in the chat. If you want to meet and start to go forward. Yeah, yeah, I was just, you had mentioned async APIs and I was just curious if you had any estimates of how well you benchmark with the async APIs versus sync APIs and SplinterDB. For reads? Yes, for reads. Sorry. So we have an asynchronous read operation. I don't think we've run these particular benchmarks. Basically what you get is you can max out the device IOPS with very few threads generally, but I don't have specific numbers to show you on that. Sure. But yeah, we do have that. Cool, thanks. No worries. All right, so Andy, should I also like, when should I stop this at like 50 minutes or an hour? An hour, so like, yeah, basically 10 minutes. Okay, great, I might be able to get through everything then. We'll see. Okay, so once you have maplets, you can do some additional cool stuff with them. And so one of those things I'm gonna show you is using maplets to manage space. So what do I mean? So one of the problems that people have with size tier data structures is that they can lead to redundant data which can waste space. So what do I mean by redundant data? So here, if I look at this node, because it has these different branches, I can have the same key show up in multiple branches. So for example here, key 41 shows up three times. So this first one here at the bottom with value one might have been an insertion and then its value got updated in the second branch and then it gets updated again in the third branch. You know, the same thing can happen with other keys. So like 79 appears twice and some keys might only appear once. And this can be because of updates. It can also be because of deletions and reinsertions. It's a thing that can happen. And so size tier data structure could waste more space than level tier data structures. Now, if you have a bunch of redundant data and you wanna recover your space, you can do that anytime you want by compacting these branches together into a replacement branch. And compaction is gonna recover this disk space when there are many updates or many deletions by merging all these together. So we started with 14 items in this node on the left and we ended up with eight items on the right because we've merged all these items together. So why don't we just do that all the time? Just compact everything. Well, if we care about space, that might not help because suppose I have a node like this one here, it doesn't have any redundant data. If I compact it, I'm not gonna save any space. I'm just gonna waste a bunch of time and IO by doing that compaction. So where do maplets come in? So maplets can tell us how much redundant data there is in a node. How does that work? Well, if I look at this node on the left here and I peek at its maplet, I sort of open up the box a little bit. What does the maplet look like under the hood? Well, it's sort of a map from keys to the branch indices where those keys are, right? Now this is sort of a high level oversimplification. The maplet doesn't actually contain the whole keys. It contains sort of fingerprints of the keys or something like that. It's not really important, but you have some sort of representation of the key that maps to one or more indices. And what we'll notice here on the left is there are lots of multiple entries for these keys because these keys appear in multiple branches. So we can just look at a sample of these keys and see how many entries they have on average. If we do the same thing on the right, this maplet's gonna look very different. All these keys are just going to have single entries. And so if we see few multiple entries when we sample the maplet, then we know that there is not a redundant data in the node. We can put this all together into an adaptive space reclamation policy. So what SplinterDB does is it maintains a heap of all the trunk nodes, which is sorted by the estimated amount of redundant data that we get from looking at their maplets. So what we can do is every time we update a maplet, every time we rebuild a maplet because we flushed new data into the node, we update the estimate of the amount of redundant data. So we get this heap. And now we can have a policy that uses this. So for example, whenever this usage gets too high, we can initiate a compaction on the top node of this heap. So the top node of this heap will have the most redundant data. So that compaction will return the most amount of space, the greatest bang for its buck. And so the idea here is that we always are doing the most efficient compactions in order to recover space and we're not doing a lot of compactions that don't actually help us recover any space. So if we look at what happens, if we run a benchmark on this, so what this benchmark is showing, we have the same sort of set up as before. And here we're doing 100% uniformly distributed updates to an existing 80 gigabyte dataset. So here what I have in this plot is on the y-axis I have the throughput and updates per second. And the x-axis is the space efficiency of the system. So what space efficiency means is it's the logical amount of data in the system over the actual amount of space that the system is using. So here 100% means that you're not using any extra space at all. And 50% means that you're using twice the space as the data itself. So here we have in purple, we have regular SplinterDB. In blue, we have ROXDB. And in pink, we have SplinterDB using maplets. So these different labels on the SplinterDB nodes are the thresholds at which we start performing additional compactions. So for example, at 120 gigabyte threshold, that means once this system is using 128 gigabytes of compactions, 128 gigabytes of space, we're gonna start triggering these additional space reclamation compactions. So as you can see, so I should say that in this plot, space efficiency gets better as you go to the right. So the system is smaller to the right and it's faster as you go from bottom to top. So I just wanna point out a few things. So Splinter with maplets is always faster than ROXDB and it always uses less space, which is great. But also there's this neat thing as we start setting our space thresholds lower and lower, we do in fact use our space more efficiently and we sort of smoothly get the trade off the update throughput as we go. So that's kind of nice. So you can just sort of say, ah, this is my space budget and I would like to go as fast as I can with that space budget. And that's what adaptive reclamation gives you. All right, so I'm gonna try and do my last topic here. This is sort of the optional one that looks like I have about five minutes. So we'll see if we can get through it. So this is just a cool optimization that we did in SplinterDB called Flushed and Compact and it has a pretty good performance impact. So what is Flushed and Compact? So Flushed and Compact is motivated by a thing that you see in B trees and Bx lawn trees, which is when you do sequential insertions in a B tree, they're more efficient. So let's see why that is. If I start doing some sequential insertions, I start with a key. As we saw before, it goes down the root leaf path and gets put in a leaf. But the thing to notice is that once we've done this first insertion, this root leaf path is in cache. So now when I go to do my next insertion in this sequential workload, it's gonna get to sort of zip right through here without doing AIO. And so all subsequent insertions, all subsequent sequential insertions are gonna be cheaper because they only incur AIO at node boundaries. This also works in Bdebs lawn trees. So here, if I'm doing sequential insertions into a Bdebs lawn tree, I can start doing them. What's gonna happen is they're going to trigger a flush from the root node and that flush is automatically gonna cascade all the way down the tree because all the data is going to the same place. So they'll go right down the tree in a big batch just like you would in a B tree except sort of in a batch rather than individually. That's gonna bring this whole root leaf path into cache as well. So now when I go to do my next set of insertions, they're also gonna be cheaper because they're only gonna incur AIO at node boundaries just like in a B tree. So what happens in a size tiered or a map Bdebs lawn tree? So just like before we want these sort of cheap sequential insertions, see what happens. So if we have this data in a node, say this is like the root node, we've just done a bunch of sequential insertions into the root. We're gonna go flush all this data down to a child into a nice big branch. And that sounds great. But suppose that there's already some data that's present in this child, stick a little bit here that's leftover from some earlier work. Any data that's there, we're gonna have to merge again before we can flush it down the tree. And we can still enter performing these compactions or these merges at each level. Maybe some of these things will be in cache, but even if they're in cache, we're gonna be tying up the CPU just going through all this data. And we get this nice big chunk of data that can go right down the tree just adding little bits along the way. So flush and impact tries to address this. So here in this node, we have mostly sequential data that's come in, maybe with a few additional bits in the pieces. And we're gonna flush it down the tree. So this diagram gets really complicated really quickly. So I'm gonna replace all these branch pointers which is the single fat pointer of these branches. It's just supposed to represent all of the pointers. You'll see why it's helpful in a second. So even behind flush to compact is, we first flush references the branches, but we don't immediately compact them. So when we do this, we just add pointers to these branches into the child. And now we're sort of looking at the data through two different windows. So we use metadata in order to make sure that each of the parent and the child sees only the data is supposed to see. So that means that in the parent, the metadata mass means that it only sees the unfushed data and then child, it's only gonna see the flush data. So now if we want to flush from the child to a grandchild, we just go ahead and do that before we do a compaction. So we just again flush these references down as the pointer swing and we use metadata to mask out so that everyone sees what they're supposed to see. Finally, once we've done this, we kick off in queue compactions and all these nodes that have received flush data. So we only initiate this compaction once data sort of reaches its final location. And this has a bunch of benefits. So if we immediately flush data without doing a compaction, then we save all the work that we were gonna do on compaction. We don't have to do any of that IO. If we had to do IO and if it's in cash, we don't have to do CPU work. We only do it once when it reaches where it's going. And sort of in extreme case of sequential insertions, this helps us get to a right amp of one and also means that we're not using a bunch of CPU resources in order to do this. And that's because that data is just gonna get flushed all the way down the tree and only be compacted in the leaf. There's a side benefit here, which is that we break a serial chain of compactions into a parallel one. So if we flush and then we compact and then we flush again and then compact again, we have to do this all the way down the tree. But if we flush before we do all the flushes before we do the compactions, then we can just run them all in parallel. We don't have to have this serialization. And we designed SplinterDB to even allow you to concurrent compactions in each trunk node. So while one compaction is going on a trunk node, you can flush more data into it and start a compaction in that node. And you can have as many as you want in any trunk node. And so both these things should help improve insertion concurrency. All right, so I'm just gonna quickly go through some benchmarks of this because I'm running out of time. So what we do to test the sequential insertion performance is we ran a single-throwed workload where some percentage of the items are sequentially generated and the rest are uniformly random. So this plot shows SplinterDB and RocksDB, the y-axis is the throughput and the x-axis is the percentage of the workload that's sequential. And note this x-axis is very much not to scale. So they'll be fooled into reading some linearity or something into this. And as you can see, SplinterDB smoothly increases its throughput as the workload gets more sequential. That's because of Flushed and Compact. And you also see our improvement from RocksDB but it's not nearly as big as the one for Splinter. I also said that Flushed and Compact improves insertion scaling. And as you can see in this plot here where the x-axis is the number of threads and the y-axis is the throughput, SplinterDB's throughput scales well up until about 12 threads and it's about seven times the throughput of a single thread. At that point you start to level out but that's because SplinterDB is using most of the device bandwidth at that point. All right, so let me come back to this initial slide here. So we kind of walked through this, we spent a lot of time on that, be the epsilon trees and so on. So we can kind of make this into a full cycle now. SplinterDB has been integrated into vSAN 8.0 which just shipped a few weeks ago. So we're really proud of that. We've also, because SplinterDB is a general purpose key value store and not just so specialized solution, we've also open sourced it and you can go download the code and we'd be really thrilled if you, and you guys wanted to either build on SplinterDB or just use it as a comparison point in any work that you guys are doing. We'd love to help if you get stuck or anything like that. And yeah, so I'm just gonna say like going forward, what are some of the sort of lessons that we've learned here? Like we used theory to build, makes learning to be go faster, basically. We used inspiration from theory, and some of my other work also uses theory to build fast hash tables and filters and so on. And I think it's always interesting to ask what other systems could benefit as well, perhaps things with transactions, replication. And I know this is not a theory audience, but like I think to put to the theoreticians out there, like I think the question is, how can we get theory to focus more on problems that directly impact systems? Like what kind of changes to your approach that theory people can make, will make those ideas more applicable? All right, that's the end of my talk. Thank you very much. That's awesome. So I'll follow up half an hour. Rap, show us applause. All right, thank you. All right, thank you so much. It's Zoom, it's gonna, I'm happy. I'm happy too, don't worry. I'm just kidding, I've done a Zoom talk before. Okay, I have maybe one last question from me. Sure. I don't think I was the first to raise my hand so if someone else wants to go first. Everybody else was applauding. Okay, yeah, so I was wondering to what degree is reader-writer concurrency a thing that VSAN cares about? And were there any interesting challenges or solutions that you had to come up with, with the addition of the like the maplets and the, I forgot what they called them, but the additional kinds of buffers. I was basically just wondering like, are there interesting concurrency problems here? Does it scale well to like many readers and many writers and stuff like that? Yeah, that's a great question. So we really cared about that a lot. VSAN cares a huge amount about concurrent readers and writers and we tried to design a system to handle that as possible. It scales really well. One of the nice things about this data structure and I didn't really focus on that is that most of these data structures are immutable. So the branches basically they're immutable until you slice the data out of them. So it's not immutable to deletion but the otherwise just stay the same on disk. And the maplets are also immutable. You, when you update a maplet you add a bunch of data to it and then you throw the old one away. So you don't really have a lot of reader-writer conflicts on the core data structure. There's just the, you know, basically the mem table where you're ingesting data before you write it out. That's where you have to handle reader-writer conflicts. And we just have a highly concurrent B tree there that at least so far is working well for our purposes. And we thought about this a lot. We use distributed reader-writer locks and we tried to make sure that our cache is highly concurrent and stuff like this. We spend a lot of time engineering and stuff. So it's a problem that we cared about and we sort of addressed it with those sort of things.