 All right guys, thanks for coming. Super excited today to be at Bradley Kuzma from MIT Tokutek. He's going to talk about fractal trees and the stuff that they built in the Tokutek product. For a lot of you already know Bradley, but for those that you don't know him, his story career is amazing. He has a bachelor's, a master's, and a PhD of MIT. Two bachelor's. Two bachelor's. He's four degrees from MIT. Quadruple beaver. So five of us, that's five degrees from MIT. So we've got you being four, right? So just running out there. Do you always go back and get a hug? No, I can't. After he graduated, he went to go work for thinking machines in the early days in the 1980s or the 90s. After that, he was a professor at Yale. And after that, he was a senior scientist at Aquamine, right? And then after that, where he's been for the last decade or so, he's been a research scientist at MIT. And for the last six or years or so, he's been involved with Michael Fenger on Tokutek, which is a database start-up out of Lexington and New York City. So we're really excited for having Bradley here come today. And thanks for everyone coming. So go for it. Does this mic also amplify? Yeah, you guys can have that. Do I need to be amplified for you to hear me? I think I can hear you. Yeah. So what about the people in the back? Can you hear me? Okay. So I also hold the terabyte sorting record in perpetuity, because after I won it, they retired it because it got down to be one minute. And there's a separate sorting benchmark for how much can you sort in one minute. And so now, if you want to enter the big sorting thing, you have to, it's called gray sort after Jim Gray, and it's 100 terabytes, and that takes an hour. So they can never take that away from me. So this is my sort of, I'm going to bracket this talk with two marketing slides, one at the beginning and one at the very end. This is really, the reason I'm doing this is to sort of explain what the setup is. In this talk, I'm going to mostly talk about data structures and algorithms. But I want you to understand where they fit into the situation. So Tokutek builds a library called the Fractal Tree Library that implements the data structures I'm going to talk about. These are basically, logically can be thought of as a replacement for a B tree. So below it is a file system, so we just use ordinary file system calls and there might be disks under here or flash or something. Above it is the application. And from the perspective of the Fractal Tree Library, the application looks like the MySQL database or MongoDB or there might be something else. We have a file system prototype. And then above, say MySQL, there's an application that's written in the MySQL language or in the MongoDB language. So basically, what I'm going to talk about is the stuff in here. I'm not going to talk about how do you do SQL processing or query optimization or drivers or any of that stuff. So for database indexes, the abstraction is fairly simple, the thing that we have to implement. There's three operations, maybe four. There's lookup, which basically given an index looks up. The database index is in C++ lingo, it's an ordered map. So we want to be able to do a lookup. Given an index, given a key, what's the value associated with that? There's a sequence, there's a collection of keys and values. There's a insert operation where you say, here's a new key and a value and I would like you to replace, if there's already a key matching that key, I'd like you to overwrite it, otherwise insert it. And there's a next operation which, given a key, says, well, what's the next key in lexical graphic order? And this operation is the one that makes everything interesting because if we didn't have next operation, we could do lookup and insert with a hash table and things are a lot easier. So we have to maintain these orders and that's why it's an ordered map. I'm going to delay talking about things like acid properties, but I will touch on that briefly. So, if we assume that the database doesn't fit in main memory, how fast can we run these operations? Because if it fits in main memory, things are going to be very fast. And Michael Stonebreaker goes around talking about doing millions of transactions per second in main memory and that's great. A lot of data doesn't fit in main memory and B-trees, which is what almost everybody uses, although it's long been known that B-trees are essentially optimal for doing lookup, they're not optimal for insert. Well, okay, so before we get into, as soon as I talk about optimality for this and that, I need to have a machine model to talk about it because otherwise I can't do analysis. So for this, I'm going to basically be using the disk access model, which is due to Agarwal and Vetter. And the idea here is that there are, there's a big memory, which we'll call disk, and a little memory, which we'll call RAM. Sometimes big memory is RAM and little memory is cache, but for this talk, that's what it is. Data is organized in blocks of size B, so these blocks are of known size for this talk. I'm not going to talk about, say, cache oblivious data structures, where you don't know the block size, but the model is that there's some fixed size block and whenever you move data from RAM to disk, it has to be moving in blocks of size B. So if you're just writing one bit and you want to write it out to disk, well, you have to write B bytes. You don't get to write just one bit. And maybe, you know, so that, and whenever you bring memory in, the RAM is small, it's total size is M, you might have to evict somebody back out to disk because the RAM is too small to hold the entire database. So that's the disk access model. It's a simple model, it's very powerful. There's some better models that are better in the sense they give more accurate analysis, but some of them are harder to do analysis on. And this model, although it's very simple, it doesn't reflect all sorts of strange anomalies about disks. It thinks that any disk block access is the same cost, so it ignores the fact that on disk there's some blocks that are closer to each other than other blocks. We can ignore a lot of that and make a lot of headway. So this is the model I'm going to use for this talk. So here's an example of using some analysis on a very simple data structure. I'm going to scan an array. So I have this big array, and I want to scan it. The big array is size N and the block size is B. So how many IOs does it take to scan the array? N divided by B. N divided by B, right? Wouldn't it be 2 N? Mostly N. Do you divide by B since you still have to send it back? Well, for a scan where I'm not modifying it, I may not have to send it back because I can just evict it without writing it back to disk. Even if there is a write back and you're worried about the factor of 2, this talk is mostly going to be about big O. Ignore the factor of 2 behind the curtain. It's not important. So it's order N over B IOs. It might actually be less than N over B because it might all actually already be in cache. So this big O, if it turned out to be zero IOs, this is still accurate. That's the way big O works, right? And if you want to get very detailed about it, there may be a block at the beginning and a block at the end, so it may be N over B plus 1 or plus 2 or something. As soon as you start getting into that level of detail, it's easy to make a mistake. It kind of gets boring. So here's another example. Searching in a B tree. We've got this B tree. A B tree, just quick review for those who may not remember, a B tree is a search tree. And so in a search tree, everything, there's children that goes down. B trees are of uniform depth, typically. The kind of trees I'm going to talk about have all the values down at the leaves. So they're B plus trees. There's lots of variants, B star trees, B plus trees. They're all the same from the point of view of this kind of analysis. And everything to the left is less than, in lexical graphical order, everything to the right. So the way you do a search, as you go to the root of the tree, you have some pivot keys that say, what is the boundary key that's, everything over here is less than and everything here is bigger than. And you find the right boundary key and you go down the right path and you find the thing. So how many IOs does it take? The fan out is proportional to B. If we've got a block that's four bytes, maybe we don't get very much fan out. But if you have blocks that are even disc block size, four kilobytes, you end up with a fan out of maybe a hundred for a lot of databases. And a lot of databases are now using larger block sizes. So you might see blocks that are 64K or even a megabyte. And so the fan out can be very large. So what's the number of IOs that it takes in the worst case, because that's a big O notation, to do a query? Log N. Log N. What? Log to the base B event. Log to the base, that's small. So those of you who have poor visions, log to the base B event, right? And this is Michael's drawing of a bee tree. Now, he drew that and I showed it to, and he showed it to me and I said, that's not a bee tree. That's a wasp tree. Anyway. So, the depth of the tree is log base B of N. And that's the same as writing log N divided by log B. Sometimes it's helpful to think of log base B of N as giving you a log of B advantage. That's what putting the B on the base of the log does. And this is a concrete example of a search tree that holds a bunch of prime numbers, not all of them. You can also make, you know, so Gritz Graffi likes to point out that, you know, bee trees can also be a ray. So if you have a tree that you're not trying to do inserts into, sometimes instead of having these be separate, all separate blocks, if you just write everything into a big array and then you create an index array that has every beef item in it and basically says so that when I'm doing a search I can do a smaller search here and then, you know, then I can only have to do a restricted search within a single range. I keep doing that recursively going up the tree until I get down to one block worth. You can view that as a B tree too and that works very well for static B trees which will show up, for example, in log-structured merged trees, which we'll see in a minute. So the depth of the tree is log-based B of N, the search cost is the same, the write cost is the same, the next cost is on average, it's 1 over B of an I.O. because once you've already fetched a block you get to usually fetch B items before you have to go find another block and with caching, which is an important case, typically all the internal nodes of the tree will end up in cache after the cache is warmed up anyway and so the operation costs for B trees in practice turn out to be like one operation per I.O. You have to have a really big disc before you can't fit the entire internal nodes, all of the internal nodes into main memory and you probably just need to make the fan out a little bit bigger so that you do. A fan out of a million, if you have a million times more disc than you have main memory, you probably want to buy more main memory anyway. It doesn't make sense to spend a million times as much on your disc as your main memory. Okay, so the answer was going backwards. That's the example of searching in a B tree. Okay, here's an example of searching in an array. I just have an array, I don't have the B tree index. How many I.Os does it take to operate on an array of size n? It's not log base B of n anymore, right? It's just log base 2 and what happens is you do a probe into the middle, you do a probe and eventually you get down to a single block and you don't have to do any more probes. So it's not log base B of n, it's log base B of how many blocks there are in over B and this log base 2 of n. I'm not going to be very careful during the rest of this talk about n over B versus n inside the log because it doesn't matter very much. So here's where we start getting to be interesting. B trees are sub-optimal for insertion. The fastest, what do you mean sub-optimal? Well the fastest insertion you can do is just to write to a log. When you write to a log there's only 1 over B of an I.O. You get to write B items before you have to do an I.O. So lookup is really bad for this data structure, right? You have to scan the entire array which we already know is order n over B. Now one of the questions here is like this is a lot less than 1, right? If B is a thousand, I'm talking about a thousandth of an I.O. This is for relatively small blocks. If I have megabyte sized blocks and say records which are a hundred bytes this could be 1 in 10,000 or a really small number. So how can I do a write and only do a thousandth of an I.O. I have to commit the transaction or something, right? Something like that but group commit can reduce it. So that write is not because of this data structure. The write at the commit is not because of the data structure, it's because of the commit. And a typical thing that somebody who's putting a database into production does is they put a RAID controller with a battery backup in front of the disk to absorb the writes into the log and writes into a log I append a byte to a log I append a byte to a log. The battery backed up RAID with a very small amount of RAM can absorb all those writes at very high speed because they're not all over disk. Whereas these writes that I'm talking about here you could be logically inserting here and those become difficult to absorb. Your model doesn't have locality and your RAM model doesn't have locality issues. So why would you care where they are on the disk? Because the battery backed up RAID isn't part of the RAID controller really isn't the disk, right? So I'm just the model doesn't really talk about transactions. So if you want to understand the practical concern of transactions you don't have to do a whole I.O. for every transaction because you can it has enough physical locality in it the model. The keys have an order, yes. So you're assuming the physical locality is similar to the key locality ordering in the keys. Something that's far distance in the key space is actually far distance in it. I'm not assuming that but in the worst case if I insert random keys it doesn't matter what data structure you have it doesn't matter what locality scheme you choose because random defeats you. Let me try it the other way. You're assuming that things that are close together in your submission or in the same unit of RAID are not dispersed they're one event and anything else might be is two events to block RAIDs and those are arbitrarily far apart. If I do a single insert I'm assuming that I can update one place on disk. I'm trying to figure, like you said distance models of disks those properties and I'm wondering whether that ignoring that costs you anything and I think what you're saying is if B is a track then that's local, one operation, one seek but if you do two of those they could be arbitrarily far apart. So getting it doesn't matter to first order and I don't want to it turns out it does matter so you want to choose the block size appropriately you want to simultaneously optimize for the number of tracks you look at and the number of blocks you look at and the number of 10 percentiles of the disk you look at you want to minimize the number of there's a lot you want to do that's not most of what I'm going to do this is a simple model where we're just going to assume the worst thing is we don't understand the locality between blocks within a block we have locality and we assume that it's one IO Do you believe these results apply as well to flash as they do to this? I'll talk about that so the surprise here is that there are data structures that do as well as a B tree for look up we call that was log based B event what happens with B trees on insert and I guess it's the same as searching except in fact your log in I forgot to mention that okay so I guess I mentioned it I alighted it the right cost is the same cost it's log based B of N because I have to go down the tree so it's the same as doing a search first and then I dirty the page which is maybe one more IO because I eventually have to write that dirty block to disk so that makes sense even in the cash situation where all the internal loads of the tree I go down the tree and I write the dirty block occasionally there's the tree rebalancing that happens but rebalancing I talk to people and they say oh I don't know why B trees are bad B trees are slow because the rebalancing is expensive it's like rebalancing hardly ever happens right you get to insert B items into a block before you have to split the block so it doesn't matter at least on average maybe the worst case behavior is you have to do several IO's but on average when you get to insert B items into a block and then you split it and then you have to write its parent and then when you write B squared items into the database you have to write its parent and its grandparent but it takes a lot of work to incur those IO's search for a B tree and look up and insert are the same cost asymptotically for a B tree I thought the issue was locking insert created much more of a bottleneck to concurrency so this model doesn't have we're not talking about concurrency concurrency is basically an in-memory issue because you always let go of all the locks while you're doing IO you don't block anybody else while you're doing IO if you're doing a database right you can do that but you don't hold any locks while IO's happening so the locks are only held while you're doing an in-memory operation this model, the disk access model CPU cost is free we're not counting CPU costs we're counting IO's if you want to talk about concurrency and stuff, the data structures do get more complicated when you actually try to make these things go fast on a multi-threaded system any other questions we'll get back to so the surprise is that there are data structures that do as well as a B tree for a lookup, which is the log-based B of N and can do almost this well for insertions this one looks really like a really bad trade-off you've made lookups really slow to make insertions fast you can do almost that well and still do as well as a B tree and I'll talk about several data structures here I'll talk about log-structured merged trees which are due to O'Neill B to the epsilon trees due to I always think of booksbomb and so forth and the cache oblivious lookahead array which is a cache oblivious version and I'm not going to get to that today but there are even cache oblivious versions of this and I'm going to sketch these out and then I'm going to talk about some systems issues for log-structured merged trees how many of you know what a log-structured merged tree is? okay so some of you but not all of you so it's good to know what this is it's a simple data structure and it really works well and the idea is that you're going to maintain a collection of sorted runs of data so in the database world whenever you have a sorted array of rows it's often called a run so a run contains key-value pairs it's just an array and it's sorted in ascending order by key some systems so if you have a 100 megabyte run there's one file out there in the file system with 100 megabytes on it some of them will say oh no runs comprised of files that are smaller pieces so they're having an implicit blocking going on they might have 10 megabyte pieces but if you concatenate them together that's a run and some systems have options to do both Cassandra has way too many tuning parameters but not as many and these runs can overlap in their range so with a single run it's just a sorted array two distinct runs are incomparable they could overlap one could be less than the other there's no rules that say how the runs are related so if you happen to insert the data in sorted order maybe that you could cat the runs together and have a bigger run if you're inserting data in a random order you'll end up with runs that are mixed so here's a little LSM tree it contains some numbers and I've started out with those prime numbers and I've got this this other run where I've inserted some other numbers and another run and this is my data structure of my array of integers my set of integers and I might at some point merge two of these runs to make a bigger run so the picture here is that I've merged 8, 9, 12, 15 I don't know if you can see the green to get a larger run so I started out one of the operations is this and the way you insert something into this is well you just create a little run containing one element and then they optimize the first megabyte they keep one megabyte, they collect it in main memory and then they write it out if you want to be a purist you write create a run of size one and create another run of size one maybe merge them into making a run of size two it's a little bit like a merge sort that's going on all the time the way you search in a log structure and merge tree is that well the first idea is let's just do a search in each of the runs is 13 in this database so I do a binary search on this one and no it's not there do a binary search here and I say yep, there's 13 it's in the database if I want to do a next operation I do a binary search in each of these runs to find so the next here is 28 the next one after 13 is 15 the next one here is 19 then I have to do I have to select the smallest so 15 is the next after so that's how you do a search or a next operation to do to analyze this you have to know how long it's going to take to search within a run and how many different runs there are and what their sizes are there's only one run or if they're all exponentially distributed in size or if it turns out there's a whole bunch of runs the same sizes you'll get different answers and I haven't told you what the policy is exactly for how you manage the size of these runs and that policy turns out to create a lot of confusion so I'm going to try to introduce some terminology so again we already did the binary search on our run so it's log of n can we reduce the cost? Well one thing we can do is when we create that run we can create a B tree that indexes the run right and then that gets us that saves us a factor of log B because it takes us from log base B of n to log base B of n from log base 2 of n IOs the B tree is really small compared with the actual run so the cost of creating the B tree it's only one out of B items of the data have to appear up here as I'm constructing the run I just select out every B th item I insert it into that tree so it's small potatoes compared to the run the B tree is generally going to fit in memory so I don't have to worry about all the worst cases because this run doesn't fit in memory but the B tree does right the next question is we've got the cost of searching a run a size n down to log base B of n how many runs and what sizes are there is the other piece that's going to affect the analysis so there's two kinds of log structure and merge trees this terminology I basically got from Mark Hallahan at Facebook he's their architect for their user database basically and he worries a lot about performance and so he gave me this terminology it's good terminology so I'm going to go with it there's two kinds his brother works at Tokutek Tim the brother at Tokutek was the first sales engineer at VoltDB Tim worked at VoltDB he had an application company called Crunchtime or something so this one everybody's got good relatives this one this is Mark not Tim so Mark told me there's two kinds there's leveled log structure and merge trees and size tiered log structure and merge trees first level trees these are used by Cassandra level DB and RocksDB and it was the approach that was described in the original paper about log structure and merge trees data is organized into levels each level contains one run and the levels are bigger so the level 0 contains up to a megabyte level 1 contains a run if it either has nothing in it which is one of the possibilities or if it does have a run in it it's between 1 megabyte and 10 megabytes level 2 contains between 10 megabytes and 100 megabytes and so forth that's a sort of typical design choice is that you see a factor of 10 data starts out in level 0 and eventually gets merged into level 1 and gets merged into level 2 over its lifetime and finally ends up in the highest level at the end of and then when you delete things we'll talk about deletes in a minute so there's one level DB diverges from this one by level 0 having multiple runs so level DB is sized tiered so let me get to that in a minute level DB level DB is so let me it's hard to figure some of this stuff out you read the code you talk to people according to my notes level DB is a leveled tree and the first run is often a special case because it all fits in main memory you don't really need data structures disk data structures for the part of the data that fits in memory so they keep it in just a binary tree in memory until they're ready to write it out and they write it it's a typical thing they just treat the log the right-ahead log as the level 0 if you need to get the level 0 data there it is in the right-ahead log there's lots of hacks okay so an analysis for the insertion cost if we assume the growth factor is K so K was 10 on that example and the smallest level is a single file size of B the number of levels in the tree is log base K of N over B log base K of N if I ignore the over B data moves out of each level once and then it's re-merged into the same level when I'm writing that 1 megabyte into the 10 megabyte level well I started out with a I might have 2 megabytes and I merge a megabyte in now I have 3 megabytes and I merge a megabyte in and now I have a 4 megabyte run that requires to merge the 3 and the 1 to get a 4 requires reading the 3 and reading the 1 and writing 4 and the average write the same level K over 2 times if the if the blow up is 10 then the 10 megabyte level gets written each data item gets rewritten 5 times on average and the last item only gets written in once but the first item gets rewritten 10 times so there is a blow up of something like K the merge itself only costs 1 over B I O's per object because scans, the scan is 1 over b per object that you read or write. So the average write cost is k times the number of levels, log k of n over b divided by b. That's what it works out to be. So if k is 10, you might have 10, and if the block size is 1,000, so you'd have 10 times log 10 of something divided by 1,000. So that's 100th of log of something. So that's a really small number, right? It's not as good as a log. A log would have done it in 1 over b time per, right? If I just wrote to a log, so the blow-up compared to logging the data is k log k of n. So it's a little worse than a log, but it's not a lot worse. And again, this is a big improvement over a b tree, because b trees had one i o in the worst case for every insertion. And here we're talking some log or something divided by 100. Or maybe if b was you make the blocks, if you assume the blocks are a megabyte, which is the way these organize things, the b is, you're dividing it by a million, right? So it doesn't matter what that term is. It's a really small number. OK, the lookup cost. Well, the biggest run is order n. That requires log base b of n. The next run requires log base b of n over k, because the first one was a terabyte, and the next one was 100 gigs. So you have to do log of 100 gigs, and then log of 10 gigs, and log of 1 gig, and so forth. If you sum those up, that's an arithmetic sequence, with log b of n as being the big thing. And because when I divide by k, it's like subtracting off a constant, right? It's subtracting log of k. So the number of terms here is basically log k. So the total number of IOs you have to do in the worst case, and this is the uncached case, is it's got a log k of n times a log b of n. k and n are both pretty big. k may be 10, and b may be maybe a million or something, or 1,000 or 100,000. But it's still a log squared, which is actually pretty painful to do log squared. Even in the cached case, it's not that great, because you end up having to do one read for every level that doesn't fit in main memory. So in a 10 terabyte database on a 64 gig machine, there might be three runs, the last three levels might not fit in main memory. So you might have to do three reads to do a search instead of doing one read, which is what a b tree has to do. The LSM has really one on insertions, and it's lost a little bit on lookups. And if you look asymptotically, it's lost a log factor, which some people say, log, who cares? Big O, log, as we ignore logs, or some people say, no, no, I really care, in which case maybe you care about the constant. It's three times worse, and people might not like that. So, okay, that's the right cache lookup, the warm cache case. One of the things that people do is they stick a bloom filter on in front of each run. You all know what a bloom filter is? Some of you do. Okay, I didn't prepare anything to talk about it. So a bloom filter is another data structure that you should find out about, if you don't know. And basically, it lets you, for one byte per object, give you a high probability of finding out whether an object is in the data structure. It never gives you the wrong answer if the object is in the data structure. It always says yes, but if the object's not there, sometimes it gives you the wrong answer. When you're doing this kind of lookup where I'm looking at a bunch of levels, it's okay if once in a while I look in a level that I didn't need to. Because, as long as it only happens one in a hundred times, it doesn't impact the overall system efficiency. So a bloom filter lets you do that. Unfortunately, there's no way to use bloom filters to reduce the cost of the next operation. So the next operation, bloom filters basically take a hash of the object you're looking at and say, is it in there? And it says, well, probably not. And you say, OK. It's definitely not in there. Or it says probably yes, or definitely not. So there's no way to use bloom filters to fix that, that I know of, be an interesting, great result if you could get a bloom filter to reduce the cost of the next operation. And if you don't care about next operations, then you're using the wrong data structure. You should be using a hash table. So it's kind of a funny situation is that there's this data structure, well, maybe they occasionally need to do a next operation and mostly need to do just point queries where they look something up. So maybe it actually does work in that case. But it's a little bit funny if you're trying to say that this is the right data structure to solve your database problems. OK, the second group, second mode, size tiered. So this is used by Cassandra, which can also use the leveled version, WiredTiger, HBase, RocksDB, and maybe Bigtable. The idea here is we're not going to organize things in levels in any particular organized ways. We're just going to have this process that in the background finds K runs that are close to the same size and merge them together to make a larger run. And usually K's smaller. So K equals 4 is typical. You find 4 runs and merge them together. Whereas recall for leveled LSMs, we used a fan out of 10. So you might see things which are 1 megabyte, 4 megabytes, and 16 megabytes, roughly. Here, for merging data is read from each data, from each level only once and written only once. We don't have that problem of merging things in repeatedly into the 10 megabyte bin until it gets up to 10 megabytes. We just take four 1 megabyte files and make a 4 megabyte file. And once data gets into a level, we don't write it again until we write it into the next level. And so you end up with K runs in each level. Because you'll have in the 16 megabyte level, you'll have a couple 16 megabyte runs, because you haven't got enough yet to make a 64 megabyte run. So briefly, you have those four 16 megabyte runs. And then you start merging them to make a 64 megabyte run. Yeah? You're worrying all about the latency that will occur when you do these advanced merges. Every now and then, you're going to do an operation that's going to merge terabyte runs together. It'll take five minutes, right? Well, merging a terabyte takes more than five minutes on most hardware. No, so the merge, although it takes a long time, it doesn't actually block anybody. The way to think of it is that there's a background process that does the merging. And while you're merging these four 4 megabyte blocks into a 16 megabyte block, you still have the 4 megabyte ones lying around. So if somebody wants to query them, you can. And you just don't register the incomplete 16 gigabyte piece. You don't tell anybody about it yet so that nobody looks inside there because it's incomplete. So you just have to make sure that you tune things up so that you can keep ahead. If I do way too many insertions and I end up with 1,001 megabyte chunks, I've inserted too fast because my background process needs to have merged those together. And it's a common failure mode in these practical implementations that, in fact, they end up with too many chunks at a given level because they have trouble tuning that up. It's possible to tune that up. And in fact, it's possible to completely de-amortize it with one process. You can basically have a single thread that does an insertion and moves one object from this level to this level, and moves one object from this level to this level, and keeps sort of, you can de-amortize it. But it's complicated and I don't think it adds very much insight to what's going on. Yeah, Phil? So the same key in the data structure more than once, you always take the smallest one? That would be the, in the leveled version, the one in the smallest level is the newest one. So that gives you memory semantics, right, that when you ask, you get the most recent thing written. For the size tiered version, if you're very careful, you have to timestamp them or something to make sure that you get the right answer. Because they're not being disciplined about what they merge together. Not necessarily, anyway. You could imagine being disciplined, but the basic data structure doesn't require that kind of discipline. So you have to do something to be careful about that. For the leveled one, it's very easy. Because the small ones are the new data. And if the new data shadows the old data. And there's some databases where you actually want duplicates. You don't want to have duplicates be overwrites. This is not a data structure problem. That's the applications, the higher levels problem is to put some sort of unique counter at the end of each key to make sure that you don't overwrite. That's not a B-tree problem, per se. So is this something you used to cite here at one point, that you do some sort of timestamping to figure out? They do some sort of timestamping to resolve conflicts. Or they might look in all the bloom filters and find out, hey, they might do something clever. You could say, well, if this is being shadowed, I can delete it. In TokoDB, we use multi-version concurrency control. So we actually have several versions of the same data lying around. So we don't want to, just because some key shadows, there's also transaction visibility information in there. So we want to keep all the versions around until the system can prove that some version of the row is no longer visible to any possible transaction. So there's some complexity to actually make that work a little bit outside of the scope of this talk. So the leveling one didn't seem to be in your definition. But the way that I understood it was that you kept the run broken up into many files, the files themselves were a lot smaller with no overlapping ranges. So you tended not to merge them, if you could manage it well. They sometimes try. So what I talked about was merging runs. If the runs were single files, you have to write them. Each one actually comprises a bunch of different blocks. Sometimes you can just reuse a block, because you get lucky. Because there's nobody in between from any of the other stuff that's being merged. The worst case, that doesn't happen. So for this kind of analysis, we argue about worst case. So it's not going to happen. But in a practical system, that really can happen. And so you can gain something. But it kind of bothers me when people go off and they do some hack like that without really having evidence that it's actually going to help them. Because in practice, they're going to do the compaction slowly over time. So if they can work with one file, one sub-run at a time, it might overlap a smaller number of pages. Yeah, so that might also help. And there's compression in here. There's lots of stuff going on that, OK. So the insertion cost in the size tiered one is, well, we only have to write things into each level one time. So we don't have the k here like we had before. We had k log something over b. Now it's just log k over b. But k is smaller, which makes that log base k bigger. But it doesn't make it anywhere as much bigger as the k, multiplying by k. If I say make k smaller so that k log k versus log k, you go for the smaller k, right? Because lookups in size tiered turn out to be a lot worse, however. Previously, we had lookups where we had log of k times log of b, or log base k times log of b. Now we have to actually look in k times as many runs. So we've got a straight trade off here between the cost of insertions and the cost of lookups. And the lookups with cash bloom filters is basically the same as for the leveled one. OK, so here's a graphical picture of the two cases. The leveled one has one run of each size. And you'll have a snapshot that looks like this. The size tiered one will have several of a given size. And the snapshot looks like this. And then later we'll merge these to make a bigger one. OK. So it's not quite been happy news yet. So I've got eight minutes left. I'm going to get to the good data structure now. I've got 20 minutes. 20 minutes. 20 minutes, I can do the good data structure. So here's a simple right optimized data structure. It's still not going to be quite right, but at least it doesn't have a log squared term in the lookup cost. This is a binary tree. And what I've done is, so binary trees are not very well suited for disks because they only use disks of block size B and the binary tree node is only a couple of bytes. So I've got all this extra stuff in the block. So I'm going to use that block for something. I'm going to use it for a buffer for recently inserted data. And basically, so I have a picture here where this data item belongs down here in the leaves of the tree. But instead of putting it down there, I'm just going to put it there. So it's a balanced binary tree. It's got buffers of size B. To insert data, I just put the data into the root of the tree, into the buffer of the root. And when a buffer fills up, I flush. So here's a couple of insertions. And now it doesn't fit. So I push things down. So it's a very simple data structure. So the insertion and deletion cost here is, well, a buffer flush costs order one IOs, because these are blocks of size B. So I had to write two and maybe read and write. Maybe there's six IOs or something. But it's order one IO. And it only costs one over B to actually send one item from here to here. Yeah? How did you determine the key that's used to determine which elements go left and which elements go right? Well, there's a pivot key inside the binary tree node that says everything to the left is less than this and everything to the right. When I did the insertion, I insert the whole row, the whole key value pair. So I have the key for each of those objects, as well as the value. But how did you determine which key would be a good choice to remain in the top node? For this, I'm going to assume that the tree is a reasonably balanced tree. So maybe it's a red-black tree or something. Whatever the red-black tree algorithm is for choosing the key, there's some key that roughly divides this into two equally-sized pieces that might not divide the entire bucket. The bucket may be all inserting to the left. So everything may, in fact, go down here. And then everything may go down here. That's actually a good case. If you're doing insertions that have a lot of locality, that problem's easy to solve. Because now when I actually get the data, if I'm inserting it at the left edge of an array, I'm inserting timestamp data. And timestamps only go up. B-trees work great for that. So this data structure also works. Wouldn't it be very unbalanced to the right if everything you insert is greater than that first key you chose? Well, eventually this tree becomes unbalanced. And you're running a tree balancing algorithm, like a red-black tree, or a B-tree, or a 2-3-tree, or something. So as a subroutine for this talk, I'm assuming that we can maintain trees and that they don't get way out of balance. It could be a weight-balance tree. It could be lots of choices. But I don't want to brush you off too fast. But I don't have time to do that level of data structure here. If you want to know more about it, I'd be happy to spend time on it later. So the analysis of the rights is then that basically, since there's log base two levels, and each level costs 1 over B to move a block down, because I get to move B blocks for the cost of one right. So it's log of N over B. So that's really good. That's a single log on top with a B on the bottom. That's better than anything we've seen so far for insertions. I guess you could get this performance by setting K to B2 on a leveled thing. So key axis. This is another of Michael's drawing. This shows what happens when you sometimes find it difficult to get at your keys. Gertz points out that this key does not appear to fit that lock, but OK. So point queries. Well, when we do a search, we normally would just do a search going down here to find the thing. What we have to do now is we also have to look in here to find anything that might match the search that we're looking for. If we're doing a next operation, we have to look in here and find all the candidates for the next and choose the minimum. So if this is really big, like a megabyte, that means that I need a data structure in here. It's an in-memory data structure, but it's not a disk data structure. I need a B tree or a binary tree or something in here so that I can actually insert things and do queries quickly. And one of the things that, for example, shows up, like if you look at Berkeley DB, Berkeley DB is one of the oldest embedded databases. And it's a great product. I'm talking about the classic C version of Berkeley DB. 1980, late 80s, but it really behaves badly if you set the block size to a megabyte because this data structure is a linked list for the B tree. It's not this data structure. The list of the children and the pivot keys is a linked list, basically, within a node. And so it really goes very badly. Because it's not their design space. In 1985, nobody thought that blocks should be a megabyte. We can make this better in the following way. We've kind of not used, we've made the lookups. Lookup is log base 2 of n instead of log base B of n. So we want to get that back. That's the game here. So the trick here, and I don't know. Some people say that to some people this is obvious and some people this feels like magic or something. I had this block of size B. Instead of having a fan out of 2, I'm going to have a fan out of root B. So there's going to be root B children. And I'm going to, instead of having a buffer for the entire node, I'm going to think of segregating the buffers so that there's one buffer for each child. Although in Toku DB, we actually just have one big buffer. In this case, B measures elements. B bytes. So for this whole talk, I'm assuming that elements are size order 1. So the difference between elements and bytes is only an order 1. That's not active. Strings would work with that model well at all. Right. So strings work very poorly on B trees. Everything I've talked about works badly if you try to insert a string that's larger than main memory into your data structure. But there are data structures that can do that. But I'm not going to talk about them today. There are data structures that you can insert genomes into, and it works the way you would hope it. As fast as you could possibly have hoped for it to work, it actually does work. This data structure is not one of those data structures. Nothing I've done here works for big strings. So I'm just going to assume to keep things simple, the real system has to solve these problems, not for the extreme case of a 10 gigabyte string. But you have to deal with strings and databases that are not order 1. But if we were to build a file system with the data merged into the metadata here, every block would be a string. And yes, there could be teramined strings. Yeah. But that's not how we build our file system on top of this. So we have root B buffers, each of size root B. And here I'm just going to count objects. That's what B is, B's sized in terms of objects. Well, OK. The height of this tree is now log base root B of N, because the fan out's root B. But that's asymptotically the same as log base B of N. It's only twice. If you square root the fan out, you only double the depth of the tree. I put on my theory hats, ah, the factor of 2 between log base B of N and log root B of N. It turns out that factor of 2 ends up completely swamped out by other engineering concerns. This is the thing is that you get the asymptotics right, and then you can put your engineering hat on and go after the constants. And so this factor of 2 doesn't, for example, change the number of reads on a query from 1 to 2. When things are cached, it's still 1. You just tune it up so that root B is still big enough that you can fit the entire internal note, a data structure, into main memory. So that's the tree height is order log base B of N. And the insertions cost, now, there's a log base B of N instead of log base 2 of N. So that's better. But instead of dividing by B, we divided by root B. Because when we move things down, we can only be guaranteed to be moving root B items instead of B items, which is what we got to do in the binary case. So we gave up a huge factor. If B was a million, we gave up a factor of 1,000 on our insert performance in order to get a relatively small factor of log base B of N, or log of B, which maybe is to get a factor of 20 or something on our query performance. But that's a good trade-off, generally, because this is still a lot less than this. If you have one thing that's slow and one thing's fast, if you can make the fast thing a little slower to make the slow thing faster, that usually is a good trade-off. Now, this flushing refers from one level to another. From one level to another. So for the analysis, it's disk to disk. So when I move stuff from this level to this level, I have to read this block in. I have to read this block in. I have to take it and move it. And eventually, I have to write those out. So it's four IOs in the uncached case. In the cached case, it might be less, because you don't have to actually write dirty blocks until either they get evicted or there's a checkpoint or something. It strikes me when I was talking about earlier that each of these root B sub ranges now is a file of the run at that level, because there are sub runs within themselves not overlapping. But it's all tuned up to fit within one block. Because remember, we're in this disk access model. That's the point. All of this is one block. So for one IO, I can bring in these root B buffers of each size root B. If I think of that file as the limit on the size of the level, 10 times bigger each EO, I get you. No, it's not quite the same. Yeah. But it turns out that there are connections between these data structures. So if you think about it, after a while, they all start looking alike. And the things we did to the cola to make it work well, the cola is like a log-structured merge tree that's got some other stuff stuck in and makes it end up looking like. Is that your prior work, right? Well, so fractal B trees is actually Phil's paper, right? OK. We have something called that. But it's not this stuff. So fractal trees is a marketing term. Notice I didn't really talk about fractal trees except in that first slide where I said libft. We have B to the epsilon trees. We have log-structured merge trees. We have colas. These are all technical names. Fractal trees is whatever Tokutek marketing says it is. So what do you call this one? This is a B to the epsilon tree. And the epsilon here is 1 half. It turns out there's a trade-off. You can make these blocks smaller and get the fan out bigger, which gives you a trade-off, a very smooth trade-off between the insertion and the lookup cost. So at the extreme end is the one I showed you before where epsilon was 0. And the other end is a B tree where epsilon's 1. You have B children. And the size of the buffers is only 1 per child, which is essentially every time you insert anything, you have to do a push. OK. So I've talked about right optimization and the trade-off between rights and reads. And in some sense, that's the easy stuff. It's the stuff that I think is intellectually interesting. But it's really, at some level, easy. To build a full-feature database is harder. You have to deal with variable-sized rows. You need concurrency control. You have to cope with multi-threading and transactions and logging and crashes and special cases for when the data's previously sorted. Turn out to be important. Real systems do compression. Real systems need to be backed up. And to get all of that stuff to work, well, that's a small matter of programming. But it's where all the work is. One of the things that I think is interesting is that there's a bunch of systems work where you try to apply this kind of right-optimized technology and you run into problems, because the system somewhere assumed that the search cost was the same as the insert cost. And what this work shows is inserts are a lot cheaper than searches. Inserts are almost as fast as running to a log. Whereas searches, by golly, you have to look on disk. And it doesn't matter what crazy data structure you use. If I give you a random key, you're going to have to do a disk head movement. So you don't really get to do very many random searches. That's life, at least with this technology. So an example is the Berkeley DB API has a mode where it returns an error when you insert a duplicate key. Insert something that says, no, that's already there. Well, that requires doing a lookup. Now, maybe the bloom filters can mitigate the cost of that lookup. But at some level, you have to do that lookup. Or another version is I do a delete and I get told, how many items did I delete? Was it 0 or 1? Well, I could do a delete without doing a query. I could just insert a tombstone. Instead of an insert message, I could insert a tombstone message and then have it go down. And as it goes down, it just annihilates matching keys. It wasn't important when they first built the first implementation and now it is. Yeah. And so Martin Farrish-Colton, the other founder that Andy, I guess Andy didn't mention Martin. Martin's, there's three technical founders, Michael Bender and Martin Farrish-Colton, who's at Rutgers, and Michael's at Stony Brook, and me. He calls these crypto searches not because of cryptography, but because they're hidden in the more traditional sense of crypto as hidden. So they're hidden searches. And they show up. We found them in the file system. We try to use fuse to build a file system. And every time you do anything in fuse, it does a stat on everything it can think of. And stat is a query. And so you end up being throttled to the speed at which queries can run instead of the speed at which the insertions can run. So if things didn't get faster. Yeah. So uniqueness checking and all sorts of stuff. So basically, I think one of the interesting research problems that we've been trying to figure out on how to do this for file system is how can you get rid of all those crypto searches so that the file system actually can do insertions fast. I go one step further than that, which is there is lost information when you make those changes in the API. How do you provide a way for them to retain that information? Yeah, it's a system problem. This is not an easy problem. Well, because if you take it out of the API and the user turns around and does, look up followed by insert, then you haven't changed anything in the performance that you've made their life harder. Yeah. But from a performance point of view, they could package that in a library and just call look up with the stupid check. So there's a bunch of interesting cases for how do you get rid of these and some things we found. So this data structure sort of likes fire and forget operations. So fire and forget is an insert where you don't care whether there's a duplicate. But there's other fire and forget messages that you can think of like delete or update where you say find the key, key number five, and it's got some fields, increment the X field inside that row. You can have broadcast versions of these where you say delete everything, delete every key in the database where you just sort of send a message in and the message works its way down deleting stuff. Or you could have narrowcast versions where you say delete every row from here to here. And you could do all those in time, which is very fast from the perspective of the application because it just drops this message into the root of the tree and also has very good average cost because it works well. Flash, so Garth asked about flash. One alternative is to use bee trees on flash. The problem on flash is generally not that they suffer from, reeds are free on flash. The best model on flash is reads just become cheap. You don't care about reads, but writes are expensive on flash, both because lots of flashes don't actually have that much write bandwidth, especially consumer grade flashes. Anything that you can sort of plug into using a PCI card might be different. But if you're talking about a SATA interface, they don't actually have that much write bandwidth that they can sustain. And also, most flash can't do very many writes before it wears out. And we've seen there's flashes now. I think that the typical number today is maybe between 5,000 and 10,000 overwrites. Does that sound plausible? That's on the low side, but anywhere between 500 and a million, depending on how much you pay. So if you get the cheap stuff, you might only have 500 overwrites. And so if you buy a 1 terabyte drive, maybe I'm showing my age, a 2 terabyte drive, you can only write a petabyte into that drive before it's used up. And I've seen cases where people said they took MySQL and they were having performance problems. They took out a disk drive, plugged in an SSD, and six weeks later, the drive is burnt out, because you only get so many writes. So the advantage of these data structures for flash is not so much that it's faster, but that it writes less. And also that it's very friendly for compression, which is more writing less. Writing less is the key win for flash. And I don't know what's going to happen when we have other memory technologies, like PCM or something. But everything's great. Everything's bad with PCM. Because it writes so fast, you can wear it out in seconds. I had understood that PCM doesn't wear out, but PCM wears dangerously fast. OK, that's great. That's great for my research. The one that you want is FTT. I'll have to find out what that is. What's up? Your talk. Because you have 52 out of 51 slides. That's interesting. I'm going to leave gracefully. It's 52 out of 51. He's done. This is the start. Do you have a question? No, I just want you to finish. OK. If Tim answered questions, just finish your talk. He told me to shut up. Shall we argue about whether I should finish? So this is a benchmark that Tim ran, not Mark, Tim Callaghan, where he measured the MongoDB version. MongoDB is a great straw horse, because MongoDB has a very slow implementation of the B tree. It's like you compare it to any of the other B trees. It's like something from the 80s. But so it's a good stocking horse. So this is an axis where this is how many seconds have passed in the application. And this is how many IOs have happened. This is for an insertion benchmark, which is maintaining three random indexes. And so as time goes on, the database gets bigger and the IOs per second, and this is doing the same application. So TokuDB is done here, and it's done here at less than 10,000 seconds. And MongoDB is done here at about 115,000 seconds. So it's kind of a funny measurement, because that direction isn't really progress, and that isn't really good. It's more IOs, but that's cost. Lower to the left is good. So sooner you finish, and the less area under the curve, because the area under the curve is the total number of IOs you've done. So this is the number of IOs that happened with basically a bead of the epsilon tree, which is what TokuTech mostly uses, and this is the number of IOs. And the IOs are bigger for TokuTech than they are for Mongo. So it's not a completely fair comparison, because I'm counting IOs here. If you count, if I were to draw this graph where it was scaled not in IOs per second, but in megabytes per second on this axis, there's a 70-fold difference instead of the 500-fold difference or whatever that this graph shows. So that's my end slide that brackets the. I can't explain the numbering. Latex has always done math properly before.