 So this is the last lecture we're going to do on all the indexes. So today's focus is going to be on try data structures, which are now becoming more, people are giving you more attention in the context of in-memory databases because they have certain properties that we'll cover today that are unique, or more interesting, or potentially better than the B plus 3 stuff we've talked about so far. So for today's lecture, I want to start off talking about just some more generic implementation issues we've got to deal with when we actually want to build a in-memory index. So these first part here aren't going to be specific to the art indexes, or mostly can be described in the context of B plus 3s. But then some of them will then help motivate what we end up talking about the tries later on. So when we talk about try data structures, we're going to talk about primarily two variants, the Judy Eurese and the art index, which is the paper you guys read. And then we'll finish up for quickly just mentioning what the mastery is because I think it'll tie in nicely together all the things we talked about from last class and this class into a single data structure. All right, so when you take the introduction class, we have you guys build a B plus 3, but we don't really get into the nitty gritty details of how do you actually make the B plus 3 usable throughout the entire system. So now we actually want to build a real database system, and now we've got to say, well, how the hell we're actually going to make the B plus 3 or make our index function with the other parts of the system. So for this, I want to start off talking about different ways to manage memory that go with garbage collection memory pools. For the garbage collection stuff, it's just going to be a repeat of what we talked about last time, the effect-based garbage collection. Then we talk about how to reclaim memory. And then we'll talk about how you ways to actually store non-unique keys and variable length keys. Again, we need these for NVCC. We need these if you want to store strings. And then we'll see techniques to actually reduce the size of the data or the keys we're actually storing in our B plus 3 nodes. And then from this, we'll see why we want to do something like a try or the rate x tree. So last class, we talked about Latchry data structures. And the thing we talked about was that we stressed was that Latchry is nice because you don't have to hold heavyweight latches while you make changes to the data structure. But now, since you don't have any protection mechanisms on the actual latches or on the nodes, you don't know who could be reading the particular node or that part of the index that you're modifying. So we solved this example here. It's the same you have a really simple skip list. And we want to delete this key here. But say there's another thread that's scanning long-relief nodes. We blow this away before we update this pointer here. And then now, our thread tries to read this pointer, and it follows to invalid memory location. And if we're lucky, we read garbage data, which is still bad. Worst case scenario, we get a segfault and our program crashes. So again, the reason why is because we're not keeping any kind of global tracking information about what threads are reading. So we have to make sure that we do things in the right order and make sure that we reclaim pieces of our index, the nodes in our index, once we know it's safe to do so. So this is what garbage collection is going to do for us. And so there's a bunch of different ways to do this. The most common ones are reference counting, EBOC-based repclamation, which we talked about last class, hazard pointers. But there's a bunch of other ones that you can do as well. So in the context of in-memory databases, we're just going to focus on these two. These are not that common. These are the most common for in-memory, actually, really any data structure for in-memory database. Hazard pointers, less so. This mostly comes up in operating systems and other programs. So we're going to focus on these two. And then, like I said, there's a whole bunch of other techniques that people have developed over the years to do this kind of thing. So the easiest way to keep track of what threads are reading what pieces of data is just to maintain a counter on every single node in our data structure that we just increment any time a thread reads it. So if I'm scanning along the leaf node, I want to follow a pointer to go to the next node. Before I actually do the jump, I want to maybe flip a pointer and say, I'm reading this, or sorry, flip a counter and say, I'm reading this. And then when we're done, we just go ahead and decrement it. And so now when our guard perspective comes along and wants to reclaim this memory, if that counter is not zero, then it knows some thread is accessing it. If the counter is zero, then it knows that nobody's accessing it, and therefore it's safe to reclaim. So again, thinking in the context of the skip list, part of the reason I teach you the skip list, because it's really easy to understand these kinds of issues, or to conceptualize them, in the skip list case, we would mark a node as logically deleted. We flip a flag and say, the thing's deleted. Now any thread that comes along would ignore it. But then we update the pointers to now route around it in our each level in our link list. So now any other thread that comes along isn't actually going to even see that node. So in that case, it's not like the counter is going to be greater than zero for a long time. It's going to be this small window where we could have something looking at it, and we would blow it away and have memory problems. But in actuality, it's not going to be that big of a deal. So once this thing goes to zero, then we know it's safe to reclaim. So for as simple as this is, it actually turns out to be really bad, or get terrible performance, because if you start scaling out the number of threads you have in a multi-core, multi-socket system, now you have all these threads that are reading and writing to these counters, which is going to be stored in some numeral region in your motherboard. And now the motherboard, the CPU, is going to make sure that everybody's in sync anytime you update that counter, and that's going to send a bunch of cache invalidation messages all between the different cores. And that's going to get, that's going to be a bottleneck, and that's going to prevent us from scaling. All right? If it's a single thread who cares, but now we're now dealing with maybe 50 threads, then this becomes a big issue. So how can we do better than this? We know the answer is going to be epoch-based garbage collection, but let's understand why. So the first thing we can point out is that we don't actually care to know what the exact value is for this counter, right? I don't care whether it's one, one, two, four, eight. All I care about is whether it's zero or not zero. So incrementing the counter is going to cause all this cache invalidation messages sent out by the CPU, because the CPU really wants to make sure that everyone is completely in sync and everyone sees the exact same value across all different threads, across all different cores. But we don't actually care about that. So we're paying for this extra coherence across our threads that we don't really want. The other issue is that we don't really want to, don't need to perform garbage collection immediately when this counter reaches zero. And again, thinking in the context of the skip list, if I'm doing reference counting, I logically mark something as deleted, I write around it, and then once the reference count goes to zero, I don't need to immediately reclaim that. I can kind of do it at a later stage. I'm not talking minutes, I'm really talking maybe milliseconds. The other aspect of this too is that the size of this data we want to reclaim is not that big, right? Most maybe one or two kilobytes. We're not talking about hundreds of megs that we need to reclaim right away to get the space. We're talking about a small number of objects we want to clean up. So again, we don't need to stop the world while we do, when something, this counter goes to zero and then immediately do garbage collection. So this is what the epoch-based garbage collection is designed for, right? So it's sort of a, it's a more coarse-grain mechanism where we're just keeping track of what objects have been modified and validated or need to be cleaned up within a given time window. And once we know that there's no threads could be existing in that time window, it's safe for us to reclaim it. And this looks a lot like the garbage collection stuff we talked about at NVCC. Once we know that there's some old versions that can't be read by any actual transaction within that snapshot, it's safe for us to reclaim it. So it's basically the same idea but now done inside the index itself. So the way it works is basically you have this global epoch counter that you're periodically gonna update every 10 milliseconds. Silo does it every 40 milliseconds. It doesn't matter. And this can be updated by a dedicated thread or you could have one thread do this cooperatively. Again, it doesn't matter. And any time a thread wants to access their index, they just register themselves with the garbage collector and say, I'm now entering the index to do something at this epoch. Does whatever it needs to do. If it creates garbage, then it registers that with the garbage collector as well. Again, within its epoch. And then when it leaves the index, it unregister itself and then the garbage collector just says, all right, now there's one less thread in my index at this epoch. I don't know what it's actually reading. I don't need that fine grain access information. I just say, it just exists. It could be reading anything. And then once we know that all of the threads within our epoch and all preceding epochs have gone away, then we go through and clean up any garbage. Again, whether that's done cooperatively or whether that's done with a background thread, it doesn't matter. So for in-memory indexes, this is the most common approach that everyone uses. It's used outside of databases as well. So inside the Linux kernel. In the operating system world, they refer to the exact same idea, but they call it read copy update. And the Linux kernel uses this for some of its internal data structures. So again, if we're going to build an in-memory index, epoch-based garbage collection is the way to go. But now, after we identify what nodes we can garbage collect, now we need to decide what we actually do with that memory. So one thing we can do is just call free. Give it back to our memory allocator, which may or may not give it back to the operating system. So that would work, but this would suck because now we have to go into malloc. We have to go into our memory allocator and say, hey, here's some memory you can have back. This also means that every time we need to create a new node, we call it a call malloc to get a new chunk of memory so that we can use that for our index. But that's bad too. We don't want to call malloc non-stop because that's going to be slow because malloc is going to maintain its own internal data structures for its arenas. Means it's going to have to protect them in its own latches and that can then become a bottleneck. The libc malloc is terrible. You always want to use je malloc or tcmalloc. We use je malloc in our system. As good as je malloc actually is, it's still not free. It's going to have to protect its own stuff. So instead of doing this, we can actually maintain our own memory pools. The basic idea is that in our user space, in our database system, for our index, we just keep track of here's a bunch of pre-allocated nodes that any time a thread asks for to create a new node, it'll come to us and we can handle off when we are already allocated. And obviously, if our pool runs out of space, then we go down to malloc and get more memory. Then when you want to delete a node, you just go hand it back to the memory allocator. There's nothing magical here. It's just a way to avoid having to call malloc. Of course, now if we, say, insert a billion things and then delete a billion things, we want our index to actually get back memory. So in the same way we saw this with compaction in MVCC, we want to have some kind of mechanism in our memory pool allocator to be able to return memory back to the OS and return back to malloc as needed. All right. Again, the main takeaway here is avoid calling malloc as much as possible, because that may end up being a sys called down to the kernel and that would slow down our threads. All right, so now let's talk about how we actually want to store data in our indexes. So again, in the intro class, we're a bit hand-weighted about all this, but now let's actually go into a bit more detail. And this matters for us because, as I said before, in MVCC, every index, even if it's the primary key or unique index, needs to be able to store non-unique keys because the same key might be inserted and deleted within a span of different snapshots. So how are we actually going to do this? So for this, I'll describe this in the context of B plus tree just because it's just easier to understand. And this is all coming from this great book that's available online about modern B plus tree techniques. So this is written by Gritz Graffi. Again, his name is going to come up multiple times throughout the semester. He'll do the volcano stuff, the Cascades query optimizer. But he has this free book. I think it's free. It was free to me. And it's like the go-to guide. If you're going to build a B plus tree, this thing, it's like 200 pages. It describes everything. It's awesome. All right, so there's two ways we can do this. To store non-unique keys. So the first is that we just duplicate the key multiple times in our key array, and we just have a separate mapping to the pointer or the value slot for that corresponds to the pointer of that value. The alternative is to use value lists where we only store each key once, and then we just maintain an internal link list or internal array of all the values that correspond to that particular key. So visually it looks like this. So here's the case when you have duplicate keys. So here's my sort of key array, and I have a key K1 stored multiple times. And then the value array, again, these are just offsets. These offsets correspond to the offsets here to say that this key here belongs to, this key is mapped to this offset here, and then this is the pointer to our tuple. In the case of value lists, every key only exists once, and then these are just pointers now to some arbitrary variable length size list. For a given key, here's all the value pointers to it. Right? We only have to do this in the leaf nodes. We don't have to do this in our inner nodes because in the inner nodes, the keys will be unique. Because again, there's guide posts that tell us whether to go left or right. We only have to do this in the leaves. All right, the next issue is that we want to store our variable length keys. So again, in the intro class, I think we just did fixed length keys, but now we actually want to store variable length keys. That gets slightly more complicated now inside of our node. So there's four different ways to do this. The easiest thing to do is just to store pointers, not actual values for our keys, just pointers to the tuple. This is the same thing as the t tree we saw last class. So if I'm traversing a node, I want to say what is this key I'm looking up, I'll have a pointer, a fixed length pointer, 64 bits that again jump to the tuple and then I can see what the actual key is. And we said before, this sucks, you don't want to do this because it's going to be a cache line fill or cache miss for every single time you have to follow the pointer. And that's indirection, which is going to be bad for your instruction pipeline. The next approach is to do variable length nodes. And the idea here is that you could have different size nodes that are then allocated based on what do you think the total amount of size you need to store in that particular node. And if it goes too big, then you maybe increase the size of it. The reason why this sucks is because now you have to have, if you're doing a memory pool like we talked about before, now you have to have different size pools for all these different node sizes. And then that way when you're going to have potentially more fragmented memory and this is harder to manage. So for this reason, nobody actually does this. The third approach is to do padding. And the idea here is you just store extra space for what you think the maximum size of the key should could be for every single key. So if I declare that I have a var char 32 and then I insert a key that has is a var char 16, I just pat out the remaining 16 characters to make it so that it exactly fits var char 32. Then I don't have to worry about any of this other stuff because I know exactly how to jump to offsets in my node to find all the, every particular key. And I know how many exactly keys I can store in each node. So as far as I know, nobody does this one either because this is pretty wasteful, right? A lot of times you see people write crappy applications where they'll say, when they create their schema, they'll say, oh, it's a var char 1024. Even though they want to store one character, right? They just store some large, large ass, allocate a large ass column size. And so if you're padding that out, it's going to be all wasted space and you're going to get terrible performance. What's more common is the key map in direction approach. And the basic idea here is that you just, you embed the array of pointers to now map to the key value pairs within the node itself. And again, if you, when we discussed this in the introduction class, this is going to look a lot like a slotted array for when we store tuples. All right, so it looks like this. So we have our sort of key map. This thing again, these are the pointers or offsets into down our key value pairs here, right? And then think of this as like the end of the node. So I'm going to, as I add new entries into this node, I'm going to grow it from the end to the beginning. And I keep going until I butt up against this thing to say I ran out of space, all right? So the sort of key map, this thing always has to be in sort of order based on the keys. This can be stored in any arbitrary order, right? So if I want to say, find the first entry here, this is just an offset down to this, all right? So what's one potential problem of this? What are you just going to be very large? He says values can be very large. Not worried about that, right? If values are super large, then you typically just have an overflow chain to allocate new space and then this is actually a pointer and say, hey, either the entire key you want or the rest of the key you're looking for is over here. That sucks, but that's unavoidable. Yes? He says deletes can call fragmentation. That's not that big of a deal either because if I say I hold the right latch on this, if I delete say this entry here, I could just re-compact everything. It's not too bad. Think of like searching. So sometimes the same problem is the teacher, right? So when I come down here and I want to say, find Obama, I'm going to do binary search on this thing or I could do linear search. Say I'm doing binary search, right? So in order for me to check what the actual value of this key is, I got to follow this pointer. Now this is just an offset in my node. So it's not like a full 64 bit pointer, but it's very likely that this thing here is not going to be in the same cache line as this thing here. So as I'm doing my search within the node itself, I'm going to have multiple cache line fills, right? How big is the cache line in x86? I heard it. 64 bytes, yes. So say I'm storing these as 16 bit integers, so that's two bytes, right? I can have a bunch of these and a bunch of these and that's all not going to fit in a single cache line, right? We start Windows, no. Okay, sorry. So what's the one easy trick I could do? So if I want to do a search, right? In this case here, in this example here, the first character in all of these strings or keys that I'm storing in my index, they're all different, L-A-P-O, right? So I could actually just embed the first character up in here. So now when I'm doing binary search, if I'm looking for Obama, or I say I'm doing just a linear search in this, I can just scan along these guys, the L-O-L-A-L-O, oh, I might have a match there, so then I follow my pointer. So it's one cache line fill for this thing and then potentially at least one cache line fill to go jump down here, right? Without having to just go back and forth nonstop, right? I'm only showing four keys here but imagine if you had 64 keys inside of this thing, right? You still have to go always check, right? This thing might be, this might be Oprah here instead of Obama, so if I'm looking for Obama, I still have to go look at this thing, but I can skip a bunch of other pointers by just jumping to this, or by only looking at the ones that we start with L. So this seems nice, right? In actuality, in most indexes, this may not always work, right? Because in this case here, again, the first character of every single key was completely different. In real data sets, it's not gonna be completely different like that, right? It's gonna look something like this where there is gonna be, the first couple of characters are probably gonna be the same, right? Then think of like if you're storing, this is a trivial example, it wouldn't actually do it this way, but think you're storing the list of URLs, right, that you crawled on the web. A bunch of them are gonna start with www. and then something, so that first three characters, the four characters are gonna be exactly the same, so that little trick we did in the last slide isn't gonna work. So we can do something else, we can do what's called prefix compression. So now down in our leaf node, because it's a sort of array, that it's very likely that the first couple characters within our node, or digits within our node, are gonna be the same. So in this case here, instead of storing the complete key with the duplicated prefix for all three keys, we instead store the prefix separately, and then now for the keys, we just store the part that's actually different from that. So we go from Rob, robbing, robot, to just storing the prefix Rob, and then B-E-D, B-I-N-G, and then OT. Again, you wouldn't wanna do this for the, I guess you could do this for the internets too, but this is mostly more common in the leaf nodes. So we can go the other direction, right? So this is taking the prefix and finding where the oblap is, but we can also do, recognize that sometimes the prefix is all we really need, and we can throw away everything else. So this is called suffix truncation, which I guess is sort of a variant of a prefix compression. But the idea here is that in our internodes, we can recognize that we only really need maybe the first three characters for these two keys for us to be able to differentiate one versus the other. So the idea is that we can just take these guys, only store them, throw away the rest, and that's enough for us to figure out whether we wanna go left or right as we traverse our index. So the, again, we can do this because, in a B-plus series, because these are just copies of the keys, right? We don't need these to have a complete copy of the key in the internodes to check for existence. The only way we can check for existence is if we go to the leaf nodes, and in which case we have to have the complete key to make sure we have an exact match. So this now leads into talking about tries. So again, this says exactly what I said. Anytime we wanna do a lookup to see whether a key exists in our B-plus tree, we always have to get to a leaf node, right? Because the internodes may or may not have keys that still exist in our index in our corpus, but the leaf nodes will always have everything. So no matter what, no matter what key we're looking for, and when we do a lookup in a B-plus tree, it's always O log N. We always have to get to the bottom, even though our key may not exist, all right? So now we start thinking about performance. Again, thinking of the context of cashlines. So we know that there's gonna be at least one cashline fill for every single level in the tree. Right, assume our tree is completely out of cash. So every single level will have to be a cashline lookup in memory to bring it into L1, and then we can do our lookup on it, right? And going out to memory is super slow, right? It's faster than disk, but still way slower than than reading L1, L2, L3. So in that case here, even though we may be looking up our keys, they don't actually exist in our index, we always have to get to the bottom. We have to always get to leaf notes. So this is where tries are coming in. Tries are a different alternative to storing indexes, or storing data in keys and error indexes. So the naming for these things are quite confusing. So tries are sort of the accepted term for what I'm describing here. Sometimes they're called digital search trees or prefix trees, and then there's variants of tries, which are the radix trees and patrician trees, and sometimes they'll say patrician trees or radix trees are the same thing as tries. My understanding is that they're not. So these are, well, I'll describe first our tries, and then we'll switch over to talk about radix trees in a second, okay? All right, so the basic idea of a try is that rather than storing the entire key at different levels in the tree, we're gonna break up the keys into digits, all right? So if it's a string, one character will be a digit. If it's a integer, it could be a single bit, it could be a byte, right? Depends on how you implement it. And we're gonna store these digits separately at different levels in the tree, or the try, excuse me. And then now when we wanna see whether a key exists, we're gonna do a one-by-one comparison of each digit in our key and in the try as we do our traversal. And so what'll happen is we may reach the point where the digit that we have in our key that we're doing the lookup one doesn't exist at the level we're at in our try. And at that point, we know the key cannot exist, right? It's not like a B plus tree where some inner nodes may have keys that have been long-deleted. Everything that exists in the try has to exist at different levels, at the correct level. All right, so say we have a simple try like this, and we have three keys, hello, hat, and have. So if we wanna do the lookup on hello, again we're gonna break up the string into each character where we correspond to a digit. So we do a lookup on H here, then this tells us to reverse down to this node here. We find that we have the E, which is in the second position, and then we have LLO, and then the pointer to our tuple, right? So tries were first discovered in 1959 by this French dude, and then there was this other guy, actually in the US named Edward Fenkin, he then coined the term try, which is supposed to mean retrieval tree, a few years after the first thing was implemented. Edward Fenkin apparently is faculty at CMU. If you look on the website, he's there. I've never met him, he's old. I think he's still alive. But apparently the guy that coined try is at CMU, but somewhere. So tries are interesting, and they have some unique properties that are different than what we talked about for in B plus trees. So the first is that the shape of the try, the physical data structure looks like. What levels you have, what nodes you have, things like that. It actually depends on the keys that already exist in the try, and not in the order you insert them. So the way I think about this is like, if I have a B plus tree and I shuffle off the keys that I wanna insert into it, I may actually get a different physical data structure, because depending on when it does splits and merges, from one shuffle order to another shuffle order. In the case of a try, it's always, since it's deterministic, it's always gonna have the same physical layout. No matter what order I insert them. The other interesting aspect of that is that it's not gonna require any rebalancing operations, like you have to do in a B plus tree. You still have to do potentially splits and merges based on the size of the nodes, as we'll see this in the art index in a second, but it's not, you're not doing major changes in order to rebalance things in the same way that you have to do in a B plus tree. The changes you're making will be localized. So now, the complexity of this is different because it's not based on the number of keys, but it's actually based on the length of the key that you're trying to do a lookup on. So the length of the key is K, the complexity of any operation on this is okay. So think about this, and if I go back to my example here, so if I wanna do a lookup on hello, complexity is gonna be five lookups to get down to here. But if I wanna do a lookup on Andy, at the very first node, I would see there's no A there. The first character in Andy is A, there's no A in the first position here at the first level of my try, therefore my search is done. So worst case scenario, I'd have to go A and D, Y, but in the best case scenario, I stop immediately. And that's different than in a B plus tree. So again, another important thing about this is the keys themselves are not actually stored in their entire form, since they're broken up into digits. If we wanna put the key back together, we have to figure out what the path is. And we have to do more work when we wanna do scans, because we just can't scan along the leaf nodes if we can in B plus tree. We have to backtrack and reconstruct things. So the key design decision you have to make when you build a try is called the span. So the span is gonna determine at each level the number of bits that you're gonna represent for a digit or a key or the prefix, right? So the way it works is that for each digit, I'm storing in a level, if the digit exists in my corpus, meaning I was, the key got inserted that has this digit at this position in my try, then I'm gonna have a pointer to either the tuple that corresponds to it, if I'm at the end of the key, or a pointer to the node below it, otherwise I'll store a null pointer. So the span will end up determining what's called the fan out, right? And this will determine the number of branches you're gonna have at a particular node in the try, right? The maximum number of branches you could have, right? And then this in turn determines the height of the tree, right? So if you have a, if your span is one bit, then you're gonna have to store, for 32-bit integer, you have to store at least 32 levels, but if you're doing one byte, which is eight bits, then you only have to store four levels. So another thing you'll see also in the literature too, they'll refer to things as an n-way try, like a 256-way try, right? And that's determined, that's based on the fan out. So 256-way try means that at a particular node, there's 256 branches to children, at most. So let's look at a simple example of how we actually wanna store a one-bit span try. And so for this one, we're gonna store three keys, 10, 25, and 31. So a one-bit span try means that at every single level in the try, we're gonna represent one bit, a one-bit digit. So we're gonna convert our keys into binary form, and these are gonna be two eight-bit words. So we're gonna represent our integers as 16 bits, just for simplicity reason. But in a real system, it could be 32 bits or 64 bits. All right, so a try is gonna look like this. So at the very first level here, that corresponds to the first position in our digits or in binary form. So in this case here, all three are zeros. So at our try node, for position, for bit zero, when the bit equals zero, we have a pointer to the child node, but for bit equals one, we have a null pointer. Right? Then we go to the next position, and this one's zero again. And then for simplicity, this thing would be repeated 10 times, and I'm not gonna do that. So imagine this thing gets repeated 10 times, because it's the same thing, it's all zeros. So the zero position will have a pointer to the child, and then one will be null. So then we get down here, and now we actually do our branching, because for key 10, the bit of this position is zero, but for key 25 and 31, the bit position is one. So we'll have a branch go down this side correspond to where key 25 is, and a branch for this side for key 25, sorry, this is key 10, this is key 25, 31, because they have one. So there's a pointer from down here to one. All right, so let's look at this point here. So for the remaining four bits in key 10, right, there's no other branching here, so we just have one pointer, zero pointer, one pointer, and then at the bottom here, we have our pointer to our actual tuple, right? For these guys here, they're the same, for the next position, they both have one, then they split off here, zero and one, so 25 goes down here, and 31 goes down here, right? Pretty straightforward. So what's one easy optimization we could do for this? To reduce the amount of memory we're storing per node. Compress nodes that only have one child and have a variable length. He said compress nodes that only have one child and have a variable length prefix, so that'll do vertical compression, that'll reduce the path. What's one way to compress in the node itself? It's called horizontal compression. So what am I doing stupid here? Well, I'm storing the value zero and the value one followed by the pointer, right? I don't need to do that, right? So instead, I just store two element array, I know the first offset is zero, the second offset is one, and I'm done, right? So now for each node I've stored, instead of having to store one bit followed by a pointer and then another bit followed by a pointer, I'm just storing two pointers. And implicitly, the offset of the pointer corresponds to the value of the digit at this level. Yes? Like, if you have a sparsely populated alphabet, what does that mean to allocate if there are also five array and 30 node? So his comment is, if I have a sparsely populated alphabet, does that mean I'm gonna allocate a slot or an offset for every single character but most of those characters are gonna be null? Yes, we'll come to that in a second. And that's sort of what the R index and the Judy rays do. So now the second optimization is exactly what he said here. So you see I have these two, these three branches here. So this is 10, this is 25, this is 31. So at this point here, after I branch at this bit position, there's no other branch, right? It's a straight line shot down to the tuple pointer here. So I don't actually care what are all the bits that are stored in here. All right, I can shortcut it and just get to here. Same thing for this. At this point here, there's no other branch at this node here, so I could just move that up. Same one here, I could just move that up. So this is what a Radex tree is. So a Radex tree is gonna emit any node where there's only a single child, right? So that, this is vertical compression. So the idea here is that again, this is the same try we have before, but now I'm compressing it. So at this point here, when this guy pointed down to this path and nobody else pointed to it, I can just store a pointer to the tuple, right? It's basically saying that the path here ends, the key that you were looking for that I know about has to be in here. Now of course, someone could do a look up and say, oh, I'm looking for something that has bit zero, bit zero, bit zero, and then maybe differs in that somewhere down the path here, my try doesn't know about it, right? So can't store that. So that means that I still have to follow the pointer now to go to the actual tuple itself and figure out whether I actually have a match or not. And that sounds like a tea tree, but we're only doing it at the end. We're have to do the look up anyway on the tuple. So this is fine. Now prevents us from being a covering index because we still have to go look at the tuple, but again, we can cover that later, yes. Can't you store the bit pattern on the edge that you compressed? So he says, couldn't you store the bit pattern here that you compressed up in here and then do exact match? Yes, you could do that. And Judy raised you that in a second. Yes. Okay, so these are the basics of what a red X tree in our try is. So for the remaining part of the class, I actually wanna talk about how do you actually implement this? And for this, I'm gonna first talk about Judy arrays, which I don't think is covered or mentioned in the German paper you guys read, right? But this often comes up when people say, talk about our indexes outside of databases, they're like, this sounds like Judy arrays. And it does, it is very similar. So I wanna describe you what Judy arrays are, and then it'll help motivate what Hyper actually does for the art index, and then we'll finish off talking about the mass tree in silo, right? So Judy arrays and art indexes are gonna be 256 way red X trees, the mass tree is gonna be a try of trees. It's a different way to representing the actual nodes themselves, okay? All right, so Judy arrays were invented back in 2000 by these guys at HP labs. The guy, it's named Judy after the guy who's vented it, it's his sister. There's also Patricia trees, which I think they're not named after another woman, it's just, I think they picked their main name for some reason, all right? So Judy arrays are named after the guy's sister. It's a 256 way red X tree, and the reason why this is interesting is this is the first known red X tree implementation that supports adaptive node representation or adaptive node layouts, which is what the art indexes, A in art stands for, the adaptive red X tree. So Judy arrays are gonna have three different types. They have a one bit array, a integer map, and then a string map. We're gonna focus on the Judy L, which is the integer map, because that'll map directly into what we're talking about with the art index. The basic idea is the same thing if you wanna do a string array. So this was invented in the late 1990s, early 2000s at HP. HP filed a patent for this in 2000. It expires in 2022. There is an open source LPGPL implementation. If you read Hacker News, they're all freaked out about this, that the patent saying like, you don't wanna use this, HP's gonna come sue you. If you follow this link here, there's a posting on the mailing list where somebody asked one of the Judy array authors, like, hey, there's this patent, should I freak out about it? And they said, basically no. But my understanding is that nobody actually implements the Judy arrays because it's actually quite difficult to do. It's not like, if you go read the manual for which I did this weekend, it's like 80 something pages. It's pretty complex. In my opinion, the art index is easier to implement, but it might be biased because I like the Germans. Anyway, but the basic idea at a high level, the same thing we're gonna do an art index, the Judy arrays guys did first. So now, actually one thing that is different than art, which is unique, and as far as you know, this is the only data structure that I know about that does something like this, is that instead of storing the metadata about each node in the header of the node, right? In most implications, you do that, like in the B plus tree node, at the header, you would say like, I have this number keys, I have, here's my lower bound and upper bound, here's my slot array, here's some information about what's in the node itself. Instead of storing that in the node in the Judy array, they're actually gonna store this in the pointer to the node. So they call these Judy pointers. There's this great paper from Jens Dietrich out of Germany where he does an evaluation of Judy arrays versus the art index from ICD 2015. They call these things fat pointers. It's basically you think of like every pointer now to another node inside my index is now gonna be 128 bits. I think it's gonna be the double the word size, the double the regular, the pointer size. So I'm gonna have 128 bits pointer to my child and in that 128 bits, in addition to storing the 64 bit pointer to the actual memory location of that node, I'm also gonna store all that metadata that I would normally keep in the header of the node itself. All right, so I'll keep track of what the node type is, what's the memory layout, how many entries or keys that it has or digits that it has. If I'm actually wanting to store it to the pointer itself, to the tuple, instead of actually to a node, I can embed what the value is or the remaining digits that he mentioned in the pointer itself and then if necessary, a pointer to things. So again, this part is unique. The art index doesn't do this. I don't know if any other index does this. I'm actually storing the metadata about the node in the pointer to the node. And we can do this because every child node in our tree only has one parent in a tri or a radix tree. So there's only one location where you have to keep that in sync. It's not like we have sibling pointers to be able to scan across the leaf nodes. So in a B plus tree, this would be hard to do. In a radix tree, we can do something like this. All right, so every node is gonna be able to store 256 digits. So it's a 256 way tribe. The issue is gonna be the same thing he brought up is that all our nodes are not gonna be 100% full. Right, in many cases, like as you traverse down to the tree, right, say you're storing URLs, the top of the tree could have, you know, dub, dub, dub for a URL that everyone uses, but then everything else is gonna be completely different. And then maybe as I go down to the tree, then my fan out gets larger and I'm having more keys as I go along or more digits as I go along. So what they're gonna do is they're gonna have three different node types that they can switch between based on what the distribution of the population of the digits at that particular node or at that particular level in the tribe. So you have a linear node for sparse populations. Again, I'll go through these two in a second, right? So this is when you don't have, you have a small number of keys or digits at a node. Bitmap node is when you have a little bit more and then uncompressed node is when you just have a complete dense population, right? It may not be entirely 256 digits, but it's more than you can actually store in these two guys here. So we'll see this in a second when we talk about art index. The linear node will map to the two smallest node types in art and then they'll also have the uncompressed node, that's the largest node type in art and then what they do here is different, that what Judy Ray's do is different than what art does, like the bitmap node. All right, so again, the linear node is for when you have a small number of digits or in your population at a particular node. So what you're gonna do is that you're just gonna store two arrays in the node that can store up to six keys or six digits. And so the first part of the array will be sorted digits and this will be whatever digits are actually being stored, you're actually storing a copy of that digit in the array and then you have the child pointers where the offset in this array corresponds to what the offset of the key as it exists here. So again, it's called a linear node because when you want to do a look up here, you just do a linear scanner along these guys to find what the digit you're looking for. If you find it then you know how far you scan along and you know how to jump into this thing. If you don't find it then you know you're done because the digit you're looking for isn't there. So in the original specification of the Judy Ray's from 15, 20 years ago, they talk about how a linear node could be accessed with a single cache line. They had 32 bit pointers so they could store at most seven keys in this to make it fit in a single cache line. Again, these are the Judy pointers. So these should be double the regular pointer size. So if it was 32 bit addresses then it was 64 bits. In our world now though, these are now gonna be 128 bits. So I can store at most six digits on this side. So each digit is a single byte, so I have six bytes here and then I can store six digits or six fad pointers here. That'll be 16 bytes because a regular pointer is eight bytes. So these are 96 bytes. So in total I have 102 bytes. That's not cache aligned, cache line aligned so I have to pad this out. So now to get it to be exactly two cache lines, I just add my actual bytes to make it 128 bytes. I think the alignment size for x86 is 16 bytes. So if you could align it that way but for simplicity we'll just say it's two cache lines 128 bytes. Again, in the original paper or the original GD specification, this would be 64 bytes because your pointers would be much smaller. So again, at a particular node, at a particular level, in our try, in our registry, if that node only has six or less digits, then we can use something like this. If we have more, then we have to convert that node into one of the larger node types. And the next larger node type is called a bitmat node. So the way this is gonna work is that we're gonna maintain a bitmap, whether the digit exists at our node, just so we record one or zero whether that digit exists in our node. And then we're gonna break up that 206 bitmap, a 206 bit bitmap into one byte chunks, so eight bits. And then at the end of every chunk, there'll be a pointer now down to our child pointers. So here we have our prefix bitmaps, and again, think of this as like, I have my eight bits, and when it's all zeros, that's the first position. When the next, you increment one to that, that goes in the next position, increment one to that, that goes in the next position, right? So I know how to jump to these different offsets where I need to look to see whether my digit exists because I know what my bit sequence looks like and I know how, this is all fixed length here, right? So then again, these are just now pointers down to the subarray here that tell me where my starting location is for the child array pointers that course, or child node pointers that correspond to my segment like this, right? And then these are our 128 bit pointers down to the nodes below us. So say I wanna do a look up here on digit with value seven zeros followed by a one. So that would go look up in here and that would be in my first position, or sorry, the second position here. And then now I just count the number of ones that came before me in this and that will tell me where I need to jump to at the starting location of my subarray pointer here. So for this guy here, I'm in the second position but there's no other one to the left of me. So therefore I know my offset is zero so I follow this pointer down here and jump to offset zero, which is just the beginning here. So this pointer here corresponds to the digit that was stored at this position here, right? Same thing for this guy. He's got only one, one, one to the left of him. So he's here, this one here is there. And then now I start with the next sequence. So again, so we can do this when we have at least, I forget the exact number, how you actually, how many, I forget how many you can store in here but it's less than, because you're, this has always been the same size, right? Up to 256 different bits or different digits and then this thing, I forget the max size is, right? It has to be able to fit in our cash lines. So though if our, again, our node doesn't fit or the number of digits we have doesn't fit into this, then we have to use the uncompressed one, which is just an array of child pointers where the offset corresponds to whether the digit exists. Okay, so is this clear? All right, so now we can talk about the art index. So the art index, again, it's gonna look a lot like the Judy array, where it's gonna be 256-way Radex Tree and depending on the population at a particular node, we're gonna choose different node sizes, some different node layouts. So this was developed by our good friends in Germany for the hyper system in 2013. I think they're using this for the new system that they're building called Umbra, which is not out yet. I think they're still reusing the art index. So in the original paper from 2013, it was single threaded, didn't support concurrent operations. So the paper I had you guys read in 2015 with the newer one, describes how to actually how to do this and how to make it thread safe. All right, so let's compare now what the Judy arrays look like or how Judy arrays are different than the art index. So for the node types, they're gonna be slightly different. They're still gonna have the uncompressed node and have the linear node, but it's that middle region, instead of a bitmap node, they'll have another node type. And in the case of Judy arrays, they have three node types in the art index that have four node types. The other thing to think about too is also is that the Judy array is meant to be this general purpose, like map, a general purpose associative array. So that means that they have to have the complete copy of the key in the array. So they're not gonna be able to do the vertical compression where you're lopping off large segments of the branches when you know there's no other children. They have to keep a complete copy of everything. And then down here in the art index, since it's a table index, we don't worry about losing keys if we truncate our branches. We can always go back to the actual table itself and figure out what the original key was and rebuild the index as needed. All right, so let's talk about the node types. So the first two node types are the smallest ones. And this is where you're gonna store when you have a small number of digits that exist at a given node. So again, every node in the data structure logically could have 256 branches. Because we're doing eight bit spans, one byte spans. So at most it could be 256 unique values or new digits at our node. But again, since we know that some distributions are not gonna have all digits at a particular node, we can choose to use these more compressed node types. So with node four, we can have at most four values. Again, this is just an array of the sort of digits. And then we have the child pointers. In the case of art, they're not using the fat pointers that Judy arrays do. So these are just 64 bits to our children. And then the offset of our key in the sort of digit array corresponds to the offset to the child pointer here. Another aspect of art index that they describe in the original paper was, you can actually use SIMD or vectorized instructions to do these searches very efficiently. With SIMD instructions, I can take all four digits, put it into a SIMD register and do one instruction to do comparison of them. I don't have to do the linear scan across them and look at them one by one. For node 16, it's at most 16 values. So I'm gonna have 16 slots in my sort of digit array and then 16 pointers here, all right? So it's when you now get to something that doesn't fit in the other two nodes, this is where it gets different than the Judy array. So in the Judy array, you would have the bitmap index where you have, again, the offsets. You have the bitmap array, then you have a pointer down to the sub array for your children pointers. In art, what they're gonna do is that instead of actually storing the values of the digits, as we did in the previous one, we're just gonna have a 256 length array of pointers that then point to offsets within our child pointers over here, right? So the idea here is that because we only need to store, I think these are one byte, so eight bits, to know what offset we need to jump into this array here that we don't have to store a full 32-bit or 64-bit pointer for this array here. So the size of this thing is actually quite small. And so the max number of entries you can store in this array type is 48, right? So these are 64-bit pointers in order to get everything to fit into two cache lines, I think, right? So this is 256 bits, because each of those are one bytes, and then you have 384 bytes for these 48 pointers. Actually the total size is 640 bytes, which I think is like five or six cache lines, right? So again, the argument that they make is that instead of actually doing the linear scan or the vectorized scan that we did back here, right? Instead of scanning along this thing here to figure out is the key I'm looking for, the digit I'm looking for in my array, by just having this offset thing, it says I know exactly what the bit sequence is for the key I'm looking up, the digit I'm looking up. I know how to jump it neatly into that array, and then that'll then tell me how to point over into the second array to get into the child pointer that I want. So there's a balancing that the computational overhead of scanning this thing versus the memory overhead of storing this larger array. Because what I could do is I could have just stored the digits that I want up to 48 slots and have pointers over there, but this actually gives you the best performance of the amount of storage you're using. All right, so now the last node type is corresponds to the uncompressed node type in the Judy array, which I didn't cover. The basic idea is again, it's just like before where I don't actually store the digits, I just have a giant array up to 256 slots, and then I have either a null or a child pointer based on whether the digit exists at this particular node. Again, then now it's super fast to go do a lookup. Again, I just find my position and then if it's null, I'm done, if it's a pointer, then I follow that. So this is what the node types are. All right, so another cool thing about the art index, what I like for the paper is that they tell me how you actually can store any possible key in a radix tree or a tri-data structure. And when I first did my first example, I just did the three strings. And for us, looking at this as humans, it's really easy for us to see, oh yeah, they just take the one character corresponds to one level in the try. But when we actually wanna go implement this in real hardware, then we find out that the way the CPU may be representing different data types are not amenable to be storing in a tri-data structure. So for unsigned integers, it's fairly easy to understand except that if we're running on x86, x86 is gonna be little endian and we actually wanna store things in big endian, which I'll show in the next slide. But for signed integers, we have the two complement bits at the beginning. So if we actually just store those bits directly in our index in our tri-data structure at a level and a node, then that's gonna be cause problems because that may not be what we're actually looking for when we do our comparisons. Well, of course, now if it's flipped and when we're doing a little endian, now we're coming in the wrong direction. We won't know whether something's negative or positive until we get to the very end. So they flip that around to make sure that negative numbers are always smaller than the positive numbers. For floats, they break them up into normalized or denormalized, positive or negative and then always store things in unsigned integer to do comparisons better. If you have a compound key, which is very common, like I have an index that's on multiple attributes, I just do the transformation to the comparable form for each one separately and just concatenate them as a long key. So let me show you an example of what I mean by this. So say we have this key here. And for simplicity, we're gonna store this as a hexadecimal number. So 0A, 0B, 0C, 0D. So if I store this in little endian, which Intel does, then if I start from the top to the bottom, I'm going in the wrong direction. And I won't be able to know whether the value that I'm looking for is less than or greater than this particular key until I get to the very bottom. But if I store it in big endian, then I know I'm looking at the most significant bit over here and I can know right away whether the value I'm looking for is less than or greater than or equal to the first digit in my sequence space here. So again, explain why this matters a bit more. Let's say I want to do a lookup like this 65, or 650,000. If I represent an hexadecimal form, if I store it in big endian form, then when I do my traversal, I know exactly at the first position here, 0A equals 0A, 0B equals this, and then 1D is what I want here and then I follow down here. If I was doing it in the other direction, then I would look at 1D, 0D, and this would end up being greater than this, which is not true because this is less than this. Right? So in the first art index paper, they basically give you a recipe book on how to do conversions of all the different possible key types you want to store in our index to make sure that they have this correct form when we want to do a comparisons in the index. All right, this is clear. Okay. All right, so now we get to the paper that you guys read. So the first thing to point out is that the art index is not a latch-free data structure. They talk about how they could make it latch-free, but this would require a major engineering work, a major engineering effort to reorganize the data structure to be able to achieve this. And as you guys in the paper, you read last class about the BW tree, latch-free doesn't make it much better, it doesn't make it better at all, and turns out you're gonna do much worse. So you'd have to do a bunch of extra stuff in the art index, like bringing the mapping table or delta storage, delta version chains, like in the BW tree, in order to make it latch-free, and they argue it's not worth it. So they instead propose two different approaches to do latching. So they have optimistic latch coupling or lock coupling is what they call it in the paper, and then the read-optimize-read-exclusion. So I'm gonna focus on this one because I think it's the easiest one to understand, and this is what we actually implemented in the BW tree that we benchmarked in the paper you guys read last class. I think this is the right way to do this. So the basic idea of optimistic latch coupling is that we're gonna add now a version counter to every single node in our index, and any time that a writer thread wants to modify the node, they acquire the right latch and increment that version counter. Now any reader thread comes along, they don't have to take any latches, they just check to see whether the right latch is being held on a node, if not then they can read it, and then they read that version number, then they do whatever they need to do and then go on to the next node. But before they can finish doing whatever they wanna do in the next node, they go back and look to see whether the version counter of that node, that it just came from is the same what it read before. If it is, then it knows that no other thread has come along and modified that node in between the time it checked it and then read it and moved on. So you're optimistically assuming that no one's gonna modify the node, so you don't acquire any latches, but you still do that sort of coupling as you go down to check behind you to make sure things are okay. The make all this work is we have to use that epoch-based garbage collection that we talked about before because again we don't want writer threads to modify nodes and then blow them away as if they're doing deletes or doing splits and merges and then we end up following a pointer to nowhere. So the epoch-based garbage collection will make sure that any node that we're reading, all the pointers and everything will still be there as we follow the pointer to the next node. It's just that it may logically have been modified by another writer thread and in which case we abort ourselves and restart. So let's look at an example here. So again I'm switching to the B plus tree because I think this is easier to understand but you can do the basic of the same idea in an art index. So I'm gonna have a thread come along or sorry every node now has this version number and for that a thread come along I want to do a look up on key 44. So here we start at the beginning. We don't acquire any right latches but we just check to see whether or relatches being held. If not then we go look up what the version number is, do whatever it is we need to do, examine the keys to figure out what direction we wanna go for our traversal and then move to our next node here. Now we read this snapshot here, version five but then before we can do any examination in this node here we gotta go make sure that this is actually where we should be. So we just go back and check the version number from the node we came from. So we maintain a stack or the list of what nodes we're looking at as we go down and if the version number for this thing is the same what we saw at the very beginning then we know that no other thread has modified this node since we first started at it so therefore it's safe for us to proceed down further. All right, again. So now I do my check on v3 that gets validated. I can do my examination on this node here then I realize I need to go down here check that this version number, go back and recheck version five, that pass that hasn't been changed. So now for I can examine my node and find the key that I'm looking for. Right, so to optimistically assuming that no other thread's gonna modify the node you just came from so you don't require any latches on it as you go. So let's look at an example. Let's roll back to say before we got down here and read this version nine here before we do our check on this node C we have a writer thread comes along they flip the right latch which you can do because I'm not there anymore, right? But I didn't require a latch anyway and then it modified this node, modified this node and update this version number. So now when I do my recheck to see whether the node has came from is still the same. I would see now that it's no longer version five it's now version six and I would say, all right, now I know someone has modified this node since I came through so I bought my operation back out and come back and do it all over again. So this is more of a lightweight than having to take re-latches but you still have to take right latches to protect things. We just check to see whether the right latch is being held. If not, then you go ahead and read things. So the downside of this approach though is that it's kind of heavy weight of course grain. So in this example here all I did was check to see whether the version number has changed like I went from five to six and then I saw I didn't have a match and I killed myself. But this thread here, the writer thread, let's say all they really did was update whatever this pointer was down to this node here. It didn't touch anything going in this direction here. So in actuality it's not really a conflict because this side of the tree that I went down is still fine, still sound, still the same but I don't know that all I know is that they modified this and incremented it and therefore I have to abort. So this is unnecessary abort, this is like a false conflict because these version numbers are two course grain. The alternative would be what's called the read optimize, read exclusion and we're short on time so I'm gonna go into detail but the basic of the idea is that every node's gonna have exclusive latch and they're gonna block other readers but they're never gonna block, sorry question this. Yeah, perhaps it's still a big question but I don't quite understand that why do, why do you have to examine the version of your parent instead of the node that you're currently reading? So this question is, if I'm here, I jump down here, why do I have to examine what this version is and not this version here? So you record what this version is but the idea is that I follow some pointer down to my node here because this is where I should have gone and so if now someone comes and modifies what this pointer is in between before I, or after I've already checked it then this is not the version I should be reading, I should be reading some other thing and now I'm not seeing a consistent version of the index. Now for example, someone has changed my parent but whatsoever I'm in the correct node that I'm going to read so what could be the situation that I'm not checking my parent is wrong? So this question is, what could be the situation where not checking my parent causes a problem? Cause I'm already down here. Let's say we're back up here so, cause I think there's a race condition where someone could, like I could come here and I followed this pointer, this pointer is going to take me to this node here but in between that time before I actually jump here somebody else has decoupled this pointer now to have the node below me so there's another level and actually the thing I want to get to is over here not through this and then this thing doesn't actually point to it. Okay so you mean that for example I'm in the node Z in showing the graph but actually after some writer can only find Z I should have gone to some other node which has a G install. Yes. So I follow that. Yes correct, yes. Okay. Okay. So real quickly the read optimized read exclusion like the way to think about this is basically like shadow paging or MVCC where the reader threads just read whatever node that it finds. The writer threads will always create new snapshots and create new nodes and that way they block out the writers from doing them and then basically readers never get blocked they never get aborted they can always read consistent snapshots which essentially would solve the problem that he was asking about. So to do this is actually quite difficult because it requires major changes to the data structure in order to allow threads to make modifications to replace new nodes and update now all your pointers to those new nodes atomically. So you can do this in a Radex tree because you know what you're at with any node in a try there's only one child pointing to you so there's only one parent pointing to you so I can blow away my node and atomically update my pointer without worrying about anybody else or updating multiple locations at all at once. If I'm going to B plus tree or BW tree well BW tree wouldn't have this problem because you have the mapping table but in B plus tree you could have sibling pointers pointing to your node and so to do a major modification of creating a whole new node and having update all your pointers atomically require you to hold latches or use a mapping table like a BW tree in order to make sure this happens all at once. So as far as I know nobody actually implements this the OLLC approach from the previous slide is there's something like it's very similar in Mastery and a couple other index but I think OLLC from the previous slide is the way to go. We have 10 minutes real quickly. Mastery. So I'm not going to go details about this I just want to make you aware that it exists. So what would we see in the art index and the Judy arrays is that the way they would manage the what's actually being stored in each, never node, right? Instead of having to store the giant uncompressed version every time they would have different sizes based on the population. So instead of having different node sizes what if we just had a way to have dynamic nodes that can then grow and shrink based on the population as needed. So this is the motivation of a Mastery. So the Mastery instead of having the the dynamic node sizes, the adaptive node sizes they just store a B plus tree for the node. And that they can grow and shrink again as needed based on what keys or what digits exist at a particular node. So for this one Mastery is going to target really large keys. So I think like URLs are really big email addresses and the size, the span of every level in the try is going to be eight bytes. So 64 bits, right? In the case of the Radex tree in the case of the Judy arrays they were doing one byte spans. So this is going to be eight byte spans. Again, the idea is that in the leaf nodes of every B plus tree at a level they can either point to another try level which will have its own B plus tree or you can have a pointer to the actual tuple itself, right? So it's not like all the leaf nodes at the very bottom of the last level will have your pointers to your tuples. You can have pointers in the upper levels just like a regular try would as well. So this came out of the CYLO project which was built by Eddie Kohler at Harvard. We will cover CYLO's logging scheme in a few weeks and I think we covered a little bit about we mentioned a little bit about it when we did concurrency control. But CYLO is basically it's like an in-memory storage manager that's super optimized for a large number of cores. So one of the things they implemented first was this try of trees which I think is pretty interesting. So this is showing you that you don't have to store these sort of fixed size nodes in the same way that the Radex and Judy arrays do you can use a dynamic data structure. All right, so let's finish up the discussion of the indexes by looking at this graph that I showed you last time. So this is doing comparison between the OpenBW tree that we did better at CMU, the best known skip list out of Australia, our best known skip list in the world which is from Australia, B plus tree written by the hyper guy who did the art index using his optimistic latch coupling, the mass tree which I showed in the last slide and then this is the full art index again from the guys from Hyper. So this is doing 50 million keys on a single socket machine with 20 threads, 10 real cores and then double with that with hyper threading. We have insert only, read only, read update and then skin insert. So last class I only showed you these first three here and again across the board the art index crushes everyone. The skip list is always terrible, BW tree may do better than B plus tree sometimes but not always. So the new number I'm including now though is the scan insert. So this workload is you have some threads are scanning ranges in the key space and other threads are inserting. So now you see that the B plus tree crushes everyone here because the scans are super fast because now you just jump to the leaf nodes and scan along the bottom. The art index does terrible, well it does better than the skip list but the BW tree actually beats it because again, I don't have pointers along the bottom nodes in the bottom levels of the try. I have to backtrack when I want to do scans and go back up and down sort of doing a breadth-first search to scan along my range and that's why they get the worst performance here. Mass tree does bad here. This is more of an engineering issue of the way Eddie actually implemented scans. This is not necessarily indicative of the actual data structure, at least we don't think. Of course the scan, the B plus tree crushes everyone. So now it makes question maybe as well I made the claim at the beginning that the art index can actually only store you can do the vertical compression where you only store the actual branches that have children and you can do horizontal compression because now you can pack in you use smaller node sizes based on the population of a particular level in the tribe. So how does it compare in memory usage versus the other data structures here? So again for this one we have three different key types we have monotonically increasing integers so just think of a serial key adding one to it over and over again we have random integers from I think zero to two to 64 and then we have the email distribution list from the adult website that got hacked a few years ago. So these are actually real email addresses. So the main takeaway from this is that the art index crushes everything with the exception of a random int so I forget why the B plus tree does pretty well here. But in the case of monotonically increasing integers it's significantly smaller than everyone else because most of the, again you're increasing, you're inserting keys, increasing order those keys are gonna be very similar to each other because it's just plus one, plus one, plus one and it can pack a lot of data into a small number of nodes here. Emails are another one too because there's a lot of overlap in the, I think we've reversed them so you go, if your email address is andy.populategmail.com you store it as com.gmail at andy.populate so all these keys are gonna have the same com. at the beginning because we've reversed them and that's why they get way better compression that way. That's the standard trick people do when they store URLs and email addresses you always wanna reverse it. All right, so again, the art index stores less data. The BWTree because of this mapping table and the Delta version chains is always storing more. The skip list is always just really bad. I forget why in this one here. Oh, and the mastery, I think the reason the mastery is larger is because because it's a try of trees all that B plus tree has a bunch of more pointers whereas in the art index, a regular rate X tree you don't have those additional pointers to keep track of what digits I have in my node. All right, to finish up real quickly. So the, basically the last two lectures was me telling you how wrong I was about the BWTree and Latch-ree data structures. The, I'm in the opinion now that the rate X trees are interesting. This is something, you know, whether art index or the newer one called hot is the right way to go. I don't have a full, I'm not 100% convinced yet for scans and obviously it does worse. There are some other issues but if you have a solid B plus tree that's well implemented to have prefix compression or suffix truncation and bunch of other optimizations we talked about at the beginning, I actually think that's the right data structure going forward for an in-memory database system if you want to do fast transactions because it's sort of like the Toyota pickup truck of indexes. It does well in a bunch of different scenarios. All right, all right. So next class, we're starting to talk about system catalogs. So the way I think about it, now we know how to build indexes. Let's start building the rest of the system. So we'll talk about catalogs and do data layout and actually storing tuples and then from there, we'll start adding layers above the storage manager, storage, storage, storage layer to execute queries and run transactions top of that. We know how to run transactions correctly and efficiently. Let's actually look at how to run queries in it now. Okay? All right guys, any questions? Take care. You know I'm great handy. Got a bounce to get the 40 ounce bottle. Get a grip, take a sip and you'll be picking up models. Ain't it no puzzle I guzzled because I'm more a man. I'm down in the 40 and my shorty's got two cans. Stack some sick packs on a table and I'm able to see St. Isle on the label. No shorts with the cross, you know I got them. I take off the cap and first I tap on the bottom. Throw my three in the freezer so I can kill it. Careful with the bottle, baby. Don't spill it. Cause ain't nothing says the pain I've wet. You drink it down with the gods little vibes have. Take back the pack of dust. They gon' get you some same knives and drink it to the stars. Billy D is the chili chaser down with the weak gods. Be a man and get a can of St. Isle.