 So this is the second lecture now we're going to do on indexes. So last class was all about discussing the traditional latching techniques and locking techniques that you have to have for all of the indexes. So today we're going to spend time talking about what's been invoked for the last decade of building lock-free or latch-free data structures. So I'm going to actually talk about just three data structures today. Even though the lecture has to be about latch-free, only the second two, the skip-list and the BWT or latch-free, the first one, the T-tree is not latch-free. I'm just presenting it for historical context to understand what people have done for in-memory databases in the past. So again, we'll go through the, those are talking about the BWT, but to understand what certain design decisions they make in this, we got to understand what they're doing up in the skip-list. All right. So the, as we said last class, the original B-plus tree from the 1970s was designed. What's wrong, Gary? All right. This is going to be in YouTube too, you know that, right? Sorry, sorry. That's okay. No, sorry. Okay. All right. Thank you. Thank you. I can't do anything if you want to. Okay. Sorry. I need to leave it at the bank. Okay. All right. So that I'm going to bleep. Okay. Right. So the B-plus tree back in the 1970s was designed to deal with slow disks. Right. So with disks in the spinning disk hard drives, it's all about sequential access. So you design the nodes in such a way that if you need to do a range scan along the leaf nodes, that's going to be a bunch of sequential reads. The, but now we're going to say we're going to store things entirely in main memory. The question is like, is that, is the B-plus tree going to be the right data structure still for us? Right. Did people try to look at other things to say, if I know my, my data sets entirely in main memory, could I, would I, would I choose different data structures to, to maybe get better performance than the B-plus tree? So people did try that. Going back to the 1980s, there was some early work done at the Dirty to Wisconsin on building the first in-memory databases. Of course, back then memory was super limited and super expensive. So they were really talking about, you know, databases in the size of megabytes, right? But at least they recognized that at some point DRAM will get large enough and we could have databases fit entirely in main memory. So they did a great, a bunch of great early work with Dave DeWitt and his students on building out prototype databases that were based on this premise that everything's in memory. So the data structure they ended up building was this thing called a T-tree. Quick show of hands, who has, who has ever heard of T-tree before? Nobody, that's fine, perfect. So T-trees are gonna look like AVL trees, but the key distinction about them versus the B-plus tree and other data structures we'll talk about is that since in their world back in the 1980s, memory was super limited, that they didn't wanna store copies of keys in the indexes. Instead, you store pointers to the tuples. In the B-plus tree or in the BW tree or the skip list, everything we're talking about here, like whatever your keys or index are based on, you're gonna make copies of them in the actual data structure itself. In the T-tree, what we'll see in the next slide, they'll also have pointers to the original tuples because they didn't wanna have to make copies of the key because that would be wasted space or take up too much space. So again, in the 1980s, this was proposed at a university of Wisconsin in Madison. In the 1990s, when people started building the first in-memory databases, the commercial in-memory databases, these first prototypes that came out had T-trees. So Times 10 is the most famous one. That was originally a small base out of HP Labs and then they forked it off and did a startup and then the startup got renamed to Times 10 and then Oracle bought Times 10 in like 2006. So I think actually even today you can, by default you get a B-plus tree if you use Times 10, but there's a flag you can set to go back and use a T-tree. And there's some other databases out there that are designed to work in really extreme memory environments, like embedded devices and stuff like that and those guys are gonna use T-trees as well. But nowadays there's no major commercial in-memory database that's gonna use a T-tree and we'll see why in a sec. So let's see what it looks like. So the name T-tree comes from the fact that its nodes are represented to look like T's. So within a single node you can have a bunch of pointers. So the first you're gonna have in here you're gonna have the data pointers. So these are gonna just be pointers out to the corresponding tuples that they're mapped to. So again, say I'm building this node here, I have key five, keys two and key eight. I'm not gonna store copies of those keys in the actual node itself, I just have a pointer there. So again, think back then in the 1980s, these keys could be 16 bits. The pointers are probably 16 bits. So rather than having to store the data plus the pointer, which is 32 bits, I just store the 16 bit pointer to get to the tuple that I want. Then I'm also gonna have pointers that allow me to go to my siblings as well back to the parent. So again, the T-tree is gonna look like an AVL tree. So it doesn't mean that when we do traversal we're always going to the bottom and scanning along the leaf nodes. We may have to go back up, right? Because it's gonna be organized in breadth-first search order or breadth-first order. And so we need to have a parent pointer to know how to go back where we came from. The only copies of the keys we're gonna maintain under index are node boundaries. So within a node, for these data pointers for the tuples that I'm pointing to, I'll have the min key, the lowest key that's represented by them. So that would represent whatever the copy of the key for this pointer here, and then the max key, this one here. So in this sort of toy example, I'm only showing three pointers, right? So it's not that big of a win, but imagine if you're storing maybe 128 or 256 keys per node. So now the min and max are just a trivial amount. So let's see how you would actually do a traversalness. So here's a really simple three-node T-tree. And then there's the data table, so it's key one through nine. And so starting at this node here, say I wanted to do a lookup on K2. So you see again, these are all data pointers to the actual keys themselves. So I can't just, when I do a comparison, I have to follow these pointers to figure out when it does something equal something, I have to follow those pointers to go look at the original tuple and then do the comparison that way. And again, I have my pointers down to my children here left and right. So I start out and I wanna do the first thing I do is do a comparison, see whether K2 is less than K4, because that's gonna tell me whether the key I'm looking for, it could either be in this range here or down on the right side of the tree. So in this case, K2 is less than K4, so I know I need to follow the sibling pointer, or the child pointer, and go down here. So now what I do is look and see whether K2 is less than K1, well it's greater than that. Then I check my max bound here to see whether it's greater than this. It's not, so I know it's in between this range. So now I'm just gonna do a scan and if I follow the pointer, then do my comparison. It's the thing I'm looking for equal the key over in the actual data table itself. K2 doesn't equal K1, this could here K2 equals K2, and then we found our match. So there's obviously a bunch of optimizations we can do for this, right? These things will be sorted, just like a B plus tree node, so I can do binary search to jump around. If I was looking for K3, and then I found that this thing, the boundary node, the boundary key was equivalent to the thing I was looking for, then I know I don't need to do a binary search, I just jumped to it. It's an obvious optimization you can do, but the main takeaway here is that you always have to follow the pointer to get to the original key in order to do your comparison. Is this clear? So now if I wanna do a range scan, I don't have an example of this, but if I wanna do a range scan, say like from K3 to K6, I do all the same search come down here, right, and then scan along to find the keys that I want, but when I reach the end, I would recognize that the upper bound for my range scan is greater than this, and now I gotta go back up, and then do scan along and potentially go down here as well. So you can't make this latch-free because it requires you to update pointers in two memory locations at one time when you wanna do splits and merges. So you have to use traditional latching techniques. So the advantage of T-trees, and sort of why people thought they were a good idea, back in the days, got obviously because they use less memory because you're not, again, you're not copying the keys. You leave everything in the data table themselves. The downside is, as I said, it's difficult to rebalance because you could have splits and merges coming from the top going down and the bottom going up, and you need to be able to reconcile them. We saw the same problem before when we talked about B-trees last class, and of course now if you have to go do a scan, you have to go do the lookup on the table itself, and that's gonna be indirection in your instruction pipeline because you're following pointers, jumping some other location and doing comparison, and that's gonna be slow. So now, you gotta understand, back in the 1980s, again, memory was quite limited, so this sort of made sense. The other aspect that'll make sense for us when we talk about the B-W tree in a sec and a little bit about the skip list is that this is also not very cash friendly because we're going, doing this indirection to go lookup into another location in memory. Back in the 1980s, the speed difference between CPU caches and regular memory was not as significant as it is today. So back in their world, taking a cache miss wasn't like a huge, huge problem like it is in ours, right? It's not orders of magnitude slower. So then for them having to go chase all these pointers, you can also argue too that the CPUs were much more sort of simplistic and didn't do, wasn't doing branch prediction as aggressively as we're doing now. So maybe some of the aspects of doing this wasn't as problematic in their CPU architecture as well. So again, teachers only appear in sort of really small embedded devices and in some of the older in memory databases. Nobody actually uses them today, but I think it's at least interesting to know because at some point in your life, someone might say like, oh, what didn't they try to build a memory index back in the 1980s, 1990s? Should we be using that? Like I remember I was at a conference one time, the guy was like, yeah, it wasn't, why are you guys using B plus trees? Shouldn't you be using T trees? No, right, the answer is no, okay? All right, any questions about T trees? All right, so let's talk about Latch tree data structures. So the orders of every indexes that we're trying to build here, the easiest way to actually implement one, the dumbest way to actually implement one is just have a sort of link list, right? You have different locations of memory and each element in the link list has a key followed by a pointer to the next element, right? So this is the easiest for us to actually maintain and it's easiest for us to make Latch free because anytime we wanna add a new entry into our index, we just do a compare and swap on this memory location and insert our new guy, all right? Of course, now the problem with this is that scanning this sucks because it's always gonna be a linear search, right? You have to start from the beginning and scan along. So what's one really simple way to sort of speed this up? Instead of scanning every single one, what if I just had some extra pointers to allow me to scan every second one, right? So say I'm looking for key three rather than having to do key one, key two, key three, if I just look up in this one here and then, oh, I can jump to key three and I'm done, right? I keep going, right? I can add another layer above this. Now I'm skipping, instead of skipping every second one, I skip every fourth one. So now if I wanna find key five, instead of having to go, you know, scan along the bottom here, I can just jump over the top level list and jump there, right? This is basically what a skip list is, right? That's all it is. Skip list is just a bunch of link lists with different levels and at each level, as you go up, you just have fewer entries and you can jump over more things. That's all it is. So again, if you sort of squint or maybe rotate it, it is basically what a B-plus tree is, right? Because the B-plus tree, you have those guy posts of the inner nodes of the top and that's just allowing you to skip over a bunch of stuff instead of having to do the linear scan on the bottom, right? It's just sort of organized in a slightly different way and the skip lists are considered a sort of a probabilistic data structure because to decide whether you add a pointer at the upper levels, you flip a coin, right? You pick a random number and if it evaluates to true, then you decide as you insert a new key, you're gonna add an extra pointer, right? Whereas the B-plus tree, you just insert to the bottom and then if you have to do a split, that'll modify what guide posts are at the top. So what that gives you is that now when we do insertions into a skip list, all the changes will be localized. Meaning just like before when I was back here, if I needed to insert key 2.5, the only thing I need to do is just update this pointer here to now point to me. It doesn't affect anybody else. In a B-plus tree, if these guys are packed in the same node and it can't take a new key, then I have to do a split which may sort of reorganize the entire tree. So skip lists, all the changes will end up being localized. And although we're doing, it's a probabilistic data structure, right? On average, we'll get approximately log n. We'll get the same asymptotic guarantees that you would get from a B-plus tree even though it's a non-deterministic tree. So let's look at a full example here. So the first thing to point out is that we have a begin and an end marker here. And these guys are on the begin side, they have the entry point for the link list at a given level. And then at the end point here, you have some special marker that says this is the end, right? It's infinity or some bit pattern say, if you reach this point, then this is the end of this level. And then now for the different levels, you just have the probability that a key will exist at each level. So at the very bottom, the probability that the key has to exist is one because every key must exist at the bottom level, right? Because that's sort of the ground truth of what keys are in our skip list. And then as we go up, the probability is divided by two, right? So again, so the longer bottom, these are all our keys. And then above here, we have the pointers that are gonna span multiple keys from the levels below us. And this point here in our first example, there's nothing in there. So in skip list parlance, the keys that are sort of this little vertical stripe here where you have entries for the same key, this is called a tower, right? So this level here, I have K2. It's gonna have a pointer to go along horizontally to the next key in this level. It also go vertically down to my own key, right? So you can't have a key here, like K2 can't point to K3 from one level to the next. It always has to point to the same key. All right, so let's look at an example, how to actually manipulate this. Let's say we want to insert K5 right here. So the very first thing we're gonna do is allocate the nodes for this new key we wanna insert. So we'll flip a coin, and every single time we get heads, we continue to say we're gonna add it to a level. Since we get tails, then we stop adding it, we stop going up the tower. So let's say we flip a coin once, we get heads, so we wanna add it here, we flip a coin again, we add heads, we get heads and we wanna add a node here, we flip it in the third time and we get tails, so then we stop. So our tower is gonna have three entries here. So now at this point, we've added our nodes, but nobody else can see us yet, because this pointer for K4, he still points to K6 and this one still points to the end, and then for the top level, that pointer still points to the end as well. So now what we're gonna have to do, and we'll see how to do this in a second, how to do this without latches, we want to flip these pointers now to have them point to us. And we're gonna do this from the bottom to the top, because what'll happen is now if anybody's scanning along the bottom here, they're guaranteed to see us, but it's not the case that, we don't want the case where someone would come along and see K5 and have us not actually be in the full thing. Right, and we'll see how to do this in a second. All right, so now you wanna do a search. Say I wanna do a look on K3. So my cursor starts at the beginning, at the very top level, I scan along whatever this pointer points to, all right, and I do a comparison. I'm looking for K3, is K3 less than K5? If it's K5 or equals, then I wanna have my cursor go this direction and down. In this case here, K3 is less than K5, so I'll go down to the next level. I do the same thing, look along this pointer and see what key points do. So this one here is K3 is greater than K2, so that means that I know the key that I'm looking for, if it exists, has to be at this point in the link list or the skip list, over. So I don't care about anything that comes before it because this thing told me that I should start my search in here. So then I scan along, look ahead again, K3 is less than K4, so I know what I don't wanna look there, so I go down, right, and now I just scan along the leaf nose until I find what I want. So once you reach the bottom, you're just doing linear search, but this is gonna allow you to again, the towers allow you to jump over elements without having to look at everything. So what are the advantages of skip lists? Is this sort of clear? Who here has heard of skip lists before? All right, about half, okay. So with skip lists, the advantage is that they typically use less memory than an off-the-shelf B-plus tree without any compression or prefix truncation. They typically use less because you're not storing a bunch of extra pointers in the same way you would have for a B-plus tree. Now, an example that I'm showing here and the latch-free stuff we'll talk about, the link list always goes in one direction. So if you have to do a reverse search, like an order by in descending order, then you have to do something special because you're not gonna have pointers going in reverse direction. It's always me in one direction, and that's why they get the reduced size. We already talked about this before, but the insertions and deletions don't require rebalancing, right? When I did that insert, all I did was flip pointers for the keys that appeared before me at each level. I didn't have to restructure or reorganize the entire tree or the entire data structure, so that's nice. And then we'll see in a second, we can actually implement a thread-safe concurrent skip list using only compare and swap instructions so without having to do any latching. So the way we're gonna do that is that when we flip those pointers now, we use compare and swap on them, we do it in the right order, and that guarantees that our operations can be done atomically. If the compare and swap fails, then we know that somebody else has come along and tried to insert something in the spot we were gonna insert, so we just stop what we're doing and then just go back and try again. Remember we said this before, when we do modifications to an index, if the operation fails, we wanna go back and try it again. With the exception of like, if I try to insert something in the key already exists, that's a primary key failure, that's a logical thing, we have to fail. But when I try to do this compare and swap stuff, if I fail, I don't wanna abort the transaction because the transaction doesn't care that you compare and swap failed on index, it doesn't know about anything about that. It just wants to get its keys in the index. So if the compare and swap fails, the thread's gonna go back and just retry the whole operation again over and over again. And that's what makes this latch free because you're not taking a latch on anything as you make the changes and that's all hidden from the upper parts of the system. So that's what I sort of mean here is that the transaction can only abort from high level conflicts, like did two transactions try to insert to the same tuple with the same primary key? I don't care about conflicts of the whole level data structure, like if I two threads try to update the same pointer, that's something we're just gonna retry as we all hidden from the rest of the system. All right, so let's go back and see how we actually do the insert for real. So we wanna insert K5, we wanna do it here. So as I said, we can allocate the memory for the nodes we wanna have in K5 but these guys are still pointing around it, doesn't know anything about that. So at this point here, although we've allocated the memory and in my visualization, I'm showing you that we have K5 here, no thread can actually ever see it because they would follow this pointer along K4 and jump to K6, right? So before I start doing compare and swap on these guys here, I just gotta make sure that my towers are all linked together. So this doesn't need to be done atomically, this is just you just write in the memory as you set things up. And then same thing for going in the other direction here, right? So I know that if I'm inserting K5 between K4 and K6, that if anybody comes along and modifies this pointer before I do, that means I know that there's some node in between here, right, or K6 has gone away and therefore I shouldn't be pointing to K6, I have to retry my operation to come back and figure out who got him before I, what we know, who got him before I did. So I can again, I got my pointers to I think I should go and this doesn't need to be done atomically without any latches, it's just I know that when these things fail actually just bottom one, this bottom one fails, then I know something else happened and I gotta go figure out what's going on. All right, so I started at the bottom and I wanna do a compare and swap on this guy here. So the way I found it was when I did, in order to do my insert, I had to scan along, do a search and try to figure out where I should be. And so now I know what is the node before me and I know what the node after me, so I know to get this pointer and I know that I need to do a compare and swap on this. Now, if this compare and swap succeeds, this tuple is now fully, or sorry, this key is now fully visible in the system. What still could happen though? Say someone is looking for, in this case here, someone's looking for K5 and I've only done this. So they could scan along here and this thing still points to infinity so they would come down and then they would see me. So that's okay, right? But let's say that this thing pointed to K6 instead. Okay, there's a K6 tower like this. Someone come along, scan and see K6 and think, oh, there's no K5. Is that okay? Is that allowed? It still has to drop down to verify because of the amount of probability at that level. So your statement is, wouldn't you have to step down and verify it because there's only 50% probability at that level? What do you mean? So like if you're searching for K5. Yeah, that's a bad example, K5 is a bad example, yes. If we had, oh yeah, I guess in this case you would always find it, yes, that's fine. Yes, ignore what I said. So at this point, this is always visible, right? Because you would say K4, even if you pointed to K6, K5 is less than K6, so you would still drop down and see it, yeah, so that's fine. I'll come up with the corner case I'm talking about in a second. So at this point, again, once I do a compare and swap on this, everything's fully installed, so we're done. We're done at this level. So then we just go now up here and do another compare and swap on this guy, have him point to us, and compare and swap on this, and now points to us. And now our tower isn't fully installed and everyone can still see us, right? Okay, so the leads are a little more tricky because we don't want to remove a node physically from memory and have a thread still be pointing at it. This is gonna sound a lot like the garbage collection stuff when we talk for MVCC, right? We don't want to have the garbage have to come and prune out old versions that expired when some thread could still possibly see that, right? In that case, we were actually worried about both threads logically not being able to see a version they should be able to see in their snapshot, but also you wouldn't want a pointer point to nothing and a thread jump along to that memory location and then get screwed up. So our point here, we're actually only worried about the physical problem, right? So what we're gonna do is, if you want to delete a key, we're gonna logically mark it deleted. And then technically it'll still be in our index and for a little bit, people can still scan along and still see our key even though we deleted it. Then we'll go ahead and flip the pointers to now round around it. And then at some later point, we'll come along with the garbage collector and once we know no thread is actually ever looking at it, we can go ahead and clean it up. All right, we're gonna send more time in the next class about garbage collection in for these types of systems and indexes. We'll talk a little bit about how we do garbage collection in the context of the BW tree later in the lecture. The concept is still the same. So what we're gonna do now is we're gonna introduce this little delete flag here for every single node along the leaf node, along the bottom level. We don't need it for the upper levels because again, these are just guide posts, right? The bottom is always where the ground truth of what's going on in our index. So let's say we wanna now delete K5. So our thread would do a search, find K5, find the bottom level here and just flip this thing to true. And then it tells the data structure, the garbage collector, hey, by the way, I deleted this thread. Once you know that nobody else is actually looking at it, we can go ahead and clean it up. It also, actually, before it finishes though, it's now gonna go through and flip the pointers as well to route around it. So for this, we can go start from the top and go down and do compare and swap for all of these. And then just like before, if we do the compare and swap and fail, then we know that somebody else has done something, either route K6 or add a new entry like K4.5. So we just repeat the process and figure out what is actually here now, where our key five exists, and then just do the compare and swap on our predecessor in order to correctly modify that. So sort of this two-stage part here is where I can mark this thing as deleted and then any thread that comes along the scans at the bottom and sees it just looks at the flag, says I should ignore this information. It's not actually there. And then to make sure that I can physically remove it, I do the compare and swap like this and that way now it's physically decoupled from my index. And then at some later point, the garbage collector gonna say, all right, I know no thread is possibly looking at this. Let me go ahead and clean it up. Yes. So in my point, is there any? Yes. Let's say you're trying to search K5. I mean the compare and swap on the bottom level, what is your compare and swap on the second tier fails? Is that mean that you completely retry or you just afford to protect the third one? All right, so this question is, I'm here. My first compare and swap fails. That doesn't see them. That's, sorry, that's the sees, yes. Now I need to do it up here and this one fails. So it would fail and so here's the issue. It could be a race condition. So say there's somebody trying to install 4.5 and I'm trying to install insert five. So at the benefit of, say 5.5 because if I get this guy in here, then the 4.5 would fail. So he tries to install 5.5. So I succeed on this. So my guy now is linked in here. And then for whatever reason, my thread gets stalled and then he comes along and then he installs 5.5, right? But at that point I have a handle to this thing. This is where I think I should be updating and I know what the memory location is when I read it the first time. So now when I try to do a compare and swap, the compare and swap would fail and say, well what's up with that? Let me go now just do a search and figure out how, forgot the thing I inserted because I know that succeeded. Then work back my way back up and figure out what was the thing that actually came before me now. So in that case now, I would see what's still K4 is just the pointer that I thought should be pointing to the end is now pointing to 5.5 and now I just compare and swap and insert myself like that. Once I get it in the bottom, then it's fully in there and everyone will still be able to see it. It's just, I need to make sure that I insert myself correctly going up. Yes? In the case of deletions, like let's say somebody is inserting 5.5 as I do to delete 5. Yes. And so they update my pointer like they would while I was looking at K4's pointer. So both compare and swaps work but in fact that 5.5 ends up not in the link chain now. All right, so his question statement is that someone inserts 5.5 here and they compare and swap on me. So now I'm pointing to it. Now I do a compare and swap here. With K6. With K6 because I think that's what should come after me and that will succeed because no one would get to me. What should happen here? Should I just then look at myself again and see if my pointer changed? Yes. And then you could, yes. So this is the example I was trying to pick up. So there's a difference between serializability and linearizability. So for serializability, we don't want to see things that our transactions shouldn't see. For in terms of linearizability, like we can insert things into this and have it temporarily be inconsistent and have people not be able to see things we just inserted, that's okay. So if there's a brief period here, yes. I flipped this guy, 5.5 is now here. I flipped this, sorry. Somebody else flipped me, now points to 5.5. I get flipped on this and then I would see that I'm now pointing to K6, 5.5 is missing. I could then go look at myself and say I'm pointing now, I'm not pointing to what I should be pointing. Let me go ahead and fix myself up. And then everything's correct. And then you're doing compare and swap for all of those things. And then that fixes everything. They're interested in the situation. So like as I'm running my transaction, if the 5.5 guy flips my pointer, then someone scans along and misses me because we'll hit K4 and get routed to K6 and they miss 5.5. I should have seen that, but I didn't. Unless you're running with serializability, that's okay. If you're running with serializability, then you should have seen that and there needs to be some extra mechanism to make sure that doesn't happen. So again, if you do what Hecathon does, you just run the scan again, now the scan would see 5.5 and you recognize that you have a phantom and therefore you have to abort the transaction because it fails validation. Yes. How would eventually find out that 5.5 existed? This question is how would you eventually find out the 5.5 existed? So I'd say I scanned along the first time but I miss it. Now under validation, like in Hecathon, they'll scan it again. But you will scan it after 4 to 6. You still took the 4 to 6. All right, so in that case there, I don't see it again. If you use reference count as you call it, what do you mean? In this case, it's 2.5. His statement is like, if you keep track of who actually points to you, you can't do that because now you gotta do atomic updates on two member locations and you can't do that. The question of what, should I be able to see that? Well again, so now think of like in like Hyper. So Hyper was doing the precision locking, so there would be a delta record with 5.5 and now when I do my validation, I would see that I should have seen that and I didn't and therefore that would violate serializability. But that's okay from again, the data structure point of view, it's okay. From the higher level concept of transactions, that is not okay. We have to do extra stuff to make sure that doesn't happen. All right, so I wanna jump ahead to the PWT because this is, to me, this is a more interesting data structure. Skip lists are terrible, don't use them. Okay. All right, so this is exactly the point I was trying to say here because the comparison can only update a single address at a time, this sort of limits what you actually can do in your data structure. In particular, we couldn't do that reverse search because we can only have pointers going one direction. We can't update two memory locations at the same time. All right, also because of this, we can't have a latch tree B plus tree because the B plus tree has pointers all over the place. So now if I have to change a memory location of a node because of a splitter or merge and I gotta update a bunch of pointers to it now, I can't do that atomically without taking latches. So for this reason, the canonical B plus tree, as we teach you in the introduction class, you can't make that latch tree even though everything's in memory. So the way to get around this limitation if you want a B plus tree is you can introduce an indirection layer that's gonna allow you to update multiple addresses, logical addresses, all with a single compare and swap on a single memory location. And that's the main idea what the BWTree is. So the BWTree came out of the Hecaton project. I think I mentioned this before. They first started building Hecaton using skip lists, did some benchmarks and realized skip lists weren't that great, and then they ended up building this new dash structure called the BWTree. The BW stands for buzzword. The idea is it's a latch-free, flash-organized, like so forth index. So the original paper came out in 2013 and I'll talk about this in a second, but when I read this, I was like, this is awesome. I was gonna sign up to come to Carnegie Mellon, we're gonna build a new database and I said, let's build a BWTree, right? So the first time I cut this class in like 2016, one of the projects that we had the students do was actually build a BWTree. That was a nightmare because that was way too hard and I learned that it was way too hard. And then in previous years, we did skip lists, but I don't think that's interesting so we ditched that. Anyway, so, but this original paper, it describes, I think it does a good job of describing at a high level what the BWTree is, but the nitty gritty details, and when you actually need to actually implement it, they're not in this paper. So the paper that you guys read was sort of our attempt at writing the missing guide on what you would need to actually build a real BWTree. But then what happened was, as we implement it, and then when we benchmarked it, it got crushed by everything. And so the paper you guys read, it sort of has this split brain because the first half of the paper is like, hey, here's how to build it. The second half is, oh, by the way, it sucks, right? Because as we were writing, we're doing experiments like, oh, man, we can't write this, like, we gotta be upfront, like, this is things not that good, and here's why, right? But again, so you can get bits and pieces of how to build the rest of the BWTree, the other things you need, from a bunch of other papers that came out of Microsoft for this other system called Deuteronomy, but it's sort of, again, it's sort of scattered across these different papers. You have to know what you're looking for to find it. And our paper is meant to be like the single source to describe everything. So there's two main ideas that you, or the main takeaways you have to have for how the BWTree is implemented. So the first is that they're gonna use deltas to record changes made to single nodes. So you're not gonna allow, you're not allowed to do any in-place updates to pages or nodes once they're created, right? Contrast that with the BWTree. All right, BWTree, I allocate a bunch of space and I can add new, allocate a node, I can add new entries into it and take things away, you know, as needed. So now their argument was that this will help reduce cache and validation. I don't buy this argument because when you do deltas, those deltas get propagated between cores. So that's not really, this is slightly, this is not true. Then now to make it latch-free, they're gonna introduce a mapping table, this indirection layer, that's gonna map logical node IDs, a logical page IDs to physical locations and memory. So now if I wanna change the location of a page, all I have to do is compare and swap into a single memory address in my mapping table and that updates all my pointers. So quick, Joe, hands, who feels like they understand the BWTree after reading that paper? Sort of one, nobody, okay, that's fine. No, it's a hard data structure, right? People think latch-free data structures are sort of magically gonna run faster. No, but also they're also much harder to implement. Skip lists are super easy, right? It's just a bunch of linked lists and you flip a coin. The BWTree is a way more complicated beast because it's the corner cases that fuck you up. Okay, so this is a really simple example of our BWTree. We have three nodes, we have one root, and then two leaf nodes. So the first thing is that every single page, every single node is gonna be assigned a logical node ID, right? Page 101, 102, 103, yeah. 103, no, it is 104, sorry. And then now in our mapping table, we're gonna have a map from the logical page ID to the physical location in memory where that page exists. So the way I'm demarcating this is that the dark line is the physical address, a physical pointer, and then the red dotted lines are logical pointers. So the way that we're gonna organize the tree to keep track of my children and my siblings is that I'm just gonna store the logical page ID instead of actually the physical pointer. So now if I wanna do a scan along this thing, say I start at page 101 and I wanna jump down to my channel 102, I do the lookup in the mapping table and say I want 102 and then I get the physical address and now I know where to jump to, right? Pretty simple, but it gets harder. So now also too, because we don't wanna have in place updates to our pages, we're gonna introduce these delta chains where we append modifications that occur to the elements within a page to this link list above it, right? So every single time I wanna do an update for something in my page, I create a new delta record. So in this case here, I wanna insert key zero into this page, so I create a new delta record that just records that I'm inserting this key. This delta record's gonna have a physical pointer to the base page, right? To the head of the base page here, whatever this thing's pointing to. And then now to install this update, I go to the mapping table and I wanna do a compare and swap to change the physical pointer to now point to my delta record instead of the page ID or to the base page. So that compare and swap succeeds then now my modification is now fully installed in the page. And any note of the thread comes along and says I wanna get to page 102, they do the lookup into the page table, they would get a pointer to this delta record, there'll be a little bit in the header that says you're looking at a delta record on this page, and then it knows how to then interpret whatever information is being recorded in that delta record to find whatever needs to find or to do whatever it is that it needs to do. So now let's say I wanna do a delete on key eight in my base page. So same thing, I followed the mapping table, it would tell me that I'm pointing here this is the current head of the delta chain. So my physical pointer for this new delta record I wanna install points to this guy here, not the base page. Then same thing, I do a compare and swap if it succeeds, now my delta record has been installed. All right, this is clear, all right. All right, to do searches, again, as I said, when I do my lookup in the page table, well, first I'm doing traversal like a regular B plus tree, but now when I do my lookup, if I land into a delta chain, then now I have to start looking to see what those delta records are and see whether they correspond to the key that I'm looking for, right? So in this case here, say I was looking for key zero and I land in the mapping table, I get to the head of the delta chain, I first delta record says delete K eight, that's not the one I'm looking for, so I skip that, I go down here, it says insert K zero, that's the one I want, so I know I'm done and I can return just as if I had scanned the bottom or something like that. If you get through all the delta chain and you get down to the base page, not finding the key you're looking for, then you just do binary search on the keys because internally this looks just like a B plus tree node, but you're gonna have separator keys and the arrays for the values, so I just do binary search until I find the key that I want, right? So again, what's really nice about this mapping table is that since this is the sort of the ground truth of what the physical pointer is for a single page, then all I need to do is compare and swap on that and that determines who actually gets to succeed and install updates, right? So let's say if I go back here and remove that delete and I only have inserted key zero in here. So now I have two threads, one guy wants to install key eight and another guy wants to install key six. Again, they both do the compare and swap on this location, only one of them will succeed, right? Because they both did a look up and say, well, what is the current address in this? And then new compare and swap on that, if say the first guy succeeds, the second guy fails, then I blow away my delta record, or I can say, you don't want to blow it away, but I'd make you figure out what got installed in front of me and then now retry it and maybe do compare and swap on the new address, right? Whether or not you have to start from the beginning again or whether you can just recognize it, oh, well, this is where the head now is, let me just go back and do that again, depends on implementation. So what's the problem with these delta chains? Yes? They get long. They get long, exactly, yes. What's that? There we go. So a delta page only has one delta in it, or? So his question is a delta record, you're saying delta page, does it only have one in it? In our implementation, no, we'll see this in a bit. The easiest way to think about this, these are just mallocs, some chunk of memory and this is just a linked list pointer, but you don't actually want to implement it that way because it's way too slow, right? So as he said, these delta chains can grow infinitely, so at some point we want to consolidate them, right? So the way it basically works is that you, in our implementation, the number of delta records you have is fixed. In the original Hecaton paper from Microsoft, they say that these things can grow arbitrarily and therefore you just have a threshold to say at some point, this is when I want to consolidate. So there's no background garbage collector thread doing this. It's just as threads scan along, if they recognize, oh, this delta chain has got pretty long, then at that point in time, as they're doing the scan or doing the lookup, they try to do the consolidation. So in this case here, this guy gets too long, so a thread's gonna do consolidation. So the very first thing you do is just copy the original base page into a new page, right? And then you apply the delta records in reverse order. So I'll first do the insert, and then the insert up there. Let me guess why you do this in reverse order. Consistency, correct, right? So let's say I insert key zero here, then up above this, there's a delete key zero. I don't, if I go from the top to the bottom, then I would have key zero when it shouldn't be. So go in reverse order, replace them correctly and then puts it back to the correct state. All right, so now at this point, after I've applied all of my changes, I just do the same thing I did before. I go back to the mapping table, do a compare and swap, and then if that succeeds, then now I know that my node has been installed that has been consolidated. Yes? So you see it was the trend that, for example, it's looking up key zero, that recognizes that it's time to consolidate it and so it's that trend, it's trusted. This question is, what is the mechanism, or what is it triggered to recognize that I need to do consolidation? And which trend does it? And which trend does it? So I think in the original paper, it's wherever it reaches the bottom, and you recognize that thing's long enough, but you still could end up missing it. I forget what the exact mechanism is. I think, like you could just keep a counter in this thing to say I'm at offset zero, I'm at offset one. So if you reach the top one, when you start, you say this is the 12th delta record in my chain, I should go ahead and consolidate. And the person that does it, whoever just finds it. So this is, again, this is another advantage of the compare and swap, because I can have two threads come along and say, oh, this thing's super long, we're going to consolidate it. So then they both end up doing it, only one compare and swap will succeed, and that's fine. But the other one is just reach the bottom. Yeah, the other one ways to work. But again, that's, this will be a reoccurring theme in like lottery data structures. It's wasted work, but you can't avoid that, right? The idea is that the amount of work you're wasting is less than the overhead of having to stall by on latches, right? Whether that's true or not depends. Yes. So what if you compare and swap to the same quality notes at the same time, right before you do the compare and swap, someone has just inserted into the old node. Okay. So then now that insertion is missed, you consolidate the node. Okay, so his question is, say I'm here, I've already done my consolidation, I've applied all my delta record changes, right? And then now, before I do the compare and swap here, somebody else inserts key six, right? What would happen? I would do, try to do compare and swap on this, but this is no longer going to point to that. I know where I started, because I had that pointer, and I went in reverse order and added all those guys. So now what I do with the compare and swap on this, and it's no longer insert key six, it's no longer insert key five, it's now insert key six, I know I didn't see that and I missed that. So therefore I know that this thing is not in sync with what should be there, so therefore I can be smart and try to pick up where I left off or what I'm missing, or I just do the whole thing all over again. Again, the compare and swap guarantees that it doesn't happen, is if someone else inserts something, this thing will no longer point to the thing I thought I should point to, right? Hopefully my excitement is coming off and conveying my excitement, and this is why we ended up building, it's like, damn, this thing's so awesome, it solves everything, but then, f**ks. All right, so at this point here, I make a pair and swap succeeds, I then tell the garbage collector that this thing is old, right? And at some point it should be reclaimed, or it should reclaim the memory. Of course then again, I don't wanna do it just willy-nilly because I don't know what threads could be looking inside of this. So the way we're gonna do garbage collection in the BWTree is through epochs. If you're coming from OS background, Linux uses something very similar, they call it RCU, right? The basic, and I would say also too, this epoch-based garbage collection we're talking about here is not specific to the BWTree, you can do the same thing, and people usually do the same thing in a skip list, and we'll see other data structures that would use something very similar. All right, so the basic idea is that all the operations we're gonna do in our BWTree are gonna be tagged with whatever the current epoch is. And think of the epoch as just a counter, a logic counter where you add one to it, every so often, think of every 50 milliseconds, it doesn't matter. So then when a thread enters the system, they register themselves with the garbage collector, and the garbage collector keeps track of all the threads that are currently inside my index at this current epoch. And then when they do whatever, as they make changes to the index, and then have garbage that needs to be reclaimed, they register that garbage with the garbage collector and say, within this epoch, I created this old pages, go ahead and clean it up. And then the garbage collector knows that once all the threads have left the current epoch and any epoch that came before it, then it knows there's no thread could be pointing, could be looking at the physical memory that I'm gonna reclaim, and therefore it's safe for me to remove it. Same idea we saw with garbage collection for NVCC. If I know no other thread, it could be looking at some old versions at these old snapshots, it's safe for me to reclaim it. All right, so let's look at an example here. So this is the same one we had before, right? We have one thread running on CPU one, they're gonna do the consolidation of this page down here, right? So again, when it enters the system, it's gonna say tell the garbage collector I'm a thread, I'm gonna do some stuff, here I am. And so for this one, we're just gonna have one epoch, it's fine. Another thread's coming along and they're scanning this delta chain, right? So after, again, same thing, they get registered. So after now I do my compare and swap on this page here, and this thing is now fully installed, any other thread that comes in after this epoch will see this, it won't see this. But I don't know about where thread two is, I don't know what it's looking at, I don't know what it's looking at. This delta chain, whether it's looking at another part of the tree, I don't know and I don't care, because I don't wanna track its fine-grained access. I'm just saying it's in this epoch and that's good enough. All right, so then I again register all this memory to be reclaimed with the epoch table in the garbage collector. I go away, I deregister myself, this thing continues to scan and then at some point it's done, it tells the garbage collector I'm done, I'm leaving the index, and then now it knows it's safe to reclaim this. So there's some mechanism, again, where a thread will just say, all right, it's been 50 milliseconds, let me increment the epoch counter, right? So there's some, something almost like a heartbeat always moving forward in time, right? And then I know what threads could be around in the epoch. And once I know that every, for given epoch, there's no threads that exist in that epoch and any epoch before it, I can reclaim it. Again, in the case of like the HANA garbage collection stuff we talked about, they were worried about these really long queries, you know, running for hours and therefore there's a bunch of garbage in the middle you wanna reclaim, but you can't, so that's why they wanna do that integral garbage collection, right? You think of the epoch one as like the timestamp version, which is the high watermark or the low watermark when we can remove stuff. In our index, we're not worried about things taking hours where we're talking like microseconds here. So it's okay that there might be like five epochs that I wanna reclaim garbage from and there's some thread that's taking a little bit longer that's causing all of them to get stalled. It's, again, we're talking microseconds here. So it's fine. So it's okay to have a coarse grain epoch based policy and we don't need to worry about intervals. Okay, so let's get to the hard part. How we actually do structure modifications. So, again, the BW tree is basically a lat tree B plus tree it's self-balancing. So that means we have to do splits and merges just like in a B plus tree. So that gets a little tricky now when you have these delta chains and this mapping table. So the way we're gonna handle this is that we're gonna introduce two new types of delta records that are gonna correspond to physical changes in our index and to keep track of where the logical, the logical contents of different pages can be stored. So we're gonna have a split delta record that keeps track of where certain ranges of a key from a page can be found, whether it's one left or right and then we'll have a separated delta record which is a shortcut mechanism for higher parts of the tree to allow us to not have to go scan through the entire delta chain just to find that the data we're looking for is on another node. So we're gonna do splits here in the next slide for merge the same process just done in reverse. There's nothing extra, nothing special you do. All right, so this is like the simplest example I could come up with. Even it's still kind of nasty, but that's okay. All right, so we're gonna have four nodes, three leaf nodes, one root node here. Again, same thing everybody has their own unique page ID and then they're gonna have logical pointers in between them. So in the BW tree you're actually gonna have pointers in both directions like a B link tree but for our, you can keep it simple we're only gonna go one way. All right, so say these are the keys that we have stored. One to two and the first guy three to six and the second one and then seven, eight to this one. So we wanna do a split on this one here, right? So the first thing we're gonna do is just create a new page and copy the keys we wanna split on, key five and key six, we'll copy them into our new page here. All right, there's a delta chain, whatever you apply it, it's fine, all right? So now at this point, again, and you update the map table now to point to us. So again, at this point nobody can see us because there's no, nobody's pointing to us, right? Nobody knows about us, right? This thing still points here and this guy still points there. So we're gonna introduce this new split delta record here and this is gonna go in the version chain for this page here and so what the split record basically says is that it's gonna have two points, it's gonna have a physical one and a logical one. So the physical one would say here's all the keys on the left side of the tree, right? Key three, key four, so key five exclusive and then on the right side, here's all the keys over here, right? And again, the physical pointer points to this page here because this is in this delta chain, it's not in this delta chain here. So now at this point, in order to full install it, again, I do compare and swap on this guy to now point to set a page one or three since this is the head of the delta chain, now points to our split record. So at this point here, our split is fully installed and now everyone can see our page, right? Because when we updated this physical pointer, it also updated these logical pointers to now point to us. So now at any thread that comes along and say they're looking for key five, they would jump down into here, right? And see that key five is in the range of key five to key seven, so they would follow that logical pointer here. Even though we still have copies of key from key six here as well as here, this is the one that I'm gonna read at this point if they follow those logical pointers, right? So again, up above, we don't know about the split. And so all we have here is everyone thinks key three to key seven, you follow this logical pointer and then they find the split and then you go left or right, right? So to avoid this extra work of having to go, follows down and just to find out that the thing I really want is over here, I can introduce a separator key up above and this is basically a way to say if you're looking for this key here to modify this and say if you're looking for something in that range, jump to this location here. So as far as I know, for correctness, you don't need this, the separator does a record, it's just done for convenience to avoid having to scan all the way down and do extra work just to find out you're actually really down here, right? And then same thing, we flip that pointer to now point to that and now everyone can see us. And then at some later point, we'll do consolidation, right? So this thing will get compacted, we remove key five, key six, right? And when we create a new page. So this is different than a B plus three, B plus three wouldn't have the multiple copies of the keys of one key exist in any leaf node, right? It can only exist once. In our case here, because we're doing consolidation at later points in time, it's okay that this thing can still exist, right? Everyone has to keep track of what they're allowed to see when they land on this node. So if I'm scanning along the leaf node here and I'm looking for, say, key six, if I come along this, then scan along this pointer here and I would recognize the other thing I really want is actually down here. So even though I could jump here and see it, I know I still go to that one. All right, is this all clear? Yes. So there's a sibling pointer in the base node, is that just irrelevant at that point? So this question is this, oh, is this sibling pointer here in the base node just irrelevant? But once the idea is split? Yes, because you would keep track of up here that the now sibling of this guy is here, right? And that's being maintained here. So as I scan down to this thing, I got to keep track of what I saw that I had to split up above. So if I need to jump over here, I have to jump down here, blank faces. Okay, so I think he brought this up earlier, like what are these delta records? Are they just mallocked in the heap, right? Well again, if you read the original BWG paper, they don't tell you, right? And you don't want to do that, right? Because if you mallock those little delta records just randomly in the heap as needed, then you're doing a bunch of small memory allocations and that sucks, right? You're gonna have fragmentation, you're hitting up the mallock all the time and that's bad. So again, the paper I had you guys read, the sort of first thought of it describes actually how to build the BWG tree and the missing parts from Microsoft, then the next third is these additional optimizations and then we get to the actual performance evaluation. So I want to talk about two optimizations that are pretty straightforward to understand to make things run faster. So the first is that the delta chain really isn't actually just a link list, you're just gonna have a little extra space in each page in each node where you can store these delta records. So you have your delta slots and then you just have a counter or an offset to say at what position can you come along and insert a delta record here? So this means you have a fixed number of delta records per page and at some point when you try to insert a new delta record, you have to do the consolidation. Whereas in the Microsoft paper, it just was when the thing got long enough of some threshold, then you did it. In our case here, if we run out of space, we have to do it. So if a threat comes along, it says they want to install a new delta record, right? They do compare and swap on this guy or an atomic add and they try to add one to it, right? And if they succeed and they know whatever this thing had before is now claimed by our thread to install a new delta record for this. And then just like before, we do the compare and swap on the mapping table to now point to us, right? Now the selection metadata we have to maintain because this thing is done, this is one compare and swap and then changing the mapping tables in our compare and swap, but I may have one thread get the first slot here and install something and then the second thread comes along and he install something, then he gets to the mapping table before I do. So this thing may not be in physical order. Sorry, may not be in logical order. So I may have to keep track, I had to keep track of like what is the actual order of these delta records I want to apply. So it's not just like starting here and just scanning across, right? Because again, I can't update two memory addresses at the same time. So I can do a compare and swap on this, get a slot, then do a compare and swap on that and then someone might beat me to it. But that's still fine. The second issue we had to deal with was the size of the mapping table. So the way the mapping table is actually implemented it's not a real table, it's just an array. Even though I just sort of draw it as a hash table it's just a flat array of 64 bit pointers to point to different memory locations for pages. So if you end up allocating the entire array first then you're basically wasting a lot of space so that may never actually be used. The reason why you have to allocate the entire memory all ahead of time is because you can't make expanding that memory latch-free. I can't call realloc because somebody else might be doing it at the same time and that'll cause problems. So then you say, well maybe I partitioned the size of the mapping table so I can incrementally reallocate new parts and then have some pointers above that to say if you want this portion of the mapping table go here, this portion of the mapping table go there. Now it is basically building a B plus tree, right? So I'm screwed there. So the way to get around this is you do preallocate the entire array but you use virtual memory and have that and only use the portion of that virtual memory that you actually need. So again, the page IDs are just offsets into this array. So if I only have maybe like a thousand nodes or thousand pages, I'm only using the first 1,000 entries in my array and then although I have allocated the entire thing all of that virtual memory which is not actually backed by physical memory, the OS, like if I do virtual memory allocation on a big space, if I don't touch any of it the OS doesn't actually allocate anything. It's only when I try to access a memory location then there's a page fault and then the OS actually creates the backing physical memory for it. So with virtual memory, I can allocate the entire thing and only when I actually touch it then is the memory actually created. Yes? Doesn't the OS handle look like why is that the physical handle on this? Yeah, so this question is, doesn't the OS handles? Yes, when I say, we're not doing any special work using the OS as virtual memory to do this. So like how come like there's a real history of this happening? If there's allocating free memory going to be, does that, other than how they happen on the OS? All right, statement is, if you, yeah, so the statement is, if you just allocate a big chunk of memory and you don't touch any of it isn't the OS still going to do this? Correct, I think in the original paper they used a hash table, right? And if you do that you're jumping to random locations. This technique actually came from, there's another paper called a Kiss Tree, which is another last tree data structure that we got this idea from them. Just to understand why we did this, like I actually look in the code today, the default is that we allow for mapping tables a one million nodes, right? So if it's one million nodes and each node can have 128 keys, so the current version of the BWTree, there's no reason it couldn't be bigger, it's just 128 million keys. So for every single index, if you want to store one million nodes, we would end up malloc-ing eight megabytes. It doesn't seem like a lot, but if you have like two indexes, two or three indexes per table and you have like a hundred tables, then you're just allocating the memory before you even put any data in it. So we saw this when we were using Peloton, we turn the thing on and just look, you know, install the TPCC database schema and we'd be like 200 megabytes. I'm like, how's this 200 megabytes? We didn't do anything yet, right? Go try this on SQLite, create tables on SQLite and like the memory footprint isn't like nothing. So this allowed us to avoid this problem by just again using virtual memory for this. All right, so I want to show two sets of experiments that we'll finish up. So first one is from the original BWTree paper from 2013. So these are internal benchmarks that Microsoft did using some workloads that they had on hand, like some dataset from the Xbox Live, some synthetic workload and then some Ddupe workload. So this is the graph I saw and I was like, man, this is awesome, I want to build a BWTree too, right? So they're comparing the BWTree versus the skip list they originally built for Hecaton versus the B-plus tree from Berkeley DB. And so they modified Berkeley DB to make the B-plus tree actually be in memory. If you've never heard of Berkeley DB, think of like Rocks DB or Level DB or SQLite, where it's like an embedded database engine that provides you like low distance data structures, like a hash shape on a B-plus tree. It's awesome, I mean it's an awesome system, but it's like from 1992, right? So they were using an old B-plus tree in their comparison here and it's no surprise that it gets crushed. So like, yeah, yeah, so when I saw this, I'm like, yes, this is the way to go, we should build this. So then we did build it. And then these are our benchmark numbers where we compared against a sort of a, wasn't like super advanced, but a modern B-plus tree provided to us by the hyper guys in Germany. Like we had, he was a visitor with us for two months, a few years ago and he came along with the B-plus tree they used for hyper and it crushes the BWTree here. So for the skip list here, again skip list always loses, skip list always sucks. This is actually the state of the art skip list from Alan Fecki's group down in Australia. So it doesn't use towers, it uses wheels, whatever, right? Like the basic idea is the same. So we're calling our version of the BWTree, the open BWTree, because again, Microsoft never open sourced theirs, we open sourced ours. It has all the enhancement and the optimizations that you guys wrote about in the paper. So again, for a variety workloads, the, with exception to inserts, the B-tree always wins. And this is running on a machine with one socket, 10 cores with the hyper-threading, so it's a total of 20 cores, whereas Microsoft were running on a cores, because it's back in 2013. So then when you wanna get even more embarrassed is when we actually now compare against a bunch of other data structures, which we'll discuss next class. So this again, this is the BWTree, the skip list, the modern skip list and the B-plus tree, but now we're also throwing in the mesh tree and the art index, right? So you will read it, then the assigned reading for next class is the art index, it's a Radex tree, to try. So, mesh tree is a tree of tries. Again, we will discuss what these guys are next class. So now you see the Radex tree crushes everything, right? And the main takeaway here is that both the Radex tree, mesh tree and the B-plus tree are all using latches. So just because your latch free doesn't make your magic go faster, it actually ended up doing worse, right? So at this point, we're stuck with the BWTree. We spent a lot of time getting to work and as we build out our new system, we're keeping it for now. At some point, I would consider maybe looking at either a B-plus tree or maybe something that looks like a mastery. The Radex tree has some limitations that we'll talk about next class. For scans, it's not as good as a B-plus tree, right? Because you can't scan along the leaf nodes, you have to jump up and down, okay? So again, I don't want to make it sound like I'm taking a sh** on the BWTree, I really liked it, spent two years of my life helping build it and writing this paper, it's just, you have to face the cold hard scientific facts that compared to classical data structures using latches, it gets crushed. It is what it is. All right, so what are the main takeaways here? So hopefully now in this lecture, you saw that some of the things we talked about, like with garbage collection, overlaps a lot with the NPCC stuff we talked about before. Worrying about what threads are in the system, what things we need to clean up, what's pointing to what. And so it's this really interesting idea where the index is almost like it's own little database in itself, but it has to still interact with the outside world, which is the database around it and the table that it's connected to. So I think there are some efforts to actually maybe unify the epochs within the index and the epochs within multi-versioning. We don't do that. I don't think everybody else has attempted this yet, but I know people have sort of been thinking about this. So one of the main takeaways of the latch-free data structures we talked about is that the skip-list is super easy to implement. The concurrent skip-list, the compare-and-swap is slightly harder, but not too hard, and the BWTRI is a nightmare. We spent probably a year and a half to get it actually into the shape that it is today. The kid that wrote it is a freak, too. Like he's one of the best programs we've ever had. The rumor going around is he wrote a portion of it in notepad. He's that crazy. And it works, right? So, anyway, so, Ziki's a funny guy. All right, and he wrote it in notepad on his laptop that runs Windows 10, but he configures Windows 10 to make it look like Windows 95, and then he sets the default font to Comic Sans. Like, it's hardcore, right? All right, so anyway, so it's hard to do. We think we have the best open-source implementation of the BWTRI. There's been a couple other systems that's shown up on Hacker News where they've built their own BWTRI, but I don't think they go to the full extent that we have. They don't do all the oppositions that we talk about, unless they've taken our paper and bummed with it. All right, next class. We'll talk more about how they do garbage collection. We'll talk about more about how you actually represent keys in the indexes, but then we'll spend most of our time talking about the tri data structures, so mass tree and then the art index, which are both Radex treats, okay? Any questions? Hey, y'all, you know I'm a great handyman. Got a balance to get the 40-ounce bottle. Get a grip, take a sip, and you'll be picking up models. Ain't it no puzzle, I guzzled, cause I'm more a man. I'm down in the 40, and my shorty's got sore cans. Stackin' six packs on a table, and I'm able to see St. I's on the label. No shorts with the cloths, you know I got them. I take off the cap, my first on tap on the bottom. Throw my three in the freezer so I can chill it. Caps full with the bottom, baby. Oops, don't spill it, cause St. I's has said the pain I was wet. You drink it down with the gobs, little Bob's head. Take back the pack of drugs. They gon' get you some same loves and drinks. Be a man and get a can of St. I.