 So for today's class, I've gone back and forth three or four times on what I want to call this lecture. Like it was originally, I think, just old to be indexes, lat tree data structures, then I switched it to B plus tree data structures, then as of last night, I said let's call it whole key data structures, and that'll make more sense in a few more slides. But the main idea here is just like, we're just talking about what kind of indexes or data structures we could build or use for old to be workloads. The other thing that I forgot to mention on Piazza, because they were late on giving me the information, there actually is a visitor today from Snowflake coming to give a talk at 4.30 on the query optimizer in Snowflake. Snowflake is one of the biggest sort of OLAP cloud database vendors, database as a service, a competitor to be something like Google's BigQuery or Amazon's Redshift. And so, Bowie is actually a former student of mine here at CMU. He took 721, like you guys, started working in the query optimizer. People could not hire him fast enough. So he's been at Snowflake for two years now, and he loves it. So they'll give a talk today about the kind of stuff they're doing in the query optimizer at 4.30. So by all means, please come to this, and I'll send a reminder on Piazza. Okay? Everyone's invited. Bowie, I think there's some recruiting events on campus this week. I forget why he's here. What's tomorrow? What's that? Okay. So he's here for that, but like I said, if you take this class, many ways the things that you'll learn from this will help you get past the interviews of these various data companies. And so Bowie can tell you whether that's true or not, because he's been at Snowflake for a while. Okay? All right. So as I said, at the beginning, I debated back and forth what I want to call this lecture, and whole key is kind of not exactly true, but the main thing I want to, I'm trying to distinguish is the things we'll talk about today versus what we're going to talk about on Wednesday's class. So this is also in quotes, because this is my term. I don't think this is actually what, this is not standard vernacular. I don't know what they say in information theory, but this is what I'm using just to sort of, again, contrast the two. My whole key, I mean that we're going to have a data structure where it's going to be order-preserving, so we want to make sure we can keep track of whether one key is less than another key, and that we're going to store all the digits of that key together in the various parts of its representation inside of the data structure. So what I mean by that is if I have a key ABC, I could have, in my leaf node, I could have key ABC together. And now if I want to do comparison and see whether my key is less than or greater than or equal to that given key, I have all the contents of the key right there. That's not entirely true when you start doing some compression techniques, but for our purposes today, we can assume that's the case, right? And then on Wednesday's class, we'll talk about what I call partial key data structures or tries, where you actually break up the digits of a key and store them separately inside the nodes inside the data structure, right? So one way to think about this is like, with this approach, sorry, with this approach, the partial key, you may have to do less work to find whether something matches. You may have to store less data potentially as well to represent all the keys in your data structure, but you may have to still go out and look at the original tuple, the original record to see what was the original key that you're doing a look up on, because it may not all be contained in the tree, whereas in the whole key, key ABC exists in the tuple, key ABC will exist in the data structure, okay? So again, this will make more sense as we go along. I think contrasting it with Wednesday's lecture, but for our purposes today, we just assumed that we're dealing with B plus trees. So today, I'm going to spend most of the time talking about the BW tree, because that was the assigned reading, and that's the current data structure we use in our database system today. And part of the reason I want to have you guys read this is because it is exposure to how you actually want to build a latch-free data structure, a latch-free B plus tree, right? You may read on Hacker News or read on the internet. You may think people saying like, all of these latch-free data structures or latch-free algorithms are superior to one that do use latches, and you always want to be using a latch-free data structure. And I certainly thought that was the case when we first started building our own BW tree, but the paper you guys read basically shows that you don't. So before we get there though, I want to provide some historical background of what kind of data structures people built originally for in-memory databases, in particular the tea tree. And then we'll also then finish up how to actually then take a regular B plus tree as we described in the introduction class and maybe do something a bit smarter in how we do latching, right? And that's how when you read that paper, when you see the B plus tree beat the BW tree, they're going to be using this last technique here, okay? All right, so back in the day, in the 1970s, before we were all born, they invented the B plus tree. And at the time, since they were dealing with disk-oriented databases and disk was super slow, the B plus tree turned out to be a well-designed data structure to allow you to do efficient access for long strides of sequential data. So I traversed the tree in log-in time, now I land on my leaf node, and I scan along the leaf nodes until I find the key that I'm looking for, right? So that's fantastic, right? And if this is slow, sequential reads are faster than random reads, this approach was perfect for it. So then in the 1980s, there was some early work done on designing the first N-memory databases. And in that world, you don't have a slow disk, you have fast random IO in memory. So the idea what they were looking into was, can we build a, is it an alternative data structure we want to use instead of a B plus tree? That would be preferable and more efficient for N-memory databases. And so the most famous one that came out of this work was called the T-tree. So T-tree is gonna look like an AVL tree. Basically, it just means that instead of having a B plus tree where the keys always exist on the leaf nodes, and then the inner nodes are just guideposts to tell you whether to go left or right. In the T-tree, the keys are gonna be scattered all throughout the different nodes with the leaf nodes and the inner nodes. But the big way, the big difference between a B plus tree and a T-tree is that instead of storing the keys in all the nodes, like copies of the keys in all the nodes, they're instead just gonna store pointers to the original records, the tuples themselves. And the idea here was back in the 1980s when memory was quite limited, instead of storing redundant copies of the keys, like in a whole key B plus tree, if we just store the pointers, then that's way more efficient in terms of memory. So yes, we pay that penalty of doing that lookup to say, for this pointer, what's the actual key that corresponds to that tuple? But then that reduces the amount of total size of the index. So the T-tree was originally posed in 1986 by the University of Wisconsin that was Mike Carey and his group were doing a lot of early work in memory databases in the 1980s. And in the 1990s, when people started building the first in-memory databases, like commercial ones, like a small base out of HP, which became times 10, that Oracle bought, and it's still around today. These first early in-memory databases actually used the T-tree design. So one key aspect of why T-tree has actually worked back then was that the difference in speed between CPU caches and DRAM was not as significant as it is now. So back then, if I had had a cache miss and had to go read something in memory, in the T-tree role in the 1980s, that wasn't that big of a deal. So it was okay to follow that pointer because it wasn't a big performance penalty and you saved a lot of space in the data structure itself. So I've been teaching T-trees because I think they're fascinating and there's not a lot of, there is a Wikipedia article about them. And as I said, times 10 still uses them today, but default though, like you get, if you create a table or index in times 10, you get a B plus tree. You can pass a flag to force to get a T-tree, but there's very few databases that still use T-tree today. So mostly for like embedded devices running like extreme low-memory environments. So there really isn't that much information about them. So I always like to talk about them because I think they're kind of fascinating and it turns out the guy that actually invented it sent me an email last month and just say, hey, look, I see you're talking about T-trees. And the mistake I always made was I said, oh, it's called a T because the node looks like a T, but he tells me I'm wrong. And so the guy's name is Toby Lieman. He got his PhD at the University of Wisconsin. He named it after himself. So the T in T-tree is Toby, which I think is awesome. So now for a B plus tree, we always say, oh, the B means balance. He says that it actually just means it's named after himself, the Rudy Bayer, the guy that sort of did the original work. He just called it B for himself. I don't know whether that's true, but T in T-tree means Toby, which is awesome. And he points out some other optimizations that I'll talk about in a second. So again, this is why I love the internet because I've never met this guy. I just didn't know who he is. He just found the YouTube video and he sent me an email. Hey, you're wrong about some things, which is fantastic. Again, so here's what the data check looks like. Again, I always thought because it was the nodes look like T's and that's why they call them a T-tree, but that's not the case. So what does a node actually look like? So the node is going to be a combination of pointers and then just two keys. So the first thing we're going to have here, these are the data pointers. So these are going to point out now, out to the actual table. And they correspond to the tuples that they represent. All right. So this is our data table. These are the keys. So these are just pointers to the different keys, right, or to the original tuples. So these pointers will be sorted in the order of the keys that are stored in the data table, right? So now, again, I can do that binary search that I would normally do in a P-plus tree to jump around to find the entry that I'm looking for. But at any single time I need to do a comparison, like, is my key equal to this key or less than or greater than, I got to follow the pointer to get to the original tuple to figure out what the original key was. All right. Again, in a modern system, our pointers are 64 bits. I mean, in the actual reality, they're 48 bits, but you have to allocate a 64 bit pointer space. Back then, you know, these things were, I think, probably 16 bits in the value neighbor, 16 bits. So by not having to store the key plus the pointer back to the key, right, I can reduce the size of the amount of states that have the store in each node by half, right? The other thing we're going to have now is also these data pointers. So unlike in a P-plus tree where you normally only have the pointer to either your child or any sibling if you're a leaf node, in a T-tree, you have to have pointer back to your parent because the leaf nodes aren't going to be the final location of keys. We may have to traverse back up. So we need to have a pointer to go back there. And then we have pointers to the right and left child. Then we have our node boundary keys. And this will just be the min and max value of the key that's represented by this node. And so anything that is less than this key will be found on this side of the tree. Anything that's greater than that key will be found on the other side. So this is not like, how does this, it's not like in a P-plus tree where the root node would have the right and left boundary would be sort of encompass most of the space below you in the key space. This is just a slice of the key space. All right. So then now let's look at what we actually want to look up. So we have a three node T-tree and we're trying to find key K2. So again, I start at the root. These are just pointers to the original keys. So my key space here for this node here encompasses from key four to key six inclusive. So here now I have pointers that are sorted in the key order over to what's being stored in the data table. So now then I have my pointers down here and that's allows me to do my traversal. So the very beginning I start in the root. I'm looking for key two. So I only need to potentially do one comparison per key to figure out, per node to figure out where I need to go. So again, I have a copy of this key here because this way I can do this efficiently without having to go out to the original table. But I just do a quick comparison. This is K2 less than K4. And if yes, then I know I want to follow that pointer down here. All right. So even though I said it's kind of inefficient to have to do these pointer lookups, most of the time you don't have to do that. It's only when you land on a node to find what you're looking for that you think the key should be in this that you have to follow this. All right. So then we land down here and then now we check to see whether K2 is greater than K1. It is. We also want to check whether K2 is less than K3. It is. So we know that our key will exist because we know or should it could potentially exist because it's within our boundaries here. Right. So now we just do now again, this keep it simple because we don't we only have three keys per node, but we just do a linear scan and look at every single record or follow the pointer and then do our comparison of the key over there. Yes. So would you go back up to this? This question is when would you go back up? I don't have a slide for this, but when you do a range scan, like find all keys greater than K2, I would come to come down here, scan along, find everything and then jump back up and keep going. So basically in the initial search, you're trying to find the from your side, the leftmost starting point and then you scan along and when you realize there's nothing below me and I think there's something up above, I follow back up. There are no simply pointed that's why we need to go up. His question is are there no simply pointers and that's why we have to go up? Correct. Yes. That's that's how ADO trees work. Yes. Why three in this example? Yeah, because it fit in the slide. Oh, so again, like there's nothing in the original specification of the key of the T tree that says you have three again to make it fit. I put three, you can have five, get 20, whatever. Doesn't matter. Yes. So the question is what differentiates this range and the in the root between or the parent between the children? So again, so in a B plus tree, all the keys are at the bottom. All right. And in a AVL tree, you can have a regular B tree. You can have keys anywhere throughout a key and value pointers to the actual tuples anywhere in the data structure in the tree, right? So in a B plus tree, if I only have keys in the leaf nodes, okay, or keys and then the values to the points of leaf nodes, then I have up above I'm wasting space because now I'm storing just guidepost keys. So they're trying to use get the maximum usage of every single node. And so they store the key value pointers anywhere, including the root node here, the parent node here. Okay. Well, these are guideposts, but if I'm looking for key five, for example, and I'm here, then I would say key five is greater than key four and key five is less than key six. So I know the thing I'm looking for is in here. So I don't need to look at leaf nodes, right? Again, in a B plus tree, they pushed everything to the bottom because now if I want to do that scan, I don't have a backtrack, I just scan along leaf nodes sequentially and I find what I want. Yeah. Yes. I guess I'm kind of curious how this changes things versus like living. I guess like what's the advantage of having all the keys on the bottom? How does that cater to disk versus this catering to memory? This question is why does having a B plus tree design of pushing all the actual the keys and the values to the leaf nodes? How's that better for disc oriented versus this being scattered anywhere for a for a memory system? So again, if I'm trying to find a find a range of values, right? It's an order preserving tree. So if it is a B plus tree, all the leaf nodes are stored in that order, I just try to find the leftmost node to start. And then I sequentially scan, which is faster on a disk oriented system, at least in the spinnages, even actually, even today on the SSDs are still faster. But like I just now do a sequential scan along the bottom and find what I want and never to go back up. We're trying to avoid random IO. In this world, random IO is not a big deal because it's in memory. So who cares? So I can jump around and traverse back up and I don't pay a big penalty for that. All right. And by, again, to her back to her point by storing the by using all the upper nodes or the inner nodes of the data structure to actually store keys and values that I care about, I waste less space, right? Because in a B plus tree, I could I could delete a key, right in the leaf node and it still could exist in the in the inner node because that's my guidepost. So we got here. Okay. So I think I've said a lot of this already. Again, like part of the reason I teach T trees is because it's just again, it's a different way of thinking about how to do it in memory database and memory indexes. And you may come across some and you may come across somebody who says, oh, why are we using a B plus tree for memory database? Shouldn't we be using, you know, a more optimized memory data structure like a T tree? And then there's no because, well, I'll get to the disadvantage in the next slide. But just this is mostly just for your background information. So again, we already said this, we store, we use less memory for index because we don't have to store the copy keys in every single node. And then every single key is also always being used for storing key value pairs and not just guideposts. The other interesting thing that the the inventor of the T tree pointed out, which I think is kind of something I didn't think about is that in a T tree, because now when you do the evaluation of saying, does my key equal this key or does my search key equal the tuples key? I'm following the pointer. I'm looking at the whole record. So once you do that, you're already sort of paid the penalty of following the pointer and now looking at the record. Instead of just comparing, you see whether the search key matches the index keys for that tuple. You might want to evaluate all the other predicates that you have in your where clause, right? If I have a where clause where A equals one and B equals two and my index is only on A and a B plus tree, I would just do traversal and only look at A because that's the only thing I can see inside my index. Then I have to go do the lookup on the index or the tuple, then evaluate B. But in a T tree, you could just do that all at once, right? While I'm looking at the tuple for A, might as well look at B. So you could potentially end up throwing away or throwing out tuples that more quickly than you would otherwise. Yes. But then the key table and the data table would have to be at the same place, right? Your statement is then the key table. And the data table. What do you mean by the key table? Like you had a table, right? Yes. The table. Okay, this is the whole table. This is the whole, yeah. So, yeah. This column data, this is like a bunch of other attributes. This is the attribute I'm in the index on. There is no separate key table. There's no separate key table. That would be a waste of space. Why? Because the index itself is a key table to think about. Okay. So for this one here, there are techniques in modern systems to sort of get this benefit as well. So you can do like partial indexes where you define a where clause for what keys could be in the index. Like build an index that only have students that are in, enrolled in 15.721, right? And so all the other students that aren't in the class aren't inside the index. And so that way, that's sort of like pre-filtering the where clause ahead of time without having to store as an extra information. Or in other systems, you can have include columns. So you can say, I want to be index on A, but also, by the way, store B in the leaf nodes so that I don't have to go to look up the index or look up the tuple to figure out how to evaluate a predicate on B, right? Postgres can do this, SQL Server can do this. This is actually a bit more common now today. And it's not as bad as actually having store B everywhere throughout the index. You're only storing it in just the leaf nodes. So the benefit you would get from this, I think, is not as significant as maybe it was back in the day when they evaluated this. All right. So why don't anybody use them? Well, I didn't talk about how to actually how to maintain this thing and keep it balanced. If the L trees are kind of tricky because you don't really do splits and merges, you have to do rotations. And so now I have to take more heavyweight latches on my data structure in order to make significant changes. And that's sort of related to this as well. Like it's hard to make sure that I guarantee that all my operations are thread safe. And then as I said, once the CPU caches got much faster, the cost of going chasing those pointers and looking at the tuple actually became quite significant. So it's better off actually just, yes, you're making a redundant copy of keys in your data structure, but that avoids this penalty here. You're paying a little extra storage overhead to get a quite a significant performance efficiency gain. So there's a paper done in like 1999 by Ken Ross in Columbia that basically said that teachers are a bad idea for MMM indexes and actually a B plus tree or variant that looks like a B plus tree is a better way to go. And so that's why I say nobody, nobody today actually actually uses this other than like, you know, embedded devices. Like there's a database called ExtremeDB that's supposed to run on like, you know, little IOT devices. And that world short, right? I think that makes sense. But for, you know, a large, you know, Xeon server teachers are probably not the right choice. Yes. How can a binary search actually work here? Well, again, so this is linear search. So say I did binary search and I landed here. So what am I going to go do? I'm going to go look up the key in the data table. I get the key now and then I compare it with the key I'm looking for. If my key is greater than that key, then I know I want to go this way. If it's less than that, I go the other way. These are sorted on the key, on the key, the values of the keys, the data table can be sorted anyway at once. Right. It's, it's a relational database bag algebra. These are unsorted. So all the point of this is all the standard tricks we would do in a B plus tree of doing like linear search or binary search or tribulation search. We can still do all those things. It's just, we have to pay extra, a penalty, extra jump, jump over there and see what the actual real key is. One thing that actually would be interesting to do though, now that I think about it, it's too late to do this for project too, but so as I said before, like when you get a pointer in, in, on x86, like you have to allocate 64 bits. Right. But in the hardware, they actually only store 48 bits. So you kind of have 16 bits there. You can actually store whatever you want. And when you do, when you do reference to that pointer or that dereference, that memory address, the, the, the harbor just ignores the 16 bits. So you could do something where like this is 64 bits. I still have the 48 bit pointer to take me back wherever I need to go into the data table, but I could store part of the key in here. So some of the times I have to go look up. Sometimes I don't. So yeah, that would potentially work. But the problem with this is one Intel could take that away at any time and start using the full 64 bits. I, I, I was at a, I was at a technical seminar, but with this Intel guy a few years ago, they said like they had an in-memory database that was maxing out two to the 48, you know, minus one addresses, and that eventually Intel would be going to the 64. So they don't store anything in those extra 16 bits, but that was like three years ago and it hasn't happened yet. So I don't know what, well, I don't know, but I don't think it's a good idea, at least a future proof of the system. We will see this technique used though for hash, hash joins from hyper. They use chain hash table because they store that some crap in that, they store a bloom filter in that 16 bits, which is kind of cool. Okay, I don't dwell on T trees too much. Like this is, again, the BWT and the B plus tree are, or, or more modern, we should focus on that. So any questions for teachers before we switch over? Okay, so part of the reason why it's difficult to make the T tree perform efficiently or make it even make it latch free is that we have pointers all over the place. We have, you know, every, every parent pointer has a pointer to the child and that child has a pointer to the parent. So now if I need to move one of them, uh, and I change one memory address, I got to change a bunch of memory addresses to all the children that are pointing to that parent. And I can't do that with atomic parent swap instructions because you can't update multiple memory addresses. It's like one 64 bit, a one 128 bit location. I can't say atomically update these two things. So related to this, this is also the reason why we can't build a latch free B plus tree, you know, for the same reason. We're going to have pointers to things in multiple locations that update them atomically. It's just not possible. So to sort of motivate the design of how we may potentially want to build a, a latch free B plus tree, the way to sort of solve this multiple address problem is that if we had an indirection layer or some centralized data structure where we could record, uh, art, these addresses and then multiple, uh, multiple elements or nodes in our data structure could know how to do lookups in that indirection layer of that, that mapping table. And then now I just need to do a compare and swap in that mapping table, change one address and that automatically propagates the change through, throughout the entire data structure. And then I can make a latch free. So that's essentially what the BW tree is. Well, the BW tree is a latch free B plus tree, uh, that came out of the Hecaton project that I said, I think two classes ago, uh, the, the awesome people at Microsoft, they, they, when they first started building Hecaton, they were originally building it with skip lists because skip lists are latch free. Then they realized skip lists are a bad idea. The, the grass will show, show that it, that it's, it performs poorly. And then what they came up with was, was the BW tree. I should also comment too that the BW tree, although it's sort of described in our paper and there's most of the papers that, that talk about, at least the original BW tree paper that talks about from Microsoft talks about it in the context as an in-memory system. Uh, in Hecaton's in-memory system, there's another project they built called Deuteronomy out of Microsoft that actually stored things on flash. And so the Delta record approach in the BW tree, uh, you know, appending changes to nodes, those actually work really well for flash environments because it's appending to a log. But for an in-memory database, uh, the BW tree is going to be a bad idea. So I'm jumping ahead, but before we get into the details, who here read the paper and felt like they understand the BW tree? All right, I asked this every year, very few people raised their hand. Like it's a hard data structure, right? It's hard to wrap your head around. Uh, there's a lot going on. And this is not so much a commentary about, uh, the complexity of the BW tree. It's just the complexity of any LAT tree data structure, any LAT tree algorithm, is actually pretty gnarly. Right? And so a lot of times, even though you're using lashes, could potentially be slower. The engineering complexity of the data structure or what you're trying to do is will be significantly less. And therefore you're less likely to make mistakes and it's easier for other people to work on it. So right now for our BW tree analysis system, I think it's like, uh, 5,000, 8,000 lines of code. Uh, very few people in our team can actually touch it. The one guy that wrote it is like, he's not like crazy, but like he's kind of eccentric, right? So he wrote the BW tree. The rumor is he wrote the BW tree for our team. It took him like a year and a half. Uh, he wrote it, a lot of it in notepad, um, on a Windows laptop. It was Windows 10, but he modified his Windows 10 to make it look like Windows 95. And then he set the default font to Comic Sans. And he wrote, again, one of the hardest data structures he wrote in that environment. So like, there's a lot going on. Um, all right. So let, so let me go through the actual key ideas, the main ideas. And then we'll sort of increase complexity of what else this, you know, we need to do our data structure to actually support real things. Yes. Questions. Why is it called the BW tree? And I think I guess it was in the title of the paper, buzzwords. So it was, uh, it, like it was all take all the buzzwords at the time when the paper was originally written, like Lat Tree in memory, uh, LSM, Long Structure, Merch Trees, uh, to take all those buzzwords and they throw into a single index and it's called the BW tree. Your face is really disappointed. It seems it was really bizarre. Yeah. All right. Okay. So two key ideas, the deltas and the mapping cable. So they are going to argue that, uh, you want to avoid cash invalidation. Right. Again, think of like a multi socket system where you have a bunch of, uh, new nodes and the CPUs are trying to update the same data structure to reduce, uh, invalidation of having to make in place changes to the nodes. They're going to do Delta records. So you pin Delta records to the node as you modify them. And then some later point, you'll consolidate them. Now, this is not entirely true because it won't work the way we actually implemented because we're actually stored the Delta records in the nodes themselves. So you still have cash invalidation. Um, but this is what they claim that we didn't see at this. See, we didn't see this, this, uh, this benefit. The other thing was the mapping table. And again, this was a central location that you're going to store all the addresses of physical nodes. And then now if I need to change the address of a logical node, I need to change the physical address of a logical node. I just go to my mapping table and update it. Okay. So let's look at a really simple table. So here are three node, uh, BWTree. So the first thing to point out here is that again, we have our mapping table and every node is going to be assigned, uh, a page ID or a node ID, right? So page 101, 102, 104. And then now in our mapping table, we'll have physical pointers that tell us the address, the starting address for each of these nodes. So I'll denote this in, in, in, in all these, uh, diagrams. The, the, the solid black line will, will represent the physical address and the dotted red lines will represent the logical addresses. So in this case here, we have, the root node and it has two pointers to its children. So the only thing we need to store now in that node is just the page ID of the children. 102 and 104. So now if any time I need to go say, all right, I, I'm traversing my tree, I'm at page 101 and now I need to get to page 102. I, this is not a pointer I can actually follow. I have to do a look up in my mapping table and say, Oh, I want page 102. Tell me the physical address of it. And then now I can land into this. Right. And it has this in direction layer that allows me to take any logical page ID and map it to a physical address. All right. So let's see now if, uh, when we do an update. So let's say I have a single page here, right? Page 102. Um, and now every single time I'm going to do an update to a page, like insert a key, delete a key, uh, we're not worried about updating keys because that's, that's just a delete followed by an insert. All right. So it's a leader insert. So again, instead of making the change directly on the page itself, right? So this is just another, this is just a node, like in a B plus tree, I have, uh, uh, an array of keys and array of values. It's the same, same, same physical layout. But now instead of making an update to those arrays, I'm going to create a delta record that says what the change I want to make into, uh, the key that's represented in 102. So let's say now I want to sort of key, key zero. So this record will have a physical pointer to the base page. So how do I get that? Well, I do my lookup and my mapping table, uh, and I would say, I know what this physical dress is going to be one or two. So then now at this point, nobody can see my change because if anybody's looking for page 102, they would look at the mapping table and see this pointer and bypass my, my delta record entirely. So what I, what I need to do now is do, to do install it, I'm going to do a compare and swap in the mapping table to replace the, the physical address that it used to point to, to now my physical address. Now that anybody goes, looks at 102, they're not going to land here. They're going to land here, recognize that I'm looking at a delta record and evaluate it accordingly. So if I'm looking for key zero, I landed this delta record here. I say, oh, insert key zero. Voila, I'm done. I found exactly what I was looking for. If I'm not looking for key zero, then I just follow along now down here. And then now I can look in the, in the, in the base page, the base node. So is this clear? This is like the, sort of the, the core idea of what they're doing. Yes? Since you're like storing all the delta records in the same page, have you considered what you're doing? Your question is, your question was like, if I'm storing this with this? No, like. If I, because I'm not storing with this. Well, I'm saying it's like you stored like the newest record in page 102 and you stored your reverse delta and then you put back to the reverse delta. Oh, so it's like this is always the latest version. And then if, if I want to say what was the version before this, then I think somehow, and then this would be like, what would be the versa in certain K zero? What, like you said, this is like the reverse of the change you made here. So what's the reverse of insert K zero? It's not really delete K zero because it didn't exist before. And then what would happen now, at least in that case for your example, I've modified this page to be in cash emulation to the other of the CPU sockets. But then also it's another cash emulation because I updated another region of memory. At least in this case, I just, I create this delta record, this stays unmodified. So the only sort of cash reference I need to update would be this thing here. Yeah, they're packed in together. Yes, we do that for efficiency reasons. Your cash line, though, is what, 64? Your cash line is 64 bytes. So long as you update something less than, something more than 64 bytes, it'd be OK. I had to think about that. But I think it's sort of weird. Because like you're creating a reversal of something that doesn't exist and you would need to know, like you would need, in your example, you need to know I'm looking for K zero. I don't, I see it here. I'm done. But yeah, there's no reversal for that. Delete maybe, you could say, all right, I see something here. I don't see something, but did it used to exist? Yeah, I have to think about it. What you just, what you're saying is weird. Sorry. Yeah, so the question is, like, when would I actually follow this pointer, right? So at this point here, I've been created the, I've created this delta record. It has a physical point to the page. Nobody else can see it, though, because everyone else is following the mapping table that takes you to the base page. I do the comparison swap on this. And now anybody that's looking for page one or two lands here. Doesn't matter where you're looking for K zero or not. If you're looking for page one or two, you land here. And then you, then you have to evaluate, essentially replaying like a log in memory to say, well, what's actually being stored in one or two? Question is the more, the more delta records you have, the longer it takes to actually find the key if you had to look at the base page. Yes, we'll fix that in a second. Yes. So this question is, what if there's a concurrent delta update? Next slide. We'll handle that. So again, now if I do this, let's do another one. If I do a delete eight, same thing, compare and swap on this, right? And now the points of this. So now anybody coming along for one or two, right? They would have to evaluate delete K zero, K eight. That's not what I want. Delete key zero. That's not what I want. And then do the search down here in the base node. Okay. So we've already covered this. That this is just doing search like a big, a big plus tree. All right. If the thing you're looking for is found in the delta chain, you're done. Otherwise, again, you just do a search of the box. Right. All right. So let's handle his problem. So now we're back here. We've installed a delta record for inserting key zero. And then now I have two threads that are going to try to install two delta records at the exact same time. So two threads are inside inside the index and they say, oh, I need to perform an operation and this is the node I want to perform. This is the base page I want to perform my operation on. So the first guy wants to delete key eight, the second guy wants to delete key six. So what's going to happen here? How do we install these updates? Like the updates actually only apply whenever you update the address pointer? Correct. So his statement is the updates are only applied and only visible to everyone else if, you know, when you update this thing. So these guys are now going to compare and swap at the same time on this memory location in the mapping table. But only one can succeed. So, you know, essentially what you're doing is when you're back here, you know what the physical address is to the head of Delta record, the delta chain for this node. Right. That's what these guys, that's how these guys got these physical addresses. So now when I do a compare and swap, you say, if the current value of this address here is what I think it should be, then go ahead and swap it and install my new update. Right. So let's say the first guy is able to do this. Right. That's fine. So now he is the head of the Delta record or Delta chain. And his update got installed. The second guy with that compare and swap operation would fail because it would do the evaluation of the mapping table, see that the address is not what it thought it was going to be pointing to this does record here. Now pointing to this one. So it knows that somebody else got in before he did and updated this. So my update would now fail. And then depending on the implementation, I can either try to do another compare and swap and try to update this or I could just repeat and do the whole operation traverse down and try again. If you try again at the right, like the paper says that you have to traverse again, you try again right now, then say what will happen is that you manage to do compare and swap, then that guy's thing will get removed because you will only add your current chain. You're saying that if I try to do compare and swap now. Like say insert K6 trade, right? Yes. Now you try again. Yes. So you will only insert after K0. Yes. Somebody so that delete K8 will get disappeared. No, no, so what I could do is I update this physical address here down point of this. You're saying you'll read the whole chain again and then read back. Well, no, so if you're here, the compare and swap succeeds. This is fine. His compare and swap fails. So now I could go back and say, well, I just fail. What's in there now? Yeah, then you have to copy the whole chain again, right? No, you don't have to copy because there may be two three things that somebody else has inserted and you have to insert on top of that. But again, like this thing is always going to point to whatever the head of the version change the delta record chain is. You have to get that top thing and point towards your insert K6. Correct. So you have the like, OK, then top and you have to change that. Why I know I just go look and see this again. You could do that way. I don't think we implemented that way because for like safety reasons, because like you don't know now that this thing might do a split and the thing you're looking for may no longer be encapsulated in this index or this note here. Like this might have done a split and now where K6 should be should not be page one or two. It should now be another page. Papers say that you copy the whole voice engine every time. In your private space. When you do the valuation. Yes. Yes. So don't you have to necessarily work because could you run like an ABA issue where like you have like when you try to like let's say you try to like update an update right in between you try to compare and swap you fail in between that your record is deleted. And so then you're updating a deleted record in your version chain or something like that. So you always have to report. So so again, we're only going to delete and inserts. Right. So there's no we don't worry about updates. Right. The other thing too is like this is inside the data structure. We don't have to worry about higher level consistency issues of the transactions. Like I think maybe what you possibly saying is like well what if one transaction deletes K8 and I try to insert it or I'm trying to read it and it's been deleted. All that's handled up above in the like either doing this the re-running the scans are doing the validation stuff. All that's handled above us. We just care about the low level linearizability correctness of the data structure. And this thing will handle that for us. Yes. Like why like why when during the traversal on the route to like whatever you're currently accessing that case should be different from you know restarting again. I mean like I don't really see the point of why to worry about splitting on a page one or two. Like we have to restart from the beginning. But how does that restart from the beginning of the route actually ensure that. When I say so it has to do with like it has to do with like I'm trying to insert something in a I'm trying to start a logical key that is not that is not represented or should not be stored at this this this location here. So if now key six should not be in page one or two should be page one or three. If I try to nearly come back and do compare swap here. Now I'm starting key six into this page. But anybody else that looks for key six is not going to land here. They're going to follow the guy posts and land us mother node and they'll have a false false negative. So when I'm doing the traversal on the beginning of the route the problems all solved. I don't make sure that I'm going to the one that I know why there's like. How do I make sure I'm landing the pair where I should be right because that's just the way I mean the because we're enforcing the the the the the ordering of the directions of where you go from left to right from one node to the next right with the keys and so we're guaranteeing that we're propagating changes from the bottom to the top. So you're not in this weird state where like something's pointing to something that shouldn't be or something being stored in the place that shouldn't be always restarting at the top. It's inefficient and that's sort of the downside of a lateral data structure but we guarantee the correctness the consistency of the data structure at a at a physical level. Okay. So now to his earlier question of like well can't this delta chain get kind of long. Yes it will and so we want to do consolidation. So basically what's going to happen is one thread will recognize as it's going along this delta chain has gotten too long. This could be like a threshold you say if the delta chain is as more is has more than this number of records then I'm going to do a consolidation. So what you're going to do is you're going to make a first a copy of the base page and then now you're going to apply the changes in reverse order of the delta chain. We'll take a guess why were you doing reverse order. That is the order. That is the order right. Like this is like in physical time the change to be made like this is the oldest change and this is going to the newest change right. So if I say if I if I'm deleting K8 here and I insert K8 here then it doesn't make sense to try to delete something if I'm going in this direction. So we always go in reverse order. So basically it's going to be as I scan through and I recognize the delta change has gotten too long. I have a copy of all these things and now I can replay them in reverse order one by one right. So now after I replay all my changes now this new copy of the node represents all the same things that are represented by this base page and it's in this delta records. So how do I install it now? Command swap easy right. All I need to go back now is compare and swap to this this for the record here the entry here for one or two and now anybody else that comes along can see me. If I fail then I wasted work I've done my consolidation and somebody else changed something like if someone else appends a new delta record before I get my before I do my compare and swap well then that solves the problem of not seeing that potentially missing a delta record update because now this thing would have pointed to a new delta record that I didn't see and therefore I throw away my work and start over. Yes it calls this a virtual node yeah it's just in the heap it's just like a node that nobody else can see yet it's in the like only my thread is doing the consolidation can see it. Yeah. This question is why don't I just instead of doing this and compare and swap this why don't I just take all these things and apply it to this yeah I don't have the team team. Any questions why you have these delta records one is apply this change here again there they're arguing that in order to make a latch free. If I if I'm allowing anybody to make any change down here you still need latches. Right because these things have to be sorted and you have to apply latches to try to like enforce the ordering of them they're trying to be entirely latch free. Yes. This question is is it compare swap technically a latch we will cover this next class could you implement latches with compare and swap. No right so latch would be like I hold I'm holding a latch on some critical section and I do a bunch of stuff so this is like I here's a single update I apply now you're right. Well if it's a spin latch what do you do you spin until you get the latch so I could keep spinning until I get the thing I want. We don't because we because it yeah we always want to restart yes. What he was saying like the problem that is if we apply the changes on the page right and we compare and swap the whole thing then we have to copy the entire page make one change and then compare and swap the entire thing by doing this delta thing we are only changing like we are adding only one thing. Otherwise you have to copy the whole page. No you see if to make it latch free yes if you don't want to make a latch free then like you take a latch do the update yes then you have a B plus three yes. So like if you whenever you try to recompute this like consolidated note yes right and like let's say you go back to compare and swap and you fail right I don't necessarily think you actually could throw out all that work because you could essentially refall this new chain and then see where like your change this last word and then update and compare and swap that one pointer and swap yourself in like if you're right if this is just an insert you could say all right missed it put just put it in and then try it again. I don't think we do that I think we play it safe because again if you have a split or merge that's when that's when that's when things get bad. Yeah. All right so one or two is a new one or two is installed this guy sitting around what do we need to do with it when we obviously want to clean it up at some point right so these things can be marked as garbage and at some point we need to clean them up. What does this look like this is starting to look like MVCC right once we recognize that something's no longer visible to a bunch of threads or transactions in MVCC world then we want to go ahead and clean this up and reuse the memory. So this now looks a little bit different though then slightly different than what we've talked about before but the high level idea is going to be the same right so what for garbage collection for these in memory data structures what do we what's the issue well we don't want to throw away something that somebody could be reading or jump into because then they'll have a segfault because they're reading you know unallocated memory so like say I want to delete K2 here this is a simple single direction link list right my threat is here it's at key one it sees the physical pointer now to the next key then the garbage collector comes in cleans up this thing but now I follow this pointer to some random invalid you know and memory address that doesn't mean anything anymore and you know worst case scenario as segfault actually worst case scenario also like I could read garbage and think it's something real right so we want to avoid this so the two approaches we'll talk about are reference counting and epoch based reclamation there's a bunch of other techniques to use hazard pointers what's the other one I forget to forget what they're called it doesn't matter like there's other things these are two ones that the most prominent most common in memory databases so reference counting is essentially what you get with the shared pointer on the receiver plus and the standard template library so all it is now is that inside of every node in a data structure or in a shared point inside the points the pointer data structure itself we're just going to maintain a counter that keeps track of the number of threads that it could be accessing a that memory location right so any before I go access it I increment the counter that's an atomic add right that that part sufficient but then when I'm done doing you know accessing it then I decrement that counter to you know by one so the garbage collector would know that it's safe to reality to deallocate some region of memory when we know that our counter is zero because we know no thread could be looking at it so long as everyone updates that counter before they jump to the next location right we won't have it have an issue turns out though this is actually really bad for performance because now I'm every single time I jump to a new location I'm incrementing this counter and that's a global counter that everybody needs to be able to read and write so if I have a lot of cores a lot of sockets that's a cash and validation measures to everyone just to go read something yes uh... questions this this invalidation also apply when you're picking up in general uh... yes but but i can read the mapping table without having updated this turns every read into a right this is bad so again this is what you get in shared pointers uh... and this that you know that this is obviously going to be slow so one obviously the point out though is we don't actually care what that counter actually the value actually gets all we really cares whether it's non-zero so when it's one two four whatever who cares we know somebody's reading it we can't go reflect something it's only when it's zero do we do we do we actually care so maybe it's now set of storing this fine grain counter per per node in our data structure we could just try to keep track of a higher level construct more course grain counter just know that when nothing can be visible to buy anybody within some some time range just like an mvcc then it's safe for us to go ahead and remove remove things so this is what you pick epoch-based garbage collection uh... is and we briefly mentioned this uh... last class and i said i was going to spend more time than on it today but again the high-level is idea of what we did mvcc for epoch-based garbage collection is the same one here so there's going to be this global counter that's going to be periodically updated i can have a thread do this or it could do a cooperably cooperatively every ten milliseconds and the only thing we need to keep track of now in our in our index is that what threads exist at a given epoch i went what time did they show up when epoch do they show up and then when did they leave and i don't care what epoch that they left in all i care is that that that they did leave so i could show up epoch one then i leave epoch two that's fine but i still only considered uh... to be in epoch one and then now what'll happen is when we do our consolidation will say what's the current epoch of my uh... of of of this node or so what's the current epoch of the of the of the b but b w tree i marked that garbage with the epoch and then once i know that there's no threat to be possibly seeing that node because they're not in my that epoch anymore then it's safe for me to go ahead and uh... delete it or move it so in linux this is called rc u recopy update this is used in uh... various different data structures internally inside of the colonel to go resystems papers they'll refer to this is rc u in database papers refer to this is is the pocket as garbage collection so now to do this again in this is this repeat what i said but do something to be there to be again we tag everything we're going to do any search up in sort of delete is tagged with my current epoch is we register with the garbage collector when a threat shows up to say you know i'm i'm showing up and do something i'm in this epoch and then when you leave you do register and then the garbage collection can say i know that nobody else is in the park here's a bunch of garbage for the epoch let me go ahead and remove it so let's look at the example here so this is the same one we have before we did i guess we're going to do a consolidation on uh... on one or two cpu one thread one is going to do the consolidation so it when it showed up uh... and in the very beginning it is registered with the box of the garbage collector and now there's some other thread thread to uh... that's gonna be scanning this at the same time we registered with the epoch tables and now we do now the compare and swap update one or two and now nobody else that comes in uh... after this point will ever see this original thing but this thread here is is still hanging out we actually don't know where it is that could be anywhere in in inside the data structure looking any node but it could potentially be in here so instead of tracking exactly you know what node i'm looking at or what delta red looking at every single time we just say hey there's somebody around that you know with within this time so now uh... we registered this garbage with it within epoch uh... this first epoch this guy goes away uh... we do register this guy scans down and then he finds what he's looking for page one or two right and then when he's done it's safe for us to go ahead and delete this so instead of actually giving every node a time stamp every delta record time stamp like an mpcc we just have this course green epochs yes we have an epoch table for uh... for every node the epoch table is for the entire instance of the data structure all it is it's just a it's a pointer to the physical address of this node here so it says like if i register this garbage here i'm not actually making a copy of this i'm storing the pointer to the head of the delta record chain so then i know that anything any delta record and then the base table base page itself can be garbage collected again nobody else can can can never jump to this because we did compare and swap here and we're able to you know and now point to the new page your question is is it what is the data structure for registering the threads yes just a cue uh... an array like it could be a ray pointing to your cue because you could cycle through the epochs it could be a bottleneck but traversing the index itself is more expensive than that when cpu2 was compacted what if another delta record so we're back here oh so when cpu1 was compacted we're here so what if somebody else creates a new delta record here while we're doing compaction what would happen when i do the compare and swap i would fail because it's now pointing to now some delta record above us that i didn't see and that's what i was saying to him like you could be smart and say oh well this is just another insert let me reapply it uh... and then do the compare and swap for that one you basically have to do a diff you're trying to figure out well what did i have and what did i miss so if it's just one maybe it's not a big big deal but if it's a bunch of them it might just be better off just restarting but again the compare and swap because this mapping table guarantees that this thing is always going to be like the ground truth of what the correct what the correct pointer should be so if it's not what we think it is somebody else got him before before we did so far so good right let's make it hard let's do splits and merges right so we'll just focus on splits merges are essentially the same thing in reverse order so now we're going to introduce two new delta record types the split delta record and the separator so the split is going to be a delta record that says that uh... the base page below us in our in our delta chain has been split and here's where to go find uh... the two new boundaries of keys so we'll have a physical pointer down to the next delta record and then a logical pointer to the page they got split off from and then a separator delta record it's not required for correctness but it's just a it's a shortcut up above in the higher part of the tree to say oh by the way below you there was a split here's where to go find the things you're looking for so let's look at example here so now you have four pages and we're going to want to use and then the keys are sort of organized like this and we want to do a split on on one or three let's say we want to insert uh... actually let's let's just do a split we don't insert anything the first thing i'm going to do is do a split in you know from my thread you know that with a virtual note here that's that nobody can actually see yet and then he's just now pointing to the next sibling one or four so now i'm going to do a compare and swap here nobody else could could get it before i did so that's not a big deal but i want to update now the the delta record the delta chain for one or three with this new split record here right and the way what the split record store is that there's the physical pointer to the base page and that just says key three key three keys key three to five are here and then five this key seven are over here and this is just a logical pointer so now i do my compare and swap to now update one or three to be now pointing at my uh... my new split record anybody that comes along is looking for key five for example will come down follow this uh... the version chain here these these all get updated to automatically and because i have the mapping table right these guys have logical pointers to the update that everyone automatically now points to the split record so anybody coming along either from from the box you know as a sibling pointer or from the top would see the split record recognize oh well if i'm looking for key five i want to follow a logical pointer here otherwise follow the physical pointer down here now this point here we actually ever done in copies of key five key six right they're still stored in in one or three because we can't do in place updates so when you're now traversing like down here and saying i'm trying to find keys greater than four for example i would have to read remember that i saw split record up above that said right this node here if you're looking for anything greater key five or greater should not be stored in one or three even though you may see it in this one of this one or three you know you need to go find it down down here in one of five so then now i need to uh... uh... sort of propagate this this key space up above a key or is split up above so at this point here the root table has the still the original uh... splits or demarcations for what's below me and so if i from a correctness standpoint if i follow it in here and i'm looking for it for some key five uh... i would say key threes keep i was empty k three and k seven and still follow that logical pointer down to the the split one and then i would recognize i really need to go down this other side not the side but to avoid having to uh... to you know do that as unnecessary look up i could insert a separator uh... record that just says i will keep on k seven are now at this new uh... it is a logical point to this new note here same thing compare and swap on this now that gets installed and anybody else comes along with to see this can you don't need a separator for correctness it's it's just for efficiency reasons and then when i do consolidation of the compacted and update this correctly any questions uh... the logical point is what what what what is a logical pointer what would what we actually story he had a page at it so if i'm scanning along like find all keys greater than greater than k k two i would land here it's a r i c k two but i want to keep going to look for a greater than k two oh i need to follow my sibling pointer my sibling pointer is page one or three right that's what you know that's sort of this thing is but i'm just only storing you know the id one or three here in order to get to there i go look my mapping table and what a lot now my physical dress points to here and then and i can either stand down finally need to split to one side of the other as needed yes like your delta chain uh... if you want to do like nothing page one of five or one of three uh... with like one of three and one of five share like same delta change no so this question is when i did the split here would would one of five one three set share the same doesn't change now would like to split going like the split is the delta chain of one of three because that's what we're back here right so again i i i just copied out the keys i need for one of five now i have a split record here and i want to have anybody that goes to one of three should know that i split so my compare and swap needs to be on this guy's dot the chain this guy has his own dot the chain and again if i now start making changes to page one of five well that is going to have to do the dot the chain but the logical pointer to it will still get me to the head of that that doesn't change okay so i can anything goes like one of three that's going on one one of five it just it's like the head of like one of five delta chain if anything goes to one of three that should be going to one of four separate the separate is likely to avoid having to do like an extra actual look-ups going down your statement is something that should be one of five can never end up a one of three because we can't because like i this split record say well if i want to insert a key five point five that should be in between five five five six i would get here five point five is greater than or equal to therefore it has to go here i can i can never get there but you wouldn't like immediately reverse down to like one of five why wouldn't you if you're here yes like when you're out when you're up here no again so like there's a lot of lines here so it's confusing sorry the root note still thinks that if i'm looking for key range k3 to k7 i should be looking at a logical pointer one of three so this arrow should really be here physically when i do the look up on the land of the split so i'm looking for key five i do my i think it should be in page one of three i landed a delta record that that's a split and i would recognize of the split therefore i need to look at the boundaries in the split and then that'll take me left or right at what point can you like directly go from like one of five passing through the split all the way here so this is the separator key is basically updating this information so once this thing's installed and i'm looking for key five i could potentially go now down here again this is like say i could store four things in this node and i don't need to split it instead of having to do in place update to add a fourth entry or guidepost in this thing this the separator thing does that for me yes going back to the previous questions where if you do compare swap fails you will essentially see the split up right so if you compare the compare swap fails where wherever like when you're trying to update like the page one of three so here yeah so um in the implementation that you guys say that you have to go start from the lose because you might not be going to the same page correct one of three but you should actually go to one of five yes so but i but the one that fails that um we'll actually see the split up yes so if that's very smart enough it could just actually follow the logical pointer to the one of five and doesn't have to restart from the beginning if it fails where when at one of three like when you're trying to append something to one of three let's insert k seven to one of three five point five okay i'm trying to insert k five point five and compare swap fails yes someone actually did a splitting so you're here this has been updated and you will see the split nodes so you don't have to go back to the rules but you can actually just correct there i think there are some optimizations i don't know whether we do them all where like if i if i compare swap fails and i recognize oh well if i come back and read what am i seeing i could then use that as a way to jump to what i'm looking for i think we are very conservative and we don't always do that in the implementation yes so when we're creating 105 would you traverse the delta chain of 103 to see what changes do five and k five and k six remain so this question is when i'm creating uh if i'm creating 105 how do i would i apply changes yes and i would know again you know what you're splitting on so if it's like delete k four i ignore it because i'm that's not not that's not in mind but it's like insert k five point five that should be in mind then i apply it yes are the separator split like both out of time there is it's like we do the split then a lot of the separator later this question is are the split and separator at atomically again i can only do one thing atomically i can only update one address in the mapping table atomically it's because it's a latch it's a latch read so what if we update page 103 and like we append some of the delta record of page 103 right but then after that we want to append something page 105 like insert k five point five right we'd have to read the entire delta record to find the split for this page here yes right yes there could be stuff up above so do we read the entire delta record i see you might like there could be like it could be like delete could be insert k if there's insert five point five i think for insert you always have to go to the base page so then you would always see the split um actually i think for everything you always have to go to the base page yeah for delete and insert yeah so i it's not like i would blindly append this thing all right we're short in time um i'm rushing this a bit much but i can answer questions afterwards so i'm going to quickly just talk about the our optimization so again the paper you how do you guys read uh this was our attempt to write uh the sort of the the missing guide on how to build a bw tree so the original bw tree paper from microsoft doesn't explain a lot of the core things you actually need to actually build a real real bw tree there's a bunch of these other papers they they wrote after the original paper that sort of sprinkle in some of the the things that are in our paper now so our paper is meant to be a consolidated guide on here's how to build our real bw tree and so when i first started at CMU i said all right we're going to build a new database system and i was super enamored with the bw tree i was like well we we have a build a new system we're going to use a bw tree the first time i taught this class the second project was implement a bw tree which was a nightmare right um but like i said the one student the peachy student that was that was super awesome we took his bw tree he kept on working for two years and that's that's what this paper was but this paper was not meant to be like oh look how crappy the bw tree was it was like you know we actually wanted the benchmark to see how i'd actually go perform and it turns out it was terrible um so that's why it's sort of the split brain like half the paper says here's how to build it and the second half paper says why it's bad right because it wasn't supposed to be like that all right so i'm going to quickly talk about some optimizations that we did in our version so ours is called the open bw tree ours is is to the best my knowledge is the best open source implementation of the bw tree there's a couple other ones that are out there but they don't do all the things that we do there's one system out of germany called sled uh it's like an embedded system that's written in rust they're supposedly using a bw tree i don't know whether they go to the same extent that ours is um the bw tree shows up in other systems at microsoft so cosmos db or used to be document db it's their version of manga db or their cloud database they use the bw tree in certain cases um but ours is the best open source one all right so i didn't want to talk about what these help where we're actually storing these delta records like you know the original the original discussion of in microsoft they said oh they just sort of they don't really say where they are so you could just allocate a bunch of little delta records on the heap but that's a bad idea because you have fragmentation in memory so what we do is when we allocate a base page we have a little extra space in the header of the base page that we can use for storing delta records so now uh all i need to do now is if i'm going to update the um i'm going to add a new delta record i just do a compare and swap on on this thing here get a new offset and now i can insert my delta record here and then i go back here and do the the compare and swap to have it point to my head of the delta record right so again i don't need to take latches to acquire the space and then when this thing gets full then i do the consolidation the other thing that would that was super important was um i haven't really described what the mapping table is in the original version of the bw tree paper it seems to be a hash table um and our implementation is just an array and so you want to allocate an array that could store any possible you know notity up to some some limit um and so the problem though is that if you allocate the full array in the very beginning for every single possible notity you could ever have then you end up wasting a lot of space because you're allocating memory that you probably don't need and so i think in the um in the in the current version of the bw tree that we have in our system the the mapping table the max size is is a is one million so i can have one million node ids in my system and so if we're storing 128 keys per node i can have up to 128 potentially 128 keys total it's less than that because we're only storing them things in the in the leaf nodes but roughly that's what it is so to allocate an array with 64 bit pointers with one one million entries is eight megabytes so it doesn't seem like a lot but if you like load the tpc database every table has two or three indexes so for every table i'm creating three indexes eight megabytes starts to add up a lot from not actually not storing anything so in the old version of peloton when you first loaded tpc the database would grow to be like 256 megs with no data in it right so the way we got around this was we just use we allocate for virtual memory uh and this is not using anything special just using the the constructs you have from the os we allocate a chunk of memory and make sure we only use the upper portion of the mapping table when our index is small so it doesn't actually get backed by physical memory so although the virtual size can get large the resident set size is quite quite small right because it's only when you touch a page in memory does the os actually back it with physical memory all right quickly to finish up so this is the this is the the these are the results that were in the original bwt paper that microsoft published uh with justin and the other people there and this is when i saw this i'm like oh this is awesome we totally want to build an electric index here at samu so what are they comparing us so it's their version of bwt against the skip list and then the a b-plus tree but this b-plus tree is actually from berkeley db berkeley db came out of um uc berkeley uh it was it's an embedded uh database that has you know basically like level db or roxyb but like one of the first ones to do this and oracle bought it in like 2006 so they extracted out the source code for the b-plus tree from from berkeley db and they modified it so it didn't actually store anything on disk so this this shows that the um bwt crushes everything right so when i saw this like this is amazing we should totally do this but then when you actually implement it so these are our results uh this is the best you know our best version of the open bw tree this is actually the the state-of-the-art implementation of the skip list right it's out of um from alan fecki's group in austria instead of using towers and uses wheels which it's some minor change or whatever um and then this is a b-plus tree written by one of the authors from hyper who came and visited us for a couple months and this is like his this is one of the the data structures they used in in hyper for their comparisons so as you can see the b-plus tree crushes except for this one here this might be wrong i let me talk to this but the b-plus tree pretty much crushes everything right so then when you look at other data structures we'll talk about next class right then you see the open media tree gets gets gets gets wiped away so i'm now going to bring in the mass tree and art index art index is what you read on wednesday's class it's a radix tree or tri uh from hyper and then the mass tree is out of harvard this is a tree of tries which will cover next class so again in this environment the the bw tree just loses to everything so this is why we need to get rid of it but we just have we haven't started it yet okay okay that was like super rushed at the end i apologize um any questions all right so next class i'll spend more time talking about latches for b-plus trees then we'll talk about the radix tree stuff and then we'll talk a little bit about uh what you can do for project one okay all right guys see you