 So this is the first of the two-part lecture on OTP indexes. So last class was a discussion of adding or latching and locking inside of indexes, right? The locking is for high-level logical things like, can I lock this individual key? Then the latching stuff is now for the internal data structure of how do you protect the critical sections. So in today's class, the lecture is going to be all about how do we build indexes that don't use any latches at all. We're still going to need the high-level locking stuff that we talked about before or being able to check for phantoms in other ways like in Hyper or in Cicada or Hecaton doing additional scans. But today's class is really how can we build data structures without any latches at all. So the first thing I want to bring up though is for project number two, we merged in some code yesterday into the master branch that will make it easier for you guys to debug the contents of your index. So we add some helper methods to print out what's the values of the keys that I have and what are the keys and what are my values. There's no single method or utility method where you can throw in a pointer to your index and it'll print out a nice table with everything you have in it. This is because some indexes as we'll see next class like art, for example, doesn't actually expose what keys it actually has. In your skip list, you're going to do this, you want to add some debug methods for your own implementation that allows you to print out exactly what's in there. So the stuff we added should make it easier to do that. I mean, general I'll say that your implementation should match the behavior of the BW tree as much as possible, although we may have found a bug yesterday in the BW tree. But for getting started, this should be okay. The other thing too is that I'll send out an email to everyone either today or tomorrow with information on how to get access to the machines that we have available for you in the course. So when I first taught the course in 2016, MemSQL graciously donated some three machines that you all can log into and run whatever experiments are testing that you want on there. So I think each machine is like two CPU sockets, we're 16 threads per socket plus hyper threading, so you have 24 threads in total, and then each machine also has like 128 gigabytes of RAM. So way more than you have in your laptop, so if you want to do hardcore scalability experiments, you should use these machines. Again, you have root access, you can log into them, trash them, whatever you want with them, and they get wiped every 24 hours. You can't actually do any real damage. So again, I'll post on Piata how to get access to this, and then you'll get an email from the internal PDL people that says, here's your password, here's how you go log in, okay? All right, so for today's lecture, as I said, the goal is to talk about Latch-tree indexes, but I'm listing three types of indexes here, only the last two are actually Latch-tree. The first one, T-tree, we'll talk about for historical reasons. This was the first in memory index that they built in the 1980s. Nobody actually does them anymore, except maybe with some rare exception, and so we'll see this and see what people did back then, but you don't actually want to do this now, and then these are actually the more modern implementations, okay? So back in the 1980s, Sipson's machines were way more remaking strain than we are now, and so when people start thinking about, can I build a database system that's entirely in memory? They were focused on this idea of how we reduce the amount of memory you actually have to use as much as possible, and so T-trees are one solution to this problem. So the basic idea of the T-tree is that instead of storing the actual keys to the attributes that the index is based on, we're going to instead store pointers to the tuples, and then if any time you need to figure out what that original key was, you'd have to follow the pointer and look up the original value, and you did this because back then on these machines, the pointers were probably 16 bits, and that's going to be way less to store 16 bits to the original value rather than having the duplicated value. So all the aspects of this that actually matter too is that, the speed difference between CPU caches and DRAM are not as significant as they are now. Back in the 1980s, you didn't have L3 cache, your L1, L2 was not super fast compared to what DRAM could do. So paying the penalty for a cache miss in this environment isn't as bad as it is now. So for them having to go do this look-up and say, what's the actual key that I'm checking, that was considered okay, and that was a trade-off they were willing to make because they had limited amounts of memory. So the T-trees came out of the University of Wisconsin in Madison by Dave DeWitt and others during the 1980s. So Wisconsin did some early awesome work on in-memory databases, but they was mostly all in simulators. Nobody actually, because you couldn't actually have enough DRAM on a single machine to do this. But in the 1990s, a commercial in-memory databases came out, like times 10, data blitz or dolly from AT&T, and they all end up using T-trees again, because this was considered the way to build an in-memory index. But nowadays, none of the modern in-memory systems actually still use T-trees. So here's what it looks like. So it's going to be like a tree structure, just like a B-plus tree or a B-tree, and the name T-tree and the T-tree comes from is that the nodes are designed to look like T's at a high level. So within a single node, we're going to have the pointers to the actual tuples themselves. So again, I'm not storing the actual copies of the keys, I just have pointers to the tuples. This will be a sorted array where the pointers are sorted on the value of the key that the index is based on. And then I have my pointers to my left child and my left child and right child to go down below in the tree, but another big distinction between the B-tree or B-plus tree is that I also have a pointer to my parent, because with the way I'm going to traversal, it's not like a B-plus tree where you always go to the leaf node and then you find exactly what you want, you may have to go up and down to different levels. Question? Is it even stored like integers as pointers? So your question is do they store integers as pointers? Yes. What do you mean by that? Well, it's like they store everything like this. Right. So if I have a table and I have three attributes, ABC, if my index is actually A, I don't store that value of A here like I would in the B-plus tree. Even the integer. Yeah, correct. If the value is a 16-bit integer and my pointer is a 16-bit integer, they still store the pointer. Right, because if I store the key directly in here, I also have to store the pointer to get to the tuple that I wanted. So in that case, for your example, the attribute I'm based on is 16 bits. I still need another 16 bits to have the pointer. So you just have the pointer. Right? It's like the minimum information you need to be able to go jump and find the thing you want. Right? So the only copy of the value they're going to maintain are these min and max key values. So this won't be pointers, this will be actual values. And you need this to represent the boundaries of the range that's managed or stored here within a single node. Right? So again, this will be based, if your integers are 16-bit or 32-bit, you would have to have a copy of those keys in there. Right? So let's see how you'd actually search on this. So let's say this is my key space. I'm having keys 1 through 7, right? Sorting from low to high. The way I'm going to represent this in a t tree is not going to be like a b plus tree, where everything is in sorted order along the leaf nodes. Instead, I'm going to insert entries in breadth-first order like this. Right? So now what happens is, say I want to do a lookup between the range 2 and 5, I start at the root node. I check to see if the value I'm looking for is within my range. It's not, right? Because they say this is the key space of 4. 2 is less than 4. So I would know that my min key is greater than the thing I'm looking for. So I would go down to my left child pointer, say anything and do it, and I would land here. And now check is in my range. I have a match. So I know I need to go find whatever it is. I have to follow these pointers now to go get the actual tuples and do whatever additional computation I need while processing the query. But then I jump down and go down here to 3 to go across. And then when I want to get to 4, I've got to go back up. And that's why they need the pointers to go in the reverse direction because you're doing this in breadth-first order. So for the space aims you get from not having stored some additional pointers or additional copies of keys, you're going to pay a penalty of having to do more traversals. Yes? So at the root node, wouldn't the min k be 1? And max k be 7 to the root of the whole range? No. So his question is, wouldn't the key range for this node here at the root, wouldn't this be 1 through 7 because I need to store everything within this? No, because if it's 1 through 7, that means that this key has exactly what I want. So maybe I should draw this better. This would have 4 to 5, 4 inclusive, 5 exclusive. So I would do my look up here. And to get the actual value, I've got to go down this side. So don't think of this as a B plus tree where the inner nodes are just guide posts that say, go left or right. It's the actual data itself. So the data you need is not within this range. And the min key tells you to go this direction to get it. His question is, is it one record per node? No. So again, I should draw this better. So this could be a range of values. Yeah. So the advantages of the teacher is that it uses less memory because we don't have to store the keys. And then we don't have additional copies of keys like you would in the B plus tree in the inner nodes because the inner nodes themselves are actually where the data can be stored as well in a t-tree. The always downside of this is in a t-tree, it's difficult to rebalance because the breath-first search ordering, we may have to go muck around the tree in a bunch of different ways. And of course, now that means that the rebalancing is more complex, it's going to be difficult to implement this concurrently and allow multiple threads to do modifications to the index without blocking the entire tree or locking the entire tree when you make a change. And of course, the other big issue is that we have to chase pointers whenever we have to scan ranges because we have to perform that binary search within the node and we have to go look up the pointers to see what's the actual value that I'm looking for. So again, this is more of a historical anomaly or a historical artifact that I like bringing up because it often comes up where people say, we're building a memory index or a memory database and we're going to use a B plus tree. And someone would always say, well, wait a minute, didn't people design indexes be exactly for memory storage like a t-tree? Aren't they better? And the answer is no because the way this is actually being organized is really slow because of that indirection of having to go follow the pointers in your node to go look up the actual value in the tuple itself. So as I said, in the 1990s times 10, data lists were using this. As far as I can tell from the documentation of times 10, it's actually not clear to me whether they're using B plus trees or t-trees. The manual from 2006 from Oracle says they use t-trees. But then a recent blog article from June 2017 says they're using B plus trees. So I sent an email this morning to one of the developers at Oracle and he didn't get back to me yet. But the only other notable system that I know that it also uses t-trees is ExtremeDB, which is an embedded database system. So don't think like think of a better device, like not your cell phone, but like an IoT or sensor device, like where you're really memory constrained. And in their system, they use t-trees. Your cell phone has a lot of memory. Your cell phone has like two gigs or four gigs. In that case, you use SQLite. That's an awesome database system. And SQLite uses a B plus tree. So t-trees are interesting. If you're really memory constrained, this is what you want to use. But if you're building in a large memory system, as we'll see throughout today's lecture or next lecture, there's better implementations or better indexes that are willing to pay the penalty of having stored keys in the internodes or additional copies of keys. But for modern CPU architecture, it's the better because you don't have that indirection problem. OK? All right, so now with that said, now we start talking about how to build a modern index. So as I said last class, the way to think of an index is that it's essentially like a glossary in a textbook that allows you to do a lookup on some key and jump to the exact page that has the thing that you're looking for. And so the easiest way to implement a dynamic order preserving index is as a sorted link list. And so the key word in my description here is the word dynamic. That essentially means that we don't know a priori. We don't know when the system boots up exactly how many keys we're going to have and what their actual values are. So dynamic means that someone could be inserting and deleting things over time. And we need to be able to accommodate that. So the easiest way to implement this, to be able to handle changes to an index, is as a sort of link list. And the reason why I'm saying it's easy is because when I want to insert or delete an entry, I only have to find the thing that I'm trying to modify. And then I just flip one pointer to now point to my new thing. So if I have key four, key five, and I want to insert something in between there, I just have to change this pointer. It's super easy. It's not like in a B plus tree where I could have a split and I could cause changes all the way to the root to the tree and have the thing be completely reorganized. The change I can make if I use a link list is localized. But now the problem with this is that if I didn't want to look up a key, it now is going to be a linear search or a linear scan across every single element potentially. Because these are pointers to some other location in memory, it's not like an array where I can just jump to an offset where I think my thing is going to be and maybe do binary search. I had to scan, go across, and look at everything. So what's one obvious solution to this problem? How can I speed this up? The answer is skip list, but like, does everyone already know what a skip list is? And are we just skip this? OK. So the answer is you basically have extra pointers now can jump over one element or one key in my list and get to the next one. I can do this for every single one. So now if I want to look up, say, key 4, whereas without these, I have to do a linear scan to land here. But now with this, I can look at this and say, well, I'm looking for key 4. Key 4 is greater than key 3. So I can skip whatever's in here and just jump to that. Now I do my load lookup. I'm looking for key 4. Key 4 is less than key 5. So I know I need to now scan across this. And I can do the same thing. I can have one now skip every four elements like this, every four keys. This is essentially what a skip list is. So a skip list is going to have multiple levels of link lists with these extra pointers to allow you to skip over intermediate nodes that you know don't have the data that you're looking for. And the big advantage that you get in a skip list is the same thing as a link list, because it's just essentially a bunch of link lists, is that when I make modifications, those are localized to the elements I'm modifying, like either the guy that comes up for or after me. And I don't have to do major rebalancing across the entire data structure. And that's one of the big advantages you get out of a skip list. So what makes a skip list actually special beyond just what my simple example where I said jumped over every other is that it is a probabilistic data structure. And so that means that instead of saying I'm always I'm going to build an extra pointer to skip over every second element or third element, you're going to end up flipping a coin and decide randomly how many extra jumps you're going to have. So again, the way to think about this is that you have different levels. At the lowest level, it's always in that single direction link list that's in sort of order. You need that because you can't have false negatives. If anybody does a scan across the leaf nodes or the lowest level, you want your key to actually be there, because otherwise it's not there. So the lowest level has to have everything. But then in the second level, you basically want links to every other key. And then third level, you get every fourth key and so forth. The idea is that as you go up from level to level, you want one level to have half as many as links as the level below it. And so the way you can do this is that when you insert a new key, you always have to put it at the lowest level. But then you flip a coin and say, if it's heads, I'll add an extra link. If it's tails, I'll just stop. I won't add an extra link. And so if you get heads, you add the extra link, you flip it again, because now you're going up to the next level. Next level, and you say, should I add it, yes or no? And you keep doing this until you get heads. So what happens is you essentially end up with a data structure that is technically random. But on average, you're going to approximate log end searches, like you would in a B plus tree. So it's like having a B plus tree without having to have a rigid data structure in terms of every node has to be at least half full and all that kind of stuff. And it gets, again, get that benefit of only having to make localized changes. So let's look at an example here. So the first thing to point out is that along my link list, I have my starting point and my endpoint. So my endpoint, I'll just have a marker that says infinity or null or nil. This basically says that if you're traversing this link list and your pointer lands to one of these spots here, you know you're at the end of the list. And at the beginning of a list, then I'm going to have my starting point links for each level. And for each of these levels, we'll correspond to the probability that the given key will be at this level. So the bottom of probability is always one or every element, because you have to have every element. And then above this, you're going to have half as many keys. And above that, you have half as many keys as one below this. So again, this is the link list that I showed at the very beginning. In my sort of toy example, I have to have everything. And then above this, we're going to have pointers to allow us to jump to the next key that's in our level. But we'll also have to have pointers to go down the tower. So think of these vertical strips here called towers. So if I'm going along and I say, all right, I know an EK2. I know how to go down here and actually can get it. And then above here, we don't have any keys yet. So this is our sort of top level. And it just points immediately to the endpoint. So let's do an example where we want to insert key 5. So key 5 would go here. And so you would first do a traversal, and you would find out I have key 4 and I have key 6. This is where I think I want to put it. So the first thing I'm going to do is I'm going to copy it in. But at this point, I haven't updated any pointers, so nobody actually knows about it. So key 4 here still points to key 6. And this one points to the end here. So then what I'm going to do is go in and install my pointers, and now everything becomes visible. So I'm being very hand wavy at how to do this. I'll go in more details in a second, because the order you do these, install these pointers is important, because you end up having something point to nothing. But the basic idea is the same. I've flipped the coin. I installed key 5 here. I flipped the coin, got heads, added to the first level, flipped the coin again. I added one to the second level in this tower. I flipped it a third time. I got tails, so I don't add anything above me. Yes. Question or no? And so how many levels I go up per key is localized to that key. So just because this one got up to the third level, I don't need to go back and add anybody else, right? Questions, yes? You have to sort of keep flipping coins until it comes up. So his question is, let's take the stream case. Say I'm super lucky, or unlucky, depending on your view, and I flip a coin 100 times, and I get heads 100 times in a row, right? Would my tower go all the way up? It depends on implementation, right? From a correctness standpoint, it doesn't matter, right? Anybody will be able to find a thing that I need. In practice though, in terms of probability, that's very unlikely to happen. So the tree sort of balances itself out automatically. Question, yes? Do you store any additional information at the starting level? His question is, do I store any additional information at the starting level, such as what? If you want to search for K6, yes? Should I just take the top class level? OK. How do you know that they are to take that? So well, next slide, we'll see how to do a search. You had a question? No, no, no, yes, sir. Does it store parent pointers at every level? So his question is, do you store parent pointers at a re-level? No. And the reason is because when we start making this concurrent, you need to do compare and swap. You can only compare and swap on a single location. So you only have pointers to your neighbor, or in this case here, to the guy below you in the tower. So when you're actually checking the coin, you go to every level and search both of those. Your question is, if I'm flipping a coin, would I go through every single level and do a search? Yeah, from the beginning to the end, and then find out where you want to actually search those. Yes, we'll come up. Let's see how to do a search. And we'll come back to that. Yes. Any other questions? OK. Let's do a final K3. So Prithina's question was before was, do I store any additional metadata here in the starting point to allow me to jump to a location? And there's no. So if I'm doing a lookup on K3, I just go down each level and I try to find where do I enter into the data structure. So in this case here, I start at the top level. I follow this pointer. It would take me to key 5. And I would say, well, key 3 is less than the key 5. So I know whatever I'm looking for, key 3, cannot be at this point or beyond. It's going to be on this side of the index. So I don't want to go across and follow that pointer. I want to go down to the next level. Same thing. Now I do a comparison. And now K3 is greater than K2. So I know that anything going this direction, I don't need to look at. So it's somewhere in this range between K2 and K5. So I want to follow now along this pointer and then do my comparison. Again, K3 is less than K4. So I don't want to go follow this. And I jump down here and I follow along. And then I find the thing I'm looking for. So his suggestion was, can we provide hints about things like, if you're looking for K6, go along this path or something like that, you don't want to do that because you get that for free because you have to look at these pointers to see what the actual key is to see whether you want to go and follow along this path or go down to another level. So the advantages of a skip list is that in practice, you end up storing typically less, use less memory than it be plus three because you're not storing additional, you're not having much way more pointers and a lot more redundant copies of your data. And it's only true actually if you're storing reverse pointers. So in this case here, all the pointers are sort of going in one direction. In a B plus tree, you have to have left and right as you go down. So because everything is always going in one direction, we have to use less memory. The insertions and leashes don't require any rebalancing because the change is always localized wherever you are on the index or on the list. And it's actually, as we'll see in the next two slides, you can actually make this thing be concurrent and thread safe just by using compare and swap. Yes. What does it have only less memory than B plus tree? Because like it's node being B plus tree can contain like a lot of items. Why a skip was a node only comes in one. So your question is, I think your statement is, is it the key? Is there a yet? His statement is, in my example here, I'm showing at a single node. Say this is a node or element in my skip list. It has one key and value pair and then a pointer. But in the B plus tree, you can pack in multiple keys and values in a single node and get better locality or less pointers and better that way. You can do the same thing in a skip list. We'll see that later. So let's talk about how to build a concurrent one. So as I said, we can do insertion and deletion in our skip list without any locks or latches using only compare and swap. But the key thing to understand is that the reason why we had to have our single direction link list or only going from left to right is because the compare and swap instruction can only update a single memory location atomically. So if I have to have reverse pointers and I want to insert a new entry, I've got to go update the predecessor and the successor to now point to my new guy. And I can't do that atomically without having to do latches. Yes? Can I just cram two pointers into the same word? So this question is, can I just cram two pointers into the same word? Yes. And you could. I actually don't. I can do they have 64-bit compare and swap? I'm sorry, 128-bit compare and swap? Because pointers are really just 48-bit, right? So the chances are, I can't. Not starting later this year. Yeah? Never mind. But even then, if you have 248-bit pointers, you have to have a 96-bit compare and swap. I guarantee they don't support that. They might have 128. But potentially, yes, you can think about it, but in practice, nobody does this. So let's go back and do our insert, and let's see how to do this atomically. So as I said, we do a traversal to figure out where we actually need to be. And say we're assuming that this is a unique index. So we have to do this traversal anyway to see whether our key already exists. So we figure out that this is where we want to go. So the first thing we're going to do is we're going to create our indexes after flipping the coin or create our entries for every level in our tower. But at this point here, nobody knows about our key because we haven't updated any pointers. So these original pointers for key 4, they're just still going around and pointing out what they were looking at before. Same thing for the starting point here. So we can update all our entries here as much as we want without worrying about whether we're going to interfere with anybody else, because no one can land on us and figure out what we actually contain. So that means that we can install our pointers going down to the tower. And then we can also have pointers to what we think is the next entry in the list, like that, right? But now we need to install this and have it now become visible. So we're going to start from the bottom and go to the top. And we're going to do a compare and swap on our predecessor's pointer. So this is, again, a single instruction. I know what I think should be there, because I did my traversal the first time to figure out, oh, it's actually pointing to K6. So when I do my compare and swap, if that pointer is no longer pointing to K6, I know that somebody else has come along and inserted or deleted something before I can make my change, and therefore I need to abort my operation, go back and try it again. But if it is pointing to K6, and I'm able to do the compare and swap, so I'm now successful, and now K4 points to K5. So now at this point here, the key is technically visible and valid in our index. Even though we haven't updated any of the other pointers in other levels, anybody that does their traversal here will be able to find us. So then I can do the same thing, compare and swap for this guy, have him point to this, compare and swap to that guy, and have him point to that. And now our index, our key is fully installed in our index. Yes? So how do you do your insert? You have to remember the last node. Yes, so his question is, in order to do an insert, I mean, think about this. To do an insert, you have to know where you're going to insert. So you have to do a lookup, do a traversal, find out where you think you should be. And yes, you have to save that because you want to be able to install yourself. And as I say. I got like parent pointers or something. So his question is, his statement is if you had parent pointers, like going the other direction, you wouldn't have to save that, right? But then you can't do that atomically. So his question is, say I'm here, right? And say someone deletes something, you delete k4. And then now we have the problem of our, so be careful here, we'll talk about this in a second. There's physical deletion and logical deletion. I can do logical deletion. The chunk of memory for k4 is still here. So I can do my compare and swap and I'll still be in my install correctly in the linked list. That memory for k4 won't actually be reclaimed until later. And this we'll see this in garbage collection. The index will figure out that, oh, I know no thread could be looking at this chunk of memory. So therefore, it's safe for me to delete. So even if someone deletes this, it's a logical lead so it's still there. If someone deletes this, again, logically, it's still there. And my pointers are OK. Yes? If we compare and swap, we don't have to go and change the successor or next pointer for all the higher levels. We just retry in that level. Yeah, so his statement is, his question is, say I'm able to do this compare and swap successfully. But then now, if I get to this guy and I try to compare and swap, this fails. I don't need to roll this back. I just need to retry it. And I have to figure out, well, because it could be a race condition. Maybe somebody else got here and then they got here before me because they don't know about K5 yet. You just retry it. Yes? If K4 is logically deleted, the comparison will still succeed. But if K6 is deleted, the pointer from K4 to K6 won't be changed. All the comparison swap still succeeds on K5 with K6. But both K4 and K6 are deleted. OK, so let's back this up. K4 is logically deleted. So it's still here. It's still going to, we can do a compare and swap. It's still pointing to K5. Then you say K5 is deleted, or K6. Before the comparison swap happens, K6 is deleted. Logically or physically? Has to be logically. OK, so it's logically deleted. Then K4 is still pointing to K6 because we don't update the pointer because K4 is going to be in. But we still think it's valid. So it'll be marked in, I'll show deletion in a second. It'll be marked as logically deleted. But people are still going to find it because there's still a pointer back here that's pointing to it, right? So even though K4 will be logically deleted, K6 is logically deleted, someone will still be able to find our K5. My point is, wouldn't we think that the next thing is K6 still being able to delete it? At this point here, if I have this set up here, say K. We don't have K4 pointing to K6. You're here. OK, so this is logically deleted. Pointer still points to K6. K6 is logically deleted. That's fine. My guy comes along, says, well, even though he's deleted, I still have to add myself in here, right? So I do my compare and swap. That's fine. Doesn't K5 think K6 is the next guy? Doesn't K5 think K6 is the next guy? It is. Logically, it's still there, right? It's not visible. Sorry, yeah, physically it's still there. Logically, it's not, right? Like, I think, boom, all right, let's just jump to this. OK, this is clear all right. So again, we're going to have this distinction between logically being removed and physically being removed. Logically, basically, we add a little flag, an 8-bit flag in the node now that says this entry has been deleted. So if you scan along and you see that, you know you should ignore it, right? Physically deleted, as we'll talk about it when we do garbage collection, this is when we can remove the actual contents of the entry from memory. And when we know that nobody else could be looking at it. And we know there's no pointer possibly to this. This is easy to do. This is hard to do. So we'll go through both of these. All right, so let's say I want to delete K5. So now in every single node, again, I have my single Boolean flag to say whether this entry is deleted or not. And I only need it at the leaf nodes. I don't care about the towers, ending at the higher levels. So I want to delete K5. So I do my traversal. I find K5. I get down here. And then I do a flip on this. And I mark it as deleted. I don't think this has to be compare and swap. If you try to see whether it's deleted and you get there and somebody else already deleted it, then you know that you can bounce out you're done. Because someone else already did it for you. So now I set this thing as deleted. And if anybody comes along and they're looking for K5 and they would find me, they know they can ignore me. Or if you're traversing along K3, K4, K5, K6, you know that again, you can just ignore this. But all the physical pointers are still valid. So now I can start cleaning this thing up by removing all my pointers. So I'll start from the top and go down. And I'll do compare and swap to replace whatever is pointing to my different entries at each level to now point to whatever comes after me. So in this case here, this top level here was pointing to the end. So I want to do a compare and swap to replace this, to now send a pointing to me, to point to the end. Same thing for the one below this and then one below that. So at this point here, it's still actually not safe to physically delete this, because we don't know whether there's a thread that's sitting at this location here, and they start jumping down the tower and laying to the bottom. So at this point, we've disconnected it from the index, and no new threads will be able to find us. But there may still be an existing thread that could find us and get tripped up. And at some point, once I know it's safe, I can go ahead and completely remove it. Let me answer your question. OK, yes? How did it attack pulling there's no threads? So this question is how to detect whether there's no threads? That's garbage collection. We'll get to that in a second. Anything else? Yes? Before we actually got into that, could we again enter the key if I hold that? So this question is, if I'm here and someone comes along, tries to insert K5 back into it, you create a new one. Inconsistent state, right? So his question is, or statement is that it can be an inconsistent state. Let's say a threaded, someone is searching for K5 at the top. Yes. They tried to traverse. Yes, so say it's like this, right? I've marked it as deleted, but my pointer still points to it. So my thread is here, and then I do a transparent swap, and now that gets removed. Yeah, and then you compare and swap, you remove the bottom one. So you're here? Yeah. Yeah, next step. So you're in it, and then someone? Yeah, now you're just encounter null, right? Sorry, sorry, sorry. The thread which is searching would think that there is K5 when it started, but suddenly it cannot even make the value of K5. No, it always can. It always can get to it. When we do garbage collection, then when we do garbage collection, it's like, I know that there's nobody pointing to it, because I've already done that. And I know there's no thread that can be hanging out here and following these pointers. Can we set the logical flag only at the lower level or at the other flag? So his question is, do you only set the logical flag, this delete flag at the lowest level? Correct, it's only at the lowest level. You don't need it at the top, all right? How does the search know at the top that it's skip over like a tower? So your question is, how does the search at the top know to skip over a tower? Yeah, like if I've been deleted and so on. It doesn't. It has to go down. So if I'm here, the skip list is at this state here. It's been marked as logically deleted. And I'm here. I don't know whether this is deleted yet, so I have to go down. And functionally, that's correct, right? I'm not going to get a false negative. I won't get a false positive. It's slightly more inefficient had I known that, all right, I can actually skip this and not go all the way to the bottom. But you don't know that. And the additional metadata you would need in order to provide that hint is too much. So the order of compare and swap, if matter, 6, 4, enter, we're going to be bottom to top deleted. So his question is, does the order for compare and swap matter? For insertion, yes. For insertion, you have to have, you have to install this thing first, right? Because otherwise, you could have so many come up. I don't get into linearizability. But in the end, it's OK, because there's high-level constructs that, as you saw with the maintenance serializability from a transaction point of view, that's OK. But in terms of what we're trying to avoid here is having things point to invalid locations in memory and have psychvolts. So I think, in general, when you do insertion, bottom to the top, deletion, top to the bottom. Is this deletion, is this deletion exactly when you, if you try to do a note, you delete all the power when you do that, or you just set the flag and do that in the garbage collection top clue? So his question is, when you're doing a deletion, do I try to, if I'm the thread doing the deletion, do I remove all of these pointers now, or does that happen later in garbage collection? The thread tries to do it now. So if he tries to do it now, maybe it's more fun. If somebody tried it as on the top tower on the K5 on the top. Yes. And then you delete the pointers on top, and you'll be back. So you're here? Yeah. OK. Now if you insert something like in front of the K5, if the thread is searching has to go through that pointer, because the vitals, like you are seeing all the pointer that's before the K5, can you insert something there, and I'll K5, what if it doesn't exist? Well, physically, it doesn't exist anymore for the skip. So we're going to be skipping that note instead of going to that note, and essentially, you won't miss that note. OK. So I think, say we're at this point here. So his question is, if I insert something between K4 and K5, say I insert K4.5, right? So and I met this state here. So I started the top power at the top level. I follow my pointer. I'm at the end. I know I don't want to go there. Then I jump down to the next level. 4.5 is greater than 2. So I go here. 4.5 is greater than 4. So then I go here. So now I'm here. I don't know whether key 5 has been deleted yet. I don't care. So I know that this points to 5, 4.5 is less than 5. So I want to go down here. Now I'm at 4. And my next key is 5. So I know I want to get inserted here. And depending on how many heads I get, I add my towers up. And that's still correct. There's no false negatives. OK. I want to get you insert after the K5. So I insert K5.5. OK. So I'm at this point here. I would say, all right, I'm pointing to nothing. I can't do that. I would get here. 5.5 is greater than K5. So I go down. 5.5 is less than K6. So I know I want to insert here. That's fine. It's installed. Everyone can see me. And because I'm going from the top down, this chain is still correct, still find my entry. At some point, I'll do all my swaps. But again, at this point here, when I want to do a compare and swap from K4 to make it point to not K5 to the next one, it's not going to point to K6. It's going to end up pointing to 5.5. And I don't lose anything. Question over here? OK. So again, the careful thing about skip lists when you build one, it's how you order the operations, especially for inserts, can matter. And again, when the threat performs an operation, if the compare and swap fails, because somebody else jumped ahead before you did, you can't abort that operation and say, I'm not doing it. You have to retry it. So it's up for you and your index implementation that you're going to implement for the Project 2 that you have to support the ability to retry operations. It's not like a higher level transaction concept that we talked about before, where if I tried to update a record and somebody else updated that thing before I did, and therefore that would violate serializable ordering, because there's a write-write conflict or read-write conflict, then I can abort that transaction and roll back to any of these changes. You can't do that for your index. The index should always be able to do whatever it is that you ask it to do. Now, if it's a unique index or unique keys and you try to insert something that's already there, then yes, you report back up that you can't do it. But if I ask you to look up something or delete something or update something, if you're not violating those invariants about uniqueness, then you should always be able to do it. So I'll go through these real quickly. I really wanted to get to the BW tree, but I guess because this is Project 2, I'll go into a bit more detail. So the skip list that I'm showing here is what you would implement if you're trying to learn from skip lists in the first time. It's the most basic vanilla one, but it's actually very inefficient. And so there's this great blog article by a written some dude I don't know who actually is from 2016. I emailed him, asked him for his real name, never responded, whatever. And so he has this nice, it's more engineering focus rather than academic or like a research blog article. But basically, it says here's how to build a real skip list to make it actually high performance. So he highlights four things that are real slow with a basic skip list implementation. So the first issue is that if you're flipping a coin with a random number generator, calling RAND is actually really slow because it has to maintain an internal state machine to figure out what's the next random value we should give you. And so there's actually a way to just do pure bit shifting operations to get pseudo random numbers that are good enough as RAND. The other one is obviously reusing memory. If you have a node, it gets deleted, or entry in your index gets deleted. You can always reuse it rather than freeing it and allocating again. But the two things I want to talk about is how to handle multiple keys in a single node, which is what he asked before about the B plus tree, and how to do reverse iteration. So again, the way I drew the skip list is sort of you have every single entry, and every entry has a pointer to its next neighbor. And this is obviously very inefficient because I'm storing a bunch of extra pointers that maybe I actually don't need to. And then also, this is going to keep my memory fragmented because these different entries may not be contiguous in memory. So I may be having a cache miss every single time. Furthermore, it also sucks for modern CPUs because this indirection is going to call the branch predictor to do a bad job. And we may end up having to flush our execution pipeline in our CPU because we're just jumping to different random locations in memory. So the way we can solve this is by packing together a bunch of keys into a single node. So typically what you want to do is try to get your single node to fit exactly in a single cache line. Can you tell me how big a cache line is? 64 bytes. Yes. So in this case here, let's say that my keys are 32-bit integers. So I have four of them. So that's 4 times 4. So that's 16 bytes. And then I have to have 64-bit pointers. So that's 8 bytes per pointer with four slots. So that's going to be 32 bytes. So that's together. That's going to be 48 bytes. And then I have my another 64-bit pointer, another 8 bytes for another pointer to the next guy. So that's 56 bytes. So I can put four entries into a single cache line. And that fits with an extra 8 bytes to store whatever additional metadata like deletion flags and things like that. So now this is super efficient because it's a single cache line read to go retrieve this and put it into my CPU cache. And I can process this as needed. And so in this example here, what I'm also showing is that we're not going to store the entries in sorted order. We just store them in the way that they are inserted. And that way, when I do my look up here, I can't do binary search as I normally would if it was sorted. I just have to do a linear scan. But that's OK because I only have four elements. And this can be my CPU cache. So that'll be efficient. So if I want to do an insert, K4, I just go find the free slot that I have. And I go ahead and add it. And then if I want to do a look up on K6, again, I just land to the head and do a linear scan to the find that I want. The obvious downside of this is that, say, I need to insert key 5 into this. It should go in between key 4 and key 6. And so I don't have any more entries here. So I have to move and copy some of this data out and put it into a new node. And maybe what I want to do is borrow the nodes from my neighbor. But that's more tricky to do. So you'll get better performance. When you do this, the downside is that you may have to do extra copying to split a node. And you may have wasted space because you may have a bunch of nodes that are half full or less than full. So you guys should definitely try to implement this in your skittlist for the class. It'd be interesting to see what the performance benefit you get from it. And then the last thing you got to deal with is do reverse search. So as we said, because we're in a single direction link list, you just can't start at the endpoint and try to walk your way back, because there's no way to do this. So there's a couple of different ways to actually do this. What I'm going to show you guys here is how to use a stack to do this. And this is from an open source reference limitation written by some guy in GitHub, where he described how to use this algorithm with a stack. If you go read the MemSQL blog about how they do skittlists, there's like this one short paragraph that says, oh yeah, here's how to do it in reverse direction. But it's not actually clear how to do this. When I asked the VP engineering of MemSQL, who is CMU alum, who's no longer there, I asked them what they do. And they claim they don't use a stack and they maintain some extra pointers in the end to help them jump in reverse direction. But I never got a clear story exactly how they do this. I should probably email them again, but whatever. But in my opinion, the stack way is the easiest way to think about this. OK, so the way it works is that, say I want to do a lookup k4 to k2. And so what I'll do is I know that my boundary is, for the lowest value, is k2. So I'll just do a lookup on k2 to find where my starting point is for this range. So k2 is less than k5. Don't go there. Come down here. k2 equals k2. So I want to jump across and come down here. So now I'm at the starting point of my range. So what I'll do is I'll maintain a stack for every single entry that I hit until I hit my upper bound k4. I'll just add that key into my stack like that. And then when I reach the upper limit of my range, all I need to do now in my index wrapper is just traverse or spit out these keys in reverse order. So you have to maintain, again, a buffer for your thread, a little memory space where you can add these entries and then know how to spit them out in reverse order. OK? All right, so we have what? 25 minutes left? When is this class end? 420. OK. Yeah, all right. So let's go to BWTrees. So BWTrees, the paper you guys read, I think some of you were confused what's actually going on, so I'll do the best I can to describe it. I would say that you were actually lucky, because the first time I taught this class, project two was not the skip list. Project two was the BWTree. And kids were like crying, their eyes were bleeding. So you only had to read the paper about it. We actually did the trouble of implementing it one later. But I think it's worth to see the skip list juxtaposed with the BWTree, because the BWTree is another lat tree data structure. And it handles the user's compare and swap in a different way, which I think is interesting. So the big problem we had, again, with the skip list, is that we can't update multiple pointers at the same time. So we can only have our link list go in one direction. And this also means that we can't build a lat tree B plus tree because, again, they have pointers in different locations. And if we had to do a split or merge, that's a major change. We have to update a lot of pointers. And then we can't do that atomically. So the solution that they propose in the BWTree is to introduce an indirection layer that will allow us to change these multiple addresses atomically, even though we're still confined to using compare and swap, which can technically only update one address at a time. So the BWTree is a lat tree B plus tree index that came out of the Hecketton project from Microsoft. And again, the goal with this is, again, can they build a cache-friendly data structure that need to be order-preserving for their in-memory execution engine. And as I said, I think last class, they were originally looking at skip lists, found a bunch of problems with it, why it actually wasn't scalable. And then they went off and built their own data So there's two key ideas about a BWTree you need to understand. This is the whole enchilada, the whole higher-level concept of what this thing actually does. So the first is that they don't want to have any in-place updates to any node in the data structure. So instead, they're going to introduce these delta chains that allow you to make changes to a node by just appending the change to some list rather than going down and reorganizing memory. And they do this to make it cache-friendly, because if I have a copy of my index node or B-plus tree node at every single CPU, if I go make a change at one thread, I have to go invalidate that copy everywhere. The other thing is that they're going to use a mapping table that's going to allow them to do compare and swap in a single location to identify the single physical location of a page, but have that automatically propagated throughout the entire data structure. So let's look at an example here. So here's a really simple BW tree. We have three pages. And so the first thing we're going to point out is that in my these diagrams, I'm going to distinguish between the logical pointers and the physical pointers. So the logical pointers are going to all be based on these page IDs. So every page, when it gets instantiated and put into the index, is assigned some unique page number. And then in our mapping table, we're going to map the page ID to the actual physical address in memory of where that page is located. And each page can only exist in one place at a time. There's only one address for it. So then what happens is any time I want to say, where's page 102, I look at my mapping table and I get the physical address for it. Internally, the way it's going to have these logical pointers, again, instead of storing physical addresses, we just need the page IDs. So for this top page here, this was on 101. It has a child 102 and a child 104. If I ever need to get down here, I just need to go look at the mapping table. It says, I want page 102. Where is it? Voila, here it is. Let's see how they handle updates. So let's start, we have a single page. Any time I make a change, like I insert or delete a key, then I'm not going to make it modify the actual page itself. The base page is immutable. Once it's created, it's never changed. But instead, to apply a change to it, we're going to add a new delta record. So every base page is going to have this thing called a delta chain, sort of as a prefix to it, where you can apply or install new delta changes. So let's say I want to start key 50. And key 50 should be contained in page 102. I won't change the actual page. I'll make this new delta record and says, I'm inserting key 50. And then now what I need to do is have the delta record have a physical pointer to the base page. And then I want to do a compare and swap now to install the pointer to my delta chain as now the new location of this page. So for page 102, I want to do a compare and swap. And now have it now point to the head of the delta chain starting with this record here. So now if any other thread does a lookup, it says, I want to get to page 102. When they follow the mapping table, they're going to land this delta record. And they have to do a check and some flag in the header. It says, the thing you're looking at is not really a base page. It's actually a delta record. And then they'll have to apply that change in an internal memory representation as if they're like replaying the log to put it back in the state, put page 102 in the state that it should have been. So I can keep doing this. If I want to now lead a key, the same thing. I'll make a new delta record. I'll have it point to now the head of the version chain, which is the delta record I created before. I do my compare and swap. And now this thing is the head of the delta chain. Yes? Is having a centralized mapping table a big part of it? The question is, is having a centralized mapping table a big bottleneck in terms of what? In terms of threads trying to get into the same thing? Or? When you have a lot of cores, all try to touch like the same centralized data structure. Yeah. So his question is, could you have a lot of cores all trying to touch, either access and modify the same data structure? Does that become a bottleneck? Yes. And we'll see that next class. Yes. Think about this. In the skip list, as you're going along, like your pointer is like, ah, here's where I need to go. You know exactly where you need to go. Every single time you want to traverse in the BW tree, you've got to do a look up in this mapping table and then jump to that location. Yeah, and that becomes a bottleneck. Yes. All right. Let's see how to do a search in this example here. So I'm only showing one page here. But again, it's a tree structure. So we'll traverse the tree just like we would in regular B plus tree. The leaf nodes actually contain all the keys and values. But then as we do a look up in a mapping table to say to get to page one or two, if we land in a virgin chain, we've got to start looking at what the actual delt of the records are doing to figure out whether they correspond to the thing that we're looking up. So in this case here, if I'm trying to see, get me key 50, I do my look up mapping table, I land here. This deletes 48. I don't care about that. I follow the pointer now here. This inserts key 50. Aha. That's exactly what I'm looking for. So I'm done. I don't need to go all the way the bottom. I have exactly what I need. But if I don't, then I have to go down now down the base page and do a binary search. But I have to apply all the changes that I saw as I traverse the delta chain into my base page in a copy that I have that's specific to my thread in order to find the thing that I want. So with the compare and swap being done in the mapping table, this is, again, a single location that we only need to maintain to figure out whether I'm allowed to install my change. So let's say I have two threads trying to install a new delta record to this chain here at the same time. So I have this guy wants to delete key 48. This guy wants to insert key 16. When I do my compare and swap on my mapping table, only one of these threads will succeed. So let's say that the first guy does compare and swap and then he gets it. So now this is part of the verdant chain. This compare and swap will fail. So then it has to abort its operation, retry, and try to apply the change again. And maybe this time it'll be able to succeed. So it's like the skip list where if your compare and swap fails, you have to go back and traverse everything and retry. Yes? It doesn't retry. What's that? Is that it? It vibrates out. It doesn't retry. Yeah, so this statement is if this insert 16 fails, you could do the look up again, figure out what this points to, and then try to compare and swap it directly on this again. That's one shortcut way to do this. Yes? It's not the way that paper does. What's that? It's not the way that paper does. Yeah, so as an aside, I'll say the paper leaves a lot of implementation details undescribed, or they don't mention. When you read paper, you get a high level idea of what they're actually doing. When you actually try to build a real one, that paper is insufficient. There's a bunch of stuff that they leave out. Now they have a bunch of papers that come after this that are a part of the Heckelton project or do an erotomy, which is another system out of Microsoft. They sprinkle in little tidbits of like, oh yeah, you got to do this in your BW tree. You got to do that in your BW tree. But some things, again, are not clearly specified. But then the paper that I sent the email out on Sunday that we just had accepted the sigmod, that's like, here's how to build it for real. We fill in all the missing gaps. There's way more details than you actually need to care about, but there's a bunch of stuff you need to know how to do this all correctly. And it actually comes with a big issue when you actually do structural modifications. OK, so to finish up, in addition, so there's different categories of delta records. So the ones I've been showing so far make changes to the actual key value space, insert, update, or delete, but you can also have delta records that handle structural modifications. It's like a B plus tree. So it means we have to do splits and merges. So we've got to be handled that. So the first thing we've got to be able to handle is do consolidation. So if we just let this delta chain record grow forever, then it's essentially going to become even worse than the link list that I showed before, because I have to look at every single record in my delta chain and figure out whether the thing I'm looking for is actually there. So what will happen is you set a threshold to say, my delta chain can only get so long, and if a thread comes along and it realizes, oh, I'm trying to apply a new delta record, and I can't because the delta chain's too long, I want to try to consolidate it. So that essentially means taking all the changes that you made and applying it to the base page and to a new memory location. So in this case here, I would copy the base page, new page 102, and then for every single delta record I have, going in reverse order, I apply them one by one. So after I've done that, now this base page is logically equivalent to this base page here with all its delta records applied to it. So again, same thing. I do a compare and swap now to flip this pointer now to point to my new guy here. And the nice thing about compare and swap is when you try to do it, you want to make sure that nobody else tried to insert another delta record that you didn't apply to your new copy of the base page. So you know that if this thing still points to the head of the version chain that you saw when you created this new node, then you know there's no other new delta record, so therefore you've got everything, and then therefore you're safe to do this. And you can be clever about certain things like if my compare and swap fails because somebody else inserted something above this, then I can go maybe figure out what that was and apply those things and try to compare and swap again. Yes? What if some value is not disappeared? Some duplicate value is in the sum of delta and base page. Your question is what happens if a duplicate value disappears? Appears in different delta and base page because you said basically when my shop was delta, which I was trying to find a return. OK, so this question is, so I have a cert key 50 here. Say I have delete key 50 here, what happens? No, I have an insert 50 in the first delta and also have a 50 in the base page. So the question is, what if I have 50 here? Yes. Am I going to miss something? No, so to do the, I mean you have to do the normal check if I go back to my insert example. Here, at this point, OK, so I want to do insert 50. I have to check to see whether that key is there anyway to see whether I'm allowed to do that. And then if it's there, then if I'm a unique index so I can't do that, I have to abort. And again, the beauty of compare and swap, if I compare and swap now and I succeed, meaning the mapping table did point to the base page, my compare and swap succeeded. I know there's no other delta record that came before me because the pointer was pointing to the base page, right? So you won't have any incorrect duplicate entries. What if it is there? This question is, what if this is at the leaf node? You actually insert into it. It's all still the same. It doesn't matter whether it's a leaf node or an inner node, right? You have to always do the same checks. So yeah, so say this is the leaf node and I'm trying to insert key 50 and have to actually point to a data record. In the leaf node, I would see whether it succeeds or not. And then if I have to do a split now because my node got too big, then the same thing applies up above when I try to jump ahead to script or modifications. But the same concept applies everywhere. Yes? One of the issues that you said between these handles, you have to do two memory references every time you do a lookup. Yes. Wouldn't the same issue, same problem, have a lot of you here as well? Because once you look in the mapping table, then you won't look at the leaf. So his comment was, I said in the beginning that in T-trees, in order to do any key lookup, you always have to follow the pointer to the actual tuple itself and then find the things you want, right? And in this case here, you're doing a bunch of extra lookups because it's actually more than just two because depending on how your hash table is implemented, you have to maybe jump a bunch of different locations. Yes, but this can be easily made concurrent, easily in quotes. It's possible. We did it. It's not easy. When the T-tree, it's not easy to do this in a fine-grained latching the same way you can do this in a BW tree. Actually, BW trees don't have any latches. Yes. The reason why we deal with get-out dates is because it's cache friendly, but it doesn't save any cost. And actually, additionally, it can create some cost. You have to traverse all the delta types. So his comment is that because they're having these delta records, I'm not making changes to the base pages. And that saves on cache validations. But in exchange for getting that property, I have to execute more instructions and potentially do more reads in memory to figure out how to have this mapping. Correct. Yes. That's a classic computer science trade-off. OK. So we got this. We've got consolidation. So let's talk about how to do garbage collection here. And this is actually related to how you do garbage collection on a skip list. And I'll cover this in more detail on Wednesday's class. So the way they're going to handle this, again, the high-level idea here is that they need to figure out when is a location of memory, like some node, or it's in this delta chain, when does it know that no other thread could be accessing it at that moment in time. And the way you're going to hand this is through epoch garbage collection. So this idea of epochs is not specific to BWTree. This is also called RCU in Linux. This is how they do garbage collection for their data structures. So what I'm going to describe in here is still applicable to how you would do this in a skip list. So the basic idea is that any time a thread wants to do an operation in the index, you have to figure out what epoch am I in. And it can just be a logical counter that's always increasing. And the idea is that when you recognize that there's data now or nodes now that are now garbage, meaning in my previous case here, I did my consolidation, and now this thing points to this, I want to reclaim this memory here. The idea is that you know what all threads are inside your index doing something at a given epoch. They create garbage, and that garbage is tagged with that current epoch. And then at some later point, when you know that no other thread is in that epoch or any prior epoch, you know no other thread could be pointing or looking at that data, so therefore it's safe for you to reclaim it. So let's look at an example here. So again, this is the example I just did where I would do consolidation. So say I have one thread and one CPU, he's doing all the consolidation work. So when I first entered the index, we would have to add it to this epoch table. And for now, we just assume we have only one epoch. But you can obviously have multiple ones because in silo you tick this off every 40 milliseconds. It doesn't matter what it actually is. So then say I have another thread though, he's also in this epoch, and he's now scanning down this version chain, and he's applying all the changes, and he's trying to figure out, do I have the data that I need? So this consolidation finishes in this thread. We do a compare and swap, so now one or two is now here. And this thread knows that this thing is garbage because nobody can point to it anymore because I changed the mapping table. So it's gonna add it to the epoch table, but the epoch manager is gonna know, all right, for this epoch, I still have two threads inside of it. CPU is done, it goes away, and it gets removed from the epoch table, but it knows that two is still hanging out here. So it has to wait until this thing finishes so whatever it's doing, even if it follows now another pointer and looks at another node, it's still in the epoch so we can't free this memory yet. Because again, if we go free up memory, then this thing can now be pointing to whatever, and that's incorrect. So when this thread drops out, then we know it's safe to actually delete this entirely. Same idea applies in your skip list. You would know what threads are in it, give an epoch, you mark things logically as deleted for that little flag, you remove the pointers, but some threads still may be hanging out in it, and then once all the threads exit that epoch and all prior epochs, then it's safe to go ahead and delete it. Yes? Can't it be done by setting the epoch, when you set the flag, deleted flag to true, just set the epoch at which this flag was set to true, and once you know there are no transactions which have an epoch, and tag each transaction with the epoch at which it started, and once you know that the epoch, all the transactions are greater than the deleted epoch, you know it's safe to delete. So your statement is, when I mark a node in the skip list case, as an entry is being deleted, I tag it with an epoch number, and any thread that shows up gets an epoch number, and once I know that all the threads with epoch less than that tag gets deleted, it's safe for me to delete it. Can all the threads have an epoch greater than the tag? Yeah, they just have to delete it. Yeah, so that's the same thing with an epoch of one, right? This is, you can sort of batch things up more carefully, in more coarse-grained or broader epochs. It's basically the same idea. All right, in the sake of time, I'm gonna skip structural modifications because this is the worst, people hate this, it's hard to do this, right? Let's skip this. Okay, let's skip the numbers. Trust me, it works, it's hard. All right, so this is the numbers that they've reported in their paper from 2013. And so they have a bunch of internal, they have some workload from the Xbox Live application, they have a synthetic workload and deduplication workload, and they're comparing their BW tree with a B plus tree from Berkeley DB, which at this time and even now is not a state of the art implementation, Berkeley DB came out in 1992, right? And they basically bits and stuff that make it store everything in memory, right? But it's still not considered a high-performance modern implementation. And then they have their basic implementation, I guess this is the first, their skip list that they were trying out when they were building Hecaton. And as you can see across all these workloads, the BW tree crushes all of it, right? So I saw this in 2013 or so I'm like, oh yes, I'm going to Carnegie Mellon, let's build a BW tree, this is what I wanna do, right? So we did that. And then instead of using Berkeley DB, we actually implemented a modern in-memory B plus tree using optimistic lock coupling. So this is running on a machine here at CMU. It's only running for this experiment, it's only a single socket with 10 cores with hyperthreading, so it's 20 actual threads at a time trying to do something in the index. And then we're gonna do a dataset of 50 million keys, ziffian distribution that are 64 bit integers. And so what you see is that the skip list does really fast for the insert operations versus the, sorry, the BW tree does really well for insert only operations. But for the read only and the read update workloads, it loses to the B plus tree. And then for this skip list, sorry, take that, yeah, that might be wrong, read update skip list, let me double check that. I might have flipped the numbers. So the skip list we're using is actually a super modern implementation from Alan Fetkin in Australia. It's called a rotating skip list. It uses wheels instead of towers, right, same idea. So this is the best skip list that's available now and our BW tree beats it, but the B plus tree beats it. And then what I'll talk about next class is when we start throwing these other data structures in, like mass tree and art, then the BW tree gets crushed, right? And the thing to point out though is for all, for the B plus tree, the mass tree, and the art index, these are not latch tree indexes. They're using latches, right? And so the main takeaway here, and we'll do a breakdown of the BW tree, we need to start, if you remove the mapping table, how much faster do you get? We'll do things like that, but in general, a latch tree data structure, at least according to the current research, is not the best way to actually implement a memory index, right? If you do latching in a careful way, as in the case of mass tree art and the B plus tree, you actually can get better performance on highly concurrent workloads, right? So latch tree is in vogue, it seems like the hot thing that everybody wants, like lock tree algorithms and all this, right? It's actually, for this particular scenario, it's not the best. So again, we'll cover these indexes in more detail next class, but this is a sort of preview to say everything I talk about today is hard, you have to build a skip list for the project because I have to give you the grade, but if you go out in the real world and try to build an memory database, you may want to use these other ones, okay? Okay, so this other overarching thing about in the skip list, which I think is kind of interesting, is that, actually for all these indexes, is that it's essentially like a mini database inside of a database. We have to worry about garbage collection, we have to worry about versions, we have to worry about visibility things, right? And then the nice thing I like about cicada again is that the indexes are just tables, so you don't have to re-implement all these things yourself, you get that for free because you already had to implement it for your data tables. The non-concurrent skip list is easy to implement, the BWT is hard, and the performance benefit you get from the BWT or the skip list is not worth the extra pain. So I'll say too, also in the paper you guys read, they talk about logging the BWT deltas to an SSD to a flash drive. And that's actually one of the key advantages you do get from a BWT in that scenario because it's just logging out the delt chain records and you can do that sequentially. So in a environment where you do want the index backed by a disk, a BWT actually might be a right, a good choice. But for pure in memory, our research shows that it's not, okay? All right, well over time. Wednesday we're gonna add back lactase, we'll also talk about other things you need to have in your implementation, and if we have time we'll do, talk about how to do performance testing, okay? Any questions? All right, so let's gamble. Who says the stream will be stuck? I need something refreshing when I get finished manifesting to cold a whole bowl like Smith & Wesson. One court and my thoughts hip hop related ride a rhyme and my pants intoxicated lyrics and quicker with a simple moan liquor to summer city slicker, play waves and pick up rhymes I create rotate at a rate too quick to duplicate philipines as I skate Mike's a Fahrenheit when I hold him real tight when I'm in flight, then we ignite blood starts to boil, I heat up the party for you let the girl run me and my mic down for all your wrecking still turns with third degree burn for one man I heat up your brain, give it a suntan to just cool, let the temperature rise to cool it off with same eyes