 I think it's different than cooling that thing. All right, so real quick, for project two, I checked the spreadsheet this morning, and as far as I can tell, everyone's in a group of three. So we're going to talk about skip lists today, but I encourage you to get started as soon as possible, because there's three people in a group, but it is something that you sort of to think through and reason about what you're actually doing to get it to be correct. Skip lists aren't a lot of code. It's not a lot of code to actually write one. It's a little bit harder to make it current, as we'll see today, but again, it's really making sure you get the ordering correctly. So real quick, some administrative stuff. So I sent an announcement yesterday that I had updated the master branch to now include a separation between the test cases for the skip lists and the BW tree. So now there's a single test file called skip list index test that invokes the same sort of testing harness infrastructure that we have for the BW tree. So now you don't have to go muck around with the BW tree file at all. You just have a single test case that's for the skip list, right? And so your implementation for your skip list should have the exact same behavior as the BW tree. Meaning if we give you some input to the BW tree and put this as an output, your skip list should do the exact same thing. And so this is a nice way, you know, it allows you to test to see whether your thing's actually working correctly because we have a reference implementation you can look at that we know is correct. So the other thing too is we'll be sending on information either today or tomorrow about how to get access on the development machines we have dedicated for the class. Again, these are machines that MemSQL donated to us last year. So these are gonna be what you're gonna wanna use to test the scalability of your implementation, right? Because these machines I think are their dual sockets. Each socket has six cores with hyperthreading, so 12 cores. And so there's 24 cores available on this machine on these machines with like 120 gigs of RAM. So way more than you have on your laptop. So you can do all your development locally, but when you actually wanna test your implementation to see whether you don't have any bottlenecks, you'll wanna use these machines. And so I think these use ME Labs so the way it works is like you reserve access to them, you can install whatever you want on them. And then after 24 hours, the machine gets wiped and restarted. So there'll be a directory where you can store all your files that are persisted after each refresh. And then it'll be available to you on no matter which of the three machines you get. But that way if you install whatever you want and try to break the machine, it'll get just wiped away and restarted. So you can't call some real damage. All right, so any questions about project two? Any questions about the skip list? Sort of the, at a high level, not necessarily how you actually implement it because we'll talk about that today, but how to get started on the project. Does anyone have a question about that? Yes. The question is how much memory is on there? The question is can you do bug your code on this machine? Yeah, why wouldn't you, right? You can, it has GDB, has whatever, right? You have root access to the machine, you'll have sudo access. You can store whatever you want. Any questions? Right, it's not desktop, you just SSH into it, you can do whatever. Okay, cool. All right, so for today's class, we spent the last lecture talking about locking and latching inside of indexes. And now we're gonna be focusing on how to actually build a latch-free index. So for today's lecture, I'm gonna start off talking about T trees, which are not a latch-free index, but I'd like to include them for historical reasons. So you see what the early in-memory indexes, in-memory database indexes look like, and you'll see why it's a bad idea and why we have better things today. And then we're gonna spend time talking about skip lists and the basic implementation of them and how to make them concurrent. And then we'll finish off talking about some sort of implementation issues you're gonna have when you wanna build a latch-free index in the database system. And these aren't gonna be specific to the skip list or the veto tree or whatever index. These are sort of high-level things that are applicable no matter what data structure you're actually using. And these are the things that you're gonna have to care about in your own system, in your own implementation for project number two. So again, tea trees are an interesting topic because again, I like sort of the history of databases. I spend all my time thinking about databases. I like to go back and see what people did in the past. And it's crazy to think like 1986 is like what, 30 years ago, most of you weren't even born. Question? You have a question or no? Okay. So tea trees were the first data structure that people have developed for in-memory databases in the 1980s. So the sort of first ideas about how you build in-memory database and some of the sort of design decisions you have with them sort of came out from some early research that was done by Dave DeWitt and some of his students at University of Wisconsin like the early 1980s. And of course, back then, the capacity of DRAM was quite limited. So like, you know, people actually didn't think you could actually, you couldn't actually build these and actually run them production. Like, you just simply did not have enough DRAM. But then in the 1990s, things got a little bit better. And so now there was a bunch of implementations of these ideas that the Wisconsin guys came up with in actual real products that people were actually using in production. So the most famous one is Oracle X10. X10 was originally called SmallBase. It was an internal project at HP Labs and they spun it off into a startup called X10. And then Oracle bought them in the 2005. And they were actually doing quite well. They were like a profitable company. But nowadays, X10, you can still get it. You can download and use it. But it's sort of been relegated to like maintenance mode. So Oracle tries to sell it to you as a cache, a memory cache for the front end behemoth database. But most people wouldn't say, I'm gonna build a new application and use X10. We benchmarked it against our system and we can beat it, which is kind of sad. But it's old code that people aren't really making better over the years. The other sort of famous memory database that came out around this time was this thing called Dolly at AT&T Bell Labs. They eventually changed the name as DataBlitz. I think it's still around. The product page takes you to a 404 page at AT&T. I'm sure people are still using it. It was used a lot in early telcos. But nowadays, again, you wouldn't actually try to download and buy it and use it. So I don't know if DataBlitz is still around, but TimeSend is definitely still around. So, T-trees were designed to be the data structure you'd wanna use if you have an in-memory database. And the key thing that a T-tree, what distinguishes a T-tree from all the other indexes that we've seen so far, and we'll see in this class and on Thursday, is that instead of storing the actual keys, the values of the keys in the nodes inside of the index, we're instead gonna store pointers to the tuples that have those values. So that means that when we actually wanna determine whether, if we do a lookup on a key, whether our key matches some entry in the index, we have to dereference the pointer to go to the tuple and then find the attributes and do the comparison that way. Whereas in every other index, like a B plus tree and a skip list, you'll see that the key is actually copied in the index itself. So this is sort of roughly what a T-tree looks like. So again, this looks like a B plus tree where we have this tree hierarchy. One key difference is that the pointers between the different nodes are two-way. So a parent has a pointer to its child and the child has a pointer to the parent. And so the actual composition of a T-tree node looks like this. And now you're just gonna see also see where the name T-tree comes from because the node is meant to look like a T. So the first thing we're gonna have is that we're gonna have pointers to our parents and then pointers to the left and right child. And then internally, these are the data pointers to the actual tuples that have the values that are indexed here. So, and this will be done in the sorted order based on the values in the tuples. So let's say I have tuples with ID equals one, ID equals two, I equals three, I have three tuples. I would have pointers to those tuples and I would have to dereference them to figure out what is the actual value of the key that I'm indexing here. And then to avoid that dereferencing over and over again, the only copy of a value they would store is they would have these node boundaries to say here's the min and the max for the left and right child pointers. So again, the key difference here between a B-plus tree is that we don't actually store the values, we store pointers to the tuples and we can derive the values from them. And again, they did this back in the day because storing the values could be expensive because you're copying stuff in memory. It's expensive to store and your DRAM is limited. So in the old days, these were 32-bit pointers, even 16-bit pointers, and that's gonna be a lot smaller than if you have a composite key made of a var char and some other stuff. So they're reducing the memory overhead by just storing the pointers. So the next key difference is that our key space, our sort of key space for the t-tree is not gonna be just stored directly along the leaves as it is in a B-plus tree. Instead, it's gonna be scattered in a breath-first manner across the tree like a B-tree. So if I say I wanna do a lookup on the range two to five, I have to start at the root, oops, sorry, start the root, traverse down into two, then go to three and then now I have the two-way pointers, I can go back to three, back up to four and come back down here to five, right? Now in a B-plus tree, you would just do one traversal down and then scan across the leaves to find all the data that you needed. Okay, in a t-tree, you have to go back and forth to find the thing that you needed, right? And again, their argument was that this was okay to do because de-referencing a pointer is worth the trade-off to de-referencing a pointer versus saving the memory storage overhead. So again, the advantage again, we're storing less data because we're just storing pointers to things, so that makes our index more compact. And the internals are gonna contain the key value pairs just like a B-tree, and that makes in some cases, if we have to do a lookup on an exact key, we don't have to traverse all the way to the bottom and maybe the case that we only have to go to maybe the root or some higher level node. But obviously this has big problems, right? Because otherwise, if it didn't have the other problems, people would still be using it, right? The most obvious thing is that you're chasing pointers to, many times you have to do a scan binary search inside of a node, right? Any single time I need to say is my key in this t-tree node, I gotta go do a lookup on the tuple in memory and then look at the attributes. The upper understand that the design decision they made for this is different than how things aren't out today. So in the hardware in the 1980s, when you read the papers about these early memory databases, they talk about how the CPU caches weren't that much faster than DRAM. So doing a lookup to chase a pointer to go to DRAM to find the tuple to do your evaluation of our key, that was not that much slower than doing a lookup in your SRAM or L102 cache. And I don't think they had L3 back then. And then in the 90s, CPU caches got much, much faster. So now the performance difference is quite more significant. So now chasing pointers is actually a terrible idea. The other problems they have too is that it's difficult to rebalance and it's difficult to make this thing be concurrent because we have multiple pointers. And because we have the actual, the inner nodes of the index could also be, or actually we're actually storing the values for the tuples themselves as well, the pointers to the tuples. Remember I said in a B-tree, they're hard to rebalance in a lock free or latch free environment because if you decide to split or merge an upper level node, you may have also split and merge at a lower level in some descendant node. And so now you have two guys trying to change the structure of the index at exactly the same time. And unless you take locks, it's probably impossible to do. So same thing. We'll see this when we talk about skip lists, but because we have pointers in both directions and all over the place, it's hard to make this concurrent without bringing in locks. You can't do the compare and swap technique that you can do with a skip list or B-tree. So again, as far as I know, no database system, or very few database systems actually still use T-trees. The only one that I know still actually does use it is called XtremeDB. And it's designed, it's a database system that's sort of for embedded devices. We have really constrained resources. And I don't mean like your cell phone. Like your cell phone has like two gigs of RAM. So you can use SQLite on that and SQLite uses a B-plus tree. But I'm talking about like really low level IoT sensor kind of thing. So you don't have a lot of memory. So they use T-trees. I think in times 10, I think you ask it for a T-tree, you can get it. I don't know whether that's still true or not. But the default now is that you definitely get a B-plus tree if and when you create an index in times 10. So as I said, nobody actually uses T-tree. I'm only bringing it up to you guys so that like if you go out in the real world and someone, you know, you want to say, oh, I want to use a skip list. I want to use a B-plus tree, a BW tree for an N memory database. And someone says, well, aren't there N memory indexes like a T-tree? Isn't that better? Now you know the answer is no, right? You don't want to do this. Okay? So the question is about T-trees. I just sort of think of them as an interesting historical phenomenon of what people did 30 years ago. Okay. So now we want to sort of focus on the skip list stuff. So the first thing we want to sort of point out is sort of an obvious observation that if we want to have an order preserving index in our N memory database and we want to be dynamic, my dynamic mean, I mean that you don't know the exact number of keys that you're going to have ahead of time. Right? So you need to be able to handle inserts and deletes and for arbitrary length. The easiest way to have a dynamic order preserving index in a database is to simply use a sorted link list. Right? Like this is the dumbest thing, the most simple thing you could actually do. And basically it looks like this, right? Say, you know, key one, key two, key three, all the way to key seven, we're just going to maintain as this link list in sort of order. And that way anytime we need to do a lookup or need to do a delete or an insert or update, we would just do a linear scan across the link list until we find the location where we know our entry should be or we want to put something in. And so for each node in our link list, it's you're going to have, you know, the first part will be the key and the second part will be the 64-bit pointer. And so you have to traverse all the pointers because we're not going to be storing this in contiguously in memory. Remember I said it's a dynamic link list or dynamic index, so we don't know exactly the number of elements we're going to have so therefore we can't pre-allocate in an array and know how to jump to some offset. We always have to go across. And so in that case, the average cost, the worst average cost would be always ON. I do always have to go, you know, in the worst case, going across the entire thing to find the last thing that you're looking for. Right, so, but this provides the property that we want, right? It provides a dynamic order-preserving index. So there's a simple way we can make this better. What's a really simple thing we could do to make traversing this thing not so bad? You're doing a hand gesture, yes. Right, so yeah, you're jumping way ahead, but like instead of having to go across one by one in our list, we could just add extra pointers that point to every other element. So let's say I need to do a lookup and find key six. Instead of having to go from one, two, three, four, five, six, I can start at one, recognize that the thing I'm looking for is greater than key two, so I'll jump over here and get to key three and do the same thing. Then I can jump to key five and then I can find key six. So in that case here, going from key one to key three, I've reduced the number of evaluations and pointer chasing, you know, by one. And overall, I'm reducing by two. So why not just keep going up and doing this more and more? So I can have now pointers to go from, instead of skipping every other, I can skip every four. So this one can go from key one to key five and then from key five we can find our element in key six. So this is a really simple thing we can do to reduce the search time for our linked list. So this is essentially what a skip list is, right? A skip list is just gonna have multiple levels of linked lists with extra pointers that's gonna allow us to skip over intermediate nodes when we know the values that those intermediate nodes have are less than the thing that we're looking for. So the nice thing about skip lists is that it's a probabilistic data structure where we're basically gonna roll the dice and figure out when we should have these extra pointers and it's gonna allow us to maintain our keys in sorted order without having to worry about any global rebalancing, right? Because at linked lists you can always insert a new entry into it and you just have to update the pointers so that the previous guy now points to you rather than your successor. And this doesn't require you to do this splitting and merging you would have to do in a B plus tree. So this is sort of why I'm saying this is way easier to implement than other indexes because it's like it's a linked list. Everyone should know how to write a linked list. So again, the way to think about this is that the linked list is linked lists at different levels and at the very bottom you're gonna have a single direction linked list with all the keys that you have in your index and sorted order. And then above that you're gonna have these extra pointers but then you're gonna have links to only every other key. And then likewise again on the third level you're gonna have half the links that the previous level has. You keep going down sort of up and up and up and at some point you don't have any more levels and that's sort of considered like the root of the index. So now to insert a new key, you're basically gonna flip a coin and you always have to start at the lowest level but then you decide whether you should go into the second level as well. You flip a coin, if it's tails you don't, if it's heads you do add extra pointer. Then if you get heads flip the coin again and if you get heads again you can add another one. You keep going and going and going until you finally reach tails. And again this is why it's a probabilistic data structure because we don't have to know exactly ahead of time where should we have these extra pointers. What does randomly figure out over time? So the skip list is gonna provide the lookups in order of log n. I said it's approximate log n because it's probabilistic. So we can't guarantee exactly we're always gonna have log n but in practice it turns out to be, on average this is the case. So this is essentially the same thing as a B-plus tree but we don't have to have that sort of tricky merged split operations. We use role of the studies and that sort of lays out the pointers for us. So I'm gonna go through a bunch of different examples but I first sort of sort of a visual diagram of what a skip list looks like. So whenever you see the literature this is essentially how everyone draws them. So the first thing to point out here is that we're gonna have our level pointers and these are sort of gonna be the entry point to the first node in a particular level. And so they're also gonna have also to the probability that a node will be in this level. So in this case here the probability is always one because we have to have every single key at the lowest level because this is like the leaf nodes of the B-plus tree and we have to have this. Otherwise it's not, the key's not in there. And above that we have n over two and above that we have n over four. So we get at the bottom level that's our sorted link list and now I'm here I'm showing that for each node we're gonna have the key value pair. So key one will correspond to value one and again in the context of memory database this is always a 64 bit pointer and then we're gonna have another 64 bit pointer to the next key. And as we go up here if for these key entries at level two they're not gonna have the values instead they're gonna have a pointer to the node below it. In this case here the key's always at the match so key two here has to be the same as key two below it. And the top here we're not gonna have anything and so what we'll do is on the end of the link list we'll have these special node markers to say that this is null or nail or infinity. Like this is the stopping point. So if you scan across the leaf nodes the bottom level and you get to this point you know that you're at the end of the link list. So in this case here at this top level we don't have any nodes. So when we would start and do a search we would know that we can skip this and go down here. All right so let's look at an example where we wanna insert key five. So again what we'll do is we know this is the spot where it's gonna go. So the first question we gotta figure out is how many levels should we add it? So we'll flip a coin and if we get heads then we'll add it to level two. If we flip a coin again and we get heads again and we'll add it to level three. And say we flip this example we flip it a third time but we get tails so we don't wanna go any higher than that. So we're gonna put our entry in here and at this point the key four still points to key six. But in our hierarchy we have key five, this key five at level three will point to key level two and this points to level one. So at this point here we have allocated all our memory for our new entry but we haven't updated any of the pointers to now tell everyone that this thing actually exists. So what we'll do is we'll update all these things and then now this thing, if anybody's sort of scanning us it'll find our new entry and we're done. So I'll show how we do this in the context of a multi-threaded environment. It does matter the order in which you add the pointers. You have to sort of from bottom up because you don't want someone to think that there's key five here and then sort of scan down and have it go missing. Yes. So every time you flip a coin you add another note in the top level. So what if like you have, you have to add the fourth level like to increase all the levels? So your question is, all right interesting. His question is I'm adding key five. I keep flipping the coin and every time I get heads I go up and up and up and up. His question is in this case here I added an entry for key for level three but there was no other entry in level three. His question is do I also increase everyone else to go up to level three or even further? So I get level three and level four. Do I increase the other ones? No, right? Because the order, the height of your tower, these are called towers in skip lists, the height of your tower is determined at the moment you insert it. You don't go back and update everyone else because that also too would violate the probabilistic data structure because then what's essentially, if you do what you're suggesting, then you don't have this nice reduction in the number of pointers as you go up the levels. Everyone's getting moved up. You essentially have the same skip list replicated, the same bottom link list has replicated every single level and that's useless. But then you mean the maximum level is fixed? His question is is the maximum level fixed? No, it's probabilistic. So yes, you could put a hard threshold and say like I only want to go 10 levels but the probability that you're going to flip a coin and you're going to get 10 heads in a row is very low. So in that, the reading you guys had, they had this nice diagram where they show sort of the skip list as I'm showing here but then they rotate it and make it look like a tree. That's another way to think about this. This is essentially the same thing as a B plus tree except that instead of having everything along the leaf nodes, it's sort of going down to the right side and these higher levels are just like the guideposts that figured out where you are in the index. And that's why we want to have this probabilistic manner of deciding what level you add something in because we're going to have fewer entries at the top levels and that's going to allow us to do more coarse-grained jumps to find the data that we need. In this case here, instead of having key one also here, I just have key two and I know that if I need to go beyond key two, I don't have to look at key one or anything that comes before this one. So that's why the probabilistic insertion of these high nodes in the higher level helps us. There will have fewer pointers that are using less memory for these things and it still allows to make the jumps we need to to get farther along the list to look at the final we're looking for. Okay, so let's look how we do a look up. So say we want to find key three. All right, so we would start off here. This is sort of the program counter of the pointer to say where we're at in our search. So the first thing we would do is you follow this pointer to the node and look at the key that's inside of it. So in this case here, key three is less than key five so we know we don't want to jump over here because we can only go in one direction. So anything that comes after key five is not going to help us because we're looking for key three. So that means we're going to go down and look at the next level. In this case here, key three is greater than key two so we know that this is the next place we want to jump to. But then we keep going along this level and we look to see is key three greater than key four? It's not, it's less than that. So we don't want to keep going across in this level. We want to come down into the lower level. And now here we just do our linear search to find the key that we're looking for. So again, it looks a lot like traversing a B-plustery. You're sort of doing comparisons as you go down telling you whether you want to go continue going right or continue going or at least right or left depending what you're looking at in that direction or you go down. And at some point you're going to reach the lowest level and you know there's nothing below you and then where you have to go across and do linear search. Yes? Is it a single link to this or a double link? This question is, is this a single link list or a double link list? What does it look like? When you found the key three is smaller than key five why just go down and go left to find when you need to start from the left of the separate data. Okay, your question is at the starting point here. Yeah, I want you to start to go down from key five and to scan from the left. Okay, so let's say I did that. I go here, key five. Then I come down here, key five. That's infinity. I can't do that. How can I go left? Why not go left? How? If you make it a double link list. Right, but you can't make it concurrent. We'll see that in a second, but it has to be single direction because you can't do a compare and swap when two memory addresses to update two pointers at the same time. But make it a single list if it's efficient. His question is if you make it a single link, if it's a single link list, does that make it efficient, inefficient? Yes, we'll see how to solve that in a second. Actually, I should have showed you some posts about this. So, as far as I know, the only database system that uses SkipList as the primary index for the database system is MemSQL. WireTiger uses SkipList when they bring in pages into MongoDB, but that's, but it's not the main data structure. So, when MemSQL came out, they had this sort of inflammatory blog post about how MemSQL was 30 times faster than MySQL, right? Because they were benchmarking the primary databases or distributors. The, they got a lot of flack for that. When you read the hacker news post, people are sort of saying like, look, you have these SkipList, but they're single direction. How can you go in reverse direction? And their initial response was, oh, you just built another SkipList that goes the other direction, right? Which defeats the whole purpose of having like this super fast or memory efficient data structure. We'll see in a second how you actually do reverse search when you only have single link lists. It's less efficient than just having the double link list pointers, but if you have double link list, you can't have, you can't do a gamma latch free index. Yes. Does SkipList store the second SkipList index? Your question is, does a SkipList store, can you use a SkipList to store non-unique indexes? Yes, we'll come to that later in lecture. Yes, any other questions? Okay, so, so the advantage of SkipList is that it's gonna use less memory than a typical B plus tree. And the example is part of the reason for this is that you're not storing pointers in the reverse direction. We'll see this when we talk about the BW tree. BW tree has to maintain a host of a mapping table for indirection, and that will take some more memory. Sort of, I guess, we'll see also too, and as you read in the reading, there's ways to have more compact SkipList by combining nodes together. The, another advantage of that, insertion deletions do not require rebalancing because you're just doing compare and swap and insert the new entry in your list. And we're gonna be able to, because we can do compare and swap, we can make this be latch-free and have concurrent access to it quite easily. So, sort of now let's segue into what he was sort of, his question was like, well, why would we have to have a single link list, or why is this SkipList in a single direction? And again, we're gonna use compare and swap to allow us to do this in a latch-free manner. So let's go back to this insert key five example here. So we know we're gonna put it in here. So, when I first allocate the nodes for key five, right, I flip the coin, I know I was gonna go up to level three, I'm gonna allocate the memory, insert my key value pairs, and have the other nodes in my tower. But at this point, key four in this here is still pointing to key six. This key four is still pointing to infinity, and then our head pointer for the beginning of the level three is still pointing to infinity. So again, we've allocated our memory, we have our key value already, but nobody knows about this yet. So now we wanna do compare and swap to insert this into the index and make this be available. So again, we still have all our pointers down the tower. So what we're gonna do is we're gonna go from the bottom top, or from the bottom going up, and we're gonna do a compare and swap instruction to change the key that comes before us to now change the pointer to be from key six now to key five, right? And what'll happen is the compare and swap instruction, we're gonna say, we thought that the old value for this pointer here was key six. So do compare and swap, if it's still key six, we know that we're the thread that's gonna win the race to get to change this thing, and then update to now to be key five. So if the compare and swap fails, then you know some other thread has come in and inserted something between key four and key six before you could. So then you have to come back and try it again, right? But you can do this without ever requiring latches or locks. And the compare and swap instruction is super cheap because, again, it's a single instruction to do this. So it's not like we had to lock the node with a new text or a spin lock and then update all our pointers. Now, this is why it has to be single direction because you can't do a compare and swap to now point this guy in this direction, this guy in this direction at the same time because there are gonna be two different locations in memory and you can't do that in a single instruction because otherwise you gotta bring in a spin lock to protect both these guys, then flip the pointers and then unlock them and that's gonna be really slow. Yes? Isn't this two compare and swap operations? For statement, isn't this two compare and swap operations for what? One to connect key four to key five and the other one from key five to key six. Okay, see I should have drawn this, I should have been more careful about this. So here, this doesn't need to be compare and swap because nobody knows us about yet. There's nothing pointing to this key five. So the thread that's inserting this entry is the only thing that knows about this. So go ahead and can set this pointer here, right? What if some other thread comes and deletes tries to delete key six? We'll see how it deletes in a second. So her statement, her question is what if someone deletes key six? There's delete, there's logical delete and physical delete. We won't allow physical delete, but we'll allow logical deletes. Next slide, you'll see how this works. So now at this point here, key five points to key six and I can update all these others as well but at this point here, I can now do my compare and swap. I only have to update this one location, key four now that knows about key five. So if someone's scanning along the leaves, they're gonna find our key five. Now if someone has scanned past key four to that key six, then we do our compare and swap. They're not gonna see us, that's okay, right? Because there's some high level construct about checking for fandoms that would know how to rectify this. From the purpose of the index, that's fine, but it's still considered correct. So then now we go up to the tower and we keep doing the same thing. We update this pointer here, then do a compare and swap, update here and do a compare and swap. So now if you get to these higher level ones, if the compare and swap fails, you just keep retriant without having to go back and redo this one, right? Because this thing's still always gonna point to this and that's okay. Because if someone else came and inserted something in here and they beat us, we beat them to update key four, but then they, for this level, and then they beat us to update this key four, that still, the physical structure of the index is still correct. Because we would still be in front of them here, or they would insert in front of us here, but we still have maintaining the correctness of our link list here, and then we need to rectify that as we go up to the lower levels. So if you do the compare and swap at the higher levels and they fail, you just retry it, right? You don't undo the thing you did here. Okay? So that's sort of clear. Again, single direction, you do single compare and swap at each level, and that's enough to make sure that the integrity of the index is correct. Yes. So does this mean you can't do what the paper suggested, or what the blog post suggested with the single method with the way of pointers, one for each level? So his statement is, is this, does this mean you cannot do what they suggest in the blog article about, you can have a single array per node with multiple pointers? That's still okay. Because you're updating the, you update the lower pointer at the lower level. You do one time, and the same logic applies. Like if you fail at the higher levels, you would just retry it. Okay. Yes? If I fail to do things from K4 to K5, should I redo the search to find what key is? Yeah, so his question is, if I fail to do this, if I am not able to insert key four into key five, do I have to redo the search? I think the answer is yes. Right, because you don't know how to get back to where you were. You know that's something else came up. So you do another search, you figure out where you wanna be, then you retry it. No question? Same question? Okay, good point though. So now you kinda see what I was saying before. Like it's super easy to make a skip list if it's not concurrent. Then we start adding in these atomic insertions and deletions, and things get a little more tricky. It's still not super hard. It's in terms of the amount of code you have to write. Cause again, a single compare and swap is one instruction, right? One command. It's making sure you get the order correct is the tricky part. All right, so let's do the example that she asked about doing deletions. So the way we're gonna do deletions in our skip list is we're gonna do it in two phases. The first thing, the first phase we're gonna do is we're gonna logically delete the key from the index by setting a flag inside the node that tells the other threads that are running in the index at the same time, ignore this entry. It's still there. It's still being pointed to by other nodes. It's just as you scan and come across, you check the flag, and if it's set to be deleted, you know just to ignore it. Then what'll happen is at some later point, we'll go and physically remove the key once we know that no other thread could be possibly holding a reference to our deleted node. And we'll talk about how we do this in the garbage collection phase later in the class. But again, to now physically delete it, it's just another compare and swap to take it out of the list and then throw it away. But again, we want you to be careful that we only do the physical delete once we know the thread is running at the same time. It could be touching it. All right, so let's say this one. We want to delete the key five that we just inserted. So again, in all the leaf nodes, now we're gonna have a little flag, a boolean flag that says whether this node has been deleted or not. And this is just an eight bit, or sorry, eight bit bool for this. So let's say now we want to delete key five. So we would do our search going across, we would find our entry here, and then flip this thing to be true. And it can only go in one direction. It can only be set to delete, you can't undelete it. If someone tries to then come back and insert the same key after it's been marked for deleted, you create a new entry for it, right? So you don't have to do this in compare and swap because if someone does it, right at the moment you do it too, it always goes to true, so it doesn't matter. So then at some later point, once we know that no other thread is looking at this guy, and we'll do this through the garbage collection mechanism later on, we can go ahead and do compare and swap to now remove these pointers. I'll take it back. We can change these pointers with compare and swap going from the top down to now point to the entry that comes around it. So now at this point here, a threat could still be sitting here looking at us, so we need to make sure that they're gone when we actually finally do the free and remove it. But it's safe for us to go ahead and do the compare and swap and update these pointers. And the same thing, if you do the compare and swap, and somebody else has come in and updated the pointer before we did, like if they inserted a new entry in here, we need to maybe do the search again and find where we should actually be pointing, or find the thing that's probably pointing us and make sure we get rid of it. Yes? I have a question about how do you find any other nodes that are pointing to the video? This question is how do you find any other nodes that are pointing to it? Well, you're always gonna go in... Yeah, but that's like a video, sir, pointing. No, because you start at the top again and you log in to truce where you need to be. Yeah. And you only care about what the leaf node is that pointing to it. You know that the only thing pointing to you in your tower is yourself. And you know if this thing's being deleted, so you don't have to worry about making sure these guys go away. And so again, at this point here, when I'm showing this diagram at this point in time, we still have key five in our tower, but all the other pointers now go around it. So we know that no one could possibly traverse and find us. But again, because we're a lock-free or electric data structure, there could be a threat sitting here for whatever reason, and we need to make sure that it's gone before we can go ahead and reclaim that. In the back, yes? Does it have to physically delete the node from top down or bottom down? The question is, does it matter whether we physically delete the node from top down or bottom down? Yeah. So why would that matter? Because if you insert a node, you have to insert from the bottom to the top. You update pointers. Yeah, it's semantically yes. You insert from the bottom to the top. But you're talking about physical deletion or logical deletion. Physical deletion, what did I just say? And we'll talk about this more when we do garbage collection. Physical deletion, I said that we physically refree this memory when we know no other threat could be referencing it. So no other threat could be referencing either of these blocks. It doesn't matter. Let me think about whether that's true or not. Actually, no, you are correct. Actually, yes. Yes, she's actually correct. So you could have a thread here, and it traverses here. And then at the point we want to delete this, I actually know I'll explain this in a second. Yeah, the issue is that you could have a thread here and then recognize that no thread could be touching this thing. So I want to go ahead and physically delete it. But then this guy is pointing to it. So I traverse the pointer. And it was set fault because you're pointing to nothing. When we do epoch-based garbage collection, we'll see why that cannot happen. So I would say it does not matter. As long as you do garbage collection correctly. I have a question. So when you condition this node, another threat is reading this value. Physical delete or logical delete? Don't come near it. All right, some of the threat is reading that value. OK? So her question is, I marked this as deleted. So let's go back here. We're back here. We haven't physically deleted it. We still have pointers. Our thread got to it. Marked this as deleted. Your question is, if I have another thread, is that this node at the same time? And they check to see whether delete value is true. It's not. Then I do the read. But actually, before I do the read, then the thing gets flipped to be true. It still reads it. Is that OK? Is that your question? No. Let's talk it through. Why would that be bad? Why would that be good or bad? So that's our customer with that value. It can, yes. My question to you is, is that a bad thing? So the answer I don't care about is, I don't know if they come out. They do. But like, again, the index is part of this larger organism, the database system. The database system has all this concurrency goal stuff that we talked about before. So I mean, I don't want to get into the linearizability, but that's OK because we're still seeing things like, it's still correct from the index's point of view. It's incorrect from the transaction's point of view, but that's outside the scope of the index. The index only knows how to make sure that you don't screw up the data structure as you make modifications with the concurrent threads. Whether the operations or the answers you're getting from the index, whether they're correct or not, is left to the concurrency goal scheme up above in the database system. So yes, in your example, if I have a transaction, comes in, they have two threads for two different transactions. They both get to key five. The first thread checks whether delete is true. And they say, before we get to that, it's false. So then it goes ahead. And before it reads value five, the other thread marks it as true. So technically, yes, it should not be able to read that, but it read this before it got that, sort of like this is a classic ABA problem, but that's okay from the index's point of view. Because otherwise you have to set a lock on the node when you flip this bit and make sure nobody comes along, but the index doesn't care. And the upper levels of the system, we can do the fan checking to make sure whether we read the thing we actually supposed to read. So that's outside the scope of the index. Let's go question, yes. So when you said it is a latch-free, do you mean the compare and the square that would belong to a latch? Your question is, when I say this is a latch-free index that the compare and swap don't belong to what, sorry? Latch, it is not a kind of latch. His question is, is the compare and swap operation on the pointers, is that equivalent to a latch? No, right? Like, think of latch as like a traditional OS mutex to protect some critical region. It's not the same thing at all. We're flipping a bunch of bytes to store our new pointer. It's not like we have to flip a flag to acquire the latch, then we can flip the pointer. It's one of the top operations to do that. So it's considered lock-free. Do the back, yes. So is this process, is it logical to read? His question is, when I'm showing the example here, is this considered logical to read? Yes. Physically it's still there. There's a little flag that says logically you cannot allow to read this. Okay, so then I have another question which is when I delete this note, another thread goes, assuming it is doing a scan and it goes to this note and I just like delete the pointer to the next note, then the other thread, how could it get back into that? So your question is, when you say delete, you're saying logical delete or physical delete? Logical. All right, logical delete, first thread comes along, he flips that bit, right? Now logically it's deleted, but I still have all my pointers. Scan guide comes, another thread comes along with scans, he'll see key five and sees the flag that it's deleted and he knows he should ignore it. And so if it's doing a look among key five, it would say, oh, it's not actually here, delete, or we can delete it and say no, or it just keeps going and scans, keeps the scan going. So the pointer pie is the physical. Yeah, so correct, yeah, I should have not shown, yeah, I should not show this, this is actually incorrect, you're right. You have to still maintain the pointer to this later on when you reclaim it, you always maintain this pointer. But now, like I say, I insert key 5.5, you have to have some logic, like that should go here, you have to have some logic to know that, well, I should really be updating maybe key four, to now point to my new one instead of key five, okay. I don't like it, it actually matters. Yes. So when you do the physical delete. Yes. Do you have to make sure that all the other nodes that point to this deleted node of different levels have to be done at the same time? So, in my example here, when I show, I'm updating pointers. Yeah, I mean updating the top level pointing to the end and second level pointing to. Yeah, your question is what, sorry? Like, so you're updating the pointers, right, in the next place. Yep. So do they have to be done at the same time of different levels? Your question is, when I do the update here, when I say, so this key four here and key four here, is that a pointing to key five? I wanna now go around them and point to the next thing. All three of them have to be done at the same time. When you say at the same time, what do you mean by that? Like atomically? Yeah, atomically. How are you gonna do that? How can you? You can't do that, how do you make sure that, like if there are other searches going on at the same time, how do you make sure that? All right, so his question is, how can I update three pointers without having a latch and still, I can't do it atomically and have it still out, that should be correct. Well, let's say, if I just do the first one, right? The first one, I can do a comparison swap on this guy to go around it. At this point here, I mean, again, these things should not be removed yet, right? At this time, there's another thread like searching at the second level. Yeah. And it came across like that. Key five, right? Yeah. That's fine. Okay. It can't go that direction. It can only go this direction. And at this point here, we've only updated this one pointer, so they're still symmetrically correct. So this brings up, I mean, the way you guys are reading through this, and I like this, but like, this is what I was sort of saying about, like, you need to write test cases to make sure the structural integrity of your skip list is correct, right? That you don't have things pointing to garbage, you don't have things pointing, you know, in a loop to each other in a cycle, right? And this is not test cases we can provide you, because this depends on how you actually implement your index. So these are the sort of things you're gonna wanna do in your project implementation to make sure that maybe there's a function you call that just does, you know, freezes the skip list and make sure that the pointers are in the correct order and pointing to the correct things, okay? All right, so I wanna keep going because, you know, we're low on time. There's a bunch of other stuff I wanna talk about. So again, this is just reiterating everything I said before. You gotta be careful how you order your operations. And then a very important concept also too is that when the database management system, the upper level of the database system invokes your index, the operation can never fail. What I mean by that, it's different than in a transaction where if the application executes a query, the database system can come back and say, your transaction failed because there was a conflict. In your index, that can never happen. You can't come back to the database system and say, yeah, no, I can't do this for you, right? Because a compare and swap failed. So you're gonna have to retry your operations until you actually succeed, right? If you do a compare and swap because you're trying to update things and it fails, then you come back and try it again. And eventually it should succeed. If it fails forever, just because there's so much contention, well, from the index point of view, there's nothing you can do to prevent that, right? There's just too much work trying to be done. So this is an important concept. So whether you implement this retry logic in the index wrapper itself or in the data structure, remember we provide you the sort of, there's the skip list index and that's the higher level API. And then there's the skip list data structure implementation and that's where you actually put your, the pointers and all the things we're talking about here. Whether you put that retry logic directly in the wrapper or in the data structure itself, it's up to you. So this last point is important, right? Again, if I say insert, if the compare and swap fails because someone else changed the pointer before I could, I can't come back in the database system and say I failed. I have to try it again and over and over again and eventually it'll succeed, okay? So what are the bad things about skip lists? So as you saw on the blog post, he talks about how invoking the random number generator can be expensive, especially if you have a, you're trying to build a high performance index. This is because the way pseudo random number generators are implemented is that there's a state machine inside of it. So every time you call a RAND, you update the state machine and that can be expensive. They're also not cache friendly because you're chasing all these pointers and for every single fetch on a cache line to go grab a node, you're only getting one key value pair, right? Where in compare this to the B tree or B plus tree, I can just scan across the leaf nodes and I get all my key value pairs that way. And eventually I have to follow a pointer to go to the next node, but again, I'm bringing in a lot of data with a single load operation. And then as we'll see in a second, the load, the reverse search is non-trivial. So again, the paper you guys read or the blog post you guys read was that again, my opinion was a nice sort of introduction to say here's how to actually make a skip list work well in practice. So we'll see this when we read the BWT tree for Thursday, they compare against a skip list. I don't know how optimal their skip list actually was, but you'll see the BWT tree is gonna crush the skip list. There's newer versions of skip lists that came out last year. There's a multi-core one from Alan Fecki in Australia that said he's using towers, they use wheels. That you don't have to implement for your project or there's better implementations of skip lists that are coming out in recent years. But in general, the skip lists are considered to be slower. All right, so the optimizations that we talked about, we can reduce the number of random vocations by doing that bit shifting operation he showed. I tried to test that out in Python last night and it appears to work. We can pack multiple keys in a node. We can do reverse iteration with a stack, which is not in the blog post, we'll describe how to do it. And then we can reuse nodes with a memory pool. So I'll talk quickly about how to combine nodes or combine keys into multiple nodes. So again, instead of having these nodes just have a single key value pair with pointers, we actually can combine them to have a single node with a bunch of slots and then just have one pointer to the next entry here. So ideally what you want your node size to be is to fit into a cache line. Cache line is what, 64 bytes. So in this example here, assuming I have 32 bit keys and 64 bit value pointers, so it would have, in this case here, I could have four slots. So that would be what, four times four. So that's 16 bytes for my keys. And then I would have, well, 24. So it would have 56, or sorry, we'd have 48 for the key value pair, 48 bytes for the key value pairs. And then a 64 bit pointer, 64 bit pointer, which is eight bytes for the pointer to the next node. So that would be 56 bytes to store this node with four entries in it. And that can fit into a cache line. So now when we do a single load, it's one fetch operation to go get the thing we need to DRAM and put it into our CPU cache. And this is why packing them together is much, much faster. So net to do an insert, rather than keeping this in sorted order, we'll just find whatever the free slot we have and just put it there. So in this case here, the last entry is our free slot, so we'll go ahead and just insert it. So now what happens when we want to do a search, we have to do a linear search inside of our node. But again, in this example here, it's only four elements. I set it, this will fit in the cache line. So this is hanging around in L1, and this will be super fast to do. And this is getting this way better than having jumped from node to node to node. The downside obviously is that if now I want to insert something like key five, that should go in between key four and key six. So now I need to do a split and copy some bytes out. But the trade-off is that it may not occur that often, or relative to the cost of doing that split is be less than having to traverse all the pointers all the time. So you get a benefit of this. This gives you a performance benefit. The other downside to also too is like, you could have wasted space because you have to pre-allocate this memory. So you could have sort of half four, less than half full nodes because you have to maintain all these free slots. So again, this would be interesting if you guys implement this to try to measure to see what the performance benefit you can get from it. Yes. In a cache line, you basically have to properly align your. So yeah, so his statement is, if you want to put this in a cache line, you have to make sure that it's properly aligned. Yes. We will talk about how to do word alignment next week. Yes, we'll talk about this later. Okay, to do reverse search, again, because we're in single, we have a single direction link list, we can't just go jump to the end and try to go back. So if you read the MemSQL blog post about skip lists, they have a real small paragraph talk about how they do reverse search. It's not really clear exactly what they're doing. So the link here is actually to another reference implementation on GitHub from somebody else who implements the reverse search I'm gonna show you using a stack. Now, when I asked the guy that used to be the VP engineering at MemSQL, because he was seeing me alum, and he was a guest speaker with us last year, I asked him like, look, this is how I think it should be done with a stack. He's like, oh no, we actually don't use the stack. They use extra pointers at the end. So I don't know exactly how the MemSQL guys do this. The blog article is not clear, but for your implementation, you should probably look at the algorithm that the guy shows here. All right, so for this, we wanna find the range of key four to key two, inclusive, but in descending order. So we'll start off here at the very top. And what we wanna do is we know that our lower bound for our range is K2. So we wanna find where K2 starts. So the same thing as before, we do a lookup, K2 is less than K5. So we wanna go down to the next level. Here K2 equals K2, so we can jump here. And then we know immediately we can go down because we have an exact match. We actually don't even need to do this comparison here. So we jump down here, and then now we're at the lowest level in the skip list. So now what we wanna do is now maintain a stack of the entries we see as we scan across the bottom. So in this case here, we would go from K2 to K3 and then finally to K4. And now when we wanna return to the database system, the range in the proper sorting order, we just pop these off the stack and that gives us the reverse search we wanted. Is that clear? So this is sort of something you would maintain the stack possibly inside the index implementation itself, the data structure, but then in the logic inside the wrapper, you would say, I know I have a, I know I wanna take the keys that I saw and just reverse them and that's how I generate my output. So the wrapper is where you would implement the logic to reverse the stack. Again, this is laws to do reverse search without having back pointers, okay? All right, cool. All right, so now I wanna talk about some additional implementation issues that you're gonna have when you build an index. And again, these are not specific to a skip list. You're gonna have these same issues no matter what data structure you're using. So for these first two, the memory pools and garbage collection, I'm gonna focus on how you actually do these techniques in the context of a lock-free or lat tree data structure. For these other ones, the non-unique keys, the variable length keys, I'll talk about composite keys on Thursday, these have nothing to do with being lock-free. These are just how you organize things in memory. But these first two, again, these are what you need to do in a lock, these are the way you're gonna do this in a lock-free index. All right, so obviously in our index, we don't wanna be calling malloc and free all the time, because that's gonna be slow. That means every single time we wanna physically delete or physically add a new node into our index, we don't wanna call malloc. Every time we free it, we don't wanna call free. So what we're gonna wanna do is we're gonna use a memory pool. The idea is that if we know the nodes are same size or within the same size category, then we're gonna maintain a pool of available nodes that we've pre-allocated in our index. And that when someone wants to insert, a thread wants to insert a new node, we look in our pool and find one free node we can use and use that for the thing that we're inserting. And if we don't have any more free nodes, we just call malloc and mallocate it for us. And the idea is that when we do delete, rather than again freeing it and returning the memory back to the operating system, we'll just go back and put it into our pool so that some other thread can come along and use it. Now you see why it's important to make sure that you do the physical delete only when you know threads are not looking at your node because if I delete it and it gets added back to the free pool but some other thread is still looking at it, when I go back, another thread comes back right away and inserts it, they'll start putting in new values and now you're gonna have something super crazy, like clearly incorrect, right? The integrity of the data structure is still correct as you're not reading invalid addresses in memory but now some other thread has come along and it's inserting things that shouldn't be there. The other thing you'd be mindful of in your implementation for project two is that you're gonna need some policy to decide when you wanna retract the pool, right? So let's say, I have an example where I load a billion key value pairs and then I delete all the billion. If I don't retract my pool size, then I'm always gonna maintain this giant allocated space for one billion keys but I'm never gonna go back and put them back in, right? So you're gonna need some kind of threshold to decide, well, if I know, for example, my, the size of my index is twice as small than the number of free slots I have in my memory pool, let me go ahead and retract that. Yes? So when you do the recapturing, you also have to do those around one of the three other ambiguous things. Yeah, so his question is, it's a question or a statement? It's a question, that would be very... His question is, if I retract the size of the memory pool, am I gonna have to go around and reorganize... Reduce fabrication and reorganize things in memory? Yes. So, how you avoid this could be the policy you would use when you hand out things in the memory pool, right? Rather than randomly picking at anything, try to always pick one that's, that was in a block of memory that's being used a lot. The simplest thing you could do is just for every single node you allocate, just call malloc for that. So then when you free it, you know that it's not part of some contiguous space, you just give it up, right? That's the most naive thing to do. To do proper compaction is more complicated. And we're not gonna talk about that here. You have the same problem with MVCC as well, right? If you delete tuples and delete old versions, you could have a block that has just one version that's actually still available and all those other free space. Then you wanna reclaim it, but now you need to move things around. You have to update pointers to the locations in memory because you move things around. So compaction is more tricky. I would say just do something simple for your implementation, okay? That's a good point though. Again, I hope you're kind of seeing that like, we now will talk about garbage collection, but the index is kind of like a mini database system in itself, right? We're caring about latches, we care about concurrency for different threads, we care about garbage collection and memory pools, and this is all that's going inside the index and all this sort of same kind of stuff is occurring outside the index inside the database system itself. So the index is an awesome project, it's like in my opinion, because you're sort of building a mini database because you care about a lot of the same things that the database system cares about inside the index. So garbage collection is another example. So as I said, we can logically delete safely by setting that flag to know that if anybody comes across and reads our entry, they shouldn't be able to use it. But now at some point, we need to be able to free our memory. And again, this could be calling free and giving back the memory to the operating system or putting it back into our memory pool. So there's a lot of different techniques you can use. There's a well-studied problem in latch-free or lock-free indexes. So I'm gonna focus on the first two here, reference counting and epoch-based reclamation, but if you take in like 418 or 213 and other classes, there's hazard pointers, there's thread scans, there's a whole bunch of other techniques you can use. And again, the problem we're trying to solve is like this. So my thread is doing a scan across the leaf nodes or the lowest level, and some other thread comes along and deletes this guy. And if we go ahead and free the memory right away, then this guy would still think this thing points to it and it would jump to that location in memory and get a segfault because it's reading invalid memory. So we need a way to ensure that no other thread could be coming across and trying to find our guy if we actually free the memory up. So the most naive way to do this is to call reference counting. And the idea is that we're gonna maintain a counter for every single node to keep track of the number of threads that are accessing at any point in time. So anytime before you can access a node, you increment the counter by one to tell everyone that you're actually involved in, or you're reading it, and then when you're done, you go ahead and you move along to the next node, you go ahead and decrement the counter. And the garbage collection mechanism of your index will know that it's safe to delete that node when that counter finally reaches zero. This is really simple to do, and you can use atomic addition, essentially a covariance swap to update these counters atomically. So this actually gives you really bad performance though for multi-core CPUs. And for the same reason we saw that the naive spin lock latch was giving you bad performance because now you have all these different threads trying to update these shared counters for these nodes, and that's gonna cause a lot of cash coherence traffic in the system. So if you're scanning along the leaf nodes, as you scan, I keep saying leaf nodes, but you know what I mean, the lowest level. As you scan along the lowest level, you keep incrementing and decrementing these counters as you go from one node to the next. That's gonna be a lot of cash coherence traffic to invalidate the cash lines for that counter at all different cores as they scan across. So the reference counting gives you bad performance. And it's still technically lock free, but it's sort of the same thing as a spin latch because you're flipping this counter to let everyone know that you're inside this critical region. So just two observations we can make about reference counting. The first is that we actually don't care about the actual value of the counter at any point in time. I don't need to know that it's one versus two versus three. All I need to know is that it's just not zero. And so with that, we also don't have to go delete the node immediately when the counter reaches zero. We actually can defer it to some later point once we know that no other thread could be touching it. So again, when I show the example of the packing the multiple key value pairs in a single node, that, you know, I said, what, that was a single cash line. So that was 64 kilobytes, I can tell you that. 64 bytes, thank you. So if I delay the deletion of 64 bytes, that's not really that big of a deal. So it's not like when it seems like the counter reaches zero, I immediately have to go free it up to save space, right? I kind of delay it and let it do it at a later point when I know it's safe. So this is what epoch garbage collection gives you. So the idea is that there's gonna be a global epoch counter that is periodically updated by some thread, which that it doesn't matter. And you do this like every 10 milliseconds, you just update this logical counter by one. We saw this also too in the silo paper, right? They had this epoch thread that's updating things every 40 milliseconds. But now we're doing this inside of our index, we're doing it every 10 milliseconds. And the idea is that we're gonna need to keep track of what threads enter the index at any given, at a particular, at our current epoch, and then when they leave. So this is sort of the kind of this, like you enter the index when you invoke the skip list index wrapper, and then you exit the index when you leave that. And so you would know that when I enter it, current epoch is 10, and I keep track of it. Here's all the threads that I know that are in my epoch. This is why another reason why you can't have go-to statements, right? Be really bad to have someone enter the index, and you mark it in the current epoch, and then you jump to some other location in memory and come back and try to enter it again, right? That would, the semantics of that would be quite weird, right? So, and then what'll happen is when one of these threads says, all right, this node has been deleted logically, you'll mark the epoch, the current epoch for that node, of when it was logically deleted. And then the garb structure will decide that when it's safe to reclaim that node, once all the threads have left the epoch that the node was deleted at, and all preceding epochs have no other threads in them as well. And this will guarantee that no other threat could be looking at the node that you're dealing with that you want to reclaim, and there's no pointers to it that's gonna cause any problems. It avoids this having this shared counter of every single node, all you need to do is just update this thing every 10 milliseconds, which is not that big of a deal. And then you just keep track of setting the epoch on the node when it gets deleted. So this is also called, and there's another example of like in the operative system world, they're basically, it's the same concept as what we use in databases, but they call it something different. So in operating systems, this technique, this epoch based garbage collection is also called read copy update, or RCU. And this is used heavily inside of the Linux kernel. So this is sort of clear, right? We just have a global counter. When the thread enters, we mark that it entered at that epoch, and then we know that when it leaves, and then anytime we delete a node, we just have to set its little epoch flag to say, this is the current epoch when this one was deleted. And when no other threat, no other threat could be inside our current epoch, and any epoch that came before us, because this thing's updating every 10 milliseconds, regardless of whether threads have exited the epoch or not. So we know there's no other threads coming in the prior ones, then it's safe to delete. And we know that nobody else could be looking at it to cause problems. In the back, yes? But this is logical deletion, right? This is physical deletion. When you mark the current epoch? Yeah, so yeah, his statement is, yes, when this part here, when you mark the current epoch, this is the logical deletion. So instead of setting the Boolean flag, say true or false, you set this flag to say, this is the epoch when this thing's being deleted. And then as you scan across, you check that flag, and if it's non-zero, then it's the same thing as the flag being set to true. So that's the logical piece. And the physical piece is when you come back and scan and find all the nodes where the epoch is set to true, and you know there's no other threads in that epoch or any epoch preceding that, then you can physically delete it. In layman's point, the threat can also assess the logical deletion node. So again, at a later point, the threat can what? Sorry? It can also assess this logical deletion node. I'm missing the verb there, sorry, I said it again. Like in skinless, you mark it as logical deletion. Yes, yes. And in the later point, the threat can also assess this logical deletion. Will you say reset? Assess. Assess, I mean read it. Yeah, read it. So that's fine. But now the threats are, read this node. Yeah. It's smaller than the current threat, so. Yeah, I understand. This has nothing to do with verging. This is like, the epoch is like, have I been deleted? Then stamp it, done, deleted. You don't care about what epoch you're in when you read it. It has nothing to do with verging information. It's not multi-verging. It's just, am I deleted or not? Again, this is what I was saying. It's like, it's a mini database inside of the index, but we're not doing multi-verging. It's, you know, it's single instance, a single version. Okay, so real quick, things you gotta finish up. So like, for non-unique indexes, there's two things I have to deal with, non-unique indexes and variable length keys. For this one, you have to handle for variable length keys. We already handled this for you in the implementation. So two ways to handle non-unique indexes. The first is you'll have to duplicate keys. You basically have multiple key value pairs, so they all have the same key. Or you have a value list where you store the key only once and then have a pointer to a link list for all the additional values of it. So for this, I'll just use the B-plus tree leaf node as an example. So there's a bunch of extra metadata at the top that we can ignore. The thing we care about is here. So in this example here, I have two arrays. One is for the keys in sort of order and one for all the values. And the values again are pointers to the actual tuples. So in this case here, I have key one. It's replicated three times. And for each of these, they're gonna point to some offset. So I know the offset for my key in this sort of array and that'll correspond to the offset of my value for this entry here. So I'm wasting more memory because I'm storing the key multiple times, but I don't have to have extra pointers to where the actual values are. Because I just know if I'm at position three for this key, I could jump to position three in the value list and that finds the value that I want. The other approach is to use value lists. So now in my sort of key array, I'm only storing each duplicate key once. And instead they're gonna have a pointer now to a link list where they'll have all the values that correspond to that particular key here. So you'll see this when you implement the delete operation in your index. We provide you the key value pair because you can have duplicate keys when you wanna delete say key one that has been duplicated multiple times. You also check all the values and make sure you're deleting the right one. Because if you just delete the entire key, you'll delete the things that shouldn't have been deleted. So in practice, I think everyone implements the duplicate key implementation. This is how we do it in the BW tree. And the reason is because yes, you store less memory, you can store in the key less time, less multiple times. It just makes this thing way more complicated because now you have these variable length lists and that makes the nodes possibly variable length and just makes everything more complicated. Because now if you wanna reuse your nodes in your memory pool, you basically have to use like a slab allocator technique where you have, here's all the one kilobyte nodes, here's the two kilobyte nodes and know how to organize things that way. So in practice, I think everybody does it the duplicate key way. And the last thing we gotta discuss is how to deal with variable length keys. So the first approach is actually you don't store the keys at all in the index and you store pointers to the tuples and you have to look up to find the attributes that you need. Now we saw this in the T tree and we said this was probably a bad idea. The next thing you'd have, you have variable length nodes. So you have, you store the sort of key array as a link list and the inner nodes of the link list can have different sizes. And again, you have the same problem that I just mentioned before with the value list where now your nodes can have different sizes and that makes managing memory more difficult. Another naive approach is just do padding. So if you know that your key is a var chart 32 and even though most of your keys are only have 16 characters, you just pad it out to always be 32 so that everything is nicely byte aligned and the nodes sizes are always the same. But obviously the downside of this is that you waste space. And the last one is a key map where you actually store pointers to the key value pairs inside of the node itself. So here for the key map, I'll have an array and these pointers correspond to a sort of list like this and whatever data structure you want to use, it doesn't matter but usually the first element will be the actual key and then the subsequent elements will be the values. So I want to think in our system, we sort of sort use the first approach but we don't actually point to the tuple itself. We point to a byte array allocated in the heap of that that contains the actual value. So we can actually support SCD strings natively because they provide us the move semantics to avoid having to copy the string multiple times when we want to reshuffle things. But the idea here is that we're pointing to a heap that's owned by the index and not owned by anybody else. So it's okay that we have a pointer because we don't worry about having to share with any other thread or any other data structure. But in practice, I think most indexes are going to be on fixed length attributes, right? So it's not a big of a problem. Okay, so what are the party thoughts? So again, I said this multiple times during the lecture today but basically managing a lock-free concurrent index is a lot like managing a database system, right? A database itself. We have concurrent control, we have grab reflection, we have memory management but it's sort of done as a microcosm inside of the larger database system. And as I said, the skip list is really easy to implement. You read various blog posts about skip lists. They'll say you can implement it in 300 lines or less. I've never fully implemented it all the way so I don't know whether that's true or not. But as we saw, in order to allow multiple threads access the index, having a concurrent skip list is more tricky because we have to make sure we get things in the right order. And then I finished up showing the Epic Arbitre collection scheme and that's going to be more cache friendly than sort of reference counting. So the other things that I'll say to you real quickly about how garbage collection will be done in project two, you don't have to do cooperative garbage collection meaning the threads, as the threads scan the index and do whatever the normal operations, you don't have to worry about them doing the garbage collection. There's a separate function called performGC that will then invoke a completely different thread periodically, so we'll go ahead and you can then use to do whatever the garbage collection is that you want to do. So when we do the scale up test cases, we'll invoke that, I don't know the exact number, maybe every 50 milliseconds or something like that. But there's also another function called needs garbage collection that you return to a fault. So if you know that there's nothing to garbage collect at this time, then you can do a quick check and return false when we invoke that method to tell us not to have our scan comment invoke performGC. I think about this in a real system, if I have a table where in the last minute I haven't done any updates to it and therefore I haven't updated in the indexes, I don't want to keep invoking the garbage collection component over and over again, okay? Any questions? All right, so next class we'll talk more about all the indexes and we'll spend most of our time talking about the lock free BW tree from Microsoft and we'll spend some time talking about the art index from the hyper guys and then in the remaining time I'll deal with sort of quick over your crash course and how to do performance testing with your skip list invitation. So I'll show you how to use call grind and I'll show you how to use perf, okay? And I'll assume you guys know how to use GDB because it was in the pre-rex, okay?