 Let's get started. So again, a reminder, the project one will be due next Monday at midnight. And then we'll talk about project two on Monday as well. And then we will assign people to groups. I'll do this later today. So it's your last chance to sign up on your own. Otherwise, I would just randomly pick a group for you to be in. So for today's class, I sort of emphasized this on Monday that we're going to be talking about OATB indexes that are being relevant for what you're going to need to do for project number two. So for today, we're going to spend time in the beginning talking about how do you actually implement a latch. Because I'm assuming most of you here may have taken an OS course or a systems course. And when we talk about latches, you may not know exactly what the implementation details of that. You may be just mostly familiar with using new texas. So there's more complicated or unsophisticated approaches to using latches that we'll talk about here. And then we'll switch into now talking about the modern LHP indexes. So the paper you guys read was from Microsoft Research on the BW tree. So even though we'll start in the beginning talking about latches, then we're going to talk about a BW tree, which is a index that doesn't even need to use latches at all. And then from there, we'll talk about how to do concurrent skip lists. And I'll spend a little bit of time at the end talking about the Radex tree from Hyper, which is not a latch-free index as far as I know. One of the core things you have to understand of how we're going to do latches, and actually a lot of the techniques of how we're going to use it in the BW tree to be latch-free, is the ability to do the atomic compare and swap operation. So this is a instruction that modern CPUs have that allow you to read a value in memory and then compare it to some other value. And if it matches, then you can override it with a new value. And so this is sometimes called CPU intrinsics because it's an intrinsic functionality of the processor that a compiler will expose to you. So in the case of Linux in libc, they have a CPU intrinsic called SyncBool Compare and Swap. And there's other variations of this. Unlike other functions, although it looks like a function, what's actually going to happen is that the compiler is going to convert this into the single CPU instruction to perform this operation. So instead of writing, you could write your own assembly block in C or C++ to make a call to the exact instruction to do the compare and swap. But compilers will provide you with what looks like a function but it actually gets rewritten to be the assembly that you need. So this is actually much faster than doing this as an inline function call that's going to be emitting assembly rather than other arbitrary C code. So in this case here for our function, the first thing we'll see in our argument, we're going to have an address. So that's going to be some location in memory that's going to have some typed value and we're going to look at it. And then in the second argument, then this is the value that we're going to expect to be there in our memory location. And then if it is there, then this is the new value that we want to write into it. So this example here, when we make this call to the intrinsic, we'll say at this memory location, we expect the value to be 20. And if it is, then we want to overwrite it to 30. And we're allowed to do this and get overwritten over here. And again, we're doing this as a single instruction. We don't have to set a lock like a mutex and then have an if clause to check things and then do our swap. We're doing this with sort of one evocation in hardware, which is really, really fast. So likewise now, if say we have say we want to do compare value 25 and new value 35, that obviously doesn't work because the value is different. So in this case here, for the sync book intrinsic, it will return false. So there are variations where you can get back the current value and things like that. So we're going to use these compare and swap instructions as a building block to allow us to do more sophisticated latching inside the database system. So in general, there's basically four approaches of how we can do latching. There's maybe a few more that are out there that I don't know about, but these are the major ones that are actually implemented in database systems today. So we're going to go through each of these one by one and we'll see how we can use the compare and swap instruction to perform latching. So the probably one that everyone's familiar with, like if you take an OS course, you're going to know about mutex. Basically it's an operating system provided mutex to perform these synchronized regions. So they're really easy to use and this is what again, everyone is taught when you start doing any kind of systems programming. So pthread mutex is the sort of the standard one in Linux that you can use. Essentially what happens is you declare that you have a mutex lock and then down below you can lock it and then do something special in your critical region and unlock it when you're done. So again, this is really easy to use. This is what everyone does when the first time they write a concurrent program, but mutexes are terribly slow. So even if you don't have, if you have no contention then doing the lock and unlock is really fast because essentially it's just doing a compare and swap underneath the covers. But if someone holds the lock when you try to acquire it it's going to be at least 25 to 30 nanoseconds to go make that vocation because what happens is when you call mutex lock and someone else holds the lock then it ends up being a few text syscall down in the kernel. So now your kernel is taking locks too. And then now underneath the covers the operating system is going to maintain these internal queues about all the threads that are waiting to acquire locks. And it provides these hints to its scheduler to make decisions about what thread to run next. Right, so if your thread tries to acquire this mutex lock and it's blocked, it's denied then the operating system wants to know that you're not going to do any useful work so therefore the scheduler is not going to schedule you. So this is why you have to go down in the kernel when you take a mutex lock. If no one holds the lock then it's just a compare and swap and that's fast. Few text stands for I think fast user space locks. But as soon as someone holds, as soon as there's contention it's going to lead to 25 to 30 nanoseconds to do the lock and unlock call. So it's 25 nanoseconds, is that a lot? He's shaking has a no. How much is a call to memory to read something from memory? What's that? 90 to 100 nanoseconds, right? So this is kind of expensive, right? If you're doing this over and over and over again. So this is bad. So this is why people in the 90s basically wrote their own mutexes but now every single modern database systems that's doing any kind of locking doesn't use mutexes at all. I think in our code we have some nasty mutexes that are sitting around and we did that because there was easy to do. Not proud of it, we'll eventually get rid of it. But in general you don't want to use mutexes. What people normally use instead are what we call spin locks or test and set spin locks. And the basic idea of this is that we're still going to do that, we're going to do a compare and swap but instead of using the operating system primitive to do our locking, we're going to maintain our lockings in our user space, in our application. So the database system is going to manage this. So the basic idea is that we'll do a compare and swap on a memory location and if we don't get it we'll just spin in a while loop and try to come back and do it again. So it essentially looks like this. So you can use in the standard type of library, C++, there's an atomic flag type and this is essentially just a boolean or single bit. And then it provides a nice abstraction over top of the CPU intrinsics do compare and swap. So this makes it a little more portable. So in this case here our thread once acquired the lock, it does a test and set, that's the same thing as compare and swap. If we get it, we're done, we pop out of the while loop. If not, we loop back around and try to get it again. And the reason why this is a little more complicated than like a mutex that the operating system provides is because it's up to us to decide what we wanna do in this inner loop, right? We could try to yield the thread back to the OS scheduler. We could abort because we tried too many times. We could sleep a little bit and try to back off or we can just come back right again and try to get it. Right, so this is gonna be much more efficient because we're not going into the kernel and making any sys calls, but now we have to add some extra stuff in our operating system, or sorry, in our database system to make sure that we don't have starvations and priority issues and other things like that. So it's more work on the database side, but in general that's usually the conventional wisdom in a database system. We always wanna do as much as possible as we can and don't rely on the operating system to do anything. And this is again, this is why we don't use Mmap or memory map files. We manage our own memory. Another key problem, other than just the complexity of how you manage this inner part here, is it's not a cache friendly approach, right? So let's say we have a single lock and we have three CPU cores, three threads running on separate cores trying to access that lock. So they're keep spinning calling, compare and swap, compare and swap over and over again. What happens? Why is this not cache friendly? The lock is somewhere else, right? We'll be at another core and then every single time these guys are making invocation to try to acquire it, we have to send an invalidation message or a fetch message to go grab that memory location, bring it to our local L1, L2 cache for our core, and then try to go compare and swap. And chances are we're gonna fail, right? If we have a high, if there's a lot of contention in the system, they just keep spinning over and over again and there's all this network traffic on the chip itself as we move this memory location around, right? So this is bad. So the alternative or a better approach to doing test and set spin locks is to do called Q-based spin locks. And sometimes you'll see these called MCS spin locks. I always forget what that, it doesn't actually, it's actually the name of the people that invented it. I never remember their names. So it's Mel or Crummy and Scott. That's two guys. And so the basic idea is we're gonna use the same spin lock approach that we have in the test and set example, but instead of having a single lock that everyone's gonna try to acquire, we're gonna have a queue of locks and it's gonna be like a convoy. And so only one CPU thread will try to acquire a single lock and then we'll build out the queue behind it and all subsequent threads that come along try to acquire the locks that come behind, they come later on. So you can, again, you can still implement this with using the atomic flag because it's the same thing as the spin lock as we have before. The basic idea now is we're gonna have sort of a base lock or a base latch and instead of having a single bit, we could have, actually, atomic flag, you wanna use atomic pointer, but whatever. For next, the next is gonna be a pointer to another lock in a chain of these things or a queue of these things. So when the first guy comes along and he goes to acquire a lock, nobody has it so he's allowed to acquire it and he's gonna do a compare and swap on this next pointer. And what he's gonna install is another same struct of the same kind of queue lock here that has his own next pointer, which is now null. So when the next CPU comes along and he goes to the base lock and tries to acquire it, it's gonna see that he can't because it's not null. So he's gonna then follow the chain here and try to do a compare and swap on this lock and since he's the first one there, he'll be allowed to acquire it. Now he's waiting for this guy to finish up before he is granted access to the latch, but he's not spinning over on this thing here. He's, actually, he is spinning on this. Substitute people will spin on his, right? So the next guy comes along, he would follow this chain, he would spin, I guess the error should be over here. He will be spinning on this one. So now we don't have that same problem where we have a single memory location that everybody's trying to do a compare and swap. This memory location that they're spinning on will be local to it, right? So it can sit in L1, L2 very close to where this thing's spinning and then eventually when this guy releases, it releases the next latch and he's granted access, right? And then everyone is, again, you don't have to worry about cash coherence traffic or cash invalidation messages because everybody's just going at their recollocation. Right, so this is actually what they use in Linux and a lot of database systems are using this approach here, at least the more modern ones, if you have to take a latch. The problem is, though, this gets more complicated because let's say you can have a thread decide that it doesn't want to wait anymore, like it times out, so therefore you want to release it, you don't want to spin on the lock anymore and you have to release it and therefore you have to do compare and swap to sort of fill in the gap and delete it in the link list. So it gets more complicated to do this and then just the single bit that you spin through, right? But you get the advantage of it, it helps if you have a lot of cores and then you're not ping ponging messages back and forth. So the last approach is called reader-writer locks and this is basically the same thing you think of in a database system when you have sort of shared and exclusive locks. So rather than for our latch, rather than having a single bit that everyone's trying to flip, we're going to have two, we're going to have different locks that correspond to readers and writers. So in our thing, single latch, we'll have a read lock and we'll have a write lock, but now we're also going to maintain a counter for the number of threads that I have currently hold each of these locks and the number of threads that are currently waiting for these locks and you have separate counters for both types of locks. So when the first guy comes along, right? He wants to read, so he's going to try to acquire the read lock. Nobody else is holding it so he can do the compare and swap and get it right away, but then we'll update the counter to say that we know that at least one thread is holding this read lock. The next guy comes along, same thing, he wants to acquire the read lock and since that's commutative with somebody else also holding the same read lock, he'll be granted that lock as well and we'll update the counter. Now when the right guy comes along, he wants to acquire the right lock, we would see that the read lock is being held and therefore we're going to have to stall and wait and we update a queue that says we know there's a write lock waiting. So now let's say we have this next thread, he wants to acquire the read lock. What should happen with him? What's that? Sorry. Right, he said, why should he wait? Exactly right. So we'd see and see, we know that somebody is waiting for the right lock and therefore we should stall and wait ourselves here and eventually these guys will release the latches and then the right lock can go and this avoids starvation for anybody who wants to do reads and writes. So again, this is being built on top of the spin locks that we talked about before but there's a little bit of extra management we're doing in our database system to make sure that threads are treated fairly. And you'll see this more common in database systems that support transaction priorities. So a lot of times in the enterprise systems you can specify that for these transactions they're minor background processes and therefore they shouldn't be given high priority. I don't care if they get preempted by somebody else but something that's more latency sensitive like a transaction that's interacting with actual customer or another machine, you want that to get higher priority, you want that to be able to finish more quickly. So in order to implement those policies at a high level you actually maybe need to consider them in your latching policy as well. So that's basically how you do latching. I don't remember what kind of latching parameters we provide you in Peloton but we will specify everything for Monday's project announcement. Okay, so given I just talked about how to do latching and then last class we talked about how to do index locking at the leaf levels. Now we're sort of gonna ignore all of that for now and we're gonna talk about how you can do indexing in old to be scenarios without having to do any latches at all. So the first two guys will talk about the BW tree and the current skip list, these are latch tree indexes that actually used in production today and this last one is the art index which is a Radex tree. As far as I know it's not latch tree but it doesn't take that much thinking to think about how to make it latch tree but we're gonna ignore that for now. Okay, so the BW tree came out of the Hecaton project from Microsoft Research. We read the Hecaton paper on MVCC. This is the index that they developed for it and remember that I said in the early days when the project got started at MSR they were considering skip lists and they then later decided that they could come up with a better index structure that could outperform skip lists and this is what the BW tree was. So I don't have any prizes to give but does anybody wanna take a guess what BW tree stands for? Nobody. You might be disappointed when you find out. Big wheels? Big wheels? That's a new one, okay. What's that? Windows, no. In the back. Buzzword, there it is, right? So it's a latch tree index for main memory databases for many cores and yada yada, right? It's the buzzword tree. I don't know why they couldn't think of a better name but that's what it is. So this was developed, the paper you guys read was from Justin Lewandowski so he is actually somebody that has been in prison but he was in a foreign prison so that's an example, he's an example of somebody that has played the rough game in databases and now he's at MSR and I think they're really doing awesome work there. So the BW tree is essentially a latch-free B plus tree. Now they talk about being a B link tree, that just means they have maybe some extra pointers. The idea is the same thing we talked about last class. And so the key thing is because it's latch-free it means that no threads ever gonna have to stall trying to acquire any locks and latches, right? When you think of a typical B plus tree when you have to do a split and merge you have to lock the node you're splitting or merging plus it's parent, right? In their case they're not gonna have to do any of that. And the way they're gonna achieve this is through two key ideas. The first is that all updates to the index itself will never make any changes in place to the nodes and instead they're gonna apply deltas or add deltas to those nodes that you can then read at runtime to figure out what the correct state of the node should be. And this is helping them reduce caching validation because if someone has a copy of a page in their local cache and you have the same copy of the page when you modify them you just put new deltas on top of it you don't actually have to go and validate the page that they have a copy of. The second key thing is that they're gonna use a mapping table as a way to use indirect pointers or indirect addresses to all the pages in the tree. And this again we'll see this in a second. This is a lot of them gonna do a compare and swap at single memory locations and not worry about having dangling pointers to new pages as you add them and remove them. So these are the two key ideas. If you understand how this works and we'll go through in more detail this is how they're gonna be able to achieve the latch tree guarantee. So our mapping table is essentially gonna be a hash table that maps a page ID to a physical address. And then within the actual logical structure itself we'll maintain these logical pointers between the nodes. And so every page is gonna be given an ID and then from that you can use the mapping table to know where in memory the page is located. So for these inner nodes here between for child and siblings these are just gonna be logical addresses that are embedded in the page. So anytime if I map this page here and I wanna get to 104 I have to go look up in my mapping table to know where in memory that is. Again this is gonna allow them to move this guy around or resize it, do whatever they want and not worry about having to update this pointer here because there's only a single location that they need to modify and they can do compare and swap on this. Remember we talked about last class it's really hard part of the reason why MemSQL doesn't do the reversible skip lists is because that means you would have to do a compare and swap and two pointers from the child to the parent from the previous guy to the next guy and you can't do that if it's atomically without setting a mutex. But you can do a compare and swap if you know it's one location you have to modify. So let's look how they do updates. For this we're gonna ignore the rest of the tree we'll just look at a simple leaf note or look at one page at a time. So what's gonna happen is every time you wanna modify a key in a page do an insert, update, or delete instead of again making the change directly in the data structure in the page itself you're gonna add a delta above it. So in this case here we'll have a delta record in memory that says we wanna insert the key 50. And so this delta record is gonna have a physical pointer to the base page that it's modifying. It's not a logical pointer it's actually a physical pointer. And what happens is we wanna do a compare and swap now to now point the page address instead of being the base page to now point to our delta record, right? So in this case here for page 102 you wanna do a compare and swap for its physical address in memory here. Which again we do in the similar instruction and it's really fast. And then now what happens is any time someone is doing your search or traversing the tree when they wanna go look at page 102 they would land here instead of here. And inside of this delta record will be a little bit in the header that says I'm a delta record not a regular page and therefore the thread would know how to interpret it and figure out what it should be doing. The reason why we can store the direct memory address here is because we're gonna be treating sort of the a single page in all its deltas as a single logical unit or single atomic unit. So it's never gonna be the case that this thing's gonna get swapped out and this thing's gonna point to nothing because we're gonna treat the delta chain and its base table all at once as a single unit. And we do this again so we get the next guy we wanna do a delete on 48. Again we do a compare and swap and now this guy's now the beginning of the version chain. So now to do a search on this again we're ignoring traversing the upper levels of the tree. We just sort of land at a leaf node and through these logical pointers and then we would end up here for the first delta chain. So what would happen is we would go through each of these delta records one by one and if we see that the key that we're looking to access or modify is in our delta chain then we know we found the thing that we're looking for and we can ignore anything else that comes down in the delta chain. So for example, if I'm looking for key 50, if I start here, I don't see it, I move here, now I see insert 50. I don't care what comes after below this because I know that going at this point in time, right, this is the newer version of the page, my record 50 is there. So it doesn't matter if there's a delete 50 coming after it, as soon as I find it I know I'm done. Then if I get to, if I follow my chain and I get to the page and see that there wasn't a delta record for the key that I want, then I just do the regular binary search within the sort of key array that's inside of the page just as you would in a regular B plus tree, right? So the compare and swap thing is very important because this is gonna allow us to have concurrent operations without having to set any heavy latches, anything like a mutex. So let's say that we have two threads tried to install updates to our page here, right? And since insert 50 is the current head of the delta chain, both these guys are gonna have physical pointers going there. But now we're gonna do a compare and swap to our physical address in the mapping table and only one of them is gonna be able to succeed, right? So at this point in time, say we have two threads added these both delta records and at the exact same time they're gonna do a compare and swap to try to overwrite the physical address for 102 to be whatever their physical address is here. And again, because it's an atomic instruction in the CPU only one of them is gonna succeed, right? One's gonna get true, one's gonna get false, right? So in that case, when the thread recognizes that it did not succeed in doing this overwrite it's gonna have to delete itself and try again, right? And what policy you use when you fail to update the physical address depends on what kind of operation you're doing. We'll see in a second when we do consolidation if we failed because somebody else added something to the delta chain then maybe we just throw away what we've done and somebody else will try it. But in the case of a operation on keys like an insert well then we would just create a new delta or change our pointer to now point to this one and then do compare and swap and see whether we get to be the front of the delta chain, okay? So there are basically three types of delta records we could have in a BW tree. So we've already talked about the record update stuff, right? We already started, you can do insert, update and delete. But then there's another one where we can do structural modifications to the tree itself. So we can do splits and merges by again and depending deltas to the chain. There's a third one in the paper that talks about how you write the index pages itself out to a flash drive or an SSD and they use like a log structured merge tree on disk to store these changes. We're gonna ignore all of that because we don't care because everything's in main memory but in all the stuff that we're talking about still works even though we're not durable to disk. Remember I said typically what happens in an in memory database system you don't write the indexes out to disk because it's a waste of time, right? You assume your node's gonna be always running and you'll use replication and other high availability techniques to have a backup so that if your master node goes down you have another guy that can pick up right where he left off. So you don't have to spend the time rebuilding the index and you don't have to spend the time at runtime because you're assuming that you're not gonna fail to write things out to disk and make sure that it's durable. So we can ignore all the log structure stuff in the paper and everything still works. So let's look at that. Let's look to see how we do consolidation. So the problem with these delta chains is we keep appending things to them as we modify them and eventually it gets really, really long. So now that makes our search times increase because we're gonna keep, have to look at these delta chains one by one by one and maybe eventually get to the base node. So what we're gonna wanna do is we're gonna wanna create a new page that will consolidate all of the deltas and apply them to the base page and have that be the new page or the representation of this page. So let's get this case here. Let's say we'll make a copy of page 102 and then we'll take all the deltas that we have in our chain in order of going from top to bottom, actually other way from bottom to top, and we'll apply them to our copy page. And then now once that's done, we wanna do another compare and swap operation in our mapping table to now point to our new page. And if we're successful then we know that by the time we applied all these deltas to their new page, nobody else added any other new one. So we know that everything's consistent. We don't lose any updates when we install this. And then now at this point, if this compare and swap is successful, we can then mark this page to be deleted later on. And they have a garbage collection mechanism that we'll talk about of how you make sure that nobody else is modifying this as you move stuff around. All right, so the way they're gonna do garbage collection is they're gonna organize the operations into these, these quantum called epics. The basic idea is that anytime you wanna do an operation in the index, you join an epoch and say that I'm a thread that's operating in the index at this time. And what happens is anytime you just, you determine that there is a object in the tree that can be deleted, you add it to the epochs, the sort of listing of objects that can be deleted. And the idea is that when all the threads in the epoch are no longer running in there, then you know anything that was marked for deletion can be safely deleted because nobody's sitting around and could possibly try to access it, right? So it looks like this. So this is our example we have before where we created the new consolidated page. And we have at the same time while we're applying all our updates, we have somebody traversing, we want thread traversing the index of this chain on CPU one. So they'll add themselves to the epoch table. Say I'm a thread running, here's what I'm doing. And then we have another thread here that's actually doing the consolidation part. So he would add themselves to the epoch table as well. So now we do our compare and swap and this guy's now the new page. And therefore we know that everything that is in this node including all his deltas can be reclaimed as soon as all the threads that are accessing it are done. Because again, what you don't want to happen is if this guy immediately just deleted it right away, then this thread would be tried to then traverse this physical pointer and end up with garbage and end up with nowhere. So there's a barrier to make sure that nobody's sort of hanging around with dangling pointers. So CPU finishes and then he gets removed from the epoch table. Eventually, CPU one gets finished and he's moved from the epoch table. And now we know that no other thread could be in our index anywhere. So now it's safe for us to reclaim the space. We delete all the delta chains, maybe put it back in a memory pool and then we can reuse it later on. It's pretty simple, pretty easy to do. And you can do all of this in latch-free data structures directly inside of the epoch as well. You could use a concurrent skip list to keep track of the list of threads that are running. Okay, so the thing that I find, sorry, yes. Yes. Correct, yes, so his question is, say all the CPUs have page one or two in their cache, we're not making modifications to it, we're over here and then eventually this guy gets blown away and now every CPU thread has to get the copy of new page one or two, yes, that's true. But you can't avoid that, right? The difference is that for all these other things in the delta chain, I didn't have to invalidate it, right? And maybe I go just go read a small delta record and bring that in my local CPU cache. But that's much smaller than just taking the whole 4K thing and having it invalidate that everywhere. Question is, if you have, say it again, I missed it. Correct? So you have to increment all those new entries that we're not ridden to. Right, so obviously there's a trade-off with how long do you let the delta chain go before you consolidate it and things like that. Right, because in that case, if the delta chain gets too long, then you spend all the time traversing it for every single operation. If you think about a traditional B plus tree, what it's gonna do. So anytime you modify something, you have to keep the array sorted inside of it. So that's definitely gonna be moving more bits around than this other thing. It's a good point though. Yeah, there's like, I don't wanna claim that cache validation goes away entirely. It's just significantly reduced than what you would have in a B plus tree. I don't understand that point. So this point is that, yes, when this guy gets consolidated and everyone else to look at this, this page will get thrown out of their CPU caches and they have to go and grab the new one. But what I'm saying is that, yes, you have to do that, but that's gonna be fewer number of cache reads or number of memory reads to do this rather than every single time you have a new update, you have to modify the sort of the array that would be inside of the base node or the base page. And all of those would be cache invalidations over and over again. Whereas this is just updating a single pointer in a linked list that is a minor thing and everyone goes that's just the one thing that they need. And then anytime they read this Delta record, that's probably gonna be in their CPU cache as well. I don't have graphs on this but they measure the number of cache invalidations and it's significantly less. Yes. What if you're doing like tons of operations on those pages so I insert some deletes? And if the consolidation phase is long lived, then there's quite a large probability that the compare and swap would fail. Yes. So his question is, say this is a hot page, right? This is Justin Bieber on Twitter or whatever drink and whatever the hot person now is. And everyone's trying to do like, like, like, like. And therefore there's all these like Delta chains coming building behind it. And then I try to do a consolidation and if it takes a long time, then the likelihood that someone's gonna come and invalidate me. I won't be able to do the compare and swap because they could point it up to a new Delta record. And therefore I did all this work and have to wasted it, right? Yeah, it's just a trade off. There's no free lunch. I don't know what they do but you can imagine like what happened after you and point that to the new page. So this comment is, so let's say. I think what they do in the paper is they just have all because they're just a threshold. So all the friends will do the same thing before they try to insert something. Yeah, so basically you look at these right again. So I'll get to that. So his comment was, let's say that I'm doing the consolidation, but when there was only to meet 48. So I do the consolidation and then when I try to do a compare and swap, I fail because insert 55 got put in. So couldn't I just recognize that insert 455 came after I started doing my consolidation and just apply that and try to do compare and swap again? I'll do a compare and swap on the pointer of insert 55 instead. Yeah, correct. So the pointer's here. When we started doing the consolidation, it was at 48. So we have 48, 50 and page one or two in here. Now we do compare and swap. We see that it's pointing to 55. So that's a new Delta record that we missed. So you could start at that and go back up and apply more of those changes and then try to compare and swap again. You could do that. I think that might work. He's right in the paper, what they do is you recognize how long the chain is and then if it's above some threshold then any thread that comes along tries to start reading the chain will try to do the consolidation. So you could have all 20 cores, 20 threads all trying to do the consolidation at the same time. One of them will succeed and everyone else just backs off and throws a work away. But again, the key thing is you don't need a mutex to make this work. But I think your approach actually might work as well. But correct, yes. So you could have a little blip at some point in time where everyone tries to do this. Yes. Okay, any other questions? Can compare and swap is a really powerful construct that allows us to make this all work. So let's spend more time now talking about the structure modifications because this is the key thing. This is one of the key advantages that the BW tree gives over the B plus tree. So the first thing to point out is that all the page sizes are elastic. So if something gets too big, when we do the consolidation, like if your keys are too big and we're not gonna fit in the page, when we do consolidation and we just can allocate more memory for things, right? So that's not a big deal. So this is gonna allow us to be more flexible when we decide to actually do splitting. So the BW tree is gonna support doing the standard half-split. So I have four keys. I wanna split two per page. And you can do this without doing any latching. It's really easy. So the basic idea what's gonna happen is we're gonna add these new delta records above the page or splitting to say that a split has occurred. And then we add new separator delta records at the higher levels of the tree to provide guidance to anybody that's searching through the tree of where the split has occurred so they can shortcut the split record and jump right to the page they're looking for. So I spent a lot of time yesterday trying to make this diagram make sense. So I apologize. There's just gonna be a lot of lines and it's sort of complicated with these arrows but I'll try to go through step by step and point out what's going on here. To show an example of how to do a split. So for this, for simplicity reasons I'm only showing one direction of the BW tree and the leaves from sibling to sibling. There's obviously pointers in the other direction but nothing really fundamentally changes to make this work because we're using logical pointers and it hides all the complexity of things. All right, so we have a two level BW tree. We have a root node and then we have three children node, three leaf nodes. And so the first thing we're gonna do is lay out our keys across all these nodes. So page 102 will have keys one and two. Page 103 will have three, four, five and six and page 104 will have seven and eight. So what we're gonna wanna do is we're gonna split this middle guy here and put half its keys on our new page and keep the other half keys there. So we'll keep three and four here and we'll put five and six on a new page. So the first thing we have to do is just allocate a new page, right? And we can have a, since we know it's gonna go in between here we can set its next sibling to be a logical pointer to 104. And then we just update our page table or mapping table to now point to our new guy. So then the next thing we're gonna do is we're gonna add a split record above page 103 to tell anybody that comes along later that the split has occurred. So the split's gonna have two pointers. We're gonna end up invalidating these two keys here. So the first we're gonna have a physical pointer that points to the base page for what we're splitting, right? For ignoring that there could be other delta records but basically the idea is the same thing. You just would append it to the front of the chain. And then we're also gonna have a logical pointer that says four keys, five and six, you wanna go to this page to get it, okay? So now what we need to do is go and update the physical pointer to page 103 to be instead of the base page to now be the split page. And that's just done, you know, just like a parent swap like before. And what this does automatically as well is now this actually modifies the logical pointers in our index to now point to this as well. So before our root node had these separators that said for key three, four, five and six, you wanna go to this page and this guy knew that its sibling was located here. So now by doing the parent swap on the physical address, we end up automatically updating the logical pointers. And that is, again, we don't have to do any parent swap for that. That's just done by the nature of modifying the mapping table. So now anybody that traverses the tree would come along and see the split message and say if I'm looking for key five, I know I wanna go down here and not here, right? Five and six are still stored in our page table because we're not doing in place updates. It's just that nobody will find them. If you do a traversal along the leaf nodes, right, going one, two, three, four, five, you would follow the logical pointer and come up and end up with the split here as well. All right, so no one's gonna be able to read these keys at this point. So now the next thing we're gonna do is install a separator delta record at the top. And this is gonna be updating the ranges that are specified by this root node, right? So here we have a pointer that says infinity or negative infinity to three should go this way. Three to seven should now be pointing at our split and seven to infinity should point to this pointer here. So what this basically is gonna do with the separator delta record is a shortcut to say that if you want keys five to seven exclusive, you wanna go to this location down here, right? We don't need this for correctness because again, if anybody follows this and they say I'm looking for key five, they would still look through this range but then they would hit the split and follow down to our new page, right? This is just to make search go faster. And of course now we have to do the compare and swap on the physical pointer to page 101 to now point to our new separator. So does that clear how this works, right? Again, all of this indirection through the mapping table makes everything end up being consistent and correct and we don't have any dangly pointers. And it's again really fast to do with using compare and swap. All right, so these are some performance numbers that are in the paper. They didn't actually include the skip list numbers for two of the workloads and I emailed Justin last night and he sent me the full results. I don't know why they excluded the D-dupe and Xbox workloads in the paper but here we're comparing the BW tree versus a B plus tree and then a concurrent skip list and the B plus tree was based on Berkeley DB. So Berkeley DB was one of the first embedded storage databases from storage manager databases from the 1990s. So this is all sort of, it's doing concurrency but I don't think it's writing a thing out the disk. Everything's in memory. And then they have a, they implemented their own version of a skip list to do this. And so what you see for the first workload is based on this Xbox live, this workload trace of a bunch of gets and sets. And I think this is primarily gets than sets. It's like it's seven to one. The BW tree has been doing 10 million updates a second, 10 million operations a second, whereas the B plus tree is about 560,000 and the skip list is 423. For the synthetic workload and D-duplication workload, the magnitude of performance is much lower and I think this is because there's more rights going on but still you see that the BW tree is outperforming these two other indexes, right? So this is part of the reason again why the Hechtang guys gave up on doing skip lists because they came up with something that was much better. As far as I know, no other database system actually implements this. I don't know about this because patent reasons, the BW tree is not open source but they've been talking about how they want to make it open source which I think would be kind of cool. So any questions about BW trees? And this would be one of the next structures that you can choose to implement for project number two. Okay, so now let's talk about skip lists again. So again, last class we sort of were kind of hand wavy about the concurrency side of things but now we want to look at sort of specific examples of how you do latch tree or skip lists. And the key idea is that we're going to do everything sort of lazily, right? So we do deletions. It's safe for us to go keep things around and not worry about corrupting the structure of the index because we'll be able to mark that items are deleted and they end up just being sort of, any keys that are still in the data structure end up being to sort of guide posts like you have in a B plus tree when you keep the keys in the upper levels to help you know which direction to go. So the first operation we're going to look at has to do, how to do an insert. So this is the same example we had last class we went to insert on key five. So we know we want to put it in this slot here. So the first thing we're going to want to do is just create all the entries for this new key for all the levels that it should be in. Remember we're going to flip a coin and then our more times we flip that coin until we get tails corresponds to how many levels we're going to include it. Let's assume that in this case here we got three heads. So therefore we'll put it in three levels. So notice here we're going to create the keys in our skip lists but we're still maintaining the pointers around it to the entries that come up before. So key four doesn't know about key five yet so it's not pointing to it and still points to key six. So now we're going to go from the bottom up and add our entries one by one. And for this we can just do the compare and swap that we were doing for the mapping table of the BW tree to update this pointer to now point to our new element. And because we're going from the bottom up it doesn't matter that these upper levels don't know about key five yet. If someone comes by and they look and say do I have key five? They would end up with the end marker here and that's okay because we're not doing this in the context of a transaction. Those are all sort of the phantom avoidance and things like that. We're all things the index locking we talked about last time would help you prevent. For this we just care about is the structure of the index correct. So then we go up to the next guy we'll add his key to again do a compare and swap and then next I would do a compare and swap as before. So now at this point our key is fully integrated in the index. We didn't have to take any mutexes. We didn't have to take any heavyweight locks. We can do this with simple compare and swaps and we don't need the indirection table of that we had in the BW tree. So now let's look at how you do deletes. So let's say we want to delete key five. So to make this work we're gonna add a new flag to every leaf node or in the bottom layer of the skip list that says whether the element has been deleted or not and we can use the atomic Boolean we saw before, right? A simple bit and we can do compare and swap on this. So now let's say we get here we want to delete key five we would do a traversal to find key five and then we would flip the bit here and say that this thing is deleted. Now what would happen is if anybody comes along does it scan along the bottom layer if they would check to see whether it's deleted or not and if so they just ignore it and sort of keep on going. So it doesn't matter that we still point to it we know that nobody else is gonna read it. So now what we need to do is go and update all the pointers to now go around it so that nobody actually can find it. I don't think there's an issue if you go from bottom up from top down all the literature I've been seeing says you want to go top down. I haven't really thought through whether everything's self safe if you do that. We're basically gonna go each one do a compare and swap to reroute it all the way down and then eventually once you know that nothing else is pointing to it you still have to do maybe some bookkeeping to keep track of make sure that there's no thread that's sitting on this key before you start throwing things away use the same epoch mechanism that the BW tree does but once we know that we're safe we can then just delete everything and then now we're back to where we were before before we inserted it. Yes. For example there's no key that's inserted between five and six while you're doing all these operations. The question is how can you guarantee that there's no key inserted in five and six before you delete it? So say you're here, right? Is it before you mark this as deleted or no? Yeah. So the question is like say if I have key 4.5 and I'm at this point here. So I would do a seek and I would find four and I would know that the thing that comes after it is five therefore this is where I wanna go. So I would do a compare and swap on it and either I would succeed because I was able to change this before this guy did or I would fail because this guy finished the compare and swap and pointed to six. If I failed I would maybe just try it again, right? But you probably have to do some extra check and maybe sort of like it didn't end up pointing to key 4.25 and therefore I need to be coming after that. It's a little extra bookkeeping but a compare and swap, there's nothing. You don't have to send a mutex to do anything. And again, this works because we're not pointing the other direction because to do, if you had to point to the other direction to do this update here, we'd also need to be able to update this thing to point back to it and not this guy. Is this clear? Okay. So now, last class I tried to describe you how you do reverse search if you don't have the pointers in the other direction. And I thought I was trying to describe what MemSQL does. They had this blog post and someone posted in Piazza said the blog post is not clear and then my explanation was not clear. I actually also emailed the VP engineering at MemSQL and tried to get clarification of what the hell the blog post is actually trying to say because I thought I understood it but I guess I didn't. And he said he was in India so he can't send me details right now because the internet's bad, I don't know, whatever. But what I did find was another algorithm to do reverse search on a single direction skip list and it's sort of basically whoever posted on Piazza saying, well, what if you could use the stack? That's one way to do it. Now the MemSQL guy says, I asked him about this, is that they don't use a stack at all but what I'm describing to you now is basically more or less what they're doing. They maybe spend some more work doing additional searches. So the way it's gonna work is say here, we wanna get the key range K4 to K2 inclusive but we wanna get in reverse order from what the skip list is sorted by. So what we're gonna do is we're gonna start in the beginning and we're gonna do a seek just as you would looking for the key but instead of looking for the first key in the range we wanna find the lower bound, the lowest value. So we're looking for K2 here. So we start off as first guy and it's K2 less than five so we skip that. Now we see K2 equals K2 so we know we wanna jump to this. Now what we're gonna do is since we know we're at the starting point in the range that we're looking for, we wanna maintain a stack. So we look here and say K2 is less than K4 so we don't wanna go to the direction we wanna go down. So now we're at a leaf node and we wanna keep track in the stack of all the nodes that we're gonna visit up until we get the upper bound of our range and then we can go in reverse order and examine those leaf nodes one by one and that will give us the reverse order that we're looking for in our search. So in this case here we see K2, then come here we see K3, we add it to our stack, then we see K4 and now we know we're done because that's the upper bound. So now what we'll do is take our stack and go in the reverse order and look at these one by one and that gets us what we wanted. All right, so it's a little extra memory to maintain the stack instead of these global pointers to know how to jump forward in the case of the MemSQL route but I think this is good enough. They claim they can do better with paying a little extra CPU cost but I don't know exactly what they're doing. So that is this clear. The stack makes it much easier, I agree. Okay, so a few minutes left. So I wanna spend a little bit of time now talking about the adaptive Radex tree from the Hyper guys. So the Radex tree is not batch three as far as I know. What is designed to be is a index that uses less memory but without paying that huge pointer chasing penalty that the T tree did that we saw last time. So the way a Radex tree and Radex trees are a special case of a try and we'll look at a try in a second. The basic idea is that instead of storing whole keys throughout the index we're gonna store prefixes of it in the tree structure itself and so the prefixes that we're gonna use to figure out where the record is that we're looking for and we can do quick comparisons by just sort of going in order of the key we're looking for and looking at sort of one digit at a time. And I say digit not like as a number but sort of it's like a one position in the key that we're looking at. Whether it's a string, whether it's a date or whatever. We're gonna look at it one by one. So Radex trees are kind of interesting because they have properties that are slightly different than what we'd expect in a B plus tree. So the first is that the height of the tree depends on not on the number of elements like it would in a B plus tree but it depends on the length of the keys itself. So if I have a key with a single key with a million elements the height of my tree is gonna be one million. I'm assuming I'm using a try, not a compressed Radex tree. It's also not gonna require significant rebalancing. You can do things like coalesce nodes and collapse them to reduce the height of the tree in the Radex case but you don't need to do any splits and merges and have to take latches to do those things as you would in a B plus tree. And the last thing that's interesting is that the key itself can be reconstructed from the edges of the tree. For a single key it's not gonna be stored in its entirety anywhere in the tree. It's gonna be along these edges and if we ever wanna get it back we just append these edges back to the other. So let me show you an example. So this is an example of a try and this is an example of a Radex tree. So let's say that we have three keys. Hello, hat, and have. So we're gonna try the way that it works is you don't store the digits of the key in the actual nodes themselves on practice you do. You represent them as the edges. So we have a root of our tree and in the first edge we're gonna have the character H. And that's because all the keys that we have in our tree start with the letter H. So this letter can be reused across all the keys. And then going down we see that the path going to the leaf nodes with a pointer to the element that we want we see that we have all the letters for each of our keys. So we have H, E, L, L, O for hello. And then for here for have and hat they both share A in the second position. So therefore it's only one edge and they split off corresponding to both of them. Our rate X tree is sort of a compressed version of a try because you don't need to store individual letters in separate edges. If you know that there's a portion of the key that is not shared with any other key you just store it in an entirety in the edge. So in the case of hello, E, L, L, O is not shared with any other key so therefore it's a single edge that has this. That makes traversing this much, much quicker and you use less memory to store all this information. So the way you actually implement this, you sort of represent it algorithmically through having the key prefixes in the edges themselves but in practice you actually implement the nodes or actually where you can store the data. So in this case here we have the same keys in our rate X tree and then they just sort of point to the object we're looking for. And so to do an insertion and say we want to insert the word hair all we need to do is traverse the prefix and find the lowest point where we can add in the remaining portions of our key and we just insert a new record in there. So what makes rate X trees kind of tricky is that you don't want to be allocating new nodes all the time because it could be that the thing we're trying to insert hair had a hundred characters afterwards and therefore wouldn't fit in this node and we don't want to do a malloc or realloc every single time we're gonna do this. So in the hyper system they actually use different pre-defined size nodes and we just have a little extra space when we allocate a node so that if we come along with a key that is too big or is bigger than we expected we have room to put it in. Let's say we want to delete hat and have so it's these two elements here again we traverse it, we find them and then we just can delete them and now we sort of have just in this node here we just have INR by itself and we can then coalesce it and move everything back up here. I don't know, it's not clear whether you can do this atomically without latches. They're very vague in the paper how to do this and they say it's future work but I believe it's possible. You could possibly also use an indirection map or indirection table like in a BW tree to make this all work, right? So in the paper that I listed on the website it's not latch free and they don't actually talk about how much smaller space the prefix tree uses over a B plus tree, which is kind of unfortunate but it doesn't take much imagination to think about if you have a lot of keys that are strings and have a lot of overlapping values this would be much smaller than a B plus tree. Again this is mostly just to keep you aware of these are other indexes that are out there and they have some interesting properties. The tricky thing, another tricky thing about the Radex tree is that it's trivial for us to see how prefixes work when you have strings but how do we do this when we have other data types? So there's four types we need to handle and we have to do a little transformation before we can put it in the tree to make it so that they can be comparable in a fixed order. So for unsigned orders the key thing we have to be mindful of is that if we're on a little endian machine we have to flip the order of the bits so that the most significant bit is the one we can use to compare whether something's bigger than another because otherwise it's reversed and it wouldn't work. For signed integers what you have to do is if it's negative flip the two is compliment because otherwise negative numbers would appear to be larger than positive numbers. So if you just flip that bit then everything works just as if it was unsigned integers. For floats it's a little bit more complicated. It depends on how the machine, I guess the I should we format that determines how you represent a float but you basically classify a float into different groups like negative versus positive normalize or just de-normalize and each of these obviously have their own comparison that can be made that if it's in the negative group it's less than something than a positive group and so on. And then that allows you to come up with a well structured comparison protocol for these floats and you don't have to do any real transformation of the number itself. For compound keys it's pretty simple you just apply all the same techniques and treat them separately and you can do comparisons as if the byte streams were appended together. Like if I had two unsigned integers I can just compare the first one followed by comparing it with the second one and everything just works. All right so again in a rate X tree there's some little extra work you have to do to make sure that everything's comparable that you can then have all the prefixes and get the better compression rate. This is sort of a very hand wavy here because I don't think it's something you'd want to implement for project number two but I think it's a cool idea that you can get the lower memory usage without paying that big penalty to go look for every single pointer as you did in the T tree. Okay so what are my parting thoughts? I think the BW tree is probably one of the most interesting data structures and databases that have come out the last five, six, seven years. I mean I've read the paper, I've seen experiments, I've talked to people, I'm having trouble thinking about what the problem is. Yes you have to do some extra work to deal with phantoms but in Hecaton that's why they just extrude the scans and reads again to check for phantoms. You don't then embed anything and do anything heavyweight inside of the index itself to check for phantoms whereas in the B plus tree stuff we talked about last time you can maintain these hierarchical locks inside of it in order to figure things out. It's a bit more complicated to implement and you have to make sure you get the ordering of the compare and swap operations in the mapping table correctly because you can end up pointing to nothing. Skip lists on the other hand I think are one of the easiest data structures to implement. It's only a couple hundred lines of code, it's really simple to get up and running and it's not hard to make it concurrent. The sort of the standard skip list that I should have showed you here is can be slow in practice and there's other more modern variants that are deterministic or use paging mechanisms sort of like the mapping table in the BW tree. There's a lot of overcome these challenges but I still don't think they're going to perform as well as the BW tree. All right, so for next class we're going to read another paper from Microsoft Research. You may be thinking why do I keep reading Microsoft Research papers? Again, I think their database group is doing some really awesome stuff. When you look at the people that they have there, there's some of the people that have been databases for a long time and then a lot of the stuff that we're learning about, right? Because at Microsoft, the Microsoft Research eventually picked up all these guys that fled all the other research labs that went under, right? When Paul Larson was a professor at Waterloo but Dave LeMay was at DEC when that folded when Compact bought them. They had Jim Gray, the Toringwood winner. Now they have Donald Kosman, although he's from Europe, from ETH but they have some really awesome people and I think they're doing stuff and I find their papers are actually really easy to read, right? If you try to go read, you've read the BWT paper, try to go reading the Radex tree paper from the HyperGuys. It's not the same. So that's why I think there's, I think there's a lot of, they're easy to read and they explain things very clearly. All right, so then next class we'll spend the last 15, 20 minutes, we'll talk about the last project and we'll talk about what's expected and then you have a month to do it, okay? Any questions? All right guys, see you next time.