 So, real quick, I think everyone, if you're enrolled in the class, you should have gotten an email saying that you now have an account on something called Narwhal. Did everyone get that email from the PDL or no? No one got this, okay. So I put your guys' name in to get an account for the cluster we're going to have available to you guys to test and debug and do performance testing with your product number two. So I sent the roster, everyone's Andrew's ID, I'll check to see what's going on. You should get an email saying you have an account and then there'll be, I'll send a post out on the Piazza to say, here's where you, here's where we actually go to login and use those machines, okay. All right, so for today's agenda, we're going to do, we're going to talk about another latch-free index. Right, so remember Tuesday we talked about the latch-free skip list, the latch-free concurrent skip list. So for today, we're going to talk about the BW tree from Microsoft and then we're going to talk about the art index from the HyperGuys, which is not actually latch-free, but it has other interesting properties that I think are worth looking at. And then we'll finish up doing, sort of do a quick crash course on how to do profiling in Peloton. And this, again, you'll need this for project two to figure out, you know, why your index could possibly be running slow, okay. So, so an observation we make from last class was that we talked about, when we, when we talked about the concurrent skip list, we said that we couldn't have reverse pointers or backwards pointers along the lowest level in the index because we wouldn't be, we're not able to do atomic updates to, to insert or delete nodes using compare and swap. Because we'd, we have, we have to touch two memory addresses, right? To remember compare and swap is like you give it one memory location and you can, you can flip the bits if it matches what you expect to, expect to be there. But if you had to add reverse pointers, now you had to update two locations, right, the pointers for two different nodes and you can't do that atomically to compare and swap. The only way you could do that and guarantee that it would be atomic would be to take a latch, right, which would defeat the purpose of a latch-free, latch-free index. So the BW tree that we're gonna look at now solves this problem by doing it a different way, or storing your, your pointers in a different way. So the BW tree is a latch-free B plus tree-like index where the threads are never gonna have to set any latches or block on each other in order to do any modifications, do lookups, reads, inserts, updates, or deletes. And so the, the history of the BW tree was this was developed by Microsoft in the early 2010s and this was built for the, the Hecaton project, or the Hecaton engine, which we talked about before a little bit about how this was sort of a specialized engine for OATP workloads that Microsoft built inside of SQL Server. So you sort of like, instead of just throwing away SQL Server entirely, you had this little thing called Hecaton, this, this separate engine that was an in-memory OATP optimized system that could live nicely inside of the entire SQL Server ecosystem. So the outside still looked like regular SQL Server, but internally was using the Hecaton engine. And how that actually, the project got started is relevant to a discussion we had on, on, on, with SkipList last week was when they started the Hecaton project in 2008, they originally decided that they were gonna go ahead with SkipList, and that was gonna be the primary index for, for Hecaton, right? Because again, it was a latch-free index that had sort of nice properties that were relevant to what Microsoft was trying to do with Hecaton. But then around halfway through the project, they realized that the SkipList didn't have the properties or didn't get the, the kind of scale ability that, that they would want. So they ended up abandoning the SkipList and went ahead and built the PW tree. Now what's interesting about this, as I said last class, the only database system, as far as I know, that uses SkipList as the primary index is MemSQL. And so the reason why the MemSQL people use SkipList is because one of the co-founders was at Microsoft at, at the time when they were building Hecaton. And he saw all these sort of internal Hecaton talks where they say, oh, look how great SkipList were. So then when he went off down to San Francisco and formed his company, they were all in with SkipList, right? Because it sort of has, again, it has nice properties. But he missed the sort of second half of the talks from the Microsoft guys where they came back and said, well, SkipList is actually, there's not what you want to do, and you want to use the, the PW tree. So that's sort of probably the reason why, why MemSQL is like big on SkipList. So there's two key properties or two key ideas in our PW tree that is to allow us to be latchery and avoid that, you know, reverse pointer problem that we saw in the SkipList. The first idea is that we're only going to record changes to the index in the context of these delta records. So that means that you're never going to update a, a note or page in the index in place. You're never going to take an existing page in memory and overwrite to shuffle it to add a new key or delete a key. Instead, you're going to append these delta records to it to say, here's the change that, that occurred. Sort of again, think of this sort of like a, like a log structured organization. And the, the advantage we're going to get from this is that it's going to reduce cache validation because every single time there's a change to a note or page in the index, we're just appending one delta record. We're not reshuffling or moving everything around inside of the, the page node, the leaf node. The other key thing is that we're going to have a, a mapping table that's going to provide us an indirect, a way to do, have indirection between logical pointers inside the index and physical pointers in memory. And then because we're going to have this mapping table, we're going to have this indirection information stored in this mapping table, we can do compare and swap inside of the mapping table that will update all the pointers throughout the entire index without having to update them one by one or instead of latch for them. So we can update one location in the mapping table and that will get propagated to every, every node inside the index that may be using that pointer. So I'll go through each of these one by one. But this is the mapping, mapping table is a key, key idea. So for this, let's do a really simple example of a three node, two level BW tree. And so the first thing I'll point out is here's this auxiliary data structure, the mapping table on the side, but it's going to be a mapping from the page IDs to physical addresses. So every index page, and I'll use page node interchangeably, it's the same thing. Every index page will be assigned a unique identifier. And then in the mapping table, we'll have physical pointers to where they exist in memory. So in this case here in this diagram, I'm showing in the solid lines, that'll be a physical pointer to a physical memory location. And then the red dotted lines will be a logical pointer. So in addition to each page, in addition to having a page ID and then the physical pointer tells where to go, we're also going to have internally logical pointers that are just the page IDs for the children of a parent node and then your siblings. So in this case here for this parent page 101, it's saying that it's two children of one or two in 104. So that means that in order for me to traverse this index and find my child, I do my look up in the mapping table and that's going to give me the physical address of where I need to go. Right. And this is different than the skip list we saw last time because the skip list was actually embedding the physical pointers directly inside each node or page in the index, where in here we just stored this unique number and then we know how to go look it up to find the thing we're looking for. Right. So this is this clear. This is sort of the key idea of the mapping table. So let's see now how the Delta records work. So for this, it'll keep it super simple. Say we just have a single page in our index, page 102. And of course we have in our mapping table, we have a mapping to the physical address where it's located. So when a thread comes along and wants to modify a key or key range that's maintained by this page, whether it's an insert, update, or delete, it doesn't matter, instead of actually going modifying the page itself, it's going to append a Delta record. And the Delta record is going to specify what the operation was that the thread did that modified this page. So in this case here, I want to insert key 50 with a key value pair with a key 50. So I would have a Delta record that this is an insert and the change is to key 50. So now in the Delta record, it's going to have a physical pointer to its parent base page, not a logical point, but a physical pointer. And the reason why it's a physical pointer is because we're going to maintain the Delta chain for a single base page, we're going to treat that as sort of a single atomic unit. So at no point is this thing going to get moved around independently of its Delta records. This page owns these Delta records so it knows that this pointer is always going to be valid. So now what we're going to need to do is at this point we've inserted the Delta record at the front of the chain for this base page, but no other thread can see it, right? Because any time someone says, I want to look up page 102, it's going to look in the mapping table and then land here, right? It doesn't know about this. So now what we're going to do is we're going to do a comparison swap on the mapping table to install the Delta address as the new physical location of page 102. So in this case here, again, we just do the comparison swap. If we succeed then now this mapping table points to the Delta record. So now this change becomes visible to any other thread. If anybody comes along and wants to look at page 102 they're going to land in our Delta record first and they know how to apply that change and it's a total map. We'll see how we do a search in the next slide. Same thing if I have another modification made to this page, say I want to delete key 48, then same thing. I would first set it up so that this thing points to the new physical address of the previous Delta record and I do a comparison swap to now this thing becomes the head of the Delta chain. So this is clear what's going on here, right? It's pretty straightforward. So now let's talk about how to do a search. Say I want to do a look up on some key that's within, you know, that's maintained by this page here, right? So we're going to traverse the tree just as we would in a B plus tree, meaning we would look at each node, we would look at the markers to say whether I need to go left or right. And then when we hit the leaf nodes we know that we're not pointing to another node, we're pointing to the actual tuples that we want. So in this case here what will happen is when I want to do a look up and as I traverse the tree and I say oh well now I know I need to go look at page 102, I would do the look up in the mapping table and I would land at the head of the version chain. So now what's going to happen is as I scan down the version chain, the thread's going to maintain this sort of internal memory model about all the Delta records that it's seen. And then if it knows that if it's looking for a particular key where it sees a Delta record that corresponds to it, it can stop right there, it doesn't have to keep going. So if I say I was looking for key 50, I would land here in the top, this is dealing with key 48, so it has nothing to do with me so I skip that, then I get here and it's key 50, that's the thing I'm looking for, so I know my entry exists and I don't need to keep going down in the chain, right? If the key that I'm looking for is not in the version chain, then I land in the base node and I just do the binary search you would normally do in a regular B plus tree. Right, and the key thing about this is then this is being, this is, this version chain is going backwards in time, so the latest version or the latest Delta record is always going to be at the head, so that means that if I have say right here delete key 50, I'm not going to see that because I'll find my insert here and that occurred logically after that delete, so therefore it should be there. Right, it's the same thing when you reach the bottom you just do the binary search as you normally would. So let's see now happens if we have two threads trying to update the same page at the same time, right? Say that this, there's two threads, first guy wants to delete 48, the first guy wants to insert 16 and it's going to be again the same thing that we saw in the skip list, right? They'll both try to do a compare and swap on this entry here to change the pointer to now point to either one of them. They're both going to have the same physical pointer because that's the way this works, but when they try to compare and swap say this guy, the first guy wins, right? So it now it knows that it's dealt the record has been successfully installed in the chain. This thread we recognize that the compare and swap failed, so it had to come back and retry to do the change again, which in which case would put its insert on top of this thing because this is now the head of the version chain. So we'll see this in a second when we talk about doing structure modifications, but if you're doing an insert, update, or delete, it's just like in the skip list where if the compare and swap fails, it's up to you inside the index to go back and try it again. When we talk about structure modifications, you're actually going to want to abort the operation you're trying to do and don't keep trying it. And this is slightly different than the skip list and we'll see why in a second. But again, if it's insert, update, delete, if the compare and swap fails, the index wrapper would come back and try to do the same operation over again. And at which point it would just be appended on top of this. So there's basically two types of delta records we're going to have. So the ones I've shown so far are the standard insert, update, deletes, and this is pretty straightforward to understand. But then we'll see in a second how you can do special delta records to deal with structure modifications. Again, this is a B plus tree like index, so we have to be able to do the split merges that you would normally do in a regular B plus tree. So in the paper, they talk about how the default size of the version chain or how long you want to try to keep it is like eight records. When in our own version of the BWT tree, we've done similar experiments, we see that eight records is about the right number. You do the right tradeoff between how expensive it is going to be to traverse the chain versus how often you're going to have to do consolidation and garbage collection. So again, consolidation is when the version chain is going to get too long, and we want to collapse all these delta records and apply it to our base page. Remember I said before, though, we don't want to do any of our updates in place. So in order to do consolidation, what we're going to do is we're going to first make a copy of the base page down here, and then we're going to go in reverse order up the chain and apply each delta record to this new page table one by one. And then once we've applied all the changes, now we know that the state of this page is equivalent to what you would see if you were traversing the full chain from the original page. So then, just like before, now we need to have the other threads know about our new consolidated page, so we'll do a compare and swap on the mapping table to now have it point to this thing here. And again, the logic is the same if it's the compare and swap fails, you know that somebody else inserted either our delta record above this and we missed it, or they also did a consolidation. Again, the way it's described in the paper, once the delta chain reaches a certain size, then all threads, that'll trigger a consolidation by all threads. So one of them is always going to see it and all the other ones are going to fail. And this is what we were saying before, if your consolidation fails rather than keep trying to redo it, you just abort that and go and abort that operation and go back and do what you were trying to do before. So if you do this consolidation and say if you're doing a lookup on key 50, you would come along and recognize that the delta chain has gotten too long and then that would trigger the consolidation. If it's a seed, you're done and you know, you can sort of piggyback off of that and get the key you were looking for. If it fails, then you just come back and do the search again, right? And the search has to be a complete traversal of the tree to get to this page. Yes? So what is the lens of the search page? There's just the key. So her question is what's the contents of the page? So it's the B plus tree thing that I shared last time, right? You have an array of keys and an array of values, right? So if you don't solve the insert, it may spill over from the I. There may be no space in your page to answer the new key rate. So her statement is if you're doing a lot of inserts into this page, at some point it's going to get too big and you have to spill it over. Yes, you have to do a split just like a regular B plus tree. And I'll show that in a second. Any other questions? Again, the difference between the skip list is that the way we're going to do avoid latches is to have this mapping table because it's in direction. So we can just change one memory address and don't have to worry about fussing with the internal pointers inside of the data structure itself. All right, so now at this point we've done the compare and swap and now our new page is the current page for page ID one or two. So at this point we want to mark this old page and all its deltas as being available for garbage collection. But then we're going to have that same problem that we have before in the skip list where some thread might be hanging out inside of our delta chain and we don't want to free up the memory right away because if they then try to follow this pointer to get to the old page 102 it would go to invalid address and read garbage and fail. So we need to be careful just like before in the skip list to make sure that we only collect garbage when we know the thread could be touching the data we're messing with. So in the VW tree they do the same epoch based garbage collection that I talked about last time. So basically there's going to be some global counter for the epoch that's going to be incremented every so often periodically by a separate thread. It could be 10 milliseconds, it could be 50 milliseconds, it doesn't necessarily matter, it doesn't matter too much. And what will happen is anytime a thread enters the index to perform an operation we'll tag it to say that what is the current epoch for that operation? And so the threads will have to join the epoch when they enter the index and then we have to know that they left when they're done doing what they want to do. And we're going to say again that the garbage for an epoch can only be reclaimed once we know all the threads that were inside of our system for that epoch has successfully exited. And it's not just your current epoch that you're dealing with but it's also pretty as epochs. If I go from epoch 1 and there's a thread inside of it and then I switch to epoch 2 but that first thread is still in epoch 1 I can't free anything in epoch 2 until I know that 1 is safe. So let's look at an example here. So this is the same setup we have before. We have our delta chain off of 102. We do consolidation and we're going to make a new page too. So at this point when we start off there's a thread running on CPU 1 that's at this point in the version chain. So in our epoch table sort of metadata for what's going on in our current epoch we even keep track of this thread is actually alive in us. And then we're going to have another thread, sorry, another thread running on CPU 2 and it's going to be doing the consolidation. So we again we collapse all the delta records into our new page 102. We do the compare and swap to now that this is now the new version of page 102. And then the CPU would recognize, well, this base node and all its delta records I've just replaced it by putting everything in this guy. So this entire thing can be garbage collected. So it will maintain inside of the epoch table the metadata for the epoch that these nodes and this base page and its delta chain can be deleted. So then CPU 2 finishes up. It finishes whatever it is wanted to do. And then it leaves the epoch. Now at this point again we don't want to free up this memory because we know that CPU 1 is still hanging around. The thread running on CPU 1 is still hanging around somewhere in our delta chain. So only when it finishes and it leaves the epoch is it safe to actually then go ahead and garbage collect it. So I didn't really talk about this last time but it was sort of implied that in order to do epoch based garbage collection there has to be some auxiliary data structures or actual bookkeeping you have to do to know what threads are hanging around and what the data is there. And then whether you have a separate thread comes along and does all this garbage collection for you or it's cooperative it doesn't necessarily matter. The protocol still stays the same. Like for example you could have something like every other thread that enters the index could look in the epoch table or the bookkeeping information for the previous epochs and decide whether it's now safe to go delete older stuff. Or you could have in the case we're going to provide for you in the project number two there'll be a separate thread that periodically gets invoked that perform GC function and we'll go ahead and clean these things up for you. It doesn't necessarily matter but you know there's no other thread running in these previous epochs therefore it's safe to delete this. In the back. So in this case when you do the garbage collection you should know the epoch number that you can. Yes, yes so I'm not showing that. So the statement is in order to do this type of garbage collection you need to know what the epoch number is. Correct. I'm not showing that again it could be it's a global could be you can use a global counter that you'd have one thread do an atomic atomic addition every so often to increment it. So then there'll be like these these functions when you enter the index you know what the epoch is and you can tag that for that thread and any single time you do something where you then you know put information put things that can be deleted in the garbage collection epoch table you would tag it and say well when I when I entered the system and entered the index I was at epoch 2 therefore this data can be deleted once epoch 2 is considered safe. So this is not like the responsibility for the index. So his statement is this is not the responsibility of the index. Now we're pricing semantics so what do you mean by that? What do you mean the responsibility of the index? So like this should be managed by the high level like the concurrency. So his statement is this epoch information should be managed by the high level concurrency control protocol of the database system. Why? Because only then the epoch So his statement is it's only the concurrency scheme knows the epoch. So this is independent of the concurrency protocol of the system. So I can do two-phase locking that doesn't have an epoch. I can do two-phase locking for my higher level transactions and still use epoch based garbage collection for the internal indexes. They're completely orthogonal to each other. Yes, if you're doing like MVCC with timestamp ordering with epochs you could maybe share that information down with the indexes but in practice it's usually implemented independently. Another way to think about this too is I could have say I have like 10 indexes but only one of them is ever getting updated. I don't need to update the epoch if no one's ever actually going in those other nine indexes. Whereas if you're having some higher level power come in and say update your epoch, update your epoch you may be doing it unnecessarily. This is sort of why we provide you also too with that needs GC function in the index wrapper because we'll poke you and say do you want us to garbage collect and if you come back and say false we'll leave the index alone. So again the index will know what it's the same argument about like the database system always knows better the index always sort of knows better because it knows what happened inside of it. So you can poke the GC if you know that another thread has come along and your epoch table is empty then you don't need garbage collection and you don't waste cycles. Any other questions? That's a good point. Okay. So we actually we have considered how to maybe combine the high level epoch management for the virtual control system and the indexes what we haven't really thought through and done anything yet with it. But it's an interesting research topic. All right. So now we got to talk about structure modifications. So as I said before it's a B plus tree and it's hard. So we need to be able to do splits and merges. So I'm just going to show example how to do a split. The merge is essentially the same thing but just in reverse. So now we're going to have two special delta records that are not going to be exactly for they're not going to be used to store you know operations to modify the contents of the index but rather these are going to be delta records that correspond to physical changes to the composition of the organization of the index. So we're going to have a split delta record that's going to allow us to specify that a subset of a key range for a particular base page is now being managed or it's now being it's now located at another base page in the index. And then we're going to have the delta record have a physical pointer to the previous delta record in the chain as well as a logical pointer to this new base page. And then we're also going to have a separate delta record that will use the higher level parts of the tree that will say that basically acts as a shortcut to say if you're looking for this key range that I knew that I knew I split down below here's a fast track way to get to it without having to go all the way down to the version chains down below. So again, I'll show an example to make this more clear. So this is not an easy thing to walk through. And when I when I start showing the different steps of this, you can see a lot of different arrows. So I did the best I could to clean this up and make it like easy to follow. But there is a lot going on here because it's kind of complicated. So but I'll walk through slowly one by one. So this is sort of our current state of our index when we start off. We have two levels and we're going to have three three base pages at the bottom on the leaf nodes. And then for simplicity, I'm only going to show the the logical pointers going from sibling from this direction. I'm not going to show the reverse pointers. It's basically the protocol is exactly exactly the same for simplicity. I'm only showing this way. All right. So our keys that we're going to have stored in our index along the leaf nodes will be like this. So page 102 will have key one two 103 will have three four five six and then 104 will have seven and eight. So we're going to want to do we're going to want to want to split page 103 to put half of its key range in or sort of take the upper half of its key range and store it in a new page. Right. So to do this, the first thing we're going to do is just copy the the the the page that has the data that we want, but only include the keys that we want to include in our split node. And then we're going to have a logical pointer from 105 to 104 because this is where it should appear if you scan along the leaves. Now at this point, nobody knows about page 105. Right. Because if you do any lookup up above, you're still going to land a 103. If you scan along the leaf nodes, you're still going to go from one or two one or three to 104. So now what we need to do is we're going to install an entry in our in our mapping table to point to 105. Again, still nobody knows about us, but now we're going to add the new split record that's going to tell any thread that comes along that we have split the key range for page 103. So this delta chain, the split record is correspond to this chain for this page here. So it's basically going to say key five and six is no longer available for you at 103. It's going to be available for you at 105. And so the split record is going to have a physical pointer to the base page because that's what we always have to have in the delta chain for a base page. But then it's going to have a logical pointer to 105. Now again, this is where the ordering sort of matters because before we can put the split record into our delta chain, we have to make sure that we have a physical address to page ID 105 down here because when we add the logical address, if this doesn't exist yet in the mapping table, someone could do a lookup and land nowhere. So we know that this thing's been installed so it's OK for us to add the logical pointer here. So now what we're going to do is just as before, since we're appending this to the head of the delta chain, we have to go back and update the mapping for page 103 to now point to the split record. Again, this is a no... Sorry, go ahead, yes. Yes, so if the comparator split operation fails, will the intermediate result be hit or abandoned? So for which one? Let's say the split that allows the comparator and swap operation failed. So this one here, when I update one or three to now point to the split record, and your question is what? Will the five and six, the new node, whether it be camp or abandoned? OK, so his question is, say I'm at this point here. Actually, we're basically here. So we have a... No, we can be here. We have a physical pointer from the split record to base page 103. We have a logical pointer from the split record to base page 105. So his question is, if I now do the compare and swap on the mapping table for page 103, if that fails, when I try to make it point to the split record, what happens? Well, what happens is, it fails. So what about the new node? Will it be deleted or... Yeah, so you'd want to... I mean, you could just try to compare and swap again, or you could just say, well, my split failed, let me just abandon it. And at this point, although I'm putting the x's here, the data's still there because we don't ever modify the base page. And if anybody does a lookup for page 103, they're still going to get here. So say someone deletes key five, they're not going to see the split record because when they do the lookup on base page 103, they're going to get this thing here. So then if I do my delete after the split fails and I come back and I'll try to do the split again, I have to reason about whether that's the right thing to do anymore. Right? So typically, I think the way we implement this is if your split fails, we just completely start over from scratch. So is that the statement with that being considered a waste of time? Yes. But again, this whole idea of these latch-free data structures is you're sort of optimistically assuming that you're not going to have conflicts and therefore it's better just to do these comparing swaps, which are much faster than taking heavyweight locks and pessimistically assuming that nothing's going to work out. Yes? Is there a like a percent, like a cut-off threshold? Sorry. Like a threshold? A threshold of which it's just not worth it to do this optimistic, latch-free structure and more conflict is so common. Yes. So his statement is, is there a threshold where there's so much conflicts so many conflicts that it's just not worth it doing this? Yes. This is actually something that's sort of an active area research now is like, if you just have super, super, super terrible contention, like everyone's trying to update this one thing. They all, all the different protocols basically degenerate to be exactly the same. Right? There's no magic you're going to, you can have to make it better. And the question is like at what point is there's so much contention we're switching to a heavyweight locking scheme better? There is a threshold, but it depends on a lot, you know, data structure, transactions, depends on a lot of things. Yes. At some point they all essentially end up being the same thing. In practice though, like this is, this works out to be okay. Again, think about it. If everyone only, everyone wants to update you know, key eight, every, you know, all thousand threads want to update it. There's nothing you, nothing to make this work, you know, great. You're basically running in serial order. All right. All right. So, so we're at this point here, we're going to do the compare and swap and for our purposes we'll assume that it succeeds and now page one or three points to the split record. So now what happens is because we have these logical pointers, this is a great example of why they're useful. All we had to do is update this one physical address to this guy here. But these pointers here and here, we're saying I point to logical page ID one or three. So when I change this physical address these things atomically also change as well without having to do anything. So now by scan across the leaf nodes instead of jumping from one or two one or three directly I'll get up to the split page, the split record and it'll say, well now I need to know for to get the key range three four I'll go down here and then to get key range five and six I got to go down to this guy here and then I know how to come back up and get to one or four. All right. So at this point we can stop, right? This is still this at this point the index is structurally sound and correct and we're not gonna have any false positive or false negatives, right? We're not gonna lose any information or not see things that we should be seeing but it gets kind of expensive now because now you have to like always traverse to this split record and then know how to jump up and down between these different leaf nodes to get to the key range that you're actually looking for. I'll say also too is now at this point when this thing gets installed if anybody wants to update five and six they'll come down here and do their operation down here. The data is still technically stored here because we haven't consolidated yet. At some point we'll consolidate and then these things will get blown away and this sort of will get all nicely smoothed out. So up above in our index we still have the markers about what key ranges are handled by different parts of the tree that still correspond to how the key range was laid out before we did our split. So we have this pointer here it goes a negative infinity to three then we have three to seven and we have seven to infinity. But now we know also too that there's a separate base node that will have a key range five to six but that's being encapsulated in five to three and seven which would then take you down and see the split. So what you can do is you can add a new separator base node separator delta record above the root node here that says now there is a new key range that's handled by another base base page node for key range five and seven. So now if I want to do a look up say on key six I do my game before I do that compare and swap to now say that the root of the tree is now the separator node. So now if I want to do a look up on key six I would land here at the separator and say oh well the thing I'm looking for is in between five and seven. So let me jump down to this logical pointer down here and get to the data I'm looking for. If not then I'll jump down and go just as I was before. Right and then in this case here if I'm looking for key four I would come between three and seven. I get to the split. I would know that the key that I'm looking for is not handled by this new split node down here. So therefore I want to go back to the original page one or three. Again you don't need the separator to ensure correctness it just avoids having to do some extra jumps or traversal into the delta chain. It's an optimization. Yes. Your question is why does block 103? Yeah that's correct yeah that should be down there. Actually no hold up. No no so it's, the split information is gonna have to you don't have to know that if you want to now get to if when you get to the end of this you really should be going to this not this. Right that stored in the split record because again you're not allowed to modify a base page once it's created. Right so that's what this logical pointer does for you here. Right it's again, it's actual bookkeeping inside the split record to allow you to do scans still correctly. In the back yes. Is the ax mark in node 103 actually part of the node a node 103 or part of the split record? Your question is are these X marks? I mean this is just for illustration right? This is not. The split record is gonna say that if you want key range you want key range five to six, five to seven exclusive then you would go here. So that's essentially equivalent to the X's I'm just showing X's to say that these are not it's still there, the data is still there but no one will see them. Matt yes. So at this point if you wanted to install an update on page 105 what would you have to do to compare its blocks to get that to work? So the statement is if I wanna do an update to page 105 let's say I wanna assert key 5.5. You would update the split record to point to these to a new update and you'd have to update the magic table. You would update, hold up, so I'm asserting key 105. So there'll be a delta record above the split that says, yeah, right. Is that true? Actually hold up. Now because if you put it here. Um. You put a new copy of the split record and that new copy of the split record has a point to it, you would delta a record and then point to that. Your statement is what you create a new copy of the split record. And then that split record has like the red point or point to the new delta record and then to the bottom five. And then you just compare it as well. That's what you would have to do. So why can't you, why can't you put the delta record here, right? Then this, you update this pointer to now point to the new delta record. If you then traverse the split, you're following a logical pointer. So you would do a look up on page 105 and you would see the delta record and you're fine. Yeah. That's the beauty of this, right? It's like it's, it's, it solves all these problems. It's a good question though. Okay. Yes. It's like multiple threads trying to split. Is it possible that one of them has the split record installed and someone else, the other third tries to install the separation? How do they install, which are all of this happened at all? All right. So her statement is, if I have two threads that want to do the split at exactly the same time, is it possible for one of them to do the split? And the other installs the split. No, you have to do the split before you do the separator. Yeah. But how do you ensure that all these operations happen accordingly because they're different gas operations? It doesn't matter, right? So say I have two threads. They both try to do the split. One of them is going to succeed. The other one fails, right? So my splits installed. The other one, the other one gets aborted. Now in the case of the separator, you could do this. You don't have to do the separator, right? It's just providing you a shortcut to get to the bottom. So what you could do is, I don't know, I forget how exactly we do this. You could have the thread that was successfully installed to split, then come back and do another traversal because you can't go back up. Come back and do the traversal and say, well, I know that I did a split before. So let me try to install my separator, right? Or you could say, well, you could have a thread come along and say, I see there's a split here, but I didn't see a separator. Let me add a separator. It doesn't matter. But you always have to do the split first before you do the separator. Because otherwise, the separator's pointing to nothing. Or it's pointing to something that you cannot see if you come down normally. All right, cool. All right. So let's look at some performance numbers. So this is from the paper you guys read. So this is the BW tree running on an older machine with, I think, one socket and four cores. So eight with hyperthreading. So they had three workloads. One is derived from an Xbox online game. One's a synthetic micro benchmark and another one's doing deduplication. I think in the paper, they only show the numbers for the BW tree versus the B plus tree for some of these. So I actually emailed Justin, and we have all the, this is the full numbers that I think that it's not fully reported in the paper, but here's the experiments that they did. So across the board, you see that the BW tree is outperforming the concurrent skipless and the B plus tree. So this version of B plus tree is actually from Berkeley DB. So Berkeley DB was like one of the first embedded databases from the 1990s that came out of UC Berkeley, and then Oracle bought them in 2006. So it's originally a disk-based database system and a disk-based index, but I think for this case here, they turned off all the disk stuff as well. I'll say also too that the Berkeley DB guys, after Berkeley DB got bought by Oracle, they went and formed another company called WireTiger, which then got bought by MongoDB about two years ago, and is now sort of the default engine you get, or storage engine you get from MongoDB. WireTiger is awesome, oh, WireTiger's really good. So again, across the board, you see that the BW tree is outperforming all the other indexes, and especially the skipless, right? Even though this has to be latch-free. In the case here, the B plus tree is doing latch-crafting that we talked about last time. All right, so any questions? Okay, cool. So in the remaining time now, I want to talk about the adaptive-ratex tree or the art index from Hyper. So as I said at the beginning, the art index is not latch-free. So the BW tree was, and the skip list is, but this guy is not. And it's worth talking about, because again, it has some interesting characteristics that are different than what you normally would see in an index. So as a real quick show of hands, who here knows what a try is? The try data structure. Okay, small number, okay. So a ratex tree is a variant of a try, and I'll show what a try looks like in the next slide. But the key difference we're gonna have in a ratex tree versus the BW tree and the B plus tree and the skip list is that we're gonna store our keys in this digital representation that allows us to examine the prefixes of those keys rather than having to look at the entire key. And when I say digital, I don't mean like an air break numeral or a number. I mean like taking the individual elements of the key and storing them one by one. So say if the key was a var char or a string, a digit would be sort of one character in the entire string. So what's interesting about the ratex tree is that unlike in a B plus tree, where the height of the tree depends on the number keys that you have, in a ratex tree it depends on the length, the longest length of a key. So if I say if I have in a B plus tree I have a million, a million keys, the height of the tree would be log n. In a ratex tree it's the length of the longest key. So if I have one key that's a million, has a million digits that's completely unique, then the height of my tree will be a million. In practice this doesn't happen almost every one in a second. Again another nice thing about the ratex tree is that it doesn't require any rebalancing. And this is because the structure of the index, the structure, the organization of the data structure is deterministic. What I mean by that is no matter what order you insert the keys that you have, the layout of the data structure is always gonna be the same. So again contrast this with the B bus tree or skip list. If I insert the keys in either sorted order or random order, the layout of the tree is gonna be different each time. In a ratex tree it's always exactly the same. We'll see why in a second. The other cool thing about it is that since we're not storing the actual entire key over never again the side of the index, we're actually gonna store again the digits along the path in the tree so we can actually reconstruct the keys to put them back into just original form by traversing the index. So this is what a tri looks like. So say the keys I have are hello, hat and have. So what you're gonna do is you're gonna break up the digits or in this case the characters of the string and store them down in the data structure. And what you're gonna do is for cases where you have overlapping characters that are the same for each key, do you only have to store it once? So in this case here all three keys have start with the letter H so I just have the H at the top and that's enough to represent all three keys. And then so in the case of the key hello, I'll have a path down along in the tree for H-G-L-L-O and at the bottom I'm gonna have a pointer to the actual tuple. So again I'm not storing, yes I'm storing the key hello in my index but it's not sort of all concatenated together as it would be in a B plus tree or a skip list. And then what we'll have to is like, we'll have a special, a bit to say whether the pointer we're looking at is another node in the try or it's a pointer to it, 64 bit pointer to a tuple. So this is like the try, this is sort of the standard version of this. So now, alright so the same thing here, sorry, hat and have they both have A as the second character so you only have to store it once and then it splits below that. So in a Radex tree the big difference is that we actually can do compression for when we know we have a unique suffix. So in this case here for the key hello we don't store E-L-L-O as we did here, we only have to store E-L-L-O together as a single suffix with just one pointer. So we don't have that chain that we have going down before but then we have the same layout on the side here for hat and have and again VE for the suffix of have is not shared by hat so therefore those two they just are stored together. So a Radex tree is basically a compressed version of a try. So now what makes the art index be adaptive is how they're gonna support modifications. So you don't actually store an art index in the same way that I showed here where you have sort of like these nodes and then you have these edges where that's where you store the actual key digits. Instead you have these sort of nodes that span a level and you'll pack in multiple digits within that level. So this is the same keys I have before hat, have and hat and so at level one we just have the H so that's shared by multiple keys. But at the second level we have a separate portion of the node for E-L-L-O and then we have A point down to the third level for VE and T. So now let's say I wanna insert the key here I would traverse down H, HA, get to this node here and recognize that this is where I wanna put the IR and I store the pointer there. Again this is different than what we saw possibly in like a VW tree or a P plus G index where you may have sort of separate leaf nodes storing all this at the lowest level. This is sort of you try to pack every month as much as possible you can in a single node down to the bottom. Let's say I wanna delete hat and have while again you do traverse down you find these keys, you go ahead and blow them away and then you can recognize that this thing is nearly empty so you can go ahead and consolidate them. And so the paper talks about how to sort of do this in an efficient manner. That's memory aligned or word aligned so that you reduce the number of cache misses you have as you traverse this. So as I said the, so we'll get to how it's not Latchery in a second but one of the cool things also too about the art index that I like is that they deal with how to store key types that are not amenable to doing the sort of digit decomposition and do binary comparison. So what they have is they have sort of a recipe book of how to take all the possible different scalar types you can have in your database and how to store them in a manner that can be easily stored in a Radex tree or an art index and do efficient comparison of them. So in the case of like on-site integers normally on x86 they're stored in a little Indian layout but that's not, you can't do mem compare on the individual digits because they'll be going in reverse order which is not what you want. Same thing for like sign integers if you just stored like the bits or bytes within the index the upper level bits would be whether it's negative or not and that would be not a correct comparison. Floats the same thing, they have to do some convert them to sort of these fixed format integers and for compound index, compound keys you just transform each of them separately and then you concatenate them. So let me show what I mean by this I'll show you what we're talking about for the on-site integers because this would be pretty straightforward to understand. So say you have an integer key this one's like 168 million. If you want to convert that to hex you'll see why you have to store this in big Indian format in the Radex tree instead of little Indian. So if you stored it in big Indian you would have from the highest member or lowest member location to the end you would have it in the order you would expect as you read across from zero to A. So now if I want to do a comparison to say is some number less than another number I can just look at these higher order bits and do a straight mem compare and see that yes is it indeed it one's less than the other. But whereas if you do it in a little Indian format if you look at the upper level bits and you're doing mem compare well this is actually incorrect because you're looking at the lower portion of the digits, right? And that's not gonna be about comparison you'd have to convert it back or put this into registers and invoke like a comparison instruction. Whereas if you do mem compare that's much faster. So now the way we'd store this in a Radex tree here for this if we put everything in a big Indian format for this key here again we just store the bites as we go down and now as we do our check on our tree to see whether our entry is there then we start with the higher level bits and go down one by one and we do straight mem compare on those. Right? So we had to do a little extra work to transform the indexes to put it into a format that makes it easier for us to store it into the Radex tree, the R-index but that's gonna make our comparisons go much faster. This is sort of clear what's going on. So this is I think this is one of the key ideas that the hyper guys did in their index. So it's so good also to we actually do this in our own system. So remember I talked about how for your skip list implementation we already provide for you how to store keys and how to do comparison to keys. Right? It's those template arguments that you have to specify. And so we have actually two implementations of keys and then two implementations for doing comparisons with those keys. So the compact int key is when you have a one or more integers that we can store in big Indian format like in the R-index and that allows you to do fast mem compares to see whether a one key is less than another or whether equivalent. And then the generic key is like what you would expect to do with the first time you ever wrote an index where you just take the keys and you concatenate their memory locations to be one after another and that's what you store as the composite index. And the two type of comparators we have are the generic compare would be like I sort of de-reference each offset in the index into a value object and do comparison based on that. And then the fast compare is what I wrote or two months ago or so where I just de-reference the pointer directly with a reinterpreted cast and do a comparison on the memory location that way. So it is just showing you doing, storing the stuff in the big Indian format and doing comparison is much faster than the other two approaches. So for this workload, it's basically taking 10 million keys in a single thread, inserting them into the BW tree, seeing how fast that goes. Then we're gonna do a look up on every single key and see how fast that goes. Then we're gonna go ahead and delete every key. So again, what you see is that doing the compact integer representation is much faster than everything else and then my fast comparator is faster than sort of the old default slur one. And the big reason why this is faster is because you're instantiating less objects, you're allocating less memory to do a comparison. You try to do everything directly within comparing two registers. So again, this is just, when you look at the code, you'll see compact ends key, you'll see generic key, you'll see fast generic comparator and generic comparator. This is sort of what's going on there. And so this is not something you would have to do manually when you invoke the index factory to make your index for you. It'll know what your key types look like and it automatically pick which of these, which of these three setups that you'll do. So if you have a bar chart, it'll have to default to like this one in your key. If you have all integers, it can use the compact one. I think floats and everything else but it will also use the middle one as well. So that clear. Again, this is something you don't have to worry about. If you just want to know what you see in the code when you start using the index, this is what's going on. Okay. All right, so real quickly, as I said, the art index is not latch-free but they have a paper that came out last year where they talk about how to do a optimistic craving scheme where you have writers and I'm not being blocked on readers. So they're doing latch-crafting just as you would in a regular B-plus tree. But what they'll have is they'll have a, the latch inside the index node will have a counter. So anytime a writer thread wants to acquire an exclusive lock or a write lock for that node, it does a compare and swap because it's a spin lock. It acquires the lock and if it gets it then increments the counter atomically. And then what will happen is the readers come along and they'll check to see whether the lock's already being held. If it is, then they have to wait because you can't have a reader try to read something when a writer's trying to write to it. But if it's not being held, then they go ahead and allow to virtually acquire the lock but they also keep track of what was the version number of the counter when they checked it the first time. So then when they come back and want to unlock the latch, they just check to see whether the counter is still the same as it was before. If it is, then it knows and know the writer has come along and modified the node you're looking at. If it's still the same, then you're fine. So you're sort of optimistically assuming that the readers will not get interfered with by writers but then you don't pay the cost of blocking and waiting to acquire as you take that. You don't pay the cost of blocking writers when you have threads in there doing reads. And in their case, they're also doing the same epoch-based garbage collection that we talked about before. We have to keep track of what threads are running around inside the system. And this is showing you guys another alternative to doing the latch-free stuff in art. I forget, I think there is a latch-free try also from other group of Germans that came out I think a few years ago. And I don't know why they didn't do the same thing but that's fine. All right, so the last set of benchmarks I wanna show you are, I was debating whether to actually show you guys this or not because it's sort of incomplete and I don't wanna give you the false impression but it'll highlight some key aspect of art about why I'm spending the time to show you this. So this is a single threaded benchmark not doing any multi-threading at all and it's gonna be doing this on a 30 million random 64 bit keys. And so we'll have three types of workload. We'll do read-only, insert-only and then a read-write mix. I think it's like 50% reads, 50% writes and then a sequential scan along leaf nodes with inserts. So I think like 10% scans, 90% inserts. So the five indexes we're gonna compare against are the B plus tree, which is the S6 B plus tree is sort of a good open source state-of-the-art implementation of a single threaded B plus tree. As far as I know, we have trouble finding a good concurrent B plus tree. But this one's single threaded so that's why we're doing a single threaded experiment. Then we have the MASH tree, which is the index used in the silo system. MASH tree is essentially a tree of tries or a try of trees. It's sort of a hybrid combination of B plus trees and tries. Then we have the skip list, a concurrent skip list. Then we have the BW tree that we implemented and then we have the art index from the Germans, the hyper guys. So the two things I just wanna point out real quickly is that you see the way higher number you get in the art index over everything else. Like this is single threaded so we're not measuring contention here. This is like best case scenario for all of these indexes. And this is about four X faster than everyone else. And this is based on because you have, the index is more compact, you have, you can quickly find the things you're looking for and you don't have to do expensive comparisons for keys all over and over again. And then again, it's faster with all these cases here, even for the scan insert. And then the other thing I'll point out too is like, I got these numbers, so all the numbers except for the BW tree is from a paper we had published last year. The BW tree was, these experiments were run by a student that took 721 last year and he implemented the BW tree that's in the system now. I don't think he's running on the same hardware. I have to double check this. He sent me the numbers like late last night, so I'm not sure. So that's why I think we're running a little bit slower. I don't think it's an exact apples to apples comparison, but it's close enough and they're definitely running the same workload. But just showing you here that the BW tree is gonna be slightly slower than everyone else. And again, that's because you have that mapping table because you have to do that look up all the time to figure out, from a logical pointer to a physical pointer to figure out where's the data that you're looking for. But maybe next week I can show concurrent numbers with a multi-threaded experiment. In that case, when you see in the paper that you guys read that the BW tree definitely does perform better than these other guys. We have a measure to get against the art index to see how much faster they are. All right, so any questions about this? Again, just sort of, for your own edification about why I'm spending time talking about art, even though it's not lat trees because the performance numbers are hard to ignore. Okay? All right, so my parting thoughts. I think the BW tree is one of the more interesting lat tree index data structures that have come out in recent years. And this is actually a really hot area in research, right? There's papers coming out all the time for better skit lists, better lat tree indexes and things like that. So at this point, the BW tree is four or five years old and we have done some work to improve it over what the original Microsoft paper was done, I had talked about. But like I said, it'd be interesting to see how this compares against art or maybe some of the better skit lists that have come out in the last year from the guys in Australia. But in my opinion, to me, when I read the BW tree, like it really just sort of clicked in my mind, oh, the mapping table is just the key thing that allows you to avoid all these other issues that you can have in a skit list and other lat tree data structures, right? So this is a single location where you can update pointers. All right, so to finish up in the last 10 minutes or so, I want to talk about some tips for doing profiling in the database system. So now these are generic tips, right? These are applicable to when you go out in the real world and whatever application you're working on, but I'll show the examples of how to do this in the context of Peloton and in your project. So say we have some kind of program that has two functions, foo and bar. So Naive question is how can we check the speed of our program and figure out how to make it actually run faster? So the naive way to do this is just you open up GDB or whatever your favorite debugger is and every so often you're just gonna randomly hit pause and look at the stack trace and see what function the program is in, right? The dumbest thing you can do, but you can do it. And so let's say that we did this and we collected 10 call stack samples and we saw that six out of the 10 times we were in our function foo. So from this we know that based on this estimate of the samples that we have that 60% of the time we spent in foo. And obviously we can improve this by maybe getting that little servo hand that you've seen on YouTube to hit the button to pause over and over again right in your laptop. So we increase the number of samples we have and we get a better number of this. But we'll see how we don't have to do this coming up. So before we talk about that we can talk about Omdol's law to understand what kind of speed up we can expect if we try to go optimize this foo function, right? So we know foo was running 6% of the time and say that there's something really stupid in our code where we go ahead and fix it and now we can run two times faster. So what's the expected overall speed up of this, right? Well again, 60% of the time we spent in the program in the function foo that'll get dropped in half but then the 4% of the time in the other part of the program exceeding the function bar that stays unaffected, right? That's not changed. So Omdol's law is this nice little function here that says based on the percentage of the time we spent in the task we optimize and what the speed up we can get from that optimization that tells us what should be the expected theoretical improvement of our change. So in this case here for 60% of the program of cutting that speed in half we'd expect the speed up to be 1.4. So this is something to be mindful when you start debugging and profiling your skip list, right? You'll see what functions you're spending all your time on and just because you make them run 3x, 2x, 4x faster you're not gonna see that sort of a cumulative change throughout the entire system, right? It's gonna be a subset of the overall runtime of the system, okay? All right, so again, since we don't wanna have a little robot hit our laptop over and over again we have two tools we actually can use to do profiling. So a quick show of the hands who here has used Valgrind other than for project two or project one? Not many, okay. Who here has used Perf? One, okay, cool. So these are the two main tools that people use at least in the Linux world I'm sure there's equivalent things in sort of Windows environment but in Valgrind what it's gonna be is it's essentially like a virtual machine where it's gonna run your program with a bunch of instrumentation to record everything that it's possibly doing. And then the nice thing about it is it's gonna have nice visualization tools to show you where you're spending all your time. In Perf it's much more lightweight and the way it works is it enables these sort of hardware performance counters that Linux provides you, the kernel provides and keeps track of them for running your application, right? And then they'll have a, I'll show what it looks like, they'll have like a console-oriented tool to show you where you're spending all your time. In general, in my opinion, I prefer running in Valgrind or Calgrind, Perf is sometimes good to get sort of quick numbers. All right, so Valgrind is sort of a suite of debugging tools for Linux programs or Linux programs. And we've seen Memcheck before because when you're on Valgrind for the extract test for project number one, right? That would show you whether you had memory leaks or invalid memory lookups and things like that. The other program you can use is Calgrind and that's gonna be generate you the call graph of where, you know, what function calls what function and how much time you're spending in these different parts. So to use Calgrind, you basically invoke Valgrind but then you pass in the command on what tool you want and then whatever executable you wanna invoke. So in this case here, you would call Valgrind and pass in the tool command and then you can say I wanna invoke my skip list index test. You can also run this on the full binary of the database system and it gets complete stack trace of the entire program. For this, I don't think you have to run this in as root for perf you do because you have to get performance counters from the kernel but Valgrind I think can run in user space. So what's gonna happen now is the, when your program finishes, it's gonna spit out this file with callgrind.out. whatever process ID your program ran as. And then you can use the visualization tool or cachegrind and it'll generate you a nice chart looks like this. You'll have like a human distribution time for your program and then I'll make this nice call graph to show you where you're spending all, what calls what and how much time they're spending in it. So I can show you sort of the live version of it. So this is a trace I collected last month. That's less readable, right? Well, anyway, so you'll see here you can get, here's all the function calls in order like this calls this calls this and then you have this nice call graph that shows how many times this function was invoked and then the percentage of the time that was spent in the different parts, right? So this function execute plan was invoked 140,000 times and this is where we spent 66% of our time in the system, right? And then it shows you what it invoked. And then you can double click this, right? And it'll sort of, if I double click one of these other ones it'll rebuild the call graph and show what was happening inside of that because what'll happen is it'll, for the high level stuff it'll show just the heavy hitters and then you have to zoom in to kind of see more information about what's going inside this function. And then what I really like is up here at the top see me, I'll try this one here. It'll give you actually, if you have the source code installed and you run with debugging symbols, you can see like, here's the actual lines of code that were invoked how many times they were invoked and how much time, the cumulative time they were spent in each of them, right? So this is like more than just saying, you know, what function was called and how often and you can go line by line and see where you're spending our time. This is probably the reason why I like call grind better. I'll say also too for this you wanna make sure that you compile your program with the release mode, not debug mode. Cause what'll happen is there'll be all these log debug statements and those are creating strings and writing to standard out. So that's where you're gonna see all your time being spent not the real system, not the code that you wrote. So you wanna make sure you compile with in release mode it actually is trickier than that cause you can compile in release mode you lose the symbols. Yeah, all right, so I'll write instructions how to do this, how to do it correctly. Yeah, I forgot about that, sorry. So, right, so that's call grind. The other one is Perf and I said the way this works is that it uses performance counters that Linux provides for how much times it's got, actually it's doing hardware counters. So like you measure things like how many cycles you're spending or instructions you're spending, how many cashmiches you have and also the low level hardware things that are difficult to measure or you can't measure with call grind. So the way this works is you sort of say Perf record and you specify what you want to count, how often you should take samples and then what program you wanna invoke. And then once you get the process finishes it'll dump out a, I think it's called, the file called perf.report or something. And so you just invoke the perf command again but with this report option and that'll give you sort of like a, again sort of the same thing with an ordered list of where you're spending all your time here, right? And I'll show you what it looks like in the live version. So you can actually like go down and see where you're spending all your time. So like in this case here this is the BWTree index test and it looks like we're spending all our time allocating strings for something. I forget what the test was. There's more strings, more strings. This is a SIMD, memcopy, malloc and free malloc and then here's actually some code that we wrote, right? Doing some comparison, right? So Perf is okay because you can get at a high level what's going on and it'll show you things like, how do I say this? It'll show you sort of more fine-grain things than what Perf can show. I said what call-grinding show but in my opinion I like call-grinding better. Okay, right? Secure the distribution and then everything. And then there's a bunch of different other events that you can measure like cash misses and branch mis-predictions that may be relevant for whatever, you know, what you're doing project two or possibly later on for project three, okay? So, well, we'll post the slides on the course website so you'll be able to get all these links. So there's a bunch of tutorials, a bunch of walkthroughs about how to do this. There is a Wikipedia article on how to do run Perf and call-grind for the full Peloton system but I think the instructions that I'm sharing you here should be enough to run for the skip list index test. And then I'll write up instructions on how to compile the code with debugging systems turned on but with log messages turned off which is not trivial but I'll try to make it easy for you guys, okay? So any questions about profiling? Okay, so next class will be the last lecture on indexing, we're gonna do another work, another paper from Microsoft Research and again, I think Microsoft Research is probably the best database research group outside of academia and so they do a lot of great stuff so we're gonna look at paper of how you actually built indexes for OLAP database systems. And so this will be looking at the special engine for SQL Server called Apollo. So just in the same way, the heck of time was the O2P engine