 So this is the the the second lecture that we're doing on all to the indexes. Somebody asked why we're not doing a lecture on OLAP indexes. And the answer is because, as we'll see later in the semester, it's oftentimes faster to do analytical queries on large segments of the database by just reading the data, doing a sequential scan, right? If it's compressed and stored as a column store, it's gonna be really, really fast. Especially if you have to read the entire table. And so for all to the indexes, this is where you do need to have something that's more fine-grained where you can come in and do small updates and operations. All right, so for today's agenda, I'm gonna start off talking about high level implementation issues that we have to worry are concerned about in a concurrent all to the index. And then we'll spend time talking about how to do about the art index, the rate extreme from the hyper guys that was in the paper that you guys read. You guys actually ended up reading the second art paper. The first one sort of describes the data structure just by itself. But it was single-threaded, it wasn't supporting concurrent operations. So the paper had you guys read was how do you make the art index concurrent? And then we'll finish off, we have extra time. We'll talk about actually how to go about doing profiling in Peloton, like to check for performance and other things. And I would say that this will be in the context of our system because that's what I care about in this course. But the high level ideas of what you'll be able to do using Valgrind and other things are applicable to other systems as well, okay? All right, so we need to talk about now some things that are gonna come up. Some design decisions we're gonna have to make about how we're actually end up going to implement our index, right? It's one thing to show you a bunch of pictures and say like, yeah, this thing has a point to this thing. But when we actually come down time to actually build the index, there are some difficult things we have to worry about. So we'll start off talking about how to do memory pools or object pools. We do garbage collection and then handle non-unique keys, variable length keys and do prefix compression. So what I'll say is that for all of these here, these are applicable to sort of general purpose trees like a B plus tree, a BW tree, a skip list. Some of these are not applicable to the rate X tree, the art index. And it's sort of obvious why because how they store their keys are completely different than regular tree indexes. But again, I'll say these are high level constructs or ideas that are applicable to most indexes. So the first thing that the deal with is how are we actually going to allocate memory? So in the diagrams I was showing you, I had these boxes in their skip list. And say one node in the skip list has a key and a value and a pointer. And then you had to have another node to have the same data members. The stupidest way to actually implement that would be just to call malloc for every single node any time I create a new node. And this is bad because now we have to go inside of the memory allocator, which is going to have its own internal latch and maintain its own internal data structures and get new memory. And if the allocator doesn't have any memory, it has to go then down to the kernel and allocate more memory from that, right, to add it to its arenas. So as I said, often in times in this class, the operating system is our frenemy, right? We need it to actually run our database process, but in general we don't want to talk to it. So we don't want to be calling malloc and free every single time we want to add a simple node, right? And I said that you could combine multiple key value pairs in a single skip list node and that would reduce the number of times you have to call malloc. But in general still, we don't want to call malloc every single time we have a new node. So the easiest way to solve this is to do memory pools or object pools. And the idea here is that if all our nodes under index are the same size, then anytime we delete a node, instead of just freeing the memory and giving it back to the allocator, we just put it into this memory pool. And then when someone says I need to insert a new node, give me some memory, we just reuse the old one, right? And this is really easy to do if your nodes are fixed length size or a small number of fixed length sizes, right? If it's completely variable length size, then you might just call malloc and go to the allocator and let it figure out how to do it. Because there's not much you can do else, right? So get inserts, you just grab a free node, otherwise create one, delete, you just add it back, right? The tricky thing about this, right? So what I've described here is super simple, right? Anybody can implement this. Where things get hard is when you want to actually do compaction or retract the size of your memory pool, right? So think about this, say you were going to organize your memory pool as using a hash table, right? So if you're using a cuckoo hash table or whatever, you can grow the hash table pretty easily, but if now you need to retract it, significantly, you're basically going to have to rehash everything. And that's going to be a very expensive operation. And then the other problem you're going to have is that, you're going to have a bunch of holes in your continuous space of memory. And you need to start now reshuffling things to pack it in so that you can free up a lot of it. So you basically have to reorganize the contents of memory, and that's not cheap. So memory pools are a good idea, or it's super easy to do, if you only get bigger and bigger, right? But if I insert a billion keys, then delete billion keys, I don't want my database system to keep around all those nodes for those billion keys that don't exist anymore, right? And you may say, well, how often does this happen? It's actually pretty common. A lot of times you see in OTP applications, you'll see things where they will insert a bunch of data throughout the day, and then they all float it to a analytical system, like a separate database system to do all the machine learning or complex decision support queries on a separate machine, and they just delete or prune everything from the front end system. So that when the next morning starts, they essentially have an empty database or smaller database, and they just insert a bunch of more stuff into it. So it's very common to have things getting bigger and larger, and so retracting is not trivial, okay? So this is something you should definitely do in your skip list. This is easy, you should definitely do this. This is harder, let's see how far you get, okay? So the next thing I'm gonna do now is garbage collection. So, and this is again, this is separate from the memory pool, right? This is how we can decide whether it's actually, it's okay for us to delete a node or an object in our index. And then give it back to the memory pool, right? And the issue we're trying to deal with is that, because if we have all these pointers pointing to our different nodes, and it's not just continuous space in memory, we could have a thread that's hanging out on this node here. And it has a pointer to go to K3 and wants to scan across. But then some other thread decides, all right, I've deleted K3. I can go ahead and remove this entirely from memory, right? I did a physical deletion after I logically deleted it. But now this thread is gonna scan along this pointer and land in no man's land, right? And get a site fault because you're accessing now garbage in memory, right? So, our garbage collector needs to be mindful of what threads are active in our index and make sure that there's nobody could be accessing memory where we're actually about to free up, right? It could also be the case too, if we're using a memory pool, I go ahead and delete this. And now I put in K12, some other key, because I reused that memory. And now I scan across and I find this and I think everything's okay, because I landed the memory. It looks like a node in my index, but it's actually not logically correct, right? So it's not just dealing with physical deletes and where you could have site faults, it's also just reading data that you shouldn't be able to read. So there are, if you've taken 418, 618, or any sort of concurrent programming course, they'll talk about a bunch of these different methods, right? The two that we're gonna care about are reference counting and ad-park-based reclamation. There's also hazard pointers and other things you can use, but in general, this is what most people use in a database system. So with a reference counter, the basic idea is that we're gonna keep, track on every single node or object in our index, what is maintained a counter that says the number of threads that are currently accessing it. So if I'm a thread comes along and I wanna read this node, I'll do a atomic addition, a compare and swap on its counter, and add one to it. So that just tells now everyone else that I'm a thread, there is a thread. They don't need to know which one it is. They just know that there is one. There is a thread that's accessing this node, and therefore it's not safe to delete it. And then when we're done doing whatever it is that we need to do, then we go ahead and can decimate that counter. And then when a thread says I wanna delete this object, delete this node, because the key's been removed. We know it's only safe to delete that node when that counts as zero. Now again, this is super simple to implement, but you're gonna have that same problem that we saw before when we talked about spin blocks, right? Because now you're basically gonna have some objects sitting in memory and all your cores running on different sockets are doing compare and swap or atomic add to flip its counter up and down. So that's gonna generate a lot of cache coherence traffic to invalidate that location in memory just to flip this counter and to say, hey, I'm done accessing this, right? So this is technically sound and correct. We will not have dangling pointers. We will not have a physical deletion of objects before they actually should be. But it's gonna get bad performance because it's essentially writing just shared memory. And as you saw on the silo paper, we don't wanna do that. So there's two things we can point out here and having come up with a better version. So the first thing is that we actually don't care in the case of reference counting. We don't actually care what the value of that counter is, right? As long as I know that it's not zero, I don't care, right? I don't need to know exactly. So it's not that super important that every thread adds one to this counter. We just need to know that there's somebody there, right? The other thing too is that this is sort of from an implementation standpoint. Just because the counter goes to zero doesn't mean that that exact moment of time we need to free up that memory and reuse it. Now, the reason why I say this is sort of an implementation detail because it depends on how you implement garbage collection in your index. So in the case of, in the BW tree, you could do it with this epoch thing. And have a separate background thread, delete things up. Or actually, as we described in MVCC, you could have a dedicated background thread do vacuuming and clean up the objects. Or you could do a cooperative cleaning where some thread recognizes, all right, where there's nobody else could be accessing this object. It's time for me to go ahead and delete this. Either one is okay, it's just that when we design our garbage collection, you don't have to think like, okay, I need to have to delete this once I know the counter is zero. Or else I know nobody could access it, right? In the case of our implementation of the BW tree in Peloton, we do cooperative garbage collection inside of the BW tree. But we don't do this for performance reasons or any sound reasons. We do this for stupid and software engineering reasons, which I can take offline, all right? So let's see how to do this better. So the epoch based garbage collection is what you read about in the BWT paper and the basic idea is that there's some global epoch counter that's periodically updated, right? You can do every 10 milliseconds and the silo paper they did every 40 milliseconds. And then what will happen is that any time a thread enters the index to do something, like some operation, insert, delete, or look up, that they're going to update some global epoch manager to say, I'm a thread that's entering this index at this epoch, whatever the current epoch is. And the idea is that your thread does whatever operation needs to do, and then when it's done, then it goes and removes itself from the epoch that it entered in, right? And then what happens is if your thread makes a change where it says, all right, there's this node that I can now delete, I've logically deleted, and therefore it should be garbage collected at some point, you mark that node in the epoch garbage collector to say, this node was deleted at this epoch. And then the garbage collection mechanism or component can then reclaim that node once it knows that the no other thread exists in the current epoch or any epoch came prior to it, right? So say I'm doing epochs every 10 milliseconds. So for whatever reason, I have one thread enter at epoch one, but it takes 20 milliseconds to do whatever it wants to do, right? So he comes at epoch one, then at epoch two, we switch over. Another thread comes in and he deletes something. And then at epoch three, there's no thread in epoch two, but there is a thread in epoch one. And therefore it could still see that node that got deleted because you don't know what pointer it's looking at that could jump to that location in memory. So we can't delete it. So even though there wasn't anybody in epoch two anymore, there's still somebody in epoch one and we have to wait for him to complete. So that's what it means by all preceding epochs, right? So this is what you're going to want to implement in project two. There's a bunch of literature online how to do this. If you're outside the context of databases, it'll be called re-copy update for RCU. This is what's used in the Linux kernel for how they do internal garbage collection for their data structures, right? In general, it is conventional wisdom that the current sign says that this is the best approach to do this for a memory index. All right, so let's talk about how to do other things. So it's how to handle non-unique indexes. So in any intro database class, we teach you about B plus trees and we talk about here's a key in value, here's you just insert it, right? And there's arrays where you store the keys and the values, and things are super easy, right? And that assumes that you're doing unique indexes or primary key indexes. But now if you had to support non-unique indexes where you could have duplicate keys, things get a little bit tricky. So there's this great book which is available online from the Gertz graphi, the guy that wrote the paper you guys read about walking and latching and indexes, right? He wrote a paper, a book in 2010 called Modern B-Tree Techniques. So a lot of things I'm talking about here come from this. And I find this thing really fascinating, it's definitely a good read. All right, so the two ways we can handle non-unique indexes are either to just store duplicate keys and just have the additional values that we need to have for our pointers to different tuples. Or we can only store the key once and then just have a linked list of all the unique values that correspond to that key. So let's look at an example, right? So we're not doing skip lists, let's do B-plus trees because, again, that's the easiest to understand for this. So there's a bunch of metadata at the top you need to have in your page for a B-plus tree for our purposes as we don't care. So here we see that we have a sort of array for the keys and their values. In the case of key one, we see that it's duplicated three times. And then whatever offset you are in this array, that corresponds to the value, the point to the tuple that corresponds to this key instance, right? So if you're at the third key instance here, you jump to the third offset in the value list and that gives you your array, right? This is pretty simple to implement. The tricky thing, of course, is that when you want to delete a key, you actually have to always pass in the pointer or the value to it because you need to know which instance you're actually deleting because otherwise you would just delete all the key one. The alternative approach is to have the key sorted in a separate array and every key only appears once. And then there's an indirection layer that says, if you want key one, it's in the first position and that points to some link list or some array data structure here. That'll give you all the values that correspond to that single key, right? From what I can tell, this approach, the duplicate keys is the most common one because this doesn't require any change to your actual data structure, right? Whether it's sort of unique keys or non-unique keys, you don't have to have a separate list like this. So most systems implement it this way and this is actually what we do as well. So the next thing we got to deal with are variable length keys. And again, in the intro class, we make it super easy. Say, oh, yeah, everything's a 64-bit integer. You don't care about dealing with things of variable length, but oftentimes people index on strings all the time, so we have to deal with that. So it's essentially four approaches. And I was just saying, it's really only the last one that is what people use. This is the real one. But just to show you that there are other ways to do this, although they're not good. So the first approach would essentially do the same thing that we did in T-trees where we don't actually store the key in the index. We just store a pointer to the tuple and then we can find the key. So this is nice because now our pointers are always 64-bits. They're now fixed to length and we don't have to do anything special. But as we said before, you're going to pay a big cash locality hit because every single time I need to do a comparison of a key with the thing I'm trying to probe on, I got to go look up the tuple, bring that into my CPU cache, and then do the comparison. So nobody does this. The other approach is to support very length index, other variable nodes. And this basically allows you to have the size of each node can be different lengths. But then this is the problem we talked about before when you have object pools because now you can't reuse objects very easily. So nobody does this. The other approach is maybe do padding. So if I'm trying to do an index on a VARCHAR32 and I put a 16-character string into it, I'll just pad it out or add in the extra space where it should be if it was up to 32 characters. And I just need to know what the length actually is so when I do comparison I can ignore those nulls or the spaces. And nobody does this because this is actually wasteful because oftentimes people declare VARCHAR 1024 when they only want to store two characters. And so you'd be wasting a ton of space to do this. What people instead do is to have either a key map or an indirection layer that's going to allow us to embed basically a list of pointers that are going to map to some offset in our node or our page of where the actual key is stored. So say like this, so I have my key map. And again, these are all going to be fixed length offsets into my page. And then down below here I have my keys and values. So if I'm looking for the first key, I follow this pointer and then there's the offset and now I can do whatever comparison I want to do. So the key map will be sorted in the electrical order that's defined by the keys. But the actual keys, how they're laid out in memory, these can be in any order that you want. Because what will happen is as you delete keys, you want to be able to reuse these slots and without having to always depend to the bottom. So Obama might get deleted. And for better or worse, we have to insert Trump here. And so we'd want to keep this sorted such that Prashant appears before Trump. But in actual physical memory, Trump would appear first. And so it's sort of like a bin packing problem, trying to figure out how to fill in the holes as you delete things. So how is it different from using pointers and keys? His question is, how is this different using pointers and keys? So the difference is that this is a single page. So I'm going to go have a single page and it's not going to be a far jump for me to get this. I'll get to, you actually bring up a very good point. But it's very likely that like I can bring this all in with the Harbor prefetcher because I know I'm going to scan everything. Because this is essentially just a byte array. It sort of looks like I'm describing it as like a Java object with these data members. It's just a byte array. So I go fetch this and the Harbor prefetcher can bring everything else because it's all contiguous. Whereas in the case of the t-tree model, I'm jumping to some random location and memory. And the Harbor prefetcher doesn't know anything about that. And these are just offsets then? Yeah, these are just like, these could be like eight bit offsets or 16 bit offsets. Because you're jumping within the same page. So let's say if you want to search for Andy, and you'll have to go to your directional layer and then follow it to the offset. Right. So his statement is, if I want to do a look up on Andy, I basically have to look in this key map. And you could do binary search because this is sorted. So you could say, take the middle guy. You land here and it's Obama. Andy is less than Obama. So I would know that this is the one I want to go to. You don't have to do a linear scan. Every time you have to, like, have to audition. Yes, so thank you. So he's like, every single time you have to have this indirection jump on here. So a very common technique is just to add the prefix up here in the key map like this. So now when I say I want to look up Andy, I could do a binary search, look for Obama. A is less than O. So then I don't want to go here. And then poof, there's the thing I want. Right. So I had to do one less indirection to make that happen. And I could do this very quickly because this is all of my CPU caches. Yes. Is this still wasteful though? Because now the page needs to be like how many key values the node can store at times like the maximum size. But it still wants to be able to store all the values. So your statement is? Like you described it as allocating in one array, right? Yeah, this node is a single byte array. Right. Doesn't that byte array need to be big enough to hit all the maximum amount of values in one node times what the maximum size of that value is? So if you're storing large charts 1024, don't you need to allocate another state to fit a final node? Yes. So his statement is do I need to allocate space to accommodate what I think could possibly be the largest key I could insert? Typically, no, you don't do that. If it's a char, then yes. You have to allocate it because everything is always going to be the exact size. So it's a var char, then it's going to be variable length, and you just allocate as you need. Now if I get too big, if it gets to the case where I insert something in between Andy and Obama, and that's the 1024, and now I can't fit this all in a single node, then I have to split the node, right? And how you do that depends on the index for B plus 3. You just have a pointer to an overflow chain, right? Because you're not sort of chain, well. You're not logically splitting it? You could. You could have solved an overflow chain. But both approaches work. It makes sense to put them in a node for discovering it. But in terms of like, if you have everything with this, do they have to be in a single node? So your question is, in a disk-based database says disk-oriented database, they will organize in terms of pages. So everything's all packed in. But the in-memory database, would it still be organized as a page or a node like this? No. Is there any memory database? Is it like, do we have to necessarily keep a key map and key cross bounding as in a single node? So that would be the fact. All right, so the statement is, does it have to be the key map and the key values be put together in a single node? No, but from a software engineering standpoint. Now you've got some chunk of memory here that's your key value map and your key map over here. To coordinate all that would be a big pain. So you just have a single node. It's a byte array. You know the boundary. You know the size. And you know what you can put in it, right? But then every time you insert something into the key map, the offset for the value will change, right? So the statement is, every single time I start something into the key map, would the offset change? Yeah. Why would it change? I mean, if it's the same node and it's in a particular space. Right, so again, so say I want to insert Trump down here. Let's pick somebody better than Trump. Tupac, thank you. Yes, excellent. We want to start Tupac. Tupac's dead, though. But we want to start Tupac down here, right? That's a better example, because that we just append here. Somebody between A and O, Gandhi, right? We want to start Gandhi. Gandhi logically should be between A and O. But we can assert him down here, assuming we have free memory. And then we just reshuffle this key map. Because this has to be sort of order based on the keys, but these do not. That is your question? OK. So the last thing is how to deal with compression. So we'll talk about database compression later in a few more classes. And that's like compressing the actual tuples themselves. There are ways to compress indexes. This is one of them we'll talk about another way. But there's two observations we can make on how we can compress the amount of data we have to store. So the first is to do in the inner nodes where we can reduce the number of, reduce the size of the key that we need to have or maintain in our inner nodes for us to figure out whether we need to go left or right or route probes and searches to go in the right direction. So in this case here, I have A through K as one key, then L through V as another key. But we know that these two keys are pretty distinct. So I actually, already, so I don't need to store the entire key sequence. So instead of storing A, B, C, D, E, F, G, all the way to K, I can just store A, B, C for the first guy and L and N for the second one. And that's enough to still be able to allow us to figure out whether we want to go left and right. And we can get away with this because, remember, in the B plus tree, these are just copies of keys. The real keys we have to store down here. So we'd have to store that entire thing to do the correct comparison. But up here, we don't have to do this. So we can cut down on the space. The optimization we can do now down in the leaf nodes is to recognize that since our keys are going to be sorted in lexographical order, in general, there's going to be a lot of overlap between the first couple of characters or first couple of digits in our key. So say in this leaf node here, I have Andre, Andy, and Annie. Well, we can recognize that A, N is the first two characters of all these three keys. So I can just have a little prefix thing that says A, N. And then just the unique portions of those keys stored in the actual key slots. So when I come down, I would know, if I want to actually look up and see whether Andy exists, I've got to take the prefix and put it in front of the DY. And that can reconstruct my original key. You can actually do this also for the actual record IDs as well. If you can recognize that the first couple of digits of your record ID is the same because they're in the same page, for example, then you can do that same kind of duplication. So this is also very common as well. Compression often isn't used widely in a memory database system for OTP applications because the overhead is often deemed not worth it. But I would say prefix compression is pretty common. OK? So the reason why I wanted to start off with this is with this sort of these techniques. Because then we can see now how in the context of the art index, a lot of this actually doesn't matter anymore. The garbage collection does, the memory pool stuff does. But the prefix stuff, not so much because the way they're going to organize the keys themselves is drastically different. So the adaptive rate exchange or the art index is the default index you get in hyper now. I think they originally were using a B plus tree or a red black tree. And then they switched to the rate exchange or a tree. So the basic idea is that instead of storing the complete copies of keys at different nodes, they're only going to store the single digits or subsequence of the digits in the key. And by digit, I don't mean like a single Arabic numeral. I mean like one character in a string or one bit in an integer. And then what this is going to allow them to do is do one by one comparison of these digits as you traverse the tree rather than having to do a complete comparison of the entire key. And so the rate exchange has some really interesting properties that are much different than everything we've seen in the skip list, the B plus tree, and the BW tree. So the first of that, the height of the tree is going to depend on not the number of keys we have, but actually the length of the keys. So if I have a B plus tree, I could have a tree with 10 keys versus 10 million keys. The 10 million key tree is going to be much, much taller of a greater height. In the Radex tree, I could have 10 keys that have the length of a million or something like that, and that would be the height of the tree. The other interesting thing is that it's not going to require any rebalancing. We don't have to do splits and merges as we change our index. And then because we're not actually storing the full key in the index or actually storing in its full form, the way we have to reconstruct the key is actually by traversing the index itself. So the paths from the root to the leaf are actually how you regenerate the keys. This is why we don't provide you guys a sort of dump key function or a nice pretty printer for your skip list or your index, because we support the Radex tree or the art index in Peloton. And you'd have to do basically breadth-first traversal to reproduce all those keys. Now, we could do that. We just haven't done it. There wasn't a quick way for us to do this, so we don't provide that method. And again, sort of related to this is that the keys are now stored implicitly inside the paths inside the tree itself rather than having the complete copy of it. So there's a quick show of hands. Who here knows what a try is or has heard of a try before? All right, good. A lot of people. Who here knows what a Radex tree is or heard before? You guys read the paper. Maybe there's a better way to ask it. So who here has heard of the Radex tree before you read the paper? Two in the back, three, four. OK, good. Well, you hang out with me, so that's fine. All right. So a Radex tree is sometimes called a digit tree, but it's a variant of what's called a try. So a try was discovered, I think, in 1959 by some French dude. And then in two years later, so actually this guy, Edward Fenkin, he then coined the term try as a way to distinguish it from a tree. The term try comes from the term retrieval tree. That's sort of what it's supposed to mean. The Edward Fenkin, apparently, he's still faculty at CMU, but he lives in Boston. I've never met him, but supposedly he's here. That's fine. So this is a try. So we're going to store three keys, hello, hat, and have. And what we're going to do is we're going to break up the digits, like the single characters in our key, and those are going to be represented in the index and the tree structure. And any time we have an overlap between the same character at the same position for different keys, they'll be able to reuse that character. So in the case hello, or these three keys here, they all start with h, so h is at the root. But then hello has ELLO, which is distinct from hat and have. So it has its own path in the tree, where you can see it's H-E-L-L-O. And then at the bottom, you would have a pointer to the tuple that it belongs to. In the case of hat and have, they both share the character A in the second position. So as you come down the path here, they reuse that. But then they split off between VE and T for the two different keys. So our renext tree is just a compressed form of this. It's the same idea, but now what happens is any time that you have a path that is not shared with anybody else, then you don't have to have every single individual nodes. You just only have the single node that corresponds to the unique part. So again, all three keys start with the letter H. That's at the root. And then for hello, ELLO is stored separately in this node in its complete form. I don't need to have a separate node with a pointer all the way down. So this is a renext tree. I think also sometimes it's called a Patricia tree, but in practice it's always called this. It was always called a renext tree. So now, what makes art different is that just like in the skip list where I showed you guys a high level diagram like this, where you have these edges and these nodes and things like that, this is not how actually you want to store it in memory for real in a real system. So what art contributes is that it's showing you how you could represent a renext tree in an efficient form in a way that's designed for modern CPU processes and modern architectures. And so what they're going to do is they're going to allow the adaptive part of the art index is that it's going to be able to support different node types based on how the keys get inserted and how things get restructured. So the cool thing about renext trees also too is that the layout of the index or the layout of the tree is deterministic, meaning if I always have the same set of keys, no matter what order I insert those keys and with how many different threads, I always end up with the same structure. Whereas like in a B plus tree or BW tree or skip list, if I have different threads inserting things at different times, I may be doing splits and mergers in different ways than if I run at a second time with different number of threads. But in a renext tree, it's always going to be exactly the same so that they can define these rules that allow you to decide when do you want to split and merge your nodes in the renext tree. And then the idea is that you want to try to pack in together as many digits as possible to maximize your cache locality. So again, just like before, he was asking, could I store the key map and the value list in separate locations of memory? In practice, yes. But things are going to be used together often. You want to be close together. And now I'm going to have to do all these separate fetches. So this is how you would actually represent the same three keys that I showed before inside of an art index. So the first thing you see is that at the second level, this is the same contiguous block of memory, a single node. But it's going to have the digits for ELLO and then the digit for A down here. And the same thing for hat and have, V, E, and T are stored in the same contiguous node. So now, say I want to insert the key here. So we would put it here because they both share H and A, but the IR part is separate. So I have space in here. I can go ahead and insert it. So I don't allocate anything to do this. I don't have to reorganize anything. I have space. I'm able to put it right in there. And that's done very efficiently. But now I'll say I want to delete hat and have. Well, we want to blow this away. And again, now the art index specifies that there's rules to say, all right, this thing is now a single path down by itself. So I want to collapse it and put this up in here. And so the only modification I need to make is really for the node that I was modifying and then my parent. I don't need to go all the way up in the tree and condense everything. So they talk about having different node sizes where you can allocate the most amount of memory you possibly could think you would need. And it's sort of packed in a way, again, that minimizes the amount of indirection you would have in your index. So the one really cool thing I like about art also is it talks about how to actually represent any possible data type or attribute type you could have in your database in a rate extreme. So in the examples I was showing before, I was showing all varchars. So it's really obvious to see how you just take every character and that's sort of considered a node going down or as part of your path in the tree. But when you want to take other binary values, integers, floats, and other things, you can't actually do that. Or at least you can't do this easily on Intel CPUs. So Intel CPUs are little indian. So that means that the least significant bit is always going to be on one side. So we took the little indian representation of an integer and stored it in our index. Going down, we would end up trying to compare the least significant bit, which is not exactly what you want to do. You end up with false negatives and false positives. So I'll show an example of what I mean by this in a second. But basically what they say is that to make the rate extreme work on Intel CPUs, you have to do a lot of extra work to flip around binary values to make them be representable in a try. So the easiest one to think about is unsigned integers. Again, Intel CPUs are a little indian. We need to make them big indian. So we just flip the bits and then we can store the digits down. For signed integers, you've got to be a little careful. We've got to flip the two complements. So this ensures that always the negative numbers are smaller than the positive numbers. Because again, because otherwise, if you're going in little indian order, you wouldn't know that until you hit the bit at the end that says whether it's negative or not. For floats, this is a little more complicated. You basically classify the floating point number to basically whether it's negative or positive or denormalized. And then you try to store it as a, or you do store it as the unsized, unsigned fixed point integer in big indian order. For compound keys, you basically do the transformation for every single element of the key, and then you just reverse everything. So let's look at an example. Because I realize me making this hand gesture doesn't not help explain it. So here's the integer key. Assume it's unsigned. So it's $168 million. So we represent this as hex. We have these hex sequences here. So if we store it as little indian as you would in Intel, you would see that the first entry corresponds to the least significant bit over here. But big indian, it's flipped. So you would have the most significant bit start at the top, which is what we want. So we need to flip this, put it into big indian form, and then we can store it now in our index, or in our tree. Because now when we want to do a lookup, we would start here and do a mem compare on this value with whatever we're trying to probe against. And then this is going to allow us to determine right away whether, if we're checking to see whether less than or greater than, we would know at that point whether we want to go left or right, or whether we actually have a match. Because otherwise, we're starting with this part here of the value and trying to compare that. And we wouldn't know later on to whether actually we're greater than or less than. So let's say I wanted to do a lookup on like this, 65,000 or 650,000. The hex would be like this. If I'm in little indian form, then I'm comparing the hex code 1D with 0D. 1D is greater than 0D. So I would think that this number is greater than this number when it's not. So that's why they have to put everything into big indian. So we actually do this in Peloton as well. You'll see this when we start building your skip list. You'll call the index factory class. And you say, make me a skip list. And you'll see that it will instantiate things like compact. They'll have different parameters for the templates, like compacts and key, or generic key, or tuple key. And these are how we represent keys in our index. And so our compact iterators key is essentially this doing the same kind of binary comparison that the rate x tree, the art index, is actually doing. And so this is a simple micro benchmark I wrote last year, where I inserted or did a lookup or insert, delete, insert, lookup, or delete 10 million integer keys. And I stored them in their different form. So generic key is basically take the raw bytes and then wrap it around a value object and do a comparison whenever you traverse the index. And then I wrote a faster version due to comparison. A lot of fast comparer basically instead of copying the bytes into a comparison and creating the value object, I just go directly to the pointer and do my comparison. But it's still just doing the Intel instructions to do the direct comparison, rather than trying to pack many bits together and doing a fast comparison like that. So what you see, again, the compact integer key representation and comparison is four times, five times as faster as the generic compare. So this is like the same data structure, same data, just different ways to represent the actual keys themselves and do the comparison. And then the binary comparison that art does is much faster. So as I said before, the last class was about latch-free indexes. So the way to think about these last three lectures are, we talk about index and locking and latching from the traditional sense. Then we said, oh, well, people think you want to build latch-free indexes. So let's give it all that. And let's do a skip list or a BW tree. And now we're bringing back the latches and saying, well, if you actually use a different data structure, you can get better performance while still using latches. So art is not latch-free. And the paper you guys read, they talk about how it would be a significant amount of engineering work for them to retrofit the index and actually make it latch-free. You essentially have to bring in the mapping table and possibly delta records that we saw in the BW tree or have a limited, I don't want to do a skip list. But bottom line is, they say it's hard to do and they didn't do it. So instead, they proposed two different ways to do a current index that is not as bad as traditional latch-crafting that we saw with the BW plus tree. So they have optimistic-alot coupling, which is the one we'll send most time on. And then they talk a little bit about this read-optimize write exclusion. The optimistic-alot coupling is easy to understand and can be easily applied to any possible data structure. The read-optimize write exclusion is required to major change to your data structure, to be able to do this, to support the kind of techniques that they want, to describe how to do it in an art index when you could potentially do it in the same thing in a B-plus tree. OK, so the latch-crafting scheme that we talked about before was sort of a pessimistic approach in that you assume that you're going to have problems as you go down so you acquire latches on your way. Whether you're a reader or a writer. But with optimistic-alot latch-crafting is that the writers are never going to get blocked on the readers. And it's up for the readers to figure out whether they read something that is no longer valid anymore. And you have to do this check before you're allowed to proceed to the next node or element in the tree. For this discussion, we won't do it in an art. We'll do it in a B-plus tree because that's easier to understand. But basically what happens is now every single node in the index is going to have a version number. If you're a writer and you're modifying that node, you basically make your change or acquire a latch on it, make the change, and then increment that counter. And then now the other threads can figure out whether in the past they've read the correct version they showed up. So let's look at an example here. So this is that same sort of simple B-plus tree that we have before. But now we're going to introduce version numbers for every single node. So now if I want to do a search on key 44, if I'm a reader thread, I would start off here in the root. And I don't acquire any latches to go down. All I have to do is just check to see what, you know, record the version number that I read, do whatever it is, you know, look up I want and some I know to figure out whether I want to go left or right. And in this case here, assume we want to go down. So we'll go down here. But now I'll read this version. But now before I'm allowed to actually proceed and say, yes, I followed the correct path, I got to go back and check to see whether the node I just came from, whether its version is still version three. If it is, then I know that nobody actually changed anything. If it's not, then I know that somebody did change something. And the thing that I'm at now may not actually still be valid, maybe another node here that I should have followed to get to. Again, this is why we have this epoch-based garbage collector, because I'm down in here, technically I shouldn't be able to get to this. This thing's been marked as deleted, but I don't want anybody physically deleted it because I'm still looking at it. So we do this in the context of epoch-based garbage collection, we won't have the problem that we're now reading invalid memory. So assume this case here, v3 was still the correct version we read. Then we now can do whatever look up we need to do in this search, in this node, figure out we need to go down here, read this version number, go check to make sure that this thing is the same version, and then now we're able to do our traversal here. And the same thing, once we're done reading 44, we go check to see whether we read the same version. If so, then we know that we had a consistent view of the index on our way down. And then nobody modified anything. Super easy to understand, I really like this. So let's look at an example now of when we have a writer in the mix. So let's go back and say that before we actually were able to finish checking that we're at v5, some other thread comes along, it came down here, and then it's gonna modify something here that's gonna cause us to modify this node here. So what'll happen is it has to require a latch on this node, because that prevents other writers from modifying at the same time, but the readers don't check for this, so that's fine. And then after it's done making its change, it increments the version number to now be v6. So what'll happen is when this thread goes back and checks and say, did I read the right version, it'll say, oh, v5 is not the same as v6, so somebody modified something, so therefore I'm invalid, I have to abort my operation and come back and retry. So one downside of this is that it may lead to false aborts or unnecessary aborts, because it may be the case that when this thread made this, modified this node here, it just modified what this thing pointed to. It didn't modify what this thing was pointing to, so technically the thread that got killed, he's fine, he did the right thing, but because these version numbers are so coarse-grained, it's on the per node basis, any modification to that node would affect the entire thing. So any thread that read that would have to fail. So the alternative is to read optimized read exclusion, where the idea here is that we're not gonna have the writers interfere with the readers. The readers are always gonna be able to proceed, and it's only the writers have to acquire locks to prevent other writers from making changes. And essentially this is sort of like doing shadowpaging or sort of multi-versioning where the writers will create new versions in an atomic way, such that the changes become visible to readers all at once without you having to block everyone else, and without you reading inconsistent data. So again, this requires a complete change to how the data structure actually works. In the case of the paper, they talk about how you reorganize your atomics when you update pointers and offsets in each node. In the case of B-plus trees, you essentially do a copy and write scheme, and you do atomic installation of new nodes. So again, I don't wanna get into details of this because we're short on time, but the basic idea here is that you have to make fundamental changes to how threads modify the data structure in order to make this scheme work. Okay, so this is the same graph now that I showed you last class. And this again, this puts everything all now that we talked about in perspective here. So the B-plus tree and the art index here, this is with optimistic lock coupling. The mass tree does something similar to what cicada does, but not exactly the same. But the mass tree is basically a tree of tries, or tri-trees, it's one of the two, right? But it's basically a hybrid data structure in a combination of B-plus trees and tries. So it's sort of similar to the radix tree, but it's not a pure try. So again, this is just showing you that in the case of the, with optimistic lock coupling with the B-plus tree, it can often outperform the B-W tree. It crushes the skip list. Again, oh shit, I forgot to check that. I think that's wrong. Yeah, I think those two numbers should be flipped. Yeah, the B-plus tree should be faster. But the art index, even though it's using locking, it's beating all these lottery data structures, right? And again, the main takeaway here is that if you organize your keys and you fast key comparison, if your data structure can support efficient operations, even though you're doing locking, the overhead of that is not detrimental to performance, especially in a multi-core environment, right? So the main thing to point out is I basically was wrong about the B-W tree. This is a two-year process for me. I'm still in recovery trying to deal with this loss. But again, I was super sold on the B-W tree. The paper really clicked for me and said, this is exactly how you wanna build an electric index, but the science has borne out that that's actually not the right way to go. And I'm not saying the art index is the end all to be all, but I think there's ideas from a Radex tree and certainly they're applied to the Mastery that are better for all the two workloads. Yes? Why the B-W tree? Why the B-W tree is slow? Yeah, I have a graph, but it's in the paper by the end of the time to show it. It's everything, the mapping table, the Delta chain, the replying the Delta records. It's just you're just doing more work to deal with the fact that you're lock free of that tree. And that the fact that you're lock free doesn't mean you're that much. It doesn't mean you're that much. It locks carefully. Correct, yes. So his point was, is the fact that you're lock free for an MME index, is that superseded by a index that is just using locks more carefully? Yes? All right, so in the last few minutes of the class, I wanna do sort of a quick crash course on how to do profiling. And again, this is not in the context, or this is in the context of Peloton, but these ideas are applicable to really any systems. All right, so let's say that we have a program here, and the program has two functions, Foo and Bar. So, and our program's running slow, so we gotta figure out how we wanna speed it up. So the most naive thing to do, like the dumbest thing to do, is just we run our program in our favorite debugger, like GDB, and basically every so often, we just pause the program, spit out the back trace, or the stack trace, or where it's executing, and record what functions we're in. And we just do this every so often, and then maintain, figure out of this table, and say, well, what, where are we spending on most of our time? That's a dumb way, we'll see how to do this better, but let's say we do this, right, and then we find out that of the 10 call track samples we've collected, we find out that six of them, six of the 10 times we're in Foo. So that means that we know that 60% of our time, based on our sample, 60% of our time is spent in the function Foo. Right, and of course, obviously, if we can do this, we had like a little robot hit and pause over and over again, we can get more samples and have more accurate measurement, but for our purposes, we find out 60% is in Foo, so that's what we wanna look at. So now we're gonna figure out how we wanna optimize our system, and what kind of benefit we would want to expect. And the idea here is we wanna avoid premature optimization, and we don't want to just pick, oh, I wanna optimize this piece of code, because it's written in a real stupid way, but if that piece of code is barely ever called, then you're basically not, you're not gonna help anything. So anybody know what this is called when you try to figure out the performance benefit you'll get from a piece of code based on how often it's used. Amdahl's Law, right? So Amdahl's Law is basically a way for us to calculate, so a nice little form that says if we know what percentage of our program is running at a certain time, what kind of speed up should we expect? So say that Foo does something really, really stupid, and we can rewrite it to be two times faster. So the question is what should be the expected overall improvement, right? So in the case of Foo, since 60% of our time is spent in it, that 60% of the time would now be cut in half, right? In the case of the 40% time we're spending in bar is clearly unaffected because we didn't modify this code. So with Amdahl's Law, there's this little formula here, you basically plug and chug the percentage of time the speed up you'll get, right? This will tell you that the overall speed up you would expect for this example here is that when we make Foo run two times faster, because it's only executed 60% of the time, our overall program will be 1.4% faster, 1.4 times faster, right? So the idea here of profiling is that we wanna figure out where we're spending all our time, what kind of optimizations we think we can do to speed it up, and that'll tell us what kind of performance benefit we'll get and where should we actually spend our effort. So as I said, having your little hand or a robot, Paul's and GDB over and over again is really stupid, nobody does that, we have tools to automate this. So the two tools that you can use are Valgrind and Perf. So Valgrind is a heavyweight instrumentation framework, so think of it like I'm just running like a VM. That's basically going to record all the instructions that your program's gonna execute. And then to be able to give you nice call graphs to say here's where you're actually spending your time. So Valgrind has a bunch of other stuff I can talk a little bit about, but the nice thing about it is it has nice visualization tools to make traversing these reports nice easier. Perf is more lightweight. And the way it works is that it relies on the low level hardware performance counters that your CPU provides and the kernel provides to figure out the different events that occur in your program. And so this as far as I know, maybe this has gotten better over the years, but they basically have a console oriented visualization tool to be able to see this. So with Valgrind, as I said, essentially think of it as like a, Valgrind itself is like a framework that you can then build analysis tools to look at other aspects of the system. So the one that we'll be talking about here is callgrind, you can use memcheck to essentially look for memory leaks and then massive is a way to look at the overall usage of memory in your heap over time and get visualizations of what data structures are you spending most of your, you're using most memory. So to use callgrind, you basically invoke this command here and you can see that we can invoke it with any arbitrary binary, right? So we can say that we can run your skip list test to test to see where you're just spending time in your skip list or you can run this on the actual full system. And so what'll happen is you'll run this and then it'll spit out a dot callgrind file with the PID of the process that ran and then you can use a tool like Kcashgrind to actually generate a visualization of your program. So it'll look sort of like something like this, right? And so you see on one side you have the distribution time of where the system spent in different instructions when exiting your program and then you have sort of the call graph structure here to say, you know, what function called what? And then you see how many times things are invoked and what percentage of the instructions were actually executed in each function. So for this, and actually what I also like about callgrind too is that you can actually get the, you can go inside the function and look at the actual lines of the code and see how many times were different parts executed, right? It takes to get used to looking at because the compiler can sort of reorganize things and rename things a bit, but over time you learn how to decipher this more easily. So for this, you wanna run with debugging symbols turned on because otherwise you're not gonna be able to map the functions back to the source code, right? And then again in the call graph view. And you can drill down to each of these and see more information about it. All right, with Perf as I said, basically it relies on hardware counters. The way the user does so like this is like you pass in how often you want to collect data, what program you wanna run, and then it will spit out a file that you can then load in back in with the Perf command and get like a visualization of what it looks like, right? So after you run Perf, you collect the data file and then you run Perf report, and then that gives you like a console thing like this where it shows you all the different functions where you spend most your time and then you can drill down into it and see more about it, right? So and then this gives you the distribution of where you're spending your time. So call grind is much more heavy weight than Perf, but call grind is gonna be more accurate numbers, right? So Perf you could use for like quick and dirty things, but call grind, if you wanna say where I'm actually spending my time, call grind's probably better. But the other thing about Perf is kinda cool is that it could actually record things like the number of cache misses you have, the number of times you mispredict a branch, right? So you can get a sense of why your program actually might be running slow rather than this is where you spend your instructions. So you need both of these things. So there's a lot of different tutorials and everything online to see how to use this. I think we have a wiki page that describes how to set this up in Peloton and run it, but I'll post these slides as well and you'll be able to run the same instructions on your skip list and see how it works, okay? So any questions about Perf or call grind? Who here has used either of these tools before this class? All right, good. Good number, okay, good, okay. All right, so this is it for indexes. Next class, we're gonna now start talking about more of the core storage of the database system. So we'll talk about how you do different layouts so you have to support in your tables like Roastore versus Column Store. We'll talk about how you organize things in memory and then we'll spend a little time talking about how do you actually maintain the metadata about your database in your catalogs. And this is the one that I'm very interested in now because we're spending our own catalog on our own system. We go up to see how other systems like MySQL do it and we're like, oh my gosh, that's terrible. We can do it better. So I'll talk a little bit about what they do, okay? Mm, I need something refreshing when I get finished manifesting to cool a whole bowl like Smith & Wesson. One court and my thoughts hip-hop related. Ride a rhyme and my pants intoxicated. Lyrics are quicker with a simple mown liquor. Tits on my city slicker, play waves of liquor. Rhymes I create rotate at a rate too quick. To duplicate philipines as I skate. Mikes at Fahrenheit when I hold them real tight. Then I'm in flight, then we ignite. Blood starts to boil. I heat up the party for you. Let the girl run me and my mic down with oil. Records still turn with third degree burn for one man. I heat up your brain, give it a suntan. So just cool, let the temperature rise. Then cool it off with same eyes.