 So, two quick things about project number one, as a reminder, it's due a week from now. And so AutoLab should be working, so you should be able to submit it and get back a grade and a log file to say what happened to your submission. Some of you have already submitted and gotten a product score, so that's good. The other thing to announce is that on Thursday last week I sent out a notice on Piazza that everyone that's in the course needs to sign up for a project group for project number two, which comes out a week from today. Some of you have listed yourself in a group, some of you listed yourself as free agents, and some of you have not put your name at all. So, I'll give you a few more days. If you're not in a group, then I will just randomly assign you to a group, and it's up to you to figure out how to make that work. It means that if the person that you're working with has body odor issues or other problems, you're stuck with them for the entire project too. So, it's up to you now to be able to figure out what group you want to be in so you don't have this problem later on. So, any questions about project one? Any questions about signing up to be in a group of project two? And probably what will happen is whatever group you're in for project two will just be, you roll over into project three and that'll be the group you're in for the final project. So, choose your partners wisely, okay? All right, so for today's class, we're gonna now start talking about indexes. So, we're now gonna start to understand the next component we need in our database system is to be able to access data very quickly in order to execute queries and do other operations we wanna do. So, for today's assigned reading was really about a survey of all the sort of index locking and latching techniques that can occur inside of a database system. But then going forward on Wednesday and Monday next week, we'll look at specialized indexes for the two different types of workloads we could have, OLAP and OLTP. So, for today's class, we're gonna start off talking about order preserving indexes. And I'll discuss at a high level some of the major ones that are out there. And then we'll go into details about locking and latching inside of an index. And for that part, we'll just assume that we're gonna deal with B plus tree indexes because they're the most common type of index you can have in a database system. And then we'll finish off with the discussion and the overview of different prison gang tattoos that you're gonna need to be aware of in case you end up in prison and need to be able to navigate those waters. Because a lot of people that get involved in databases end up in prison, so you need to know how to be able to handle yourself, okay? So, I don't think we really need to be able to define an index, right? Everyone should pretty much know. But the basic idea of an index is that we're gonna trade off having to pay additional writing overhead and storage overhead in this index in order to make retrieval of particular data items that we need during our run time as we execute queries much faster, right? We could just do a sequential scan across the entire database and look for everything every single time. But if we need one particular item or a range of items, an index is gonna allow us to do that very efficiently. And in general, it's worth the overhead of having to maintain the index in order to get those faster queries. Now, there are some systems that we're not really gonna talk about that essentially do continuous full table scans all the time. So they just sort of have one cursor that's scanning through the entire table over and over again. And then if you're a thread that wants to access the table, you sort of jump on the train as it goes around. And you scan for whatever amount of data you need, and then you jump off when you're done. Nobody really does that in practice. Everyone pretty much uses indexes to be able to find things very quickly. So the most common index that we have in database systems today is the venerable B plus tree. So before we go into the detail of the B plus tree, I wanna spend some time to differentiate between the B tree and the B plus tree, because sometimes these terms are used interchangeably. But nobody in practice actually implements a B tree. Everyone implements the B plus tree. So it's good to know what the difference is. And again, in a lot of the literature from the 1970s, early 1980s, there was B star trees, B link trees, B trees, there was all these different types of trees. But in the end, it's the B plus tree that everyone uses. So the two major classes of indexes that we can have for data structures are the order preserving indexes. And this is where the B tree falls in, and then the hashing indexes. So the order preserving indexes are essentially always gonna be some kind of tree structure where you're gonna maintain the keys of the index or the attributes that you're indexing on in some sorted order, right? And then this is gonna allow us to be able to do all the point queries, range scans, and reverse and forward to order. Within the single index. And it's gonna allow us to do all these searches in O log N. An alternative is to do what's called a hashing index. And this is essentially building a hash table or an associative array that's gonna map single keys to individual elements, right? So you basically take your value, you're trying to look up, you hash it, look in the hash table, and that tells you where that element is. These are not that used that often in sort of user exposed indexes. And what I mean by that is if you call create index, you're most likely gonna get an order preserving index. You're not gonna get a hash index. And I'm gonna take a guess why. Let's say I buy a new database system, I paid a lot of money. I call create index, and it underneath the covers unbeknownst to me, it calls hash index. And the first thing I do is select count star between some range, right? Can that use the hash index? No, right, cuz you can't do range scans on a hash index. So by default, most indexes, most unit systems are gonna give you this. It's only if you ask it for a hash index, will you get something like this. Now internally, as you saw this in the hash join project, internally, the database system is gonna be using hash tables all the time. So it's definitely good for to have a good hash table implementation. And we'll discuss next week some of the different, more modern variants of it. But in practice, unless you've really asked for it, you don't really get a hash index. And some database systems actually don't even offer support for it for that very reason cuz they don't want people to sort of trip up in themselves. The nice thing about the hash index is that although you can only do quality predicates, you can do searches in, I guess, that should be ON, it should be 01, right? Cuz you can immediately go get the one thing that you need without having to traverse anything else. So for this class, we're mostly gonna focus on the order-preserving indexes. For Wednesday's class, that'll be also order-preserving indexes. And then for Monday next week, we'll spend some time talking about hash indexes because this is what you would wanna use in OLAP queries when you learn new hash joins and other things, right? You can build temporary hash tables or hash indexes on the fly to make your join go faster. That's essentially what you're doing in the hash join operator. Okay, so the original B tree was developed in 1972 by a bunch of Germans. And the basic idea of a B tree is that it's a balancing tree where we're gonna store the keys and values all throughout the nodes of the tree. Remember, in our case, because we're dealing with in-memory databases, the values are always gonna be 64-bit pointers to the actual tuple in memory. In our case, they're 48-bit pointers, but you still have to represent that as a 64-bit integer. The difference between a B tree and a B plus tree is that the B plus tree is only gonna store the values, again, the memory pointers to the tuples in the leaf nodes. There's not gonna be any pointers to tuples in the inner parts. So what this allows us to do is essentially the inner nodes end up being like guide posts as we traverse the index to find the data that we want in the leaf nodes, right? Where in the B plus tree, we could find things anywhere in any node. And so one of the advantages of the B plus tree over the B tree is that it's gonna be a lot easier for us to allow for concurrent access because we only have to worry about locking at the leafs of the trees. We don't have to worry about locking up the upper parts. We still have to take latches on it as we go down so that nobody modifies the data structure as we're moving. But all the logical locks that we'll talk about in you read in the survey paper, all that only needs to be done in the leaf nodes, right? So this is the key difference you need to understand. So in a B tree, the memory pointers are everywhere and the B plus tree, they're only in the leaves. So typically when you take an introduction database course, we always draw sort of some sample B plus tree that looks like this. And then we always sort of draw the contents of the leaf node, sort of as an example like this. So you have the one end of the array on the other end, you have pointers to the next node along the leaves. And then this inner part here you have every other pair is a key value pair of the element we're indexing. So the first element will be the key and the second element will be the value. And then so on going forward like that. In practice, this is not actually how indexes are actually implemented. Can everybody take a guess why? Yes? That's going to be a big part, but yes, variable length keys. So in our case, the value is always going to be 64 bits, but the keys can be arbitrary length. So creating an array like this is difficult to do because now you're going to have to store what the size of the offset is for every single element. So in practice, and again, the values are just pointers. In practice, they usually look something like this. There's a bunch of metadata we're going to maintain at the top. So what level is the leaf node in, or what level is the node in? In this case here, it's always at the bottom. We'll keep track of the number of free slots that we have. And that way we need to make a decision whether we need to split or merge the node based on the insert or delete. And then we still have our pointers to the previous and next. But now we're going to have two separate arrays, one for the sort of keys and one for the values. So essentially these, if for each offset in this array, corresponds to an offset in the value array. So now it's really easy to say, all right, I'm at offset three for this key and I know how to find the pointer I want for the tuple that I'm looking for. So in practice, this is usually how people implement it. But now we're again, we're still assuming that the keys are fixed length. We'll deal with the variable length keys in a second. But this implementation is much more efficient and much better because everything will be word aligned. So it's easier to run instructions in the processor to traverse these things and modify them without worrying about spreading over from having misaligned byte streams. So the two design choices we have to deal with in our B plus tree is the first we need to deal with non-unique values. So not every index is going to be a primary key index and not every index is going to be a unique index. So it's possible that we could have duplicate keys where the same key maps to multiple values, so we need to be able to handle that. And then the example that Matt brought up is you need to be able to handle variable length keys, right? And again, in the introduction database class, we always talk about keys being on integers and things like that. So those are always going to be 32 bits or 64 bits. But certainly if you do an index on a string, like someone's last name, then that can be variable length. And so now our fixed length array may not always work. We're not going to talk about inverted indexes of how you do sort of more complicated full text searches. Maybe that's something we'll discuss later in the class, but that's sort of outside the scope of what we're dealing with here. For assume, for variable length keys, it's like some varchar, a small varchar field. So let's go through each of these and what we have to do. Discuss what we have to do. So for non-unique indexes, there are two approaches to dealing with this. The first is that you essentially just duplicate every key in your sort of key array. And you store it multiple times and each have their own offset. And the second approach is you have a secondary value list that you're going to maintain separately for each key inside of the B plus tree node. So in the duplicate key case, it's pretty straightforward. Again, this key one is duplicated three times. So we'll have three occurrences of it in our sort of array and each point to their own offset here. So this is nice because this allows us to still do the binary search within the B plus tree node when we need to find the element that we're looking for. With the value list, it's basically we have a separate array for every single key. And somehow we're maintaining an offset from this key to the start of this array. So now the advantage is this key is only stored once. And then we have to have the separate array for all the values over it. So in general, as many things in databases, there's no one better way to do this, whether you use the duplicate keys or the value list. The duplicate key way is nice because it's really easy to find the thing you're looking for. But then the downside is you're essentially wasting space because you're storing the key multiple times. In the case of the value list, it's nice because you only store the key once. But now you need to maintain these separate arrays and know how to link them at this offset for this key goes to this value list. And these value lists can obviously be variable lengths. So you have to deal with memory management within the B plus tree node. And the various database systems all do different things. So now we've got to deal with variable length keys. And there's essentially three approaches to doing this. The first is the easiest thing to do. And that's where, instead of storing the actual key inside the node in that sorted array, we'll just store a pointer to the actual tuple. And then if we want to know its value, we have to follow the tuple and look at the value. You could store the link to be directly to the actual varchar field, the varchar attribute in the tuple. But in practice, that's usually a bad idea because sort of the indirection of having those variable length pools makes it easier to move things around in your memory allocator. Whereas if you point directly to it, then you need to make sure that no index is pointing to your attribute when you shuffle things. So we'll see in a second how you do this actually in another index. But in practice, nobody does this first one. The second two approaches are more common. The first is to use variable length nodes. So instead of allocating every node to be a fixed size and filling it up with attributes as you go along, you can allow the size of every B plus tree node to be whatever length it needs to be. This makes it a bit more tricky to do memory management because now if you have to delete a node, you may not be able to put it back into a general pool and reuse the memory, you may have to free it and malloc it again because it may not be the correct size that you need. And you may have to grow the node as well. And that gets expensive if you have to update pointers. The third approach is to use quite a key map. And essentially, we're going to embed an array of pointers inside of our B plus tree node that will then point to separate key value lists. So it's sort of like the same thing as a value list, but now we still have another level in direction. So this looks like this. So now in our key map, instead of having the actual keys here, these are pointers to these lists where the first element will always be a key that we want and then followed by the list of values. This is actually what's used in a lot of database systems is what DB2 does. This is what I think MySQL does as well. And when you kind of squint at this, this key value map, what does this look like? What's another data organization scheme we talked about earlier that looks something like this? We have these sort of slots and you're pointing to something. I just gave it away. What's that? Slot of pages, right? Because we have our pointers pointing to some location in the node, the byte stream for the node that says corresponds to here's where this element is. So again, in practice, this is what usually everybody does. So as I said, the B plus tree is probably the most common data structure that you use in databases. Pretty much every single database system that's out there that supports relations will provide you with a B plus tree. But it's not the only order preserving data structure that's out there. There's a lot of different alternatives. But no other one is used as often as the B plus tree. So for this, I'm listing here some alternatives to B plus tree that are used in some systems. T-trees, skip lists, radic trees, there's something that's called Patricia trees. I actually looked this up. It's actually named, there's nobody named Patricia. It's actually an acronym that stands for something, but I forget what. Mass Tree, which is in silo, and Fractal Trees, which is sort of also known as a streaming B plus tree that's used in the Tokutek database system. So for this class, I'm only going to discuss T-trees and skip lists. We'll discuss radic trees and possibly mass trees next class. And Fractal Trees are primarily used in disk-based systems, so we're not really going to cover that. So we're going to focus on these two today. So T-trees came out of the 1980s by some pioneering work that was done at the University of Wisconsin. So the University of Wisconsin people were doing some awesome stuff in the early in-memory database system. So they had a little prototype, and they explored a lot of different things like query processing, indexing, join algorithms, and things of that nature in the system. But it was never really commercialized, but actually made into a real system that other people could use. Because certainly in the 1980s, we still weren't quite ready yet for having a database fit entirely inside of DRAM. So later on in the 1990s, when the first, what I'll say, full-featured in-memory database systems came out, times 10 was the first one, ATT's Datablits or Dolly was the other one, they all end up using T-trees. Because T-trees were designed specifically for main memory databases. They were designed to reduce the amount of storage overhead you needed to maintain the index. So they're based on AVL trees, which is another variant of a structured order-preserving tree database index. And we'll go through what it looks like in a second. But again, the main idea of what a T-trees is going to do over a B-plus tree is that it's going to have less storage overhead, because it's not going to store the keys in the actual nodes themselves. It's going to store pointers back to the original tuples in the table heap. Remember we talked about that for the bare-of-the-length keys? Instead of storing the keys, you store a pointer to the keys. T-trees essentially does that. So let's look at what it looks like. So this is for a high-level diagram of what the T-trees looks like. And we'll focus in on what these nodes look like. So the first is obviously that you see that it gets the name T-trees based on that the data structure looks somewhat like a tree. So inside of a T-trees node, it's going to have a bunch of pointers. So at the top, we're going to have a pointer to the parent node in our tree. And obviously, for the root, we can be null. And then at the bottom, we're going to have pointers to the left child in our tree and the right child in the tree. So in this case, here, we'll point down to these guys. And if you're at a leaf node, then it's obviously null. But now this middle part here is going to have pointers to the actual tuples themselves. So this is going to be an array of pointers that are going to be sorted based on the value of the key that corresponds to the tuple that this thing points to in that sort of order for that attribute. So you'll say, this T-trees for index ID, or index on an attribute ID. And all these pointers will be sorted on their values of ID. So when you built it, you have to follow the pointer, figure out what the value is, and figure out what order it should be in. The other thing we have on the beginning of the end of this array are called the node boundaries. So these are actual key values, not the pointers to the tuples, but actual values that correspond to the range that bounds the tuples that are within this node. So the min would say, this is the value that's less than any tuple that appears in an array. And this one is the max, something greater than any value that appears here. And these are sort of guideposts that tell you whether you need to look down the left or right child when you're traversing the tree to find something. So if we take our key space, so we say, here's all the possible keys we can have. We'll order them from one to seven. And we want to say, now when we apply this to the t tree, we see our ordering looks like this. So no longer it's not like the B plus tree anymore, we're going to have along the leaf nodes all our keys, all our keys are going to be embedded throughout the entire index. So now, say I want to do a scan from two to five. Is that going to be good or bad in a t tree? Compared to the B plus tree, is t tree going to be able to execute the scan more quickly? Shaking head no, why? Why? Why no? Absolutely, so basically what's going to happen is I start off the root and say I need to find two. Well, I start here and I would look at the min and the max. I would see the min value for this is anything less than four. I'm looking for two, so therefore I'm less than four. So I would go down here. And now I find my two, I access this guy. Now I can scan all of the values, but it's even worse because now I've got to follow these pointers to actually get to the actual tuples themselves. And then I go down here to three. I do the same thing, but I traverse back up to two to get back to four. So you're chasing pointers all around this tree just to do these scan operations. Whereas in the B plus tree, you could just go along the leaf nodes and get one after another after another. So this sucks for cache locality because I'm recursively going up and down this tree and I'm basically throwing away all the pages that I may have in my lower CPU caches. It also makes it really difficult to do concurrency control because memory is said in the B tree, it's hard because the data items could actually be in the inner nodes themselves. Whereas in the B plus tree, everything's in the leaf node, so that's the only stuff we need to lock. Whereas in this case, in the T tree, all the data items again could be in the inner nodes. So again, that makes it harder to have concurrent access by violating serialized bogey R&Ts. So the main advantage of the T tree is that it's going to use less memory because you don't have to store the keys in the nodes. Again, think about back in 1986, memory was still very limited, so if you can squeeze some extra memory by reducing the size of your index, that made a big difference. You also have to recognize back in the 1980s and 1990s, the speed difference between CPU level caches and main memory were not as quite pronounced as they are now. So accessing stuff in L1, L2, L3 probably didn't exist then, was basically the same as going and reading something in DRAM. But now, in our modern chips, L1, L2 are orders of magnitude faster than going to the memory controller, going to DRAM. So before, when we were chasing pointers, it didn't really matter that, yeah, we're blowing away our L1 cache, but we're going to DRAM anyway, and it basically calls to the same. Where now is that we can maximize the amount of data we can reuse in our low-level caches, or to get much better performance. And B-trees essentially provide this to us for free, because if we're doing scans, we're going in order, and we're going to look at everything in a single item in the page, one by one, before we go to the next one. So again, the disadvantage is that we're going to have to chase pointers, and that hurts our cache performance. It's difficult also to rebalance, which I'm not really going to talk about. But if you have to split and merge nodes, you have to rotate things, and this is a byproduct of being from an AVL tree, and then that's hard to do when you have concurrent access. And in general, also too, again, because you have to lock the inner nodes of the T-tree, that makes it hard to do transactions in this thing. So again, the early main memory databases systems in the 1990s all use T-trees. Nobody basically uses them today that I know of. The only few that I know are ExtremeDB, but I think they're trying to run on really memory constrained devices. But Times 10 is probably the most famous in-memory database. They use T-trees in the 90s, they throw it away, and now they're using P-plus trees. Everyone pretty much uses P-plus trees. Nobody still uses T-trees, because again, the pointer chasing is what's going to kill you. OK, so another index that has actually been coming quite popular as an alternative to the P-plus tree is the skip list. So most famously, this is the index data structure that's used in MemSQL. When MemSQL came out, they announced that if they're not going to use P-plus trees, they're going to use entirely skip lists. So the way to think of a skip list is that it's a probabilistic data structure, where the organization of the index is going to be based on some random coin toss. So you're going to essentially build these layers of linked lists, and the number elements, which elements you have in the upper levels of the index, will be determined based on a random number generator. And so at a high level, it's going to end up looking like a P-plus tree. But instead of having these different pointers all the way down, you're only picking some of the keys that you want to index at the top parts. I'll show what it looks like in a second. So at the lower level of the skip list, you're going to have every key in your key space, and they'll be organized as a sorted single linked list. And then above that, at the next level, you're going to have another linked list, but it's going to have half the keys of the keys that are in the level below it. And then likewise, when you get to the third, it's going to have the half keys from the one below it. So you're decreasing the amount of data you're storing as you go up in the index. So the way you're going to determine when you start a new key, how many levels to add it to is basically you flip a coin, and you keep going until you reach tails. And the number of times you get heads before you hit tails determines how high in the index you'll go. So you see why it's a probabilistic structure, because again, every time you load it, it could be different. The structure of the index could be different. So it provides approximate log n search times, which is equivalent to a B plus tree, but in practice, you can use less memory than a B plus tree. So let's look at an example. So this is what a skip list looks like. So at the bottom, you'll see this is the first level. And we're going to have a linked list that has every single key in our key space. So we have key one, two, three, four, and six. And for each key, they'll have a pointer to the next guy, and then they'll also have the value here, which for our case, again, it's just a pointer to the address of the tuple in memory. Now in the next layer, we're going to have half the keys that are in this first one. And they'll have, again, the same linked list going from one element to the next, but then they'll have a pointer to the same key directly below it in the lower level. So you can sort of think of these as like towers, where for a key two, no matter of times, every single time I have it at a different level, it'll always point to the same key to value, so the same key to instance in the level below it. And at the end, you'll also see here, we have sort of markers that correspond to infinity. This is sort of a way to signal to the database system as it's traversing the skip list that this is the end of the linked list. And the same thing at the top, we can actually have a linked list that basically goes from the beginning to end without any elements in it. So let's look at an example and see how we actually insert things and how we actually search in a skip list. So say we want to insert key five. Key five is going to end up here. So the first thing we're going to do is we're going to go create the elements of key five in all three layers. So say we flip the coin, and we got heads three times. So we're going to put it in level three, two, and one. And there's going to link down from each other. And then once that's in place, then we can do an atomic swap and change the pointers from the guys in front of us and the guys behind us to now integrate this new key into the linked list. So you can do this actually without locking. You can actually do compare and swap operations and have these changes be applied atomically and not affect the correctness of the index at a logical level. So for example, you could install for key five, install it on the first level, update these pointers, and you don't necessarily have to update the pointers for these upper levels yet. The key is still be in there. So if anybody does a range scan across the bottom, they'll find our key, but we just haven't added this sort of upper level guidepost markers to help you find it faster. Because then we can do the same compare and swap as we go up and add these one by one. So that's sort of clear what's going on. Now let's talk about how to actually search in a skip list. So let's say that we want to find key three. We want to find the key three. So we're going to start off at the first level here. We always start at the top level, and we're going to look to see what element does this level marker point to. And we want to check to see whether the key value is less than the key that we're looking for. So in this case here, this guy points to key five, and we'll see that key three is less than the key five. So we know we don't want to follow along this path and traverse the index going here. We want to go down actually to the next level. So same thing. Now we check here and see is our key greater than key two? In this case, it is yes. So we know that key three is going to have to appear somewhere to the right of this element here. Do the next check. Is our key three less than key four? It is. So we know we can't go that way. So we go down here, and now we're at key two. And then we just keep following until we now at the lowest level until we find our key three. So again, this looks like traversing a B plus tree, except now instead of having a rigid hierarchy or rigid protocol for how we create the upper levels of the B plus tree. We have this sort of probabilistic data structure where we can again flip a coin and it'll randomly decide which of these elements are going to be up here. But in practice, when we search them, it ends up being roughly the same speed as a B plus tree. But it makes it much easier now to do a bunch of operations in certain delete because we don't have to worry about merging or splitting like we have to do in a B plus tree node, a B plus tree index. So in practice, the skip list is going to use less memory than a regular B plus tree, only if you don't store the reverse pointers. So this will be important in the next slide. So in my example that I showed you here, we were always going from the beginning of the index and traversing in that direction towards the end. If you have to go in the other direction, you have to do something else. You have to do something extra. Or you have to store the pointers that go in the reverse direction. But if you now have to store pointers in the reverse direction, that's extra memory that's wasted space, and now your skip list is going to be larger than your typical B plus tree. So you lose all the advantages of it. The insertion deletions are really easy, as I showed. You can do compare and swap operations and add these things atomically. And you don't have to do any rebalancing of the tree. And again, using compare and swap is much faster because you don't have to maintain locks as you do these things. You can do one quick instruction to add in the keys that you want. The downside of the skip list is that it's not as cash-friendly as a B plus tree. Remember, in the B plus tree, we can scan across the leaf nodes, and we can look at all the keys we need within a single node before we move on to the next one. In the skip list, there's no guarantee that consecutive elements or consecutive keys are going to be located in the same page or even the same location. So you may end up doing a fetch for one cash line to get one element, and then you have to follow the pointer to the next guy, and you have to hold another cash line you have to bring in. So in practice, you don't get the same cash coherence you get in a B plus tree as you do in a skip list. And again, as I said, doing a reverse search is non-trivial, unless you store the reverse pointers. So one of the ways to do a reverse search is actually something that Memsegal guys came up with, which I think is pretty clever. So again, we don't want to store pointers for every single key in our skip list, but instead, we'll store markers in the end array here that corresponds to the last element that we looked at when we scanned forward in a previous query in our key space here. So at this point here, someone reached the end marker here, and the last element I looked at was key five. This guy, the last element I looked at was key four, and this guy, the last element looked at key two. So now, if you want to do a scan in reverse order, we would start off here and look at this last entry and see whether it's greater than our value. If it's not, then we can jump to it, and then now we know that the value we're looking for will be in front of it. But in this case, it's not, so we have to go to the next one. Same thing, the value is not in front of the one we're looking for, so we have to go to the next one. And here, now we see key two is in front of key three. So we would jump to this location, and then now we can scan forward, and then know what elements we looked at, and then that's our reverse direction. So I'm showing you an example that doesn't make any sense to do this reverse search, because I just showed you how to do a forward scan to find key three. But in some queries, you may want to do an order by that's in the opposite order of what the index is sorted on, so that you can find all the elements in the sort order you want and not have to do an additional sort. So that's one of the advantages you get in a B plus tree. If you have, say, an order by descending, but your index is sorted in ascending order, you can do a descending order scan, and now you don't have to do an additional quicksort to put it in the correct order. So this allows you to do the same thing, approximate the same thing in a skip list. You pay some extra overhead of having to jump around to find the thing you want, and then scan backwards to get it, but it's going to be the cost of doing this will be less than having to do a quicksort, especially if you have a large intermediate data set. So the interesting thing about skip lists is that, again, when MemSQL came out, they announced that you were using skip lists, and everyone was asking them, how do you do reverse scans? And at the time, when they first came out, they said, well, we can't do that. You have to have to have two indexes. And they got a lot of flak for that. And so this reverse iteration approach that they're proposing here is a way to get around that problem. Skip lists were actually also first explored by the Microsoft guys when they were figuring out what heck-attempt was. And the early days, when they knew what they wanted to do at MVCC, and they were trying to figure out what data structure they used, they were exploring skip lists, and they would give these internal talks at Microsoft talking about the advantages of using skip lists over a B-plus tree. So what happened was the MemSQL founder was at Microsoft at the time, and he saw them giving these talks about skip lists. So that's how he got the idea to use that in MemSQL. Now, the paper you're going to read next week is about the BW tree, where they basically said, at Microsoft skip lists are a bad idea, or they're not as good as we thought they were. And they went ahead and still went with a B-plus tree variant, whereas the MemSQL guys are still using skip lists. I think roughly they're about the same. Maybe on the next Wednesday I'll show you some benchmark numbers. They're roughly equivalent for most workloads. I don't think there's anything, you know, one is not specifically better than another. But this is an interesting data structure that's only been around since the 1990s, and only now is it actually getting picked up and used in some systems. So MemSQL uses this, WireTiger uses this, and it might be a couple others use it as well. Yes? So one of the problems you mentioned in the previous slide was there's not very good cache locality. Yes. So are you saying that does not have a very large performance impact? If you're doing point queries now, you get this, right? You just jump to the thing you want, and you're done. If you have to scan the entire thing, then yes. But if you have to scan the entire table, you might just be better off jumping to some location in the table if it's clustered, and then be able to just scan the blocks from the tuples directly. I actually don't know whether MemSQL uses clustered indexes or not. If you had clustered indexes, then this cache locality problem wouldn't be that significant. But they're using MVCC. I don't know whether they're using in-place updates or not. So do you ever know what a clustered index is? No. OK, so a clustered index is where you have the sort order of the rows in the table heap match the sort order in the index. So if I have key 1, 2, 3, 4, and 5 to 6, in my table heap, they'll be ordered 1, 2, 3, 4, 5, 6. So if you have a clustered index, if I want to do a scan from 2 to 5, I can just follow this index to get to tuple 2 and then scan in that order in the table heap until I hit tuple 5. And I know I've gotten everything that I wanted. So my SQL sort has clustered indexes on the primary keys, a bunch of different databases do this as well. OK, so now let's talk about doing concurrency control inside of an index. So the key thing to remember is that the database system is going to have to treat indexes differently than the regular tuples in the table heap. The reason is because with tuples, we only really care about whether the logical contents have changed from one transaction to another. In the index, we actually now need to care about whether the physical structure changes as well, because we don't want to be traversing an index, and all of a sudden someone moves a pointer out from us, and we jump to a memory location that's invalid. So we need to be careful about not only the logical contents of what the values are or the keys in the index, but also what the data structure looks like. And so that's essentially what locking and lashing does. So let's look at two different problem scenarios. So let's say that transaction 1 wants to do a check to see whether key 25 exists. So it'll start the root node, take a shared lock on the node, take a shared lock on this guy here, and now we're doing latch crabbing, or this called coupling in the paper. So once we get the shared lock on C, we know we can release the shared lock on A, because we've already been able to get down to the point we want to be in the tree. Then we get down here, share lock on F. Then voila, we see our slot for 25 is empty. So we'll be able to insert something in there. But now we release our shared lock, and 25 comes along. It crabs all the way down, taking exclusive locks, because it wants to insert 25, puts it in there. Then we come back, and now we try to insert 25 for ourselves. We take exclusive locks, and now we're going to see that the value that we thought wasn't there, the key that we thought wasn't there, is now there. So this is latch crabbing doesn't help us here. We've got to do something else. Another scenario is that we want to do a range scan. Maybe this transaction wants to do a range scan from 12 to 23, share locks all the way down. And for simplicity, it requires the shared lock on its neighbor. In practice, you just can't take shared locks going across horizontally, because someone else might be coming in the other direction, and you would have a deadlock condition. But for simplicity, the system, you can do that. So we do our scan. Now another transaction comes along, inserts 21. And now when we come back and scan again, we're going to get a fan on, because we'll do a scan on this node here, and we'll see that there's now a value 21 which didn't exist before. This is why we need to be able to handle locking and latching inside the index, and not just in the table space as well. So the key difference to understand between locks and latches are locks are the things that will protect the logical contents of the database. So tuples, tuples values, phantom tuples, ranges, and things like that. And these need to be held for the duration of a transaction. And we also need to be able to roll back changes in case the transaction fails. So we have to maintain an undo log so that if a transaction aborts, we can reverse our changes. Latches are what typically we refer to in the database systems are sort of the micro locks, the low-level locks for the internals of the database system. Now the problem is in other areas of computer science like operating systems, they use the word lock where in databases we really mean latch. Like a latch is like a mutex, a low-level thing that we're going to protect, a critical section of the internal data structures. And so for a latch, the transaction is going to only hold it for the duration that it needs to do that operation. So remember if you think back to the hackathon paper, they were taking right locks on the timestamps inside the tuple, and they had to hold those right locks until the transaction finished. But when it did other modifications like installing updates, it would take latches on the tuple itself, and it would do whatever operation it needs to do and then release that latch. So this is the same thing. And so in the case of a latch, the transaction never needs to be able to roll back its changes because we're doing things at a physical level. So we acquire the latch, make our change, and then it gets applied. And then once that we're done, we don't need to maintain any undue information and roll that back. So there's this great table in the survey paper you guys read that lays out how locks and latches are distinguished from each other. And I like it because it talks about every possible aspect of how you're going to use these two different constructs. So in the case of locks, we're going to use them to separate user transactions from each other when we're modifying the database contents, and we're going to hold them for the entire transaction, and then we have a bunch of different modes we can handle. And then we're going to be able to use deadlock detection and resolution to deal with deadlock scenarios. So the transaction will be able to acquire the locks as they need them, but then there'll be something else like a lockman or something else in the background that's going to make sure that we don't have any deadlocks, we don't have any problems, and it can abort things and free things up as needed. In the case of a latch, we want to protect threads from each other because they're both accessing the same data structures. And we're pretty much only going to have simple read and write locks. In general, sometimes you only have a simple global lock for the data structure or for the element you're trying to modify. And the key thing is that we're not going to have a background thread to protect us to make sure we don't do something stupid and have deadlocks. It's only through careful coding and discipline of the actual software engineer that's building the database system and building these indexes and using these latches are we going to be able to have concurrent operations that don't deadlock from each other. So that's the main thing you have to recognize. And when you do this for project number two, sometimes you're going to have to take locks and sometimes you're going to have to take latches. And it's up to you to make sure that you don't end up with deadlocks for these. So the index locks are essentially managed with a lock table. So you can think of this as a hash table that has some logical naming scheme for locks. And then for each lock, there will be a queue of transactions that are waiting to acquire the lock or hold the lock. So typically what happens is that whatever the first transaction in the queue, it's the one that holds that lock. It doesn't necessarily have to be one transaction. So in this case here, we have three transactions that all want a shared lock on a single logical lock. So therefore they can all acquire that because shared locks can be shared with each other. So where latches fit into this is that any time we need to update this queue, we need to take a latch on it because that's a data structure that is a data structure we need to protect and avoid dangling pointers and things like that. So we would take latches in order to acquire a lock. It's one way to think about it. So now another key thing that he talks about in the paper is that there are some people out there that talk about how they have lock-free indexes or lock-free data structures. And it's not really quite clear what that means in the context of database systems. So the two possibilities are a lock-free index could be something that has no logical locks. So that means that transactions does not need to acquire any lock for an element any times you need to modify anything. And we saw this in the OCC case because you did all your changes in your private workspace and any time you modified a couple of those in your private workspace, you didn't have to lock it from anybody else, right? But now the key is that you still have to use latches at the end in order to install those updates. So it's not lock-free as to this magical data structure that doesn't use any construct for critical regions, critical sections. It's just it's not using logical locks, but you still need physical latches. Another data structure could be have no latches at all. And essentially to do this, you have to do multi-versioning inside of the index itself. So you can use shadow paging or other techniques like that to make this work. And you can use the comparison of operations to apply your pointer updates automatically. But, and this is essentially what we could do in our OCC case, sorry, in the MVCC case, but we still have to acquire locks at the end in order to validate the transactions. So again, there's no magic data structure that's going to be entirely lock-free and latch-free, and transactional memory doesn't help you, right? It sort of obfuscates the problem in some ways. There's always going to be some way to coordinate and synchronize cross threads. And whether you're doing this to physical latches or logical locks depends on what index you're using. That's something to just be mindful when you read literature and people talk about their lock-free data structure. It probably still means that they're using latches. OK, so now I want to go through the locking schemes that we talked about in the survey paper. Now you may be kind of confused why I had you read papers on multi-version concurrency control and OCC. There were all these optimistic schemes that weren't doing locking, and then I throw back how to do two-phase locking inside of an index. The reason why is because on Wednesday, you'll read papers about doing optimistic operations inside of an index that are sort of like MVCC and OCC without having to use locks, but still have to use latches. So I want to first understand what high-level locking does, why it sucks and why it's hard, and then that will motivate why the indexes we'll talk about next time are able to get better performance over these other approaches. So the very first locking scheme that was proposed actually came out of the System R project from IBM in the 1970s. The System R project, I think, was fascinating. They basically had eight people with PhDs, someone computer science, someone with mathematics. Back then, you didn't get a PhD in databases because they didn't really exist. They basically took eight really smart people, and they put them in a room and said, build a database system. And every person in that group of eight went off and did design and built this one piece of the system. So one person worked on cost-based query optimization. Somebody else worked on storage models, storage management. Other people worked on developing the SQL language. And one of the things that Jim Gray worked on, who won the Torney Award when he was at System R, he worked on this idea of transactions and consistency models. And so they had this paper in the 1970s that described a way to check to see whether transactions conflict using high-level logical predicate locks. So the basic idea of a predicate lock is that you're going to take the where clause of a select statement and map that to the key space of the table. And that'll be the granular that you apply the lock. So if you have a select query, it's just the where clause. If you have an insert updated lead, you acquire an exclusive lock on the where clause. I mean, the insert doesn't have a where clause. So it's just the quality predicate based on the values you're trying to insert. This sounds awesome. This sounds like this would be exactly what we want, but it's actually really hard to implement in practice. So no system actually ever implemented this. They never did it in System R, and no one's even come close to something like this in a real database system, and we'll see why in a second. It provides, I think, the best granularity for locking that you can have. It's just the overhead of figuring out whether one predicate range intersects with another predicate is really difficult to do. It's not an MP complete problem, but it is very inefficient to actually implement in practice. So let's see what it looks like here. So say we have two queries. The first one is to do a sum balance on the count table where the count name is Tupac, and then we'll do another insert into the count table for Tupac with a balance of 100. So in the first case here, our predicate lock is going to be based on where name equals Tupac. So you can think of this as, again, the entire key space of this table, and therefore the predicate name equals Tupac covers this range here. And then for this guy here, because it's an insert query, we just convert it to a conjunction predicate with the qualities, and we see that name equals Tupac and balance equals 100 overlaps with our region over here, and therefore these two conflicts intersect, these two predicates intersect, and therefore they would have to have a conflict. So this seems like a simple example in a 2D projection. But now you can imagine if you have more complicated queries and more complicated predicates, this gets really hard to actually implement. So you can sort of think of it as like a multi-dimensional space of all possible predicates and all possible values you could have for your attributes in your table. And then now you're trying to decide whether the space that is encapsulated by this predicate has an intersection with the predicate from the other query. And you need to do this for hundreds of queries running at the same time. So it's really hard to do this. In practice, one way to think about what they're doing versus what we'll talk about in a second, they're trying to figure out what locks to take for a transaction or a query before you actually run the query. What I mean by that is like, we can look at the predicate and decide logically whether two predicates overlap versus the physical locking that's done in the other types of index locks are only applying the locks when you actually try to go access the data. So if you predicate locks, you actually don't even need to execute the query. You just need to analyze the predicate. But all the other locks we'll talk about, you as you're running the query, you acquire the locks before you actually access the object you want. So this is essentially what key value locks are going to do. So now we're going to build up to get to the hierarchical locking, which most database systems implement. But we're going to have a little building blocks to get us there. So in key value locking, it's pretty straightforward. Basically, we want to get a single lock for a single key that's in our index. This is to be done in the leaf nodes, because we said that we only need to worry about locking the leaf nodes. We don't care about the upper level stuff. We acquire latches for them in order to rebalance them. But the logical locking only needs done in the leaves. So in this case here, if we want to check to see if 14 exists, we would require a lock for it. And then nobody else can modify this as we go. The next thing we need to handle are the gaps. So let's say I want to check to see whether 15 exists. But in this case here, I have increments of two. So there's no element 15 in my key. This is another reason why you can't have the locks being embedded directly in the index. You have to use an external lock table, because there is no key 15. So there's no way for me to set a lock on something that's not there. So instead, we'll have a gap locks, where basically we say between these two key elements, there's a gap, and I can acquire a lock on that gap. So to acquire a lock on the item 15 that doesn't exist, I would acquire the gap lock on 14 to 16 exclusive. So now we can combine the key value locks and the gap locks to give us key range locks. The idea of a key range lock is that we're going to take chunks of ranges in our key space and apply the same locking mechanism that we have for gap locks and key value locks. And it's going to allow us to cover a larger range of the key space with fewer number of locks. So instead of having to acquire a key value lock and gap lock separately, I can get a key range lock and get both of them with one invocation to the lock manager. And then I'm going to be able to define different modes for how I acquire these locks in my lock manager to make it easy for me to decide whether two key ranges overlap, and therefore their operations are commutative. So for example, if I have a key range lock from 14 to 15, and you have from 16 to 14, if we both have a shared lock for the part of our ranges that overlap, it's safe for us to go and access the same data with each other. So this is what key range locks are going to allow us to do very efficiently. So essentially it looks like this. We still have our gaps. And so we can have a next key lock would be 14 plus the gap, or a prior key lock would be 14 and below the gap to 12. And again, we can acquire these in one invocation to the lock manager to get to cover this range. This also handles the virtual keys. So let's say that we want to care about things from 16 to infinity. Again, we can hire a key range lock for that, and the same way for the other direction as well. In practice, most database systems actually implement the next key. It doesn't matter which one you actually do, as long as you're consistent. You can't mix the next key locks and the prior key locks. You can only have one type. And there's no one ways better than another. In practice, like NODB and other things, they all use next key locking. So now we have the sort of same problem we had in motivating why we need the key range locks, is that if we need to acquire locks for a large range of keys, we have to go acquire individual next key locks for every single one of them. So this is where hierarchical locking comes in. So it's going to allow us to acquire a lock for a larger key space using different modes to provide hints to other transactions that are running at the same time about what we intend to do with the elements that we have locked. And this is going to allow us to reduce the number of trips we have to make into the lock manager to cover a lot of different, a wide range of keys. So for example, let's say I acquired an intention exclusive lock from range 10 to 16 exclusive, or inclusive. And what this is basically saying is the transaction that acquired the intention lock is telling other transactions that I intend to modify something within this range. I don't know what it's going to be yet, but I plan to take an exclusive lock inside of this. So nobody can come along and try to take a shared lock at the same time. So then now in my range, I can acquire the exclusive lock on exactly the data that I want to lock and modify. And somebody else could acquire an intention exclusive lock on the same range and acquire an exclusive lock on another part of the key space here and can modify it without at the same time without having a conflict. That's the way to provide additional concurrency in the system with reducing the number of trips to the lock manager. And then you have the same, follow the same sort of strict two-phase locking, a regular two-phase locking protocol, where you have to hold on locks until the shrinking phase, and then you can release all of them. So now only are we locking the tuples in a regular table heap. We're also locking these key ranges in our index space as well. So that's sort of clear. So it's different than the hierarchical locking you may learn in an intro class where you can lock a tuple, you can lock the page the tuples in, the table, the database. We're doing hierarchical locking within the range along the leaf nodes and not in the upper levels of the tree, because we don't care about the upper levels in the tree in terms of the logical contents. We only care about their physical layout. And for that, we protect the latches, OK? So again, essentially everything I just said, hierarchical locking allows us to do, we essentially can get predicate locks without the expensive and difficult implementation of having to evaluate whether two predicates are overlapping. And we're only going to do index locking within the leaf nodes, but then we'll use the latching to preserve the consistent data structure for the entire tree. So in the same way that we've been reading about in-memory concurrential schemes that are sort of more recent implementations of them that are really fast, indexing is now a hot area as well. So you'll read this on Wednesday, but the papers that have been come out in the last five years are all about how to do fast indexing for in-memory databases by using other techniques than locking that we're talking about here to get better concurrency and better parallelism. So any questions about indexing and all of this? OK. So now, I want to spend some time a little bit talking about prison tattoos. So again, as I said, if you're going to be involved in databases, chances are you may end up in prison. It happens to a lot of us, not me in particular, but there's other famous data that people have spent time in prison. So part of what I want you to get out of this course in this lecture here is to understand how to be able to deal with these various factions of gang members you may encounter while in prison or in the streets and be able to assess the situation and understand what is this person's motivation and where are they coming from. And you can do this by identifying key aspects of their tattoos. So I want to go through some of the five most popular tattoos and then you can find more information online and learn more about these gangs and other things. So the most popular tattoo that you've probably seen for is the teardrop. And this is typically done along the eye. It doesn't matter whether it's left eye or right eye. And this is used to signal that the person has been involved in a murder. And if the teardrop is filled in, it means they were successful, they actually killed the other person. If it's not filled in, it's just an outline of the teardrop, then it usually means that it was an attempted murder and they didn't succeed, but they might try it again. And the number of teardrops corresponds to the number of murders that they've been involved in. So a lot of times you see this in a lot of celebrities today, a lot of musicians. This is sort of one popular thing that may not necessarily mean that they were in a gang, but in practice, if you're in prison, if someone has this, they probably were involved in a murder. Another famous tattoo, or a very common tattoo, is the three dots. And so in the case of, unlike the teardrop, the three dots doesn't necessarily always have to be next to the eye. You can have it on the wrist. You can have it on other body parts. The three dots are meant to represent La Vida Loca, basically my crazy life. And it's just a mean to signal to other people that the person with this tattoo is involved in a wild lifestyle, thuggery, and other things like that. A lot of celebrities have this. Ice Cube has this, although he was a founding member in NWA. And the reason why they usually do dots is because when you're in prison, you don't obviously have a tattoo parlor. So you can do these tattoos by taking a lighter to a big pen with a sewing needle and you sort of jab it into the skin, and that'll produce the same effect. And it doesn't necessarily mean they don't have to line up. They don't have to be in a triangle. As I showed in the example here, they sort of have three dots that are just in a straight line. Essentially, as long as there's three dots, that means my crazy life. It's very important, though, for you to not confuse three dots with five dots. So the five dots are always going to be on the hand. You'll never see them on the eye. You'll never really see them on any other body part. And this will mean that this person has been in prison. So if you see them on the outside of prison and they have five dots, it means that they were incarcerated in the past. And the five dots are in the shape of a quinox, and essentially means that it's supposed to represent their time in prison. So the four dots on the outside are the prison walls. And then the one dot in the middle represents the person that's been incarcerated in prison. And again, I don't want to judge, but usually these are the people that are more hardened than the three dots thing. The three dots could just mean you like the party. Five dots means you definitely were in the pokey. You definitely spent time. So these are sort of the three standard tattoos you can have that sort of follow the same formula or the same iconography. I now want to discuss two different gang tattoos. They don't follow the same formula, but I want to show you how you can identify what gang they're in based on what symbols that they use. So the first gang is the Maria Salvatrucha gang. And this is a prison gang that started in Los Angeles. And they're basically a bunch of El Salvadorians, although they have ties with the Mexican cartel. And typically what you see for these tattoos, you'll see the letters MS followed by the number 13. And so again, they're not going to be in a specific pattern. They're not going to be in range in an exact way. The usually the letters and the numbers are arranged as part of other artwork that's on the body. So the MS obviously just means Maria Salvatrucha. The 13 doesn't correspond with the date or anything special. It just means that M is the 13th letter in the alphabet. And you see in this case here, they have MS on the back. It doesn't have to be in the back. A lot of times you see it on the chest or along the neck here. And these are pretty ruthless gang. I think the number is maybe 50,000 members across prisons in the US. And so they're a pretty tough group. And then the last one I'll talk about is the Aryan Brotherhood. So these are usually like skinheads or like neo-Nazis. And so you'll see the same thing. You'll see the initials AB as abbreviation for Aryan Brotherhood. But then you're also going to see often the numbers 14 and number 88. So 14 corresponds to some stupid slogan from Hitler that has 14 words in it. I don't remember what it is, but that's what it means. 88 is actually not the number 88, but actually two eights together. And this corresponds to the slogan Heil Hitler, where both of those words start with H, and H is the 8th letter of the alphabet. So you see 14 and 88, and that's what it usually means there. A lot of times the Maria Savitutra gangs, they'll be very colorful and a lot of different artwork. Whereas in the Aryan Brotherhood gangs, they're usually black and monochrome. OK, so like I said, knowing these tattoos are important going forward when you get involved in databases. For next class, we're going to talk about the B.W. Tree mostly, but also talk about how to do current skip lists and the art index, which is from the HyperGuys, which is a Radex Tree. It's very important for you to understand what we're going to talk about in next class, because that's essentially what Project 2 is going to be about. You're going to have to build up a current index, and you'll have to follow one of the techniques that we're going to talk about here. OK, any questions? All right, we're done. See you guys. See you on Wednesday.