 Today is the second part of the lecture we were doing on Tweet indexes. Just real quick to go through what's on the schedule for you guys coming up, right? Everyone should know that project one is due on Wednesday. Who here has done project one already? It's got 100%. Raise your hand. All right, small smattering. Who here has not started at all? In the back, really? You know what? Kudos for admitting this, right? Because I ask every year and I was like, oh no, right? Nice. Nice. Okay. So homework two, again, it was originally due this Wednesday. It's been bumped out to be due on Friday at midnight. And then project two will be announced on Wednesday in class this week. And then what we're doing is for project two, because everyone again last year assumed project two was going to be as easy as project one. And maybe you heard me say it's easy and that's why she hasn't started yet. More new checkpoints. First week and a half, you have to submit something on Gradescope to get sort of the first half of the project done. And then the final grade will be due a week and a half after that. So sort of again, it's a checkpoint halfway through to make sure you at least have started and look at the project. So that way you're not trying to build a B-plus tree in 24 hours, which is going to be a bad time for everyone, okay? So any questions about project one at a high level? Okay. Again, post any technical questions about Gradescope or other aspects of the coding assignment on Piazza and then the TAs will be able to answer it for you. All right. So today's agenda we're going to focus on sort of splitting the three parts. So in the beginning I want to talk about other ways you can use table indexes in a database system that go beyond what we've mostly been assuming or talked about so far. And then I want to talk about two alternative tree-based data structures for building a table index as an example of other options beyond a B-plus tree. The spoiler would be the B-plus tree is still superior in some cases to these other two data structures. But I still think it's useful for you guys to be aware of them and understand how they work. And then we'll finish all talking about at a high level another example of a not exactly a tree-based data structure but an example of another kind of index you can have in your database system called inverted indexes. And we see how we do full text search on these. Okay. All right. So there was somebody had a question at the end of last class about how the database system was actually building these indexes. And that all means so much like what was the process of actually taking the keys and populating index more about are there cases where the databases would automatically build indexes in order to do certain things? And the answer is yes. So most database systems will automatically create indexes to enforce different kind of constraints. So integrity constraints and possibly also referential constraints. So integrity constraint would be like a primary key or unique constraint. Right. So like if I call create table like this and I define a primary key and a unique key then implicitly what this means is that the database system is essentially going to invoke these SQL commands to create these two indexes for you. Right. So for the primary key it automatically creates a unique index on the ID. And then for the second value of the bar chart it will create a unique index on that because it has a unique keyword. And when you think about it this makes sense. Right. Because what's how else would the database database system actually enforce these constraints? Right. How is it going to know that nobody else tries to insert the key with the same primary key or to both the same primary key? As always in database systems the fallback option is just to do a sequential scan and look at every single tuple to see whether you have the same key already. But of course now if we have one billion tuples in our table every single time we insert something at the scan through one billion tuples. So we can build an index to do this very quickly for us. And it doesn't matter whether it's a B plus tree or a hash table. Right. The same idea. At a high level it's the same thing. Right. Whether there's a conflict and if yes then we know we should not insert it. If no then it's okay for us to insert it. So this is to do integrity constraints. I also thought it would do the same thing for referential constraints or foreign key constraints. Right. So let's say I have a second table here bar and this ID field has a foreign key reference on the value one in the food table. But at this point here the value one in the food table is not unique so there isn't an index already built for it. And I thought actually the data system would automatically recognize oh I'm having a foreign key reference from this table to that table. I don't already have an index over here. I thought it would actually they were all going to go ahead and create one automatically for you. Turns out I tried this in my SQL and Postgres. They don't actually do this at all. Actually they won't even let you create the foreign key unless you're pointing something in the parent table that already has a unique index for you. And again the reason why you need index is the same reason we need it for the primary key and the unique constraint. Right. Every single time you would insert something into bar the foreign key says whatever you're storing for value sorry for the ID field says it has a foreign key reference on the food table there must be a key that exists in the food table that matches what I'm trying to insert in this table. So again you would use an index to make sure you didn't have to scan everything to find a match. So turns out in Postgres and MySQL they will throw an error. MySQL actually will throw a silent error for this one. Postgres will do it correctly at least MySQL 5.7 and say you can't create this foreign key reference because you need to have a unique index on the food table for the thing you're pointing to. But there's nothing about the relational model that says that this is the case. I think this is strictly an implementation issue. Maybe it has to do something in the SQL standard says something about this. But I think at a high level there's no reason they have to require this. This is why I thought they were going to automatically create one. So the only way to really make this work is actually if you change the food table to make the value, the first value attribute be unique. And again that automatically builds an index and then this thing can now point to it. Right. So this sort of clear. Any time you define a constraint when you create the table it will automatically create the index for you. Now I can also go back and create the indexes manually but you can do it directly when you call create table. All right. So another interesting way to expand how we can use indexes are called partial indexes. And the idea here is that so far we've assumed that any time I call create index it's going to scan through every single tuple in that table and build an index for the entire table. But there may be some cases where you don't actually want an index on the entire table. You want a subset of the table. So this is what is called partial indexes. So the way you make this work is that in the where you add a where clause to the create index statement and you add some additional predicate to say only index the tuples on the base table in this case foo where c equals wu-tang or some expression has to evaluate to true. So inside of this index we're only going to index a and b but implicitly there's some metadata we're maintaining to know that for this particular index there only be entries where c equals wu-tang from the foo table. And so the advantage of this is now if you come along with a query like this we say where a equals 1, 2, 3 and c equals wu-tang. So now for a the a attribute we can get that from the index but we would even before then we would check to see whether this c predicate matches what we've declared as the partial index where clause for this index. If it doesn't match then we can't use an index. If it does match then we know that it's safe for us to go inside of it and then extract out the apply the a predicate to get the keys that we want. So this is actually a very common setup in a lot of applications when you maybe want to partition the index based on things like date. So you can have like say you split up your the list of orders for your business. You can have one index for each individual month. And then this reduces the size of the index and makes maintaining them much easier because they're more compact. If you have a lot of queries that only focus on a single month then you don't have to maybe swap in a giant index into your buffer pool that has a bunch of keys that don't actually matter for you. So this is very common approach that people use to sort of manually partition the database and partition the data in such a way to reduce the overhead of large scans. So one additional thing about this query if you want to notice also too is that everything I need to answer this query is actually contained in the index itself, right? So I'm doing a look up on A. I did my where calls on C there. And then the only thing I wanted to produce was my output was B. So do I actually need to look at the tuple for this? Because how would we normally execute this? We'd follow the index, the index at the value at the leaf node assuming we have a B plus tree would have a record pointer. And then we would go follow that record pointer to go get the tuple and then evaluate the query. But for this particular query there's nothing from the tuple itself that we don't already get from the index. So these are called covering indexes. So the idea is that a covering index covers all of the data you need to execute a particular query. So this is not something you would declare. You don't say make this index be a covering index. It has to do on a query by query basis. The data system can recognize that for this particular query everything I need to answer the question that you're asking me can be found in a particular index. And if so it's considered a covering index. So to simplify our example of four for our simple query select B from foo one two three we can get the B that we have needed from the output from the index. Actually the errors are wrong. That should be there, that should be there. So A would be for this A and B for that B there. So I've already sort of spoiled it. What's the advantage of this? I don't have to go look at the tuple because everything I need is directly in the index. But we know this from you guys implementing your buffable manager. Memory is limited. So I already have to bring in the index pages because in order to traverse the tree to get to the leaf node that I wanted. But now if I have a covering index I don't have to go look at the tuple. There's one less page read, one less entry into the buffer pool I have to make to get the data I need for this query. So this can significantly speed things up in a lot of cases. Especially if you're doing a large range scan where you only want to get things, you're doing a large range scan and your tuples are really big. So you have to read a lot of pages. So this is something that is widely used in the major commercial systems. MongoDB supports this. I think Postgres supports it now. They're getting better in Postgres 11 when it comes out later. I don't think MySQL, at least MySQL 5.7 does not support this. I don't know what SQLite does. Again, the idea here is we can reduce contention in a buffer pool and not only for the amount of memory we have to fetch from disk to go into the query but also we have to go into the buffer pool manager fewer times. We take fewer latches. So this is a big win in a lot of situations. Now there may be some cases also too where the data I need, though, to make it a covering index is not actually being indexed. In theory, if I have, say, Actuaries ABC on foo, I could make an index on A and B and C and just never do the look-up on C. So in this case here, it would make it a covering index more often than not but you're sort of wasting space now because in your upper levels of the tree, the inner nodes, you're indexing C when maybe you don't actually need to. So a way to get around this is what's called index include columns. And the idea here is that in our create index statement we add a little include clause and just a bunch of attributes that we want to get packed into the leaf nodes for any key that we're indexing. Right? So again, in this case here, I build my index on A and B but now in the leaf nodes for every entry or unique pair of A and B, I'm also going to include the values of C. So now if I have my query that from before where I say, you know, select B from foo where A equals 1, 2, 3 and C equals Wu Tang, I can get A and B from the index and I can get C from the where clause. So as far as I know, I know SQL servers from Microsoft support this. Again, Postgres 11 will support this later this year. My SQL doesn't do this, at least not in version 5.7. SQL Lite doesn't do this. And I don't think Oracle and DB2 does this but I might be wrong. So in theory, again, you could just pack everything in the include clause but then you essentially end up with an index organized table and you're just really sort of duplicating things unnecessarily. So sort of this trade off between how much I want to include my index versus how much extra space I'm willing to pay to just try to speed up queries, right? So the last thing I want to talk about are what are called functional and expression indexes. Postgres you originally used to call these functional indexes. Sometimes in the literature it'll call expression indexes but the way to think about this is everything I've shown so far, anytime you create an index we would just always say, oh, an attribute A and an attribute B. It would copy the exact values of those attributes from the table and build your keys on those. But there may be some cases where I don't want to store the keys exactly as they appear in the underlying table. Let's say I have a query like this. I have a new table called users and just think of like people, it's a user account and you keep track of the time stamp of when someone logged in. And I have some query that says, I want to give me all the users that logged in on a Tuesday. So the extract function says, for a given time stamp login, extract the day a week from it. And I'm checking here to see equals two, which is Tuesday, right? Sunday, zero, Monday, one, Tuesday, two. So let's say I build the index on login like this. Would this help me for this query? If you're checking ahead, no, why? It says the data itself in the where clause is the source of the function. Correct. So he says the data that we want in order to satisfy our where clause is not in the data itself, it's derived from the function we invoke. So think about what this index is actually doing. It's going to create you a sorted list of all the users based on their login. But I want people that are logged in on Tuesday. So there's not going to be a single range where I can get these values that give me all the people that are logged in on Tuesday. I can maybe be a little bit tricky and recognize my lower bound and upper bound and try to calculate what ranges would be people logging in on Tuesday. But as far as they know, no data system would actually do that. So this index doesn't help us. So instead what we can do is that we can just take the expression that we want to compute and embed that in the index itself. So now when I define what keys I want to put in my index, instead of having the single attribute, I can then put the same function I would have normally in my where clause. In this case here, we're going to index everyone based on the day of the week. We're still getting everybody on Monday, Tuesday, Wednesday, and so forth. And then now when I execute this query, it can know that I have the function in my where clause exactly matches what I need in my index and therefore I can use that. So is there another way to sort of do the same thing that we talked about before so far? Another type of index we can use to essentially do the same thing for us? Say we want to get exactly everybody that's on a Tuesday with an index, how would I do that? Partial index, right? The where clause can just be anything you have in a select statement in a where clause, you can put in your create index in a where clause, right? So let's do a demo. And we can see a little bit now also how Postgres is going to pick plans. Excuse me. Somebody asked why am I always using Postgres in my demos? Is it because I'm like a fanboy obsessed with Postgres? Postgres is really good. Postgres has all the things that we're talking about in this class. Postgres has available to us. And it's almost a textbook definition or implementation of a database system. And another thing I like about Postgres too also is that it has many of these things that the commercial guys have Postgres eventually gets as well. I said that the index include columns was in SQL server. That's where it's first implemented. Postgres is adding that now in a level. And it's open source and free. It's amazing. So let me do this. Let me drop the table of users. I'll create a new one. It only has two fields. It has the ID field for the primary key and it's just an order increment key always increasing. And then it's going to have this timestamp field. And then let's generate some synthetic data. So for this, what I'll do is I will do an insert query that's going to insert into the login table and then there's this function in Postgres called generate series that's just going to generate a monotonically increasing list of values. So it's going to generate me a bunch of timestamps from 2017, January 1st until today at one month of one minute intervals. And it's going to create 900,000 records. So let's say I take the query that I wanted like this. If I run it, it comes back pretty fast. But if we go check the query plan, you see that it's doing a sequential scan on the users table in order to apply the predicate. So if I now go create my index on just the login field, when I want to explain, again, it doesn't pick the index because it doesn't help us. It only does sequential scan. So let's drop that index and make one with the partial index. So it's going to be the expression index. And now when I run my query plan, it recognizes that it can use the index that is created for it to do an index scan. I'll explain what a bitmap scan is later on, but basically just think about how posters are keeping track of what tuples match. So now let's create, though, the partial index that I had before. So do Postgres, login, where? So this is creating the partial index where I'm still going to get all of the login information, the login timestamp. But my where clause says only get people where they logged in on Tuesday. So now if we go back to our query plan, let's take a bet. Who says Postgres is going to pick the partial index? Who says it's going to pick the expression index? Raise your hand for partial. I don't know if you'll raise your hand for the expression index. So why do you think it would pick the expression index? It's more specific. Although, the partial index is pretty specific. The partial index was, which one was it? This one? This one. Where day a week from login equals two. That's exactly what I want to execute in my query, right? Wrong one. My index is where day a week from login equals two. So pick the partial index. Because, again, it knows how to match the where clause that's defined for the index with the where clause that's defined in your query and pick that. If I change this to be three, right, it picks the expression index. Because the partial index doesn't have any of that information. Right? So this sort of clear what's going on underneath the covers. And it uses all the same storage information that we talked about last class. It's doing the same thing. There's nothing special because it's an expression. It just knows that when I insert a record, I run that expression and insert that into as the key rather than the original value of the thing I'm indexing. Yes? The question is, when should you create or drop an index? Is that sort of an open and broad question? Do you mean something specifically in the implementation or if I'm administrating a database, when should I do this? From the application level. This question is, from the application level, when is the right time to drop a created index? This is essentially what DBAs do for you. You basically look through the application and say what queries are running slow and which queries are running slow and are the indexes that could help me improve the performance. So there's tools that sort of do this for you. In a lot of cases, for all of the applications, like if you use an ORM, like Django or Ruby on Rails, you define, you get implicit index for free. You say I want this attribute unique because most of the time I'm doing lookups on that. So then it'll create the index for you so you get things speed up. The more complicated things are, you have to look at what queries are running slow and there's ways to turn on this. It's called the slow query log in my SQL Postgres in most systems to say, show me all the queries that run longer than 20 milliseconds. You look at a bunch of those and say, well, what indexes could I use to speed things up? So it's sort of like you just look and see what's slowing you down. So there's no conventional wisdom or hard fast rule to say, if your query looks like this, then add an index. If your query looks like that, then drop an index. It depends on the application what you care about. It's maybe the case that I have a query that takes 100 milliseconds, but it runs once a day. So who cares? It depends on what you want in your application. So any questions about these other ways to use indexes? Pretty straightforward. Again, it just builds on top of the things like once you have an awesome underlying data structure that is thread safe and reliable, you can start implementing all these other cool things on top of this. And this is essentially why older database systems have more way more features than the newer guys. Postgres has been around for a while and they're just adding all this awesome stuff on top of their existing infrastructure. All right, so I want to take a step back now and start talking about different data structures. So the way to think about what we talked about last class with the B plus stream is that we were essentially building a sorted list of keys. If you think of the leaf nodes, the leaf nodes were essentially just a linked list of keys in sorted order. And then there was some infrastructure on top of them, the tree portion of the data structure that we would use to route ourselves to find the data that we're looking for. But the most simplest way to actually implement an index would be just a single linked list, right? The problem though is that all the operations on that linked list would be always in ON time because you have to do a linear search, right? So again, think of this as just the leaf nodes of our B plus stream. We have keys and then we have values. The value portion is just a pointer to the next key. So anytime I want to find a particular key, I can't jump to the middle. I have to scan across from the beginning and check as I go along. Worst case scenario, I'm at the end. Best case scenario, I'm the first one I'm looking for. Again, it's just a simple linked list. So one way to sort of speed things up is that rather than doing our searches linearly by just looking at one key after another, we could actually maybe build a way to jump ahead in our linked list to find keys, the key that we're looking for more quickly, right? So a really simple way to do this would just be add another pointer for every other element that jumps over that element, that key. So from key one, I can either go from key two or jump ahead and go to key three. So now if I'm looking for a particular key, say key five, I can look at key one so that's not the one I want. Jump ahead to key three. That's not the one I want either jump ahead using the top level over key four to key five and then I found what I wanted. Yes? Are the leaf nodes in continuous regions of memory? So this question is are the leaf nodes in continuous regions of memory? For this, assume no. If we say yes, then can we just do like a binary search? His statement is if we assume yes, can we just do a binary search? That'll make searching go faster. That makes insertion slower and now you've got to slide everything over. Just keep this in with the link list. It's the easiest data structure to implement. So if we can do this and skip ahead and that helps us, that cuts things down further, we can just go add another level of pointers and skip every fourth one and so forth. So now to find the key that we'll be looking for is we can follow along the top point and we realize we go too far. Then we go down to the next level and maybe follow those pointers until we reach the very bottom and find the thing that we want or find that we've gone past the key where our key should be so we know it's not there. So this at a high level is what a skip list is. So a skip list is essentially multiple levels of link lists and you're going to have extra pointers that skip over intermediate nodes in your link list. There's a quick show of hands who here has heard of a skip list before? Okay, about a third, maybe less than half. So again the way to think about it is that we're going to have multiple levels of these link lists and we're going to have these extra pointers that allow us to jump ahead things. Now if you sort of think at a high level when I show the diagram it's going to look a lot like a B plus tree, right? But we'll see some differences later on because we don't have to end up doing any rebalancing any time we do an insert and delete. With a link list we don't have to ever do splits and merges because we know that we want to add our entry, we just update the guy that came before and we weren't to now point to us instead of pointing to the one that we now point to. So this is sort of one of the advantages you're going to get in a skip list. So it's not a new data structure. It dates back to I think 1990. It was invented by a professor at University of Maryland. But in the last 10 years skip lists have become sort of fashionable because you can actually implement one as a lock-free data structure. We can argue whether that's a good idea or not and actually show numbers or show that it is not a good idea. But there has been a couple of systems in recent years that use skip lists. So RocksDB is a log structure system that we talked about earlier. Instead of storing the tuples, they store logs in the heap files and then they build an in-memory skip list to map you from keys to offsets in the log. WireTiger builds an in-memory skip list as an ephemeral data structure when they fetch in a page. Again, sort of the same thing. You build some way to jump to different offsets very quickly in a page. The most famous skip list database system is MemSQL. We can talk offline about how they came around to deciding to use a skip list. But there are all of them on skip lists. They do not have a B-plus tree. They do not have any other index. They use nothing but skip lists. Yes? Who doesn't make the insertion more complicated? What? Sorry? Who doesn't make the insertion more complicated? What will make it more complicated? Does this make insertion more complicated? Let's go through an example of insertions and you'll see why not. Again, comparing this with the B-plus tree. The B-plus tree, I do an insertion and I may have to split my node but then that recursively goes up the tree and I have to split my parent which may split the root, may split everyone else. So with a single insertion, a worst-case scenario, I have to rebuild the entire tree. In a skip list, you never have to do that. Every modification is always localized to the portion of the list you're modifying. So the way to think about skip lists, as I said, is just a collection of linked lists. So at the very bottom is like the leaf nodes in the B-plus tree. You have to have every single key that exists because otherwise you'd have false negatives because the key wasn't there. So every key that exists is always at the bottom and you have pointers from one key to the next. But then the next level above that is going to have links or pointers to every other key. Going above that, you're going to have every fourth key and so forth. So in general what's going to happen is at every level you'll have half as number of pointers as the level below you. And so the way you figure out whether you want to add a pointer at a level when you do an insertion is you actually flip a coin. You use a random number to decide whether you're going to actually add a pointer at a level. So this means that making this a probabilistic data structure which is different than the B-plus tree or the radix tree you want to see a second which are deterministic data structures. So that means that if I had to take the same set of keys and I insert it into my skip list and use a different random number generator to seed I may end up with a different physical structure of the index every single time I rebuild it. In a B-plus tree you wouldn't have that because it's deterministic, meaning when I do an insert I know where it's going to go and then if I do a split and merge as long as you don't flip a coin when you decide whether you want to steal from the right sibling and the left sibling you're going to always end up with the exact same physical layout of the data no matter how many times you rebuild the index. In the case of the skip list it can change every single time because you have this randomness to it. So because though even though it's going to be random the math works out that we'll get an approximate log n search and insertion operations which is the same thing we saw in the B-plus tree. So worst case scenario we get completely unlucky that we always have no pointer at the upper levels but with a billion keys that's almost impossible to ever happen. So in the end we work out almost the same asymptotic properties as a B-plus tree. So let's look at an example. So as I said the way to think of skip lists in some ways it's just another B-plus tree the way everyone represents in these diagrams is always a flat list like this but if you sort of rotate it in some ways it looks a lot like a binary tree. So the first thing to point out is that we have at the beginning and the end we have the starting entry point for our index with these different levels and at the end we have just these markers to say we reach the end of the link list at a given level. And then with each level we have the probability that we're going to have a pointer for a given key based on how high we are. So at the very bottom the first level the probability that we're going to insert a pointer for a particular key is one and above that it's half that and above that it's half of that and so forth. The very bottom is again is the link list. This is where you have to have every key every key we're going to represent as a triplet so you have the key that you inserted the value would be the record pointer to the actual tuple and then you have a pointer to your neighbor in the link list. In the levels above that you replace the record pointer with actually a pointer to the same key below you. So in order to have a key above an upper level you have to have obviously the key below you. So this vertical sort of strip of having the same key across different levels is called a tower and skip list. So that means that I can't have a key to be up in here in level three without having a key in level two. And then the top one we haven't had any keys end up in the first level so it just points to the end marker. In reality the way this would work is you just keep an account and it says what's the highest level I have so that when you do an insertion or do a lookup you know where to always start off with. The link I have is in the second level so I always stop start at level two I never started level three. Alright let's see an example of how to do an insertion. So let's say I want to start key five and I've conveniently left a space here to start key five in my diagram. So to do the insertion what we're going to do is we're going to flip a coin decide how many levels we want to go we want to add our new key into. So we always have to have a key at the first level we flip a coin to the second level it comes up heads so we say yes we want to insert it, we flip a coin again for the third level it comes up head so we want to start it again we flip it the third time and it comes up tails so we know we don't want to add a new level. We keep going until we hit tails. So that means that we want to add key five to these these three levels like this. So at this point here I've allocated the memory I've created my triplets to store the key but it's not fully integrated into the index because the keys that come before it in the list still are pointing to key six. So key four is pointing to key six on the first level and key four is pointing to the end point in the second level. So the first thing I need to do in my tower is add the pointers now going down, right? So again that way if anybody lands at the higher point they can always reach to the bottom. And then going from the bottom to the top I'm now going to same thing I have to have pointers to the end but going from the bottom to the top I'm going to add, I'm going to modify the key to come before me to now point to me. And at this point the key is integrated into the index meaning anybody that scans along the bottom will find us. But what would happen if we scan across the top? Will we find it? No, right? Is that still okay? She's taking her yes and you seem real emphatic about it, why? That we're going to eventually add it? But think of this scenario, right? At this point here the index looks like this key five is there anybody that just comes along the bottom and scans along here would find it. But anybody that maybe starts at the second level scans along here and they're looking for key five this thing still points to the end so it would say oh well key five there's nothing beyond key four at this point that I'm looking for so I just skip past it. Right? It would miss it, it would be a false negative. So, I'm not going to explain it all in detail now but it sort of set us up a little bit for next class and later on we'll talk about transactions. This is okay to have this, right? This is a higher level logical thing that where transactions may not see all the changes immediately and that's okay. What's not okay if we have corrupted data structure. So this thing set a point to key five or key six appointed to garbage and some of the thread came along and followed that point and started reading garbage that would be bad. Right? Because we would crash, we'd read something we shouldn't be reading. But in this case here the data structure sound all the pointers are going to the correct things just whether a thread comes along that you know maybe it was here at the moment this thing got flipped in and it would miss it and keep going. Logically that's okay. We can rectify that later on but physically we don't want to have it point to nothing. So this is a good distinction between having a consistent data structure versus a corrupted data structure. The higher level semantics of what our queries are doing are allowed to miss this change but we obviously don't want to crash. So we'll cover more about this on Wednesday. We're mostly going to focus on how do we make sure our data structure thread safe. When we talk about transactions then we talk about the problem of one transaction not seeing the changes from another transaction that they probably should. Right? So the main thing I'm trying to point out here is like swapping the pointers like this may not make it so that any thread can see the data that we need at this moment but that's okay. I could just avoid all this by having a single latch on the entire data structure so that no thread could ever read it anytime I modify it but that would be slow because now everyone's going to get bottlenecked on that. So instead we're going to have, we just do this single compare and swap here. Allow other threads to read inconsistent data and that's okay. Okay? So again, now I'm going from the bottom to the top. I go to the next level same thing, I swap my pointer. Now anybody comes along that sees us and then the level here gets swapped in and then anybody can find us. And now our key is fully integrated. Does that answer your question about inserts? Okay. Yes? This question is how do you know which node is before you? So to do an insert you have to basically figure out where I, you do a search and figure out where I should be and you keep track of the steps along the way. Okay. Yes? This question is, does this mean you'll do a search on every level? Yes. Let's go through that example now. Okay. All right, so they say I want to, yeah, sorry? So the gap between two nodes at the second level the gap between two nodes is not the same. So your statement, your question is the gap between two nodes at a different level are not the same, what was the second part you said? Same what? I mean previously the gap between two and four is two. Yes. Right now four and five is the gap of one. Ah. All right, so this question is, in this case here at the second level, key two jumped over one tuple, one key here and went to key four. But now this, at this point here the number of tuples that this thing is going to jump over is actually zero because you go to there. So again, it's a probabilistic data structure. I flip the coin to decide where I want how high would my tower to be when I insert it. So this by chance, by random it said, all right, I want something key two and I want, sorry, level two and level three. The gap be zero. Right. In practice with a really large data set this randomness works out that on average you're skipping at least one. In this particular example it doesn't. Right. Because the alternative could be, if you think about it, if you if you had to enforce it so that every single time you add a new key and you want to add it to a level that you always make sure that it jumps at least one. If now I insert say if I insert say something 4.5 in here, I would have to then maybe modify other things in the tree to make sure that it's already jumping one. But it's random. I don't care whether it's exactly one or exactly one or zero or something large in that. I don't care because it makes the change be localized. And in practice with a large enough with a billion keys it works out just fine. Yes. So if you insert like 4.5 and then you put the key Flip a coin. So meaning on every level there will be like 3.5 Let's say like so you need to put 4.5 on level 1 and level 2 and then level 3 and then there will be like 3.5 on every level. Yeah okay so his question basically start here. I added key 5 right? Initially nobody points to me. Now I go through and at every level going from the bottom to the top I have to then swap this pointer now point to me. Yes. What's that? So his question is the cost of doing the swap small? Absolutely yeah. So in if it's an memory index it's compare and swap. In modern CPU it's a single instruction. It's super cheap. Again it'd be done atomically. This is where he made a comment before too is like if I assumed everything is in pages together or continues to be in memory then you know maybe you don't have to swap pointers as much because you just move things around the way people try to typically implement these sort of as a first pass is just you just have every node sort of be its own sort of chunk of memory and then you do have to swap this pointer. If you cluster them together or group them together you may not have to actually swap a pointer but in our example here you do. And it's cheap to do. And again if I do it here my keys in there if another thread may come along and miss me that's okay. And then eventually I'll swap the rest and then now I'm fully integrated. The reason why we go from the bottom to the top that way if someone is here if I'm here and someone's trying to insert key 4.5 at the same time I'm trying to start key 5 we do the compare and swap only one of us is going to win because the compare and swap says check to see whether the value I want to replace in memory equals what I think it should equal if yes replace it with my new value. If no then you fail so if both these guys try to do the compare and swap thinking that key 4 points to key 6 only one of them will succeed the other one has to back off and retry so doing this is really cheap to do. Let's see how we do a search now. So I want to find key 3 so at this point here I know from my data structure I have three levels so I always have to start at the first level. So I'm going to look ahead in the pointer and it says oh it points to key 5 I know that key 5 is greater than key 3 so I can't continue along this path in the skip list at this level I have to go down to the next level and I do the same thing I follow along the path now I see key 3 is greater than key 2 so I do want to skip ahead jump along and now look ahead to the next node the next key which is key 4 but key 3 is less than key 4 so now I go down and then now I just scan across the leaf nodes until I find the one thing I'm looking for so it's sort of like you're stepping down different levels trying to go as far as you can across horizontally because that means you're skipping way more keys at the bottom but at some point you'll hit a boundary and say I can't go past this because the key that I'm looking for is less than what your key is so I go down a level and eventually I reach the very bottom and then now it's just a linear scan to find what I want it so this click it's pretty straightforward so let's do deletion so the thing I want to focus on here is that we're going to have this distinction between logical deletes and physical deletes so a logical delete any thread that comes along will not see the key I just deleted but physically it's still there and memory is still inside of my pages right that's different than what we saw in the B plus tree because when we did a delete we went and took the key actually out we may have to do a merge to reshuffle things around accordingly so what I'm going to do now is along the leaf nodes the bottom level for my keys I'm going to add a simple Boolean flag that says whether it's been deleted true or false so now let's say I want to delete key 5 I do my search just like the before and I walk down until I find the leaf node that I want for the key that I want to delete and then I just flip its flag to be true so now at this point any thread that comes along and they are looking for key 5 would say oh the delete flag is true therefore I should just ignore it right but physically it's still there and then now what I do is I start removing the pointers going in reverse direction from the top to the bottom so I'll unlink it from the top level and the second level and so forth and the very bottom I get here and then now anybody that's scanning along would be rerouted around me so the reason why we're doing this is because it's sort of where we want to let everyone know that things have been deleted as soon as possible we want to minimize the window where someone might get a false positive and so because we have to go to the top to the bottom we just want to flip this thing to be true so that way as we're going down our towers and unlinking things anybody that comes along can just ignore the key that it sees right because otherwise we would only delete it until we reach the very bottom and then we're not going to talk about garbage collection or threat safetyness here but once you know that no threat could ever possibly be looking at your node it's okay to free the memory again this avoids having people read corrupted data alright so to finish up skip list the advantages are the main thing is that the insertion deletions do not require any major rebalancing because all the changes are localized to just that point of the skip list it's not like the B plus tree where you have to recursively go up and do splits and merges that cause you to reshuffle everything in practice the skip list could potentially use less memory than a typical B plus tree if you don't use the optimizations or the compression stuff we talked about last class you only get that benefit though if you don't include reverse pointers so I showed again the lowest level the link list always went in one direction if I want to reverse scan I can't do that as it's being shown because I don't have pointers in the other direction so if I add other pointers in the other direction then then I can't do that compare and swap atomically that I talked about and I have to store extra space for those pointers so in the case of mem sql if you find the early blog post when they announced hey we're a new memory database system and we use skip list because they're lock free the first thing people point out in this well the skip list can't go in reverse direction so then the mem sql guys came back and said well the way you solve that you should make a separate second skip list that has reverse order and that defeats the purpose of the memory savings you get from the skip list so the downside of the skip list are that they're not disk and cache friendly as he was asking about before because we're following all these pointers and jumping along to the different locations every single time we traverse things and then as I said reverse search is not true to implement you have to do something extra which we'll cover in the advanced class of how to do reverse searches when you don't have pointers in the other direction in the advanced class 15-21 for the last two years in the spring the second programming project was you have to implement your own thread safe lock free skip list debating whether to do that again this year I don't know how interesting it is and skip lists are not that common but anyway any questions about skip lists yes his question is are they still with the buffer bowl yes if you want to be backed by disk yes means you have to organize it in pages anything you want to be able to spill to disk has to be stored in the buffer pool, organized in pages so keys so you'd have to you can pack multiple keys in a single page right but then it that does require rebalancing if you now insert things that expand the key and you need to move it left and right yeah his question is will the level of the skip list increase as we insert new things because the probability that you're going to have a tower go up to a really high level increases with the more keys you have so when will we decide to increase the level it's random every time I insert I flip that coin and I keep going until it comes back with tails meaning like false right it's a random data structure or probabilistic data structure yes this question is why would you want to even use a probability to build this data structure uh over B plus tree for this particular data structure I mean what's the alternative the alternative would be you have a heuristic and say for a key that looks like this I add so many levels and remember I said before the issue of that one now is like I don't know what all the keys are going to be ahead of time so I may pick a bad heuristic that would say you know for keys that mod 2 equals 0 add it to level 2, key mod 2 equal 3 add it to level 3 that may be bad because I may never see keys that have that have those particular values and I just have a stupid link list at the bottom by making it random you sort of make a good best effort for all possible domains of values and again for really large key spaces it works out fine for a billion keys it almost looks exactly like a B plus tree alright red X trees so red X trees are less common um but what's interesting about them is that they they have different properties than a B plus tree in a skip list so I think they're worth discussing um and I think that the current research trend is that these things are very interesting and a lot of the newer systems are thinking about people are thinking about exploring them now um so also say ahead of time so there's an in-marry database system out of Germany called Hyper that is very influential in state of the art got bought by Tableau last year they use a red X tree as their main data structure they don't have B plus trees I know the data stacks guys that are working with Cassandra they are super keen on red X trees and tries the last time I talked to them and they look about integrating tries all throughout Cassandra which I think is really interesting so a red X tree the way to think about this is that we're going to represent keys we want to store in our index by their individual digits rather than the entire key and so by digit I don't necessarily mean like you know like a number like one, two, or three like if I have a string of characters each character is a digit or if I have a number each number is a digit so the the reason why we're going to do this is now we're going to be able to do comparisons on the digits of the keys in sort of prefix order one by one rather than having to examine the entire key so you sort of already do this in some ways with certain functions like doing string comparison like you know it's a forward that walks through the characters beginning the string and actually compare them and if it stops when soon as you don't have a match but now the data structure itself is going to be designed to do this very efficiently so the the reason why what's interesting about is that now the height of the tree is no longer dependent on the number of keys that you have like in the B plus tree if I insert a billion keys right depending on my fan out level is how many entries I want to pack on my node that's going to determine the number of levels that I have in a red X tree the height of the tree actually depends on the length of the key right so I could have a billion tuples or billion keys and have a really you know a small key because they're not really long strings but I can have another billion keys with really long strings and now my index will be really long right so in a red X tree we're not going to not require any rebalancing the major level or reshuffling the same way we do splits emerges in a B plus tree and instead of actually storing the entire key at every single level and we just store these digits and then we can reconstruct the key based on what path we take down into the tree so quick show of hands who here has heard of a red X tree before even fewer Patricia tree anybody who here has heard of a try alright there we go good excellent so a red X tree is just a variant of a try okay so here's a try so I have three keys hello have hello hat and have so in this sort of visual representation in the try I'm storing at every edge represents a single character of a key so that the root I have H because all three keys start with H and the case of hello I'm going to have now path down H E L L O right I can reconstruct that key by following that path so if I want to check is the key hello in my index I can start the root I see H I see E you see L L O so forth and I reach my record pointer at the bottom in the case of hat and have they both start the second character is A so I only need one entry for A and then I have my split point for the E for have and T for hat so the tries are first discovered back in 1959 by some French dude and then two years later after that there was a professor Edward Fredkin coined the term try as a concatenation of retrieval tree he apparently is faculty at CMU I think he's retired he's like a famous dude he doesn't show up in any faculty meetings but apparently he's here I don't think he's in the directory so a ratex tree is sort of like a compressed version of the try so in the try for H E L O no other key shared those characters after the H E L O was specific to hello but it still had that entire path in a ratex tree you recognize that nobody else stores is storing E L O O so I just have a single edge with that information there right sort of it's packed together right the same thing for hat and have we're both sharing A so I have my entry for A and then for have nobody else shares V E so it's there by itself right so a ratex tree is a compressed version of a try if you look in the literature or Wikipedia sometimes they're called Patricia trees it's not named after a person it has some other meaning from the fifties or sixties I don't remember sometimes in the literature these are called suffix arrays or suffix trees the basic idea is the same and in databases we call them ratex trees so let's see actually how you actually really store this so nobody actually really stores the in that sort of edge oriented diagram that I showed before where you have the characters the digits stored in the edges you store them in nodes and your nodes have to fit in pages so the way you would actually implement this is maybe try to pack in as much data you can within a single page at a different level in the tree so in this case here for the second level I have ELO for hello and then A shared by hat and have I would pack them into a single page now let's see if I want to start a new key I want to start here in this case here I would recognize that the h and a are shared so that I just follow along to my third level here and I see that I already have space to store IR and so I can pack it in there so it's not we are going to do potentially do merges but we never have to do we may have to do splits as well but it's always sort of again we never have to do it rebalancing the same way that you have to do in a B plus tree so let's say I want to delete hat and have so I go ahead and find my two entries here and I delete them and then I recognize that I have IR by itself and I can decide to leave it here if I wanted to and everything is still correct or if I'm going to be aggressive about reusing memory or saving space I can just compress it and move it back up right so this sounds awesome right because now we're using way less data to store our keys like if this is a B plus tree if I want to store these two keys hello and here I'm going back here maybe the better example when I had all my keys in there back here I would have maybe the root would have hello or hat in its entire form and then down below in the leaf nose I'd have the keys in their time form again as well in this case here I don't have to do that I only store the key essentially once to represent a path into a record port can I point out what the obvious downside of this is though what's the one operation I said was really efficient in a B plus tree you made the hand gesture, sorry yes, sequential scans along the leaf nodes how would I do that here right I can't right I essentially have to because if I scan along these guys I don't know what the key is because the key is not being stored down with me so I essentially have to keep a stack of how I traverse in the tree and go back up and go back down in order to scan along the leaf nodes until I find my end point so this is going to be way more efficient for storing keys in a second we'll see why point queries are much more efficient insertions are much more efficient as well but the scans are going to be much slower because again we have to be backtracking so the major thing to talk about though is how we're actually going to store keys in our Radex tree so not all the attributes we want to store can we just split them up by their digits or characters and store them directly in the index you actually have to do some transformations in order to put it through a form that makes it amenable to doing the kind of digit by digit comparison we want to do and this has to do a lot with how the underlying CPU is going to represent data let's say you have an unsigned integer so we want to maybe split it up by bytes and store those bytes as our digits in the in our tree because we can do really efficient single instructions and do comparisons of bytes the problem is though if we're on Intel CPUs which are little indian if we go from left to right and look at that then we're actually looking at the least significant bit first and that's not going to do incorrect comparisons so we're going to need to flip it and store it in big indian form and then store the prefixes that way and I'll show an example that looks like in a second same problem for signed integers we have the two complements bit in the front so we have to flip that to make sure that negative numbers are less than positive numbers and then we can store everything else in the indian form that we talked about before for floats, the more complicated you have to classify whether the positive or negative whether there's normalized or denormalized and then just try to store them as unsigned integers the way the hardware is going to represent the floating point keeps track of where that floating point number is so that means 33.0 is greater than 3.30 so we need to keep track of where that floating point is so that we can do our correct comparisons we have to normalize everything for compound keys like if I have two integers, I'm going to build an index on that I just do my transformation then concatenate them together and everything works up just fine well let's look at the unsigned integer problem let's say that I want to store this key here 168 million I don't want to store this as an unsigned 32-bit integer so we have to worry about two complements here right so I can represent this as four 1 byte hex codes right that would look like this so now if I store it though in order in big indian versus little indian form we see that we get two totally different sort orders of these so for big indian the most significant bit will be represented first but in little indian it's at the bottom so say now I want to compare whether 3 is less than 168 million in the case if I was stored in little indian then I wouldn't know until I get to the very bottom right but if I'm big indian then I would know at the very top that yes 3 is less than 168 million so I can terminate the search immediately right so it would look like this in our tree right and so this would be a path down into store 168 million and compare whether the value 1 is there if I want to compare if I'm stored in little indian form the value 3 the integer 3 at the most significant bit up here the hex code is 0 0 right but the bottom would be you know 1 0 right yes 1 1 and 1 1 would be greater than this and therefore it would be incorrect so we flip it around when we want to store this on little indian CPUs to make sure we can do faster comparisons so if you do this for all the possible basic data types we want to look at alright this is sort of clear this is something about the architecture we have to be mindful of and transform our data to make it amenable to be storing in a rate extreme so now for this entire class I've not shown you any benchmark numbers because I want to sort of focus on the fundamentals of the databases that we're talking about in this course but I want to show you in case you get the wrong idea of how amazing skip lists are some benchmark results we've done in our own research to show that it's not actually the case so this is a comparison we published in sigmod this year written by a PG student here at CMU and we compared a data structure that we implemented called the bw tree so this is something originally invented by Microsoft for the hackathon system for SQL server from 5 or 6 years ago when I showed up at CMU I thought it was an amazing thing and we set out to build our own because it was an open source one so we built our own open source bw tree it was a lock free data structure and we wanted to see how well it would compare against other existing data structures the spoiler is that it sucks and gets crushed but we also compared the bw tree against the b plus tree, a skip list and a rate extreme for this this is on a single socket CPU and we're running with 10 cores or 20 threads we're doing insert only, read only and then read an update and what you see is that for the rate extreme crushes everyone it's just so fast to do these quick breaking up these values into digits and doing insertions and lookups and updates very quickly the skip list always loses the bw tree turns out almost always loses to a regular actually skip list is doing okay here that might be wrong yeah these might be swapped swap, there's no way yeah there's no way, sorry my fault, I should have approved read remind you to go check the paper now make sure the paper is right so the rate extreme blows everyone away the amount of work you have to do to do comparisons is so much less than a b plus tree for doing insertions you don't need these splits and merges in the skip list, this is in memory database we're not worried about reading from the buffer pool you always lose out because the cache misses an indirection from chasing pointers and the bw tree just has so much overhead to make it lock free if you take the advanced class we'll discuss this, lock free sounds like an amazing thing we want to use this for all data structures but in the day a solid b plus tree, which is latching that we'll talk about on Wednesday outperforms it the rate extreme is using latching well it's not lock free okay, any questions? yes so his question is for the rate extreme when I showed the compression the collapsing collapsing of one node into another node his question is, is that only done at the leaf nodes or is it done anywhere in the tree so the textbook definition of a rate extreme does not actually do any of this collapsing in a practical implementation you'd want to do this so in the hyper system the German system I mentioned that was bought by Tableau they have a paper where they show how to adaptively do this collapsing or splitting at all different levels yeah this is one of those things where a rate extreme was an old idea, it's been around for a while and it was until people actually said, oh, this seems like a good idea let's actually try to make one work in practice that people recognize, oh, this is something we should revisit yes his question is the rate extreme so fast for sequential scans I don't have the graph here, the answer is no b plus tree is always going to win the skip list does okay too because it's just following the leaf nodes but the pointer chasing always loses out so we have five minutes left so we'll do a crash course on inverted indexes so all the tree indexes we've talked about so far are really good for what are called point queries and range queries point queries are finding a customer record where they have the zip code 15217 or range query would be finding all the orders within a particular time range from June 2018 to September 2018 but these data structures are not good for keyword searches so if I'm going to find all the Wikipedia articles that contain the word Pablo I can't use my index for that because think about how you actually would store the Wikipedia article we saw this before this is the sample schema from the real actual Wikipedia software we have user accounts and pages and there's revisions for pages so the actual content of the article is stored here in this content field as a text field so I could build an index on a text field but that is actually not what we want because that's going to be an exact match of everything that's in my text field and actually I think database isn't going to let you do this but it's actually a terrible idea because these text fields are going to be kilobytes it's always like some Star Wars articles they're always the longest Wikipedia article it might be Pokemon now if I build an index on that I'm storing that entire Pokemon article as a key and then I can only then use it for exact matches so this is not what we want so even if I go again if I try to build my index on content it won't help for this particular query because I want to find a partial match for this in actuality this is actually not the right sequel itself either because I want to find maybe the word Pablo but this is going to find any substring that matches with Pablo a lot of times people want to use these database systems you want to find show me the word that says Obama not in the Schmo-Bomber or something like that right so this is what an inverted index does so an inverted index is going to map words within a record or the attribute that we're indexing to the actual record itself the reason why it's called inverted because you're storing documents in these Wikipedia articles instead of storing the whole document you invert it so you store pointers to this document based on what words they use so in the literature these are sometimes called in these databases called full text search indexes if you go back to the 1800s these were called concordances concordances would be somebody in the 1800s some woman spent 16 years making an inverted index to track every single word that Shakespeare used in this entire bibliography but nobody already calls that now we usually call it inverted indexes so we're not going to have time to go into the actual implementation of this but essentially you can think about it as we can use some tree indexes to find the words that we want and then some additional metadata to say here's where the word can be found or here's what words are close to it so all the major databases support full text full text search indexes in some form or the other the commercial guys obviously have much better packages Postgres, of course, has this thing called GIN, the generalized inverted index I think it was called and it has a bunch of different ways you can use B plus trees or hash indexes to find the data structure itself and then they have extensions to SQL to allow you to do more complicated things than the like that I shared before so all the major systems have these but there are also these specialized database systems that are designed explicitly for these full text searches so Elastic Search is probably the most famous one that's built on top of Lucene there's another project called Sphinx and then Apache Solar I think also might be built on top of Lucene as well but again, think about it again they're designed to be doing searches on inverted indexes instead of doing searches on B plus trees so the different type of queries we want to execute are go beyond what we saw before I want to do phrase searches I want to find exact phrases in my text I want to do proximity searches so I wanted to say find me all documents or all records one word is within three words of another word I don't care what words come in between but I know exactly three words you do wildcard searches or regular expressions to find more complicated patterns you want to match in your text and the various systems all sort of have their own proprietary extensions to expose this information I think there's a SQL standard that says how you can write these kind of queries but no one actually I think follows it very well when you actually implement this what distinguishes one implementation from the next is what information they're storing so the very least the most minimal thing you need to store is just an inverted index that maps the words to the records and then how you define a word can depend on how you actually split the text up you do in punctuation of the spaces sometimes there's stop words like the word the would always split things together things like that but you can also store things like how often the word appears where it actually appears about the data you're indexing and then the big thing we care about in a database system is when we actually update it so you could just only do this in a batch mode where you say take all my documents or my records and build my index once every day ideally you want to keep it up to date so we could maintain a sort of auxiliary data structure to keep track of all the changes and eventually merge them into a batch again the different systems do different things so the main takeaway from this is that there are additional data structures we could build in our table indexes that allows you more complicated things than the B plus tree or the hash indexes or the skip list and the radix trees that we've seen before because all of those guys are doing key matches this allows us to do way more complicated things again I realize I'm going fast on this but if you if you're really interested in inverted indexes then we have a whole course at CMU 11442 or 642 I think which is called search engines so the inverted index is essentially the underlying data structure or implementation of the data store of how search engines are implemented and as I said they store way more complicated things than just the actual key in order to do rank matching and other things that you may want to do in your system okay so it's my opinion that B plus trees are still the best way to go for doing indexes radix trees are promising and I think I'm going to put them in the future but the B plus tree there's a reason why it was called in 1979 ubiquitous B plus tree it's so good even now, even with lock free data structures it's always the way to go so we didn't have time to also discuss geospatial indexes so think of indexes you want to do matching on geospatial data like I have a map of the United States and it's broken up by state I want to know whether a given point is in a state or not so you can't use a B plus tree for that because again it's not doing key matching the way we talked about so far so these specialized data structures called geospatial data structures or multimedia data structures or multi-dimensional data structures that can handle these kind of things so I just want you to be aware that they exist R trees, quad trees, KD trees if you want to learn more about that there's a class 15A26 taught by Christos Felutus the guy who on tour with his Iggy Pop cover band I don't think he's back in the spring but it might be offered in the spring but it might be offered the fall 2019 okay okay, any questions we're done next class we'll talk about how to make all these indexes thread safe we'll be focused on crabbing because you'll need to know this for project two which I'll be presenting as well on on Wednesday okay alright guys, have a good week that's my favorite all-proud all-proud what is it? party by the 12 pack case of a thought 6 pack 48 gets the real price I drink proof but yo I drink it by the 12 they say bill makes you fat but saying eyes is straight so it really don't matter