 So reminder again, project number one is due today at midnight. We've been working on the scoreboard to get the sort ordering correct, because obviously you want the person who has the fastest implementation to be at the top. But if you have the fastest implementation, you have the lowest run time, so we have to normalize it by flipping things. And I think currently joy is the fastest. That's not good if the TAT is beating you, right? And so as we said, we will give the top five implementations some bonus points, right? So everyone's going to get, you know, along with your above some threshold, you'll get the full points. Then anybody who's a little bit better than that will get some bonus points. And if joy is within the top five, we'll look at the top six. Okay? And I know there were some issues about you would submit things and the scoreboard wouldn't refresh correctly. And as far as I know, that's all been fixed, right Joy? Yeah. So he wrote this in his underwear, you know, back in January, and his parents' basement in India, and he's beating you. So that's not good, right? Although it is joy. Okay, so any questions about project number one? Again, do at midnight, and then for every 24 hours that you're late, you get 10% off. But I think most people have submitted so far. So for today's class, we're going to switch from what we talked about last week. Last week was all about doing indexing for OLTP operations. And now we're going to focus on specialized indexes that are designed to make analytical queries run faster. And so we'll see what the different trade-offs, what kind of things we care about in this type of environment versus the OLTP stuff we were looking at before. And then the paper you guys read was for Microsoft SQL Server. I think they talked about being the Denali release, but I think the code name of the project is Apollo. And this is basically adding columnar indexes to SQL Server to make it run faster for OLAP queries. And so I'll flip back and forth to calling them projection indexes or columnar indexes, but basically the same thing. And then I'll spend some time and segue into talking about bitmap indexes. Bitmap indexes are another way to succinctly represent data in a column store system. The part of the reason why I didn't have you guys read a paper about bitmap indexes is because I couldn't really find a good one that really laid out all of the nitty-gritty details of this approach. So we'll go over at a high level what you need to know about them. And then we'll finish up in the remaining time and talk about project number two, which is now out on the website, which everyone will do after today. Okay, so in the paper, they were talking about doing analytical operations on decision support systems or DSS applications. And the basic idea of what a decision support system is is that it's an application that's designed to help humans make decisions about their organization. So you look at the historical data of everything you've collected from the front end only to the application. All your transactions are updating the database and adding new information. And then you want to look at all that history and ask questions and make decisions based upon that. So the example that Stonebreaker always likes to use is, say, you're Walmart and you have a database of everything that anybody's ever bought at a Walmart store and you know there's a hurricane coming up in Florida. So you want to look back in history and say, when the last time I had a major hurricane in this area, what was the most bought items one week before the hurricane and one week afterwards? And then you would use that and look at the listing of those items and then that's what you would use to make decisions about how you stock your store. So essentially these DSS systems are providing you sort of big tables and it's still up to you to infer meaning from that. This is different than the machine learning data science stuff that's sort of the hot thing now because in a machine learning system, instead of providing you a table so that here's all the items that have been bought, you want to compute a model that can then predict, you can ask a question about what should you be stocking and make better decisions, right? Instead of having a table, you have a more compact representation of the data. And so in these DSS systems, we're just going to be doing joins and aggregations. We're not doing the linear algebra you would be doing in a machine learning system. That's the whole end of the ball game that we're going to ignore for now. And so another thing in the paper is they refer to things being either a Star Schema or a Snowflake Schema or a Star Schema query. So I want to go over a quick high-level overview of what that actually means. So in a Star Schema, the basic idea is that you're going to have some center table with what you call the fact table that's going to contain all of the events that you've been recording for your application. So again, in case if you're Walmart, your fact table is going to be every single time somebody scans something at a checkout line, you're going to store a new event that says this is the thing that the person bought. And then around the fact table, you're going to have what are called dimension tables. And these dimension tables are like the metadata or additional data about things that are in the fact table. So for example, you would not want to embed the name of every single product that you buy in the fact table, you would normalize it out into one of these dimension tables. So in this part here, you would have some information, you would have foreign key pointers to dimension tables, but then you have some additional information that's unique to the event that occurred. So in this case here, again, we have a foreign key reference to the product table, and then we have a product dimension table on the outside that contains the every information about the item that the person bought. And we do this because someone's going to buy a product a million times and we don't want to have a denormalized table with that product information stored a million times for every single event. This fact table could be billions of things, whereas the dimension tables are usually smaller. So this is different than a snowflake schema. In a snowflake schema, you basically can expand out dimension tables to have their own dimensions as well. So before we had in the product dimension, it had some information about what category the product was in, like what manufacturer it was in, what eye look it would be in. And then we can expand that out to have a lookup table coming out of the dimension table with additional information. So it's the same thing. In the star schema, you only have one level of dimensions on the outside. In the snowflake schema, you can have arbitrary number of dimensions. So what are some trade-offs here? Why would you want to use a snowflake schema versus a star schema? When would a star schema be faster than a snowflake schema? Or why would it be faster? Again, think about the analytical queries you're doing in a decision support system. You want to say, again, assuming you're Walmart, what are all the items that somebody bought before a hurricane? Well, here's all the items that... Here's all the sales of things. Then I want to go beyond that and I'll join this with the product table and see what those products were, and then maybe infer some additional information about, you know, who was the manufacturer of them and can I get them to the store in time? Or how many do I have in stock and things like that. So typically, in most enterprise applications, you would use a star schema because you can reduce the number of joins you would have to do. Yes, you're going to denormalize the product category table to be in this product dimension table, and yes, you have some repeated values, but the number of products that Walmart sells pales the comparison to the number of items that it's ever sold in total. So we can denormalize these two things and we pay a little storage cost overhead, but now we don't have to pay a runtime cost when we execute queries to do an end-way join. We can do these two-way joins between the fact table and the dimension table. So this is sort of what the paper you guys read is trying to solve. How do you take a star schema or a snowflake schema and run really fast with it? And without having the columnar indexes that they propose, you could only pretty much use B plus trees. And B plus trees have problems when you have really large tables. Let's say that we have a table of all the customers that have been to our store and we're storing some basic metadata about them, right? So here's all the items that they bought. We have a sales facts table and then we combine them with the dimension table for customers and let's say that we're keeping track of what zip code the customer lives in. So now if you want to do a query that says, find me, count the number of items that have been bought by people that live in zip code 15217, the two choices are to either do a sequential scan across every single customer dimension to find all of the customers that live in 15217 and then build some kind of hash table to do a hash join with the fact table or you could have a B plus tree on the zip code to quickly find for a single zip code, here's all the customers that are in it. But now say you have a million people that live in 15217, you have this B plus tree that is relatively shallow and not that wide but then it's going to have these linked lists and these value lists with a million entries, right? So there's all this wasted space for this particular query and this is sort of an example why B plus trees aren't going to solve the problems we need to face in a decision support system, okay? So this is sort of where the SQL server guys are going because they would have these B plus trees but for analytics you don't want to do this. Okay, so the SQL server is traditionally a row storage system. Remember it's provenance, we talked about this. SQL server is actually based on the source code of Sybase. In the early 90s, Microsoft licensed the Sybase source code to build a port of Sybase for Windows NT and eventually it became some kind of arrangement where Microsoft became the steward of the Sybase port on Windows and then they rebranded it and that became SQL server. What's sort of ironic now is Sybase has kind of stagged it, right? Like SAP bought them and yes, they're sort of maintaining existing applications but it's not really the forefront of, I'd say, newer database technology whereas SQL server is now one of the top three database systems and you see this Apollo stuff you're reading here, we've got the Hecaton stuff, they're doing a lot of really cool stuff to make it a modern database system. So even though SQL server came from Sybase, it's kind of surpassed it in its performance and capabilities. So the basic idea with these columnar projection indexes are going to do is that you're going to decompose the row store database of tables and split them out into column segments. You're going to take all the values for a particular column in a table and you're going to store them continuously in memory or on disk. And so what's going to happen is in this first approach and this initial implementation that you guys read, when you say I want to columnar index for a table, SQL server would make that table become read only. I mean you're not allowed to update it and they have to do this because they're going to have no way of being able to, in this first implementation, to map changes to the row store back into the column store. We'll see this later on when we talk about how they get around this in the future systems. But when you think about how when they build these columns, they're taking a static database and they're doing one pass on it to build it out and build these columnar indexes and there's no extra information laid around to say like if you have to modify it, here's how to go find the single tuple you're looking for. And they're giving up this because they want to be much faster for the analytical queries. You may think this is kind of cheap. Why would anybody want to do this? Why would you release sort of a half-baked product or a half-baked feature? It's very common database systems for sort of companies to put out an initial idea, an initial implementation that maybe doesn't work all the way that you want it to, but at least started getting customers' hands, get feedback for how they actually end up using it. So MemSQL sort of did the same thing too, right? They have their own auxiliary column store and when you use it, you have to declare the table read only in the same way but eventually they'll be able to make it so that you can modify it. And everyone will talk more about this next class but there's sort of a standard approach about how you would be able to modify a column store system when you use these kind of indexes. So another cool thing that I like about the paper is that they make a big effort to reuse as many existing components that are already in the database system without doing a major rewrite. So rather than sort of do what Vertica did was start from the completely brand new system and write it from scratch to be a column store system, they look to see how can we take our existing system and bolt on this column store stuff without throwing away everything else. The reason why you want to do this is because there's this whole ecosystem designed around SQL Server and if you come along with a new database system then you can't reuse any of that. So for example, Microsoft owns Crystal Reports, sort of a standard data visualization package that you will use. So if you had to write a whole new database system then things like Crystal Reports or Tableau or other things like that would no longer work. So they want to be able to use, you know, have this thing work seamlessly inside the system and from the outside it looks and sounds and smells like SQL Server but the inside is being completely rewritten. Okay, so we'll talk about how what they do later on to get around this problem of everything being read only. So here's basically the outline of how the thing is going to work. So we have our existing row store table and so the first thing we're going to do to build out a column in our indexes is that we're going to split the table up into row groups. So each row group is going to have roughly about a million tuples is what they use. And then for each row group we're going to take the values within the, for a particular column within a row group and create a column segment. So in this case here you can specify which columns you want to index. So we have four ABCD but you can specify I only want to build an index for three of them. And then for each of these column segments within a row group we're going to pass this through a component that will encode and compress them. We'll talk about what those encoding schemes and question schemes look like in a second. And then we're going to store this in the regular row oriented blob storage mechanism or blob storage system, subsystem of the database system and have some catalog information that allows us to identify for this particular row group and this column segment, here's the blob that contains everything that I need. And the reason why the storing it in the built-in blob mechanism is that it's going to allow them to reuse all the components in the system that do replication, fault tolerance, logging and recovery. You don't have to write your own recovery manager for these column indexes. Everything just sort of be reused. You can use all the existing infrastructure that's already there, which is really nice because it limits the amount of work you have to do to add in something like this. We'll see in next class how Oracle does sort of something similar but they take a different approach and it's something called fractured mirrors and we'll talk about how to do that in a second, or next class. So to keep track of all of these column segments for the columnar indexes, we're going to have this segment directory that's going to maintain statistics about every single column segment that it has. The size, the number of rows, the number of tuples you have, the min and the max values and any kind of metadata you need about how you compressed or how you encoded it. Yes. Physical size, yes, on disk. So I should also qualify this by saying this is a disk-based implementation. There's nothing about what they're doing that would preclude us from being able to do this in an in-memory system and so I'm sort of ignoring all that for now. But, yes, they'll keep track because they want to say, they want to use this for cost estimations when they do query planning and things like that. And I think also too, if you're doing on-disk allocations you need to be able to pre-plan and things like that. Okay. The min and max stuff is kind of interesting because let's say that you're doing a query and on a range, then when you look over your column segments you can say, oh, I can ignore anything within this range because it doesn't fall within my min and max. So it's sort of like a materialized view to keep track of like, here's everything you expect to see within a column segment. Vertica does something similar. I think MemSQL might do something similar. This is sort of another standard approach. And then we're also going to have a data dictionary for the dictionary encoding scene and I'll show what that looks like in a second. The basic idea is that we're going to pack in with the compressed column segment. We'll also include the dictionary we need to be able to reverse the dictionary encoding and get back the original values. So that way we're going to store it within as a single column segment. So we're going to talk more about dictionary encoding later on in the semester when we talk about database compression. There's a bunch of different ways to do it. A bunch of different encoding schemes. But for this I just want to give a high level over what dictionary encoding is. It's probably the most common compression scheme used in analytical system. Blue does this. HANA stores everything in dictionary encoding. This is a pretty standard technique that's pretty easy to understand. So the basic idea is that we're going to have as we pass through the data and we build out our column segments we're going to keep track of all the unique values that we see for this column. And then we're going to maintain a separate table like a hash table or a lookup table that says for a particular offset in my table here's the value that it corresponds to. So now when you build out the column segment the column segment for the index you don't actually write the original value you put in the offset of the original value that corresponds with what it is in the table. So this is going to allow us to take an arbitrary long string and instead of storing all the bytes for it we can just store a 32-bit integer that corresponds again to where it exists in the dictionary. So let's look at an example here. So we have two columns. We have sort of the user ID to say that the person lives in and as you can see here we're storing things like New York, Chicago and Pittsburgh over and over again multiple times. So this is essentially wasted space because we're storing the same byte sequence for all these attributes. So under dictionary encoding what we'll do is we'll first build a dictionary that says here's all the attributes that I've seen and then there's an offset that corresponds to that particular value. So now in the column index itself and instead of storing the var char we'll store the ID that corresponds to where it is in the dictionary. We can also keep track of the number of times this value occurs. So in this case here we have New York five times, Chicago twice and Pittsburgh once and we can use this to make decisions about what numbers we should what order we should put the dictionary in. We'll talk about dictionary encoding in further details later on but you could sort these things based on how often they occur you could sort them based on the electrical graphical order of the actual value itself. So that would allow you to do range queries to be able to just do a binary search on the dictionary itself to then be able to find the tuples that you want. There's all these different ways to do this. There's no one way is better than another but the basic idea is the same. Instead of storing the var char we store an integer and we can use the dictionary to find what the original value was. So then now instead of starting like a 32 byte var char we don't have to store a 64 bit integer or even 32 bit integer. Okay. The other type of encoding that SQL Server supports is called value based encoding. This is sometimes called delta encoding. What they do is slightly a little different than traditional delta encoding but the basic idea is the same. So what we're going to do is we're going to take a numeric column kind of an integer or a decimal and we're going to convert it into a smaller value that allows us to store less bits than it would if it was uncompressed. So the idea is that we're going to split the main of the columns that we see, the values we see in a column segment and we want to reduce them to be a smaller amount and then we maintain some metadata to say if you need to get back the original value here's how to go back to it. When delta coding, one way to think about this is say I have the number 0, 1, 2, 3, 4, 5, 6 instead of storing those whole numbers I just store the original value at the beginning of the chain and then just plus 1, plus 1, plus 1, plus 1, plus 1. The difference of that case is that in order to get to the say the last value I have to decompress the beginning of the chain whereas in value-based encoding I just need to have the global information to do the transformation without having to uncompress anybody else. So let's look at the two examples that they have in the paper. So the first is to take decibels so instead of storing now 4-bit floats we can store 32-bit or even less integers. So say we have decimals 0.5, 10.77 and 1.33 the first thing we need to do is come up with an exponent that we're going to use to reduce the size of the integer or in this case making decimals at least make them whole numbers so in this case here we'll take 3 because it's 10 to the 3 and then in our initial coding pass we will multiply all the decimals by that 10 to the 3 so now we no longer have floats we just have whole numbers and then we're going to choose a base to be sort of the starting point the lowest value of our collection of numbers in this case here the lowest number is 500 so now in our final encoding we'll take the initial encoding value when we multiply by the exponent and subtract the base value and then the first entry would always be 0 and then the subsequent entries will be some number greater than that shouldn't that be true? yeah right 10 to the 3 is about no problem it should be 1.333 in the values the last figure should be 1.333 that's what you're using in the encoding I'm saying this should be 1.333 3,3 I'm missing another 3 yeah sorry fine I think okay the main thing about this is like we we don't want to store 64-bit floats we could store this we could store that with 9 bits we wouldn't do that because then we wouldn't be word aligned so we'll have to use 64-bit integers for these and again think of these long segments most of the numbers may be around hovering around the same thing if you're storing temperature data whether it's Celsius, Fahrenheit it doesn't matter most of the numbers are going to be roughly close to each other so this can reduce the amount of data you have to store for this dollar amounts is another good example of where you can use this for integers it's the same basic idea in this case here rather than making whole numbers we will not come up with a way to reduce the magnitude or reduce the size of them so our exponent is going to be negative 2 and that'll put most of these numbers to single digits and then 500 will go to 5 1700 will go to 17 and then that guy goes to 13,000 and then same thing we'll pick the lowest number and that becomes our base we subtract everything and then we get a compressed value again we can do this because we're building out the column segment from static data so the DBA says I want a column on our index we do our pass and we compute all these things this becomes tricky now we have to do updates and this is part of the reason why they're not going to support them is if I now I insert 400 that should be my base not 500 because that's going to be a lower value and so if I wanted to recompute if I got a lower value and had to update this I would have to do another whole plate pass over the column index and update everything and that gets really wasteful if you have a single insert yes so the question is is this encoding only done in column stores this type of encoding yes because usually what happens is in a row store you don't compress the data because when you do an insert it's like a transaction is only going to insert a small number of things at a time right and so if you had exactly what I just said before if you had to do re-computation of the value encoding bases for every single time you update things that just slows down all your transactions so typically what happens is you have a row store that gets all the new updates where you do all your inserts and deletes and updates and then you have a background thread that then applied changes to the column store side of things so what you could do let's say that I could do this encoding on my columnar indexes and then if I get any inserts or deletes then that doesn't change my encoding any insert would go into its own row group and therefore I can do the encoding for that for any deletes I could just mark a bit and say ignore this 2-point it doesn't affect any of the values for updates we have to be a little more careful about how we do it so you could do what some multi-version concurrency systems do where do a delete followed by an insert it's the same thing as an update the main answer to this question is you don't do this kind of compression or encoding on a row store because it's slow but we do this in a column store because again we don't in a analytical system we don't care about single entities we care about large swaths of data so like Facebook doesn't care about what you as an individual do what your attributes look like they just look at the entire segment of things so in that case you're not doing point queries to go grab single things you can do this kind of coding like a few years ago I went to go visit an internet advertising company and they basically told me they group all their entities into like 35 categories so every person can fit into like 35 groups and that's enough that's accurate enough then they'll figure out what's the right ad to show you my group might be like middle-aged white men over-educated and whatever they'll show me some ad joys might be like Indian computer science student that's DTF and they'll show them a different ad so there's again the main thing about this in this kind of thing we're not doing individual lookups we're doing aggregates so therefore this compression stuff works it doesn't work on a road store because you're changing things all the time you're trying to get single things okay alright so the other compression scheme that they're going to be able to support is called run length encoding and this is probably one of the most common approaches that are used in these types of systems if you're smart about it the basic idea is that we're going to instead of storing each individual attribute one by one we're going to see if we can store a summarization of when we have duplicate values in our column alright so now instead of starting individual values we'll say in a triplet here's the value that starts in the sequence here's the offset of where I am in my column and then here's the number of times this thing's repeated alright and so the key thing about to understand on this is that in order to get the maximum amount of compression using run length encoding it requires your columns be sorted in a smart way alright so let's look at an example so to say that we have our customer data again and now instead of zip code we're going to have what sex they are for simplicity reasons we'll just say it's male and female so if we wanted to do run length encoding on this we would end up with these triplets here that would say here's the value here's the starting point in the column where we're looking at and then here's the number of times it's repeated so what we see in this example here and never mind that male and female can be stored as a single bit but we see that there's this middle part here where we have sort of one female record one male record followed by another one female record so we're storing this triplet to represent three tuples and this is a lot of wasted space right because each triplets would be say that that's eight bits that's eight bits and then another bit so you're 17 bits to store one value which could have been stored as one bit and so the way these systems get around this is that you do pre-sorting on the column to now have all of the males in a single run followed by the females so now when you do run length encoding you only have two entries here so you have one male record starting position zero and I have six of them and one female record starting position seven and I have two of them so this is sort of a toy example now think about if you have like a billion users right so you could represent the sex of a billion people with two triplets it's amazing right and so what makes running encoding tricky is that the optimal sorting scheme or sort order for particular column to maximize its run length encoding possibilities is can be affected is different than what another column would want because when we sort this column here we also have to change the position of all the values in the other columns right because we want to be able to say at position four you know across all the column indexes here's the values for it so we sort from male male to female for this column that may be the absolute terrible thing to do for another column so in the paper we talk about how they have this trademark algorithm like vertipack or something like that and that's basically what it's doing it's sort of an empty complete problem to figure out the optimal sorting or all your columns that maximizes the overall run length encoding possibilities you get for your entire table so vertica is another system that does stuff like this and I think possibly db2 blue as well so we're going to come up with this run length encoding stuff later on again but this is probably I mean when you think about this it's so awesome because if I want to know whether the person at position four is male or female it's trivial to look that up I don't have to do any scan or anything and of course that thing can sit in a single cache line and it's a quick comparison so again to make these columns columnar indexes projection indexes work in SQL server they had to do more than just actually store them they actually had to go and modify the query planner and the optimizer to become aware of what these indexes are and so the other thing they had to do as well was because SQL server was traditionally a row storage system that means they were doing the tuple at a time iteration model of processing queries in their operators they had to have additional operators to do batch at a time or a vector at a time and we talked about earlier that when an operator calls getNext on its child operator it's not going to get back a single tuple it's going to get back a large chunk of tuples because again we're doing OLAP queries or scanning large segments we care about getting single things we care about getting a lot of data as possible and doing our operation directly on that the other cool thing that it did was they added the ability to do and this is standard in OLAP systems is in order to compute joins efficiently they can compute bitmap vectors of data or matching tuples on the fly and do faster joins that way so the basic idea is when you're doing a scan in the clominar index if the predicate you're trying to do a join on if the position of that tuple in the clominar index matches your predicate then instead of storing the record ID you just have a bitmap you say at position 4 I set it to 1 because I know tuple 4 in my clominar index should be used in the join operation so again this is the standard approach used in most OLAP clominar databases in their system ok so since 2012 since the paper you guys read they came back and added support for all the things that were wrong or missing in the initial implementation of the clominar indexes so I think in sql server 2014 and 2016 they sort of have all the things that I'm talking about here so the first is that they're going to add support for clustered clominar indexes so basically the idea here is that now the clominar index can be the primary storage location of the table itself and you don't need to have the separate copy of the rows they're going to add more support for different data types because I think in the initial version it was only barchars and a few ins and things like that so the more complex objects that they support can now be in the clominar indexes and then they also add support to modify these tables so they no longer need to be classified as read only now you can do all the insert up and delete operations that you would normally do in the system directly on the clominar data a table of clominar indexes and you do all this in the context of a transaction just as you would in a regular table in sql server but again as I said you don't want to do modifications directly on the column store because it's really expensive to do so instead they use this delta store approach where they sort of buffer all the modifications then there's a background thread that every so often applies those changes maybe needs to recompute the encoding scheme and then you can update the compression algorithms for each of the columns we'll talk more about this in next class but basically all the things as you were reading this were like oh this seems terrible why would everyone want to do this they fixed all those things in the newer stuff so any questions about clominar indexes? yes so the question is the statement is that you would not want to use clominar indexes for regular OTP workloads if you don't do any kind of compression could you still do could you still do transactions and updates on uncompressed clominar data yes and no you could it could potentially be slow because when you think about it before in a row store say I insert a new tuple I have a byte stream or a byte array that says here's all the values of my tuple I just append that to the end of the table and it's in there data store I have to split it up every attribute up for every single column that exists and do a mem copy or mem write to all those locations it's not as bad as writing a disk but it's a little bit more work typically what everyone does is have a deltel store it's the right way to do this it's not that much more work you still get the fast writes that you want to the row store side but then over time as things get colder you can go to the column store so hyper does sort of something similar like this when we'll see next class when we talk about HANA and high rise as other guys they always have the deltel store and you think of how this would work in a bifurcated environment if you have one database system that's the old LTP side and one database system that's the old LAP side it's basically doing the same thing because a row store is always faster for transactions okay so now I want to switch over and talk about bitmap indexes so bitmap index is going to be essentially the same kind of thing that we talked about in a columnar projection index but now instead of storing individual values we're going to store bits to say whether the tuple of that offset has that value or not so for every single unique value we're going to have a separate bitmap vector for it and we're going to say the bit is set to one if the tuple of that position has that particular value so typically what happens is say you're going to break up your bitmap instead of having just one giant array you'll break it up into segments and that way if your bitmap is one gigabyte you're not calling malloc for an entire gigabyte chunk you can break it up into smaller pieces and that's better for memory management but we have before we have the sex column so in this case here instead of storing a single column that has male, female, male, female we're going to split this up into two separate vectors and so we'll have one vector that corresponds to whether the attribute is male and one would be whether the attribute is female and again I realize now this is a bad example because I'm storing two bitvectors for this one column when I could have just used one in this original form but it's just again to show you the idea so now what happens is if we want to say at position four what attribute does it have I look across for every of these values and I see whether the value is set to one or not and that will tell me what the original value was so normally what happens is there's a hash table that says for this particular value and then it points to the bit vector for that value so let's look at another example let's say that I go back to my customer dimension table and now I want to build a bitmap for on zip code a bitmap index for zip code would this be a good idea or a bad idea let's say I have 10 million I have 10 million customers how many zip codes are in the US what's that he says that most 100,000 anybody else want to take a guess it's like 43,000 okay so if I build a bitmap a bitmap index for every single value I could possibly have in a zip code I would have to do 43,000 different bitmaps and each bitmap needs to have 10 million positions or 10 million bits because I don't know whether what bit a customer is going to have set so in this case here to store a bitmap index for the zip code table with 10 million tables I would need 53 gigabytes if I just store it in uncompressed so the bitmaps are awesome and they work really well if the number of values is small because again you always have to store one bitmap that includes all the tuples that exist on your table for every single unique value so you need to be careful about how you choose these things another problem we have too is that any time that someone inserts a new tuple if we just append it to the end of the length of all our bitmap vectors so if we have sort of the example that I sort of gave with him earlier about say if I insert a tuple in a column store I have to update all the columns if I have 43,000 bitmaps for this table any time I insert a new tuple I have to update 43,000 of them to say here's the new length of your bitmap so again it's not the same as a clominar index it's more general purpose I'm going to take any value in special cases we can do much better than the clominar indexes that the Microsoft guys are talking about so in particular I'm going to spend some time talking about the different designs and choices you have to make when you build a bitmap index yes yes it's not it doesn't matter what the size of the value is how many are there how many discrete values do we have so an int is a 64 bit field and it can have 64 combinations correct any int can be represented in 64 bits correct but if you want to build a bitvector 43,000 possible values anything more than 64 is bad correct yes right so this is like we're applying knowledge of the application domain so an int actually this is 32 bits in my application I would know that I only have 43,000 zip codes right so I could build an index on this thing but I probably wouldn't if instead of zip code this was state there's only 50 states so that would be 10,000,000 times 50 that would be much smaller and that would be better than storing either the 2 8 byte varchar 60 byte varchar for the the state code the bitmap vector would be much faster for that much much better this is sort of like what DBAs do DBAs say what does my application look like what does the data look like and can make decisions about what kind of index it wants to include okay alright so we're going to talk about the different ways to build our bitmaps so I'm going to talk about the encoding scheme and I'll talk about how to do compression on these things as well so the type of encoding we've talked about so far is called a quality coding it says for this position the tuple at this offset has this particular value if the bit is set to 1 in my bitmap vector right and it's allowed to do basically only a quality product because you can say find me all the customers that are in zip code 15217 and you do a one just do a lookup in each one and you find whether that bit is set to true and that tells you what value it actually is but again we saw that this waste a lot of space because in the case of the zip code when you think about it the probability that a bit will be set to 1 is if you assume uniform distribution of population in different zip codes is 1 divided by 43,000 so chances are the bit is always going to be 0 so instead of storing exactly whether the bit is 1 or not for each tuple you can do range encoding where you can say I have a bitmap represents an interval of values and I'll say whether it's still 1 or 0 based on whether the value of that particular tuple is within that range so I could have say for example instead of having say zip codes 15217, 15213 and 15215 instead of having a separate bitmap for each of those three of them I'll say I'll have a bitmap that says if the zip code is within this range then I know whether that's you know I'll set it 1 or 1 or not and that's actually an example of a lossy compression scheme because I have no way to go back and say what exactly zip code did I have for that particular tuple I just know whether it's whether it's within the interval another approach is called byte slice encoding and this is kind of cool because this is instead of storing the actual you know 1s or 0s based on whether a tuple has the value or not we're going to encode the actual values of the tuples themselves in these bitmaps and also do other more complicated things so let's look at an example here so say we have a table here of all the zip codes so these are all the zip codes of the locations I lived out in my life I guess there's 8 different places I lived and then what we'll do is we'll take the first one here 21042 and then instead of again storing a a single bitmap 421042 we're going to set break this up into slices for all the positions of bits as in the actual value itself so if we call the bin function like in python we would get this bit sequence for this particular value and then in our index what we'll store is a single bitmap that says whether the value is null or not so you can think of it we're going to go down vertically each row in the bit slices will correspond to a particular tuple and then these positions along this these correspond to the positions in the bit sequence so in this case here we always have to keep one bitmap to say whether the value is null or not in this case it's not and then we'll go down again slide across all these bits we fill out into these separate slices and then we can repeat this again going down all the other tuples and flesh out all of these bitmaps so again we're not looking horizontally at bit slice 16 here's all the here's all the values that for each of the tuple so now what this is going to allow us to do in this representation is we can take queries like finding all the customers where the zip code is less than 15217 and then we can walk through the slice and construct a result bitmap of all the tuples where this predicate is set to true but we can do some optimizations and know that in the case of 15217 it has the first three bits set to zero so when we examine these slices we can draw out anything that has a one in any of these cases or in any of these bit slices so allow us to please skip looking at these tuples in this entirety and we can use some of the modern vectorized execution instructions in chips to do these operations very very quickly and beyond that we can actually even use these bit slices to do aggregations very efficiently as well which means much faster than doing sequential scans so for example when we do the computation all we have to do is compute the number of ones in each slice and multiply that by two to the whatever the slice offset it is and then we can do this for the next slice so on and so on and we add that together and that's the sum of all the numbers for in our entire table for this particular attribute and what's really awesome about this doing these bit slices is that computing the number of ones in a slice so it's a single instruction in the CPU that will give you back the number of ones within like a 64 bit word so that's really really fast and that's you know versus doing a for it to go everything one by one and adding it together so we'll talk more about how you bit slices and vectorize execution instructions when we talk about vectorization later in the class but this is sort of one advantage you can get when you use a different representation of bitmaps for your data I think SciBase supports this a lot of different data systems support these types of indexes so that's a different encoding scheme we can have for bitmaps now let's talk about how to do compression so again we saw that zip code example before where it was just this massive amount of data we have to store for these ones and zeros you could take the chunks of the bitmap and run your off the shelf compression scheme to like LZ4 snappy whatever and compress those segments to reduce the amount of space that you're using for them the problem with this approach is that anytime you need to access or compute something on this bitmap vectors you have to decompress it so it helps in a disk based system because you can compress it before you write the block out the disk in an in-memory system you're basically wasting CPU cycles to decompress and uncompress things so that doesn't help you so another two approaches are to do an encoding scheme called byte aligned bitmap codes from Oracle and the basic idea of what we're going to do here and I'll show an example is that we're going to use structured run length encoding of the data itself so we're going to use run length encoding but rather than just having this arbitrary sequence of things we can be more sophisticated on how we laid out the compressed version of them and then I'll talk about roaring bitmaps so roaring bitmaps was invented by some French Canadians a few years ago so the algorithm is actually very, very new and actually the paper was actually published this year or last year in 2015 but it's actually now being used in some a lot of open source systems like Spark and Lucene and Hive so this is one of those things where like the open source guys are maybe first to adopt this kind of technique and the commercial guys don't quite know about it yet which I think is kind of cool alright so let's talk about what Oracle does so this approach the basic idea we're going to do here is that we're going to split the bitmap into chunks based on what sequences of bytes we see and we're going to categorize bytes based on on two groups so the first is we're going to have gap bytes that correspond to bytes where everything is zero and then we'll have tail bytes where at least one of the bits in the byte is set to one again when you think about like you know the zip code example I had before most of the bits are going to be set to zero right because most people aren't going to live in certain zip codes and you have a one over you know forty-three thousand chance of something being set to one so we're going to exploit that by doing running thin coding on large sequences of zeroes and so once we find a chunk that has so much of gap bytes followed by some tail bytes we're going to encode them in such a way that we can get much less storage overhead than what we get if we store everything uncompressed so I'll show you an example here so let's say we have a sequence of I think these are eighteen bytes so the first thing you'll see is that most of them are zero only these two bytes here and here and here do we see we actually have something set to one so these are going to be our tail bytes right whereas everything else here is going to be gap bytes so anytime we see a tail byte we would chunk this up so this is the first chunk and we have the tail bytes and the gap bytes and then here's our second chunk now for each chunk we're going to go through and figure out how to encode it into a smaller space so for the first guy we always have to write out a header the header is going to tell us what is in our sequence so the first three bits are going to say the number of gap bytes we have so in this case here we have two and then we'll have a special bit to say whether the tail is something special or not and we'll say it's special is if it only has one bit set to one and that way we can encode where that position is and we're storing the entire thing so in this case here we have two gap bytes so we store that in the first three bits then we have a flag to say that it's special set that to one and then we would say where in the position it is so that's what we have here so I'm sort of drawing the parentheses to show you where the chunks are on the header so that means for these three bytes in our bitmap we only need one byte to represent it so if we go to the next chunk we have 13 gap bytes so we have 13 bytes sequences that are all zeros and so we can't store that in our three bit in the header because it's greater than seven so we'll set all of these to one to say that we'll tell you later on how many gap byte sequences we have and then because we have two verbatim or two tail byte sequences here and they're not special because this one doesn't have two ones whereas this one has one that means we need to store a flag to say we're going to store these entire byte sequences in exact copies and here's how many we have so in this case here the second chunk can be represented like this we have our header then we have our gap length which is 13 because we have 13 byte sequences here and then we have our verbatim tail bytes that are exact copies of the data here so in this case here our original byte sequence for this bitmap was 18 bytes and now in the compressed version we can do this in five bytes taking advantage of run length encoding being very hand wavy and very fast about how you should do this unless you go to work at Oracle you're never going to have to implement this because this is obviously patented by them but it's just getting to show there's other mechanisms there's other ways to exploit run length encoding for repeated sequences to get better compression so the the BBC approach from Oracle is still what they use but in some ways it's kind of an obsolete format because it was developed in the 1990s and it doesn't take advantage of all the vectorized execution instructions you can have in modern CPUs there's a better approach called write-out-lined hybrid encoding that the problem with that is it's actually patented I don't know who owns the patent but most people don't implement that either because they want to avoid any patent issues the other key thing to understand about both the BBC and the general compression approach is that none of these allow for random access meaning any time you want to figure out what weather position is set to one or not you have to decompress the entire thing right and that can be really really slow and so the a more modern approach is called roaring bitmaps the basic idea here is that we're going to store some of the data in uncompressed bitmaps and some of the data as just regular value lists and then that way you can get the sort of best of both worlds right the value lists very quickly to see whether something's in a set or not and the bitmap we can do all the regular vectorized bitmap execution operations that we saw when we talked about byte slicing so again a quick overview of what it looks like so at this upper level you're going to have this partitioning table that says for all the chunks in my key space here's a container that stores all the data for it and so some of the containers will store things as regular 16-bit integers some of them will be stored them as bitmaps so the idea is that for each value we get we first need to figure out what chunk it should be in based on what container we want to store it in so we want to divide the number by 2 to the 16 and then we take that number and then we'll only have to store the mod of 2 to the 16 in the actual container itself and that's enough for us to be able to reverse it back and get back the original value so if the number of values in the container is less than 4096 so that's the default but that's what Lucine does then you just store it as a value list array otherwise you're going to store it as a bitmap and then you can do all again the vectorized stuff on this and again this works because when you have a small number of small number of values storing it as a bitmap is really bad but when it gets too large then you store it in bitmap is better alright so let's look at an example say you have n equals 1000 the first thing you do is divide it by 2 to the 16 0 because you take the floor of that and then you mod the number by 2 to the 16 and then that gives you a thousand so that's the only thing we need to store and we can easily reverse that to get back the original value so let's say that we have this other arbitrarily large number here we do divide it by 2 to the 16 that gives us chunk 3 and then the number we have to store in our bitmap now is 50 so what we're going to do is that we just find the 50th position in our bitmap and we set that to be 1 and now we can easily we know that whether the roaring bitmap contains set to true for that offset does this sort of make sense so there's some open source implications of this and again this is actually being used in a lot more systems now even though it's actually a new algorithm you think about like skip list, skip list sort of that in the 90s and it wasn't until 20 years later that database has actually started using it whereas this is something that people are picking up really on. Okay, so to finish up about OLAB indexes the one thing you should be mindful of is, and we'll talk more about this next class is all these schemes assume that the position of a tuple in a table is fixed throughout the life of the tuple meaning if I get inserted at position 5 no matter if I come back and get updated again I'm always going to be at position 5 but this doesn't work in a system that uses NVCC with the insert method because when you insert a tuple if you modify it you make a new version of it and now its position is somewhere different MySQL doesn't have this problem and Oracle doesn't have this problem because they're using the rollback setting so you always have a fixed position for a tuple but it's something like Postgres you can't do this because again the position is going to be changing all over this is why we're going to have to look at Delta stores to get around the problem of updating things when we have a bitmap or a Klumnar index we talked about this as well maintaining the bitmap index is wasteful if there's a large number of unique values and so you'd be very careful when you choose to use a bitmap index another key thing is that we're completely ignoring multi-dimensional predicates or multi-dimensional range queries we've been assuming things like is the zip code equal to this or is the zip code less than that if you have more complex predicates combining these things or doing things in a multi-dimensional database it becomes very complicated and bitmap indexes may not be the right thing you want to use KD trees or RD trees and we're ignoring all that for now any questions about Klumnar indexes any questions about bitmap indexes project 2 so the goal for project 2 what you're assigned to do is everyone has to implement a latch-free BW tree in our database system so you don't get to choose what index you have to do because we decided since skip lists are too easy everyone would pick that and that wouldn't be that interesting so everyone has to implement a BW tree and you need to support all the things that I talked about last class in your BW trees you need to support the compare and swap mapping table for indirection you need to support the delta chaining you need to support split merge and consolidation and you need to support the cooperative garbage collection so you need to have your threads to identify I need to do garbage collection using the epoch time stamp approach and clean up any pages that can be deleted so your index is going to be we're going to provide you with a standard API of all the functions your thing needs to support but your index needs to be able to support both unique and non-unique keys so if it's a unique index, if I insert the same value twice the second time I try to insert it you should come back with false in non-unique you should be able to accept a new one over and over again but if I call it delete I'm going to give you back the a particular key and a pointer to the tuple and you'll be able to find the right one to actually delete right so the we're going to provide you with a header file that has again the API that you have to implement and all sort of the gritty details of how to do data serialization, un-serialization and a predicate evaluation is completely taken care of for you so you don't have to examine what's actually in the key we give you will give you a comparator that you can use to then say you know is this thing less than that thing but as you go along you'll see that there's all these things that sort of they were hand-wavy about in the paper that it's up to you to figure out how to actually implement correctly so for example what you set for the threshold and when you decide to garbage collect or consolidate a delta chain it's up to you to decide what the right number is and you can play with in different test cases we gave you to figure out what a good value is for that I can't stress this enough as you go along there's all these questions you're going to have about how you actually implement things and I'll say there's no right answer and it's better for you guys as a group to sit and figure out what should we actually be doing rather than come to Joy and I and say you know what's the right answer to do this how should we do this right should be more independent about how you design this thing as long as you maintain the API and do all the stuff that we talked about you know it's no one implementation is to be better than another potentially although I guess we can measure to see which one is actually better using speed tests so just like before we're going to provide you with C++ test cases for you to check your implementation we strongly encourage you to write additional test cases to play with things because what we're providing you with you is not going to include all the things we're going to grade you on so you need to be careful about all the different corner cases you have to worry about when you split some solidations and merging and we may not provide you with test cases that cover everything to check to see whether your thing is correct but we can provide you an existing index implementation that is based on the STX B plus tree which is an open source B plus tree implementation but the difference between this implementation we're giving you and the one you're implementing is that this implementation does not do concurrent operations right the STX B plus tree is not thread safe so we're using a coarse block for any single operation you do inside of your index so anytime you do an insert we lock the B plus tree anytime you do an update we lock the B plus tree so this allow you to have an oracle to say is my index correct because you can load up yours and load up this other one and you should come back with the same values and roughly the same structure in some ways but your thing should be able to outperform our implementation right because this thing is doing the most simple way to do locking whereas your thing needs to be more fine grained and doing compare and swap I don't think we're providing you with any SQL tests so everything will be done but you can do everything from the command line just as before the other thing we'll require you to do is actually write documentation or commenting your code to make sure you explain what you're doing so since we're splitting up the groups we're only going to have 10 submissions so Joy and I are actually going to go look at your code and see whether it's actually structured correctly and actually explaining all the different parts of your implementation so if we see code that looks like you just wrote a bunch of stuff and it's not clear what the hell is actually going on you would lose points for that the idea is that when you go out in the real world, when you go out into a company if you wrote code without documenting what you were doing you'd probably get fired you'd probably reprimand you so we're going to get you in the habit of actually writing production quality code now and that way we can correct any mistakes that you may have along the way for grading again we're going to have additional tests beyond what we provide you again we're going to have bonus points whoever actually writes the fastest implementation so hopefully all of you should be able to beat Joy you definitely should be able to beat his coarse-grain B plus tree locking scheme but all of you should be able to beat his BWT implementation because he'll spend less time on it than you guys did and of course again we're going to do the same thing where you have to pass the client format syntax checking so that way everybody's code looks nice and clean with the rest of the code in the system so you saw my email on Friday everyone should be assigned to a group now I think we actually had one person drop out this morning so we have 29 people but I think we're going to go in the waiting list and bring somebody else in so every group should have 3 people and the expectation is that there's not going to be one person carrying the torch for the entire group I'm not your babysitter so I'm not going to go check on GitHub to see whether everyone has the equal number of commits and so it's up for you guys to manage yourself and say everyone is contributing equally I don't want to get emails saying all this person is a dick they're not doing enough work everyone should be doing the same amount of work okay so the due date will be March 2nd and again due at midnight that'll be a Wednesday it's the Wednesday before spring break and then full description and links to the source code everything is now available online on the course website so any questions about project 2 none? everyone can go write a BWT in spare time and weekend, it's that easy people are shaking their heads yes alright so for next class we'll now start talking about storage models and again we've been dancing around this idea of deltel stores and other things like that so we'll talk more about that on Wednesday and then Joy will spend the last 15 minutes of class talking about how to do performance profiling in the Peloton system again so how you hook up GDB run perf, gprof and all these other things to figure out what your index is actually doing figure out where the hotspots are why things are going slow and how to optimize your code further so we'll spend some time and give a quick tutorial on how to do this okay alright see you guys next time