 Before we get started, there are many questions about the skip list assignment. I know there was one question posted on Piazza. We had to look into it. We think there was a bog in the BW tree, but that shouldn't affect how the test is correct. The BW tree is incorrect. So we'll fix that, but that shouldn't affect how your skip list goes. Any questions? High-level questions about the skip list. Everyone has started. Who hasn't started? Who would submit they haven't started? Dana, have you started? She hasn't started. Okay. All right. So for today's class, the last lecture was all about how are we going to store the values of our tuples and organize them in memory. So now what we're going to talk about today is a layer below that, sort of independent of whether we're using the NSM or DSM storage format. Now we're going to say how do we actually compress the data. So we'll start off with some background about compression. And again, everyone should pretty much understand the basics of how data compression works, right, from your undergrad and CS or intro courses. And so what I'm going to be talking about is sort of what do we have to be mindful of about when we want to do compression inside of a database system? Like, why is this different than just, you know, G-Zipping or Zipping files on your desktop? And then we'll talk about how to do naive compression, which when we'll see all the problems that you can have with this. And then we'll finish off talking about going to more detail how to do columnar compression, because this is where you're going to get most of the benefit. This is where you're going to see compression the most in OLAP systems. And then we'll finish off talking about research that we've done here at CMU on how to compress all to be indexes. Because we'll see that we'll end up storing the raw data in our column stores, but we don't actually want to do that in a row store if we're doing transactions because compression is going to be too slow. So we can see how we can compress indexes instead. So we've talked about this in the very beginning when we sort of motivated why we want to study in-memory database systems. And we said that in these, you know, the legacy systems or traditional systems that assume the primary storage location of the database is on a disk, whether it's an SSD or an HDD, then IO is always going to be the main bottleneck. And these, in this world, compression totally makes sense because now, whenever you read a page from disk and bring it to your buffer pool, you're essentially bringing in more data than you would otherwise if it was uncompressed. So in a disk-based system, the disk is so slow that the database system is absolutely willing to pay the trade-off of having to spend more CPU cycles to decompress and compress the data. But it's not as clear that this is always what we want to do in an in-memory database system because it's more complicated now because now the cost of going fetching a tuple is significantly faster than it was before in a disk-based system. So we need to be careful about what kind of compression algorithm we use and the techniques we're going to use to do in-memory database compression because we don't want to be the case that, yes, we're going to be able to store more data in DRAM but now our system is going to be super slow because we're spending all our time on zipping or re-zipping our data. So it goes without saying that compressing a database in DRAM is important and we want to do this because it's going to allow us to store more data. But again, we have to be mindful of this trade-off of making sure that in exchange for having a better compression ratio, we don't reduce the speed of our system. And in general, I can't prove this, but whenever I look at the literature and understand what systems actually do, in-memory database systems are always going to choose speed over compression, right? Because if you just have a slow compression algorithm, then you're going to lose all the benefits of an in-memory database system. So just to say also real quick too, the reason why you want to compress an in-memory database is sort of obvious, right? Because DRAM is expensive, the capacity is limited, and by compressing the data, we'll have a larger database and more data with possibly fewer machines and fewer less DRAM. So DRAM is actually really expensive not only to buy, but also to maintain inside your server. So there's been studies that said for all the energy you have to use to power a machine, 40% of it is spent on DRAM just refreshing it. So again, this is why we want to use compression to avoid that. So we want to talk a little bit about what real databases look like and we'll see how we can exploit their characteristics to get better compression. We want to understand why is compression going to work in a real database. So there's two key characteristics we want to be mindful of. The first is that real data is very, very skewed, meaning the distribution of the values for individual attributes are going to be highly, highly skewed. So the example people always use is something like the Zipfian distribution from the Brown corpus. So if you don't know what the Zipfian distribution is, it's essentially the parallel, sort of exponential growth. And then the Brown corpus was this thing that is a collection of texts of American English literature that these researchers generated at the Brown University, and then somebody came along and counted the number of words, how often each word was used. And they found that the most often used word is obviously the word the. And then the next most often used word was occurred half amount of time as the previous one. So the next one occurred half amount of time as the word the. And then likewise, the third one occurred half as much as the second one. And so forth. So you have this sort of exponential curve. So you're going to see Zipfian distributions or parallel distributions all the time in real work databases. Think of like on Reddit, there is a very small number of people that post a lot of the articles that get uploaded. There's a small number of people that post most of the comments. So you have this huge parallel curve where you have a small number of people posting all the material, and then there's a long tail where you have people not posting anything. That's a Zipfian distribution. So we can try to exploit that fact when we do compression. The other aspect we can exploit is real databases have a lot of correlated columns, correlated attributes. So what I mean by that is say you have a table that keeps track of people's addresses. So if you have somebody's zip code, then you can easily derive what their state or city is. So these values are always correlated. Or likewise, if you go through a temporal dimension, like on Amazon, the date in which somebody ordered something will occur just a few days before when something was shipped. So Amazon usually ships something within seven days. So those dates are going to be really close to each other. So we can try to exploit that when we do compression. So again, you're going to see this time and time again. If you look at all these different applications, you're going to see these two properties. So now to do database compression, there's two goals we're going to try to achieve. The first one is that we want to ensure that all of the compressed values that our compression scheme produces are always going to be fixed length. We talked about this before when we talked about the DSM. We want to use offsets to figure out where to jump into a column to find where a tuple is. So we want to make sure if we're compressing our data, we want to make sure that we can always compute that offset very easily because we know that our values are always a fixed length. You don't always get this. So again, gzip on an arbitrary file, it's going to generate some file with a different size bytes, possibly over and over again. If you take one file and you compress it, you'll have a certain number of bytes. You can use the same algorithm but a different file that'll have a different number of bytes. So they're not going to be able to do offset jumps if we have this problem. So we want to make sure all the encoding schemes we talked about always generate fixed length values. And we'll see how we'll get around this when we use naive compression. So then the other thing we're going to do is we want to make sure that whatever compression scheme we use, we want to allow the database system to postpone for as long as possible in the query plan having to actually decompress the data. So what I mean by that is say I have a table with a million tuples and say I just compress every single tuple with gzip, some off the shelf general compression algorithm. Now if I want to do a sequential scan on that table and find somebody's name where they equal that value, I have to go through every single tuple and decompress it one by one just to find the thing that I'm looking for. And ideally what I want to be able to do is do my sequential scan using the compressed version of the value of my wear clause so that I don't have to decompress any of the data. So say that particular scan on this one million tuples, it only produces 10 tuples in the end, I only have to do decompression when I have to finally give the result back to the outside world either it's like the other client, the terminal or whatever machine is executing the query. So if we can delay this long as possible, we'll get two benefits. One obviously we're doing less work because we're executing less instructions because we're not decompressing things we're never going to have to give back to the client. We also get the benefit is that we're going to get better cache utilization because now we can operate directly on tuples in their compressed form and we can cram more of them in our CPU caches. So we don't have to do less, we have less paid falls we have to go less or cache measures we have to go to DRAM less to get data. So this is going to be really important and we'll see how a naive compression you're not always going to be able to do this but then all the other schemes we'll talk about we can try to do this. It goes with the elsewhere saying again this should not be anything new to anybody here but there's this difference between lossless and lossy compression. A lossless compression scheme would guarantee that whatever you compress the data if you compress the data going in you get the exact same data coming out. A lossy compression with something like an mp3 or mpeg encoding where you can exploit some aspect of how humans interpret sound or video or something to throw away some bits so that when you decompress it the human can't tell whether there's a difference or not. So in our database systems we're always going to do lossless because the bottom line is people don't like losing data so if I put my data in there in this week I want to come back next week and get back the exact same result. Be really bad as some stuff just started missing. So for that purpose again we're always going to be talking about lossless compression algorithms. So there are some cases where you do want to do a lossy compression algorithm but these are always going to be performed at the application level. If you want to do lossy compression in your database it's up to you as like the application programmer to implement it. The database system is not going to do this for you because it doesn't know what's the right amount what's the right threshold or what's the right amount of data you're willing to lose in your application. So an example would be something like this is like say I'm collecting sensor data and my sensors are collecting temperatures every minute and then a year later maybe I don't need that one minute granularity and maybe I want to compress it into like five minutes so you sort of aggregate them together take the average. You can't go back and get the original data because you've coalesced them into five minute segments instead of one minute segment but for your application that might be okay. So again the database system has no way of knowing whether that was the right thing to do or not. So that's why again you're not going to see any lossy compression schemes you're only going to see lossless. There are some newer database systems that are out that support what are called approximate queries these are sort of like lossy compression algorithms where they're going to give any exact results but the difference is that these systems are actually still going to store everything with a lossless compression scheme. So an example of this would be say I'm collecting the number of visitors I'm keeping track of the number of visitors that access my website and I want to say do a select count star to count the number of unique visitors that occurred last week that visited my website. So in an approximate database system say you have a million users it might do some sampling and return back 999,000 users. It's not going to give you an exact number but for what you actually need it to know this count that's good enough. So it was an academic system called BlinkDB that came out of Berkeley a few years ago and then they since commercialized it as a snappy data. So snappy data is one of these HCHAP systems where they sort of have gemfire on the front end to do all your OLTP stuff and then they use Spark to execute all the OLAP queries and then they added this little extra engine to do these approximate queries. So again it's sort of like a lossy compression scheme but the data is still actually stored in its full form and if you really want to get an exact result you can always tell it ignore the approximation stuff go directly and get the data. So this is an interesting idea and it's related to this but we're not going to talk about that this semester. So now we can talk about at what granularity the database system can try to do compression. So you sort of think of this as like what's the scope in which we're going to look at a chunk of data in the database and try to compress it. So the easiest way to possibly do this is at a block level where you just take a sort of block of tuples that you have in your database that's a page or it doesn't, you know, same thing and you're just going to do whatever compression you want directly on that block. We'll see example what that looks like in MySQL in a second. You can also do compression on a tuple level where you just take again the single contents the entire contents of a single tuple in a row store and just compress that. And again you may be thinking oh I'm just going to run GZIP or snap you on it again we'll see the other encoding schemes. It doesn't necessarily have to be one of the general purpose algorithms that could be these other compression encoding schemes that we're going to talk about later in the lecture. You can do attribute level compression where you just sort of take a single value or single attribute value in the tuple and just compress that. And if you want to you could combine multiple attributes if you wanted to but typically people do it this way here. So this is actually what Postgres does. If you have a really large blob or text field Postgres will store this in this auxiliary storage this variable length storage space called the toast it stands for something I forget basically they can store things as compressed blocks and just have a pointer to where that variable length field is over there. And the last one again we'll spend a little time on this is doing column or column level compression and this is where you take all the values for a single attribute across multiple tuples and sort of come up with a compression scheme for them and you typically only do this in a column store. So again there's different trade-offs that we'll discuss as we go along which one's better than another. So for example we talked about how Postgres does attribute level we'll talk about how HANA and Hyper and DB2 Blue they do this column line of compression stuff we'll see how MySQL does block level Oracle also does block level too but they don't do it for performance reasons they actually do it for a sort of recovery reasons. So in Oracle they have this sort of design philosophy for the actual system where they say every block, every page inside the Oracle system has to be self-contained. So they're only going to do compression on a single block because what they don't want to happen is if you do dictionary compression and you have one page store the actual compressed values and another page store the dictionary if you trash the dictionary page if something gets corrupted over there then you lose your compressed data. So they have this philosophy where they say everything has to be self-contained so if there's ever a problem and you actually lose a block that damage is always contained to that single block which I think is kind of interesting. So we'll keep this in mind as we go along. So another thing that doesn't really fit I didn't know where to put this in the lecture but I think it's an interesting idea to keep in the back of your mind as we go along just because we're going to have these compressed blocks doesn't mean we're always going to have to access them when we execute queries and whether we can be smart about doing the late materialization or delaying the actual decompression to the very end there's actually a technique we can try to avoid these blocks entirely as well and these are called zone maps so the zone map term comes from Oracle IBM calls them synopsis tables Vertical has this, this is a pretty standard technique but in the literature it's called zone maps the basic idea is that we can take our block of data it doesn't matter whether it's a DSM or NSM and it doesn't matter how this thing is actually compressed and what we can do is we can pre-compute this zone map that has a bunch of the aggregate calculations already generated for the data in this block so in this case here we would know the min and max ahead of time 100, 400 we have the average sum and count so think of this as something like a materialized view on a per block basis so now if I have a query that comes along say select star from table where value is greater than 600 I can just look in the zone map and see the max value is 400 so I know that there's no tuple in here that's going to have a value that's greater than 600 so I don't even have to look at this data I can just skip it entirely so zone maps are orthogonal to the compression schemes that we're talking about here but this shows up a lot of literature when they talk about doing a compression on a block basis they'll talk about putting synopsis or these zone maps in an auxiliary data structure to avoid having to do any decompression so now we can talk about naive compression schemes so this is sort of like one of the big themes about what this lecture is about is yes there are off-the-shelf compression algorithms we can use to compress our data but the database system is going to if you do that the database system is going to have to treat the compressed data as a black box meaning it doesn't know anything about its contents and therefore in order to actually have any queries on it you know unless you have a zone map you have to decompress it entirely it doesn't matter what granularity you're doing this you just can't infer anything about it because you don't know how the thing was actually compressed so these general purpose naive compression algorithms will be sort of all your sort of standard favorite ones so LZO, LZ4 LZO was sort of the first one and this sort of class of compression algorithms LZ4 is a newer version Google came out with Snappy in 2011 and then the new hot one that everyone's using is Z Standard from Facebook so again the way to think about these naive doing naive compression is it's just a black box you compress, you take your bytes in you get a compressed bytes out but any time you want to look at them you have to decompress them so we've always got to be mindful of now again of like if we're going to use a general purpose compression scheme you know what's the computational overhead of actually doing this and then there's this trade-off of saying look how fast are they compress and decompress so notice here I didn't put like GZIP or BZIP right people don't typically use those in database systems because those compression algorithms are actually really slow compared to these other ones so LZO and all these other ones these are designed to give up better compression ratios in exchange for getting faster compression decompression schemes right so next if you go to like the Z Standard website I should have made this a table in the chart that's okay but they have like this benchmark that shows what's the performance of Z Standard versus a bunch of different ones and what you see is like the compression ratio that Z Standard gets is much better than these other ones compared to like LZO and LZ4 but their speed versus LZO and LZ4 are much slower so LZ4 is actually really fast but again you're giving up better compression in exchange for having faster decompression and runtimes so I don't think GZIP is not on this but GZIP would be have much better compression ratio but actually get worse performance so again this is some of the trade-offs that people make when they have to decide what compression algorithm they want to use and if they're doing naive compression so I know in the case of like MySQL or at least MariaDB I don't know about MySQL you can specify when you start the database server you can specify like which of these compression algorithms you want to use and again it depends on what you care about whether they're getting good compression or getting faster performance alright so another thing also be mindful too about you know what these general compression algorithms look like and again this should not be anything new for anyone who took basic CS courses there's basically two types of compression protocols there's the entropy coding where you're going to try to find the more common sequences in your data set and you want to encode them using the fused number of bits and this is the same thing as Huffman coding what is more common and what we'll see as we go along is to use dictionary encoding so all the LZ algorithms that I showed before these use some variant of dictionary encoding and this is the same dictionary encoding we talked about with the SQL server column in our indexes we're basically going to build in some kind of data structure that's going to map repeated segments in our data set to smaller identifiers, smaller encoding values and then we'll go back to the original data and replace those repeated segments with our encoding to reduce the space again this is the same thing we talked about with SQL server so let's see how to use this with how to do sort of naive compression in MySQL so in MySQL I think 5.5 or maybe 5.7 they added support for doing compression in ODB so again MySQL is a disk based system so we have our in-memory buffer pool and then we have our on-disk pages so when the data is at rest on disk it's always going to be stored in a compressed form and again they'll be using one of those there's naive compression algorithms general-purpose compression algorithms that do this so the size of each of these pages always have to be either 1, 2, 4, or 8 they always have to be sort of a fixed size and the reason they do this is because it makes it easier to manage how you lay out things on disk if you know you're always going to have pages that fit into one of these size buckets so that means if say you compress your page and it comes out being like 1.5 kilobytes they'll just pad it out so it's always at least 2 kilobytes because it always needs to be take the ceiling or whatever it is to get to one of these values here so initially every compressed page so you have to compress data but then you have this modification log that is where there'll be changes that were made by transactions to the data that's in this page but we can actually make some changes without having to decompress it first so this is like the, you know, the delta store for an individual page like for example if I know there's a tuple in this and my query is update table set value equals 5 I don't need to know what the old value was I'm just always going to overwrite with that new value 5 so I'll just put it in here if it's update value equals value plus 5 I may have to go back and decompress it first alright so now what happens is when the database isn't, you know, it's running the transaction tries to access a tuple that is not in the buffer pool but we notice that one of these pages will go fetch it into our buffer pool and again any updates that can occur where we don't have to read anything we can always go to the op log I'm sorry the mod log, the mod log it's basically the same thing as an op log but any time we need to read the data then we're going to go ahead and we have to uncompress it and in my sequel the page size when it's uncompressed form is always going to be 16 kilobytes alright so now what happens is it always has to keep the compressed version and then the uncompressed version and there's nothing about the protocol what they're doing that requires them to keep both I think just for internal bookkeeping it makes it easier to know that this thing corresponds to this and then this corresponds to that because otherwise if you drop this and somehow you have to update some internal metadata to know that this thing when it compresses originally got it so they always keep both around what happens is that when you uncompress it then you reapply the op log and to the changes here and then when you want to go free it up then you just recompress it and put it back and then you can write it back later on so again the downside of this is when it's in its compressed form the database system knows what should be in there because it has the keeps track of the record IDs that it knows or inside of it and depending on what indexes you have you can then maybe infer what some of the values are inside of this but if you ever actually need to look at the actual tuple then you always have to decompress the entire block even though say you have 100 tuples inside this one page you always have to decompress all 100 of them even though you only need to access one of them and this is because again when we use a naive compression scheme we don't know anything about what's actually stored inside of this right? so again this basically repeats everything I just said the data has to be decompressed anytime we have to access it anytime it has to be modified and this is because the data system doesn't know anything about the meaning of what the actual compressed data looks like so if we're able to actually do what I said, actually understand what is actually being stored in its compressed form then we're going to be able to do certain queries directly on the compressed data so say you want to do two joints to teach them tables and say you have columns and they're compressed in the same way I can just do an exact comparison of those two compressed values in each column and if they're equal as long as they're using the same compression encoding then I know they'd be equal if they were decompressed so I don't have to decompress them this is easy to do for some compression schemes if you want to do range predicates if something less than something, something greater than something then it becomes more tricky so the naive compression scheme you could do this on a per attribute basis and say for each value in this table in this column I'll compress it with gzip or snappy and then I'll do the same thing on the other one and they'll get exact matches because it's deterministic and I can just do my straight comparison but I can't do a range query if something less than something gzip or snappy doesn't enforce that ordering so we'll see how we can overcome this as we go along so again this is sort of the same setup here for example I was trying to show so I do want a query, I want to say select name equals Andy I have my database here and I can put it into its compressed format and if I want to do exact lookup I can just compress the term Andy and I can find exactly what I was looking for so we can do this for most of the compression schemes even the naive ones but we're not going to be able to do range queries with naive compression and we'll see how to get around that so we're going to spend most of our time in the rest of the lecture talking about how to do columnar compression so some of these we've already talked about I think last class or in the SQL Server column store so I'm not going to spend too much time on them I'm only going to bring them up again because then we can apply them to other encoding schemes as we go down and actually get even better compression so if you know about doing columnar compression for some of these schemes you actually can apply multiple ones together and do even better than you would just using one by itself so we'll talk about doing null compression running for the coding, bitmap encoding deltson coding, incremental coding mostly encoding from Amazon Redshift and we'll spend most of our time talking about dictionary encoding so again as to ground the discussion in the context of what we talked about before when we talked about the SQL Server indexes yes they were essentially doing some of the same compression encoding schemes we're talking about here but the difference is that that was on a copy of the data for the indexes but now we're actually talking about compressing the raw underlying base table so that means that in some cases we need to be able to talk about how we can handle doing updates to them because in that SQL Server paper you guys read they just froze the table entirely and made the indexes read only we'll talk about how can we handle doing updates because we're actually doing this on the base tables and yes we can get to the sort of philosophical discussion of what does it mean for a tuple to be actually a tuple if we create those columnar indexes and the fraction mirrors is that really fundamentally different than the underlying table if it's stored in the column store and for our purposes we don't care we're just talking about how to do this on the base tables alright so null suppression is more or less the thing we talked about when we did the Oracle's BBC encoding for bitmaps the basic idea here is that when you have a data set where you have consecutive nulls or zeros or blanks then rather than storing every single exact value you can store sort of like a run the encoding scheme to say I had this blank occur this many many times and this is really useful to do for why tables mean with a lot of attributes on sparse data sets to think of something like if I have a table where I'm collecting all the log messages from my distributed system a bunch of different services I'm running for my application some services will have some attributes in the logs and other ones will have of different ones and so rather than storing this as just the raw log message I want to parse it and break it up in a structured format because that's going to make it easier for me to do queries but now if I have to have the logs in my web server logs from my database server logs from my memcache server they're not all going to have the same attributes but I want to combine them into a single table then I'm going to have a bunch of attributes that are most of the time going to be null depending on what services generating the log message so this is really you can use null suppression to store sparse data like this log example in a small amount of space we're going to store the null over and over again we talked about run length encoding before and this basic idea is here's when we have runs of the same value for some long length in our column instead of storing again the exact copy over and over again we can store this triplet that says here's the value, here's the starting offset in my column and here's the number of times it's repeated and as we talked about before the way to get the best compression ratio for doing run length encoding is that you want to sort your columns intelligently so that you maximize the length of the different runs bitmap encoding we also talked about before and the idea here is that for every unique value we have in our attribute we're going to have a separate bitmap that says what tuple and an offset in our column has this value and we can only do this when the cardinality is low and again this is something that you as the database administrator would have to specify to say I know that this column can be stored as a bitmap so go ahead and do that the database system is not going to come back and tell you hey you know I think you have way too many attributes you actually don't want to do a bitmap index alright so now we're going to talk about some new schemes so we sort of alluded to delta encoding before but now I actually want to give a specific example because this will show up in other parts as we go along so the basic idea with delta encoding is that instead of again storing the exact value in a tuple for an attribute over and over again we can actually store the difference of that value from its previous tuple so let's say in this example here I have a table where I'm keeping track of temperature readings from some sensor and so every minute I'm going to take a new reading to say what's the current temperature of the room so 12 o'clock I have some time I have a reading 12 o'clock and 12 o'clock and so forth so what we can see is here the way delta coding is going to work is that we'll pick the starting point value in this segment to be the base value and then we're going to base all of the subsequent values that come after that be dependent on what that base value was so in this case here we start at 12 o'clock and we'd always have plus one, plus one, plus one because that's the that's the delta or difference from the predecessor so in this case here you get 12 o'one by adding one to that you get 12 o'two from adding n to this so there's sort of two ways to do delta encoding and this way that I'm showing here you base it to be based on what the previous value was or you can base it on what the first value in your segment would be so instead of having plus one here you could have plus two to represent 12 o'two people typically do it in this way that I'm showing here because remember in an OLAP system where you actually want to use a column store you don't care about finding exact tuples you care about computing some aggregate across a large segment of tuples so it doesn't matter that in order to get this exact value here that I have to go scan from the beginning and add one as I go down and keep track of what my running total is in order to get to this point here that's okay because I'm going to be doing that scan anyway if I care about doing point queries then even if I do point queries it doesn't matter you always have to scan through anyway this compression scheme even though it's slow to do point queries this is good for the OLAPS we want to do and we're doing the same thing here for the temperatures we're having a delta from what the previous one was so now this is a great example where you actually combine compression schemes to get even better results so we notice here we're saying plus one, plus one, plus one over and over again because we're just taking a new measurement every single minute the run length encoding to now say we have plus one four times so first we do the delta encoding we recognize that we have these repeated values then come back and do the run length encoding so this is an example of where you combine these different encoding schemes together and get better compression ratios that's sort of clear what's going on we'll see this occur later on when we talk about dictionary encoding another technique we can use that's sort of similar to delta encoding is called incremental encoding or prefix encoding and the basic idea here is that we're going to store just the difference of a attribute value from its predecessor and we'll keep track of how much of the prefix we can use from the previous guy so in this case here I have four tuples, Rob, Rob Rob, Robbing and Robot and so what we'll first do is we want to figure out what's the common prefix from each tuple to the previous one this is our first one so we don't have a prefix because there's nothing above it in the case of Rob we would see that we have ROB is shared with the previous one so we would store that same thing for the next one we go down the line we have ROBB and then ROB and now the way we actually would store this is in our compressed version for each tuple we have a pair so the first thing would be the prefix suffix the portion of the previous value can I reuse and then the suffix is the remaining attributes that are actually unique from its predecessor so in this case the first one for Rob it doesn't have a predecessor so its prefix length is zero in this case here for the second guy my prefix length is three because I can use ROB from the previous one and then I have BED that comes afterwards because this is the unique suffix that the other guy didn't share with it I can just do this down the line so anything as before yes I can't do point queries very easily because in order to figure out what's the exact value of this attribute here for this tuple here, for this attribute I have to decompress the entire thing but that's okay in OLAP queries because I'm doing large central scans they don't have to read all the data anyway okay yes this would be more compact if it was arranged out of a particular yes so the question is would this be more compact if this is arranged in alphabetical sorted order the only work that you're doing in sorted order right and we'll see when we do dictionary encoding the inner leaves where they do their mapping they are going to be sorted because that's how we're going to do order preserving dictionaries so in this example here again this is sort of like run length encoding if you pre-sort the data intelligently then you get the best compression ratio for this so this next scheme is called mostly encoding it's not really a compression scheme it's sort of like a oh duh this is actually a good idea this is pretty easy to do and this actually comes from Amazon Redshift the basic idea is that the dating system can recognize when you declare that the size of an attribute is actually larger than most of the values you're actually storing in it right so let's say that you have a table and you declare some column to be a 64 bit integer but for most of the values you're storing in that column they don't need the full 64 bits right in this case here I have 2, 4, 999,000,000 and then 6 and 8 so only this attribute here does actually need the full 64 bits all these other ones can be stored in a smaller data type so I'll declare my column to be mostly 8 meaning most of the integers I'm going to store can be represented in 8 bits and then I'll have a little special marker that says if you have this marker here it says that the value for this tuple is not stored in this column there's instead a separate lookup table where you say for this offset that I was what's actually the real value that I had right because again most of the time you're going to be fine with 8 bits but for the few cases where you're not then you can store it in this auxiliary table so obviously if you if you get this wrong right if most of your attributes are not cannot be stored in 8 bits then this is going to be really slow because every single time you start reading this you have to do another lookup in this other table here right but for a lot of cases maybe this is okay and again this is actually a pretty big win right in this case here I went from 64 bits to 8 bits right for these attributes that's an 8x reduction that's a lot and I think the way it works in redshift is you have to declare this ahead of time as the administrator right it doesn't figure this out for you automatically for the reason that it actually couldn't I just like this because it's again it's so simple that it's like why doesn't everybody do this alright so now we can get to the main thing that we want to talk about today that you guys read and this is how to do dictionary compression and as we talked about before the basic idea is that when we have frequent patterns of values in our columns we're going to represent them with smaller codes and for our purposes we're just going to focus on very length fields or character fields you could also do this with integers instead of storing things as 64 integers if you had a small enough domain you could maybe store this as 8 bit dictionary values but for our purposes here we'll just talk about this in the context of strings so this is the most common or most pervasive compression scheme used in every single database system right with the exception of like the MySQL block level compression that I should talk about before if you have an OLAP database and they're doing compression they're probably doing dictionary compression it's pretty much in everything right because it has the nice support where you can do fast encoding decoding you can support range queries and you can delay materialization until the very end and you can operate directly on uncompressed values so the questions we're going to ask to answer as we go along is we need to understand when do we actually want to construct our dictionary and what's the scope of the dictionary what elements of the database can share the same dictionary how are we going to allow for range queries and how are we going to enable for fast encoding and decoding as we go along these are the answers we're going to try to solve so for the first one when do we actually want to construct the dictionary there's basically two ways to do this so you could just compute a dictionary for all the tuples that you have you're going to use for your dictionary you just compute the dictionary when all the data is just first entered in the system and at any time that you have new tuples come along you're going to use a separate dictionary for them you don't want to invalidate the previous dictionary that you've already generated so let's say I have my database on disk some CSV files when I bulk load it into the system I'll compute the dictionary for that data I just loaded then if I have any updates or inserts that come along later I'll maintain a separate dictionary for them so this is the technique that DB2 Blue does this is what HANA does I think this is what Memsego does as well because you don't want to have to modify the previous dictionary that you've already generated the other approach is do this incrementally where you're going to allow new tuples or new updates to come along to modify a previous dictionary and in order to make this work you have to be careful about how you encode your values so that you can allow for new values to be inserted without having to re-encode everyone else so if I densely pack my encoding scheme I say I have four attributes and their encoding values are 1, 2, 3, 4 but then I insert a new value that comes between 3 and 4 then I need to go back and modify all the tuples that used to have 4 because now they're going to have 5 so typically the way you get around this is you have a little bit of extra slack space in between each different values so that you can insert one without having to disrupt everything else but of course at some point you're going to run out of space and you have to re-encode everything so I don't know who actually does this in the paper you guys read they do sort of something like this the way they're careful about it they lock the entire dictionary any single time they they want to update it because they have to go around and re-encode everything now we talk about what's the scope of our dictionary so this is sort of similar to what we talked in the beginning of like what's the for our compression scheme what are we actually compressing and so our scope could be on a single block where we're going to build a single dictionary for all those that are contained in that table within that block you're going to get worse compression because now you have if you have a new dictionary for every single block you're going to have repeated values on all these different dictionaries but again with the advantage of this one is it can make it really easy for us to do incremental updates or even if you have to do complete re-encodings you only have to access tuples that are in that block you don't have to scan the entire table and then we talk about how Oracle does this if you corrupt your a page or a block table log compression we're just constructing a dictionary for the entire page and again this is going to give us a better compression ratio because we'll have more repeated values but it's going to be a more expensive update and then the last technique is to do multi-table dictionaries and the way to think about this is say I have two tables where one table has a foreign key relationship with the other table so I could reuse the same dictionary for those two tables so that the values that either two columns are going to have have to be in the same domain because there's no value that could be in the child of the foreign key relationship that would not appear in the parent because that violates the integrity of a foreign key or the referential integrity so I can use a single dictionary for both of them and not have to maintain these two separate data structures and this makes it really fast to do joins or set operations like intersections and unions because again the domains are going to be exactly the same if you don't have this and your encoding scheme ends up being different when you do a foreign key lookup say I have I want to do a lookup and the value I'm doing my lookup in the parent table gets mapped to the dictionary code 1, 2, 3, 4 but because I have another dictionary for the child table it's going to get mapped to 5 or 6 or 4, 5, 6 in order for me to do the join directly compressed data I have to decompress the child attribute first then recompress it using the dictionary from the parent table then I can do my lookup so if you can share it across multiple tables then you can avoid this problem here so again the different database systems do different things usually it's somewhere in between so STP HANA and MemSQL I think they do something in between these two where it's not on a single block but it'll sort of be in multiple blocks so you could do something like again STP HANA what happens is you have that delta store where you put all your new insert updates into a row store format and then when you want to convert it to a column store format then you take a large chunk of that delta store and then you'll convert it to the column store and then you compress that so that delta store could be multiple blocks so your dictionary would span all those blocks so typically sometimes you see it somewhere in between these it's not a full table but it's not on an individual block and I actually think that's sort of the best of both worlds so another cool thing you can do with dictionary encoding is you have multiple attribute encoding the idea here is that instead of storing a single value in your dictionary per attribute you can actually have the dictionary encoding encompass multiple values and this is related to what we talked about in the beginning where if we know that our values are correlated then we know that they're going to be repeated two or more attributes are going to have the same values multiple times so in this case here in my original table I have value 1 with the domain being abcabc and then in the value 2 anytime I have a it's going to be always 202 anytime I have b it's going to be always 101 think of this again like if you have the state and the zip code if you know the zip code you always know the state so what we can do is in our dictionary we can instead of storing a separate dictionary for each of these we'll say our dictionary has for both value 1 and value 2 we'll generate a single code and now in our combined column we only store just that one code so now to do this look up it's super fast because we only have to do one probe and one dictionary to find the thing we're looking for so you see this all the time we read the literature about column stores and compression but as far as I know no database system actually does this because again it's hard for the database to automatically figure out that two values are always going to be correlated just because it just sees the data that it currently has and looks very correlated it doesn't know that you're going to insert something later on that breaks that assumption so that's why no system actually does this even though you can get for some cases you can get a big win on this now we need to talk about how we can encode and decode the data so the basic idea again encoding would take an uncompressed value and convert it to its dictionary encoding form and decoding is just the reverse of this so the thing to be mindful is that there's not going to be a magic hash function we can use to do this for us right because if you use something like md5 or shaw1 or shaw256 these are one way hashes meaning I can do the encoding but I can't do the reverse so you can use a cryptographic function to get this property but the problem is one that's going to be super slow and two it's not going to guarantee that you have fixed length values so instead we're going to maintain two different data structures to do this both encoding and decoding for us and the key thing in order to support range queries is we want this to be order preserving so what I mean by that is we want the order of the, let's look at the order of the values that we're going to use to encode the encoded values for our original data to represent the same sort order of the original data so let's say I have a table here and I have four values Andrea, Joy, Andy and Dana and so when I generate my dictionary encoding scheme the Andrea is the smallest value so it's going to have the smallest dictionary encoding and he comes next he has the next big one and so forth right and so now what's going to happen is if I have a query I want to do a look up like this so select start from users where name like A and D and then a wild card I can actually convert this into a between query because what the optimizer will do is it will recognize that I have this dictionary that I can with all these values in it I can do the scan on this or binary search to find the tuples or sorry find the values that match my where clause figure out what their codes are and then do you know rewrite the query to be like this so then I can do my sequential scan and I evaluate this predicate looking directly on the compressed values so this is another key example why letting the data is a managed compression itself instead of using a naive compression scheme is always going to be better because internally LZO and LC4 and snapping all these different algorithms they're basically doing dictionary encoding but this dictionary table is not exposed to the database it doesn't know anything about it but because we're going to maintain this ourselves we can actually process some of the queries only having to look at the dictionary and never actually having to look at the raw data you can sort of think of this as like it's like an basically another index those looking at another example so say I have a query when I do select name from users where name like A and D with a wild card the question is here we know we're always going to have to do our lookup into the dictionary to figure out what are the codes we want to look for here right but now because I'm not doing select star I'm doing a select name it may be the case that I don't even have to look at the other columns in my table for this example here I actually do have to do my lookup because there's no information in my dictionary as to the number two pools that have each of these values it just has a coding it doesn't tell you the frequency of them so for this one I still have to do my lookup in my dictionary table then do my sequential scan in the regular table but if I had a distinct clause here all I need to do is just find my matches in the dictionary and I never need to look at the original data right and you can't do this when you have my compression because you don't have access to the dictionary table here right same kind of thing if you have like a select count select distinct select count distinct you can just count the number of unique names you have directly on the dictionary without having to look at the original data right so that's a big win that's why dictionary coding if you preserve the order helps you out yes yes we get around to doing the sequential scan by looking at the dictionary and finding that Andrea and Andy are the two bits and then are the only two possible bits and then doing that converting that to basically two equality queries and then using index so his question is if I do my lookup for the first query here I recognize that Andrea and Andy are the only two matches and you're saying do what then do basically convert that query into two equality matches where name equals Andrea or name equals Andy so the statement is I convert this one query into two separate queries that are doing does something equals something actually it can still be one query where something equals something or something equals something then have another index to do a lookup to find this one tuples so for this I'm assuming there are no additional indexes we're always doing sequential scan on this yes if you had a there's nothing this is what I was trying to say before that this is independent of the indexes you have and yes if you had those indexes you can do the same thing you speed that up but if you had to do a direct sequential scan again like another big benefit of this is instead of having to do does string equals string in this case here when I'm doing my between I'm just doing comparisons between integers and that's super fast and if I'm doing sequential scan on this then I can use simd to vectorize the execution and that's going to be even faster right so again another benefit is like not only is the size of the data going to be smaller because we're storing it now with integers instead of strings doing comparison of them is going to be way faster too another so the only one downside about this is that if the wildcard is not at the end it's either at the beginning or the middle you have to do more work to do more sequential scan inside the actual dictionary itself right when it's at the end you can just sort of do a binary search to find exactly the things that you're looking for when it's somewhere else then you may have to do a sequential scan across the entire thing and then instead of converting it to a between you convert it to an in clause right okay so how do we actually want to implement our dictionary so you know one way obviously would be to use a hash table it's going to be nice because they're fast they're compact we do 01 lookups but again it's the same problem we talked about in the beginning is that you're not going to be able to do range queries easily on these and it's going to be hard to do so the prefix like queries that we talked about before so the obvious choice is to use an order preserving tree like a B plus tree a skip list a BW tree it doesn't matter the downside obviously of course is that this is going to be slower than a hash table and it's going to take more memory but we'll be able to support the range previous queries that we care about so in the paper you guys read it uses an interesting idea where they're going to use two B plus trees for the encoding and decoding and instead of maintaining sort of the different leaves with the mapping of a value to the dictionary code they're actually going to share the leaves which I think is actually really novel so you sort of think of this as like there's two indexes going in two different directions then they meet up in the middle when the shared leaves right so the leaves are going to be these sort of mappings from the value to the code and that way regardless of whether you're going from the top down to the bottom up you can always do a sort of binary search quickly to find what you're looking for actually if you're doing decoding you have to do a linear search if you're doing encoding you can do binary search but the page is not going to be that really that big right so it's not going to be super expensive to do that look up if you have to do linear search alright so the way this works is the encoding index allows you to take the original value in you traverse down to find the leaf nodes and then you spit out what the encoded value is and then likewise for the if you have the encoded value you go in the other direction do linear search and then you come back out with the original value would be right so the thing to point out though is again I sort of mentioned this before notice here I'm saying that I have the encoded values are 10, 20 and 30, 40 so that way if I insert a new value that comes between 10 and 20 say like you know I mapped it to 15 I can insert that store the new encoded value and not have to update anything else if I run out of space between 10 and 20 then I have to readjust everything and that means basically doing a sequential scan across the entire data set and re-encoding everything but this is why I talked about before if you have a sort of smaller granular scope of what the dictionary is is supporting like on a single block or a subset of the table then you may not have to do this update and re-encode over and over again because the once the sort of that segment gets full you don't try to insert new things into it so again think of the delta store example if once things get cold from the row store and I'm going to migrate to the column store when it gets cold and it's in the column store probably never going to be updated again so I'll build my dictionary once and that's pretty much the end of it whereas if I had a dictionary for the entire table then every single time I kept inserting new tuples I may have to do this entire re-encoding the other thing about this is that she sort of brought up before is that these values are sorted based on the values and if we're doing strings then we're going to have probably a lot of prefixes that are similar so we can actually do additional compression inside of the shared leaves using the incremental prefix encoding that we talked about before even a better compression ratio for these guys again we can daisy chain these compression teams together to do better than we would just by themselves okay so that's sort of it for doing OLAP compression and as I said the dictionary encoding scheme is the most widely used approach and I really get into how different systems do different things like DB2 does it differently and HANA, these guys I don't think the paper you guys read is actually implemented in any system but I do like I have you guys read it because it sort of lays out all the problems of dictionary encoding and they talk about how to do updates in both directions which I think is kind of cool but now everything I've talked about here so far is not actually what you want to do if you want to get fast transaction processing workloads or do compression in fast OLAP databases the naive compression scheme that I talked about at the beginning for doing the block based compression so that's what MySQL does WireTiger does this for MongoDB that's good enough for general purpose workloads but if you want to do fast and memory transactions then you don't want to do something like this and typically you don't actually want to compress the data because now when you want to access a tuple you have to decompress it and that's going to take time you can spend cycles doing that and again what we talked about before when you want to do transaction you want to finish quickly as possible because you're holding locks and latches while you're doing operations so if you delay the time it takes to actually complete the transaction then you're slowing everybody down so in our research we were looking at what are the other things we can compress in an OLTP database that's not the data to reduce the memory footprint of our database in memory and it turns out for a lot of OLTP databases it's actually the indexes that are the biggest thing so we did a study using H-DRAW which is the academic version of OLTP the databases might help build when I was in grad school and we loaded three standard OLTP benchmarks and we measured what percentage of the database or memory usage was spent in the actual database versus the indexes and so what you see is that for these workloads in case of TPCC 57% of the total memory usage is just spent on the indexes right 54% for voter and this article's benchmark is supposed to model like Reddit or Hacker News where people post articles and comments so you have larger text attributes so that's why the index isn't as big but this is actually really common in OLTP databases because people want neither transactions run really fast so they usually build a lot of indexes to speed everything up because you never want to do a sequential scan inside of an OLTP database so if we know that we don't want to compress the tuples because that would be slow we want to figure out a way to try to actually compress maybe the indexes and so the approach we came out with was a technique called hybrid indexes and the basic idea is sort of similar to the Delta Store approach we talked about last class where you have sort of the fast dynamic part for all the new data coming in and then you have a compressed or compact version as the backend for cold data that's not being updated and we're still talking in the context of an OLTP database so when I say cold data I mean maybe it's data that's just red but it's never actually updated in red in the context of transactions so what we're going to do is we're going to have basically a dual stage index in the front part we'll have a dynamic stage where this will be fast for us to update and this is where we're going to put all our new information and then over time we'll migrate cold data to this backend static stage which we'll use a compact index that's much smaller in size than the dynamic stage but still be just as fast so again all the updates are going to be go to the dynamic side and anytime we want to do a read depending on what the query is and what kind of index it is we may have to check both stages so the high level diagram looks like this so again we have our dynamic stage for that and this could be any index you want it could be a skip list, it could be plus tree, BWTree and art index doesn't matter and then the same thing for on the static side this will be a compressed version of the tree index as we talked about before so again all the insert updates leads always go to the dynamic side and then over time and when it gets too big we'll just merge all the contents over to the static index we have a way to rebuild this over and over again then when I want to do a read I'm going to have a balloon filter in front of the dynamic index that can tell me whether it thinks the thing I'm looking for is in there or not, remember balloon filters will give you false positives so if it gives us a false positive and we do our read down here, if it's not there then we know we always need to go check the static side but if it tells us no it's not there then we have to go check on the static side and do our read there so again think of this as being dumb at the data structure level that you guys are building for your skip list, there's the single logical index interface that the data system uses it doesn't know, it doesn't care that there's a dynamic side and a static side it just knows it has a single API to make calls into the index that it's looking for so the one tricky thing about hybrid indexes are when you want to do range scans when you want to do iteration because now you need to have two concurrent annotators at the same time in both these and then walk through them in lock step because you want to avoid having false positives or false negatives so let me show you actually how you compress an index so this is sort of a standard B plus tree we have pointers to everything because the protocol specifies that we only have to do a merge when one of these nodes is less than half full in this case here we just have a bunch of empty space because we're expecting new stuff to come in but because this is on the static side and it's a read only index we know we're not going to have to insert new stuff in so we actually want to get rid of all these empty spaces here and we can then collapse it into a more compressed form typically I think the studies show that B plus tree on average the nodes are like 60 to 69% actually full so that means we have about 40% of the data of memory being used where B plus tree is completely empty so we just get rid of all that entirely because we know our index is going to be static so the next thing we can now get rid of is all these pointers and again we have these pointers because we're assuming B plus tree is meant to be a dynamic index so that means that we can do splits and merges and move these blocks around and we want to update our pointers to go find them but again because we know we're read only we don't actually need them and instead we always we just used fixed offset to know how to jump from some position in a giant array to get to the values that we're looking for and we don't need to keep track of our sibling pointers anymore because if I want to do a leaf node scan across the bottom then I just scan this array right so this is how we're going to get better compression in our static indexes and we're going to take some benchmarks that we did to show the performance benefit and the memory reduction you get from this so this is using the SX B plus tree which is probably the best state of the art open source B plus tree implementation but it's single threaded it's not a concurrent index and then what we're going to do we're going to have our workload we're doing 50% reads and 50% writes using different data types so one we have random 64 bit integers they have monotonically increasing 64 bit integers think of this as like an auto increment key and then we have our e-mail this is actually a real data set from the Ashley Madison website data that got released about two years ago so across the board you see that the hybrid index having the dual stages actually gets better performance because the size of the dynamic index is actually really small because we're always merging stuff to the renolding index and therefore more of it can fit in our CPU caches but then even even though we're still getting better performance we're actually using less memory too right because we're storing most of the data in the static renolding compressed compact index in the second stage so again this is awesome because not only are we getting faster we're actually using less memory which is you know which is totally what you would absolutely want in memory old to be index or database right so in the paper again we're not going to read that but we talk about how to apply these techniques for compression for any possible tree old data structures you have so we can do this on a skip list, we can do this on an art index, we can do this on a mastery we didn't apply it to the BW tree but I think it would work there as well right and it just sort of shows you how you can apply this for any possible index and use this in the side of old to be database so we did this in h-store and the goal maybe this semester is to support this technique over to Peloton alright so what are my party thoughts on this, as I said I think dictionary encoding is probably it is the most widely used compression scheme and it's super handy because you don't want to do any pre-sorting or pre-organization of the data as you would and maybe run length encoding and all these other schemes you just stick the data in and you get great compression we also showed how you can apply different approaches one after another to get better compression you would than using by themselves and then we also showed how it's always important to wait as long as possible when you're executing query to actually find decompressed the data because if you don't do this and you're essentially just going to be wasting cycles so currently in Peloton we don't do any compression I think it would be an awesome project 3 in this class if someone does dictionary compression and whether you actually do the scheme that present in the paper or the other schemes from HANA and DB2 blue that we can talk about I think this would be an awesome project and it would fit nicely with everything we talked about in this course alright so for next class on Tuesday next week actually no today is Tuesday so it's Thursday sorry we will talk about how to do logging now so now we're going to talk about how do we actually make sure that if we have it in memory database and we apply changes to it if someone comes along and pulls the plug we don't lose everything so we'll talk about how to do this in physical logging or physiological logging which is the most common technique but then we'll show the benefits of doing logical logging which is what VoltDB does okay any questions okay awesome guys see you on Thursday