 Jump into chalk everywhere, that's why I can't have anything nice. So today we're going to talk about database compression. So we'll sort of jump right in and get into it. So we'll talk at the beginning a little bit about why database compression is important. I mean, it's sort of obvious, but we can talk a little bit about the trade-offs and issues you can face with it. And then we'll talk about using what I'll call a naive compression or general-purpose compression. And this is sort of related to what one of the groups presented on Monday. And I sort of stopped and said no, that you don't want to do this, might be a bad idea. So we'll talk about why one of the implications of doing naive compression. And then we'll spend most of our time talking about columnar compression in OLAP systems. Because this is where you get the most bang for the buck, right? And this is where we can actually have the database system be written in such a way that it can operate and execute queries directly on compressed data without having to decompress it at all. Which was what the naive thing has a limitation for. And then we'll spend a little bit of time talking actually about research here, going on at CMU, doing compression on OLTP indexes. And that is actually not published yet, but I'll give you a preview of how it works and look at some early numbers. Okay? And as always, stop me if you have questions as we go along. So if you think about it in the old days, when old days, because this base systems are still around. But in a disk oriented system, the disk is always going to be the major bottleneck, right? Even if you have a flash drive, it's still going to be orders of magnitude slower than reading and writing from DRAM. So in a traditional disk oriented system, they make the big trade off that we're willing to pay CPU cycles in order to reduce the amount of IO we have to do getting data from disk, right? So you think about also too, when you see those graphs of CPU speeds and Morse law and number of transistors, right? There's always that exponential growth of the performance of CPUs. And then if you pair that together with the read and write speed, the latencies of accessing spinning disk hard drive or an SSD, that's always much flatter, right? So the CPUs have gotten so much faster over the years and disks haven't really kept up. So now they have all these cycles and what are we going to do with them? Well, we'll compress things, right? That way we reduce the amount of IO we have to do. In an MME database system, it's a little more complicated because we don't have a disk anymore, right? Reading things from DRAM is really, really fast, right? Despite what we talked about making sure we have things in L3 cache, still going to DRAM is orders of magnitude faster than going to disk. So maybe we don't want to pay that big penalty of having to do computationally expensive compression for our data. Well, it turns out actually in the end you actually do want to do this and we'll talk about reasons why as we go along. But in general, the reason why it's important to do compression also in a memory database is obviously because you can store more things in memory that way, right? DRAM is very expensive not only to buy and have in your system but also to actually maintain in your hardware and in your server, right? The energy cost of a general purpose rack machine, the energy usage about 40% goes to just refreshing DRAM. So if you have fewer machines that can use less DRAM to store your database because you compress things, then you end up paying less energy and you'll have to buy less machines, so that's important. But the key trade off that we're always going to make in when we talk about compression algorithms is going to be speed versus compression ratio, right? So in a database system, as far as I know, every single databases that does compression and especially in memory database systems, they're always going to choose speed over compression ratio, right? Because it just makes, you queries need to run, you're fast, transactions need to run fast, so we're willing to give up some extra compression that we could get if we were doing things with a more computation expensive approach in order to get faster computation times for queries. So there's, we need to talk a little bit about what real data looks like to understand why compression actually makes sense and is going to work for us. So in general, real world databases have two key characteristics that we're going to want to exploit in our compression scheme. The first is that the data sets or the values in these databases tend to be really, really skewed, right? So the means that the cardinality of the values for a particular attribute is going to be skewed such that most of them are going to have the same value, right? And the example of this would be like the zippian distribution or the parallel distribution of the brown corpus. Does anybody know what the brown corpus is? What is that? You shake your head, Jess? Right, well, what is it? Right, so it's like in the 1960s at Brown, they came up with a bunch of, they handpicked a bunch of documents that had a, that they considered to be emblematic representative of American English. And then someone went along and counted all the different words that are in the corpus and they found out that the word that occurs the most is the word the, and then the second word that occurs, the second most occurs half as many times as the first word. And then likewise going on, you get exponential curve. So that means that again that there's, if you were building an index or storing the brown corpus, you would want to compress the word the in an efficient way because that's going to occur more than anything else, right, and you get the most savings that way. And this occurs all the time when you think of like internet applications, right, on eBay there's a small number of sellers post the most auctions on Reddit, a small number of users post the most comments and articles and things like that. So there's a lot of skew in our data sets and we want to exploit that. The second is that we have, we're going to have a lot of high correlation between different attributes within the same tuple. And we can try to exploit that in different ways. So what I mean by that is like the value of one particular attribute is going to be some way connected to the value of another attribute. So for example, if you have like an address table and you have the person's street address, there's zip code in the city. Well, if you know there's zip code, then you probably also know there's, actually you do know their city, right? So these two values are always going to be, given one value, you'll know the other one. Same thing of like, if you have like the order date of when someone purchased something on Amazon, the date that they, it was, get shipped is usually within the next week, right? It's not going to be some arbitrary time stamp and way in the future. So these are the kind of things we want to be able to exploit when we do compression in our database. These are the things we want to be mindful of. So the two key things, the two main goals we need to have in a good database compression scheme is that we obviously need to have, always be able to produce fixed length values, right? If we have these variable length values, then we can't do that word aligning stuff that we talked about when we were packing and things in memory. Then we have to worry about fragmentation in our memory allocator. So it's always easier for us to make sure that our compression scheme will produce fixed length values. The second major goal is that we want to be able to postpone for as long as possible during query execution, having to decompress the data. So again, not to harp on the guys on Monday. They talked about having to, you compress a block and then when you want to access it, you decompress it every time, right? And you need to decompress it as soon as you access it, because you don't know what's actually inside of it. But with the other compression schemes that we'll talk about here, we can actually go pretty far up in the query plan tree, just operating directly on compressed values. And it's only when we have to return a value back to the application, do we then invoke the decompression protocol to put it back into its original form. And the advantage of that is that now we're passing around fewer bits and fewer bytes for one operative to the next, because it can still be in a compressed form. So these are the two major goals that we want to have and we want to support for all our compression schemes. We're talking about compression, we have to talk about lossy versus lossless. I mean, everyone pretty much here should understand the difference. But the main idea is that if you have a lossless compression scheme, basically when you compress the data and decompress it, you get back the original value, right? So anything like gzip on Linux is an example of a lossless scheme. A lossy scheme would be one where you compress the data and then you decompress it, and some of the original data will be gone. Because it got thrown away during the compression process. The easiest one to think about this is MP3s and the video compression schemes or JPEG. And so in a database system, we're always going to do lossless compression. We can never want to do this because if we do a lossy compression and we compress their data that somebody doesn't know we're doing this, then they go to access it and then now their data's missing, they would get pissed and think their database is doing something wrong, right? So this is why every single database system that's going to do compression always does lossless because it's sort of this higher level meaning to understand what does it mean to lose data and still be okay. That's something that the database system can't figure out. So anytime you do a lossy compression, it always is done by the application programmer. So you see this a lot in time series databases, right? Say they're collecting a new measurement from a sensor every minute. Well, maybe you don't need that one minute granularity from last year's data. So you'll combine it together and take an average for every five minutes or every ten minutes or something like that. That's the example of lossy compression, but that's something that you as the application programmer have to implement. There are some database systems, newer ones that can support approximate queries which are sort of like lossy compression. But the data itself is still stored in a lossless compression scheme. But what happens is they do sampling to generate query results without having to look at all the data, right? For most of the times, this is not a big deal, right? If you want to know what the average sale price of 10 million items you sold, if you only look at 9 million, it's still going to be pretty close to the full value. So BlinkDB is probably the key one system that does this kind of thing. And actually they just forked off a commercial version of this a few weeks ago called snappy data. But again, they're still doing lossless compression, but it's sort of like a lossy scheme when they actually execute the queries. Which I think is kind of cool. Okay, so we also need to talk about what the granularity of our compression scheme is going to be. So the different database systems are going to do different things. And some of these different granularities are going to be better than others. So like block level would be sort of like you take a single block, you compress it in some way, and that's the sort of scope of it. And every block is compressed individually from all the blocks. Tuple level, we take the contents of the entire tuple and compress it. Attribute level would be sort of like take a single attribute within like a field within one tuple. You can combine this to have multiple attributes together. And then column levels, what we're most going to talk about is sort of taking all the values and then a column on a single table. And you can still break that up even more, maybe into individual blocks. But you're looking at sort of attributes continuously one after another. For this one, you're going to have to require you to use a column store. Doing this in a row store doesn't really make sense and you couldn't quite do it. There's also in like a disk-based data system, you can do sort of file level granularity or table level granularity. This doesn't really, it's usually not a good idea because if you ever have to update the table, you may have to decompress the entire thing. And that can get really expensive. In the case of file level compression, it's usually done at the storage layer. Like your file system or your NAS or sand that you're using to store the database could do the compression for you underneath the covers and hardware. Okay, so now let's talk about, yes. So column level would be like, attribute level would take one tuple, take one actual attribute and compress it. So Postgres does this. So if you have a var char that doesn't fit in line, they can store it in this thing called toast, this toast storage. And that can be compressed using naive compression, right? Column level is taking all the values with a single column, right? Yes? That's true. Like for example, like an address has like numbers in it and just a street name. Yes. You can do things like save you out. I mean it's sort of like, I don't get it in the big package. It's sort of like attribute level, right? You can go with a more finer ring granularity. But again, I think you sacrifice performance because you need to be more careful about like how you unpack it as you need it, right? Most of the times in like OATP, actually most times in most queries, you need the whole attribute, right? A value. You can do like an inverted index to do text search to get at one particular element of an attribute. But that's separate from compression. Yes, I would lump it in this. Do you have another question or no? No. Okay, so the Postgres is an example of this. Right? Again, you just take the var char or var blob, you compress it somewhere else. Okay. All right. So, naive compression, the basic idea of this is that we're going to treat whatever it is we're compressing as sort of like a black box. We have a correction scheme that just takes some input in, crunches it, and then spits out the output, and that's what we store over top of where the original data was. So, the examples of compression algorithms that you could use for doing naive compression are LZO, LZ4, and Snappy. They're sort of the main ones. So, these are different than like the GZIP algorithms or the BZIP, because these algorithms are designed to be very fast in exchange for giving up, you know, better compression ratio. So, LZO was sort of the first one that came out in the 1990s, and then LZ4 is the successor of LZO, which is Google's open source compression algorithm. So, I think it's sort of the same family of these LZ algorithms, where, again, these are designed to be used for databases as you update things or as you access some things. So, they make the sacrifice of giving up compression ratio in exchange for not having to spend a lot of CPU cycles to do this. All right. So, again, what's going to happen here is we're just, again, we're taking the raw bits or bytes of our data. We throw this algorithm and then we store it, whatever the compressed version it is, back in our table space or heap. And any time we need to access it, potentially we have to unzip the whole thing, right? We have no way to know what actually was inside of it. So, in general, there's, I think there's always basically two types of algorithms you can have to do naive compression, right? So, the first is called entropy encoding. So, this is an example of like Hoffman codes, if you're familiar, if you remember them from like, you know, intro CS. And the basic idea is that for the more common sequences of data, like data segments, the thing you're compressing, you're going to use less bits to encode that, and then the more rare or the less common segments of data you can use more bits, right? So, again, Hoffman coding is the classic example of this. Dictionary coding is actually very familiar, or similar to what we've already talked about with the SQL server cloning our indexes, because that was doing dictionary coding. We'll talk about dictionary coding compression later on in this lecture. The basic idea is the same thing, right? You're going to find repeated segments of data and store them in some kind of dictionary lookup table and assign them a prefix code or a code. And then you use that code to overwrite the original value, so then you're using less data to store things, then you know if you store the whole original value, right? So, all the algorithms that I've talked about are all examples of dictionary coding. But, again, the difference is that the compression scheme is a black box to the database system. I put some bytes in, I got some bytes out. Even though it's you and doing dictionary encoding underneath the covers, the database system doesn't know what the original values are. It doesn't have access to the dictionary, right? So you have to decompress the whole thing. Whereas if the database system implements dictionary encoding itself, it can maintain the dictionary as an index to look up and can do some query processing directly on the dictionary without having to decompress the original values, right? So this is the difference between dictionary coding using naive compression and dictionary encoding using the native database and the limitation. So I'm going to show you an example of how block-based naive compression works in MySQL. So I think in MySQL 5.5, they announced that now they can do block-level compression or page compression in MySQL's NADB store agenda. So the way it works is that on disk we have our pages and they're always going to be stored compressed. And a compressed page is going to be roughly, or it has to be either 1, 2, 4, or 8 kilobytes, right? So that means what happens is, like, if you start compressing your page and you go just a little bit over 1 kilobyte, then it'll always pad it out to be 2 kilobytes, right? This ensures that you have these fixed-length sizes and it makes it easier to manage their location. And then for each compressed page, it's going to be appended with what we'll call a modification log. And what this is going to be is going to allow it's a place to store updates to the data that's in the page without having to decompress it first, right? So I'll show you what this looks like in a second. So when we have a query that, say, wants to access page 0, we'll copy it from disk into memory and we'll always be still in its compressed form. So not any time that a transaction wants to modify this page, we're going to append it to this modification log. And we can do this without having to know what the actual value is, because, again, there'll be some other index that we're going to look up and say we want to find, you know, 2 plus 5. And 2 plus 5 is going to be in page 0 to offset 6. And we want to update its fifth attribute. So all we need to do is put that information into the modification log. Now, if any other transaction wants to read that particular tuple we modified, it goes to this page, checks to see the mod log to see whether that tuple's information is in there. And if so, it can just reuse it without having to decompress it again. But now, say, the query wants to access the entire tuple, wants to read it, then we have to uncompress it. And an uncompressed page is always going to be 16 kilobytes. So we've been going from 1, 2 to 4 to 8 kilobytes, and then it expands, then it becomes 16 kilobytes, and then we apply the mod log to changes to the page here. So what happens is that my SQL doesn't throw away the compressed page. It always has to keep both of them, as you either keep the compressed version and the uncompressed version or only the compressed version. And then what happens is, if it has to decide, all right, I need to evict space, I need to get rid of this page, it can then recompress it and store it back here. And then eventually, if it realizes it doesn't need this page at all anymore, it just throws it away. So the key thing again here is, if I had to read the tuple, I had to compress the entire thing. If I just have to update it, I can keep applying updates to the mod log until it gets full, and then I have to open it back up and apply the changes. So this is the example of doing block-level naive compression. Yes. So your question is, what happens to the mod log overflows? You get to uncompress it, and then pack, you know, uncompress it, apply the mod log changes, and pack it again. So this question is, could you have them? I don't know what they actually do. I don't see any reason why you couldn't do that. But I don't think it matters. OK. But again, the main thing here, again, the remember is that if we have to read it, we always have to decompress it. And potentially, if we want to modify it, we have to decompress it as well. We try to use the mod log, but if it gets full, then we have to do something else. So the problem with this is that it's going to limit the scope of the compression scheme. So what I mean by that is we can only compress a single page or a single block of data. And so that means that we may not be getting the best compression ratio because the number of unique values or number of repeated values within a single page might be kind of low. Whereas let's say we compress the entire table, then surely we're going to have more repeated values and we get an even better compression ratio. Again, another key thing is that, again, there's nothing in the database system that's nothing in the compressed data that the database system can exploit to find a higher level meaning about what's actually in that data or the semantics of it that it can then use for query processing without having to decompress it. So to give an example of this, unless you maybe pre-compute an average or a min and max or something in the page, you have to decompress the entire thing to want to compute that aggregate. Whereas in some of the compression schemes we'll talk about later on, you can compute some of these aggregate queries without having to decompress anything. If you want to do a count on it, for example, if I have 10 compressed items, I don't need to decompress them to count them. I can just count the compressed items. Well, you can't quite do that in here unless you pre-compute those aggregations. So any question about the naive compression? So in general, my SQL does this. I think MongoDB's wiretiger engine does this. A lot of different databases do this if you're trying to support general purpose workloads. For the OLAP stuff, the columnar compression that we're talking about next is the better way to go, in my opinion. It's less computational overhead and then you can still, again, operate directly on compressed data. So another key thing to point out is related to that, can you operate directly on compressed data? If the predicates that are in your query are compressed the same way that your data is compressed, then you don't have to do any decompression. So let's show an example here. Let's say we have a database that has two tuples, has name and salary, and then we have some query here. We want to find all the users where the name equals Trump. So if we compress our table here, and I'm not going to say what this compression scheme is, let's say we're just doing attribute-level compression. If we rewrite our query to now say instead of name equals Trump, name equals the compressed or encoded value of what Trump used to be, then we can do our lookup directly on this data without having to do any decompression. So let's say we didn't have an index. We had to do a sequential scan. We don't need to go down through every single attribute and decompress it one by one to see whether we match, right? We just look for the thing that equals XX. We find that, then we do whatever it is we need to do to reverse and decompress the encoded value, to get back to the original value, and poof, there we have our answer. This is why you can't do this if you're doing the block-level naive compression scheme, you can't compress everything, and you can't do this kind of like mapping from the queries predicate to the encoded value. So this is what I mean by trying to exploit the semantics or being able to leverage the fact that we can compress our predicates, the constants in our predicates in the same way that the data is compressed and not have to do a bunch of extra work. All right, so now we're going to talk about all these different clominar compression schemes. I'll do my best to sort of say where it's called different things in different systems, right? So sometimes it's called bitmap encoding, bitvector encoding, right? Delta encoding, value encoding. So I'll try to tell you when they're called different things but they mean the same thing. But for the most part these are all examples of compression schemes that are used in the major data work-house systems that are available today. So each of them one by one and sort of look at some key examples. We'll spend most of our time talking about dictionary encoding because that's, in my opinion, that's the most important one and that's the one that's used the most often. Okay, so another thing I want to point out too also is we've already sort of talked about some of these ideas earlier when we talked about OLAP indexes for SQL server, right? Remember we talked about doing dictionary encoding, value encoding, run length encoding in these special indexes that SQL server was going to add. So the key thing to understand about what we're going to talk about today versus what we talked about before is the earlier stuff was a sort of an extra copy of the data that was stored in a columnar format which could then be compressed using the techniques we'll talk about here. This is also called the fracture mirrors approach. So you had your original row data, the primary storage of the database was always in this row format and then you made an extra copy of it as in columns and then you applied whatever additional compression that you wanted. And in the paper you guys read, originally that was just a read-only copy of the database but in the newer version they can switch to entirely column store database. So a lot of the same ideas that we're talking about are the same but now we're talking about compressing the primary storage location of the data rather than a secondary copy. And therefore we need to be a little bit mindful about what do we do when we actually update things whereas in the SQL server case it was completely read-only. But many of the same things we'll talk about later. So the first compression technique is called null suppression and the basic idea here is that you have in a single attribute or mostly in columns you have these attributes that are always null or missing or zero. And so rather than storing each individual zero you just store sort of a run length encoding kind of thing where you say I have 20 zeros starting at this point and that's the only information you need to store rather than every single element. So this is really useful for tables that are really wide and we have a lot of attributes and where those attributes are usually empty at very sparse. Can I make you think of an example of like an application that would have a database that looks like this? There's one sort of canonical example that's always used for something like this. So log processing is a key one that does this. Sort of go to use case for HDFS like the Hadoop file system has always been like store your logs in there and then you have your MapReduce job come over through and process them. But you can actually store them in a relational database as well and what you end up doing is you just have these for all the different events that can occur in a log message that is basically another attribute or column in your table. So if you have all these different subservices that it's running your application and you're generating different types of logs some log messages will have some attributes, some log messages will have other attributes and so most of the time across the entire tuple most of the elements are going to be empty or null. And so that's when you would use null suppression because then for a single tuple you wouldn't have to store or within a single column you wouldn't be storing all this junk data that you don't need at all. Yes. So this question is if you use null suppression does this mean your tuples are no longer fixed line? So the way you would I should have drew a diagram. The way you normally would implement this is you have it's sort of like run length encoding where you would say within a single column I have 20 nulls followed by maybe a 1 followed by another 100 nulls. So we're doing this as a DSM in columns so we don't need to worry that across a single tuple that's fixed length it's the values within the column are fixed length. So in this case here yes you wouldn't be able to jump to an offset and get to the exactly what the value should be. You have to do a little extra work to figure out at offset 1000 what should the value be and you can have sort of like bookmarks or ways to jump to a location in your null suppression data set to say like alright I'm at offset of you know 999 I don't have to scan everything before that I know how to get to the 5,000th offset real quickly. That's a good question though. Alright so this is to the null suppression stop is related to run length encoding we've already talked about this so there's not really much to say else but basically again it's the idea is that you're going to set a storing every single value over and over again you can just store a triplet that says at this offset I have this value and it's repeated X number of times. And so the way you get the way to get really good compression ratio using run length encoding is that you have to sort the columns in such a way that you can maximize the compression opportunities. So like if you have a table that has a column on the sex of a student if you sort the table based on whether it's male or female then now instead of having to store 1 million males and 1 million females you just store 2 triplets alright and that gives amazing compression. So the reason why I bring up run length encoding again even though we've already talked about is that what we're going to be able to do in some of the later schemes is actually apply run length encoding again to already compressed data. So we'll compress the data once using one scheme and then we can use run length encoding or another compression scheme to compress it even more which is kind of cool. Alright so we also talked about bitmap encoding bitmap coding and again the basic idea is that for every unique value within an attribute we will have a bitmap vector and then for a particular position in the bitmap that corresponds to an offset in our table and if the value is 1 for a particular bitmap then we know that the tuple has that value. And again we talked about how this only really works if the cardinality of the different values for the attribute is low. If I have a million different values for a million tuples then I would have to have a 1 million by a million matrix for my bitmaps and that would be more space than it would be to actually store the raw data. So you have to be careful obviously when you choose to use bitmap compression that you're not wasting space. Okay so now we can talk about some new schemes. So the first one is called delta encoding and the basic idea of this is that within a particular column or particular attribute we're going to store the difference for one value from the value that preceded it in the attribute. And then we only need to store that delta and not have to store the original value. So in this example here I have a table that has say we're taking measurements from a sensor we have a time and we have a temperature. So the time goes up by one minute every time you take a new measurement and we just store the temperature here. So what you can see is that we're storing 12 o'clock, 12 o'clock, 12 o'clock, 12 o'clock, 12 o'clock, 12 o'clock, 12 o'clock, 12 o'clock, 12 o'clock, and we'll put them in the time here. And likewise, for the most part our temperature's not going to be vary greatly from one measurement to the next, right? Cause otherwise there's like a bomb blow-off.ur something significant happened. Most of the time it's not going to be like that. So the numbers will be pretty close to each other. So what we'll do is we'll take the first value in our attribute and we'll store that in its original form. But then all the subsequent values the one that came before it. So in this case here, 12.1 is just one minute later, so we store plus one, and then 12.2 is one minute later than 12.1, so we store plus one there. Same thing for temperature, right? We store the original value and then we go minus one, plus one, plus one, minus two, right? So now instead of having to store maybe a 64-bit timestamp and a 64-bit float or whatever this is, we can store this like an 8-bit integer, right? Now again, obviously, if we're at this particular offset, if we want to know what the value is, we have to start from this point and decompress it by adding one all the way down. All right? So now you could store, you could store instead of plus one, plus one, plus one, you could store plus two to say that from the base value here, here's how to get to the original one rather than having to do the daisy chain thing and adding one to every single value that came before it. The reason why you usually don't do that is because most of the times in OLAP database, you don't care about the exact value for one entry, right? You're going to do the scan anyway. So you might just go across and just add one as you go along, and save the space because what happens if now it's like a plus 10,000 because you have 10,000 entries, then that can't be stored in a 8-bit integer anymore, right? So they typically always do it from some base value here. So the next thing to point out is for this delta here, we have nothing but plus ones over and over again, right? Our time's always increasing, so it's always going to be nothing but plus ones. So this is where we can apply run link coding again, now to only store the pair that says this is the value plus one, and it's repeated four times. We still keep the temperature in its original delta encoding form. So this is an example of combining different compression streams to get even better compression than you would just by itself. Yes? Because OLAP queries anyway scan the entire table. What about OLAP? Let's do you not do any coding or? Yeah, we'll talk about it at the end. You typically do not do compression on tuples in OLAP database because if you're doing a transaction, right, and I got to jump to the 1,000th entry, now I've got to scan all this crap and repeat the value. I'm doing this while I'm holding locks and latches. That's terrible, right? So you don't do this kind of encoding or you don't do any encoding? You pretty much don't do any encoding. Though, I guess the one they mentioned in the paper seems funny. Hyper? Are they a dictionary encoding? Yeah. I mean, did you read the paper heavily? I think so, yeah. So when they talk about updating the dictionary, that's a big problem. Yeah. The way, so we'll talk about this later in the semester. The way HANA does this is everything, all new data always first goes into uncompressed row store and then over time it gets colder, then you can flip it to a column store and then you can flip it even further to compress it. In Vertica, Vertica's not a hybrid system, it's just purely an OLAP system. They have an uncompressed, right optimized store where all new updates go in first and then there's a background process to then merge and compress with the regular data. When you want to put new data in, you always put it in uncompressed form, right? Because updating the dictionary like I did in that paper is a big pain in the ass, right? Yes? So the run link encoding deplicates the values itself. So the question is, does the... No, sorry, that's true, right? Like, so the... I didn't hear what she said. So the run link encoding on the original data deplicates the values themselves and then the run link encoding on the compressed data deplicates like the derivative? Correct, yes. So the question is like, so in this example here, I'm doing run link encoding on the already delta encoded data, right? You could do run link encoding on the original data as well, right? So let's say instead of 12, 12 o'clock, 12 o'clock, 12 o'clock. If I had, you know, five 12 o'clocks from different sensors or whatever, I could then do run link decoding on that. Well, they should all be zeros like so. Because the delta will be zero. Yes, but yes. But still, you could do run link decoding on that. But is that actually a pattern that occurs like you have updates like really similar to each other over and over again? Oh, absolutely, yeah. Like, and the sensor data is the classic example. Everyone takes at one second, take a new measurement, right? And you have a thousand sensors, they're all gonna be the same. So this question is, is this pattern occur often? And the answer is yes, right? The log is another one I get, right? Yeah, that's a good point. Any questions? Okay, an extension or a different variation of delta encoding is called incremental encoding. And the basic idea here, this is gonna use this for like, bar charts and strings, is where you're gonna be able to identify when you have common prefixes for your different attribute values. And then you wanna store the, how many elements or characters of the previous entry you can reuse for your current entry. And then you just store the suffix part that's actually different, right? So let's say here I have four entries, four values, nab, nabbed, nabbing and nabbit. So the first thing I'm gonna do is compute the common prefix from one entry to the next. So in the very beginning for this first guy here, nab, there's nothing that came before. So I just keep a marker that says this is my base value. Then I go to the next guy, nabbed, and I'm gonna say, well, what part of nab can be found in the beginning of nabbed? And it's obviously the first three characters, so I would store it as my common prefix is that. And you do this down line for each one going with the guy above it. So then now when I compress the data, I then store two pieces of information. The first is gonna be the prefix length to say how many characters in the guy that came before me can I reuse? And then the second part is the suffix. It would be the unique portion of the value that wasn't found in my predecessor. So in this case here, the first element is the base value. We have zero and then we have the full value. And then in for nabbed, we take the first three characters, but then we have bed as the suffix, right? And we have to do this. This only works if the values are sorted. And we'll see when we talk about dictionary encoding, actually inside the dictionary, we can actually apply this incremental encoding to get even better compression ratio as well. And again, this has the same challenges that the Delta encoding does is like, if I want to get the exact value for this element here, I gotta know how to reverse all the other ones. So this is primarily used for doing OLAP stuff. All right, so another technique is called mostly encoding. And this comes from Amazon Redshift. And this is not really anything special. Maybe instead of encoding, it should be called mostly compression. The basic idea is that if you have a attribute in your table where most of the values are gonna be smaller than the max value that's defined for the type, then you can store it with less space and anything that deviates or anything that is actually bigger than what the compressed version is, you would store that in a special table on the side. So let's say that my original table has a column that's defined as a 32-bit integer. And I have six afterwards, five attributes. And most of them are less than the 32 bits, but I have this one guy here set at 10,000. He's much larger. So I can define this to be mostly eight, meaning most of the values for this attribute will be eight bits. So I can store them all as small ints or just tiny ints. And then for the one guy that exceeded the max value for eight-bit integer, I'll store that on a side table here where I keep track of the offset and what the original value was. So in the case I actually do a sequential scan and I come across this marker or whatever this is, I know I need to go look up in here and find it. So this does not do anything special in terms of encoding the schemes we've seen. This is basically saying, oh, well, I recognize that most of the time I don't need the full 32 bits to store these values, so I'll store it as a lower thing. So obviously this sucks, like if most of your values exceed the minimum and you end up having to always look in the side table here, your performance isn't going to get worse. I don't know how many other database systems actually, database systems, I don't know this, but this is actually something used in Amazon's Redshift, but I think it's kind of cool. All right, so now let's talk about dictionary compression. So again, we've already talked about dictionary encoding and dictionary compression when we talked about SQL servers clone in our indexes, but now we need to actually go a little bit deeper and actually understand how you actually implement it. And the main thing that we avoided last time is actually how you do the decoding and encoding and actually how do you represent everything. So as I said before, I think dictionary encoding is the most important and then it's definitely the most pervasive compression, compression scheme used in OLAP database systems. As far as I know, every single major commercial column store database system is going to support dictionary compression. So the key thing we need to be able to do is we need to support the fast encoding decoding, taking the original value, compress it to compress value and decompress it. And then we also need to support range queries. This is something we didn't really talk about before, but if we have a dictionary, if we implement our dictionary as a hash table, we're not gonna be able to do range queries and other kinds of similar operations, prefect searches, and we're gonna be no better than just having to do a sequential scan all the time and decompress everything. So it's very important that we're able to do this. So some of the questions we wanna consider as we go along is when do we construct the dictionary? When do we actually look at their values and compress them and compute our dictionary table? And then we talk about what the scope of the dictionary is gonna be, meaning at sort of the same thing we talked about in the granularity in the beginning, like how much data should be accomplished in a single dictionary. Then we talk about how we're gonna allow for range queries and then how we can enable the fast encoding decoding that we're gonna need. All right, so for the first issue we have to deal with is when do we actually construct the dictionary? So the easiest way to do this is just compute the dictionary for the entire segment of data all at the same time. Right, so you can think of this like if you bulk load a bunch of information to your database, you'll stop the world, scan everything and compute your dictionary. And then anytime you have new information come in, you don't wanna update the existing dictionary. You wanna put the new tables in either a separate dictionary that's different, but still have unique values from the first one, or you need to do a re-scan on the entire dataset again and re-compute the entire dictionary all over again. Okay, the second approach is to support what I'll call incremental updates to the dictionary. And this is where as new updates come in, then we wanna add them to our dictionary on the fly, which may cause the values, the encoded values, the compressed values of our data to change, and therefore we have to go back and now update the tuples to change their old values to the new one based on our new compression scheme. Right, so there's sort of no one way is better than other, they all have trade-offs. I actually don't know what the different database systems actually use. I think Vertica does this one, and I don't know about Redshift and the other guys. Right, the issue is again, do you pay this penalty of having to spend a lot of computation overhead to re-compute the entire dictionary, or do we do it on the fly and update things that we've already set before? Related to the granularity stuff at the beginning, we can also have a different scope of our dictionary. Right, we can have a dictionary for an individual block. We have it for all the tuples within a table or across a single attribute. But what's actually kind of cool with dictionary coding is you also can have a dictionary correspond to multiple tables. So let's say you have a foreign key reference from one table to another. Those two columns are always going to have the same distribution of values, or not distribution, they always have the same set of values. Right, because the foreign key child cannot have a value that doesn't map to the parent. So rather than having two dictionaries set up for both of these two attributes in two separate tables, they can share a single dictionary and have reduced overhead and reduced space from that. That's going to help us speed up joins and set operations, like unions and things like that, that you would not be able to get over other dictionary schemes. Sorry. Dictionary encoding is also kind of cool because it allows you to do multi-attribute encoding. So let's say that I have a table that has two columns, value one, value two. Rather than doing dictionary encoding across one single column separately, one by one, I can compute a single dictionary that combines the two values for both of these columns and now store a single column that has the compressed value for both of them. So I have now one dictionary that can accomplish two different columns. The reason why this makes sense, this is kind of cool because if you have values that are correlated with each other, where they're always going to be roughly the same, then you can use way less space to represent both of them. The other issue is I don't actually know any database system that implements this approach, like it shows up in the literature, shows up in the papers. But as far as I know, nobody actually does this because it's hard to identify this automatically when you have these correlations, when you actually can do this. Your DBA can spend a lot of time trying to figure this out, but it's kind of the pain in the ass. So you can approximate this by concatenating columns together but that's changing the application schema. So again, this is kind of cool but nobody actually does this. All right. So our dictionary needs to support encode and decode. Again, it's pretty straightforward. Encode takes the original value, produces the compressed value, decode does the reverse of that. So the key thing to understand is that we're going to have to use some kind of data structures to do these operations. There's not going to be a magic hash function that's available to us that's going to be able to do this two-way decode and encode and decode. You can kind of think like, oh, you could use an encryption function to do this kind of coding, but that would be computationally expensive and it's likely that the encrypted version of the thing you pass through the function is going to be larger than the original value. So we need a succinct way of actually representing this. We're going to have to use two different data structures. The key thing though is that we want our dictionary of encoded values to preserve the order of the original values. So what I mean by that is if we sort the original values in a certain way, like electrical, graphical sorting, we want the encoded values to end up with the same sort order. This is going to be important to do the range of previous queries that we want to support. So let's look at an example here. Let's say I have a table that has one column on name and it has three attributes, Trump, Joy, Andy, and Truman. So what I want is that when I compress the data in my dictionary, I'm going to sort it based on the values of the original data, and then they're going to have codes that then correspond to their sort order. So even though now the data is not going to be sorted here based on that ordering because this is the dictionary, this is the real data, it still preserves the order that I specified in the dictionary. So now if I have a query that's something like this, it's a select star from users or name like TRU and then the wild card. If I'm not preserving the order of the original data, I would have to do a sequential scan across every single element here and decompress them one by one or the access to execute this query. But because I'm preserving the order, what I'll end up doing is I can first go through the dictionary and do using like a binary search, find the upper bound and lower bound of the value that I'm trying to look up in my query and find their code, and then I can rewrite the query now to be, instead of being the like can be between 30 and 40. So then I can do my scan in here and just find the actual attributes I need without having to decompress anything. Is this clear? So this only works for this particular exam, only works if the wild card is like in the middle, if it's at the end or the beginning. If it's in the middle, then you gotta do more work looking in the dictionary to find that thing you're looking for. So let's look at some other queries that are similar to this using this compressed data scheme and see whether we can operate directly on the dictionary or whether we still need to execute on the access the compressed data. So our first query is gonna be select name from users were named like TRU. So the difference here instead of saying select star, now I'm only getting the name attribute. So my question is can I operate this directly on the dictionary data or do I need to go through and look through the compressed column as well? So what happened for this query? What do we need to do? What did I just say? We have TRU followed by a wall card. So we're gonna go find our upper bound and lower bound in here, and this is the dictionary. This is not the real data yet. So now I have 30 and 40. Do I need to go scan the data as well? Why? Exactly, I don't, I have sort of a sequential scan because I don't know how many trumps there are. Hopefully only one, but whatever. So related to this, now I have select distinct name. Do I need to, can I just do this in the dictionary? Right, yes, we only have to access this because we only need to care about whether we have an existence of the value, not whether how many there are. Yes? You assume there's never deletes or that if you have a deletes, it's also from the dictionary? So this question is, are we assuming that there's no deletes or that if we have deletes that they're reflected in the dictionary? Right. The answer is yes. So I'm not really gonna talk about consistency in the index here. But you can imagine that the same curve system that we talked about before applies to the index as well, right? The same stuff you guys are doing in the BWT tree to make sure you have a consistent index. We have to do this in here. In the paper you guys read, the way they get around this is they lock the entire table anytime you do any modifications on it. But you can imagine there's more sophisticated things you can do to make this work. Again, we're coming this from the OLAP world. So we're doing bulk loads every five minutes, ten minutes, hour a day. And we can lock the entire thing while we do that. And then we can do all our queries doing a lock free manner. So this is different than the htap stuff where we need to be able to update the database at the same time. This is why again they don't do any compression on the hot data that just got inserted because then you have to lock all these auxiliary data structures and it's gonna be slow. Whereas we know how to do again a latch free B plus tree. We know how to do all that real fast now on the OLAP data structures. But doing updates for this thing is difficult. There's a question in the back right now, okay. All right, so again what I'm pointing out here is there's sort of like you have all this extra data as it's in the dictionary and the database system can exploit it and avoid having to look at the entire thing for certain types of queries. And does in the same way that we could then if we rewrite our query just to operate directly on the encoded values, we don't have to decompress everything one by one, right? We can only find the attributes we needed based on the encoded values. And then when we produce the final answer to the terminal or to the application, to the client, then we actually do the decompression. We delay the decompression as for as long as possible. All right, so how are we gonna actually implement our hash table? Or sorry, our dictionary. So we could use a hash table, right? The hash table is kind of nice because it's fast, it's compact, right? We do 01 lookups. The problem though with a hash table is that you're gonna lose that order preserving guarantee that we want, right? If we hash the values, we have no way of, unless we maintain additional metadata about the sort order of the hash keys, we have no way to go, go to the range queries that we want. With the B plus tree or any kind of tree index, it's gonna be slower than a hash table and it's gonna definitely use more memory. But it's gonna satisfy our range and prefects queries that we need. And that's essentially that we're gonna make a decision to choose this over this. And this is what they used in the paper. But the thing that the guys in the paper did that I think is very novel is that rather than having two completely independent data structures for the encoding and decoding, they can allow them to share the leaves of the two trees where most of the data is stored anyway. And that way you can do both two-way lookups in either direction. So this is an example of what they were talking about. So you have, on the top we'll have our encode index, and the bottom will be the decode index. And again, this could be any tree structure you want, a BW tree, a P plus tree, it doesn't matter. And then in the middle here, we'll have our sorted share leaves. So in this case here, we'll have our values and be sorted in, like, a graph of order. And they will have some codes identified to them. And these are the codes that are stored in the actual column in the table heap. So now we get the original value in. We would then traverse the encoded index, and we would come to this leaf, and then we could do binary search directly inside of this to find the code that we want, and then we can spit that out to whatever part of the query engine needs it. Same way for the encoded value, right? When we come down here, since the codes are still sorted as well, right? Because they correspond to the sort order of the values, we just do binary search again, and find the value that we want, and then produce it out to the query engine, all right? So the key thing to point out here is that by organizing the data into leaves, potentially makes it easy to do incremental updates without having to modify the codes for everybody else in the index. So let's say that if I set up these, my code values being in ranges. So let's say this guy handles from 10 to 100. This guy handles from 900 to 999. Then I can insert a new value here, change the sort order for all these potentially, which may cause them to get new values. But then when I go to update now the actual columns with these sort of compressed values, I don't have to redo everybody over here. I only have to redo the ones here, right? So this is what sort of paper talks about how do you handle all this in an efficient manner. And they don't really say that one way is better than another. They just say if you actually need to implement a shared leaf dual index for encoding decoding, here's different things you have to consider. So another thing to point out is we have these leaves with these values and code mappings. This is already sorted for us. Obviously it has to be because that's how we preserve the ordering. So what compression scheme can we use for the shared leaves here that we've already talked about before? What's that? What was the prefix encoding scheme? What's that? No, close. Incremental encoding, right? Because these are all strings. We can just store the prefix that we have from the previous guy and then the suffix of what's different. So not only will we have a compressed column now with our dictionary encoded values, but because this thing has to sit in memory too, we want to compress this as much as possible as well, right? And we'll talk about in a second how do you actually compress this part. So any questions about dictionary encoding? It's the most common one, handling updates on the fly is difficult. And there's ways to sort of handle this, it's better off to do it in bulk. And you may have to go back and update your existing values, yes. The question is, is this only beneficial for variable length fields? No, why do you say that? Or maybe not variable length fields, but barchars or things like that. The statement is, this is only useful for barchars or lobs, no, right? So let's say I have, you can imagine you have an attribute that has 32 bit numbers, but I have a billion tuples, but they only have three numbers, three different values. So dictionary encoding would compress that really well as well. You kind of think a dictionary encoding is like an enum in some ways as well. But it's dynamically handled to you by the database system rather than the application level. But the order of stuff you get out of it is less beneficial for those sorts of things. The statement is, the order of preserving that you get from the dictionary encoding is less beneficial for other, the example I did. I'm sure I can sit and think of an example where you do want to do dictionary encoding on integers, and you do need all preserving stuff. So I tell you, in DB2 Blue, or IBM DB2 Blue, and SAP HANA, they dictionary encoder everything. It doesn't matter what the attribute is, they use dictionary encoding for it, right? They're very aggressive about it, and you get, the space isn't as poisonive again. So, all right. So now let's switch gears for a little bit, and the main time you have. So everything I just talked about, as I said, sort of relate to his question, it's awesome, but you can't really use it for all ATP, because the cost of updating these things, maintaining it as the database is changing, is going to be really bad. And we want to support fax access and fax modifications in the middle of a transaction. If we're holding locks in our transaction, and we have to go read computer dictionary, that's terrible, because that's basically stopping the world. So we need to find another source of data that we can compress to reduce the overhead of our storage. We have our database in memory. So it turns out, in OLT databases, indexes consume a sizable portion of the amount of data that's stored in memory. So these are actually measurements we did in our in-memory database, h-store, for three different OLT benchmarks. And so the first column is, this is the percentage of memory that's used for storing tuples, and then the percentage of memory used for primary indexes and secondary indexes. So if you add these numbers up here, you can see that it ranges from about 35% to almost 60% of the data is spent doing indexes, or is used just for the indexes. The article's workload is sort of modeled after Reddit. So it's like people posting comments on posts. So those that have large text fields, so that's why the percentage is much lower. In these kind of data sets, it's like you're storing order processing information, or online polling, and things like that. And so that's why those index percentages are higher. So to figure out how we can compress indexes, it makes sense, and this is something we've been looking into. And so what we've been working on is something called hybrid indexes. And the basic idea of a hybrid index is that we're going to take a single logical index. So what I mean by logical is when you call create index, or you declare a primary key, that's a logical index. And then we're going to store that logical index, or we're going to maintain it using two different physical data structures underneath the covers. So from you as the application programmer writing SQL queries, you see one index, but underneath the covers, we really have two. And then these two different data structures will have different characteristics that can exploit how the data is actually being used, and can allow us to compress one and not the other. So we'll split these data structures up into what we'll call stages. So we'll have the dynamic stage where all new data is going to come in before we first store it, and it's going to allow us to do the fast updates that we've talked about all before. Like this could be a BW tree, a B plus tree, skip list, it doesn't matter. And then we'll have a static stage where we'll migrate data as it gets cold from the dynamic stage to the static stage, and we'll have a compressed read-only data structure that is significantly less space to store than the dynamic side. So to show a diagram what this looks like, so let's say we have our dynamic index, and that's going to be small, and the static index is going to be something that's very big but compressed. So all inserts, updates, leads always have to hit that dynamic index. Even if we're doing an update on something that's stored in the static index, we'll always mark it here in the dynamic index. Likewise, if we do a delete, then we just store that we deleted the thing here, and don't update the static index until later on when we do the merge. Because over time what will happen is when this thing gets full and the data gets cold, we'll move things over here and rebuild the static side. So now when we do a read, we always first hit a bloom filter that'll tell us whether the element that we're looking for is in the dynamic side or not. Because the bloom filter can give you false positives, but not false negatives. So it would tell us that, yes, it's in here, and therefore we should go look. And if it's a false positive, then we'll go check the static side. If it comes back and says it's not in there, then we just go push our read over here. So it's a little more tricky than what I'm describing here, because it matters whether this is a unique index or this is a non-unique index. It matters whether the read is a scan versus a point query. If you do a scan, you actually have to scan both of them at the same time. You're paying a little computational overhead to maintain these two separate indexes, but it's going to allow us to get significant space savings. So let me show you how we actually generate a compact index. So we have basically a formula of different techniques. You can apply it to any different kind of data structure you want. Skip list, the Radex tree from Hyper, B plus tree, possibly BW tree. We haven't tried it yet, though. And there's a bunch of rules that we're going to apply to these data structures to allow us to compress them and reduce the amount of space that they're using. So let's say that we have our B plus tree, and we just load a bunch of data in and it gets laid out like this. So the first thing to point out is that there's all these empty slots within the nodes of the B plus tree because the B plus tree has to be at least half full, but on average it's just 69% full. So what we want to do is we want to get rid of this because this is just wasted space. We always have to allocate this space because we never know when we're actually going to have to store something in it. So because we're static, because we're read only, we know that nothing's ever going to be stored in here. So we can go ahead and just get rid of that. Now we can compact it down. So that's the big space savings. Then the next piece is that we have all these pointers here. The B plus trees, we have pointers to get back and forth along the leaf nodes. And then we have pointers from the root node to get down to the leaves here. But again, we're static. We're never going to update this thing. We're never at the location of these nodes in memory. So instead of sorting pointers, what we'll do is we'll just combine all the leaves into a giant array. And then we'll have keep track of where our different levels are in the tree. And we know how to compute for a particular offset in one of the arrays how to jump into a location in the lower levels here. So here we've got rid of pointers. And we've got rid of empty slots. And now you see why this is read only because if I have to insert, say, element 20 here, then that's going to mess everything up because I have to reallocate things and shuffle things around. So we're going to do bulk loading of data when we create the static side. So to give you an idea of what kind of space savings we can get, so this is the original B plus tree, the SDXB plus tree, and our hybrid B plus tree. So we have the SDXB plus tree on the dynamic side, and then our compact B plus tree on the static side. And we're going to run a simple workload of doing 50% reads, 50% writes on a data set with 50 million entries. And the three data sets we are using random integers, monotonically increasing integers, and then the email list from the Ashley Madison website, which is available online. So what you basically see here is because it's not a comparison against the static side versus the dynamic side, the hybrid is going to have both. So this is why the performance is going to be better for the hybrid index because our dynamic index is going to be significantly smaller and therefore traversing it, the compact B plus tree is much faster than doing the dynamic version. And we also can see here that the size of the index is significantly less than the other one. In this case here we almost get about almost a gigabyte back, and this is half the size for the monotonically increasing integers. So we're not paying any performance penalty for this particular workload, and other workloads are different, but then we're getting our space savings. So now you can think of a large OTP application that has a lot of indexes. If you're saving a gigabyte per index, that's a lot of tuples, that's a lot of data. And you have no performance impact because, again, it works just like any other OTP index because you always hit the dynamic side, and then as things get colder, you can migrate it to the static side. So any questions about this? Again, this is not published yet. This is stuff that we're reporting out later this year. And it's currently only implemented in h-store, and eventually we hope to give this into the peloton later in the year. OK. So what are my parting thoughts? Again, I can't stress this enough. Dictionary encoding is probably the most important compression scheme because it doesn't require any pre-sorting, and you have to do running for coding and other schemes. And you can still do random access directly at the single tuples if you want to. How you actually handle updates, though, is kind of tricky, and this is why most of the database systems do this in bulk or do this offline or on the side and not try to do update the dictionary on the fly. We also saw how the database system can combine different approaches, compressing them one after another and get a better compression ratio with minor overhead. And then we also talked about how it's really important to wait as long as possible, as you actually get a query plan to actually decompress the data. Ideally, you want to wait to the very end when you actually have to produce the answer to get back to the application. If you can wait that long, then you're pretty good. So any questions about compression? OK. So next class on Monday, I'll spend some time talking about doing query planning, query optimization, and then we'll spend a little bit of time at the end just giving some general tips for you guys as you get started on Project 3 of how do you work on a large code base? I can speak from experience because I worked on Condor when I was at Wisconsin. I worked on H-Door. I've worked on Postgres and other things like that. So I'm not saying I'm an expert in this, but I can give you my own tips of things that I've learned over the years to help it guide you along. OK? All right, guys, we're done. See you next class.