 All right, so today is, again, we're talking about the storage layer of the database system. So today's class is going to focus on how can we actually compress the database itself or the data we're dealing with. So last class was all about these different methods to avoid data when we do our scans. The filters, like the zone maps, or using the bitmap indexes. So today's class is really about, okay, well, if I do have to read the data, what can I do to keep it in a smaller form as we work compact to accelerate things? So the reason why this is going to work is that we can take advantage of just the nature of what real data looks like and see how it's going to be amenable for compression. So there's going to be two basic ideas that we can exploit. First of that, the data sets are going to be highly skewed, meaning we're going to see the same values over and over and over again in our columns. And that's going to be a right target for compression. So the classic example always is this thing called the Brown corpus. In the 1960s at Brown University, some researchers basically looked at the entire collection of what they thought was emblematic of literature in English language, Shakespeare, and stuff like that, and they literally just counted the occurrence of each word in all these different documents, right? And what was the most common word? The. The. Right? What's the second most common word? A. A, right? But what they found was that the second most common word appeared half as many times as the first most common word, and the third appeared half as much as the second, right? So there's this parallel distribution. Real data looks like that, looks like that. So we're going to be able to exploit that when we do compression. We're also going to see highly correlated values where the, where we know that if a value is in one column, it's going to appear the same, or forgive me, got value one column, it's very likely to have a value within some small range in another column. So again, classic example would be a zip code to the city, right? If I know your zip code, then I'm going to know what city you're in. The high probability, right? So the real data looks like it's always going to have this kind of skew, and this is why we're going to be able to get results in compression. So it sort of goes, obviously that the, if we can compress the data, we can reduce its size and increase the amount of utility or usefulness or information, or useful information we're going to be able to extract from data as we read it. The reason why I want to do this is because traditionally, disk or the network is always going to be the main bottleneck when we execute queries. So for, you know, one megabyte of data or one gigabyte of data we read from disk, if we can extract more tuples out of that one gig, then we would realize if it was uncompressed, then that's a huge win for us. Now, I say traditionally because the disk and network are getting very fast, and CPU is less so, and so the, in some cases, it actually, with modern compression algorithms and modern hardware, you actually don't want to maybe just do blind, you know, blind naive compression over blocks of data. We still want to do the native database compression, what we're talking about this class, but it's, the advantage you'd always get just by, you know, running GZIP or whatever, snappy on the data when it's a block on disk, that isn't always the case. So we're going to see techniques, how we can keep things, we can still get the benefits of compression, but then actually speed up query execution by operating on directly compressed data. That's sort of where we're going today. But that's what I'm saying traditionally, it isn't always the case. But again, we just want to get more bang for the buck for everything we read. And so there's going to be this key tradeoff between speed and compression ratio where there may be cases where we can get amazing compression and really reduce the size of data. But the computational overhead of compressing and decompressing it is going to make this not a good tradeoff, not a good choice. So we'll see this when we talk about native compression. There's going to be algorithms like Facebook's Z standard that it's going to maybe not get as good compression as like GZIP, but it's going to be computationally way faster. And therefore that's the right tradeoff in a database system, right? Again, if we can operate directly on compressed data, then we'll be able to reduce the amount of DRAM, the amount of CPU overhead potentially, we'll get a lot of benefits of this. All right, so today we're going to talk about some basic background of data's compression. We'll then talk about a naive page compression scheme, which I think we talked about in the introduction class. So last semester in the introduction class was the first time I talked about data's compression in the intro course. So some of this will be redundant, but we'll talk about more advanced things as we go along. All right, so we'll talk about basic naive compression. Then we'll talk about native kilometer compression. And this is how we're going to get the benefit of, this is how we're going to be able to operate directly on compressed data if we get the data that natively does this. And then we'll finish up talking about like, what do you do with intermediate data? Like from one operator to the next, what can you compress? The answer is nothing in general, but we'll see why. All right, so I've already said this before that the goal of data's compression is that we want to reduce the size of the data we're actually storing, the tuples itself, so that for some amount of data that we're reading, either from the disk or network, that the amount of values we're going to extract from it, we increase for some unit of competition or IEL. So there will be three goals that we need to have in any compression scheme we're using. The first is that we must produce fixed length values for everything. Because again, we're assuming we're storing things in a columnar-oriented format. Everything has to be fixed length so that we can just do the simple arithmetic to jump to a given offset. I want the hundreds tuple in this column, and the next column, I know how to do the simple math so that they're always get to exactly where I need. We want to have a lossless scheme. Again, we'll explain what that is in the next slide. But basically, for any data we go in, we need to get the same data out. Because otherwise, if we start losing things, people get pissed, or the database gets corrupted. And then, ideally, which is not going to always be the case, we want to postpone having to decompress the data for as long as possible during query execution. So that means we want to be able to, if possible, operate directly on compressed data to do our predicate evaluation. That way you can check to see whether something matches or not. And if it does match, then we maybe go ahead and decompress it. Ideally, it would be nice to be able to ride compressed data all the way up to the query plan. But as I said, nobody supports that. Nobody can do that. And we'll see why as we go along. So these two are definite. This one, ideally, not always the case, but beyond the scan operator, not much you can do. So lossy versus lossless is sort of obvious. And I think we might have touched on this a little bit less class. But all the compression schemes we're going to be using about has to be lossless. Because like I said, people don't like to lose data. What's an example of a lossy compression scheme? Probably famous one? JPEG and P3 and P4. And the reason why these things work, because the us as humans, we're the ones consuming the audio, because we listen to it or see the visual image in the JPEG, we can't detect some bits being removed. We still see what would appear to be the original image, or we still hear what we think is the original audio. The databases are computer software, so can't do that. So if we start throwing away bits, people can potentially start noting this. Or the data system is not going to read any kind of data. So any kind of lossy compression has to be done at the application level, because the data system doesn't know what is OK to throw away. So a simple example would be a lossy compression scheme would be say I have this like I'm keeping track of the temperature in this room. And I don't care about knowing exactly what was the temperature at some given time stamp from five months ago, six months ago. So what I'll do is I'll just compute the average for the temperature of a room in one hour chunks. So that's a lossy scheme, but it still gives me the sense of what the data should be. The data system can't know that. All those kind of operations have to be done at the application level. Now last class, we also talked about maybe doing a process of query processing, where we're sampling data to produce approximate results. That's basically the same thing. It sort of achieves the same thing. But in that case, again, we still have the original data. Just we're sampling it. So there's a bunch of design stages we're going to have to deal with when we want to add compression to our daily system. This obviously is like, what are we going to compress? How are we going to compress it? What is the scope of what we're going to look at compression? So we sort of need to look at all these things, and then we'll see why we'll make some decisions now. And then that'll infer how we decide what's the right compression scheme to use when we talk about the columnar compression. So the first thing we've got to deal with, again, is what the granularity of what we're trying to compress. So the first choice to do at the block level, this would be like a row group when we talked about packs before. Some chunk of tuple. It's not the entire table, but some subset of it. And actually, we can actually compress within a single tuple itself. This would be like a row store. So I think I'm taking the entire tuple itself, and I'm going to run some compression scheme on just that tuple, and every single tuple will be compressed in a different way, potentially. I can do compression within a single attribute. So if I have a column that has a really large text field or a binary field, maybe I want to run GZIP or Snappy on that individual value for every single tuple. Postgres does this in their toast tables. And then I can do it on a column level. And this is what we're going to mostly be focused on, where it's going to be column level plus block level. But we're going to take the entire column itself and run our compression algorithm on that. Different systems do different things. All the column store systems that we're focusing on the semester, they're going to block level but within a column. My SQL can do tuple level, sorry, that's incorrect. My SQL does block level on a page. Postgres can do attribute level for really large bar charts. I figured out he was dead and we just left, and let the cops deal with it. It wasn't my problem. What's that? OK. So the first scheme will be called naive compression. And the idea here is that it's going to be opaque to the database system. It's going to be some external algorithm that the data server doesn't know about, doesn't know how to interpret what the bytes are after it's been compressed. Again, think of it just like you run GZip locally. It's some block of data. You run GZip on it. And the data server doesn't know what's inside of it. It just says, here's the bytes that represent some amount of data. And as I said before, we don't want to use GZip because GZip's going to get a great compression ratio. 7Z is probably better, but it's going to be super slow for databases. So instead, we're going to use these categories based on the LZO stuff, came in the 90s. But all through these, we have a variety of variations of this, where we're going to make a trade-off we're not having as good compression ratio in exchange for getting faster decodes. And so Snappy came out of Google. That's pretty widely used in modern LAPS systems. I'll pass you over to this. Broadly from Facebook, it's meant for compressing HTML documents on the web. So it has some magic in there, because it knows there's an HTML, there's a header, the structure of the HTML document compressing things. Oracle has their own version of compression called OZip. What's interesting about this is that for a while when they were actually, they bought Sun Microsystems, they were selling Spark chips in the Spark instructions set. They actually put OZip instructions in there. So there's hardware accelerated compression. But that doesn't exist anymore. Right now, the gold standard is Z standard. This is considered the state of the art from Facebook. I think there's newer versions of it. It's gotten better since 2015. But this has the best performance versus compression ratio trade-off. And this is what everyone should use. So let's go see now how you would do naive compression in MySQL. And we'll see, again, the disadvantages of why we're actually not going to want to do this for our column stores. And I'll say also, too, you can combine these different compression schemes together. So I could do the dictionary compression that we're going to talk about in a second on a column. But then that generates a giant PAX file. And then I just run Z standard on top of that. I don't have graphs to show here. I can just show on piata. But as the disk gets faster, you actually don't want to do that. So you still want to do the native compression that the decent is going to support. But you don't want to then run additional compression on top of that. They can kind of compress a compressed file. You try to run Gzip on mp3 and before, it's not going to make a big difference. But you pay the CPU cost. So the way MySQL does this is that you declare a table would be compressed in a table option, the engine options. And then when everything's on disk, it's always going to be in a compressed format. And the pages are either going to be 1, 2, 4, 8 kilobytes. And so the idea here is they want to make sure that the pages can easily fit into slots on disk. And that way you have a bunch of holes. So they always round up to the next largest page size. So if my page is 1.5 kilobytes, that was rounded up and padded out to 2 kilobytes. And at the header of each page, there'll be a fixed length mod log where it's a place where you can do updates or insertions into the page without having to decompress it first. And then now when you go to consolidate and write it out to disk, then you apply the change. So maybe I actually don't know whether they actually store it on disk. When you bring it in memory, you're going to have this. It's like a buffer space. So you can apply changes without having to decompress it first, right? So now if I need to access the page in memory, I'm always going to copy it in in its compressed format. And I'm going to try to do as much reading as I can from it if it's compressed, because it might be some header stuff. But if I actually have to read the data, then I have to decompress it. But this mod log can absorb any changes or updates. So all writes go into the mod log. At some point if I need to know it's inside of it, then I have to decompress it. And I still maintain the compressed version for bookkeeping reasons. But now once I have it uncompressed, then I can do whatever I want. If I don't modify it and need to save space, then I can blow away this and then still keep this in there. And so the uncompressed page is always, in my seal, the default is 16 kilobytes. So in memory, uncompressed is always 16 kilobytes. But it could be some smaller size on disk. Yes? This question is, can you perform any operations on the compressed page as is? No, I can't. If I need to read the value of a tuple, right? Is there any metadata that's useful? I think you can get the number of pages, or sorry, the number of tuples that are inside of it. I'm saying also two. So this is my SQL entity b. So the pages themselves, assuming these are data pages, these are actually leaf nodes of the B plus tree. So you know something about the B plus tree structure when you get here. So you maybe don't derive some things. But the main thing about it, I can do updates easily. Actually, I can do updates in the uncompressed form. And then any time I do a read, in theory, I could, if I say, does this value exist, I could do look at the mod log without decompressing it. If I find the thing I'm looking for because someone updated it before, then I can get the result back. Yeah. Yeah. Yes. So again, this is just one way to do this. This is what my SQL does. The Postgres doesn't do this, except again, the toast is a separate thing. So there's some obvious disadvantages of this. So we have to decompress everything any time we want to access, like actually read it and potentially modify it. And then even if the sort of naive compression scheme we're using underneath the covers is using dictionary encoding. Like something like z-standard. The dictionary itself is not exposed to the database system. So therefore, we can't use the dictionary to do some predicate evaluation or answer some queries. Because again, it's opaque. We don't know what's actually in the binary compressed data. Right? But if we maintain the dictionary ourselves, then we don't know what's actually in the binary compressed data. Right? But if we maintain the dictionary ourselves, then there are some performance benefits we can get when we do query evaluation. Right? The other thing about this, too, is this compression scheme is basically it's naive, meaning it doesn't consider anything about what the meaning of the queries are, that you don't need to operate on it. Can't reason about what are the data types, what are the query I'm trying to execute. I have to blindly decompress everything and then run the query. So the only advantage you're really getting, in this case here, and I'm going back, is I can read one kilobyte instead of 16 kilobytes, potentially. Right? So if we can operate directly on compressed data, then we still got to read the compressed data from disk, but maybe we don't want to actually materialize the results. We can operate, you know, do our predicate evaluation directly on the compressed data and screen way faster. So that's the key idea we want to show in columnar compression, that is there a way to encode the data in such a way that we may be able to operate it directly without having to decompress everything first. We sort of saw this last class when we talked about the bit weaving stuff, and bitmap indexes are essentially another way to compress things as well, like I could represent data in a smaller form and get more bang for the buck for any amount of CPU I found in spending on the system or in processing the data. So we're going to go through a bunch of these different examples, and we'll see in some cases these things can be combined together. So running the coding is super common, and that's oftentimes used in combination with dictionary coding and other things. Dictionary coding overall is going to be the most common approach, so we'll spend time talking about what that looks like. Bitmap coding will be an extension of what we talked about last class for bitmap indexes, but we'll look to see if there are more compact representations that we can use for the bitmaps themselves. Delta coding is sort of obvious, and bitpacking will be the last one. So running the coding, it's exactly as it sounds that the idea is that we're going to be able to compress if we have repeated values for long strides in our data, then instead of having to store the exact same value over and over and over again, instead we can just say we store a summary of it. Say for this given value at this starting position in our column, here's how many occurrences there are of it, and ideally that would again be some compact form of the data. So to get the most benefit of this, you're going to need to sort the data ahead of time, but this gets tricky because if you sort it versus one key versus another key, that could affect the efficacy of running the coding. But in general, people usually sort in the primary key because it's good enough. This is sometimes called null suppression in the literature. It's basically the same idea, but instead of just tracking any possible value, you say how many nulls that I have. But the idea is basically the same thing. So this is a simple example. So we have that same data we had before. It's a list of people and whether they're lit or not. So if I want to do run life and coding on this, set a storing yes, no, yes, no, yes, no, I'm going to have these triplets that say here's the value, here's the offset where we are in the column. Sometimes that's not needed, but for the simplicity assume it is. And then here's the length of the run. So we have three yeses in the first three attributes. So our triplet is the yes, offset zero, and followed by, and there's three of them. So now if I have a query like this, select count star, users group by lit, then I can rip through this pretty quickly because all I need to do is just scan through this and add these up together for the different groups in my hash table when I'm doing the aggregation. So in this case here, we're not actually getting better compression because what was simply yes, no, if we could store that maybe in a single byte, we're actually now storing way more data. The reason is because we have these regions here where it's a no followed by yes followed by no. This is the worst case scenario for assuming two values. So run length of coding is actually not going to help us here. But if we sort the data just based on the lit column here, now we get it down to two triplets. Here's all the yeses followed by the noes, and I'll have to store that twice. I don't have the math here. So this went from some smaller bytes. Again, there's only eight tuples here, or 10 tuples. Think of really large examples. If I can sort things ahead of time, then this would be a huge win. So a lot of data is actually going to look like this. It depends on whether you can actually sort it or not. So RLE we're going to see in a bunch of different cases combined with the other ones. So the basic premise, very few systems, as I know, are going to do on a column by itself run RLE. Vertica applies the most aggressive about this because they want to pre-sort columns ahead of time. But we'll see this in dictionary coding and other things. Again, that's why even though dictionary coding is the most common, I'm showing RLE first because then we can apply it. We can apply RLE as necessary. So the basic idea of dictionary compression is that we want to take frequent values that occur in our column and replace them with a fixed-length code just to represent as a placeholder for what that value represents. And then we have this separate dictionary data structure we're going to maintain that we're going to use as a mapping from a given dictionary code back to its original value. Again, this is how all the column store systems are going to convert very well-length strings into fixed-length values because the dictionary values are always going to be fixed-length. So we're getting the benefit of the dictionary compression. We're getting the benefit of compression, obviously. But now we're also getting the benefit of it now makes everything nicely aligned in our columns that would otherwise not be. So typically you're going to do one dictionary code for each unique value. There are some schemes where you can do engram and compression. Instead of like within the three characters within the string, it has a separate code. You can cat-name it together and that's the original code. Most systems don't do that. It's usually like one code equals one full value. So the, yes? It's a question mark statement, sorry. So the question is, is dictionary coding really useful for strings? Absolutely, yes, because everything becomes fixed-length. But people still use it for numeric columns as well. Or if it was like a float, right? Your floats are repeated a lot. Now you can represent the float as an integer assuming it's order-preserving, which we'll get in a second. Then now I can do comparisons on integers instead of comparisons on floats. Would you like faster? Yes, we faster, yes. So I think part k and orc both compress new dictionary coding for floats. Time stands to be another one, right? You can then do, again, you can combine these things. If I'm doing delta encoding, maybe instead of storing the, I can maybe store the code as a smaller thing and a small delta together. Like, you get most bank, the largest portion of the data within real databases will be string data. So you get a huge win for this, but you can still use it for other things. Okay. So in deciding how we want to do dictionary coding, there's four questions that we have to answer. So first of all, when are we actually going to start the dictionary? This only is really an issue for, if you want to support incremental updates in your database for our purposes in this class, which assume we're not. So that makes it easier. So when we create the file, the pack file, the part k file, whatever, like the data, once it's in the file, that's when we create the dictionary. Next question is, what is the scope of the dictionary? Like how much are we looking at within, you know, multiple tuples, a single tuple, sorry, multiple columns, a single column within a block, within a file, similar to what we talked about before. Next question is, what data structure we're going to use for the dictionary? If it's immutable, the answer is going to be sort of the array, because that's the easiest. But if you want to, if you want to support updates, we'll look how to use B plus trees for this. And then what's the encoding scheme? Like, so how are we going to actually represent the dictionary codes, and then how can we do fast encode and decode? All right, so go through these one by one. All right, so versus when are we actually going to construct the dictionary? And as I said, it's either all at once, because like if the file's immutable, I'm going to write out the disk, you know, then I populate the dictionary. Or if I need to support updates and inserts to an existing file, then I have to support incremental updates. This is a lot harder because now you can imagine if I'm trying to preserve the ordering of the dictionary codes mapping to their values, then if I have dictionary codes one, two, three, four, now I need to insert something between three and four, I got to readjust and rebuild my dictionary and re-encode everything. So the way to handle that is sometimes it snaps in extra space in the empty space within the values that way. If someone does come and insert, you can still preserve the ordering of the values, the codes, without having to re-encode everything. This is much harder. DB2 blue is pretty aggressive about this. We're not going to cover this. Our focus this semester is on this one, right? Again, I have all the data. I'm trying to pick the correct scene for it. The scope, again, this is similar to what we talked about before in the granularity. So what will the dictionary represent? What range of possible tuples in the table? So block level will just be within a single column. Here's all the attributes. So within a single column, here's all the values within a block. Here's all the values for those tuples in that block. Now the problem with this one is going to be that if I have different blocks that are part of the same table, if they each have their own dictionary, now if I want to do comparisons or maybe put them together into a single data structure, I always have to decompress them because the dictionary codes for the one block are not going to be the same with the dictionary codes in the other block, and therefore I don't want to have false positive of matching. So I have to decompress everything but when I combine it together. So I'm building a hash table for a join. I have multiple blocks that each are dictionary compressed. I can't store the dictionary codes in my hash table because the same dictionary code might repeat it from one block to another. If it's doing table load compression, we'll get great compression ratio because we're going to see a lot of values over and over again. We don't have to store, we have to store the dictionary multiple times at the same mapping, the same code. But of course, as I said, if we have to update this, we have to put extra space to do, to maintain extra room for it to absorb and amortize the inserts. Nobody does this, we're not going to talk about this, but you could do dictionary compression across multiple tables. Again, I think I have foreign key columns that were the same value domain that repeats in one column to the next. If I have a single dictionary for both of those, then I can get a better compression ratio but as far as I know, nobody does this. So again, for this semester, we'll assume it's going to be this. So what do we need to support in our dictionary? We need basically two operations. We need to encode and decode. So for an uncompressed value, we want to convert it to its compressed form and we want to do this as fast as possible. Because again, we think of like, I'm bringing a block in, I'm bringing a block of data in, I want to do a scan on it, I want to convert my predicate into the dictionary value because I want to operate directly on compressed data. So I need a way to encode it. And then also too now, as I'm getting tuples that match my predicate in my scan operator, I'm going to be able to decode them very quickly. So there isn't going to be a hash function that can do this for us, right? That is going to be able to do both of these things like go in both directions and be fast and order preserving, which is the other thing we would care about as well. So we're basically going to have to maintain some data structure to do this for us. So here's why order preserving matters. So if we get order preserving, if the dictionary codes, the ordering of the dictionary codes corresponds to the same ordering values that we're compressing, then we can convert our predicates and our queries to these dictionary codes and operate directly on compressed data. And in some cases we can do some query rewriting and actually operate directly on the dictionary itself without ever actually having to look at the real data. So say I have four strings here, Andrea, Wynn, Andy, and Matt. So if I sort the data, the original data in the column, and then I build dictionary codes where the dictionary code corresponds to the sort position of the original data, then now if my query comes along like this, select star from users were named like A and D with a wildcard, then I can just do the, just to convert my query, the like calls into a between 10 and 20, because I know how to go look at the dictionary, find A and D, and this would be the highest point and this would be the lowest point. Then now I can just rip through the scanned data or sorry, compress data very quickly. So what would normally have to be a string match, would be slow, now it converts into a between calls on integers, which is fast. So that's why we want to have this order preserving property. So it doesn't always work though. So if I have like this, then we still have the foreign sequential scan, but now we're performing, actually sorry, this is with a name on it, since I need to get back the original value, I still have to do a sequential scan but then convert it back anything that matches. But if I have a query like this distinct name, then all I have to do is just operate directly on the dictionary, apply my predicate with the between calls and then pull out the unique names. So again like we can, this is almost like a zone map where it's a compressed representation of what's actually in the column and in some cases we only have to operate on this. We never have to look at the original data. So again, say I have four entries in my dictionary but I have a billion tuples in my column that's being represented by these four dictionary entries, I can run this complex query directly just on four things instead of all one billion. Yes. So the statement is, to do this we actually have to store two dictionaries, to store the value to code and code to value, typically the only store value to code and then this is so small that like you can just do a sequential scan on it quickly. I think Parquet does this, Arrow does this, there's another approach we'll see in a second where it's actually a B plus tree so then you can do the lookup, you can do lookup in both directions as you're talking about. Again like we assume that the data is how you skewed so there's going to be repeated values in my column so the dictionary, I could have a billion tuples but I'm going to have some smaller percentage of potential values in my dictionary and if I'm organizing things on a block which is like maybe 10 megs or whatever it is, 128 megs in Parquet, then this thing's not going to be that big and it'll reside in L1, L2. Right. So I've already sort of alluded to this the three different ways we can represent this. The most common one is going to be an array where it's just you take a bunch of variable strings, you sort them and then you just store them one after another in this array and then the dictionary code is an offset, a byte offset into the array. So this is the worst case scenario or this is the worst thing to do if you need to support updates, but again if our files are immutable then this is fine. Hashtable would be another approach, the problem with this one, you can't do range of prefix queries so I don't know if anybody actually does this, this will just support faster lookups potentially. And then the B plus tree is going to be much slower than these other approaches to take way more memory but in some cases will be to support complex range of prefix queries. But you'll see the trick, they actually need to be, since they go into both encode and decode, it's a complex data structure where the leaves of the B plus trees are shared between the two separate indexes. We'll go through the first one and the last one. All right, so doing an array, the first thing I'm going to do is just sort our data and then store them in a dictionary that's just contiguous memory of the value. So there's always the prefix of how big the original value is and then the actual value and then there's the next one in the array. So now then in our compressed data we just replace the actual value with an offset into this array. It's pretty simple. I said this is the most common one, this is what Arrow uses. The B plus tree we want to share leaves, the idea is that there's a B plus tree in the top that does encoding and a B plus tree in the bottom that does decoding and then in between you're sharing the leaf nodes. So I can take a rigid value here and get back an encoded value and do a rigid value coming back. So I can go in both directions. If we can ignore this, we'll be incremental encoding, we're not going to assess this, but basically you can do additional compression on the contents of the leaf node as well. It's not z-standard, but you could think of like you could do other techniques here because now you know the data is sorted, there's sorted strings and you can take advantage of that. So I said before that in the case of z-standard and there's not even compression schemes, the problem with using those, even though underneath the covers they're doing some variation of dictionary encoding, they don't expose the dictionary to the database system. So with Parquet and Orc, they actually have the same problem. When you actually go look at their implementations, you can't actually, there's no command to pull out the dictionary to then use that for maybe encoding your predicate to do like predicate pushdown. You can do this with Arrow, but they basically do a copy of the dictionary instead of Parquet and then convert that into an Arrow format. There's no way to use like zero copy direct access into the dictionary, at least from my understanding. So even though I made a big deal about how important it is to do native compression and maintain the dictionary ourselves, the most common to columnar formats, Parquet and Orc, don't actually let you do this. There's talk that from Facebook has a new file format called Alpha that they talk about in the Velox paper we'll read later in the semester. They fix this problem and then at YouTube, there's a system called Priscilla that has this other kind of file format called Art Tourists, and they solve this problem as well. So it's well known that you want the database engine to be able to see the dictionary to do a bunch of the tricks that we talked about, but the most common ones don't actually do it. There's something to be mindful. Yes? For this one or for the Array one? This one. So for this one, the index itself, if you want to decode a value, right? Good here, right? So I'm going to encode a value, I do a look at my index, then it's going to point me to some offset inside the leaf node here, and I know how to do the lookup. So do you need something to do that? Yeah, this is the B plus 3. This would point us to an offset here, whatever you want. Where I'm going to be going with is this one here, that there is no, we don't store the codes actually in here. So if I want to do, I want to encode, I have to do a sequential scan in this, although it's sort of, you can do binary scan, but they're very reliant so now you've got to figure out where to jump. You just have to do a sequential scan. Or you maintain something that says, here's the offsets for everything you want, but that takes up space. I don't know if anybody does that. Those are the things you do, like a balloon filter in front of this and see whether actually the string you want is in there, then if no, then you don't do the scan. You can combine all the things that we talked about before. But this is the most compact form, right? You can't get smaller than this. Lengthen the value, you know, store it continuously. Whereas if a hash table, you maintain a hash table in empty space, if it's a V plus three, same thing, but like those techniques would be useful if you want to support incremental updates. If you don't want to support incremental updates, you have all the data ahead of time, then this is the best approach. Again, you can put a balloon filter in front of it. So we've gotten through running the coding, dictionary coding, and then we'll talk about bitmap coding, compressing the bitmaps we talked about last time, and then we'll finish up with the last two. So last class was all about using bitmaps as auxiliary data structures for indexes, right? But as the paper you guys read, you could actually just store the column itself natively as a bitmap. And actually that would compress it, assuming the cardinality is low enough. So again, this is the example of the same column we had before. If I just store it in the same two bitmaps that I had, then I can get, you know, the original data was 72 bits, and then if I get it down to this, it's 16 plus 18. All right, so 34 bits. That's a pretty good, you know, pretty compression ratio. Question, sorry? Your hand's like up, sorry. You look cold, as I do. Right, so again, this only works if there's a little cardinality. Again, we saw the zip code example. If I have a bunch of different possible values, I have to maintain bitmaps for all of them, we'll compress them in a second, but in general, you only want to do this when there's low cardinality. Now, we're not going to talk about this in this class, but there is additional optimization you can do where depending on what the query is, you may want to, you can order what bitmap you actually want to look at. Like if you know some bitmap, if you know the query is trying to find something that matches a predicate and you only need to find the first one, then maybe I want to choose the bitmap that has the most bits in it, because I'm more likely to find a match in there, and then I don't have to scan the other ones. Data systems try to do the same thing for predicates, like where A equals 4 and B equals 5. It tries to evaluate the predicate that's going to be the most selective first. So it's the same optimization here, but we can actually do within a single attribute, a single column, choosing what bitmap to use, which is kind of neat. All right, so how can we compress our bitmaps? As I said, there's going to be a lot of zeros in them. Because if it's a discrete value per bitmap, I can't have a 1 for one tuple. You can only pair them in one of those bitmaps. So it means for all of the other bitmaps, it's going to be a bunch of zeros. So we want to compress that down. So we could just use a naive compression scheme that we talked about before, z-standard or snappy, and just compress the blocks that we're representing at bitmaps. Again, we have the same problem where the analog compressed data is opaque and I don't know what's in the bitmap until I decompress it. So that's less than ideal. So I want to talk about two other approaches, byte-aligned bitmap codes from Oracle. And this is a way to use what I call structured running from coding for bitmaps where, again, the algorithm is designed based on the assumptions of what it knows the bitmaps are going to look like. And then we'll talk about roaring bitmaps that are given to this French-Canadian guy. This is more common in a bunch of different systems. And it's a simple trick to compress things down. So I'll say also, too, the spoiler's going to be I like teaching about the Oracle stuff. It's from the 90s. It's slow. Nobody actually does it. Oracle doesn't even do it anymore. I think it's a really fascinating way to represent things. The roaring bitmaps, though, is more common in modern systems. Right. So with the Oracle byte-aligned bitmap codes, the BBC, what we're going to do is we're going to divide the bitmaps into chunks based on different categories of bytes. And again, we're going to assume there'd be a bunch of zeros. So they're going to call some sequence of a bunch of zeros as gap bytes. And then there'll be these tail bytes where some of the bits will be at least one. And the idea is we want to find these long sequences of bunch of zeros, represent that with run-length encoding, say here's how long the bits sequence of zeros are. And then for the tail bytes where there's at least one one, we're going to have different ways of representing them to indicate what's in them. And we can be clever about reducing the size of representing what's in those bitmaps based on the different patterns or bits that we can see. So we're going to chunk up our bitmap that's going to be, again, a long sequence of gap bytes in the beginning followed by some number of tail bytes at the end. So the gap bytes we're always going to compress with RLE. And then we'll store the tail bytes as uncompressed unless it has one byte that has at least one non-zero bit and we just say where the location of that bit is. Well, I'm going to show you the example to make more sense. All right, so here's our bitmap. Again, we have some non-zero bytes here, one at the top and two at the bottom. And so the first chunk is going to be these first three chunks or three bytes here because we have all zeros and then that tail byte and again, these are the gap bytes. And then the next chunk will be this sequence here because we have a bunch of zeros followed by the two at the end. So the way we're going to store this one first is that, again, in our header, we're going to keep track of the number of gap bytes we have. And then we have a one-bit flag to say whether it's special or not, meaning like is it only one one in it and we need to know the location of it. And then we can have information about the number of tail bytes we store in exact main format. So it looks something like this. So in the header byte, we have the first three bits represent the number of gap bytes we have. So in this case, it's two because we have one, two. And then we have a bit here to say is this thing special and set to one because there's only one in the tail byte. So that means in the next four bits, we just store, hey, where's the one? Right? So we went from now three bytes down to one byte. So that's pretty good. Let's take this guy here. So we have, again, we have this long sequence of gap bytes and then two tail bytes here. So we would have in the first three bits the number of gap bytes in case 13. Then we have a zero to say the tail isn't special. There isn't just one byte with one bit. So then the next four bits tells us how many bytes we're going to copy exactly here. So I'm going through this. There's the gap length. It tells you how many bits for the gap. There's the zero to say that we're not special. Here's the bit or the bytes to say how many copies we have exactly at the bottom and then there's our verbatim tail bytes. Yes? No, sorry. 111 is a flag to mean that, hey, go look at the end of the byte and this will tell you the gap length. How many? Yeah. Sorry. I messed that up. 111 is a special flag. Yeah. And then, again, zero means we're not special. This is the number of verbatim bytes we stored at the bottom and then there's the actual verbatim tail bytes. So the original one was 18 bytes but then we got it down to five bytes. I think it's neat. It's like listening to old vinyl records. Nobody does it anymore but sometimes it's a curiosity. What's that? Some people open the vinyl record. Yeah, you don't understand. It's like vintage. This is a vintage encoding scheme. I like it. Not because I made the slides. I think it's interesting. But it's obviously bad. I'm going to guess why. Close the door. Sorry. He said it's highly data dependent. Yeah. So his point is if your data doesn't have a lot of zeros then this doesn't work but we said we assume that it's going to be. Right? Real workloads look like this. Real datasets look like this. To her point, like, yeah, if there's at least 111 in these bytes then this thing is going to be terrible. Right? But as I said, most real datasets are going to look like long sequences of zeros. From a computation standpoint, why is this bad in modern CPUs? Gee. It's bad for SIMD, but what else? Lot of branching. That's the answer. Yes. So this is great for compression but it has terrible branching because as I'm looking at these bits it's telling me, okay, do I need to jump ahead and look at the next thing or should I stay and look at my current bits? Even though everything is going to be in L1 it's the CPU instructions that are going to be terrible because as I'm ripping through at least my example here I had two sequences, two chunks. This one, you just kind of rip through it. This one you had to then jump around and find where things are. Like, this would be terrible for branch prediction in the CPU because it's going to be completely different maybe from one chunk to the next. Right? So this is why Oracle abandoned this in the 2000s and they replaced it with this thing called Word Align Hybrid Bitmap Encoding. This is all patented. Andy and David's patents don't mix so we're not going to cover it. But again, they identified that in the 90s or the 2000s for modern SuperSale CPUs and we'll see more about this throughout that semester all this branching is going to be bad. It's going to be difficult to use SIMD because you can't just take arbitrary if clauses and run that on SIMD. You have to do some kind of coding scheme similar to what we saw with the bit-weaving stuff. There's also none of this is going to support random access. Less of an issue for us in OLAP but like if I need to know exactly is some bit set to zero I got to scan the entire thing and decompress it the whole thing. Right? So a more common approach or sort of more modern approach is this thing called a Roaring Bitmap. And again, this is from this French Canadian guy named Daniel Lemire. His blog is actually very interesting. He's like a professor who actually writes a lot of code and puts out a thing on GitHub. He writes a lot of low-level system code. So I highly encourage you to go check it out. So the basic idea with the Roaring Bitmap is that we're going to store 30-bit integers in a sort of complex or a compact two-level indexing structure where you could have one chunk of data store things exactly with exact values but other ones maybe store it that's more dense, you store it as a bitmap. Right? And as I said, this is used in a bunch of different systems today. So here's the basic idea. So we'd say we have four chunks and then each chunk is going to point to some container that's going to store their values. So say the first three chunks here we're going to store the values exactly, right? If we put it in key four, we'll store key four here. But in this one here there's a lot of values within this chunk range then we want to store it as a bitmap. See, the idea you can kind of be adaptive here. Like for some things you'll store it exactly some things, if it's very sparse but if it's something that is very dense instead of storing all these exact values I'll just store the bitmap to represent that. Right? So for each key we want to insert we're first going to figure out what chunk it goes to by dividing by two to the 16 and we store the key in only one chunk, in only one container. Alright, and then the values are the number you can now use is less than 4096 and this is all going to be trying to be word aligned memory or cache aligned. Then we're going to store it as just an array of values just a pen of one after another. Whether or not you're just sorting depends I actually don't know what the implementation does but otherwise if it's a lot of values then we're going to store it as a bitmap. So I say I want to sort key 1,000 so first thing I'll do is divide it by two to the 16th so it tells me to go to this chunk here and then I'll just mod it by this and that tells me where to go put it. So now I have another key like this divide that by two to the 16th tells me to go to this chunk here but then now I can mod it and that tells me what offset I want to put my key in. Right? So set B the fifth bit, the 50th bit to 1. I don't know again we could read the paper, the blog article I don't know when they decide when if it goes about 4096 do you immediately convert it if you go back down do you convert it back it depends on the implementation. So that's how to bitmap coding. The last two schemes we talked about are delta encoding and then bitpacking. So with delta coding again it's exactly as it sounds assume that there's going to be a lot of the values that we're storing sequentially in our column are going to be small additions or subtractions from the previous value. So instead of storing the whole value I just have to store what is that delta and then I'll be able to combine it with RLE if I see the same sort of delta open up again to get even better compression. So say I have a data set here where I'm keeping track of the at a given every one minute I'm keeping track of the temperature in the room 99 is kind of hot in Fahrenheit sorry. So we'll say this will be the base value for the first entry we have in the first tuple in our columns so we'll store that in its original form and then the subsequent columns will just be with the delta. So it's the time it's one minute over and over again repeat that in the case of the room temperature the outside temperature it's just plus or minus some decimal. And of course now what do we have here once I've repeated values. So we just use RLE on that and compress that down even further. Right. And the reason why this is going to work is because assume that the thing that we're recording or tracking in the real world for these different columns is not going to have wild fluctuations. It's like the temperature outside is going to be 99 degrees now and then immediately go to negative 100. So it's always going to be the same thing for the time otherwise this wouldn't work. So in its original form we were 160 bits if we just do delta encoding it's 96 bits but then we apply RLE on top of that then we get down to 64 bits. Of course the tricky thing now is like again if I want to know what the exact values at a given time stamp I can use RLE thing to figure it out or since these are just increments of each other I got to replay it and apply it back here. So there is some computational cost of trying to get back to the original value but if I'm just scanning things what is the average temperature for some time range assuming I have a zone map that tells me what the time range is or otherwise I have to look at the time column I can just scan through this and apply the changes to my to my hash table as I'm doing my aggregation. Okay. So the last scheme is bit packing and then the redshift variation called mostly encoding. The idea here is that if the data system can recognize that even though the application told it that a given column should be stored as a certain data type like a 32 bit integer 64 bit integer but if it recognizes that the values you're actually storing for that column are way smaller than the actual possible range of values you could have then instead of storing the exact bits you would need to represent the value based on the defined type I can store it in a smaller type. It's sort of obvious. So say I have a column here, a bunch of integers we saw in the last class and the user defined this as a 32 bit integer column. It's super common. People define a serial ID or integer data type they may not know about small int or tiny int in SQL. So I have 30 bit integers but if I actually look at the bits for these values I'm showing here I'm actually I can store these actually in 8 bits because the values themselves aren't that large. So we have all these bits here that are completely wasted. Right? So the data system recognizes this and says yeah I'll tell you it's 32 bits but underneath the covers I'm actually going to represent this as 8 bits. Now this is nicely memory aligned actually going smaller, right? You can do something like 7 bits or 6 bits and there's techniques using the similar to the things we saw in bit weaving where we were putting values in a single 30 bit 2 values in a single 30 bit integer and then doing some tricks to do our operations on top of them. We can do the same things if we're packing things as like 4 bits or 5 bits. For simplicity I'm showing as 8 bits. So the original value with 8 32 bit integers is 200 pieces of bits but with bit packing we can get it down to 8 bits. Yes. So her statement is okay this is great but what if the user comes along and inserts something that's more than 8 bits. So if it's immutable they can, right? So that solves this problem. If it's not then you have to do something else. That's mostly encoding. So this is what Redshift calls it, I don't know who else does this. But the idea here is that say that most of my data will fit into the smaller data type but there's some bad eggs a few values that don't actually fit. So say this one is a 30 bit integer column but there's one value here that's super large. Whether that's inserted later on or whether or not it was in the original data we looked at it doesn't matter. So what we're going to do is we'll store it as mostly 8 bits but then we'll have a sort of special marker here some bit sequence to say oh by the way the value that you want at this offset in my column isn't actually being stored here there's a separate other table over here where you can go get it. It's like dictionary encoding to say hey don't look here for what you're looking for, look in this other dictionary and get the original value. So in this case here originally we were 256 bits but it would do mostly encoding we still have to store all 8 bits for all the values here because even the flag, the marker here takes space and then they have the actual dictionary itself. So I'm showing my example here that this is just like some say maybe it's all 1s right? You could store a bit and say where to go find it with one bit and then the remaining bits tell you what the offset is inside of this I don't think they do that I think it's just like all 1s and oh by the way go look over here right? The last thing I'm going to talk about is the intermediate results and I don't have diagrams to show this I can visualize this next class when we start talking about joins and other things this might make more sense but the whole point in this exercise this class and last class was trying to reduce the cost of doing sequential scans and the indexes and the filters allow you to say I don't need to read this data because I know something about what's inside of it based on my predicates I don't need to look at it the things we're talking about today is for the data I do have to go read how can I keep it compressed compact so that the the amount of data I'm fetching in I'm getting more bang for the block again I'm getting more useful tuples out of it to do my evaluation for every I.O. that I'm doing but also too in some cases I can speed up execution if I can operate still directly on compressed data and maybe not have to look at the real data look at the dictionary or something else but once we get past that basic sequential scan operation as far as I know every database system is going to then pass along from one operator to the next always in a decompressed form so even though that I can operate in my my predicates directly on the compressed data without having to decompress it then when I pass it to the next operator it has to be decompressed and ideally of course that ideally I'm my predicate is selective enough where I'm throwing away a lot of data so the amount of data I have to do decompress is small but it still has to be decompressed to the next operator and this is because we don't want to have a repeated logic throughout the execution engine for all these different operators like the group eyes, the sorting, the joins we don't want to repeat logic or make calls to our decompression libraries to operate on decompressed data ideally we only want to have from a software engineering standpoint only the lower levels of the execution engine of the lower parts of the query plan, that's where the decompression logic actually is but everything up above can still operate on decompressed data and you have to do this because you don't want to repeat the logic but also you can't guarantee that the data coming from one source versus another source in the leaf nodes or even within the same table is decompressed in the same way so my example here would be do and join the two hash tables or do and hash join the two tables my hash table I built the build side if I'm storing the compressed values inside of the hash table I have to then make sure on the probe side that table is then reconverted back into or recompressed into the same dictionary scheme that the build side was on but now again how do I do that there are multiple files that each have their own dictionary it just becomes like intractable right and again I could try to do incremental scheme where I decompress everyone then I recompress it as it's going into the hash table but that starts to get expensive it becomes logistically difficult especially if maybe the data is coming from different nodes in an distributed system so for this reason once you get past the scan any basic predicate evaluation the data is uncompressed going up because it's just again from a software engineering standpoint and a computation standpoint it's just easier okay so that's what I have today again this is a sort of the last class we'll have on at the storage layer everything above or everything going forward in the next semester is going to be okay how do we actually take the data we just decompress or found through indexes and actually run the queries on them alright so dictionary encoding as I said is the most common one it may not be always the most effective one because there's obviously RLE would be RLE is probably going to be the best compression ratio but again it only really works if things are sorted so in general the dictionary coding is good enough in most cases and as I said we can combine different techniques that we talked about today within the dictionary itself to get better results yes keep it compressed so there's two things there's late materialization as in when do I stitch back the tuple back together from the different columns and then at what point do I decompress it so they make a big deal the papers written in 2006 in the C store that you want to keep things as depressed as long as possible Vertica is the commercial implementation of that and they got rid of that and as far as our no system does that I actually don't know Vertica's compression schemes I know they have used RLE I don't know what the scope of their dictionary is so if the dictionary is on the entire column then yeah you can do that but if you have blocks of things then the dictionary is different from each block the codes aren't going to match up so you've got to convert it and then are you going to spend time then re-build a new dictionary and re-code things again nobody does that because you might have to spell out a disk and once you're writing intermediate results out of the query that's the depth of the query I mean it's still running it's just going to become slow you want to keep things in memory as long as possible and so not having to stage even results with pipeline breakers is going to be the best case scenario I can put something on Piazza there's a follow up paper from the Vertica people basically of like okay since C store ten years later here's all the things we've learned and that's in there I mean it's a cheese question he asked me earlier today was like okay we've read the last two papers on C store they're from 2006-2007 I just told you that they're just saying things they talk about people don't actually do anymore why do they have you guys read those papers because those are the fundamental papers on column stores and compression and the basic ideas that they describe in the paper that's what everyone else is based upon there's nuances to them like I mentioned as we go along okay they said this but here's actually really want to do it things evolve over time of course there's harbor changes and workload changes and cloud absolutely changes a lot of things as we talked about in the very beginning but the fundamental ideas are still sound relational models from the 70s but that is the best thing to do almost every single case okay alright so next class we'll talk about query execution again it will be about how do we actually take a query plan I haven't told you Howard can create the query plan that will come later but how do I take a query plan execute it and then move data up the query plan tree to produce a result okay alright guys see you on Wednesday I make a mess unless I can do it like a Gio ice cube with the G to the E to the T now here comes Duke I play the game where there's no roots homies on the cusp of y'all my food cause I drink bro put the bus a cap on the ice bro bushwick on the go with a blow to the ice here I come will he eat that's me rollin' with fifth one by the 12 pack case of a five 6 pack 40 act against the rail I drink bro by the 12 they say bill makes ya fat but sayin' eyes are straight so it really don't matter