 So today, that's actually one of my favorite lectures. We're talking about database compression. Again, the high-level idea here is we're just going to be able to squeeze down the size of our data in our database and save memory, save storage. But in some cases, actually, we're also going to be able to execute queries faster because of the way we're going to compress things. We can be aware of how things are laid out and do things more efficiently. Quickly, before we jump into that, there's a reminder that there's a tech talk today at 4.30 over here in New Simon from Oracle, one of the guys that works on the in-memory database team. So if you're interested in that, come check it out. The other thing I want to talk about too is last class, we were looking at MemSQL and we had this bit type. And we were like, oh, how can a bit type be 8 bytes? And I read the documentation in the SQL standard and in other systems. It's actually a bit string, not a single bit. So you can specify how many bits you want. But I emailed the MemSQL guy last night. I said this documentation is kind of confusing because you can set an arbitrary size for this, but it's unclear why is it 8 bytes. If I set it to be bit 1, it should store 1 bit, and I'm really going to store that to be 8 bytes. He said it's some kind of a compatibility of my SQL that I'm trying to follow. And then, actually, in their new column store system that they call single store, they can actually compress this bit map. I don't know exactly how they do that, but we'll see how to do bit map compression in this lecture earlier. Just again, just to show you what this looks like. So this is actually Postgres. So I can declare a table that has a bit type, right? And you can specify the number of bits you want. And then this one here, you can specify that it's a variable number of bits. I don't like this little varying thing. This is from the Postgres documentation. So you can just replace this with var bit, like that. And then now, if I set it into the table, I can specify that I want a bit string like this, and go ahead and store that. And then when I do a select, I get back bit strings. And in the case of Postgres, the difference between var bit and regular bit is if I try to now insert something that in the first one, because I'm saying it has to be three bits, if I try to insert something that's just two bits, like that, it'll throw an error because it says the bit string has to be three. Is that a two? Why do we need this? Why do we need what? This is base attack. Why would you ever want to need a bit map to store natively in your database? No, not a bit. This is like a string. What's a bit map? You can store it as an anchor on thread. So the statement is you can also store it as a var char. That would be wasteful. So what they really do is, at least in case of Postgres, I don't know exactly what MemSQL is doing. Like Postgres will say, all right, you're saying this thing can be two bits. I can store that as a single byte, because I can store eight bits. Yeah. Yes? Wait, I understand why I feel like a bit between like specify three with like a var bit, this variable length, why do you need a specify five? It's saying the max size that I can store is five. So I try to do this. It won't store it. It won't store it. Well, actually, it's like the first one. If I go make this one one, it's too big for the var bit one. But the point I'm just going to make is for the var bit one, I can store this be one bit, and it will let me do that. Does it store like the one bit and like the five bit the same way, or does it store the regular bit and the var bit one? If you specify like var bit five, right, it has a max like the five. Yes. But does it like assume everything I'm looking at, that's like a big size of five, or is it like compressive? It'll store, in the case of this, they'll store the size plus the bit string itself. OK. Yeah. OK, we don't need to look at bit strings in Postgres. That's not the point of this lecture. I just wanted to clarify, because I didn't know the answer last time. OK? All right. So what we're going to talk about today is compression schemes. So we'll first provide some background, sort of obvious, but why do you want to do compression? Actually, it's more obvious maybe in a disk-based system, but we'll see why we still want to do it in an in-memory system. Then we'll talk about how to do a basic naive compression scheme where the database system doesn't know anything about how the data is actually being compressed. Then we'll look at the more sophisticated techniques of doing columnar storage compression, where the database system itself is responsible for running the compression algorithm. And then it knows how to natively store this compressed data, and they can operate queries on top of them. And if we have time at the end, we'll finish up talking about some research we've done here on compressing old to be investors. OK. So we didn't talk about compression in the disk, in the intro class, when we talk about disk-oriented databases. But it's sort of obvious that if you're assuming your database is on disk, and you're fetching a lot of pages from disk process queries, then compression is going to be a no-brainer in that environment, because the disk I.O. is so expensive, so paying some additional CPU overhead to compress the data when it comes into memory is going to be a good trade-off for us. Because the disk isn't getting that much faster. That's not for the show. But we have more cores. There's more computational power available to us than there possibly is for us to get the data in off a disk. But now for in-memory database, this trade-off isn't so obvious. Because now everything's in memory that I'm not paying that big, big I.O. penalty. And I'm also allowed to do byte-level access to jump to different offsets in my comms or my data and access single attributes very efficiently. Whereas in a disk-based system, I'm bringing in four kilobyte pages at a time. So the scope of what I can compress is slightly different. So for an in-memory database, we're going to make this trade-off between speed and compression ratio. And to the best of my knowledge, in most cases, for in-memory databases that are going to support compressed data, they're going to choose speed over compression ratio. So that means that there's going to be some compression algorithms that are going to get an amazing compression ratio. I mean, you can take your original data and compress it down to a much, much smaller size. But it's very computationally expensive. So they're not going to do that. They're going to instead choose a compression scheme compression protocol that maybe doesn't produce as good a compression ratio, but it can run fast, run quickly. And then there will also be this trade-off of whether we want to compress very quickly or decompress very quickly. And in most cases, we're going to care about decompressing quickly, because that's going to do most of the time. So why do we want to do this? Well, DRAM is not cheap. I have to pay for to buy the hardware and actually pay to keep refreshing it, keep the charge in, that costs energy. So if I can store my database on a smaller machine with less memory, then that's going to be a good win for us. And in some cases, when we compress the data, we're actually going to get better performance as well, because now things will be able to fit in CPU caches or for looking at a smaller portion of data, we can still compute the same result for a query. So this is going to be a win-win almost all around. So to understand why we can compress the data, we want to understand a little bit what real databases actually look like. And so the first thing is that the real data sets are going to be very skewed. That means that the distribution of values for a given column are not going to be uniform. Using zip codes as an example, there's more people that live in New York City than live in Montana. So if I have a list of all people and I'm storing their zip codes, there's going to be more people living in New York than in Montana. And I can take advantage of that in my compression protocol. Another good example is also the Brown corpus. And this exhibits this property called the Zipian distribution. So the Brown corpus was this project in the 1960s at Brown University, where they basically took what they considered to be the most emblematic books in the English language. And they just counted the number of words that were in these books. And I want to take a guess what the most common word was. The, yeah, sorry, T-H-E. But then what happened is the next most common word was the word A. And the word A, or so the word the, appears twice as many times as the word A. And A appears twice as many times as the third word and so forth. So you have this sort of parallel curve. So a lot of real workloads are going to look like that. And these are the kind of things we can take advantage of. Or you think of like Reddit or Hacker News, there's a small number of articles that actually get a lot of comments. And most people smit crap and no one ever sees it, right? And so if we know this, then there's a bunch of things, the long tail stuff we can then maybe compress in a different way. Or get a good compression ratio. The other thing also, too, in our data sets is that they're going to be highly correlated. So that means that the values of different attributes in the same tuple are oftentimes going to be correlated or related to each other. So like my zip code for my address, that can be correlated to my city. So if I know my city name and state, I know what zip code I'm in. Or if I buy something on Amazon, oftentimes the date that something was shipped, or sorry, the date that somebody placed the order, is only going to be a few days later what would actually be shipped. So I buy something today, and Amazon's shipping shit out the same day. I buy something today, it's going to be shipped tomorrow. So maybe instead of storing a full 64-bit date on a timestamp for the order date and ship date, if I just store this, and I can now maybe store the delta, which is like a couple hours off of this, then I can save space there. So these are a lot of these aspects of real data sets is why we're going to be able to get a good compression ratio and employ these techniques without slowing down the system entirely. So what do we want in our compression scheme? Well, foremost is that we need these compression schemes to produce fixed-length values. Because again, remember, we have this fixed-length data pool, and that's the main primary location of a tuple. So in order for me to jump down to different offsets within that fixed-length data pool, all my values need to be fixed-length. So if I start compressing strings to random sizes, that's not going to work for us. So everything has to be fixed-length data pool. The only exception would be when we shove things out of our length data pool, since that we're recording the size of the data in the very length data pool, it's OK for that to be variable-sized. But the most of the data needs to be fixed-length. The other thing we want to be able to do is postpone the, have the data system wait as long as possible before it actually has to decompress the data in the lifetime of the lifecycle of a query. So what I mean by that is if I have a bunch of compressed data, I want to try to do my query, excuse me, as much as my query I can, own that compressed data. And only until I have to then produce a result to a human or to the outside world of the database system, then I go ahead and decompress it. That's going to be a big deal. This is called late materialization. And this is why we're going to want to do native compression inside of our database system instead of using a general-purpose algorithm, where the data system doesn't know anything about what's actually being stored. And the last one is sort of obvious as well, but we should just sort of bring it up because there's different types of compression algorithms and we make sure that we only pick ones that we actually want to use. So any compression scheme you want to use has to be lossless. What does that mean? What's that? Well, yes, you don't lose data. Like, if I take some original data and I compress it and then I decompress it, I should be able to get back that original data. What's it going to be able of a lossy compression algorithm? And B3, exactly, yes. So in our database system, we have to almost always be lossless because people don't like it when you lose data. But again, if it's your bank account, you put $100 in and they come back and say, no, you have $80, you're going to be pissed. So any compression scheme that we want to use can't do that. If we do want to do a lossy compression scheme, then that must be done at the application level, meaning some human has to make a value judgment that it's okay to lose data in our database. The database system, we're not going to, figure that out on its own, because the best compression scheme is to write everything to DevNol and lose everything, right? It's going to have an amazing compression ratio. It's going to be really fast, but of course you don't have any data. So that's sort of an extreme example, but that's the basic idea. The database system doesn't know what you actually care about, so it's not going to do anything that's lossy. We're not going to talk about this, but there are techniques called approximate queries that sort of look like lossy compression. If I have a column of people visiting my website, and I want to count the number of people that visited the website in the last month, I can look at every single tuple and add that all that up and count it, or maybe I could sample it, or maybe just skip some portion of it and then try to approximate that count. And the idea there is like, who cares if I had a million users to my website, but then my approximate count comes back with 980,000 or something like that. It's close enough. And so, again, we're not going to talk about this, but these are called approximate queries. This is actually now being showing up in the commercial systems, they'll support this today, like they'll have an explicit approximate count operation or aggregation function in Oracle and SQL Server, and I think DB2 might have this. So we're not going to talk about that, but actually this is the next slide. Sorry. So this technique actually does exist. And then SQL Server, I think, just added this within the last year. Again, the idea is that you're just sampling on the data and produce counts, produce aggregations. Another type of technique is to do zone maps, and so I would categorize both these guys as in data skipping, meaning we're trying to avoid having to look at a bunch of data. The compression scheme we're going to talk about today, we still have to look at everything, but the things that we're looking at is compressed. These are techniques to say like, well, I'm going to just ignore large chunks of things. So zone maps, I think we covered in the introduction class. It's not really a compression scheme, it's just a technique for skipping data. The idea is that you pre-compute some aggregations on columns, and then when you start executing queries, you can look at those pre-computed aggregations and decide whether there's anything you even want to look at in a given block of data. So they're called zone maps by Oracle. I'm sure that they have patents on this. Sometimes they're called pre-computed aggregations, but in general in databases, if you say zone maps, even if you're not talking about an Oracle system, people know what you're talking about, and you're talking about this thing. So let's say I have a column here, it has five values, and so I can compute a zone map where I'm going to have all the aggregation functions that I may want to run computed for this column. So the min, max, average, sum, and count. And then now if a query shows up, like select start from table where value is greater than 600, instead of scanning through the column, assuming this is a block of data, instead of having to scan through this and find all the records that were values greater than 600, I can instead look at my max value for this zone map and see that, oh, the max value for this block is 400. So therefore I know there could not be any tuple in here that is gonna be 600 or greater. So I don't even bother looking at this. I just look at the zone map, see that thing doesn't satisfy my predicate, and I can't ignore this. So yes. When you're running MVTC. So this question is, if you're running MVTC, when would you update the zone map? So for this and the compression stuff we'll talk about today, you typically only do this on cold data that's not being modified all the time. So you would leave it uncompressed, don't compute a zone map, and only when you recognize, and no one's come and updated this block in a while, let me go ahead and compute these things, right? Like I said, this only appears in the more high-end commercial systems going back here. Oracle has this, Vertica, MemSQL, Netiza was a FPGA accelerated database that IBM bought, they killed that off last year. But I think zone maps originated from these guys and then everyone else adopted the same idea. But only Oracle calls them zone maps because that's the Oracle term and everyone else calls them pre-computer aggregations or something, okay? So let's focus on the compression stuff. That's the main thing we wanna talk about today. So if you wanna add compression to our database system, the first question we have to think about, and since we're the ones actually building the system, is what are we trying to compress? And based on that decision, that'll tell us what kind of compression scheme we could possibly do. So the first approach is to do it on a block, right? Could be a block of columns or it could be a row store, it doesn't matter, so it's a block of tuples within the same table. You could even do even more fine-grained than that, you could do it on a tuple level, so within a single tuple, you compress all the values. Even more fine-grained is you have a single attribute within one tuple and you wanna compress that, right? That sounds crazy, but if you have a large text field, that might make sense. Like Wikipedia has all those articles or every revision of a Wikipedia article is its own attribute value in a tuple and you wanna compress that. Or we can compress it on the entire column. So we take all the values for a single attribute across multiple tuples within the table and we'll go ahead and compress this. And this primarily, we only wanna do this if it's a column store. So most systems, most of the OLAP systems are gonna do this. And then any other system that does do compression, maybe like an naive compression, like in MySQL, we'll see in a few more slides, they'll do this. There are some systems that can support this, like SQL Server can do this. There's a variant of RocksDB called Terrock out of China and they can do this. This appears for only variable length fields. If I have a large blob, a large text field, then I can compress this. But again, the compression scheme you're gonna wanna use for all of these are gonna be much different. Like for this, I could do naive compression like Gzip or Snappy, but for this one, we'll use dictionary coding of the other techniques, okay? All right, so let's first focus on naive compression because this is like, if you're coming from, if you know something roughly about compression, and when I'm saying, oh, we're gonna compress data, this might be the first thing you think of. So the idea here with naive compression is that we're gonna take an off-the-shelf, general-purpose compression algorithm that are people using already other things that aren't databases. And we're just gonna take our sequence of bytes from whatever data we're trying to compress, whether it's a tuple, whether it's a single attribute, whether it's a column or a block, and we're just gonna run it through these algorithms. And they're gonna produce hopefully a block of data that's smaller than what I gave it. So the kind of algorithms people are gonna wanna use, I said Gzip, but that's usually not what you wanna use because Gzip is an example of a heavyweight compression algorithm where it's gonna get a better compression ratio than these other ones, but it's gonna do it much more slowly. Whereas like LZO or LZ4, these are some of the first protocols where people have developed where they are, they don't get as good a compression ratio, but they're much faster. Snappy is from Google, that's another one that's very common in database systems. Brotly is from Facebook, this is not typically using databases because this is designed for like compressing webpages, where there's some magic in here that you know if you have a bunch of HTML tags that you see over and over again, they can compression them better. Oracle has this thing called Ozip, that's their own patented proprietary compression algorithm. It's a variant of these like LZ4 dictionary-based compression schemes, but what's fascinating about it, Oracle bought Sun, Sun was making the spark chips. So for a while, I don't think, I think they discontinued Spark, I forget, but for a while they were actually storing the Ozip decompression and compression algorithm on hardware, right? So you can actually do this as accelerated on if you bought spark chips and ran Oracle on it. Z standard is considered to be the sort of the state of the art one today that gives you the same speed as snappy, but a better compression ratio. So a lot of times you'll see snappy, but newer systems are starting to adopt Z standard. So again, we've already talked about this. There's computational overhead of compressing the data and decompressing it, and so we need to be mindful of this and we're always gonna try to pick something that is faster to decompress, even though it's a little slower to compress because most of the time I'm just gonna do decompression. So let's see how MySQL does this. So this is MySQL entity B. So what you can do in MySQL when you call create table, you can pass in a flag and say, I wanna store this as, I wanna store this as a compressed blocks. And so what's gonna happen on disks, again MySQL is not an in-mary system, so this is a disk system, but it'll just show you the implications of this, and it's the clearest explanation I found on how to do naive compression. So what they're gonna have on disks is they're gonna have these compressed blocks that are gonna be of four different sizes, all powers are two, one, two, four, eight kilobytes. And the idea here is when you have the uncompressed block and then you compress it, whatever is the ceiling of the size up to the next boundary, that's you padded out to. So if I have a block that's 1.5 kilobytes, I'll pad it out so it's two kilobytes, and they do this so that they can pack things in and align them nicely out on disk. So you're gonna have the compressed page, but then you're also gonna have this uncompressed mod log that sits in front of it. Think of this as like a Delta store that represents the changes that I made to tuples that are represented in this block. So now if I have a query comes along and wants to access some data, then I bring it into my buffer pool and I keep it uncompressed. And if the query is an update, then all they need to do is just modify what's in the mod log. So I can sort of append an entry, say I made a change to something down here that I know you have. Like you have to have an index up above and say, oh, here's the page that has the tuple you're looking for. So let's say like I do a set value. Update tuple set value equal ABC. Then I don't need to look to see what the original value was. I just need to put my mod log and say the new value for this attribute is now ABC. So if anybody comes along and tries to read that, you check the mod log just like you did in the BW tree when you went to the Virgin chain and you say, oh, this is the latest value for this particular tuple. That's what I'm looking for. But if the thing you want is not in the mod log, then you've got to decompress it. And so what they actually do is, when they decompress it, they both keep the compressed version and the decompressed version in memory. And then decompressed version is always gonna be 16 kilobytes because they do this to align things again in the buffer pool. So they'll have a region for 16 kilobyte pages and then for all the different compressed page size, they have a certain region for that. So then now, if I make any changes and I won't write them back, if I modify this page, then I'd have to replace this thing before I write it out. Otherwise, if I don't modify it, then I could blow this thing away and then if this thing's unmodified, I can blow that away as to free it from the buffer pool. So yes. So the alignment space, is that so like, can we add stuff to the mod log in the alignment space or is that just like any school? This question is, can we add things to the mod log so this is a fixed size, the mod log's a fixed size. So if you run out of space in here, then you have to decompress it and compact or apply all the changes. It's like the Delta store part for the BDW tree. Like we have a fixed number of, yeah, okay. So the main thing I wanna point out about this is that as I said, if I need to access something that's in the compressed data, I have to decompress it because again, assume I ran this through Snappy or GZIP or whatever algorithm you wanna use, the data system doesn't know anything of what's inside those bytes of the page, right? Underneath the covers, these algorithms are actually often times using dictionary compression, but you can't see that in the database system unless you know how to interpret the compressed format of those different algorithms. But I don't think any system actually does that. So this now means that this is gonna limit the scope of what we can actually compress because if I try to compress an entire column and I need to get one tuple in that column, now I gotta decompress the entire thing, right? So the main takeaway from this is that because we're using a naive compression scheme where the data doesn't know what's going on, we're losing all high level understanding of what's in the thing we're compressing and what are the semantics of that data. So we can't really do anything other than just decompress it every single time. So this is now where we're gonna talk about these better compression schemes where the data system is gonna do this because it's gonna know how things got compressed, what it compressed, and then in some cases when it sees queries, it can maybe just compress the things that the queries are trying to look up on, actually the predicates of the queries and do your matching on that and you never have to decompress the original tuple or the original data. So to say this is my table of users and I have their salary and I have these two tuples, Andy and Matt and we have our fake salaries. So if I compress it, I'm not saying how I'm compressing this, assume I could and for every single value in my table, I may have to compress it with something that's smaller than what the original size was. And so now if I wanna say find all the users where name equals Andy, don't assume I have an index, assume I'm gonna do a sequential scan. Well, in order for me to do that match, I'd have to decompress this. But instead, what if I actually just encoded the thing I was looking up on, the same compression algorithm, and now I do my match directly on compressed data. That's why we want the data system to do all the compression stuff ourselves. We don't wanna rely on an external algorithm. Now for range predicates, this is gonna be more tricky. Find all users where name is greater than Andy. This, depending how we compress this, it may not work, but we'll see how to handle that in dictionary and code. Okay? All right, so we're gonna talk about seven different compression schemes today. And again, this is all gonna be for columnar compression. In OATP systems, you typically do not compress the data when it's in memory, because as I said, it's gonna be too expensive to decompress and recompress over and over again as I start making changes. So this is only before cold data that's not gonna be modified, and we already know we can store it as a column store. So we're gonna talk about null suppression, running the coding and bitmap coding, delta encoding, and all these other ones are, again, leading up, this is what the paper you guys read, but these are much of other things that I think are interesting to know about. But dictionary encoding is gonna be the most common one, but it's not always gonna be the best one. There'll be some special cases where these other ones can do much better. Okay? All right, so the first one, the easiest one is called null suppression. And this is a variant of running for coding. The basic idea is that if we have a table that has attributes or columns that are sparsely populated, meaning most of the times the values are null, then instead of storing null repeatedly over and over again for every single instance of a tuple, we can then instead maybe just record somewhere that says, hey, I have this many nulls at this location. And then I only record the exact non-null values in the table or in the column. Now, how you actually can do that can vary. One example would be this Oracle BVC stuff, we'll see in a few more slides. But like I said, this is just like, this is pretty common when you know things are gonna be, if you know you're gonna have a lot of sparse data, a lot of times the JSON databases can do this kind of things because if they organize it as a column store, because most of the time you don't maybe have all the attributes that other documents have. But look at run length of code. Run length of coding again is sort of the higher level idea that null suppression is, and null suppression is a sort of optimized version for nulls. So the idea here is that with run length of coding, as I said, our values in our columns are gonna be oftentimes skewed. So we're gonna have this bunch of repeated values over and over again. So if we know that we have runs of values in our columns that are all the same, then instead of storing that unique value over and over again for every single tuple, I can instead store a triplet that says, hey, here's this value. I'm at this offset in the column, and here's the number of times that it's been repeated. And then once you now have a value that's different than what's in that run, then you have another triplet that records that information. So we'll see this in the next slide, but we talked about sorting column stores last time with Vertica. The run length of coding, the benefit or the compression ratio you can get with run length of coding is gonna be amazing if you're able to pre-sort your columns in such a way that you have these long runs of repeated values. So let's see a really simple example. So we have a table here of people with their IDs and their sex. And so we wanna do run length of coding for the sex column because assuming that there's only two sexes, male, female, there's only two values we could ever have. And so we know we're gonna have a bunch of these repeats over and over again. So if we wanna do run length of coding, now we can store this as a bunch of triplets where for each triplet we're gonna have what the original value was, what our offset is in the column, and then how many values got repeated, right? So in this case here, for the first one, I have three m's, one, two, three. And so in this triplet, I have m starting at position zero three times. So now if I wanna do a look up and say, all right, well, what is the sex of the tuple at offset three, then I would scan through this and say, all right, with the boundary or what I'm looking for, it can be found in this triplet, and therefore I know what the value is. Other optimizations too, and now if you're doing like a group by count, instead of again having to look at every single value in the entire column, I can just rip through and just look at the count and the value and populate my hash table to be my aggregate very efficiently that way. So what's one problem though with this example? Well, one problem with this example here is that it's actually not, we're actually not getting a good compression ratio, for this small example, but you can imagine it if the table is billions of, millions of people, then this could be problematic because we have like alternating patterns of male, female over and over again, right? And so to store one tuple, one instance of this value here, female, I have to store a full triplet, right? Because the run length is one. So what I've coded stored as maybe just eight bits over here, now I'm storing as, assuming this is like, 64 bits for each one, that's a bit small, but it takes me more space to store the triplet than it did the original value. So but if I sort it now, and now I have all the males first, followed by all the females, now when I do run the encoding, and I'm getting an amazing compression ratio, because I just have one triplet to say, here's all the males with the length of six, here's all the females with the length of two. And again, now you imagine a large column with a million people, this thing, if I do it this way, it's gonna be amazing. Yes? So like normally when it's fun to compress, like you have like a one-to-one napping, like you know ID six is like male, right? So like in the compress data, you have to like linear search through like the... So this question is, when it's uncompressed, if I'm at like this offset here at six, and I wanna know what is the sex of this person, I know how to jump to this offset and just see it. But in this world, if I'm here, I gotta then go scan from the beginning to figure out where I belong. At what offset my boundary can be found in? Correct. So one is think of this like, I'm doing run the encoding per block, so it's not like I'm scanning the entire tuple, right? And the other thing is like, for OLAP queries you usually don't care what is the value of this single tuple. I don't care what any sex is, I care to know what is the distribution of sexes across the entire university of a class or something like that. So you're doing full scans and in that world it's fast. So like overall it's just, it like makes up for it. So overall it makes up for it, I mean for the workload we're trying to target, like the analytical operations, this would be a big win. For OLTP when I'm saying go get this one record, this is gonna be bad. Yes. Are compression schemes used in HTAB systems? This question is, are compression schemes used in HTAB systems? Yeah, so this is like, if you do the fraction mirror push we told it last time, like Oracle does this, they have the column stores compressed because you're almost never updating that and then you have the row store to be uncompressed. So in HTAB systems you can do this. In our system what we do is we have a slightly different layout of the blocks when the data is cold. So we actually do dictionary compression for the blocks when it's not being modified. But then like if you, somebody updates it and then you scan it, then that won't get propagated into the column or database immediately, right? Because to change one value in the OLTP thing you have to change the entire. They talk about fraction mirrors? Yes. So not everyone was in intro class but they talked about this last class. So they might have talked about it today. They talked about how like they put a little marker in on the fraction mirrors side in the column store. It says this thing got modified. Don't go look here for this exact value. Go look in the row store. Yeah, so they handle that. In Peloton everything was all sort of, and in the current system everything was always a single storage space and it's just that the layout of the data may change. This was Peloton, the layout of the data may change if it was a row store or column store, if it was hot or cold. And we were doing a pen only NPCC. So if you modify something in the column store then that version chain would then get updated to now point to the row store. In our current system, that's essentially we do the same thing too, right? We have a block of data that's compressed. It can't be modified. It's immutable. So if you modify a tuple, we instead of appending the version chain exactly where that is it now gets moved over somewhere else. We treat it like a delete file by an insert. Okay. So the cool thing about running the coding, RLE, I would say also too, Vertica is probably the most of all the column store systems that are out there. They're one of the biggest heavy users or proponents of this approach. Most other systems are gonna do dictionary encoding. But the cool thing about RLE is that we actually can apply this in conjunction with other compression schemes and we get this nice multiplicative effect where we can then compress the compressed data which is kind of cool. So with bitmap encoding, the idea is that, so we're gonna show how to do RLE on bitmaps in a second but this is another compression scheme you could do where then you can also apply RLE but you can also apply RLE for delta encoding. So with bitmap coding, the idea is that it's a different way of organizing the data where instead of storing for every single, for a column you're storing individual in attributes, you're instead gonna have a separate bitmap for every single unique value that a column could have. And then that way, if I wanna see does this attribute or this offset have this value, I look in the corresponding bitmap and I check to see whether it's zero or one in that position. So this is only gonna work if the value cardinality is low for the thing I'm trying to compress. So if I have a small number of possible attributes that I could be, because I need to have a bitmap for every single one, then this could be a big win but if I have a lot of different possible values where every single value is unique, then this is gonna be a terrible idea because I'm gonna get negative compression because my compressed data is gonna be larger than the uncompressed data, which is obviously bad. So as I say before, we're gonna break things, the database up to blocks or chunks so that we don't have to allocate these super large bitmaps, right? We'll have the bitmaps per block. So going back to our example here, we wanna bitmap compress the sex field. So again, we only have two values, male or female. So in our compressed version, we would have a bitmap for the male value and a bitmap for the female value. And again, if I wanna know what the value is at this offset, I just look in the bitmap, the bitmaps of this offset and I check to see whether it's zero or one, right? So this starts to be a big win. So in my little toy example here, I have nine values, assuming I can store the male female as a single byte or eight bits. So this is gonna be 72 bits, but for this one here in the compressed version, I have two bitmaps. So that's each one bit piece, one bit per value that I have is nine times two. So that's gonna be 18 bits. Then I need to store the original values, but I only store them once, right? So that's gonna be two times eight bits. So 16 bytes plus 18 bytes is what is 34 bytes. So I can store my original data with 72 bits. I can get it down to 34 bits. You can now extrapolate across millions of users. This starts to be a big deal, yes. What's in D? Why not just like store one about one column and then get, I don't know if it's possible, like can't you just compare one and zero with Cindy? Your question is why not store, oh. Like a class. So I think what you're saying is this is a binary male female. So instead of storing separate bitmaps for one, like. But even for like larger numbers of columns, right? Can you compare it like integral values faster? Yeah, so his statement is if you have with vectorized instructions, instead of storing this as a separate bitmap for each of them, couldn't I just store a, you're essentially doing dictionary encoding. Just storing a potentially larger size value and compressing it down to a smaller encoding, store those contiguously and then run SIMD on that. That's what dictionary encoding does. This is an alternative, yeah. The other cool thing actually you can do, I don't think you could do this in the dictionary coding is what you said. But one thing you could do is like, we already know that in the query optimizer, we're going to evaluate the predicates in our where clause so that we try the more selective ones first. So we start throwing out data as quickly as possible. But this is actually a little bit different now too. We can actually order the way in which we check the bitmap. So we know one bitmap is super selective within a single attribute. We can actually check that one first. That brings a whole nother level of optimization, which is pretty different than what we talked about before. I don't think we'll talk about other bitmap encoding schemes, but this is there's other techniques you can do that are really kind of cool if you start encoding your values as bitmaps not exactly in this way that we're describing here, but if I'm looking for values that are greater than some given value, if these things are sorted in such a way that I know that I can look at the first bitmap first and I can throw away more data by looking at that because that's more selective. I'll try to maybe cover that when we talk about vectorization later on. All right, so this was a good example. This is an obvious win. If it's male-female, we're going to be able to cross it down by a lot. Let's take an example where bitmap encoding could be a bad idea. So let's say now I have a table of customers and I have the zip code field here. I want to try to do bitmap encoding on this. So assuming my table, I have 10 million people and in the United States, there's roughly around 43,000 zip codes. It's really like 41,000, I think. But assuming that's the case, so in my uncompressed data, if I have 10 million people and assume I can store the zip code is just 32 bits, then the total amount of space I'm storing for this column is just 40 megabytes. But now if I do bitmap encoding, I need to have a bitmap for every single unique zip code in the United States. And that bitmap needs to be 10 million bits long, because I need to represent it for every single two bit on my table. So now the total size would be 53 gigabytes. So again, this obviously would be super stupid to do. So bitmap encoding can get great compression ratio, but this is something that a human has to decide, is this the right way to actually do my encoding? So for this reason, I don't think bitmap coding is commonly used. You only really see this in this for the high end commercial OLAP systems. This is not really an issue. We organize things with blocks. We can imagine if now I start inserting tuples into different offsets in my bitmap, I got to extend everything else and move everything down. The same problem we saw last time. And that can get really expensive if I have 43,000 bitmaps, just for one attribute. So one thing we can do, though, in addition to just doing bitmap coding, these bitmaps are just byte sequences, right? There's ones and zeros, like all data. We actually can do compression on the bitmaps themselves to actually get even better compression. So the simplest way to do this is just run that same snappy, or LZ4, or GZip algorithm we talked about before, take our bitmap that we're storing and just run that through our compression scheme. And then we could end up actually with pretty good results because, again, most of the times it's going to be a bunch of zeros. And these algorithms are, you know, that's the ideal case for compressing bit sequences. But again, we're not going to want to do this for an memory database because this is going to be expensive. Because now every single time I've got to do a lookup on one of these bitmaps, I've got to decompress it, and that's going to be slow. You could say, well, let me combine together a bunch of bitmaps in my zip code example from zip codes that nobody lives in, like Montana and Wyoming, and I combine those together. And most of the times I don't need to go look at those, but when I do, then I have to decompress them. Yes, you could do that, but as far as I know, again, this is a lot of engineering overhead, and no system actually does this. What you do see instead, though, is to do, again, sort of native database compression on the bitmap codes where you know you're going to have a bunch of zeros and you can try to take advantage of that. And we use a variant of run length encoding. So I'm going to describe now one technique that Oracle used to do called vital line bitmap codes or BBCs. The spoiler would be they don't actually use this anymore, and I'll explain why at the end. But this is, I think, a good example of the sort of low level bit, you know, bit level compression you can do on bitmaps if you know you're going to have a lot of zeros. So I like this because in my opinion, this is easy to understand, and then you can then build upon this and do more sophisticated things that are better for today's hardware. So again, soon we're doing bitmap encoding, so we have these long bitmaps for our column, and we want to compress it down so that it takes less space. Again, most of the bits are going to be zeros, so we're going to organize it into these chunks where we're going to have a bunch of bytes where they're all zeros followed by some bytes where there are some ones. So they're going to designate any byte that has all zeros as a gap byte, and then any byte that has a one in it in any position, they'll call that a tail byte. And so then we're going to encode a chunk where we have a bunch of gap bytes followed by some number of tail bytes. And we're going to break it up and compress it that way. So the gap bytes we'll compress with run-length encoding, and then for the tail bytes where there is a one in it, we'll keep track of either where that one is located, if it's a special case where we just have one byte, or we'll just store them uncompressed or in verbatim as they call it. So let's look at a visual example. So say this is my bitmap here, and we can see that it's all zeros except for this byte up here we have one one, and then this byte here, these two bytes here, there's a one here and there's two ones in this one. So again, what they're going to do, they're going to break up the bitmap into segments or chunks where we have a bunch of gap bytes where they're all zeros followed by some byte where there's at least one one. So for the first one here, again, here's our gap bytes that are all zeros, and then here's our tail byte where there's a one one in it. So the way we're going to encode this, and here's the other one, right, because we have to have down here. So the way we're going to code this first one is that we always have to have a header byte, and the header byte's going to tell us what is in our chunk. So the first three bits are going to be organized to tell us how many gap bytes that we have up into seven. So we can encode this portion of this header. We can say we have at most seven gap bytes in our chunk, and we have more than we have to do something else. So in this case here, we have two, one, two gap bytes. So we say, we record that we just have the number two here. Then in the next bit, this is a special flag that says that if this thing is set to true, then we know whatever tail byte we're looking at, we only have one in our chunk, and that it only has one one in it. So this is saying it's a set to one, so we know that it's a special case. So then the next four bytes tells us at what position in our byte that one is located in. So in this one, we're encoding four. So in our tail byte, we have a one, two, three, four, and that's where the one is located, right? So we're able to take three bytes in our bitmap and encode it down to a single byte. So that's pretty good. Let's look at the next chunk. So now here, again, we have 13 gap bytes followed by two tail bytes. So when we write out our header, we can't record that we have 13 gap bytes in it because we can only store up to eight or seven in this three bits. So we set it all to ones, right? And that's gonna tell us that this is a special case that we need to look at the first byte after our header and that's gonna tell us the length of our gap. So we can code 13 here and that represents our 13 gap bytes. That's the run-in encoding part. Then we look at this special flag here and again, that tells us whether we only have one byte that has one one in it. This is set to zero. So that means that in the next four bits, we're gonna record how many verbatim bytes we have, right? And this is where we encode exactly whatever how the bits are found up here down in here. We'll just make a straight copy of them. Yes? So you only need this extra byte, right? Yes, yes. You can record at most six, yes. But the high level of ID is still the same. Okay, so in this example here, I was able to take a bitmap that was 18 bytes and compress it down to five bytes. That's pretty impressive, right? So again, instead of now storing the, if I'm doing bitmap encoding, instead of storing the bitmap for every single value in its uncompressed form, I can take that bitmap and compress each one separately. Yes? Why do you need the trailing four bits just to bytes have only one, rather than extending the first three? The first three where you can have a larger chunk size because the optimization of four bits from eight doesn't really seem to be a really significant one compared to having a larger chunk size. All right, so you're saying rather than storing these four bits here for this bit? Move all the resources for the first one. Sorry, do what? Sorry? So is there a resource like storing, like having that optimization where you, you know... Can you say here's the position of that bit? Yeah, and then you could actually save maybe four, three bits for the first bracket. By allowing to have a larger gap size. Yeah. I think that would potentially work. Yes. They didn't do it that way. All right. Yeah. Again, I suspect they did it for a reason. They looked at their bitmaps for their databases. This is Oracle, right? So they have real customers, they have real data, and they made that decision that this, for the most cases, this is the right way to do this. I would assume that, right? All right, so as I said, Oracle doesn't do this anymore. This is considered an obsolete format. So the reason is because although we're gonna get a great compression ratio, it's actually gonna be terrible for modern CPUs because there's all this branching, right? As I'm going through, like, in order for me to go figure out, like, you know, I'm operating on these compressed bytes, I gotta go jump around like, oh, is this thing all ones? If so, jump over here, or if this thing is smaller than that, then I don't need to look at this, but if this is one, jump here, if this is zero, do something else, right? That's terrible for modern CPUs because all those conditionals are gonna be hard for the branch predictor on the CPU to predict because it's gonna be different for every single, you know, chunk sequence. And we're gonna have a lot of misprediction that's gonna have to cause us to flush our instruction pipeline and then we'll have the CPU stalls. So for this reason, they then came up with another version of an improved version called write-aligned hybrid bitmap encoding that's designed for modern CPUs. The BBC stuff was patented, but it was like the 90s, so it might be patented expired by now. This thing is still, I think, patented. I've never read it. Database patents and Andy don't mix, so we don't read anything like that, but they do have something out there that does this. Again, the other thing to point out too is that in all these cases, or in this bitmap encoding scheme, it doesn't support random access. That's okay if we're doing OLAP queries because we're gonna scan from beginning to end, but if we wanna say find at this position whether the bitmap is set to one or zero, we have to start from the beginning and decompress everything. So this is not gonna work well for OLAP. For OLAP, this is fine. All right, so let's look at some other compression schemes. So delta encoding, the idea here is that if we know that values within the same column are gonna have a small difference in between them, then instead of storing, again, a complete copy of that value for every single tuple, we'll just store a delta between consecutive values. All right, so the idea here is say this is my table and I'm recording the temperature of this room, it's 99 degrees, it's super hot, but I'm taking a measurement in my sensor every minute and then the difference between these temperatures is not gonna be that much because if it's, think of the physical world, this room's not gonna go from 99 degrees to a zero degrees in a single minute. So most of the time these deltas are gonna be really small increments. So what I'll do is I'll take the first value in my two columns and I'll record them in their entirety, but then everything that comes after that is just a delta between the previous one. So here I'm storing plus one, plus one to represent that I'm going one minute ahead and so forth. What's another technique where we get further compression on this? RLE, right, exactly, right, it says that. So in here, again, I'm recording this measurement every minute, I'm gonna have a bunch of plus ones over and over again. I'm gonna take this now and run RLE on that and now store that as a single double here, plus one followed by the size of the rock, right. So in just looking at my toy example here, the original data, assuming that we're storing the time stamp and the temperature as 32 bits, I think that's wrong, it's three, five times two. Actually, just take the time column. So we have five values, each one is 32 bits, that's 160 bits. In this case here, assuming that we can then store maybe this as 16 bits, deltas. So now I get down to 96 bits, but if I can do RLE on this, say I store the original 32 bits, followed by just two 16 bit fields, right, the delta and then the length that are wrong, I can get it down to 64 bits. So that's pretty good. But again, same thing like all the other schemes I said before, if I wanna know the exact value is for this, unless I'm storing one length encoding, I basically have to recompute this thing to get the correct time. So this is where RLE could actually get us better computational performance improvements because I don't have to reapply this. I can just look at this and say, oh, I know that's plus four and there's two simple math to get there, okay? So a variant of delta encoding is called incremental encoding. And the idea here is that for strings, if we recognize that we have repeated prefixes and suffixes, we just store the difference between the consecutive values. So say this is my original data, I have four strings, Rob, Rob, Robbing and Robot. So the first thing I need to do is figure out what is the common prefix between consecutive values. So in the case of the first one, Rob, well this is, there's nothing before it so we have to store that in its entirety. There is no common prefix. But when we look at the next one here, Robbed, well this shares the first three characters as this one appear. So we know that this is the common prefix from this attribute to the next attribute. Or this value to the next value. And we do the same thing for all these other ones here. So now once we have this common prefix, the compressed version can just store the length of the prefix that we're sharing and then the remaining suffix that's actually different. All right, so this obviously only works if things are sorted. Or it actually would also work inside of B plus series we'll see later on. But just doing some quick math, same now I'm storing all of these characters is eight bits. The total size of this data set here is 168 bits. And then for this one it's 88 bits to store the suffixes and then say four times eight bits to store the length of the prefix. So in total that's what? That's 120. I'm also not storing the length of the string. I'd have to do that for this as well. And then same thing for the suffix. I need to know the length of the suffix. Yes? This question is, yes, if it's sorted lexicographically, this is saying, lexicographically is defining what the collation of the sort order is going to be. So yes, in my example it's ASCII sorting, so yes, alphabetical sorting because it's worked out great for that. Okay? So the next scheme that we're talking about is not really a compression scheme but it's just a way to encode data in such a way you use less space. So as far as I know, the only system that actually does this is Amazon's Redshift and so they call this mostly encoding. So the idea is that if you have an attribute where most of the time it's not going to be the full size that you define for the attribute, then instead of storing all the bits for that attribute, you just store it as a smaller data type. So this is oftentimes very common, you see this in people application developers when they define their schema, they'll just say, oh, I want everything to be in 64 but most of the time you're only storing something that could be stored in eight bits. So if you know that, you can tell the data system I want to do what's called the most encoding to say, what used to be a 64 bit column, now most of the times is going to be eight bits, right? But any time that I have a value that doesn't fit within eight bits, then I just have a special flag to say that the original, the full value you're looking for at this offset is in this lookup table here. So now if I'm scanning along and trying to find all values greater than 10 or something like that, if I come across this, I would have to know to jump over here, right? So in this example here, the original size would be five times 64 bits, would be 3,320 bits. Assuming again I can store this lookup thing as just 16 bits plus a 64 bit value, then I can get this down to be 120 bits, right? So I always run out of time with compression but the thing I want to spend time to finish up with is the dictionary compression. And the idea here is that this is, we're going to find frequently repeated values in the values themselves of the attributes and instead of storing them in their entirety over and over again, we're just going to replace them with some kind of smaller code and that anytime we want to do a lookup to see what was the original value, we'd look in this dictionary, it could be a hash table, it could be B plus three, it could be an array, you have different ways to organize it. We can then convert that encoded value back to the original value. So we obviously want to support fast encoding and decoding if it's really, really expensive to encode a block of data then this is going to be a bad trade-off. Decoding is going to actually matter more but we obviously want that to be fast as well. And then we're also going to want to support range queries so that we can do, you know, find all attributes that are greater than some, or some tools that were the values greater than some given value. I don't want to decompress that to do that evaluation. So if my dictionary encoding follows the same order of the original values, then I can operate directly on compressed data. So the decision we're going to make for dictionary compression is when to construct the dictionary, what should be the scope of the dictionary, like what portion of the data are we looking at, what data structure we want to use and then what should be the encoding scheme for the dictionary. So for the paper I had you guys read, they talked about you could use Huffman encoding or Hutuckler codes. Hutuckler codes are what actually you want to want to use because they're order preserving. We're not going to talk about what those algorithms actually are because we don't really care, we just use whatever, you know, we're database people, we'll use whatever just works. But in general, most of the dictionaries are going to use the Hutuckler codes. So we're going to focus on the other ones up here. So the first one is when do we actually want to construct the dictionary? So we can do this all at once, meaning we have some data that, and we know that there's nothing else that's going to be added to it or deleted from it. So we just take that data and compute it right then and there. And this time, you know, we're actually going to possibly get the best compression ratio because we can assume that everything, you know, we know exactly how to pack in the dictionary codes in the smallest number of bytes that we're ever going to need. Whereas now if anybody comes along and wants to modify this, we may have to change, depending on if we're adding things or deleting things, we may have to recompute the dictionary. But if the data is cold, then this is not going to happen that often. Increment on coding is where we would have a dictionary where we add some additional placeholders where we know values could occur. We don't know what they are, but we know in our order preserving or the order of the dictionary codes, something could go there. So we just leave some extra space so when these things show up, we can accommodate them. And at some point, we'll still get too full and then in that case, we have to recompute the dictionary all over again. For the dictionary scope, again, we talked about this a little bit before, but we can have it on a block level. We just take all the tuples in the same value and compress them. This is going to be a potentially lower compression ratio than these other ones because the more unique, the more repeated values we see than the better compression ratio we're going to get and the fewer copies of the same dictionary over and over again, we get the storm. Table level would be, we talked about dictionary for the entire table. This is going to be amazing compression ratio, but of course, this is going to be expensive to update because I'm inserting a lot of things and I may have to recompute the compression ratio or compression dictionary all over again. And then multiple table would be, we can record or maintain a dictionary that can be spanned multiple tables. And you would do this if you have like foreign keys where you know that you're going to join the parent relation and the child relation through this foreign key over again. So if you can compress them with the same dictionary, then they'll have the same codes and you don't have to decompress them in order to do the comparison. So let me skip multiple attribute coding in the sake of time, but let's focus on the order preservation stuff. So we already talked to this. We need a dictionary that can support encode and decode. Encode basically says for a given key that's uncompressed, convert it to its compressed form, and then this is just the reverse of this. So the main thing to point out is that there's not going to be any magic hash function we can use to do this for us. We're going to have to have an additional data structure that we're going to have to compute in our data system to do this for us. If you had a hash function that could do this and still be order preserving, underneath the covers, the way it's actually going to be implemented is probably going to be a hash table, which is just a dictionary. So we avoid all that and just build the dictionary ourselves. So the reason why we want to be order preserving, as I already said, if now I have a query shows up, like find all the users where the name is like A and D. So to execute this query, instead of having to scan through and decompress the data and apply this predicate, I can just scan the dictionary. If I can re-encode this to say now between 10 and 20, I can just scan the compressed data and then I can figure out what I actually want to look at. So the way you actually compute this, you do A and D, you would like on the dictionary first, you figure out what the between range is and then you run that scan on this. So this dictionary is exposed or maintained by the data system itself. So it can run like preliminary queries on this to figure out what is actually available in the compressed data. You can do other things too, like if you want to say give me all the unique, count the number of distinct names in this column, I don't even touch the name field, I just go to the dictionary, rip through this portion, count all of those, or actually just know how many dictionary entries I have, that would actually be good enough too, because these are guaranteed to be unique. Right? All right, so let's see now, if we have a query like this, we already talked about this, the question is do we have to perform a sequential scan? In this case, yes, because I'll match on the dictionary, but I need to find what the actual, the tuples are. But in this other example here, I just want to get the distinct name, I don't want the decompressor, I only have to access the dictionary. Yes? Some tuples might get deleted later on. Good, you keep bringing this up. Some tuples might get deleted, yes. So if my dictionary is on a block of data and I say that block is immutable, then that can't happen. Right? Sometimes actually also too is in addition to the code, you also can store the count with this, right? And that's cheap to maintain. Every time I insert, yeah, so then you can just do more things on like that. It's sort of like, again, the pre-computer aggregation stuff we talked about, the Zomab stuff in the beginning, right? For this value, here's another currency that I have. And for some queries, I just operate on the dictionary. All right, so in the different data structures you can use for this. The most simplest one is just be an array. So you have all these variable length strings and you sort them in an array. And the dictionary code is just the offset into that array. Again, this works if the data is not being modified over and over again, right, as he's gonna ask about. Right, because if I modify things and add a new string, then now all the offsets change and I gotta go and recode everything. So this is actually what we do in our system today because we follow the Apache Arrow format and the Apache Arrow specification. Apache Arrow is like an open source, universal file format for in-memory column stores. And our data system essentially follows this. And so they do dictionary encoding this way. Even though other encoding schemes might be better, Arrow only does dictionary encoding. Another approach is you use your hash table. So this will be fast and compact to do lookups. The problem's gonna be, though, you're not gonna be able to sort range queries or prefix queries because you're not gonna be easily able to easily scan through the values, right, because everything's gonna be hashed. Another approach is you use a B plus tree. So this will be slower than hash table in the arrays and potentially take more memory, but this is gonna support the range of prefix queries you may wanna do. As far as I know, a lot of systems do this first one here. I only know of, there's a patent on this particular idea so I don't know who else actually does this approach. Right, and most of the time you're compressing cold data, so this one is good enough. Let's see actually how you wanna do this. So the idea is basically you have two B plus trees. You have one to encoding and that's right side up and then you have one to do decode and that's upside down. So you can sort of go in both directions and then in the middle, they're gonna share leaves that are gonna map values to codes or they can use that to do reverse, map code to values. All right, so this is gonna be the shortest share leaf. If I have a original value, like original string, I wanna encode it, then I go through this index, I land in my leaf node, find the answer that I want and then I would produce an encoded value. But then if I have an encoded value, I use the bottom one, do the same thing but in reverse and then I can spit out what the original value was. And internally what they're gonna do is they're gonna use incremental coding as we talked about before because again, within the node, there are gonna be sorted values and they're oftentimes gonna be repeating the same prefix over and over again so I can compress these down even further. The other thing they're gonna do also too, in my example here, the codes are 10, 20, 30, 40, so they're gonna leave gaps in here between the code range so that if a new entry shows up, that's in between AAB and AAE, I can insert a new guy here, give it code 11, without having to re-code everything. As soon as I run out of space in between these, then the game's over and I have to restart. Okay, so any questions about dictionary coding? That was a bit fast, yes? Yeah, not many people. Is that like sometimes like? Yeah, so his example is in this one here, I have AAB and AAE, so I know the number of keys could exist in between these because therefore I could pat it out that way. You could do that, I don't think they'd do that though, yeah. Like if you know it can only be three characters, then that's okay, right? If it's longer than three, then I don't think that works because it could be infinite. Okay, so a quick one to talk about how to do index compression. So as I said, everything we talked about here today has been about OLAP, columnar compression. You're gonna see that mostly in the wild, that's the most common thing. But for OLAP workloads, the index themselves can actually take up a lot of space as well, right? So this is a study we did for a paper we wrote a few years ago. We took some open source benchmarks, TPCC articles, think of this as like hacker news or Reddit, it's like people pushing articles and then people writing comments. And then voter, it came from VoltDB, this is a, it's the Japanese version of American Idol. It's like when people call on the phone to vote, when someone's performing, this benchmark mimics that. You snicker, but they sold VoltDB a lot, so a lot of, it runs the Japanese version of American Idol, it runs the Canadian version of American Idol. And I think there's another one of Malaysia it runs, like it was used a lot. I got hired based on this, right? So the original tuple size for these benchmarks, for all of these, the indexes themselves are comprising a large portion of the data. So again, we can't use the database compression stuff we talked about so far, we can't compress the tuples because it's OATP, we can modify anything at any time. So we can't compress this part, but here's all this other data that we can potentially also compress as well. So the two most common techniques to do index compression are to do prefix compression and suffix truncation. And this looks a lot like the stuff that we already talked about before. We have a bunch of sorted data in our nodes and they repeat values over and over again. So instead of storing the entire value repeatedly, we just store the prefix once and then just the suffix part that's actually different. Right? This one is actually the most common. And you'll see this all the time in B plus trees that support prefix compression, they're gonna do something like this. Now, whether or not the leaf nodes are compressed or just the inner nodes, it depends on what kind of workload turn you think you're gonna have. Most of the times, most of the updates you're doing to a B plus tree are always gonna be the leaf nodes. So maybe you don't compress that part because this thing's gonna change all the time. But for the inner nodes and up above, you could compress those, yes. Is this like a B plus tree approximation? This question is, is this a B plus tree approximation of a radix tree? Yes. That's one way to think about it. But again, except the difference is that these key, if this is an inner node in the B plus tree, these keys may not actually exist. Suffix truncation and the idea here is that for the inner nodes, we don't maybe need to store the entire key. We just store the, we just throw away the suffix that doesn't help us and only store the prefix. So in these case here, for these two keys, they're unique immediately after the first character, A and B. So I'm looking for something that's greater than A that I know I wanna go one side. If I'm looking for something that's less than A or I go to the other side. So I could just store A, B, C and L, M, N and that's enough for me to distinguish which path I wanna go down and that can cut down the space. Going back to this one real quickly, I'm showing compressing keys. You also can compress values, the pointers to tuples. If you know that tuples are being the same block, then instead of starting the block ID, maybe you just store the offset. That one's more common in the dispaced system. I don't know if any memory system does this and I don't know of any memory database that does this as well. The biggest compression is the more common one. All right, the last thing I wanna talk about is a paper we wrote for hybrid indexes and again, the idea here is just to expose you to other techniques of what other things we could compress other than just data. So we can compress the keys in the index, we can compress the values in the index, but what if we actually could also compress the data structure itself, like the internal metadata we're maintaining to say that it's actually a B plus tree. So what we're gonna do now is gonna call it hybrid index because we're gonna have a uncompressed index and a compressed index. Again, this looks a lot like the fracture mirror stuff that we talked about for the column store last class. So the dynamic index will be fast to update, it's uncompressed, and then as data gets old, we'll convert it over to this compressed index that'll be read-only. And then every so often, we're gonna refresh the static stage by merging in the changes from the front end, the front end index to the back end. So it looks like this. So we all are in search update lead, go to a dynamic index and then over time, we'll then merge it with a static index that's compressed. But then we'll also maintain this bloom filter so that when we do a lookup, we check the bloom filter to see whether the thing we're looking for could be in our dynamic index. If yes, then we go check it out. If not, the bloom filter tells us false, then we always go check the static index and this avoids having any false negatives. So everyone know what a bloom filter is? Who here doesn't know what a bloom filter is? Yes? You sort of did this, like, you don't know what a bloom filter is. Okay, I'll give you a crash course in 30 seconds, or a minute. Okay, so a bloom filter is an approximate data structure that answers set membership queries. It'll tell you whether does a key exist in my set, but it may give you false positives. It'll never tell you something doesn't exist when it does exist, but it may tell you something does exist when it doesn't actually exist, okay? And think it's just a bitmap and we're gonna take a hash function on our key, or take a key, hash it multiple times and update different bits, and that's how we're gonna record whether something exists, right? So the only thing we can do is insert and lookup. We can't actually delete things. We can't do any range queries. So this is my simple bloom filter. So say I wanna insert RZA from the Wu-Tang Clan. I have two hash functions. I take the key, I hash it. I modify the number of bits that I have and I go flip these bits to be one. Now I do this for the jizza, same thing. I modify the number of bits and I update this. So now if I do a lookup, does this bloom filter contain the key RZA, then I then do the same thing, take the key, hash it, then I do my lookup. If all the positions of these bits are set to one, then I know the thing could exist, right? And true. In the case of Rayquan, I hash him. Now one of the bits is set to zero. So I know this thing definitely does not exist. But again, you can get false negatives. If I do a lookup for ODB, then the hash will take me to bits that are worth set, but I never actually inserted this. So it's telling me this key exists when it doesn't actually. So going back to my example, you get it now? Okay, so going back to my example from before, if I, with this bloom filter, it could tell me something exists in this dynamic index but it doesn't actually exist because I've already merged it back to the static stage. But that's okay because maybe I do a lookup here and it's wasted. But I'll never say this thing doesn't exist. I go look in here and I don't see it when it actually existed there. So how can you compact a B plus tree other than just doing the compressing the keys? Well, the first thing is that because we have this rule for B plus trees, they say they have to be at least half full, we know we're gonna have a bunch of empty slots. So if our index is never gonna be updated, we don't actually ever need these empty slots. So now we can compress every node down to be just storing exactly the keys that exist. But then the next thing is that we have these pointers now because these are just random nodes in the heap. So these are gonna be 64 bit pointers to some location in memory where the next node is. So instead of storing these pointers in their entirety, if we just had a offset to some starting location in an array, then we could jump down to exactly where we should have been if these were separate nodes, even though it's one contiguous piece of memory. So that's how you can compress a B plus tree. You can do the same thing for a skip list. You can do the same thing for an art index. Again, recognizing that we're never gonna update the index, all these extra stuff we have in our system or in the index so that we can accommodate updates are unnecessary and we can throw them all out and compress it down. So in terms of what size you can get, throughput and performance, with the hybrid index, it actually turns out being faster because there's a product that's smaller and therefore you have fewer cache misses. And then the overall size is reduced quite a bit as well. So again, this is the paper that we wrote at Publishing Sigma a few years ago. No system actually does this. But I think these techniques, again, could be applied for in a column store system when you know tuples are not gonna be modified. Yes. Do you have a separate bloom filter for each stage? So I said, do you have a separate bloom filter for each stage? Or just like that? If you have multiple stages, like you could say, like, here's the hide index, here's a cooler index, and here's the cold index, you'd have a bloom filter for each one. No, you don't need for the static one, right? Because this is saying, this is like a query shows up, do I need to read this? Yes or no? And the bloom filter tells me whether I need to do that. And the idea is that, because I say I do much insert to my dynamic index, over time I want to then merge it, all those keys now exist in the static side, they no longer exist here. This thing becomes empty. Why don't we use this interior? The question is, why don't we use this interior because we have other problems. Yeah. I would say also too, we don't have a, we don't have a multi-threaded thread safe merge algorithm. We did this in h-store, h-store is single-threaded engines, so we'd stop all transactions while we merged it. Because otherwise you'd have to merge this and worry about people reading at the same time or make two copies and that gets expensive because the index can get kind of large. Okay. So that was rushed to the end. I apologize. But again, hopefully the main takeaway you got from this is that, hey, the database system knows what the data looks like. It knows what the types are. It knows how the queries are gonna be accessing them. It can now make better decisions on how to compress the data beyond what a naive compression scheme can do. And we showed how we can do, we can apply compression on already compressed data, like doing RLE on Delta encoding and get even better compression ratios. And then we talked about how about we can wait as long as possible during query execution before we had to decompress the data. Only when we tell the outside world, here's the answer to the query, then ideally we have to decompress it. For joins we'll see later in the semester. If you're joining two things that are compressed and they're on a different dictionary, you have to decompress one and potentially re-encode the other one. And that gets expensive, but that's unavoidable. Okay. All right, so next week we'll talk about, or sorry, one Wednesday, Wednesday, today's what? Today's Wednesday? Thank you. Yes. Yeah, it's been rough. All right, the kid, or the baby. All right, so Monday, thank you. Monday we'll talk about logging and checkpoints. So now, even though we're in memory database, we still need to be able to recover after a crash. So we'll talk about how to write out things to disk and take checkpoints quickly as possible. Okay? All right. Have a good weekend. See ya. What is this? Some old pool shit. Ay-yo, ay-yo. What? Took a sip and had to spit, cause I ain't quit that beer called the O.E., cause I'm O.G. Ice Q. Aye, you look, then it was gone. Rip the top off, ain't brought, just dropped off. Cause ain't eyes hopped off.