 This is one of those lecturers like there's so many things I want to talk about I probably could have broken this up into two lectures But just for since we're short on time, you know the semester already already packed as it is We're gonna do one one lecture on this so the the high-level idea with data is compression as you see is that the We want to be able to have the system do more with less Right so using less memory potentially also less CPU we can execute larger larger data sets a more complex Quares, so that's the high-level goal today before we jump into this. I just want to have a one quick announcement also post this on piazza tonight We have a database tech talk coming this this Thursday at at noon over in CIC and on the fourth floor So this would be the CEO and co-founder of splice machine He's actually is CMU alum. He graduated here in like 1985 before there was CS I think it might be in engineering or something whatever He's also actually on the board of the board of advisors for the school computer science like on the on the dean's council So he's coming in town to meet with the dean's people to make sure that like the school is running Okay, but he has time to spend and talk about System they burning so splice machine is a h-tap database startup. It combines If a member correctly it combines h-base Plus spark so you can do all your transactions on h-base and they extend h-base to actually make it be transactional and then when you have analytical queries they can sort of Route some of those queries operations that go on on the spark side of things But to you again as we talked about before with h-tap It's presented to the the application as a single logical database instance underneath but underneath the covers is actually has two different engines So Monty's a good guy Splice machine actually runs on top of HDFS because that's h-base and spark So it's a distributed database system not an in-memory system like we're talking about here in the semester But they actually do some of the same techniques because it's a database system that we've been talking about so far And they'll and they'll we'll talk about in the future. So in particular they do query compilation vectorization and You know, obviously they built their own query optimizer So there's three pieces of that it ties to do even more. So I'll set an announcement You should come to that Monty Monty's a good guy and the hook again. He'll go into technical details. This is not like some marketing bull All right So as they say we have a lot discussed. Let's jump right into it So first I'll start off doing a background on what database compression actually means then we'll talk about naive database compression and then we'll spend most of our time talking about the columnar compression Which is relevant to the paper you guys read But we'll talk about other techniques other than dictionary encoding to do columnar compression and then hopefully I always run out of time every year I try this I try one more time. Let's try it We'll also talk about how to do compression on indexes because this is actually research. We've done here at Carnegie Mellon All right, so compression the We didn't discuss compression if you took the intro class last semester but Compression is used all the time in disk based systems because it's obvious Right the the disk is the slowest thing in the world that we won't try to avoid having to go to it as much as possible So in a disk-oriented database system making the trade-off of I'll pay an extra CPU cost to do compression and decompression in exchange for Improving or increasing the amount of useful data. I get for every disk IO that I have to do So the disk is so slow that I'll use them the the the most robust compression algorithm I can have to try to get as pack as many data As not as much useful data as I can in the bits that I'm reading and writing out the disk So that when I go fetch it back in I'm getting more tuples than I would have otherwise So now in a in memory database the disk is gone Memory is fast, but you know not as fast as our CPU caches But it's not a you know the huge gap of difference in performance as you know from disk and memory and so in our world in It may not be a or it's not as obvious that we want to pay a huge computational overhead in exchange for getting a better compression ratio Right because everything's already in memory of course We obviously want to compress things Because this allows us to use less memory to store a database and memories You know memory is expensive both in terms of you know buying it and putting in your machine But also maintaining it because you have to keep giving it energy to maintain the charge So we want to use compression But we have to be mindful of the trade-off of the speed of our query performance versus the compression ratio we get And so to the best of my knowledge Every in memory database that's system that's out there that supports some kind of compression is always going to favor performance over compression ratio so that means that there'll be some compression algorithms that it they won't use because The the cost of compressing is is not worth the the benefit you actually get from it so Before we talk about How we're going to press our data we should talk a little bit about why we can compress our data Right so the search should be obvious But I think I just want to sort of state this at the beginning in that way It'll sort of color or our thoughts and discussion going forward about what kind of compression schemes are going to use so in the worst-case scenario if Every single value for every single attribute and every single tuple is completely random bits Then we're entirely no compression scheme was going to be able to handle that at all It's just random bits But you know think of like you know almost if you look at the raw bits are like JPEGs the raw bits of movies Those things are complete from our perspective in a database those we consider random bits And this is why you take an mp3 you try to run it through like gzip. You don't you don't get that much much benefit But in actuality the data that we're starting in our database because we're structured. We're worried about structured databases the data itself is actually going to be highly skewed and amenable to compression So this can be done in sort of the way to think about this is across two dimensions One is all the values within a single column or single attribute Across all your tuples are going to be highly skewed and again the way to think about this is like say you took every single book that ever existed And then you just counted what words appear the most Right, it ends up looking like what is called a zipped in distribution or a power law So this is what the Brown corpus was so in the 1960s Some researchers at Brown University took what they thought was the most representative books of the English language Just counted the number of words that appeared in each of those books And so the most common word I think is the word the Right, and then the next most common word turned out to appear in in the text Half as much as the previous one. So I think the next most common word was a Right, so if you have a million entries for the you would have half a million entries for the next next word a And then the third word would be appear as half as much as the previous one Right, so this is what a zipped in distribution is our power law, right? It's it's exponential So if we know this then we're going to know that the values are trying to sort in our database What follows some similar pattern that this is very common and so we can take advantage of this by using a Using less data to encode in some ways the most common Most common byte sequences or words that could appear in our in our data set the other kind of other dimension we have for our Our data is that they're going to have high correlation between attributes of the same tuple All right So what I mean by that is within one tuple It may be the case that we don't need to store the entire raw value for every single column or every single attribute we can actually piggyback piggyback off each other and Store some kind of more compressed encoding scheme to say how this one values correlated to another value And therefore I don't need to store the whole raw value So an example would be say you're ordering things on Amazon right and you have Amazon Prime So Amazon is going to ship whatever you buy from them in in 24 hours So I have to store my order date and my ship date for my purchase So rather than storing the 64-bit timestamp for the order date and a 64-bit timestamp for the the ship date If I know the ship date is 24 hours from the the order date, maybe I'll just store like a 16-bit offset Right So we'll see a lot of examples and you'll see a lot of examples in real-world data sets where do you see these kind of patterns? And these are the kind of things we want to exploit in in our in our compression schemes So there's three high-level goals We're going to care about in our database when we want to do compression I mean this is going to be different than what you see what I'll call naive compression schemes or or general purpose compression algorithms like things like G Zip or Zip or LZ4 things like that where You know, they're all about taking raw bytes from your files and then compressing them into a single single tarball or a single file In our world we want to encode or compress our data in such a way That's going to allow us to actually operate queries on top of them efficiently So our first goal is that we would need to be able to produce fixed-length values Because all that same thing I talked about before when we talked about layout I'm word alignment and jumping to fixed-length offsets. We still need to be able to do that right, so that means that if we have a piece of data that is Say say a hundred bytes and we have a piece of data about 101 bytes We want to be able to compress them and have them be fixed-length Now for the var length stuff We'll see this in a second when we talk about toast in Postgres Since we can store these in these variable length pools anyway We could use whatever compression scheme you want on them and produce variable length sizes But for our fixed-length data pool for those cut for that for that data that would that always need to be fixed-length No matter what compression scheme we use The next thing we want to be able to do is postpone decompression for as long as possible as we execute queries Up until the point where we have to expose the data to the outside world All right, so the last class you guys read the paper about the row stores versus column stores And they mentioned the term late materialization the same idea So I want to use I want to execute my queries on compressed data as much as possible And only when I have to produce the result for the query to the client then I'll go ahead and decompress it All right, this makes sense right because otherwise if I had to read a billion tuples And I did and and run a filter on that and my filters only could produce one tuple as the output I don't want to have to decompress one billion tuples just to then figure out. Here's the one. I actually want I Want to be able to operate directly on the compressed data and then the one that does match my filter Then I decompress that And the last one is that we we have to guarantee that whatever compression scheme we use is considered a lossless compression scheme So this should be sure to be obvious to everyone in the room right the difference between lossless and lossy, right? Every compression scheme we're gonna have in our database has to be a lossless Because people get pissed if you go put some data in and then they try to go read it back out and you get something different, right? Now with transactions and things updating, you know We can ignore that but in general if I have my bank account has a hundred dollars today And I come back without taking any money out and I get back some other value When I when I read it back in people notice that and people complain So that means that anytime we want to use a what a lossy compression scheme Where we're gonna lose data in exchange for reducing the size of it in our compressed form That always has to be done at the application level Right because this requires a human to make a value judgment about whether the data you're giving up or throwing away is Is okay to lose right so the example I always like to give is like say we have a Simple database application. That's recording the temperature in this room and Recording it every one second so every one seconds can take another 64-bit reading of the temperature so one year from now I Probably don't care about what was the exact temperature at 315 p.m.. At five seconds or six seconds or seven seconds so instead maybe what I could do is aggregate the Those 60 seconds worth of data into a 1-1 value, so I'm getting the average temperature for each minute So this is again, this is a lossy scheme because I can't reverse that average after I've Combined everything or aggregated it. I can't get back. What was the individual? temperature at every single second Right, but again, I might not care about this in my application But the data system is not going to know that you're okay with that so we can't we can't do this so the Another way to think about Lossy compression is is if you end up actually not or another way think about this compression schemes that we're talking about is is Another way is is can we produce the result? We want for our query without actually having to read all the data We normally would So this is a technique called data skipping so this is sort of orthogonal to compression So what I'm going to talk about here will sort of be like the lossy versus lossless compression But we can still do this if you wanted to on top of the compression schemes are talking about here today It's just another way to think about what we're talking about So with the idea of data skipping is that we want to produce the answer for our query But without having to read all the data we normally would as to say we want to compute some aggregation You can use my temperature example. I Would have to scan if I wanted the exact answer I have to scan the entire table and you know the entire column and then produce my aggregation but instead maybe what I could do is In the first case I could approximate What the answer is actually really is by doing some kind of sampling on the the base data and then computing my query Or actually my query just on that sample So again think about like if I'm tracking the the visit to my website Do I really care about like the exact number of visitors that went to my website yesterday, right? If say I have known nine hundred thousand visitors Say I sampled it and I came what with eight hundred ninety nine thousand visitors That's probably good enough, right? So these are gonna use our approximate queries. It's more than just like, you know random sampling They're actually some math behind this actually keep track of what data you're sampling what the confidence and a rule is Right, so there's not just like I pick them on a bunch of random tuples and compute that they actually do some some actual work Under the covers to provide some confidence about what the accuracy is for these for these for these queries So the idea is not new but it's only in the last couple years It's actually been people have been building systems that actually can can do this as a native first-class Component to the query language. So blink TV was a project out of Berkeley From the amp lab this then got spun off as a startup called snappy data Which is I think spark plus gemfire or Apache geode XDB was an approximate query extension for Postgres and then in Oracle in 2017 They added actually now approximate aggregation functions So you can specify like I want to do an approximate count and then here's the confidence interval or a confidence bound You have to drive for me So another approach to reading less data or skipping data is called zone maps and The way to think about this is that for every single block and every single column in each block I'm gonna pre compute a bunch of aggregates and then now as I execute my query. I go check to see whether the There could possibly be data in a block That would satisfy my filter that I'm doing and I would know whether I need to even read that block or all Whether I need to read that block or not and then I could skip it if not So again zone maps are I think are really important. There are a lot of systems. I should use them I don't know where else to expose you to this. I think I taught this in the intro class I'm sort of dumping on this you guys now. It has nothing to do with compression and it's our foggable You can do this plus compression I just think it's important for you guys to see this because a bunch of systems will do this and it's an obvious It's a really good obvious idea. That's not hard to implement. All right So here's our original data. We have a single column right and it has five values So in the zone map for this block right for this for this for this table there'd be multiple blocks But just for this block we can then pre-compute the min the max the average of some of the count whatever aggregations that we want And so you could store this some systems will store this actually in the header of the block Other systems actually store this separately. We actually had this implemented last year or two years ago We would store this in the catalog itself But in the disk-based system, they do I think they usually store it separately because You know you can pack a bunch of these in one page and then for one disc seek you get a bunch of different zone maps All right, so now my query comes along select start from table where values greater than 600 So again, I know I have a zone map for this block So I go check my predicate values greater than 600 so I check to see whether There could be every possible be a value greater than 600 in this block in this case here max the max value is 400 for this entire column, so I know that there could be no possible tuple that satisfied this predicate So I just skip this block entirely So this is what I started saying like there's an example of data skipping that's lossy versus lossless lossless would be would be this because I'm not I'm always looking to see whether there possibly could be data Whereas the lossy scheme like approximate query processing is you just you're guessing the the the actual value Okay So let's talk about compression now So if we we know we want to have compression, right? It's sort of it's a no-brainer every major OLAP system that's been invented in the last 15 years is doing compression It's sort of like the column store stuff You do these build a system and say you want to do analytics and you're not a column store and you're not doing compression like It's like it's it's it's a non-starter for people So the first question we got to figure out is what do we actually want to compress? Because this is going to end up telling us or determining what compression schemes are actually going to be available to us So for this I'll call the this sort of first question the granularity of the compression so the idea is you know What chunk of data in our in our database are we focusing on and doing compression on? right so the first approach is to do block level and This is basically I have a block a bunch of block of bytes and I just compressed compress what's inside of that The next to do to the level compression So this is where I take say a single row for all the bytes or the attributes are contiguous in memory And I'll just take that piece of data and compress that Attribute level compression would be taking one single attribute within the tuple and compressing that and then for this example Would be the the var length stuff that I talked about in the beginning like if you have a An object in your variable length pool you can go ahead and compress the bytes that correspond to that single that single attribute so postgres does this postgres it's called toast if you declare a a var binary or var char or text field that's really big and It exceeds some size and they'll store this in a special toast variable length data pool and they just run G zip on that Right, so it's the one attribute ends up being compressed The last one which is the one we're going to focus on the most is the columnar compression But this is we're going to we're going to compress multiple values across multiple tuples that are stored continuously in a column and You know just focus on compressing that because this is going to end up opening about way more opportunities To do more sophisticated things than the naive compression scheme You'd have to do in your block level compression or or at your level compression. Okay? All right, so let's focus on the most simplest type of compression you can do in a database All right, and this if you're you know, if you're not familiar with so that the native database compression We're talking about in the dictionary coding stuff in the paper You read or stuff, you know the other techniques we'll talk about this is probably what most people think about when they think about database compression So the basic idea is that we're just going to take a block of data Right, whether it's a single tuple or a chunk of you know column chunk or a block itself And we're just going to run that through your favorite general purpose Compression algorithm and then whatever bytes come out. That's what we end up storing in the database All right, so there's a bunch of different algorithms that people can use You know G zip and B zip those guys are actually not good for databases because those are computationally expensive to to compress and decompress right the the conventional wisdom is that the the performance trade-off you have for having to run those algorithms is not worth the Is not worth the compression benefit you get from them So instead there's these bunch of other algorithms where they're not going to get as as as good a compression But these will be much faster Actually in general for databases We are actually okay with paying a larger penalty for faster So we'll pay a larger penalty to have slower compression in exchange for faster decompression I think of this like when you when you render a movie like you know compressing something with mp4, right? The you're only gonna really compress the data once but you're gonna potentially read it back multiple times So you want to be able to read it back very very quickly Right, so some of these algorithms make that trade-off But in general the conventional wisdom now is Z standard is the state of the art from Facebook and this is what everyone you know If this is what if you're building a new system now today This is what you would use and you see a lot of examples of systems like I saw this morning was a Pull request to forget Marine Plum. You know they've added Z standard support as well So if you're building a new system you want to use Z standard snappy is one from Google that people sort of interested in as well Again, they it's you don't get a good compression ratio, but it's you get better performance Bratley is actually an interesting one This is also one from Facebook and so this one it's like for like HTML documents and the web They actually have a predefined dictionary a bunch of common commonly used words So you don't need to store a separate dictionary. That's sort of baked in the algorithm itself I Don't know how well that would be for databases. I guess depends on what you're actually storing Oracle actually has their own proprietary Compression algorithm called ozip it's basically a Dictionary coding implementation that's designed for old TP. I don't know the details of it. It's patented I don't want to go read the patent but What's actually what's also crazy about this too is they actually Since they bought Sun a few years ago now and they control the spark, you know manufacturing the chip For some spark the newer CPUs. They actually put the ozip algorithm in hardware so you can do native compress the decompression and compression of O zipped data directly on the CPU, which is All right, so let's see how we'd actually would use a naive compression scheme so again, we take bytes in and It compresses it and spits bytes out So for this I want to show I want to describe actually what in a DB does in my sequel So again, this is a disk based system. It's not an in-memory system, but at a high level the idea should be the same so out on disk we have our compressed pages and The way it's going to work is no matter The the in-memory representation of the page is always gonna be 16 kilobytes But when we compress it it has to be in either one two four eight one of these powers of two And we're gonna do this because it's gonna allow us to figure out how to store You know to find easily free slots on in-memory and on disk to basically do like a bin packing problem Like how to put these pages in the smallest amount of data as possible So that means like if I compress my in-memory page and it ends up being 2.1 kilobytes I always have to round up and then store it as a as a pad of four kilobyte page All right, so when I want to read data in in my sequel on a compressed database or a press page I'll always copy it and leave it compressed in my buffer pool And then what they have and that and the header of the disk page now is they have this thing called the mod log So this allows them to record any changes to data on this page without having to decompress it Think of this is like the version chain and or the Delta the Delta chain in the BW tree I can leave this thing Leave this part immutable and I pen new Delta Delta changes into this thing here So any time I need to read data though, I have to always uncompress it again With naive compression scheme the database system doesn't know anything about the data that's been compressed, right? It's the compression algorithms a black box Bites in bites out. So if I need to read anything I always have to uncompress it and then the uncompressed size is always 16 kilobytes But they also leave the compressed the compressed version around so what would happen is if I had if I had to read something I uncompress it read whatever I need to read and then if I never go back and modify this thing Then if I need to free up a frame or slot in my my buffer pool, I just drop this thing and leave this thing around right So again, and then obviously some point that the Delta chain gets too long and you have to you know uncompress it and combine it and pack and compact it again, right? But that's we don't care about that for our purpose here So I've already spoiled what the the the main issue with this is that Again, the data system doesn't know anything about the bites in a night compression scheme Right, so that means that if I want to access anything. I always have to decompress it So this is it end up limiting the scope which I'll explain what that is in a second of our compression scheme Because if I try to come if my scope is compressed the entire table into one giant block I'm super because any time I need to go read any to buy the decompressed the entire thing So that's why my sequel breaks up in in into smaller chunks The other aspect about this is that Why this sucks or why this is not gonna be what we want to do for Olap queries I shouldn't say it sucks because you still want to do this in other certain circumstances is that There's again the algorithms don't know anything about what the data looks like don't doesn't know Anyway, how the queries are an access on it. It's just sort of like a pickup truck. He bites in bites out So if we're smart about what we're actually storing and what queries we want to run on it Then we can achieve that late materialization goal that we had in the beginning of delaying for as long as possible Having to actually decompress the data So this is this is when we get into the columnar compression stuff that we want to talk about right, so the first thing we can observe is that For certain queries like exact matches natural joins a quality predicates things where we want to do like does something equal something Well, if we know how the data is compressed and we know what the query is because it's sequel We know exactly what you're trying to do then we can take advantage of that and maybe just compress the Query itself and not have to decompress the data So say that we have a simple table right name and salary and just two records Andy and Lynn So I'm not describing what compression scheme. I'm using for this say I just replace The values here for every single tuple with these symbols right so Andy becomes xx Lynn becomes yy So now for my query select start from users were name equals Andy So the database system knows how it compressed the data because it did it It's the thing that actually you know did it records in the catalog that you know that This table or these columns of these blocks are compressed in this in this scheme So now to do a look up here All you need to do is replace Andy with the symbol that I've compressed Andy over here with So now when I want to do this ignoring whether I have indexes or not If I now I'm doing sequential scan I can do just do the comparison name equals xx and look at the compressed form of the data And never have to go back and decompress it. I can't do that in the naive scheme Certainly not in the block level case. You could possibly do this on the actual level case Right. I could run each of these strings here into to gzip, but that would be I wouldn't you know You don't want to do that because you're not going to get a great, you know compression ratio So again If we know what the hell the data is is compressed and we see we then we know how to manipulate our queries to then operate directly on the compressed data and Only when I have to you know find my match and send the result back out to the outside world then I go back and decompress it Now for range queries, we'll see this in a second. This is a bit more tricky to do You have to use an order preserving scheme. Yes There's a statement as a question is this assumes that every attribute is compressed to a fixed line. Yes again for the verbal length stuff Since we're already storing pointers in the fixed length to pull to jump to the very length pool That we just let that thing we we could use snappy or gzip or whatever you want all those things and let that be verbal length For the fixed length data because we want to be able to have in the word alignment and jump to offsets Everything needs to be fixed like okay So we want to do columnar compression because we're focusing on all lap queries here, right? It's all that queries are read-only We're assuming the database is not going to be updated that often We'll talk about how to handle that in a second But we want to see what kind of what kind of compression we can get if we assume everything's gonna be columns And we assume we're not gonna have to make major updates to them So I want to talk about seven different encoding schemes here These are all used in in various ways in various forms and a bunch of different systems These are all considered state-of-the-art and then the spoiler would be dictionary coding is the most important one Which is why I had you guys read that paper and then we'll talk a little bit about the the How to actually implement it at the end but I want to cover these guys because this is you know These are real things that are actually out from real world All right, so the first one is called null suppression The basic idea here is if we know our data is sparse Then rather than destroying a bunch of zeros a bunch of nulls all over the place We can net instead to store a little little descriptor that says This thing or this attribute is null in our column for this many steps or this many tuples Right so now if I'm scanning along I would see see this and I would say all right Well, I know that there's a bunch of Bunch of nulls ahead of me So let me go ahead and skip beyond that now There's a little bit tricky if you want to have fixed length all sets, right? There's there's a bunch of stuff you actually have to do to maybe jump around You basically have been coding tired thing and this is a variant of what's called run length encoding Which I'll show in a second So we'll talk about how to do the byte-aligned bitmap codes in a second There's ways actually compresses even even farther down instead of just storing the exact You know the stacks that counts for the entire the entire column right so run length encoding is a The the general purpose version of null suppression The idea here is that when we if we recognize within our column We have the same value repeated a number of times Instead of storing that repeated value over and over again. We just store a triplet that says at this position This value is repeated this many this number of times So for this one the entire column has to be encoded using these triplets like that And then that way I just use some arithmetic to figure out how to jump into My correct offset to find the value that I'm looking for right so we'll see this in the next slide but The way to get the most bang for your buck for run length encoding is that you actually want to pre-sort the columns So that you maximize the number of these repeated runs so in case of vertical vertical actually doesn't support any at least At least a lot last year a bit of change. They don't support any b-plus tree indexes all comms are pre-sorted and Then it allows them basically to do binary search, which is the same thing as what a b-plus tree is doing But because all the comms are pre-sorted they can get huge speed-ups or huge benefits of doing running run length encoding All right, so let's look at an example here So say we have a simple table or keeping track of the students and that we have a column that says what their sex is and For simplicity assume it's male female right not other things right so you want to compress this guy here so Again instead of storing the say we can store this as a single character eight bits or one bite A set of storing you know one bite for each of these guys here instead we'll store this this compressed form triplets Where the first visit the first value of the triplet is the actual original value male or female? The the next value says the offset of where I am in my column and the third Value in the triplet says here's the number of times that this value has been repeated right So what's for this particular example? This is not a we're not getting actually really great compression ratio here because we have this middle portion here where we go male Female male female, so we're storing a triplet So one bite for the one of the sexes and then maybe 16 bits or 32 bits for the offset 32 bits for the length so we're not getting that great compression ratio here We're getting actually negative compression because we're storing more data in the compressed form than in the original form So to alleviate this this is what I was saying what the sorting helps helps with so if I pre-sort this by male and female Right on the sex column and of course I have to modify how you know the ordering of this thing because it has to still match My offsets here, so now in my compressed form all I need is two triplets male with then you know offset zero length run on six female offset of seven length of two right So for this example is it's a toy example where I only have nine tuples But now I think I have a database of a billion people Right and everyone's categorized as male female So now I can keep track of the sex of a billion people with two You know to say 24 byte triplets That's insane right. Yes Is there a hypo in the compressed data graph where the number between six and nine should be eight instead of seven? Sorry, I mean. Oh, yeah, this this this should be sorted. Do you mind this? About nine that seven should be eight. Oh One two three six seven. Yes minor thing. Thank you. Yes Okay Do you understand the high-level idea though? I don't care about the typo. Do you understand you understand the concept? Yes So he says does this mean you can only have one column that's running from coding encoded no, so again, think of like it's It's not but it's another optimization problem where I can one figure out what order I want to sort my columns on Right, so say I have three comms here ID sex and then zip code so if I sort on sex first then in in I would have all the males in Different zip codes here and then I would sort them again based on their zip code or I guess sort by zip code First and then for every zip code. I would have a I would say whether it's male or female Right, so that's again. It's another optimization problem decide in what order you want to sort your columns in Vertica they call these projections Yes So this question is now if I want to look for a particular ID would this take much longer So you're just looking on this field here Just did I just want to find I want to find six Yeah, so his question is now if I'm sorting Well, that assumes you I guess before you were assuming that you could do binary search on the ID, right? For this one assume that there's an index up above this That's again. It's orthogonal to the compression scheme. We're talking about then I could jump to this entry here Yeah, but again, you care about that for OTP for OLAB. I don't care about I don't care about ID six I care about the segment the entire entire column, okay? So running encoding is super cool And we're going to see this in a bunch of other slides coming forward, you know in this lecture because we can actually use running encoding in conjunction with other compression schemes, so we sort of piggyback or get a multiplier effect of Compressing compressed data even further using run the thing coding All right, so the next type of encoding compression scheme you can have is called bitmap encoding And the basic idea here is that for every single unique value in our column We're going to store a bitmap That keeps track of whether the tuple of that particular offset has that value or not Right, so the way think I get the the if my if I have nine tuples or and the position The I position in the in the column For you know in for the table I'll have an ice position in each bitmap and that it'll be either zero one to determine whether I have that value or not, right? so For this one you have to allocate the entire bitmap and of course you don't want to do that if you have a billion tubeless You want to allocate a one billion bit bitmap because that you know The memory allocator has to then allocate a big chunk of space So typically what you do is you break it up into maybe on a per block basis So you have these bitmap Encoding for the data within us the column within a single block in that way that the bitmap doesn't get too big So bitmap encoding only works is only useful if the cardinality is low So we go back to my this example here, right either male or female I don't care whether sort of they're not to do this So and so to encode this column here as under with bitmaps all great two Bitmaps one for male one for female and again There's a one that terms of whether it's a male or in this this bitmap or one or one here in terms It's a female right in this case here There's only two possible values so I could just have this be either either or right if I was pre-declared ahead of time What that would put the possible values were like if it was you know But again if I have I have different types of sexes then I have to have a column for each one right So again if I want to get fine for for IDs for I Know how to jump to this offset in my in my comms What is I'm compressed and I know how to do the same thing to jump into my offsets within my bitmaps so the original data just for this column here was nine one bytes right or nine eight bits So this is 72 bits in my compressed form. I can store the bitmaps as two nine, you know nine slot bitmaps So that's 18 but 18 bits and then I need to keep track of original values male or female I encode those both in in one bite. So those are 16 bits So the original size of the data was 72 bits, but in compressed form. I can get it down to 34 bits That's not bad, right? all right, so Again this bitmap encoding works if your card and now he's low Male female I have two possible values. So I only need two bitmaps Those are an example where things go wrong Right, so say I have my customer table. I want to build do bitmap compression on the zip code column So assume we have 10 million people in our in our database on our table. This there's 43,000 give or take zip codes in the United States So in the uncompressed form assuming I'm storing this zip code as a 32-bit integer I can store all 10 million tuples for the single column and 40 megabytes If now if I use bitmap encoding, I need to have a bitmap That's 10 million digits long 10 million bits long for 43,000 possible different zip codes All right, so now the bitmap encoding is 53 gigabytes, right? So that that's stupid. That's a bad trade-off Right, so this is a good good example where you actually get negative compression if you're not careful what compression scheme you're using So the other issue is that depending on how we organize our bitmaps We may have to go extend all 43,000 bitmaps Every single time we we we add a new entry right If we're doing block level then this isn't this is a big deal because we would know how many Tuples we can store on a block and we can pre-allocate our bitmaps for each of those So with bitmap encoding We actually can compress the bitmaps themselves right So there's two ways to do this So one we can go back to using the naive compression schemes we talked about before because again our bitmaps are just just bytes So we just take those bytes run it through snappy run it through You know LZ4 and whatever comes out. We'll just that's what we store, right? Downside of this of course is that we now have to decompress it any single time we want to look up because the Compressed version the compressed bytes or a black box. We don't know how to jump to any offset to find the data that we actually want The other approach is to use an encoding scheme like byte aligned bitmap codes where it's gonna be a variant of run length encoding That's designed swiftly for sparse bitmaps. So again using the zip code example from here I have 43,000 zip codes. Some of them have nobody living in them like Montana, right? So you're gonna have these these huge swaps of the Regions of these bitmaps where it's entirely all zeros So we can use rank encoding to actually compress that even further So I want to show one example of this It's from the 1990s. It's a bit old It's from Oracle or what doesn't use this anymore, but I think it's a good way to sort of Show you the kind of techniques you can do to do bitmap compression So these are called Oracle BBC BBC So the basic idea what's gonna happen is that we're gonna divide our bitmap into chunks and We're gonna classify the the entries of these chunks So each chunk is gonna have what are called gap bytes Where you're gonna have sequences of eight bits where all the bits are set to zero and then you'll have these things called tail bites Where after some sequences of gap bites where everything's all zeros you're gonna have Sequence of eight bits or one bite where at least one bit in that bite is Set to one right Okay, so you're gonna then what's gonna happen is gonna break these up the chunks where again you're gonna have these Sequences of gap bites followed by some tail bites And then the gap bites can then be compressed with run length and coding and then for the tail bites It gets it gets a little complicated You can either store them in Uncompressed or verbatim store the exact bits for each bite Right, or if you know that there's it's a special case where you have a Bite that only has one bit in it. You just store where that bit is located Let me give an example here All right, so this is our bitmap Right, so again think of this is like the bitmap from for one value that we for it within our column that we're encoding with bitmap compression So here we see that we have a bunch of zeros and then three locations where we have a bites with one I so we have one one for that bite up there We have one one this has one one and this one down here has has two ones so we're now break it up to chunks and I said a chunk is going to be a Sequence of one or more gap bites where all the bits are zero followed by a tail bite where at least one bit is one So in this case here, this first chunk is two. We have two One bite chunks or one bite bite one bites Segments where it's all zeros and then we have a tail bite here where one bit is set to one and The second chunk is down here again. We have 13 Bites with all zeros and then we have two tail bites with with once So it encode the first bite We're always going to store a one bite header and the one bite header is going to tell us what's in our in our chunk And so the one bite for the headers now going to be broken up into three different parts So I'll represent that down here So the first three bits of the header bite is going to tell you the number of gap bites we have in our chunk So in this case here, we have two Gap bites where they're all zeros so we just encode in the first three bits the number two Then the next next bit the fourth bit is called the special flag And if this is set to one that we say that our Tail bite that we're storing at the end of our chunk is special and special means that there's one bit That's set to one within the entire With the entire bite So then if this is at the one right because again, we have one bit up in here Then the the next remaining bits the next four bits. We just store the position of that one bit in our Tail bite so this here. We're storing the number four This is saying that is one two three four the bit you want is stored here Let's look at this guy here a bit more complicated. So this one here has 13 gap bites followed by two tail bites so in our header We can't actually record in the first three bits more than seven gap bites So we have if they're all set to one We say that this is a special case that we have more gap bites than we can encode here and therefore the last the first header after the first bite after the header is going to tell you the number of Gap bites so this thing's all one So I know to jump here look at this thing and this is going to give me a value to tell me how many get bites I have which is 13 here Then now I come back into my special bit header is set to zero because I have more than two tail bites All right, and this one here has more than more than one entries So this is set to zero so then the next piece This is telling me now how many verbatim bites I have or how many tail bites I have which is two zero or one two And then now I store the actual tail bites in uncompressed form at the end Super complicated right, but just the main idea. I want to show you here Is it a way to actually compress the bitmaps themselves so for this example the original bitmap was 18 bytes In the compressed form it was five bites Right, so that's that's not bad So as I said, nobody actually does this anymore. This is actually oracles proprietary format. I'm sure there's a patent about it But they don't do it anymore because it's not good for modern CPUs Because I'm jumping around all over I have in direction I have to check this thing and that tells me what I need to look here Or if it's not set then I jump over here. This is bad for modern CPUs because all this branching Is going to if we have branch misprediction in our CPU then you know We're going to be be paying a big penalty of flushing out an instruction pipeline and go fetching in the things we should have executed So as I said oracle band of this Maybe 10 15 years ago and now they use another approach called word-aligned hybrid encoding which is also patented But this is designed for actually running on modern hardware So the one thing I'll say though about this is that it still suffers from the Using the naive compression scheme for a bit mass because I can't do random access on this I always have to scan forward to figure out you know to find the position that I want To see whether I you know the value. I'm looking for is a one or a zero So I can't just jump to some offset and say all right what's going on I always have to start from the beginning of the encoded bitmap until I find the position that I want All right, so any questions about bitmap encoded. It's you can get super great performance We'll see other techniques to use bitmaps and other ways to encode data later on the semester You have to be careful whether you use it or not But then we also showed here with the the oracle the byte line codes Running from coding can be also used to compress data within actually a compression scheme itself All right, the next technique is called Delta encoding and The idea here is that instead of storing exact values again for every single Position or every single offset in our column We can actually exploit the fact that the values are not going to be that different from each other and instead of storing the exact Value maybe I had to store with a difference of my value is with with the previous one So let's say again This is our example where we're recording from a sensor all the temperature of this room and apparently it's really hot 99 degrees in Fahrenheit So every single minute we're taking a new reading so the first thing we see is that if the the interval between every tickle tick is is Uniform right it's always one minute So maybe instead of storing the exact date or exact time here everything, you know for every single entry I can I can just say what's the difference between this and this Same thing for the temperature right the temperature is not going to wildly swing You know from 90 degrees to zero degrees right it's 99.5 99.4 These are small deltas from each other so I can maybe just store that instead of storing again the full the full uncompressed value Right, so what I'll do is I'll start with whatever the first value I have in my column right so 12 o'clock 99.5 and I'll treat these as the base value and Then I'll compute deltas for everything that comes after this based on this base value here So in this case here actually for this one, I'm doing encoding from the previous one So I start with a base value 12 and then I add one minute to that and then one minute to One minute after this so the way to think about this is like in order to figure out what the exact time is I have to scan from the beginning and then add plus one until I get to here and that tells you what my current value is Right, there's other ways you can do this say what's always the delta between this and here and record that But then you're not going to be able to run like the coding to compress this even further Again same thing for the for the temperature. I'm just doing the deltas of these So I sort of already spoiled what we can do of For this right we recognize here. We're storing plus one over there again, right four times in a row So I can just then just do run life encoding on this to compress this even further. So now to store Five different time stamps. I need to store the original time stamp and then just what what the delta is and how many times It reappears right So again, you can store this the the the the base value here at the Uncompressed the first column could be in the header of the column or you could have a separate lookup table It's usually don't usually done in the header of the column all right, so now variants of delta encoding is called incremental encoding and The basic idea here is that you recognize that within strings you have common prefixes We saw this when we did prefix compression for B plus trees and then instead of storing Every single string in its entirety. I just store. What's the difference with my my previous one? So in here I had Rob robbed robbing robot So the first thing you need to figure out what's the common prefix of my predecessor So I store Rob uncompressed because I always need the original one and then this one goes looks as well from what are the What is the prefix that I'm sharing with this guy up here? So it's ROB, right? So I store that then I do the same thing. I always look for the one above me and say what what's the common prefix? So now what I store is in my compressed data I say here's the prefix length and here's the suffix length so in this one here I don't have a value before me so the prefix length is zero and here's the full value for this one here I say what is the length of the prefix that I can reuse from my previous entry? And then what is the remaining suffix so for this one here? I want three characters of the previous one Rob and then the remaining three characters or BED This one is four because it's it's It's robbing so I want to use ROBB Four characters shared from this one and then this ING that's different from this one Again, you wouldn't want to use this for OTP because if I want to jump to this one here and say what's the original value? I got a scan from the beginning and reconstruct it right the difference is I could just say what's the difference between My value and the base value then I can do those jumps like that But I'm not going to get as good compression All right All right, so this next approach is called mostly encoding. This is not really a compression scheme It's just a way to recognize you can store data in a in a smaller smaller type So as far as you know only Amazon Redshift supports this So the idea is that if I have values where even though I'm going to declare it as a 64 bit integer or some larger data type But most of my values in my column or actually be stored as a could be stored in a smaller data type I'll declare it as a mostly type and I'll Store all my data as that smaller data type and then for any value that exceeds the upper bounds of that smaller data type I just have a special marker and say I can't store the whole value here. Here's some go to some lookup table Here's where you can find it So let's say that the application comes along and they can declare things as 64 bit in it 64 bit integer You see as people do all the time do this all the time where they don't know what the hell They're actually doing when they make to create their schema So they just say everything's gonna be the largest type that ever could be so I'm gonna say I have these 64 bit integers and Only one of them is actually needs the full 64 bits all these other guys here could be stored in just 8 bits So I'll encode my column as mostly mostly 8 meaning I'm storing 8 bits for my values And then for that one value here at this offset that couldn't fit in 8 bits I just have a marker and say here's how to go find the original value or go look in this table and find the original value And then I'm just organizing this lookup table based on the offset So as I'm scanning along I see this little flag says oh, I don't have the original value here I know my offset 3 I go my lookup table find offset 3 and then I can get the original value I said it obviously would suck if you Declare something as mostly 8 or mostly 16 some one of these these special encoding, but then you know 99% of your tuples You know I have to go in the separate lookup table because now as I scan along I keep looking up in this thing and it's I'm doing terrible right and I would also get negative compression again as far as I know this is this is Only redshift does this and this is not something they do automatically like you have to declare that you want this this special mostly type All right, so now we can jump into dictionary compression. So as I said before this is probably the most common compression scheme It the basic idea is that we want to recognize that we have Repeated patterns in our data and instead of storing them in their raw form. We want to represent them with some smaller smaller code and then we had this thing called the dictionary that's going to allow us to be able to look up and say For this given code, what is what is the original value or for this original value? What what code should I use? I want to be able to go both directions We also want to be able to support range queries as we'll see in a second because this is going to allow us to do To then do it actually as many queries as possible as we can on the compressed data without how to decompress everything So the bunch of design decisions we have to deal with when we want to do dictionary compression So first of all when we actually going to generate or construct our dictionary when we actually want to compress our data What is the scope of the dictionary meaning? What what amount of data we should be looking at in our columns decide? That's the dictionary we want to we want to build and then what data structure We want to use for the dictionary and then what encoding scheme to use for the dictionary So I'm going to focus on the first three for this one the paper talks about it a little bit right there's the There's Huffman codes, but you want to use the who took our coding because that's order-preserving There's a whole separate class here at CMU. I think it's called Rashmi teaches it. What's it called sorry? type theory or no no Yeah, yeah, there's encoding schemes. There's a whole course on it go take that that'll tell you how to do this We don't care. We'll use whatever we can find right All right, so the first question is when we construct our dictionary so the easiest way to do this is that We just compute the dictionary any time we get a new tuple right so again This is different than that what the granularity is like are we doing compression on the block or doing compression on the entire column? Right, this is when do we actually want to generate our dictionary? So we can do this every single time say a block. It's a new tuple We then does you know rebuild the dictionary and re-encode everything right? so the issue is going to be is that this is where the Why the scope or why the granularity matters because if I build the dictionary on the entire table Anytime I get a new tuple. I got to rebuild the dictionary from the entire table Alternatively I could do it on on a per block basis and then some data older blocks I've already compressed I leave those alone anytime a new tuple goes up shows up and I don't have to recompress everything The alternative would be to do incremental encoding where I can take new tuples And I incorporate them into my existing dictionary and not have to re-encode everything else. That's already been encoded Right the tricky thing is going to be is that if I want to be order preserving I got to make sure that my any new code I put in fits into the right location in my In my in my ordered encoding list We'll see some some examples how to handle this right So the next thing is then also what is the scope of the dictionary? So I said this before it was like what is the at what? How much data are actually we're gonna be looking at when we build our dictionary so we can do this on the block We do that at a table or we can do this cross multiple tables So the block approach the scope is the most common approach, right? I have a my block of tuples I can do you know whether I'm doing on a column or on a tuple basis, right or the entire block itself I'll only build the dictionary for just that block Again, the advantage of this is that the anytime a new tuple shows up if I put it into a block that is not compressed I I can leave my other compressed blocks alone. I don't the re-encode them Now the downside is that you're not going to get as good compression ratio because the amount of Redundant data within one block will be much less than the amount of running them redundant data within the entire table itself But this first is the one that's most the most common so Oracle famously uses this one Right because they actually store the dictionary in the block itself Because they're paranoid about losing data because I there you store the dictionary in a separate block and that block It's trashed then you're because all your other block now You can't decode because the dictionary got lost Whereas if you embed the dictionary in the block itself if the dictionary gets gets gets you know gets gets trashed Then you just lose that one block. It doesn't affect any other blocks for Emory databases Well, actually so I'm gonna do is not here. We actually implemented dictionary encoding last week. I'm gonna do this. It's awesome We store it restored the dictionary in the block itself, but we're in memory. We're not worried about writing at the disc All right table level compression The table level can scope has been the we have a dictionary for the entire table like all the for entire column We have it for that entire table This one you get better compression ratio because it'd be more of a done the data We're only storing the values in the dictionary once right say in this case here I'd have to store the dictionary every for every single block and this one's here I have one dictionary for the entire column But it's expensive update especially if you want to be order-preserving and you insert entries that now screw up your Your your ordering the last approach is I don't think anybody actually does this But if you recognize that you have dependencies between different tables on certain columns Maybe you can reuse the dictionary Across those different tables. So if I have a foreign key reference from one table to another Rather than having two separate dictionaries for these there's two columns I think they can share the same dictionary because I know the values are gonna be in the same domain because they have a foreign key on each other So we won't talk about this here but this is super useful when you want to do joins because if I have two columns that are compressed in different schemes or different dictionaries in order to join them I have to decompress one then Re-encode it to recompress it and then I can do my join on my other table Whereas if you're sharing the dictionary then you just everything's already compressed in the right right scheme And you can go directly at each other All right, so in the second time I'm gonna skip this Basically one thing you could do also as well instead of having one value Represent a or have a dictionary separate dictionary code for every single value if you take two values together and generate a single single code for them and that gets even better compression ratio Again, I don't think anybody actually does this though All right, so let's talk about the encoding scheme in the encoding scheme of the dictionary so We we need our dictionary to do both decompression and Compression and decompression so encode and decode Right because when we want to do encoding we want to take original values and the generate dictionary codes And then when we want to produce output for the query we want to say all right Here's the compressed code. What is the original value? So we need to be able to go in both directions And so the dictionary essentially is going to be a data structure that's going to allow us to do this So you would think all right, maybe there's a nice hash function that could could go both ways and provide this for us It's not going to work because there's no magic hash function that that's going to be Generate smaller codes for larger values and be guaranteed to be guaranteed to be ordered preserving Right the only way you can actually implement this is if you had a lookup table to map You know strings to hash codes, but then that's the same thing in the dictionary So right now there's no magic function. I don't say there never will be but unlikely to be a magic function that'll do this for us All right, so let's see why we care about order preserving So let's say that we have a simple table here. We have three one column for values So I want to make sure that whatever codes we generate Follow the same ordering the likes of graphical ordering as the original strings that they're encoding So in this case here if I sort my this this column here on on the alphabetically Andrea Andy Lynn Prashant my dictionary codes also follow that same order Right the reason why I want to be able to do this is because for some queries like name like and like AMD followed by the wild card I can then do a lookup in my dictionary itself Figure out what just what values here? satisfy my my my my predicate and Then rewrite the lookup to now be baked directly based directly on the compressed values So instead of having to run this string match, which is expensive on every single Attribute here. I can then just do it on the on the values in my dictionary because that's gonna be a much smaller number You know it's much smaller number records. I might have to look at right and Then I can convert my query to be exactly this this range lookup and I can do I can I know how to execute this efficiently with SIMD So that's why we care about order preserving So think about other operations, too We now we can do sorting on our data look at the order by calls We can do that directly on these codes and not worry about decompressing into the original form Because the sort order for these guys will match the sort order of these guys So that's why we care a lot about order preserving So look at look at some other examples here So for this one here say I have my in my query. I only care about the The name field where this one here before I was doing a select star now I'm doing just a select on this projection for this one here We still gonna have to do our lookup on on on this because we have to be able to decompress it here, right? But for this one here, I can compute this entire query just on the dictionary Right, so in this one here. I still have to look at all the original compressed data for this one here I can just go directly on the dictionary Find the tuples that match my my predicate and since I only care about the distinct name I'd never have to go look at this data here Right, so the dictionary is this cool thing where I can actually run queries in the dictionary I never have to look at the original data and Underneath the cover again the database system is gonna make sure it's gonna guarantee that this thing is always in sync with this So it's not like we're gonna see some some entry in our dictionary that doesn't appear in our the actual the data itself So how can we build a dictionary where there's three approaches? one is the most simplest one is just using an array where we just have One array for all the very length values in our in our dictionary And then we have another array that has pointers to those all sets Right thing of that is like the key map we had inside of our P plus tree node All right so this is actually what we implement in our That I'm gonna implement in our new system as a last week because this follows the spec from a patchy era Which is a in-marry data format. We're trying to follow so this is expensive to update because I have to if I insert a new tuple into my In you know into my lexicon the thing I'm storing my dictionary I had maybe shipped a bunch of stuff around To to insert it in between where it should be to make sure I'm still order preserving now that changes the offsets of the pointers And that changes my dictionary codes that have to go re-encode everything But what we're doing in our system if you would do this once when you compact It's actually freezing a block making it immutable. So we don't have to worry about updating this Hash tables would another approach you'd use the problem is you're not going to arrange in previous queries because you're not going to be able to have easy access to The the actual value you're storing in the dictionary And so the last approach I'll talk about is from a paper that I normally had the class read but I not this year where they Use actually an upside an upside down a right side up the plus tree You have to be plus trees going in two different directions But in the middle layer they actually share the leave nodes and this allows you to encoding and recoding in both directions So it looks something like this So at the inside of my in the middle. I have these sort of shared lease And if I want to do encoding I start with the original value Right, so this would be actually that the key itself the strings and I traverse now into my leaf nodes And then I can find the the entry I want for my for the for the value on a store and I do my lookup for my code Right if I want to go in the other direction if I have an encoded value again in my and this this B plus tree here I'm storing it based on the codes and now I know how to traverse and land into the leaf nodes It's gonna have the day that I want and I just do binary search inside of this And so for this example here I'm actually this example of actually doing incremental encoding So you can do all the additional compression stuff we talked about before inside these leaf nodes themselves and in this one here, I'm actually storing the The the codes that offset to 10 So this gives me a little extra room now if I have to insert some stuff to keep it still in sort of order I can still insert it in between between this and not worry about having to re-encode everything I said this says I between 10 and 20 I can insert 10 entries or nine entries without having to re-code everything If now I have more than nine entries in between these two values Then I have to go re-encode my entire data set which is going to suck So I actually think this is actually a really good idea I think the paper you guys read is better because it sort of lays out the the basis of dictionary encoding Bit more straightforward. There's also a patent for this one too. So you can't use this either Whatever But I think this is a really good approach All right, so any questions about dictionary encoding? I realized that was like You know sort of rushed at the end the basic idea is pretty straightforward. The thing you have to be mindful of is that It's order preserving matters a lot and the who talker approach is that has been coding scheme you want to use for this That's outside the scope of this class And then you need some kind of data structure that allows you efficiently go both in directions If you care about doing incremental entries or incremental additions to your dictionary encoded data Then I think the B plus three approach the way to go if everything's static then using arrays is the right way And we our system now uses arrays Okay Five minutes. Let's do this. Okay, so I just want to show you that Compression can be used in other parts of the database as well So for all to be databases, they're gonna have a lot of indexes Because you want every transaction to always be an index look up You never want to do sequential scans in your transaction because that's gonna be really really slow right so the problem is that in if you have a lot of indexes and at the indexes actually take up a Large portion of the overall database itself and None of the compression schemes we talked about so far will actually solve that problem for you So this is actually a study. We did a few years ago where we took some common OTP benchmarks TPCC you guys know about articles think it's like I'm just like looking like a reddit or hacker news kind of website and then voter is a benchmark from voltpb, which is the Japanese version of American Idol where people call on the phone and vote right so it looks a lot like that so the main thing I want to point out here is that for these three different workloads the Total amount of memory being used for indexes can be up to almost 60% of the total amount of data We're storing in the database So even if we do all that columnar compression stuff we talked about before The store a huge chunk of data in the indexes we have to store that we're not you know, we're not we're not compressing right So we came up with a technique called hybrid indexes and we published this in sigmod a few years ago This is with Dave Anderson and Huang Chen Zhang who's a PC student Working with David and myself and the basic idea is that we're gonna have two stage indexes We're gonna have one index where we store all the hot data and that's gonna be uncompressed form and then we're gonna have a Compressed static index that's gonna be much more compact the idea is that every so often we're gonna take all the data In the hot index and merge it into the cold static compressed index, right? The basic idea is pretty straightforward So again, we have our dynamic index That's your regular B plus three all your inserts update deletes go into this and any single time I do a read I can get an invention murder over there anytime I do a read I first check a bloom filter Does everyone know what a bloom filter is? Where's your hand if you know what a blue filter is? All right good. I'll cover this next class. I Assume everyone did but I realize you guys don't think of it like it's an approximate set membership So it's a really compact data structure that says does this thing exist or not? It'll it can give you false positives could tell you something exists that doesn't actually exist But it'll never give you false negatives. So if something's not in the bloom filter It's definitely doesn't exist So you go check this and this will tell you way Hey, the thing you actually want is in the dynamic index, right? And you go read there or you say hey the thing you want isn't in here and go check the static index So again, this is the regular B plus tree or skip lists or BW tree Whatever you want to use that we talked about before and this is gonna be some compressed data structure That's don't look like a B plus tree, but it's gonna be much smaller So where can we where can we save some bits in a B plus tree? Well again assuming this thing static. We're not gonna have to go and update it So the B plus tree is really designed for doing efficient dynamic updates Right. They had that rule says the size every node always has to be half at least half full So they're storing much extra space assuming that someone else can be able to come along and certain some data in here So the first we recognize is that there's all these empty slots For that we're never going to use because we already said that we're freezing this thing We're making immutable So this is waste space we can give it to this so now we end up with a much more compact B plus tree But then the next thing we can recognize is that we have all these pointers that allows us to you know Traverse from one node to the next or maybe one from you know from one level down to the next Right and these are gonna be 64 bits because these are just regular pointers to other nodes in the heap So we can compress that even further and just store the leaf nodes as a giant array And it almost looks like like a skip list now We have now some stuff at the top these are just just gonna be offsets allowed to jump into someplace in that area Right, and since we know where everything is in memory. We can store these as less than 64 bits So that's essentially what we're doing on the static side We're removing all the extra space where we're pre allocating storage for data That's never gonna come and we try to remove as many pointers as possible And we can do this in the B plus tree a skip list Mast mastery and we can the art index as well all these all those indexes can be compressed or compacted using these techniques Just give you an idea of what performance benefit and storage benefit you can get from these So in the the dark line is the regional B plus tree and then the red line is with the hybrid index again We have the hybrid index is the dynamic one. We have a regular B plus tree in front of it And we have a compressed one in the second stage And so what's crazy about this is that you actually still you actually get a performance benefit because The static side is now much smaller. You have fewer cache misses or fewer cash cash line fills And so you can store more data in in your caches than you would otherwise as well as getting the benefit of having the You know using less data Right, so these are correlated because you have less data. You're using less space to store your data You're getting better performance. It's not always like I'm giving up I'm giving up, you know, I'm getting better compression ratio in exchange for getting worse performance When we get down to the cash line or cash level granularity storing less data makes a huge huge difference All right 420 on the dot any questions about this Again, just want to show you that there's it's not just compressing comms. You can compress other things Oracle lets you do all sorts of crazy stuff like or okay. You can declare. I here's my old to be data compressed it This way here's my sort of analytical data could press it this way and here's my archive data I'm storing out in a really slow disc do super heavy compression on that All right main takeaways the dictionary coatings is the most useful compression scheme because it doesn't require us to pre-sort things Like we had to do an RLE or some of the Delta encoding We just take any data and we can compress it I think block level block level order preserving dictionary compression is the right approach if you're building a new system today You get to specialize systems like time series databases where Delta encoding the RLE makes a huge difference For general purpose databases I think dictionary coding is the way to go and we saw techniques actually combined multiple compression schemes get better to get better Compression ratios and then we want to wait as long as possible to have to actually decompress data Because that's going to be that you know This requires us to copy less data from one stage to the next one opera to the next in our query plan all right next class is Something I normally only cover at the end of the semester But I'm actually going to move it up to be immediately after this because it's still related to all the storage stuff We're talking about is that it's the one class where we're actually going to deviate from our Original charter in the beginning to say we're only going to worry about in-memory databases for next class We're going to bring back the disc And we're going to see what we change for a memory database architecture if we now allow some data to be spilled out the disc Right in memory databases are super fast. They're super awesome But again, everything has to fit in memory and a lot of environments people want to be able to shove things out to cheaper storage So we'll see how to do that. Okay All right, any questions So