 All right, let's get started. DJ Jupio, I know you got a live show coming up. We'll talk about that in a second. So quickly, for you guys in the class, again, homework one is do this Friday on the 15th. Project one, it's out and be due on October 1st. Even though we haven't discussed what a buffer pool is yet, buffer pool manager is, you can get started if you want. Again, it's where you're going to allocate memory that gets written out the disk. The upcoming events, so next Monday, Dana Van Aken will be giving a talk with us, she's senior alumni, senior database group alumni, giving a talk at our seminar series on the, what, you ****? No, no, no, I didn't. OK, sorry. The show is you can be talking to the seminar series on Zoom on Monday. And then, actually, before that, DJ Jupio is having a concert this Saturday at 9 PM on campus in Ragnos in the- In the CUC. The CUC, right. Are you letting me sign autographs or no? I mean, like, CVS will haggling over the price. OK, yeah. CMU is kind of weird. They won't let him, like, it's a big show, and they won't let him sign autographs afterwards. It's some stupid, yeah. Why would we do that? So we'll have a lot of discussion today. Let's jump right into it. So last class, we were talking about alternative approaches to the tuple-oriented or slotted-page storage scheme that we presented last week. And in particular, we spent a lot of time talking about the log-structured storage method, where instead of storing the actual tuples, you store the log entries of the changes you've made to tuples. And I said that was popular in modern systems that are more right-intensive. So the three approaches we talked about, so the tuple-oriented slotted pages, log-structured storage, index-organized storage, these storage approaches are ideal for workloads that are right-heavy, meaning if you're doing a lot of inserts, updates, or deletes, the log-structured one is obviously better for this because, again, you're pending to the log. And for a lot of applications, or most applications when you start off, you're going to be a more potentially right-heavy workload. But there's going to be some applications of some environments, of some workloads, where maybe you don't care about getting the best performance for writes. What you really want to do is get the best performance for reads. And therefore, these approaches may not be the best way to approach it. So I'm going to spend a little time talking about what broad categories of database applications look like. And then that'll be motivation for why we want to look at an alternative storage scheme where maybe we don't want to store everything that's just rows, like the tuples with all the attributes together. So this is a rough categorization. But in industry, if you study sort of these three approaches, people roughly know what you mean. So the first category of applications are going to be called OLTP, or online transaction processing. And these are applications where you're ingesting new data from the outside world, and you're serving a lot of users at the same time. So again, the example application I always like to use is Amazon. When you go to the Amazon website, you look at products, then you click things, you add them to your cart, and then you purchase them. Maybe you go to your account information, and you go update your mailing address or payment information. Those are all considered OLTP style workloads because you're making changes to a small subset of the database. You're going updating your cart, going updating your payment information. So think of like posting things on Reddit or Hacker News. Those are making small changes, which potentially could be a large database, but the amount of change each query or operation is making is small. The amount of data that they're reading is small. They're reading for a single entity. So contrast this with online analytical processing or OLAP workloads, where this is where I want to use, I'm going to run queries that are going to extract or extrapolate new information across the entire data set. So this would be like Amazon running a query that says, find me the number one sole product in the state of Pennsylvania on a Saturday when the temperature is above 80 degrees. It's not looking at a single person or looking at a single entity, but it's looking across the entire table. It could potentially be doing a lot of joins also with additional information, similar to the things you guys have done in homework one. And so in these OLAP workloads, they're going to be primarily read-heavy or read-only. I'm not doing single updates. I'm doing large scans and joins over big tables. And this last one is sort of a buzzword from the industry analysts or Gartner called HTAP. And this is basically, there's some applications where you want to do both the LTP style workloads and the OLAP workloads potentially in the same system. So instead of having to me take all my transactional data, put it into a separate data warehouse, and then do my analytics on there, maybe I can do some analytics directly as the data comes in. We'll discuss this throughout the semester, but the main two ones we want to focus on are OLAP and OLAP. Another way to think about the distinction between them is sort of a simple grid like this, where along the x-axis, I'm saying whether the workload is read-heavy versus read-heavy, and in the y-axis, it says how complex the queries are. So you can sort of divide it up like this. OLAP would be down in this corner because we're doing potentially a lot of updates, but the queries we're going to execute are going to be really simple. Like go select star from the account table where ID equals Andy. It's going getting single things. OLAP would be on the opposite end of the spectrum here, where we're doing mostly writes, and the writes are mostly reads, and the reads, the select queries we're going to be executing are going to be much more complex than we do in the OTP world. Think of like Q9, Q10 in homework one. So OLAP, the OTP term goes back to the 80s. OLAP comes from the 90s. This guy named Jim Gray, his famous database researcher, who invented a lot of stuff when we talked about this semester. He wrote an article saying, hey, there's this new category of workloads in the early 90s called OLAP, and we should pay attention to it. Turns out he was actually getting paid by a company who was trying to sell a OLAP database system product in the early 90s, and the paper got retracted, but the name still stuck around. And then Jim Gray won the Torring Warden Databases in I think 96, right? He's a very famous database researcher. Has anybody heard of the story note? Has anybody heard of Jim Gray before? It's one sort of guy. So he famously got lost at sea in the San Francisco Bay in 2006. He was out sailing by himself. This is not a joke. He was out sailing by himself and his boat disappeared. And this is actually one of the early examples of crowdsourcing because they actually moved satellites to take pictures of the San Francisco Bay and try to, you know, people look at the images to try to find the boat. And they never found them, right? All right, so then that's a weird tangent, but I never met him, but like a lot of the, we talked about like, you know, going to Pluto versus reading the book in front of you. You know, that was a Jim Gray metaphor. He had a lot of interesting things like that. All right, so aged out would be sort of in the middle. So today I wanna spend time talking about why the things we talked about so far in the previous two lectures, they're gonna be good for OLAP and not OLAP, and then we'll design a storage scheme that is better for OLAP. So to do this, we're gonna do a real simple example using a real database. So this is roughly what the Wikipedia database looks like. It's open, it runs a software called MediaWiki. It runs off of MySQL and PHP. Like it's open storage, you can go look at it and the schema roughly looks like this, right? There'll be user accounts, people that are actually making changes. There'll be pages like the articles in Wikipedia, and then there'll be revisions for those articles. And so there's a foreign key of reference for a revision. You have the user that created the change and then the an ID to the actual page itself. But all the text itself is gonna go in the revision part. Right, and there's a foreign key going back from page to revision so you can find the latest revision. So I said this before and I'll say it again. The relational model does not define or specify that anything about how we should store the data in a table, right? And so in all the examples I've shown so far, we're just showing the tuple, every tuple, all the attributes one after another. Yes, we said there was overflow pages for large attributes, but in general, all the smaller attributes would sort together. But there's nothing about, again, the relational model says you have to do that. It's just sort of what we as humans came up with first. And it's easy for us to conceptually think about. But again, for OLAP workloads, this may not be the best thing. So let's see how it works for OLTP, right? Again, for OLTP, it's gonna be a bunch of small queries that are going, sorry, it's gonna be a lot of queries that are gonna be real simple and they're gonna read or write a small amount of data relative to all the data in the database, right? So the first query here is just going to get the, for a given page, given its page ID, go get me the latest revision for it. So it's a join against the revision table, but it's one page and one revision. It's retrieving that. The next one is the update query is somebody logging in. You have a user ID, assuming they've been authenticated, and you update the user account with the timestamp of the last time they logged in and the host name from where they logged in from. Or if I do an insert into the revision table, it's inserting a single row, right? And this is what usually people end up with when they build a brand new application. Like if you're gonna create a startup and you start building some online service, you usually end up with something that looks like this because you don't have any data in the beginning. You need to get it and you make a website and then your website's gonna run these kind of queries. Again, for OLAP, we're gonna do more complicated things that require us to look at larger portions of the table. So this is actually a rough approximation of a real query where people were running, you would look at the user accounts in Wikipedia and you find all the login attempts from users that had an IP address or a host name that ended with .gov, right? Because there was a scandal late 2000s, early 2010s, where people in Congress were having their staff go update Wikipedia to say more flattering things about the congressman or congresswoman, right? Like Pence did this, Joe Biden did this, right? So this query could find all people that were doing that. So basically we're paying government employees to go update Wikipedia and they shouldn't have, right? Again, so this is queries we're gonna execute on data after we've already collected it from sort of the OLTP portion of the application. So the thing we need to talk about now is what I'll call the storage model. And this is gonna be how the database system is gonna physically organize tuples in disk and in memory relative to their other tuples in their own attributes. And so up until now, again, I've been assuming that all of the attributes are contiguous for a tuple and that's sort of roughly called a row store. But again, for OLAP, that actually may not be the best thing and we'll see why in a second. And the reason why we have to discuss this part of the system because there is a clear distinction in the marketplace now or in the database world between a row store system and a column store system. Like a row store system, you'd wanna use that for OLTP and for column store system, you'd wanna use that for OLAP. And if anybody tries to say, hey, I have a fast row store that you can use for analytics, you should be very skeptical. All right, so the three choices to do are the NRE storage model or NSM, that's the row store. Decomposition storage model, DSM, that's the column store. And then a hybrid approach is actually this is the most common one for column stores. Well, we'll see why in a second. It's called PAX, our partition attribute across. And most time people will say they have a column store, they really have the PAX one, but it's not a major, major, major difference, okay? All right, let's start with the first one, the NSM or row store. Again, this is what we've already said so far this semester. We assume that almost all the attributes for a given tuple are gonna be stored continuously in a single page, one after another. And the idea is, again, you're going across in the page and you're laying out all the data for a given tuple and you don't lay down any bits for the next tuple until you finish the current tuple. And the reason why this is gonna be better for an OTP system as I already said before, OTP application is that most of the queries are gonna be accessing a single entry or single tuple, right? And so now I can go to a single page and get all the data I need for that single attribute. And that's really all I need to satisfy that query. We already talked about page sizes, but again, it's always gonna be some multiple of hardware pages. So this is basically the same layout that we saw before, right? That we have some database page, we have a header in the front with the slot array and then as we start scanning through our table and wanna start scanning through the application which starts inserting data, it's just gonna go append the entries to the end and keep adding more and more and then doesn't think it's filled up, right? And again, now if any query comes along, it says, you know, select star from this table where ID equals something, we did go to get this one page, jump to the offset as defined in the slot array and we get all the data that we need. So let's see now how this works in our Wikipedia example. All right, save your query here where someone wants to log in, right? They're passing in a username and a password, right? You're just checking to see whether that matches. This is roughly how you log into a database back application or using, if you do authentication database, it roughly looks like this. And again, so we can ignore how we actually find the data that we want for one given user but assume there's some kind of index, hash table, B plus tree, it doesn't matter, we'll cover that in lecture eight but there's a way to say for this user account, here's the record ID and offset. So now we go in our page directory and we find the page that has the data we're looking for, we look in the slot array, we jump to some offset and now we have all the data that we need for this query and we can produce the result. Again, so this is ideal for OTP because all the data is just contiguous. Same thing, we wanna do an insert, all we need to do is look in our page directory, find a page that has a free slot, go bring it in memory, assume it's this one and then append it to the end of it, right? That's fine. But now if I try to run that query before again where I'm trying to find all the, get the number of times people logged in per month if they end with a hostname with dot gov. Now you see in this case here, I gotta scan all the pages in the table because I need to look at everything, all the user accounts and then when I bring a page in, the way we're gonna roughly execute the query, we haven't got through how to do a query execution yet but the roughly idea is that we got this where clause thing that's looking up on hostname so we need to go find the tuples in a page that predicate on the hostname is satisfied. So the only data we really need to look at is just the hostname here, right? Then we gotta do the aggregate on the last login for the group by and so that means the only data we really need to look at for that push in the query is just these attributes here. So what's the obvious problem? You have to go through all the rows and you brought a bunch of data in and you actually don't need, right? So in order to get just the attributes I needed, I had to bring in the entire page but the entire page brings on a bunch of attributes, user ID, username, user path that I don't need for this query. So I'm basically doing useless IO. I'm fetching data in from disk and I don't even need it at all, right? So not only is that slow but in some systems you pay per disk IOPS, right? In Aurora on Amazon, if you read something from disk, you pay per the number of IO operations you're doing for a query, right? So in this case here, I'd be paying for data that I don't actually even need. So that's the obvious problem with NSM. Again, great for insert, update, delete. It's great for queries that need to get the entire, all the data for a single entity in the database but if you wanna scan large portions of a table and you only need a subset of the attributes which most OLAP queries only need, right? It's very rare for you to call a select star on a really wide, huge table because you're basically dumping the whole thing out, right? There's utilities to make that go faster other than the select star. So this is gonna be bad for OLAP because we're bringing in data that we don't need. This is sort of low level detail but going back to this portion here, like if you think about how you would actually execute the query to do this, to run this predicate, I'm jumping around to different locations in memory to do my scan, right? So like I gotta read this header for the first tuple, figure out where, you know, how far I need to jump over to get the host name. Then I can maybe look at the last login event from computing aggregate as I go along but then I jump down to the next tuple and then jump over that to get to the, you know, to its host name attribute. So in a modern super scale of CPU, this is terrible because there's a bunch of these non-sequential operations that could also be non-deterministic where my memory locations that I'm accessing is gonna be sort of, it's not random because you're always going in an increasing order but it's not gonna be like, I'm just reading strides of memory and crunching through it very quickly. That's a low level detail that we don't really cover in this semester but it's at least worth discussing. And then we'll see this, we'll cover compression later in this lecture but in this case here, we're not gonna be able to get, you know, good compression rate if you wanna reduce the amount of, pack in more data within a single page because all the attributes for a given table are just thrown together in that page and there's gonna be less chance for repeatability or less chance for identifying, hey, these values are the same, I can compress them really well. Right, again, just going back, we have a user ID, that's gonna be integer, user names could be random string, user pass random string, right? It's gonna be all sort of different value domains and that's not gonna be ideal for compression. So the alternative approach is the DSM, the columnar storage decomposition storage model and the idea here is that, instead of storing all the attributes for a single tuple together in a page, we're gonna store all the attributes for all tuples, sorry, for all tuples we're gonna store a single attribute in a single page, right? I just had that last log in field instead of having all intermixed with the other attributes within a single page, so I would only just have that last log in attribute and this is gonna be ideal now for OLAP queries because they're gonna scan the entire table and only for only a subset of the attributes and now when I go fetch a page from disk, I'm only getting data, I know I'm only getting data for the hazardous attributes that I actually need and not for other things that just sort of gets carried along for the ride. So the, again, the benefit of a declarer language like SQL is that you don't have to know whether you're running on a row store system versus a column store system, your same SQL query works, I'll just find this is the same, but now it's the database system's responsibility, meaning us people actually building it, it's our responsibility to be able to take data, split it up into separate columns, separate attributes and then stitch it back together to when we need to produce results. All right, so this is just another, the same diagram I showed you before. Again, the way to think about this is that for the first column here, first attribute column A, we will have a separate file with a bunch of pages, it'll have a header now, just to tell us what's inside the page and then now we'll have this, the null bitmap for all the columns, sorry, for all the values within this column, followed by now the contiguous values for all the tuples in the table. And we just do it for the next one and the next one. All right. And so these files will still be broken up as database pages like we talked about before, so either four kilobytes, eight kilobytes, whatever the system supports, but the file itself will contain, again, just the data for a single attribute and now the metadata overhead for these different files is actually much less than in a row store because I don't have to keep track of all the additional for every single column, whether it could be null or not, the different information about the offsets or where to find things, right? These are all gonna be the metadata at the store because it's just all the same value domains or all the same attribute type is being much less than a row store. All right, so the idea here again, so we go back to our Wikipedia example, we just take the sort of every column for our table and then we're gonna store that as a separate page, right? So if you go back to the host name example, within a single page, again, we're only storing values for the host name column and we'll have separate pages for all our attributes. So now if you go back to the OLAP query here, again, we're doing the lookup, the counting of our logins per month for government addresses, the first part exiting the query is going to get the host name, well, that's assuming it's one page per attribute, that's going fetching that one page and then doing the scan, just ripping through the column and identifying all the matches for that host name. And again, I have complete utilization of all the data that I brought in because I'm only bringing in the data I need for this query. I'm not bringing in attributes that I don't care about. And then now we'll talk about this later how we do, when we talk about query execution, how we match things up, but assuming I keep track of a list of here's the offsets of the tuples within this column that match my predicate, then I go to now to the last login page, go fetch to that, and again, that only has data that we need, and then now I know how to jump to the different offsets of the matching host names to find the right offsets for the login timestamp and then compute whatever I need for the query. So this is clear. Who here has heard of column stores before today? Yeah, less than 10%. Okay. Again, so this is a, it sort of seems obvious now. This is clear the way you want to do this, but up to like 15 years ago, I mean 20 years ago, this is not how any database system was actually built. It was very, very rare. Yes. Sorry, your question is, if I go back to the row store example, this one here, your question is what? Sorry, if even in this one, do I have to, oh, like this one, literally this, okay, yes. Oh, okay, this question is, I said there's some index, I didn't say what it was, there's some magic way to say, look at the where clause where it says username equals something, right? Because you would build an index on username and I magically got to the single record, again, the record I paid for an offset, how did I do that? That's what the index does, that's next week, right? It's just a key value, think of a key value map or associative array. For a given key, the username, give be the record ID or record IDs, if it's non-unique, that match this. Then I, so I get that, my index gives me the record ID, I look at my page directory, said okay, I need page 123, where's that? It's on here, it's either in memory or on disk, I go get it, and now I have the slot number from the record ID and I look in that page and jump to that slot, okay, get what I need. So that allows me to jump exactly to the page I need and then within that page go to get exactly the record I need. But again, I only need one, assuming usernames are unique, I only need one username. So, but I had to go fetch all these other rows I don't actually need, yes? Yeah, so just to understand the only difference between a column based, like the only disadvantage a column based would have is slightly more implementation on the DPS side. What do you mean by slightly more imitation? I mean, your DPS has to be more intelligent because now it has to keep map of different columns and stitch them together, wherein in a row base you don't really care, you once you have the access of the whole row, you can compare the host name, you can compare the login at a single instance, wherein if you are doing it column based, you will have to have some more intelligence to understand what I'm putting out because I'm matching two different columns. I would say they're equally as hard. If you don't care about other protections, like we're not talking about transactions, but like if you don't care about those things, then yeah, I would agree that like a row store would be easier to implement, right? It's getting just like, everything's here, I assume it all packs in, then yeah, then assuming that every record can fit into a single page, ignoring overflows, then a row store would be potentially easier, yes? Yeah, yes? So this question is the fact that a, is it, should the data system store is a row store versus a column store? Is this something that's configurable by table or is this, it's sort of all or nothing? So most systems are gonna be row only or column only, right? The H-tap stuff, the hybrid stuff is sort of tries to do sort of, so both. The, so typically you would say, yeah, so in most systems you would say, all right, I know I'm using this system, it's gonna be a column store, so I'll store everything in there, right? It'll all be, the tables will be column, column only. Now, even though I said like the, like even though it's a column store, and we're gonna be optimized for read-only queries, people obviously wanna update data, right? And so the way you typically get around that is the systems would have a sort of row store buffer area and it typically is log structured where if I have any updates, I apply them to that row portion and then periodically I would then merge them into the column store, right? That's one approach to this. Oracle does a different approach where the row store is considered the primary storage location of the database, but then they'll make a copy of your tables in a column store format and they're responsible for keeping that sort of things updated for you. So different approaches to different things but typically if the system supports I want a row store versus a column store, you could define it on a per table basis, but most systems don't do that. Yes. This question is if I have as many disks as I have columns, assuming I break up in a column store, a table, and every attribute goes to a separate disk, would that be as fast as a row store? Well, no because you still have to do that that's splitting apart and writing it all out, right? And then you also would have more pressure in memory because again, say I have 1,000 attributes. So now if I have to update 1,000 pages, I have 1,000 pages in my buffer pool to do the update to update each of them with a new single attribute and then write them all out, right? Typically again, doing updates in a column store system without this sort of buffer thing I just mentioned is always going to be slower. Yes. Her question is what is the null bitmap and is there equivalence in row stores? Yeah, we discussed this last class. It basically, it's a way to represent which attribute is null, right? So I'm not drawing here, but they had the diagram last class. In the header of every row, there'll be a bitmap that says which attribute is null or not, right? That's one approach to do it. That's in this common one, right? You could do this, we had to talk about the special value or in each attribute you could have a little flag in front of it, right? The null bitmap basically says attribute, you know, for this tuple, attribute one is null, attribute two is null. And so thinking of that, instead of having it that bitmap per, in the header per tuple, in the column store, it's just for the entire column. Here's the null bitmap. Yes. Like here is where to find this one row. What's here is where to find like these 10 columns? This question is, in a column store, in the column store, what does the index actually do? So some systems, some OLAB systems that are column stores, you don't get any indexes. I don't think Snowflake gives you an index. You can't create one. It might have changed. Vertica, you couldn't have indexes, right? Because again, they're not trying to do like point queries or single thing lookups, right? It's do complete scans. And so now you're at the point and you're correct. Like you could have indexes that are range indexes, right? So here's where to go find, if your ID is when zero and a hundred go to this page, right? Or there's things like that. There's inverted indexes, like find me all the records where the keyword Andy exists, right? And that doesn't look like a tree structure. That's usually a hash table. Like there's different types of indexes, but you would not, maybe you wouldn't have the index of do point query lookups. All right, cool. So, let's jump back. All right, so I was kind of hand wavy about this part here where I said, okay, let me go fetch the page that has the host name, run my where calls, I'll get a bunch of matches, and then let me go fetch the last login page. And then I had a magic way of finding the matches there. Right? How did I do that? Well, the two approaches. The most common one is to do fixed length offsets. And that means that you identify rows, not by a slot number, but you identify unique tuples. This is why I don't want to say row versus like instead of tuple versus record, because like what does a row look like in a column store, right? Tuple's the better term. But the unique identifier for a tuple is going to be its offset within the table. So now if I'm at like offset three in one column, I would then know how to jump to the offset three in another column, and then I can stitch that tuple back together. But again, this only works if the values are fixed length. Of course, what breaks that assumption? Very length, bar charts, strings, blobs, text, right? So we'll talk about how to handle that in a second. So this approach, the fixed length column one, that's the most common one. A legacy approach is to use embedded IDs where with every single value, you have some unique tuple identifier, like sort of like the log structure stuff, like there was some counter being incremented by one. And then there's some index structure that I'm not showing here, where for a given record ID, for a given column, it tells you where to jump to this. This is rare. You probably shouldn't even mention it, it is one way to do it. There was some system I forget in the old days that did do this, because they were kind of like contorting a row store to make it a column store. But everyone uses fixed length offsets. Of course, the problem you gotta deal with now, again, is how to convert variable length values into fixed length values. And I think I guess how you do that. Yes? Pointers maybe? He says pointers, pointers to what? Yeah, so for every column, I'm sure every data is also contiguous. Yeah, that actually would potentially work. The problem with that one is like, if you do like in-place updates, right? If you're just packing all the data in, if it's immutable, you don't have this problem. But if it is immutable, then you could have fragmentation. He says slot arrays. But what's a pointy-two? We have the starting address of every column entry. How we had... Yeah, yeah, yeah. That's sort of similar to what he was saying. That potentially would work. Compression, right? So the point approach, that would work, but I don't think anybody does that. You could just pad things out, but that's gonna be wasteful, as we said before. But this is basically how dictionary compression works. Dictionary compression is replacing some variable length value with an integer code, which is gonna be fixed length in usually 32 bits. That we can use to then do some predicates on the dictionary code. And if necessary, if it matches something we're looking for, go do a lookup and find what the actual value is. And that's a typo. It's not more than this next week. It's more than this hour. Like we'll discuss this now. So sorry. Right, so that's how we're gonna be able to solve this problem. And this is pretty much the way everyone does it in a modern system. So this comms for idea is not new. According to the literature, the very first version of this goes back to 1970s. There is this project out of the Swedish defense ministry called Canter. It was more of a file system than a database system. But this is considered to be the first documented like proposal for a column store system. And I don't know whether it exists today. In the 1980s, there was a paper that actually sort of mapped out the theoretical properties of what the decomposition storage model looked like. But again, it was still mostly only in academia. The roughly what was considered the first commercial implementation of a column store system was this thing called Sybase IQ. But it wasn't really a full-fledged database system. It was more like a query accelerator. And so similar to what I was saying before about Oracle, they make a copy of your row store into a name memory column store. This is basically what Sybase was doing back in the 1990s. So your query would show up and then Sybase would figure out should I go to the row store and maybe do something there or should I run the query only on the in memory column store. And the 2000s is when the column store stuff really took off. And the three sort of key systems in this space were Vertica, which is again found by Pichy advisors, Stan Zinotic and Mike Snowbreaker. Vector-wise was out of, it was a four-common ADB, but that was out of CWI. And Moen ADB was a major academic project at CWI as well. DuckDB is from CWI. So the first version of DuckDB was actually called Moen ADB Lite. They threw all the code away and then started DuckDB from scratch after learning about the Moen ADB Lite. Vector-wise was started by some people that worked on Moen ADB. And then the two main people at Vector-wise, one of them left and was a co-founder of Snowflake. So a lot of the early ideas that Vector-wise developed is in Snowflake. And then the other guy, Peter Bontz, he went back to CWI and then he helped him to advise the DuckDB project. So there was a bunch of other systems at the time, but I would consider these the three major ones and the pioneers in this space. And actually how this all sort of came out, the way Mike tells it was, or Stan Becker tells it, he was consulting for Walmart Labs in the early 2000s and they were struggling to try to scale their TeraData database, which at the time was a row store. I think Walmart was, it was multiple petabytes. It was the database of every single transaction. Anytime somebody bought something at a store, like scan some of the cash register, it was in that database. And they were struggling to get TeraData to run fast. And then Mike was like, oh, we should just make this a column store. And then he found the C-store project that became Vertica. And then it was a pretty famous project. Anyway, now pretty much everybody does this. So this is just a sample of a bunch of your database systems that are out there that are considered column stores. But the two key things that are also interesting about the 2010s is there's these open source file formats, Parquet and Ork. Parquet came out of Dremio and somebody else, I'm forgetting, and then Ork came out of Facebook. These are open source file formats that are columnar based. And now you can build database systems that can read and write Parquet and Ork files. All right, so the advantages for the columnar or DQ opposition storage model is that we're gonna greatly reduce the amount of waste of IO we have to do for analytical queries because we're only reading the exact data that we need. And we're gonna get better cash reuse and better locality for our access patterns because again, we're literally just gonna rip through columns one after another and not have to jump around within memory, which is better for CPUs. And again, we'll get better compression which we're coming up to. But of course, the downside is gonna be it's gonna be slow for point queries, slow for inserts, updates, deletes because we're gonna have to split things up and write out data to multiple locations and then bring it back in if we wanna put it back together. Yes? In the previous slide that this... Question is, is that some way of building a database? No. Consider that an Easter egg. But pancake DB is real. That's a real system. So, one thing to point out though is that in my earlier example, the way I showed I ran that one query, I did the scan on the host name column, then I ran the scan portion of the query on the login one. And you sort of think of that like it was I did one and then I moved on to another. A lot of cases though queries, you actually wanna look at multiple columns at the same time. My where clause was only referenced one attribute but as you've seen in the queries you've written for homework one, often times you have multiple columns or multiple attributes referenced in your where clause. And so it would be kind of expensive or cumbersome to sort of now be maintaining as I'm scanning along one column fetching in another column at the same time and trying to kind of patch things together. And so we still wanna be able to have data or we want a way to have the attributes that are somewhat that are gonna be used together somewhat close to each other on disk or in our files but still get all the benefit of a column store layout, right? And so this is what the Pax model model is. Well, again, and as I said, in most systems they say they're a column store they're really doing this. Parquet and Orc are really doing this. And the idea is not like mind blowing it just basically saying instead of having a separate file for every single column or attribute by itself I'll break them up into chunks, into row groups and have data that are within the same tuple close to each other in the same file to sort of spaced out in separate pages, right? So we go back to our sort of mock example here all we're gonna do is just horizontally partitioned the table into row groups and then within that row group we're gonna partition it based on columns, right? So you think of this, the first three rows here I'll have some portion in my giant file like I'll define the row group I have a header for that row group and then I have all the extra column A together then all the values for column A all the values column B followed by all the values column C, right? So now again if I have a where clause that needs to access both column A and column C when I go fetch these pages for this row group I have all the data that I need close to each other but I'm also getting the benefit of sequential IO because this row group is gonna be in the 10s of megabytes instead of like four kilobyte or eight kilobyte pages, right? Same thing for the next guy and so forth. And so this is roughly how Parquet and Ork work. There's a lot of diagrams or presentations of what Parquet and Ork look like and again they're basically using all the same language that we're using here, right? And here they say the default size of a page is one megabyte because they wanna group things together and have as much sequential IO as they can and then a row group is gonna be 128 megabytes, okay? Yes? The statement is couldn't this also have those you're still gonna have the problem where you have a bunch of useless IOs if you're doing a full table scan. So the header tells you where things are, right? And because it's so huge, like I could bring in I bring in this first header here actually I'm showing the header but like in real systems or in Parquet and Ork it's actually at the footer because assuming the file is immutable so I don't know what like I don't know where everything's gonna be until I finish writing it. So it's in the footer but that's on the side. So his statement is don't have the same problem as a row store here if I'm doing this packs thing because now if I bring this entire row group in am I gonna read a bunch of stuff I don't need? So you don't bring the whole row group in you bring the header in and you say okay here's the offsets now of where my attributes are and then I can go ahead and fetch those. Actually here you can see here you see the footer here this is instead of the header saying the metadata what's in here like the file row group and column metadata like where the offsets are instead of the header it's just in the footer for Parquet and Orcus the same way. All right so as you said multiple times I always always been the main bottleneck we have especially for analytical queries and if we assume the data's on the press that means like the whatever the exact size of a tuple for in a table every page is gonna bring exactly that data in. And so the most obvious way to reduce the to speed up queries you can you can basically skip data or you can make the data you do fetch bring more things into memory. So skipping data is what the column store stuff helps with because you avoid having to read read attributes you don't need. Compression is another way to say okay for every page I fetch I get more tuples than I would if it was uncompressed. Now there's gonna be this trade off between speed up and the compression ratio obviously disk is gonna be potentially slower than CPU especially in the cloud setting and so I'm willing to pay the extra cost of having to decompress and compress data because now again it'll reduce my amount of IOPS or the amount of time I'm wasting on IOPS to bring the fetch things in. If things are slightly getting the sort of the difference between disk speed and CPU speed is getting the distance is getting smaller where in some cases disk is actually getting so fast lately where maybe you don't want things to be compressed. There'll be some other benefits we'll see in a second where if you do keep things compressed the data system actually can run faster when it actually processes things in memory and we'll cover that in a few weeks but in general for most systems compressing things on disk is always gonna be a win. So any compression scheme we need to use has to produce fixed length values as I said before because if you wanna store this in a column store we wanna make sure that we always have fixed length all sets. In some cases too we wanna delay when we actually decompress things for as long as possible while we execute queries and we'll see this again we'll talk about this more when we talk about query execution but the idea here is that if I have a bunch of these one megabyte strings that are in my table but I can convert them through 32 bit integers I wanna process 32 bit integers for as long as possible because I have to copy data from one operator to the next as I execute the query or if it's a distributed system copied over the network I wanna keep things compressed as long as possible and only decompress it when I actually have to show something back to that something needs it DB decompressed or the user needs the output. Joins makes that makes that harder but we'll cover that later in a second. And then the obvious most important thing we need for any compression scheme in our data system is we need to ensure that we're using a lossless scheme. And I know what that means. Lossy versus lossless, yes. Right, there's no information loss when you decompress things, right? Or decompress them, right? So a lossy scheme would be like MP3, MP4, JPEG, right? Where they're doing some tricks about how the human perceives audio data or visual data to compress things down to a much smaller size. So that means like if you have the raw image you took or the raw sound file you took when you compress it you're not gonna get back the same values if you will when you decompress it. We don't wanna do that in a database system because well, as you said before people don't like losing data, right? If you have $100 in your bank account and then they compress the data and when it gets decompressed and now you have $90 you're gonna notice, you're gonna complain, right? So typically most systems will not use a lossy scheme just because it will have problems. You can do lossy compression yourself, right? So think of like, I mean like the application could do this like if I have a keeping track of the temperature of this room every second, right? And I'd do this for 10 years. Do I really need to know what the exact temperature was at a one second interval, you know, a year from now? No, I can maybe compress it down to here's the average temperature per minute, right? So I can't get back the original data because it's been compressed or aggregated and that might be okay. But again, that's something you as a user in the application, a human has to know whether that's an okay thing to do. The database system does it. Therefore the database system is always gonna be using a lossy scheme. So now the question is, what do we actually wanna compress? And there's a couple of different choices. One is we can compress a single page or a block of data. So that's all the tuples within the same table. We can compress a single tuple by itself if it's a row store system. We can go even more fine-grained than that. We can say I'll compress within one tuple one single attribute and compress that. So think of like the overflow tables we said before, if you're storing huge text attributes or like in Wikipedia, the revisions, it could be a lot of text. Like I forget what the largest Wikipedia article is. And it's something like Star Wars, right? So that could be kilobytes of text data. I could compress just for that one entry. I'm possessive of this and a bunch of other systems do this. Or alternatively, I could compress for a single column if it's a column store system. So let's talk about how you can do this for the block level. And then we'll spend most of our time talking about how to do this at the column level because that matters the most in a columnar system. So to do it at a block level, we essentially need to use a naive compression scheme. And by naive, I mean that the database system is making a call to like a third-party library like GZIP. You wouldn't want to use that because it's slow, but it's a third-party library that's gonna take the page and then compress it down to some binary form where the database system has no way to interpret or can do any introspection into the compressed version of the block. Again, so I call GZIP on a file. The data system doesn't know how to go read inside that compressed file. It has to decompress it in order to get back the original version of it, right? So again, you wouldn't want to use GZIP. There's a bunch of these faster alternatives and that sort of all came out with LZO was a big breakthrough in the 1990s. Z-Standard is considered the state-of-the-art compression scheme now from Facebook. They're actually working on a new version. It's not public yet. It's even faster and better, but that's not out yet. Z-Standard is what you should be using. So let's see how MySQL does this. So MySQL actually, you can support table compression. You declare it on a per-table basis. I don't think it's on by default. And the way it works is that all your pages when they're written to disk, they're gonna be compressed into a page size that's gonna be some multiple of four or two up to eight kilobytes, right? And then each page, they're gonna have a header portion called the mod log, where it's sort of like the row sort of thing I've always mentioned before, where I can do a bunch of writes and make changes to the page without having to decompress it first, right? So it's like a little extra space in the beginning. And I'll say also too, say like, say if your page is like after it's compressed at six kilobytes, they'll pad it up to the next highest value within one, two, four, eight. And this ensures that you don't have any fragmentation in your layout on disk. And when you bring things into memory. All right, so say I query runs and want to read something in page zero, right? If I'm doing a blind write, like an insert or a delete or even update, assuming I have the values, I don't need to decompress the page. I just write that change to the mod log. And again, it's just log structure, like we talked about before. I said we were gonna see this idea throughout the rest of the semester, right? You can think of the mod log as just the log structure storage we talked about before. And in some cases too, I can do reads on the mod log because if the data I need was just inserted and it's in the mod log, I don't have to decompress the rest of the page. But then if I do need to read the page, they'll decompress it, store it as a regular 16 kilobyte page in memory in their buffer pool because that's the default size for my SQL. And then I can do whatever reads I want on that, but I still keep the compressed version around. And I think also too, when it gets decompressed, they apply the changes to the mod log to the page there. Is this a good idea or a bad idea? Postgres doesn't do this. Yes. He says for reading, it's not super great. Why? Not necessarily, like going back here. If I do an insert and it lands in the mod log, I don't have to decompress it. My index, there's some bookkeeping they're doing saying, okay, I updated the index now, so the record ID points to this page and then you look in the mod log oh, for that slot number or that record ID, it's really in the mod log on the full page. So you don't have to decompress it. All right, I'll say, I actually think it's a decent idea. And I say Postgres doesn't do it, not because, oh, Postgres is not the gospel, right? If Postgres doesn't do something, it doesn't mean you shouldn't be doing it. Postgres is actually an amazing front end. The back end's actually pretty terrible, right? Because a lot of the design is remnants from the 1980s and it's not how you would build a modern system today. So, and they don't support compression for pages, for regular data pages like this, only for toast tables, the overflow pages. So this is actually a decent idea, right? It does have some challenges though, right? The, because my SQL is a row store, the, that's why you have to use a native compression scheme because you can't do anything fancy because the values you're storing in the tuples themselves, or sorry, in the pages itself, from all the different attributes and you're not gonna be able to do all the native compression scheme we'll see in a second, right? And again, because we're just using, I think they use snappy or Z standard, because you're using a general purpose compression algorithm, the data system doesn't know how to interpret what those compressed version, the compressed bytes actually mean. And the spoiler is, all those compression algorithms I talked about before, they're basically doing some variant of dictionary compression, right? It's gonna build its own dictionary of repeated byte sequences. But again, my SQL doesn't know how to read that dictionary. So it has to decompress the whole thing. So for some workloads, I think this is actually a good idea. And I kind of wish Postgres did do some compression. All right, so, if we're doing OLAP, ideally we wanna be able to run our query directly in the compressed data without having to decompress it first, right? So say something like this, I have my salary in DJ2PL salary, assuming I have some compression algorithm, I'm not saying what it is, and then I have a compressed form of the database. Well, if my query shows up or I wanna get my salary, I do some kind of magic that converts the query to convert this constant string Andy into the compressed form, and then now I can do a direct lookup on my compressed table using my compressed constant and not have to decompress every single page as I'm going along. Now I'm gonna reduce the amount of IO I have to do because I'm fetching in compressed pages, I don't have to decompress them in order to do lookups into them. So this is ideally what we want. And the easiest way to do this is gonna be in a columnar system. So this is just sort of a quick overview of a bunch of different compression algorithms you could possibly have. Again, the spoiler is gonna be dictionary, compression, and dictionary encoding is the default choice for most systems. But what you can do, you may not wanna compress a single column using these other schemes, we'll see some examples where it does make sense, but after you do dictionary encoding, you can apply all of these other compression schemes on the dictionary itself or you're still your dictionary encoded values and get even further compression. So you get sort of a multiplicative effect when you do compression one way and then you run another compression algorithm on the compressed data and get even better compression. And it's still done in a way where the data system can natively interpret what those bytes actually mean in a compressed form without having to decompress it first. And again, this is why you want the data system to do everything and don't want the OS to do anything or anybody else do anything because, again, because we can do this native compression. All right, so let's do some quick examples here. So one approach to do is called run length encoding RLE, and this is the basic idea here is that if we have contiguous runs of values that are the same thing, or literally the same value, instead of storing that value over and over again for every single tuple, I'll instead store a compressed summary that says for this value at this offset, here's how many currencies it has, right? Now, this works great if your data is sorted based on whatever the column you're trying to compress. You can't always do this, but again, if you sort things, then you can maximize the amount of, the maximize of the repeated runs. Let's say I have a single table where it has an ID field and then has a column that says whether somebody's dead or not, yes or no, right? It's yes or no, there's no null, there's no maybe, right? So we can compress this guy here, right? So a compressed form would just take, essentially just scanning through the column and finding the contiguous attributes or contiguous tuples that have the same value and then converting it into this triplet that says here's the value, we're at this offset, and here's the size of the run, right? And so now, if I have a query that comes along, like count the number of people that are dead versus not dead, right? I can just rip through that is dead column and compute my aggregation by just summing up the length of the run, you know, and then along with the value there. I actually can do even better, right? So I have this little part here, I have somebody's not dead and then they're dead and then not dead, right? So I have now these three triplets here where the run size is one. So in this case here, I'm actually doing worse because I'm storing a triplet when I could just store a single value by itself, right? So if I sort the data based on whether somebody's dead or not, now my RLE compression only has, you know, the compressed column only has two entries. Here's all the dead people and here's all the non-dead people. And this greatly reduces the amount of data I have to store now. So it may be the case, again, say I have, always think of it at extremes, my example is because I have to fit them on the slides. You know, I'm having, I have eight or nine tuples here. If I have a billion tuples or a billion people, I can compress now down, you know, keeping track of like, you know, who's ever a billion people is dead or not dead into a small number of bytes and that'll fit on one page. You need the length in the triplet because again, assuming that we always have fixed-length offsets, this allows you to figure out, okay, like if I need to find for a single tuple, a single entry, are they dead or not? Like it allows you to do the math, to reverse it back and say, okay, I would be at this offset if I was uncompressed. And that's just simple arithmetic. Now the compression scheme you can do is called bitpacking. And the idea here is that people oftentimes declare attributes or columns in a certain type that is larger than they actually need. So an idea would be like if I have a column where I'm keeping track of some number and I declare it as an integer type. That's in SQL, that's a 32-bit integer. So that means that even if it's a small value, I'm still gonna allocate 32 bits to store it, right? So for these numbers here, none of them are very big, but I'm always gonna store it as 32 bits. So in this case here, to store these eight or nine numbers, eight numbers have to store 26 bits. But again, the only thing that matters is actually these lower portion of the bits here because this is the actual data that they need, right? All this other stuff, the other 24 bits, just wasted space, right? So instead what I can do is even though you've declared it as a 32-bit integer, I'm gonna store it as an eight-bit integer. And then now that greatly reduces the size down by a factor of four. So I was able to go again from 26 bits to 64 bits. And you can do a bunch of tricks with bit shifting operators and SIMD, which we can talk about later this semester, to actually now, as I'm scanning along and saying, trying to find matches on a certain number, because these are now eight-bit integers, I can put them into a single 32-bit integer and I'm keeping track of inside my system, oh, it's really at this offset, it's these different values. And then with now a single CPU instruction, I can operate on four values at once. What's the problem with this, yes? 32-bit integer to the database. Boom, okay, excellent, thank you. So his name it is, well, what happens if I have a number that can't be stored in those eight bits that I'm trying to pack them into, right? And so the way you get around this is a technique from Amazon for Redshift is called mostly encoding. Where you say, the idea is basically to say, most of the data in my com is gonna be small enough. But in the cases where it's not, they'll keep track of that and then store that as a separate in a dictionary, right? So again, I have these 32-bit numbers, but I have this one 999 here, that's really big. So I'll store still, store them as eight bits, but then I'll have a special marker value. Think of like all the bits are set to one. And then I have a separate table that says, for a given offset, here's what the original value should be. So now as I'm scanning along this column, if I see my special marker value, I know that I should look in this offset table and find out what the real value should have been. Yes? Can you also do something similar with, say the next couple things are four bits, so then the next couple are eight or the next couple are 32 bits? Yeah, so same as, couldn't you do something with the triplets where instead of just saying everything's always eight bits, could you say, I have 1,000 values that are contiguous that you store those four bits, then I can store them in as 12 bits or whatever. So that going back to the packs thing, because they break it up into row groups, each row group could have its own compression scheme. So you could do something like that. I think parquet is more aggressive compression than orc, it's more complicated. Or maybe the other way around, one of them does, one of them is very simple, one of them has a bunch of the various tricks you're talking about. All right, so in this example here, the original size is 256 bits, but then if I do the most encoding, I just have to store eight by eight bits for the mostly eight column, and then assuming that I only need 16 bits for the offset and then 32 bits for the value for this lookup table, which is not true, because obviously allocate more for additional metadata, but assuming you get it down to that, it's 112 bits. So that's pretty good. Another trick you do is called bitmap encoding. And the idea here is that if you have an attribute that has low cardinality, meaning has a small number of unique values, where now instead of storing for every single tuple in a column, here's the actual value, what I'm stagnating to do is maintain bitmaps, where I have one bitmap for every possible value I could have in the column, and the bit is set to one based on whether the column or the attribute, the tuple at that offset has that particular value, right? So there are some database systems that provide bitmap indexes that essentially give the same things. You still have the original column, but then they maintain bitmap indexes that'll do the same technique that we're seeing here. There's a system, there's a company that's gonna come talk about their data system later this semester. I think it's, there's either feature base or feature form. There's two different database, the same name feature in them. One of them only stores bitmap indexes. You can't actually store real data or the base data. All right, so the idea is here. So say we go back to our is dead column, right? Again, there's only two possible values. Either dead or not dead. So instead of storing in the actual single values themselves, I have two bitmaps. One says, for yes, one says no. And then there's a bit here that's set in each bitmap that corresponds to whether the original value has that bitmap or not, or it has that particular value or not, right? So I only need now two, eight bits, or 16 bits to store the yes or no. And then now my bitmap is just 18 bits, right? Because I have nine values and I need two bits each. So I can get this down now to 34 bits. What's an obvious problem with this approach? Yes. If your data is high cardinality, this is really bad. If your data is high cardinality, this is a terrible idea. And indeed, yes, it is. Let's look at an example. So say we have a customer table like this and we have the zip code column, right? How many zip codes are in the United States? Do you have any of your guests? I hear 10,000 now, more. 100,000 less. For now we're doing binary search. It's 43,000, right? Assuming we have a table with 10 million rows and I'm gonna build a bitmap for every single unique possible zip code I have, well, I'm gonna need, just to store the raw data, assuming we can store the zip code as 32 bits. The raw data is 40 megabytes, but if I had to have a 10 million size bitmap for every single zip code, now we're at 53 gigs, right? So clearly this is a bad idea. Furthermore, every time somebody adds a new tuple, I have to extend that bitmap because all sets have to match. I keep adding more to it. So I have to do that for every possible bitmap, right? So bitmap indexes can make a huge difference but it's really for like, when you have a really small number card or not, like less than maybe 10, you'd wanna do this. I think in most systems don't do this by default. All right, delta coding, the idea here is that if the values from one attribute to the next, from one tuple to the next, sorry, if they're close enough to each other, maybe again, I don't need to store the entire value for one tuple, I just need to store the difference of the delta between the previous value. So let's say again, I have some kind of sensor reading where I'm keeping track of the temperature in the room and every minute I'm storing the temperature, right? So with this timestamp column here, assuming that we're storing the 64 bits, we know that the time is always gonna be incrementing by one and furthermore, assuming I'm keeping track of the temperature in the room or outside, from one minute to the next, there's not gonna be dramatic temperature swings, right? We're not gonna go from like 99 degrees to a zero degrees within a minute. And so what I can just do now is to store, from one tuple to the next, what was the difference between the previous one here, right? So in case of the timestamp, it's just plus one adding a minute. In case of the temperature, it's some fractional, a decimal difference between the previous one. I can press this even further now because what do I have in this first column here at the timestamp? What do I have? A bunch of plus ones. How can we compress that? Run length encoding, right? So we can compress this even further now and convert this into, you know, convert the combination of the depth encoding and the run length encoding to tell you how many plus ones I have afterwards. All right, so this is a good example where again we can have this multiplicative effect where we can compress the compressed data even further because we're putting into a form that can take advantage of it. So if you go back to our original data size, just for the timestamp column itself, we were at 320 bits, but if we do the depth encoding followed by the RLE encoding, we can get it down to 96 bits. Again, I'm showing six or seven tuples here, it's not that big, but again, think of it extreme, think of like a billion records. This would be a massive savings. All right, the last one to discuss is dictionary compression, because again, I said this is the most common one, this is how we're gonna get, this is how most systems are gonna compress data. Even for things that aren't strings, right? In some cases, there are some, columnar systems will compress integer data, flow data, and putting them to dictionary codes. All right, the idea here is that if we have values that we see open ever again, instead of storing that value repeatedly for within a column, we're gonna convert that into some 32 bit integer. And then we maintain a mapping data structure, the dictionary, that knows how to take that dictionary code, the 32 bit integer, and convert it back into an original value. Typically, we're gonna have, it's a one-to-one correspondence for one value, we'll have one dictionary code. There is some techniques, I don't think any commercial system does this, where you can say, if I see multiple attributes that the patterns together, I'll convert the combination, the two of them, or three of them into a single dictionary code to get even further compression. But again, I've only seen that in the academic literature. And then we need a way to do fast encoding and decoding on the fly, that allows us to do both range and point queries. So point queries are obvious, like I wanna be able to say, the string Andy maps to code 101, I know how to do the exact local ones and those. But ideally, I wanna be able to also be able to do range queries on compressed data. And so I want my dictionary codes to have the same ordering that the original values actually did too. And we'll see how to do that in a second. So say it is my original data, a bunch of names of my former students. Then the compressed version of this could be, again, I have my original column convert those into 32-bit integers. And then I just have this mapping table here that converts the, allows me to look up to say, for a given code, what's the original value? Or for a given original value, what's the code? And that's the dictionary. So now we can go back to my example that I had in the very beginning with me and DJ2PL where select star from users were name equals Andy, I can convert the string Andy into the dictionary code by doing a lookup first in the dictionary. Then now I scan through my column and just do lookups or do comparisons based on the integers. So I don't need to go through as I'm scanning along. If I don't compress my constant, then as I scan along, I gotta go decompress each of these one by one and then do my lookups. I'm basically losing all the benefit of any compression. And that's what my SQL has to do because they can't interpret what's actually in the dictionary. They can't interpret what the compressed bytes actually mean. But in this case here, because we're the data system, we built the dictionary, we control it, we know how to read and interpret it. We can, I mean in SQL, so we know what the query wants to do, we know how to take that constant, convert it to the dictionary code, then do our scan directly on the compressed data. So how do we actually do this? Do the encoding and decoding, right? Well, again, for given uncompressed value, we know how to go get compressed form and then reverse it. So the key thing to point out is there's not gonna be a magic hash function that can do this for us, right? Any reversible hash function is gonna generate something that's much larger likely than the original value. So they not get it down to a 32-bit integer, right? So we're gonna have to build a data structure that we maintain that allows us to do this. And as I said, we want something that's gonna be preserved, the ordering of the original values, such that the compressed data, the compressed dictionary codes, those things have the same ordering, lexographically or yeah, as the original data does. So going back here, right? If I have, again, I have a bunch of these names, I want the dictionary that I'm generating to the codes that have this, such that if the original value comes before in the ordering before another original value, is dictionary code should come before it as well? So I would have, my dictionary is basically sorted. So now this allows me to do queries like this. Slack star from users were named like Andy, A and D, followed by the wild card. And so if we operate directly on the compressed data, we can convert this like clause into a between clause. Because we can look up in the dictionary, run the like portion just on the dictionary values, find the ones that match, find the min and max values for the matching values, and then rewrite the like into a between clause. And then now rip through my column while it's still compressed. Again, we can do this because it's SQL. We know it's in the where clause. It's not arbitrary Python code or C code. We know exactly what the where clause wants to do. And we can be smart, intelligent, and convert this into, you know, do the rewriting for us. And again, you as the application programmer, or not you guys, but some JavaScript programmer, they don't have to know what the hell's up going underneath the covers, right? They just write the like clause and the data system can be smart and rewrite it for you and get better performance. So, in some cases here, you still have to do, the question is whether it's, you know, still the performance and original column. In this case here, since I need the output of the name attribute, I still have to go rip through the column and actually look at them. In some cases though, the data system can be even smarter and it can answer queries without actually looking at the compressed data, but just operate directly on the dictionary. So, instead of saying select name from users, if it was distinct name from users, where I don't need to get the actual tuples themselves, I just need to get the actual values that are unique, then for this query here, after I do my conversion to the, you know, converting to the between or convert this wild card here into the dictionary values, I only need to know what values actually exist in the dictionary and I don't need to go look at the actual column. For this query here with a distinct, you know, assuming I only have four names in my table, but I have a billion rows, I only need to look at four rows in the dictionary to answer it. And again, I've said this multiple times, we can do this because the data, because the data is responsible for compressing this. Now, Parquet and Ork, one of the big limitations that they have is, they don't actually expose the dictionary to you when you use their libraries in utilities. So you can't do this trick, I'm talking about here if you're using Parquet and Ork, right? Parquet and Ork will decompress the data when it gives it back to you. You can't operate directly on compressed data. And that's actually one of the biggest limitations of my opinion of those two formats. But again, other systems that do native compression without Parquet and Ork can do this trick. All right, so what is the data structure, what is the data structure we're gonna use for our dictionary? So the most common approach is gonna be a really simple array. And this works great if the files are immutable because I build the array once and I never have to resize things and insert things in place and to move things around. I can just build it once and I'm done. If you need something that's dynamic and can support updates, you need either a hash table or a B plus tree. These are, hash table is I think less common. There is, actually these things are less common. Most people do the array and assume the blocks are gonna be, the compressed blocks are gonna be immutable and only if I need to rebuild it, then I'll rebuild the array. I realize I'm over time, let me show roughly what it looks like. So basically you have your original data in your column. So the first thing you do is build your dictionary and again, all that's gonna be is a sorted list of the values you have and then the, and you store the length of the string and then now the dictionary code is gonna be, it's just an offset into this array here. So my compressed data would look like this and these are just offsets, the byte offset into the array. And so now when I'm doing a scan and I wanna say, okay, if I have, if I, in fact that's the second one, the second entry, it's 17. I jumped a byte offset at 17 and I can look in the, let's down here, I can look in the header and tell me how big is the string afterwards. All right, so the dictionary itself is literally just an array packed of bytes like that. Okay? All right, so to finish up, this row store is a column store, it's gonna be really important and we'll see this show up when we talk about query execution and other things, mostly before the bitterm because again, the distinction or the difference between a row store and a column store system have ramifications throughout all other parts of the database system, how you do recovery, how you could do query execution, how you wanna run transactions, how you wanna optimize your queries. And so it's really important to understand this now and we'll see the trade-offs between the two approaches again and again throughout the entire semester. And then most database systems to get the best compression ratio, you wanna do it natively, you wanna do it yourself and dictionary encoding is in this common one. So I showed this three lectures ago that there were sort of two problems in the database storage. First, how are we gonna represent data on disk? We've covered that so far. So starting next week, now tell what, okay, when we bring things into memory, what do we do with it? How do we store it? And how do we write things back out safely when we make changes? Okay? All right, hit it. This shit is gangsta. I'm a poppy with the motherfucking hookup. 28 a gram depending on if it's cookup. You ain't hit them all yet? Still got your shook up? I'll smack you with the bottom of the clip and tell you, look up. Show me what it's safe at before I blow your face back. I got a block on taps. The FETs can't trace that. Style is like Tampa proof. You can't lace that at the Dominic and oh you could call me Dominican. Black Skelly, black leather, black suede, Timberlands, my all black, dirty haters send you to the peri gates. You get your salvatrona skating. That's your first mistake. I ain't lying for that cake. If I'm a senior weight, my grand is heavyweight and ran through every stake. When they asking how I'm living, I tell them I'm living great.