 Yo, hey, yo, hey, yo, yo. Pack the chrome styles, fly like Mrs. Jones. Lyrical mathematics will have the devil's smoke and stone. I put heads to bed, lick shots, and rap was fled. Sorry, virtual being out this last week. Had to go to Seattle for some stuff. And then my b---- caught COVID in her low-low. So I was trying to, I had to do much extra stuff. I wasn't around her when she was sick, which is eight of all my time. So I apologize for not posting this lecture while I was gone and also not updating the schedule. As I posted on Piazza, this is basically what the class we should have had last Wednesday, but we'll start today. And then everything has been shifted up by one. And then we dropped the materialized view lecture. All right, so a lot to cover today. So let's jump into this. So today's lecture is about talking about the lowest point of the database system, meaning how we're actually going to represent data in the database on disk and in memory. What are the actual bits or the bytes for representing tuples and their attributes and their values and so forth? And what the data is going to look like is going to have a very important impact or influence on how we're going to design the rest of the database system. In particular, we're going to talk about row stores or column stores, because that was the paper you guys just signed reading. And just whether it's a row store or a column store, whether the data is laid out in rows or a column on TN format, that determines so many different aspects of the system and the different design choices we can make, like how we're going to process tuples, how we're going to move data from one relational operator to the next, how we're going to materialize the intermediate results as we go from one operator to the next. What algorithms are going to use? Like some joins we faster on because we're a column store because our hash table can be much smaller. Whether or not we're going to support ingestion updates, that one we'll cover a little bit today, but we won't go too much into this. Actually, not until we talk about data breaks at the end and in the semester. For current control, how to support transactions, we're actually going to ignore all this as well. But just think about if I need to update data tuple that's split up into columns, well, now I've got to go maybe acquire latches or locks for a bunch of different physical locations on disk or in memory to do that update. Again, we'll ignore that for now. And query optimization, of course, is influenced by all of this. The query optimizer has to know what the data system's actually doing to determine what's the best query plan to use. So again, just in the back of your mind as we go along, even though today we're only talking about the lowest levels of the storage layer, you will see that as we go out throughout the entire semester that a lot of these things, the decisions we'll make now will determine how we make other decisions later on. Another challenge also too is there'll be times when they will start talking about certain techniques in the storage models where the solution to a hard problem is what we will cover either next class or next week. In particular, the fixed-length data versus the variable-length data. The solution is dictionary compression, which we will cover next week. So the paper talks about these things, but we won't go into too much detail about it just yet. Know that these things are also coming up later on. So today we're going to focus mostly on the storage model. Again, row store versus column store versus the hybrid store. Then we'll talk about how we represent individual types. And then we'll spend a little bit of time talking about how to handle partitioning at a high level. OK? So first thing, storage models. So a storage model is going to define how the database system's going to physically organize tuples on disk and in memory. Now I say the database system is going to organize this as if it's responsible for actually laying down the bits in a file. But as we said before, in the modern OLAP system, in this data-like model, it could be just a bunch of existing S3 buckets that already have CSVs or pre-K files in it. And the data system just knows how to read it. So for our purposes, just assume that the database system is the one generating this data. But as we know, it isn't always actually the case. OK? So there's been three choices. N-ary stores, the row store, decomposition storage model, that's the column store. And then the last one is the hybrid store, the PAX model. Again, the paper you guys read came out, I think, in 2006. It is the defendant paper, my opinion. Even though it's old, it still is the best paper that sort of describes exactly the pros and cons of a row store versus a column store. Oh, whatever. It's old, right? But again, you were one in middle school, elementary school. Yeah, so it's old, right? Yeah, I should have said, damn, it's already at Yale by then. He wasn't at MIT anymore. But there is a textbook that they wrote about how to build column store systems. I obviously didn't want to assign that to you guys, because there's like 100 pages. But this is a good summary of the pros and cons. But the thing I'll point out is they mentioned PAX. The PAX pair, and they cite it. They don't actually go into detail. But this is what Parquet is, this is what Orca is, this is what all the modern systems are using. But we'll cover these, and then that'll lead into why we want to do the last one. And then most systems today, when they say they're a column store, chances are they're this. They're PAX. So the Anneri storage model, the row store model, is the database system of the store, most of the attributes for a tuple continues with each other in a single page. And the reason I say almost is that in most systems that will support very large, variable length fields, like a text field of our binary, for anything above a certain size, they then offload it into auxiliary storage pages. And postgres, they call this toast, I think, in Oracle, or in MySQL, they call them overflow pages. You don't inline them. So that's what I'm saying almost. But we can ignore blobs for now. So the row storage is going to be ideal for O2 workloads, where most of the transactions are only going to touch single entities or entries in the database. And they'll be very insert or update heavy workloads. And so by single entities, this means like, think there's a table of user accounts and you're using some application you log in and you only touch your user account. Maybe millions, but each individual transaction is only accessing a small number at a time. So with this storage model, the iterator processing model or the volcano processing model, which we'll cover in a few weeks, this is the ideal processing model for these workloads and for this storage model, because most of the query is accessing a single two by the time. So the operator can just go grab that one tuple and then feed it up to the query plan, to the next operator in the query plan. Most of the database systems that are going to be row stores are typically going to store database page sizes in some constant multiple of what the harbor provides you. The largest atomic page size that the harbor provides that you can do atomic writes is 4 kilobytes. And so most database systems, when they have their database pages, it'll be like 4 or 8 or 16 kilobytes. They won't go really, really big, as we will see in a column store system. The reason is because if you're doing updates, you don't want to try to do a 1 megabyte update across a bunch of 4 kilobyte pages and then crash halfway through and you got to go back and fix things up. So they try to keep things in smaller pages sizes. So it essentially looks like this. So say I have a simple database that has six rows and three columns. So in a disk-oriented system, all of the pages, all the data that's going to be fixed length and variable length, we want to pack them together into a single page. So the clipper does not work. So we have our database page here. Again, some say it's 4 kilobytes. There's always going to be a header. And then you have the slot array. Again, this is just a slotted page design. These slots are going to point to locations in the page where you can find the given row. So now, if another part of the system, like an index, wants to reference a tuple in a row-ended architecture, you need to use the page number and then the offset within a slot array. And then that way you can move things around in the page without worrying about having to update your indexes. It's an additional layer of indirection. So say I want to insert this first tuple up here. Well, I'm going to put it at the end of the page and have the slot array now point to the header for this. So the way it works is the slot array will grow from the beginning to the end. And then all the data will grow from the end to the beginning. Now I want to start the next tuple. And I just, again, append it to the end here and then update my slot array. So I can do this for all the other ones and end up with something like this. So one thing to point out here is that for every single tuple, we're going to have a little header space for it in the page. This will keep track of which values are null, any other additional metadata. Typically you don't put the schema there, because you store that in a centralized location in the catalog, because otherwise it's redundant. All the tuples within a single page are for the same table. So you don't need to store that over and over again. This header up here may contain maybe a checksum for the page, the version of the data system that generated it, and so forth. But for each single tuple, there's a lot of that you're maintaining its own header. So now let's say I come along with an OLAP query like this. Really simple. I want to scan the entire table and compute the sum on column A and the average on column C for all tuples that are, well, column A is greater than 1,000. So the way you would process this query, without even discussing what the processing model is, the way you would execute this query is on the single page. You'd have to jump into it and first go look at the header for every single entry as you scan along the slot array, find a header, check to see whether the value for column A could be null, and then do the comparison. So I go fetch this page from disk. Now it's in memory, my buffer pool, and I'm jumping around from one entry from one tuple to the next following the slot array to go start processing it. And then now if I want to start computing the aggregate, well, then I've got to go back and scan through again and maybe jump through all the C values here. So on modern CPUs, this is actually terrible. Modern CPUs are superscalar architectures are designed for or optimized for things that are very sequential and deterministic, meaning I'm going to scan through the same thing over and over again and not worry about arbitrary jumps into memory. I get better cache locality as well. And we'll see how to design scan operators and predicates to take advantage of superscalar architectures by removing indirection. But in general here, I'm jumping to different locations, and that's going to suck. For this example, it's a simple one page. Who cares? Again, think always in extremes. If I have a billion pages or trillion pages, this is not very efficient. Because I'm bringing in data also too that I don't need. I don't need B5. Sorry, I don't need column B, but I had to bring that data into memory. And I got to skip over it, where every tuple has a header. And I got to deal with that. So the row storage model, again, it's going to be great for fast updates, inserts, and fleets. It's going to be really good for queries that need the entire tuple. So I have a question. So it's great for queries that need the entire tuple because it's all located together in a single page. I go fetch the one page, and it has everything I need. Again, ignoring overflows. We're not going to discuss this this semester, but it's great for you can use it for index-oriented storage, like if you store it. If you have a B plus tree, like a MySQL NADB, where the leaf nodes are actually the tuples themselves. So you get the clustering effect so that if you do binary search on the leaf nodes and get faster performance, when you try to find certain keys. But this class, we care about OLAP systems. So it's going to be terrible for scanning large portions of the table when you only get subsets of the attributes because the data's going to be sort of jumping. The location of the column you'd be accessing is spread out across the single page. And you're bringing in other attributes that you don't need for the query. So that's going to pollute your buffer pool memory. As we said before, it's bad for memory locality and access pattern because, again, I'm doing these jumping. It's instead of doing sequential reads. And then we'll discuss this next week, but this will be terrible for compression because now within a single page, I have a bunch of different data that comes from different attributes that are all going to have their own different value domains. And I'm not going to be able to exploit repetition or redundant information and compress things. If you ever try to take a zip file or an MP4 file, if you try to run zip on and compress it, it's going to be terrible because it's binary data that doesn't have any patterns. So essentially, if you intermix all the attributes within a single page, then you're essentially having the same issue. So the solution to this, which we all sort of know about, but let's cover more detail, is to do a column store approach or a decomposition storage model. Decomposition storage model is how you would describe it in the academic literature. Nobody really uses the word DSM, people to say column store. But again, when they say column store, they usually impacts not the formal definition of what a DSM is. See, the idea here with the DSM is that the database system is going to store all the values for a single attribute and a table continuously, one after another. Just think of like a giant array. And here's all the values of column A, and there's another giant array for all the columns, all the values in column B. And this is going to be ideal for OLAP workloads where we do read-only queries, and we only need access to subset of the attributes because now we just go get the array, the arrays of the data that we need, the columns of the data that we need, and ignore everything else. And we're going to want to use a batch vectorized processing model where, as I process the queries execute the operators, instead of patching up one tuple at a time, I'm going to pass a vector of tuples at a time that all correspond to a single column, a subset of columns. And now I'm not paying that penalty called get next, get next, get next on my operators. I'm just moving large chunks of data from one over to the next. So in this world, the files or the term file versus page versus chunk versus row group can be all slightly nebulous. But again, assume these are like parquet files sitting on S3. The file sizes are going to be much larger than what you would typically have in a row store. So think of like 100 megabytes where the page size is like tens of megabytes or two megabytes or some larger size. And then even though that we're going to have maybe stored these columns as separate files, we still want to organize the tuples in such a way that we can keep track of where they're located by offsets in these files. So I'll sort of end in the next slide. But so say here's our data, same table we have before, three columns, six rows. So what we're going to do is we can take all the values from column A and just store those continuously in a single file. Again, these are only six attributes, but again, think of it extremes. You would have this giant file, maybe a billion entries. So there's going to be some header at the top that just says here's the metadata about what's in this file. Again, check sums or what version of the system created it. Now you're going to store also a null bitmap separately in the header. So instead of storing it in the header per tuple, now you just have this giant bitmap at the top and says here's for all my 10,000 tuples in this file, here's the bits that are set to true or set to one if the value for that attribute, if the given tuple is null. And then you do the same thing for the next column and the same thing for the next column as well. So in these examples here, and I'll say this multiple times, I'm showing that the header, like the metadata about what's in the file is in the header. In a mutable file like parquet or orc, this is actually in the bottom, in the footer. Because the idea is that you construct the file once, you don't know what's going to be in it until you actually process it and create the file and then you go to pen the metadata in the footer. Whereas again, in a row store or a system where you're supporting incremental updates, you typically put this in the header. So for illustrative purposes, I'm showing the header, but real assessments will put this in the footer, or parquet or orc will put it in the footer. Right, so again, the basic idea here is that for every single attribute, we have a separate file and we have dedicated metadata in the top to tell you what's in it. So one key difference between a row store and a column store is the way we're going to identify tuples. So remember I said in a row store, it's going to be the page ID and then the slot number offset. That's typically, I can't think of any other system, except for memory systems, but this is how every sort of disk-based row system is going to work. There may be additional things, like there might be a file ID and an object number, like Oracle and SQL Server have a bit more complicated addressing schemes, but in general, it comes down to a page number and an offset in the slot array. In a column store, what you instead use is the offset within the column itself. So I say I have a thousand tuples, I know how to jump to the header, the location of where a column starts. I know the size of every single attribute, they all have to be fixed length, I'll explain why in a second, well, this is why, but all the values will be fixed length. So I know if I want the 500's tuple, I take the size of the attribute times 500 and then I jumped into that giant array of the column and that's how I find the data that I'm looking for. So I can do that for any column. So if I know how to, if I jump into column A with the 500 tuple, I can do the same thing in column B and column C. This is why all the attributes have to be fixed length. But of course we know that not every attribute is fixed length, right? I have a text field, a varchar, binary field and so forth. And so we're gonna need a way to convert them back into the fixed length and this is where dictionary compression will come in. Yes. Questions, if there's a null value in the middle? There's questions, what if there's a null value in the middle? You would still mean, you would still reserve the space for that null tuple or the null attribute and then you mark it in the null bitmap. So when you do a look up now, like you would say, say if I'm jumping to an exact tuple, if I wonder what the value is, I could look in the null bitmap, tell me whether it's null or not and if it's not, then I go look at it, right? Okay, so you can just think of this as like vertical partitioning, right? The reason why it's called the Decompensation Storage Model because the idea is that you take a regular row store table and you do vertical partitioning such that each column is stored in its own one column table. That's the metaphor people use for this. So I said that most, as far as I know, every system that I could think of is gonna use this fixed length offset to identify individual tuples and we need this for stitching things back together later on, right? As I'm scanning along and I'm trying to find a match or something and maybe I evaluate the first predicate and the second predicate on the next column, I need to know what are the tuples that match in the first column and then go look up their corresponding offsets in the second column. Again, we'll cover this in a few weeks but that's the basic idea of why we're using this fixed length offsets. We always know how to jump to, within any column to go get all the values we need for any one logical tuple. So as I said, most systems are gonna do fixed length offsets. In the research literature, there is a discussion about how you could start variable length values but instead of actually storing, keeping all these things to be fixed length, you actually embed the tuple ID in the value itself. I think it's gonna get some kind of 32 bit integer. So that way, if I'm at this offset here and I'm looking at tuple two, you have some index or some way to jump into these other columns to get tuple two. This is a special case. I don't remember the system that actually does this. There is, this shows up in that literature in some places but nobody, as far as I know, nobody actually implements it this way. Like, there's obviously a huge storage overhead of maintaining this plus the index that jump into the other columns to get the value you need. Instead, everyone just does these fixed length offsets. And as I've already said this, but we need to convert any variable length data into fixed length. An obvious trick to do would be just to pad out any strings with spaces at the end to make it fit. Like if it's a char 16, but most of the time I'm storing eight characters, I just pad out the remaining eight and that way everything's fixed length. But this is wasteful and it's expensive because now you're copying data that's actually, you're looking at it and examining data that's wasteful and now you use string detection, you gotta look for the offsets or look for the padding. So instead, everyone's gonna do the dictionary compression stuff. And the idea is that we're gonna take anything that's variable length and convert it down into a fixed length integer, usually 32 bits. Again, we'll discuss how to do this. But basically think of like, think of a, there's 50 states in the United States. And so instead of storing the string for all 50 states over and over again, that's variable length, I just convert the state into a number, you know, and I store the number. That's a very high level explanation but that's basically dictionary compression is gonna do for us. All right, so I said that ComStores is not new. So the original idea was actually proposed way back in the 1970s from the Swedish military, the system called Canter. It's in the research literature. I don't think it went beyond, you know, what were the prototype they were building. And it wasn't a database system, it was actually like a file system storage approach like an object store almost. But the paper clearly defines like, hey, we store things in columns and things are faster. In the 1980s is when they actually proposed on the academic side that formalized the decomposition storage model. It was out there, people really didn't pay attention too much. In the 1990s, Sybase came out with this thing called Sybase IQ, which is an in-memory sort of query accelerator will be the fracture mirror approach we'll see in a second. Basically they would take the, you had your regular Sybase row store and they would suck data out and then it's covered into a column store. Then your query showed up, they would try to figure out how much can I run on the column store. And then the 2000s of when this really took off, the Pax paper came out, I think 2001. There was a C store project at MIT, was a bunch of startups in this space, Vertica, VectorWise, and MoneDB. And VectorWise we're gonna see a lot later on this semester because they're the ones that sort of invented the VectorWise execution model, processing model, and they did a lot of the early work in using SIMD to accelerate operators. And again, the guy that created VectorWise then became the co-founder of Snowflake. I think a lot of the ideas were developed in VectorWise that made it into Snowflake. And then by 2010s or 2020s, pretty much everyone is a column store now. Because it's clearly the better way to go. I can't think of any analytical, huh? There are some systems that are claiming to do real-time analytics, and they are sort of doing a row store approach, but they'll build column-oriented indexes. What's that? Yeah, we should remove that, anyway. And then this is the old DuckDB logo, we gotta update that. Anything else out of digimus? I mean, Impala is still there, but yeah, anyway. Yes, but I mean, it knows how to operate on Parquet data. Yeah, so is it actually a column store database in that it's a congenerate column, columnar files? I think it does ingestion, it's on the critical path, but it's certainly designed around assuming that it's operating on columnar data. So it'd be fair to say it is a columnar database system. Okay, yeah. All right, so what are the pros and cons of this? Well, again, the obvious advantage is that we're basically getting projection push-down for free because we only bring in the data we need for the query, we're only the attributes of the columns that we need. We're gonna get faster query processing because we'll be able to take advantage of the increased locality and better cast usage of the data we're accessing because we're ripping through columns and only accessing the columns that we actually need and not having to jump around over stuff we don't. And then again, we'll get better compression because all the values within a column for the most part can be very similar to each other. I just think of like, again, I think timestamps, like some event stream, the timestamps there are one second all for each other. So that means there'd be a lot of redundant data from one tick to the next and we can exploit that and get better compression. The downside is gonna be that it's gonna be slow for point queries, things that have to go grab a single tuple. Anytime you can do insert updates and leads, assume your system supports that and all of them do because now we gotta take tuples that come in from the application as a row store and then split it up and store it in separate columns. Or now we gotta take the columns and separate the data and put it back together, stitch it back together, return that to the client. And anytime we have to reorganize things then that's always expensive. We won't go into this too much but there's some systems they would sort all the data in your columns. So now anytime I insert something that maybe lands in the middle of my sort order, I gotta move things around to make space for it. That gets expensive. All right, so this sounds great. So the DSM seems like what we wanna use except that we know as I already said in the last slide like there's a penalty we're paying for having the extreme case of everything separated from everything else. Like you're restoring the columns in completely separate files. Because those queries in OLAP setting aren't gonna be only accessing a single column. Very rarely you're gonna say it's like ID from table foo where ID equals something, right? Like where ID greater than this. It's very rare that the OLAP query would only look at a single column in isolation. So that means at some point as we execute the query, we're gonna have to go back, get the other attributes for all the tuples that are matching our predicates as we're going from one opera to the next and then stitch the tuple back together. The paper you guys read, they discussed this as you wanna do this as late as possible. This is the late materialization technique but you still need, at some point you still need to do this. So the idea is that we wanna get the benefits of the clumnar storage for compression execution reasons but then we also wanna be able to take advantage of the, of having data that's related to each other, close to each other. Maybe not exactly on the same page but at least nearby so that we can not have to go a bunch of more random IO to go put things back together. So we wanna, we wanna clumnar scheme that gets, again, the separation but have things at least be, again, relatively close to each other. So this is what PAX is. PAX stands for Partition Attributes Across. This came out in 2002. This was invented here at CMU by Natasha Alamaki. She used to be the data professor at CMU and then she left to go to EPFL in Switzerland. I am here because she left. What's, not entirely all of it, yeah. Well, the last time this class was taught before I revived at 721 was like 2006 because of her. She's great. So again, as I said, this is gonna, this will be still called as a clumnar storage format. This is what Parquet is, this is what Ork is, this is what basically carbon data is. This is what most people say when they have a clumnar database, it's actually PAX, but this, outside of academia maybe it's less well known. So the idea here again, the goal is that we wanna have the faster processing benefits of clumnar storage, but they maintain the spatial locality of having data that's close to each other, they're related to each other in this part of the same tuple, physically close to each other. So the idea looks like this. So now, we're not gonna break things up into separate files, separate pages, assume that there's a single file. And so first we're gonna do, we're gonna horizontally partition the rows, could be on some key, some attribute, could just be the insertion order within the arrival of the system, it doesn't matter for now. And so we're gonna take all of the data that's been maybe the first two, three rows, I'm gonna put them together where all the attributes for the first column are contiguous, all the attributes of the second column are contiguous, and then the last one as well. I'm gonna call this a row group. And every row group's gonna have its own header that says things about what compression scheme they're using, where to find the offsets for these different columns. And again, don't think of a row group as a single page. In this world, these files are usually quite large. So a row group could be like 100 megs, and then within a row group, you have the data for a single column, that could be broken up into multiple pages, or I think a part can be called some chunks. And you do the same thing for now for the next row group, all right? And then again, as I said before, in immutable file formats like Parquet, the header would actually be in the footer, but this is gonna contain information about here's the location of the row groups, here's which maybe check some for the file, and any other global metadata. But the metadata about what's actually in the tuples themselves, the activists themselves, that'll be tied to the row group. Yes. This question is, how do I still get the benefit of locality if the columns maybe can take us, but within a single tuple, the columns themselves are still separated, because it's still close enough, right? It's not like completely two separate files. Yes. Her question is, has someone determined the optimal size for a row group? Funnily to mention this. My former student, Huanchuan, who I co-advised with Dave Anderson, he's now at Xinhua. We are actually working on the survey now of looking at Parquet and ORC and understanding the pros and cons. And we have found that in a modern system with modern network hardware, like speeds, the defaults are actually terrible. Oh. The defaults are actually too small. Yeah. Because again, these file formats were designed 2011, 2012, 13. Networks have gotten way faster, right? CPUs haven't gotten that much faster, but the network and disk have gotten way faster. So, we won't talk about this class, but you can also then take this PAX file and then it'll run like snappy or Z-standard compression on it. But it turns out, again, if your disk is really fast, you don't wanna do that because the CPU cost of decompressing it is not worth the benefit because you can suck it in way faster than you used to. So, this is not maybe the best way to do this. Well, however, this is the best way to do this. There's a lot of studies on this, you get the benefit of the cash locality and all the things I've said before, but exactly what these parameters are, how long these strides should be, the trunk sizes, all that depends on a lot of things. Parquet is also more heavyweight with compression, I think. They have, the compression scheme is way more complicated than ORC. And sometimes simplicity is just better. Yeah. Okay, I think I've said everything here, right? Yeah, so again, every row group has its own metadata and we're not gonna go discuss exactly here's how Parquet does, here's how it works. But if you go, there's a bunch of talks where they basically describe the same thing I'm describing here. And so this is from Databricks of Hell, what Parquet looks like. Right here, here we go. The row group, they says default 128 and the page size is one megabyte. Is that the best? It depends. But for faster hardware, no. All right, so we're not gonna discuss the buffer pool side of this in class or the semester, because we're just gonna use all the same stuff we talked about in the intro class. At the end of the day, like, there's stuff on disk, you gotta go bring it into memory. And you don't wanna use M-Map, as I said multiple times, right? We wanna manage this stuff ourselves, right? But underneath the covers, like what is, if we malloc something, what is actually is that? I mean, it is on almost M-Map, but we're not letting the OS decide what gets evicted. So all the things about like LRUK or ARC or preventing from issues of the sequential flooding, all that still applies here. The thing I do wanna talk about is because we are in an OLAP setting and we are trying to bring in larger page sizes than we would in a row store, what does that actually look like underneath the covers at the hardware level, right? So what's the default size in hardware? What's the default memory page size in Linux? Four kilobytes. Why? Because that's what Intel decided for x86 in 1985, right? There's been attempts to try to have larger page sizes. I think ARM tried to do this with 64 kilobyte page sizes, but then it broke a bunch of stuff that was assuming four kilobytes so they had to roll it back in Linux. So the reason why this is bad, because if we're now reading really large files, really large chunks of data, and we're gonna want to rip through that quickly as possible, then having the operating system in the hardware keep track of these little four kilobytes sort of pages, even though the block of data we're reading might be 100 megabytes, is gonna be expensive because in the TLB, the translation local aside buffer inside on the hardware, it only has so many entries. So to bring, I think the fastest, the smallest cache, the first level, I think you have 72 entries for your TLB. So it's really hard to bring all that in and update your TLB if you're going at four kilobyte page sizes, right? Yes, next slide, give me a sec. Yes, he said, can't you always configure this machine to use huge pages? Yes, this is where we're going. Yes, all right, so, well, he's ruined it. All right, yeah, so huge pages. So instead of allocating four kilobyte pages, you can turn on huge pages in Linux and you can get larger page sizes. I think it starts off with two megs and you go up to one gig. And so when this first came out, this was touted as this huge improvement because like I said, we have larger memory machines, we're reading large datasets, this is gonna be way more efficient. So the way it works is that you would allocate memory just like you normally would with transparent huge pages and then underneath the covers, Linux would try to identify, oh, these pages can be combined together into larger page sizes and then you have fewer entries in your TLB, right? So the way it works is that the pages that you contiguous and identify, I've malloc-ed these contiguous pages that are physically close to each other, I can just combine them together and then use a single reference or TLB entry for them, right? So the OS is gonna try to do this in the background, right? So it wants to keep things compact, right? It wants to reduce fragmentation. So it's gonna look for maybe pages that could maybe split up to smaller page sizes and then combine them together and reorganize, right? The downside of this is that when this is happening, this is now a kernel thread that's gonna block the database system when it tries to access one of these pages if it's starting to move things around, right? Because again, there's virtual memory that's mapping, there's a mapping from virtual memory to physical memory. So it can prevent you from accessing virtual memory if it's changing where that physical memory is actually backed from or what that virtual memory is backed by, like what region of memory, right? So when transparent huge pages first came out, all the database systems basically would tell you to turn this off. This is a terrible idea. It's all on the documentation and all these links here to go look at it. Everybody tells you to turn this off, right? Because the performance overhead of random stalls because the OS starts to come back in your memory is terrible, right? The only system that I'm aware of that at least I could find in a more recent search was Vertica. They tell you you can turn this on but only for some new version of Red Hat or CentOS where somehow it's been fixed. So historically, database systems, even though we're, especially in old systems, even though we're reading larger page sizes or larger blocks of data, they don't want you to use the huge pages mechanisms in the database system because of all these penalties for the OS is doing stuff that we don't want it to do or the system's not aware of. So more recently though, there is research that seems to suggest that having a version of malloc that is aware of these huge pages and trying to allocate for them can actually make a huge difference. So Google has a paper from I think 2021 where they turn on huge pages in TC malloc as their version of malloc and across the entire data center for all the workloads they saw, 7% improvement. And then specifically for Spanner, there's a transactional system out of OLAP system, they saw almost a 6.5% improvement. So seems like this would be a big win. The current research literature suggests that it is not, at least, sorry, it's not research. It's the, basically all the blog articles and a bunch of benchmarking people have done for existing database systems seems to suggest huge pages aren't the way to go, but the Google paper suggests that it's worth reconsidering. And there's a blog article written actually last week by a friend of mine where he basically says, this is something we need to reconsider in consequence databases. Because clearly this is what we should be using, right? This is a no-brainer. But as far as you know, no database system can do this natively and reap the benefits of it. I think you can turn this on in JVM, I think you turn this on and go. But again, I haven't seen anything that suggests like all the OLAP systems we talked about before, you get the back of it. Yes. This question is, what has changed in the last couple of years that suggests that you could turn this on before you could? Because the background compaction stuff is now less aggressive and it stalls less. Yes. The question is how difficult is it to implement this? I mean you call amadvise and use huge pages and that's it. Is that enough? No. But that's how easy it is to do. But like I said, the data needs to be aware that it's maybe using huge pages and maybe configures its memory allocations to be aware that I'm allocating two megabytes of block that are going to be contiguous or something like that. Instead of like four kilolite chunks over and over again. Yes. The question is, what is the point of implementing huge pages from an OS perspective? I mean the benefit is obvious, right? Like it's the TLB. TLB is like super small. Like it's like L1, L2, or 3. Like actually it is L1, 2, 3. What's that? It's a great dead question ignore me. No, no, it's not a great new question. It's like, the question is like, why do we want this? Because again, in an OLAP system we're reading terabytes of data potentially for a single query. For the general unwashed masses. Right, for random programs or JavaScript programs? No, that's an honest question. What is regularly what? Like, Abby, your database. Again, like, if you have to allocate memory and the memory you need to allocate is going to be contiguous, then you want this. If I'm doing a bunch of small little object allocations for random locations, this doesn't help you. You don't want this. But again, what are we doing databases? We're reading large, allocate a large chunk of memory to read large contiguous data. It's not a super question. I mean, like it's good to figure it out. Again, this is like, it's a good, this is why like you have to think about, you don't want to think about what the dataset looks like, what the query looks like, what the workload looks like. And you can design, if you design a system for the OLAP workload patterns, you can take advantage of these things. This doesn't make sense for OLAP workloads where you're going, again, doing random updates on random locations, so small amounts of data. Okay, right, there's a question about what it has changed. It used to, again, if it has to figure out where to go do compaction, the old days would take like seconds. And the new one, like in the newer versions, I actually think like version four in the kernel. Like, if it can't find anything right away, then it backs off. Whereas a four, it would like lock everything stall. Okay. All right, so let's talk about how we actually want to represent data. So this is basically the same thing we talk about in the intro class. There's nothing really dramatically different here. For a bunch of, you know, parameter types like integers, big nts, and floats and reels, we'll just use what's in the IEEE 754 standard. And this is a standard that defines what the, how hardware or CPU manufacturers will represent these low, you know, data types, low-level data types, like integers. You can think of like, I allocate a variable on C++ for an integer. It's going to be 32 bits. The standard defines how the hardware should actually represent it. And there'll be instructions and registers that actually store this data accordingly. For timestamps, depending on whether you want the time zone, but typically it's going to be a 32-bit or 64-bit integer. That'll be some number of unit measurements since the Unix epoch is the most simplest way to do this. For very-length fields like var-chars, var-bandings, tex-and-blobs, if the value is less than 64 bits, you can just inline it. Otherwise, again, you'll have a reference to some other overflow storage, whether it's in disk or in memory, to where to go find it. And again, most systems are going to use dictionary compression for var-chars to make things fixed-length. Or the address itself to the overflow storage will be fixed-length. The one interesting thing I do want to spend time on is talking about decimals. There's basically two approaches. You can have variable precision or fixed-point precision. And the idea is that do we want to have, let the hardware handle the decimals for us, or do we want the data assistant manager for us? And the trade-off is going to be, if we let the hardware manage it, as defined by the 754 standard, for either a 32-bit or 64-bit floating-point number, it's going to be faster because the hardware can support it natively. There'll be instruction to take two floats and add them together. Didn't always used to be that case, that came around in the 90s with floating-point units in the CPUs. But now every modern CPU has it. But the problem is, of course, there's going to be that they don't guarantee exact values. So I wanted to write this in Rust for Qi, but I didn't have time. But it's the same amount of code. So here we have a simple C program. We take two floating-point numbers. We have 0.1 and 0.2. We're going to add them together and print it out, right? And what do you get? Well, you get what you expect. 0.1 plus 0.2 equals 0.3. But is this actually what's going on? No, if I make a print out all the values, I don't round it off, and you see you get something like this, right? Why is this happening? Again, because the 754 standard doesn't define how to store exact values for floating-point numbers. So if you actually look within the bits that are being stored, you get something like this. Is this okay? Depends, exactly, right? Again, that's the answer for most everything in databases. Is this going to be fast? It depends. Is this the right way to do this? It depends, right? It depends, right? If it's the temperature in my office, and I have a sensor every one minute, then yeah, who cares, right? If it's like a scientific instrument, like trying to land something on Mars, I probably don't want to have these rounding errors because at large scale, it's going to be problematic. So the way to handle this, sorry. The way to handle this was called fixed-point precision numbers. And the idea here is that this is a data type that's implemented in the dataset itself, not in the hardware, where we can manage the exact precision and scale of an integer, or sorry, of a decimal without any rounding errors. So the SQL standard, you would define this for 30 bits, would be numeric or decimal. Sometimes some systems, they're just alias for each other. Again, this data type is implemented differently per database system. It's not something that, again, the SQL standard specifies how to do it. The SQL standard specifies the behavior you should have these data types, but how you actually implement it is different from one system from the next. And in case of actually Oracle, I don't think you can get the, well, you can actually get the variable-length precision, the floating-point numbers. But if you say I want to float or reel, you actually get these, right? You get the fixed-point numbers. Because their idea is that the, just avoid any problems with rounding errors, force everyone to use the fixed-point one. And they made it fast enough that you can't really tell. There is a way, I think you can specify, I want exactly the variable-length one, or the variable-precision one. So the basic idea is that we're gonna store some kind of varchar, or a byte stream of the value, and then we're gonna have some extra metadata to say where the decimal point is, what the scale is, and so forth. So if you want to support arbitrary precision, meaning like the decimal point can appear in different locations from one value to the next within a single column, then you gotta, there's some extra metadata you gotta store within the tuple of sub to keep track of that. If you don't need to support that, where every single value has that decimal point in the exact same location, you can go away way, way faster. So we had a project on this a few years ago on a library called Live Fixie Pointy, was when we were building a noise page, we built our decimal type from, inspired by the Germans, which we'll cover later on. Well, they're people, we don't have to cover that. But in the umbra, or in hyper, the head German there sort of told us how to do this. Thomas Neumann, yes. Told us how to do this, and he's not the head German, but he's the best German. He told us how to do this, but they can't open source the code, and had a student start implementing it. So we have an implementation of this. I need someone to help work with me to fix this library up. I mean, we could try it out in Postgres. So this could be a potential project. The math works, it's super hard. It's a lot of like bit shifting to make it work, but the performance is quite good. All right, so let's quickly look at what Postgres does. So this is their numeric type, this is the actual code of Postgres. So as you see, we're not gonna go detail all these things, but here's a bunch of metadata, we have to store four integers per value, plus this numeric digits array, which is just a Nailius up to unsigned char. So you take whatever the size of this is, which is now variable length, and then with 16 bytes here, just to store one value. This is pretty heavyweight. And then if you go look at the extra source code here, this is how they do this adding two numerics together. So you got to check whether one's null, got to check whether one's positive or negative, like this giant switch name in here. There's a lot of work you have to do to make sure that you end up with exact values. And it's not a Postgres thing, here's my SQL. Basically the same thing, right? A bunch of metadata, right? And then you have this decimal digit thing, it's an Nailius up to this, which is just an array of 32 bit integers, and then they have their own add function, right? Again, so think of this, to add two numerics together, I got to invoke all this code versus a single instruction, take a floating point number, and add the two together. So again, you pay the penalty to have exact precision, but in, again, if it's your bank account or a scientific intervention, you would care. So now let's talk about nulls. The way basically everyone's gonna do it is this one here, the second one. You have a header somewhere, that is a bitmap and a bit of set to one if the attribute at that offset in the column is null. It isn't the only way to do it. Another approach to do is, if you really care about storage space, is to use actually a special value in the domain of an attribute or the type to represent null. So you typically see this in memory systems because you don't wanna, if you pay the penalty or the memory overhead of maintaining this bitmap, so you just say, all right, in 32, the minimum value in 32, defined by the libc, that's gonna be my null. You make sure nobody can insert that value, just make the total possible value you can store, one smaller by one. So VoltiB does this, there's a couple other memory systems that do this. The dumbest idea is choice three, where similar to the tuple ID that you would embed in the, when we talk about column stores, you actually embed a flag for every single value that could be null to determine whether it is null, right? So this, even though the flag only needs to be a single bit, you can't store it as a single bit because you have to worry about a word alignment, a cache alignment, right? I can't just have, you can, but it's a bad idea. You don't want to find a data type that's 33 bits because the hardware isn't set up for that. So when you go try to read 32 bits, you're actually gonna read 64 bits, even larger because it's a cache line, but we can ignore that. So this is actually, I'm just gonna spoil that. You think it's so terrible, who would ever actually do this? Memsegal used to, right? And this is the old documentation, they've since fixed this, but when you go look at their integer types, they would tell you the size of each type, they would have the size if it can't be null, sorry, the size of it could be null, the size of it if it can't be null, right? So for a Boolean, which is just one or zero, if it's not null, then it's one byte, but if it's could be null, then they have to store that as two bytes, all right? They get, you go from eight bytes up to 12 bytes. They're doing this because they worry about, word alignment of the data that they're accessing. And that goes back in addition to why we want to have fixed length values so that we can jump to offsets more easily, it's also gonna ensure that all our data is nicely memory aligned, so that we don't try to access a tuple that is, or a value that maybe spread across two cache lines, because now that's two cache reads instead of one, right? So don't do this, but it does exist. Yeah, so I debate whether to show you this, like, hey, here's a stupid idea, don't do this, but now you know about it, so like, is that a good idea? Okay. All right, so we've covered column stores, we've covered the pack stuff, we've covered what the actual bits are gonna look like for integer values. Let's talk a little bit about how to handle updates if we wanna do that. So, data is considered hot when it first enters the database system, this is just, in general, not just for OLAP system, but like, think about it in your own life, like when you go to, I don't know, Twitter or TikTok, whatever the hottest, be real, whatever, yes. Only fans, whatever. Like you go look at the latest things. You don't go back eight months, 12 months, 24 months, and look at those things. So when data first enters the database, it's more likely to be accessed. It's also more likely to be read and also potentially updated, right? So, and then over time as it grows colder, it's less likely to be accessed. You may only start using for read-only queries. So we may wanna organize our system in such a way that new data is stored in a way that can be very efficient for quick access, for OLAP kind of queries, and then over time we wanna migrate it to a core storage system. So this is what sort of the hybrid storage model is. And the basic idea is that we wanna have two execution engines, essentially two database systems that are linked together that maybe store things in a row store for updates, for hot data, and then a column store for colder data. And over time as data gets cold, either through the exploration process or by observing the workload patterns on it, we then migrate it to the column store. And we can do this in a batched way, and this is why we can then combine a bunch of data together, put it together, put it stored in a single file, like a Parquet file or ORC file, and then do all the compression stuff we wanna do all together, rather than try to do things incrementally. So the two approaches you'll see in your life are Fractured Mirrors and Delta Store, Fractured Mirrors I think I've already mentioned, but the Delta Store approach would be what any system that's supporting Databricks has their lake house term or this thing called Delta Lake, this is essentially what they're doing. The idea is they wanna have a separate storage for new updates, and then over time they migrate or percolates into the column store. So again, we just go with these at a high level, just so you're aware of what they are. So Fractured Mirrors comes from an idea from 2002, and the idea is that we're basically gonna store a second copy of the database in a column store that we automatically updated, as I said. So you have your row store, that's the primary copy of the database, and then you have this mirror copy in the DSM. And if this thing dies, if this thing crashes, it goes away, who cares? Like this is considered the source of truth, the database of record for our database. So we can't lose this, if we lose this, we'll just rebuild it from this. So all your transactions come along, they're always gonna operate directly on the row store data, because that thing, that'll be optimized for fast updates, and then we'll propagate things to the column store, and then we'll assume any analytical query gets applied here, right? Of course, if someone then updates something that's in a copy over here, you need a way to keep track of that so that any analytical query knows that it has to go fetch the newer data from the site. The depth of storage is probably the more common approach these days. All right, so I would say also this is only used in, this is used in Oracles, it has an in-memory columnar accelerator. SQL Server is hard to keep track of, they have a bunch of versions of SQL Server, but I think they have an OLAP index or Apollo engine that uses this. IBM Blue is, or DB2 Blue is their column store accelerator, this approach is more common in systems, in older legacy systems that already have a huge, infrastructure and ecosystem around the row store, and rather than making a whole separate column store system, they graph this thing on so they get the advantage of the existing tooling, and then people still get the benefit of the column store. So it's an engineering slash sort of business decision to go with this approach. With the Delta Store, the idea is that again, just like before, we have our front end row store, all the updates are gonna go in here, and then over time as things get colder, they will migrate it to the stroke or dataset, the column store. And so a tuple can only exist, the source of truth, the most recent version of a tuple, can only exist in one of these two. So as I propagate it into the column store, I essentially remove it from the row store because why waste the space if I've already have a copy over there? So now again, transactions come along, you update the column store, sorry, the row store, and then any analytical query comes along, you have to go figure out, is the data you want in either one of these and then merge the results together? Or again, if I update a tuple that they've already migrated, I need a way to keep track of that and validate it here and make sure that I know that I'm reading the latest version over there. This question is, is there an underlying assumption here that the analytical queries are okay with old stale data? No, well most systems will make a trade off between freshness and timeliness of the data and performance, like if I care about having the most latest data in my analytical queries, then I gotta go look in here, same thing with the other approach, but then I pay that penalty to go parse it out of the row store, right? If I don't care, then maybe I just run over that only on the historical data here. The Napa system from Google, they gave a talk with us last year and there's a paper which I'm not gonna cover this semester, but they make the trade off of performance plus timeliness or freshness plus cost for internal reasons at Google. But it's the same idea, again, I can get better performance, but if I'm gonna pay for it, and potentially also get the, but it's hard to trade it off against having the latest data. Because not having to go read this is much better. Yes. Question is, does this mean the application have to keep track of where is data? Look, let's see what side of this. Now the idea is that the system, the data system, they consider this all as the database system, whether or not it's two separate products or whatever. Think of this as the database system, it's responsible for keeping track of where is what, what is where. And for also keeping track of like, okay, when should I move things? Now the administrator could define like, okay, if the data is ordered in five days, then move it over, or has been touched in a week, then move it over. There's some systems that support those kind of rules to move things, but the data system is responsible for facilitating the movement of things. Yes. Okay, his question is, is this faster than fractured mirrors? Yeah, so like, so this one, you're basically storing two copies of the entire database. Now you can compress the hell out of this, and so let's finish here. But this is basically, think of this approach as like, it's like an index, like this is an auxiliary copy of the database that I have to maintain, you can be sure it is in sync with this thing. But it's another copy, but this is one tool only exists in idle locations now. There may be a period where I update something here, if the system allows it, and I still have the old version here, and only later, once I do compaction or whatever, then it gets burned out. But yeah, the amount of space at the store, this takes more space because you're maintaining two copies than this one. Yes. The question is, is there any reason why both of these methods have one row store and one column store, as opposed to what, like multiple tiers? Or, could you have a single system that supports both a row store and a column store? The question is, why can't you store? Like, yeah, so we're not gonna cover this. No, no, so the question is, why couldn't you have, basically you're asking, why couldn't you have a column store do fast transactions? And the only system, single store kind of does that, but the Delta store is like a pen blog, so it's sort of the same thing. But it's within a single system, and so it's transparent to you. Hyper is the only system I could think of that did transactions directly on a column store, but because it's doing multi-versioning, it's storing Delta records. So, if I have a bunch of Delta's I'm updating for MVCC, technically it's a row store. His question is, in both of these cases we're not storing paxes anywhere. No, assume this is pax. I say DSM, but column store pax. Again, for the purposes of the rest of the semester when I say column store, I'm gonna mean pax. So yeah, I should put column store here, to be more generic. Okay, so we have two minutes left, although Phil's still at our time. Oh, shit, nevermind, let's keep going, databases. Thank you. Last semester was ending at 10 after. Okay, thank you. All right, so I wanna briefly talk about partitioning. We're not gonna go into details of how to actually partition, meaning like how do I pick the right columns or actually split my data on because that's an NP-complete problem and we don't spend too much time on that, but talking at the lowest level, the physical level, how are we actually gonna do partitioning? So the idea here is, again, we wanna split the database up across multiple resources so that we can take advantage of parallelism. And so even though I showed, when we talk about pax, there's a single file and there's these row groups, you can do horizontal partition within the row group, there's nothing that says a single machine has to operate on that file, the entirety of the file. You could split things up further. Most systems don't do that, but there's nothing, you could. The idea is basically the same. Like you could split things across the file, then within the file you split it up again. But we can ignore that. So in the no-segal world, they're gonna call this sharding. In the academic world, we would say partitioning, but again, the basic height, and then the row, they're both talking about horizontal partitioning. So when we have partitioned data, when we start actually the query, we're gonna break the query plan up to fragments and run those in parallel. And at some point, we need to coalesce the result to the different fragments, the query fragments are working on to produce a final answer to the user. Like someone's writing, sends a SQL query through the Snowflake browser tool, or from the command line. We need to come back with a single result, so we have to go, even though things are partitioned, we've gotta put it back together. So the data system is gonna be able to partition up physically in a shared nothing system. We're actually moving data and separating from each other. Again, not specific to shared nothing, you could do partitioning within different files in a shared system, but we can ignore that for now. Or logically in a shared system, where again, it's a single shared location, or shared files in a shared system, but then we assign different nodes to operate on either separate files or different offsets in the file. And you saw this in the Snowflake paper from last week, where they talked about how to use consistent hashing, so they signed a worker node to one file that's out in the storage. And then if they add a new node, they do consistent hashing, they reorganize to make another node and now be responsible for that file, without having to reshuffle that. That is the same, they're doing horizontal partitioning. So what we wanna do here is that we wanna take a tables, tuples, and we're gonna split them up into disjoint subsets based on some partitioning key, or some partitioning column, based on some objective function that we're trying to optimize for. Like we wanna prove joins, we wanna prove data locality for doing scans or something. For our purposes here, it doesn't matter. And then partitioning scheme's gonna say how we're gonna divide things up. So hash partitioning's gonna be the most common one. There's some hash function, you take the attribute, you hash it, and then you modify the number of partitions you have, and you sign it to a node that way. Range partitioning, if you know the values ahead of time, the range ahead of time, the range is ahead of time, you can split up that way. And then predicate partitioning, we're not gonna really talk about, but that would be, I could define a where clause to say, determine whether something should be in a partition or not. So the basic idea looks like this, so we have a table, and we have four columns. And say we pick this one as our partitioning key, so if we're doing hash partitioning, we just take the value that's in each, for each tuple, hash it, mod it by the number of partitions we have. I'm showing these as databases, but it could be files, it could be different nodes, it doesn't matter. And then the value of this hashing function after modding it, that determines where the data is gonna be located. So now if I come along, and the most obvious example would be, if I have a query that wants to do a single lookup on one tuple based on this partitioning key, I know exactly where to go find it, right? And I don't have to do any data movement, I don't have to do any reshuffling. When we talk about joins and other things, that's obviously not always gonna be the case because I'm not always guaranteed to join on the thing I've partitioned on. Snowflake talks about, I don't forget what the paper talks about, micro partitioning, but we'll see this later in the semester. Reshuff does this too, they try to do some automatic reorganization in the background based on how you're joining tables together to maybe try to partition it so that the data from two tables are partitioned on the same join key so the data's local to each other. But again, we'll cover this later. So with logical partitioning, it's the Snowflake approach where I don't actually move data to a physical node, I just say what node is responsible for operating on the data in the shared disk storage. So again, really simple partitioning, I'm doing range partitioning, so the top guy here gets one and two, the bottom node gets three and four. So best case scenario, I have a query shows up and it says I wanna get where ID goes one, it goes to the shared disk system and I know where to route it and I know that this thing, this node here can produce the result. Same thing here, get three, can move it like this. But if I have a query that wants to maybe do get three and two, then I can make a decision whether to push the query fragment up here and then get the result back or copy the value down. It depends on the implementation. And that's that push the query to the data versus, push the data to the query versus now. Push the query to the data versus pull the data to the query. That's that design decision that we talked about last class. Yes. Yes, so this question is, but I'm not showing here, how does this thing know to go here? Yes, you would have some kind of middleware system in front of this. I think Snowflake calls it the, they had some name, but like yeah, there'll be something up here that you submit the query to and then it says, it looks in the catalog says, I know where three and two is located, so I can make a decision should I go here or not. Right. So the statement is, you could break the query into two parts and then call us the results on that front end node. It depends on whether that front end node is like just a router or it actually is a worker in itself and can actually do some kind of computation. Again, for our purposes here, like it doesn't matter yet. So for physical partition, again, because it's a shared nothing system, it has its own local disk and that's where it's, shard or partition of the data is stored. So when a query shows up, it knows how to route it accordingly. Again, this is the idea of horizontal partitioning. It's not just applies for O2B workloads. We need it for OLAP because in the day it's gonna be for the joins and making sure things are partitioned on the same join key. All right, so to finish up. So as I said, multiple times, every modern OLAP system today that says they're a column store, well, they're not a column store, they're out of business. I can't imagine anyone, but every OLAP system that says they're a column store is gonna be using PACs. And the key idea about PACs is that all the data has to be fixed length. The key idea in a column store, especially in PACs as well, is that all the data needs to be fixed length. And it's not fixed length. We need to use some method to convert it into a fixed length value so that we can jump to offsets just doing simple arithmetic to identify tuples. Most databases in the real world in terms of the number of percentage attributes that they have, not the size of the data, but the percentage attributes, most of the time it's gonna be integers or numeric values. Again, we did a survey that shows this is the case. But most of the data itself is actually gonna be bar charts and strings. That just makes sense, right? Like you have your zip code, that's a number, that's pretty small, but then like your name or email address, that's a string that's way much larger. So this is why we're gonna have to use a lot of compression or rely on compression to get this size of the database down and also make it convert everything to fixed length. That's super important. Another thing that we didn't even know to talk about, we'll talk a little bit about next class, is that you guys are sort of spoiled in this modern era of these column store systems because they're so fast now that you don't have to have a database administrator spend a lot of time or spend any time really brooking on the actual schema itself and denormalizing tables to get it down to like a snowflake schema or a star schema. Like you just take a bunch of parquet files, you stick whatever system that can process them at it and it's gonna rip through the column very quickly, produce results. In the old days, like in the 90s, you'd have to have a DBA to sit down like okay, let me try to convert things to fact tables and to reduce number of joins and get things down to be these giant super wide tables so that the system was fast enough to handle complex queries on them. So people still do it, DBAs can do these things, but a lot of people don't have DBAs at their companies. They just have a bunch of files and they shove a lot of engine at it. That's a pretty big, big change in how people, in the last 10 years, about how people organize data warehouses. Yes, yes. The question is if you have normalized tables, aren't you gonna have to do joins? The answer is yes, but the performance, the performance of these systems has gotten so fast that yes, joins are gonna be the most expensive thing you're doing in the system most of the time. But they're so fast that it's not worth the effort. You'll still be an advantage to it of denormalizing it, but then there's other soft calls, not soft calls, so there's other engineering calls and the time to set it up, but now also you have this denormalized data that someone's gonna understand what the hell they actually are, versus if you have the raw data files in a column of storage, you still have to do joins in them, but now it's more easy for people to come along and say, I know it's in this table. Okay, so now we're out of time. So next class, so next Wednesday, we'll be here if for real, I'm back in town. We're gonna spend time talking about how to accelerate OLAP queries beyond just doing sequential scans. So I've been saying multiple times sequential scans are super fast, and there are maybe ways to build auxiliary data structures to improve that performance. The spoiler's gonna be that the paper you're gonna read about sketches, I'll also talk about bitmap indexes. These can get you a big win as far as I know, nobody does it. No major system does it. The most indexes you'll get will be inverted indexes that do fast like text search, and then zone maps will be the other thing that everyone does. That's gonna be a huge win as well. Again, it's an auxiliary data structure where it's a summation of the table's contents that you can use to do early pruning. The best you're really gonna be able to do because the overhead of maintaining an index is probably just not worth it. Okay, and we'll also spend time talking about project one and why you can't do it in Rust. Okay, according to Xi. All right guys, see ya. Ha ha ha ha, that's my favorite all-brad all. Ha ha ha. No. What is it? Yes, it's the SD Cricut IDES. I make a mess unless I can do it like a Gio. Ice cube with the G to the E to the T. Now here comes Duke, I'll play the game where there's no roots. Homies on the cusay, I'm a fookus, I drink brook. Put the bus a cap on the ice, bro. Bushwick on the go with a blow to the ice. Here I come, will he eat? That's me, you rollin' with the pack for the act against the real pop. Joe, I drink it by the 12 o'clock. They say Bill makes you flat. But saying eyes is straight, so it really don't matter.