 Yo, hey, yo, hey, yo, yo. Pack the chrome styles, fly like Mrs. Jones. Lyrical mathematics will have the devil's smoke and stone. I put heads to bed, lick shots, and wrap this flat. Today we're talking about a lot of indexes, which is sort of, as I said, the spoiler is going to be that a lot of techniques were described. The modern Kilominar databases aren't going to support this. But most you see the bitmap indexes will be mostly used in row store database systems that then want to have sort of accelerated Kilominar access. But well, let's go through a lot of different things. So where we left off, last class was we were discussing the pros and cons of the sort of the Kilominar storage versus row-based storage. And I talked about how most modern OLAP systems are going to be implementing some variant of packs, because that's going to be better for doing sequential access, sequential scans on a large amounts of data. And there'll be a bunch of optimizations that we can apply to make these run faster. The key thing to remember, too, is also all the attributes in a Kilominar database must be fixed length, because the way we're going to do addressing is through offset arithmetic. By rescading one column, if we know what the 100th offset within that column, we know how to do the reciprocal math to jump to the 100th offset of the other columns to stitch tuples back together. The other big thing is that we're going to, not always, but if we assume that, where the data system assumes that the data files are immutable, meaning they're right once, read many, like I write it once and I never can go back and make inline updates to it, this is going to open up a bunch of different optimizations that we'll see today, where if we had support incremental updates or deletes or insertions, then some of these things suddenly just won't work, or they'd be too expensive to maintain. So this is another big, big optimization or assumption that the modern systems are making. So the first thing to sort of motivate what we're talking about today is when we talk about OLAP indexes, we have to sort of forget all the things we learned in introduction class about indexes, in particular for B plus trees. Or if you remember like radix trees or tries, these tree based data structures, or even hash tables. Because in the OLTP world, which is what the intro class sort of focuses on without explicitly saying it, these indexes are designed for finding individual tuples, or small number of tuples with very selective predicates. You can go get Andy's account, or go get Andy's orders. And in the OLTP world also as well, we're not going to assume that the files are immutable. And therefore these index data structures, the B plus trees, are going to have to store some extra room in their nodes to accommodate updates later on. So in the leaf nodes, on average, the leaf nodes are actually any node in B plus tree in a real world database system, on average, it's about 70% full. So that extra 30% is wasted space because they want to be able to amortize the cost of someone inserting new data and not have to do a split for every single operation. So in the OLTP world, we don't need to support incremental updates, again, assuming that our files are read only. And we typically don't need to find individual tuples, at least the lower points of the query plan, the access on the tables themselves. When we do joins, yeah, we're going to have to find individual tuples or values that match. But if you're just doing scans, you're not going to look at Andy's record, or Andy's orders. You're going to look at all the orders from some demographic, like everybody who lives in Pittsburgh. And so the data structures you would use to make OLTP workloads run really fast are not going to be what we want to use for analytical queries. So the question is, what can we actually do to speed up the sequential scans? I've got to get into it quicker. So here's a big list of optimizations of things you can do. Some of these things we're going to cover this semester, some of the things we won't. So the first thing you do is obviously data pre-fetching. Instead of stalling every single time, I've got to go get new data. If I know I'm going to sequential scan a bunch of files or a bunch of pages that are contiguous, let me go ahead and why I'm scanning one and start bringing in the next one off a disk into memory. Then I can, as I'm running my query, instead of having just one thread be responsible for scanning a terabyte of data, I can have multiple threads do this and run the task in parallel. I'll get some advantages also too if I can pre-sort or cluster the data ahead of time. Because now I know that if I'm looking for things in a certain range that once I get past a certain set of values, I know that the thing I'm looking for is less than what the value is looking at right now. I know I don't need to look at anything else. Late materialization, we talked about last class where, and we'll talk about this later in the semester, where if I have a column stored, instead of having to pass around the entire tuple from one operative to the next, they have to just pass around the offsets, or just the columns that they need at that point of the operator. So I'm copying and moving less data around. Materialized views, as I said, we're not going to discuss in this class, but it's sort of a variation of result caching. Result caching is really simple. Like if I had the exact same query, like select star from table foo where id equals four, I can take that result and cache it, and if the exact same query shows up, I can reuse it. Materialized views are way more sophisticated because it's taking a more complex query, breaking it down into smaller parts. And then as I make change to the database, I can incrementally update and materialize view without having to rerun the entire query, ideally. But essentially, it's a variation of result caching. The id is the same. Data skipping is being able to figure out what data we actually don't even need to read ahead of time, and just avoid having to do scans on that. Data parallelization is where we can use vectorized instructions, like SIMD, to be able to apply different operations or do different steps in our sequential scan in parallel on a multiple piece of the data at the same time. And then code specialization or query compilation, just in time compilation, is to generate machine code that does exactly what our scan is going to do or what our query is going to do, instead of having to use an interpreted system. So there's a lot of things here. And this semester, we're going to cover these here. When we skip the materialized views, we've already sort of discussed prefetching in the intro class. And clustering, sorting, it'll come up throughout the semester, but we're not going to teach you how to sort data. We just use whatever the fastest one is this time. So again, a bunch of these things we're going to cover throughout the semester. But today's class, we're going to focus on data skipping. You can think of like, again, all of these are things you could do to accelerate query execution. So we're still kind of focusing on the lower bowels of the system, like the storage layer. So that's why we're just going to focus on this for now. So at a high level, there's two things you can do to just get data. You can do a lossy approach or a lossless approach. So a lossy approach would be approximate queries. So the way to think about this is, say I'm doing an analytical query like select count star on the number of website visitors. In most systems that don't do approximate queries, you would get an exact answer. And look at every single web visit and just count all the entries. But in many cases and many application domains, you don't actually need the exact answer. Do I really care that I need to know the exact number of visitors to my website to the exact person, like 999,000 or whatever? Or can I get something that's close enough and that's enough for my application? So this technique, again, is called approximate queries. There are systems designed explicitly for doing nothing but approximate queries on top of it like a non-approximate query database. There's a project out of the University of Michigan called VerdictDB with Barzama Safari. He spun that off as a startup called Kibo. That basically sits in front of Snowflake and does approximate queries in front of Snowflake, even though Snowflake is not in its basic form in an approximate query system. But a bunch of these other systems, VerdictDB was a research project out of Berkeley. It sort of did the same thing. But all these other systems here, like Snowflake BigQuery, they have approximate query aggregates. So you can call it, you know, there's count star. There's the count thing that gives you the exact answer. There's also an approximate count. And it's basically using sampling in a way to give you some statistically bounded guarantee of the accuracy of the answer. So with approximate queries, you basically say, I'm only going to say, well, I'm going to put a fraction of the data that I regularly have. The other approach to do a lossless pruning, or sorry, data skipping is to data pruning, where we're going to rely on some auxiliary data structure that's been pre-built on our data, or be maintained by the database system, that's going to allow it to identify portions of the database, the portions of the data files for a table that's needed in a scan that it doesn't need to read. B plus tree is basically this. B plus tree says, if I'm looking up some value on a predicate, and I have a B plus tree on that predicate, I can get down to the leaf node, find anything I'm looking for without having to scan the entire table. It's basically the same idea, but we're going to see a bunch of different ways that we can do this in this class. And so one of the trade-offs we have to consider as we go along, and you'll see this multiple times, is the scope of this auxiliary data structure, what is the purview? How much data is it actually covering in its metadata generated for this auxiliary data structure, versus the how selective the kind of filter it could be? Think of this like I have, if I have a, I always think in extremes of databases or systems, I'm always going to think about the scope of the problem. On one extreme for the scope of one of these printing data structures would be the entire table. So I have a billion rows, and maybe I keep the min and max value for a common of billion rows. That's not going to help me skip any data, because the min and max is basically the entire domain of the column. Or I could have an auxiliary data structure on every single tuple, but then that's useless too. That's the other extreme. So we want something in the middle where it's going to be large enough that it's going to also throw a lot of data away, but small enough it's going to also throw a lot of data away, but not too large where, large enough that we throw a lot of data, but it's not too small, we're spending all the time looking in this auxiliary data structure, right? So, and then the other issue is also, is it going to be manual automatic? For most of the approaches we'll see, where zone maps in particular, these are going to be automatic, the decimator just does it. For the bitmap indexes, these are things you have to define. And you have to know, like, in some cases you have to define what the ranges are. So this class, we're only focusing on this. We won't discuss approximate queries for the rest of the semester. You know, this requires, again, outside knowledge of the data, external information about whether the application is OK with approximate answers, that the data system can't figure out for you. Like, it doesn't know when you're OK with a rounding error in your bank account. You're probably not, but it doesn't know that. You have to be told I want to run approximate queries. So we can do that for now. So the considerations we have to have for anything, anytime we're doing data pruning, or using auxiliary data structures, of course there's going to be how selective we want, or how selective these predicates are going to be. Like, if there's a column that's a boolean value, and it's every one billion tuples, this column is true, we don't want to build an index on that, because it's useless, because you might just scan a bit through it. So how selective the queries are going to depend on how effective these auxiliary data structures are going to be. We also have to consider the case of how skewed the data is. Again, my extreme example where everything is set to true, or how a billion rows. If I build a B plus tree on that, or a bitmap index on that, it's useless, because I get the leaf in the B plus tree, and then there's a giant linked list, or giant list of all the other tuples. So there's a trade-off here that we have to be considered. And then, again, we won't discuss this too much. But if the data is already pre-sorted, then that's going to allow us to do some optimizations that we wouldn't be able to get otherwise. So there's a lot of things we want to cover. Again, this is just a quick smorgasbord of everything you could possibly do. There's other things, too. We're going to know our multi-dimensional indexes and inverted indexes, just because I want to focus on the OLAP stuff. So we're going to go through zone maps, again, the most common ones. And I'll spend a lot of time talking about different bitmap representations. And then that'll lead us into the imprints and the column sketches with the paper you guys read. So as I said before, many times, zone maps are the most common data skipping technique that most systems use, or that a lot of systems use. The basic idea is that these are just pre-computer aggregates for all the individual values, or all the values with individual columns of a table. And the idea is that for certain predicates, we could check the zone map first and see whether there could be any data in the block we're about to read, or the chunk of data we're about to read, based on the zone map. And if we know, according to the zone map, there isn't, then we just skip it entirely. So say I have a simple column that has five tuples with this consuming a single table with a single column that has five tuples. So the zone map for this, we could compute the min, the max, the average, the sum, and the count. Pretty sure it's forward. But now my query comes along. Select start from table where value is greater than 600. The first thing I would do is go look in the zone map and say, well, I know I'm looking for a value that's greater than 600, but the max is 400. So therefore, I know there isn't going to be a value in this block of data that could match my predicate. So I could just skip it entirely. Pretty straightforward. And again, I showed this diagram last class from a Databricks talk when they talked about parquet. And lo and behold, without calling it a zone map, right here, they call page as metadata the min, the max, and the count that's stored in the header of every chunk. This is the zone map. I think Oracle invented the term zone map. Maybe it's copyrighted. I don't know, trademarked. But when people say the original definition was called small materialized aggregates, that's the original paper from this paper here from this German guy. But everyone pretty much calls them zone maps, even though Oracle has them. It calls them map. So again, pretty straightforward. Not hard to compute. And because, in the case of parquet, these files are immutable, we don't have to maintain this. We don't have to change this all the time. We just store it in the header and we're done. So, as I said before, there's a trade-off between the scope of the zone map and its efficacy. I have a zone map on an individual tuple. That's going to be useless. I have a zone map on the entire table. Again, that's probably not going to help me either, because if I know that the thing I'm looking for is not in the bounds of the zone map and the entire table, then yeah, I avoid reading the entire table. But in the likely case that it is in there, then the zone map is basically useless. And I spent time processing and reading it. The other point I point out too with zone maps is that they're only useful when the attributes position and the values are correlated. Meaning if the thing I'm looking for on a certain column, and that column's values are completely random, then the zone map's not going to really get to help me again, because the value domain that's encompassed by the block of data I'm looking at could be quite large. And therefore, every time I check the zone map, it's going to come back as true for my predicate. But then I've got to go scan it and find a thing I'm looking for anyway. But if I could pre-sort the data on the attributes that are in my predicate, then my zone map will help me throw things away. So again, zone maps are, is it an index? Sort of, right? It's a filter, not an index. So a filter would tell you is something here, yes or no. And I'd have to go look for it. Index would tell you where it is. So this is a filter, but it's still, again, this is the most commonly used one. And it's not just like you have to have additional code in your query execution engine to consider these zone maps as part of the process of deciding what data you actually want to read. Off the record numbers from, I'll have to bleep this out. They said, I think, 80% to 90% of the queries take advantage of zone maps. And again, at their scale, that's massive. So the win for these is quite obvious. This is why everyone does it. But again, although 90% of the queries they're talking about take advantage of zone maps, it doesn't mean they're going to be able to throw away substantial portions of data. It just means that there's some amount of data they were able to throw away. And they're all that matters. Yes? This question is, if you have very large data, what's the overhead of pre-computing these zone maps? I mean, what's your idea of very large? Like petabytes? I don't know. So most systems don't magically conjure up petabytes of data instantaneously. It's usually a fire hose of something downstream in your application stack is generating data that you're now ingesting and storing. But it doesn't happen all at once. The only application I know where this does happen all at once was the Large Hadron Collider at CERN. Because when they did the experiment, that thing generated petabytes instantaneously. And they got to deal with that. Most systems are not smashing atoms. Most systems are like, OK, I'm collecting sensor data, or I have people using my website, and I'm getting things in as a stream. So in that case, I can batch things up, compute these zone maps. Again, it's not expensive. I got to pack it into the file anyway. So I'm already reading it because I got to split things up into columns. It's not expensive to maintain. So another approach I'm going to look at, or I'm going to spend most of this class on, is instead of using, so zone maps help you sort of filter out blocks. But if you still want to find things that are actually matching, like you want an index, for OLAP workloads, bitmap indexes are going to be the primary choice. Again, ignoring multimensional indexes, because those are sort of different types of queries. Think of things like select star from table where attribute greater than something, or attribute equals something. Those full text indexes are trying to find words in strings, or text documents. We can ignore that for now. So with bitmap indexes, the idea is that for every unique value in a column, in an attribute, we're going to maintain a separate vector with a bit set to one, whether a tuple at that offset has that particular value. So again, because everything's fixed length, we can just use the offset and the bitmap to then map to the offset in the actual data columns themselves. So the i-th position and the bitmap corresponds to the i-th position and the tuple. And obviously, we're not going to want to have a giant bitmap for the entire table. We're going to do this as we chunk things up in our PAX files, or row groups, as we showed before. So bitmap indexes are an old idea. They go back to the 1970s. And then there's this paper from 87 that sort of describes the first way to put it into actually a database system. And this is just a sample of some of the systems that support bitmap indexes. Postgres has these things called brin. They're going to be binary range indexes. And that'll be, we'll see range coding in a second. Pelosa is now commercialized as feature-based. This is nothing but a giant distributed bitmap index database. That's all it is. It's just bitmaps. And then Pino has range indexes. A bunch of other systems that support this. So here's a really simple example. So we have a table that has two columns, the ID field, and then whether somebody's lit or not, yes or no. So say we want to build a bitmap index on this column. So we're going to have, since it's a binary value, yes or no. We only have two bitmaps. And so we'll have one for yes and one for no. And again, there's a one in the bitmap corresponding to whether the attribute within the original data has that particular value. So if I'm looking inside this, if I jump to the fourth all set here, I can check to see that the lit status is zero for yes, it's one for no, so I know it's no. And that corresponds to the original value here. And this is like a bitmap index in its most basic form. It's pretty simple. And there will see some ways we can use SIMD to make this actually go faster, so you're not actually just running in a for loop and examining bits one by one. We'll cover it in a few more slides, then we'll cover later in the semester a lot more. The next question is, which then leads to this next slide, what kind of data is suitable to be used in a bitmap index? Because if it's a, the worst case scenario, again, thinking extremes, if it's a auto increment key or serial key where every tuple has had a unique value, that's the worst case scenario, because I have a bitmap for every single one, for every single value, and most of them are going to be all zeros. Every entry is going to be zero except for one. And then my extreme case is like it's a binary decision yes or no, that's the best case scenario. But most things are not that extreme. It's somewhere in the middle, yes. Her question is, or statement is, isn't it just an enum value, yes. But again, so yes, no is an enum, right? Oh, where do you know you're only? No, no, no, no, A, B, anything. He's correct that like this is like, I'm just trying to show like there's two bitmaps. In a binary decision, it's either one or zero, so you only need one bitmap. Yeah. Right, so how about yes, no, maybe. You need three bitmaps. Like you're maybe lit, right? Yeah, see, like you're maybe. So like you need three bitmaps. Got it. But to your point, like, yeah, so you're basically saying, that's basically enum is sort of a column, right? We'll see why you want to do this as we go along, right? And you can have more compact form. I'm not even getting into like, enum is maybe also like, it's a simplified example where like you could do the compression techniques and things like that. But let's, we'll go along and we'll see like, because we sort things of bits, we can play games to make things run really fast. So that's what I'm gonna do. So his statement is this only works for category variables or values, which is the same thing as she's saying in enum, right? I will say through our analysis of like schemas, there's like enums are super rare, right? So even though like it may be like in the application treat, it really isn't enum, they define it as an inter whatever, whatever, right? Without thinking about it. To his comment also too, that like it's a category variable. Yes, but like, it'll work on anything. Like you, like, I think like in Oracle or you can tell I want a bitmap index, it doesn't come back and say, hey, that's a bad idea. I'm not doing that. I think if it's the worst case scenario. Is it effective in defining? Yes, it is. We're all saying the same thing. This is my example here to show like why this doesn't always work, right? So say I have a dimension table on the customer information and then we have this zip code attribute here, right? And so my query is select ID from, I'd email address from the customer dimension table where the zip code is in in this range here or these three possible values in Pittsburgh. So I want to build an index on this column here. So if I build it as a bitmap index, then it's gonna be fast because you know, if I want to do a scan now to find all the matches for this, I just scan the three bitmaps that correspond to, you know, one five, two one six, one five, two one seven, one five, two one eight. And I just take the intersection of those and now I know the tuples that matched, right? And I can do that evaluation of bits very efficiently, much more faster than actually comparing to integers, because I can take, you know, I can pack together a bunch of bits in a single value and then do the bit-wise comparison on those. But of course, this obviously is gonna be problematic as you guys keep bringing up because the value domain for zip codes in the United States is actually quite large. And we can guess how many zip codes there are in the US? 10,000. 10,000. No, 100,000. 100,000. All right, it's less than 100,000, more than 10,000. It's 43,000, roughly. So if I have 10 million tuples and I build the bitmap index on this, I have to have 43,000 different bitmaps of length, you know, 10 million bits and that's about 53 gigs. All right, and the original data of just like 10 million 30 bit integers is like what, 40 megs? So it's, you know, we're paying a lot of space for all these bitmaps and most of the time they're gonna be sparse, most of the time it's gonna be zero, right? So this sort of naive scheme of like, hey, let's just create giant bitmaps for the entire, you know, for the entire length of all the tuples or the size of the table and have for every unique possible value that I have, right, this is, this obviously won't work. So we need to be smarter about how we're gonna do this. So the huge design choices we have to consider is the encoding scheme. So how to represent and organize the data within the bitmap. In my simple examples, I'm just saying it's a giant, you know, you malloc a giant array of bits, but we can be more clever. And then how do we actually compress the size or reduce the size of bitmaps where we know they're gonna be sparse? So we'll cover this second one in the next class and then this class will focus on the first one here. Okay, and so the idea again, we wanna get, we don't wanna, it's basically we're trying to, can we get the benefits that we would get if we had a bitmap to do your accelerated scans without paying that huge storage overhead of these giant, you know, giant arrays of empty data. All right, so the before and coding schemes we can consider. So the one I've already shown you is the quality coding, right, you just have a giant bitmap for every unique value that you could possibly have in your column. With range of coding, it's a way to sort of store less bitmaps for all the unique values. So instead of saying for, you know, for exactly 15217 and 15216 for those zip codes, what if I take a range of zip codes and represent a bitmap of those values? And this is sort of what the column sketch paper you guys read, it's viewing the basic same idea. Whereas now that means that I could get false positives where I think I have a match, I could possibly have a match within some range. I think I'm matching more tuples than I actually, I am matching, then I gotta go check back the original values, the original base data and do the final pruning. So Postgres has this with, again, with these brand indexes, Apache Pino came out of, I think it came out of LinkedIn, I forgot, but it's an open source system, or that system that supports, makes heavy use of range indexes. And again, the human has to specify these ranges. As far as I know that no database system can actually automatically figure these things out for you yet. So hydrocoding is a way to use a tree structure to identify the MPP ranges, and then not actually have to store them. So we'll see that in the next slide. And then bit slicing is a way to actually maintain a bitmap of a single bit location within a value within a column, and then do a bunch of tricks to speed things up when you evaluate this. Bit slicing is a really neat idea from the 1990s. I don't think any system, maybe I think Pino might use something like this, or Pelosa, but we'll see the modern variation of bit weaving from Wisconsin, to how to do this very efficiently, which is pretty cool. Okay, so let's go through the last two, because there's not much to say about range encoding. Obviously, again, it's just the same idea, just instead of having a bit per value, you have one per range. All right, so hydrocoding, the idea is going to be, is that we're going to have a tree structure, we're going to organize the bitmap as a tree structure, where the, at each node, there's going to be a bitmap itself, and that's going to tell you whether the child represented by the position at that node is going to have a one below it in the subtree. So in this case here, so every node's going to have four bits, so every, or in its internal bitmap, and then every node's going to have four children. So if I have a one in a position, then that's going to tell me that at the subtree pointed down by this pointer, there's going to be a one somewhere. They're not aware of a somewhere. So in this case here, the second position is zero. So I know that in the second child going down this subtree, there isn't going to be anything. So the way to think about this is along the leaf nodes, these are the offsets in the bitmap, just as we have before. And in this case here, I'm storing, say, offsets at the top here, and I have the bits at the corresponding locations. So I'm not actually storing the bunch of zeros, which are really storing is just the locations where you have a one. So to say, if I want to figure out whether I have a bit set in position one, I start at the top, I say, well, I know I need to look at this side of the tree, so therefore the first bit is one, so I know I'm not zero, so I need to go down here. This bit is set to one, so I know I need to go down here, and then, lo and behold, I can look at my bitmap and see that there's something there. Right? So what are some problems with this? He says you need to decide the size of the tree before, the size of the nodes. That's not hard, right? Because it depends on the number of values I have. So his statement is that if the, again, so this is a bitmap, this is the thing that says like, with someone lit or not, yes or no. So this is just, instead of storing it in a giant array, this is the array. So skewness in terms of what? Because it's, the bit set to one, this position here corresponds to one tuple. He says the distribution is uniform. Well, if all the bits are set to one, then yeah, you don't save any space. Yes, you spend more time, yes. Yep, there you go, yes. And what else? It's tied, I mean, what's reading the memory? The CPU, right? Yeah, so this seems like a great idea. And it's one of those things, again, this idea of like, it came out of the 1990s, it seems like a good idea, but nobody actually implements it. Because on a modern super scalar CPU, this is, it came out of 2003, but on a modern CPU, this is a bad idea because now you have this indirection where as I'm doing this probing, I gotta go look at this bitmap and I may go down one path versus another, and then that might be another cache line miss to go fetch this, right? So like, the cost overhead, the storage savings you would get is not worth the sort of performance panel you pay for using a probing data structure like this. Again, think of like trying to scan large things. Because again, it's not like a B plus where you're writing scan along the leaf nodes. I'd have to store some metadata, either store some metadata about what position I'm in and then maintain pointers across to go from one node to the next, or I have to do a depth first search. I gotta, you know, put yours down here and then come back up, right? And that's how I scan along all the leaf nodes. And all that back and forth is very costly in a super scalar CPU where you don't want to, you don't want to have stalls in your obstruction pipeline because now you're jumping to another location. Yes. Yeah, so statement is if, it's again, it's the enum category value issue, right? If everything's one, then this. Oh yeah, so like, yeah. So like in this case, if at least one value in a node is one, then I have to maintain the tree structure. And so these nodes are really big. I'm showing four bits per node because I make it in PowerPoint. Like if it's something larger as it should be, like a cache, like a cache line size of 64 bytes, then the likelihood that something is gonna be one is, is, yes, correct, yes. Right, so he's basically saying again, the, which is, I have one bit set to true and I gotta store the entire node. And the nodes are gonna be kind of large, right? Probably a cache line. So again, this is, seems cool, seems clever. The savings is real. You can go from eight bytes to four bytes. But again, in a modern CPU architecture, this is gonna be, this is gonna be bad, or better off just by scanning the data, ripping through it. Okay, so let me show you another variation to bit slicing, another way to encode things. So for this one, again, we're gonna have, original table is gonna have this ID field and then we're gonna have a zip code column. And these are actually all the zip codes, the places I've lived before in my life, group of Maryland, and the Pittsburgh now, Compton, Wisconsin, and so forth, right? So what we're gonna do now is, we're going to flip around how we're actually gonna be storing the data within the values themselves. Instead of having this bitmap to say, this bitmap corresponds to 15217. And there's a one in that bitmap if the tuple has that. We're now actually gonna store the values themselves in bitmaps at a sort of single-radix or a bit-wide manner. And then that's gonna open up opportunities when we start doing predicate evaluation to do early pruning and identifying things that we don't need to actually need to read. So again, think of this as like, you wouldn't wanna store the table itself like this. This would be an auxiliary data structure, like an index, that we can then identify what offsets in the column of the original table are matching, and therefore we then need to go get the full data in the column, get the process rest of the query, right? So again, the way to think about bit-slicing is that it's like when we took a row store and covered it to a column store, but now we're taking the bits that we're storing for each value and flipping that to be a column store, okay? And this is an old idea, I think it comes back to, actually I forget, I think it's from the 90s. All right, so let's take the first value here, 21042, where I was born. So if we just convert that to its binary form, we end up with this bit string here. So again, I realized that I showed the table schema, it was a 32-bit integer, we're gonna show 17 bits just to make it fit. So we're gonna have 17 bits for the actual data itself, then we'll have one bit map for the .fi whether a value is null or not. So in this case here, the value is not null, so that's set to zero. And then again, for every single bit along in the binary representation of the value, we're gonna store it in a bitmap like this. And the bitmap is going down, not across. So with any single position in the value of the attribute, that here's all the bits for that, for all the tuples. All right, so we can do the same thing for all the others. All right, so now if our query comes along, find all the customers that are in a zip code less than 15217, what we can do is we can look at what is the first, for this value here, what is the first position where in the binary representation of the value, what is the first position of the one? I mean therefore we know that anything before that is gonna be zero and anything that therefore is not zero within this first position we can throw out entirely, right? So I'm basically gonna walk through each slice and construct a bit, a result bitmap to say whether a tuple of different positions in each row matches. Right, so 15217 is this value here. So the first three bits are zero and then we have the one. So what we need to do is we need to examine, find all the entries here where the first three bits are not zero and therefore we know we can throw them out immediately. And again, we can go in using vectorized instructions, we can do this evaluation very quickly across these bits and then we can sort of keep marching along going down to the end and at some point we realize we reached the end and we're done or there's no matches and we can stop early. So this seems kind of, this is way different way to think about how to evaluate tuples than like hey here's a for loop for every single tuple will apply a predicate, now we're looking at batch of tuples all at once while we're doing this by bit by bit across all of them. So bit slices can be again are used for efficient aggregate computations. Like a really simple trick, if you wanna do a summation on a column you can use the hamming weight which is just the, or hamming distance is the number of ones or number of, you count the number of non-zero bits which is one, number of ones within a string of bits. So you walk along the slice and then count all the number ones in the first slice and then multiply by two to the 17 because that's their position in integer form and then you count all the ones in the next slice multiply two to 16 and you do this all of them going across one by one and you add those together and you get the summation. And Intel added this pop count instruction which basically gives you the hamming weight. So you can do this with SIMD very, very efficiently. Again, instead of having to maintain an integer sum and then having to follow up looking at every single two by one, you go across these bit slices, do the simple math trick and you end up with aggregation very quickly, right? So for aggregation, sure, this always works. The challenge of course is going back here, you know, for simplicity I said, oh yeah, skip any entry where the first three slices have a one in it and I was sort of hand waved you how quickly you found that out and how we actually go from down the line. It's not that efficient if you have to look at all the bit slices. If you can throw everything out in the first chunk of the first bit slice, fantastic. That's super fast. But for larger values, again, more than 17 bits, this is not, you know, this can get expensive. And again, if you fall back to it, can I just scan the data and that would be quicker. So any questions about bit slicing? Yes. So his statement is, how much is this actually going to save us because, well, basically at the end of the day, if you have to go get the disk, isn't disk always going to be the main bottleneck? Absolutely yes. This is sort of assuming once it's in memory, can you quickly rip through the, you know, the data that is in memory without having to go, you know, without having to go, you know, tuple by tuple or attribute by attribute, right? Or another way to think about this too, again, we're using this to do data pruning. So going back to that parquet diagram, they have the header has all this information. So if I just bring that in, then I can identify, oh, I don't need to read anything else because I have this bit slice indexes or filters, then I end up reading less data in total. And that's great, right? And then you can play the trick of like, okay, well, doesn't it mean if I check this index and then I do have to find something that I got to stall and go fetch while I get it? Well, if you have multi-threading, you could bring one piece in, have one thread chomp through this while something else is waiting to bring the next piece into staging that, right? And so there's always index to look at. If you then have to hand something off, say, okay, I do need to read rather this data, you can block that thread and run something else. So, let's talk about bit weaving. So bit weaving is a modern incarnation of bit slicing. And it's typically designed for doing fast evaluation on compressed columnar data using SIMD. Is everyone here taking 618, 14? Does everyone know what SIMD is? Who doesn't know what SIMD is? Awesome, fantastic. The first time. All right, so, we'll ignore how we're preserving dictionary coding till next class. But basically what we're doing again, we're trying to get bit-level parallelization where we're trying to scan lots of data, a lot of tuples attributes at a bit-by-bit level and be able to identify quickly whether something's going to match our predicator. So, this idea first came from University of Wisconsin and it was implemented in a embedded OLAP engine called QuickStep. Think of like DuckDB before DuckDB, which didn't go QuickStep, but it didn't have SQL, didn't have a parser, didn't have a query planner. It was like RockDB but for analytics, but without supporting SQL. So they rolled us out of Wisconsin, it became an incubator project for Apache, but then in 2016, it got killed off in 2018. I figured what happened. He went up, I think he did a start-up. So, this was invented by Janex Patel. Again, he's probably along with the Germans, Thomas Neumann in Germany. He's probably one of the best data systems researchers in the world. He's really cutthroat, too. Every time I showed this picture, I always say like, he would tell me stories about him growing up and he would tell me stories about him growing up and he would tell me stories about him growing up and he would tell me stories about him growing up in India. We'd get up there like, f**king cool, getting on the bus, just go to school every day. It was really, but he's a super mild-mannered guy. It was insane when he was telling me. Anyway. All right, so there's going to be two ways to do this. And again, this was in the sketch paper you guys read. I think they lew to and talk about bit-weaving, but only the vertical approach. So we'll talk about both of them and then you'll see why the vertical one is going to be superior. The other thing I also forgot to point out, too, this paper came out in 2013. And so at the time, SIMD, at least on x86, didn't have scatter and gather operations, so they're going to be able to do all the, get all the parallelism with vectorization. In some cases without SIMD entirely, but also if they are using SIMD, they don't require scatter and gather. Because again, now we have it, but back then they didn't. You can, there are the modern versions that do it, take advantage of it, but it doesn't require it. And again, we'll cover scatter, gather, and SIMD stuff in more detail later in the semester. Right? So again, we're doing this, we're going to talk about the horizontal approach first. It's going to be row-based, but again, think of it, it's for a single column. So even though we're storing the bits within the, within a value in a row-oriented manner, it's assumed that it's on a single column. And a column will be stored as a, you know, in a clominar fashion. The attribute will be stored in a clominar fashion. Right? It's, again, it's to think of it as a clever way to allow us to get better parallelism through vectorization because we're organizing bits. Right. So, say we have our tuples here, and this is just the binary form or whatever the storing. So here's the actual, the numeric value that's being represented by all these. So we're going to break it up into segments. We're thinking of, again, just like row groups as we had before. And for each segment, we're going to store the, we would know how many values we need to store, and then we're going to represent them in sort of this interweaved manner where we will store in, the continuous memory will be, have gaps where we would come back around and get the next batch of tuples. So let me show you what I mean by this. So think of this as like this is the start of memory and goes here from here to here. So we're going to represent these three bit values in four bits. So you actually have the actual value, right? So the first one is 001, 001 here. But there's always going to be a delimiter that's in front of it that's going to be a padding bit that's always set to zero. And we're going to use this to then, when you start doing the arithmetic to figure out whether something matches the predicate, we're going to use this to store whether the tuple matches or not in the output. You'll see what I mean in a second. And the other thing to point out also too is that the way it works is we go from top to bottom. So this is T0, T1, T2, T3, and then we loop back around and then append T4 to where T1 is, the same with T5 and so forth here. And we'll do this when we start because we need to get, convert out or extract out the matching tuples into a selection vector to know what offsets match. So if we organize it in this way, you'll see what I mean in a second. So then the same thing for the second segment. We know there's two values, so we only have to store two tuples and there's dependent one after another. So again, I'm showing this with three-bit values with sort of four bits and we'll say this is a processor word which is eight bits. On the x86, a processor word I think is 16 bits because that's what it was back in the 80s. I think arm is 32 bits. It doesn't matter to, for simplicity, we're using this, right? Okay. And then here's the delimiter that's always set to zero. So let's see how we use these to match a predicate. We have a select star from table where value is less than five. So we're just going to deal with just two tuples, like T1 and T4 that I showed before, and going back, or T0 and T4. We're going to deal with this first one here. So the x is going to be the actual tuple organized in the bit-weeding horizontal format. And then this y vector here, that's just going to be the constant that we're comparing, five. Right? And then there'll be the mass to just zero a bunch of ones and we're going to use that when we do our arithmetic to figure out whether the value is less than the constant we're looking for. So the way it works is that we have this formula here and we'll have a different formula based on what the predicate is. So this is for less than. You take the x score of the x in the mask, add y to it, and then end it with a negation of the mask. And, lo and behold, what you get as the output is a selection vector that tells you whether the tuple actually matches. And so now you see in the padding here, that's where you put whether the predicate actually was satisfied or not. So you ignore the zeros. The padding here says one because T0 is what? One and one is less than five, so that matches. And then the second T4 is six and five is less than six. So that's false. So that's set to zero. Right? So what's really cool about this is that it only requires three instructions to evaluate a single word. It works with any word size, as I said. These are 8-bit words. It can go larger. It doesn't matter. And all these different formulas here, they're all defined in the paper. And then these. Actually, Leslie Lamport invented these back in the 1970s to do these kind of stuff, these bit-wise comparisons. So again, nothing I'm showing here is actually SIMD. This is just doing, you know, this mask here, you can use regular SISD instructions. But you're still getting the parallelism because you packed things in to fit in a single word. Right? So this is pretty cool. Again, in the 1970s, SIMD, I think SIMD existed theoretically with Flynn's taxonomy, but I don't think any CPU back in the 1970s supported this. Not to the extent they do now. Right? So, again, I'm showing you 8-bit words. If you had 64-bit words, now you can pack in 16 3-byte values and do this simple in the arithmetic to compute the comparison very quickly. Yes? The question is why would they range vertically instead of horizontally? Next slide. Because we had to, at this point here, we just have a one and a zero to say, yes, this match. We got to convert this back into what is the offset of the tuple. And then it's raising horizontally, what solves the problem? Is David, does this mean you have one less bit to represent all your data? Yes. But if you had random 32-bit integers, wouldn't this actually not work, you'd have to know some of the value domain to say what's the minimum you can do. Right? And this is why they're assuming they're coded in order of preserving. So, again, if these are a bunch of, say the state codes in the United States, you could represent that in these bits. All right, so let me plow ahead and get through to answer her question. So, these are all our bitweaved packed, I don't say packed, that's compression screen, but these are bitweaved bitmaps. So we can just do all the same comparison that I showed in the last slide. We have a bunch of vectors like this. Again, in our delimiter, we have set 0 to 1 based on whether it matched. But now we need to convert this back into, again, whether what offset matched. So you just do simple bit shifting to slide things over based on where you are in the list. And then now when you combine these together, then you end up with a bitmap that says what tuple actually matched. That's why you have to shift. So what T4 was over here, when I shifted it over by 1, when I push it up, it lands nicely where T4 should be. Yes? Is it the one we saw before, I mean the hierarchical encoding, or a bit slicing? Is that more efficient? So the storage of the bit slicing approach is potentially more efficient. Yes? Yes, but this bit is not... Yeah, I understand. I had to look at the original paper, I forget whether... I think it might have compared to bit slicing, and this is way faster. This is faster, yes. The vertical one we'll see next, that's actually even faster, yeah. So that answers your question, and then we'll wrap it up, and then that's why they origin that way. So this is called the selection vector. This basically is a bitmap that says that our... Is the value for the... Does the tuple of this offset, does this match our predicate or not? But again, it's just a bitmap, we've got to convert that back actually to numerical offsets. So there's two ways to do this. One is to do a real simple thing, would just be take our selection vector and say, well, if the bit set to one, then I know this is the tuple at this offset that I need to consider. It'll work, but it's slow. So the better approach is to do a pre-computed positions table. Where you recognize, oh, the size of my bitmap isn't that big. In this case here, it's eight bits. So that means that there's only two to eight possible combinations of offsets that could be represented by this bitmap. So I can just pre-compute that. There's a giant array where the offset of the integer representation of the bits corresponds to an offset position table, and then I have a pre-computed vector of the tuples that match. So this is 150, I jump to position 150, and there's my offsets. And I can keep that... This positions table, it's in a low meg, so if that, I can keep this in L2 or L1. So this technique was invented by vector-wise in the late 2000s, and then we end up using this in our own system for our research on vectorized execution stuff. But papers will cover a later in the semester. All right, so this is the horizontal... Yes, sorry. Yes. But I gotta go examine the bits. There isn't a instruction that says convert these bits into the integers. So you just pre-compute this little small table, and this is way faster, right? All right. Okay, so this is the horizontal representation. I don't know whether they actually invented this in QuickStep. No other system I know actually does this. I just find this super cool and super clever, like things you can do. The vertical one is the one that showed up in the sketch paper you guys read. So same thing as before, we're going to split our tuples up into segments or row groups, but now what we're gonna do is we're going to generate bit slices for all the positions and store those continuously in memory, right? So take all on the first column, all the second column, and the third column, and each segment will store those bits, right? And in this case here, even though the second segment only has two tuples, we have to store the entire bitmap with a bunch of useless data because we have to make sure that everything fits in our process of works, right? Okay, so what can you do with this? Well, when we want to start evaluating tuples, we can use now SIMD to compare tuples, or sorry, compare bits on a bit-by-bit basis within the value that we're looking for and ride along in SIMD registers without having to go back to regular CPU registers and compute our predicates very efficiently. So say we have now select star from table with value equals 2. I'm showing over the inequality predicate, just for simplicity, but you can use inequalities as well. So for this one, we're going to represent it in binary form, 0, 1, 0, and we want to do an evaluation of the first bit and find all the matches within our first bitmap vector in our first slice. So we throw out this empty zero-mass thing, we do our SIMD compare to see whether it's true and then we get back a selecting vector like this, right? Then we have to evaluate the second one, but we only want to compare the ones where we know that actually match the first one. So these positions here, so we bring in another mask to do that comparison and then this would then produce out a, in this case here, all zeros. So we just use that pop count thing before when we look at the vector and say how many ones are there, as soon as the vector comes out with all zeros, we know we're done and we do early pruning. So this is basically bit slicing but a way to do it with SIMD efficiently, right? In the case of, in the horizontal encoding, we can't do early pruning because we have to always look at the whole tubules, the whole bits within the attribute or the value. In this case here, because we slice it up, we can do early stops. Just showing you how to do it with SIMD. As you stated, we're doing the exact same operation as bit slicing, yes, but the visual bit slicing paper doesn't say how to do this. That's the difference because it was from the 90s, right? So in this case here, we're able to examine eight tubules with two instructions. That's insane. It's amazing. Again, if you do how to do a sequential scan with a for loop and evaluate tubules one by one with their attributes on a column, you're not going to get down to two instructions. It's not entirely, it's two SIMD instructions, but there's instructions to get things in and out of the registers, you pay that cost, but we're ignoring that for now. That's a lot. So where have we gone so far? We've done zone maps, bit map indexes, bit slicing, bit weaving, and then we'll finish up quickly and talk about column imprints and column sketches. So all the bit map schemes I showed you so far about storing exact representations of the data within the columns, meaning again, the simplest case of the bit map, the basic bit map index with the quality coding, there's a one in a bit map whether the value is equivalent or not. Now the range encoding, you're kind of grouping things together and as I said, that's what basically sketching is trying to do, but in a different way. But in general, like it's, if ignoring range encoding, if I could check a bit map and it says something matches, I know it matches. I don't have to go double check on the actual tuple and make sure I don't have false positives. But in some cases, if we give up some of this accuracy to support faster evaluation and maybe more compact bit maps, then that actually might be a big win for the common case. Again, there's extreme examples as we talked about. Those may not work perfectly for this, but most data doesn't look like that. Most data is highly skewed. And of course, again, we may have to check original data in some cases to make sure there's no false positives. In the case of the range encoding example, I always had to go check. But in the sketch check, in some cases, I know there'll be some times where I don't have to check. So column imprints is a precursor to the column sketches. And the basic idea is that we're going to have a bit map that says whether something could exist. It's basically collapsing down the bit slices into a single bit map instead of multiple slices. So same original data is 184. So if I had bit map indexes like this, you would see that for the different possibilities that I have, there's a bunch of columns or a bunch of bit maps that are zeros. So instead, if I distort imprint, if I just collapse this all down by oring everything, then I end up with a single bit map called an imprint. And I can use that to then determine whether, quickly, whether something exists or not. And again, if I get a one here or a one over here, I got to go check the original value, the original tuples to see whether it's actually a match or not. But if I'm landing it maybe in here, it's a zero, so I don't have to do anything. What's that? It's like a bloom filter. It's like a bloom filter without any false positive rate guarantees or things like that. We'll cover bloom filters in much more detail when we talk about hash joins, but basically the same idea, yes. Yeah. Okay, so this was done, and this was done, it may need to be again in the early 2000s. So the column sketch paper you guys read was a variation, as I said, of a bitmap, range encoded bitmaps, where the idea is that we want to maintain a sketch, which is like a probabilistic data structure, kind of like a bloom filter, where there's counting of sketches and things like that. But it's an approximation of what data exists. The bit will be set to true to determine that there could be something in this range. So the idea is that we want to convert our base values into these smaller codes, and then we'd use those codes to figure out how much data we actually have to look at. So of course now there's going to be a trade off between the distribution of the values and the compactness. If our codes are the same size of the original data, then they're going to be very precise, but of course that means that we're not going to be able to throw anything away, and waste space for maintaining basically the values twice. They will have a special case for the most frequent values to avoid false positives, because otherwise if something appears all the time, and they get sort of grouped together, you may have to go check to see whether the thing actually exists or not. But if you have a special case, I know this frequent value appears most, you know, 99% of the time, here's the exact value for it. I can just go do that look-ups and avoid having to go read things that I don't need. All right, so here's a high-level overview of our works. So say we have our original data, and we're storing this as 8-bit values. So what they're going to do is basically take a pass over the data. Again, we can do this if we're creating an immutable file, and we've got to read everything in and encode it and write it out. So they can build a histogram where they want the height of the data represented within a range of the histogram, within a bucket to be equivalent. You don't want to have one range have a lot of skew, and therefore you're doing a bunch of look-ups on it for useless data. And so then based on this histogram, the paper describes how you'd actually want to build it, you maintain this compression map. The idea is that you would say within the values over here, it's less than equal to this, you'll get mapped to some smaller dictionary code here. And then now when you want to wrap this column in the sketch form, you're going to store these codes here instead of the actual original values. So say the original values were 8-bits, I can store it now down into 2-bits. So now when a query comes along, like select star from table where value is less than 90, I do a look-up in my map and figure out what's the highest, the smallest value that will be covered by the thing I'm looking for. In this case here, 90 is greater than 60, so I got to go to the next one. 132 is greater than 90, so I can stop here. But I know I need to consider the values that are encoded by these code values here. So you basically convert, you would use the mapping function to convert 90 into 01, and then I go do a look-up in my, I know these are the offsets that are matching for me in my sketch column. But because 132 is greater than 90, we don't know whether there might be a value like 91, 92 that doesn't satisfy our predicate, but it's still within the scope that's managed by this code value here. So for 01s, we have to go back to the original data and see whether we have a false positive or not. So in this case here, but 01.2, the code 01 points to 81 to 140, 81 would satisfy our predicate, 140 does not. That's the typo, yeah, that's my fault. Did I screw this up? Yeah, I probably might have typed that wrong. That's my fault. Yeah, so let's say this is 120. Yeah, 120, yeah, that's the typo. Yes. So the statement is, doesn't this have the same problem as before where it's an indirection in terms of traversing the data structure itself or actually doing this second look-up over here? Second look-up. Yeah, so his statement is, when I talked about the hierarchical encoding, I talked about how like I got to go up and down the tree to go find all the bits. That was just trying to find the matches for the bits, like when they were set to one, then I go look out and get the actual tuples themselves in the table. In this case here, I can scan through this column quickly, find all the positions where it's either 0, 0, 0, 1. Then for the ones that are 0, 1, it depends on what the query actually needs, I then do my batch look-up to examine the actual data itself. So you would scan this thing first, get all the offsets. You could use a pre-computed table if you wanted to, it doesn't matter. Get all the offsets where it satisfies my code where it's less than or equal to 0, 1. I got to maybe get the data anyway, because in this case here it's like star, so I got to go get all the values anyway. Then I just do an extra check there before I show up for the next operator in the query plan. Other than the type of the abbey picked out. Thank you. Any questions about this? Yes? Is 0, 0, 0, 1 not happening? Yeah, yeah. So his statement is, one of the optimizations you would get is, I only have to check for 0, 1 to avoid false positives. In the case of 0, 0, I don't have to go check. Again, it depends on what the query is. If it's a count, then I go check these, I count them all together. If I need to do a join where I do need maybe the column, I had to go fetch it anyway. But I don't have to do the comparison because I know it satisfies my predicate. All right, cool. End on time. All right. So Zomaps, as I said, are the most widely used method to accelerate sequential scans. It's pretty much what everyone will give you, I think, Parquet. Actually, I don't know about ORC. This is what Parquet, the only thing Parquet has. Bitmap indexes are more common in the enterprise row stores. I don't see Postgres lists as themselves enterprise, but I mean the Oracle and SQL servers. Things that cost a lot of money. Because they, again, they're trying to get the benefits of sort of columnar processing on data without having to have a whole column store engine, which they all eventually built anyway. And as I said, we're completely ignoring multi-dimensional indexes and inverted indexes. I don't think they teach 8.26 anymore, but that whole class is about this kind of stuff, KD trees, ARM trees, and so forth. We can ignore that. Geospatial things, we ignore that. And then our inverted index is just like a mapping from a word or an engram within a text field to the actual tuples that match, okay? All right, so next class, we'll talk about data compression. We'll spend most of the time talking about tuples. We'll also talk about how we actually compress these bitmaps as well, beyond the things we talked about before, okay? All right, so quickly, I want to talk about project one, which is out. Thank you for WAN for updating everyone at Piazza. So for the first project, you guys will be writing a foreign data wrapper for Postgres to access a clominar data stored in a file format that we invented. The reason why we didn't choose Parquet or Orc because there's existing libraries to access them and they basically take care of all the complexity for you. And it's way more complicated to also bring in those other libraries. So we just gave you something really simple and you know about that. So a foreign data wrapper is basically a way to have, if you're familiar with SQLite, it's got a virtual table. It's a way to basically, or sometimes called a connector, but it's basically a way to have Lincoln a shared object to Postgres that allows you to override scan operators on a table to go to whatever that shared object wants to support rather than always going to their internal base tables. So there's foreign data wrappers to read CSV files. There's foreign data wrappers to read, maybe connect to another database system. Postgres is actually really extensible. It's an amazing data system, but it's also very extensible, which is pretty cool. So there's actually commercial products that build foreign data wrappers for Postgres. Like Timescale makes heavy use of this. But again, from the SQL perspective, it looks like a regular table, even though it's going to go through your code. And the goal of this is, again, get you familiar with how to extend Postgres because you could use that for Project 3, but also to actually implement the things that we're talking about in class today, in the last couple of classes, of actually how to scan through columnar data. So the file format that WAN developed, it's pretty basic, we're not doing any compression. It's just storing things in columnar data, in columnar format. And the footer has JSON metadata where to find the offsets for different types of data. So I bumped the deadline up. I think it was due the 16th or something. Now it's due the 26th, which is a Sunday. And all the information is on the Project website now. And then I'll post on Piazza a spreadsheet. I already know everyone's Android. I'm going to fill out a form to get AWS credits. So then I'll send everyone an email if you're enrolled in the class, for like a hundred bucks from Amazon. You can use that for development if you want. You can use it for Bitcoin mining if you want. Go ahead. I don't care. So anyway, like, you can make it work on your laptop, make it work on EC2, but at the end of the day, it has to compile and work on AWS. Or sorry, on Gradescope. This is how you submit this. So again, we've already provided you a basic file format, and there's instructions on what it actually looks like. It's not super sophisticated. It's not doing any compression. It's literally just storing data in binary form in a common argument. So the first step you want to do is just write a parse over this file. Just scan into your columns and try to stitch tuples back together. And when you process, when you read the metadata, understand what the offsets are, and know how to put tuples back together. And then you want to integrate this in Postgres and using the foreign data wrapper API. I think, William, you've provided us some basic skeleton code already, right? So there's some basic skeleton codes, some functions you've got to fill out, and then instructions on the website on how to actually compile it and then link it into Postgres, right? And then the last step is then you want to actually support predicate evaluation. So you can have Postgres take the where calls. We're not trying to support all everything you possibly do in a where calls. Some basic comparison operators that all get passed to you in what the Postgres struct, and you want to then apply the predicate as you scan the data, okay? What's that? You don't have to do joins, yes. It's just, it's surprisingly difficult enough just to do the scan, the foreign data wrapper. So Postgres is extensible, that's great. I never said it was easy to extend. You can't extend it, right? So the way to get started, as I said, writing the parser code without any Postgres, you can just do that locally. You should do this in C++ and we're giving you skeleton code that uses pgxs. If you want to do Rust, talk to Chi. He's going to tell you to go away, but you looked into it, right? We don't think you can do it. He can do it because he's on c***. I'll leave that out. So, and just like the intro class, make sure you do all your debugging locally, write your own local tests. Don't use Gradescope to do this, okay? Because it's going to be slow. So don't change any of the files other than the ones that you submit on Gradescope because those will get wiped when you compile and link it in on Gradescope. Make sure you fork our version of Postgres because it already has the scaffolding code that has set up and it's all sort of set up for you to compile from the command line or DS code, whatever C line he's using. Oh, that's in there and then you'll be able to link it in. You don't have to figure that out yourself because it's not trivial. And then please post questions on Piazza or Tia's office apps, okay? So there are example farm data wrappers that are in the write-up that WAN has linked to. There's the Parquet one and then there's the Citus one. Some of the scaffolding code will look very similar. It won't be entirely useful to you but you can look at it as a reference but avoid copying wholesale without attribution from existing code you may find, okay? And also too because there's always random people on the internet that see the videos and try to implement themselves. Avoid their stuff, okay? It's usually crap. I'm not going to lie. Okay, all right, any questions? That's my favorite all-rounder. No, I drink it by the twelve. They say bill makes you fat. But saying eyes is straight so it really don't matter.