 Carnegie Mellon University's Advanced Database Systems course is filmed in front of a live studio audience. Let's get started. We'll finish up talking about the proposal presentations coming on this week on Wednesday. Let's hold off any questions until then. Today we're going to pick off where we left last class talking about what data files actually look like, what actual data looks like. And so recall from last class we talked about these different storage models you'd have, NSM or row stores, DSM, like pure column store, and then as I said everything is pretty much packed these days. You're going to divide your table up into horizontal partitions to row groups, and then within each row group you're going to lay all the bits or bytes for each column contiguously before jumping into the next column. So you get the column store contiguous values, but also the spatial locality of a row store. We then talk about the file formats, what's actually inside of them, what additional things we're going to record beyond just the data itself. So we talk about the metadata, keeping track of what's in the footer, where they'll jump to offsets in the row groups, the format layout specifying, again, like is it a row store or a column store, what is the actual, the nesting structure within that type system we sort of glossed over quickly to say there's some primitive types, logical types, encoding schemes, we've spent a lot of time on, and we'll talk about mostly today, different ways to encode the data. Na'e block compression or general purpose compression is taking whatever you produce in your file from the lightweight encoding or these schemes here and then throw snappy or z-standard or gzip at it, actually never use gzip, snappy or z-standard, zone math bloom filters, and then we rush through the shredding stuff. So I'm going to spend a little bit of time in the beginning today going over that more detail, because again, I think this is a neat idea from BigQuery Dremel stuff, and we'll see this again later in the semester. And then we'll kick off the conversation about today's paper. So again, real-world data sets, there's a lot of JSON out there, and if you're at Google, a lot of protocol buffers. And if we just store the JSON document as a varchar, as a text or blob inside of a column, you know, yeah, we can run JSON functions on them to extract out the structure from it, but that's not going to, we're going to lose all the advantages of having a column store with PAX layout and be able to vectorize execution. So instead, what we want to do is split up or blow up the JSON document for every single tuple and store the paths within that document as separate columns. So, you know, the NoSQL guys talk about, oh, these are schema-less databases, you can define your schema later. And that just means that you don't have to call, create table and specify here's exactly the columns I have for different types. You just throw JSON on it. And inherently, there's always a schema, because it doesn't make sense to have random applications writing random documents into a table, because then nobody can make sense of it. So the documents may be pretty, and maybe not have all the same fields, but at least there'll be enough overlap in the structure that we can break things up and then store it as columns. So again, I'm going to focus on, go through, walk through the record shredding from Dremel, and then we can briefly talk about length and presence encoding. All right, so the basic idea of shredding is that we're going to store paths in our columns, for each path, we're going to store it as a separate column, and then we'll record how many steps DPR into a given document for that hierarchy. And unlike in length and presence, we'll see in the next slide, if something doesn't exist doesn't mean we need to always record that it's actually there, where in length and presence you do. So there's going to be two additional columns we're going to find. I think of these are just going to be integer columns that we can then do all the encoding impression stuff we talked about before. So adding additional columns, yes, that's more data we're storing per attribute within our JSON document, but like again, we can compress these things pretty well and avoid a lot of the bloat of the storage space. So the first one is going to be the definition level, and that's going to determine how many, it's going to keep track of how many optional elements existed to get us to the current path we are in our hierarchy. And the repetition level is going to say if it's a repeated structure, like in our scheme up here, we have two repeated structures, we have the repeated group name and the repeated group language. So how many times have we seen those repeated groups at that given level repeat? So let's walk through the example, there's a simple document here, this is roughly what protocol buffer looks like, think of like defining a schema on a JSON document or XML. So we're going to walk through this document here and we're going to scan through as if we were loading it into our database system and show how it's going to generate the attributes across the different columns in our shredded model. So at the very beginning it's easy, we have a document ID, so we have a table that corresponds to the document ID path, that's at the root or the top of the document. So then we just insert a new record there. But at this point here there's no repeats before us, so the repetition value is zero and there's nothing, there's no other things in the path before us, so the definition is zero. Then now as we scan down we hit the first nested structure, we have name and then inside that we have a repeated group called language. So now we see our first entry for the code here. So we're going to create a new column for our shredded document where we have the value that's being stored, the repetition value is set to zero because we're the first language object or group that we've seen at this level in the hierarchy. And then we set the definition to two because we're two levels deep. Then we go down to the country and now we see that we can create a new column, the value is US, that's easy, repetition is zero because there's nothing before us and then now we're sort of three elements into our path. Then we go to the next group for language and so again we see a code, we insert that here where the second, again starting at zero, it's the second repeated group within this hierarchy, so we set that to one and then our path to get here is two, just like it was here because we had to go from name to language to code. Then now we don't have a country, so in this case here because we at least have something at the group level within the language, we have to put an entry here. Again the same as before, we're one into the repetition and then we have two elements to the path to get us here, but again because there's no value here we set it to two. Then we go down to the URL here and again repetition is zero because we're the first name in our group just like we were in document but then our depth is two because it went from name and then to URL, just as we were over here. Then we go down to this group, this is just name with nothing else, so we put, sorry, it's just a name with the URL and no other attributes for the language, so we add our entry into here, repetition one because we're the second element in the repeated group and the path is two to get us there, so now we've got to put placeholders here to say that there wasn't anything within the repeated group for language. So repeated group one because this is the second group within that level of the name and then the definition is one to say the path is really one to get there. It's the same as the path as, why is that two? Yeah, because it's one down from name, so if you want to know how deep you actually are as you fall along from this you could then get that and see that you're actually three levels deep. Yes? The big problem is that I have a definition level of two but I don't understand why a country would have a definition level of two. Yeah, the question is why is code, go back here, why is the line, the error doesn't line up, PowerPoint was being stupid. So in this case here I add English US, the definition for the path is two, but then when I add the country the definition of the path is three is as he said the code is required, the country is optional. Other question, sorry. Yes? So why is the definition like no, what we added for the second definition of the line is group we add null and that definition is different? This question is why, this example here, why is the definition two, it's taking the same place as the country? Because it doesn't exist, I think that's why. Right, that you're not moving down even farther. Yes? Questions, is it determined in the point reversals, do you have to go to? Yeah, the definition is like how many steps are you down within the path that gets to where you're at. And so if it's not there, it doesn't count. It's my understanding of it. Yes? Is question, is there any advantage of doing this, storing the structure as columns versus storing pointers? What are these pointers, sorry? The idea is that think of it, you have a query. The reason why we're storing this structure here is that we want to be able to go through the column itself without having to go back to figure out how we actually got there when we do a select query. So I don't have a select query here, but think of like select star from table where name.language.code equals enus. So I can just rip through this column here. I don't need to refer to any of the other ones and when I find any matches, I can then use that to figure out where I need to go jump back in the offsets if I want to stitch things back together. Say it again, is there a large overhead of if I have to then reverse this shredding, taping it back together to go back to the original form, absolutely yes. But the advantage is that when we want to do lookups, it's already broken up in a form that we can find things very quickly. This is what they're giving us. Some asshole developer says I'm going to give you JSON and we've got to handle that. We haven't even talked about how we're going to handle the type either. And we'll see this in Snowflake. Snowflake will actually try to figure out, oh, I see these a bunch of strings. I'll keep the original JSON, but I'll synthesize a string column for you of our type is. This question is does SQL know about JSON? I mean the SQL standard has JSON constructs or data types. But again, that's the SQL that's at the logical levels. The SQL that the programmer sees. Underneath the covers, the data system is free to store data any way that it's wants. And Dremel made the decision to do it this way because they want to optimize for the common case of just doing lookups down paths. Stream example is everything is a blob and then I have to parse it every single time on a query. This avoids all that. You're basically materializing as if you parse it ahead of time. Okay. So in the second time, I don't want to spend too much on this, but you kind of get the better general idea that we're breaking this up, we're generating columns and we can use that to figure out the path as we just came through. This question is the Great Britain one. So we're now here, right? So we now have a new name, right? There's, we have the same name is repeated group and with repeated group we have the country. Now we have Great Britain here. Our repetition is one. That might be wrong. That's actually, this is why I asked it. This is from the Dremel paper. I had to fix some of these things. Yeah, I think there's... This is not the... So the values are according to the Dremel paper because the Dremel paper had different values. So the second row, the E and one two would be two two because the repetition level isn't defined as how many times it stops in a period, but the level to which it stops in a period that's named or language stops in a period. So it's two. Got it, okay. Yep. All right, we'll fix that later. All right, so yeah, the typo here is that it's the number of times that the group's been repeated at that level. Yeah. Again, low level detail is not... The low level details maybe not entirely matter. It's the idea of like taking JSON, taking something and breaking it up because we can do this at the physical level and that'll make queries run faster later on and the application programmer doesn't know, doesn't care. All right, again, in the sake of time I'm going to skip past length and presence. Basically the idea is that you're just storing you always walk down as if, for each level you're just going to record whether something exists or not. All right. The Dremel paper we did earlier on, there's specific experiments they showed that the shredding one is better. Okay, so let's get back now to talking about where we left off. Last class, talking about the big picture of these Parcain orc showing how they have different levels of sophistication and complexity in their implementations of how they encode things, but these file formats are really designed from a different hardware era, like 10 years ago, or 12 years ago now, Parcain and Orca 2011, 2012. Back then the network was always the slowest thing and then disk was slow and then memory and everything in the CPU stuff, all that's fast. They were making a trade-off to use heavyweight compression schemes like D-Standard, like Snappy, because that would reduce the size of the blocks. They were fetching for data files. But now the hardware has shifted such that network is actually really, really fast. You can get on Amazon, I think, 100 gigabyte internet connections, for instance, and it's not that much. So the trade-offs and design decisions that the Parcain orc people made, they're not wrong, as if they were doing something stupid, it's just the hardware landscape is shifted so much that we need to revisit what they were doing. So there's a couple other problems in these formats that are going to be a problem for us when we want to start vectorizing the operations inside of our engine. So the first is that Parcain orc can generate variable-sized runs. They're making decisions on how to encode different things at different parts within a column chunk, and that means as you're scanning along, trying to decode it to find data you're looking for, you have to have these conditionals to figure out if my data is this way or that way, and then sometimes it might be a certain size versus another size. Well, that's bad for SIMD. Does everyone know what SIMD is? Has anyone taken 418, 618? Because who here doesn't know SIMD? Okay. Isn't the paper? Okay. Well, let me give you a quick crash course. For this, everything you know. All right. Single instruction, multiple data. So it's these classes of CPU instructions that you can get on modern processors that allow you to do multiple things, do the same operation on multiple pieces of data at the same time. Contrast this with SISD in Flynn's taxonomy, single instruction, single piece of data. Like, you know, X equals 1. Single instruction and take the value 1, put it into a register of X. Right? So say when you start doing things like matrix tradition, right? X plus Y equals Z. So the way you would typically write this using SISD instructions, you would have a little for loop. You just iterate over all the elements of X, assuming that X and Y are the same length. And then I'm just going to add them again and sort in Z. Right? So with SISD, again, you're just literally running through the for loop you're adding one by one. Yeah, you can unroll it, that'll speed things up, whatever. But it's still at the end of the day, it's doing single instruction per each level within our vectors. So the idea with SIMD is that I can break out the pieces of data I'm trying to operate on into chunks, lanes, that I can store in these SIMD registers. So in this case here, it's really simple. Say I'm storing 32-bit integers, and then I have, I'm putting four values in a single register. So I have 120-bit registers. The current, the largest register you can get now is 512. The Fastlane's paper talks about a hypothetical 1024 SIMD register. That doesn't exist, 512 is the state of the art, and we'll cover that more later on. So now what I can do is, and within a single instruction, the CPU will take this register and that register, add them up, and then write it out to another register. And now that's one instruction. The next block, or the next portion of the data, we had another SIMD instruction to go add it up. So what was, before this, just counting the number of addition instructions, eight instructions to add the X and Y together, I can now do this in two. I'm ignoring the cost of getting things into the register and out of the register, and we'll see problems with AVX-512 where the CPU will slow itself down when you start using AVX-512 in some cases. So it's not magically free, there is some work to actually do this, but that's the general idea of what SIMD is. And we'll see more about this next week. And then a few lectures after that as well. Okay, so, but as I said, these registers are always going to be like 128-bit, 256, 512, and then it's going to have every element within that I'm storing in a lane has to be the same size. So I have variable length encoding, I've got to put everything now to the same length, the same size, before I can load into the register, and that's expensive. The other problem with these formats is that, as I said before, they want to eagerly decompress everything. So they don't expose to you the dictionary into the execution under your database system to allow you to start to look up on the dictionary itself when you iterate over a column chunk in Parquet and Oracle, they give you back the decompress values. Likewise, if you're using block compression, like a naive scheme, like z-standard or snappy, you can't see anything or the data system can't see anything inside of that compressed data because it's opaque to the database system, because z-standard and snappy have their own encoding scheme and we can actually interpret any values within the compressed data. The other problem is going to be in some encoding schemes, like specifically delta encoding and run-length encoding, that there will be dependencies between adjacent values in our column chunk, and that means also we can't use SIMD because there's no way to pass data from one element to another element if they're in the same register. Delta encoding is taking the difference between your neighbor, the preceding value in a column. So if you load that up in a SIMD register, you can't do that delta addition very easily. The last one is going to be which is not really that big of an issue for us at this point in the semester, but the portability of the implementation because there's a lot more hardware out there, a lot more vendors between ARM and RISC-5 and GPUs, TPUs and Xeons, that even actually within just ARM and Xeons themselves, there's all these different versions of the ISA that have different features of SIMD capabilities and there's no guarantee that if you write low-level and transit code like the low-level instructions to do SIMD that on one system is always going to work on another system. So ideally you want to rely on the compiler to figure out how to vectorize this stuff for us. But at Parcain or because the certain design conditions they made, they can't do that. Yes? Is question is are there libraries like LibSIMD? Yes. Are there libraries out there that can abstract away the low-level details of certain operations and therefore if you write your code against that library then whatever ISA you land on or whatever hardware you land on they can do it for you. Yes, but I'm not aware of anybody actually using those, at least in databases. And we have friends on the inside that ask them whether they're using intrinsics or some kind of abstraction layer. Everybody's running the intrinsics. I don't know what DuckDB does though. They are trying to be very portable so we can look at and see what they do. Yes? Why are dependencies between JSON values bad? I don't understand that. So the question is why are dependencies between JSON values bad? Because if I have actually go back to my SIMD example here if I go back here, say that these are just raw encoding, uncompressed but say it was delta encoding so starting at the top it's 8 and then the next one is 15 because it's 8 plus 7 or something, right? So if I can't load that in my register and have it do arithmetic right next to it, the thing that's next to it you have to copy into another register and start shifting things around. Yes? We're getting there. Yes. You're a bad movie writer because basically give the end, you know right? It's like you're telling the ending of the story to the beginning. You have to set the mood. Okay. Yes, the answer is FastLanes did solve this particular problem. So I'm going to talk about three different schemes today. And the better blocks and the FastLanes ones, these are brand new, these papers just came out in the last year. Bitleaving is an old idea from Dignesh Patel, the other data specialist here that came out almost 10 years ago. But I still think these, it's worth looking at because it's a completely different way of thinking about how to store data. So I want to cover that a bit. But the way to think about these better blocks is going to be like parquet plus plus, right? It's still going to be in the sort of the same overall flavor of parquet just with better lightweight encoding schemes and they'll do nesting in a, you know, with a sort of recursive algorithm that does nesting scheme automatically. FastLanes is the paper I got, how do you guys read it? Again, it's just a different way of thinking about how actually how to store data in a way that people really haven't considered that much. Like people have been sorting data all the time. That's a new trick, but like to purposely go out of your way to store it in a sort of almost arbitrarily random order because that's the best way to then decode it at runtime. Again, that's a that's a far, it's not a common way to think about how to build database systems. Again, the main takeaway from all of this is going to be the SQL layer. The application programmer doesn't know what we're doing underneath the covers, right? They'll get all the benefit. Things will be faster and cheaper and more efficient, but they don't have to rewrite anything in their application code. Because SQL is just going to run just fine because the database isn't going to handle all that for them. Alright, so we'll go through each of these one by one. Alright, so Better Blocks is a Packspace file format out of TU Munich. It's from the guys at Munich. They have a system called Hyper and a new one called Umbra. They have very good database professors that are probably some of the best people in the world. So this paper came out of their group last year. And so the idea with Better Blocks is that it's going to do more aggressive nested encoding schemes than with Parquet and Orc. Parquet only did dictionary encoding for strings but it didn't try to do any additional optimizations for like the codes that came out of them. Orc tried to be more sophisticated and have these for integer columns to try to figure out, you know, should I compress it this way versus that way, but it was basically static heuristics. And so when Better Blocks, what they're going to do is they have a more or less a greedy algorithm that's going to figure out for each column chunk what's the best encoding scheme by looking at a sample of the data. We'll talk about how they generate that sample. And then they apply that encoding scheme which may produce more columns and as long as they're of a fundamental type you then go back and run the same algorithm to figure out what's the best encoding scheme for those derivative columns that came out of it. And so you're still going to be able to do the, you basically get almost all the benefit of something like Snappy and Zstander but you can still natively operate and decode the columns without having to decode, decompress everything. Right? And so that means a purpose and you cannot use Snappy as Zstander for this. Same thing with Fastlane. They're not even going to touch that stuff because it's too slow and hides everything with the data system. Now interestingly Better Blocks makes the argument that they don't want to store the metadata, the schema and information about what's in the file, in the file itself and they said that's better left to some management service. But that breaks the portability capabilities we talked about before. We can just give someone a parquet file and there's everything you need to decipher what's inside of it is in the file itself. I would chalk this up to more a philosophical design decision argument rather than like, oh my gosh, they're wrong or they're right. Some cases make sense, some cases it doesn't. Yes? Is the statement is if the, you're saying if it's embedded that you're safer or you're not, corruption, I mean, yeah, so the metadata gets corrupted. Right, so his argument is like, okay, if the file gets corrupted in some way and it trashes the metadata, because it's stored in the file you can't do anything. But the flip side would be if you're storing now your metadata as separate files and some other service or it's just there's more moving parts that could cause problems, right? Furthermore, again, we say we're storing in an object store, Amazon's replicating all that stuff like I think three or four times or six times, right? Like, so the likelihood in all honesty, like the file is going to get truly corrupted and I have no and I can't recover it. If it's mission critical, then I have offsite backups, I'm doing if it's, if my company fails, my business fails because this one file gets corrupted, then that's my fault for not making sure that it's written in stone, you know what I mean? So I think that would be the argument there. Yes. Yes. So she's correct. The thing we, the thing we talked about last class was in Orc, they had a bunch of different encoding schemes and as you ran your, you know, as you're trying to decode things it's trying to figure out, like on the fly, which decoding scheme should I use isn't that, isn't this going to start with the same problem? My understanding is though, within the column chunk, they're picking one encoding scheme whereas Orc is trying to be clever on smaller runs. So let's say they only compare against Parquet in this paper. They don't compare against Orc. And then we, we didn't, for our experiments in our paper, we didn't compare against this. But that's an open question. Yes. So this comment is the argument that they're making in this paper about why they want to store the metadata as a separate file is that it allows them to retrieve the file which is going to be much smaller than the actual data and then look at the zone maps and our information and figure out whether you need to look at the file. But as we said last time, like with S3 I can go get a range. So whether the, whether getting that metadata is the footer of the file or like the separate file from our perspective, it's the same. Yeah, yeah, yeah, yeah. Okay. So let's look at all the encoding schemes that they have. A bunch of these we've already seen. Two of these we'll cover in more detail. So we already know about run-length encoding. One value is like the extreme example or like literally your column of 64,000 tuples has one value, right? So you just store that once. Again, it's like the extreme case RLE. Frequency encoding we didn't talk about, but this comes from IBM's DB2 blue system from a few years ago. Basically it's like you store the, you look at your column and figure out what's the most common value, like what's the one value that appears most often. Store that separately and then you have a bit map to say how many times where it occurs in the column and then all the other values that are not that top value, you just store them in an uncompressed way. But then you feed that back into the encoding scheme and compress it further. So I think I like what's a good example of this. You're at a you're at a you're at a everybody loves h*** for one person so you just store, everyone loves h*** goes yes, store that and then for the few people that don't like h***, you just store that separately, right? Stupid example. That's a basic idea. Frame a reference with bit packing we talked last time, again it's like Delta encoding, they're not going to do Delta coding, this is, they're varying from it where you store the min value of within a column chunk the delta from everyone's delta to that global value. Dictionary code even covered. Pseudo decimals, we didn't really talk about fixed point decimals. But the basic idea is that they're going to convert floating point numbers into integers by just figuring out where the decimal point is and store the integer version of that and then what power of 10 you need to convert it back to a decimal, right? I'll briefly talk about these in a second. But this is basically FSST comes from the Germans and the DuckDB people. It allows you to do compression on strings. But instead of doing dictionary coding where you have a code represents the entire value of that string, you can do separate codes for individual bytes. But I think if you have a column of a bunch of URLs and all the URLs start with HTTPS, so I could store the separate code just for HTTPS and then additional codes for the other parts of the URLs, right? Let's talk about that in a second. And then roaring bitmaps is a way to do basic compressed bitmaps. But they're going to use these for nulls and exceptions, like the frequency encoding. If I'm keeping track of what locations is the most frequent value occurs, I have to store that as a roaring bitmap. Again, we'll cover that in a second. So again, no doubt on coding because it's not SMD friendly. But then the Fastlane's people will fix this. So the selection algorithm works like this. So basically, you're going to collect some sample data from your column. And recall, in the case of Orc, Orc was using this run ahead buffer to look at the next 512 bytes and figure out, OK, or sorry, 512 values and figure out what's the next encoding scheme I should use. What they're going to do is assuming you have the entire column chunk ahead of time and you're going to sample from that uniformly and then use that to determine what's the best encoding scheme for this given column chunk. But you just can't do random sampling by just jumping different locations because that'll make run length encoding look terrible because you're going to miss that continuity or repeated values in a sequence. Likewise, if you just then grab the first 100 values, then it's going to make other schemes look bad because, again, you may just hit a bunch of repeated values in the beginning and therefore run length encoding looks great, but that's actually not the most optimal scheme. So what they're basically going to do is they're going to jump to 10 different locations in a column chunk, which is 64,000 values. So it's basically 1%. And then when they jump to that location, they're going to then grab 64 values. So that gives you sort of the spatial randomness within the column chunk itself, but also the continuity or contiguousness that you need to figure out whether RLE makes sense. So then you run the algorithm, figure out what the best encoding scheme is, and then, as I said, sometimes the encoding schemes produce more outputs, and then you can just feed those outputs back into the next encoding scheme. Yes? Is there something special about the greeting algorithm as opposed to what? They're trying all of them. Oh, they're trying all of them. Yeah, so say my original data is this integer of, or this vector of strings, so this column of strings. So this is the algorithm, right? It's an integer, so they're literally going to try all of them on the sample. And they say, roughly, it's about 2% overhead of the compression cost. Presumably, the seldom one would do one of the data in the database, and it feels like expensive, just encode and construct it. Right, so his comment is, basic regent, is when are we doing this? It's when we're loading the data into the database. And so the encoding is an expensive cost, because we're willing to pay that cost upfront once, because that's going to make the common case of running queries run faster. Absolutely, yes. So they're saying the running this algorithm is a 2% overhead. And I think that's a fair trade-off, right? So in this example here, again, they can just be raw, uncompressed, non-concoding. There's open-sorted invitations to do vectorize partial frame of reference, and then bitpacking, one value we talked about, and then dictionary. So let's say this is my stupid example here. It picks that run length encoding is the fastest. Well, then again, this could produce out two columns now, one for the actual values, and then the next column is the actual lengths, right? Again, though they'll recursively try three times for every single output, as long as it's something that can be compressed again, they'll feed it back into the algorithm and try it again, right, up to three tries. In the case of if you land with bitpacking, there's nothing you can do after that, right? The algorithm terminates. So my example here is for integers, but then they have basically decision trees for strings and doubles, and those are the core data types that we care about in databases. Yes? No, him. Yes? Does it ever go back and correct, or is it something supposed to, like, sample it real? This question is, like, do I ever backtrack and say, oh, like, after recursively applying it, it turns out that what I'm doing is the optimal choice is actually another path down. No. But how do you know it's not representative? Oh, yeah, yeah, so, like, yeah, the statement is basically, if I start encoding and I realize this kind of sucks, this is not the way it's working. It's not working out as well as I thought it was going to. Do they ever go back and try it again? I don't think they do. But I don't know. That's the same. OK. What is the cost savings of, like, only looking at 10% of the data? One percent. They're looking at one percent. So what in the mic are you talking about? One percent of the data is looking at the whole data set, trying all of them exhausted search faces. Wait, I'm crossing the entire data set? Yeah. Wait, I want to bulk load one terabyte of data. You want to? You're not like one terabyte of data. No, say one terabyte. No, extremes. One petabyte, right? So yeah, that's not feasible, right? And also too, someone's got to pay for the compute, right? So I'm, again, it's a trade-off. I'm going to pay this computation overhead 2% seems reasonable to me in order to make queries run faster. And so if you do the his example or your example, like, just try everything or backtrack, am I going to get another, you know, what is going to be that percentage improvement? Probably not worth it. Yes? When I was running the code, there was some message himself, for this important to use more data. And, like, for example, dictionary can be very easy. Yes. So the statement is, when he ran the code locally, because they didn't have any source, that, like, they try encoding a scheme, and then they start encoding it, like, how far into it would I go? Like, before this? By a test, for example. Yeah. And then you can skip these words, then it goes back. No, but so if you say, basically, that you set the recursion def, the default is 3, but then it's the, based on the sample, right? Not like he's saying, if you scan the data and realize you got it wrong, like the sample said one thing, the real data looks something different, do you then roll it back? They don't do that. You're basically saying, on the sample itself, they can roll back. Which is fine. But on the sample of the real data. OK, they'll roll back the whole thing. Oh, OK. Again, their column size is 64,000 values. It's not that big. You can do everything in RAM. OK. All right, so going back, these are all the encoding schemes that we had. I want to briefly talk about FSST, because we'll see this when we talk about DuckDB, and we'll see this in, when we talk about how to pass intermediate results from one operator to the next. This will come up. And then, we're on bitmaps. This is just a better way to do bitmaps. All right, so FSST, again, comes from this 2020 paper. It's the Fastlane's guy, Peter Bontz. It's Victor Lise from Better Blocks. And then Thomas Norman, who's probably the best database researcher in the world, we'll read a lot of his papers. Those three got together and decided let's go build a compression scheme for strings that allow for fast random access. And again, think of like dictionary encoding, you're taking the entire string and representing with a single code, but now you can't actually do partial lookups on that code to find prefixes and other things, because you have to go look at the entire string itself. So the idea here is that they're going to replace frequently occurring substrings up to 8 bytes with 1 byte codes. And so all the values, once they're encoded in the FSST symbols, they all are going to still be the same length. And so you have to do some tricks to figure out, to record, this is the end of the string. Therefore, don't look at any more symbols. So the way they're going to generate the symbol table is actually kind of interesting, because it's sort of an NP-complete problem to figure out what's the optimal set of symbols that will produce the smallest number of codes and the most compression benefits. So rather than try to do like they mentioned standard dynamic programming or something more fancy, they just use what they call evolutionary algorithm that any time you think you have a good symbol as you're constructing the symbol table, they'll put it in this hash table. If an entry is already there, they kick it out. So it's not like the linear scan or linear programming hash tables that we talked about before or the chain hash table where you can, if someone's in your slot, you keep going till you find a free position. They immediately kick out whatever is there. The idea is that as you're constructing the symbol table, if the things that actually really matter a lot, the symbols that could provide a lot of benefit, if they keep getting kicked out but then they're still used again, then they'll get added back. And then over time you end up sort of roughly with a reasonably good set of symbols. Yes. Is the output fixed length for the, yes, the byte codes have to be, the codes are one byte, but the substring could be variable length up to eight bytes, which is fine, because again, the columns have to be fixed length, right? So we know exactly the number, you know. And so the question is like, what if you have a symbol like HTTPS and so you have a bunch of URLs that use that same code, but then someone's got a weird URL that's just HTTPS, then you need to keep track that this thing is only, you need a way to denote that the string is actually terminated. So don't interpret any other bytes remaining in my fixed length portion of the value. Otherwise you could go look up and add, sort of adding more symbols that aren't actually in the original string. So again, this is just a better way to do, or it's actually the, this is basically what Z-Standard is or LZ4 or Snappy, they're basically doing the same thing inside of their, you know, their compression scheme. But again, it's opaque to the database system. This is now an explicit scheme where we can expose the symbol table to the database system and we know exactly how to match the codes to strings. So you can do all the same tricks, you can do dictionary encoding to find prefixes and other stuff by just looking at the symbol table and actually looking at the real values for some types of queries. Yes? So the question is, is it possible to have a code referred to another code? Like what do you- Yeah, basically there's another sub-stream that contains another code, I guess. Oh, yeah, the question is, could you have a situation where the inside the symbol table is another code? So don't interpret all of the bytes in the original string as the string itself, interpret some of them as actually another code that I do in a recursive lookup. But then how do you record that you should go, that portion of the string, you know, should be another lookup. So a lot of the design stages they made for this is like this example of like, okay, you don't do the linear probing to find a free slide, you immediately kick out whatever's in there. They did this because now you can do this all in SIMD. Because you can't have conditionals, you can't have loops in SIMD, right? So by just doing everything like, okay, here's the exact structure that I'm always gonna do. If someone's there, kick it out, just overwrite them, then you can vectorize all of this. So in your thing, you would have to have some bits that somewhere that says, oh, by the way, at this offset for this string, don't interpret it as a Varchar, don't as an ASCII character, it's actually a code that you wanna feed back into it. You wouldn't be able to do it with this SIMD. All right, so again, I don't have a demonstration of what this looks like. I can post something on Slack when Peter gave it to me a few years ago. One thing that is cool that does show up a lot in data systems now these days are called Roaring Bitmaps. At a curiosity, who here has heard of Roaring Bitmaps before? Yeah, very, very few, right? Basically it's a way to store a Bitmap index in a, with different data structures based on the how often bits are being set to true within some portion of the range that we're trying to record, right? So again, a Bitmap index can tell you something that just at some position is a bit set yes or no, right? So you can use this like a Bloom filter, like a Forset membership and so forth, right? So the dense chunks were to store these as uncompressed Bitmaps because there really isn't any way to make that better, right, just literally just the bits. So you can then turn back and recompress it again with nested encoding with RLE. We'll ignore that for now. And then the sparse chunks, we'll just store them as bit packed arrays of 16 bit integers, right? So there's a lot of limitations of this and pick your favorite programming language. There's a lot of different data systems out there that are used this. Pelosa is the open source version of a system called FeatureBase and FeatureBase basically stores almost everything as a lot of data as Roaring Bitmaps. Using in the byte slicing techniques we'll see in a second. But here's the basic idea. So say again, we're gonna split up the range of values that we're going to support. In this case here, I have four chunks, right? So for every single key I wanna set to true or look up to see whether it's set to true, I'm just gonna divide it by two to the 16 and that basically tells me which path I wanna go down. And then within the container within that range, I'll just, you know, I can set something to true based on how it's actually being stored, right? So then what happens is in the default setting, if the number of values that have been set to true within that range is less than 4096, then you just store it as an uncompressed array. Otherwise then store it as a bitmap, right? So say I wanna do set to key equals 1000, right? I'm gonna divide it by two to the 16. I land in this partition here and then now I keep track of the number of bits that are set to true in this container. At this point it's zero. So I'll just store the, you know, store it as a bitpack or a 16 bit integer, right? Original value. Now I'll say I wanna store this key here. Do the same thing, divide by two to the 16. I land in partition three, but now I see that this is being stored as a bitmap. So I just do the math and say, okay, what offset within that range should I set my bit to true, right? So in this case here, just do the math like this. You get position 50, so you just go jump in here and set bit to 50, right? That's it. So as you delete and insert things, it'll adjust back and forth between what data structure you wanna use. Yes. So the question is, what's the overhead of storing everything as a bitmap? So in this example here, what I have two to the 16 different values. So I need a bit and you need to store two to the 16 bits that I can set to true. That's expensive, right? Excuse me, David, if I say I have two to the 16 and I now go beyond that, I'll just remalloc. You could do that, sure, but then it's... But you could do that, but for some values, you're treating everyone the same. And so it may be the case that your data structure is, sorry, the domain range is wide, but within that it's sparse, right? So now I'm jumping to different cache lines or different chunks of memory just to go see why this thing's set to true, or if this is a bitpack array, which I can then compress it again, this is better. Yes. No, this is from a French-Canadian guy, Daniel LeMere. And it's used in... A lot of systems use this, right? For bitmaps, right? I get a bit. It's a... It's actually, it takes inspiration from a paper from the Germans called Art, which I'm not gonna cover. It's basically adaptive Radex Try, where they can keep track of the population within some... So pat down in the try, and they'll change the size of the node. It's sort of the same idea. Yes. Is someone trying to repeat this exact thing for a level down as well? So you have chunk partitioning as well, and partitioning it again? This question is, because anybody tried to do a hierarchal chunking? Yes. There is hierarchal bitmaps. You get screwed on superscaler CPUs. So it's just too much indirection. I have a slide for that, but we're not covering that this semester. That's like an idea from the mid-90s. No one does that anymore. In that case also too, which you're proposing, why bother doing the extra level? Right, just make the top level larger. Yes. So it's that again? So it's that again? So it's that again? So it's that again? This question, oh, how do I interpret the bit? It's a bitmap index, right? So you would date, so you wanna say, is 50 set to, is this key set to true? So after doing the division to figure out, I'm going down this path, I know that whatever the position is, what bit position in this, is the offset to get the original key is this value plus the offset to reverse it back. Not the key, the actual, the starting point of the range, right? So now within this again, so say, I forget how many things I have in here, but if I wanna know, is position 50 set to true? Like, sorry, at position 50, I know that corresponds to my key here, I can then check whether that bit's set to true or not. Which one is this? In this? I mean, it's a PowerPoint, I don't know. Yeah, it's a bitmap index. I think we covered it in the intro class. We're basically, again, if I wanna know, is the value at two to five, a given set to something, we're ignoring how this actually gets mapped to something, I can then look to see whether a bit that corresponds to position five is set to true. And then some higher level part of the system then interprets what does that mean? This question, is this a rudimentary version of a Bloom filter? So a Bloom filter is a probabilistic data structure where you can get false positives. This, you don't get false positives. You wanna know something's in there? This will tell you, yes or no? This question is, can you have more than one value mapped to the same index? No, right? Because we're dividing it by two to 16 to figure out what partition we go to, and then we take the mod, which is basically the remainder of that, to figure out what bit within that. So you won't have any overlap. This question is, what if at the same key again? What again? What are you trying to do with it? Again, so think of like, if I'm storing the null bitmap, right? I can store it as this, and then I don't, I'm not gonna set the tuple, I'd offset 50 null multiple times, because it makes sense. Right? It's not trying to do, it's not accounting data structure, it's just set membership. Okay. All right, so, but better blocks, parquet and orc generate variable length runs of values. Parquet, but better blocks is less susceptible to this, but you could still have that within, you know, across the column chunks. And then better blocks explicitly avoided delts on coding. But again, you have this problem where the value of one given tuple would depend on the preceding value. And again, you can't use SIMD for that. So the, in the case of better blocks, they're always gonna use running and coding the vectors, even if the data is the, would you end up coding a smaller than the number of lanes you have in the SIMD register? And so the thing of like, if I, if I can, in my SIMD registers, I can put 16 values, but I only have 12 values. They're still gonna use all 16 positions in the SIMD register. And then the last four just garbage and they'll clean that up afterwards. And in the case of fast lines, we'll see in a second, they align things such a way that you're always guaranteed to always be doing useful work in your SIMD registers. So fast lines is not a complete file format in the same way that better blocks is. It's just a low level encoding scheme that is gonna achieve better data parallelism through reordering the tuples in such a way that can you're always guaranteeing or always maximizing the amount of useful work you're doing in your SIMD registers or SIMD instructions. So they're gonna have all the same encoding schemes as better blocks, but again, with the addition of Delta encoding. And what's really wild about this paper is that, as I said, they were rather than designing it for one instance or configuration of SIMD for one CPU vendor, they basically say, hey, we're gonna make our own virtual ISA. And that's gonna have 1024 SIMD registers. Again, even though that hardware does not exist, well, they allude to, like I think M1 has 1024 cache lines and so forth, right? It's a way to, pretending or foreshadowing the arrival of 1024 SIMD registers. I remember seeing some talk from somebody at Intel saying like, this is not happening anytime soon, but that was a few years ago, maybe things have changed. But again, the idea is that they're gonna define all the operations on these basic constructs on these virtual ISA, and then they can show how you can then map that to either scale or assist decode, which apparently still runs really well, or an existing SIMD instructions app. So the key, key corner thing that they're doing is with this Uniforred Trends with this layout. And again, the idea is that, you're going to reorder the values in a column, the tuples in a column, in such a way that you can do as much work as you can entirely on SIMD. And the reason why we can get away with this, as I said before, is because we have this independence between the physical layer and the logical layer. The relational model is based on unordered sets. So you, as the application program, when you put data into a database, you should not expect that the data will be inserted in the same way that you, or the data will come back to you in your queries in the same way that you inserted it, right? Most of the times, you actually, for some cases, depending on the system, you'll usually get that. But in case of Postgres, as soon as you run the auto vacuum, that's going to start moving tuples around, and there's no guarantee that you'll end up with the same ordering. If you cared about ordering, you would explicitly have an order by calls, right? Because also, if you think about it too, what's the optimal ordering for a set of column, for one given column versus another, that could depend based on what the query actually wants to do. So instead, they're going to make the choices. We'll store this in the best way for us to process the data, and then let the query engine above it figure out how to do the, stitch things back together. If you wanted to record the order that things were inserted, you could add a selection vector that basically keeps track of the position of tuples when they arrive, but the overhead of that sort of negates any of the benefits you're getting. So again, all the algorithms are going to define, are going to be based on this virtual ISA, and then they just either emulate it on AVX512 or scale instructions. So in the second time, I'm just going to show one example of how this works using a column that will convert into run length encoding, and then we'll convert that to dictionary encoding with deltas. And we'll see how to do everything in a vectorized way with their reordering. All right, so say there's our original column, we have a bunch of extreme characters here. And so we can just, first we'll convert this to run length encoding. So we have our original dictionary values here, CVCA or BCBA, and then for each of those, you specify the run length as separate integers. And the numbers on the bottom are just telling you the positions within the vector where they correspond to. So for this one now, we can do delton coding, right? So we would have the starting base value here is zero, and then you can sort of read this as going across that we're just adding, taking the delton, whatever the preceding value for us was. And then the index vector then tells you how to take this materialization after you've done the reverse the delton coding, that then tell you what are the actual symbols that you wanna get back, right? So they'll set things up like this, but then they go ahead and take this delton coded vector because the index vector you don't actually use, you just materialize it, sorry, you just materialize it and then do the delton coding on it. They then order things in such a way that the contiguous values aren't gonna be one after, sorry, within the original data set aren't gonna be one after another. They're gonna be, in this example here, four elements away. So now when you wanna decode this vector like this, because it's delton coding, we have the base vector is gonna be now four elements instead of just one as before. So as I start off, I take these four elements, I do the SIMD addition now to apply it to this vector here and then I produce the output here and they're doing some extra steps to make sure that things are written out to the output in memory at these different locations because these correspond to the positions that they exist in the original index vector, right? Because if I just have them be, this is right next to this, right next to that, then that's gonna screw up all my ordering that I need for the offsets to jump to other columns. So even though things are coming out incrementally in out of order, we wanna space things out so that it goes back into the right order and they talk about the bit shifting and other operations they do in SIMD to make this work. Again, so now we slide over the window to look at the next operations and then in this case here, we're taking the output of, that was generated from this or these values here and then now we do SIMD to apply it to this next one to produce the next set of applets and likewise we do this going down the line like that. Yes? So on this we store the top right now, right? This question is, we store the yellow one. The only yellow one or the whole thing? You store the whole thing and then the yellow. And the yellow. Yes. Isn't that much worse than what we store for unmet encoding? So the statement is, isn't this much worse than unmet encoding because the size is smaller? Yes, but. The compression size. So the decoding is faster, yes. Again, it's classic computer science, compute versus storage. So I can store less data, but it's gonna make more work for me to decompress it. Again, nobody does this as far as, there's no open source system that stores data like this. This is wild. And again, the paper type of other ways to handle this for other different coding schemes, but again, the basic ideas that were sort of spraying bits out into these vectors so that when we go to decode them, they line up nicely into our SIMD registers. We don't have to do this scatter gather stuff to like move things around to put it in the form that we actually need. So you can't actually decode this, the run length encoding with SIMD, right? Because you basically need conditional loops now to say, okay, I look at the run length here, it's seven, let me loop through in SIMD seven times. You can code Jenna and do it, right? But we'll see this in a week or so, that like code Jenning brings a whole bunch of other problems that make life's harder. Yes. So what's David is, can you also compress this? Better blocks, so better blocks would? I don't, these guys don't. Because no, so if you do run length encoding on this, you're back to this problem. All right, I wanna finish up talking about bits slicing bit reading real quickly. So all the schemes we've talked about so far, parquet, auric, better blocks, fast lanes, they are all about when you scan a column, you're looking at the entire value for each tuple in its entirety every single time. And that means that you can't short-circuit the scan or the filter if you recognize early on that this data's never gonna match. So I mean, you can do this for strings if the string is decoded. Like if you ever look at like the string compare operation or a libc, it's just a four that looks at every single element. And then if it doesn't match the thing you're looking for, then it breaks out of the loop, right? That's what short-circuiting is. But if we're comparing two integers, right? Ignoring SIMD, it's a single instruction, is this equal to this? That you're at the lowest level of the hardware, you're looking at these primitive data types. You can't do any tricks to say, oh, I recognize that the first bit of these two values aren't gonna match. So why compare the other 31 bits, right? Because that's the interface that the hardware provides you, the API the hardware provides you. So over data, we can do every one, right? So what if we could do this? Is there a way to be able to recognize that we can just look at a subset of a value and do comparisons based on that and only look at the rest of the data for that value if we think it's gonna be meaningful or still match, if we need to. So this basic idea is called bit slicing. And this is an old idea from 1990s. There was a system called CyBase, or I guess it's still around. CyBase IQ that does this. The Palosa or feature-based system I mentioned does this now. The basic idea is that we're gonna store, instead of storing the actual integers, all the bits contiguously, it's like an extreme case of the column store. So the column store was taking the rows, a row and breaking up the column, storing all the columns contiguously. Now with bit slicing, we're gonna take the bits within a column, store those things contiguously. So the first bit for every single value for all tuples, store that contiguously, and then so forth, the other bits. So let's see an example here. So there's all places I've lived my life. I grew up in Maryland, 21042. It's Compton, it's Pittsburgh, I lived in Wisconsin a bunch of places. So we're gonna take say 21042, convert it to its binary form, and then now we're gonna store a, again, a separate column of bits for every single one of those, every single one of these positions. Now these are third-jubit integers. I'm showing 17 bits because it has to fit on PowerPoint or whatever, right? So then we always have a null bitmap, but then we just scan along, look at all the bits, and now store them across in separate vectors. And we'll do the same thing for all the other ones, like this. Again, think of these as, again, these are just contiguous bitmaps. And again, I can use running bitmaps now to represent this, right? In some cases, make the least significant bits, maybe those are not, those are sparse, but the most significant bits are dense, right? So now I wanna do lookups, queries. Select star from a customer table where zip code is less than 15, 15217, I can now walk across each slice, construct a result bitmap to see what tuples of different offsets at the bit level are matching my predicate, and then I can determine if I don't see any more matches as I'm going along, I stop, right? So this is the bit representation for 15217. So say some simplicity, maybe I just look at the first three bits because these are all zeros. So that means that if there's any tuple that has a bit set in these first three vectors, that I know it can't match my tuple because it's gonna be greater than 15217. So I know I don't need to look at that position anymore, right? That's the basic idea of bit slicing. The original algorithm was all scale instructions, we'll see bit weaving in a second that can do this in SIMD. But bit slicing can do, you can do some other interesting things. Some queries, like aggregate queries, there's actually really simple operations to compute these things quickly. So if you wanna compute the sum of integers, well I could use the hamming weight, or the hamming count, for just counting the number of bits that are set to one in a column. And Intel and SIMD, sorry, Intel provides instructions to do this very quickly using pop count, right? So there's one instruction to go compute the number of bits that are set within some vector. So now I just count all the bits in the first slice, and then multiply that by two to the 17, go to the next slice, count all the bits, multiply that by two to the 16, and two to the, go all the way down, and then I end up with the aggregation for all my columns, or for my column here. Again, that's way faster than just doing integer instructions to add the sum together. So bit slicing, there was original ideas from the 1990s, Jay Ganesh was exploring this topic in the previous decade, and I think he's looking at it again now, of this technique called bit weaving. The idea here is that it's an alternative coding scheme for column databases that's being predicated on this idea of bit slicing, but you're gonna do it in such a way that you can maximize the amount of SIMD prospects or opportunities that you actually have. What's wild is that he did this work in 2013 when SIMD, the ADX2, didn't have all the scatter gather features or the ADX512 stuff we'll see in two weeks, right? And so the horizontal bit weaving approach we'll see is entirely scalar, but then for the vertical one, it's basically, it's the same thing as bit slicing, but it shows you how you can use SIMD for this. Even though back in the day, they didn't have all the SIMD capabilities that we have now. So Gignesh was building this in a project called QuickStep. Think of this as like DuckDB for DuckDB, like it was an embedded OLAP engine, but it didn't have a SQL front end. It was just like a storage manager that can run OLAP queries and store things as columnar data. So I think of like almost RocksDB, but for OLAP queries. And so he spun it off as an Apache project, but then I think it died in 2018. The code is still there. I think he's still working on it roughly, I think, right? Somebody's working on it, but I don't think the bit weaving stuff is actually in any of this, but the open source version did not have it. The academic version did, but as far as I know, no other system implements this. All right, so there's two ways. The two different coding schemes that they propose. The horizontal one is basically a row storage at the bit level, and the vertical one is gonna be like bit slicing, but you'll do this in such a way for that you can be clever about how to get better parallelism through vectorization. So I'm gonna kind of rush through this, but I just wanna give you the flavor of what's going on. So with horizontal storage, the idea is that here's all our triples we wanna store, and here's the bit representation of the values, and the red is their values. So we're gonna break this up into segments. Again, think of this as like a row group within our data file, and then within a segment, we're gonna store the data in order going from the top to the bottom. So this is, in the first vector, we'll have t0. Second vector, the starting point is t1, t2, t3, and then wrap around in four. And then we have the same thing for the other segment, but because we don't have additional entries, we don't need to store other vectors, right? And so in my demonstration here, I'm showing these are eight bit vectors, but this would be like a processor word, which I think for x86 is 16 bits, because it's from the 80s, but ARM is 32 bits. Basically, this is the sort of large representation that the processor can operate on. All right, so then now in addition to destroying the values as three bits, there's gonna be this padding value here that we're gonna use as a place to record for given operation, was it true or false, right? So when you store things in this bit leading approach, you always have to have this extra space. So you're paying a one bit penalty per tuple to store it in this manner. But if that's gonna allow us to send the operations or do operations where we just store what happened to whether again, the filter or whatever we're trying to do, if it applies to true, we'll store it here rather than some other location in memory. So let's see an example here. We have a query, we wanna find in our table, find all the values less than five. So say we'll just start with the first vector, so that's gonna have t0 and t4. And then we have now our encoding for the value five, this is 101. So we'll have repeated versions of that, of repeated values for instances of those bits corresponding to all the lanes up above for the tuples we're trying to compare against. And then there's some mass vector where they define formulas that specify how to actually do this arithmetic just using bit operations, bit level operations. So I guess to do addition, so whatever this formula is here, so it says all ones. Then now I can just do the operation, produces the selection vector that determines whether the predicate evaluated to true if the padding bit is set to 01. So in this case here, t0 is what? Is one, so that's less than five, that's set to true. And then t4 was, what is that? Seven, so that's set to, that is greater than five, so that's set to zero. Right, or six, sorry. Right, yeah. So what's nice about this is that it only requires three instructions to evaluate a single word. So I can compare two values within a single instruction, whereas if I'm just running this in a columnar data, ignoring compression and all that other stuff and coding schemes, I would basically say, okay, is one less than five true or false? It's five less than six, true or false? But even without, I can sendee and vectorize as I make that run fast, but even without sendee, if I store the data in this bitweed pattern, I can just use regular SISD instructions to get the same kind of data parallelism we would get otherwise. So now we gotta though, if you apply it to all our vectors, we're gonna end up with a bunch of these different, these selection vectors, right? But now we gotta put this back together to get back the offsets of our tuples in our column that actually were set to true or satisfied our predicate. So to do this, all you need to do is just bit shifting to slide everything over so many steps, and then I can then collapse it with an OR operation to generate the selection vector that corresponds to whether the tuple match had a given offset. And then if I need to go back to the original value, I can use that to figure out, go get the original tuple. The problem is that the selection vector is that it's just bits, right? We need a way to reverse that and say, what position is the bit set to true? To know, again, what offset in our original vector matches the true. So the easiest thing to do is just iterate like a simple for loop, right? If the selection vector is set to true, then add it to an output buffer. But that sucks, right? That's a for loop, that's slow, again, just to convert bit offsets into values. As far as I know, there isn't a SIMD instruction. There isn't a CPU instruction to do this for us. So the alternative is to use a trick called, came from the vector-wise paper from Peter Bons and the paper you guys read next Monday. We pre-compute all the positions, so you pre-compute all the selection vectors ahead of time. And then now you just have a simple array that says, okay, well, if I take this binary encoding and convert it to the actual number, in this case it's 150, I jump it to my array at offset 150, and then now I'm storing my selection vector to tell you what positions are set to true. And again, in my simple example here, the size selection vector is eight bits, so with two to the eight possible values, like this thing is easily sitting in L2 cache. So it's not, it's not, you know, this big chunk of memory I gotta maintain just to convert bit, you know, bitmaps into values. All right, last one, bitweeding vertical. So for this one, we're going to store the bits that are all within some offset contiguously, right? So we're gonna have one vector for all the bits of position zero and next vector of position one, two and so forth, right? And then now we get down to this other segment here because it only has two values. We still have to put, we still have to record the entire vector, but we just have a bunch of zeros in there. So it's wasted space, it's gonna waste instructions the way that the fast-leaning guys don't like, but it makes our life easier when we wanna do the calculations, right? So this is vector here and so forth and again, same thing as the processor word. But now we don't have this padding bit anymore, right? To be able to record things. We can do everything, we can do everything here in SIMD. So I wanna do a lookup and say find me all the tubos where value equals two. Again, two is just that bitmap like that. I take the first vector, I generate a mask vector that I'm gonna use for my comparison and because there's zero, these are all zeros, I wanna see whether these are sets of true across this and I get my selection vector like this and then now I run pop count and say, is there at least one bit set to one? If yes, then I keep going. If there's no bit set to zero, then I short circuit, I terminate. Without having to look at the other vectors. So I only need to look at the subset of the data within a given value at the bit level. In this case here, there's some bits set to one, so I go down to the next vector, now I do compare some of my masks that says find me all the values that are equal to one at different positions. Now my selection vector is all zeros, so I know there's nothing else that could ever match my predicate and I stop. So in this stupid example here, I have three bit values. If I had 64 bit values or 30 bit values, I could stop early and I'm comparing multiple, again, within the instructions that I'm doing SIMD, I'm looking at way more values than I would otherwise if I was looking at the entire values of integers. So we do all the early printing, we did like a bit slicing, again skip last vector if all the bits are produced zero, and then the algorithm has a bunch of, the paper has a bunch of algorithms to handle all the other operations you wanna do. I kind of rushed this, but I just wanted, you know, we're not gonna see this technique used in other papers, but it's again, it's just a different way to think about how to actually store data in database, which I like. All right, so I said this multiple times about today's lecture. This is really showing you that again, the logical and physical data independence is super important, right? He was starting to say things like, oh, what do you have pointers to data? And that like, as someone who's like, here's a relational model, that gives me nightmares, like pointers to what, why? Like that's a bad idea. We wanna be able to use this fixed length all sets and do anything we want underneath the covers and not worry about, you know, be able to not worry about explicit pointers to different things and not worry about how programmers actually see those pointers, right? Everything's done underneath the covers. And that way they can just write SQL, write whatever Python code that they want, operates on our system and nothing changes. And then the data parallelism through SIMD is a really important tool we'll see throughout the entire semester, right? The paper on Monday next week will be the precursor to using SIMD for stuff. It was written in 2006 or 2007, so SIMD wasn't as useful for databases as it is now, but it's designing the query processing model for the databases in such a way that you can vectorize a bunch of stuff, okay? But we'll see algorithms how to do joins, filters and other things using SIMD going forward. Take out, you know what I'm gonna hide in here? Got a belt to get the 40-ounce bottle. Get a grip, take a sip, and you'll be picking up models. Ain't ain't no puzzle, I'll guzzle, cause I'm more a man. I'm down in the 40 and my shorty's got sore cans. Stacked with six packs on a table. And I'm able to see St. I's on the label. No shorts with the cloths, you know I got them. I take off the cap, my first on tap on the bottom. Throw my three in the freezer so I can chill it. Careful with the bottom, baby.