 Carnegie Mellon University's Advanced Database Systems course is filmed in front of a live studio audience. We're going to start at the bottom of the system stack of this conceptual database system we've been building and sort of work our way up to actually be able to start running queries and producing results. So we're going to start at the very bottom of the system and describe what is the data actually going to look like. So the first thing we need to understand is to remind ourselves what workload we're targeting. We've been talking about these OLAP systems and their workload is going to look different than an OLTP system and that's going to inform us how we want to design the data, how we want to lay out data on the disk or in memory and then again all the auxiliary things you need to do to support that. So the primary thing, primary sort of access method or access pattern we're going to have in a OLAP workload is going to be sequential scans meaning we're going to be taking large chunks of data and scanning some subset of the columns that may be in the table but just scanning large segments of them at a time and therefore we're going to have to do a bunch of stuff to make sure that runs as fast as possible. There isn't going to be, typically there isn't going to be a data structure like a B plus tree or a skip list or other things you would want to use to help us find individual tuples because we don't care about individual tuples in an OLAP workload. We care about aggregate information. We care about sequential scans and the only time we're ever going to care about finding individual tuples is that we have to stitch the results back together because we're going to use a decomposition storage model or a column store where we're going to break up the attributes for a single tuple and that's fine because a lot of the processing we'll do in our sequential scans will be again on a subset of those columns but at the end when we need to produce a final result we may need to go find the other tuples or sorry other attributes for a given tuple and put things back together. We're not going to want to use a again a B plus tree or another data structure to do that for us like a hash table. Again this is a contrast to remind ourselves from last semester in an OLAP environment OLAP workloads they care about finding individual things like go find Andy's orders, go find Andy's bank account and in this in that world we want to use again like something like a B plus tree be able to go find those things efficiently because OLAP systems also need to be able to support updates, inserts, updates, deletes these data structures have to be dynamic. They automatically resize themselves as we insert new data. We don't care about any of those things in our world. Yes. So by stitching why would you do it at the very end? This question is why would I have to do stitching at the very end? That's late materialization which will cover two weeks. Yeah but that's a basic idea is that like I want to avoid having stitched the tuple together for as long as possible because I don't know whether I'm going to need even that tuple as I'm going up the query plan. So if I can hold off actually needing to put it back together to the very end then I avoid unnecessary I.O. I think I'm going to understand what the stitching means. Again it's going to be a column story. That's what today's talk is about. We're going to break up the tuple into two different attributes. We've got to put it back together at the end. Stitch. Okay. Question? My question was just like if we're not like using this at all like over clothes. Yes. Like how do we like if we go to like the sequential scan of five of the tuples that we want and it's just like are we passing like up the tuple as we see it? What is it like later? Yes the question is like how we actually how can we stitch things back together? Like what additional metadata we need to know is like hey you're part of this tuple put that together. We'll cover that later. The basic idea is that you just record offsets. That's the query processing. That's later. Alright so. Okay so if all we're going to do for OLAP workloads is mostly run sequential scans and again it's not entirely true. There's smaller range scans. We don't scan the entire thing. Sometimes you do need to go find individual tuples. Alright and for all that we would have additional things. We could add to our database system. For now we're going to ignore that. Right. So if all you're going to do is sequential scan of scans. How can you actually optimize it? So this list here is basically what we covered a lot or we talked a little bit about in the intro class. But a lot of these other things we're going to talk about in this class. Right. This is basically the menu what's available to us. Right. So today's class is about data encoding and compression. How to minimize the amount of storage space it takes to represent data. Represent tuples. Pre-fetching is identifying what data I'm going to need as I'm scanning along the table and go ahead and bring those things into memory before the execution actually needs it. So when it goes and says hey I need this block voila it's already here in memory or already in a local cache for us. Paralization is going to allow us to run multiple queries at the same time and within that single query run multiple tasks or query plan fragments or different portions of a query plan at the same time. Either across different threads, different processes, different nodes, kind of doesn't matter. Clustering sorting is identifying that the data can be sorted in such a way that when queries start asking for data within a different range or something you can minimize the amount of data you have to look at because you know within some range it's located spatially close to each other. Late materialization is what he was asking about how do I, can I delay having to stitch the tuple back together to the very end? Because I don't want to pay the cost of going reading things from disk or from memory if I know I'm actually not going to need it. And I don't need to materialize, you know, take up memory to put things together either. Materialized views of result caching is basically identifying that I'm going to execute basically the same query over and over again. That's the result of that query around. And that way when someone asks for it again, it's already there. Or materialized view is sort of a specialized case of this where you recognize that there's a bunch of queries that need to operate or process the same subset of data. Like give me all the orders within today's month or this month. So maybe I can pre-compute that portion of the query for, give me all the orders for today's, for this month. And even though the queries may do different things on that month's worth of data, I've already done sort of the basic work of getting the month. Yes? These questions, how is it different than data cube stuff? So it's basically the same thing in the data cube world back in the day even now they're probably not all very dynamic. But like you basically had to manually refresh. And materialized view, at least the idea is that if I make incremental changes I can then refresh the view. Now in the ideal case you want to do this by re-computing the entire query because that's the dumbest thing you can do, right? Like if I insert one tube of dough and I re-compute this one second query or 10 second query, now there are some systems that can do incremental updates. Immediately this is the part of data systems I know the least about. This is super hard. And we're actually not going to cover that this semester. Result caching is obvious. You just do pattern matching on the string and say it's the same thing and then reuse it. This is not that common either. Data skipping is being able to identify before actually looking at the data that I don't need the data and not have to process it. Data parallelization or vectorization, this is going to be sort of specialization or regular parallelization or task parallelization. The idea here is that within a single piece of data or chunk of data, maybe multiple tubeless, can I use things like SIMD to process multiple units at the same time or multiple pieces of data at the same time. And then code specialization compilation, again, we'll cover this later this semester. The idea is that since I know the type of data I'm processing, I know what the query is going to be, rather than have this engine or execution to interpret what the query wants to do, I can just literally generate C code that does exactly what the query wants, compile that and run that. It doesn't have to be C. You could use intermediate languages and things like that. But this will be a big thing that we'll cover later on. So again, for this semester, we're going to not talk about materialized views and we're not going to talk about the previous one. And so for this class, though, we're going to talk about this lecture today. We're going to talk about data encoding compression. And then the data encoding part is going to then help us for next class to talk about how can we encode data in such a way that we can get better parallelism through vectorization. Yes? Can I get the summary of materialized views again? This question is, can I get the summary of materialized views? The basic idea of a materialized view is to think of it like a regular view, right? A view is almost like a macro of a query. So instead of having to, like in case of... A view would say, it's just a select query, create a view for it. Now, any time that someone does a... and you treat it like a virtual table, and any time you do a select on it, the database system is basically replacing the name of the table with the nested query that's defined in your view. But in that world, in regular views, every single time you run the query, you're re-computing whatever the select query that the view points to. A materialized view basically says, generate the result when you declare the view, and then any time anybody wants query that view, you don't have to re-compute it. You have the result already. But then the thing you can do is, you can find materialized views on... maybe they don't exactly match with the query once. Like, give me all the orders for this month. And then you define that as a materialized view. Any time anybody inserts a new order for this month, you have to then refresh that materialized view. And the database system will do this automatically. If they support automatic refresh. In Postgres, you have to manually refresh it. In SQL Server and other systems, it'll do it for you automatically, right? It sounds like an index. It sounds like an index, yes. It's like a supplemental data structure that the subset of the data you're looking for. But the way the data system maintains it, it's just like a temp table. But if you restart, it comes back. But the hard part is doing that incremental update. So if I insert one tuple, say my view is like the running total of the number of sales for this month. Well, if I insert a new order, I want to update that total. But I know if I'm dumb, I'll just run the whole aggregation query from beginning to end. But if I'm clever about it, I can recognize, oh, well, this is just a sum of the total order amount. I know what this order amount is. Again, but not every system can do that. Okay, again, so this class, we're talking about these two things. And then how we do data encoding will influence how we can do data parallelization and vectorization later. And then we'll cover vectorization in more detail when we talk about query processing further along. But it's hard to, like, try to avoid talking about SIMD in this lecture and vectorization in this lecture. Next lecture, you'll read the Fastlane's paper and it's all about it. But we'll see how to use vectorization in this lecture. Okay, so we're going to talk about storage models and persistent data formats. I'm not going to talk about intermediate data formats. We'll talk a little bit about Arrow today, but we'll cover that in more detail when we talk about query processing. These are the files that are on disk. Whether it's an object store or a local file system, it doesn't matter. These are the actual bits that are getting put out in persistent storage. All right, so the first thing we've got to discuss is what storage model we're going to use for these data files. And again, this will be somewhat of a, you know, repeat what we talked about in the intro class, but it's important to go over this again in more detail to understand, again, what PAX is going to do for us later on when we start constructing the file formats. So the storage model is going to find how we're physically going to store the tuples both on disk and in memory. And again, this is not the actual bytes we're actually storing. It just says, like, okay, here's some tuple, here's some attributes for them. I don't really care what those bytes are in those attributes, but how should I organize them in relation to the attributes within the same tuple and then across other tuples. So the default storage model in most systems, or what most people think about when they think about a data system, this is going to be the n-ary storage model, the row store. Again, this is what Bust Hub is. This is what Postgres, my SQLite Oracle, this is what most people get. Gigabit's storage model is the sort of pure column store and then the PAX model will be a hybrid of the two of these. And you'll see that we want to use PAX because we'll have better locality for attributes within the same tuple, like they'll be in the same block. Again, so the default storage model that every system can think, most people when they think of a data system they think about the storage model as going to be the n-ary or the row store. And again, the idea here is that almost all the attributes for a single tuple continuously in our pages or on our file or in memory, one after another. Now I'm saying almost because, again, there are sometimes in some systems if you have oversized attributes, like something more than four kilobytes that doesn't fit in a single page, they'll have an auxiliary storage. Postgres calls this tow storage. I think some systems call it secondary storage. It's basically you just have a pointer to some other blob store thing that has the large attributes. But we don't really care about that. And again, this is going to be ideal for OTP workloads because in that environment, these applications, the transactions of the queries are mostly going to be concerned about getting single tuples, like go get Andy's order record. And because we're going to want to, you know, we're just new information from the outside world, we're going to insert something in deletes. It's really easy for us to take any new tuple that someone inserted, go find a free slot in some page and it's all out there continuously, right? And then when we want to commit the transaction or we later need to flush the journey page, assuming one tuple fits in a single page, it's one disk right to put that out there. So the page size in this world is typically going to be some constant factor of four kilobytes, right? Postgres by default is eight kilobytes. Oracle is four kilobytes, although I think you can tune that. DB2, you can tune this. My SQL is the biggest one, maybe at 16 kilobytes. There might be one that's 32 kilobytes, but it's always going to be some, it's going to be measured in kilobytes, like single digits kilobytes in this world. Because again, if you think about the work that they're trying to support, go get Andy's order record. My record isn't going to be that big. It's going to, isn't going to be megabytes. So it doesn't make sense for me to have these really big pages to store this raw data because I got to go get that entire page and bring it in. I can't, you know, in the NSM world, I can't bring in partial pages. I got to bring the whole thing because I need the header, I need whatever else the data is in there in the context of transactions. So the page sizes are going to be much, much smaller. Then we'll see for OLAP workloads. And then because, again, because we're trying to do single tuple processing, for the most part, we'll want to use the iterator, the volcano processing model in this world. Again, I don't want to talk too much about query processing, but that'll come later. And obviously, this sucks for OLAP, as I said, because in OLAP, we're doing large sequential scans on a subset of the columns, for the most part. So if I have 100 columns in my table, but my query only needs four of them, well, I got to bring in the other 96 columns that I don't need because it's all packed into the single page. So this is where the column store, the DSM, the decomposition storage model stuff came along, where people recognized that, oh, for the different class of workloads, for OLAP workloads, it doesn't make sense to store everything contiguously. Instead, you want to break up the tuple based on its attributes. And then now you can store all the data for that attribute across multiple tuples contiguously. And that's going to open up a bunch of advantages for compression and other things that we can take advantage of. And this, we can do this because we're, again, we're mostly running read-only queries. So we're not worried about how to do incremental inserts into our database. If someone needs to insert new data, again, some systems will handle that, but oftentimes it'll be like a bulk load of a bunch of results, a bunch of data we've gotten, and then you don't have to worry about incremental updates, right? Because, again, if you had to do a transaction and update this one record that had 100 attributes, if I'm doing the competition storage model, I got to update 100 pages, at least, and write them all out transactionally. And that's going to be slow, that's going to be terrible. So we don't have to worry about that. These file sizes are going to be typically larger, hundreds of megabytes. Now, within this file, we'll break it up. This was the row group stuff that you read in the paper. You'll break it up into smaller chunks and identify which of these pieces of the file actually need, but the overall file itself could be quite large. So just to give you an example of what this looks like, say we have a really simple table, three attributes, and six rows, or six tuples. So what we're going to want to do is we're going to store all of the values for a given column all within a single file. There'll be some metadata header at the front that maybe tells us additional information, like what version of the data system wrote this data, any additional statistics, like the zone map stuff that we want to store for that. There'll be a null bitmap to keep track of which of these attributes are actually null because we need a way to represent that. But again, we'll just store a separate file for each of the three attributes. Now, we'll see why we need to do this in a second, but we're going to make sure that all the data we're storing has to be fixed length. I'm going to take a guess of why, or knows why from last semester. So you compute the offset. So that, going back to the stitching thing, if I need to get for row zero, two plus zero, I need to get its data, and I'm processing, you know, I'm at this position here. I know how to do simple arithmetic to jump down to the other column files. I know the length of the data that I'm storing, each value, and I know what offset I'm at, so I just do the simple math and jump to the right offset. But of course, how we handle that for variable length data, like strings, which are super common, we'll see that in a second. So fixing an offset isn't the only way to do this. It's the most common one. This is what pretty much almost everyone does, especially if you have a sort of pure OLAP system or an OLAP system that was designed from the ground up to be to run these analytical workloads as a column store. The alternative is to embed tuple IDs. So for every single value in a column, you'll have a little prefix that says, I'm tuple zero, tuple one, tuple three, tuple four. And now if you ever want to find the corresponding data or attributes for that same tuple across all the columns, you need an auxiliary data structure, like a hash table that says, okay, for this column, for this tuple ID, here's the offset you want to jump to, or at least, very least, like here's the range or you should start out looking for it. All right, yes. I say the same thing as, the question is, if I'm saying everything has to be fixed length, but then you have strings, does that mean you can't do this? No, we'll see how to handle that in a second. The answer is going to be dictionary coding. Dictionary coding. Because we want to convert anything that's variable length to fixed length. The next one is going to be what I'm highlighting here. So, like I said, I'm only bringing this up, the bottom one, I'm saying don't do this, I'm only bringing up because there are some systems do it, and as far as I know, it's only done in cases where it was like a row store system, and then they added like a little specialized storage manager or storage engine to say, oh yeah, we also have a column store piece, and they need to be able to sort of handle the need to reuse some of the existing components that they had to do a process on column store data, so they add these tuple ideas. But this is what you want to do, you want to use the fixed length, right? And in Parquet and Orc, they're all going to be fixed length. All right, again, so basically what he asked me was how do I handle variable length data? The answer is going to be dictionary, yes, question. So his question is, or save it as, if I'm doing run length and coding, do I still have the advantage of fixed length? Yeah, why wouldn't you? We won't know how to process run life and coding today, but the basic idea is that you do have to scan through to know where to jump, like okay, I'm looking for this value, I scan through, and I know that if I'm looking for a tuple, this offset, I keep scanning until I find the run life and coding entry that covers my offset. Right. So the same as you just scan through all the file, again, we'll break this in pure DSM, yes, you would have that problem. In the worst case scenario, you have to scan to the very bottom because the thing you look for is at the very bottom, right? This is why, another reason why we're going to use packs because it's basically going to break things up to the row groups so that in worst case scenario, yes, we have to scan the entire column if it's RLE compressed, but it's only going to be what, 10 megs or something like that. It's not that big. Right, again, so what do you use dictionary and coding? And that allows to convert very-length data into fixed-length files. And if you read the paper, if you pick this up, Parquet is very aggressive, actually, on the dictionary encoding. And their dictionary encode everything. Even if it's already fixed-length. Orc only does this for strings, for very-length things. But then it turns out, Parquet does pretty good still because there's still, you can convert, dictionary encoding, we'll see in a second, it's going to allow you to convert data that may be in a really large domain, down to a much smaller domain, and then you can apply additional compression techniques on top of that. And you still get a win. It seems counter-intuitive that I wouldn't want to compress floats as dictionary encodes, but we find that it actually works. This doesn't solve the problem of how to handle semi-structured data in this world. The dumbest thing to do would be just treat semi-structured data as a text field. And have the query engine do parse the JSON or whatever it is as it goes along. That's going to be super slow. Again, we'll see how to handle that in a second. We can convert everything to columns. And if there's string fields in the values of the nested data, the JSON values, again, we'll just dictionary code that too. Just treat it like a regular column. So as I said, most OLAP queries are never going to access a single column in a table by itself. It's very rare to say select some, average in a single column without any where clause, without any group by, without any sorting or anything like that. So single, I'm not saying that single column queries don't exist. I'm saying they're not that common. They're not as common as multi-column queries. So if we do the decomposition storage model where we're storing single files per column, then I'm going to have to jump through different files if I need to start putting things back together in order to process my query. Like my where clause might reference four columns. I got to go jump to those four files at different offsets and go get the data that I need. So we want a way to get all the benefits that we get from having columnar data, either through getting better compression or also doing vectorized execution. We still want all that, but we don't want the downsides of having separate files. So this is where PAX is going to solve, PAX is going to solve for us. So this was invented, actually I think here, so this is the paper is from Natasha Alamaki. She was the database professor before I was. Before I showed up, she was teaching 7.21 in 2006. Then she left to go to EPFL and I revised 7.21 when I showed up. But there's a paper she wrote in 2002 that basically recognized that there is this problem if you have a lot of memory, then you don't want to pay this penalty of, not that you have a lot of memory, you don't want to pay this penalty of having to jump between these different columnar files to put things back together. If you have enough memory to bring in a big chunk of data, you can still keep things in a columnar format, but then now the data that's related to a single tuple will be close to each other. So again, we get all the benefit of columnar storage, but still maintain the spatial locality of the row store. So the way this is going to work is that we have our example table before. So we're going to horizontally partition our table into row groups. And the size of the row group will vary per implementation. For now, let's just assume that we did some fixed number tuples, like three tuples. Then now within that row group, we're going to have a header, of course, that's going to tell us information about what's in this row group. But then now we're going to lay out the data for a single column, sequentially or contiguously. And then once we're done for all the tuples in that column, then we jump to the next one. So we're going to call this piece here within the row group. We're going to call this a column chunk. I think the Orc paper calls them, or the Orc system calls them stripes. Row groups and column chunks are what's in parquet. We'll just use that. And then once we lay everything out for the first row group, then we do the same thing for the other. And then in the folder, we're going to have metadata of what's in this file. Yes? So what's the reason that we have multiple row groups instead of just having different rows? What's the reason for having different row groups versus having a row group be a single file? Yeah, so you could have a... You could define your parquet or you could say I want my file to only have one row group. But then now you basically have a bunch of metadata that's very narrow just for that one row group. So the idea here is that if I put a bunch of row groups, the right amount, again, it's probably an empty, complete problem. It's difficult to know. The right number of row groups to have the right granularity for the scope of the zone map and other metadata you're maintaining, it's hard to know. I guess my question is, what's the benefit of having these multiple... Within a single file? Within a single file versus just having multiple files. So I could have a zone map in the folder, the metadata, that tells me within my row groups here's the data that I have. Or I could have it for like... For some column, say the date. Here's the min and the max value for all the values in that single column across all the row groups. So why can't you have the bottom of it inside? Right, now what I'm saying is so now if I want to go... If I'm looking for all the orders from this month and my files have been sort of sequentially based on time, now I just go read this and I can figure out that this thing contains all the orders from 2023. I don't need anything in there. I'm looking for January 2024. So I can skip it entirely. If I do what you're proposing, then I have to go read this header from every single... from multiple files. It's just another level of granularity. But again, what I'm saying is what's the right size? It depends. Which is a cop-out answer in databases but it can answer a lot of things. It depends on the query. It depends on the data. It depends on what you're trying to do. One single row group would be one four kilobyte each? Yeah. So his question is, is a row group one four kilobyte page? No. It's going to be tens of megabytes. Okay. Yeah. Again, it's the same thing he brought up. It's the granularity of this, right? If I'm storing just four kilobytes of data but I have to have this metadata thing in, then say the metadata is, I don't know, half a kilobyte, or sorry, kilobyte, then I can only have three kilobytes of data in a row group. And now I'm basically have a bunch of overhead... there's metadata that I don't actually need. Yes. There really has to have been a moment when you want each strike to be at least one page. Did you pull in that column and get that benefit of not pulling in the data you don't need if you're just doing a subset of columns that you're querying over? So your question is, shouldn't the size of the column group or the size of the row group be a certain size or what? Like a single strike, like a single column shock should be at least a page. At least a page, yes. Yeah. Yes. So will the zone map be in the border of the entire file? We'll cover that in a second, but there will be a zone map per row group in this metadata thing. And then, optionally, you can have a zone map that covers the columns in the metadata. This is actually one of the problems of Parquet and Orc, is that like they added this later, a bunch of limitations don't actually support it. So like it's a... the spec says it's there, but it may not actually be there. Yes. A question is, are row groups random? What do you mean by random? Like are they just... The boundaries, you mean? Yeah. We'll cover that in a second. Like there's a sizing protocol that they have. For these different formats, they'll say like, I want my row group to be at least this size or this number of tuples. And there's trade-offs for them, for those choices. Yes. This question is, are these attributes the same length that cause different chunks? Yes. Let me think about that. I think you can actually have different... within a row group, I think you can vary the encoding scheme you're using. So I think they could be different. But, since I have my metadata, what encoding scheme I'm using, if I say I want tuple at offset six, like logical offset six, and I jump to this row group, I know it's in here, I use the metadata to then figure out, okay, the size of the attributes for this column is this, therefore I can do my math to go find that. So it could be different for row group, and certainly different for file. But again, I just do the same arithmetic that I did before. But in the column group, they all be the same size. The column chunk. Okay. And it's sort of his questionnaires, there's a global metadata that can tell me what's in my file, how things are encoded. You would think, and it's essentially the header, remembering the slot of pages stuff from last semester, that was literally in the header, the beginning bytes of a page. They're putting this at the bottom. I'm going to take a guess why. He says the size is not fixed. The size of what? He says the size of the metadata is not fixed. Now, again, these files are big, and I don't know what the metadata is going to be, like what's the min max value for all the data that I have until I process all the data, right? And then also the reason why they're putting it at the end and at the beginning, because this comes from the Hadoop world, or HDFS world from 10 years ago, which is a pen-only file system. I can't make in-place updates to the file without rewriting it. So if I'm bulk loading much of data to generate a Parquet and Orc file, I've got to scan through all the data and I'm writing out these row groups incrementally, and then when I'm done, let me close the file, the metadata, you can't go back and put it back up here because you've already written out the beginning of the file. S3 is the same thing, right? You can't do in-place updates. You have to basically overwrite something entirely. And that would be expensive because then you basically, if I wrote out a one-gigabyte Parquet file, I don't want to have to go write it back out again just to update the metadata. So they always put it at the bottom. The zoom app is actually in the bottom for Parquet here. Again, I think the paper talks about that was added, wow, 2018, 2019. That was actually my student at Cloudera that added that. She took this class. Now she's a PhD student at University of Maryland. There's some connection there with the 7.21. Okay. So, up until, again, 10 years ago, sorry, yes, this question is like, if it's at the end, how do I start processing the file? Yeah, so the entry point for the file is the footer. So I think the way it works is that the last 32 bits of the file is the length of the footer. So you read that, I'll tell you how much back you read to get the complete view of the metadata. Yes? You're saying this is kind of... It's a question. Am I saying this is like a log structure of Mercury? No. Why would I say that? So append only means like in Hadoop, the file doesn't only allow us to open a file and there's append bytes to it. But Hadoop doesn't know what you're storing, it just sees bytes, right? This is not LSM. This is literally like, there's one version of a piece of data, a tuple, and I'm just writing it out sequentially as I go. But I'm going to organize it in this Pax format. Okay, so prior to 10 years ago, even the early column store systems, like the Vertica's, the Green Plums, and so forth, they had their own proprietary data format. And then I mentioned this last class, like most data systems you can think about, SQLite, Postgres, MySQL, Oracle, and so forth, but when they put bits down on disk, the format is going to be proprietary to that database system, right? The system's designed to re-write that data. Now there are some systems, like DuckDB is really good about this, they can read SQLite database files, they can read other database files, but most systems don't do that, right? And so the problem with this is that since all of these different data systems have their own proprietary data format, you can't share data across different disparate systems. Like I can't take a Postgres data directory files and plop it down and let MySQL read it. Again, DuckDB is trying to be like a Swiss Army knife kind of thing, we can ignore that, but that's not how most systems work. So that means the only way I could get data out of one system and put it into another system is to write a SQL query to dump it out and then convert it to a CSV, TSEV or JSON XML file, some other format, and then do bulk insertion for the other database system to then convert it into their own proprietary format. But again, with the Hadoop and the cloud stuff taking off in early 2010s, you had now, as I showed in the last class, you had all these operational databases that wanted to start writing data out so that you could read it into your data warehouse, but again, you don't want to do this conversion. I just wanted to put things out to my object store and not worry about having to do additional conversions. So this is where the Parquet and Orcs stuff comes into play because that's the problem they were trying to solve. They're trying to be a universal file format that one application or some other thing upstream in your application stack could generate and then you wouldn't have to do that conversion to be able to read it. And you would get all the benefits of a binary encoding scheme that you would want in a column store or a PAX layout without having to revert it to a text format like this. Like CSV is the worst thing you can do because now basically all your binary data gets converted into ASCII characters and you're going to parse that and deal with that when you load it back in. So the idea is that you define a spec where here's what the file format is and then whatever system wants to be able to read it could either use an off the shelf implementation of it which they have questionable quality or write their own, which again, have questionable quality. So this is not entirely a new idea but it actually goes back to the 1990s in the high performance computing world of the scientific community. There was this thing called HDF5 that means there was 4, 3, 2, 1, there was previous versions to this but 5 is the current one everyone still uses. But this was a binary format to store compressed multi-dimensional arrays and that you could have whatever random 4-cham program you had to do processing on the data from scientific instruments or satellite data you could use this universal file format and use it for different experiments. But this is almost entirely ignored by the database community. So 2009 people recognized that oh, Hadoop wants to generate a bunch of these files that we want to be used for different purposes and so the original version of the data format in Hadoop which is a thing called sequential sequence files is literally key value pairs, like serialized value strings that only the function of the Hadoop code knew how to process like your application code so there was no embedded schema information. So they replaced this with this thing called Avro and I think this came out of maybe Cloudera or Hadoop project but this was still row-oriented so even though the column store system existed at the time Hadoop was still writing things out and processing data as rows which sucked. So then there was another version before Parquet called RC file actually before Orc from Meta but then at the same time in 2013 Cloudera and Twitter working on Apollo they put out Parquet and then Meta they put out Orc for Apache Hive. Hive is basically a SQL query engine on top of Hadoop or MapReduce so convert your SQL queries to MapReduce jobs that's still around today but you don't want to use it. So again, Parquet and Orc are the dominant ones but at this point they're over 10 years old and they're still widely used today. Carbon data is an extension of Parquet from Huawei it adds additional metadata like a schema version and so forth I don't know anybody who uses this other than Huawei the open source version that we tried doesn't work, it doesn't compile but somebody is still working on it and then Arrow, again I mentioned this before this is going to be an in-memory I'm just thinking of this as an in-memory version of Parquet and it's going to allow me to do change data between different processes over the network or in memory on the same box and this came out of Jeremio from the Panus guy. So again, for this class we're only going to focus on these two and then there's a bunch of other ones I mentioned in the paper there's an extension or newer version of Orc called Dwarf from Facebook and then there's this thing called Alpha that we've been talking about, sort of the next generation one we can ignore all that, we want to focus on Parquet and Orc that's what pretty much everyone's using today and it's not to say that these formats are the best and this is what every system should be using it's just what's commonly being used like is SQL the best query language? No, it has problems but is it widely used and therefore it's not going away? Yes, it's the same thing, right? Parquet and Orc are widely used we have to be able to handle them like on HuggingFace you can get native data directly in a Parquet format, directly Alright, so the thing we're going to talk about is when we design a file format is what metadata we want to maintain the layout of the data what type system we're going to support the encoding teams or the compression schemes we would use for the actual data itself and then block compression is sort of the after I've done the encoding can I run like GZIP or something to compress it we'll talk about filters and then we'll finish off talking about the nested data so the paper I had you guys read again this is something that I wrote with my former student, Peachy student here who's at Xinhua and his Peachy student and then the guy that embedded Python Pandas who was at Voltron he and I have been talking about hey, Parquet has some problems because he worked on the event arrow he worked on some Parquet stuff too and realized that we wanted to investigate what are the problems in the modern environment for these two most popular storage formats at the same time, I'm going to Microsoft was actually doing the same evaluation the same experiments that we were their paper came out I think in October and ours came out like in November so we dodged a bullet they approached the problem differently and did different kind of evaluation so they're complementary I'll talk a little bit about what they are but the main takeaway I got from them is their findings corroborate what we found in ours so I felt good as a scientist that it worked out so the first thing is when again what metadata we want to put in the file so I already mentioned the zone map stuff but the basic idea is for Parquet and work these are meant to be self-describing and file formats meaning everything I need to know to interpret what the bytes mean in my file or be contained in the file itself and as opposed to thinking like in a system like BustHub or Postgres or MySQL you have a bunch of files that keep track of the catalog blocks in the catalog keeps track of the scheme of the tables what the types are and so forth and then you have pages for the actual data so in order for me to understand the data pages I got to go read the pages for the catalog Oracle is the only system that actually packs everything within the page itself for disaster recovery so that way if your Oracle gets destroyed your disk gets destroyed and you can recover some pages you can still interpret those bytes because everything you need to know what those bytes mean are actually in the page itself and then there's other limitations in Oracle like you can only have a thousand columns in Oracle not just because they want to be self-contained there's other reasons why they have that problem but again the bottom line is everything we need to know in the file is in the file itself we don't need to go read some external catalog or external data and so we'll keep track of the table schema and the way they're going to do this is to basically serialize here's my columns there are this type and this size and so forth and additional annotations about them and then within the row groups they're being encoded so that when I read the file I know how to then process those bytes and put it back to its original form does everybody here use thrifter protobuf? one few but protobuf is from Google thrift is from meta, sorry basically again you define like a schema almost like a create table statement you define like here's my here's my columns, here's their types and so forth their names and then they have a way to then a binary encoding for what that schema actually is and so they basically just pick you back off of this serialize the bytes and then embed that in the file in the metadata the big problem is going to be though if I have a really wide table and say I only want to learn about two or three columns out of a thousand columns I have to deserialize the entire protobuf or the thrift message to convert that so that's going to be a big problem for these file formats as well there's newer versions of like there's better versions like flatbuff there's better versions of these things but at the time this is what existed and the file formats carry with that legacy the row group all sets of length we talk about this again, this is going to give us a direct details how to jump in the file to define the beginning of each row group and then there will be some basic metadata like the number of tubals that I have in each row group potentially and that can use this information to determine whether I need to even read the rest of the file for the format stuff we've mostly already talked about, we're going to use packs for this and then we're going to split it up based on to row groups and that will contain one or more column chunks and the question has come up, what size of the row group should we use and in the case of parquet, they're going to use just on the number of tubals that you have and you can change this, you can specify the file, I want my row group to be a million tubals or 10 million tubals or so forth but it's always based on the number of a number of tubals ORC takes a different approach and they specify it based on the size of the data, so the default I think is 250 megs so what are some pros and cons of both of these we already talked a little bit about this, yes the tubals are massive if I have a lot of attributes ignore like storing wops, if I have just 10,000 columns then a million tubals times 10,000 columns is going to be a pretty big row group so what happens in that case, why is that bad at no point sorry, his statement is his statement is will it even fit in your storage at no point this semester should we ever say are we going to run out of disk S3 for our purposes is infinite, like long as your credit card keeps as long as Amazon keeps charging your credit card you're never going to run out of storage if you're at the point where they start running out of storage they will call you and be like hey who are you, what are you doing storage is infinite for us, yes could it be the data split across multiple pages they could be split across multiple pages that's assumed page size of 4 kilobytes in hardware that's assumed yes that's one, yes the granularity of the scope of the zone map is so large that anything I look in there your value range for this number is 0 to infinity, great, that's useless missing a key thing, one more I have to bring in the entire row group so if I have a massive row group then I'm going to have to choose a memory that I have to bring in in order to start processing and understand what's going on the benefit though is that I can size this in such a way that I can guarantee to or at least I can maximize my chances to do vectorized processing that like I'm always going to have enough data to put in my SIMD lanes or scan across multiple threads process and paralyze it yes the statement is if you're pulling from S3 would you be pulling the entire parquet file no, so like again in S3 you can do byte offsets and lengths so I jump, get to the header and then if my zone maps are selective enough I can then say oh I don't need this row group and this row group then maybe I'll get the byte ranges that I need this question, when would you not have enough data to put in your SIMD lanes so again say if I have a really wide tuple with a lot of attributes then I might only have this is an extreme example I might only have four tuples in my row group because it's just so huge now SIMD lanes may be too fine a greater layer but if I'm processing I don't think I don't think again I'm jumping ahead to what we were talking about this semester but like don't think in the bus tub where one thread is running this one operator it goes to one page from memory and then does whatever one's on it think of like I'm going to have some other piece of the system, the IO scheduler whatever you want to call it it's going to say I need this file you go get the blocks in and then maybe some coordinator figures out because it's half a gig so this thread processes the first half this other thread processes the second half so I want to have enough work for everyone to do yes so this question is again the cop out answer it depends so clearly it depends sometimes this is good and then this is good it's a hybrid solution where you support both well that's going to be another theme we'll see as we go along although we're short in time the increasing the complexity of the file format means that when you're actually processing tuples then you have to like you have branch predictions because now you have different code bats you need to consider plus there's engineering complexity now to support both so Parquet does this or does that there's other things too like bringing for like a transactional system did you flush the log buffer when you've written so many transactions or when the log file is a certain size or when you've run for a certain amount of time there's pros and cons of all these different choices but from the engineering perspective it's just easier to pick one or should I use two phase locking or OCC yeah there's times you want one versus the other but everyone does just one it's so hard to build the whole system why additional complexity yes this question is why is not having being able to put a row group entirely memory a bad idea so I gotta go fetch it from somewhere I bring it to the node that's gonna process it I don't want enough memory for it where does it gotta go to disk the statement is like could you could you break up the row group and execute it incrementally yes but then it's another request to get more data as I need so this is just a diagram from Databricks showing sort of what Parquet looks like and you can sort of see in all the things we talked about here there's the footer, there's these row groups that are numbered and then within them you have a column chunk and within the column chunk you would have additional page metadata for the encoded bags and so forth right the exact distinction of how we're gonna lay out those internal parts versus Parquet and ORC I don't care that much about but again the idea is that there's this hierarchical nature to the file and we can store additional metadata as we go along and the right size of each each part again depends on a bunch of different things alright so we're at CMU and they love type systems in the PL group so we have to worry about that a little bit so the typeset is gonna define in our file format you're shaking your head yes the typeset is gonna define what the how we actually can store the types themselves and what are the bytes gonna look like so there's the physical type of representation we have for a given value and for this we're not gonna do anything special for the most part for integers and floating point numbers we're gonna use the IEEE 74 standard that specifies how a hardware should represent the data you can think of like if I declare an integer in 32 on C++ that's the 74 standard that's what I get because that's what the hardware gives me we'll talk about strings in a few more lectures again there's some tricks we can do to speed that up and then the logical types will be built on top of the physical types and it's basically gonna define how we would map some logical type to a physical type so for example if I want to store timestamps what is a timestamp it's just the number of seconds or milliseconds or nanoseconds from some starting point well that's just a number so I could just store that as a physical type as an int64 and then I have a notion of a logical type that says how to parse the bits within that that physical type so Parquet and Orc have different complexity to their type system and then that determines how much work you have to do upstream is the thing actually generating data for these file formats or actually consuming it how much work you have to do to have to interpret the contents so in Parquet they have the bare minimum of types they only have int32, int64, int66 and then the 752 standard and then byte arrays and then they don't even have strings because you would interpret that as a byte array so it's interesting they don't store 8 bit ints or 16 bit ints so if you declare I want to 8 bit int or 16 bit int they're still going to store it as a 32 bit 32 bit int and the reasoning is that well yeah there's a bunch of zeros in my bits that I'm not using well a lot of this compress out and that reduces the complexity of what they have to support Orc is much larger they have all these different types some of these are logical, some of these are physical they don't really make a distinction but you can define way more things than you can with than in Parquet I'm not saying one is bad versus the other it's just how they chose to implement things so now the encoding team is going to specify for a given physical or logical type yes the question is why isn't Orc better because if you have more types you can do more things than coding them and maybe you'll say you just represent as logical types and you extend the file format that way I've never taken a type system class this is my understanding so again the coding teams are going to specify for the physical and logical types the actual bits themselves how can we actually store them for contiguous or related tuples within our column chunk and the paper talks about a bunch of different schemes that we've covered in the intro class there's a partial frame of reference in intro class it's basically like delta encoding but instead of having in delta encoding it's like what's the difference between the value before you you pick some starting point maybe the min value of a column chunk and then now you're just storing the delta from that global value it's sort of the variant of delta encoding and then there's partial frame of reference coding which I think the paper mentions P4 that's basically if you have any outliers that would wreck your encoding scheme they have a way to handle those separately but we can ignore that the one I spent time talking about is actually a dictionary encoding because one this is the most common encoding scheme that most data systems support this is where you get most of the win for getting compression and for these the different schemes here vary not in how they implement it but more like when it's triggered or they're very aggressive using RLE like if they see three or more particular values then RLE kicks in if you're in parquet it has to be eight or more and you can't change that and so you know probably dictionary encoding because you can then take the results of the dictionary encoding the compressed column and then apply all these other things on top of it and get it compressed even more dictionary encoding again this is the same thing with intro class the basic idea is that we're going to replace values that occur often in our column with some smaller fixed length dictionary code from a smaller domain and then we use that at runtime to figure out okay if I see this dictionary code in this column I can reference the dictionary to figure out what the actual value should be again this is how we convert variable length strings and variable length data into fixed length values that we can store in our columns it means the metadata is going to be now arbitrary length and the values that were variable length in our column are now being stored in the dictionary which is stored in the header of the row group so the dictionary codes could either be positions within the dictionary and therefore you have to maintain a hash table to figure out how to find that position in the dictionary or we could do offsets within the dictionary assuming it's like everything can be used as bytes we can also often sort the values in the dictionary to help us get some additional benefits for compression in some cases and then we can further compress the dictionary or encode in columns to reduce them even further using RLE and other techniques so in the different formats they have to handle the case when the dictionary becomes too large because there's too many unique values and therefore I'm losing all the benefit of dictionary coding like if I have my column is just monotonic increasing values from one to a billion it's kind of stupid to store dictionary code for a billion tuples that are all unique because I'm going to have a billion dictionary codes so now I have the original column that's dictionary encoded with a billion unique values and I'm storing the billion unique values in the dictionary so I'm double the size of it so they have various techniques that figure out okay this is not working out I don't want to do dictionary coding so in the case of parquet if the dictionary gets larger than one megabyte then they just stop and then everything comes out comes after that point it's just stored as regular the original values in the case of orc they compute the number of distinct values ahead of time but basically they have a look ahead buffer where they can say let me look at when I'm starting to write out a chunk of data let me go look ahead at like 512 values or 500 values go figure out whether there's enough distinct values and I think the rest of the data is going to look like that and doesn't make sense to do dictionary coding or not if they get it wrong then they do the same thing as parquet they basically stop encoding and just store data in this native format so here's a really simple example I have some column with a bunch of strings in it if I'm doing an unsorted dictionary then the values of the strings that I'm trying to compress will just be in the order that they appear as I'm scanning through the column and I can either store the position in the dictionary again so the first one here was William so at offset 1 then position 1 I would find the actual string that I wanted or I can store this as an offset so if I take treat this as bytes so I know that if I look at my dictionary I'd need to jump to the 7th byte and that's going to tell me where the starting point for my entry will be and I don't need to maintain a hash table yes this question is what are the advantages and disadvantages of a hash table so we'll see arrow later on arrow does it this way because then you don't have to serialize the hash table they're going to pre-sort everything and then jump into it more easily and you have a sorted dictionary again you get all the dictionary values ahead of time and then you sort them and then you again it's just like before you have position into the dictionary or an offset but you can kind of see now here in this example here I have Andy repeated a bunch of times well one is now I've started things to integers in my dictionary codes but now I have repeated values so then I can take this and say oh I have four 2's in a row let me compress that with RLE or I can do delta encoding or frame reference where now I just say these are all integers I have four different unique values in my dictionary and I can then compress these things yes we talked about like what happens if the number of distinct values is like indirect to like some other auxiliary data so if I recognize that I have a bunch of random strings and I can't do I can't do dictionary encoding but now I have a bunch of very like data how do I handle that you have auxiliary data and now you're just storing the offsets into that blob okay so a couple of designs that you may have to make is what data we actually want to compress with dictionary encoding sorry so as I said before parquet they compress everything flows, integers, strings, dates they compress all of that in orc they only compress strings and this seems right this seems like common sense because most of the variability you can see is for the the randomness you would see in values are going to be mostly or sorry the repeated values will be in strings not in not in integers and floats right but when we did our analysis it actually turns out to be the orc way sorry the parquet way is better next question is what do you do can you compress the encoded data so parquet is pretty simplistic they'll just do RLE and bitpacking again bitpacking is just saying oh I recognize that my dictionary codes are 32 bit integers but my values are within 0 to 20 so I can use 8 bit integers for that or even some smaller amount and now my column gets compressed even further and if I have repeated values I can do RLE on that but again they only RLE only kicks in if you have 8 or more values in orc they have a bunch of different things you could do RLE, delta encoding, bitpacking frame of reference and they basically have a greedy algorithm again they look ahead in the buffer to try to figure out what the data looks like run some heuristics to figure out which of these approaches is the best right it always tries to use RLE first and then if it can't then it tries delta and if it can't do that it uses either bitpacking or frame of reference and then another design station is the dictionary that's being generated for this encoded values do you expose that to the outside of the file format or the library that's processing it I'm going to guess why you'd want to do that so if I go back here so say I want to look up find all the my query is my query is select count from the table where name equals Andy so if I can then look at the dictionary I can zoom in like a zoom out to see if any is even there or not right or select what's that a range query it's a range query you have a sort of dictionary you want to count all the unique values within a given range I can do that by looking at the dictionary that's another example so Parke and Orc do not do this you say scan some column you basically get an iterator and that's going to give you back the columns that you asked for in their original form so underneath the cover is the library doing all the decoding and compressing for you right so again that means that you can't really push down predicates all the way down to the lowest level looking at the file itself you do have to do whatever whatever the library spits out back to you so that's a problem so there's a paper we're not going to cover from Google for a system called Priscilla this was developed in-house for YouTube to process do analytics but also serving data on online matter which we don't care about right now but there's this little blurb here they develop their own file system instead of using Orc and Parke and one of the advantages they talk about is that they actually expose the dictionary to be the query engine so that you can do the predicate push down that you want to do so this is something as I mentioned before there's no problem with Parke and Orc and then the newer stuff that people are looking at once I solve this problem it exposes to you what the dictionary is because then now you can do you can do evaluation directly and compress data by compressing your predicate and then comparing your predicate versus the compressed data rather than decompressing everything first alright next thing is to block compression and this is basically taking off the shelf naive general purpose compression algorithm that just take the blocks, the row groups and just run that and compress it right and the the paper talks about Parke and Orc I think the default is snappy the modern the best compression algorithm you want to use right now is actually Z Standard from Facebook there's a newer version that's not out yet it's called something different that supposedly is better but Parke and Orc again they come with snappy because that was the thing back in the day when those file fonts are invented so there's this the things you have to consider whether you actually want to do this or not is whether you you're willing to pay the computational overhead of decompressing the data the blocks when it comes back even though it's already been encoded with one of your dictionary encoding other schemes you can still get some compression benefits but now it's going to make processing the data much slower because you have to do this extra step to decompress it because these are opaque compression schemes meaning if I run something through snappy or Z Standard the bytes come out the data system doesn't know what those bytes mean and I can't jump to arbitrary offsets with them and to go find data I'm looking for I got to decompress the whole block and again this made sense in 2013-2012 when these file formats were designed because disks were slow, networks were slow so if I can reduce the amount of data I have to go read from some local storage and then bring it to my memory then I'm willing to pay that CPU cost but things have changed a lot now the CPU is actually one of the slower things so this actually doesn't make sense anymore right the additional metadata we can keep track of are the filters so the only two types of filters that they would have are zone maps and blue filters even though the paper calls it a page index what's the difference between an index and a filter an index tells you where something is and if it exists a filter tells you something could exist it doesn't tell you where it is though so zone map is going to say here's my min and max values I'm trying to find something within a given range if it's in that min and max range then it exists but I don't know where it is I've got to go to sequential scan to find it whereas a B plus G would say which we don't care about so we're right to our zone maps and again by default Parquet and Orc are going to start on the zone maps in the header of each group you can store it in the file level but I don't think that's on my default and then for blue filters within each row group they can keep track of whether a value could exist for a given column again a blue filter is a probabilistic data structure it can tell you definitely that something does not exist but it can tell you that something may exist you can get false positive but not false negatives yes this question is why does it matter whether values are closer for a blue filter because the how do I say this yes I have a good question I don't know why I wrote that it doesn't matter what order you once it's there because you hash it and it's scattered this might be for this sorry this is the type of this should be for this zone map it matters for the cluster because now the range will be smaller and I can throw things out if it's 0, 2, or infinity then that's a useless zone map this should be for this real quick this flip lock blue filter is basically a way to if your blue filter has so many bits instead of having your hash functions look at any possible number of bits to block a subset of it and that means you can bring it in within a single cache line we'll cover cache line stuff later on it's a way to just reduce the keep everything in L1 to run as fast as possible this part is a bit tricky I'll do what I can on the time the nested data structure is also really important we'll cover the Dremel system later on but there's this paper from I think 2011 2010 about the system called Dremel which is a precursor to what Dremel is the internal name for BigQuery so we'll read those papers but they talk about how back then at Google when they were building Dremel they had all these applications generating these protocol buffer data that's all nested and semi-structured and inconsistent schemas and they needed a way to efficiently process them the Dremel paper talks about using this technique called record shredding which is mentioned in the paper you guys read it's an alternative instead of a better approach to the what ORC does and other systems do it called length of presence encoding so we can cover this again later if necessary but the basic idea with shredding is that again instead of storing the semi-structured data that I have as a single blob in my a blob column then at the parse every single time I'm going to be processing on it I'm going to split it up so that every level in a path is now treated as a separate column and now I can rip through those columns to say if I need to find for a given field in my JSON file does it have this attribute set to a certain value I can rip through that column and find it without having to parse everything every single time so the idea with shredding is that instead of keeping track of the explicit hierarchy of a tuple of a document for a given tuple I stored some repetition and a definition column that tells me whether this thing exists for some tuple at some offset as I'm scanning along so the basic idea is that say I have some protobuf definition like this there's always a document ID at the top level so I have a separate column for that and that's always going to be there and the definition is always zero because there's only one integer one doc ID for a document there's no repetition there's no definition saying that there's other things I need to look at from the other nested columns but then you see here that within the the name field I can have a language, I have a code I always have this optional string URL and then I can define now a separate column for all these things and this is telling me the repetition is like how many tuples after the run you're looking at and do I belong to the original tuple that was created at the top of the hierarchy I'm going this way fast we can cover this more next class the basic idea is again I have some additional metadata of these additional columns I'm using to keep track of that I can use to reverse and figure out where I'm at in my hierarchy the easier one is the length and presence this one basically says that as I'm scanning through my document I'm generating the data that if a tuple doesn't have an attribute then at a given level then I'll just set its presence to false and leave it blank space so I'm always putting in blank spaces for optional data to know whether this given tuple as I reconstruct things exists or not and that way I can just do the offsets and say okay well I know as I'm processing along as I scan through how to use the presence to tell me to reverse back to where I'm looking for because the offsets are going to match up again I'm butchering this going through fast but basically again the main takeaway here is that I can split things up based on the past in my JSON document and then run all the encoding compression stuff that we did before I'm going to show one experiment from the paper you guys read what we basically did was looking at synthetic data like from TPCDS or these other benchmarks you guys are going to see in a bunch of papers and the Microsoft people guys used we said okay let's go find a bunch of real data let's go find random parquet files or random data sets load this up in parquet and orc and then understand are the design decisions that these file formats make are these good for real data so and then they said there's a bunch of parquet files you can find on github we basically downloaded a bunch of these things and then loaded it up and so for our evaluation for the most part we're going to use Arrow C++ implementation of parquet and orc even though parquet and orc are the original file library the support libraries to process those files and create them it's all written in Java because the Hadoop world wrote everything in Java from ten years ago we wanted to have the best performing implementations you could have so we used C++ but the problem is like the file specs for parquet and orc they have all these new things that they've added over the years that are defined as optional like the page index for the zone map and the footer is optional and so the various implementations of these formats sort of these processing libraries some of them implemented things, some didn't and then there's just the implementation of things that are required how high performance were those and so in the case of parquet and orc when we looked at like stuff in Rust and other implementations they didn't have SIMD support or parquet was really good but orc was really crappy it was really hard for us to find a true apples comparison between these different formats because everyone writes their own thing over the years alright so the first question we need to ask is how well do these things actually compress and the X axis is just showing a bunch of different workloads that we generated based on real datasets and you see that for the most part parquet is better for the logging and the ML workload than over parquet because you want lowers better because these are these datasets are mostly comprised of floating point numbers all the weights from some ML model are all floating point numbers and because parquet does dictionary encoding for floating point numbers you actually get a big win this was surprising to me that floating point data actually compresses really well through dictionary encoding and then orc is going to do better for the classic workload and the geospatial workload because they mostly contain strings and orc is more aggressive in compressing the dictionary encodes it has that four different schemes that can use after you've already done the dictionary encoding so the file size these aren't huge wins these aren't mind-blowingly different these are in my opinion the margin of error it's not like one is 10x larger than another storage is infinite but now when you actually run the queries on these things simulating what an actual query engine would actually do scans and range scans and then partial selects and point queries you see a parquet is going to be faster because it's going to mostly use bit packing and for the for some of these workloads there isn't a lot of repetition that are for contiguous values there's repetition within the column but it's not like you see 1111 over and over again so RLE doesn't have a big win for this because again RLE only kicks in in parquet if there's eight or more repeated values orc is more aggressive for RLE again three or more but the problem is when you use run length encoding you can't vectorize that very easily with simd and I don't think the error implementation of the orc processing library that we had none of that was vectorized so we basically left doing sysd operations or sysd instructions on columnar vectorized data right the other thing we saw with the case of orc and again this would be a recurring theme throughout the entire semester is that the additional complexity of supporting four different encoding schemes for the dictionary codes dictionary compressed columns is that at run time as you're trying to rip through the column you got to keep checking okay for this column chunk how is it actually being encoded and then now you have this branching in your code to say if I look at my header it says encoded this way then I want to use this function to decompress it if it's this way use this other function then all that all that in direction or conditional clauses causes branch misprediction in the cpu which in again in the modern architecture with a superscalar cpu architecture is terrible because you got to flush the pipeline and paul's and all that garbage right so for that reason parquet because it's much more simple works way better and we'll see in some cases when we talk about sequential scans and actually applying predicates always applying the predicate always copying the tuple as like always copying the tuple into your output buffer and then apply the predicate actually works faster in some cases then checking the predicate and then if it matches then putting your output buffer so always copying seems like the wrong thing to do but on superscalars if you use when everything's in memory you try to rip through as fast as possible it turns out to be the better choice and this is why the compilation the specialization stuff we'll see later on the giant switch calls that says if I'm in 32 do this if I'm floating point do that but again if you overlook the bus stop code that's basically what it looks like when they process types so if you can specialize all of that then you avoid that indirection so that's why parquet simplicity is going to help here all right I'm well over time let me finish quickly so main takeaways of this dictionary encoding is effective for all types not just strings and again to me this is surprising and then because the hardware landscape has changed so much where network has gotten so much faster then and disk has gotten much faster too then we just want to avoid using snapping as the standard entirely the sort of native encoding schemes or dictionary encoding RLE and all that that's always me better all right I've already said this hardware has changed and then even though there's been no statistics that are successful they're used everywhere there's a lot of things that are missing in parquet and orc they didn't consider when they first designed these things that would help us if we want to process OLAP queries so there's no statistics there's just zone maps and tuple counts and bloom filters no histograms, no sketches, nothing about what's inside those columns that I could use for for cardinality estimations and my query optimizer I can incrementally deserialize the schema if I have 10,000 columns deserialize that proto buff thing in the very beginning all at once and then as I said there's a bunch of different limitations pick whatever query language whatever programming language you want there's a parquet library for it but it doesn't, they're all not saying they're all garbage but like none of them supports exactly what's in the spec even the Java one doesn't okay so next class we're going to read fast lanes we'll cover better blocks these are going to be sort of modern encoding schemes go beyond the things we talked about today that's going to be really designed for modern hardware and the fast lanes one is WAG is basically saying if you don't put things in the order that they, in memory in the way that they're found somebody inserted them not even sorting it store things a certain way because you know if you process with SIMD then you get much better performance it's pretty wild okay, any last questions thank you