 Carnegie Mellon University's Advanced Database Systems course is filmed in front of a live studio audience. So today we're going to pick up what we left off last class, again talking about how we're actually going to execute queries. And so in last class we discussed, we mostly discussed query processing models and part of this is deciding like what is the sort of shape and amount of data we're sending from one operator to the next in the query plan. And then what is the sort of the control mechanism that we're going to use to tell one operator to execute. All right. And the distinction here was the push versus pull and bus tub is a pull-based system, we're going to be primarily focusing on push-based systems. And then we talked a little bit at the end, how do you actually represent the output of predicates and filters, either a selection vector of offsets, a position list, or a bitmap saying what tuples got satisfied. So again, going forward in this semester, we're going to assume that in our conceptual system we're building as we go along, it's going to be a vectorized push-based system. Now vectorized is pretty much every other system now does vectorize. Push-based, I don't actually know the number, maybe less than half is push-based. Still a lot of systems are still pull-based. But the one of the key advantages that we talked a little bit at the end about push-based system is that you essentially have this like centralized scheduler that has complete control of what tasks are going to execute and when. And this allows you to have fine-grained control over when things actually are executing. And so there may be situations where if you're running out of memory because too many queries are running, you can just pause in a push-based system because you just stop executing tasks or you stop scheduling them. Or if you have an operator that needs to go get something over the network, that's a blocking call, you just can de-schedule that task and then fire it up again when it's available. So push-based model allows more control over exactly what's going on than a pull-based approach. Again, so not every system is going to be push-based, but almost every other system today will be vectorized. All right, so today's class, we're going to continue where we left off last time talking about how we're actually going to run queries in parallel, the different types of parallelism that we have. And essentially how we're going to architect the system so that we can allow multiple operator instances to run at the same time, either within the same query or multiple queries concurrently. And then going forward, starting next week, we'll see how we actually start paralyzing the individual operators within the query plan. Then we'll talk about what the operator output looks like. Then in terms of how do we reference other parts of a tuple, other columns, whether we put everything together or do late materialization. We'll talk about what the data is going to look like going from one operator to the next and the spoilers are going to use Arrow for this. How we evaluate expressions and then we'll just do a quick preview of what adaptive query execution looks like. What we focus on in the context of the Velox paper you guys read, we'll focus on how we do predicate evaluation and let that be adaptive. But then going forward throughout the rest of the semester and then when we talk about the query optimization, we'll see how we do query add that activity. And this is the hardest one. It's one of the hardest things you can do in a database system and you kind of need to design the data system at the very beginning to be able to support this rather than go retrofit it. Mostly on the query plan side, for the expression stuff, you can add this stuff later on. All right, so the parallel execution, the idea is pretty straightforward. Again, some of this will be a review of what we discussed in the intro class. But the idea is that we want to be able to have our database system run multiple things at the same time to take advantage of the hardware that's available to us. No longer are database systems running on these monolithic machines that have one CPU with one CPU core and no way to talk to other machines. At the very least, even if you're a single socket, it's going to have dozens of cores, maybe a multi-stock machine, and then now that's basically a distributed system, let alone scaling out to multiple nodes. So in that environment, we want to be able to have the system be able to take one query and fan it out across multiple machines to paralyze the various operations, all right? And when, sorry, question. Multi-machine to multiple cores. Doesn't matter. Yeah, so this question is multi-machine to multiple cores. For all the things we're going to talk about today, it doesn't matter where it's a multi-threaded, multi-process, multi-node system, it doesn't matter, right? Because think about what is a multi-socket CPU, right? You have to communicate something. It may or may not be even in the same address space, whether it's a multi-process or multi-threaded. So if you have the communication channel set up to distribute things, assuming you're not calling low level pipes to do IPCs to different processes, you can extend that easily to multi-node, right? We're not going to talk about what kind of frameworks you could use, whether it's frameworks that hide that abstraction. But the network that adds some latency, isn't that a big issue? Does it though? Nowadays? Okay. Never is really fast. Okay. Yeah, especially if it's in the same rack. Okay. Yeah, the CPU is often going to be the bottleneck. The things, in the last two or three years, things have flipped. Oh. Yeah. Right, so again, I mean this is basically what I just told him. The high level approaches that we're talking about here today aren't going to matter whether it's, again, multi-threaded, multi-process, or multi-node, right? So, at a high level, there's two types of parallelisms, interquery parallelism, interquery parallelism, and again, I think we've covered some of this in the intro class, but we'll go more detail and reach you these. So, interquery parallelism basically means that, can I have multiple queries running at the same time in my system and there's something up above, like a coordinator or a scheduler, that's responsible for figuring out who runs where and at what time. I did some previous work in the space and when we looked at existing systems, most of them are using a really basic first-come, first-served policy, meaning when a query shows up, it's assigned some priority based on that, and then it's allowed to run on some of the resources that it needs. And then other queries that come out later in time will get resources when they come available when the first query finishes. The enterprise systems can do more sophisticated things, like have priorities for individual users or connection strings. So, if you log in with some user credentials, you'll be given high priority to other queries, but a high level is still doing more or less first-come, first-served. So, in OLAP queries, they're going to have both parallelizable and non-parallel phases, and we'll see what that looks like in a second, but the main idea is that we want to keep all our cores, all our resources, actually running something. So, if there's a point where a query has to coalesce data to a single node, like an exchange operator, that exchange operator may be running on a single core, so we have other cores available to do other things. And again, this scheduler component is responsible for figuring out how to take advantage of all those available resources. So, we won't go into much detail in this class about how we want to do this for interquery parallelism. There'll be a separate lecture on scheduling queries, and then we'll see things like for join algorithms and other things, how do we handle the cases where we're distributing tasks across different nodes, sorry, different workers, whether they're in the same box or not, and we need to send data around. So, we'll cover this in a bit more detail later in the semester. The thing that most people think about when they say query parallelism is interquery parallelism. The idea is, again, taking a single query and then distributing across multiple resources and multiple workers running at the same time. And this is the beauty of a declarative language like SQL, that where ideally you don't have to know or care what the system architecture looks like, whether it has one node, or four nodes, or 10 nodes, or 100 nodes, it doesn't matter. The same SQL query can then be chopped up into different plan fragments or tasks, and then let the system distribute it for you. So, it's not like you have to debug things locally. Maybe debug things locally on the subset of the data, and then submit it to the larger machine to run on all the data. You have to rewrite everything. SQL, in theory, should just be able to get translated into a distributed plan or parallel plan on the system without any changes. So, there's two types of parallelism. The most common one is going to be interoperative parallelism through what's also called horizontal parallelism. I think the one of the execution engine teams was asking about sort of asynchronous I.O. and doing a technique called vertical parallelism. So, we'll quickly talk about what that looks like. And I'll say also, too, all of these things are not mission exclusive. You can combine these things together and various systems are going to do this in different ways. But again, the interoperative parallelism is way more common. And then for each of the interrelational operators that are out there, there's parallel versions of all of them. And oftentimes, whether or not the operator itself is aware that it's being paralyzed or not depends on the implementation. And if it's not aware, then you will see an exchange operator in a second where you put that above and allow it to combine results, but again, not being aware that it maybe got fanned out across multiple tasks. So horizontal parallelism is common one. The idea is that we're going to break up an operator in our query plan into independent instances of them or operator instances that are basically going to do the exact same function or exact same computation on its input as all the other operator instances that it's a copy of. But it's going to be operating on different pieces of data, different chunks of data. Again, just think of like a sequential scan if I have 10 parquet files in my table that I want to scan, then I could have each operator instance be responsible for scanning just one file. And they run in parallel. I don't need to coordinate the intermediate results across while they're running between them. They're both independent computations. But again, at some point, we need to put the results back together for other parts of the query plan or the final output we give to the application. So we're going to use an exchange operator that's going to allow us to coalesce these results from the different operator instances and to combine them together and then send it off to whoever needs it next. So we'll see this technique used in the morsels paper when we talk about scheduling in a few weeks. This idea of using an exchange operator dates back to the 1990s. So it's not new, but this is basically how a lot of distributed systems or parallel systems are going to operate, even if they may not call an exchange operator, it still will be some variation of it. I'll show what I mean in a second. So here's the most basic version of doing exchange. So we have our table here. We want to join A and B. And we have this really simple predicate. So the first thing we're going to do is convert this logical plan into a physical plan. And so say we have there's three chunks of data, three partitions on table A that we want to scan. So we'd have an operator instance for each of those that it gets assigned to an individual core or worker. And again, whether it's a worker thread, worker process, worker node, we don't care. Random function, it doesn't matter. And so now we're going to start building up the pipeline. So we know we want to scan A. And then immediately after scanning some vector or row group out of A, we want to apply the filter on it. And so again, we don't need to coordinate that filter operation across different workers or different operating instances. So that can run in parallel. And then we want to do predicate, sorry, projection push down. So we'll push that down above here in our pipeline. And then now we then feed that into the built side of our hash join. So now the output of these three operator instances are now going to feed into this exchange operator. Think of this as the pipeline breaker for multiple operator instances. So I can't do the probe side of the join until the exchange operator gets all the results from the individual operator instances. What I'm also not showing is how we're actually building the hash table here. Is it one hash table, three hash tables? Again, we'll cover that when we talk about joins later on. Sometimes it's faster just to build three separate ones and then do another pass to put combined together. Which sounds crazy, but sometimes it is faster. Most systems, if they're on the same box, they'll probably build a single one. So you need latching inside of that thing to protect it. So I think each of these is now these boundaries here. This is the operator instance, so this is the pipeline. So once we have that, then we go now on the other side of the query plan tree. Again, say we have three files for B. So we have three different instances that are going to scan B, do the filter, do the projection, and then probe the hash table and produce the output. But now in this case here, assuming we have a single hash table for our query plan, each of these individual operator instances, each of the individual pipelines, could probe the hash table separately. So you're going to have three threads or three workers producing outputs from this join here. So I need another exchange operator to take their individual results, combine them together, and then produce the final output to the user. So again, the pipeline boundary for each operator instance is it's going to look something like this. Yes? The question is, exchange be running on these one of these cores? It'll be, yeah, think of this as an operator instance that's running and keeping track of the inputs that it's getting. And then depending on whether or not it actually doesn't work, in this case here, the exchange operator could just be, oh, I got an indication that we populated the hash table from three operating instances. So it doesn't do any computation. It's just a barrier to say, OK, now we can start running on the other side here. In the case up here, this exchange operator actually wants to call us the results. And again, whether or not that's a physical operator that is just taking buffers and putting them to a final output, or somehow we've staged the output buffers in such a way that three threads go right into it at the same time, it doesn't matter. So the exchange operator I just showed you is what is known as a gather operator. Again, so these are the variations of an exchange that you can have. And as I said, the gather one is the most common one you'll see in those systems. And so this terminology I'm showing here, this comes from SQL Server from a blog article from, I think, 2006. And so this is their terminology, but I think it clearly defines the different ways they're going to take data in and send data out. So again, not every data system is going to follow this taxonomy, but I like it. Because again, to me, it seems easier reason about just saying, oh, yeah, they're all exchanges, but sometimes things come in one way versus another. So again, the gather operator is just combining results from multiple workers and then producing a single output stream that it sends to somewhere else in the query plan. The distribute exchange operator takes a single input and then fans it out to multiple workers, to whoever actually needs it. And so you can think of this as like when we talk about enemy results, but we've talked about before, like, OK, what if I want to have the output of one query operator be then used for multiple query operators, like if I'm doing a sub query that appears multiple times in the query plan, I want to be to reuse the result for different parts of the query plan in the DAG. So the distribute operator would allow you to do this. And then the last one is to do repartition. Basically, you have multiple operators producing input to the exchange operator, and then it's going to produce multiple outputs to different channels. Yes? It's just reducing and then mapping again. So this seems like MapReduce, sure. But what came first? Parallel data business or MapReduce? Parallel data business. Yes? So when would you need the, so the gather seems one seems like the only one that I can think of is going to be useful. Why would you need repartition and why would you need this? So it's that repartition? BigQuery is going to call us a shuffle, right? And so after every pipeline, they're going to do this shuffle thing. And they actually have a separate service to take the inputs and stages them. And then the operators upstream can then pull from them. And it may be the case that the number of operators, the number of workers that are on the input side might be different than on the output side. So having this natural staging point, barrier point, you can then scale in the number of execution engines or operators you want to bubble. This is more along the lines of like if you're over the network one, the increase and decrease. Doesn't have to be a network, but yeah. Yes? But it won't really happen on a single node. No, so it won't happen on a single node. No, it may be the case that you have a physical plan. And again, we'll talk about it at a 2.15 second. I have a physical plan, and I have some optimized estimates of like here's the amount of data I expect to get from these operators below me. And therefore, I know how many operators I'm going to need to process it above the exchange operator. But what if those estimates are wrong and you like woefully underestimate it? So even on the same box, you may want to scale this in and out up above. And the repartition shop operator allows you to do that. OK, so the other question I had was, other than table scans, maybe you have a slide about this, but which other operators could take advantage of this stuff? Because it doesn't seem trivial doing this for every operator. So even as other other, what type of operators would you have, could these be? Anything. I mean, my last slide, in this case here, this is the coming up from the scan. You do a bunch of stuff into it. And then I'm showing the output of the join goes into exchange operator. But what if there was something above? Like, what if this is like a nested subquery? And then there's something else that wants to use it up above, right? Again, the relational operator allows it to be composed of these things in any way as you want. Is there any more semantics than, except like for over our repartition, can it sometimes be like making like this joint subsets going out or copying it? So the question is, is it always a one, maybe another way to say it, is it always one to one inputs to outputs? Or could you make duplicate of the output? You could duplicate the output, right? As I've examined this type before, what if I want to reuse some nested query computation? Well, I have 1,000 tuples coming in. Logically, I have 1,000 tuples coming out. But physically, I maybe have double that, because it's going to do two different parts of the query. Does that change like the distinction? Like, is it just all considered repartition nodes? All considered repartition if you do that. Because it seems like there would be a logical difference between on cloning the data to different players. I mean, you could do that with this one too, right? Oh, OK. Because another question is, how much computation is actually going on here? Like, could you like, is it like sorting? No. Could it do? Well, actually, could it be like sort of like maps like the specific amount of like the third operator to like the first thing thing else? What a good idea. Yeah, doing that would be a bad idea. So you're just taking a bunch of things in, like, OK, I'm just going to randomly decide what's right or wrong. You're not randomly deciding. It needs to be equal. Sorry? It needs to be equal, right? Right. You mean the amount of work that goes to each one? Yeah. Ideally, but you can't guarantee that. Right, so like, again, think of this case here, think of like, I take some input in, and then I'm going to hash it on some key, and then redistribute it based on that hash key. So in that case, again, if I recognize that everything's skewed and it's all going to this operator, then maybe I want to do another round of hashing and redistribute it. But then repartition of it is running on one single node. No, no, not necessarily. Yeah, again, BigQuery, this is like, we'll leave it in the semester. BigQuery is going to run a whole separate service that just does shuffle, that just does this. Right, right, right, right. So really, it's the specific implementation of the specific repartition node that determines it. So there's no general hole. Yeah, so the question is, like, OK, this is PowerPoint. There's boxes with arrows going in, right? I'm just giving you the idea that you could have three inputs in, and then two inputs coming out very. Then logically, you could build this, right? Or sorry, you could build this based on this logical idea, and the sophistication of the different limitations are going to depend on the various systems. BigQuery, they fabricate hardware to make this go as fast as possible, right? Microsoft doesn't do that, other people don't do that. Other people actually don't always, you know, BigQuery's always going to do this. And it gives them, you know, it seems like it would be expensive, but it opens up a lot of opportunities to add activity later on. OK, so again, you can obviously see how this is just generalization of the other two. And I'm saying for you, partition, again, this is Microsoft's terminology. Google's going to call it Shuffle. BigQuery's going to call it Shuffle. I forget what Databricks might call it Shuffle, too. From MapReduce to the Hadoop world, they would call this Shuffle, right? But the idea is more or less the same. And the question is, again, the statement isn't just copying MapReduce. Other way around, MapReduce is copying databases. And I have a paper that talks about that. Yes, we can take that one offline. All right, so the last type of parallelism is interoperative parallelism. And the idea here is that within a single query plan itself, we could have multiple pipelines running at the same time. And you sort of think it is like an asynchronous model where you have one pipeline always running, but it's producing tubules. And that tubules get fed into some other operator. Sorry, it's some other pipeline that's always running as well. And so you may still need to exchange operators to combine any results from different segments. But if your query plan allows this, and not all of them do, you could get better speed up because now you don't want to run one pipeline. And when that's done, then switch over to another one. You can start running higher parts of the query plan as soon as you have tubules, again, when it's allowed. This is sometimes called pipeline parallelism. You'll see this mostly in the stream processing systems, which we're not going to talk about this semester. Examples of that would be Spark streaming, not SAMZA, Pulsar is another one. There's that one guy who came here for a year. Rising Wave, there's a bunch of these systems that are designed to do this. Kafka is probably the biggest example of this. Yeah, I blanked out on Kafka. Kafka is the big one. So say we have this, again, this unusual query. It's basically Cartesian product across four tables with no where clause. And so in each of these, I can run each of the pipelines just at the same time. This one up here, and the two ones up here, are going to be blocked waiting for whatever comes out of these two down here. So as this thing is running, it's sending tuples up, and then it can merely start doing the join on other things once it's available. Again, Cartesian product, you're trying to match everybody with everybody. So there's no, you won't have any false negatives, because you're just waiting for things to come up, and as soon as it comes, you can start matching it and producing the output. Again, this is a stupid example, but it just gives the idea that you could run these things at the same time, and it's a producer-consumer model. This thing's producing tuples that it sends up, and then that guy's waiting for those results. But again, this is not that common. OK, so now let's talk about the paper that I had you guys read. And the reason why I had you guys read the Velox paper is because, as I sort of mentioned in the second lecture, or the first lecture, that there's this trend now where people have recognized that the OLAP execution engine has sort of become a commodity, and that there really hasn't been groundbreaking changes in hardware that requires to rethink how we want to build everything. Certainly, there's optimizations and improvements over the years, but at a high level, like it's a vectorized push-based system, well, that's what Snowflake invented in 2013, 2012, and that's actually what the MoniDB X100 system was doing back in the day, or they weren't doing push-based. So there's a bunch of these different products that are out there that are all more or less doing the same thing. And you can sort of think of now, instead of if you're going to build a new system from scratch, although we are doing that here as part of the class, like instead of having to build all the low-level details you would need or the functionality for an execution engine, it'd be nice if there was just something off the shelf we could use that could provide this basic functionality for us. And so that's the idea of these composable libraries for building database systems, and that you could, in theory, instead of building everything from scratch for every new environment, every new workload, whatever, you could just use these things and modify them and extend them for whatever the hardware or whatever the application wanted to support. So the Velox one is probably the most famous one, again, Adam Meadow, which we'll cover in a second. Data Fusion, we're also aware of, which is part of the Aero project, which I feel like it should be a separate thing, but that's another story. There's this OAP thing from Intel, Polars is another one, where again, they're all just trying to be these libraries you could plot down that are execution engines, yes. We'll get that in a second, yes. His statement is data fusion includes a front end, Velox doesn't, yes. But you don't have to use the data fusion front end, you could just use the execution engine, right? Okay, so Velox is, again, came out of Meadow, and what I liked about the paper, they talked about how the original motivation for this was they looked across the entire organization of Meadow, and they saw people reinventing the wheel over and over again, building these custom execution engines that were at a high level all doing the same thing. And it was a bunch of wasted effort where everyone's trying to do the same snowflake style optimizations to all these execution engines, and they would end up having different semantics and understanding and data types and other capabilities, and it'd be really hard to sort of transfer from one workload running on one system to another system. Or the one example he mentions in the paper is that they did a survey and they found 12 different implementations of the substring function in all of Meadow, where sometimes they started with zero, sometimes they started with one, sometimes they took nulls, sometimes they threw exceptions, like everything was completely different. And so they were trying to unify this for having sort of Velox be this, again, this low level, this execution engine library that you could extend for the different categories of workloads or different environments that they want to support across Meadow. And then have a small team make that one really, really fast instead of random people trying to reinvent the wheel. So as he said, in Velox, it's not something you just plop down and immediately use. There's no front end, there's no SQL parser, there's no query optimizer, there's no metadata catalog, there's nothing to keep track of what tables you have. It literally is just the execution engine that implements the operators. They have a little bit more things that can do the, they have some memory management, thread management, they have connectors to read data from different sources. But I mean, it's a core that the really thing they're trying to optimize is essentially running these operators in parallel in a vectorized manner. So the input is gonna be a physical plan of Dagger operators, and then it's gonna produce some output wherever you tell it to go. And again, it's not like you just, it's an executable you run, you have to build this in a context of a larger system. Right? So for all the systems we're gonna cover throughout the semester, we'll try to do this quick summary to say here's the core features that they have. Again, and then you can have this mental checklist and say, okay, I have a high level understanding of what they're actually doing. So when we talk about Databricks, Photon and Snowflake and everything, we'll try to do the same things like here's the key things that they have based on everything we talked about this semester. So Velox is gonna be a push-based vectorized query processing engine. They're gonna use pre-compiled primitives and co-gen for expressions in C++. Again, we haven't discussed what this is. That'll be the paper you guys read on Monday next week. But again, just think of like the, think of like the, you know, for the scanning a table where you know a column is gonna be in 32, they have a compiled version of that. You have a compiled version for floats and all the other data types. They're gonna support compatibility with Arrow, which will cover what Arrow looks like in a second. They have some support for adaptive query optimization, not the full query plan re-optimization that we'll see later in the semester, but they can do some little tricks as you're running and scanning along data to modify how they're gonna evaluate expressions and so forth. And then as far as I know, at least when I looked at this before, the only support Surtmers joins and hash joins. Again, there's no query optimizer, there's no networking layer. All this is, has to be provided by whatever the encompassing system that wants to include Velox. Again, as he said, data fusion has more of this stuff, but even that's not meant to be a full complete system. So as part of this, again, because it's not a full complete system, Velox is not, doesn't have its own proprietary data format. It doesn't own any of the data similar to this data lake architecture we've been talking about where we assume there's a bunch of files somewhere on S3 or whatever object store you want and they'll have connectors that allow the scan operators to go retrieve that data and process it, but it's not meant to be the, at least the library by itself, can't control the ingestion of data and so forth, like something else has to manage all that. So they're gonna, the APIs that Velox provides have the capability to define connectors that allow you to retrieve data from other systems. So connector is a very common term that's used in different systems, like you type in your favorite database or type connector, you'll see different, they'll see what kind of capabilities and support to talk to other things. So the one that's probably the most extensible is like Trino or Presto, Presto came out of Meta as well, but they have connections to talk to any possible database you would want. And the idea is like, it's an API within the system allowed you to go talk to some other system, whether it's a database or a file system and then pull data in. So you sort of, basically a federated database where you make one logical database, talk to multiple physical databases and a connector is how they all do that. If you're familiar with Postgres, the foreign data wrapper API is the simpler thing. DuckDB calls them in extensions, but the term usually called is called connector. And then they have adapter that allow them to decode and encode the different stored formats that we talked about before. So Parquet, Ork, Dwarf is the extended version, eternal version of Ork from Meta and then Alpha, I think I mentioned before, this is something that the developed guys are building, but the next version of Ork. So there's a bunch of components that he talked about in the paper. Some of these you've already covered, some of these I wanna spend time on today, right? So the type system, there's part scalers, sort of complex nested types, the vectors that they're gonna operate on are gonna be arrow compatible with some extensions. Although some of the, I think the extensions they've added are now in the mainline arrow codes. I don't know how much overlap there is today or how much distinction there is between the two of them. They have ability to find functions if you wanna have your own version of substring or whatever you want that's not in the built-in Velox library, you can extend that. Then there's the engine that actually runs the operators, which we've covered a lot, connector we talked about and the resource management would be memory pools, basically a buffer pool to keep things in memory and pull things out as needed. So today's class, I wanna talk about these three things because we already talked about the type systems in the context of a parquet. Function API, again, it's just almost like a user-defined function and allows you to say, here's this new capability, built-in capability, connecting different storage devices or whatnot, that's not really new and the resource management stuff we'll cover throughout this semester. Yes? Oh, I was gonna ask him about this one. If you can come up with it. What's your question? Is there any curiosity doing that, doing the old memory pools or, is there always a good idea to have a, separate your memory into something that you can pull and give back? And your question, his question is, is there a good idea to manage memory? Inside of it? In the way they do. With arena pools? Yeah. Absolutely, yeah, because what's the alternative? Well, I don't think Rust has something like that. You don't, Rust? For a buffer pool? No, no, not for spilling this for the way you approach the whole terra-tile. Yeah, not the high-tile. No, no, no, no, so you said Rust, like the programming language? Yeah, but it doesn't let you do it something. It is something similar to the way they have it. That's what Rust. I mean, this is like a, this is a layer above the programming language. This is like, it's a system, right? They have to manage memory. It's also like a C++ constraint. And it says that we need to manage memory because Rust is a good idea to do this. There's no story to that, man. No, it's memory. I mean, yes or yes, so in the Rust, how do I say this? Rust is gonna prevent you from handing off memory without keeping track of it, right? But that chunk of memory, you don't want it to go malloc every single time, right? You want a memory pool. So this is a layer above the programming language. Again, if it was Java, like a memory managed, a programming language, like even then you want to avoid, like in the JVM, you want to avoid their garbage culture. You want to allocate your own memory, absolutely yes. Right? Where the data is here, we can do anything better than what the OS can do or what the JVM or whatever can do. So we want to manage everything ourselves. Okay, so all right, the first thing I'm gonna talk about is like, okay, what is the, what is the, what do tuples look like? Not in terms of the physical level, but like, that's not true either. But like, how are we putting together and sending data from one operator to the next? Not in terms of what the encoding is, just in terms of like, are we including all the data we need or some of the data, right? So the operator output is gonna define what is actually being passed up from one operator to the next, whether it's in the same pipeline or not. Again, it doesn't, it matters, but like the, the data system will be able to decide, okay, at what point do we want to materialize everything? And the, how are we gonna put things together? It's gonna depend on what the processing model is. So in the vectorized case, because we want to operate on columns, we don't want to materialize, maybe all the columns all at once, because we know we may not need that up in the query plan. Depends on what, you know, whether it's a row store or a column store, in our case it's a column store, so we know we want to do lay materialization. And depends on what the data requirements are, meaning at what point, at what point do I actually need certain pieces of data in the query plan, and how expensive is it gonna be to go get it again, versus materialize it at the point I'm actually scanning the data file, right? So let's say in the case here, we want to do this join on R and S, and there's this, you know, the join clause is RID equals SID. So we know that while we're scanning R and S, depending on the number of columns that they have and how wide the table actually is, we know that we only need certain attributes from those tables at this point here, so at what point do we actually materialize the tuple so we can do the join, or do the projection above it. So to do, or the materialization, the idea is that you, at the moment you're scanning the data, the leaf nodes and the scan operators, or whatever part in the query plan, you're gonna put the entire tuple together, right? And this is easy to do in a row store because when you scan the tuple, right, everything's right there. And even the PACS model it's all gonna be within the sort of same row group and close by, but depending on again how wide the table is, you may not actually wanna put it all together, right? Because that's not free, right? So if I'm doing this join, again on our earlier materialization, then I'm gonna stitch together the entire tuple of the join operator and have that be my output. That's gonna be sent from this operator to the next operator, right? So again, the idea here is that all the subsequent operators in the query plan never need to go back to get more data because we're passing the entire materialized tuple up above. In late materialization, which is what the VELAS people do and most OLAP systems do, the idea is that you only wanna pass up the bare minimum number of attributes you actually need in the query plan. And at any point you need to get more data, that operator has the ability to go down and get it. So again, in this case here, doing the join, I only materialize the record ID for R and S and then like a tuple ID or an offset that says to allow me to go find the other attributes within the columns below it. And so that's what gets passed up here in the query plan. So in case of R, I'm just feeding up the tuple ID and R ID because that's all I need to do this side of the join. And the same thing here, once I get past the filter on value, I only need to pass up the S ID, right, to do the join. But now up above here, after I do my join, I have R ID because that was passed up to me, but then I don't have the creation date from the S table. So in this case here, the operator has the ability to say, okay, I need this other attribute and it has the capability to ask the system to go fetch it for you and then I can stitch it together. So in this case, you may not always wanna do it this, right? It may be the case that the cost of going fetching things from disk again or from the object store can be expensive. Again, if it's cash, you avoid that, that I wanna materialize everything early. But again, if I have a billion tuples and the join produces one tuple, then I'm passing along a billion tuples, a bunch of columns I don't actually need and it's better to do late materialization. So this is an old idea that goes back about 15 years ago. I don't know whether the first column storage stuff did this in the 90s from SyBase IQ, but I know Monabee did DB do this, Vertica did this, like this is an old idea that pretty much everyone does today, right? Again, it's easy to do in the Pax model or the column store because we can go fetch individual columns as needed and put things together. Yes? So the question is, if you have a say, really high query plan, a lot of steps going up and there's an attribute that's used at the top and the bottom, I know we're in between, would you drop it and add it back? You could, but I don't think anybody does that. Just too much bookkeeping, I think. Also, it depends on like, like, Yeah, then you have to estimate things. It would be hard to get it right. Yes? On the context of the materialization or so materialization is good in, but like in terms of actually designing systems, do like systems like, can you just plot out the other or like a hybrid approach and how do you evaluate the system? So the question is, those pros and cons of both of these, I'm rephrasing, those pros and cons of both of these, do most systems just pick one to use that or they try to be clever and do a hybrid approach? It's sort of similar to what he was saying. Can you try to figure out like, can you, well, is it an extreme example? Like, can you try to figure out for any query plan should I use early materialization or late materialization? I think some systems do, the enterprise ones, I don't know about the open source ones. I like, I don't know what DocDB does. I think probably what is most common is just pick late materialization and to do that. Because there's less engineering. Okay, so now, so now we know, again, we're assuming we're doing late materialization and we know that what we're sending up is gonna be columns of values. It's the bare minimum of what we actually need to process each operator as we go up. Now I need to talk about what the encoding of that data of those columns actually look like, right? And the challenge here is gonna be for these lake house systems, so these OLAP systems that wanna read data from a bunch of files on S3, all those files can be encoded in different ways, like Parquet or whatever, CSVs. And that means that you need a way to transform whatever the on-disk representation of that data that you're reading into some common format that the execution engine can actually understand and support. Like, you would not wanna say here's my sequential scan operator for my join operator for Parquet data, here's my sequential scan join operator for ORC data, like that would just be so much code to maintain, it wouldn't be worth it. So all these systems are gonna translate or transform the disk format into some internal representation, right? And that's what's gonna get propagated from one operator to the next as we go up in the query plan. So the internalization or internal encoding, again, as I said, how the system is gonna represent data on the inside. And this may not be what is then exposed to you as the user as a result of the query, like it may come out as Arrow, it may come out as a Parquet file, whatever. But as it's going from one operator to the next, it's gonna be different. And so just like when it was on-disk, we want our columns to be, the values within each column be fixed length, because then we can find the corresponding matches across tuples very easily. Ideally, we wanna be able to use data structures and move them from one operator to the next without serializing, right? Without having to run in better blocks or ran that algorithm to do three passes to try to encode it. That's super slow to do that while we're actually running a query. So we want something that can ideally natively operate on the encoded data and move that along without having to deserialize it first. Similarly, we also nice to have zero copy memory access because that would then potentially allow us to take our internal representation of data and be able to share that with external processes. Now this is a new idea. This is actually what Arrow is really predicated on. But think of something like, actually DuckDB does this really well. So like when you can use DuckDB inside of like Python code, right, and almost like a data frames API, and you can access data that's running in memory in DuckDB, and all it's doing is just passing a pointer because it's the same process. You don't need to serialize it and deserialize it to transfer data. And so if you have the ability to have this common representation that everyone could potentially support, then you can do that zero copy memory access. Like that's sort of an extreme example. But think of like, I could send data from one node to another node over the network. And if my transfer scheme or transfer encoding scheme is the same as the internal representation of data in my operators, then I don't need to serialize and deserialize it as a word of protobuffers, right? I just send over that data and then as soon as the bytes arrive on the other node, it just starts crunching them. Even crazier, you can put things down in the FPGA and start on the NIC and that can start crunching on it without having to decompress it first. And there's some systems that do that. So this is obviously leading up to what Apache Arrow is. And so in the same way Parquet and Orc are defined as open source formats for storing data on files on disk, Arrow is a open source format to describe how to store data in memory in a common art fashion. And the idea is that it's gonna set the sort of encode data in such a way that an implementation of something that operates on Arrow data will be able to do all the cache efficient vectorized stuff that we're talking about throughout the entire semester. The one thing that is gonna be much different than Arrow, sorry, Parquet and Orc is that they're also gonna be able to support random access more efficiently. I can jump to an offset and get the exact value for that tuple without having to be decode, using RLE or Delta encoding, all the stuff that we talked about before. And then sequential access is just ripping through the columns and processing things. So Arrow's a lot of stuff. It's meant to be sort of this framework that comes with memory management, thread management, RPC mechanisms. They have a SQL parser and a SQL transport layer as well. But what I'm talking about here today is just gonna be the encoding scheme for the columns and for the data we're passing from an operative in X. Again, whether it's on the same box or different box, it doesn't matter. You also have a sort of a not simplistic but a basic expression engine for evaluating expression trees, like symbol filters and projections. So you can't do all of the complicated functions that you would want or could do in Bellox. Like you can do really basic things, like there's this column equal to this and so forth. And they actually compile it down into LLVM and then using this engine they have called Indiva. So Arrow's gonna be much different than Parquet and Orc. And again, because they wanna be lightweight, they only support two encoding schemes. They have dictionary encoding and some variation of RLE, right? If they're not doing Delta encoding, they're not doing all the bitmap stuff that Betterbox is doing, right? It's just gonna be these two things. And so the way they're gonna addiction your encoding, at least the original version, was just taking all the strings that are in a column, sorting them, and then the offsets are the codes you're gonna embed in the column itself. Yes. Why do they have expression and evaluation? Yes, there's a specification of what the encoding looks like and then there's an implementation of it, of you wanna operate on it in seamless loss. And so as part of that implementation, they include the ability to find all the columns or find the matching tables where something equals something. And the way they're gonna execute it, we'll see in a second, instead of traversing the tree, they're gonna compile it down to LLVM. Again, it's hard to say, it's debating whether you're gonna mention it because we haven't talked about query compilation, which is what, next week? And we'll see how you can do this for expressions in a second, but again, it's more than just the spec, there's an implementation library for it as well. All right, so I should have talked about this earlier when we talked about data encoding of how you actually wanna represent strings. And again, strings are variable length, so you wanna convert them into fixed-length values, store them in your columns, right? But then what are you actually storing for that value in the fixed-length column? And then what are you actually, where are you actually storing the variable length string? So what Aero originally did was the fixed-length data would just be a size and a pointer, so 12 bytes, and then the pointer would just be an offset in some blob region in memory. And it just pointed what byte offset you would find the string that you're looking for. So the column would be this data up here, always 12 bytes, and then this other variable length thing that has the actual strings itself. So it didn't matter what the string was, you were always using this scheme. So in Velox, in the paper, they talk about how they extended Aero to support this additional storage scheme that's based on something that the Germans did, and we'll just call that German string storage. So the idea is that if a string is small, so it's always gonna be 16 bytes instead of 12 bytes, if you always have the, in the header, you have the size, and then if the string is small, you store a four-byte prefix in the first four bytes, and then whatever the remaining data is in the remaining eight bytes padded out the zeros. But if the string is larger than this, in the prefix, you still have the size, you still have the prefix, but then you have the pointer now to that blob area where you have the actual full string itself. So even though in this case here you have the prefix here, you're still gonna store it down here because you don't wanna have to stitch things together when you're scanning things along, you just jump to here and get down to everything you need. And so this is a really simple idea, but it makes a huge difference because now in some cases or some queries, I actually don't need to follow the pointers. Like if I'm just trying to find all the strings with the letters A-N, I can rip through my column that are always gonna be 16 bytes and just do the pattern matching on the prefix, not even follow the pointer. Where in the case of above, I don't have anything in my fixed-length data, I always gotta follow the pointer. So again, another way to think about the pointer, like that's, sorry, like this thing here, that's the dictionary code, the pointer is the dictionary code. And there's, I mean, in some cases here, like you can do, there's times where you don't even need to follow the full string to do some computations, whether it's filtering, aggregations, and other stuff. You can just look at the prefix. And so I took this last year and there was no name for this technique. I just been calling German-style string storage. And then other people start using that term too. So this is from the Aero people, which says, last year, six months ago, this is in October, it says Aero can't do German-style string storage, literally it's on the same slide, and then he shows the PR where this comes out. But then someone messaged in afterwards, like, hey, what is German-style string storage, right? And this is the paper from the Germans from Ambra, where they invented this technique. The Germans do amazing stuff, like, well, I said this before, like they love to pack, they love to pack like bloom filters and other things and leftover bits in different parts of the system, like for hash tables, even though they're 64-bit pointers, in x86, Intel actually only uses 48-bits. So they take the remaining 16-bits and they put a bloom filter in in your hash table so that just checks you whether the key's gonna exist and the chain that comes after that. So you can look at the bloom filter and not even follow the pointer, right? So similar idea here, they do amazing stuff. And then this actually came out two days ago, the Polar's guy, they have a new blog article now, they're using this format, and of course, somebody's asking again, what is German-style storage actually mean, right? So this is my fault. So, umbra-style string storage, or Neumann-style string storage, okay? Dr. B does this, Velox does this, Polar's does this, umbra obviously does this, I don't know what the hyper did, right? All right, so this is a really simple technique and it's a huge, huge win. Yes? Then do you do like a two-fasting way for it to like filter that different, that's not matching and then do the, do the pointer? Yeah, this question is, if you need to match what you need, so you need all five bytes, the first five bytes instead of four, would you do a first pass, look at the prefix, figure out what matches, then within that do a pass on this? Probably yes, I actually, I don't know what Polar's does and the other systems do. It would make sense, because in that case, you could rip through the column, figure out what you actually need to look at and then in that case also you would avoid duplicates, right, because like, again, think of this as like, I could have multiple, if multiple tuples are sharing the same string, then the pointer would be the same, so I could just call less than them. I had to do some bookkeeping to figure out, okay, which ones actually match to the same thing when I pass up the results. But yeah, that would probably be a faster way to do that. Again, assuming that there's enough overlap, but I don't know what people actually implement, but it's a good optimization, yes. Is this pointer like an offset or what? So actually, is this pointer an offset? Yeah, so it could be an offset or within some blob region in memory. I don't know what arrow stores, I don't think it's a, yeah, an arrow, it's like a bucket number than an offset within the bucket because they have different buckets for the vervelling data. Okay, yes. But in Bellox it's just a pointer. I think Bellox is just a pointer, yes. Yeah, okay. Yeah, and I was looking at the arrows back, it's a bucket. Because then, again, think of like, you're sending stuff over the network, you don't want to have to deserialize it. So like, if it's just a logical pointer, then the physical memory just doesn't matter on the other side. Okay, so German style of string storage is really Umber style of string storage. Okay, so I don't want to talk too much about this, I just want to mention it real quickly. We mentioned substrate before, this is a, is it not directly related to arrow, but think of this as like, in the same way that arrow is meant to be the universal file format or universal encoding scheme for transferring data across different systems. Substrate is meant to be an open source specification for how to represent query plans, relational algebra query plans. So in like, I think Data Fusion and DuckDB, I think they have the ability to take these substrate query plans and actually execute them. So the idea seems great, like it'd be something like I can take a, something like CalCyde or some query optimizer as a service, have it produce substrate, and then if my system can then operate on that query plan, that's the, you know, I can just take it and run it. I have these things a bit more decoupled. So this was created by the guy that did Apache Drill, which is a open source clone of BigQuery from Google. And then he also could have found a Dremio as well, but now he's focused entirely on substrate. Question, sorry. Okay. Again, so when I talk to some of the Aero people at Volshan Data, the problem with substrate is that it's for really simple things, it's okay, but as soon as you start extending it, going beyond what they expect, and it causes problems. And I think it's just one dude who's like doing it, whereas like Aero's, you know, rather large consortium, I think substrate is mostly, at least it was at the time, it's just one dude doing his hobby now. All right, data fusion we've mentioned multiple times. And again, the branding seems kind of weird because like it's a part of the Apache Aero project, but it's almost like this own entire separate thing. And so I wouldn't be surprised if this thing gets forked out underneath the Aero umbrella and comes this own standalone top-level project in Apache. But it's basically like Velox, it's a vectorized execution library for operating directly on Apache Aero data. It does include a query optimizer, it does include the SQL front end. So it's a bit more, it has a bit more features than Velox, but at its core, if you don't want any of that, you could just take their query engine. So there's already a bunch of systems that are based on this. Influx DB is probably the biggest one out of all these. This is their third, Influx DB's third rewrite. And now, that's a long story. They had the RISR 1.0 and then they switched, they got off SQL when I told them it was a bad idea and they added M-Map, 10,000 was a bad idea and that was version two. And then version three, they got rid of M-Map, they added back SQL and now they're based entirely on Data Fusion. Ceres DB, Canos DB, I think they're out of China and then CFAL, I think these guys are, oh, they just got acquired by Enterprise DB. Two or three weeks ago, I think this is a Data Fusion execution engine that's designed to run inside of Postgres. Anyway, so Aero has sort of become the de facto standard, like Snowflake supports it, like this is a lot of systems are based on this and then Data Fusion is one implementation of an engine that can operate on it. All right, so I wanna finish up talking about expression evaluation and then sprinkle a little bit of that execution and then that'll then set us up for talking about vectorized execution next week. So expression evaluation basically says how do you actually take what's in the where clause or join clause or the predicates and evaluate them on the data as we're scanning or executing our operator. And so at a high level, you can think of like the, the parses are gonna, the SQL parses are gonna take whatever's in the SQL query and convert it into a tree, an expression tree that looks like this and every node's gonna represent some operator or operand within that expression tree. Like, so you have for your conjunction disjunctions, less than, greater than, not equal so forth, all your arithmetic operators, constant values, references to tuples within or a column and then additional functions. And you basically translate the, the what's in the SQL query into a tree like this. Now whether or not you break up like the join clause into, in the where clause to separate trees or put them together or how you break up within the, you have nest queries, like all of that is left up to the implementation but it's always gonna end up looking like a tree structure like this, right? And so how we'd actually execute, you would execute this. Well, in the naive way is that for every single tuple as I'm scanning my data, I'm gonna traverse the tree, right? To do this, you know, I would start with the and go down this side, evaluate these things, pull it up, see whether it's equal, go down the other side and do that comparison. But obviously that's gonna be super, super slow because it's a bunch of indirection now. I mean, jumps and you know, pointers and you know, virtual function tables if it's in C++. Like running this, it's sort of a naive case which bus top does use would be super slow if we're trying to scan through a billion, billion tuples. So what we wanna try to do instead is represent the expression in, and as if it was like a function call within a programming language. Because now we can hand that off to potentially a compiler and whether or not that's the query optimizer or a traditional programming language compiler like GCC, you can do all the standard tricks that we know how to do in compilers for the last 50 years, right? So say we have a real simple query where select star from table or missing the from calls where Sval equals one, right? Again, we have an expression tree with the equal clause Sval and constant. Well, ideally we wanna have a function that just takes in value as an input and checks to see whether it equals one. And have just run that function in every single tuple. And so we could then compile this now again using our favorite compiler into actual machine code. And then now as we're scanning along a tuple, we invoke that function instead of traversing the tree. So that's what that Gandiva thing I mentioned in Arrow. That's what they do. This is what Postgres does when you turn on the JIT compiler. This is what a lot of systems actually do. Yes. So it's clear to me why this is faster, right? Yes. But why is it that traversing the tree is precisely so slow? This question is, let's go back to Vikazil. Why are we traversing this tree so slow? Yeah. So think of like I have in my operator, my scan operator that all I have is a pointer to some root of an expression tree. And to make it composable or generic, it's gonna be an abstract class. So I'm gonna invoke like run function on this expression operand, again, assuming C++, that's a virtual function table to look up to see what's the actual function I'm calling. Then inside of this operand, I'm gonna have two pointers, left and right. And I had to call those pointers, same thing. Virtual function table to look up, then execute this thing, then go down, and now I'm copying data up. So much interaction between nodes. Absolutely, yes. Okay, so crazy idea. What if you did something like a defustery and you had everything stored in one big chunk? Could you potentially get over this cost? And I mean it's fine, because there's compilation there. What if you do something like a B++ tree and store everything in one big chunk? What do you mean? Instead of having to do the same way of B++ like every level of the tree is. Like you could store the, in order to work something in one chunk of memory, right? Yeah. They flatten it, you mean? Yeah. Yeah, that's what Velox does. Yeah. They flatten it and then they come. Well, we'll see in a second. Velox has a experimental compiler that will then convert it like the Gendiva and Aero. The Gendiva thing and the Velox one are separate. So Velox is gonna flatten it, as you say. Then you run along sequentially, right? Along the leapnesses, right? Postgres, again, if you're not doing the JIT compiler, it'll traverse the tree. My SQL traverses the tree, but a lot of systems do that. And obviously that's terrible for OLAP. Again, in their world, they're operating on a single tuple for most of the time, right? In our world, we want to operate on columns, ton of data. So not only do we not want to traverse the tree, we want to make this, whatever this thing, we convert it to the actual function call. Instead of passing a single value, we want to vector our values. Because people do the check a trillion times, right? You do what, sorry? Because you're gonna do this check a trillion times. Oh yeah, so we're gonna work this function a billion times, a trillion times, yes. Yes. And so we can vectorize this. And then it goes back to, again, we'll talk about it next week. We talked about it a little bit on Monday this week. I can, because, you know, I only have so many data types in my columns and my constants. Like, instead of passing, it's having one hard code and I can pass one as a constant value here. And I would have a version of this check function, a quality check for in 32, in 64 floats and so forth. And I mean, I'm just picking at runtime, which is this pre-compiled function I want to use, given a vector, given some constant, you know, return, whether it's something equals something. Is this different from query compilation? statement is this different from query compilation? Yes. We'll differentiate between holistic query compilation where I'm compiling the entire query plan. This is just compiling the expressions. So you can still have an interpreted engine, which Velox does and what most systems do. But then within that, when I call expressions, those are pre-compiled, pre-compiled primitives. As I said, we want to vectorize everything. So in the case of Velox, it's the example that he said, I wouldn't call it the B plus tree, but I'm sure there's a compiler term to this. You're basically flattening down the tree so that you just execute things sequentially, right? And then you could dare to think of like what you're actually executing in our function. You're making calls to function pointers that are pre-compiled, as I said. And because we're operating on vectors of data rather than single attributes within a single tuple, then that amortizes the jump cost going to that function. And there is an experimental branch that is being actively maintained in Velox, but I don't think it's turned on by default that can convert the flattened expression tree, yeah, flattened expression from the query plan into some IR that they then convert to C++ and do a fork exec to compile that in GCC, right? Which is slow because again, it's like I'm synthesizing C++ code in a file, then I fork exec GCC, which fires up all the process. And the GCC is gonna do a bunch of other, you know, initialization stuff on its own, like it's gonna check config files and other stuff, right? Then it can pause your code. That's what the first version of single store MemSQL back 10 years ago, they would do the same thing. And of course, in some cases, the compilation cost of the query could be like seconds. So you would only wanna do this kind of compilation stuff if you know that the query is gonna run for a long, long time and that will negate the compilation cost. Like if your query is gonna run for 10 milliseconds, your compilation times 500 milliseconds, that's not a fair trade-off. So you gotta be careful when you actually do this. And same thing with Postgres. Postgres, they'll have, they have an estimate for how long it's gonna take the query plan to run, how long it's gonna take to actually jit compile something and they don't always jit compile everything. All right, so even without this compilation step though into machine code, Vellux does some additional optimizations that I think are kind of cool and worth discussing. So the one they're gonna do is constant folding. So the idea here is that if you know that there's some operation doing over again on some constant value, then just do it once, memorize it or retain it, and don't repeat it. Question, yes? Yeah, I was gonna ask because like barring the use of UDS, right, like would these are common expressions that are supposed to work in a similar operation between them? Like anyone looks into like building like a lightweight, like a parallel on the UDMS that just like whole make them, like has the capabilities of having these expressions. Because I don't think that would be too complicated. It's like a lot of compilers. So his question is, has anybody considered basically embedding a compiler in the data system itself and using that to compile like everything? No, it would be like limited to just the simple expressions that are like no UDS, because UDS is like making it really awkward. Yes, I think we're seeing them now. The constant fold should be in the outline. Yeah, yeah, yeah. Yeah, you could do that. This is where I'm getting at. Like the thing about to describe you, again, you learn a compiler's class, but in the case of Velox, they're re-implementing that. And then the lines get blurry of who's actually should be doing his optimization. Should it be Velox? Because they don't have a query optimizer. Should it be the query optimizer above? Or in your case, could it just be the compiler itself? So there is a system out of Germany, same school as Umbra. She's not German though, but she's at the German school called LingoDB where they convert a query plan to this thing called MLIR, which is similar to substrate, we think of like meant for more general purpose programming. And then they run Clang or LLVM compiler on that, who then does all of this stuff as well. So yeah, people have done this. But I think, I don't think any commercial system is doing exactly what LingoDB does. Yes. The difference from like generating, let's say, C plus code versus LLVM, IR versus like, is there a difference between generating LLVM, exactly the same code? So his question is, is there a difference between generating LLVM, IR versus generating C plus code? Let's punt on that until next week. So we'll look at early system called Hikue, that did generate C plus code, said MemSQL did that. There's pros and cons of this, obviously it's gonna be slower than just generating LLVM, IR. The other hand, if it crashes, you have C code you can look at and figure out what went wrong. So we'll come to that later. But to his point here, here's how to do constant voting and expression tree, that's compilers one on one, right? So I'm calling this upper function on the constant Wu Tang to do my comparison against the column, but I'm doing that over and over again for every single tuple. So obviously I can just do this once and replace it with Wu Tang and then I don't have to run it again. Another common thing they do is, or at least in Velox, they do common sub tree elimination. So if you recognize that you have the same sub tree in your expression plan or expression tree that you're running over and over again, so running string position on a column for a given constant, then I can just do it once and then it'll link up this operator here to get the result from whatever comes out of this thing, right? Again, as we said, the lines are blurred between who's actually doing this. Should it be the extrusion and something like Velox? Should it be the query optimizer? Should be doing this? In the case of like, if you're just having a standalone query optimizer, it might be kind of hard to do the constant folding thing for this function because now you need the implementation of the upper function. So either you call the system that you know you're running on and say, hey, here's upper, give me what the result of this, but that could be slow, right? Or you re-implemented it yourself. Calci re-implements some of this stuff themselves. Data fusion just calls data fusion. I don't know what Orca does from Green Plum. In case of MySQL, I know that for these kind of things, like within the optimizer itself, if they see that something they want, like they want to do this constant folding thing or other, they want to know what's the value for this operation of the constant, they'll convert some subset of the where clause into a table of select statement. Within the optimizer, go around that query, get back the result and then inject the value into the query plan, right? And you can do that because MySQL can do that because it's tightly coupled. But again, if you're trying to build these things as separate standalone services, that'd be hard to do. You'd have to have an API to support it. All right, so the last thing I want to cover is this notion of adaptivity. And the idea here is that though we haven't talked about query optimizers yet, from the intro class, you'd be aware of what they're actually doing, right? They're taking a SQL query plan, sorry, a SQL query, converting it to a physical plan we can actually then run on a system. And the way a good query optimizer is going to do this is through cost model estimates, trying to predict the selectivity of every single operator to determine how much data they're shoving up from one operator to the next. They can use this to determine join ordering, right? And so for all the optimizations that we've talked about in this semester, like all that is a complete waste of time if we're given crappy query plans, right? If we choose the worst join order that we possibly have, then it doesn't matter whether we're vectorized or doing compilation. We're running on like fancy GPUs, hardware, whatever. All of that gets thrown out the window because we have the crappy query plan. So we need a way to be resilient, a robust, have the extension be robust enough that even if we are given a crappy query plan, we can tweak things and make changes as we go along to try to improve our situation. Challenger, you may say, okay, well, I'll just make sure I don't have crappy query plans. The problem is, again, in a lake house environment, or a data lake environment, you may not have any statistics about your data. Like someone uploaded a bunch of Parquet files in S3 and then you're told to go run select queries on top of that data. You've never seen it before. How can you actually start producing estimates, right? Or if you use one of these connector things that talk to, you know, have your lake house system talk to Postgres, Postgres may not expose to you its internal statistics. So you have no idea what you're actually reading. So we need a way to adapt the query plan while it's running based on the data that we're seeing while we're going along. And this is what adaptive query processing allows us to do, right? We're allowed the execution to make decisions on its own without maybe consulting with the query optimizer, some cases you do, some cases you don't, to either modify the query plan structure itself, potentially changing join ordering, moving projections up and down, or changing the expression tree while the query's running based on the data we're seeing. So the idea is like similar to in Better Blocks, they did a little sampling to figure out, here's the data I'm about to encode, then use that to make a decision of what the best compression scheme to use was. While the query's running, we're seeing the data. So we can start making estimates of what the real selectivity of our predicates or whatever we're doing is on that data, and then decide whether, how to modify our execution plan. In the extreme case, we just say, this is all garbage. These estimates are way off. Give up, throw away all the results, go back to the query optimizer and say, try again, you did it wrong, here's some new results. And it seems crazy, right? They're like, why would I stop a query, throw away everything and go back and try to run the game with a new query plan? Well, again, the worst query plan can be ordered to magnitude difference between the best query plan. So it is actually worth it to go do that. Very, very few systems do that. Most of them just take whatever you're given and run it, and I think only in academic systems will they be this, you know, just completely give up. So we'll discuss how to modify query plans later in the semester. Today, I just wanna focus on what tricks we can do to make the expressions evaluation go faster. And so again, all the major OLAP systems that are out there, the Snowflakes, the Databricks, the Dremels, the BigQuery's, I think, Bredshift as well, they're all gonna do bits and pieces of this. I don't think anybody really does this one, get throw up or throw it away and go back and try again. But they'll do some of the things that we'll talk about here. But I'm gonna focus on the Velox ones. All right, so the first trick Velox is gonna use is called predicate reordering. This is an old idea that goes back to the 1990s, it's a paper that did this in Postgres from a long time ago. Actually, from Joe Hellerstein, who's now the data professor at Berkeley. The basic idea is that if I have, say, two functions in my where clause that I need to run, I can decide in what order I actually wanna run them. And there may be the trade-offs between the how long it takes to compute one function versus how long it, versus the selectivity of its operation. So you decide, maybe I wanna run the slow function first because that's gonna be able to fill throughout more tuples than the faster function. And then as you're running, you can decide how to change the order as you go along based on what the computational time would be and the selectivity would be, all right? Common prefetching, the idea here is that if within my operator itself, I need, say, two columns, but so I can start scanning one column and maybe do the first half of my expression tree and rip through that. But then I make an asynchronous IOC fetch call to the storage service to go fetch the second column. So I can start ripping through the first column, then in the background, it's fetching the second column. By the time I finish the first column, then I can then start scanning the second one, right? We talked about how to do prefetching before in the intro class, but that was like prefetching individual pages. This is like within, for individual columns within my, you know, within a single operator itself. The not in all fast path, this technique basically says, if I recognize that the null bitmap or that's being passed up from, as my input to my operator, if I do a quick pop count and identify that there's zero nulls in my column, then I don't need to do null checks. And I just allow that process entirely and I have a faster version of whatever the operator that I want to run or expression operate and I want to run. When I showed that Postgres numeric code, there's a bunch of null checks in there. If null, then do this. If not, then I'll do that, right? That's all going to be in direction. That's all going to be caused by misprediction. So if I know there's no nulls, throw that away entirely and run a faster version. Similarly, you can also do, you can avoid additional checks for string data if you know everything's ASCII. So again, ASCII is the original encoding scheme for how we represent strings, but obviously that's been extended to UTF-8 to support different international languages and larger character schemes like Poop Emojis and things like that. So, but if you know all your data within a column is going to be ASCII, you don't have to do the more expensive UTF-8 check, right? So in the Bellach's paper, they just show that if you compare the cost of running different string functions on the UTF-8 version versus the ASCII version, it's ordered magnitude difference between some of these functions, right? This is a really simple check. So the idea is that you always run with the UTF-8 one first, then if you recognize that you're scanning, hey, this is all ASCII, then you just run the ASCII version, right? And if you get it wrong, abort and roll back, yes. So other than the pedicab you're ordering, all three of these are something that you do, not while you're running the query, but something that you would know about the query itself, right? There are no nulls in this one or that you can run ASCII. No, no, no, no, you don't know anything. So while I'm running it, I'm just seeing ASCII characters I assume they're only ASCII, is that how it works? Like for this one here, like I'm running, so I'm giving, hey, here's your string column, right? And then I don't think Parquet and Orc tell you whether it's UTF-8 or ASCII, or they say it's not even Parquet file, it's you're parsing a log file or CSV, right? So if you recognize that, hey, this is all ASCII, I can run the faster ASCII version, but I need to see some of it a little bit first to make the decision. That's the basic idea. So I think they have a fallback mechanism for this, I think if you're wrong. I think for ASCII, yeah, yes, yes, yes. ASCII's supposed to be backwards compatible with UTF-8. But with the null, you can't do that, right? With what? With null, you can't do that. Like if you have a null, and you're running the null null. Yeah, but you have the null vector ahead of time, right? That's why. So you just do pop count, is any bit set to one, if no, run the not null? All right, we're over time again, sorry. One last trick I wanna show is like, they also have the ability to, which I like, is if you recognize that the, whatever operation you're trying to do in the expression tree on a string data, is can just overwrite whatever data you're given. Again, think of like, I have my original data files, I've got it from parquet to arrow, now I'm just passing around memory chunks from one over to the next. If I recognize that, oh, I just overwrite the memory of this column and this data with whatever the operation I wanna do on it, I don't have to call malloc or allocate memory to put new output in. So again, if you're running that upper function, all that's doing is taking the characters you're given and uppercasing them. You can write that back over the original value and reuse the same buffer that you then sent up to the next operate, right? So they show like, if you do that trick, you shave off another 25% to the 40% on your runtime by reuse the buffers, right? So this is the baseline using the UTF version, there's the ASCII version, and then you get even more using buffers. That's a nice little trick, I like that. All right, so just to finish up, so today's lecture was basically a quick overview. Here's a bunch of stuff of how to design your execution engine and I'm trying to show you, here's the broad categories of things that you have to think about. What kind of parallelism are you gonna support? How are you gonna move data from one hour to the next? What that data is actually gonna look like? How are you gonna evaluate expressions and what tricks you can do to optimize them? So I'm not saying that everything I showed you today is the complete menu or what's available to you, but now you need to understand, here's the things that you gotta think about when you build one of these engines. And again, in my opinion, that Arrow's gonna be the best choice for internal representation. Is it perfect? No, are there improvements to it? Yes, but it is actually involving like they added the umbra style string storage, they added that last year, right? So it's not stagnant. Okay, so next class on Monday, we're gonna talk now how do you actually run the operators themselves in a vectorized manner using SIMD. So the paper I signed to you is from the Germans. It's basically a deep dive into how to use AVX 512's features to optimize and vectorize relational operators, okay? All right guys, have a good weekend. Enjoy the unusually warm weather. Take care, you know. You know what I'm saying? Got a bounce to get the 40 ounce bottle. Get a grip, take a sip, and you'll be picking up models. Ain't it no puzzle, I'll guzzle, cause I'm more a man. I'm down in the 40 and my shorty's got short cans. Stack some six packs on a table. And I'm able to see St. I's on the label. No shorts with the cross, you know I got them. I take off the cap, my first attack on the bottom. Throw my three in the freezer so I can kill it. Careful with the bottom, baby.