 Today is the last part of the discussion we've been having for the last two or three lectures about optimizing scans and optimizing data systems using vectorization and compilation. So today is really now a comparison of the two approaches summarizing what we've learned and then evaluating under what conditions is one approach better than another. Again, so this is just the agenda for you guys in the class. Again, project one is due this Sunday coming up. Chi has the fastest. Good luck being that. I think we can discuss what makes him so good afterwards, among other things. Project two will be, I'll talk about the end of this class. That'll be due April 30th or May 1st. There's a check-in on April 1st. And then projects are you can also discuss today. The proposals will be in class Wednesday here. That'll be just five minute presentation of your team. And then we'll do an update later in the semester. And then the final presentation will be whenever we announce the, whenever we have the final exam date. So instead of having a real final exam, there's a final exam that's take home that's written. But then in class we'll do final team presentations. We'll get pizza and do whatever. And if you guys want to do something fun afterwards, we can go do that. Like last year, we shot bows and arrows. We can go go-karting, whatever you guys want. Okay? Any questions about project one, two, or three? Most of the project one. Okay. So, again, just to summarize what we've learned in the last two weeks, vectorization can speed up your database. Compilation can speed up your database, right? Like we've shown and we've read papers that discuss that all of the better performance you're going to get because of these things are, you know, you're processing batches of tuples to amortize costs in the vectorization case. You're taking advantage of new instructions that's in the hardware. And in compilation, we are basically hard coding a program that executes exactly what your query wants. But we've looked at them and mostly in isolation. I've alluded a little bit that the literature seems to suggest, and sort of like this paper you guys read talks into these terms, that these two are mutually exclusive. Like if you have a vectorized system, it can't be compilation. If you do a compilation, you can't do vectorization. That's not true, but it's still worth looking at the two approaches in isolation of each other to figure out like, you know, when is one better than another and what are the pros and cons of each of them, right? And so, the paper is a micro-benchmark sort of study or evaluation within a single system where we're going to avoid all the high-level overhead or all additional overhead or other complications of a real full-fed system that may affect performance and we'll cover those as we go along. Like strip all that out and run on the same system that supports both compilation and vectorization. And then you can have a true apples-to-apples comparison of the two approaches, all right? And the reason why we want to do this is because building a database system is hard and going back after it's already written in, trying to add compilation or trying to add vectorization is basically almost impossible, right? Because it would be a major engineering effort to go retrofit these things back in. Now Postgres added predicate compilation with LLVM, but they only do it for the ware clauses. They don't do the holistic compilation that Hyper does. Doing holistic compilation with Hyper does, to add that back in a whole system is basically almost like rewriting it from scratch. Vectorization is probably a little bit easier to add, but still it's not trivial. So ideally, if we're going to build a new system, we would understand the trade-offs of these two approaches. And based on what our use case or target application domain is, we can make a decision at the beginning as we build a system which one we should use, right? And that's sort of what the goal of this paper was. So before we get into, again, the microbenchmarks, let's just refresh our memory of what these two different approaches we're talking about. So the paper sort of breaks it up into the sort of vector-wise camp and then the hyper-camp. So recall from the last class that I said the way vector-wise worked, it wasn't doing compilation on the fly this way the Hyper does, but instead they had these pre-compiled primitives or these essentially functions that would implement some operation or small operation in the database system. Like take a column of integers and compare whether a value is greater than or less than equal to or so forth. And each of those, like for each of those operands, less than, greater than, so forth, that's a separate primitive function that gets compiled. And then when the query shows up, you generate the physical plan and that physical plan is basically going to be an interpreted, or would then be interpreted by the database system, but instead of having these like switch statements to say, you know, if my column type is this and I want to do this operand, traverse the expression tree and evaluate it, instead I just maintain the pointer to the function that does whatever the primitive I need and then I invoke that primitive to operate on batches of tuples. And normally that would be super expensive if you were doing this on a per-tuple basis, like if you're in a Postgres system, a system like Postgres for every single tuple you got to make a function called to see whether one value is less than another value, which is what Postgres does, that's going to be super slow. And so the way vector-wise is going to avoid this overhead or amortize the cost is by operating on batches of tuples, or vectors of tuples, you can hence the name vector-wise, right? So the, and then the output of these primitives are going to be the offsets of the tuple that satisfy a predicate, similar to that bit mass stuff that we talked about before, and then they can use that to feed into the next primitive if it's a conjunction or if it's multiple predicates and use that then to say, you know, only evaluate whatever it is for that primitive on these matching tuples, right? So again, also be clear too, in these primitives, again, they're written by the developers, I forget whether they're auto-gendered or not, but they're written by the developers and they're compiled when you compile the database system, and then they're shipped in the binary of the database system. It's not like a shared object that's getting linked in dynamically on the fly, like the system comes, you know, pre-compiled all these things. So let's look at the very simple query, select star from foo, we have a string column, it equals some ABC string or constant, and the integer column equals something, right? So it's a simple query plan, a scan with a predicate, a filter, so the filter we've broken up on the conjunction, so you'd have one primitive that takes a column of integers, sorry, column of strings from the table, and then the constant you want to compare against ABC, and then you're just going to, you know, scan through it, the batch that evaluate the predicate and then add it to, if it does matches, then append its offset into some vector. And then you take the output of this and then use that to invoke in the next primitive where you now scan on the second column, on the integer column, and apply that predicate and return that back. So for this, this is not the actual code you would want to use, this is an approximation, you wouldn't obviously want to allocate memory for the offsets within the function and return that, like all these things are these fixed length butters that you can pre-allocate and reuse over and over again, right, to make things go fast. And then we'll see, we'll talk about it later on, but like we'll compare the, and the paper will study whether how easily it is for like the compiler to vectorize this or whether we have to write things using intrinsics. For these two examples, I mean, for integer, this is super easy, right? The compiler had no problem with it. Strings are a little more complicated depending on how you encode them and so forth. All right, the other approach is with Hyper, doing the holistic query compilation, again, they call it the push-based model, the data-centric approach, operator fusion, it's the bottom-up approach where you have these sort of for loops and the idea is that for each iteration in that for loop, it corresponds to one tuple in the pipeline. You do as much work as you can on that tuple of the pipeline until you reach the pipeline breaker and go back and get the next tuple, right? But they're trying to compile all the query, like not just like have these function pointers, like you bake in everything all into this like single function and then you invoke that. And the goal for them is that they're going to try to keep a tuple for as long in CP registers or L1 cache for as long as possible off the pipeline and because that's gonna reduce the number of cache misses and potentially increase the instructions per cycle as you go up. Again, this approach is not gonna be vectorizable. The relaxed operator fusion paper that I talked about before from my one student that does sort of combine this with vector-wise, that you can combine the both, but for this paper in particular, they're not doing any vectorization. I think the paper you guys, the paper you guys had on compilation, they were doing vectorize Piper. All right, so again, going back to the same query here, now instead of having multiple primitives, just think you have a single function that's gonna scan the table to produce this query where you're passing in maybe the pointers to the columns and just gonna evaluate the predicate first and evaluate the two predicates together and then if it matches, put it in the output buffer. And again, I'm showing you the branched version of this. I think they'll use branchless as well. They'll compare that, but we're getting to know that for now. All right, so for today's class, again, now we do a, you examine this, the two approaches, see what works and see when one is better than another. And then we'll, because I welched on this last week, or last class, we'll do project two and project three and give you guys an overview, what's expected, okay? All right, so as I said, for this paper here, the goal was to have a single system that implemented both of these approaches so that we can evaluate them in a single architecture to understand when is one gonna be better than another. The reason why you have to have a single system is because when you look at some of the papers that compares, you know, hyper versus vector-wise or even our old system Peloton, the analysis of the results would be like, would always say things like, yes, hyper's faster, but that's because they do numerics very efficiently. We'll talk about this for project three. They do numerics in a way that's way more efficient than what Postgres does, most systems do. Or like in other cases, vector-wise, we faster because the way they do multi-threading. So we wanna avoid all of that additional features and functionality, strip that all away and get it down to be a single system so we have a straight comparison of these two approaches. Another example would be like in hyper, they're using the morsel's approach with threading that decide how to schedule tasks or queries, whereas in vector-wise, they're spawning threads and letting go of schedule things, right? So we wanna avoid all of that. So now, all the high-level algorithms will be the same in both implementations, but the low-level details will be slightly different because there'll be certain design decisions you would make that are gonna be more efficient versus one versus another, that you can't be the same thing. So the example would be how they do hash joins. So the hash table implementation is gonna be exactly the same for both two systems, linear probing. And then the hash join algorithm will be essentially the same. The difference though is gonna be the hash function. So in the case of vector-wise, they're gonna wanna use murmur two because that can be vectorized and it's gonna execute, I think double the number of instructions as the other approach. So it's gonna be vectorized even though it's gonna use more instructions than CRC32, right? So what hyper does is they run CRC32, which I think there's an instruction to do this in X86. So they'll run this twice, get two 32-bit hashes and combine together and turn it to a 64-bit hash, right? And so this is gonna be way cheaper than this, but because this can be vectorized, this'll have better throughput for the way vector-wise wants to execute things, right? So again, so the nitty-gritty details don't exactly matter. I'm just trying to say that there are things that are gonna have to be slightly different, but these are intrinsic because it implies Cindy, these are inherent to the actual implementation of the system that you would have to do it this way, you'd want to do it this way. So you can't have everything be exactly the same, there are things you're gonna want to do differently. So the two implementations will be vector-wise and typer. I'm gonna slip up and probably say vector-wise and hyper. Anybody know why they're called vector-wise and typer? The first author of the paper is named Timo. He named it after himself, right? The student that did all the work, right? But he had the vector-wise, I'm a co-author of this paper, I was happy to help, but the Peter Bonce, the co-author of the paper, he's the guy that built vector-wise with Marcin, that did Snowflake, and then Thomas Neumann built hyper. So it wasn't like they read the paper and tried to figure out how to do certain things. They asked the authors, how should they implement both of these things? So this is giving me, in my opinion, this is the best comparison of these two approaches. Again, I'm not saying this because I'm an author of the paper because it really is like they asked the source, how do you do this? And Timo was a PG student at Munich, so he saw, he could see the hyper code anyway. All right, so with vector-wise, they're gonna break operations into the pre-compiled primitives that we talked about before. And then the one key difference also is gonna be that each of the primitives are gonna need to materialize the output of whatever the evaluation they're doing, these offsets sort of into a buffer and then hand that to the next primitive. In the case of hyper, because they're doing these, trying to do these tight loops operating on a single tube at a time, they don't have to materialize any offsets because the tuple's just gonna be there, right? And then the for loop stops and you go back to the next iteration, like you break out of it, not break it, you continue and go back and restart the next iteration once it's whatever predicate or whatever you're trying to compare against doesn't satisfy the query anymore and you just discard it. You don't need to store it anywhere in a buffer, right? So this will come up when we talk about the results, like there will be some cases that this materialization cost will be more expensive, but in other cases the benefit you get through vectorization will outweigh the benefit you get from these tight loops. So the comparison they're gonna use, so the work that they're gonna use for this is a subset of TPCH. So TPCH is showing up in a lot of the papers you guys are reading, right? Because this is the standard benchmark that people use to evaluate OLAP systems. It's an older benchmark, it's from the 1990s, there's a newer one called TPCDS. TPCH is like 22 queries and I forget how many tables, but then TPCDS, there's more tables but it's like 100 queries and they do CTEs more complicated. But TPCH is sort of the, do you say you have an OLAP system? TPCH is the bare minimum you need to support, because you can do analytics, right? So instead of running all 22, instead they're gonna choose five and it's not chosen at random, it's based on this paper that Thomas and Peter Bontz wrote a few years ago, where they actually did a low level analysis of TPCH running in a real system to identify what are the different pain points in the bottlenecks, like what are the key aspects of TPCH queries that matter when you build a modern OLAP system? So the idea that these five queries are representative enough of the real workloads that you would encounter in the real world as identified by the categorization done in the previous paper, right? So again, and the idea here is that because not to try to do a sweep of everything, we hand-pick which queries we wanna run and then we can focus on and understand more deeply what's actually the problems of bottlenecks that these systems are hitting, right? So the first two queries, Q1, Q6, these are gonna most be without joins scans on the line item table which is the largest table in TPCH, can think of like the fact table in a star schema. And the first one that can impute four aggregates, but it's also gonna fix point arithmetic and with the fixed point numerics and then Q6 is gonna be very selective filters that are gonna be cheap to compute. I think there might be one new American, but it's like integer evaluation, you're comparing two integers. Q3 and Q9 are both gonna be the joins. The difference is gonna be the size of the hash table and the size of the data set that probing and the two sides of the join. So Q3 is gonna have a smaller hash table on the build side but I'll probe more tuples. Q9 will have a larger hash table, sorry, smaller hash table, sorry, larger hash table but smaller fewer tuples you wanna probe. And then the last one is gonna be Q18. It's gonna be a group by with a super high cardinality, like 1.5 million groups, which is a lot. That'd be a pretty huge hash table to build that out. So again, I don't know if you guys ever looked at, we'll see what TPCH queries look like. We have our own benchmark framework called BenchBase here at CMU and this links in the slides but this is Java code. You can see what the template in TPCH queries, everyone wanna see what it looks like. And for your final project, project three, if you need to run TPCH, use this. And thanks Leachan for making it so the BenchBase now does supports generating the tables for you in Java. It used to be you have to run the dvgen code which is the C, generate the one to CSV files and then load that in through BenchBase. Leachan makes us so everything loads all at once. So thank you. Okay. So in the first graph here, we're gonna look at single-threaded performance of the two systems across these different queries. And they'll be measuring in terms of run times in milliseconds because it's a scale factor one in TPCH means it's a one gigabyte database. So it's not massive. But again, we're just trying to get down to understanding the core differences of these reproaches. So the first thing we see that in because Q1 and Q18 hyper does better but for Q6, three and nine vector-wise does better, right? What's wrong? Oh, okay, that's fine. So right, so in this case here for Q1 and Q18, I was like, for Q1 and Q18, hyper's doing better for the aggregations but for six, three and nine, vector-wise is doing better for the filters and the joints. All right, does the graph tell us anything? Not really, right? Because I mean, performance numbers are nice. You can see a difference. But they're not saying why, right? And that's the thing we actually wanna care about, right? So we, so to do this evaluation further, you now need to dig into the, that's sort of the CPU level with the Harbor counters. What's actually going on here? Why are we seeing a difference in performance? All right, yeah, Q, yeah, there's, yeah, so 15 milliseconds versus 16 milliseconds. It's so, yeah, it's almost the same. But again, you know, you look at like Q18, that's a pretty big difference. Q9's a pretty big difference, right? Your question is like, for analysis, we wanna look at the ones where there is a big gap. Yeah, we'll get there, this next slide, yes. All right, does everyone know what a Harbor counter is? Who here doesn't? Yeah, so right here that, so in modern CPUs, there's these registers down that you can get access to. Sometimes you need to have privileged access, sometimes you can make a syscall to get it. But if you run Perf or Vtune, you can express these things. But basically the CPU is gonna maintain these counters for your program as it's running, and then you can run something like Perf and it has almost zero overhead to the program while it's running. And then it dumps out this Perf file that you can then analyze and look at all these counters. So you can run your database, run your system, you can run other applications, who cares? We care about databases, it's like, you run your database system, you run these workload, you collect these performance counters, then you can go back and understand what is, what's a low level team, what does the CPU actually do? It's quite impressive. Vtune is like the GUI, the graphical interface that Intel gives you and then Perf is the open source, like Command-9, 1, and Linux. I think they collect the same thing, the Vtune just makes prettier pictures. Anyway, so in our table here, all these things on this side here, these are gonna be the performance counters, but we're gonna normalize them based on the number of tuples that the query has processed. Otherwise, it's just arbitrary large numbers and they're meaningless, right? And then for this column here, this is just the reiteration of the repeat of the thing I showed in the last slide here. Here's the runtime performance when you actually run the queries. This question is, does the runtime include the compilation time of a type of no? No. Because that will just dominate things, right? For sort of scale factor one, right? The query is running in 15 milliseconds. Like it's not worth it. Again, OLA, again, assume you could do the redshift thing where things are already pre-compiled for you and you just reuse it. Okay, so here's the runtime in milliseconds. And as we go along to make, to understand which one's actually better, like I'll just use the shading here. So if the row is shaded, it means that was the faster one for these queries in these different groups, okay? All right, so let's start with the first case where hyper was doing better, type of was doing better, right? Q1 and Q18 here. So if we go look at this column here, the number of instructions, you can see that the vector-wise is executing almost double the number of instructions as hyper, right? So in this case here, lower is better because fewer instructions, that's good, right? So again, so that's, we attribute this why it's going better. We can also see this in terms of the, what's interesting though is like the number, it's executing more instructions, but the number of instructions per cycle for vector-wise is actually much higher than hyper for the bottom one here and for the top one. So even though vector-wise is executing double the number of instructions, it's getting maybe 50% more instructions per cycle. So it's not a, it's actually more instructions, but it's more efficient on those instructions that it is executing because it's avoiding cache dolls, right? And actually here, the kernels are tighter, there's fewer, in this case here, it's branch misses. That's what's saving them. So the volumes are tighter, they can execute, they have to do more work because if they materialize the output, they're more efficient in the instructions that they're doing, but just because hyper is this, it's just way more, it's doing way fewer things, it's getting the better results, right? So in the top case here for Q1, the aggregation is sort of what the main thing is that's causing it to be more expensive here, pop it in the hash table and so forth. All right, so again, in this case here, fewer instructions was better, and that outweighed the benefit you get from a better instruction per cycle account. But in this case here for Q3 with the join, the hyper is executing fewer instructions, you're still seeing that benefit of the better instruction per cycle account in vector wise, but vector wise is actually still doing better here, right? It's close, but it's still better. So this is saying that now we can't just look at the total instruction count per tuple or the instructions per cycle, we have to look at other metrics or other performance counters and understand what's actually going on. So in this case here, it's actually the branch misses is what's killing hyper and making vector wise run faster, right? Because the, they're all joins. So in the case of vector wise, you are, there's not really any benefit you're getting for, sorry, in the case of vector wise, the likelihood that you're gonna bounce out of the forlip and go back and get more tuples is rather low. So therefore in the case of hyper, like on a single tuple basis, whether there's a predicate match or they find the match in the hash table, it'll kind of, it's random sort of bounce out in the case of vector wise for each sort of batch tuples is looking at, there is some work that's gonna, there's some tuples that are gonna qualify, you are gonna get some benefit from it. So you are gonna produce some useful results even though for now all the tuples in your lane, right? So you keep flying forward, there's less branch miss prediction because there's less branching, right? So now in this case here, Q9, the other join, same thing, hyper is getting lower instructions but worse instructions per cycle but then the branch misses are basically the same though and actually vector wise is actually even worse, right? But even then it's still faster and actually buy quite a lot. So the thing we care about now is actually cycles, right? So because even though again, the vector wise has a higher branch miss prediction, the amount of stalls that it has is just gonna be much, much fewer because the, in this query here, the hash table is larger, the probe side is smaller than in Q3. So the cost of doing those, the hash probes, there's fewer of them because I can batch these things up and vector wise I can just do more work. In the case of hyper again, single tuple at a time doing the probe that becomes expensive because you can't, with the hash probes, because it's looking at random locations in memory, the CPU can't speculate execute ahead or look ahead and try to figure out what's gonna execute, right? So the way to think about it is in the case here, the probe loop is more expensive and you end up just doing more stuff and more stalls, right? The same fewer instructions but more stalls because you're doing random look as a memory whereas in vector wise again, you can amortize that out, again, so I don't really have the right way to convey this information other than like going at, looking at these roll numbers by bit because I don't think a graph would actually help but it's good to sort of see all these numbers in the context, understand what the queries are doing and why we see one perform better than another and it isn't just like, okay, I'm actually getting fewer instructions or I have better instructions per cycle, it's a combination of these things. Okay, so the main takeaway from that previous table is that both the implementations, both the system architectures are gonna be efficient and achieve roughly the same performance. In some cases it'll be better in certain scenarios, other cases the other one will be better and in the parts where they are, when one clearly outperforms the other, the relative numbers can seem quite significant but if you think of the sort of absolute numbers, it's actually not that much, right? So just going back to the graph view, right? So he pointed out Q6, it's a one millisecond difference between these but even for the one that has the largest gap, like Q8, Q18, like we're talking 50 milliseconds and so on the relative terms, that seems like a lot but when you compare it against like Postgres or a traditional row based system, right? Those, these two approaches are still gonna be 100x faster, two orders in active faster than what a row based system is gonna do. It's not doing either vectorization or compilation. So I don't wanna get too much in the weeds about okay, like yes, it's faster by like one third or so forth. When you compare it against how else, how existing systems were implemented before these techniques came along, it's clearly better and it doesn't actually matter, it may not matter that much which one you choose. In the case of the hyper approach, it's gonna be better for the calculation heavy queries with few cache misses. So again, the, we have the for loop that's iterating on single tuple but the amount of work you do per iteration of the tuple, that is computationally expensive. The keeping everything in your CPU register and just operating on that one tuple at a time is gonna be better before, you know, going that instead of using batches, right? And the vectorization approach will be better for hiding the cache miss latencies. So when I, if there's a, if I had to go get a bunch of data in different locations and I'm gonna pay a cache miss penalty for that, at least when I bring that, that when I pay that penalty, I'm gonna do a lot, you know, I'm gonna process a lot of tuples all at once. So for every cache miss, I can process me, you know, like a hundred tuples instead of one cache miss per tuple. And that's gonna matter for it when you're doing hash joins and the bills and the probes. So again, also too, I was saying like this is, the study is interesting. And again, if you're bidding a new system, I think this is the, this is a roadmap to say what design choices you can make. The, when you start measuring cycles though, or low level operations, it becomes almost like, just like, like frustratingly, like you're dealing with a minutia, maybe you don't care about. So my one PG student, Brashant, after he did the relaxed operator fusion paper, he started looking at how to try to optimize memory hash joins. And I think the state of the art approach at the time was doing 12 cycles per tuple. And he got it down to 11. And that was like, that was after six months. It was very impressive what he did. But what he's realized is like, that's not a paper you could write, or that's not really like groundbreaking research because you shave the cycle, right? It's like, we're getting down to the bone here of how fast these systems are and these implications are. But there's not much left to optimize. And it's the higher level stuff, like query optimizers and so forth. I think I'm more relevant. That's just my opinion. Okay. All right, so now what we wanna do is the, again, there was a bunch of other studies in the paper, you know, comparing against like night's landing, comparing against the different harbor architectures and so forth. The other part of the paper I wanna focus on is when they evaluate how much SIMD actually helps. So recall when we talked about in, we talked about vectorization two classes ago. You know, there were some comparisons between the vectorize and the scalar version. But again, now if you start to look at, okay, if you wanna understand, okay, if I compare against the compilation stuff, I wanna know how much is SIMD actually helping us in the case of vectorize is better. And within a single test bed. I think when I show results, I was comparing for a compilation stuff, I was comparing like the Haiku stuff, which is not doing vectorization, but I was comparing like different systems from different experiments that had, you know, different things that could cause them to be one before and better than another. Now within a single test bed, we can understand how much is SIMD actually gonna, they think is better for us. So for this one, we're gonna use the vectorize branch such as selection that we talked about before and hash problem vector wise. And we're gonna try to use ADX 512 as much as possible because as we said previously that in that new iteration of SIMD from Intel, they had new instructions or they had new features that allows the do to pass bit masks into SIMD instructions to do selective evaluation of tuples. And what matters before is that ADX 2, you could do the same thing, but the bit mask would have to be stored in the full SIMD register. Now these are specialized registers that are just for bit masks, all right. So we're gonna, again, we're gonna look at the, for this one, we're gonna use only for vectorize and we'll use the pre-compiled primitives and we gotta see whether if we toggle on vectorization for these primitives, how much is actually gonna help us to help us explain why we think in some cases is vector wise doing better than hyper. So for this one, we're gonna look at, I think there's one query, we'll take Q3 and Q9 that we're doing the joins and we're gonna break it down to the different primitives, the hashing, the gather, and then the actual join process here. Again, the gray bar is gonna be the stale implementation and then the red bar would be with SIMD. So in this case here, again, if you just look at like the micro benchmark or just the primitive running on a single thread and again, scale factor one, in this case here, everything's gonna fit in the CPU cache. You get a pretty significant improvements, right? You don't get the 100X stuff or the 4X stuff that we talked about before, but just for hashing, 2.3X is pretty significant. Again, for this primitive, we're gonna execute a lot. Gathers a little less and then the join itself is 1.4. Because the problem, as I said, like if you look at the primitives by themselves, you know, in some cases, there's discarding the output. I hashed my tuple and then immediately like throw it away. But in a real system, because I have to like materialize the output and chain it up the query plan, I may, you know, I'm not always gonna get this benefit. So if you now look at like the end to end query execution for 3 and 9, with the SIMD, you're only gonna 1.1X improvement because it's that materialization cost because it's the, it's, you know, you're not just hashing things over and over again. You're trying to pass things along up the query plan and those sort of sort of incidental costs and being the bulk of the time being spent in the query, not just like running gather or hashing, right? So in this case here, the SIMD is helping us in the vector-wise numbers I showed before, but it's not the main reason, right? It's the way sort of, again, that it's moving tuples in these loops, in batches, through these pre-compact primitives and understanding, you know, how that data is moving along and trying to avoid long stalls for caches or cache misses. All right, so the next thing they did is try to understand how well can the compiler automatically automate the vectorization of these vector-wise primitives. So in the original vector-wise implementation, I don't know what's in there now, but this is what Peter told me what they built originally. They didn't write anything with intrinsics. They just wrote stuff that hoped the compiler could figure it out. And so what we did is we went back and tried a bunch of the compilers and we were gonna take these different primitives and in some cases we were right explicit, you know, code with intrinsics, but then we'll recompile it without those intrinsics and see how well the compiler can figure it out for us. All right, and they wanna see, then focusing on AVX512, they wanna see how well the compiler can do this for us. So we'll look at three implications. One will be with only automatic vectorization, one will be with manual intrinsics and then another one will be if we try to let the compiler auto-vectorize everything, if it can't figure it out, then they go back and write explicit intrinsics for this. So in this graph here again, we have the same five queries we looked at before, and here now what we're measuring is the reduction of the number of instructions that the CPU executes relative to a completely scalar implementation. So the idea is again, we don't wanna look at absolute values here because they'll be really large and don't mean anything. We wanna know like four, if we compare against the non-vectorized implementation of the system, how many instructions are actually gonna execute for the vectorized approaches? So you see that the auto-vectorization does actually a pretty good job and the cases where it can't figure it out, we had to go back and manually add things. The main implementation, I forget why it's so much lower in this case here, but it doesn't really matter. But like this is the combination of these two of these things that the red bars wanna save, like this is how you actually maybe wanna really build it. You let the compiler try to auto-vectorize everything and then the places that it can't, you mainly go add that based on the techniques we talked about before. Also let's say too, I'm only showing the results here for Intel's compiler, ICC. It did a much better job than Clang and GCC. This is like 2017, so maybe it's gotten better. But ICC, Intel does a very good job making the compiler do really well, but it's not for you, it costs money. So that's why not everybody uses it, right? So we actually did fewer instructions. Is that gonna always help us? Not really, right? So now if we compare against the relative performance of the vectorized implementations versus, again, the scalar implementation, and this is percentage improvement. So if you're less than zero means you're actually running slower. You'll see in some cases, fantastic, when we vectorize, we do a great job, we get faster. And in some cases, it actually gets worse, right? And so the main takeaway from this is that you probably don't want to auto-vectorize your primitives because the binary is being shipped with that, that there's compiled primitives, you might land on some hardware that is gonna, this could call a problem. I don't remember at the time whether we actually measured whether we got the CPU downcycled us because we were executing AVX 512, and then it reduced the clock speed and we paid a penalty for that. But it could just be also, again, the cache misses and the stalls are trying to put things into batches and put into registers. Like that doesn't come for free. That is why things are getting slower, right? So the cache misses are gonna overwhelm any benefit you get from the parallelization through SIMD, all right? So I think I might have mentioned this too in the, there's a research paper from Peter Bantz and the vectorized guys where they compile the primitives in like vectorized and non-vectorized form with different compilers and then at run time they would just measure, okay, if I try this primitive, if I try these two primitives, which one's faster? It's the same primitive, like compare the integer column with an integer constant, compare the two, which one is faster? That's when you use reuse. So they can dynamically, at least in research, they can dynamically figure out whether they want to vectorize, use a vectorized version or not, without having to recompile everything. All right, so this last graph here is a sort of, you know, it's not real numbers, it's just showing you the design space that is possible for building a modern OLAP system. And again, the x-axis is operating on two by the time versus a vectorized approach and the bottom approach is either through interpretation, again, interpretation meaning like, you interpret a program that you can call precompile primitives, which again, what system R was doing, something like that. System R was basically doing this, vectorized is doing the same thing that system R did back in the day, we had precompile primitives. And then you have the compilation approach and it's just sort of showing you that somehow the sweet spot is sort of in this range here. And again, this is from 2017, Umbra would be potentially up in here, the best place. And then NoisePage when we had that, that was up here as well, because we do compilation and vectorization, okay. So the main takeaway of this is again, they're basically the same in the end in terms of performance results. I think the vectorized approach, the vectorized approach is actually gonna be easier to implement and easier to bug and maintain over time from a software engineering point of view. And most of the systems, most of the major OLAP systems are choosing to do vectorization, not compilation. Big query does a new compilation. As I said, the Databricks, they had it in Tungsten, they dropped it in Photon. Redshift does compilation, but they're compiling the whole query. I don't know what Clickhouse does compilation for ware clauses. It's a bit of hot, but most of the systems are doing vectorization, okay. So I would say in my opinion, if you're building systems from scratch, I would start with that, because you can get up and running more quickly than if you try to build a compiler. Right, so before we jump into the projects, as I say, so next class, we'll then spend three lectures on doing joins. So we'll do hash joins first, sort merge join next, and then a new lecture that you'll be doing multi-way joins. Very few systems implement this, but this is sort of what the, this is where I think this decade is gonna go. A bunch of systems are now gonna support multi-way joins, because the performance benefits for many scenarios is quite significant. And the DuckDB guys have a branch that has support for it. Okay, all right, so let's go through all these. Project two. So for project two, again the goal is that you write a report about a real-world database system, and it's not like an academic, it's not an academic formal paper, but rather it's sort of you synthesizing things we've covered in the class, and looking at these real systems, and understanding from the documentation, or any public information, or potentially even talking to the developers, because we've done that in the past, to understand how these systems are actually implemented, and then we have a taxonomy to describe the, different sort of feature components, you know, what approaches are they're using, right? So I'm gonna post some piazza today or tomorrow, a link to a sign-up sheet, where I'll list a bunch of these systems, we'll be first-come, first-served, you sort of pick what system you want, because there's a lot of them. But I'm gonna curate the list to be cover systems where they are significant enough where there is enough admission to describe what they're doing. Like there's a lot of hobby projects where everyone rewrites a retus. We're not coming those. But instead, we're going to look at the ones that are actually a full-fledged OLAP system. And there's one system you wanna choose that I miss, one tell me, because I'm surprised that I would miss it. But there's systems that, in previous years, have people already written the articles? So for the, again, for Postgres one, that one's pretty well-written. We wouldn't wanna pick that. So what we're gonna do is there's two ways, there's gonna be two submission dates. The first one would be a feedback submission where you say, like, here's the first draft of it, April 1st, and I'll give you feedback and say what's working or not working. And then the final submission will be on May 1st. Again, there's a lot of systems. So I'll help guide you. This is not even all of them. I'll help guide you which one you want. So you'll be writing this on DBDIO. It's my database of databases. So this encyclopedia that we actually, partly because of this class and the intro class, where students would ask, like, oh, what system does multi-verting this way? And I knew some of them, I didn't know all of them. The idea is that we wanna have a sort of a taxonomy that we can evaluate and browse and look at to understand how these systems are actually the same. Right? Because if you look at, like, a DD engines, they have the ranking thing that everyone always cites. But when you actually look at the description of the systems, it's like free form text. It's not very, it's not normalized. And then I think they allow people to pay money to update it, which I don't wanna do. Anyway, so here's the website. We can categorize things. Like, here's all the systems that are derived from Postgres. Mary Poza was the system that Mike built after Postgres. It was a distributive version of Postgres when peer-to-peer systems were the hot thing in the 90s. So he built that. Anyway, and then there's an article like this where, again, for each of these different sections, there's little text here, you're right. Okay? So for both the feedback submission and the final submission, it has to be all your entirely your own writing. I actually don't care if you use chatGPT and then go fix it up and make sure it actually works, right? And this is the modern era, right? Like, who cares? But like, you obviously need to make sure what chatGPT is saying is correct. And I suspect that it's probably trained on us. So for newer systems, I don't think it's gonna be able to figure it out, but we'll see. Anyway, so both the submissions for the feedback and then the final one have to be your own writing. Again, don't copy and paste from random internet. Again, I don't care. It's chatGPT, tell me and let's look at it together. Okay? No, why not? All right, any questions about project two? You wanna write an article for Bustov? Let's take that one offline. But you keep asking me for a password to update other things. We can do that. All right, cool. I think I met you before you came to CMU because I emailed you and said, do you have a logo for rising light? Yeah, so anyway, all right. So that's project two. Project three. That one was hard. I couldn't get that one to work, right? All right, so this is gonna be a group project with two or three teammates in the class. And the idea is that we wanna do, build some significant amount of code or some system implementation that ideally covers the things that we're gonna talk about in this class. Or talk about in this class. But it could also be something if you're in research or something that you're interested in trying out, then I'm up for that as well. So this is not live yet, but I'll post information on Piazza. There'll be a, another spreadsheet, it'll be the same spreadsheet for project two, project three. But there'll be another sheet within that file that you can go put your name in for the groups you wanna begin. And if you can't find a teammate, there's a way to put your list there. You put your name in a little part and we'll forget how to put people together, okay? So the deliverable is gonna be next Wednesday in class will be a five-minute presentation about what your project's gonna be about. Again, this is just forcing you to think about what it actually is you wanna do. There'll be a status update presentation in the first week of April. Again, this is showing you, like instead of you, you're saying, I'm gonna do this in March and then disappear, not doing it for two months and then trying to do one week before it's due, trying to do something. So forcing function again for you to show to the class what you've done. And then the final design document and the final presentation will be done whenever we have our final exam. And so one thing that sucks, again also not building a system at CMU, another aspect of this project we used to do code reviews because that's actually a really important aspect of going out to the real world and doing system development. You have to look at other people's PRs. And again, because, what's that? Yeah, sure, yes. But I'm saying like, it's actually a good soft skill. There isn't a class on how to do PRs reviews. And so we used to do that in this class. Unfortunately we can't. Sure, yes. Anyway, I mean you've got, you've worked with real Davis companies. I'm not crazy, right? This is not, people aren't committing to Maine. You're not crazy. Yeah, so. Anyway, it's like, we can't do that because everyone's working on different code bases but it might be something we can try to figure out how to do as we go along. Okay. All right, so the presentation next week will be just five minutes about the high level what you wanna do. So obviously we have to have a topic. Again, I'm having to chat, either this weekend or early next week if you're not sure, come to office hours. Then you describe what you're actually gonna build, how you're actually gonna build it. But if you can say, I'm gonna modify Oracle. Well, that's not gonna work, because Oracle is closed source and it takes, I think, I don't wanna get sued. But also it takes 24 hours to compile Oracle. Yeah. There's a hacker news post about it. The guy's like, the way to test Oracle is like you write your code and then you submit it to the build system and it takes 24 hours to build it and run the tests. And you basically have to do other stuff and you come back the next day and figure out what that crashed. It's a nightmare. Yeah, I'll make sure of this afterwards. What's that? Because of all the tests, they have the. No, no, no, no, this is like build plus test is 24 hours, right? Because for every, Oracle's like a 40, 50 year old system. So every single corner case of some customer had a bug. There's a test case for that. The test case is the one that can. Yes, yeah. Well, I mean, it's a big, I don't know how long it's gonna take for it to compile too. It's not gonna be instantaneous. It's not gonna take long, I think it's gonna take. You know, it's not gonna take long, it's the test. Okay, anyway. So how are you gonna test on that note? How are you gonna test whether your implementation is correct? This is actually, again, I think you guys are, I'm noticing the amount of soft skills of the sophistication of the soft skills of the students in this class has gotten better over years. So like, you guys are the same age as previous students. You just come in like, oh, I know what a PR is. Or like, oh, I know how to write a test case in just printing the standard out, right? Like, you'd be surprised. In previous years, like, that's how we wrote tests. All right, and then, what workloads you'd use for your project? Again, we have bench base. It'll have TPCH that you could use, okay? Status update is just getting saying like, okay, what if anything has changed in your plan and then where you're at in the project? And then a demo, early demo of something would be super cool as well, but it's not required. Design document would be at the end of the semester. Again, actually, this is part of the status update. Let's, should be end of the semester. We'll be just like, here's the thing that I actually built. And the reason why I ask you guys to do this because it's one thing to write a bunch of code but the other thing to like, put it in writing. I'm not looking for like a page, I'm not like a whole thesis. Like put it down in writing, explain what it is you actually implemented. And then what we also do is on the course website, what we do like a showcase, here's all the projects that we did with your slides and the link to the source code. And it's useful for two reasons. One is the company see what you did and they can hire you based on that. I had companies reach out to me afterwards, like, hey, this person did this, and we need this in our system. Can you, can you connect with them? We want to hire them. And then for the international students, you guys are gonna email me in two or three years that you need a letter for H1B or whatever. And I look at that thing of, okay, this person worked on those things and I can write that in the letter. So it's win-win for everyone. All right, final presentation, that'll be again, whenever a final exam is. And then the idea is, again, we won't do some benchmark numbers, understanding implementation, what you're actually doing. And for this one, the best case scenario, yes, you should have demos or something. All right. All right, so I'm gonna go for four quick projects of things that we can potentially work on. I mean, Abby and Yuchen and Chi already have a project, Li-Chen already has a project. There are some projects that already exist, but I'm gonna talk about ones where we don't really have people right now working on stuff. So the first one is the, what's that? Yeah, the Germans. Anyway, I've been, so in spring break, I have to go see the Germans. I've been summoned. What's that? No, they wanna know why we're not building a system. I gotta go see them. Anyway, so when I say the Germans, for this case here, for this project, I supposed to mean Thomas Neumann. So he kept making this claim that the way the hyper did numerics was super efficient, way better than anything else, and that nobody else did it the way he did it, and that it was easy to do, which I, he said a bunch of things, and I know not the case. So what he claimed was that in this book called Hacker's Delight, it's an interesting book, it's not something you just read casually, but there's a bunch of bit-shifting operations or tricks you can do to do fast floating-point arithmetic or floating-point operations. Was that? Yeah, more or less, yeah. And so I had a, somebody who took 720 before, they, we built this thing called Live Fixie Pointy. This is actually written in noise page, but we extract it out and put it into, as a standalone library, and it's based on some inscrutable techniques out of this book. Again, like, basically Thomas said, it's like, oh yeah, it's this book, it's this chapter, that tells you how to do everything. Not true, we took a while, but anyway, we have an implementation of it. I don't know how far along it is, I don't know how correct it is, but ideally I wanna be able to finish this project and then implement it as a Postgres UDT, user-fine type, and then we can benchmark it against how fast our library is against what Postgres can do. And then we can potentially try to vectorize it in the standalone library, not inside Postgres, and try to maybe speak things up that way. So it's basically, you'd have to learn a lot, like a lot of bit-shifting stuff to make this actually work. And the student that implemented this, he's at Oracle now, and he's still on the Senior Davis Group Slack channel. You can ask him questions. It's Rohan, yeah. All right, so the next project is to do, try to accept, what's that? What's that? What? What? No, I know how to like, get people's attention, right? Like, all right, so we've been looking at for what, for a couple years now, since last year, we've been looking at how to accelerate Postgres proxies. So Postgres networking is actually terrible. Like, there's never been implementation itself because of, it's just really inefficient and we'll read a paper from the Dr. B guys after the shrink break that shows that again, it's just not the protocol itself and also implementation is not ideal. And so the way most people run, a lot of people run Postgres databases in production is actually put a proxy in front of it. PG Bouncer is probably the most popular one. There's RDS Proxy for Amazon. Basically, it's like, think of it as like a little system that query requests go in and you immediately send it off to Postgres, but you can pull the connection. So instead of like spawning a whole process per every connection, you could have one, multiple connections point to a single, multiple connections in the proxy can point to a single connection in Postgres. So you reduce the overhead of Postgres maintenance. So we have an implementation of PG Bouncer where we use BPF to do user space bypass. Does everyone know what BPF is? Who doesn't know? So BPF sends a Berkeley packet filter, the E stands for extended. Basically, it's a way to write kernel modules in a dialect that I think is like C dialect that then gets verified and can run and basically over the VM side of the kernel. So you can't do everything, but you can do things that would be, normally if you write it like a C kernel module, it's again, it's a random C code, you can break things. So the BPF idea is like it's a restricted API of what you can do and they verify that you can't, that you're not mallocking down in the kernel and breaking things. So we have a version of PG Bouncer that he implemented the forwarding methods or procedures in the actual kernel itself. So query request shows up as a packet instead of going to the kernel, copying them to user space, processing and then sending back down, it goes in the kernel, look at what it is and immediately send it back out without having to go to user space. What? Well, okay, so we have implemented it with BPF in PG Bouncer, which is written in C. We want to apply this technique and so PG Bouncer is also single process, single threaded. PG Cat is a newer proxy that's written in Rust where that is getting popularity when we want to try to apply our BPF same technique to this. No, so you don't write the kernel modules in Rust. You write it in BPF. The Rust stuff then calls, sorry, the BPF stuff calls up to the Rust LAN, to the user space, if needed. The idea is you do much of the processing you can down but the idea is that you have to then get to modify, huh? Correct, yes. But if you, like we can talk about fine. If there's like state you have to maintain between both of them you got to read up to it and get it back. So the other one, the other sort of PG Bouncer is the most common one. The best implementation of a PostgreSQL proxy right now is Odyssey out of Yandex out of Russia. What they do is insane. So they do, they wrote their own code routines, I think it's in C or C++ and they write assembly to munch like things in memory when you do context switch from one thread to another. It's very impressive what they do, but modifying this to use GUTPF, like that sounds like a nightmare because they do a bunch of stuff that is not necessary. PGCAD would be the best one we target. That would be multi-threaded. So user bypass would be the first approach. If we have more time we could do kernel bypass. So this is where network packet shows up instead of immediately going processing it into the kernel space and then copying it into user space you could have the network card right into this IOU ring as a ring buffer and then user space can read from it. I don't know how much benefit we're actually gonna get and you've looked at it may not be a huge win in this case here because you still have to do some packet processing and shove it back out but at least we'd have a comparison. Now we did try to kernel bypass in PG Bouncer using dbdk which is the data plane data kit. At some point I'll talk about dbdk, dbdk. These things are a nightmare, you don't wanna do this. We tried, it didn't work. IOU ring is the modern incarnation of this, right? All right, so the 100% goal will be PGCAD connection pooling with BPF and then my PG student Matt will have benchmark scripts against Odyssey and PG Bouncer that you can run against. Again, the rust thing is just be tantalizing for the crowd. All right, the next project is that we wanna be able to do some variation of adaptive query processing in Postgres. We'll talk a little bit about adaptive query optimization at the end of the semester but the basic idea is that while the query is running if you recognize that the query plan you have is not the best query plan like if the join order is wrong then what you wanna be able to do is maybe swap the order of things. There's obviously you have to deal with intermediate state or intermediate results but you wanna dynamically change the query plan without having to go stop the query and go back to the optimizer and run it again. That's how most systems actually do adaptive query optimization but we're interested in like can I do like fine grain without stopping the query swap out plan nodes with something else, all right? And where the first version of this would just be can we take a pipeline or portion of the query plan in Postgres which is then just pointers to other structs other chunks of memory can we swap that pointer out so that when it calls next it goes to a different location that maybe returns fake data. So instead of having this giant query plan tree where if I call next on some top level node and I go down and do a join do all the scan and stuff like that I need to immediately come back and say oh yeah here's your data, right? So I can talk offline the use case for this but it goes beyond adaptive query optimization with my PG student when he's working on it. This would be sort of working with him in conjunction like can I modify and push this query plan on the fly without having to stop it. All right the last project would be doing UDF inlining we'll cover this in lecture 14. But the basic idea is you have a UDF written in procedural language like PLBG SQL and the Microsoft technique can convert the UDF code into basically relational algebra and inject it into the query plan of the query itself so that now when it does query optimization it doesn't know it doesn't care that it used to be a function but now it's inline as query plan nodes or relational algebra and it can do a bunch of optimizations like dead code elimination and other things to speed things up. So we'll cover the Freud paper and other techniques in lecture 14 but Sam is interested in applying this technique to Duck D. Bean. And the reason we're choosing Duck D. Bean is that a Postgres for this is because the Postgres query optimizer is way more primitive. It's more primitive than what Duck D.B. could do. The Duck D.B. one is based on what the Germans came up with. Duck D.B. was written by a German but like a different German. And so the Duck D.B. optimizer has the things we need in order to support Microsoft Freud so it'd be working with a different set of Germans to potentially add this support for Duck D.B. Okay? Again, I think some of you've already talked to Sam about this. All right, so how do I get started? So I will post the spreadsheet to sign up into groups on Piazza. So you meet with your team over the next couple of days if you've got a topic. If you're still stumped on more clarification on things that I talked about here, send me an email and we try to schedule time to meet. And we obviously want to focus on open source things so at least look at the source code. Obviously, don't get into the details, don't try to get into the details and get hampered like I'll kind of get to the compile, worry about that later, but at least have an understanding what it would take to actually put in the thing you're asking to do. And for our postcode stuff, we can provide some guidance for Duck D.B. We'd have to go talk to them. And this is, I say during spring break, it's to be a weekend. I'll be around this weekend if you want to chat about something. Okay? All right, any questions? That's my favorite all-time job. What is it? Yes, it's the SD Cricut IDES. I make a mess unless I can do it like a Gio. Ice cube with the G to the E to the T.O. Now here comes Duke. I play the game where there's no roots. Homies on the cusley, I'm a foo cause I drink bro. With the bus a cap on the eyes, bro. Bushwick on the goal with a blow to the eyes. Can I come? Will he deep? That's me. Drinking bottle 12. They say Bill makes you fat. But saying eyes is straight, so it really don't matter.