 All right, guys, let's jump into this. Disappointingly cold this weekend, but that's OK. So some quick administrative stuff about Project 3. Everyone should have gotten two emails in the last week or so. The first email should have been from this morning that gives you instructions about how to access the mem sequel machines. I remember we had three of them, so you basically you request them, and then if they're available, you'll get it, and then you can log into it and do whatever you want with it. And you have route access to the machine, but be mindful that every 24 hours it blows away the kernel and wipes the machine entirely. So you have an NFS mount that you can read and write files to that are persisted beyond the machine getting wiped, but if you go down and change the kernels or install packages, when you come back, you may have to do it again. And the second email should have been from last week from this thing called Narwhal, which is basically the EmuLab infrastructure we're using for the PDL, and everyone should have gotten an account that would allow them to then follow the instructions and log in and request the machines. So the enemy here tried to do this yet, other than Lynn, tried to use the machines. OK, that's OK. We also are getting a donation from a hardware manufacturer, and so we'll be getting more DRAM for all those machines. So currently they have 16 gigs, but we're getting more later on. And then the other reminder for Project Number 3 is that this class on Wednesday, every group will have to give a five-minute presentation about what the current status of their project is. It's basically the same thing you guys told me last week, plus a demo or whatever else we talked about. And it's just really just to give an update to the other people in the class about what are you doing, how far you've gone in your project. Yes? This coming Wednesday, what are the 13th? I was actually very impressed and pleased with everyone's status of their projects for the most part last week. Have you ever watched that video with Randy Palsch when he gave it like his last lecture? He talks about how he had this like animation class and that he was blown away by the kids like first projects or then the status update. So he didn't know what to tell him when he went back to class, so he told him he was disappointed so that people would work even harder and go even beyond what they were already doing. I thought about doing that, but that would be kind of like a dick move. And I actually am honestly impressed with some of the things you guys are doing. I'm quite happy. I'm actually really impressed also too that other groups are coordinating with each other. Some people need a thread pool, some people need this and that, and you guys are working together and making sure you're not duplicating the effort and pulling changes from each other. So overall I'm very pleased. So we'll see how things go on Wednesday. All right, so for today's class, this is sort of the last lecture we're doing on sort of runtime execution optimizations for queries. We talked about query compilation, we talked about scan sharing, we talked about compression. So today for vectorization, this is sort of the last additional thing you could do in a modern database system to speed up your operations. And then starting next week, we'll talk about how you deal with databases that exceed the size of the memory you have, the amount of DRAM you have available to the machine. So this is sort of the last lecture on speeding things up and then we'll finish up talking about storage stuff. So we'll talk a little bit about the background of vectorization, why it matters, why it's important, what it means. And then I want to spend most of the time talking about the paper you guys have signed reading, these vectorized algorithms that the guys at Columbia came up with and sort of go through some high levels on the basic examples of actually how you use SIMD instructions, how you use vectorizations or modern CPUs to increase the amount of instruction level parallelism you can get during query execution. And then I'll finish up talking a little bit about the bit weeding technique from Wisconsin. And this is like a storage format for columnar databases that is sort of natively be used in SIMD instructions, which I think is kind of cool. I don't think actually any real system implements this, although the Wisconsin guys were bought by Pivotal about a year or so ago, so that might end up in their systems. All right, so normally I like to start off the lectures with an observation about here's the thing that's hard, here's something that's challenging, here's something we have to consider in how we design our database system. But sort of the observation we want to make about why we want to do vectorization is pretty obvious. And everyone should have already heard it before. So I'm just going to come out and claim that this is obvious, this is easy, this is no-brainer why we want to do vectorization. I want to include some other stuff that I think are pretty obvious observations as well. So first is obviously building a database system is hard. Everyone's been doing a lot of hard work on this. It's a lot of thinking about concurrency, thinking about memory, thinking about correctness is difficult. If you go to Taco Bell, you're probably going to get stomach problems. That's sort of obvious too. I hadn't been there in a while, but the last time I went there it was a rough ride. And then obviously the last one is new CPUs are not getting any faster. You've all heard of the people pretending the end of Moore's law that the CPUs are on a single core. They're not ratcheting up the frequency, the clock frequency like they did in the 90s. And instead the way we're going to get parallelism or get better performance and keep scaling up Moore's law is adding more cores. And so that we've already talked about doing the multi-core stuff when we talk about scheduling and other things when we join algorithms. But now we're going to look at another type of parallelism we can have within a single core itself. So before we get into that, I want to spend some time talking about what the state of the art is for modern CPUs. And there's basically two variants that are out there today. So the first is what I call multi-core CPUs or many core CPUs. And this is essentially what everyone is familiar with. When you buy a machine from Dell or HP or whatever, you get one of these CPUs. And so these are exemplified by having a small number of really high-powered cores. So you can think of something like Intel, Haswell, Skylake, Nihalem, all these architectures are based on this design. When I say high power, I mean like not only is it drawing a lot of power, but the CPU has a lot of additional instructions, a lot of different optimizations baked in like the pipeline scheduling and all the stuff that it can do to make it run faster. And so these types of CPUs have two key characteristics. The first is that they're superscaler. And the second one is that they support out-of-order execution. So superscaler basically means that the CPU can execute more instructions than it would at a given clock cycle in parallel across different units or computational units on the CPU than it would if it was just executing them sequentially one after another. So you could have a bunch of instructions. And at a clock tick, you're doing a floating-pointing operation on this piece and then a arithmetic on this piece or whatever else you want that the CPU supports. And the out-of-order means that within a pipeline, you can have all these instructions being executed not in the order that they were actually invoked by the program because the CPU can make decisions that, oh, I have this data locally in my cache, so I'll execute this instruction right away. But then for this other one that actually came before it, I need to go do something else to get the data I need in first. So I'll hold off doing that. So you think of like, you say I have two instructions, or A equals A plus B, or A equals A plus 1, and B equals B plus 1. Even though the program will execute A and then followed by B, the CPU is allowed to reorder them because those operations are commutative. So internally, what's going to happen in the hardware, they're going to have a dynamic dependency checker that's going to look at all the instructions and what data they're accessing and make decisions about how to order them such that they appear that they were executed in serial order, even though they were doing this in parallel. The other type of CPU that you may not be more familiar with is the last 10 years or so, these things called many integrated cores or mix. And so these are sort of these new co-processor CPUs where they're designed to sort of supplement like a Haswell CPU from Intel. So the most common one is the Intel Xeon 5. Nvidia has their Tesla stuff. And the basic idea is that rather than having a small number of really high powered cores, when I say small, I mean like four, six, eight, 12, I think Intel just announced that you have 18 now that's coming out later this year. So when I say a larger number, I mean like in the several dozens or even hundreds in the near future. But the key thing about this is that each of the cores are going to be, each individual core will be less powered than the core you have in the Haswell chip, right? And they're actually based on the Intel Pentium-1 architecture that they developed in the 1990s, right? So there's now called, originally called Intel P5-4 and now they have the C, that has some additional stuff to do SIMD and other things. So each individual core is less powerful but it's gonna be, you're gonna have way more of them, right? So if you have something that can be easily paralyzed in this kind of environment, which a database system can be, you may actually end up seeing better performance than what you would see in a sort of a common place or the commodity CPU like the Haswell. So the key difference also too is that these guys are non-super scalar and they're only gonna do in-order execution. So you have a sequential stream of instructions. Each individual, one of these individual Pentium cores is gonna execute them in exactly in that order. It's not gonna try to do any branch prediction or jump ahead in the pipeline. And so the way they're gonna get around, even though this, each core is less powerful, they're gonna supplement it with additional SIMD instructions and larger SIMD registers that allow you to do more of the kind of parallel computations that we're gonna talk about in this class. So it's not gonna make, obviously, your operating system run faster, but if you have some kind of high-performance computing program or a parallel database system, then you should see a significant improvement using these guys. So you've never seen what one of these coprocessor looks like. They basically are just like the GPUs, right? They sit in the PCI Express bus and they can read memory off of that, right? So it basically looks like a GPU. Intel announced that they're having the, I think later this year or early next year, I forget the codename for it, but they're actually gonna have now the Xeon 5s actually be socket CPUs. You could have like one socket B, like your Haswell chip, and then another socket B, your Xeon 5. It actually might even be on the same silicon, right? And that'll get rid of the problem of having to move everything back and forth over this bus. They're not that expensive either, right? They're only like $2,000 to $3,000. There's not talking about whatever that new thing NVIDIA announced last week that's like $100,000, the GGX or whatever, right? These things aren't that expensive, right? But they're obviously they're not something that you can go rent on Amazon EC2, right? You'd have to have this in your local machine. The question? So what stops them from moving on to part of the execution? Is it like very expensive to compute the dimensions of the stuff? This question is what prevents the Xeon 5 core from doing out of our execution? Yeah, so you need to add additional circuitry on the silicon to handle all that, right? And now you're taking up more space and now it needs more power, so now you can have as many cores, right? Trying to keep it this very simple and then I think Pentium 1 was the first one that actually supported out of our execution in the 90s, but for whatever reason, these guys don't do this. I mean, they change that to get around that and then add more computation stuff for the SIMD stuff. All right, so I'm not a hardware person. I just find this stuff kind of interesting because it definitely is relevant. It's definitely something you have to consider when you're building a modern database system. Okay, so just as a high level, this is sort of what the architecture looks like. So this is what everyone's seen before. You have your household chips. You have a single socket CPU, say it has four cores. Each core is gonna have its own private L1, L2 cache and then they'll share an L3 cache and the hardware runs some cache coherence protocol to make sure that if one guy does a write-in here, everybody sees it. The Intel Xeon 5s follow a ring bus format, but I don't think they have a L3 cache. So everybody has their local L1, L2 and then they need to send messages to each other. It goes over this ring bus. So this is different than that thousand core paper you guys read earlier, where that was sort of that giant mesh network and that's sort of based on the architecture that Tyler uses for their chips. The Xeon 5 uses a ring bus as opposed to like a 2D mesh. All right, so given this is what the world looks like and given that there's this new co-processors, the Xeon 5s that are available to us, how do we actually take advantage of them? So what we don't wanna do is just take our existing parallel queries that we've talked about before and just throw it on these CPUs because they're not gonna run as fast as the Haswells even though we may have more of them. So what we wanna instead then is look at how to rely on this additional vectorization SIMD functionality that they're adding to these co-processors to speed up the operation on a single core. So when we talked about query scheduling and execution, that was using multiple cores. When we talked about query compilation, that was sort of like making sure that each interval core does something in a smart way. So now again, when we're dealing with a single core, how do we use the hardware to its fullest potential and how do we take advantage of these new instructions? So the basic idea of vectorization is we wanna take an existing program, i.e. our database system, and instead of having a process, a single piece of data on a single instruction, we wanna take a single instruction and have it operate on multiple pieces of data at the same time. And this is essentially the SIMD stuff that we talked about before, which I'll go over again. So there's basically two ways to get vectorization. The first way is to do this automatically in your compiler. So you can set some flags in the compiler and say, go ahead and try to find instructions that are together inside of a loop and replace them with the corresponding SIMD operations. So this is different than unrolling a loop. Unrolling loop basically takes the one instruction you would run in the loop and you add it multiple times, corresponding to multiple iterations. This is taking those multiple iterations instead of having multiple instructions, having that single SIMD instruction, if the hardware supports it, to do whatever is the operation it is that you wanna do. So this is very hard to do, obviously, and the support you get for this is not that great in compilers, even like ICC, like the commercial one from Intel. And it's definitely the case that this is probably not gonna work for the most part in our database system. Now, I'm ignoring the query compilation stuff that we talked about two weeks ago, right? That's a whole another ball game. But in general, if you take the interpreted query plan execution model that we talked about, that the workflow of that program depends on what, not only what the database looks like, but also what the query plan looks like. So the compiler's gonna have no way to predict it. Oh yeah, these are the operations you wanna do. Let me vectorize it. So if we turn this flag on, chances are we're not gonna see a really big improvement if we're not doing query compilation, which we're ignoring for now. And obviously it requires you to have actually the hardware that actually supports these symptom instructions. It doesn't help you if you're invoking these things and the hardware's like, I can't do this anyway. So what we're gonna focus on instead is to do this manual vectorization. So we, as the database internals developers, we're highly skilled, we're highly paid, we're highly attractive, right? It's our job to look in our code and come up with the ways to do this because we're not gonna rely on the compiler and we're not gonna, the operating system can't do this for us either. So basically three categories of things that we wanna look at. So the first one's the most obvious and this is dealing with the linear access on our database, right? So this is doing a sequential scan and we're doing the same set of operations over and over again as we look for, look for matches, whatever it is we're trying to do. I tell us to say too, that we're focusing on OLAP here, right? OLTP can't really help you from vectorization because you're doing point queries to find one thing. Maybe if you had a batch of queries all at the same time that are trying to look in the same index or do something, you may wanna vectorize that but I feel like the overhead of managing that would be very difficult. So we're really talking about OLAP here where you know you're gonna do long scans on the data. So predicate evaluation can be vectorized, compression, decompression can be vectorized, right? Because we know that we're gonna be doing the same thing over and over again. Building upon that we can also do add-hoc vectorization on the sorting and merging algorithms. We've talked about this a little bit before when we talked about the sort merge join, right? We showed you how you can paralyze this across different cores but now within that single core we can paralyze it even further by using SIMD. And then at the bottom we can do composal operations. So we can build up between all of these and actually build data structures now that can be used in an vectorized manner. And then with all of these together we can build up more complicated algorithms or more complicated functionalities in our database system that are completely vectorized. And we'll see the performance improvement you get from that is quite good in the right conditions. So we talked about this earlier when we talked about sort merge join but I'm gonna spend a little time just going over it again. So the SIMD instruction setter which is sort of redundant because it's single instruction and multiple data it's this class to CPU instructions that all modern CPUs are gonna have that allow you to take one operation and apply it to multiple data points that are stored in these special registers. So think of this as almost like assembly coding, right? You're no longer dealing with these arbitrary locations in memory and manipulate them in some way. Now you're actually writing individual values into individual registers and applying operations on them and the CPU can do this very efficiently. So all the major instruction sets that are out there today have their versions of SIMD operations. We've already talked about the x86, MMX came first, SSD are the ones that most people use today and AVX are more complicated ones. The PowerPC guys have this thing called Altevec and then the ARM guys, some of the ARM implementations have this called Neon. And again, they all basically can do the same thing. It's just what the differences between them is how big are the registers, what type of values they can support, are they floating point registers, fixed point registers, the names of the functions you invoke on them or the CPU intrinsics are gonna be different from one architecture to the next. And I actually don't know whether there's a good, anybody has a good abstraction library that it can handle everything, right? For the most part, you'll usually write, it directs, you know, intrinsics for the instruction set you're working on. But maybe there's something out there I just don't know about. So to give an example of SIMD again, this is the same one we saw before. We wanna do, we wanna add two vectors X and Y and produce a new output vector Z. And so the way you would write this in scalar code, you would have a for loop and for every single offset or index in each vector, you add them together and produce an output. So under single instruction, single data item, you would spin through this loop and go through and do this addition one by one. Now maybe your compiler can be smart about this and do unrolling so that you can do a bunch of them all at the same time. But for that, we just ignore. But even then, if you do loop unrolling, you're still invoking all those instructions. You just may not be having that branching loop, right? Which is expensive in a superscalar architecture. But with SIMD, we can take the first four values, pack that into a SSE register and take from the first vector and take the next four values to this other vector. And then within the single invocation, we get a single output, right? And we do this again for the other offsets in the vectors and we produce our final answer. So what took, you know, was that eight instructions to do the addition across all these guys when we were doing single instruction, single data item can take basically two instructions to do it here, right? So this is the kind of speed up we can get. Now you obviously have to do a bunch of stuff to get the data to pack it into these registers, which we'll talk about later on. But that's, you know, it's even then that's gonna be less than what you would have to do in the scalar case. So the main thing that we'll focus on are these SSE instructions, the latest version SSE four. There's AVX, I think it's just the same thing, but it supports larger types, larger data types. And so basically what happens is you have these 120 bit registers and you can pack four 30 bit values in them, then they have all these function calls you can make on them to produce the answer that you want. So this first was introduced in Intel 1999 and as I said, all the modern variants of the Intel architecture, not only in the Haswell, like the commodity CPUs, but as well as the Xeon 5 co-processors support this stuff. So there's kind of things you can do are pretty straightforward. So obviously you can move data in and out of the registers, you can do all the basic arithmetic operations you could expect, as well as the logical operations. And then you can do comparisons, but the ones that are kind of more interesting are to do, ability to do shuffles and conversions and cache control. So the shuffle allows you to move data directly between different registers without having to go back into the memory boss or back into the CPU's caches. And then conversion allows you to convert data from sort of the x86 or the CPU cache format into the SIMD format. The one that's actually really cool and it's kind of relevant to us is the ability to have data move out of the SIMD registers directly into memory without having to write into the CPU caches or pollute your CPU caches. And so this actually matters in the context of database systems because if you imagine you're joining two really large tables and the output of the join can't be used until you finish the join operation, so when you compute the join and find a matching tuple, you don't really care about putting into the CPU caches because you're not gonna access it right away again. So instead you wanna put it into this temporary buffer in memory so that when your join operation finishes, you'll go to the next operator in your query plan and then you can pick up and start using it there. So this allows you to, sometimes it's called streaming stores, but it basically allows you to move data directly to some of these SIMD's registers into memory. So this is why I kind of think it's like assembly programming because you're actually dealing with at the fine grain level where that your data actually is in. Is it in the registers, is it in the caches, or is it in DRAM? Okay, so this brings us to the paper you guys were assigned to read. I actually really like this paper. It's actually really new, it's actually from 2015, so it's not even a year old yet in Sigma last year. And what I like about it is that they go through basically all the different operations you can have in your database system and show you how to vectorize them. All the papers that came before this have always been like, here's how you do a sorting in a vectorized format, here's how you do a hashing, right? They actually treat everything and look at and examine everything. Now I realize in the appendix, there's all this gnarly C code and hopefully you guys didn't read that, but it's a good reference to know that if you ever actually had to implement this to yourself, I think this is the go-to paper. There's some typos in it which I need to email them about, but this is the go-to paper that just tells you how to do vectorized operations in a database system, which I really like. So the basic premise of this paper, of what they've done, or what the main contribution of it is, is that they're gonna show you how to take fundamental vectorized operations and then build upon them to do more complicated things and add more complex functionality in the database system, right? So you're gonna start with basic things like selective stores, selective loads, scattering gathers, and then we'll build up from that to actually do partitioning, doing histograms, bloom filters and other things. So we'll just focus on the high-level stuff here going forward, and we won't get into the nitty-gritty details because it's kind of hard to show code examples in a lecture and say, here's the instruction you wanna do to make this work. So I'll just show you the basic ones that I think are relevant, and then if you wanted to learn more, you could read the paper further and see exactly how they did different things. Another key observation they make in this paper is that they show how to do vertical vectorization to get better performance over horizontal vectorization, and I'll show what that means in a second, but the basic idea here is that, yes, there's two different ways to use these vectorized instructions, and by trying to consume as much as you can from your input and vectorize those operations, that's better than just vectorizing this sort of the target thing that you're looking at. We'll show this example when we talk about hash tables. Question? Or no, okay. Another cool thing they do about this paper is also they try to fully utilize the SIMD registers, the lanes in the SIMD register, as much as possible as they go along, and this means that they can peek ahead in the input and add it to the computation rather than just sort of waiting till your current batch is done before you go to the next batch. I'm gonna show what it means in a second. So by lane, I just mean like, think of a slot in the SIMD register, like I have a 128-bit SIMD register, I can put four 32-bit values in it. Each 32-bit value is called a lane in SIMD parlance. Okay, so we're gonna go through the fundamental operations, and then we'll build from that and show how to do scans, hash table probes, and build histograms. So the first one we have to deal with is selective load. And so the top point here, the top data item is the target vectors. So this is the thing we wanna write into. And so what we wanna do here is we want to be able to take the contents of some location in memory and selectively write it into our vector. So we have this mask here that corresponds to the different lanes. And so if it has a zero, it says we wanna overwrite it. If it has a one, sorry, if the zero says we wanna not overwrite it, if there's a one, we do wanna overwrite it. So in this case here, what happened is we would look at the first zero, at this first point in the mask, we'd skip it because it's a zero, but then we'd come to the one, and now we know we need to overwrite B. So we would go to the first location at our starting point in memory, and that's the value that would get written up in here. So it's not an exact straight line shot up. It's like from this starting point, I'll start adding in values going from left to right. Keep going further, we skip the zero, then we come to the one, same thing we overwrite it there. So selective store is the reverse of this. We have some location in memory we wanna start writing to, and then we have a mask that corresponds to what, whether we should write the values that exist in our input target vector. So in this case here, we start with the first zero, we ignore that, we ignore this, the A, and then we get the one here, and we know we need to write into the first slot the value that corresponds to this slot or lane in the vector. Keep going, zero, we skip that, and we get to the last one we get here. Right, is this sort of clear how we're doing this? This is different than just sort of taking the entire chunk of memory and writing it directly into our vector, into memory and out of the SIMD register and memory. And the reason why we wanna be able to do this is because when we do comparison operators, and say we're doing a scan or a join, and we see whether something should be emitted in the output of our operator, we wanna be able to generate these masks and selectively say here's the tuple we wanna put in and ignore all these other ones. Right, because that way we can do all the comparisons directly on the SIMD registers because the SIMD instructions support those comparison operators and we don't have to keep copying things back and forth between the CPU caches and the SIMD stuff. So the idea is that we're gonna use these to do as much computation as we can out on the SIMD registers and selectively pull back the data that we need for selective gather. So the scatter and gather stuff, it's different than if you're familiar with the parlance in distributed query processing. So scatter, gather query and distribute databases means you take a query and you scatter across as many nodes and then you gather back the results. This is actually scatter, gather and actually the data items itself. Yes. You have to load things, they can show you. So are you saying it? Yeah, they have to load things that can show you the importance of collecting nodes. Yeah, so his question is for the selective store, is this done for load? Is this done in parallel or is this done in serial order? The answer is it's done actually in almost serial order. So the L1 cache only supports one or two accesses per cycle. So you can't take this whole thing and suck it right up because the architect, the hardware won't support that. So you would write the code as a selective store as an intrinsic and it looks like from you as a programmer it's doing it in a parallel reality since you're pulling stuff from L1 cache it's gonna take a couple cycles. It's a good question. Okay, so the selective gather, what we're gonna do here is we wanna have now a vector that we wanna update and then we have an index vector that corresponds to what value from memory should we update in each slot here. So in this case here, the first one, the two will correspond to offset number two and that would get written there. One corresponds there, five corresponds here and everything gets overwritten. All right, you're sort of gathering together different locations in memory and putting it in our vector. Scatter does the reverse to this. Now our index vector corresponds to locations in our value vector and when we start applying this mapping that tells us where we write into our context of memory. So selective load is taking a mask of ones and zeros and says, should this, for this locate, for this tuple or this value should it go into my value vector or memory? The gathering is basically you're saying, I know which offset I wanna get values in. So, sort of related to his question is the gather and scatter stuff are not truly executed in parallel because you can only have so many cycles to the L1 cache or sort of how many reads to the L1 cache per cycle and then in the Haswell chips, the gathers are only supported in the newer versions and so in the older versions they sort of emulate it. So again, you have a CPU intrinsic and then the compiler will generate multiple instructions to actually do the data movement and then the selector loads and stores are not truly in parallel as well. They're done with these sort of permutations or not as bad as it done in scalar sequential order but not as good as if they were in truly parallel. Well for the Xeon 5 they support all of this in parallel. Okay, so now we know the basic building blocks. Let's now start sort of building up the stack and do more complicated things. So we're gonna focus on sequential scans, hash table probing and then we won't do the shuffle phase and partitioning because that's a bit more complicated but we'll show how to build the histogram in parallel that you would need to do rate X partitioning that we talked about when we did sort merge and then again the paper provides additional instructions on how to do the joints, the sorting and bloom filters and other things building upon what we've talked about here. All right, so for our sequential scan which is probably the most important thing we can do and we want to vectorize. For this example I'm gonna use the simple query here, select start from table where the key is greater than equal to some low value and the key is greater than equal to some high value. So I'm gonna show you first how to do this in the sort of the scalar way, the non vectorized version and I'm gonna show two variants of this. I'll show you how to do it in a branching and a non branching way and then from this we'll show you how to do this in the vectorize. We'll build upon what these instructions are doing or build upon what these more primitive in ways are doing this. So the branching way to do a sequential scan is essentially what everyone would write if you build your database system for the first time and you implement the scan operator. You have some for loop that's gonna iterate over every single tuple in the table. You'll extract the key that you're using for your predicate and you'll compare to see whether it's greater than equal to low value or less than equal to the high value. And if that evaluates true then you'll copy the output of the tuple into or copy the tuple into some output buffer that you're maintaining to produce the intermediate result for this operator and then you increment a counter to say to move to the next position in the output buffer so that when you come back the next time and you have a match it goes into the next slot. So the key problem with this approach is this if clause here, right? Because in a superscalar architecture like the Haaswell if the branch prediction gets this wrong then it has to basically flush the pipeline and pull in the more instructions, right? Because remember for each instruction, each clock cycle they're executing a bunch of extra stuff and the CPU is trying to mask its regular slowness by doing as much as it can in parallel and then checking later on to see whether that was the right thing to do. So this if clause, it's hard to get this right because it not only depends again what the contents of the database is but it depends on what the query is. So if you have say a selection scan, a predicate that evaluates all tuples are true then then it should be able to handle that but if all tuples are evaluated to false there should be able to handle that but as soon as you find one that doesn't match what it's expected but from before you pay a big penalty in a superscalar architecture like the Haaswell for having this branch misprediction. So a alternative way to doing a sequential scan without having an if clause is to do something like this where we're still gonna iterate every single tuple but then immediately we're gonna copy the tuple into our output buffer, right? We haven't checked to see whether it matches yet we're just gonna say it might match so we'll go ahead and copy it now and then when you do the predicate evaluation rather than having an if branch you can replace it with two ternary clauses or ternary operations that the CPU will probably replace with low level addition of subtraction instructions, right? Because you can basically change this to an addition and then check the overflow or something. So now we have a branch list invocation to evaluate a predicate and then we get some mass that says whether it evaluated true or not and then if it's true then this goes to one so then we move our output buffer up by one. If it's zero then we come back around and just overwrite the last thing we copied into it because it didn't match. There's some extra stuff you have to do to make sure that if the last guy evaluates to false you don't actually include it in your output but that's a minor thing. So this seems kind of crazy, right? This seems like you're doing all this extra copying over and over again for tuples that you may not even produce output but as we'll see in a second on a superscalar architecture this is actually the right thing to do if your selectivity is very low. All right so now we can talk about how to do this in a vectorized way. So for this instead of having iterating for every tuple in the table I'm gonna say v subscript t and I realize you're reading code and class and that sucks but this will be it. So here now we're saying we're gonna get a batch of tuples or vector tuples from our table and we're gonna operate on them every iteration in our for loop. And then we'll do the SIMD load to grab all the key values and put it into this vector of keys. Then we apply our predicate and this can be done in a SIMD operation comparison to take a bunch of tuple values and apply the predicate to see whether they're true or not and we'll produce a vector mask. If none of the, if at least one of the offsets in our vector mask is true then we go ahead and do our SIMD store and this is that sort of selective store that we talked about before. We have this mask of whether it's one or zero and then we can write it out to a memory location when it values are true. So let's take an example about how this actually worked using those fundamental operations we talked about before. So for this we'll say we have our queries now select star from table where key is greater than equals o and key is less than equal to u. And let's say that this is our, that's kind of hard to see but hey, this is our sample data set here. We have two columns, we have a list of IDs and then we have a list of keys, right? There's just one character strings. So the first thing we're gonna do, we'll use the, we'll use the selective store to, sorry, the, I guess the selective gather to take the memory location and put it into our vector here. This also should show you why the advantage, the advantage of using the DSM or the column format, right? Because if this was a row store, we'd have to jump a bunch of different memory locations to take the values we want from this column and put it into our vector. Now if we're looking at a lot of different columns or our tuples are really wide, then when we go fetch the block of this data, they may not all fit in our L1 cache. But in this case here, because everything's in a column store, we assume it's in a column store, then we just jumped this one offset and we can copy to the end of the vector of tuples we're looking at. They're all the data is contiguous and we can just write it into our register and that's really, really fast. So now we have this vector and we'll use our Cindy compare to produce the mask, right? And it's one if evaluated true or zero if evaluated false. And then we have another vector of all the offsets for the tuples that we're evaluating. For now we're just assuming it's zero through five here. But you can imagine as you're walking through the table, these things get adjusted as you go along. And then with this mask, we use the Cindy store to then write out the offsets that match for this tuple. So now we know that we have one, three and four matched for our predicate. And then depending on what we're doing, early materialization or late materialization, we could then do additional Cindy instructions to get the rest of the data we need to produce the output because it's a select star query. Or we just pass these offsets along up into the other operators in a query plan and then let them worry about doing materialization later on. This sort of clear how we're doing this, right? So we took the data that we had in memory, we put it into a vector and we can do all our operations in using directly on the Cindy registers without having to ever have to go back to memory. The algorithm they show in the paper is actually more complicated because they have to deal with overflows and other things like that. But for that we ignore for now. So now we can look at the performance of this, right? So for this we're gonna compare the Xeon Phi versus the Haaswell Multicore. So the Xeon Phi has 61 cores plus four times hyperthreading. So for each core you have four sort of, sort of program counters, four threads. And then the Haaswell chip has four cores plus two X hyperthreadings. So in total you have eight cores. So the first thing we want to compare is how the scalar version of the selection scan works with the branching. So what you see off the bat is the Xeon Phi because it has way more cores, it outperforms the Haaswell machine. But as you get to sort of higher selectivities, you're paying, you run out of cash room, or run out of room in your cash, you're always writing out the memory. So they sort of converge on performance here. So your bottleneck on the memory bandwidth. But now if we add the branchless version of the selection scan, we see something very interesting. So the first is that the branchless actually works, performs worse in the Xeon Phi than the branching version. But in the Haaswell machine, for the low selectivities, right, so like zero, like this one percent means one percent of the tuples be selected by the scan, the branchless version works better. So why, why is this the case? Right, no out of order execution. So if you, there's no branch misprediction penalty. So yes, we're copying more data, but that's okay because the penalty of doing a branch misprediction in a superscalar architecture with a really long instruction pipeline is massive. So that's why this is performing better here. And of course, as you get over here, you just, your bottleneck on the memory controller, yes. Yes, so his statement is if you have zero percent selectivity and the CPU should just be, you pay no penalty assuming the CPU always predicts that it should never match. Yeah, you don't, you have no control over that. Yeah. I mean, and you can't provide hints of the compiler, like to anything to say, hey look, this is never going to match. Right. What's that? In kernel queries? Oh, kernel. Yeah. Yeah, but do you have to be like in a privilege mode to do that? In the back, yes. Like are you talking about giving hints to the branch? Yeah. Can you do that? Yeah. You can do that. Okay. So I don't know if they did that. Okay. All right. So now we add back the vectorized case. Again, you see the Xeon Phi's clearly outperforming the Haswell. This is because there's just so many extra cores for it to use, even though they're the less power. You have less power. Because everything basically is doing is almost purely in SIMD instructions. So you're taking advantage of everything, but they've added to this architecture. Of course, everything again, converges to the same point here when you start to run out of cash, or you're saturating them in control. This is sort of clear. What seems like a terrible idea to do this copying every single time, even though you're not going to need the data, actually turns out to be perform well when you have low selectivity. All right. So we don't have that much time, but I'll go through as much as I can. So the next thing that's kind of interesting is how to do parallel hash tables. So in the paper, they talk about how to build the hash table in parallel, but I want to spend time talking about how to do probing in the hash table. So for this, we're going to assume we have an open hash table, open addressing. So that means that we have some hash table, they have these offsets and the corresponding to keys. We'll hash some value that tells us what slot we should start in. Then what happens is if you notice that your slot is full, you do a linear scan to keep going down till you find a free slot. If there's no more free slots, then you have to rebuild the hash table. This is different than a hash table where you have a linked list of conflicts or collisions of additional items. So to do this in the scale away, you start with a single input key, we'll hash it, that gives us our hash index, and then we jump to that location, take whatever the key is in there, and then we do a comparison to see whether they're evaluated to true or not. In this case here, key one does not equal key nine. So we'll scan down in the hash table, keep evaluating the keys one by one till either we find an empty slot, meaning we know our key's not there, or we find our key and then we have a match. So the way to do this, if you're taking the first pass in using vectorization, is to use a bucktized hash table where for every single slot in the hash table, we're going to have multiple key entries here. So what will happen is when we take our single key, we hash it, we get a hash index, we jump to some location, and now we'll get instead of a single key value back that we want to do a comparison, we'll get four. We put that into our SIMD vector, do our SIMD compare, and then see whether we get a match. So the problem with this is that we're still iterating over the input that we're using to probe against our hash table one by one. So we're not speeding up any of our operations because if we have a million keys, we have to do this a million times. Even though we can evaluate multiple keys that exist in the hash table in parallel, we're still looking at the input for our probe in sequential order. So this is what they refer to as horizontal vectorization. So the way to do this even better is to do a vertical vectorization where we're going to look at a vector of input keys at a time, so a batch of the input keys that we want to probe into the database system, apply our hash function, produce a hash vector that we then put into a SIMD register, then these jump to these different locations in our hash table, pull them back using the SIMD gather operation, and then now we can do our SIMD comparison between these two guys. This is going to produce our mask for us that tells us what tuples evaluate to true. Now, a naive implementation for this would be, we would recognize that the first key and the last key have matches, but the middle two guys don't. So what we would do is, we would jump to the point where we left off with the middle two guys and move to the next slot and see whether we have a match. But then that means that these lanes here for the first and the last one are essentially being unused, because we've already found a match for them, we don't need to keep looking. So what they propose to do is, you actually recognize that when your mask produces ones in these slots here, you go back to your input vector and you move forward to the next keys in your sequence. So key one and key four got replaced with key five and key six, and then with some additional bookkeeping, you recognize that when I do my hashing, I want to hash the first one and the last one to a new location, but the middle two guys, I want to iterate for one, because I want to pick up where I left off and jump down to the next slot to look see whether I have a match. So then now you can do your comparison this way. So this allows all the lanes in your SIMD registers to always be fully utilized, because you're not looking for matches for things you already know you've found a match. The downside is that this makes the algorithm no longer stable. So meaning if I run with the same query and the same database on different times of the day, I may actually produce different answers because I'm shuffling the order of all these different operations. I'm shuffling the order of how I'm evaluating my input keys. So key one, key four would come at my output, followed by key five and key six instead of key two and key three. So this is okay in a relational database system because we're based on bag algebra. We have no notion of ordering. We just care to see whether a value exists in our bag or not. If you really care about ordering, you can just do an order by up above or sorting up above the query plan, and that will make your algorithm stable. So I don't really see that being a big issue, and then you get all the gain of again full lane utilization. So we can look at the performance of these two guys again. So now we're going to compare again the Xeon 5 with the Haswell machine. So starting off with the scalar versions, we see again the Xeon 5 when we have really small hash tables where everything fits in your L1 L2 caches does quite well, and as you get to larger sizes, everybody gets worse. But again, this is basically taking advantage of all the extra cores and they're all doing vectorization. Then we add the vectorized versions, and again we see in the Xeon 5, the vertical version does really well for the most part. But there's a little switch here between the horizontal version and the vertical version, as you get to larger hash table sizes, and I don't fully understand what was going on there. The main thing to point out here is the everybody converges to the same point when you run out of CPU caches. So in the Xeon 5, each core does not have that big of a cache. So they fall off around one megabyte. In the Haswell, each core has L1 L2 and plus the L3, so they're much larger and they can deal with larger inputs for a longer time. So now going further, we can do partitioning in a vectorized manner. For this, we're going to ignore the shuffling, we're just going to deal with how you build the histogram. Remember, when we talked about Radex partitioning for the certain merge join algorithms, you would do the Radex partitioning to divvy up the tuples to different cores and allow them to then process them in parallel. So this is the same thing, but now we're dealing with on a single core, how do we generate that histogram for the tuples we're supposed to evaluate before we do the shuffling? So we can use the Scatter and Gather fundamental operations to generate this histogram. So the issue we have to deal with though is collisions when two values in our SIMD register hash to the same thing. So say there's an input vector, key one, key two, key three, key four. We use the SIMD Radex implementation to come up with the hash vectors and then they both write into this histogram vector using the SIMD ads. But the problem is here is again, hash value two and hash value four, both hash to the same locations, and the SIMD operations have no way to handle these conflicts. So both of these values when you do these addition would try to write to this location, and one just get overwritten with another. So you would lose this count. So the way they handle this is they actually replicate the histogram so that each lane in our SIMD register has its own histogram. Again, now you can use that Scatter operation to have them write to different parts, and then you just do a SIMD ad to collapse it and produce the final answer you're looking for. So the thing before for this value here, what we have plus one plus one can then be collapsed into plus two and we produce the right answer. Again, I'm not really explaining what's in memory versus out of memory. For the most part, you kind of think you try to do as much as you can in your SIMD registers. This also assumes you're dealing with single keys. So this wouldn't quite work if you're trying to partition on three attributes in your table. That's one of the major limitations we have in all the SIMD stuff. To finish up with joins, I'm not going to go through the algorithms how to do this, but basically they describe how to do the non-partitioning ways without any Radex partitioning, and then having a partition table per thread and having one, but only for the outer table, and then you can have one for both the inner table and the outer table. So they call the only the outer table is minimum partitioning, and then doing the grace hash join where you partition both the outer and the inner as max partitioning. Both of these limitations are fully vectorized, whereas the top one is only partially vectorized. So what you see is that without the surprise, the grace hash join doing partitioning both things, and when it's fully vectorized, vectorized outperforms everyone else. This is because you're doing as much computation as you can on the data as it exists in the registers, without having to go back and forth between CPU and SIMD stuff. There's not really much to stuff to say here, just without any partitioning, the vectorized doesn't really help you. This is also on the Xeon 5, so it's not the household chip. With minimum partitioning, you get a little bit of game, but you see a significant improvement when you do the full partitioning. Okay. So in the last 10 minutes or so, I want to talk about bit weaving. So bit weaving is sort of like the bytes license stuff that we talked about before, where it's an alternative way to lay out your columnar database in memory in such a way that makes it be able to operate very efficiently using the SIMD operations. So they're going to use dictionary encoding for everything or your data items, but then they're going to lay out the actual values in a fragmented way that makes it really fast to do computations. So this was implemented in a prototype storage engine called QuickStep out of Wisconsin. As I said, they were bought by Pivotal probably last year, and then Jignesh Patel who's now the database professor that was leading this project. He's now the chief scientist for at least another year at Pivotal. Pivotal does green plomb, they do now it's called Apache Geo, but it's called Gemfire XD. Pivotal's kind of a weird thing was like, they took half of the database company out of EMC, and the database company out of VMware, and they merged them together, and they made a new database company. EMC bought green plomb, VMware did their own stuff in-house, they matched them two together, and then they made a new startup. All right. So there's basically two different layouts of bit weaving. There's the vertical way and the horizontal way. I'm not going to claim many ones better than another, and I think discussing how to do the horizontal one would be difficult to do in the time that we have. So I'm just going to talk about the vertical one, because I think it's, to me, when I saw Jignesh talk about this a few years ago, the light bulb went off in my head. I'm like, oh yeah, this is pretty cool, and this is definitely why you want to use SIMD. So soon we have 10 tuples, and we're going to break them up into segments. We'll say that each segment has 8 tuples, so this first segment has 0 to 7, this one has 8 and 9. What we're going to do is, we're going to assume we've already done dictionary encoding for all the values in the column here, and now what we're going to do is we're going to lay out each segment such that the first bit of the value of a particular column will be laid contiguously together in memory, followed by the second bit and the third bit. So this is different than a pure column store where you would take, it looks exactly like this, where you would have the first tuple laid together for this attribute, followed by the second tuple. Now we're looking at individual bits of the values. So it looks like something like this. So value 0 would be all the white blocks, all the white bits, value 1 would be all the gray bits, and then value 2 will be all the darker bits. In the case of this segment here, segment 2, it only has two tuples. So you would have the bits for the first two tuples together at the beginning of the value and the rest would be all 0s. Then the database system would know as it processes queries that it should just ignore anything that comes after that because it doesn't correspond to any tuple. Yes. This is one attribute and has three bits, and we're laying out all the bits for all the attributes can take you to slaying memory. You imagine if we had millions of tuples, you would have millions of values, or these really long values. So what this is a lot of us to do is as we process queries and we're doing a sequential scans, we don't have to look at all the data to check to see whether our predicate evaluates are true or not. You're going to lay out the segments in a value to be the process of word, and then you'll have as many values correspond to the code width. So we have three bits, we'll have three values. So let's say our query is select star from table where key equals CMU. So it's a simple quality predicate on a constant. So we'll do our dictionary encoding to come up with the bit pattern for this value here. So now when we want to do our evaluation, we start off the first value, the corresponds to the first bit slot, and we do a SIMD compare and that's going to produce a mask for us with ones and zeros again for actually tuples that matter. So now we can use that selective store again to say, when we do the comparison on the next bit location in our constant, we only actually have to look at a subset of the values because we know the other ones didn't match on the first bit, so they can never match anymore when we look at the other bits. So we only need to select a little look at these three here. So now when we do the next SIMD compare, we will produce a bit mask that has completely zeros. So we have no tuple that's matching. So that means we don't need to bother with the second value. So we just cut off the amount of data we have to look at by a third to produce this answer, just by laying out all the individual bits in sequential order, rather than the actual encoded values. This is not quite a compression scheme, right? The compression scheme would be the dictionary encoding. This is actually a storage layout of the data. And then they have the horizontal stuff, again, I'm not going to really talk about. I think this idea is pretty novel and I don't think any database system unless Pivotal is actually doing this now with the quick stuff they bought, no database system actually does this. And I think the performance improvement you get would be quite significant because you're cutting down so much on the amount of data you have to look at. And all the comparisons you're doing are all vectorized anyway, so that's going to be really fast. So again, I just want to bring this up a little bit to know that there's other cool stuff you can do with vectorization. Okay. So at this point in 2016, I would say vectorization is actually absolutely critical to do fast evaluations for OLAP queries. And every single major database vendor that claims they're supporting OLAP, whether they're disk base or in memory, will be doing some kind of vectorization. The MySQLs, the Postgres, and the open-source guys simply can't do this. Another key thing to remember is that the vectorization stuff that we talked about today can be combined together to get even better improvement with all the stuff that we talked about before. So you can have multiple threads executing your query in parallel, you split the data up and have them operate at different chunks. Then each of those plans can then execute a compiled query plan that we talked about before, with the code generation stuff. And then within that compiled plan, that it can invoke the vectorized stuff that we talked about today. So you can combine everything that we talked about, they're not mutually exclusive, and get amazing performance. And this is what some of the specialized systems like Hyper can do, the Impala guys do some of these things. So I think this is really exciting stuff, and this is why there's a lot of interesting problems still in database research. Okay. So any questions about vectorization? Yes. No, it's just a thing. Yes. Does it rely on vectorization? Say it again? Is it, does this rely on vectorization? If you have vectorization. So it's question is, is the bit weaving stuff completely dependent on vectorization? Right? And the answer is no, right? You could do this comparison here with, Yeah, you still cut down the amount of data you have to look at. But now you'd be doing individual one by one, or using bit shifting operations to compute the answer. But we have SIMD, why don't we just use it? I was trying to lay out whether it's using vectorization in a vector. So his question is, his statement is he's trying to figure out what the contribution of this is. Is it just the layout of the data, or is it the SIMD stuff? I think it's a combination of the two. Because this is sort of like the bit slicing, byte slicing stuff we talked about before, right? And again, there's the horizontal stuff, which I didn't talk about because it's more complicated. But that goes even further and actually you can do even more stuff than you could with this. This is sort of the basic example that like, hopefully the light bulb goes off in your head and say, yeah, this makes sense. That's a good point though. Any other questions? All right. So two announcements, and I'll post this on Piazza later today. There's actually two database talks this week. So one of the main developers on the SAMHSA project out of LinkedIn will be here on Thursday, over in the CIC building at noon, and I'll talk about what LinkedIn is doing with SAMHSA and some of their stream processing systems. And then on the 15th, we'll have the CEO and CMU alum, Monty Zwiven, come give a talk about Splice Machine. The Splice Machine talk is probably more directly related to everything we've talked about in this course because they're doing H-Tab, they're doing transactions on H-Base, plus analytics on Spark, all within the ecosystem in a single database system. Plus he's CMU alum. Plus there will be pizza this one, not the first one. So keep that in mind. Again, I'll send a reminder out on Piazza about these two talks. So on Wednesday, two days from now, on the 13th, we'll do the status updates. Everyone gets five minutes. Everyone should make PowerPoint slides or some presentation, even if your name's Prashant, yes, you should make slides. And then afterwards, you can email me your PDF, and then we'll retain that because eventually, what we'll do at the end of the course, we'll have on the website, we'll have a showcase about what everyone has done. I'd like people to share the slides for everyone so that people come along later and see what the people program. They can look at your slides and see what you actually did, okay? And then I promised I would actually get your grades out today. It's grad school and nobody should be worried, right? But I'll take care of that today. All right, any questions? My friends are ringing, we're done, thank you.