 All right, so today is the second lecture we're going to have now on vectorized execution. So in our last class, we spent time talking about the Columbia paper that showed how to vectorize a bunch of different relational operators in a database system. And then we said there was this big assumption that they make where they assume everything fits into your CPU caches, and they're always dealing with 32-bit integers for values and 30-bit integer keys. So now we're going to look at other alternatives to do vectorization in a database system. So today's lecture is actually going to be my part is going to be short because I have a flight. So I'm going to do the first 30 minutes and talk about bit slicing and bit weaving. And then Prashant's going to step in and then present the ROF paper because that's his work on applying vectorization inside of a peloton in our compilation engine. So he'll finish up with discussion on that. And so unfortunately, I also have to take my surface. So we won't have a recording for that. That's all right. All right, so real quick also to say too that I haven't looked at the pull requests. That's my goal for this week. And I'll send everyone feedback and comments about the progress on it. And I think what we're going to do is we're going to have everyone, once I go through and grade the code reviews, we'll go ahead and delete all the pull requests. And that way it's not building every single time as you continue to make changes to it. And then at the end of the semester, we'll do this all over again. We'll just close the PR, all the comments, and everything will still be there. So there'll be a history of things, but then you'll start over with a new PR at the end of the semester. So any questions about the project three? So I want to jump back for a second and refresh your memory about bitmap encoding bitmap indexes because one of the things we're going to talk about today from my part is alternatives to representing the database or tuples, or different ways to store it, as another way to get better parallelization at a bit level. So to do that, before we can talk about bitweaving or bit slicing, we want to talk about just go back to the beginning with bitmap encoding. So bitmap indexes are really straightforward. We just said that if we have an attribute that has a small domain of values, in this case we have a sex column. So for this example, there's only two possible values, male or female. So we can represent this attribute as two separate bitmaps. We have one for the male value and one for the female value. And then we just have a 1 or a 0 to say whether the particular tuple at this position has a particular value. And we talked about before that if you have a large domain then maybe a bitmap index or bitmap encoding like this is not going to be a good idea because, again, you have to have a bitmap for every possible value that it could ever be. So this is sort of called a quality coding or the basic scheme you get for bitmap encoding. The idea, again, you just have a single bitmap for every unique value and there is a 1 or 0. But it's not the only way that you could represent this data in a bitmap. Another approach to do what is called range of coding. And this is basically you have a bitmap represent a range of values rather than a single discrete value. So you could say you have a bitmap for all zip codes that start with 1, 5, 2 in your database. And of course now this means that if you want to go find an exact tuples with an exact value, you have to check the bitmap first and then you have to go then check the actual tuples whether that matches your quality predicate. But in this case here, you would have to store less possible bitmaps for all the discrete values. But the other two approaches that are less common but I think are interesting to discuss and then we'll segue into bitweaving is to do a hierarchical encoding and then bitslice encoding. So the basic idea of hierarchical encoding, which I'll show in the next slide, is that you have a way to represent sparse bitmaps efficiently by having this tree that tells you whether that actually could be a possible value within a domain. And then bit slicing, again, we'll see this where we store actually the single bits or the radicts of the bitmap for a value altogether contiguously. So again, we'll go through each of these one by one. So hierarchical encoding, the basic idea here is that it's represented as a tree and at each node in the tree, you'll have all one or zero that says whether there is a possible value with a one then at the child represented by a position in that node. So in this case here at the very top, the root has one, zero, one, zero. And so again, each position in the bitmap represents a child. So if it's one then we know we have something on this child. If it's zero then we know we don't have anything here. So in this case here, the second position is zero. So the second child is all zeros. And these are all zeros. So everything below that is going to be all zeros as well. So if you want to take the keys listed here, we would represent them in different positions. And to say the key exists, we would have a one or zero. So in this example here, I'm showing you that we're storing the zeros for these lower two trees here and in the leaf nodes. But in actuality, you don't need to store this at all. This is just for to represent it. I'm adding this. But in practice, you don't have to store this. Because once you know you have a zero here, there's never going to be anything down below this. So you can skip an entire. So this seems like a nifty way to represent a bitmap more efficiently. The downside is that, as far as I know, nobody actually does this in practice is because having this tree structure means branching when you're doing lookups. And that's going to hurt your performance in terms of cache misses and things like that. So it's just one way you could store a bitmap differently than the equality coding that I showed before. But the advantage, again, is that originally we would have to store this as an 8-byte bitmap. But now we can encode it as a 4-byte bitmap. So the other one encoding scheme that I think is more interesting and would be relevant when we start talking about vectorization is to do bit slice encoding. And so the way to think about a bit slices is that in the same way that we took a row store and then sort of flipped it on its axis and then stored it as a columns, we're going to do the same thing here. But instead of worrying about actually taking the entire tuple or the attributes within a single column and then flipping it, we're actually going to take the bits that represent the actual values in a column and flip it to be stored vertically. So I'll go through an example here. So here we have in our simple table, we have the seven zip codes for all the locations I lived in throughout my life. And this is real. So I grew up in Maryland, 21042. I spent some time in Compton. And then now we're here in Pittsburgh, 15217. So let's say we take 21042. And the way we represent this as a bit slice is that we first generate the bit map or the binary encoding of this value. So if you take 21042 and run it through the bin function of Python, you get this bit map like this. And then we're going to run along now and have a bit map slice or a bit slice for every single position in our value. So this thing has 17 bits. So we're going to have 17 bit slices. And then we have an additional one to keep track of whether it's null or not. So now when we start populating this, say we just take this cursor here. And as we scan across, we'll have an entry in the bit slice for every single position here in our bit map. So now we can do this for all the other ones. And now you should be able to see what I meant that we're storing it in a columnar bit manner. So each of these slices is a contiguous block of memory. So we're going to have just the first bit for every single value will be stored contiguously in this bit map here. And so the reason why we want to do this and what makes it really interesting is that there's some nifty tricks we can do to speed up our predicate evaluation when we execute queries, when we store data like this. So let's say that we have a simple select statement, select star from the customer table where the zip code is less than 15217, my home address here in Pittsburgh. So the way we would execute this query is that we're going to walk across each slice and then construct a result bit map that says whether the tuple at a particular position in the bit slice satisfied our predicate, right? Zero, one. And then we know that when we go look at the next bit slice we only have to check the ones that were positive or one in the last bit slice we checked, right? So in this case here, let's say we take actually just the first three bit slices, right? So we know that this entry here too corresponds to 15217, that's what I'll look up on here. So here we know that the first three positions in the bit map or binary representation of this value is just three zeros. So that means when we do our predicate evaluation, since we're looking for values that are less than 15217, that means we know that since this is all zeros, if there's any tuple that has at least one one in the first three bits in these bit slices then we know it has to be greater than what we're looking for and we just ignore it entirely, right? So in this case here, we can skip the first three entries. Yeah, so if you have a one in the first three slices we can skip that entirely, right? So we have one here, we skip this, we have one here, we have one here, skip this. So the only thing now when we start looking at the other slices we only need to look at these three guys here because they're the ones that only have zeros, right? So it allows us to do what is called early pruning by identifying what the possible values could be as early as possible as we're going across and looking at the bit slices. We can recognize that there's no way that the tuple is represented by the bit slices here and here and here could ever be less than this because in order that we now have a one so now when we go scan across look at the rest of the slices we can skip them entirely, yes? Why does an item be interpreted into like 17 trunks of the binary? Say it again, why doesn't what? Why does a zip code be interpreted into like 17 instead of all the numbers of trunks of the binary? Your question is why is- We have like 17 columns. Yeah, so let's go back here. So we're taking this first number here, right? 21042, when we convert it to binary, right? From an integer, a decimal representation to a binary representation, right? We have this binary string here. And then so we store every single one, every single bit here is corresponding to another bit slice, right? And it's just those values going across. What is the first column of space? The first column is whether it's null or not, right? So again, so at this point here when I'm looking at just the first three bit slices, I'm looking at all of these values but I'm only looking at the uppermost bits. And I know that if they're based on my predicate, if they don't satisfy the predicate at the first upper bit and it doesn't matter what comes afterwards. So I can just skip any evaluation of them later on. Pre-process like negative numbers. Pre-process what, sorry? A negative numbers. Oh, my statement is, does that mean like I need to maintain a bit map that says at this offset, I didn't evaluate it to false on the last bit slice I looked at. So therefore don't evaluate it going down. Yeah, so for inactive numbers, you have to sort of maintain another interpreter. Are you saying inactive numbers? Yeah, like negative. Oh, negative, sorry, sorry, sorry. Okay, yeah, yeah, yeah, for negative, yes. This is just assuming not two is complement. Yes, right? These are, yeah, these are unsigned. All right, so now the one other cool thing you can do with bit slicing is that it also makes it really efficient to compute some aggregations. So to compute the sum, all you need to do now is just count the number ones in each slice and then you multiply it by the base value that corresponds to that slice. So if you take the very first slice, you count all the number ones and then you multiply it by two to the 17 and that converts it to the, it's correct magnitude or correct form. And then you just do that for the next slice, right? Take all the ones in slice 16 and multiply by two to the 16, right? And you just add these values together and that produces the correct sum. Now the reason why this is interesting is because it's actually really cheap to count all the number ones in a bitmap or a bit vector. Did anybody know what that's called? What's that? Well that's the instruction, but what does it really call, like in regular CS what is it called? Or not in regular CS, in like algorithms what is it called? It's a hamming weight, right? So there's a single instruction to do this that Intel added so you just take this bitmap and it'll produce that value for you, right? So this is another good example. If you can represent the data in a different way, we can then exploit SIMD, or leverage SIMD to do a fairly efficient computations for this, right? And pretty much I think if you read the Wikipedia page for hamming weight there's a, pretty much every single architecture has a hamming weight instruction that's very efficient. All right, so the next thing we can talk about now is if we understand bit slicing, there's some observations we can make about some of the inefficiencies that we saw from the Columbia paper about, and there's ways to actually overcome them by using something that looks a lot like bit slicing to store our data in a more efficient manner that makes it more amenable to vectorization. So the first issue we had from the last class was that again they were making this big assumption that everything was 32 bit integers with 32 bit pointers to tuples, but if you go beyond that, or if you have, if you're compressing data that is not exactly 32 bits, or the encoded value is not exactly 30 bits, then this becomes problematic because it does not fit cleanly into our SIMD register lanes in order to compute the, do the computations that we wanna do. So this means that before we take data out of memory or a CPU caches and then put it into our registers, we have to do some extra work to transform it and to make sure everything's aligned nicely. The other issue is that, and what we're gonna try to solve next is that just because we said last class we were gonna choose algorithms or ways to vectorize the operators, we try to choose ones that were always utilizing all the lanes at every single time. So we weren't computing things where we already knew the answer to, we weren't essentially wasting work, but just because all the lanes are fully utilized doesn't mean all the bits within a single lane could be fully utilized. So now we're going sort of at even more lower level. So what I mean by this, to say that we have two vectors, we wanna do a quality comparison and we wanna use SIMD for this, and it'll generate a bitmap with zero one to say whether it matches. So within a single lane, we would represent the value as in binary form like this. And we know just as we saw in the bit slicing, we know that for the first couple comparisons that we're gonna do, we're finding useful data, but then when we find that the values are equivalent, but then when we get to this point here, the top guy is different than the bottom one. So it doesn't matter what comes after this, we know it can ever be equal, so this is essentially wasted work. And this is because it's the way we normally write code, we know we write if clauses, does something equal something, we're just taking two variables, two values and checking to see whether they match, but underneath the covers, some of these bits could actually be used for other things because we know that no matter what, once we see that these guys don't match, all those other stuff is just wasted work. So that's the problem we're gonna try to solve with bitweaving. So bitweaving is an alternative storage layout for DSM or columnar databases, and the idea here is that it's, we're gonna organize the data in such a way that makes it really efficient to do predicate evaluation on in-memory compressed data. And we don't have to use SIMD for all of this, but SIMD is gonna allow us to get really good performance for this. So the way, at a high level, to think about what's gonna happen is we're always gonna be using order preserving dictionary encoding for all our data, and then we're gonna try to get, what's called again, bit-level apparelization within a single SIMD lane to make sure that every single bit in a SIMD register and in the lane is always doing something useful. And then the other cool thing about this is that they're gonna leverage only the common SIMD instructions. So the last class, they had to gather and gather, was sort of the key primitive that they used to make all these things work, but as we said, the Xeon CPUs that were available to them in 2015 didn't actually have those instructions in SIMD, so they had to emulate it to overcome that limitation. So this paper was written back in 2013, so this is even older than the Columbia one, and they definitely didn't have scatter and gather back then, but they didn't require that at all in these algorithms. So in this paper, this is based on, this work was done in the context of this storage engine, database engine at University of Wisconsin called QuickStep. QuickStep has now become a Apache incubator project in 2016, so it is an open source thing that people are using. The thing I'll say though is that according to the authors of the paper, the QuickStep version you get on the internet, like from Apache, does not contain any of the bitweaving stuff here, although it was originally written in the context of this. So this work was spearheaded by Jignesh Patel, pictured here, and he's like the hardcore database systems expert at University of Wisconsin. In my opinion, he's up there with the hyperguides as being one of the best database researchers in the world, so I really think this stuff is great. I'll say also, Jignesh is a badass. He looks very unassuming, but he has told me crazy stories about beating people up on buses like when he was in elementary school, going to class in India. He was, yes, I'll stop there, okay. So there's two layouts we're gonna have in bitweaving. The first is horizontal encoding and the second is vertical encoding. So again, all of this is still in the context of a column store. So we're talking about encoding of a single column, but within that column we can sort of have a row-oriented layout or a column-oriented layout. And this is at the bit level. So we'll go through each of these, but they're, this is gonna allow us, by storing the data in these different manners, it's gonna allow us to get a bunch of optimization that we would not normally be able to get if we were just storing the values in their whole form, like all the bits, one after another in the single column. So for horizontal storage, for this we're gonna have an example of, we have 10 tuples, and we're gonna break them up into two segments. And for these, I'm encoding, in this example here, I'm encoding all the values as three bits, but that's what the paper uses and they use eight-bit processor words, but that's just a generalization for, to simplify the explanation, there's nothing about what we're talking about here, it has specific to be exactly three bits, like you could use more bits and larger process of words, just for simplicity we're doing this. So for here, these are just encoded values that I've listed here, right? So it's two to the eight, two to the three possible values, so there's eight distinct values we'd have for all our tuples here. And as I said, we're gonna break up our data into these segments and this is just done for organizing the data in our internally. All right, so for the first segment what we're gonna do is we're gonna store the data, their bits in horizontally, contiguously, within a single processor word. So each of these vectors, correspond to a processor word and then in this case in their example, they're eight-bit words. And so the way it's organized is that the first tuple appears at the first position of the first word and then the second tuple appears at the first position of the second word. And so it goes zero, one, two, three and then jump back up here and start over for the second position in the first word, four, five, six, seven. And this'll matter later on when we see how they actually combine the results of doing these evaluations together. It makes it actually work out if you organize it this way rather than going zero, one, two, three. So you start at the bottom, start from the top, go to the bottom and then jump to the next position and finish it up that way. The every also tuple in this encoding is also gonna have a one-bit delimiter just represented by zero here. Think of this as padding that says here's one, one tuple starts and the next one begins. What we'll see later on also too is that when they do predicate evaluation, this is actually where they're gonna store the result of the predicate. So rather than having another bitmap stored somewhere else that says whether something evaluated true or not, they're actually gonna store it right in here. There is a version in the paper that says you don't have to use this extra bit, like you can actually just store it without this extra padding. But then you end up having to do more computation to put it into the form you need or get the result out and put it into a form you need to actually then be able to use it to do continuing query processing for other parts of the system. All right, so let's see how we're gonna actually query on this. So for this, we're gonna do a simple query, select start from table, value less than five and then we're only gonna deal with a one vector from the previous slide. So this would be tuple zero, tuple four and then the second vector here is just the binary encoding version of our predicate here. So the way we're gonna evaluate this is that we're gonna compute a mask that corresponds to what our predicate is actually trying to do. And then we have this formula here. You take the X war with mask and X plus Y and then you end it with negation of the mask. So this formula here is specific to this particular predicate, right? So there's a different formula for quality or greater than or greater than equal to, right? And then this formula is gonna produce out a selection vector. Again, that'll tell us at the delimiter whether the tuple that corresponds to the position represented by the delimiter evaluated to true or not. So in this case here, this first one is one because one is less than five because this is zero zero one, that's decimal one. And then this is set to zero because five is not less than six, or sorry, five is not greater than six. So that doesn't match down there, right? So the crazy thing about all of this is that everything I'm describing here actually does not even need to use Cindy at all, right? Like you could write this with SISD instructions and it actually would be really efficient because these are low level bit operations that the hardware can actually do very efficiently. So when I haven't gotten into Cindy yet, you can do all of this in SISD. And so in this case here also we're dealing with eight bit processor words and three bit values, but if you had a 64 bit processor word, then you can store 16 three bit values in there. And again, the same formula would do all of them in parallel. So there's another way to get vectorization without actually relying on Cindy. What I'll say also too is that these formulas here were not invented by the Wisconsin people. This actually comes from Leslie Landport's paper from like the 1970s before CPUs even had Cindy. So again, this is just an old idea but now applying the context of new hardware. So now let's see how to do this with a bunch of vectors at the same time using the horizontal scheme. So for this we're gonna do all the same steps we did in the last slide and that's gonna produce four different vectors here. And again, the delimiter space tells us whether it actually matched or not to zero one. But now we need to combine these selection vectors and put it into a form that's actually usable by us so that we can figure out what tuples actually pass on to the next part of our query plan. So for this we could do the most naive thing and just have a for loop and scan over every single bit map and check to see whether the value is one or not. But we can actually just do very simple bit shifting operations to slide over now the delimiter value into a different position that corresponds to the position of the tuple that it represents. This is why they had again, they had the go from the top to the bottom when you start adding the tuples rather than going from left and right. So now that we have all of these guys in the correct position, it's another cheap operation to then combine it together and produce a selection vector that tells us whether the tuple added to the position matched or not. So again, it's like using the low level constructs of bit operations and SIMD to actually do more complicated things which I really like this paper because of this. So now the selection vector that came out of this is, as I said, is basically a bit map that says whether a tuple added to the position mapped to true or not. And in the example here, I showed that you can sort of combine everything together and that tells you whether something is true. But then you still need to go back and then scan through the selection vector and find all the ones that are set to true. And then you know what offset they are in, then you can go look up and find the actual tuple that you want. But this actually could be expensive to do. An alternative approach is to pre-compute the position table based on the selection vector that is generated. So this is actually a technique that I think came from vector-wise and then this is actually something Bouchon uses in his work as well. The basic idea is that we know that in this case here, we only have our selection vector is going to correspond to eight bits. So there's only two of the eight different possible values we could have represented by the selection vector. So all we need to do now is just have a lookup table that says for every single possible value we could have, it'll give us the vector of the offsets that correspond to this. So this is the selection vector that our operation generated before and in decimal form it maps to 150 and then in our position table, we just have at position 150, we would have a vector that says, oh, there's a one at zero, three, five and six. This is a more efficient way to do this rather than having to do, and the forlip and iterate over. All right, now the alternative storage is to do, the alternative way is also to do vertical storage. Again, we're going to break up our data into two segments, but instead of having all the bits for every single tuple stored contiguously, instead we're going to all store the bits at a given position all contiguously, yes. Yes, we're going to go back to previous slide. Yes. Wouldn't you generate a greater large position table and when you look up that table, you just have to put something on it. So the statement is wouldn't you have a really large position table? You don't have two, in this case here we have an eight bit selection vector, there's only two of the eight different possible values, so it's not that big. It'll fit one in like L2. All right, so again instead of storing now in the vertical storage, instead of storing all the bits for a single tuple contiguously, we're instead going to store all the bits at a given slice contiguously, right? So in this case here, in our first segment, we have in a single processor word, we have all the bits for the first position and then the second processor word, we have all the bits for the second position, right? And the same thing for the other side here. And so here we see it actually doesn't matter now how many tuples we store in or that are in a single segment, we're always gonna have to use the same number of words, right? It's just that known that there's all these other tuples here, but there's nothing there, so this is set to zero. So let's see now how to execute a query on this. So say we wanna do a lookup on key equals two, and so we can represent that in binary form like this. So what we'll do is we'll just take the first vector and then do a SIMD compare with a vector of all zeros because it's zero here, and that'll produce again a selection vector that says which positions had tuples that match this. Then when we go now to the second position, we need to know that from the last time on the first position, we had a bunch of tuples that didn't match. So we need to carry over that information so that in the next bit slice we evaluate, we don't wanna have them actually evaluate to true if they were a false in the previous one because we need to end all these things together. So we'll end up actually doing a SIMD compare across all three vectors, and then that'll produce our correct answer here. And now, obviously, at this point here, although we have another position to look at, all our tuples or our positions and our selection vector are zero, which we can compute very efficiently with the hamming weight. So we know that there's no end point even actually looking at the next position because if everything's false here, then everything should be false in the next one too. So we can do early pruning and cut everything off. Yes. Do you always have to like compare all the bits when they're like some of them, you don't have to evaluate? The question is, in this case here, since I know that some of them are evaluated to false, do I have to compare them in the next time around? In this case, yes. All right, so we'll finish up real quickly. There's one graph I wanna show you because that sort of explains clearly everything that's going on. So they have a simple micro benchmark that has a simple query that does a count star on a table with a less than prada cap. And this database is derived from TPCH. So it has a billion tuples and a uniform skew distribution. And the selectivity of the predicate is 10%. So the reason why they're doing a count star is because it doesn't have to materialize any output of the actual tuples. It's just a single decimal that says what the count actually was. So this is just going at what's the bare metal performance you can get from actually evaluating predicates in the engine. Right, so this one graph here shows a comparison of the bit we can vertical versus bit we can horizontal along with a naive non-vectorized sequential scan of the table and then a basic vectorized sequential scan when you grab one tuples together and then use SIMD instructions to do the evaluation. So right off the bat, you see that you get about a two X performance of the SIMD scan over the naive scan just based on the fact that you can parallelize things with SIMD instructions. Then you get another about two to three X performance difference between the bit weeding approach versus the regular SIMD scan. And this is because that you're having fewer cache misses because you end up having to process less data. But then as you get to a larger and larger number of bits using to encode your data, you see that there's a slight difference in performance here between the vertical versus the horizontal. And this is attributed to the fact that the vertical scheme can do early pruning and therefore it can just end up looking at less data and therefore you have fewer cache misses. And as far as I know, this little blip over here is when we go from 31 bits to 32 bits. So now with 32 bits plus the padding in the case of the horizontal approach, then you have to represent this as 33 bits and that doesn't align nicely. Okay, yes? Which one will access more, like what's the value? Oh, sorry, the Y axis is the number of cycles of code. I think the number of cycles per like tuple, right? So if you have more cache misses, it takes more cycles, right? If you measure in terms of instructions, then you would miss the cache miss penalty. Okay, so to finish up, what we'll see next with Prashant is that so far we've been sort of talking about these different topics, how to improve query execution performance, isolated from each other. So like we spent time talking about query compilation and then we spent time talking about vectorization. But how you actually combine these two concepts together to get the overall best performance is actually not something that has been covered in what we talked about so far. So what's gonna be next is that for Prashant's work, he's gonna show that in the same way that, wait, he's gonna show that if you, in order to apply these things together, there is some actual work you have to do. It's just not matter of taking like query compilation and say, oh, I'm gonna vectorize things. You have to be really more careful about how you sort of stage your operations or the instructions when you process things. And I would say the bit-weaving, bit-slicing work is also an example where in the same way in compilation, where we write code, we have our engine generate code in a way that is not the simplest way for humans to understand, if we encode our data in a way that is not so much easy for us to understand, but is the best way for the CPU to actually crunch on, then we can end up getting better performance. Okay, any questions? Mm, I need something refreshing when I get finished manifesting to cold a whole bowl like Smith & Wesson. One court and my thoughts hip hop related, ride a rhyme and my pants intoxicated, lyrics and quicker with a simple moan liquor, to summer city slicker, brain waves and pick-up, rhymes I create rotate at a rate too quick to duplicate philipines as I skate, mics at Fahrenheit when I hold them real tight, then I'm in flight, then we ignite, blood starts to boil, I heat up the party for you, let the girl run me and my mic down for oil, record still turned to third degree burn for one man, I heat up your brain, give it a suntan, to just cool, let the temperature rise, to cool it off with same eyes.