 As I said, query compilation from last class and vectorization are sort of the two main methods we can apply in a modern database system to improve query performance. So I'm going to first start, talk about what vectorization is, how we're actually going to use SIMD, and then the paper you guys were assigned to read was this sort of recipe book or guide from the guys at Columbia on how to take sort of classic database algorithms to do various things we need inside our data system and implement them using SIMD. And I like it because again it covers like you know all the various things we actually need in a database system to run analytical queries. The spoiler would be is that I'll just say up front that none of it actually works with some exception because they're gonna make this big assumption about their operating environment which I'll cover when we get to that. And then we'll finish up talking about project three topics okay. So vectorization is the process or the method we're going to apply in our database system to take an algorithm that was originally implemented assuming scalar operations and where you're going to take you know one piece of data and apply one change or modification or operation to it. And then we're going to convert that to now being able to take a vector of data items whether they're tuple values, tuple pointers, it depends on the algorithm. And then now we can invoke a SIMD instruction that is going to allow us to apply the same modification or apply the same operation on those vector of items within a single instruction right. And so we'll see this though it's not always going to be there are there is some prep work where we have to put things into these special registers and then apply the vectorized instruction. So it's not like we magically can make anything in our code become vectorized. So again the paper you guys were reading was about how to take very specialized or take very specific algorithms or components of the data system and and being able to run them in parallel. So it's sort of obvious why we'd want to do this right because now we can do more work with your instructions. But one of the big advantages we're also going to get from this is that this is going to be independent of all the things we talked about before when we talked about parallel query processing because that was all about how to take a query or whatever you want to do in our database system and divide the work up across multiple threads. Now we're saying in this lecture here how do we take a single thread and make what it's doing actually run in parallel right. And again the speed up you can get potentially is massive because it's multiplicative right. So say I have an algorithm that I can run on 32 cores or 32 threads so I get a 32x speed up if I do if I you know if I paralyze that but then now if I'm doing vectorization from what the work each thread is actually doing and say I have a four-lane SIMD register meaning for a piece of work I can run I can operate on four data items in parallel with one SIMD instruction. So now I'm gonna get also 4x speed up there so 32 times 4 is 128. So potentially I could go from a single threaded scalar implementation of an algorithm to something that I can run with 128x speed up. Now the problem is going to be as as I said to him earlier before class started we are never actually ever going to achieve the maximum speed up like this is our upper bound and we'll see why when we go along again as we have to prepare things get in registers and out of registers why we're never actually going to achieve this but like this is a good target this is something that we definitely want to achieve or try to achieve. So who here has taken in 16, 14, 18? Alright, a little over half. Alright, who here has never seen SIMD before? Okay, that's fine. Alright, so SIMD is a class of CPU instructions that the processor is going to provide for us that allows us again to do these vectorized operations and we do tracers with SISD which is single instruction single data data item. SIMD is going to be single instruction multiple data items together and this is sort of this this notion of SIMD versus SISD comes from Flynn's Taxonomy of Parallel Databases or Parallel Systems from like the 1960s. And so every instruction set for every CPU nowadays is going to have in support for SIMD operations just the name of what they're going to call the class of these extensions or instructions is going to be different. The paper you guys read was all about Intel stuff and in their world it's SSE or AVX and paper you guys were being they were it's 2015 I think so they were dealing with AVX2 which is 256 bit registers. The state of the art now is 200 and sorry 512 bits. Power from IBM has this thing called Altevec, ARM has this thing called Neon and then a few years ago they actually proposed something called SVE the scalable vector instructions. So what's actually really cool about this and I don't know if anybody actually implements this yet is the way they sort of set up the instructions is that they're not going to be specific to any register size. Meaning they'll have a sort of a sort of generic class of vectorized instructions and then depending on how it's implemented in the actual CPU you you can then operate on like 10, 24 bits or 512 bits at a time. We're in all this Intel stuff like AVX2 are instructions to do 256 bit SIMD instructions then 512 is from 512 and you can't just like you can't just take these run it on a newer CPU and think you're going to run on these ones. You have to rewrite your code to actually be aware that I'm running on a larger size registers. So let's look at a really simple example of a SIMD operation. So say we have two vectors x and y and we just want to add together the elements of x with the elements of y at each offset. So the way we would implement this you know in intro to CS class or intro database class is just we have a for loop where we know the size of the vector assuming they're the same size and then for every element x and the corresponding element of y we add them together and we write them into to the output vector z. So the way this would get implemented or executed with SISD instructions is just again taking the for loop ripping through and taking you know one element from x one element from y invoking one instruction to do the do the addition and then writing it to our output buffer. Now with SIMD what we can do is we're going to take a vector of elements from these two arrays and then combine them together in a single register then invoke one SIMD instruction to add them together and produce our output buffer. So we're going to take four elements here so we would say this is a four lane register and assuming we have 32 bit integers we'll say this is a 128 bit SIMD register. So once we populate this register we then invoke the SIMD instruction on those two registers and then it's going to write out some output result to another SIMD register that's going to be the same size. Do the same thing for the next four elements invoke the SIMD and now we produce our output. Yes. Is loading into a normal register? This question is loading into a SIMD register just as fast as loading into a like a regular single data item register. It depends on where the data is located like if you're doing that selective store thing we talked about that's definitely slower because it spans multiple cash lines it's more work. I think the way x86 works is you can only do one or two loads in stores per cycle so if this thing's really wide you may have to do multiple loads across multiple cycles whereas like writing one thing into a single data register is super fast and that's what I'm saying we're never going to achieve the theoretical maximum speed up just because there's overhead of putting things in here right but certainly in the case of this you know this would be a really good trade-off because we went from eight instructions to do it to do the addition to two instructions that probably is going to be a win for us yes. So this question is if we go to 512 a bit registers in ABX 512 that's going to hit our cash line limit does that mean now the loads into it are going to be more expensive and they're starting to store getting the data out would that end up being more expensive than this because I can't get the span multiple cash lines if you're not if you're not cash aligned yet like it depends on where this data is also like think of this is like this is hanging on L1 so L1 is in x86 is 32 kilobytes so we could take all the data you want to store and keep that in L1 and then it's not like we're going out in memory to go get it. Alright so what can we do with SIMD? Well I showed some basic arithmetic operators and again under at least for x86 for the SVE I think they still have the same issue like like the register side the size of the register will be will be fixed but like it doesn't necessarily mean that the the size of the lane has to be fixed so my example here I had a torn in 28 bits and therefore I had a SIMD instruction that could take four integers and add them together but if I had 16 bit integers there might be another instruction they provide to know how to take eight 16 bit integers and add them together right so typically the way you when you when you write SIMD code you specify like I know what what what the data type sizes that I'm operating on. We talked about how to move data in and out and then there's all these logical instructions to do comparisons on you know a bit manipulation. We can do comparison instructions and that we'll need this to do predicates to say you know whether something equals something in our where clause shuffle instructions is the ability to take the output of or take one SIMD register and write it into another SIMD register and this is gonna be a big win for us because now we don't have to take the output put it in L1 and then write it back out into another register we can go directly from one register to the next and then the other random things like we'll see in a second like the take the data that's sitting in L1 and then convert it into the form that the SIMD register wants or if now we have something in our SIMD register maybe we want to write it out to memory but we don't want to pollute our CPU cache so these these these streaming instructions allow us to take that the output directly from the SIMD register and put it put it right into memory without going through the normal caching hierarchy. So the main idea what we're gonna try to achieve here is by having all these of instructions you know we can try to do it as as pack as much useful data we want into our SIMD instructions and do as much processing as we can all in that data while that they're in our SIMD registers before we have to go shove it out to memory right and these like especially this these these conversion up or say that the shuffling operators this is gonna allow us to do this we'll change some operations together never have to touch CPU cache and then we can produce the results that we want right this last one here is also super important too because especially for joins like if I know that I'm doing the join the tuple matches but there's a pipeline breaker above me and I can't I can't do anything with the output of the join I can then shove it out to memory because I know I'm not gonna need it again until I come back and go to the next pipeline so this these streaming functions allow me or streaming instructions allow me to like use the data that's in the SIMD register and then you know put it out into memory before going back to it right so we'll see after this after the spring break and after the the project proposals next class how we can use some of these techniques and in our joint operations as well yes this question is these these seeming structure streaming sounds very nice are they available for non non-SIMD operations yes I think so I yes I think if you if you streaming rights I think that's what they're called on in x86 yeah okay so again we're gonna focus on x86 because that's the dominant CPU architecture and this is just showing you a history of over time how the the Intel is expanded for support for for you know for for SIMD the very beginning it was called MMX and this is like what's like super primitive this is like like I think Pentium three days or Pentium two days until we have this big marketing pitch about how these I think MMX actually didn't stand for anything like they can Intel's afraid of getting sued so like they just took three letters put it together I think now people think of these as like multimedia extensions but at the time I think there was a lawsuit where someone claimed that and then Intel sold their MMX name from some other company but then they found all this internal documentation that showed like oh it's it's three land random letters it means nothing right so they lost that lawsuit but at the very beginning was super primitive right it could only do some basic operations on 32 or 16 bit integers and this early version as well it would be this the CPU wouldn't be normally executing SISD instructions but then when you executed a SIMD instructions you had to pause the SISD stuff do the SIMD and then start back up the the SISD after that after MMX when SSE came out then you could actually do these in parallel and this is what I was saying before about how with an out-on-order out-of-order CPU architecture we can have different parts of the CPU doing different things at the same time right we could have executing some SIMD instructions on on the SIMD registers what while we do you know stuff on regular CPU registers was the MMX you couldn't do that and in the modern era it started when AVX came out I don't think there's any I think this this was just a naming change because they went from 128 bits to 256 bits and then instead of calling this AVX 5 which would make sense when they call it AVX 2 or AVX whatever like they call it 512 and I've looked online and as far as I can tell they have no plans to put out 10-24 bits so this is sort of the where we're at right right now and this link here will take you to this awesome video from this guy James Rendir who was a like a SIMD designer or evangelist at Intel and he basically shows all the awesome things you could you can do a SIMD not it's not a database person it's just like it's showing you how to actually flex the hard rank get the best you know get the get the best bang for the buck and I highly recommend this video like it's I think it's about an hour so long all right so so there's me trade-offs obviously using SIMD seems like it's a magical thing we're always want to use and of course yes in some cases we will see we'll get a significant performance gains if we use them but the tricky thing is going to be is actually implementing an algorithm to use vectorized instructions is is is not going to be trivial and as I said in some cases it's actually gonna because in the Columbia paper they make certain assumptions about the environment in a real database system their assumptions don't hold and actually SIMD will hurt you so it will cover what that is okay and again this is this is the issue we're going to have of getting things in and out of the registers of the reason why we may not always get the speed up we we'd want to achieve okay so now part of the reason why it's gonna be tricky is that there's no magic flag in the compiler that's gonna match and take all our database and you know database system all the source code and be able to paralyze everything right for for simple things maybe but for the more complex things we're doing with the process queries it's just not going to happen so again if we want to have vectorization in our data system people pay you know people pay you money to go do this right because it's hard let's see how we can actually achieve this so the three ways are the automatic vectorization from the compiler and then we can then pass hints to the compiler and tell us what we actually want and then there's also writing our source code with explicit vectorization using CPU intrinsics okay so the way to think about this is like the easiest one to use is the one at the top because that's just hoping the compiler figures it out the one at the bottom is the hardest to use but we'll have complete control of what's going into our registers what's coming in and what instructions were executed so again easy to use better better control but harder to write okay all right so automatic vectorization is just saying that where the compiler can identify when we have chunks of source code that inside of a loop where the kernel of the main operation inside that loop could then be converted into a vectorized instruction and so for really simple loops this is going to be easy to do but the problem is simple loops are not going to be very common in the the main thing we want to speed up which is query execution all right so and obviously this requires your hardware to have the support for CPU instruction or SIMD instructions but pretty much every you know every modern Intel chip has that net today whether you have 256 or 512 it's actually I don't know what the laptops have probably it's AVX2 but any modern Xeon you know you buy today will have 512 all right so let's look at an example here so here we have a for loop here and this is sort of like the vectorized primitives we talked about before where they were going to have the predefined source code to do the basic operations you would need to execute a query to process you know process data so what is this doing this is taking three pointers x y and z the same thing we showed in the very beginning and then we're just going to iterate over every element x and every element of y and write them to an alpha buffer z my question to you guys is is this something the compiler can automatically vectorize you're shaking head yes why what's that he's saying loop on one but that's not that's not but that's not that's not in using SIMD instructions all right so he said after you unroll the loop say I you would recognize I have four-lane SIMD instruction or SIMD registers so I unroll it four times and then I and then you know then I can vectorize that who agrees or disagrees bingo that's it yes so it is not legal to automatically vectorize this because he was exactly right that you don't know what these are at compile time right and it may be the case that these are actually pointing to the same chunks of memory so now you have unpredictable side effects of when you actually start doing your computation yes you said you could do global analysis prove that they're different that's not a static you can't do that a static statically a compile time you do afterwards sure you can run it and go you can run your database system with this function check it right and like you may never see the case where these are actually pointing to the same thing but you don't know that because you don't know you have you don't know whether you've seen all possible inputs to the database system yes you're basically claiming can you do optimistic vectorization where the compiler could insert some kind of magic here do the vectorized version or somehow do some analysis on what these guys are and see whether this would have gotten stomped on inside the vectorized version and if if no go ahead and keep my result if yes go back and do the scale of version or like for that matter just like check to see if they're like different pointers and like essentially to make sure that they're all different and that like they're not overlapping in memory or something like that and then pull that check if that passes then do a vectorized version the problem is like well if you pass like think what you're saying so so I'm just showing this is like a global variable but you'd have to pass that in right to this to this function so that you would then know and then you have to know that all right I'm gonna loop through from zero to max for each of these things then check that yeah that sounds like a lot of work yes if you do what sorry if you know all possible inputs I had let's talk about that first I I don't actually would you claim it but I mean the really simple issue is like if my if Z is just the the memory location of X plus one what's gonna happen here now I take X X and Y and I write that into Z but now when I come back I'm clobbering for the second loop now I'm clobbering the the second element of X and that's gonna produce completely incorrect results right we want to be able to vectorize this such that the output of the vectorized version has to be exactly the same as the as the scalar version right so reason why this is difficult to do for a compiler the handle is it's just the nature of how we write C C plus lots right we're writing this code in a way that are describing this algorithm of computation we want to do in in sequential terms iterate over one element at a time take two numbers add together write it to this to this buffer so that's again that's that's C's are sort of not set up to provide the correct hints to the compiler to recognize that they could do this you have to do whatever he's proposing to do or the next things that I'm describing the compiler hints okay so again the main takeaway from this is GCC and Clang are not going to be able to vectorize that much I CC from from Intel their proprietary compiler is is much more better at this but even then in many cases it's not gonna be able to do it because it's not gonna know what's going on here all right so what can we do against we can provide compiler hints to tell the the compiler that we know that there's a piece of code that's safe for it to operate on in a vectorized manner right and so the two basic ways we can do this is either tell the compiler something of that we know about the memory locations that we could be ever be passing into this function or we just tell the compiler hey you know take you know unbuckle your seatbelt take the take the safety off your gun like like just go buck wild and do whatever you want I don't care right so the first one is the restrict keyword and this is a flag we can add to in our C code or C++ code that basically says that we know that these are distinct memory locations and therefore it's safe for it to vectorize anything that comes below this so this is in the C standard I don't know whether it's I don't think it's in the C++ standard but as far as you know as I've tested before it's like GCC and Clang will handle this right so again it's basically saying that we're allowing the program to declare that these pointers are will never share the same data or you know the same memory locations and therefore anything that we do under here will not have weird side effects that was unexpected right but of course it's gonna be up to us as the programmer to make sure that we know what we're doing when we when we tell it hey don't check these things because you know after it's already compiled there we have no way to protect you know enforce that the other approach of these to these pragma hints and this is just basically saying that within this function it don't do any of those the memory checks this is like sort of a more sort of a more not brute force but a more coarse-grained definition that it's okay to do vectorization here whereas in the restrict keyword you know it's more fine-grained on individual elements right so this one is saying IV depth is saying ignore vector vectorization dependencies there's other ones I think other languages like SIMD on SIMD off I don't know how I don't know how portable this one is I think this might be working for GCC I don't know what whether Clang or ICC do something different right or like if you're using these other like libraries like open MP they have their own own flags it's they're all they're all essentially doing the same thing right okay so again the main takeaway from this is that we can we can tell the compiler we can what we can do what it can do but it's still up for us at the Davis developers to to protect ourselves the last one is through explicit vectorization what we're going to write the the the exact SIMD instructions that we want to execute and again we have to know what the register size is we have to know what the data type we're operating on and then now there's there's no question about what to actually do because we're these these intrinsics are essentially synthetic sugar that the compiler places with the exact instruction to do whatever that we're asking to do the downside of intrinsics is that they're not portable meaning if my code if I compile if I write all the x86 intrinsics but I'm not running now compiling my data system to arm they might may not support that or if I'm compiling my code and it operates on the AVX 512 and then I try to compile it on a machine that doesn't have those registers it's gonna fail or it might actually replace them with the scalar version of the operation if it's nice and I may not get the vectorization that I'm expecting so here's the here's the SSE implementation of the actually this is MMX the at the this is a SIMD implementation of the same same the same formula and then and then adding together the different vectors and now you see what I have to do is I have to take my sort of cpc plus vectors of numbers and then convert them into the the expected SIMD register vectors right and then this operation here is now doing the addition or loading things in it's doing the addition on 32 bit integers and then loading it into this the SIMD vector here right it's ugly the double underscore is is what is how we how these GCC defines intrinsics I think Clang does the same thing and so you can hide this with like a library that has a bunch of macro tricks to make you do this but there's no sort of one library that everyone uses you look and get help you see a lot of times you see explicit instructions like this okay so for our purposes in our own system and in the company of you guys read they're gonna do this explicit vectorization because they won't have again fine grain control of exactly what the CPU is doing right so now that we know how to write SIMD instructions what are the kind of SIMD operations we could do so the first thing we need to talk about is like what direction are we applying our vectorization so the difference is horizontal versus vertical so with horizontal the idea is that we're gonna apply some operation on all the elements together that are within a single vector and then produce some some single output or scalar output so like say I want to take my my register that has a number zero one two three and then I can evoke a SIMD horizontal instruction that just takes all the elements in my vector and then produce a single scalar output that's the sum of all of them so this one is only found in the new instructions that support SSC4 and AVX2 so that's roughly 2015 2016 anything or anything newer than that should have it the next one is the vertical one and the idea here is that we're going to take two vectors apply some SIMD instruction on them and we're going to match up based on the offsets when they need vector so this is offset zero this is offset zero and then we'll add them together and then we'll write that out to offset zero in our output vector so the Columbia paper is going to do everything based on this I forget why they said they didn't do this I think at the time I think they actually the CPU they looked on they didn't have it right this one so again this this is the more common approach all right so now so going forward we're going to assume that we have these intrinsics that we're going to do the vertical vectorization and now we want to do a bunch of different things in our database system to construct primitive operations that are vectorized and then we'll build up from those primitives and do the more complex things like the joins and the scans and the other stuff that you would want in a you know want to have to do when you process queries so again I like this paper because it's just it's like everything like here's all the different techniques that the of the and actually the source code implementations of how you have you how you design all these algorithms and data structures to take advantage of vectorization all them we're not going to need but I want to cover like sort of the main the main primitives okay so again I've already just had this so this is all about how can we can do primitive operations that are vectorized then do more events algorithms and functionality in our database system the system they're going to run on is actually not a full-fledged database system it's just like a little testbed prototype that does you know it's sort of hand coded to do the one operation that they're trying to measure so it means there's no SQL parser there's no transactions there's no query processing and it also means that they're not going to materialize the output in those cases which oftentimes is a big overhead in implementations the other big thing and this is the big assumption that I was saying before is that in their operating environment they're going to assume that the database fits entirely in the CPU cache that is not realistic right because like L1 is like 72 kilobytes L3 it's like maybe 64 megabytes if you have a lot of money like there's no database that's going to fit in all your CPU caches and what you'll see in the next paper you read after the spring break is that if everything now does not fit in the CPU cache this SIMD stuff actually doesn't doesn't matter at all right unless you start doing the the relaxed operator fusion stuff that they'll read it out next like of how to actually you know sort of stage your operation so that you can operate on vectors within your CPU caches and then you can pre-fetch the next piece of the next vector you're going to operate on to hide that latency of the memory stall in their world they don't do any of that it's just like everything's might CPU cache and we rip through it very quickly okay the other big thing that they're gonna have in their design decision for their algorithms is that they want to maximize the lane utilization so that for every single time you invoke an instruction you're always doing useful work for all the data items in your vectors right this will make more sense when we talk about how they do the hash table probing but the idea is that I don't want to say I can put four elements into my vector but only two of them are one of them I actually need the rest are actually garbage I can throw away they want to pack in all you know all unique data or useful data and every single register for every single instruction so that they're maximizing the utilization all right so let's start with the fundamental operations that they're going to find select the load select the stores and then the gather and scatter and then we'll see how we can then use these to do the scans and hash tables and and the histograms okay so with with selective load the idea here is that we want to take some some chunk of memory we have in our in our L1 cache and then we want to write them out to a register but we want to provide this mask to tell us what elements we actually want to store like because without this we have to take everything and and write it out like all continuously which means that if we only want certain items we'd have to copy it then you know and in in all in our CPU cache then align things the way we want and then right into our register so the idea is like we can take a chunk of memory that has some things we want and don't want provide this mask and this tells us how to populate the register right so again the lane is like this the mask offset here corresponds to an offset in the vector so the first thing we do is look at the first element the mask the bit is set to zero so we're gonna skip what's what's in here there's nothing we else we want to write out into this lane then we get into the one here and that's going to tell us at the first offset that we've been writing into so sort of think like this every time we have a one there's some cursor here that's going to copy out what's in the memory address the memory location and then when we move to the next bit if there's another one we would move the cursor over by one so even though this is offset two or all set one in our vector we're starting at offset zero for memory because we didn't knew we didn't write anything for offset zero so now this would get mapped to that and we write it up in here and then the same thing for the next one there's a zero so we skip that now we have a one and then the cursor moves over here for this v and we write it to offset three up and up in the vector right the selective store is the opposite of this is where we have our vector and we want to take the its contents of its elements and write it out to to to memory so again the lanes match up just as before so we start here with the zero and this is saying we don't want to copy anything then we get here to the one and it's going to take the same offset in the mass for the offset in the vector but then we're going to write it to the first location in memory and then same thing for this get to the zero the one then write it up there right so this is another this is another spoiler of this of this paper is that there are no SIMD instructions or no Xeon does not currently support doing this like like with a single instruction it has to be emulated doing other SIMD operations and every year I always Google to see whether check on Google to see whether selective store or selective load has has been implemented in in in x86 and the only thing ever shows up are three things either the Columbia paper that describes the technique that my slides from this class or people in Korea or Wisconsin stole my slides and talk about the same thing right we'll see what they look like in a second so again and this means that also now you can't do this atomically between multiple instructions to make this work and the idea of what they're doing is the reason why you actually want this again is that I don't want to have to you know take the chunk of memory and copy things over whenever again to align it the way I want before I load it and write and load it and take it out ideally if I could do this in a single structure that'd be great but it doesn't exist all right the next is going to be the scatter and gather right so with gather the idea is that again we have our index vector that lines up with our value vector and we want to take we want to take elements that are in memory and then write them out into different locations in in our vector right so think of these as offset from zero to five so when I look at this thing here the the index vector says at offset two right which is here write it into the first lane of my register then I get here offset one take take you know take offset one in memory write it out to the the the second lane or the first lane depending on what offset you're looking at do this and so forth all the other ones and again like I don't think actually I think Xeon now supports both of these the selective gather and selective gather there's a selective gather and scatter so you can do this it'll be a single instruction but it won't be done in a single cycle because again L1 can only do a one or two loads and stores per cycle so if I have a bunch of stuff that if I'm populating a large register it may take a bunch of cycles to populate it up using this technique yes I know I don't think I'm like this I think it actually supports this yeah could you use could you use selective gather to emulate the to select the load and store from less yeah yes I don't know the exact details like the paper describes it but it's again the main takeaway is that it's not one instruction like this is okay scatter again is the opposite we're going to take elements in in a register and then write them out to different locations here so for the first element here at index vector two that I'm writing to memory location two and I do the same thing for everything else okay so I've already said this a lot before the the gather scatters are not really executed in parallel because we can only load so many things in a single cycle the gathers are only supported in the the newer out well newer it's since 2014 was when AVX2 came out in the Oswell prior to that it wasn't supported and then again these are you have to implement these or emulate them using multiple multiple CPU instructions okay all right so let's get the good stuff all right so we have these parameters now and let's talk about how we do scans hash tables and to partitioning for histograms the paper also talks about how to join sorting and bloom filters I don't think I don't whether they talk about how to do SIMD hash functions but we'll cover that later too as well but we will cover these after the spring break because we'll talk about how to do this in the the part the rate X partition hash join or the baton of sorting techniques when we talk about sort merge join so we'll cover these more detail after spring break and I think we'll also talk about this but this one's pretty easy to figure out as well okay so again nothing's gonna work because they're gonna assume well they're gonna assume everything is a fit in CPU cache they're also gonna make this other big assumption I should have mentioned this earlier is that they're gonna operate they're gonna assume all their keys are 32 bits and all their pointers to tuples are 32 bits because again they're operating on 200 256 bit registers and so for a key and value pair that has to be 64 bits in a real data in memory database system the values are sorry the values are gonna be the tuples loss of the tuple pointers those are gonna be 64 bits and then keys are often not just gonna be you know 32 or 64 bits sometimes you can have a composite keys and if you have those then then many of these techniques don't work because now you can't align things nicely into lanes in the SIMD register right all right so let's how to do vectorized selective scans so we saw these two examples before when we talked about query processing right this is how we can do a that branching version and a branchless version of doing the scan right for the branching version you have an if clause where you check the predicate first and then if it matches then you then you copy the tuple and in the output buffer and in the branchless version you always copy it and then you then check this you check it by using this you know bitwise area you know comparison operation so it's not really a branch and then based based on the output of this comparison that tells you whether you you increment the offset of the output vector by one or zero which then determines whether if you come back around you overwrite the last one you you you copy it into because you don't want it there all right and we saw this this graph from the vectorized people where they the branchless version of the algorithm always has almost a fixed cost because you're doing the same amount of work no matter what the selectivity is of the predicate and then in the the branching case when you have low selectivity or very high selectivity then it'll do better than the branchless one but in this middle part here you know the the branch mis prediction penalty we're paying in our CPU becomes higher and higher and therefore you know those cash dolls or the pipeline flushes of having to to you undo our our mispredicted branch starts you know becomes a big bottleneck all right so in a vectorized selection scan operation we can't do the branching version because there is no notion or no concept of if clauses in our in our Cindy instructions so they're going to do a vectorized branchless version so this is a gross approximation of what the algorithm is but now we're going to scan through our table and whereas before I would get one back one tuple at a time now I'm going to get back a vector tuples not same with what the what the register size is whether it's four four elements or sixteen elements or eight whatever it doesn't matter we get a vector and then we're going to load the key we want to do a comparison on again assuming we only need to look at one key we'll load this into a key vector and in SIMD and you know this is just again syntactic sugar this is not actually really really how you write this code there's no SIMD load function then we do our the same comparison we did in the branchless scan where we're doing the bit manipulation to see whether we match or not but then we're writing our mask out of to our to our register here and this is saying if the tuple at the offset in our in our vector we got from the table if that satisfies our predicate and we set the one if it doesn't satisfy a predicate it's set to zero then now we take that mask and now do the selective store to the copy the tuples we want that matched our matched our predicate and ran our mask to our output buffer and then we just take the we take the cardinality of the number of of of ones we have in our our mask vector and that tells us what our offset should be and again there's more loop out there's more work to do outside the for look to make sure that we don't keep around things that didn't match in the last iteration but I'm ignoring that for now so again this would be a SIMD load this would be a SIMD comparison this would be a two instructions because you have to do the vectorize greater than equal to and the vectorize less than equal to but again that's just hanging out in our SIMD registers that's not a big deal and then this is a SIMD store to now take the the selective store mask from that we produce from this take our tuples and right through our output buffer and this would be sitting in memory yes offset in the output buffer right so here this tells again this is telling but this we have this output buffer here and we keep track of I and I is where the the starting location of where we should write tuples that match right it's actually easier to understand if you go back to the the scalar version right shit right so I equals zero I always copied into the current offset I if the tuple matches then this M flag will be set to one so therefore I want to keep whatever I copied in here so I add M to I which is plus one so that when I come back around now I'm writing at the next offset and I'm not clobbering with the last thing I copied in if this doesn't match then zero then I overwrite the next time I come around because I didn't want to keep the last thing I wrote so in the vectorized case this operation here you want to count the number of ones you have that's gonna tell us how to move the offset forward and you can actually do this there's a it's called the rank instruction you can do this in the CPU actually very efficiently you take a vector of things and say get count the number ones in it that's sort of example of horizontal vectorization alright so let's look let's look at a real example of this okay so let's say now we replace that query put in real values we want to find all the matching tuples where the key is greater than equal to O and the key is less than the letter U so let's say now our table looks like this right we have the key J O Y S U X and so the in order to use comparison we're going to first copy this into our key vector right and that's this this this step here then now we do our sim to compare right and that's this part here and we'll get back on that bit get out our mask and this is going to tell us which of these keys actually matched and then now we have this pre-computed offset map right this is just something we can pre-define in our source code at a time and this is just saying the offset at you know the implicit offset here corresponds to the offset zero in memory right and up to you know one two three four five so then now we use this to do our SIMD selective store and that's again it's what this is saying here if I have a one here and I would know that the offset I'm matching here should go into the first location here so now these are just all sets here so I'm matching that like all right the tuple that the key that match my SIMD compare at this offset can be found you know I'm writing what that offset is so now then I got to go back now and say well I have offsets one three four if I need to materialize the keys above in my my query plan tree I go back in here and jump to that offset to copy out the actual key right is this clear so in my opinion this this is like the most useful thing you'll probably get out of this lecture in terms of like here's something actually you can do in in days because we actually do this now in our own system right and the magic is this offset a pre-computed offset thing so again as I said if you go Google selective store you'll find the Columbia paper my slides and the people have copied my slides and the reason why I know that copied my slides is they don't know what this is they always include this j u y s u x joy with my first peachy student right that's why his name's in here so you go look at a bunch of other slides and sure enough like joy sucks right joy suck actually they copied the entire yeah right I don't care it's fine all right so let's see what you performance benefit you can actually get from this so for this one they're gonna do four lengths to knee registers and they're gonna actually compare against two CPU architectures so this one here the Xeon Phi who here has ever heard of the Xeon Phi well your dad worked in Intel but some very few people right Xeon Phi was a co-processor that Intel used used to sell that sort of look comes in a bunch of different form factors the easiest way to think about it was like it was like their version of a GPU that was meant for highly parallel computation so it wasn't like you wouldn't get play you know thousands of cores as you would from you know from a from from the video on their GPUs you get maybe like 60 or 70 cores but these cores were actually more complex than a GPU core they're basically like the Intel Pentium 4 architecture or like the atom architecture later on so like the low-powered very simple but you'll get more than you know more cores than you get on the Xeon so you could have a sit-down sit you have it by wanting to have a sit-down on the the PCI Express bus like a GPU but they did have ones that could sit up on the motherboard and the socket like this one here you could actually have a run the operating system based on you know on the Xeon Phi didn't need like a Xeon to drive everything and this one here just has this little Omnipath connector so that you can do like remote memory access to a to another another machine right Intel killed this off I think last year or two years ago and they were roughly around on a $5,000 or so right it was an interesting experiment but for machine learning the GPUs are way better all right so the main thing though I'm going to point out those the Xeon Phi they're gonna run here is going to be a it's an older version of the Xeon Phi so it's going to be do in-order execution so it can't do the out-of-order stuff that the Xeon can and it can't do speculative execution right so it's going to take your your your pipeline and execute the instructions one after another right and then and it's can't do the just to do yeah it doesn't do it doesn't do branch prediction induced to do speculative execution so if you have a jump and you have to flush your just pup pipeline then you know that that gets expensive so we have four variants of the selection scan so going to do the scalar version with system instructions the branching versus the branchless and then have the vectorized version which is always going to be branchless but they'll do one with early materialization and late materialization this just means do I need to copy the the tuple after I match the offsets do I need to copy materialize it into a buffer to pass it up to the next the next operator in the query plan but again it's not a full flesh database system so they there is no other nothing else after they do the scan right all right so the first thing here you see is that this the Xeon Phi is going to outperform the the Xeon for the branching case and the branchless case because this is the the scan up is actually it's a pretty simple instructions we're doing in our for loop and this thing just has way more cores right I always think this is a typo right like it's this thing has 61 cores I don't know why so that weird number it's not 60 or 62 or 64 right it's 61 whatever right and then this is the on here I think it has it's four cores what's hyper threading so this is has way more cores the for this is pretty straightforward so it can rip through things more quickly but this now shows you a important difference between the benefit get from branchless versus branching when you can do in order versus out of order execution so again the the the Xeon CPU has the out of order execution it has the the spec of execution and the branch prediction so in that case the the the branchless one is is is going to much better right for this as you scale up the selectivity right so as the selectivity gets lower I don't know why this doesn't arc like the other one oh yeah because in this case here you you run out of CPU cache right so think of this is like zero one two five ten this is like that first part of the graph I showed from vector wise and then when you go beyond 20 or 50 that's when it crosses so that's why that's why you know they converge there in the case again the Xeon 5 they don't have that branch misprediction or the spec execution so like copying everything every single time sucks in the world because it's a lot of ways to work when your selectivity is really low right so here again when you have the selectivity at a hundred percent the memory bandwidth on both of these is what you're paying the penalty for just trying to get the data in and off of your CPU caches so it doesn't make a difference in either algorithm for the vectorize one what you see is that the in the case of the with late materialization for both of these it's gonna perform the best so no surprise because I'm not copying tuples that match like I'm just doing less work but you have a more pronounced difference between the early materialization and the late and versus late materialization on the two architectures again for this one because in this world the because the CPU is quite simple the copy instructions become expensive and like you know you just you're doing a lot the cost of doing wasted work becomes more expensive and then everything converges down to the same performance when you you know you start maxing out the memory bandwidth right so here's a good example also to have like as I was saying like it doesn't matter whether you're fancy with sim you're not like if the query in the data are not amenable to doing vectorized operations it doesn't help you right this is also another example of what I was saying in the beginning about how you never achieve that the theoretical performance improvement you could possibly get with vectorized instructions so in this case here the on the Xeon this is doing I don't roughly two point five two point six billion tuples per second but the simi version is doing you know rounding up six so it's getting less than a 3x improvement but we said we had a 4x simi registers so this should really be if we're actually achieving that full parallelization we should have been 4x faster and we're not because again it's that cost of moving things in and out of the registers that we pay a penalty for just because we vectorized maybe one piece of it doesn't in the rest of it's not vectorized that that's going to be a bottleneck for us okay all right so let's look at some other things we do hash tables is another one I think it's really interesting that they talk about this and they come up some interesting techniques again this one we definitely evaluate it and it definitely does not work once you're at a CPU cache right so say we want to do do a probe in our hash table and we're doing linear probing so we have and the scalar version we have a single input key we'll do we'll hash it produce some offset for a slot number in our hash table we'll go jump to that offset to take whatever keys inside of it compare it with our key see we have a match if we find a match we're done if not then we just keep scanning down until we to we find a slot that's either empty meaning we know our keys not in there or we find the one that we're looking for right so again the only thing you can really vectorize in this particular example you can vectorize this right there are the hyper guys have a vectorized hash function talk talk about when we talk about joins but like that's not the expensive part the expensive part is doing this comparison and jumping through memory in the hash table so let's see how we could we could vectorize this using vertical or sorry horizontal vectorization so what we're going to do now in our hash table we're going to expand out the number elements we're storing per slot so within each slot we'll have four keys and then they'll have we'll have four values so now when we when we take a single key we hash it we get our hash index we jump to that location we're going to get back four keys then we can do our simd compare get get back our match mask and then we check to see if any one of them is one then we know we know we have a match we know how to find the offset of our matching key if they're all zero then we know that there's nothing in this slot that we have and we just jump down to the next location and then do the same vectorized comparison right so for this one this one to me this seems like actually a really good idea turns out not to be not to work because there's just there's just so much extra overhead of the cost of going getting these things and scanning and ripping through them you know copying this into the simdi vetrishers like that that's the penalty you're paying yes so same as did we try a software pre-fetching on this where I'm here and then I'm going to assume that I'm not going to match and therefore I'm going to pre-fetch the next thing but what if I match then I just pre-fetched some crap and I polluted my CPU cache and it doesn't work like this is like hash it was also random that it's like it's hard to say what you actually should do and it doesn't look in our experiments it doesn't work pre-fetching doesn't help yeah yeah all right so let's see how to do this with vertical vectorization so before we were taking one key and then looking for that key in you know in the needle the single needle in our haystack now with vertical vectorization we're going to take multiple keys at the same time and do our search in parallel for them so we'll take four keys here we'll run four hash functions and so this you have to do a scalar and there's no simdi instruction that do you know a four-lane hash again we can vectorize the hash function itself the operation is within the hash function like there's no like murmur hash or xx hash that can be run entirely in simdi for multiple elements so we're going to run this as a for loops a rip through this we can unroll the loop if you wanted to and we'll get four locations in our hash function or hash table we don't jump to those locations and then do the simdi gather put them into a single vector and then now we can do our simdi compare and check to see whether we have a match and that's going to produce an offset sorry produce a bit mass that is be one if their keys match or zero if our keys don't match but now what's the problem here well so in my last case when I'm looking at one key at a time if the if that one key matches I'm done that key doesn't match then I jump down to the next one but now I have two of them match to them doesn't don't match yes yeah so he said like and this is this this notion of that they always wanted to have full utilization of all the lanes so these guys match so I don't need to go look anymore else in the in the hash of a seat to find a match for them these guys don't match so I have to go look for them so one thing I could do is just keep these guys are keep you know keep these keys in my register everyone jumps down to one offset in their corresponding location in the hash table bring those new keys in and do comparison and no matter what the first key and the last key produce as the comparison I ignored that because I know that I already found them but now that's in that I'm getting 50% utilization because I'm doing useless computation on those keys because I already knew I found a match so I'm doing unnecessary work and so the other you know other thing could be do like you say three out of four match and so I keep scanning maybe I loop through the entire thing to find that the key the last key doesn't never matches and now I just waste all this work so what they're gonna do is they're gonna maintain some internal bookkeeping to keep track of alright well these guys matched and therefore now I need to go get two new keys to replace them they're gonna go back over here replace the input key vector with the next two elements that are in our column rerun our hash function and for the this the second and third key I the hash function is really just saying take the last location add one to it because I'm moving down to the next location in the slot array but I get a new starting location for the first and the last key and then do this all over again jump through those locations fill up my vectors and then do my simby compare right yes just can you go can you do multiple cashing with simby across multiple keys at the same time I don't know the answer we'd have to check I yeah I don't know because of the you know there's the trade-off of like collisions versus speed I I think the collision rate might be kind of high on that we could try all right so this is also not going to work to when you when you know you don't fit in your TP cashers because again all this bookkeeping overhead to go back to get new keys and fill them in it's gonna be expensive there's another issue with this this implementation that is a bit more nuance on the engineering side I don't and if anybody caught that when they read the paper like what's one problem you'd have at this if you're like the person building a database system and when you use this algorithm yes so you can show it now we ignore all that the issue is going to be that the the probing algorithm is not going to be stable meaning for the same data set and the same query we we could get different ordered results every single time you run that query now okay well relational models on order right so so this is we're not supposed to really care about the order but like if I'm trying to actually debug stuff and try to understand like what somehow the data I'm writing to one register gets you know one location gets clobbered but another one doesn't like if every single time I'm getting a completely different random results depending on what order have these things match in my hash table then it could be hard to debug things I I somewhat agree with that I I don't know how much that that is actually an issue but this is something they said in the paper that of all of the algorithms this is the only one that had this particular issue like the selective scan stuff we talked about it doesn't have this problem okay all right let's look at some performance results so again we're gonna run the phi and the zon so they're gonna have the scalar hash probe and then the vectorize vertical and horizontal the along the x-axis they're gonna crease the hash table size so with the scalar version the performance looks like this and notice that the y-axis scales are different because the zon fire has more cores than the other one the when again we're gonna see that again the difference though is that the vertical vectorization will be much better for the for the zon phi because it doesn't have the it doesn't have the the the the branch mis prediction error like it's always gonna sort of programmatically go through and do the you know the checks in the rights in the same order whereas in the in the horizontal case you're you know you may have to check you know have indirection or different non-determinism in how you evaluate the keys in the case of the and the zon I forget why there's this little crossing point here where the the vertical stuff got back to got better than the the horizontal but the main thing to point out here is the same before once everything is not in your GPU cache all this doesn't matter right so the the the zon phi has a smaller cache than the than the regular zon so you hit this this sort of convergence point much more much much sooner so again I'm saying I doubt software pre-fetching would help in this case and the zon five these the older versions of they definitely did not have pre software pre-fetching that's only in sort of modern zons modern last ten years okay all right the last thing I want to talk about is how to do partitioning with histograms and basically the idea here is we can use scatter and gather to do our histogram computation in parallel so they say this is our input key vector we're gonna use a SIMD rate rate of construction it's think of this sort of like in the same thing as a radix tree where I just go grab one digit or one byte of each key and I'm using that essentially as the hash function tell me what my offset should be in like it's a cheap poor man's hash function and then now what I'm gonna do is I take these these locations and then I'm gonna map them to some histogram and I'll just do a SIMD add where I take whatever value was in the vector I'm writing into and I add one to it right for everything every for every matching element within that location in the vector I'm writing to the problem's gonna be though is that I have two keys mapping to the same location and again with SIMD it's gonna be atomic so the instruction is add one to this location based on the previous value as I can't I can't it's not there's no forward to say add one first time and add one again it's like add one based on the old value is so we're losing one of these one of these updates so our counts are gonna be wrong so the way they can handle that is you do a SIMD scatter now to have each of these guys writing to different vectors right so this is the first vector for the first for the first element this is the next vector for the next element and now they're doing the same thing we have before when they're writing into the offset of the vector but now they're not clobbering each other because they're only one you know one lane can write into one vector at a time so now the only thing I if I need to do is to compute the final histogram it is I just do a SIMD add to do the computation across across this way horizontally to produce the final result right so I think I think this is kind of cool okay all right this is to finish up quickly vectorization is super important for all that queries we've already covered this nothing works when your CPU cache and then all the interquery parallelism stuff that we talked about so far plus of the stuff we'll talk about when we start to talk about joins vectorization is just another tool we can add to speed things up so in an ideal case an ideal system you can do query compilation with vectorization that also supports parallel queries and in our system we're almost there hyper actually could not do vectorization they can only do compilation and parallel queries right vector-wise can only do as sort of pre-compiled yeah some cases it helps sometimes it doesn't help but they could they could do vectorization yes that's for rights the rights they have only a single thread but for all that queries or read-only queries they can run those in parallel that's the mortal stuff we talked about when it for the schedulers okay any questions about vectorization all right let's finish up with project 3 okay so the project 3 the product the goal is for you in your group to implement some substantial or large piece of software component or feature in the data system we've been working on for the first two projects and the idea is that these products should incorporate the various topics and techniques and methods and optimizations that we've talked about so far in the course as well as whatever you interested in your own sort of line of work or research or your hobby if you want to bring that in as well then I'm totally fine with that right because there's certainly I don't know everything about it you know anything outside databases I'm very limited knowledge so if you come along with something that that that I don't know about it would be kind of cool to play with I'm totally down with that the point thing though is that whatever you pick for your project has to be unique from every other group so I can't have two people two groups also implementing constraints because of the goal is we want to have your software be able to be merged back into the full system so that you know so that you know you can go off in the real world ago in the job market and be able to say look you know here's this piece of this of our database system that that I helped implement so what do you have to do well there's gonna be a proposal that'll be due after the spring break then later on before the end of semester via a status update with design documents we'll also do code reviews with each other right so you talk about this in a second but you'll have to look over other people's code and see whether they're doing stupid things and look at your code and see whether you're do you're doing stupid things and then be a final presentation and then the code drop which is smitting a PR on us on GitHub that has to cleanly merge into our master branch so let's go through each of these so for the first Monday after spring break everyone every group will come up here and they'll spend five minutes to talk about what you're what you're proposing to build so this is not just like at a high level here's what I want to do you actually should spend time looking at the source code and try to come up with a good approximation of what will actually take for you to implement that particular feature or component so you should know what files you need to modify or add how you're actually going to test whether your implementation is working correctly and then this is why I have you guys list and every single paper you read what workloads they used to evaluate their their their research should know about what work would you want to use for your project so right now we can support some basic OT workloads TATP small bank and why says be and we can support some queries in TPC H but if there's something else you want to evaluate let me know we can figure something out then we'll have a checkpoint in April where again you come back up spend five minutes and tell everyone what you've done which you've worked on with the current statuses of your project if there's any change in your plans because there's something in the system that like was broken or not implemented or you found something else that was super interesting then you talk about what those differences are and this is always fun to you people could discuss like what are some surprises they found when they were start looking into the bells of the system like oh I thought it was going to work this way but it did this way like to be there with you as a memory leak right well that's not a surprise at this point but like like things like that right and it was also super useful this is and you'll see this this definitely has happened in previous semesters is that sometimes one group will need you know sort of one feature like like we need a settings manager or a locked a way to lock tables in a certain way and another group might need the same thing so then rather than you know the two groups both implementing the same redundant piece of software you guys could potentially work together or maybe one group finished for the other group in the other group could take their take their code so this is meant to be like a collaborative process it's not like project one or project two where you're in competition with each other it's meant to say like we should all be working together and trying to make the thing better the design document I'll provide you guys with a template it's just written in markdown it's basically it's a description of what the what your feature is what your component actually is why you designed it a certain way you know what are the different trade-offs that you have to consider for this implementation and then future work for if anybody wants to come along and continue with with your project you know what could it actually do if you had more time what would you actually do and for those of you that are considering doing like a capstone independent study in the fall semester this is also useful for you to write down what your state of mind was at the end of the semester because you're gonna go from the summer and come back in the fall you're like what the hell was actually thinking back in April or May and this is a good you know a good reminder for yourself for the code review again there'll be two rounds of code reviews again I'll pair up different groups to look at each other's code and what is used the the pull request review process through through github and I'll spend time in class to talk about what does it mean to actually do a code review you know it's not like things like yeah you misspelled this word like it's actually spending time to look at the code and try to figure out you know are they making reasonable assumptions or in their implementation and what I'll say also too is this is meant to be everyone's meant to contribute so I don't want for the like the first of code review one person does it and the second code review the next person doesn't everyone should be contributing equally and how we divide up the source code for what you want to look at right that'll vary based on the project like certainly sometimes this if you do it on a file basis sometimes clear some from files a more been more modified in other files and that may not be a good way to divide things up but well we'll discuss ways to handle this final presentations will be whatever we're scheduled for the final exam I think some Monday at like 8 30 or 5 30 p.m. we're not we're not at the night we're so we're not in the morning at the night so we'll get pizza we'll get food but basically you just show up spend 10 minutes to say here's what we've actually done and in previous years people have given demos if you can do a demo for the status update that's awesome too like to show like hey this thing actually does do what we say it does right and now we actually support SQL in our system so doing demos should be much easier for people not project 2 project 3 so the way it's gonna work though is it's not like other classes where you can write some code that's sitting in your private repo or your laptop then no one ever sees it nobody ever cares you have to make sure your code actually can merge into the master branch right to get a final grade the reason why we do this is just because one you know it gives you visibility about what you're doing because if you want to go get a job at a database company and sometimes they email me say hey do you know the student what do they do and I could point to your project and here's the actual source code so we want to so some cases we want to merge some codes and sometimes we don't but at least there's a PR that can cleanly merge passes all the tests and then we can point to is like the final body of work like we're not a company we're not you know we're not trying to sell this software but we're still trying to do high quality software engineering to the extent that we can in academia and I think this is you actually had a student come back he's doing internship at a data company now and was surprised at how messy the commercial source code they were looking at versus versus our source code and so I like to think that like you know going through this process you get at least you know appreciate what is what does it take to actually write you know clean and reasonable source code so it has to merge into the master branch and they have to have you know we do all the clang tidy clang format and the doxigen checks as well now the tricky thing is going to be of course that there's conflicts between different groups because they modify the same file how we determine if we're going to merge things who gets to merge first this one will just do at random or we'll take on a case-by-case basis our success rate has been about 50% so 50% of the student projects in previous years have merged into the master branch okay well in terms of resources again I will get you more credits for Amazon if you submit a PR to our main repo that'll fire off builds and Travis and Jenkins we're probably going to drop Travis because we're running the bills taking too long but the Jenkins one will cover like you know OS X and Ubuntu if you think you need special hardware which I think this year there shouldn't be anything let me know and we can see what we can do okay all right and then you've already guys know this story it's a work in progress bunch of things aren't going to work some things are maybe broken but we'll fix it as we go along all right let's talk about potential topics so let's go through these there's a bunch of these let's go about one-way one so the first one is for a query the query optimizer so we have a full Cascades query optimizer you'll learn what the cascade actually means and a few more lectures so this project would actually be working on the internals of the optimizer to add support for more complex queries more complex transformations things like outer joins I think we this might be fixed but nested queries we need potentially improve our cost model actually we definitely we know we need to improve our cost model for how we can determine whether one plan is better than another but I would say like I can go more details later on about what the kind of things you could do with this but if you do work on the query optimizer the you have to the very first thing you have to do is also send me your CV because or resume because this is the one thing all the database companies want to hire you saw that if you were in an intro class last year or last semester the guy from Oracle came and he's like yeah we don't want JavaScript programmers we want we want query optimizer people right and I also get emails like this like this is from a pretty famous database person at a database startup and he's like emailing a bunch of people and I was on the list although I'm not a senior database person whatever but he's like hey does anybody know there's like any loose query optimizer people we could hire and because they're super hard cuz like query optimization was a big thing in the 80s and 90s and it's like these old white dudes that are you know that like you know not very likely to leave leave companies and go join startups so people cannot hire these these places fast enough and like all the no seagull database companies that were like we don't need a query optimizer we're just gonna do you know get them sets and Jason like they then soon realized oh we do need a query optimization query optimizer and so people are struggling to find them so again this guy was a bit more vulgar with what his request but same thing he wanted to hire people to do to find query optimizers query people can work on the query optimizer related to this for the cost model we don't have any good way to maintain statistics our information about what the data looks like so then we can feed that into our cost model to make estimations of the quality of a query so we would need a way to collect statistics about the that we then feed into the system we can also do sampling which is another technique that Microsoft uses where you just make a small little mini table and copy some data in and then do your estimation based on that it does this could then also been used for a new cost model that we can hook it to the query optimizer that would be amazing as well if you're more interested in sort of like like how do you execute queries we want to support common table expressions so right now again we support TPCH TPCH is from the 90s it doesn't have CTE's it's pretty straightforward CPC DS is a more complex DSN sort of decision support it's a more complex OLAP workload analytical workload than TPCH this has a bunch of CTE's but we don't support any of that so this would be modifying our parser extending the parser to support the width and union clauses modifying the optimizer to reason about the width clause and potentially rewriting that to joins in some cases and then I don't know whether this is true or not but you may have to also modify the extrusion engine to actually support CTE's depending on how what the queries actually look like and I will fully admit I don't know full I don't know all the ways you can unroll or decorrelate or rewrite CTE's there's a there's a actually a really good textbook from not the hyper guide in Germany but his other German advisor so another German wrote a book about query optimizers that we have access to that describe all the various techniques for this so this is something we could do as well we also need support for add drop indexes so right now we can call create index when you create the table and then as you insert tuples into a table we will populate it but if the table already exists with a bunch of data in it and you call create index it can't go back and populate it easiest way to do that is you pause the execution of all transactions excuse me then populate it then turn transactions back on but obviously that that's bad because you're blocking everything so being able to do to to build the index while you still update update update the tables is super interesting a bunch of systems are just adding like Postgres just added this in the last couple years so this would be really awesome to do be awesome to do this also with parallel threads think of this as doing a sequential scan with multiple threads and then you're all inserting into the index related to this would be multiple multi-threaded queries the current open the current version we have that you guys are working on in the system only supports single thread of queries but there is my PhD student Brashant has a branch that supports parallel queries but I he only has the only has sort of the execution in the side of things if the idea would be porting over his parallel query implementation but then then modifying the rest of the infrastructure of the system to recognize all I can run queries in parallel make sure that you plan plans come out the right way so they can be paralyzed we could also start adding support for like the new more aware of data placement techniques that the hyper guys are doing with morsels once we get to the basic the basic parallel engine working prepared statements are actually also super important we currently don't support this so again prepared statement is I prepare a statement that says I'm gonna execute this query over and over again here's some placeholders for some parameters so you need to like cash that and then you can invoke it and reuse the plan over and over again so there's a bunch of different design decisions you have to consider to when you want to do prepared statements like when do you actually run it through the optimizer when you actually decided to replant things all the various data systems do different things and this would be sort of an evaluation of the different techniques so you'd have to modify the the wire protocol to handle this and then make sure and actually would be super awesome too is if instead of having a prepared statement cash and then a compile query cash if we can unify them together that would be a big win so my page of student Matt is actually looking into this now so if you're interested in pursuing this you know let me know we could start talking next week or this week we can probably support the right-hand log which I think we can replay the log upon restart this is not true I think we can reinstall the catalogs but as I said you want to also be able to take checkpoints you don't have to replay the entire log so this would be adding support to do checkpoints so the I think the simplest way to do this would be the snapshot isolation consistent checkpoint and you just pick you back up the sequential scan implementation that's available in the system now and again this now you can sort of see an example where if you can work in tandem with another team of another team is doing parallel queries and they support a parallel sequential scans you could then support that in your checkpoint algorithm and now do sequential scans and parallel to write the checkpoint out and then we want to be able to load the checkpoint in after we start then we play the log right so it's not just I take a checkpoint I'm done it integrating the full system it is tricky we want to support constraints like the check uniqueness foreign keys and so this is this is modifying the catalog modifying the the modifying like the front of the system to be able to handle this information modifying the stores are the execution to enforce these constraints so if I have a check clause that says like where value is not you know not negative I have to know that when I'm inserting something I comply that predicate and see whether it values true or not same thing with foreign keys two additional things you could do is do online constraint changes with altar meaning if I add a constraint that says this key can't be null I have to scan through to make sure nothing is null because therefore I would you know if I try to apply constraint that would already be violated at the beginning that's bad another cool thing to actually be able to do is extend the query optimizer to be aware of some of these constraints like I know that something can't be negative so therefore I may want to change how I do my join or do a scan right that's the high-end systems action do this that would actually be super cool sequences are the auto increment keys so this would be adding support that we can store sequences in the catalog and support support the next valve function or the serial type so that we can do easily auto increment keys this one is a bit more tricky because you want to do caching in the catalog so that everybody everybody calling next value doesn't always update the catalog you could have something sitting around a memory that you hand out in batches the tricky thing though is you have to then all right out in the right-hand log how that that value got incremented so if I crash come back the counter doesn't start at zero again right oh I don't have to look at values different types are be interesting to do as well so we talked about how the numeric type from the Germans they claim is faster than the floating point numbers and then they claim that there's this book called hackers delight which if you google you can find that describes the the sort of the underlying method for doing low level bit operations on fixed point decimals doesn't describe exactly how to do in the context of a database like there's a bunch of stuff we have to do but the bit manipulation techniques can be found in this book so I'm actually super interested in this as well I don't know how to do it but we can sit down and learn it together if we're interested in types I want to do something more easy enum type would be another one you could do and this is just like enum you have in C++ or Java same thing and the idea what you have to do is that you would support the at the cattle to keep track of all the enums as the array modifying the binder to be able to enforce those enum constraints so someone gives us a value that doesn't exist for enum we throw an error and then we have to be able to support this in the execution engine to know they're not operating on the enum and then materialize the correct value when it produces the result to to the application I'm going through these quickly but you know all the slides are online I'm happy to talk about this afterwards so we haven't really talked about views but think of them as like a virtual table I can define a view on a select query and I give it a name and it looks like a table then now I can run a query on that view and treat it as if it was a table and typically the way this the data some handles this is that it'll rewrite your query into the the to be doing a look up on on the on the actual underlying you know view query so we'd have to support the extend the catalog to support these things with a modified a binder to transform the view into the original query and as far as I know I don't think we have to modify the execution engine you can rewrite the views as a nested query and I think just work and then the optimizer can just you know decorrelate or optimize as necessary alright last two concurrent scheme and changes would be if I add a drop a column can I do that without having to lock everything this is actually super tricky this is the second year we've tried this we have we think now we have the infrastructure to be able to do this in a very interesting way because now we can do this lazy method where you say add a column we tell you we added it but we don't actually go through it and you know shuffle data around to add that that memory location it's only when you insert new things or you try to read back old stuff then we can materialize it on the fly the last one is data compression so we support Apache arrow and we have support for the their dictionary compression scheme which is very straightforward but it's not turned on meaning we can't take hot data and convert it into a compressed cold data block we used to have that infrastructure but where we don't anymore so this would be adding support for convert doing this conversion the reason why we don't have it because we because you have to also update indexes because you're moving tuples from one location to another but then the tricky thing is going to be is now modify the extrusion engine in some way to be able to process the compressed data directly because right now we we can't do that we would have to have the data table decompress it before we can do scans on it which defeats the whole purpose of compression alright so how to get started former team which is already sort of done with project two you should meet with your team discuss your potential topics potentially look over the source code or contact me to sort of point you at what parts of the source code you want to look at I'm around all next week or send me an email I'm happy to discuss you know what is the potential topics you could look at okay alright sorry for going over time next class I'll post this on Piazza with more details but you have your proposal presentations five minutes and it's a hard limit okay alright guys awesome thank you see you thank it in the side pocket what is this