 Let's start lecturing again. So just again, quick overview of what's coming up for this. Do for you guys, homework number four we do this Wednesday. The midterm exam will be in class here a week from now on Wednesday to October 18th. I will discuss what the midterm will look like and hand out a practice exam on Wednesday in the next class. And it will cover everything up to and including what we'll talk about on Wednesday. So for next week's lecture, starting on Monday, that won't be included on the exam at all. And then project number two is due Wednesday, October 25th after the exam. So if you sort of look at the assignment schedule for the course, there are six homeworks, but four of them come before the midterm. That was done on purpose because we want the focus to be on the projects after the midterm. So we're sort of front loading a bunch of these sort of basic things, the introduction stuff that you need to understand to sort of get started on the database system. And then we'll focus more on programming in the second half of the semester. So any questions about any of these things? Yes. This question is, are you allowed to bring a cheat sheet for the exam? Yes, to have one sheet handwritten only, you can put anything you want. Any other questions? Single-sided. Single-sided? Eight and a half by 11. All right, double-sided, eight and a half by 11. That should be enough. You get the 11 before the single-sided. That's fine, yes, that's fine. Okay, so the other thing we'll quickly talk about was, there was sort of a major announcement in databases last week. Did anybody see it? Does anybody know what it was? It was on Hacker News, it was on Reddit, it was everywhere on Twitter. Nobody? Postgres came out with version 10. Version 9's been around for a while. Version 10 is a pretty significant improvement. So I'm not going to really talk about what's new in it. The one thing that I think is most noticeable or notable is that they're adding better support for parallel query execution, which is what we will talk about on Monday next week. The reason why this has always been an issue in Postgres is because the way the system is architected comes from a time when threads weren't as sort of a viable option for parallel execution as they are now. Postgres was the design in the late 1980s, early 1990s, before we had POSIC threads. And so the way they would have multiple parallelism in the system is they would fork off a process for every single new connection. But within a single process, it would only be able to execute a query with a single thread. So now what they're adding in the newer versions of Postgres is that you can have the different processes work on behalf of the same query rather than having one process be dedicated to the entire query. My SQL still can't do parallel execution as far as I know, at least in version five, I don't know what's coming in version eight. But this is an issue in any of the major commercial systems, right? And any new system built now for the most part can do parallel query execution, but in my opinion, this is the most significant thing in the latest version of Postgres. So last week, when Joy taught, he spent time talking about an external merge sort, I guess was the, as he talked about, this is a way to sort some table or relation or intermediate result as you execute a query and not require that entire data set fit entirely in DRAM, right? Because everything fits in memory, everything fits in your buffer pool, then you just run quick sort, right, as you learned in the intro class. But if you need to spill to disk, then you would be a bit smart about how you do that, and that's what external merge sort does. And then he then built upon that and started talking about different join algorithms. He talked about nested loop join, which is the most basic thing. It's just two, four loops nested in each other and you just iterate over each, the outer relation and the inner relation. And then he talked about sort merge join, which basically building off of the external merge sort technique. So in general, there are essentially three major classes of join algorithms. So it's nested loop is one, the sort merges the other. And the one that we're gonna focus on today are hashing-based algorithms. And the spoiler would be that in most cases, the hash join is always gonna be the preferred option, right? There's some cases where the sort merges better, some cases where using actually a nested loop join is better. But in general, in practice, the systems always try to make the hash join work the fastest because that's where most of the time is gonna be spent, right? When you look at profiles of where the CPU time or the computation time is spent for queries, a large majority of it's always gonna be in the hash join. And this would be a bigger issue when we talk about in-memory databases in the events class in the spring. Because now you're not worried about getting page faults or swapping things in from disk into your buffer pool. Everything's always in memory. So getting your hash join to work as fast as possible is a big win. So in the book and in general in lectures, when people talk about join algorithms, they always sort of describe it in terms of at a high level that like, you join these two tables together and then there's some output and it goes to the next operator. But I think it's actually the worst time spending some time to talk about what the output actually looks like. Because I looked at the textbook and I've seen other textbooks. They don't really talk about this. But it is actually a big deal when you actually implement a join algorithm in a real system. So the output of a join operator is gonna essentially be at a logical level, the concatenation of the two from the outer relation and a two from the inner relation. And sort of in an abstract term, you just define it as, in this case here, you have two R, two S. It's the join of those two guys together. But in practice, in reality, when you actually build this, the contents of what the output actually is, what those concatenated tuples look like can depend greatly based on the implementation of your database system. So it can depend on things like the query processing model that we talked about in lecture 10, right? Are you pushing single tuples? Are you pushing the entire output in the materialization processing model? Are you pushing a vector of tuples? It can depend on how you actually store your tuples, right? Again, this is in terms of whether it's a row store or as a column store. And that could also depend on the query. Maybe your query doesn't need all the columns or all the attributes to be shoved up to the next operator. And so maybe you don't want to store a bunch of extra stuff in your hash table when you combine things together. Maybe you only pass along just the things that you need. It's sort of like pushing a projection down inside of the join operator. So I want to talk basically about two approaches you can do this. And this will come up later when we talk about, when we talk about the hash join algorithm, when we talk about what we're actually going to put in our hash tables, to do our joins. So the most obvious or simplest thing to do, and this is what most people think about join algorithms, is that the output will just be a straight copy of exactly all the attributes from the outer relation and all the attributes from the inner relation, right? So let's say that we're doing this simple join query here on table A and B. And table A has two columns, ID and name. Table B has three columns, ID, value, and creation date or C date. So at least join them together, then we're just gonna have a concatenation of the two attributes from the outer relation and the three attributes from the inner relation, and that just creates a single tuple. And of course as you see here, I'm doing an echo join, or I have an equality predicate, or a.id equals b.id. And I can be smart about it and say, well, I don't really need to store id twice for the outer and inner relation. But I do that anyway cuz it's less bookkeeping to figure out where actually things are. So the way to do this, this tuple is essentially what the output of this join operator will be then up to the projection. Then the projection operator, you know how to say, well, I'm taking the a.id, I'll jump to that all set, and then I'll take C date, and I'll jump to that all set, and I just grab those values out, and I can concatenate them together, and that's the final output of the query. So the advantage of this is that the subsequent operators in the query plan never have to go back to the base table to get any additional information to actually compute the answer that they need, right? The input to the projection operator has everything that it needs to actually then do the projection. And in the query planner, the thing actually takes the sequel and generates the query plan tree. It'll bake into the operator information about where the offsets are that it needs for these attributes. So it'll know that, all right, I'm doing projection and I expect my input to look like have five attributes. I need to jump to the first one and the last one to get the two attributes that I need. That's all to sort of be baked in ahead of time in the query plan, so you're not figuring out on the fly. So again, the advantage of this approach is that you don't have to go get more information. What's the downside of this? It should be obvious, right? Correct, it takes more space and memory, right? So in this case here, I'm passing along the name and I'm actually passing along the value even though I don't need these attributes ever again in the query plan. Again, the projection will do this for me, but it'd be nice not to have to copy this for every single tool that matches. We won't talk about this in this lecture. We'll talk about more about this in next class. But one of the things the database system can do is estimate what the selectivity will be for your join operators, right? In this case here, it was an equi-join and it's a real simple example. So there was two matches. But imagine you're joining really large tables, it may be 100% of the tuples of the two tables will match, or maybe only less than 1% match. And so based on what the selectivity is, how many tuples do you expect to come out of your join operator? You may want to choose one approach versus another. Or you may want to push down a projection to be much earlier in the query plan so you strip these things out ahead of time. So the alternative is to just pass around record IDs. So say again we have our two tables before, we do our join. And now the only thing we're going to include are the IDs from the two attributes, sorry, the two relations could be joined them together. But then also now we just have the record IDs. So the way to sort of think about this is that when we're going up the query plan here, we're passing along the record IDs of the tuples that we're looking at. And I mean in this case here, because we're doing the join on the ID value, we'd have to go out to go get them for this column, and then put them into the form we need to actually then do the join. Right, so this case here, the output of the projection just has these four attributes. And again, two of them are the internal record IDs that the system uses to say, here's the offset, or here's the page and the offset to go find the thing that I need. So what's the problem with this example? Correct, yeah, so I need c-date for the interrelation to compute my projection. So then I gotta go back to the base table and actually go get it. So now this is where the selectivity stuff comes up again, because if my selectivity is really low, meaning I'm only gonna generate two tuples out of a billion that match my join, then going get that c-date is not a big deal. I'm only maybe going to have to go fetch two things. So the advantage of this is that this is really great for a column store to only pass around record IDs, because each operator will only go get the thing that actually needs when it needs it. And this is sometimes called late materialization, right? The materialization would be taking the record ID and generating the exact tuple on all its contents. And so with late materialization, I can wait to the very end to the projection operator and then go fetch the creation date. And then it's essentially putting the tuple back together. So in a column store, this is great because if there's a bunch of columns that I don't read to compute my query, then I never have to go get them. And I don't waste disk IO getting them, and I don't waste memory putting them into my buffer pool, even though I'm never actually gonna need them again. So this is what I was sort of saying that what the ideal approach to use, whether you pass around the record IDs or actually just make copies of the data, depends on all these different factors. If it's a column store, then using record IDs could be a big win. If it's a row store, then it doesn't really help because you're gonna go fetch those attributes anyway and bring them to the buffer pool, because everything's always contiguous. In this case here, the query needed the customer date. But if you needed all the attributes, then passing around the data from operator to operator is probably the better approach versus passing record IDs. So again, this will come up in a few more slides when we talk about what we're actually gonna put in our hash table to do our joins. Because we'll talk about in terms of what data we need to store in our hash table, to actually be able to compute the join. But this sort of affects after that, based on what you had in that hash table, it will determine what data you need to retrieve or what data you can throw away as you shove up the result of the join to the next operator. Okay? So today's agenda, we're gonna talk about basically two categories of things. And this is sort of the last lecture we'll need for defining how we actually do query execution. We started off before talking about how we run the access methods to actually read the data that are in the underlying tables to get the tuples out. And then Joy talked about how to do sorting, we talked about how to do the basic join algorithms. So the last two major things that we need to talk about are the hash join algorithms because as I said, they're the most important join algorithm. And then we'll talk about how to use hashing again to do aggregations. And this is basically all you really need to implement basic SQL in a database system. Other things like limits, havings, those are all sort of trivial to do in some ways, right? Everything we're doing here is sort of the complex thing. Then we can use these as building blocks to do more complicated queries. And I'll show some examples at the end in a demo with Postgres. Okay, so for hash join, the basic idea way to think about it is that the intuition of how the hash join is gonna work is that you're gonna have a tuple from your outer relation R and a tuple from your inner relation S. And if they're gonna satisfy the join condition, right, does something equal something, then the values will have to be the same on those join attributes. For now, we can ignore anti-join because that's the negation of that. But for this, it's, does something equal something, they have to have, in order to match that join predicate, you have to have the same values. So now, because they'll have the same values for the join attributes, then if we hash that value, the first value in the outer relations tuple, and then hash that same value in the inner relations tuple, they're gonna have to match the same hash value. They're gonna have to hash to the same key. So the idea here is that we're gonna hash all the join attributes on the outer relation and all the join attributes in the inner relation. And they should end up be hashing to the same location in our hash table. And then that means that we only need to actually just do a comparison on the join predicates for the tuples that hash to the sort of same bucket, the same location. That's the same basic idea. You're using hashing as a way to sort of split up the workload in a way so that you're not comparing with something that you know can never possibly even match our key. You're only looking at things that you think could match. So hash join has two phases. As in the first phase, it's called the build phase. So you just do a scan on the outer relation and for every single tuple, you're gonna use some hash function to hash this join attributes and populate a hash table. And for our purposes in this class, we don't care what our hash table implementation is. We can use any of the things that we talked about before, right? The open hashing, the chain hashing, all the dynamic hash tables, cuckoo hashing, it doesn't matter what hash table we use. The basic algorithm will work the same. I can maybe talk a little bit of this at the end, but there are some trade-offs for what implementation might be better than another. Typically in hash joins, you maybe wanna use a fixed size hash table because it's much faster to do the look up. But of course, this is a problem. If you underestimate the size that you need, then maybe you gotta rebuild things or you have those long chains that are slow. But for our purposes here, we don't care. All right, so once you've done the build phase, now you have a hash table of all the tuples in your outer relation. So then now in the probe phase, what you're gonna do is do a scan on the inner relation from beginning to end. And for each tuple, you're gonna hash it using the same hash function. And then you're gonna land into some bucket in the hash table. And then you just do a scan to see whether you have a match from the other tuple, that's the basic idea. So visually it looks like this. Again, so in the first phase, the build phase, we're gonna pick a hash function and then populate our hash table for the outer relation. Then in the probe phase, we'll look at every single tuple, use that same hash function, we'll jump to some location in the hash table, and then we do a comparison between the key that's contained in the bucket. And if we have a match, then we know that they're satisfying our join predicate. And we go ahead and admit it to our output. So any questions about this? Yes, this question is, hashing is only done on the join attributes. So what else would you want to hash on? The join attribute, by hashing on that, the system is using that as the direction to say, where do I jump on my hash table? So if I do start hashing on crap that's not part of the join key, and I now come to the other side on the probe phase and I do the same thing. Well, first of all, it might not even have those attributes so it can't hash those, but they're not part of the join predicate. So why would you want to hash on that anyway, right? Again, we said this before about our hash function that the hash function has to be that for the exact same value, it always produces the same hash key. So if you have the exact same value on the right side and the left side, then it should point to the same location. And then you just have to figure out whether you find an exact match in your bucket. Yes? Right, so his question is, his statement is this doesn't work for all cases. Because in the first example you said, if you're doing a join on a range predicate, then absolutely right, your hash, it doesn't help you because this hashing only works for a quality predicate. And then your second statement was, if I'm doing join on an attribute that has a really low cardinality, like sex, right? For simplicity sake, it's just male and female, so you have two possible values. Then you're right, you're gonna have two buckets that have really long chains. And then you're essentially doing a nest loop join inside the bucket, right? Yes, that's maybe the point that I'm missing here is, when you do a hash, now on the probe side, you land to a bucket, we'll talk about what these buckets look like, but you're essentially doing it now in nested loop join, or doing a sequential scan inside the bucket to do comparisons for all the tuples from the outer relation that are inside that bucket to see whether you have a match. So if your cardinality on the outer table is really crappy, meaning you only have two values, then you're essentially doing a worst case nested loop join, which is always the worst join to do. Okay, so the key for the hash table, as I said, is always the attributes that you're trying to do the join on. And for simplicity, assume that we're always gonna do an equality join. But now the value that's gonna be inside the join will vary on the implementation, and it varies based on the things we talked about before, where the other operators make some assumption or have to know what the input to them is gonna be. So in order to produce the output in the form that other operators in the query plan need, we may have to store some additional information inside of the value portion of our hash table. So there's two approaches. So the first is that the value is gonna contain the entire tuple, or the subset of the comms that are needed for the tuples in the outer relation. So you're always gonna have to have the join attributes stored in the value, because the hashing may end up with collision. So two keys that are distinct may end up hashing to the same bucket. So when you land inside the bucket and you do that scan to see whether you have a match, you need a way to actually compare the actual real keys. So you always have to have the keys. Now the question is, do you want to store the additional attributes you may need up in the rest of the query plan? And again, it depends on what the operator up in the query plan expects to have. And of course, the downside of this now is you're essentially making a second copy of the tuple, storing it in memory because you have to store it in your hash table. And again, you can be smart about certain things, maybe push down the projection so that you only store attributes in the hash table that you may need later on in the query plan. But the simplest thing to do is just make a complete copy. And then the alternative is that, again, you can use the tuple identifier, which is again, ideal for column stores because you don't actually need to fetch any additional data that you need. Now the tricky thing about this is that there may be some, there's some times where the query planner could say, I'm doing a join on two attributes, but because of some selectivity or how I'm storing the data, maybe I only want to build the hash table on the first attribute. And then I'll do a sort of a straight full comparison of the attributes inside of the, when I do the sequential scan. So in that case, again, you will have to copy some extra things along with your tuple identifier. But the main thing to point out here is again, it's all the things that I talked about before. What the query operator expects, the operator expects you above you in the query plan will determine which one of these you want to use. And there's trade-offs for each of these. The first one uses more memory. The second one requires more disk reads if you have to go get additional information. So now, the join algorithm that I showed you just now, that assumes your hash table fits entirely in main memory, right? So it's kind of, obviously, it's kind of easy. So in the same way that we had in the sorting algorithms from the last class, if everything's in memory, you just use quick sort. If it's not, you have to then spill the disk. So now we've got to deal with the situation where our hash table doesn't fit in main memory and how we actually want to compute the join in such a way that is really efficient. So the dumbest thing to do is just let the powerful manager inside of our system swap out pages whenever it wants. In terms of the hash table pages. So we do a sequential scan on the outer relation. We populate our hash table. Now we start doing a sequential scan on the inner relation. But now the powerful manager has swapped out a bunch of pages. And as we go along, we do a lookup. Why is this a bad idea? Or why is this sort of the worst case scenario? What is it about the access patterns of hashing that is detrimental to this? His statement is all the buckets are in the same page. So all the values in the bucket are basically, so typically the bucket is the same size of the page. Yes. So the locality is the big issue here, right? It's good. Hashing takes a value and sort of jumps, can jump randomly anywhere inside your hash table. Now, if your tables are very skewed, which is often the case in real workloads, then it may be the case that all the attributes you're trying to join on are going to be the same. So they're always going to hash to the same bucket. But in practice, it's usually not the case. So if the buffer manager starts swapping out our pages in our hash table willy-nilly, then the worst case scenario for every single tuple that we have to look at in the interrelation as we scan through, that's always going to be a cache miss. And it's always going to have to go out the disk to get it, right? And so what we want to do is be more careful about this and try to maximize the reuse of pages when we fetch them in memory when we do our hash join. So to do this, the technique or the algorithm is called the grace hash join. So probably no one here has heard of the grace database system. It was a database machine that was developed in Japan in the 1980s. And the grace hash join is, they have a paper that proposes how to do hash joins when your tables don't fit in memory, and this is the technique they use. And this is actually the picture that's from the University of Tokyo website that shows what the actual grace database machine looks like. Who here has actually ever heard of the term database machine? No, it doesn't surprise me because again, this was a big term in the 1980s. People don't use it now. Who here has heard of the term database appliance? No one, okay. So a database appliance is sort of like if you have a, if you're say a database system vendor, you can sell people machines, like rack units that are tuned and tailored to be running their particular database system. So the idea is to set a buying machine from Dell or HP or whatever and then loading your own OS and then loading the database system and setting it up all yourself. You can buy a machine that comes pre-configured and that's set up to already run your particular database system. So the most famous database appliance vendor is Oracle. Oracle has a bunch of database appliances. The most expensive one is this thing over here about Exadata. I mean, I don't know exact numbers. I think it starts in the millions, right, to buy one of these things. They're super expensive. Clusterix was a new SQL startup out of San Francisco. Then they, for a while, they sold our database appliance. And then the one in the back is IBM, the Teeza. And again, you sort of, you buy this whole rack that has, sort of, standard Xeon processors. But again, it's set up and tuned such a way that's ideal for the database system. So I would say that we won't talk about hardware so much in this class. We'll talk about it later on. But except for the major vendors, it's super hard to break into the appliance business, right? Oracle, there's a lot of people that have a lot of money that run Oracle. So Exadata sells really well for them. The Clusterix guys, they eventually decided that the appliance business, they weren't making any money, so they got rid of that. And now they just sort of sell software. The way to sort of think about this is that if you're building a new database system from a business side, it's always best to target commodity hardware. Meaning, if you can't run on Amazon, you can't run your database system on Amazon EC2, then people are not likely to adopt it. Because they don't want to get locked in to like some appliance, so you're going to sell them and then the company goes bankrupt and you're screwed because you have unsupported hardware, right? Oracle can do this because it's Oracle, or IBM can do it because those are big, big companies. So now a database machine would be actually a system that has specialized hardware to run a database system. So the appliance is sort of off the shelf hardware. It's just sort of set up to run a database system. A database machine is where they add proprietary stuff to make the data system run faster. Again, that's even harder to sell because again, you're selling exotic hardware that people aren't going to trust or aren't going to want to support. And as of the case, maybe it's not so much now, but in the 1980s, part of the reason why the database machines never really took off was by the time you got something to market for your specialized hardware, the Moore's Law or all the other advancements for commodity hardware caught up to the point where you no longer had a big advantage. So people don't really build data machines anymore. Although you are seeing some new stuff starting to come out in the last like one or two years. So it might be a resurgence of this because the Xeons aren't getting, you know, sort of Xeon this sort of plateauing with the benefits you can get out of this. But anyway, so when we say a great hash join, just so you know, it comes from this very famous project from the 1980s. It's not named after a person. Okay, so again, this particular hash join out we're going to talk about again is going to allow us to do a join when we don't have enough memories to store our hash table or both tables at the same time. So again, basically what's going to happen is the basic way to think about this is that we're going to build a hash table on both sides, both the outer relation and the inner relation. And then we're going to do just our comparison on just buckets that are at the same level, right? So again, for the first step we'll do is we'll build our hash table for R and the outer relation. And then we'll do the same thing using the same hash function for the inner relation. And now the way to sort of think about this is that every single level in our hash table, we know that the values from the first hash table and the second hash table have to match or have been hashed to the same key. So now when we want to do our comparison, the only thing we need to do now is just examine the buckets at each level by themselves. So for bucket level zero, on either side, I never need to look at any of the other buckets at the lower levels, because I know they're going to hash in the same thing. And then the reason why we want to do this is because, say we only have two pages in our database system where we can only bring in two buckets at a time, that's enough for us to do our join. So all of these can be swapped out to disk, and then now when I do my scan through the different levels and do our comparison between the two different buckets, I just go fetch the first page from the outer relation, the first page from the inner relation, bring those in memory, and now I do a basic nested loop join. And we said nested loop joins are bad, but for this it's fine because the buckets aren't going to be really, really big, because they have to fit within page size, and so everything's in memory, so we're not worried about going back to disk over and over again. So typically I think most database systems will implement this anyway, even if you don't have a limited amount of memory. And particularly in the in-memory systems, they do something very similar to this where everyone basically tries to do a hash join. So what's one problem with this? What's one thing we have to overcome? This statement is you're assuming double memory. No, because again, in this case here, all the lower levels can be swapped out. That's not an issue, I need to only have those two pages. We need to know a lot more pages, I mean like terms of the number of levels. So I think what he's saying is there could be potentially wasted storage because every bucket won't be full. And so when you go fetch something from disk, you're essentially fetching wasted space. Well, yeah, you can't like- Fetching empty space. So your statement is when they find a match, they skip what? A page? Yeah. All right, so yeah, so his statement is if I have 10 tuples and they could all fit in a single page when they were stored in the table heap, but then when I do my hash join, now they get split across 10 pages, so now I'm fetching 10 pages for what I could have done as a single page. Again, the database system will know the size of the tables, it'll know the selectivity, and it'll say, oh, I have 10 tuples. We can try this in Postgres. Let me just do a nest loop because that's good enough. It's actually related to running out of memory, but it's actually sort of related to what you're saying. So in all these examples, we're assuming uniform distribution, so we're going to assume that every page will roughly have the same number of tuples. But again, in real-world databases, you see skew distributions all the time. So it may be the case that this first page in level zero, it has way too many tuples, right? And so then we have all these overflow chains, and now it becomes problematic because now my, since I'm doing a nested for loop, all the same terrible things we talked about in last class, where you have no locality as you're doing a scan, even if you're doing the block nested loop, you're hitting this problem. So if everything has to the first level, then you're basically back where we were before with a nested loop join and getting terrible performance. So what we can do is if our buckets don't fit in memory at a particular level, then we can do what's called recursive partitioning, where we can keep splitting the oversized buckets into subsequent sort of mini hash tables or internal hash tables. And then we know how to then go back the other direction when we want to do a probe and jump to the right bucket that we need. So this is probably easier to understand through an example. All right, so let's say that we're on the build phase of our hash join. So we run the first hash function and we get a bunch of these buckets. And what we see at level zero, it has a bunch of more extra pages because everything's hashing to that same location. So to do recursive partitioning, what you end up doing is you just pick now a second hash function that's completely different than the first hash function. And typically you do this by adding a salt value to the input of the hash. And then now I'll split that, what used to be level one, I'll split it into three different buckets. And I can do this as many times as I want. If everything keeps hashing to the same bucket or I have my buckets are too full, I can keep recursively going on and splitting it further and further. Yes? So his statement is, yeah, yeah, yeah. So his statement is this only helps you if the keys are unique. If they're not unique, then regardless what hash function you use, they're always going to hash to the same value. That is correct, yes. No, but the salt thing is, if you salt the same value, the salt is assigned to the hash function. So if you salt the same key multiple times, so using the same salt value, it's always going to hash to the same result. Yeah, you can't go the other direction. Right, so again, so this works for unique keys, it doesn't work for non-unique keys. Okay, so now the way to sort of think about it on the build size, our hash table is these five buckets. And then the inner ones were generated with the second hash function and then the first and the last are generated with the first hash function. So now when I want to do my probe, I always run the first hash function. And then I keep track of I know what ranges of levels I need to then apply the second hash function. So if I go to level zero, that's fine, I go to the last level, that's fine. But if I go to hash to level one, but I know that this thing was split on the build side. So I want to run the second hash function and then I'll tell me exactly where I really want to go. Yes? So the statement is you have to maintain some additional internal data structure for the metadata to know about this. That's correct, yes. But that's trivial, right? Like it's for every level, here's the bucket, here's the hash function I should use or how deep am I in my recursion. That's nothing. Okay, so now we can talk about what the cost of running a hash join is. So assume we have enough buffers. The cost of doing the greatest hash join is always three times m plus n. And why is it three times m plus n? So the first phase, the partitioning or the build phase, right, you always have to read every single table, right? So the outer relation is m, the inner relation is n. Then you have to build the hash table. And again, assuming that we're going to write the entire values back into the hash table, then, assuming that we have also uniformly distributed pages, everything is not highly skewed. Then you always have to write them back out, so that's another m plus n. Then on the probe phase, we need to read back in every page on the outer relation and the inner relation one by one. And then do the nested loop join between the two at each level. So the joy was showing a running example of two tables last class to show real-time cost calculations. So we say the outer relation has m pages and then the inner relation has, the outer relation has m pages, which is the thousand. The inner relation has n pages, which is 500. Then we can compute the hash join for these two tables in 4,500 IOs. Which, if you're using an SSD that has a one-tenth of a millisecond per IO operation, we can compute this in 45 seconds. And I think when we looked at the other ones, it was, in the worst case, the nest loop was 500 seconds. So again, in general, the hash join's always going to be the fastest approach, because you only have to read and write the tables three times. Okay, so I talked to this a little bit earlier, but we said that the, I think that math's wrong. That should be 0.45 seconds, sorry. We'll fix that later. Okay, as I said before, the database system is going to try to estimate the number of tuples it's going to have on the outer relation and the inner relation, and as well as the number of tuples it expects to be admitted based on the selectivity of the join predicate. And so based on this estimation, it'll select a hash table size that will have enough space. And then we'll see next class why this is problematic. And this is actually one of the notorious problems you have in a query optimizer of trying to get these selectivities correct. It's really easy when you're doing joins on primary keys or foreign keys, when you know exactly the cardinality, and you're doing a quality predicate. But if you then deviate from that, then it gets tricky. And then if you try to join after another join, then things get really messed up. So if you know the size and you think you're going to be pretty accurate in your prediction, then you just use one of the fixed size hash tables that we talked about before, and this is ideal because there's less computational overhead to do the building and the probe, right? As you saw in the extendable hash table project, you always have to check to see what the number of bits are going to be using for the hash at the global level and the local level, and that takes time. And then if you don't know the size of your, if you don't know exactly the size of your hash table, you want to either use one of the dynamic hashing tables that we talked about before, or you just allow for overflow pages or the overflow chains. And again, it's up to the data system to decide which one of these approaches it wants to use. Typically though, I would say, at least in the open source guys, often the case they try to do, they try to do the fixed size ones. I think Postgres recently added support to do the dynamic resizing. All right, again, just to summarize, this number is correct. To summarize what we talked about before, when we said m equals 1,000 and n equals 500, here's all the joint algorithms. And as you can see, the hash join is always going to be the fastest one. And typically, this is why the most data systems will choose to use this. So can I think of an example where the sort merge join actually might be better? The output of the query has an order by clause, and it's the same thing as what you're trying to join on. Then you get sorting for free, so it just does that, all right? And we can look at some examples at the end of Postgres. All right, so that's basically it. So as I said, these are basically the three classes of joint algorithms that are out there. And everything for the most part is a variation of these. So now we can talk about actually how we want to compute aggregates. So recall that an aggregation function is just taking one or more tuples and collapsing them down, computing some functional of them that produces a single scalar value, right? Average min, max, count sum, and that's the basic SQL92 ones. So in the same way that we had this choice between sorting versus hashing for joins, we had the same choice when we want to compute aggregations. So let's start with a really simple, the simplest example, and let's do a distinct query, right? And distinct is basically collapsing down multiple values and producing a single result. So if you want to do this using a sort, then the basic query of the query plan is we'll just do the filter based on our where clause. And then we'll do our projection to remove any columns that we don't want. And then we end up doing our sort, all right? And now this case here to do the distinct, we see that. We know that in the order of going from the top to the bottom, any values that are the same will be contiguous with each other in the list. So all we have to do is just iterate or scan through our final output. Check to see whether the current value of the tuple that I'm looking at is the same as the last one. And if it is, then I know it's not distinct. So I can go ahead and eliminate it. And this is how I can compute my answer. So this is really simple to do. We don't have to build any extra hash tables. And then our sorting, it can just be the same external merge sort algorithm that we talked about last class. Now the question is, maybe when do you want to use a hashing versus when do you want to use sorting? So typically, whenever time you have a group by clause, you always want to use a hash table for this, because it's really simple to say, here's the attribute I want to group by on, or set address I want to group on, hash those, and that jumps me to a location in my hash table. And then I can start calculating the various aggregation that I need to do. So typically the hash table, the hashing approach to do aggregations is always going to be a better approach. Even if it doesn't fit in main memory, everything sort of works the same. So I'll walk through an example now how you do aggregation for hashing. And again, as we said, it's basically using the hash table as a way to quickly identify when you have values that are in the same group, or have values that are the same. And then if it's distinct, you throw it away. If it's a group by, then you compute some running calculation. So we'll walk through a simple example. But it's basically like the hash join, there's two phases. The first is you do partitioning, and this is essentially where you're splitting up the two pools based on some hash function. And then you know how to spill them out the disk if need be. And then you can go fetch them in one by one once they're already sort of set up, and then you compute the aggregation that you want. For this, we'll assume that we have B buffers or B pages in our buffer pool that we can use to write out data. So we go back to our distinct query that we showed before. We do our basic filter that we had. We remove any columns. And then now we run some hash function and we'll write out these keys to the different partitions. And then if we see that we have a, in this case here we're doing distinct. So we could easily just check to see whether we already have a match. But if you're doing some calculation, you want to basically fill these partitions with all the data that it needs. And then in the second phase, you go back in and read them in and do an additional hashing on them. Then to put it into another hash table, they then compute the aggregation that you want. So you're essentially doing two passes over the data. You're computing two separate hash tables. But the idea is that in order to compute the aggregation, you need to know what all the data it is for that particular group. So you put it all into the same bucket and then you can actually compute the thing that you want. So if you go back now in the second phase. So here's all the buckets we have in the first pass. Then we run a second hash function that populates the hash table. In this case here, there should be another 15, 4, 45 here. But you'd see 15, 4, 45 twice. So we see that it already exists in the hash table when you populate it. So then you just ignore it. And then the final output of your query is just all the value portions of the hash table. So how big a table can you do with this? So if you have b minus 1 spill partitions, then your table can be exactly b times b minus 1, right? Cuz the first, you're gonna go through and have n blocks. And you take the square root of the number of blocks that you have. And I'll determine roughly on the number of buckets you're gonna need. Assuming you have uniform distribution. But if you don't have uniform distribution, typically what you do is have a fudge factor where you maybe pre-allocate some extra blocks ahead of time. So that you don't end up having too, too large an overflow chain. And you don't have to write out the disk more than you actually need. I'm going through this very quickly, but it's pretty straightforward. And the focus is really on just the idea of using hashing and another scenario to compute aggregations. So again, if you don't fit memory, then you just use cursive partitioning, then it works exactly as the same as we saw in the join. So what I really wanna focus on, again, is this idea of sorting versus hashing. What are the trade-offs? So in the hashing case, if you have n blocks, then you can do the hash function exactly in the square root of n. But in the case of sorting, then it's roughly around the same, right? It's always gonna be about the square root n. So the hashing versus sorting are roughly the same in terms of the performance of them, but in practice, the hashing is always gonna be better. Okay, so to finish up real quickly, because I wanna get to the demo, right? The, as we sort of said before, the choice of whether you use hashing versus sorting for either joins or for aggregations depends on a lot of different things, depends on what the query looks like, depends on what the data looks like. We already talked about in last class how to do much optimizations for sorting. All of them are still applicable here because then the sorting for the aggregation is just doing external merge sort. And you know that how to scan through the aggregation as you go along. In the case of hashing, we can actually do, we can use the hash table to actually calculate our aggregations as we go along, rather than having to build the second hash table and then just do another scan on it to compute our aggregation. So this is called hash, sort of inline or online summarization or hashing summarization. Again, the basic idea is that as we populate the second hash table in the rehash phase, the second phase will compute our aggregation. So it basically looks like this. So the value portion of the hash table will be a pair where we store the group by key and our running total value. So anytime we want to insert a new value into our hash table, we find to see whether we have a group by key match. And then we update our running total as needed. If you don't see that we have a match, then we just insert a new entry. And again, I think this is best represented through a visualization. All right, say that we're computing our average here on our table. We've already gone through the first phase and we computer our buckets. And then now in our hash table, then we have our running total. So now in the value portion, you see that we have our group by key, which in this case here is the course ID. And then we have this internal pair that has some extra metadata about the aggregation we're trying to compute. So in this case here, if you're computing the average, then you need to maintain the count of the number of tuples you've seen. And then a running total of the sum. Because once you know you've scanned everything, you just divide the running total or the sum by the count. And that computes the average you want. For all these other ones, it's really trivial, right? The count is just you add plus one every single time you have a match. For the min, the max, you check to see whether the value you may be trying to insert is less than or greater than the highest or lowest value you've seen before, you can just update that. And then sum is just doing an addition as you go along. It's only really average where you have to maintain some extra metadata. So this will come up later on when we talk about distributed queries and parallel execution because when you want to compute aggregations across multiple machines, in case you're computing your average, you want to send back the same metadata from one node to the next. Because it knows how to combine them all together to compute your final aggregation, right? Sort of the same idea here is that we have some sort of pre-computation that we need to compute the average and at the end, we just do our simple division and we produce our final answer. Again, the main idea here is that we're trying to use our hash table to maintain the data we need to compute our aggregation without having to just do a second pass to the values at the end. Yes, correct. So I'm showing the SQL 92 ones because they work nicely with this. With medians and other things like standard deviation, you couldn't do this, correct? Yes, okay. So the main thing to remember from this sort of discussion at the end is just that hashing is almost always going to be better than sorting for either joins or aggregations and most systems will always use hashing. Sorting is typically better for non-uniform data, right? And then sorting will also be really good when you have, when the output should be sorted in the same way. And as we'll see next class, it's the optimizer's job to make the decision about which of these types of algorithms it wants to use to execute your query. And it does this by relying on the internal metadata it collects about your database and what your query is actually trying to do. So we'll see this next class, but the data system is going to maintain internal statistics about the distribution of your data. And then it uses that to help them decide whether it wants to use one versus the other. Okay, so in the remaining time, I want to pop up in Postgres. So for this example, I'm just going to show you the sort of in actual real-time performance what the difference between the merge sort or the sort merge join and then the hash join. You can't see it, sorry. I think I would just dump out of PowerPoint or end slide show. See that? Good, okay. So this, I have three tables. So this data collected from the San Francisco bike exchange. They have that like the city bikes where you can go and pick up a bike anywhere in the city and ride it around. So this is data collected about when someone took a bike from one station and then took it to another station. So you have the station table and then you have trips of where they went, right? So it says start time from what station to the next and who took it, right? What bike it was. So, timing, so a really simple query here where we will do a join between the trips table and the station table where we have the same station ID. And we want to find all the trips where the bike ID was less than 200, right? So if I run explain analyze, again, explain analyze is, explain will have it dump out the query plan without actually running it. If you add explain analyze, it'll run, actually run the query and actually compute the answer but also shows you the query plan and what actually happened when it ran it. So, you see here that it tells you that it wants to do a hash join. And it tells you what the condition was that it did the join on. And it tells you internally, here's the out of relation that it picked the station table. And again, it did a sequential scan on it because it has to populate the hash table, so just look at every single table in station, populated the hash table. Then it went and did a sequential scan on the trip table to do the probe now into the hash table, right? And it tells you here that it executed the query in 128 milliseconds. So in Postgres, which you can actually, you can do, you can tell it to not use certain join algorithms. So in this case here, I can say set enable hash join equal false, right? And this tells Postgres, the planner, all right, when the query comes in, don't try to do hash join at all, pick other join algorithms, right? Because Postgres supports nested joins, nested loop joins, sort merge, and the hash joins. So now if I go back to my same query and I run the, you know, run explain analyze, I see that it used a, it did a sort merge join, right? So it's sort of the trip table, sort of the, using quick sort. Actually, for the station table, it used quick sort. For the trip table, it used external merge sort. And it tells you how much, how much data it had to write out the disk. And then it tells you that it did the merge of above, right? So I think the hash join was taking, I think 128 milliseconds. And then the sort merge here took 200 milliseconds. It's still, you know, still sub-second query. But, you know, the table is actually not that big, right? The, the, the trip table has only about 60,000 tuples. And then the station table has, it looks like 70. So it's, it's not that big. So the, the execution time was, it's almost twice what it was in the hash join. So now we can also do the same thing again. Now we can tell Postgres to disable the merge join, the sort merge join. So, right, what should the join query be here? That's the loop, right? So let's see here. So we're trying to do a join on the trip station ID and the station station ID. So let's look to see what these tables actually have. So trip has a index on the, the end station and the start station ID, right? And for station, actually, so for this, the, the trip table does not have an index on the station ID, right? So to have a foreign key, you would have a foreign key index on the parent table. Because when you insert a new record on the child table, you have to see whether that value exists in the parent table. So you need to index on the parent table. So in this case here, this is saying that trip has a foreign key relationship with, and it references the station table. So the station table has to have a, has to have an index on the station ID, right? And so in this case here, the station ID is the primary key. So it tells you it has, it has a B tree. So what should the join look like here? We can't do merge join, we can't do hash join. What should be the interrelation? What should be the out-of-relation? My guess is that the out-of-relation should be the trip table. Cuz you're gonna scrunchly scan every single tuple, but then you can use the index on the station ID in the inter-interrelation and that inner for loop that then do the look up. I was wrong, okay. So it did the, no, no, that's right, sorry. It did nested loop, here's, here's the sequential scan on the trip table. So that's the out-of-relation and then the inner loop is doing the index scan, the index probe into the station table. So look at here, it says the execution time was actually less than the sort merge join, right? Cuz that index probe is really fast. Now, typically in old to be, old to be databases, the nested index loop is actually usually gonna be the best choice because you're not scanning the entire table and you're not trying to compute a really large join, it's only getting a single tuple per tuple on the out-of-relation. So we know in this case here, there's only gonna be one station per trip that's station ID that it's pointing to. So it's only gonna grab one tuple. So it doesn't make sense to build another hash table, or it doesn't make sense to do all the sorting and all this extra work when we know we're only gonna find a single tuple at a time. So let's now also, in Postgres, you can actually disable what kind of scan you wanna do. So we can say set, enable, index scan, equal false. And now go back to the same query. And now this is the worst of the worst, right? Cuz this is the, the, the sequential scan. So we see we're doing a sequential scan on trip, and then we're doing a sequential scan on station. And lo and behold, our execution time is, is over half a second, right? We went from the hash join case of 128 milliseconds to do, to do the query to the worst case scenario, 600 nanoseconds, or seven milliseconds to do the, the, the regular nested loop scan, right? Cuz I prevented it from being able to do the index lookup to do the join. So again, in practice, the hash join is always gonna be the best approach. And in, in some scenarios, the, the, the nested loop join, or the index nested loop join might, might, might be better. So maybe we do this. So let's say that we wanna do a lookup on a single trip, right? So let's just pick a random trip, trip ID. Which I think is just trip or ID. So, select ID from trip, order by, this works in my SQL, it might not work in Postgres. Random, limit one, nope, did not like that. Nope, let's just do order by bike ID. I'm gonna do it, sorry, I'm just gonna buy. There we go. So basically, what I'm doing here is, it assign a random value to every single tuple and then, then sorts it based on that. So this is, this is a hacky way to jump to a random location to a random tuple. Don't do this cuz it's actually really slow. All right, so let's go back to our join. And now what we'll do is instead of saying bike ID less than 200, we'll say trip.id equals that. Actually, let's kill this, so let's go back and re-enable the index scan and everything that I turned off before. So we go true here, true to this, sorry, thank you. Actually, this is making it easier, done. So here again, now we're doing our lookup on trip ID. We know we have an index for that. And then we have for the station to do that join up operation, we have an index on that. So what join algorithm do we think it's gonna use here? I raise your hand if you say hash join, raise your hand if you say sort merge join, raise your hand if you say index nested loop join, right? So why would it pick index nested loop join? Was that? Was that? Yeah, so because of this, I know I'm gonna grab a single tuple. Oh, still did a hash join. So look at that, so it did, yeah, so here we do our lookup on trip. We get back a single tuple, and then it's still built a hash table. Cuz it did a sequential scan on station, cuz you know, cuz. So again, this is another example where the optimizer may not do what we think is the right thing to do. It's probably because there's 70 tuples in station, therefore it said, that's so small, I'll just make that be the out relation. So let's do this, so let's now, let's actually look at the value of this tuple. No windows, how do I go home? No, here we go, how are we doing in time? All right, five minutes, right? So this is hard to see, but yeah, let's just do this. So again, station city, let's just get the station ID. This keyboard is like, it's like really uncomfortable. I shouldn't complain cuz they gave it to me. All right, so station ID 70. So now we go back to our query and we'll add another predicate. And station.station ID equals 70. All right, raise your hand if you think hash join. Raise your hand if you think sort of merge join. Index join, all right, nobody's real confident about this. All right, sorry, explain analyze. Nest to loop in next scan, right? So in this case here, it's all that had one tuple in the interrelation, one tuple in the out of relation, and it picked the nest to loop join. All right, so we could try to do, we could just write a program that deletes tuples from the station table, right? One by one and keep running, explain, analyze and see what the threshold is inside a Postgres or it says, all right, I think you have one tuple, I think you have enough tuples. Let me go to the index scan, all right? So I think this is a good stopping point. And I think this again highlights the thing that we'll talk about next week is just because we think the optimizer is going to do something in a certain way. When we think of the right plan it's going to be. The optimizer is sort of this black box, this magical beast that can generate query plans in sort of all sorts of weird ways that may not be the best thing. Now maybe in this case here, actually it'll tell you how many buckets it generated. So it allocated a hash table with 10, 24 buckets, it only took one kilobyte. So maybe it just said, all right, well the hash join is so small, the hash table is so small I'll just use that. But then clearly it didn't do that for the single predicate. So one of the things we'll talk about on Wednesday is that a lot of times in these optimizers there'll be a bunch of heuristics that are baked into the optimizer that makes decisions based on some hard coded properties or rules that somebody wrote when they built the optimizer. And so there might be a rule inside a Postgres we could poke around and say, well the number of tuples that I'm trying to join on is less than 100. So therefore, or greater than one or greater than two. So therefore, I'll always use a hash table. We can look at Postgres and see what it actually does. But the main thing I want to stress is that the optimizer is not always going to get things right. And often the case of the DBA has to go back and mainly tune the query plan. But in addition to picking what join algorithm, what scan algorithm they want to use for computing the query. One of the things we haven't talked about is how it's going to decide what the outer relation is versus the inner relation. And doing that for two tables is really easy. Doing that for more than two or end way joins. That's when things get really tricky. And this is where the commercial guys will outperform significantly the open source guys, okay? All right, so that's it for today. Any questions about hashing or aggregations? All right, awesome. We'll see you guys on Wednesday and we'll talk about query optimization more. Thanks.