 Hello friends, today we are going to see hash join or the how the hash join is how the join operation is done using hash join. So, learning outcome for this session is students will be able to use apply the hash join and analyze its cost. So, hash join is especially used for natural joins and equi joins. So, hash function is used to partition the tuples of both the relations. So, what is the basic idea behind this hash join is partition the tuples of each relations into sets that have same hash value on join attributes. So, it will avoid the comparison of every tuple with every other tuple. So, here h is a hash function that maps the join attributes to any one of the value right from 0 to n, where joining attributes denote the common attributes in between r and s which is used in natural join. So, r 0, r 1 up to r n denotes the partitions of r tuples and where each r tuple, t r tuple element of r is put into one of the partition based on its hash value. So, hash value is going to hash function is going to be applied on the joining attributes of that tuple and whatever the value that we will obtain the respective tuple will be sent to that particular partition same is going to be applicable for relation s. So, look at this so r will be partitioned here in between 0 to 4 and s is also partitioned and then joining will be performed only with r 0 and s 0, r 1 with r 1 because they are having the same hash value. So, if r and s is a tuple that satisfy the join condition then they must have the same hash value ok, if they are having some different value it means that their id or the joining attribute value is different. So, this hash value is going to be used for joining that things, if r tuples in r i need only to be compared with s tuples in s i. So, just now we have seen this that r 1 will be matched only with r 1. So, d is a tuple in student for example, c is a tuple in text. So, if d and c are tested if and only if h of d is equal to h of c or hash value of d it means that for example, 10 is the id in student and if you are having the mod 10 is a function then its value is going to be 0 whereas, 22 is the value in c. So, 22 mod 10 will be 2 so 0 and 2 are not matching that is why they will not be tested. So, if h of d is not equal to h of c then d and c must have different values of id and that we have seen 20 and 10 and 22 are different. But if the hash value is same it is not necessarily that the value of joining attributes are same. So, at that time we have to check all the tuples which is having same hash value to see that the actual value is also same or not. So, this is the algorithm so in the first step we have to first partition the relation r and s. So, here it is that for loop so for each tuple ts in s find its hash value according to hash value add that tuple to that respective partition. So, hsi is the partition number where hash value is i in the same manner we are going to partition relation r and after getting the partition we are going to start with the first partition. So, here this relation r is called s is called as an build input where relation r is called as an probe input. So, nh is the number of partitions that has been generated so read hsi which is the build input and build in memory hash index on it. So, after that we are going to use a separate hash function and then the only the partition of which first partition for example that we have taken. So, hash index is going to be build on that particular hs which is build input and then for every tuple of hri so respective partition of r then we are going to check the value in that hash index which is generated on s. So, this is if this value are matching it will be added to the result. In that manner we are going to process all the partitions. Let us look at with example here so we are having two functions one is a department having department id and department name and the other one is employee. So, employee id name salary and department id and we want to perform this department join employee. So, first hash function that we are going to use is department id mod 3 for partitioning. So, this will be applied for both the so same hash function will be applied in both of the relations. So, here 1, 2, 3, 4, 5 will be added to 3 partitions. So, 1 and 4 will go to hs 1, 2 and 5 will go to hs 2 and then 3 will go to hs 0. So, 3 partitions has been generated for department. In the same manner this function hash function will be applied for employee and again 3 partitions are generated. So, first step of the algorithm is that partition r and s both the relations that we have done. Now, in the second step we have to take separate hash function. So, we have taken here department id mod 2. So, we have to take the first partition so first partition of s relation has been taken and then we have applied this function and hash index is built. So, here it is a single record so that will be in a single packet. Now, this every tuple of hs hr i will be taken so first record will be taken now here id is 3 which will be checked in this index and accordingly if it is matching result will be added. So, here this is the value 3 and here the index is present for 3 so this will be matched and the result will contain this many records that is 3 records. Now, take the second so again for the hs 1 now since our function is department id mod 2 this will go to this 2 buckets and then again we will be checking the values every tuple of hr 1 with in that particular or using that particular index and then result will be added. So, 1 and 4 are also present in that relation so it will be joined and the result will be added to the previous. So, already we were having 3 records 4 more records has been added to the result. Now, the third partition will be taken now here we are having here again 2 values 2 and 5, but here we are having the records only 4 2 that is why this will be matched and the final result of this join will be all the results there will be all the departments except 5 because there is no record of department 5 in employee. So, we are not bothering about the result result will be same if whenever we will use any join function, but how hash join is used that we have seen. So, first partition the records then one by one take the first partition build the hash index on the build input and then check for the partitions check for the tuples of the other relation partition using this index. So, now when we are partitioning it may happen that the number of partitions are greater than the number of blocks of memory. In that in that case we have to repeat the process of partitioning how we are going to reverse. So, in one pass we cannot split all that particular relation into the partitions as the memory blocks are available. So, each bucket generated by one pass is taken and it is further partitioned till it is not fitting into main memory or it is creating smaller partitions. So, hash function is used in one pass is going to be different than the other pass. So, now we have seen that suppose our this yeah there are 4 records that has been generated but our memory block is not take is allowing us the 4 records then we have to further divide this particular partition ok. So, this is called as a repeated process recursive partitioning. So, this partition first partition will be taken then further it will be partition till it is not fitting into the main memory this is called as a recursive partitioning. So, here when we are doing the partitioning there can be the problem of skew. What is skew means that the some partitions will have more tuples than the average number of tuples. So, it will be there will be the hash table overflow if SI does not fit into the main memory. This may be happening due to many tuples in S with the same value for join attributes assume that in our example if there is only the department number 2 then all these records will go in the go will go in one only partition or we are choosing a bad hash function which will result again in skew. So, what can we done to handle this hash table overflow? So, we can increase the number of partitions. So, this is called as an first factor ok, but instead of increasing the number of partitions there will be the problem of hash table overflow. So, how it can be handled? It can be handled with two approaches one is overflow resolution and other one is overflow avoidance. So, overflow resolution means that partition SI will be partition using different different hash functions or we can use that repeated partitioning to fit it into the main memory. Similarly, RI is going to be partitioned. So, SI itself is a partition which will be further partitioned using different hash function whereas overflow avoidance. So, here build relation is initially partitioned into small partitions and then this some partitions are combined in such a way that it will fit into the main memory similarly it will be done for R, but sometimes this both approaches resolution and avoidance will fail if large number of tuples is having the same value for joining attributes. In that scenario we have to go for other joint techniques. So, now let us look at the cost of hash join. So, we have to check two things whether we have used recursive partitioning or not. So, first look at the if we are not preferring if we are not use recursive partitioning in that scenario we will be require partition first step is partitioning. So, partitioning of two relations R and S will require 2 into Br plus BS block transfers because every entire relation needs to be read and that is why it is going to be Br plus BS. So, read and write operation then this build and professes read each of the partition again once that is why we will be requiring Br plus BS block transfers. So, total block transfers that will be 3 into Br plus BS. Now, sometimes it may happen that the partitions could be slightly more than Br plus BS which is called as a partially filled blocks. So, we need to access this partially filled blocks and it will be the overhead at the most 2 into NH because two relations are there and NH is the number of partitions. So, total block transfers required will be 3 into Br plus Br plus BS plus 4 NH. Now assuming that we can fit we can take only BB blocks which are allocated for input buffer and each output buffer in that scenario we are to having total Br blocks. So, how many times we have to bring that blocks? So, Br divided by BB because it is fitting only BB blocks and that is why the total 6 will be Br divided by BB plus BS divided by BB into 2. So, and it may be the possibility that build and prophases require 1 C for each NH partition of each relation since each partition can be read sequentially and therefore, the total number of C corporations will be 2 into Br divided by BB plus BS divided by BB plus 2 NH. Now, if recursive partition is required in that scenario what will be the additional cost of the recursive partitioning. So, assume that we have in each pass it has been that partitioning has been reduced by this step of or the factor of m minus 1. So, passes are repeated until each partition is of the size at the most m blocks. So, expected number of passes will be log of BS to the base of m minus 1 minus 1 and since each pass every block of S is reading and written out the total number of block transfers will be 2 into BS into number of passes that is log of BS minus 1. So, the number of passes for the partitioning are the same is required for R also. So, that is why the total number of block transfers will be 2 into Br plus BS into the number of passes plus Br plus BS. Now, how many 6 will be required? So, again same if BB blocks are allocated for buffering each partitioning and if we are ignoring the relatively small number of 6 during build and pro phase then the 6 will be 2 into Br divided by BB plus BS divided by BB into number of passes. So, this is the cost of hash join. So, again say take the same example of texts and students. So, memory size is of 20 blocks and the number of blocks should be this should be again text are 100 and 400. So, students is partitioned, so just consider that this is student and this is text over there. So, students is partitioned into 5 partitions each of size 20 in one pass and teachers is partitioned into 5 partitions each of the size 80 in one pass. So, total cost ignoring the cost of partially filled blocks will be 3 into 100 plus 400 which is going to be 1500 block transfers and 3366, so this is how hash join works. So, this can be improved that hash join if main memory size is large. So, when the entire build input can be kept in main memory NH can be set to 0 and thus we will be requiring only Br plus BS block transfers and 2, 6 this is the best case. Thank you.