 Hello everyone. Welcome to the session on parallelism in relation operations for parallel databases. Talking about the learning outcome of this session, at the end of this session you will be able to apply few parallelisms on the operations like selection, projection, aggregation operations on any given relations. So, earlier we have seen the inter-operation parallelism few techniques like parallel sort and parallel joint techniques. You can refer my videos for these one. Now today in this session we will be talking about the other relational operators like selection, duplicate elimination, projection, grouping aggregation, where the large number of data is there and we want to do the operations on these, how we can apply the parallelism for this one. Let us see for this. Start with the selection operation. We want to parallelize the selection operation. The selection operation may be select start from some relation, but more number of tuples are there and we want to do that in parallel. We are considering here the relation as employee assuming here that there is an employee relation, we will see that later. Assume that the table is partitioned which is stored in some disks. So, we have n partitions. So, we have n disk that is d0, d1, dn minus 1, these are the disks associated with the processors p0, p1, pn minus 1. So, parallely every processor will work on a particular operation. The operation may be selection or any other operation here. Now employee is partitioned, we are considering here that the partitioning technique which we have applied is the range partitioning technique on this employee relation. Let us move ahead. So, this is the scenario. We have the employee relation here. This relation on the range partitioning technique based on range partitioning technique, we have partitioned based on salary. You can see the salary with 25,000, below 25,000 is in this partition. More than 25,000, less than 30,000, it might be stored in this partition and more than 30,000 salary data is stored in this partition. So, 3 partitions are there which are divided in 3 disks and every partition is associated with 1-1 processor. So, you can say that p0 is associated here, p1 is associated here, p2 is associated here. Now let us see how parallelism we can apply on this one. So, we can recall that there are 3 types of queries based on the data access like in the partitioning also we have seen. One of the type of query is called as a point query. What is point query? The point query is for the example where condition is there and some attribute is assigned value we are checking with equal to. And we are getting few tuples or one of the tuples. So, this is called as point query. Another type is called as range query where based on the range we want to get the data. So, the range may be something greater than equal to value and something less than equal to that value or something between some value 1 and value 2. So, such kind of queries are also there in the selection operation. So, these are called as the range queries. Another type is we want to do the entire scanning of the relation where hold relation we are scanning. Basically in the point queries and range query not necessary that the entire relation is scanned because if the proper partitioning technique we have applied and the proper data we want to read according to the partitioning what we have done then automatically it will go to the particular disk and it will get the value. Whereas in the third scenario we need to do the entire relation scan and we want to apply the parallelism where non non key attributes are involved. For example, you can see that here in the range query I have done the partitioning based on the attribute only what we are searching like attribute name. So, if this is the case then it will go for range query. Other than that if we are we are instead of using that partitioning attribute we are searching for another attribute then we can say that that is a non key attribute. So, in this one we have to go for complete scanning of the relation. So, these are the three types of queries. So, based on this let us see how we can parallelize these one. So, the example is something like this that select some list of attributes from the table where some condition is theta. So, what is this theta condition? Theta condition is applied for the where clause. So, the first condition you you can see that is on the point query. So, what is the theta condition that is ai is equal to some value attribute ai is equal to some value is there. For example, here you can see that our employee relation is there. So, select star from employee where salary is equal to some 20,000. So, if this is the case the employee relation may be taken as named as EMP here. So, so the salary equal to 20,000 and what is our partitioning attribute? The partitioning attribute is also salary we have partitioned our data based on salary itself. So, what it will do? It will get it will check that in which range it is coming. So, in first range it is going for. So, automatically it will go to the first range data first redisk it will get the value immediately. So, here in the first disk automatically it will get the value that is salary less than 20,000 automatically. So, in this one as we have partition no need to check in this disk automatically the time required is less because it is only checking the disk where the data is present. The second case that is of the range query. So, range query you can see the example that this is the attribute AI with some lower value and the upper value we are checking and say the example here is it is select star from EMP employee table is taken here as EMP where salary is between 25,000 to 30,000. So, this is what this is the second range of disk you can say. So, how it is checking it will directly go to the disk and it will get the value because the range of this disk is more than 25,000 and less than this 30,000. So, automatically it will get the data from this. So, all this data will come as a result here. So, as partition is applied parallelism has worked unnecessary checking of disk D0 and D2 has not done only one disk where the data is available has taken. The third case very important where a complete scan is required. So, you can see here that what we are searching for we are searching for employee name with Sawyer. Now, what is the partitioning attribute we have taken we have partitioned our data by salary. So, whenever I am applying this query we have to go for searching all the relation entire relation based on this one. So, this is a scenario we have seen that this is the partitioning. You can pause this video and think about how that query will execute parallelly in this. How it is working? See the parallel scanning of the relations in all the partitions will work why because the attributes are not partitioned according to the name. So, it will search employee name Sawyer here also, parallelly it will search here also and finally it will search here also and it will get the data from this one. But all the relation a complete entire relation is scanned here because whatever the partitions we have done those are not done by the attribute what we have searched for. The attribute is employee name, but the partition is on salary. Therefore, parallelly all the processors all these three days are searched by the processors and finally it will get the data. So, here is what the parallelism is applied. Talking about the next operation like duplicate elimination a very simple thing is there if we want to eliminate duplication what we have to do is the first thing is we have to do the parallel sorting. So, sort the relation entire relation using the parallel sorting technique we have seen earlier the parallel sorting techniques like range partitioning sort and external merge sort. So, we can apply any of the parallel sort. So, parallelly entire relation is sorted and then the duplicates are eliminated as soon as they found during the sorting techniques. So, simple is there the parallelism may be done by the sorting may be done by the range one or you can apply the hash partitioning also and duplicate elimination is done locally at each processor because every processor is taking part in the sorting techniques. So, automatically locally at every session every partition it is removing the duplications. Talking about the next operation say projection in the projection what happens the two things are the projection without duplicate elimination or duplicate elimination is required. If project without duplicate elimination is there it can be performed like the normal selection operation and if duplicate elimination is required then we can apply the duplicate elimination parallelism where sorting is available. Now, talking about the grouping and aggregation what will happen that say that we want to group something we want to compute something. So, in that one we want to do the parallel execution also. So, the part partly the computing will done in this let us see the scenario first I will come to this one. So, what we are doing we are say that we want to do the sum operation. So, if this is the data what we have this is a relation. So, the first relation is partitioned into some disks. So, these are the these many disks are there. So, that relation is partitioned we if we want to calculate sum. So, at locally on every disk sum is calculated. So, this may be some 1 this may be some 2 this may be some 3 some 4 some 5 and finally, all the sums are added ok. So, automatically the parallelism has done on this grouping. So, let us say what the same thing is given here that consider the sum aggregation is there we are performing the aggregation operation on each processor PI on those tuples which are stored on disk DI and every disk DI is containing the partitions. So, the results are again stored as a partial sums. So, every partial sum is finally, taken for the for the final result. So, every processor is doing the local aggregation at their end and finally, those are merged ok. So, like this parallel it is doing. For calculating the cost of parallel evaluation operations what are the things? Basically 2 things are there if no skew is there if no partition skew is there no executions skew is there then what happens it will not require any overhead. So, if no overhead is there then automatically ideally the parallelism is applied as 1 by n if we have n processors ok. But if skew is there if overheads are there if communication problem is there. So, in that case it is take it is not exactly 1 by n it is estimated somewhat more than that ok. How it is calculated? It is calculated by 3 things. What is that one addition of these 3? The first one is the time taken for partitioning second one time taken to collecting the result that is you may call it as a assembling and the now all every processor whichever is taken part in the parallel operation whichever processor is taking maximum time that will be considered because everyone is parallely doing that. So, the maximum of that maximum time of the processor which is taking maximum time that is taken here and then that is added ok. So, if we are adding all these one there we are calculating the cost of parallel evaluation of these operations. These are my references. Thank you.