 Hello everyone, today we are going to see external sort merge algorithm. So, learning outcome for this session is students will be able to apply external sort merge algorithm and analyze its cost. So, the sorting operation is required many times in query processing. When SQL query demands the output in sorted order for example, select start from employee ordered by name or for efficient query processing like join operation we have to sort the relation first. So, logical sorting can be done. So, sort a relation by building an index on sort key and use that index to read the relation in sorted order. But what it does is here is in this scenario that we have to do it leads to many disk accesses while reading the record which is very expensive and that is why we should go for the physical sorting. So, it is desirable to order the records physically. So, if the relation is fitting it into the main memory then we can use any standard sorting technique like quick sort merge sort etc. But if the relation is bigger than main memory then we have to go for external sort merge algorithm. So, sorting the relation that does not fit in main memory is called as an external sorting. So, let us look at the algorithm. Now, this algorithm works in two stages. In the first stage the relation will be divided as per the main memory space available and then individual part of the relation will be sorted and saved back to the disk. This is the stage number one and in second stage all the sorted runs that has been generated will be merged. So, let us look at this algorithm here. So, m denotes the number of blocks that we can fit into the main memory ok. So, read the m blocks of the relation and then sort it individually and then return write the result back to the disk. So, this will generate first run. Then next again read the next m blocks and so on till we are not completed the relation R. So, it will generate the number of runs which will be which are sorted in sorted manner. This is the first stage. In second stage we have to bring the this runs to the main memory compare the tuples one by one and then whoever is less or greater as per the sorting required in increasing or decreasing order we are going to write down the output. So, here read one block of each n files which has been generated in stage number one and then we have to compare that first tuple of every block and as per the requirement the value will be written back to the output. Once that tuple will be written to the output this will be deleted from the buffer. And when the buffer will be empty we have to read the next block or next part of that run over there. And we have to iterate this process till the complete relation is not read or complete run files are not read. Let us look at this with example. So, before that the output of this merge set will be of course, sorted relation. So, output is buffer to reduce the number of disk accesses or the write operation and then since it is merging n runs it is called as an n way merge. So, consider the example that we are having a file size of a file of size 5 GB or relation of 5 GB and memory size available is only 1 GB. It means that this complete 5 GB file is not going to be fit in main memory that is why we will be dividing it into the first stage as per the size of memory size. So, this 5 GB file can be divided into 5 parts. So, take the first part which is of 1 GB and sort it and then again write back to the disk. In that manner we are going to take the next part each part of the relation will be sorted and will be result the result will be written back to the disk. So, here 5 file size generated or we can call it as 5 run size generated or n is equal to 5. Now in the next stage what we have to do is that this again the main memory size is 1 GB. So, if we will bring all the blocks then it will fit and we are not able to write the output. So, what we are doing is that from every run file that has been generated we are going to read some part. So, assume that here it is 150 MB has been read from each file. So, total data available is of 750 whereas 250 is blank or is available for writing the output. So, this is first 150 first 150 of every run then again from this first 150 data compare the first value or the first couple of every what you can say that block and whoever is less if we want it in increasing manner write it to the output. Then read the next record which has been already written that we need to delete and then go for the next record. In this way all this data will be compared and will be written. So, while this writing the output it is 250 MB which if it is full then we have to write it to the disk and then make it blank again. So, in this way when this all 5 files will be blank or the buffers will be blank we have to read the next file and so on that is how we are going to merge it. So, consider this is an relation employee again. So, select star from employee ordered by name this is our query. So, how it will be executed? So, let us assume that we are we can fit 3 buffers in main memory and 1 tuple per buffer. So, first stage is that divide the relation as many buffers it can fit it in the main memory. So, it can fit only 3 buffers and one buffer contains one tuple that is why these 3 will be brought to the main memory sort it and again result will be written back to the disk and that is how second will be executed then the third will be sorted and in the last one 2 are remaining. So, whichever is less that we have to take into concentration and this will be written back to the result. So, this is how the 4 runs has been created every run is sorted now we have to merge it. So, in the second stage we have to read m minus 1 input buffers and 1 buffer for writing the output. So, here m is 3 that is why we will be reading only 2. So, now we are having 4 runs, but all the 1 block of each run cannot fit. So, we are going to take as many we can fit. So, we are going to read only first 2 then next 2 and so on and once it has been merged we will go for the next 2 runs and next 2 runs and so on. So, here this first 2 will be taken now this is as we are assuming this is 1 buffer this is 1 buffer and 1 buffer is available for output this will be compared and result will be written to the output buffer again this is complete. So, this will be written back to the disk. So, next will be taken again this will be compared and then the result will be written to the output. So, this is how we are writing it on disk. So, this is that first 2 runs are completed then we have to go for the next 2 since all 4 cannot fit we are going that by the m minus 1 factor over there. So, this is again comparison. So, this will be add merged with this comparison here. So, these 2 will be generated. So, from 4 we have now we have got 2 merge files. So, again these 2 will be compared and then we will be getting the single sorted output. So, this is the stage 2 where we have merged it. So, let us look at the cost analysis over there. So, here BR denotes the number of blocks and relation R. So, what we had done in the first stage all the blocks has been read they have sorted and then written back to the disk that is why we will requiring total 2 into BR block transfers. In the merge phase what we are doing is that we are reading BR divided by m blocks at a time. So, the total number of passes will be required log of BR divided by m to the base of m minus 1 because every time we have seen that the runs are decreasing by factor of m minus 1. So, each time we are requiring 2 again block transfers 1 for reading and 1 for writing except 2 scenarios. First one is when we are getting the last sorted output in that case we are not writing we are going for the directly the result and in second case sometimes it is required that we are not able to it is not required to read or write the output. So, considering all these things the total number of block transfers will be BR into 2 into log of BR divided by m to the base m minus 1 that it is nothing but the number of passes into 2 plus 1. What about the sick cost? So, again the first phase we will be requiring 2 into BR divided by m times 6 over there because we are taking the m blocks at a time into the main memory and in the merge phase. So, assume that we are reading BB blocks at a time if it is the case then it will be BR divided by BB blocks and again that we have been read and written. So, it will be 2 into BR divided by BB sorry and that is why the total number of sick operations required will be the addition of this 2 divided by BR divided by m and BR divided by BB into 2 into this much cost. So, this is the cost of sick operation. So, of course, this is a ceiling operation we are taking the next integer value. So, consider the example again the text relation where there are 10000 records and 400 blocks. So, number of passes required will be log of 400 divided by 3 to the base of 2. So, the total passes will be 8. Assumption is that BB is equal to 1. So, here the number of block transfers will be according to formula it will be 60000 plus 400 that is total number of block transfers will be 7 to 00 whereas sick operations will be 6668. For the student relation the number of passes will be 6. Again the number of block transfers will be according to formula it will be 1400 and the number of 6 will be 1268. So, this is how external short merge algorithm works. Thank you.