 So, let us get launch with today's topic which is query processing. So, what kind of queries are we talking about here? Do we mean RTI queries where not processing it will if your responsible officer will get you a fine? No, what kind of queries are we talking about? We are talking about SQL queries primarily. However, our focus is not going to be on the SQL language. Instead, we are going to assume that the queries have been translated down into relational algebra. Many people are asking me early in the course, what is the motivation for relational algebra? The motivation is it is a clean abstraction for the operations that any real system needs to do on a relation. And SQL is translated into a relational algebra in a relatively straightforward manner. It turns out that the basic relational algebra operations which we have seen are not quite sufficient. There are a few extra operations. In particular, people have extended relational algebra with an operation which models sub queries, the nested sub queries and how they get executed. That was not really part of the relational algebra originally, but there are extended versions of relational algebra which have an operator modeling correlated sub query execution. So, we are not going to go too much into correlated sub queries in this talk and we will stick to the basic relational algebra, but we will briefly discuss correlated sub queries at the end. So, here are the basic steps in query processing. You give a query in SQL typically. It is passed translated into relational algebra and that is fed to an optimizer. Now, what does the optimizer do? It has to output an execution plan which is expected to be the cheapest plan. Now, how does the optimizer know what is the cost of various plans and how does it generate alternative plans? These are the two questions which we are going to look at in more detail in chapter 13, the next chapter which is also today. Today's session was labeled big data. I do not know if you will be able to get up to big data today. What I will do is I will cover processing and optimization and we will start on big data at the end. So, coming back, we have an optimizer which has to look for alternative expressions which are equivalent and it takes statistics about data. What kind of statistics? For example, how many tuples are there in a particular relation? How many distinct values are there for a particular attribute? What is the distribution of values? Maybe a histogram or some such to estimate how many tuples will satisfy a particular selection and so forth. So, it is a lot of statistics which any database maintains and using that the optimizer chooses an execution plan. Now, that plan is passed on to an evaluation engine which takes the actual data from the database and processes the query and outputs the query result. So, that is a summary of query processing. We have already seen these. Now, first let us focus on the alternative plans for a particular query. Let us start with the relational algebra expressions and then we will go deeper into actual specific algorithms for evaluating specific relation algebra operations. So, let us start with a very trivial query, not realistic, but still it shows what are the issues. Here is a query which says project instructor on the salary attribute and then select salaries which are less than 75,000. So, this is pretty straightforward. If you want to actually implement it, you would have to implement the project operator which would probably do a scan of the relation and assuming this is the SQL version which keeps duplicates. It would simply do a scan and output just the salary attribute and ignore everything else. And then that result would then be passed on to the select operator which filters out all those salaries which are less than 75,000 and returns them and filters out the others and throws them out. Now, it should be very clear that this particular expression is equivalent to this other expression which does what? It first does the select on salary and afterwards it does the project. Now, these two are actually fairly simple examples. The cost of the two may not vary all that drastically, but when you have joins of number of relations, there are alternative join orders which can be wildly varying in cost. We will come to that later. Now, each of these operations, select and project itself can be executed using different algorithms and we are going to see that in this chapter. What are the different algorithms for select, for project and so forth? Now, an optimizer's job is to give out an execution plan or evaluation plan which the evaluation engine can directly execute. So, that plan must specify exactly how to run this query including how to do the select, how to do the project. For example, the plan may say use an index on salvy to find instructors with salvy less than 75,000. The alternative is to look at all instructors and apply the selection condition to see if the salvy is less than 75,000. Now, which plan is cheaper? It turns out it depends on what kind of an index it is, whether it is clustered or not clustered on salvy and so forth. So, the job of the optimizer is to look at this piece of information, what indices are there, are they clustered and so forth and determine which of these plans is likely to be cheaper. That is already said query optimizations goal is to find the cheapest plan. It cannot actually find the real cheapest in terms of actually executing it because you cannot exactly predict how long a particular plan will take. Instead, we estimate how long it will take based on statistics. The estimate may not exactly match the actual execution time, but it is usually a reasonable predictor and as long as you have a plan which is close to optimal and you do not choose a plan which is really way of your okay. So, I have been talking of the cost, how to estimate the cost of evaluation of a query and one simple measure of cost is how long will it take? You start the query and see when it finishes. Now, how long it takes is a function of what other queries are running in the system and you know is some other activity going on in the operating system while queries are running and so on. So, that is hard to predict, but what you can estimate is the resource consumption. How much resources is this particular query going to consume? So, what are the resources? It could be disk access, it would include CPU time and in a system which is running queries in parallel or in distributed manner, it would include network communication cost. Now, typically in hard disk based databases, disk access has been the predominant cost. This is changing with flash memory, but for simplicity, let us focus on the disk accesses as a measure because it is easier to understand what is going on at an abstract level. What about the CPU cost itself? Every optimizer does model CPU cost and there are ways to model it. For example, each algorithm can estimate how many operations to do and then give some relative weightage of CPU cost versus disk cost because depending on how fast your CPU is and how fast your disk is, the relative cost may be different. You have to combine the cost in some way to get an overall cost, that is what any real optimizer does. For simplicity, we are only going to look at disk access and here again there are several factors in disk access. There are three things which we actually need to model. One is the number of seeks which are done and then we are going to multiply it by the average seek cost. Now, the exact cost for a particular seek depends on which block your which track the disk come is at and which is the track which you need to access. This is for a single disk subsystem. For a multiple disk subsystem, there are other issues which all disks are you reading and also it depends on what else is in the queue for that disk. So, we cannot figure that out at optimization time. So, we take an average cost. Similarly, given a particular plan, we can estimate how many blocks are read. Given we know the number of blocks read, we will multiply it by average block read cost and similarly we will estimate the number of blocks written and multiply that by the block write cost. Now, why do we differentiate between read and write cost? Writes are usually more expensive. On disk, a write requires typically that the disk spin around completely and the data is read again before the write is declared successful. On the other hand, some systems will mask this by writing data to a non-volatile buffer and written write done immediately. In fact, some disk subsystems will mask this further by not actually doing the write to disk at some point and keeping it in the main memory buffer and say done write done. Do we need to worry about these differences not really yet? For the purpose of query optimization, this is not relevant. What we want is the actual disk resources which are consumed by this particular write operation. We do not care about the actual time delays that are introduced for the purpose of comparing plans. So, as I said, we are going to just focus on transfer and seek. In fact, we are going to simplify our life further and not bother too much about whether we are doing a read or a write. We could, it is just that to keep your life simpler, we have simplified the formulae and just count the number of block transfers whether it is read or write we will treat it uniformly and the number of seeks we will keep separate. This matters because there are some algorithms which transfer less data but do more seeks and vice versa. Now, depending on whether you are using flash memory or hard disk, the cost of seek and the cost of transfer can be pretty different. So, you want to keep these separate. So, we are going to have the number of blocks B which we are transferring times the average transfer time for one block and then the number of seeks S that we do times the average time for one seek and we are going to ignore CPU cost. We will also note this last point here, our cost estimates do not take into account the cost of writing the output to the disk and the reason for this is many algorithms create output. Sometimes you have to write that output to disk in order to start the next operation. So, you have to write it out completely before you start the next operation. In some other cases you can actually pass the output of one operation directly to the next without writing it out to disk. So, depending on the overall plan the data may or may not get written to disk. So, when we are estimating the cost of an operator we will not include the cost of writing the final output. Again there are issues with buffers which we are going to ignore. So, I did not want to show you the slide because you would be tempted to read it but the bottom line is an algorithm in general has to read data from disk. But the data which it wants might already be in the disk buffer in which case the cost of seek and transfers totally avoided it is much cheaper. The problem is that the query optimizer really has no clue what is in the database buffer. So, it is coming up with the plan right now by the time the plan is actually executed the contents of the buffer may change dramatically. There has been some work on trying to do optimization taking buffer residence into account, but it is hard to predict. So, typically optimizer do not bother about it. Now why is this reasonable thing to do? Well you know it actually depends the plan assumes worst case meaning nothing is in the buffer and it will tend to favor a plan which fetches the smallest amount of data with fewer seeks. Now there may be another plan which reads more data which would be much slower if the data were on disk, but might be cheaper if data were in memory. So, the relative cost of plans might be affected by this, but the general accepted wisdom is that the relative cost of plans may not change as much. The actual time to execute will reduce drastically if data is in memory, but if you have two options plan A and plan B what is the relative cost when data is not in memory versus the relative cost when the data is already in memory. And the assumption which you know it is an assumption which is not always true in fact, but the assumption to simplify life is that the relative cost would not change all that much. So, if you choose a plan assuming worst case nothing is in buffer when we run it with everything in buffer or the plan which was better in the worst case is not too much worse than the best plan when things are in buffer. So, that is the assumption. Now let us look at specific algorithms. If you have a complex selection and you do not have and or you do not have indices the only way to answer that selection is to read every record in the file containing the that relation and then look at the record and see if it matches the selection condition. So, this particular algorithm we will call file scan or linear search file scan means read everything in the file and this algorithm is called linear search. It can be applied regardless of the condition ordering availability of indices and so forth. Now when you search for data in memory binary search is sometimes viewed as an option on sorted data. This usually does not make sense in databases because first of all the data is not stored sequentially in a sorted array. It is stored on disk and b binary search can be much more expensive than having a proper b plus tree index. The number of seeks will be much less with the b plus tree index which brings up to the next algorithm index scan. When can we use index scan? When we have a condition that matches an index we have created. So, supposing we have built an index on attribute ID of student and I get a selection select ID equal to 1, 2, 3, 4, 5 student. So, I want the student whose ID is 1, 2, 3, 4, 5. Now a b plus tree on the ID attribute is perfectly suited to this query and that is probably what I would use for this. So, that corresponds to a primary index or a clustered index with an equality on key. ID is a key attribute so equality on key. It is a single record that satisfies the equality condition. Now what is the cost of doing this index select? I am going to assume that the height of the index is h i, the ith index which we are using here its height is h i. Now remember what happens when we use a b plus tree? We start at the root we compare the key value then go to the one of the children compare key values go to its child until we hit a leaf. So, the total height is h i. When we hit a leaf the assumption is that we will get a record pointer and we will do one more i o to fetch the actual record. So, h i plus 1 i o's are involved each of those i o's involves a seek followed by a transfer. So, this is the cost h i plus 1 times t transfer plus t seek this is the formula we will use. Now, if I had an index which was a clustered index, but the selection was the index was on something which is not a key. So, let us take the same student relation and instead of clustering the relation on ID supposing we clustered it on department name and I have a query find all students in the CS department. Here I have a clustered index. So, I can go down the index to find the first record in the CS department or student record in the CS department and then all the other student records are going to be consecutive. Why? Because this is the primary or clustered index that means the actual relation is sorted on the department in this example. So, I am going to read a number of consecutive blocks which all contain relevant data. So, let b be the number of blocks containing matching records. So, you can take the estimated number of matching records divided by how many records fit in a block to estimate this. So, the cost here is h i times the transfer plus seek to get to the leaf and then after that there is one seek to reach the first block containing the required data and then I am going to read b consecutive blocks I am assuming that the leafs of the b plus tree are stored reasonably consecutive. It cannot be exactly consecutive, but the idea is if I am able to store things such that you know a block of let us say 50 pages will be stored consecutively then only one in 50 page reads will require a seek. So, that is not too much. So, we can ignore it. So, we are going to assume that all the things are consecutive and we are going to not do further seeks and b consecutive blocks we are going to read. So, t transfer times b blocks. So, that is an estimate for primary index equality on non key. This is also pretty efficient because the transfer rate is pretty high. The time to transfer one block is usually very small. Remember we can read data from this these days at something like 50 megabytes to 100 megabytes a second that is pretty good. Now, the next case is a secondary index equality on non key. By the way there is one more case which is secondary index equality on key which I have not shown here, but that turns out to be similar to primary index equality on key. What I mean is if the key is unique I am only fetching one record then whether it is primary index or secondary index does not really matter this cost will be the same, but the cost difference comes when I have a secondary index with an equality on a non key. So, what do I mean by equality on non key? The secondary index is going to give me a number of records, but these records are going to be scattered all over the underlying file. So, what is going to happen is if I want a single record to be fetched which is the equality on key I will traverse down the index and then follow another pointer to the actual record. So, h i plus 1 and each of these involves a seek followed by a single block transfer. So, this is the cost when it is a search key is a candidate key, but equality on non key is a more general case and here we have let us say n matching records. The problem is each of these n matching records may be on a different block. So, what is going to happen is I am going to have n separate IO operations for each of those matching records. So, each operation is on a different block. Let us assume that the relation is a large relation and these are random IOs. Now, when I fetch a particular block and the buffer is full I have to evict something when I fetch the next block I evict something else. If the relation is significantly larger than the buffer what will happen is most of the time when I fetch a block it is not going to be in the buffer. The probability of finding a block in the buffer will be small if the buffer is very small relative to the relation size. Now, this is again a worst case assumption we are making. If the relation is actually very small the actual running time will be much less. So, people have indeed taken this into account and more realistic formula which many databases implement takes into account how big the relation is and how big the buffer is. But, for simplicity we are using the worst case in the worst case the buffer is very small and every one of these n separate IO operations will result in a seek and a block transfer. So, I am going to have h i plus n times t transfer plus t seek. This can turn out to be very expensive and in reality too it is not just a you know theoretical thing. If you actually do this with a relatively small memory and large relation the cost of seeks dominates. So, seeks are much much more expensive than data transfer for a single block. So, seeks will dominate and it turns out that an secondary index on a non key can actually become a liability a simple file scan might be much faster. So, this was the state of affairs. So, the optimizer has to estimate how many matching records would be there and then choose whether to use the secondary index or to do a file scan. Now, this was a little problematic how well can the optimizer estimate how many records will match. It turns out this is not easy necessarily. Supposing the query is on let us say country I am trying to find all orders from a particular country or from a city let us say particular city. Now, if you take the number of orders from Mumbai for something say Flipkart it is going to be way bigger than the number of orders from I do not know some other small town. So, if the optimizer does not have exact statistics about how many orders from which town it might make some simplifying assumption and say that on average each town has so many orders. So, now if I do it for Mumbai the number is going to be way more than average. If the actual number were equal to a less than average this n might be small and it might be worth doing separate seeks per order. But, if it is Mumbai this cost blows up and you are better off doing a file scan. So, the optimizer somehow had to do its best and if it did not do a good job this thing could blow up in its face. So, this was the state of affairs till people decided you could have hybrid algorithm. So, I am going to go off the slides and use my white board to spend a minute on an algorithm which Postgres implements which is called the bitmap index scan. So, Postgres SQL has something called bitmap index scan and in the lab today you will actually be seeing query plans and it is actually not hard to look at what plan the optimizer has chosen for a particular query. In Postgres SQL it is like explain and give a query whatever query select star from the select star from student where id equal to 1, 2, 3, 4, 5 whatever it is. If you just prefix it with an explain Postgres SQL will tell you what plan it is using. So, your lab exercise is to look at these plans the actually two exercises in today's lab. One part is on creating indices and seeing the impact of the index on the plan and on the running time of various queries and the second part is trying out various queries and seeing the plan and seeing the impact of indices on the plan seeing the impact of different join selection conditions and so on on the plan. So, coming back Postgres SQL has something called the bitmap index scan which by the way should be passed as bitmap index scan with a hyphen there. Do not confuse this with bitmap indices, bitmap indices are described in chapter 11. I did not cover it in detail yesterday for lack of time, but for those of you who are interested it is there in the slides it is there in the chapter, but that is a separate concept. What Postgres SQL calls as bitmap index scan is different from bitmap indices completely different. In fact Postgres SQL does not have bitmap indices as of now. So, what is the bitmap index scan? The idea is that you know I have a B plus tree and I am scanning down here let us say this is on city and I find the number of record IDs here for the particular city that I am looking at. It could be Mumbai, it could be good one cherry or wherever. Now, the optimizer is not sure how many responses are going to be there. If the optimizer is pretty sure it will just do a regular index look up it is not going to use the bitmap index. However, if the optimizer is not sure it does what is called a bitmap index scan and the idea is that when it comes down here it looks at all the record IDs that are stored at the B plus tree is going to store record ID. This is the secondary index. So, it will have record ID. So, now what it will do is it creates a bitmap. What is a bitmap? It is simply an array of bits. So, one bit per block of file with the let us say this is on sales relation where city is one of the attributes. So, there is one bit per block of the file containing sales. So, that is the data structure which is used and that is why it is called bitmap index scan. Now, this bitmap data structure is pre-built in the case of bitmap indices, but here it is not pre-built. It is going to be built on the fly. What happens is we go through all the record IDs here. There is a list of record IDs which we fetch from the index and we will go through all these record IDs and the record ID has a page number. So, what it will do is mark the bit as one. Supposing that record ID was from the fourth page. So, these are the bits. So, the fourth page the bit is set to 1. Initially all the bits are 0. So, I will show them as blank and let us say the first one is from the fourth page and the next one is from the second page. So, that bit will be set to 1. The third one was from page 7. So, you set that bit to 1. The next one may be from page 1. So, you set that bit to 1. So, what we have now is a number of bits which are 1 and then the other bits are 0. I will make the 0s explicit here now. So, now what is done is a file scan of the sales relation is done, but during the scan I am only going to fetch pages where the bit is set to 1. If the bit is set to 0, I will not fetch the page. Now, what is the point of this optimization? Now, I am not doing random IO. I am reading the sales relation sequentially with no seeks essentially. Seeks are very rare in effect. We will pretend there are no seeks. So, with no seeks I am just reading the relation consecutively, but I am just skipping over pages which whose bits are set to 0. So, I do not incur the cost of fetching those pages. So, theoretically there is a seek because if you are skipping a bunch of pages which are on the next pages on another track, yes you have to seek to the next track, but those seeks are cheap and if I have a long string of consecutive 0s, there is a seek which may be a little longer, but it is still a local seek not a seek which is the other end of disk most likely. Again, there is no guarantee as I said. We do not have control on the placement of data on disk, but the number of actual long seeks required here is going to be much less. So, we can ignore it. So, the number of pages fetched is now small. We are not doing a full scan which would fetch a lot of pages. We are only fetching a few pages and moreover because the data is being accessed in order, the number of seeks is much less. So, what is happening now is that I am only fetching relevant pages and I am avoiding the seeks. So, if the number of pages were just one or two, the overhead of creating this bitmap, setting the bits and then going over it would be unnecessary, but it is not a big overhead anyway. But if in fact I had lots of records here, lots of records matching, then many of the bits may be set to one. So, if I had done a secondary index scan and directly access these records, I would jump from here to there to there to there and assuming this is a large relation, most of the time the data would not be in the buffer and I would be doing an actual IO. So, the cost would be very high. Now, what is happening is when I bring a page in, if that page had five matching records, I am going to get them all together because I am just getting the page and then scanning the page to find all matching records. So, earlier I would have done five separate IO's in the worst case for that page because there were five separate records with five separate entries in the index, but now when I fetch that page, I will fetch it exactly once. I will find all the five records by scanning inside the page and then that page is gone and that is it. I do not have to look at that page again. So, each page will be fetched most once and the number of seeks will be pretty small. So, it is a very efficient strategy. So, this is actually a very nice algorithm. You can think of it as a hybrid. Hybrid between index scan or index lookup, I will call it, it is called index scan in the practical community which builds databases. It is a hybrid between index scan and file scan which is scanning the entire set of records in the relation. So, when you do today's lab, you are going to see several occurrences of a bitmap index scan. In fact, if you see the plan, it will have another operation called recheck predicate or recheck filter or I think it is called recheck or maybe refilter and the idea is that when I fetch this page whose bit is set to 1, that page has some relevant records but it may also have a lot of irrelevant records. So, because I am just going through all the records in the page, I am not going back to the index. I am just going through all the records in the page in order to find which records satisfy the selection that is the rechecking of the predicate in the first which was used also from the index. Now, why do not I just keep all the record IDs here and then fetch it? An idea is the number of record IDs may be pretty large. So, keeping that may be expensive. It is easier to just read the block and filter in there. I hope that is reasonably clear. So, let us come back to selections using indices. Now, there are those were the basic algorithms. Now, what about complex selections? What we saw so far would be selections like r equal to a r sorry r dot a equal to 5 r dot a between 7 and 27 and such like. But conditions can be conjunctions. What do we do? If I have a condition on two attributes both of which are equality, what do I do? Maybe I could use a composite multi key index if it is available that is this algorithm. On the other hand if it is not available, let us say I have a selection on a student where department equal to comm psi and what are the other attributes? We have total credits, total credits greater than 100. So, I can use the department index to find relevant records IDs. I can find the relevant things for total credits greater than 100. So, I have a choice. So, what I will do is I may pick whichever is the cheapest of these two estimated. If the number of records with department equal to comm psi is small then maybe that index would be good. I will find all the records with comm psi and then I will apply I will fetch the records and apply the condition that total credits greater than 100. On the other hand if the optimizer decides that most people have less than 100 credits. So, it is more efficient to find who all have less than 100 or greater than 100 credits fetch the records and then apply the filter on department name that plan is chosen. So, that is algorithm A 7 and then there is there are a few other algorithms which I have not shown here. One of which is I will fetch the record IDs using both these indices and then I will intersect the record IDs to find which satisfy both the conditions. By the way A 7 can be used if I have even one of the indices. I do not need indices on all the selection condition. So, maybe one of the conditions involves an index column. Another condition is on a column which is not index that is not a problem. I can use the index column and use that index and then test the other one after fetching the record into the buffer. So, the bottom line is given any selection condition. There are many possible algorithm which have different costs and it is the job of the optimizer to pick the cheapest algorithm amongst these. So, the next topic is how to do the remaining operations and it turns out there are two basic ways of doing all the other pretty much all the other operations and they are based on sorting and hashing. So, all of these algorithms are going to work even if the underlying relations are much bigger than memory. This is one of the nice things about databases. They do not die if the data is larger than memory. Things which people hand built earlier had this problem that they would work perfectly fine with small data. The moment your database size became bigger than memory, they would die. Databases do not behave like this they behave very nicely even on very large data. So, what is the trick? The trick is to use algorithms which sort the data or we will also see hash partitioning of the data. So, that even if it is large it will still work. So, the very first algorithm is sorting. So, how do you sort data which is much larger than memory and we will use sorting as a underlying infrastructure for many other operations. So, this is the external sort merge algorithm. I am going to show you the picture first and then describe the algorithm. So, here is a picture showing this algorithm. So, first we have the initial data which is big and it is not sorted. Our goal is to sort it on the first column the first letter here. Now, for illustrative purposes I am going to assume that memory is very small. So, this relation is much bigger than memory. In our case we are going to assume that memory is essentially big enough to hold 3 records at a time for sorting. That is really toy size but it illustrates what we are doing nicely. So, I can read in 3 records from the relation into memory and I am going to do an in memory sort. So, what do I get? The first 3 records were G A D. I read it into memory and sort it. What do I get? A D G and I am going to write that out to the file. Similarly, I will pick the next 3 records C B and E these 3 and I am going to sort it and write it out. So, what is the sorted order I get? B C E. If you see the other attribute goes along. So, B 14 comes first C 33 is second and E 16 is third. So, these 2 have been written off to disk as files. Now, I read the next 3 records R D and M and sort those and write it out. So, what is the output? D M R and finally, P D A sorted what is the output? A D P. So, at the end of the step which is called create runs. I have these 4 files each of whose size is approximately the size of memory. It is going to be little bit less because we are going to keep some buffer blocks for reading in this relation and some buffer block for writing out the output, but this is roughly the size of memory. It is going to be a memory size in blocks minus 1 is going to be the size of this in blocks. We will come to that in a moment. So, the next phase is merging. So, here we are doing a binary merge which you are all familiar with, but in general what you will have is a K R E merge because you do not want to do 2 way merge at a time. There is enough memory typically to read in 1 block each from a large number of runs. So, I can merge in 1 pass I can merge may be 100 runs. So, the number of passes that I need to do will be much less. In this case it is a toy. So, because a memory is so small we can only read in 1 block from each of 2 runs and 1 block for output. So, it is only a 2 way merge. When we do a 2 way merge. So, we can take the first 2 and merge them. We look at these 2 A D G B C E. We are looking at the first record A B which is smaller A output A 19 move the pointer to D 31. D 31 B 14 which is smaller B 14 that is output move the pointer here to C 33. C 33 D 31 which is smaller C 33 that is output move this pointer to E 16. Now, between D 31 and E 16 which is smaller E 6 D 31 that is output. Now, our comparison is between G 24 and E 16 which is smaller E 16 that is output. Now, there is no more record here. So, I simply output all the remaining records in this run which is simply G 24. So, the merge resulted in 1 run which is in this case twice the size. Supposing I merged 100 separate runs here this run would be 100 times the size of any one of these runs. And similarly, I do it for the next 2 to get this and in this case I need 1 more pass to merge this and this to get this. Again in this merge pass also the input runs are much bigger, but I only need 1 block to input data from this one 1 block to input data from this one and 1 block to keep the output before it is written. So, even though these runs are much longer I can still merge 2 of those runs in using this really tiny memory which I have. So, that is the final output. So, that is the intuition. Now, let us look at the algorithm. The first step is to create sorted runs which works as follows. Repeatedly read m blocks of the relation to memory what is m that is a memory size in terms of pages of blocks. I will use the term page and blocks interchangeably means the same thing. So, I can read this much of data into memory that is the amount of memory I have available for sorting. I will do an in-memory sort. I can use any standard in-memory sort I could use quick sort I could use in-memory merge sort or whatever else. So, I have an in-memory sorted run and I will write that to r i initially i is 0. So, very first thing writes to run 0 in the next one time I do this it will go to run 1 and then to run 2 and so on. And let the final value of i be n in other words I have created n separate runs. So, we want to merge those n runs and for now we will assume that the number of runs is less than the number of blocks of memory available. So, I am going to keep one block of memory for buffering each of the n input runs and one block to buffer the output. Now, why do we want to buffer? A block is fairly big it is going to be a few kilobytes at minimum 16 kilobytes is fairly common these days. So, the idea is that when I read in one block that is a mean unit preferred unit for reading from this you can read in a single sector, but that is too small. So, the preferred unit is reading one block and when I read that block I have a lot of records I can use all of them. If I read just one record I would have to read the block again after some time to fetch the second record and the number of seeks would explode. So, here I am going to do only one seek per block. So, that is why the buffer is important. And now I am going to read the first block of each run into its buffer page and then do the following repeatedly select the first record in thought order among all the buffer pages. So, this is where a 100 way merge also is possible among all the 100 runs I will find which run has the smallest first record at this point and that is the smallest record overall. So, I will write that to the output buffer and delete the record from its input buffer page. Now, at this point the buffer page for that particular run may become empty in that case I will read the next block from that run into the buffer. So, after that I can continue with the merge algorithm which is here I will keep repeating this. Now, of course at certain times a particular run may become empty we have completely processed it. So, then from that point on I ignore that one it does not matter anymore. And finally, there may be only one run left which is non-empty I will simply output all its records. So, that is a very simple merge algorithm that was a one step merge. In fact, with today's memory sizes most of the time you know when I create runs initially you know I typically will have several 100 megabytes at least to create initial sorted runs. So, let us say I have 128 MB that is a common size I create runs each of size 128 megabytes. Now, if I assume that each buffer block is let us say you know 100 kilobytes then 128 megabytes means I can read 1,280 blocks with 100 kilobyte blocks. So, I can merge 1,280 roughly runs at one time let us round it off to 1000 runs I will merge 1000 runs at a time each run is 128 megabytes. If I merge 1000 such runs I in one merge pass I can sort a 128 megabytes relation that is pretty big using 128 megabytes of memory which is pretty small. If you had if you are dealing with larger relations you are almost surely using a system with much larger mean memory. So, practically speaking multi step merge is not required anymore today, but in case you need to do it what happens is each time you do the merge the number of runs gets divided by some factor. What is that factor? If I am doing 100 way merge if I had you know 1000 runs initially and I can merge 100 runs at a time the number of runs after the initial merge I take the first 100 merge them I get one next 100 I merge them I get one more run. So, what is the fine output number of runs it is 1000 divided by the memory size in number of blocks that is 100 which is 10. So, each merge pass divides the number of runs by a factor of m minus 1 technically because 1 block is for output. So, the number of passes can be estimated as forget the details in this slide the total number of merge passes is here. Br is the number of blocks of the original relation m is the size of memory and Br by m is the number of initial runs. So, I am reading in m blocks at a time in memory sort output. Now, each pass decreases the number of runs by a factor of m minus 1. So, the number of merge passes is ceiling of log to the base m minus 1 of Br by m. Practically we will assume it is just one merge pass and we will skip the details. We are going to skip further details of the detail costing of merge.