 So, what is query processing? We all know what are queries. The question is how are they processed? And the basic answer is there are several steps in processing a query. The first step is to parse the query and translate it into a lower level representation. And the lower level representation is often either relational algebra or something which looks like relational algebra, some variant of relational algebra. So, many databases will actually preserve the select from where clause. But if you recall the from clause is a cross product and the where clause is a selection and the select clause is a projection. So, they actually use a representation which for all practical purposes is relational algebra. Now, the actual execution has to be done by a series of relational algebra operations which take relations, perform selections, joints, projections and so on. So, the given query, if you treat it as a relational algebra operation can be very, very inefficient. If you take a cross product, you are dead. On a large relation, a cross product would take forever to compute and its result would be enormous. So, you really cannot afford to do that. What you need to do is turn the query into a series of joints on meaningful joint attributes not a cross product and compute plus selections, projections, group by and other relational algebra operations to compute the final result of the query. Now, there are many relational algebra expressions which are going to be equivalent to the same query. And the there are two issues here. First of all given a relational algebra operation, how do you implement it? You have to do a joint. How do you perform the joint? There are in fact, many ways to perform the joint. We are going to see at least three or four ways of performing a joint operation in the next hour or so. So, the first question is which of these four ways do you use for a joint? The second question is there are many, many relational algebra expressions which are equivalent to the same query meaning they will produce the same result, but they have different efficiencies. Some will take more time, some will take less time. So, the question is how do you pick an overall query plan which is going to be relatively cheap. So, we want an execution plan which is basically a relational algebra expression annotated with saying do this joint using particular joint method. Do this selection using an index and so forth. So, the optimizer is a module which takes the input expression and takes statistics about the database. Why statistics? Because how do you compare two query plans? How do you know which one is cheaper? To do this, you have to estimate the cost of that plan. You cannot actually run it because that will take too long and there are many plans. So, what you have to do is estimate how much time each plan would take and from the alternative plans, we will pick the cheapest plan. That is the job of the optimizer. So, the optimizer outputs a plan which it has estimated to be the best plan and that is passed on to the evaluation engine which takes the actual data and executes the query and outputs the result. So, our coverage of query processing is split into two parts. One part is this evaluation engine down here which in where we are going to focus on how to implement the different ways of doing a joint, select, aggregation, union, intersection, etcetera, etcetera. The other part is given a query, how do you find the best plan for it? That is the optimizer. The optimizer is actually fairly complex. We do not have time to cover it in great detail. So, what I am going to do is give you only an outline of optimization which is chapter 13. But chapter 12, I am going to cover in a little bit more detail so that you have an idea of how things are actually executed on the database. In the lab, you will actually see query plans in PostgreSQL to understand how the queries are evaluated. You will see the connection between what I am talking about today and the corresponding things in the query plan which you will be seeing tomorrow. I already told you about parsing translation and evaluation. I claimed that a relational algebra may have many equivalent expressions. So, let us take a very simple example. Here is a query which does a project on salary of instructor. So, it takes all the instructor rows and outputs just the salary attribute. Then it says select salary less than 75,000. This is a fairly pointless query, but it is a very simple query. The goal of this query is to output all the salary values which are less than 75,000. Now, here is another way you could have executed the same query. First, you take all the instructor tuples and eliminate those whose salary is greater than or equal to 75,000. In other words, you select those with the salary less than 75,000. That is the first step here inside part of the expression after which you apply the projection operation on that result. So, what is the difference here? Here, before doing the projection, we already check if it is worth doing the projection. If the salary is not less than 75,000, we are not going to do the projection. So, the difference here is not very big, but it illustrates the point that there may be multiple plans for a query. As I said, there are many algorithms to evaluate each operator, to evaluate the select, the join and so forth. As an example of this part, the selection operation, select salary less than 75,000. One way to do it is to scan the whole relation, look at each tuple, look at its salary value and output it if it is less than 75,000. This is called a linear scan on the relation. But if there is an index on the salary attribute, an alternative plan does the following. It starts, you know, it goes to the index. I want values less than 75,000. So, what it does is, starts with the minimum salary minus maxent or zero. If salaries cannot be negative, I assume they cannot. It goes to zero salary, which is the left end of the index on salary. Then it goes along the leaves, finding all the salaries which are less than 75,000. And when it hits the first one, which is greater than or equal to 75,000, it stops. That gives me the result which I want. So, that is an alternative using that index if it is present. So, there are two ways of doing this selection. And it is the job of the optimizer to decide which way to do it. We are going to see how to implement all these ways in chapter 12 and very briefly discuss how to choose the most efficient way in chapter 13. So, as I said query optimization is to find the lowest cost plan amongst the alternatives. And the database catalog has statistics such as the number of tuples in each relation, the size of the tuples, the number of distinct values. For example, how many distinct departments are there? How many, if I have a query on salary, I want to know how many people have salary less than 75,000. I want to estimate. I do not need the exact number, but to choose the best plan, I may need a rough estimate of how many have salary less than 75,000. How do I get this estimate very quickly? We will see a little bit in chapter 13. So, now, if I say I want to pick the cheapest plan, how do I measure the cost of a plan? One way is to estimate the exact CPU time plus disk I O and what not. So, that I can say that on this system, this query will require 52 seconds to run. That one will take 17 seconds to run. Unfortunately, it is very, very difficult to actually estimate this number. The reason is, first of all, the exact time taken depends on what is in the file system buffer, what is in the database buffer at that time. So, if the file system buffers are empty, every read might turn into an I O. If a lot of stuff is in the buffer, it just gets it from the buffer, it does not go to the disk. So, that actual time may vary significantly based on when the query is run. So, it is very hard to estimate this time. The second problem is that the time taken may depend on how many CPU system has, how many disks system has and so forth. So, instead of going into all this complexity, what many databases do is, instead of estimating the actual execution time, they estimate the amount of effort needed to answer the query. What do I mean by amount of effort? The amount of effort is measured by amount of CPU time required, the amount of disk I O required, the number of distinct I O's, the number of bytes transferred and so forth. So, from all of this, from the amount of effort, that does not actually correspond to the running time because it may be that while I am doing CPU work, a disk I O may also be in progress in parallel. So, if I take the amount of effort, it is CPU time plus disk I O time, but maybe they can run in parallel and finish faster. There are optimizers which will take this into account, but in our simple model, this is not taken into account. We just estimate the total effort. As a further simplification to keep our life simple, we are going to ignore CPU time totally. Real optimizers do take CPU time into account, but if you take a database system which runs on a disk, it is often the case that disk I O dominates the time taken especially random I O. So, what we are going to do is estimate the number of seeks that a particular operation will require, the number of blocks that would be read and the number of blocks written. What do I mean by seeks? If I do a random I O, I say go to disk page 734 and read the page. That is a random I O. If I say go to that is which requires a seek and then one block is right or I may say go to that same block and then read 1000 blocks from there consecutively. In which case, what I have is a single seek followed by 1000 blocks. So, in our simplified cost model in the book, what we do is we estimate the number of seeks worst case. Why worst case? Because I do not know what is in the file system buffer. In fact, I do not even know what will be in the operating system buffer when the query is run. I may find the plan for the query now. I may run it after 5 minutes at which point everything has changed. So, it is going to be worst case. Assuming there is nothing in the buffer, how many seeks will be required? That is the number of seeks. Similarly, how many blocks will have to be read? How many will have to be written? And we are going to multiply each of these by the average seek cost, average block read cost, average block write cost. Not that writes are usually slower than reads and so we can estimate the disk cost for a query. And this is the measure which we use in this book. It turns out this is surprisingly effective measure for query optimization although it fails in some cases. So, in reality you need to add the CPU times, but CPU time plus disk IO works quite well in practice. So, again some notation we are going to use T sub capital T as the time to transfer one block. Again for simplicity we are going to treat reads and writes as being the same. So, we are keeping making our description easy. Real optimizers will take the actual one into account. T s is the time for one seek. So, if you have b block transfers plus s seeks, the total time taken by block transfer I mean just I assume the blocks are consecutive. If the blocks are not consecutive then the seeks are added on. So, the time will be number of blocks transferred times the transfer cost for one block, time to transfer one block plus the number of seeks times the time for one seek. This is the sort of equation we will be using all over the place. As I said first of all the cost of an actual execution depends on what is in the buffer. The second thing is that several algorithms such as sorting, their cost is greatly influenced by the amount of buffer space that they provided. In our estimation, we are going to assume some minimum buffer sizes provided which they need, but many of them can actually run faster if more buffer space is provided. Again real optimizers may take this into account in more detail. We are going to keep life simple and assume each operator is given some minimum which it needs to work efficiently. We are going to use worst case estimates. So, that was the preliminaries. Now, let us look at the actual algorithms. The first set of algorithms is for selection. How do you do selection? The simplest way is do a file scan, read all the records of the relation, read all the blocks, take each record from each block, check if it satisfies the selection, output it if it does otherwise skip it. This is called linear search. We are going to call it as first algorithm A1. Now, if you did a search on an array in memory and the array is ordered, you would have certainly seen in a data structure scores. People will say use binary search on the array. Do not read it sequentially. It turns out, binary search does not directly help in a database system because data is not actually stored consecutively. So, even if the file is ordered, you need some kind of index on top to find it. So, we are going to ignore binary search. The next alternative is to use an index if it is available. So, when can you use an index? It depends on the condition in the selection. If the selection condition was something crazy like r dot a plus 5 times r dot b is less than r dot c. No index supports these kinds of operations. For complex conditions like this, I have to read each individual record, find r dot a, r dot b, r dot c, compute this expression, check if it is less and if so, output it if not move on. So, I cannot use an index for certain selections, but most selections are much simpler. The selections are of the form. Select all records with id equal to 1, 2, 1, 2, 1. That is a very simple selection and an index can be used for this. So, I search down the b plus 3 index, find the record or set of records. The algorithm is the same and this is called an index scan, but the cost of the algorithm depends on several factors. It depends on whether the index used is the primary index or the secondary index. It also depends on whether the condition is an equality on a key meaning one record only will be retrieved or it is an equality on something which is not a key which means multiple records may be retrieved or it is a range selection again meaning multiple records may be retrieved. So, the cost depends on these things. So, we have listed out several variants with associated costs. I am going to cover one or two of these and you can see the rest offline. So, the first one is primary index and the selection is an equality on a key. So, I say id equal to 1, 2, 1, 2, 1. This is one record. So, what is going to happen? I am going to take the b plus 3, start from the root, search down. Each time I go from the root from a node to the next level. I am assuming that node is not in memory at that point of time. It is not in the database buffer. So, I may have to perform an IO and fetch it from disk. So, the number of such IOs will be hi and then there is a plus 1 because when I hit the leaf level, I have a record idea pointer to the record that is the default assumption. So, I have to go and fetch the record also. So, there are hi plus 1 operations. Each operation basically seeks randomly somewhere on the disk and reads one block. So, what is the cost? Hi plus 1 times t t that is the transfer time for one block plus t s that is the time for one seek on average. So, that is how we estimate the cost of this operation, equality on a key using a primary index. Now, there are variants supposing I still have a primary index. The selection is an equality, but no longer on a key. Then what will happen is I will find the first record, but there are many more records with have the same key value, but the good thing is since it is a primary index, the relation is sorted on this attribute. So, all those records are going to be continuous in the file. So, I just have to read all of them one after another with no more seek operations. So, then the cost becomes one seek down to the first record. So, it is hi plus times t t plus t s plus t s to find the first record one more seek similar to this plus the time to transfer. The question is how many blocks are transferred? In the first case only one block is transferred if it is a key that was the previous one. Now, there may be many consecutive blocks which are all fetched. So, it becomes t t times b that is the time taken to read all the b blocks. So, I gave you a flavor for how we do the estimation. Now, the next one is secondary index equality on something which is not a key. This turns out to be an interesting issue. If I use a secondary or a primary index equality on key it actually does not matter the cost is the same. How does it matter whether the records are consecutive or not? If it is equality on key I will search down the index find one pointer fetch that one record I am done. It does not matter whether it is primary or secondary index, but if an equality on something not a key let us say department and it is a secondary index. What will happen is the set of all people with the given department c s are going to be scattered around they are not consecutive. So, if it is a large relation I may end up doing one i o for each thing that f h. What does the index have? The index will have one pointer for every such record. So, I will get the required data record IDs from the index, but for each record ID I have to go and do an i o operation to fetch that record. So, what is the cost? H i times t t plus t s that is for traversing down the tree then for each of the n if I estimate that there are n records which all satisfy this department equal to c s for each of those n records I have to do a seek and a transfer. So, this is the cost formula here. Now, if you look at it it seems what is the big deal the problem is that each t t plus t s is for the order of 10 milliseconds t t plus t s is typically 10 maybe a little bit more let us say 10. Now, what is h i? We took great care to build b plus trees where the height is small it is typically 4 or 5 at most even for very large relation because the nodes are very fat. So, the trees are short and fat. So, h i is small the problem is n what if n is 1000 that doesn not seem like a lot of records to fetch, but now if I say 1000 records times 10 milliseconds that is 10 seconds for fetching those records. So, a little slow what if the number of records is 100000. Now, we have a serious problem it is now going to be 1000 seconds to fetch 100000 records and that is a long time it is 20 minutes you are going to striddle your thumbs while these records are fetched. So, the moral of the story is if you use a secondary index and fetch a lot of records using random I o it may not make sense. Instead of spending 20 minutes doing this it may turn out to be much cheaper to go and just scan the whole relation and find which records satisfy it. What is the difference here? When you scan the whole relations you are not doing random I o you are reading the blocks consecutively almost actually the blocks will occasionally require some seeks, but that number will be small because operating systems take a lot of care to keep the blocks of a file together. So, that would not be too many seeks and as a result we can finish the entire file scan may be in 1 or 2 minutes as opposed to 20 minutes with random I o. So, moral of the story even if there is an index sometimes it is best not to use the index when there is a second V index on a non key and query optimizers actually take these factors into account and will choose the correct plan. This by the way is a major motivation for why relational databases succeeded in days before SQL and optimizers programmers had to deal with all of this they had to figure out which is the best plan. A very smart programmer would usually get the best plan, but if the system changed after that whatever assumptions that programmer made no longer holds. So, what was a good plan before is no longer good plan. Now, you have to get that programmer again to rewrite all the queries that is very painful. So, when the system are which are the original relational database prototype. One of the important things they did is build an optimizer which even today is regarded as a major milestone in database in the progress of relational databases. Before the system are optimizer people viewed relational databases as a toy. They said hey you know you guys are a bunch of researchers you are playing around with toys. We are real men we run our airline reservation systems and such like and bank databases. We are not going to fool around with SQL queries which run very slowly, but once there was a very good optimizer. These people could go back and tell them let us see what your programmers did and let us see what our optimizer does and they could actually convince them eventually took a while. It took you know 10, 15 years to convince everybody that a relational database could actually run very fast and could beat a programmer in many situations and that is when the old line of databases called network or hierarchical databases. They eventually started losing out by the mid 80s. So, that is took almost 10 years from the first relational prototypes for people to say now it is real and I can use it to build real solid applications. So, coming back this was an example why an optimizer can do a good job by looking into details. Then there are more kind of selections to deal with. If a selection has an and of multiple things what do you do? There are such cases I am going to skip it because we do not have too much time, but you can go read it up from the book. It is fairly straight forward stuff. So, now let us ask a very simple quiz question. So, first quiz for today given a choice between a secondary index and a file scan to answer a query select star from R where P where P is some predicate. The options are it is always better to use the secondary index. Two is it is always better to use the file scan that is linear scan. File scan is another term for linear scan. Three it depends on the number of records fetched. Four none of the above. So, those are the options. Do not click the answer yet. First participants please press the S T key to activate your remotes and be ready when we tell you to submit the answers. We are going to start the timer in just a few seconds please be ready. You can answer the question at this point please go ahead and select your option. So, coming back the answer here as I just discussed is it depends. If the predicate P is such that I can use the index and only a few records satisfy it. Then I have only a few IOs to fetch those records and then that index is a good idea. On the other hand if the I use the index and then find thousand records scattered all over are going to satisfy the predicate. Now, I have to fetch all those thousand records in random fashion. Then the random I of the seeks become very very expensive at which point may be it is cheaper to do a file scan. So, in that case file scan may be cheaper than us directly using the index. So, the correct answer is 3 it depends on the number of records fetched by the query. So, let us see how many centers it looks like very very few centers have actually managed to upload your answers. What is wrong 129 responses that is good of those centers that did respond a lot of people have responded. The winner by a very very narrow margin is option C which is the correct option and as I discussed it depends on how many records are fetched. There is an interesting twist to this which is actually implemented in PostgreSQL. When you have your lab where you will see query plans you will see something called a bitmap index scan. Let me write it here. So, this term is little bit confusing in PostgreSQL because there is some other index type as I mentioned called a bitmap index. Now, this and this have no connection. What PostgreSQL calls a bitmap index scan is actually something slightly different. What PostgreSQL does is it says here is an index I can find a number of records which satisfy the predicate using the index. But if that number of records is large I may end up doing a lot of random IO. But the trick is can I somehow use this information from the index to improve the sequential file scan and the answer is what it does is it maintains a bitmap 1 bit per page. I can actually do tricks to reduce this even more. For simplicity let us assume it has 1 bit per page. All the bits are initially 0. Now, what it does is it goes down in this index and then it finds the record IDs of all those 1000 how many ever records satisfy that index condition. It will go to the corresponding bits here and make those bits 1 and the remaining bits are 0. So, very large array and something like 1000 bits of this array may be set to 1 may not be exactly 1000. If there are 1000 records up to 1000 bits will be set to 1 the rest are going to be 0. Then what it does is it does a relation scan. But now there is a trick whichever page is 0 in this bitmap it is not going to fetch it. It is going to skip those pages and go directly to those pages which are 1. So, it is going to read the relation in sequential order it is a file scan. But it is a file scan which is going to skip a lot of pages. Why bring those pages in and then look inside those page because we have already used the index to figure out that all these pages which remained at 0 we know they do not contain any relevant record. So, we will skip them. So, what it has done is this is a hybrid. It is a hybrid strategy between which combines index plus file scan. So, tomorrow you will be looking at such plans that you can see. When people first see this operator popping up they do not know what it is what is this bitmap index scan and this is what it is. It uses the index to create a bitmap which indicates which pages have records. Then it does a file scan to fetch only those pages. Moving on the next set of operations are all based on one of two basic operations either sorting or hashing. So, first let us look at how to sort data and then using sorting we will see how to do a number of other operations. Sorting is a very fundamental operation in databases and there is also an alternative of hashing which turns out is very symmetrical to sorting. So, how do you sort data in the first place? One naive way of doing sorting is to build an index on the relation. So, you build an index initially empty insert records of that relation one by one into the index on whatever key you are using for sorting. Then read the leaf pages of that index to find the record IDs fetch the records one at a time. It turns out this is a terrible way of doing sorting for very large relations because you may end up doing lot of random I O 1 random I O per record which is very very very slow. So, what is actually done and this is very well known for many years from the earliest days of computers sorting was known to be a very very important thing. In fact, it was extremely important in the days when there were no hard disk memory was extremely small all data used to be on tape. So, the only way to do data processing on bulk was to be able to sort data between tapes and then use it to join information. So, sort is a very well known operation and in particular sorting where the data does not fit in memory is called external sort and we are going to look at external sort much. There are other ways of doing external sort, but let us focus on external sort much. What is this do? The first step of external sorting in with this algorithm is as follows. First of all, if we have a certain amount of memory we can read in that much data from disk into memory then run an in memory sorting algorithm. So, what we have now in memory is sorted data, but it is not all of the original relation it is as much of it as fits in memory. So, typically these days memory may be a few gigabytes, but you cannot devote all of it for sorting. So, let us say system may devote 256 megabytes for a sort buffer. So, it will read 256 megabytes of the relation into memory sort it. What you do with the sorted data is going to be written out to a file on disk. Then it will read the next 256 megabytes sorted in memory write it out to another file on disk. So, what this step does is create a number of runs these sorted files are called runs. Each run is a part of the relation which has been sorted in memory and written out in sorted order. So, we keep doing this till we have created a number of sorted runs. Now, what does each run have? Each run is locally sorted, but now there are many such runs to get the final sort order sorted relation. I actually have to do a merge process which takes all these runs and finds the minimum element amongst all these run output it again find the next minimum output it and keep doing this till I empty out all the runs. So, instead of reading this code here let us see this by example it makes it a lot more clear. So, the initial relation is on the left here and I want to sort it on the later it has two attributes I am going to sort it on the first attribute. So, the here for a toy example we are going to assume a very very small memory memory can have three records that is it. In reality of course, memory is much bigger, but relations are also much bigger. So, we take the first three attributes which are G A D and sort it in memory and write it out to a file. So, the first file has A D G in this order then we read the next three records sort them and write it out to another file B C E. The next three are R D M which go out as D M R and finally P D A which goes out as A D P. So, what we have now are four runs each of which is sorted now I have to merge them. Now ideally I would like to merge all four runs at one go, but in this case our memory is exceedingly small. So, there may not be enough memory to keep records from all these four runs in this case we are assuming memory has three records. So, what we are going to do is read one record from the first relation one record from the second first record find out whichever is smaller which one is smaller A output that then fetch the next record from the that relation. So, what are the first two D and B which is smaller B output B read the next one which is C comparing D and C which is smaller C output it and keep going like that. So, this is how you do a merge in reality we do not read one record at a time it is too inefficient we read from this one whole block at a time. So, what we need in memory is one block of space for the first of a run one block of space for the second run and so on. So, if my memory has let us say 11 blocks what I will do is keep one block aside for output because output is also done in units of blocks and the remaining 10 blocks I can merge I can read one block from each of 10 runs and in one merge pass with 11 blocks of memory I can merge 10 runs and output it to one run. Here in our toy example we are assuming we can only merge two runs at a time. So, what we end up with is these two merge together into one run and these two merge together into one run. Now, what have we done we had a number of runs we have halved it in general if we can merge if 10 runs the number of runs after one phase of merging will be one tenth of what it was before. In fact, it is much larger you can memory is not so scarce these days I can easily merge 1000 10000 runs even today with large memory. So, in reality today what happens is the first step will create a number of runs like this and then in almost all cases except for ridiculously large relations I can merge all of these at one go into a single output file. So, usually only two phases are required generate runs then merge it merge all the runs to get one single sorted run. So, this is the basic idea the detail algorithm is provided the question is has this done better than doing one random IO per record and it turns out it is a lot lot better. First of all I am going to do one random IO to read each block each time in the worst case I can actually reduce that even more, but worst case I will do one random IO per block here and when I read it and then I output it. Now, a block may have 40 100 records. So, the number of random IO per record instead of one per record it is now like one 40th random IO per record. Now, of course there may be you have to read write it read it out write it. So, there may be a little bit more, but overall this thing reduces the number of random IO's tremendously. In fact, the reduction goes from it is basically the huge reduction. There are detailed analysis here the first phase of the analysis is how many times do we have to do merge in the worst case and we can bound that by log to the base m minus 1 of b r by m what is all this terminology b r is the number of blocks in the relation m is the number of blocks of memory. Remember that one run basically has m blocks in it and so the number of runs will initially will be b r by m. Now, in each merge phase we can we have m blocks of memory. So, we can merge m runs or m minus 1 runs with 1 as output buffer. So, at one time I can merge m minus 1. So, memory is scarce I may have to do multiple merge passes. So, the first thing is how many such merge passes are required and that is log to the base m minus 1 of b r by m. As I said it is in most cases it is 1 occasionally it may then there is some analysis of the blocks transferred which is all based on the number of merge passes and that is down at the bottom. I am going to skip the details because you can always read it offline and then there is the number of seeks which are done which could be as bad as b r by m which is still lot better than 1 seek per record. It is actually 1 b r is the number of blocks that are in the relation which itself is much smaller than the number of records and then the seeks are actually much less than this. So, overall the number of seeks is not that large during run generation, but the number of seeks in the merge phase is a little bit more and there is a detailed formula. Again for lack of time I would not get into the details of the formula, but what I can say is that this number is not that large it is a fairly small number compared to 1 seek and in fact, commonly used optimization given that we have a lot of memory is not to read 1 block at a time. From a run we will read a large number of blocks and write a large number of blocks at 1 go. So, this analysis takes that into account if you say that there is plenty of memory therefore, we will read or write b sub b blocks at a go b sub b may be 16 or 32 or some such number then the cost reduces even more and this detailed formula you give the number of seeks. I am not going to cover it in detail here you can read it offline. So, moral of this story from a complexity view point much sort is much better than inserting into a index and then reading from the leaves. From a practical view point there are very very efficient implementations which even on a simple machine can sort gigabyte in just a few minutes and on a parallel machine they can sort even faster. In fact, there is an interesting story in this. So, there were benchmarks we said how fast can you sort. So, the initial benchmarks said that long back 20 years ago a gigabyte was a lot of data. So, they said how much time is your system going to take to sort 1 gigabyte of data and at that time memories were like 1 or 2 megabytes took a while to sort a gigabyte. Now, memories grew and the gigabyte benchmark suddenly became idiotic because it just turned into an in memory sort and the original applications which had 1 gigabyte of data can actually get away with an in memory sort today and this is true of many small applications IIT Bombay's database as I said the whole database is a few gigabytes. So, each relation is smaller and can easily be sorted in memory, but there are also databases out there today a lot of people have been asking about cloud databases and so on and as I said there are databases today which actually parallelize over many machines simply because the amount of data is also exploded. In fact, the amount of data has grown for many websites it has grown faster than CPU speeds it has grown faster than disk size increases which means that they have so much data it has to be sort over thousands of disk. So, again sorting on this is an important task there are benchmarks for very large scale sorting. So, that was for sorting now let us come back and see other relation algebra operations.