 last few topics on query processing. So, so far what have we done? I did a very very quick high level overview of individual relational algebra operations. Now, all of SQL is translated to these operations and we get an evaluation plan. So, what does the evaluation plan say? It is going to say do this operation, then do this operation, then that operation. However, there is, you know, if you do not say anything, so let me explain it using this tree. It will be more clear with this. The simplest way of evaluating a tree like this is to start at the bottom. Start here. Scan department to compute select building equal to Watson and output that result to disk. It can be a large result. So, you output it to disk. Now, I have finished this part of the tree. I have this result stored. It is called materialized. Materialization means compute it and store it. Now that this is materialized, I can take the join of these two. Let us join. I compute the join and store it. I materialize the join. Then finally, I take that result and there is a project operation maybe with duplicate elimination. So, now I apply that operation and store that result. At each step, I am computing it and storing it. This is called fully materialized evaluation. But you can see that maybe we should not have done it this way. Maybe we could have saved the cost of storing it and reading it again. For example, take the join result. It is being sent to the projection which maybe does duplicate elimination. Do we really need to store the result and read it back again? Can we not start the external sort on this result as it comes in? What does the external sort do? I did not cover it. But basically what it does is, first it creates what are called runs. It takes some amount of data in an in-memory buffer, sorts it, outputs it and keeps repeating this. And then it merges the partial sort it runs. You see what is happening here? The join is generating some tuples. Instead of outputting them as is to disk, I might as well fill a buffer with the join result for the next operation, sort and output it. So, what is happening is the join result is being directly sent to the sort. Instead of being written to disk and read back again. So, I have saved a lot of disk. I have saved the cost of writing it and reading it back. Is that clear? I can, in many cases I can save this cost of writing and reading back. Take this one. Supposing I choose to do a hash join. Now I can do the selection. Maybe I use an index. Or maybe I do a file scan regardless. I am getting out tuples which departments in the Watson building. Now I have chosen to do a hash join. So, what I can do is directly send these to the partitioning step of hash join. I would not write them out. I will directly send them to the hash join which will partition and write it. It does not write out the result then read back and partition again. So, again it is being pipelined from the select to the join operation. This is called pipelining. So, pipelining means pass on tuples to the parent operation even as the operation is being executed. Do not wait for it to finish and then start the next operation. While this is running, start that also and pass tuples directly to it. In contrast materialization what does it do? It completely generates this output only then starts the next operation. So, those are the alternatives. So, what you end up doing? This more details on pipelining. I do not have the figure here, but in general what you have looks like this. There is a tree. What is the tree? The bottom level are relations R1, R2. The leaves are relations. What are the intermediate nodes here? There are whatever operation. This may be join. This may be intersection. This may be outer join. This may be group by a some b and maybe this is again a join. So, this is query evaluation tree when. So, I am going to print down which algorithm I am using. For join the optimizer will choose that I am going to do hash join or merge join. Similarly, for intersection for each of these steps I have chosen what to do. This is a query evaluation plan and the query evaluation plan will also note that this output is to be pipeline or not pipeline. It makes this choice. So, if it is going to be written out it might say materialize. So, that is an annotation here which is materialize it and then only start the next step. So, what we have is in the end a query plan which is submitted after all these things are noted how to evaluate it pipelining materialization. It is sent to the evaluation engine which then does the job of executing it. So, now all databases allow you the following. You can give a query and tell the database tell me how you are going to execute this query. The optimizer is going to look at the query it is going to look at the statistics how bigger each relation what indices are present using all such information it does cost estimates and picks a plan which it thinks is the cheapest plan. In reality it may not be cheapest because it is an estimate. But based on the estimate it picks a plan and it tells you that this is a plan I am going to execute and in the lab tomorrow we are going to look at these plans and see what they look like. There is one last point that I want to make about these kinds of plans which is when you have pipeline evaluation how exactly do you pass on tuples to the next operator. I just said you pass it on but how do they coordinate? There are several ways of doing this. One way is to run each operator in a separate process or thread and then there is inter process or inter thread communication. So, when this thread has a result it passes it on to the operator which is running in the next thread and you can actually get some parallelism out of this. But practically speaking what people have found is that when you have any such system you have two different threads which are accessing a common queue. This thing generates that is a consumer producer consumer you if you taught an OS core that OS core you know about producer consumer you need synchronization a semaphore. If you try to get and release a semaphore per tuple you are dead that overhead will kill everything else. So, what they found is it is much cheaper to not run it in separate processes but to run it in a single process. So, now how do you coordinate generation of tuples across different operators which are all in a pipeline. And the answer is that you have actually two choices. We are going to look at just the pipeline. So, just one pipe forget materialization there are say three operators which are in pipeline output of one is going to the next to the next and so on. So, let us look at this situation. So, there are two ways of executing a pipeline demand driven and producer driven. So, what do you mean by demand driven? Demand driven starts at the top the top of the pipeline and that operator says tells its children give me an input that child in turn may have to go to its child and say give me an input and that child may be a relation scan it which is the next record from the relation passes it up. This operator may be a select it checks the condition on the record. The condition succeeds what happens it passes it up. If not what does it do it again goes back to the side and say give me the next record till it finds the record that satisfies the selection and passes it up. That is a easy case what if it is a join merge join okay. The parent of the merge join says give me a record the merge join has to find a matching pair and output that and then it asks again it should find the next matching pair and return it. So, it turns out all the algorithms which we have seen can be modified a little bit to do the following instead of outputting all the records at once they can leave off in the middle and when they are called again they will restart from where they left off and return the next record okay. So, this is called yeah it is here. So, what happens is an operator returns a tuple and it is going to be called again and returns the next tuple. In between these two calls the operation has to maintain a state. What is the state? What did it last do? What was the state? So, that next time it can give the next tuple okay. So, that is a key thing in pipeline and so this is for the demand driven or lazy evaluation. This is what most databases implement PostgreSQL oracle everybody does this okay. So, there is an alternative which is called producer driven or eager pipelining where the pipelining is not driven from the top it is driven from the bottom. This operator is generating a tuple it has generated a tuple it pushes it to its parent says here take this tuple and then that parent has got a tuple now it says okay I have a tuple I will see if I can generate an output tuple and push it to my parent and so forth and then of course this one again gets a chance it generates more tuples. So, that is a producer driven or push or eager pipelining. So, this is also used but it is used more in parallel processing when you have one operator running on one machine another operator on another machine then producer driven is often very useful but on a single machine pull is typically what is used. So, coming back the operator has to maintain state and then return the next tuple. Now, those of you who know Java know there is a notion of iterator in Java what is an iterator you can call next on it it gives the next object it is an object anytime you call next on it it gives you the next result same concept. These operators now are iterators the operator you can call next on the operator it gives the next result because they support a few more operations such as reset and so that you can again start from the beginning and so forth. So, what is the iterator model for file scan? So, when you open so sorry I forgot to mention in the iterator model there are three basic calls there is open to initialize next which is called repeatedly to get next next tuple and then close to shutdown. So, three basic operations. So, every single operation maybe hash join, merge join, index nested loop join whatever it is every one of these in any date standard database system is written with three functions and maybe more some extra ones are there but these three are minimum. So, for example, file scan what is the open it opens the file what do you mean by file scan I mean I want to scan all the records in a file. So, when it opens it actually physically opens the file this is a operating system level file open what is the next function for that it reads the next record from the file and returns it what is the state where in the file it is. So, if you think about it in when you open a file in the OS and say read read read OS is tracking where it was in the file when you say read 10 bytes it is reading the next 10 bytes it is an iterator a file read is an iterator in that sense it is remembering where it was in the file and when you say read it returns the next one. So, that is the same here for merge join on the other hand what would open do you know if the relations were not sorted maybe it will actually sort at that point open might actually go do a sort well in reality the query optimizer will actually create a sort operator the merge operator does not do sort there is a sort operator in the plan which gives the sorted input to the merge operator. So, this one actually happened in a realistic plan but if you had a merge operator which also did sorting yes it would do the sorting at that point and then what is the state it is a pointer to those two relations on next what do you do you move those pointers and find the next record return it and it has to remember where the records are it also has to remember whether it has to rewind the second one in case of duplicates the second relation may have five occurrences of D the first one may have three occurrences of D so the first time D D match you output it the next time this one moves the second D matches you output it third D fourth D fifth D after the fifth D you find the next one is E but there are more D's on this side you have to rewind that to the first D and move this to the second D. So, all this is part of the state of the merge operation it has to remember what it last did to know what to do next. Whether to use a demand driven approach or produce a driven approach it will depend on what two consecutive operators are there in a tree. Suppose there is a join and then parent operator is project operator project operator will work fast compared to join and if it is demand driven project operator will demand but join will might take time to give the output. So, if so first of all if you run these in different processes this matters then one process is idle while the other is working and that is the reason typically it is not run in different I mean there are many reasons for not running in different process one was the semaphore overhead the second is this problem generally one will be the bottleneck all the rest will be sitting idle so it is a bad way of doing parallelization what actually happens is both of them run in the same process therefore there is no idle this is called it calls the child. So, the process is not a processor is not sitting idle it is doing work continuously and whether you use pull or push it does not matter from the viewpoint of keeping the process of busy but practically most databases use the pull model for multiple reasons one of the reasons is that there are many cases where after fetching a few tuples the parent operator may stop you have an exist sub query it will fetch one tuple and stop there is no point computing a lot of results similarly when you are outputting things to the user there may be a limit 10 so you only want to generate 10 after that you stop so there are many reasons why the pull model is currently favored there are people who are saying that maybe you should change but as of now that this is popular using different processes will be much more using multiple processes might be faster if it is multi processor system but there is a tradeoff between synchronization and this thing so typically what is done is in parallel processing you divide up the plan in a different way what you do is you partition the data and then on each partition you run a plan that plan may do pipelining it may do demand driven pipelining but it will not coordinate with the others the killer is coordination so in parallel query processing what is typically done is you have multiple stages in each stage you first partition the data in some way then run locally on each partition that partitions do not talk with each other each is a single process running on a single thread maybe even and that finishes then you may repartition data and again run in parallel that has proven more successful than one guy generating it passing it on to the next guy the reason is what I said the speeds may be different and then things are sitting idle you don't want this to happen everything buddy should be busy this happens in a human pipeline right if you go for lunch now there's a pipeline of people putting food the pipeline speed is determined by the slowest guy the other guy is sitting idle waiting for customers but there's a physical reason why parallel processing is difficult here but if you can avoid pipelining it's actually for parallel processing pipelining is used to avoid writing to disk but it's not used to parallelize in slide number 12.25 if s contain same value of d in that case what will happen yeah that's what I was trying to explain in the air so you have to match every d here with every d there so effectively uh if you know where the first d is and the last d is in both of these you'll do a nested loops for each d here for each d there we both have d and can you repeat the question repetition of d in ps uh in s yeah that's what I'm saying you can have multiple d's in s multiple d's in r yeah all of them have to be matched all the pairs match all the d's here match all the duplicacy will be so repetition is allowed who said that the join attribute is unique it can have duplicates and you have to generate all that so in case of uh distributor file system how do you maintain that index tree I mean it is centrally located or scattered across the different nodes big table like distributed right yeah um so that's a good question uh so how do you split an index across multiple machines the way it's done in big table is there is a top level thing right which is a tablet so what effectively happens is uh there is an index which is not a b plus tree index which index is the tablet so let me draw it it will be more clear instead of having one huge index on the whole thing uh big table actually does the following this is one tablet tablet one two three so forth now the data is range partition across tablets what do I mean by that let's say the first tablet has things from a a to a p uh okay and then the next one is from just above a p to um let's say b z whatever and so forth okay so uh now given a key by using this so this thing this is not a b plus tree it's it could be an array or some other in memory data structure this is kept in memory so given a value I can find out which tablet it maps to I need only two values right the beginning and the end per tablet two values per tablet and then a pointer to the tablet I need to know which tablet it is so this is a uh essentially this is an in memory uh you can think of it as a index or map or call it what you will this is stored on one machine in the master you can replicate it and so on actually uh but I won't get into detail for simplicity let's assume this is there everywhere or somewhere so now if I want to do a lookup first I will lookup the value to see which tablet if it's a range it may be multiple tablets then I go to the tablets here I have a b plus tree so this is big the tablet is not small it's like a few 100 megabytes which could be several million entries so here I have a tree and I will look it up in the tree so I'm I may search down and find a thing in the b plus tree or if it's a range scan I may end up scanning everything here everything here and up to somewhere here a range scan can go across multiple tablets so that way we identify in which node that particular tablet gets available uh okay that's the other part which where is the tablet residing right that is one more and the reason that matters is a machine may die so then the tablet has to be handled by another machine so at any point I want to know which machine is taking care of the tablet what if the machine taking care of the tablet dies well the tablet data is on the file system so another machine can take over and do some recovery for the tablet and then starts serving that tablet so that's part of the fault tolerance of big table but at the index level conceptually there are many small indices and then an in-memory thing which tells you which index to go to each of these is on one machine so I need to know which tablet and where that tablet is so there's another map uh tablet to machine mapping is also there all of this uh system takes care of so when you have a look up request it goes through this then through this and goes to that machine thank you sir any other questions we are using same hash function so uh suppose I am taking the modulus function so uh size will be the same for uh both hash tables so how actually that can do in man no no so the question is I am taking one large relation one small relation same hash function one the size be the same no the answer is the number of partitions is the same yes if I have five partitions here five partitions there but the question is how many records fall in that partition if I take a particular partition let's say r had a million rows s had uh 20 000 rows and now I'm dividing it into five partitions r one million tuples divided into five is 200 000 s 20 000 divided into five partitions is 4000 so obviously the sizes of the partition vary and so I prefer to build on s and probe using r r need not fit in memory only s partitions need to fit on memory so when I'm probing on r what am I doing I'm taking r tuple looking up the index finding matches outputting them throw away the r tuple move on to the next start so I'm just scanning that r partition there is no requirement that r fit in memory so we will not break uh s we will only break the r we have to partition both r and s we can't get away partitioning only one well there are some things which will partition only s but then they have to scan r multiple times uh so that is a tradeoff if the number of partition of s is very small say two partitions people have found it useful to simply scan r twice rather than read it once partition then read again uh but those are some special cases that's where that uh you know 200 page article on query cross incomes take all these special cases and optimize them there are a lot of details but we won't get into that okay I think people are hungry I'll stop here