 Welcome back after the break. So, we will start off with questions from remote centers. Right now, we have chosen Regency Institute, Pondicherry. If you have a question, please go ahead. Hello, good morning sir. What is the function based index? So far, indices which we have seen support something like, if the attribute value equal to 5 or if the attribute value r dot a is between 5 and 17, these are the kinds of queries that an index can answer, but sometimes you want variance. For example, supposing we have stored names in an index, we have kept the names as is with capitalization and small letters, big letters and so forth. Now, if I want to find out everybody with a particular name, how do you use the index on that, because capital letter is not the same as a small letter. If I do equal to it will fail, if I want to find the name regardless of whether there is a mix of small or capital letters, how do I use this index? One answer is to first store all the names, change completely to lower case or to upper case, up front before you index and then build a normal index on it. A function index avoids this step of having to create an extra attribute and storing it. What it does is it lets you execute a function, in this case let us say the two lower or two upper function and the result of that is used for indexing. So, we do not have to create a new column with the result of converting the name to upper case or lower case. We directly create a function index on the function upper or lower of the name attribute, so that is what a function index is. So, now how do you look up this index? I can say if I have taken a name as input, I can first of all convert that also to upper or lower case as appropriate. Then I can say that result is equal to upper of that attribute name and now the query optimizer knows that there is already an index on upper of attribute name and I am seeing if it is equal to upper of attribute name, it will use that index. Similarly, if I do a join between two attributes, if I do upper of r dot a equal to upper of r dot b, if I want to do, if I want to use a function index, if I built a function index on name for both, at least one of these relations, then an index necessary loops join can be used. Even if there are no indices, I can still use merge join. The only thing is that I have to first compute the function upper of that and then sort on the upper and then I can use merge join. Similarly, I can use hash join by first computing the function upper and then doing the hash join, but it is more generally, it does not have to be just upper lower, it could be any function of the attributes of that tuple. So, I hope that answers your question. If you have a follow up question, please go ahead. Okay sir, I have one more question in join sir. If I want to join tables from data from three tables, how many minimal joins are required? Minimum number of joins are required. Okay. So, the question is, if I want to join three different tables, what is the minimum number of join operations required? So, if you use only binary joins and this is the most common situation that systems implement only binary join, then if there are three relations, we need two joins. Supposing the relations are called R 1, R 2, R 3, I can first join R 1, R 2, then take that and join it with R 3 or any of the other join others, first R 2 with R 3 and then with R 1 and so forth. There are research papers which talk of mighty way join. I do not know if anyone has implemented them in an actual system as of now, but potentially if you have three relations, all being joined on the same attribute, meaning the condition is something like R 1 dot A equal to R 2 dot A equal to R 3 dot A, meaning for all three relations, I am equating one attribute from each of these relations to the same value. This kind of thing happens, for example, I have a particular employee and I have multiple relations containing details about that employee. So, I want to take this employee and look up the details in each of the other relations. This is a good case where I can do a three way join, meaning I do not necessarily materialize the intermediate result. I can take one employee, fetch all the tuples from relation 1, all the tuples from relation 2, all the tuples from relation 3, matching that employee. So, the query is R join, R 1 join, R 2 join, R 3, where there is an attribute of R 1 which is used to look up the other relations. So, I can look up all those tuples and then combine them at one go and output the result. Instead of creating intermediate results, I can do it all at once. So, I think for this kind of special case which is used in data warehouses, there are what are called fact tables and dimension tables. So, here the dimension tables are the look up tables. So, for this special case I believe many data warehouses do something like this, but in the general case I am not sure if anybody implements multi way join. Any follow up to that, back to you. Yes sir. Yeah, ok thank you. If I index in a column in a table, if there are any constraints on the table to drop the index, whether we have to remove constraints or disable the constraints or without dropping it, can we drop the index? So, if you have a constraint and an index which is used to support that constraint, will the system allow you to drop the index without dropping the constraint? I think you know this is a decision of individual system, this is not part of the SQL standard. My suspicion is I have not tried this out, but a system which say look this person knows what they are doing, would allow you to drop it. Even a primary key declaration requires an index, if you drop that index you know primary key enforcement is going to become expensive. So, some system may say are you sure you want to do this, some system may say no I would not do it, but other system may say fine I trust you know what you are doing and go ahead and do it. So, I do not think I can give a standard answer to this question, you can try it out on post SQL and let us know later what happens. Let us take 1 or 2 questions from some other center, we have Sri Shankaracharya in Chhattisgarh, Raipur please go ahead. Sir good morning, sir we have 2 questions, how B plus 3 is suitable for range query? Second question is why copy of data is created in internal node during split operations? So, let me answer that question that is a easier question. If you looked at the B plus 3, you would see that all the data is sorted. So, maybe I can do a screen share and answer this question. Here is a B plus 3 and we want a range operation. So, supposing I want to find all instructors whose names are between Einstein and Singh. So, what I will do is I will traverse this tree down to the very first thing which is Einstein and then I can go along the leaves of this tree and all the instructor names up to the time when I reach Singh or something larger than Singh. So, in this case Singh is present supposing Singh were absent I would find Srinivasan which is alphabetically greater than Singh. So, I will stop at that point. So, I may not also find Einstein, but I will find the first one which is larger than Einstein supposing Einstein is not there L psi is there and the query range was Einstein to Singh. Both Einstein and Singh are missing I will find L psi and then find all of these until I hit Srinivasan when I know it is I have gone too far and I will stop. So, that is how a range query is answered. Now, of course, for each of these keys I will have to find the record ID and find the record as appropriate. The other part was regarding the split operation I am not sure what context you are talking about the split operation. If you can elaborate that I can answer the question. When we insert more number of data in any node then during the split operation of that node the copy of data is a created in internal node. So, what kind of benefit we get? Fine. So, I was not clear what you meant by split operation earlier. So, in the context of B plus trees of course, the split operation is when a node is full because there is also a split operation in query processing I thought maybe you were talking of that. So, coming back here when you split you are not creating a copy of the data in internal nodes. The sole purpose of the internal nodes is to act as a guide. The key values in the internal nodes are basically helping you decide which child to follow, which child pointer to follow. Now, you are not creating any copy of data at all there is no copying of underlying data. All you are doing is storing a key value above and what look like copying the data is simply you know storing an appropriate key value in an internal node which will help you distinguish between going to the left child or the right child of the split node. So, that is all that is and without that your B plus tree structure would not be correct because you would not know which one to go to when you are searching. You need that information with the internal node. So, I am not sure why you have a problem with it. Are you unhappy with that? It is not duplication it is essential for creating a B plus tree. Sir, I have one another question. Hello sir, is there any query for division operation? And the answer is no SQL used to have this contains operation. And I mentioned it earlier when I covered division in the context of SQL. I talked about the sub query. So, a query was to find out students were taken all biology courses or equivalently all core courses. And there I showed you how to do it in SQL by using sub queries with not exist with set difference. And I told you there is a construct called contains which makes life easier. If it were available it would have made life easier. Unfortunately it is not supported in SQL today. Weirdly it was supported in one of the earliest implementation SQL had it. Later it got dropped and never came back. That is the closest that I have seen to construct supporting division. But you can think of writing it using that and then translate it to the not exist construct. I think in the interest of time let me get back to the query processing slides. No more questions right now. Let me wrap up the outer join operation. So, the outer join operation can be computed again using sorting or hashing. I am just going to focus on sorting because it is a lot easier with sorting. If you want to know how it is done using hashing go read it up from the slides or read it up from the book. It is pretty straight forward in that case too. So, let us do merge join to compute left outer join. This is the easiest case. What have we done in merging? We have two things and we are which are already sorted and we are stepping through those. For the join I said that if there is no matching tuple on the other side go to the next tuple. But for a left outer join all we have to do is the following. If I look at a particular tuple TR from R if it does not match any tuple from S output it TR that tuple padded with null and then move on to the next tuple. So, that is the only change to the merge join algorithm. So, very trivial change in fact earlier we said that if there is no match go to the next tuple. Now, we are saying if there is no match for R in the R left outer join S we output it padded with null. Now, right outer join is completely symmetric full outer join is also a very straight forward. For either of the relations if there is no match we will output it padded with null to get the full outer join. So, it is very easy. Has join it is again fairly straight forward for left outer join as long as S is the build relation. So, we are taking an R tuple we probe the build relation and if there is a match we output it. If there is no match at all for that R tuple we output it padded with null very easy. If on the other hand the build relation is R it is still possible, but a little bit more complicated not hard at all I am going to skip the details. So, the last topic in query processing. So, far we have seen how to evaluate single operation, but a query is rarely a single operation it is usually many operations which form a tree. So, how do you evaluate a full query which is a tree of operation? Essentially there are two alternatives for each part of the tree in some sense. One is materialization that is you generate results of an expression and then treat it as a stored relation to evaluate the next operation. An alternative is pipelining where you pass tuples from one operation to the parent operation. So, let us see this in more detail. So, here is a query select building equal to Watson on department join with instructor and project on name. Let us say with duplicate elimination how do you evaluate this? One way is to start at the lower most operation which is select run this. How do we do it? We have a choice of using an index scan or whatever create the output and store it. Now, you can think of this expression tree as you snip off this part and replace it by the expression that we just computed the result that we just computed that is a relation which stored it is materialized. Materialized means create it and store it. Now, we do the join operation of that temporary relation with instructor and store it. Then we take the project operator use that relation as input run the project operator and store the output to which is the final result. So, this is called materialized evaluation. In general you can do this bottom up on this tree or a depth first traversal equivalently gives you the same thing. First child materialized visit the next child materialized then compute this operator. The alternative is pipelining. So, coming back to the same expression. Now, supposing I have an index on instructor. Now, I use maybe an index or whatever file scan on department does not matter. I have found departments in the building Watson. Now, instead of storing that relation we can pipeline it into this join operation supposing this is an index nested loop join. It receives one tuple and immediately looks up instructor index on in this case department name and finds matching instructors and passes it on. Now, again it does not store it it passes it on to the project operator which gets out the name. Now, the project operator supposing it is doing duplicate elimination there are some extra steps here. Maybe it has to sort this to do duplicate elimination. So, what it might do is it takes the tuple and starts creating sorted run and outputs the sorted run. Later on there is another merge operator which computes the final output. So, a duplicate elimination is essentially split into two operators. The first part computes the sorted run the second part does the merge. So, you can pipeline data from any lower operation into the run creation. So, instead of storing the result directly on this you pass it straight to run creation. Run creation keeps aside some amount of memory it collects tuples till the memory is full in memory sort output. So, you pipeline from this join to the run creation. Now, you finish all the runs after that you have to do the merge and output the final result of duplicate elimination. So, clearly pipelining does not happen between the run creation and the merging step. So, it is best to think of duplicate elimination as two operator or sorting as two operators run generation merging. Now, it is clear that pipelining into run generation is possible pipelining from run generation to merging is impossible until you have generated all the runs you cannot merge it. So, you have to materialize the run and then do the merge later on I hope that is clear. So, that is pipeline evaluation. Now, pipelining requires coordination I said this thing generates tuples it passes on it on to the next which consumes it. One way to think about it is that each of these operators is running in its own thread and as soon as this generates a tuple it passes it on to the consumer through some kind of intermediate queue and then that thread is running and as soon as it sees a tuple in its input queue it processes it. So, this is conceptually a nice paradigm, but there are a lot of overheads to this because you have to maintain queues there are two threads of accessing the queue and you have to do synchronization between them. It turns out the cost per tuple of actually processing it is pretty small. The synchronization cost starts to dominate with this model. So, it is not widely used instead what is used is one of two things it is called demand driven pipelining or producer driven pipelining. So, that is coming up in the next slide. In demand driven or lazy evaluation this is the thing which is used most frequently The system there is a query tree. The tree is actual you know it is a query plan annotated with the algorithm to be used and so forth. So, that is the tree on the top node of the tree the query processing system ask the node saying please give me the next tuple. Now, that node has not yet created a next tuple to be output, but what it does is it gets a next tuple from its children. So, it will tell its child give me the next tuple and then a creator output and return it. Now, in between two calls each operation has to maintain a state. So, it knows what to return next. So, let me illustrate that using the same example say the first time this thing says give me a tuple and it tells join please give me a tuple. What does join do? It is index necessary loop join let us suppose. So, then it will tell its left input please give me a tuple. What does left input do? It uses the index and finds the first tuple matching building equal to Watson and returns it. Now, this guy will find the first matching instructor using that department name and return that first matching tuple here to the project operation. The project operation needs its next input it will again tell join give me the next input. Now, the join has to remember that it got a particular department last time from here and maybe it return only the first instructor in that department maybe there are more. So, it may continue the index scan and find the next instructor and return it. So, what is just happened is in between two calls to next on the join it has to remember where it left off. It has to remember that was looking at a particular department it had returned one of the instructor. So, the next call might have to look for the next instructor. If there is no more instructor in that department it will in turn tell the select operator give me the next department and then find instructor in that department. At some point this will say there are I do not have any more departments I am done then join will say I am I am also done and tell project no more output and then project will finish up its merge phase and finish. So, that is demand driven pipeline. Produce a driven pipeline goes the other way that is the best way to think about it is that each of these things is running in parallel and as soon as it has batch of tuples it pushes it up to the next operation. The next operation join gets batches of tuples from below and when it has enough tuples it does whatever join operation required and passes it on. In this case if it is index next to loop join it does not get input from here it only gets inputs from this one. So, this generates a number of departments in Watson pushes it here this guy gets that and does the index look up on instructor and instructor or department name and pushes the output here and so forth. So, demand driven with the first one we saw is what is used far more often although producer driven is useful in parallel databases. So, there is some stuff here on how to implement demand driven pipelining essentially there are three methods you if you are familiar with iterators in Java or newer versions of C plus plus or C sharp what is iterator do you first have to initialize it iterator dot open or get iterator whatever in this case it is called open. So, on a particular node join or select or whatever you say open when you call open it may do some initialization for example merge join might sort its relations in preparation for the merge phase. When you call next on that operation you are saying give me the next tuple for a file scan it is easy it just outputs a next tuple and advances in stores the file pointer. So, the next call again after that will give the next one and so forth. For merge join it will do the merge it might have done a partial merge up to now it finds the next matching tuple in a merge and returns it and so forth close shuts down the pipeline and that is it. So, that was a very quick overview of how an entire expression is evaluated what I will do is I will take a couple of minutes for questions and then we will move on to query optimization. So, let us take some live questions we have NIT Trichy please go ahead. How can we fix the maximum number of users for any web application? I interpret this question to mean that if there are more than that many concurrent users then the system gets into trouble. So, you want to prevent any more users from joining. If that is the question one way is when somebody attempts to login you check how many concurrent users are there. How do you know this? So, you need to know how many sessions are active that is something which you know how do you do this? How do you know how many active sessions there are? I think servlets session should provide some way of finding this I am not sure. If it does not you know when a session times out it should be possible to say call this function on session timeout I am not 100 percent sure about this. So, that function could decrement the count of active sessions and then after that if somebody retries you allow them in if the number has come down. So, I will have to dig into this the key issue is the session times out automatically at that point we have to decrement the count. So, how to do this is something which we need to look into. Does that answer your question over to you? I mean implemented in Moodle sir. So, 9000 users can access that A views of them. So, how did you implement that the number of users? For Moodle was the question for Moodle or was it for A view over to you? Moodle. Yeah. So, we did not implement Moodle and Moodle to my knowledge does not actually restrict the number of users. It may run out of resources, but it is not counting and that problem showed up when we did quiz on Moodle. Quiz module is very database intensive apparently. Actually I am sure it can be rewritten in a much more efficient manner. I mentioned optimization through caching. So, quiz module I think is dynamically querying the database every time you click next or submit a result which is a very inefficient way of doing things. So, what it should do is pre generate the entire quiz for you and send it to you. So, when you do next it should be a completely local operation it should not go back to the database. So, the number of interactions in the database will come down sharply. So, I am sure if we do a little bit of rewriting on the quiz module it can actually handle many thousands of people at a time. Because the server on which Moodle system is running is pretty high end. I do not know the exact configuration, but I think it has something like 64 gigabytes of memory you know it is lot it is not this constraint. It has you know fairly high end CPU with many cores. So, it is a pretty robust system. So, this kind of stuff should not have happened. So, some amount of tuning I am sure can improve the module a lot. The other option is to restrict the number of people taking the quiz at a time. So, what I was told was break it up into batches of maybe 500 people taking the quiz at a time, that is a lot of batches that we have. I mean we have over 5000 people active who took the quiz at some point of time. It is a lot of batches coordinating all that is a pain. So, you know I decided not to do it for this course it is too much overhead. I was hoping that clicker would give us a better solution. In clicker the trick is that the answers are collected not all centrally, but your remote centers have something running locally and that collects your answers. So, each remote center is collecting maybe 30, 40, 50, 60 answers some maybe even as low as 5 or 10 and then they upload it there are 250 centers. So, they have collected everything together and now if they upload everything together it is very efficient. So, that is why clicker software even though it is not you know it is not it easily handles 1000 without any problem that is what we have seen, but I am pretty sure it would scale to even 10,000 if done if only the clicker software and systems were running at all the remote centers. I am pretty sure it would work without any performance issues. Does that answer your question? Sir, do we need to specify that number of users in my server lab program or it was network dependent? No, the none of these things has any direct built in thing to control the number of users. So, it essentially it let us as many users come in and eventually it will run out of resources and it becomes very slow and bad things happen that is about it. I do not think any of these has a admission control conceptually admission control is a straight forward thing. There are systems which implement it, but to the best of my knowledge Moodle does not implement it and you have to do a bunch of work to make it happen. So, let us move to some other question I would like questions on today's topics preferably. We have NRI institute Bhopal. Yes sir, good morning sir. So, my question is regarding JDBC session and I want to ask that with single object of connectionless class sorry with single object of connection class is it possible to make multiple connections on multiple databases? No. So, if you want multiple connections you have to create multiple connection objects. So, you have a driver manager dot you know get connection. So, you have to make get separate connections. You could make probably two different connections to the same database I think that is allowed not just think I know it is allowed we have used it. So, you can make separate connections to a database, but on a single connection you can only be doing one thing at a time. If you executed a query it has to complete before you can do anything more. Does that answer your question? Means it is not possible to make multiple connections. You can make multiple connections that is not a problem. You can make multiple connections to different databases or even to the same database if you want, but per connection you should have only one thing active at a time. Thank you sir. There are some extensions of before you ask the question there are some extensions which have been proposed which allow asynchronous submission. So, you can submit multiple queries and then whenever they finish you get the results back. I do not think this is still part of the JDBC standard, but maybe it will appear sooner rather than later I think. Good morning sir. My question is sir please explain advantage and disadvantage of DSN connection. So, that is just a lower level implementation. This is a data source which you create and then do that. I think it is a Microsoft specific thing if I am not mistaken, but I do not know much about that. I do not have anything to say about it. Any other question? Good morning sir. My question is from JDBC section. What is connection pooling? Connection pooling is something which everyone should be aware of. What connection pooling does is when you open a connection there is a lot of overhead to opening it. There is a password which is passed, the verification, hand shaking and there is a lot of stuff that happens to open a connection. Now, supposing you have 10,000 users using the system right one after another the requests are coming. Now, supposing for every request that comes every request on a servlet I open a connection, process the request, close the connection there is a lot of overhead. What connection pooling does is instead of closing the connection I put it aside in a pool of connection which are currently unused. So, when a new request comes what I do is instead of opening a brand new connection I pick up from this pool of unused connection. So, there would be multiple connections to the same database in this case. So, there is a pool of them. So, I will just pick up a connection from this pool which are currently not in use and use it to send my query and get the result back. When I am done with it I meaning let us say one servlet, the servlet is processing a request. When that servlet is done it does not close the connection it returns it to the pool. So, this has a very significant impact on performance. So, many cases where the database interaction is very small it is just one small look up in the database that may be actually cached in the buffer. So, it is very fast at the database end and opening the connection is actually the big bottleneck and connection pooling can give you you know many factors improvement in performance 4, 5 times improvement is not uncommon. Sir, how many connections can we put in connection pool? That is under the control. So, you need a library which handles connection pooling and some servers I think Tomcat supports connection pooling internally. So, if you use the right API you just say open the connection and instead of closing it you release the connection. So, you can look up the API for connection pooling and use that or you can build your own if you wish. Although today it is available. So, why would you build your own use existing connection pooling library either directly from the servlet engine like Tomcat or there are other libraries which you can download and use. So, that limit is up to you. It depends on how many connections the database can handle simultaneously. So, you do not want to open 1000 connections to a database ideally. So, maybe the connection pool would not be that big, but beyond that it is function of performance. There is no other limit to it. I think we should get back to query processing related questions if you do not mind. Others please stick to query processing questions. We have S I E S Navi Mumbai please go ahead. My question is based on sequential file organization. My question is what is an overflow block used even if there is at a given point only one overflow record. So, this is again not related to query processing. So, but I will answer it very quickly since you have asked it, but I will not answer any more questions like this. So, the answer there is that you have to store data only in units of blocks in a database. Even if you want to store one record, you have to store it somewhere and that unit is a block. You are only going to get 3 BHK here. You are not going to get one room kitchen to store your one record. All things are the same size. So, you can put many people in that 3 BHK, but you cannot get a separate record by itself. Let us move on to any other query processing related questions. What I will do is now I will take one or two questions from chat. I will answer them and then we will move on to query optimization. There is another question about query processing on multiple sites. So, here how do you run a query which accesses data from multiple sites? So, that is a good question. So, the answer to that is typically that you will break up the query processing into parts that are done in different sites. In the simplest case, all the processing is done at one site and the other sites merely send data over to this central site. So, example if I am doing r join s where s is on one site, r is on another site. I will ship s, the entire contents of s to one of the sites which contains r and then do the join locally there or I can do it the other way, but in general you might do something more. If there were selections on s, maybe I will apply the selection condition on s and only ship the relevant tuple. Then if there were join condition between r and s, if s is very large, r is very small, I might send the r tuples to s and then do the join there, get the join results back to the original site maybe where r was present. That is more efficient than sending s over. So, there are many alternatives for doing queries across multiple sites. This is part of distributed query processing and there have been optimizers built for optimization in this context. Anyway, optimization is coming up next. But the key point here is that there is one extra operator which is sometimes referred to as the data exchange operator. The exchange operator ships data from one place to another. Now, in the context of parallel processing, the issues are somewhat similar except that now you are running a single operator on many, many different nodes in a machine. So, what is the machine here? Maybe it is a bunch of computers which are linked by a network and I want to break up an operation such as a join or the sorting operation or aggregation based on sorting and so forth, hashing. I want to divide it up and do parts of each of these on each of the nodes in a highly parallel system. So, I will come back to this when we cover big data which was supposed to be today, but I do not know if you will have time for it. What I want to say here is that parallel processing is eminently possible, but the key step over here is to be able to move data between one machine to another so that it is available where it is required. So, there is again an operator which does this called the exchange operator. So, how you run parallel query plans is actually very nice. So, there are two parts in any planning. One of the steps is the data exchange step which ships data between different nodes. How to do this we will see later. Then you do the same plan in parallel at each site using only local data. So, you have moved all the data that you need to each of the nodes. You do part of the query plan. Then the results might again be redistributed to different nodes. That is again another exchange operator. Then again you run the queries locally. The same plan runs at each site, but only accessing whatever is data is already available at that site and so forth. So, you have alternating steps where things are running locally with no exchange of information and then in the other step data exchange between sites with no local process. So, if you alternate this you get a query plan and that can be represented itself as a tree where the tree has ordinary nodes. It also has this exchange operator which is the thing which exchanges data between nodes. So, that is how parallel processing is done using conceptually the same kind of query plan. Still a tree everything is more or less the same, but it can be easily extended to parallel like. So, let me answer one last question. It says can indexing of tables be done on picture data type? Now a picture data type is just a binary object. There is no way you can meaningfully build an index directly on a binary object, but if you do some processing on the image and come up with some features there, you could index on those features. For example, I think many websites today offer face recognition on photos. So, if it is a circle of friends they might do face recognition amongst your friends on that. So, they have actually processed the picture and done face recognition. So, you might actually index the picture on the people who are present in that picture. So, it is possible to do such things, but it must be done after some feature extraction from the picture. So, let me answer one last question.