 So, good morning everyone and welcome back to this course. So, yesterday we had covered query processing and we ran out of time at the end and I am sure many of you had questions about query processing. So, let us start off today by taking a few questions on query processing after which we will move on to query optimization. So, I will set aside about 5 minutes now for any questions. If you have questions at this point please indicate it on a view or send it by chat. So, let us start by seeing if anybody out there has a question. I see that Varangal has a question Varangal you are on now please go ahead. My question is related to the previous slide I mean in chapter 12 in the query processing a quiz was given in which random IO was involved in both class disk and hard disk and we were asked to just find out which of the following query processing methodology will be working fine. Could you please explain that sir again. So, the question was we had 4 different join algorithms nested loop index nested loop sort merge and hash join and the question was supposing you switched from a magnetic disk to a flash or a solid state disk which of those 4 algorithms will benefit the most that is we will see the maximum increase in speed that is not to say it will be the most efficient always, but which will benefit the most and the answer logic is as follows. First of all as we discussed hard disk have a mechanical device where arm moves to a position on the disk the disk spins and eventually to read a particular piece of information from the disk takes of the order of 10 milliseconds. In contrast flash disk can read it within a microsecond also. So, this is the basic difference between flash and hard disk that is random axis if I say go to some random place and read one block it is much faster on flash. If I say go to some random place and read you know 5 megabytes then the difference between flash and hard disk is not as great. So, when I say go somewhere and read 5 megabytes that is sequential axis and the question is which of these algorithms benefits and to answer that you have to understand which of those algorithms has the maximum amount of random IO because that is the one which will have a most reduction in cost. As we discussed nested loops join has two loops one loop goes over the outer relation one loop goes over the inner relation repeatedly. So, now if you see the access to that relation is sequential occasionally you go back to the beginning of the relation and then come down again in the inner loop, but most of the IO is sequential. So, nested loop will not benefit as much it will benefit somewhat, but not that much. Similarly merge join the actual merge phase is entirely sequential it is not going to benefit the sorting phase has some amount of seeks it will have some benefit, but again the sorting is also predominantly sequential. So, its benefit will also be limited similarly hash join the phase where you read the original relations and partition them the input is basically sequential the output is not fully sequential, but it is not fully random either you are going to write out pieces of the partitions in fairly large blocks multi megabyte blocks typically. So, the number of seeks required for a hash join is again not that high whereas, for an index nested loops join every index look up has at least one seek assuming the inner relation is big if the inner relation is in memory it would not benefit at all, but if the inner relation is too big to fit in memory then it is going to have at least one and usually several IOs each for each outer tuple you look up the index fetch the tuple. So, that is multiple IOs operations or multiple seeks are involved. So, now if you have so many seeks if you have 1000 tuples outside and even 2 or 36 it is like 2006 for a fairly small result and can be much larger and in that case you are going to see a very big benefit I hope that answered your question. Let us see if anybody else has a question PSG Coimbatore has their question flag up let us see if you are on PSG you are on please go ahead and ask whatever you wanted to ask. Hello can you hear me sir. Yes, we can please go ahead yeah there are situations the main table has only one primary key that has null values and what is that situation the table accept null values sir. So, first of all if something has to is declared as a primary key you cannot store a null value in SQL that is a constraint which all database systems enforce currently. There were earlier versions of SQL which allowed null values in a primary key column, but the current standard since at least 92 if not earlier disallows null value. So, you cannot do that now the question I interpreted as there is a situation where most records have an identifier which you want to use as the primary key, but there are a few records for which you do not have that identifier, but you still want to record that information in the database. One way to do that is to simply declare it as a unique key not as a primary key if you declare it as a unique key it will make sure that two records cannot have the same value unless they are null and depending on the implement I believe unique allows multiple records with null, but I have to cross check this. So, you could try using the unique key if you need it in this particular situation. Let us go back to you to see if that was the correct interpretation of your question or you have something more to say back to you. Thank you sir. Thank you PSG. SGS, ITS, Indoor you are on now. Indoor please go ahead with your question. Sir, I have a practical question. If we are working in a proxy server, then there are many rules sometimes there are 10,000 or more than 10,000 games and the fields like a priority like source address, destination address and many more fields there. Then how we perform indexing on these rules? If we try to arrange the rules according to priority then these are simple top to down approach then which is appropriate for indexing. Thank you sir. Okay. So, let me first elaborate on that question. The question is in a proxy server there are often many rules on what can be allowed, what should be blocked, who is allowed to access what. Those are the kinds of rules you are talking about right. Please confirm that that is what you are talking about. Back to you briefly to confirm that my interpretation is correct and the goal is to index these rules right. Okay. So, the question is if you have rules like this with basically pattern matching rules and any request that comes through you have to index into these rules to see which rules apply to it and amongst those rules which apply you may have multiple things and then there is a priority mechanism to say usually the first of those rules which apply will override the remaining ones. So, the question is you have rules which involve many fields and then you have request coming in which has several attributes and you have to index to see which of these rules apply. That is actually a very interesting question. So, normally when we store relational data in a database the number of fields is fixed indices are on fixed fields. So, what we do is given a value or a range of values for a particular attribute or for a list of attributes. The index lets us retrieve all the tuples satisfying that equality or that range. The question here is actually inverted. So, you can think of it the other way. You can think of these rules as kind of queries. They have a predicate. The predicate says if A equal to 5 or B equal to 10 then do something. More realistically in this proxy example, it says if site equal to star dot xxx that is the pornographic domain then block. Don't allow it. If site equal to some other such site, block it. Now, there is one country in the world which has a huge number of such rules. Guess which one? Our neighbor China. They put a lot of restrictions on what sites people from inside China can view going out. So, this technology I am sure has been developed vastly in China because they actually block outgoing requests for the entire country. They have a bunch of servers apparently which take you know we usually think of proxy as within the institute and each institute here is a tiny fraction of the country. These guys have the same kind of restriction going on for I don't know how many terabits of pipeline that they have throughout the nation going out. So, the question is how can you efficiently implement these rules? So, the first thing is it is an indexing problem, but it is usually not an external memory indexing problem like B trees are designed for hard disk or flash. Now, if you had to go to hard disk checking one request is going to take at least 10 milliseconds for one random IO that is way too slow. So, all of these are in memory indexing. So, next question is how do you index predicates? And this problem incidentally has been studied not only by Chinese government security agencies, but also by many other people and in a different context. The context in which it has primarily been studied in the database area is if you have people subscribing to something. So, they can give a bunch of keywords and you can do this with Google for example. You can register a query with Google say that I am interested in results for this set of keywords and whenever the result changes something significant comes up Google will send you an email saying that here are some new results for your query which is actually quite nice to track what is going on. One of the uses for this kind of tracking are people who own copyright and something like music or books so on who would set up a Google query which says free download of XYZ and any time Google finds a new site with free download of XYZ it is going to send you an email saying here are these new sites which offer free download of XYZ. You can guess why I know about this. You can substitute some three words for XYZ which are very close to home. So, in fact we found a few sites which were offering free downloads of some earlier editions of a book and of course that was completely illegal. So, the question is given a large number of such requests which people have registered with a company like Google how does Google efficiently find out which of these requests are satisfied by a set of new pages. They have a set of new sites which they have found and they need to match it with the queries that people have registered. So, this is a reverse indexing. The query is indexed the data is used to look up the query to see which queries match the data which is coming in and I cannot go into all the techniques. It is not that hard for keyword queries actually. You can easily reverse the role of data and query in keyword search engine like Google but it is a little bit harder for database queries. Even in the database context people have done a lot of work on published subscribe systems. What are these published subscribe? You can publish data on in a database and you can have people subscribing to data. Now, Google queries are similar to published subscribe but here the published subscribe is more specific. Your subscription could say give me all updates on relation x or it can say give me all relation updates on relation x where the city equals Mumbai or the institute equal to IIT Bombay or department equal to CS or whatever. So, you have conditions conditional subscription which subscribes to just those updates. So, now you have the same indexing problem where an update comes in you need to know which all subscriptions are affected by the update. So, how do you do this? If there are fixed number of attributes for any relation which is usually the case. So, the issues here you have a bunch of queries which look up different attributes. So, for each query there are multiple attributes which it may refer to. So, whenever an update comes which affects where it insert or in general an update or a delete satisfies that attribute that query is a candidate. Now, there are all kinds of interesting issues here. There may be certain attributes with where the query you know basically every data that comes in satisfies the query. But there may be some other attribute where the updates which satisfy the query are relatively rare and this in fact can be turned into its head which says given the query what is the best index to use for it. And the best index will be the one which returns the fewest tuples for it you have query which says this and this. Now, inverting the problem we select the best index for the query, but what we are doing with that index is not indexing the data, but we are indexing the query which says if you have an and condition this and this the query is indexed on one of those things. And whenever that is satisfied by any new thing which comes in or anything which changes at all you can check that query and check its other conditions and then see if it has a result. So, there is a lot of interesting issues here. If you send me email later through Moodle send me a note. I will send you pointers to some work on indexing queries like this. In fact, one of the well-known papers in this area came from person from Oracle who also spent a fair amount of time at Amrita in Kolam I think few about 5-6 years ago a person called Jagannathan Srinivasan. Then there are other papers of course in this area. I will be happy to send you pointers. So, that was for the database area. Coming to a proxy server now there is no question of disk base indexing and such like everything is on the fly. So, the specific set of techniques is usually not driven by database issues but by data structure issues. Although they are very similar, the problems are similar, but the solutions will look a little bit different because everything is in memory. Everything has to run super fast in microseconds at most. So, there is a lot of data structure hacking and so on including hacking at the firmware level and other levels on routers. So, the same kind of thing which you are talking about for proxies also happens on internet routers. There are a lot of rules in routers which have to be matched very fast. So, they have specialized hardware in Cisco for example is well known for this. They have highly parallel specialized hardware to do this kind of matching on the fly. So, I hope that answers your question. If anybody else has a question, we can take that. Otherwise, let us switch over to today's topic. Ok, good. Looks like there are no further questions.