 And this is a good point to take a few questions, we have gone about one hour without questions. So, let us take some live questions then I will take chat. S V U college Tirupati. Hello sir, my question is on database connectivity is that how can we connect the remote databases in remote location using PHP or any other language and what kind of connectivity is to be used for that. So, the question is how do we connect to a remote database using PHP and what kind of connectivity. Now, first of all the standard protocol to connect to a database whether it is local on your own machine or elsewhere there are several protocols for Java its JDBC, ODBC was the standard for C and pretty much all the scripting languages like PHP have their own version of the ODBC API and they also have some other API. So, PHP there is another API called the peer API for connecting to databases. So, internally you know they provide an API and the way they talk to the database is using the ODBC connectivity that the database provides. So, that is what you would do use the API for whatever language you are using and it does not matter whether the database is local or remote you can use the same thing. Now, the many databases also provide proprietary protocols and tools correspondingly. So, Oracle has its own connectivity protocol and if you are using Oracle tools you could use that, but if you want to use Java then you probably cannot use that. So, just to wrap up there is also other ways of accessing data remotely there are what are called web services which in the simplest avatar means you do a URL get on a particular URL and you get data back this is the simplest avatar of web services there are much more complex versions and that web page which you get back is not necessarily a HTML page, but maybe you get back JSON data or you get XML data. So, this is called a web service where the goal is that particular URL is invoked with some parameters and the response is not intended for direct display, but for consumption by a program. So, instead of low level database access you might instead use a web service like this and build your own interface to that database to implement whatever functionality you want. So, you know that is kind of what sort of what we do when we build an application in this case the end user is a actual user the user is interacting through web interface, but if the end user were a program then equivalent thing can still be done using HTTP or other protocols. Mount Zion Tamil Nadu please go ahead Mount Zion. Sir my question is how to touch this how I am creating a new database and how and where I should start normalization sir it is a question we are waiting for last two days can you able to tell me about that sir. So, the question is when you want to create a new database how and where do you do normalization when you want to build a database for some application. The first step is to understand the needs of that application and you know the whole process is laid out in the project proposal that is part of the sample project proposal document to make sure you follow the steps which are required in building a new system you do not jump straight into normalization. You first put down what all functionality you need in English or in whatever language you prefer and then you sketch out the kinds of data which you need to store then you do an entity relationship modeling of that application then you convert to relations and then you figure out what all functional dependencies and multi-value dependencies there are and see if there is any violation of normal form and that is where the normalization process happens. So, part of this is figuring out what functional dependencies hold not just now, but are likely to hold continue to hold if something is not likely to hold it may change tomorrow it is probably better to not take note of it when you design the schema because you could get into trouble. Like I said if you assume that students are in one department, but you believe there is a chance that tomorrow students may be in two departments you should not use that functional dependency, but for our colleges we know pretty much that students are given a department and that is their major. So, that is perfectly fine for us, but if you have some other thing which you are not sure of let us say advisors we currently have only one advisor per student, but you know tomorrow that may change we may want to have two advisors then that is not something you want to put into the design itself. So, do not use that particular functional dependency during schema design after that the steps of normalization you know looking at the dependencies figuring out whether it is in BCNF or 3NF and so forth those can be used as is for the moment let us go back to query processing and then I will come back to the quiz. So, we saw sorting with external memory and it is pretty efficient today people can sort a terabyte relation in a few minutes time on parallel machines with multiple disks. There is some interesting history here at one time a gigabyte was thought to be big. So, they said here is a benchmark called a gigabyte sort and let us see how fast you can sort it. Now, a gigabyte main memory was unimaginable back then main memories were 10, 20 megabytes in size and so that was a benchmark about 5 to 7 years after that benchmark was launched a main memory sizes grew to 1 gigabyte it was expensive back then, but it was available 1 gigabyte memory today of course, every one of us has 1 gigabyte on a desktop. So, in fact many gigabytes so the 1 gigabyte sort benchmark turned into an in memory sort rather than an external memory sort. So, now you have much larger sizes and so the current benchmarks are terabyte, but now there are people with a terabyte memory. So, those have to probably go to 10 terabyte or something. So, coming back that was sorting now let us focus on join algorithms. There are many different ways of doing joins the simplest is nested loop join let us start with that and then we will look at alternative. So, this nested loops join is applicable regardless of the join condition I am going to do a theta join what does that mean this is essentially R join S on some condition it natural join is a special case of this with projection, but this is the most general case this could be any condition R dot a less than b and R dot c equal to R dot d and some function of R dot a is greater than some other function of R dot a comma S dot b and so forth could be a arbitrarily complex condition. Regardless of how complex it is we can always do nested loop join for each tuple t R in R for each tuple t as in S for two nested for loops test the pair of tuples to see if they satisfy the join condition if they do output the concatenated tuples to the result that is it this is a very simple algorithm and R the outer loop is called the outer relation the relation in the inner loop is called the inner relation. So, this is very general, but also very expensive because it is quadratic. So, it is not feasible in for any large relations although occasionally there is a very complex join condition with small relations where you are forced to do this, but if it is two large relations this is going to be horribly expensive. So, the next algorithm is a index nested loops join which is actually fairly simple the edits I made to this slide did not make it, but just look at this here this thing says for each tuple t R in outer relation R use the index to look up tuples in S that satisfies the join condition. What do we do this the join condition had better be something which the index in support and the most common case of this is equi joint R dot A equal to S dot B. Let us say R is the outer S is the inner and I have an index on S dot B. Now, for each tuple in the outer relation R I will take that tuple look at its R dot A value I will use the index on S dot B to find all matching S tuples and then output all the pairs that all those S tuples with that particular R tuple. Then I will move to the next R tuple do this and so forth through all the R tuples. So, this is the outer loop here for each tuple in the relation R. So, I am going to skip the details of the join cost, but it is essentially the cost is the number of matching. So, C is the average cost of traversing index finding all matching S tuples. If there are many matching S tuples the cost C could be higher. So, essentially the number of records in outer relation times the cost of traversing index and finding matching records on average which is denoted as C. So, if the index is you know on disc in the relation is very large with small memory each index traversal could involve some higher. So, C could be fairly expensive because it may involve on average one or more seeks per lookup. So, index necessary loops join can be very efficient if N R is small, but if N R is very large this one seek per outer record can blow up in your face and become very expensive. There are alternatives though which work when the relation N R is large there are lot of records the first of which is merge join is very very easy to understand. So, the first step is to sort both relations from the join attribute may be they are stored sorted in which case there is no need to sort. Otherwise we run external merge sort and get the sorted output. So, here is the relation R which is sorted on attribute a 1 can see a b d d f m q. Similarly, relation S sorted on this join attribute in this case a 1 equal S R dot a 1 equal to S dot a 1 is the join condition. So, we have sorted this also on a 1 and now the merge join is very very easy. We simply keep a pointer to the first record here the first record here they match. So, immediately output this pair a 3 a capital A is the first output if it were a natural join the two copies of a 1 would be replaced by one copy. So, the output would be a 3 capital A. So, for natural join we can just do the projection immediately. Now, what do I do I will move this pointer to the next record b is that 1 equal to this a it is not it is b it is larger. At this point I have to move this record to the next larger 1 which is b and I get b b which is a match. So, that will be output next what do I do I move this record again to the next one which is c does not match no which is smaller b. So, I move that 1 and I get d to d and c match no which is smaller c is smaller. So, I move that pointer to the next one d now we have a match d 8 with d matches and I will output it. Now, if I look ahead here the next one is m, but here the next one is still d it is not changed. So, actually I need to do a little bit of look ahead to find which all records here have the same value and which all records have the same value. So, if there are they say 3 records here with the a 1 value as d and 2 records here with a 1 value as d all of them match both these match all 3 here. So, I essentially have to do a cross product of this subset this one matching with all 3 here this one matching with all 3 here in this case there is just 1. So, I have to output this one d 8 d n and also d 13 d n. So, 2 things have to be output then moving on I find f and m f is smaller I move ahead to m m m match that is output I move ahead here to q this one has nothing that is it. So, it is very easy merge join is a very very simple algorithm very intuitive if you are familiar with the merge operation merge join is a very simple extension. It is also very efficient you know the total cost is basically just transferring all the blocks and we will skip the details of the seeks. If the relations are sorted it is super efficient if the relations are not sorted when you have to add the cost of sorting. And the last major technique is hash join hashing technique which we did cover earlier we will explain it in the context of hash join. Now, this particular hash join algorithm is also just like for merge join it is applicable for equi joints and natural join. What is an equi join it has a condition such as r dot a equal to s dot b. You can also have more multiple attributes being equated you can have r dot a equal to s dot b and r dot c equal to s dot d. We can sort on these pairs of attributes r will be sorted on a comma c while s will be sorted on b comma d and then do the matching. So, we will ignore those details for now. So, now, what is the hash join do the first step is to use the hash function to partition tuples. What is the goal of partition the figure is to break this large relation r and this other potentially large relation s into pieces such that at least the smaller relations pieces fit in memory. So, my goal is to break it into enough pieces such that s s is the smaller relation here and I want each of the pieces of s to fit in memory. So, that determines the number of partitions that I need to create and I will choose the hashing function appropriately. How do I do that typically I have a hash function which generates a value from let us say 1 to max and then I find the number of partitions I need I know the size of s I know the size of memory size of s divided by size of memory plus some fudge factor to allow for non uniform breakup will give me the number of partitions that I want. Supposing s is 100 blocks memory is 10 blocks I will do 100 by 10 which is 10 partitions and assuming some non uniformity say 20 percent I will create 12 partitions here. Now, the number of partitions of r has to be exactly the same in fact the partitioning function is exactly the same. So, I will use the same hash function to partition r and s in the case of r it will be applied on the join attributes of r in the case of s it will be applied on the join attributes of s. Now, what is important is that the partitioning is done on the join attributes and if I had 2 tuples 1 in r and 1 in s which would have satisfied the join condition. What can I say the join attributes values would have been the same therefore, the hash functions would be the same if 2 tuples at all could match both of them will hash to the same value i. In other words if I had a pair of matching tuples in r and s both will land up in the same partition number if that s tuple lines up in 1 that r tuple will land up in 1 if that s tuple lines up in 3 that r tuple would land up in 3. In other words once I have done this partitioning I only need to match tuples in r 0 with tuples in s 0 tuples in r 0 cannot match with tuples in s 1 s 2 s 3 and so forth. They can only match s 0 similarly tuples in r 0 can only match tuples in s 1 and vice versa tuples in s 0 can only match tuples in r 0 not any others. So, I can do the joins locally between r 0 s 0 r 1 s 1 r 2 s 2 and so forth. Furthermore this relation s 0 s 1 and so forth each of them fits in memory. So, how do I finish that join it is described here the first step it is to partition r and s with hashing we have already discussed that. Now, for each i we will load s i into memory and build an in memory hash index on it on the join attribute. The hash index will use a different hash function from the earlier one because in the earlier hash function all of these tuples map to the same value i. So, I cannot reuse that hash function I will use some other hash function and build a hash index. Now, I will read the tuples in r 1 block at a time and then consider the tuples 1 at a time. For each tuple r I will use the same in memory hash function which I use for s i and locate the matching tuples using the in memory hash index. And I will compare the actual values not just the hash value and assuming they match I will output the concatenation as the join result. And that is pretty much it that is the hash join algorithm very very simple. So, the basic idea is partition the relations into smaller pieces such that each s i fits in memory then I load each s i 1 i at a time load s i fully into memory build a hash index that is called the build input s. Then I take the tuples of r and probe the index which I have built just now in memory index. So, r is called the probe input and then I will find matches and output them more on to the next r tuple. Note that if I look at this picture r i does not have to fit in memory because I am not reading all of r i into memory at a time. Only the s i's have to fit in memory. So, that is it that is the hash join algorithm. There are some more details in the book which I am going to skip and the next step. So, hash join merge join both will work for equi joints and natural joints which are a special case of equi joints. Anyway the equi joints plus projection. Now, what about other conditions which are complex? First all I said if the condition is too complex I may have to do nested loops, but it is actually not so bad. Supposing I have a conjunction of conditions where one of the conditions at least is equi joint. Then I can use the one of the algorithm merge join hash join index nested loop join on that condition theta i and what about the remaining conditions theta 1 to theta i minus 1 and then theta i plus 1 through theta n. What about those conditions? I will use them as filters on tuples that match this join condition. So, it is actually better than what it looks like which means fall back to nested loops join. So, if I have a condition which is equi join and another condition where r dot a less than s dot b I will first apply the equi join and matching tuples will be filtered. Let us wrap up the other operations. I want to finish this up in the next 5 minutes and after the break we can move on to optimization. So, the other operations are all done very easily using sorting and hashing. The first is duplicate elimination. I think it is very obvious that if you sort all the duplicates will come together. The only question is what do we sort on? The answer is you sort on all attributes. That will ensure that 2 tuples which are identical will be adjacent to each other in the sort order. Once I have sorted on all attributes I just can and when I see a new tuple I see if it is equal to the previous tuple. If it is new output it, if it is the same as the previous tuple skip it and keep going. So, that is sorting. Hashing can also be done similarly. The trick here is that we use hash partitioning to break the relation into pieces where each piece will fit in memory and the trick here is once I have partitioned it into pieces which will fit in memory. I build a hash index on that relation. As I am building the hash index I will see if this tuple is already present in the index. If it is I will not even add it to an index I will just throw it off it is a duplicate. If it is not present I add it to the index and I output it. So, it is very simple. If this went by too fast for you go read it up from the slides of the book. Projection operation is basically the first step is removing unnecessary attributes. The second step is duplicate elimination which we just saw. That is very simple. The next major operation is aggregation. Now, if you remember the aggregate operation how did we do it? We first got the records which have the same group together. We had to group the relation. If you recall how we group the relation in our example we simply sorted it. The moment we sorted all the sort on what is the question? We started on the group by attributes. If you sort on the group by attributes all the records with the same value for the group by will come together. It would be nice if I could flip to the other slide on define which gives an example of the group by operation. But it is too slow switching between slides here. But I hope you remember that we brought we sorted on the department attribute to bring all tuples for the same all instructor tuples for the same department together. Now, if you want to find the sum count min max all those tuples are adjacent to each other. I hope you can see this example now. This was take instructor group by department and output average. So, what have we done here? We have sorted on department name and now the first one is biology. Then if you want to find the average we actually do the sum and count separately and we will output the average when we move to a new group. So, the first step is we take the first tuple biology sum is initially 72000 count is 1. The next one is comp size. So, there are no more tuples in biology I will output salary sum of salary divided by count of salary which is average in this case it is 72000. So, we will output biology 72000. Moving forward it is comp size initially the sum is 75000 count is 1. Then I add 65000 I get 1,40,000 count 2. Then I add 92000 I will get 2,32,000 count 3. Then the next one is electrical engineering. So, I have reached the end of the group I output comp size with the 2,42,000 divided by 3. Moving forward I get electrical engineering 80000 count 1 that is only tuple I output it similarly finance and so forth. So, in a single pass keeping just a couple of variables for sum and count I have been able to output the average. Similarly, min max and so on can also be done very easily using very few variables. If I want to find median or few other things like that which are called holistic I may have to load all of these values into memory sort them and then find the median. Again there are tricks for median which involve presorting this. So, I can compute all the aggregates I am interested in pretty efficiently by sorting. I can also do it using hashing for lack of time I would not cover all those details come back here. Then there are set operations maybe I should spend 1 second on aggregation using hashing. The basic idea is as I get tuples I will build a hashing index on the group by attributes and when I get a new tuple I will add to the sum or the count or I will update the min or the max and scan through all the tuples and keep putting stuff into the index. Now, as long as the number of groups is small this hashing index will fit in memory. The hashing index is on the group by attribute, but it stores the aggregates computed so far some count min max. Now, if the number of groups is small this hash table fits in memory and it is actually very very efficient even I do not even have to do sorting one single pass on the data and I am done. So, it is actually a very good algorithm, but if the number of groups might exceed the amount of memory I have available it gets into trouble. So, then there are variants which deal with it I would not get into the details. Set operations union is straight forward again duplicate elimination is only extra step here. I am going to skip details of intersection set difference and so on for lack of time and what I am going to do is wrap up maybe we should give you a break for there are a couple of things more I need to do a little behind time. So, I will break here for your t break.