 So, slide here is set operations which we covered very quickly yesterday. So, we can apply union intersection set difference to the results of SQL queries. And as I told you earlier an SQL queries result is a relation. A relation is a set. So, I can do union intersect except on sets or a multi set. Now the syntax is straight forward in SQL. I just have a query select from where union or intersect or except. Now the next question is how do set operations deal with duplicates. So, it turns out that in SQL by default the set operations union intersect and except all eliminate duplicates by default. Now do not ask me why this is the default. But if you do not want them to eliminate duplicates you can say union all except all or intersect all. Now what do these mean? If you say union all and each of the relations has some duplicates or maybe the union inputs do not have duplicates the same tuple though may be present in both the results. So, in union if a copy is present in both the inputs it will appear only once in the result because union eliminates duplicates by default in SQL. But if I said union all both the copies will be present. So, if a tuple was present in both the relations it will be present twice in the result. Union all is actually very cheap to implement you simply take the first result concatenate the second result and you are done union requires duplicate elimination. Now what if a tuple appears multiple times in the first and in the second it is easy to see that union all if it concatenates if a tuple appeared m 1 times here and m times here and n times there then it appears m plus n times in the output. So, that was easy. Now what about intersection? If a tuple appears 0 times in the first and 1 times in the second it should not appear in the output. If it appears once in both it should appear in the output. What if it appears once in the first one and twice in the second? How many times should that tuple appear in the output and the way SQL defines this is to say it will appear a minimum of m comma n times. So, if it appeared once here twice there it will appear only once in the output. If it appeared 4 times here and 2 times in the other one it will appear 2 times in the output because minimum of 4 comma 2 is 2. So, that is how it is defined it is fairly intuitive. And finally, the except this set difference outputs a result if it is present in the first, but not in the second output argument to the minus. What if there are duplicates? It should be clear that if it appears one time in the left side and one time in the right side it will not appear in the output there are no duplicates. What if it appears one time in the left and twice in the right again it should be clear it should not appear in the output. But what if it appears twice in the left and once in the right? The set version of except eliminates duplicates first. Therefore, even if it appears twice it will take one copy here and the other one it appears once it takes one copy. And since it appears in the second it is eliminated and that is it it will not appear in the result. But the multi set version of except which is except all shown here in that the tuple will appear m minus n times. So, if it appears twice here and once in the right hand side it will appear 2 minus 1 which is one time in the output. Well, but we have to be careful what if it appears once in the left hand side and twice in the right then 1 minus 2 is minus 2. So, it cannot appear minus 1 times in the output. Therefore, it basically appears 0 times. So, the formula is max of 0 comma m minus n. So, if m minus n is negative the maximum will be 0. So, that is how set operations deal with duplicates in SQL. Now, moving on we saw the null value. The null value is actually very important and we really need to understand what SQL does with null values. Now, we already saw why we need null values. There are situations where we do not know if what the value of an attribute is. And in systems which did not allow null people used some particular flag. Now, what flag they used dependent on who wrote the program. So, for example, in IIT Bombay's old database this was back in the 80s we used FoxPro which is a old pre-relational. It was kind of relational, but it did not support SQL database and it did not support null values at that point. So, they had a schema which had a retirement date for people. Now, when they first loaded the data they did not know the retirement date for people. This should have been known, but it would have had to be entered manually. So, the program has decided we will say it is unknown, but there was no null value. So, what was unknown? They had a year which was 2 digit and they wanted to do year comparison to say the current year is less than the retirement date. They are still there. Therefore, they said the retirement date as December 31st 1999. That was their null value which seemed fine in the 80s. Late 80s this seemed fine. 1999 is more than 10 years away. Why do we bother? And guess what? Eventually, 1999 did come around and according to a database all of us over here who were there at that time were to be retired on that date. So, we would have all been fired. We would not have been happy. Obviously, this kind of error is going to be caught and before that 1999 December 31st happened the programs were rewritten and the actual year was increased to 4 digits and the actual year of retirement was put in. But a safer solution would have been to put null over there if you didn't know what it is and that is why nulls get introduced. But once nulls are introduced you have to deal with them. If I say that a query uses an attribute value which is null you have to be explicit in the SQL semantic as to what will happen. I cannot say well do whatever you want. I have to say explicitly what happens. So, to do that I have to see the impact of null on each operation in SQL. We will start with the most primitive operation which are expressions. So, let's start with arithmetic expression. If I say 5 plus null what should it be? I have 5 which I know. I am adding to it an unknown value. I don't know what the result should be. Therefore, 5 plus null will come out to be null in SQL. Now, what about other operations? I am not going to list all of them, but I suggest you try it out. You can write an SQL query which says select 5 plus null from some small relation. Let's say what is the smallest relation you have because you will get duplicates. You can create a new relation with just one record in it. So, it has to have one record at least. And then let's call this temp. In Oracle there is such a relation called dual which many people use. It's nothing but a relation which has one tuple and you generally don't actually access its attributes. It's for writing queries which work on constant expression. So, I say select 5 plus null from this relation dual. In PostgreSQL it doesn't exist, but you can create it and see what is the result. Similarly, in today's lab I suggest you try out other things. If I say the string let's say John concatenated with null. Find out what happens. It's worth trying this. So, that is for arithmetic and string expressions. Now, what about other things which are predicates? We are going to come to that in the next slide. If I say 5 less than null what is the result? So, that is coming up in the next slide. But before that SQL let's just check explicitly for null. By saying for example, select name from instructor where salary is null. So, these are all instructors whose salary I don't know that it doesn't know. What if I said select name from instructor where salary equals null. Now, equal null doesn't have any meaning because null stands for unknown. Unknown equal to unknown? Well, you know unknown value may be 5, it may be 10. So, I can't say null is equal to null. 5 may not be equal. 5 is not equal to 10. I don't know what they are. It may be that both the nulls actually represent 5, but I don't know. Therefore, if I say something equal to null, it will basically say I don't know the result. The equal will say unknown. So, if I want to check if a value is null, I have to say is null in a special syntax. I cannot say equal to null. So, that is the query here. I can also say is not null which will be the opposite. Now, coming to comparisons with nulls, if I had an attribute value which was null and my selection condition was 5 less than r dot a. What happens? r dot a is null. Therefore, I am actually evaluating 5 less than null. Is 5 less than null? I can't say. May be null was 4 in which case it is false. May be null was 6 in which case it is true. I don't know. Now, similarly null equal to null or null not equal to null. All of these we don't know. So, really we should say we don't know. But if you take Boolean logic which has only true or false, it does not have any value called unknown. Therefore, if you stuck with Boolean logic, you will have to say true or false and this causes problems. If I say it is true, obviously it is wrong. So, maybe I can say it is false. 5 less than null is false. That also causes a problem to have an example here. But let me bring it up on the slides here. Let's take this query which says select star from r where r dot a is less than 5. So, I can say it is false. So, I will say if it is false, then that apple will not appear in the result. That is if r dot a is null where r dot a equal to null, null less than 5. If I say it is false, then I get a result which is acceptable in the sense that that particular r dot r tuple will not appear in the result. But now, supposing I write a slightly different query. Instead of writing r dot a less than 5, I could say not r dot a greater than or equal to 5. Now, you would expect that these two conditions in the two boxes here should be the same. In normal algebra, they are the same. r dot a less than 5 is really the same as not r dot a greater than or equal to 5. But let's see what happens if r dot a is null. I cannot say that r dot a greater than or equal to 5. So, supposing I say false, what happens? This is false. Not a false is true. So, now, look what happened if r dot a is null, the first one here would evaluate to false while this one here would evaluate to true. Now, that is a problem. These two should have been the same value, but suddenly they are not. Depending on how you write the query, you get two different results. That's rather disturbing. And it turns out the problem is that our Boolean algebra forced us to say that this is either true or false. The solution is to introduce a new value called unknown. So, what we are going to say is if we say r dot a less than 5 and r dot a is null, we will say that the result is unknown. So, now, what do we do with this unknown result? So, note, look at the third bullet here which says not of unknown equal to unknown. So, r dot a less than 5 unknown. Not r dot a greater than or equal to 5. Well, r dot a greater than or equal to 5 is also unknown. Not of unknown is unknown. So, what I have just done is using three valued logic which includes a new truth value unknown. These two things which were different earlier are now both the same unknown. Now, how do you deal with unknown in general with other operations and and or? Well, this truth table shows it. So, unknown or true? Well, it is very clear that whether unknown is true or false whatever it turns out or true has to be true. Therefore, the result of unknown or true is true. Now, how about unknown or false? Now, the false part is certainly not going to make it true, but the unknown depending on whether it is true or false the result would be true or false. So, I cannot say which it is. Therefore, unknown or false is unknown. Similarly, unknown or unknown is unknown. I cannot say what it is. Now, what about and? True and unknown? Can I say it is true? No. If unknown is true, it is true. If unknown is false, true and false is false. So, the result of true and unknown is unknown. How about false and unknown? Now, whether unknown is true or false? False with anything is false. Therefore, false and unknown I can say is false. And how about unknown and unknown? Well, again I cannot say anything. The unknowns may be true or false and therefore, I will say the result is unknown. So, what we have just done is Boolean logic with unknowns. Well, so far so good, but that does not tell me what to do with the result of an SQL query with this thing appearing. So, wait one other construct SQL has explicit control is called is unknown which evaluates to true if the predicate p evaluates to unknown. So, that wraps up the three valued logic, but that has not solved the overall problem yet. If the query had a condition where r dot a less than 5 and I evaluate this and for a particular tuple, the result is unknown. Now, should that tuple appear in the result of the query or not? The wire clause condition said r dot a less than 5. I do not know whether it is true or false. So, conceptually the tuple may appear or it may not appear. We are not sure. So, people have actually worked on versions of relation algebra where a tuple presents in a result is itself uncertain. It may or may not be present. However, this complicates the semantics and query processing and many other things tremendously. So, SQL decided not to go that far. It said a relation is deterministic. Either thing is present or it is absent. Therefore, what it does is it pushes the unknown along as far as it can. So, as long as it is evaluating a predicate, it will keep the three valued logic with unknowns, but at the end when it has to make a decision on whether a particular tuple satisfies the predicate or not, it will only output tuples where the predicate evaluates to true. If the predicate evaluates to false, it is out. If it evaluates to unknown, again the tuple is out. So, what SQL is doing is saying I will only output tuples which I am sure are true. If it is unknown, it will not be in the query result. So, that is a limitation which is forced because of efficiency considerations. So, this is a good point to take a break and we have another quiz question. So, do not answer the questions yet. Just press the s t button and meanwhile, let us see the questions. Consider the two predicates r dot a less than 5. We will see. You get it up and we will also see the response for the previous question. So, go ahead and set it up. So, this question is r dot a less than 5 and r dot a is not null. The question is either too equivalent, equivalent only if r dot a cannot be null or equivalent only if r dot a is null or a never equivalent. So, you have some time to answer it. Do not answer it at this point. Before you do, we will get the response for the previous query setup. We will see it and then set up the current query. The question was if you omit the order by clause, there are several options. The first option is there is no default ordering. That is true. If you omit the order by the SQL system can choose whatever ordering it wants. The second alternative was the result may turn out to be ordered. In fact, this is also true. The SQL engine may choose a particular way of evaluating it which results in some order. Maybe it ends up, you know, if you have a query on instructor. It may turn out that the SQL engine chose to sort instructors by their ID and the result may happen to be sorted by ID. But that is not guaranteed. If you run the same query on a different database engine, it may very well give you a different sort order. Or if you move to a different version of the same database engine, it may give you a different sort order. So, if you need a sort order, don't depend on what happens to show up, but you have to explicitly sort. So, in fact, 1 and 2 are both true. So, the correct answer is 3 or C which is both the above are true. 4 which says neither 1 nor 2 is true is obviously wrong. Now, what are the results? Today, the audience has lost. Quite a few have said 1 is true. Most probably, you saw that 1 is true and immediately pressed A. So, this tells you, you should read your questions carefully, because you were partially right. 1 is true, but it is not the only true answer. Several have said 2 is true, which is again a similar state where it is true, but it is not the best choice. And third place was the correct answer C, which is both are true. And roughly tied for the first place is D, which is also wrong. So, please read the alternatives carefully next time around. So, I hope audience will get back to its winning ways and stop losing. Now, let's come back to the current question. Please read the question while we enable the quiz. I will tell you when you can answer it. Now, the quiz is active. Please choose the alternatives here. I hope you have read the question carefully for this quiz. So, now let me explain how you would go around solving a question like this. Now, clearly the only case where they differ is if a value is null. So, this obvious way of answering this question what happens if r dot a is null. So, in the first case if r dot a is null what happens you will get less than 5 is unknown. In the second case what will happen r dot a less than 5 it is still unknown and r dot a is not null will turn out to be false. So, unknown and false is false. So, now if you just look at these two expressions it may look like they are different because one is evaluating to unknown and one is evaluating to false. But if this is all there is in the query that is the where clause is exactly this there is nothing more in the where clause what is the impact on the query. As long as the final result is unknown or false it does not matter whichever one it is the corresponding tuples from the from clause are going to get eliminated. So, if you take the query as a whole these two are actually equivalent. If you just look at the result of the where clause well they are not quite equivalent. So, now what are the options the two are equivalent I is actually what I intended because I intended to say this is the full query the result of the query. But I did not phrase the question properly therefore, that is not necessarily the correct answer. If you interpreted the question as if the final result of the query is it equivalent the answer is one which several people have chosen. So, that is a correct answer. Now the two is the two are equivalent only if r dot a is declared as not null that is certainly true whichever way you interpreted if it is declared as not null this case cannot happen. Therefore, everything will be fine and that is the majority choice which is in this case it is the correct answer. And the third choice is they are equivalent only if r dot a is null which is wrong and they are never equivalent is also wrong because we have just seen the cases where they are equivalent. So, the either if you chose either one or two that is an acceptable answer if you chose three or four that is it for that quiz let us move on. So, the next topic for this morning session is aggregate functions and SQL defines five basic aggregate functions although many implementations add a few more of their own the five basic ones are average min max sum and count. So, what is an aggregate function it aggregates a bunch of data. So, you have a set and actually an SQL a multi set which may have duplicates you have a multi set of values on that you compute some aggregates that is thick. So, here are a few queries which illustrate the use of the aggregate functions this is select average salary from instructor where department name is comm psi what is this doing first of all it is going to apply the from and the where clause. So, it is going to find instructors who are in comm psi now there is a whole number of these from these it is selecting average salary. So, it is going to get the salary values which is a multi set and then it is going to compute the average and return one single value which is the average salary. Now, that is the simple query it is an average or it could have been count min max whatever you want in place of average it will work just the same. Now, in this case we wanted the distinct values for the average to be correct, but there are cases where we do not want distinct values. So, take the next one which is find the total number of instructors who teach a course in the spring 2010 semester. So, how many distinct people are teaching courses? So, how do we know who is teaching courses? Well, the teacher's relation tells you who is teaching. So, I can apply a selection on the teacher's relation which says semester is spring year is 2010 and I will find who all are teaching courses. Now, if I do a count on this there is a problem though what if somebody is teaching 2 courses or 3 courses they will get content multiple times the question did not want that the specification was how many different people are teaching courses. So, how do I deal with it? I am going to say select count of distinct IDs. So, each of the aggregate operations has a distinct version where the keyword distinct is added inside the brackets. So, what is this doing? It is going to eliminate duplicate IDs. So, if a particular instructor is teaching 2 different sections or 2 different courses in a semester it does not matter the duplicate copies of ID are eliminated and then the count is applied. So, this is going to give me the correct count. There is another variant called count star which is useful in many situations says find the number of tuples in the course relation. Now, I do not have to explicitly list the attributes like which attribute do I choose I could have chosen the primary key attribute, but instead of dealing with that I can just say count star from course it counts how many tuples are there in course. I am going to come back a little bit later to how aggregate functions deal with nulls there are some complications there. So, I will come back to it later after covering aggregates ignoring null values. So, getting a single aggregate value is certainly very useful, but there are many cases where I want to get aggregates per group. So, for example, I want to find the average marks for each course or the average marks for each exam in a course or the CPI which is the total credits of each student. So, there are a number of queries like this which actually want to compute an aggregate per something and the way you do this in SQL is by introducing the group by clause in addition to the aggregate function. So, let us take a simple query find the aggregate average salary of instructors in each department. So, each department should clue you in that I need to do it separately for each department. And therefore, in the SQL query a clause is introduced which says group by department name. Now, what does this clause do? Well, the evaluation is done as follows the from clause relations are evaluated doing a cross product as required then the wire clause predicate if any is evaluated in this case there is not any. So, the from clause just have instructors. So, the result of the from and wire clause is a set of tuples. Next the group by is applied what does the group by do? It breaks up the set into groups based on which tuples have the same value for the attributes listed in the group by what is the attribute here department name. Therefore, we form groups if you see the first table here it has already been it is the instructor relation, but already broken into groups. The first group is biology there is just one instructor in biology. The second group is computer science there are three instructors here. Electric engineering has one and so forth physics has two. So, we have grouped it. Now, once you have done the grouping the aggregate functions are applied separately per group. So, in the first group there is just one row, but anyway the aggregate function is applied on that row the average salary happens to be 72000 which is obvious. And the result contains what it has department name plus the aggregate value. So, now you can include an attribute name here it is ok as long as either you are aggregating on it or that attribute is present in the group by list here. So, department name can be included what if instead of department name I said select let us say id comma average salary and group by department name there is a problem in a particular group there may be different values for id. So, if I say select id which value of id can I choose it cannot arbitrarily choose one id I mean there are five instructors in a department which of their id is chosen it does not make sense. So, that is actually disallowed in SQL you can only list an attribute if either it is there in the group by class over here. Therefore, we know its value is unique in a group or that attribute is inside an aggregate function. So, if there are many values they get aggregated together into a single value. So, moving on to this one we have compsci with three tuples. So, those three values and their average is 77,333 ignoring the decimal places electrical engineering is just one tuple 80,000 finance had two tuples whose average is 85,000 and so forth. So, we get all the departments here with their average salary well almost what we get is all departments which have instructors if a department does not have any instructor what happens well the group by was on instructor. Therefore, the department name will not appear here at all and the department name will not appear in this result. Now, if you do wish that department name to appear you have to do something more you have to rewrite the query somewhat differently we will see that later on. So, this query is an erroneous query which I told you on the last slide itself verbally, but here you can see it on the slide it says select department name id average salary from instructor group by department name as I said id is not unique in the group. Therefore, this query would give you a syntax error SQL would catch it at the before executing it. Now, this one more clause called having clause which is like the wire clause, but it is applied after the group by and after the aggregate values are available. So, over here this query computes the average salary and what it does in the having clause it says let us read the query let us read what the query is supposed to do find the names and average salaries of all departments whose average salary is greater than 42,000. So, here is a predicate on the result of an aggregation there are multiple ways of writing the query in SQL the simplest way is to introduce the having clause. And therefore, we say select department average salary from instructor group by department name. So, far the same as the previous query, but the extra condition was we only want department name whose average salary is greater than 42,000. So, we added having average salary greater than 42,000. So, to understand this the having clause can have aggregate functions the wire clause cannot it is not allowed to have aggregate functions that is the difference. The having clause is applied after grouping the wire clause is applied before forming the groups what I would like to do for questions and ok Vidya Pradeshan. Good morning sir. Sir, I have we have a small query on clickers. So, I would like to ask a question to clicker team most of the participants are not getting functionality of the clickers which button is to be press and when the timing is going to be start. If there are the multiple questions are there are there then how much time slot is given for the one questions and when to click the next button that that is not getting to our participants. Before I ask you to give the answer for a particular question I am displaying that question and asking you to answer that specific question and you get a time window I am announcing when you should start pressing your buttons and when you should stop when you can no longer press your button. So, that is done manually at this point. So, once I say you can answer the assumption is that once it has been enabled here all the clicker software at the local centers also gets that signal and starts accepting questions. Now it is possible that there is some delay in this thing going from here to the individual centers. So, what I have been doing is actually giving you about 5 seconds after we start the reception here before I tell students to answer the question. So, I hope that this 5 seconds is enough for all the centers to catch up and turn on their collection. I hope that answers your question. There is a question on chat which says how do we retain the duplicates in relational algebra. The answer is relational algebra could either be based on the set or the multi-site version and the like as a relational algebra is used essentially inside of database system is not usually exposed to users. But most systems which do this have multi-site version of operator and they also have a duplicate elimination operator. So, if you do not do anything if you just use the multi-site version the duplicates are retained. If you want to eliminate duplicates you use the duplicate elimination operator and that is what actually happens internal to the database system. Since this is not made available to programmers you cannot actually program on it. But if you wish to write queries in relational algebra and try it out you basically have to define first which version of the relational algebra you are using. If you are using the multi-site version you can introduce a new duplicate elimination operator which is often called delta D for duplicate elimination and Greek delta and that eliminates duplicates. I hope that answers the question. I just accepted Amrita column. We are able to see you. Amrita column, do you have a question? If you want to show the data in Malayalam, original language. How do you store it? That is a good question. How do you store text in non-ASCII that is in Indian languages in a database. Today the standard is Unicode. So now there are two parts. One is how to get the text in Unicode. There are many editors today which will let you type in Indian characters. So you do have websites which let you do that. You have other editors in Linux for example in Windows which let you enter text in Malayalam or Dev Nagri or any other Indian script and that can be copy paste. I am giving a very low level description. At the level of the database system a string can be in anything. So you can insert a statement in SQL where the string is in Unicode and as long as you have declared the type of that attribute to be n care or n wire care it will accept Unicode strings and store it in Unicode. So if you get the Unicode representation using whatever means from the input and you pass that Unicode representation on through SQL the database can store it. If you declare it as n care or n wire care or in Oracle you will have to declare it as wire care too. So if you do that you can store it. Now some versions of some databases do not support this unless you specify it as a database option somewhere. So if you run into trouble with this look up the manual to make sure your database has support for this turned on. These days it is turned on by default but some old versions may have the support turned off and you may have to do something in configuring your database to turn it on. I hope that answered your question. If there is another question please ask otherwise please tell us over. It is over. People on our high bandwidth server. So we have NIT Varangal you are coming up next. Sir I want to know what is the default behavior of nulls in ordering of the tuples? Can we change it? If so how? That is a good question. When you do order by what is the default order of nulls? Now the SQL language actually lets you specify whether null should go first or null should go last and there is a specific syntax for this. You can say nulls first or nulls last. And this is actually at the level of, SQL has this whole notion of collation and so for a particular type you can say this is the sort order and in this sort order null should go first or null should go last. So you can look up your manual to see how to control this behavior and again if you do not say anything what is the default? I am not sure if a database system stick to the SQL standard. So my suggestion is you try it out on your database system and find out what it is doing. If you want to control it you can do that. Again look up the syntax on your SQL systems manual. So that is not a very complete answer. I am basically telling you to go read the manual. But it is a good question. Do you have any follow up questions? How to get top end records from a relation? That is a good question. How do you find the top few records from a relation? And the answer to that question is there is something in the SQL standard which lets you do this. If I have time I will cover it later. But there is also what is implemented in many database systems which is a lot easier to use than the SQL standard. And this varies by database system. So in PostgreSQL for example, there is a limit clause which you can add on at the end of a query. You can say limit 10. Now again the exact syntax varies. So other databases may say stop after n rows or first n rows or whatever. So look up your manual to see exactly how to do it in your database. But since we are using PostgreSQL in this course, the clause is limit and whatever number you want to specify. Now the question is which, supposing I say limit 10, which 10 rows are going to be output? If I don't have an order by clause, that depends on what ordering the query evaluation engine chose for evaluating that particular query. Now that is not very satisfactory because the 10 answers which I get may vary depending on what the optimizer did. So if I want to get something deterministic, I should really combine this with an order by clause. I will say order by something and then top end. So if I say order by name and then limit 10, I am going to get the first 10 names in ascending order. Now SQL standard actually has a much more powerful way of specifying how to get the top end rows. Here the limit n is at the end of a query. So it is like ordering. It only applies to what the user sees. The SQL standard has other syntax. If we have time, I will get to it, which actually lets you do this inside of a query, get that result and then use it in a later part of the query. So I will deal with that probably tomorrow if you have time. That is a good question. Are there any other questions? For a department, if I want at least one course registered department, I will be joining department and course. If I want a department with more than two courses or three courses, how would be the query? I am going to defer this to later because we are going to cover sub-queries in the second part and with sub-queries, it is easy to write a query like this. So I will cover this in the second half.