 So where we left off yesterday was, I hope all of you can see the new quiz question. Again we are not able to take it on the quiz presumably because there is some glitch. But I will let you read the question and figure out the answer. The question is the query, select name, course ID from instructor, teacher, match, which of the following. If you look at this, what you probably intended or you meaning, whoever wrote this, probably intended was to find out which instructor teaches which course. But there is no join condition here. And as a result, the query returns all pairs of instructor teacher's tuples, which is not what was probably intended. So how do you do that? We already, how do you write the query properly? There are several ways. We already saw one way, which was to add a wire clause with conditions matching all the corresponding attributes. Now what are the corresponding attributes between instructor and teacher's? Teacher's is something which says, who is teaching which section and what is the common attribute between the two, it would be the ID of the instructor. So you could have the previous query and add down here, where instructor.id equal to teacher's.id and that would be the right answer. However, another way of writing it is using natural join. Now we saw natural join in the context of relational algebra. In the context of SQL, it is exactly the same thing. It takes the two tables and finds which columns have the same name, equates them, in this case ID. And if you see the remaining column, name, department, name, salary are from instructor, while ID, course ID, section ID, semester year are from teacher's. So the table has all the columns of the two input table, but the common column which is ID is not repeated. It occurs only once over here. So that's the natural join. Now in SQL, the operations such as natural join and then a few more which we will see later are done in a special way. They are written in the from clause. So we are saying from instructor natural join teacher's. You can parenthesize this to do a series of joins. So somebody was asking what's the use of relational algebra in the context of SQL. If you see here, you're directly writing a join operation in SQL. You're essentially doing relational algebra inside of SQL, at least for joins. And this slide is basically the same thing as the previous one. It shows the query with the condition instructor ID equal to teacher's.ID, which is equivalent to this one, as long as they select the same list of attributes. Now there is a risk to natural join. And this can be a little tricky and things which seem to work may not actually be correct. And the reason is as following. Very often, relations have attributes name which are the same, but are really totally different things. For example, many relations might have a attribute called remark. If you do a natural join of two relations which have a remark column, you're going to insist that the remark values be the same, which is of course meaningless. In our context, if we had a three way join, instructor join teacher, join core, you'd think that this is going to give us information about instructor, the courses that they teach. And the thing is in the teacher's relation, there is no course title. There's a course ID, but not a course title. And therefore, a natural join with course will additionally ensure that the course ID will have an associated title coming from the course relation. So, this looks good, but there is a subtle problem. And the problem is that instructor had a department name, course had a department name also. And if you did this three way natural join, what happens is it not only sets instructor ID equal to teacher's ID and teacher's dot course ID equal to course dot course ID, but it also makes course dot department name equal to instructor dot department name. So, we have a problem. What is this resultant? Are we going to get extra tuples? Are we going to lose tuples? And the answer is we are going to lose tuples. Supposing I had an instructor who is from the computer science department, but taught an EE course, that instructor should appear with that course. But what happens here? The course department name is EE, the instructor department name is CS, and this condition removes that particular tuple from the result. And you might think, well, testing should have caught this, you know. Whenever you write a query, you do test it out on some sample data. But as it turns out, most of the time people teach courses from their own department. Only occasionally do people go and teach a course from another department. In our test data, if we did not have such a situation, where somebody from CS taught a course in EE, then this query would give exactly the right answer. You tested it, it gives the right answer, you're happy with it. But unfortunately, it's not a correct query. Tomorrow, when the data changes, you will realize, or you may not even realize, you may just lose it quietly, you may lose data, and the query is wrong. What do you do about this kind of situation? The first answer is when you write queries with natural join, be very careful. Don't abuse natural join. Check carefully if there are shared columns, and check if you intended to equate the shared columns. That's one option. But this is still vulnerable to tomorrow somebody going and adding an attribute to one of the relations. Let's say, a remark's attribute was added to both the relations. Now, your query was correct earlier, but this new column which was added, suddenly results in your query becoming wrong, because of that, one extra column added to each of the relations. So, that is also risky. So, the safe solution is shown down here. Instead of natural join, we say from some glitch here. This should not be natural join teacher. It should be, sorry, this one is okay. Look down here. Here, we had natural join course, and the natural join course is replaced by join course using course ID. This part is the same. Instructor natural join teacher is the same as before. That's okay. The problem was, when we joined course, we equated department name. Here, what you're saying is, instead of natural join, we say join using course ID. So, this only equates course ID. It does not equate department name. And this is safe, because we know what columns were supposed to be equated. We do this. Tomorrow, if somebody adds a remarks column to course, and there's a remarks column in, say, teachers, no problem. They won't be equated. Of course, you could have also written it like this. Instructor natural join course. Instructor natural join teacher, comma course. And then this condition is in the where clause. So, if you see, putting it in the where clause is more or less equivalent to using clause down here. There is a subtle difference. Over here, course ID appears only once. Over here, there is a course dot course ID and teachers dot course ID. So, it would appear twice with different names. Of course, in both cases, we are selecting name, title. So, there is no difference in the final result. So, there is another interesting related point. So, if you are a teacher and you give assignment like this, how do you know your students have written a query correctly? People will write queries in all possible ways. Somebody will write this query. Somebody will write the second one. Somebody will write the third one. And of these, two are correct. One is wrong. So, how do you correct the assignment? How do you find out what are all the correct queries? One way is to read each one carefully and check it. And use your logic to see if a particular thing is correct or not. But it is very easy to make mistakes. A second way is to create a number of data sets, each of which is designed to catch certain errors. And this is something which is often done. It is done manually. It is done a little haphazardly. And the idea is you create several data sets. If a query is wrong, hopefully on at least one of those data sets, the given query, let us say this one, which is incorrect. And the correct query, let us say this one, these two should give different results. In this case, what is such a data set? It is a data set where there is an instructor teaching a course in a different department. But how do you figure out what all such data sets to create? And that's a hard problem. It's possible to go over a number of potential errors which students may make. What kind of errors would students make? Well, we know that they forget join conditions. They add extra join conditions. They may use a natural join which adds extra unnecessary conditions, like in this case. And then there are many other such errors. What you could do is create a list of such errors and try to create data sets which will catch each of these errors. So this was kind of the accepted way of doing things. But is it necessarily good enough? How many of us are able to do this and find the right thing? How much time does it take? This, it turns out, is a very interesting research problem. A lot of you out there are looking for research problems. There are many research problems staring us in the face. It's just that we don't realize it. And sometimes we realize it is a problem, but don't know how to solve it. So when you have both sides coming together, we realize there's a problem, and we come up with a solution for it. Well, then you have a piece of research. So in this particular case, this came up a few years ago when I was teaching this course and realized we had a problem on our hand. We would like to catch errors. So what did we do? We actually built a tool which would generate data sets. And we have a couple of publications on this tool already. The tool is called xData. I wish we could use this tool to generate data sets for this course. So in fact, it's easy enough to use the tool to generate data sets. The problem is that it generates lots of data sets, and then having you manually run each of those data sets can be pretty time consuming. So we have not actually put it up as of now. But what we are doing is we are extending this tool to actually just submit a correct query and then another query. And it will generate all appropriate data sets for the correct query. And then when you feed in any student query, it will run it on all the data sets and tell you if there's a problem. So this tool is still under development. It catches some mistakes but not others. In particular, this error which we have shown here is not yet one of those that it catches. It's an ongoing process. It takes time to build these tools. It takes years, in particular with student support and students doing it part-time for their thesis. It takes time. But we will get there, and eventually I will release this tool to you. Maybe six months down from now, I will make this tool available for people to try out. Okay, so that's with respect to errors with natural join. Now, maybe we can take a few questions from remote center. Let's have a little bit of interaction. I see a lot of people have raised a question. So I'll let somebody here. Okay, Walton, you're on. Please go ahead and ask questions. Good morning, sir. My question is, yesterday, while discussing the foreign keys concept, you told that foreign keys can be null. But what I think is that foreign keys must be the primary keys in the referred relation, right? Correct. So if the primary key, foreign key is the primary key in the referred relation, how it can be a null? That's a good question. In the referred relation, it's a primary key. There, it cannot be null. The null is in the referencing relation. So let's say that a department name of student, the student is the referencing relation. You can say department name to null for the student. That's not a problem. You can't say department name to null in the department relation. So what this means is that null doesn't actually appear there. That's okay. It's allowed by SQL. The foreign key constraint is not enforced if attribute value is null. Does that answer your question? Thank you, sir. Thank you. Let's go to some other center. PSD College. PSD. Coimbatore. PSD, please go ahead. Sir, yesterday you were talking about relational calculus, relational algebra. Actually, I finished college six years back and during theory classes, we learned all of that. And after that, I've been doing a lot of data mining. And after that, I've been doing a lot of data mining and all that stuff. But I've never used algebra and calculus. But intuitively, we try to frame queries and use it. So calculus and relational algebra is it only to understand how databases implement things inside? That's a good question. So first of all, both the algebra and the calculus can actually be used to write queries. And in the pure form, they're not very convenient for users to write. I mean, imagine trying to write sigma on a keyboard. You don't really have a key for sigma. So basically, you can build languages around these which add syntactic sugar, but underneath it what you have is relational algebra or calculus. And SQL is really based on relational algebra at some level. It adds syntactic sugar, but it is really relational algebra. Similarly, for relational calculus, you have languages like QBE, which was very popular at one time, although of late, not being used that much. But it is an important language. And you do see it in certain specialized settings. And then you have data log and other languages which are based on calculus. So they are important. And practically, languages built around these add some syntactic sugar to make it easier to use. But the underlying concepts are algebra and calculus. Now let me take a few questions that came over the chat. The first question on chat is, student relation has ID attribute. An instructor relation also has ID attribute. Is it proper naming convention? So this is really a question about database design. And maybe we will talk about it a little bit more later. But since the question has been asked, how do you name relations? How do you name attributes of relations? There are many possible conventions out there. For example, we could have called the relation student or students instructor, instructor. How do we decide to call it ID versus instructor ID and student ID? Should we give it different names? And there are different schools of thought. One school of thought is an ID is an ID. And maybe there is a person relation on top which has ID. And instructor and student are specializations of the person relation. So then the ID field is inherited. And both of them have the same ID value. That's one way to look at it. Even though in our schema, we don't actually have a person relation. Another school says that they are different. If you equate them, it's a mistake. Therefore you should give different names. So for example, the department name in course and the department name in instructor should have been perhaps called in depth and course depth. And then the natural join would not have got into trouble. On the other hand, the natural join between instructor and department would not work. Because now in department, it's depth. In instructor, it is in depth. So there is a trade-off there. So in any design issue like this, there are trade-offs. There's no one answer. So let's leave it at that. One more question. What is query optimization? I've been mentioning this term a few times. Query optimization basically says, what is the best way to evaluate a given query? It turns out there are many, many different ways of writing a query. We already saw that in SQL, even with very simple queries, there are three, four ways to write the query. Now in the relational algebra, similarly, there are many ways to write the same query. And then correspondingly, there are many ways to execute that query. The thing is that these different ways may have different costs. Some may be cheap, some may be expensive. The job of the query optimizer is to find out what is the cheapest way to execute a given query. Now how does it do this? It needs a way to estimate the cost of various ways of executing a query. And then it needs to figure out what are the different ways, what are their costs, and then pick the cheapest. So we're going to look at this in more detail when we look at internals. But for the moment, this is all that you need to understand about optimization. So coming back, SQL has many, many features. One of the features is renaming. And renaming is done using an as clause. It can be used to rename columns or to rename relations. So the first example on the slide here, rename attribute salary divided by 12. It doesn't have a name because it's an expression. It gives it a name monthly salary. So initially the salary is an annual salary divided by 12 to get a monthly salary. The next query actually has two copies of the instructor relation. And renames them one as t and one as s. What is this query doing? This query is computing a join of two copies of instructor. This is called a self-join. And what is the join condition? So far we have focused on equality, r dot a equal to s dot b. But here is a case where the join condition is an inequality. And the join condition is t dot salve greater than s dot salve. What's going to happen here? Conceptually, every pair of tuples in instructor, think of it as having two copies of instructor. Here is one and here is the other. And you take every tuple from here and every tuple in the other one and match it up. And if the salve of the first one is greater than the salve of the second one, that passes the test. Well, then there's one more end condition. And s dot department a equal to comp sign. So the second condition is that the second one, instructor should be in comp sign. If this happens, we select t dot name. So what are we doing here? We're going to select people whose salve is greater than the salve of at least one person in comp sign. Now what happens if somebody's salve is greater than the salve of five different people in comp sign? The same instructor will match five different instructors in comp sign and will appear five times. So to avoid duplicates we are saying, select distinct t dot name. So each name comes only one. Now note that the keyword as is optional. So you can just write instructor t. And in fact, if you use oracle, you should not use as. You should only use this instructor t, not instructor as. So this is actually a very inefficient way of finding people who have a higher salve than some instructor in comp sign. There are much more efficient ways. We'll come to it later. Now SQL has a lot of string operation. There is a like operator, which does simple pattern matching. It has a percent and underscore symbol. The percent matches any substring, while underscore matches any single character. So for example here, select name from instructor where name like percent d a r percent would match Sudarshan. It would also match any other name with dar in the middle. Now coming on to the bottom of the slide, like is just one example of string operation. More common operations involve concatenating strings, converting from upper to lower case, extracting substrings and so forth. For all of these, SQL has three defined functions. For example, the two vertical bars denotes concatenation. Upper and lower, there are functions to do this. You can say upper of name or two upper depending on the database you're using. Substrings, string length, there are functions which are somewhat database specific. So look it up on your database. So now here is next feature, which is ordering of tuples. If you recall, we saw yesterday that in the relational model, tuples are not ordered. The model does not specify an order. And this is actually very nice, because the query optimizer is free to choose any order in order to do something more efficiently. However, when you want to print answers out to the user, there's often a need to have it in some sorted order. So in SQL, you can add an order by clause at the bottom, order by name. And you can specify descending, order by name, descending. You can also order by department name comma name. So when two people have the same department name, they are further sorted by name. So that's an order by clause. So let's move on to the next slide. Now we come to something new, which is duplicates. So the pure relational algebra does not have duplicates. But SQL does. And why? Why does SQL allow duplicates in relation? It's mostly an issue of convenience, at least in the context of duplicates in a relation, or duplicate names in an output. It's mostly convenience in the sense that removing duplicates may have a cause. And sometimes people don't care. They may be willing to see duplicates. But more often, there may not actually be duplicates, but SQL doesn't know. At least the early implementation of SQL could not figure out that there are no duplicates. They were not smart enough to figure it out. So what do they do? They do duplicate elimination each time and spend a lot of effort to know a way. They still get no duplicates after wasting a lot of time, which they could have got right in the beginning. So from an efficiency viewpoint, the SQL people said, let it be. We won't do duplicate elimination. So all this was fine. But then people soon realized that depending on what plan the SQL engine chose, it might give two duplicates or four duplicates or something else. You don't know. And the same query, if you run it tomorrow with a different size of the database, you might get a different number of duplicates for the same initial data. This can get very confusing. So at some point, they said, let's print down how many duplicates will be generated. While allowing the execution to be efficient, we will avoid duplicate elimination. But let's say what is the correct number of duplicates to be generated? And any implementation must generate this number. So it is deterministic. So this was important and got included in the SQL standard after some time, quite early on actually. So let's see what happens with duplicates. Where do duplicates come from? Sometimes from the initial relation, there may be duplicates. In this case, we have two relations, R1 and R2, with the schema AB and just C here. You see R2, there are already duplicates initially. There are two copies of three. Now, if I do project on B of R1, what happens? Initially, R1 didn't have duplicates, but when I projected, the result is AA. It has duplicates. And how many duplicates are present? Well, the project operation just does a one-to-one mapping. It does not remove duplicates. It just takes each tuple, outputs the required column, and you will get exactly the same number of tuples as in the input. So that's how the project operator will define. It just doesn't do duplicate elimination. Now here is another operation, the select operation. How many duplicates does it have? Well, the select operation is also very simple. It just goes over each tuple in the input. It checks the condition. If it is satisfied, it outputs it. Otherwise, it throws it out. So what happens here? Sigma C greater than 2 of R2. Let's look at the first tuple. C is 2. Is it greater than 2? No. Throw it out. The next two are both greater than 2 for their output. So you have two copies. So far, so good. The multi-site versions of project and select are very easy to define. Now finally, let's look at the cross product. How many copies does it have? Let's take this query. Project B of R1 cross product R2. Project B of R1 is this one with two copies away. R2 is here, two copies of three and one copy of two. How many duplicates are there? Well, the Cartesian product operator again does not do remove duplicates. It simply pairs every tuple on the left side with every tuple on the right side, as simple as that. So what happens here? This A gets mapped with two, three, and three. So you have A, well actually I have shown it differently. The two copies of A here are paired with two here. The two copies of A here are paired with three here, and the two copies of A here are paired with three here. So we get A2, A2, A3, A3, and again A3, A3. So how many duplicates are here? Four. How did we come up with that? There were two copies here of A, and two copies of three here. So two into two, four copies of A3 in the results. So this tells us two things. A, I can take the Cartesian product without worrying about duplicates, I will get the right answer. B, another alternative to evaluating this is to keep counts. Instead of keeping the duplicates, in cases where there are many duplicates, an alternative is simply to keep the count. I will say that A occurs twice here, three occurs twice here, and then when I match A and three in this Cartesian product, I multiply two and two and get four. So A3 appears four times. So internally the database is free to do any of these, as it chooses. The SQL semantics merely says how many times it should appear. So that is relational algebra, this is not SQL. What about in the SQL language? It's based on the same thing. So the slide first formalizes the multi-site version of select, project and cross product. I am going to skip that, because I just explained it to you using an example. But if you are not clear using the example, feel free to read this. Coming to SQL's duplicate semantics, let's take the simplest form of SQL query, select some list of attributes from a list of relations where some predicates. This is defined as take the cross product with the multi-site cross product, because that can be duplicates. Do the selection with the multi-site selection, that is it doesn't remove duplicates either. And finally do the projection with multi-site projection, i.e. don't remove any duplicates. So that is how SQL duplicates mantis is defined for the simplest case. For other constructs with join and so on, they are actually defined in a similar way, we won't go over the details. I hope so much is clear. And this does matter, because you know you want to see the right number of duplicates sometimes, you want duplicates, you don't want duplicates, you want the right number of duplicates. Now this last part, I want duplicates, but I want the right number, where does it happen? There are many places where it happens. For example, when we have aggregates which is coming up, we need duplicates in the correct numbers and we will see that. So the next operation, set of operation, actually set operation, union intersect and set difference which is except, the union here find courses that ran in, fall 2009 or spring 2010. This first query is select course ID from section, where some is fall, year is 2009. These are all the courses which ran in fall 2009. This one is all the courses that ran in spring 2010 and this is the union. Note that this is SQL, but it could equally well be relational algebra. We are taking a relation and uniting it with another relation. The next one is also SQL, but this time find courses that ran in both and instead of union we use intersect. I have just used the letters A and B here to avoid repeating this long query here. A denotes this first one, B denotes the second one. And the last one is find courses that ran in fall, but not in spring in this year. So that becomes A except B. If you are using oracle, note that instead of except, oracle uses minus, so you would say A minus B. It is the same thing. Now it turns out interestingly that unlike the earlier operation, union all intersection except actually removed duplicates in SQL. I am not quite sure why that was the standard, but that's how it is. However, there are also multi set versions of each of these. And you can use those by saying union all, intersect all and except all. So how do you define union all? Union all is actually very, very easy. Union all simply concatenates one relation with other. And what about the duplicates? Well, if a particular tuple appeared m times in r and appeared n times in s. In r union all s, it would simply appear m plus m. That's if you concatenate the relations, this is what will happen naturally. What about the intersect? Here it's less clear what it should be. And the way SQL defines it is that it should be the minimum of the number of times it appears in r and in s. In other words, if a tuple appears four times in r and two times in s. In r intersect s, it will appear minimum of four comma two, which is two times. And except all, similarly, is defined as follows. What you do is go over the first relation and remove a tuple which is found in the second relation. Remove one copy. Now, if the same tuple appears again in the second relation and it is also there again in the first relation, remove the second copy and so forth. So, if you keep doing this, what do you end up with? If the number of times appeared in r is less than or equal to the number of times it appeared in s, it would get removed completely. Otherwise, some number of copies which is m minus n would be left behind. So, what is, but m minus n may be negative if m is less than n. So, what we do is max of zero comma m minus n. Let me repeat this. Supposing it appeared twice in r and four times in s, two minus four is minus two. The tuple cannot appear minus two times. So, max of zero comma minus two will give us zero. That is, it does not appear in r except all x. However, if it appeared four times in r and two times in s, what happens? Four minus two is two. It appears max of zero comma two, which is two times in r except all x. So, that is a very quick overview of duplicates set of this. Hence out that over here m plus n is actually issue of efficiency and convenience. But, theoreticians will point out that there are some theoretical issues with using m plus n. If this were min, logically this should have been max, max of m comma n. Hence out max of m comma n is more expensive to compute than m plus n, because we actually have to look for duplicates and keep whichever is the higher number. Whereas here we can just calculate. So, again this is the issue of efficiency not one of theoretical cleaning. This may be a good time to take some questions. So, remote centers, please raise your flag if you have questions. Sir, how we can calculate this equation and covariance matrix or variance and covariance? Variance and covariance. So, you are asking for more complex statistical functions in SQL. So, the basic SQL standard does not define these things. But, most databases do support a number of other statistical functions including things like this variance, covariance and many other statistical functions. So, do look up your SQL manual and you will find this. So, you can have a set of values, you can have a set of pairs or various other situations and apply aggregate functions on these sets including these statistical functions. Let us move on to another college. We have Kurukshetra. My question is how to implement the divide operation in the SQL? That is a good question. I am going to come up to this in a little bit. I have some slides on it. So, I will defy your question and I will answer this question coming up in a little bit. Next one is NPR in Natham, Savinath. Hello sir, in our remote center we are asking the question about what is the need for triple equal symbol here. Triple equal symbol that we mentioned the last day ago. What does the triple equals symbol mean? That is a good question. It is not actually part of SQL or anything. That is notation which is used in mathematics which says it is equivalent. So, I just used it in that sense that it is equivalent. It is nothing to do with databases per se. So, triple equals means equivalent. What does equivalent mean? It depends on the context. It has the same effect or it is basically the same thing although there may be some minor differences. You know at St. Francis Borivali. St. Francis, if you have a question please go ahead. Hello, good morning sir. My resource center ID is 1067. I have a question in regards to outer joints. It is a topic which you had covered yesterday. If you could please explain that concept again and give an idea of some of its applications. That will be great. Yeah. So, outer joints is again a topic which I am going to cover in the next chapter. Chapter 4, we will get to it today. So, please hold off that question. I will cover it. Next is Charu sir in Nadia Gujarat. Please go ahead. Good morning sir. Sir, I have a question regarding does our database compiler converts our high level language syntax into relational algebra before execution? Yes, it into some form of relational. It may not be exactly the form which we are using. There are usually some extra operators, some differences, but conceptually yes it does. And then there is an execution level. Again, we will see this in the database internal segment next week. Okay. Sir, one more question is there. Whether we have to consider SQL is a combination of procedural and non-procedural language because sometimes we are specifying what we want, sometimes we are specifying what and how exactly we want. So, is SQL a pure declarative or is it a mix of procedural and declarative? So, SQL actually has a complete procedural sub-language which we would not go into too much detail in this course, but it has imperative, it has stored procedures with a function. Those functions can have loop, if then else and so forth. And that is purely procedural. It has trigger whose bodies can similarly have if then else and a sequencing of statement which is also purely procedural. So, SQL has procedural sub-elements, but the core part of SQL select from where is more declarative and it is not really procedural. You can argue about what is declarative, what is procedural at some level. You know, is relational algebra declarative or is it procedural? So, you can argue both ways, but I would say that relational algebra is more declarative than it is procedural. So, some people would say calculus is purely declarative because it says nothing about how to execute it. Relational algebra actually says do this then do that. So, it is a little bit procedural. So, yes, but there is some truth to that. So, where do you put the cut-off? It is not clear, but I would say that as long as it is very simple, this algebra it is pretty much declarative. Whereas, if you are writing loops and conditions and iteration and other imperative constructs then it is procedural. So, the core SQL is declarative. Sir, so that means we have to consider that initially all these database operators are defining mathematical form and then after it is implemented into some high level language called SQL, MySQL, SQL server and all these things. So, I think the base is relational algebra and calculus. Yes, you could say that the formal basis for all of these is the relational algebra and then the whole system is built on top of this formal basis. Let us move to another centre. We have Sarvajanik in Surat. My question is regarding order by. My question is which algorithm it uses for sorting. Means when we have got very large number of tuppers to retrieve, is it better to get it default values and using some good sorting algorithm in our program or using order by? So, the question is if you want to sort it, is it better to use the SQL's order by clause or use your own sort function? And the answer is it is almost always better to do it in SQL. A, because SQL can handle even very large relations. Supposing you want to sort a relation which is extremely large and doesn't fit in your memory, you are not going to be able to do it easily. It can be done, but your effect implementing what SQL already has built in. So, you really should not be doing it in your program. Leave it to SQL to do it. And how does SQL do it? We are going to see it in the next week's internal segment. We will talk about sorting. Okay. Thank you, sir. Sastra University, Sanjavur. My question is suppose if any field is declared as a primary key. So, it means that it cannot have null constraints, right? So, what is the need for to declare that it has to be have a null word there? It cannot have null. What is the use of it? Yeah. So, if you are declaring something as primary key, there is absolutely no need to declare it as not null. It's automatic. But there are fields which are not primary key like name which you might want to have not null. You cannot have an employee without a name. So, that's why we want a not null constraint which is an addition to the primary key constraint. Does that answer your question or was it something else? Yes, sir. One more question, sir. Using natural join, using keyword, right? Using natural join, using keyword. That's an alternative to plain natural join. Yes. Yeah, fine. You can lost some information because when you have some conceptual columns are matched, right? Two columns are matched, then using keyword can have some wrong information. That is your loss of information will be there. To overcome this, I can use either using keyword or I can use equijoin. But using keyword can be used when the two fields are same. Yeah. Suppose my two fields are different, how the using keyword should be used? That's a good question. So, the question is when can you use the using keyword? It makes sense to use the using keyword only if the fields are exactly the same name in both the relations. If they have different names, don't use it. You can use the normal from r, s, where condition or there is an alternative which is called on condition which we will be coming to later on. So, I'll discuss that syntax later. We'll move on to the last center for this round of questions. So, Sardar Vallabhai. My question is we have just learned about union intersect and accept and union all intersect all and accept all. So, can you give some example to be more clarified the differentiation between these two? So, I think I'll disconnect now and go back to the union all slide to explain that and then I'll continue on with further slide. So, union all retains duplicate, union removes duplicate. So, why would you want to retain duplicate? So, this slide is kind of abstract. It does not give you an actual concrete example. But supposing I have a situation where I want to know how many times the course is offered across two semesters. So, I may have a query which finds how many times how many sections of the course are there that is I have let me write it on the white board here. So, if I want to find out what all sections of the course run in a particular semester, I can say select course ID from section where let us say year equal to 2009. Now, I also want to know how many run in 2010. I am not here done counting on this, but the goal is to do counting. So, select similarly course ID from section where year equal to say 2010. I hope you can read this may not be very clear. So, now if I do a union of these two queries, each course will appear exactly once, but if I do union all depending on how many sections of the course there were in these two years that course ID will appear in appropriate number of times. Now, from this I can now apply an aggregate function on top of this and how to do that we are going to come to the SQL syntax for this. But think of this whole thing as a relation. On this relation I can apply an aggregate function. We know how to do it in relation algebra and SQL syntax also we will be coming up with very short use. So, this is an instance where a union all is used. But if I do not want duplicates then I would simply use union. I hope that answers that question. Let me go back to the slides and continue. The next topic is null value which was a lot of interesting issues in SQL. We saw the motivation for null values earlier. It means we do not know what the value there may or may not actually be a value in that spot. Now, what do we do when a null value is used in an expression say 5 plus null. Now, nobody is going to write 5 plus null. But if I say r dot a plus 5 and r dot a happens to be null. What we are actually evaluating is 5 plus null. So, what is the result of 5 plus null? We do not know. If null was value 6, 5 plus null will be 11. If it is 3 it would be 8, but we do not know. So, the best we can do is 5 plus null returns null. So, any arithmetic expression involving null is null. Similarly, string expression and so on. Now, how do you check if a value is null? You might be tempted to say salary equal to null. Unfortunately, anything which is equated to null also returns cannot succeed, cannot be true. So, equal to null will always fail and there are some issues there which are coming up in just a moment. But before that, since it does not make sense to say equal to null, the structural provides a construct which says is null. So, if I want to know which all instructors have a null salary, I can say select name from instructor where salary is null. Equal to null would return nothing. It would be empty. But exactly how does equal to null work? What exactly happens? How does any comparison with null what happens? And the answer is shown here. Any comparison with null returns unknown. Now, what is unknown? It is a new value. This is called three valued logic. It is not true. It is not false. It is a third true value, truth value which is unknown. Now, to see the motivation for having a third truth value, consider the following two ways of writing something. I can say r dot a less than 10 or I can say not r dot a greater than equal to 10. So, if you apply basic logic, you can say that these two should be equivalent. So, what does it matter which way I write it? But now, supposing I have r dot a equal to null. Now, is r dot a less than 10? It is not. So, it cannot be true. Supposing I say it is false. Well, fine. r dot a less than 10 is false. That is acceptable. But now, let us look at this side. r dot a is still null. Is it greater than equal to 10? Supposing I say it is false. What is not a false? It is true. So, what have we just done? We have found this side is true and this side is false on r dot a equal to null. We have a problem. These two ought to be the same. But when we decided that a comparison with null is false, we have landed ourselves in a soup. If we say it is false, we are in trouble. Obviously, we cannot say it is true. So, we are in a soup. So, what to do? And the answer to that is to introduce a new value unknown. So, in what happens is, in this case, r dot a less than 10. In reality, I do not know what is r dot a. It could be 5. It could be 15. If it is 5, less than 10 is true. If it is 15, it is false. I do not know. So, I am going to introduce a new truth value unknown. So, r dot a less than 10 is unknown. How about on this side? r dot a greater than equal to 10, again it is unknown. But now, what is not of unknown? Can I say not of unknown is true? I cannot say it. Unknown could be true or false. And then, not would be false or true. Therefore, I will say not of unknown is unknown. So, what is the net result? This whole thing gives unknown as does this. So, both give the same value which is unknown. That is the motivation for introducing a new truth value unknown. So, let us push this truth value unknown and look deeper into it. So, here are some more examples. 5 less than null is unknown. Null not equal to null is also unknown. What about null equal to null? Is it true or false? You cannot say. Null means I do not know the value. It could be true. If they may have the same value, they may have different value. So, for both of these the result is unknown. Although SQL has some funny treatment of nulls with aggregates. We will come to that later on. So, coming back we have unknown as a new truth value. And I already told you that unknown not of unknown is unknown. That is down here. What about other combinations of unknown with true and false with and or? So, let us take the first one. If I have something to be true, I say that or unknown. What is the result? Or we will return true regardless of whether this was false or true. Unknown could be false, it could be true. False or true is true, true or true is true. In both cases returns true. So, the net result is true. So, unknown or true is true. How about unknown or false? Well, if unknown were true, true or false would be false. If it is false, it would be false. True or false would be true, false or false would be false. So, it could be either way for the result is unknown. Similarly, unknown or unknown is unknown. And for and? True and unknown. Well, now it depends. If unknown is true, true and true is true. If unknown is false, true and false is false. So, I do not know. Therefore, true and unknown is unknown. How about false and unknown? Well, here regardless of whether this is true or false, the result of false with anything is false. So, I can say this is false. And lastly, unknown and unknown. Again, I do not know what it is. It is unknown. So, this completes the truth table for the unknown value. And what we have just defined is three-valued logic with unknown. So, three-valued logic has an old history. It much predates it well. It dates back to the early work on mathematical logic. And people used it in those days too to deal with situations where things are unknown. And that turned out to be a nice match for its dual unknown. Now, so far so good. When we are in a where clause, we can push things through with the three-valued logic. But what about the final result? My query was select something from some relation where a predicate. Now, supposing that predicate in the where clause returns unknown. Should that final tuple be output or not? So, let us take a very simple example. Select names of instructors where salve is greater than 100,000. Let us say a particular instructor salve is null. I do not know whether it is greater than 100,000. Should I output this instructor's name or not? Ideally, I should say that this name may or may not be present in the result. But carrying this kind of uncertain tuple in a result is a big thing. It complicates life greatly. And SQL made the choice not to do this. And said that in the final result of a where clause, unknown is treated as false. We carry unknown through as far as we can. But finally, when we have to decide whether a tuple is to be output or not? We output it only if the predicate is true. If the predicate is unknown, we treat it as equivalent to false and drop that tuple. So, that is the final semantics of unknown in the where clause. But in subparts of the where clause unknown is carried through. When the final result of the where clause is available, unknown is treated as false. So, let me just start on aggregate before the break. And after the break, we will continue and wrap up aggregate. We already saw aggregates in the context of relational algebra. So, you are familiar with average min, max, sum, count. And as somebody asked, there are many more aggregates in most SQL implementations. We have variance, median, mode, covariance, a whole bunch of other functions. So, what is the syntax for writing aggregates in SQL? If you do not want to group by anything, the syntax is simple. In the select clause, we just say the aggregate function on an attribute. So, the first one is select average salve from instructor where department is equal to comps size. This is the average salve of instructors in the computer science department as you would expect. The next one shows a new feature. It says find the total number of instructors who teach a course in spring 20 times. Read the question carefully. It does not say how many courses they taught. An instructor may have taught two courses, may have taught one course, may have taught five courses. It does not matter. I want to know how many different instructors taught courses. So, what you can do is the following. So, first of all, I want to know instructors who taught a course. So, that information is there in the teacher's relations. And the teacher's relation has instructor ID. So, select something from teachers where semester is spring and year is 20 times. But now, I do not want to double count an instructor. They taught two courses or two sections of a course. They should be counted only one. So, I am going to add the distinct class inside of the aggregature. Select count of distinct ID. So, that will count each ID exactly one, even if it appears five times. And finally, there is a count star which is, I do not really care about a specific attribute. I want to know how many tuples there are overall in something. So, for example, how many courses are there overall is simply select count star from course. So, this is the basic aggregate. And we already saw an example of group by in relation algebra. How do you do this in SQL? Well, here is the same example we saw earlier, but this time in SQL. We wanted to find the average salary in each department separately. So, we said we wanted to group instructor by department. So, in SQL, select something from instructor, group by department. So, take this table here. We have sorted it by department and grouped it. Now, within each group, we can output something. So, for each group, we are outputting department name and average salary. So, note that inside a group, department name is a fixed value. Because the group is defined by department. I can output that value here. If I output name, it would be incorrect. Because within a group, this group comm sci has three different names. Which name do I pick if I say name? I cannot do that. But department name can be output. Now, in comm sci, there are three different salaries. But I have said average of the salary. And so, the average of these three salaries is going to be output. So, the result is here, biology is 7200. Comm sci average of these three, which is 7733 and so forth. That is the result of this particular SQL query. Now, as we saw in relation algebra, a department without an instructor will not appear in the result. Because there is no tuple in instructor. You only get groups which actually appear in the relation. Now, if you want departments to appear, even if there is no instructor, there are different ways of doing it. We are going to see it later when we look at outer joints. Even before that, we will look at subquery. This is an example of an erroneous query. I said select department name ID average salary from instructor group by department. As we saw in a particular department, there may be many different IDs. Which one do you output? You cannot pick any one of them. So, SQL declares this as a syntax error and rejects it. And the last thing which I want to cover is having clause. What SQL allows is in the where clause, you can have conditions for individual tuples. Supposing I want to have a condition which uses an aggregate value. In this case, I want to find departments whose average salary is greater than 42,000. Now, I cannot do this on individual tuples. This is an aggregate owner department. So, I group by department and find average salary. But now, I want to filter out department whose average salary is less than 42,000. I do not want to see them. So, I can have a having clause which can use an aggregate result here. Average salary greater than 42,000. There are other ways to represent this query. We will see that later on. But the having clause is not essential, but it is a useful feature. So, I think we will stop here. Thank you and see you after the break.