 Okay, so now let's move back to questions first and then we will move on to new slides. Sri Ramakrishna in Poimbatu, please go ahead. Sir, good morning sir. This is Jyothi from Sri Ramakrishna Institute of Technology. I have two questions. One is, what is the difference between Warkare and Warkare 2? The second question is, how to motivate the students in designing the database by using the normalization concept because while doing the project, the students have less plan in designing the database. So how to motivate? What strategy we can adapt to motivate the students to go for normalization for designing their database during their project work? Over to you, sir. Thank you. Both are good questions. The first was, what is the difference between Warkare and Warkare 2? Warkare is a normal single byte character string. Warkare 2 is 2 byte character string and in the SQL standard, it's called nWarkare for national as in not just ASCII, but character sets which are there in Unicode. So if you are going to store names in any language other than English, you better use Warkare 2 or nWarkare depending on what is supported in your database. If you use plain Warkare, you may not be able to support Unicode in full generality, although you might be able to store utf8, which is Unicode encoded in 8-bit format. You may be able to use that, but nWarkare or Warkare 2 avoids issues with this. So that's what that is. The second part of the question was, how do you motivate people to use normalization for their projects? That's a good question, but I would like to defer that to the point when we cover normalization and it turns out that when you do ER modeling right, most of the time you don't have much to do with respect to normalization, but it is something which you do need to check as a sanity check and we will return to this question when we cover normalization a few days from today. Periyar, when am I in Valam? Hello, good morning sir. I have talked about two questions regarding sync operations. Like operator used to be used two characters, percentage and underscore. What is the major difference between these two characters? And another thing, how null value and unknown values are stored distinctly in a database? First part was percentage and underscore. What do they mean? These are called wildcard characters which are used typically to match prefixes. So many times you know a name partially, but you are not very sure about the spelling of the name later on. So if you get the first three or four characters of a name, you can easily look at a list of names and pick the names that you want. This is a very, very common occurrence. So supposing you know, let's take my own name Sudarshan. Out here in Mumbai people tend to spell it right, but back when I was in Chennai, it would be spelled in all possible manners and the right spelling was more an exception than the norm. On the other hand, most people would say that Sudarshan, it must start with S-U-D although there was some who would spell it as S-U-N and turn it into Sundarshan and various other things. But let's say that they got at least the first three characters, right, S-U-D. So you could look for S-U-D percent and then you would get Sudarshan in that list. You might get other things. You might get Sudama. You might get many, many other names. But let's say that you are looking at just instructor names out of 500 or student names out of a few thousand. The number of things which start with S-U-D is small enough that you can browse and pick the one you want. So this is a typical example of using the percent symbol or to look for names which start with something. In other cases, you don't know whether name is the first name or the last name and the name may be stored in the database in either order. So then you might say take a name and then search for it as percent Sudarshan percent. So anywhere if Sudarshan appears in a name, it will show up. Whether it was written as S-U-Darshan, Sudarshan S or it would even match something like Sudarshan M because, you know, if space is not considered special here, percent can match space. It can also match non-space. But all of those would show up as answers. So it's used for searching when you know things approximately. Underscore is a special case where it's just for one single character, not multiple characters. So that is useful in certain cases where you are not sure of one character but the other ones are clear. Does that answer your question? The second part was how do you store null values in the database? We are actually going to cover this in database internal. Now for strings, it's easy enough to store a null string. If you are used to Java or C++, you know that a null pointer, it can be used to represent a null string. But what about integers? There is no pointer. Every single value for an integer is a valid value. So what do you do? So the way it's done in SQL implementation is to have an extra bitmap for such a type. So there is a bit which says that is this value a null value or not? It's stored separately from the value. So in a tuple, there is a single bitmap for all the integers which occur in the tuple and maybe all the cares, not the wirecares but the cares which occur in the tuple and other such types and one bit for each attribute and if the bit is one, it's null. If it is zero, it's not null, something like that. I hope that answered your question. Thank you sir for answering our question. We have two more questions to be answered. Go ahead. First question is related with null value. When we are checking some condition against null value, we instead of making use of equal to operator, we make use of ease. Is there any specific reason for making use of ease here? Another question is related with group by function what we have seen already. That is, we have seen one query which is grouped using department name. In that you have mentioned, if one department name is available but no other data are available there, then the department will not be available in the answer. But assume one situation, we have no department available. That is department value is null but all the other fields are available. So many rows are available but all those rows, department value is null. In that case, null value will be available in the department where aggregate function will be applicable for all the other data. Let me answer your second question first. So the question is, supposing I group by an attribute and in a relation that attribute value is null. So department name is null. So what happens? I am going to answer that in the context of nulls with aggregate which is coming up in the next few slides. So I will answer it when I resume my talk but before that your first question was, what is the question? Can you just remind me? When we are checking some condition in relation with null value. So the question was, why do we need this null? Why do not we simply use equal to null? And the reason is that equal is defined on nulls to always return unknown. So if I say some value equal to null, null value equal to null value is unknown. So it will never show up in the results. But sometimes I do want to find those tuples for which a value is null. And I cannot do anything. If I say equal to null it will fail. If I say not equal to null, well the not equal to null will be unknown. Not of unknown is unknown. So every time it will be unknown. So I have two options. One is to use the is null syntax which will match exactly those values which are null. That is why I have the new syntax for is null. There is also another option which says is unknown. Which I can use on top of a comparison with null. So supposing I say r dot a equal to 5 and r dot a happens to be null, r dot a equal to 5 is unknown. So I can say is it unknown? It is unknown. So that is a separate thing. I think I have some slide on it later. But if not you can look it up. So that is what you could use instead of equal to. So with that hopefully I have answered the question. I think we should get back to the slides now. There are many more questions I am sure. So let us get back. So this answers the second part of your question. Null values and aggregates. So there are multiple places where null values can occur. It can occur in the group by column. It can occur in the column on which we are aggregating the data. So let us take these two cases separately. Let us first take the case where the null value appears in a column which we are aggregating. So here is a query. Select some salary from instructor. Now what happens if one of the instructors has a salary which is null? Now technically if I add a null value to any value that value is going to be null. So technically this should give me null. If any one instructor has a null salary the result should be null. But practically speaking that is not very desirable. Just because one or two instructors have a null salary I do not want to say I know nothing about the total salary. It is okay if I give an approximate value but I should not give null just because one or two values are null. So this is the basic principle which SQL uses in the context of aggregates. So most of the time when I use aggregates it is okay if I get an approximate value. It may not be exact. That is fine. So what SQL says is any aggregate operation sum, count, min, max whatever it is is simply going to skip the null values. Not even going to look at them. So some salary will just sum up the salaries which are not null and throw away the salaries which are null. So the only exception to this is count star which is counting the number of tuples there are and therefore count star will include a tuple even if its values are all null. It does not matter. It will include it. Now there are a few special cases. Supposing I said some salary from instructor where department name equal to computer science. Supposing there is no instructor with computer science then what has happened? I have a set with no values at all. It is an empty set. What is the sum on an empty set? That is one situation. The second situation is maybe there is one instructor in computer science but that instructor's salary is null. Now what happens? Using our previous principle null values are ignored. So the set had only null. After removing the null value the set became empty. So in either case we have some on an empty set. So what happens here? The answer is count returns zero but some returns null. That is everything else returns null. Count of an empty set is zero. Everything else is null. That is the basic principle which SQL follows. It can be a little arbitrary but it is designed to do the right thing for most applications. Now we have some time to try out a few more queries. I have put down a few queries here. But I am willing to take queries which are asked by remote center participants. So let me take maybe the first couple of queries here and then I will open it up to others to ask queries. And then we will try writing them. So I hope you are reading the queries here. The first query says the total number of courses taken by Shankar. How do we do that? We have the name of the student but that name is in the student relation. The courses are in the takes relation. And therefore we need to first join the student with the takes relation. We want to restrict it to students name Shankar. And then we want the total number of courses. So how do we find the total? Well we use the count. So let us write this to select something. We will fill that in. From we need the student relation because the name is in the student relation. And then we need the takes relation because the course information is in the takes relation where name equal to Shankar. And we need a join condition between these two. What is the join condition? Student dot id. Id is the field which is common between the two. Equal to takes dot id. So far what have we got? We have done the join. From this we want to find the number of courses that Shankar has taken. So here we want to count the number of courses. So we need a count of what? We will include course id. So that is the final query. So we are including here courses which Shankar has taken but may not have passed yet. In our schema we are going to differentiate courses which have been taken but not yet completed by having a null grade in there. Once the course is completed the grade will be not null. So this includes courses which Shankar has taken but not yet finished. Now supposing we want to find courses which Shankar has actually finished. Regardless of whether Shankar passed or failed the course that is not our concern. We want to eliminate courses which are still incomplete. So what do we want to do? We want to eliminate those courses from this list where the grade is null. So we can say and grade is not null. So what you have done is eliminated those courses which are ongoing and now the new query gives completed courses regardless of whether Shankar passed or failed. On top of this supposing we want to find courses which Shankar has passed. In other words we want to further eliminate courses which Shankar has failed. How do we know failed? Let us assume that grade is equal to the character f for failed courses. So we want to eliminate those. We will say and grade is not equal to. So we can write many different variants of the same query by minor changes. I hope this is clear. I will go back to the document. Let us take the next query. This is a very easy query. Find the first year in which CS801 ran. So what do we mean by course ran? It means there was an offering of this course. In other words a tuple in the section table which is down here. And the section table has both course ID and it has a year. So how do we find the first year, the earliest year? That is easy. We just say find min year from section where we want to restrict it to course ID equal to 801. So we will say where course ID equal to CS801. So that is a very easy query. The last one is find IDs and number of courses taken by all students who have taken more than 20 courses. So what are the things we have to do? Find for each student we want to know the number of courses taken. Just a moment back we did this for Shankar. Now we want to know it for all students. And furthermore after finding out the number of courses taken we want to filter out those students who have taken less than 20 courses. So let us go back to our earlier query and make a few changes to do it not just for Shankar but for all students. I am going to switch back to the whiteboard. So I am going to keep this query as it is. But now I am going to make some changes. First of all here we said name equal to Shankar. Now we do not want to do it only for Shankar. We want to count it for many students. Let us strike that out. But now I want to count the number of courses taken by each student. Therefore I need to do a group by what? We want to group by the student. What identifies the student? The ID. So group by ID. So this is going to give me then for each student first I am grouping all the courses taken by that student and then count of course ID is the number of courses taken by that student. But I also need to know the ID of the student because what will happen right now as shown I am going to get many different counts once per student. But I also need to know the ID. So let me add this field ID over here. So what we have done is select ID comma count course ID and the rest of the query is as you can see group by ID. So for each student I have the count of course ID. But the last part of the query is restricted to those students who course count is greater than 20. So I can add a having clause here. We saw the having clause earlier. Having count of course ID greater than 20. So I hope this query is clear. Now I will take a few questions. And what I would like is not question but I would like suggestions for queries which we can write online. So you ask the query and I will answer it. But before I answer it as the query is being asked I would like every participant to think about how to write that query and then I will answer it. So let's go over to remote centers. We have Amity University. The question is that since the power of expression of SQL and query by example is completely similar and since QB is based on relational calculus which is theoretically simpler than relational algebra then why do we have SQL at all? Why don't we use query by example in all the databases? So first of all query by example did not have a language syntax. It was meant more as a user interface. It turns out that most of the uses of SQL are not directly in a user interface. It's from an application program. Most people don't write SQL but pretty much every single application which we use these days at the back end talks to some kind of data storage system. Most of them SQL some of them other data storage system but it's not interactively writing query. As a result QB's focus was on the user interface and as a it didn't pick up. It did survive in Microsoft Access which had a version of QB but Microsoft Access itself is not used all that much these days. Whereas SQL had a language syntax and it could easily be used for building applications. That's why it succeeded. So it's not to do with the convenience of the language but rather for writing queries but rather the ability to create a language and write queries in that language. So what I would like now is to focus on writing SQL queries. So please anybody else who's asking queries please suggest the query for me to write. That's all I want right now. Everything else we will discuss later. Nagole Institute Hyderabad please suggest the query for me to write. How can we delete special characters from a database table? Again you're not answering what I want. SQL queries for me to write that is the purpose of the current exercise. I'll be happy to take on these questions later but right now please suggest queries for me to write. How to delete a special character from a database table? I want you to write the query. You want me to write a query to delete special characters from a database table. This is purely a string manipulation exercise and you will need appropriate string functions and there are functions which can take a string and do string manipulation, replace characters in there. These are tend to be database specific. It's not part of the SQL standard per se. If you want to delete those special characters, look up your database to find the string match and replace functions and then use that to do the replacement. I cannot answer it in a generic sense. What I want to do is SQL queries which use the functionality which we have seen so far. That's the purpose of it. Javahlal Nehru Simoga Do you have a SQL query for me to write? Sir, write a SQL query to select nth maximum entry in the table. That is, suppose if a marks based on average marks, 10th average marks, 10th highest average marks or 12th average highest marks. That's a very good question. That's a nice challenging kind of question. One which you could ask your students if you want to challenge them a bit. Let me repeat the question. Supposing we have a table of marks, the question said average marks, but to make it just a little simpler, we'll say we have a table with marks already. If you want to find the average, you could do that and use that as an input to the main query. Now the goal is not to find the maximum mark, but to find the 10th mark in class. How do you do this? It turns out that this is a little tricky in the core SQL which we have looked at so far. Now what everybody will say, intuitively, sort the relation by the mark and then go to the 10th row and output that. This might actually work in certain SQL dialects where you can say order by and then say row number or something which counts the position of the row in the output, but that is not part of the SQL standard. These are non-standard extensions because they are not fully declaration. However, there are two answers to this. The first answer is there are extensions to SQL to do ranking and there is syntax for this which we will cover if time permits in chapter 5. If you don't cover it, it's okay. You can go read it up, but it is in fact quite possible using the constructs we have seen so far. So, let me answer the question using the constructs we have seen so far. For this, the first point is we need to somehow state declaratively what is the 10th highest mark in the class? What is the highest mark in the class? It's a mark such that there is nobody who has got a higher mark. What is the second highest mark? It's a mark such that there is only one mark which is higher than it. The third one is the mark such that there are two different marks that are higher than it. Now, note that marks and ranks can be a little bit different. So, supposing two people both got the highest mark. What is the second highest mark? You know, supposing two people got 99 out of 100 and the next one got 98. What is the rank of 98? Is it 2 or is it 3? If you count the number of students, there are two first rankers and then the next one would be the third ranker. That is one way to talk about it and this corresponds to what in SQL terminology is called sparse rank. Let me write it on the white board. So, what is the sparse rank? Supposing I have marks 99, 99, 98. This would be rank 1. This would also be rank 1 and this would be rank 3. In contrast, dense rank does not leave gaps. So, this is sparse rank. Dense rank has no gap. So, this would be 1. This is the same mark. So, it also has to be 1. We cannot make one of them 1 and then the next one would be 2. So, first let us get this concept clear. We have a sparse rank and a dense rank. Now, the next thing is how to calculate the dense rank and this is what I was referring to earlier. For each person, the rank is the number of people or the number of distinct marks which are higher than that person's mark. So, how do we do this? It turns out we can do a self join with of the marks relation with itself. So, first of all let me put down the marks relation. So, marks id, let us say that this is a very simple marks relation on which we want to write this query. So, now the first thing is to pair up each mark with every mark which is higher. How do you do this? So, I am going to use a self join in SQL. So, select I will fill in the details of the select in little bit. From I want two copies of marks. Marks let us call it a marks d there are two copies of marks where a dot marks a is the person for whom we want to find the rank a dot marks. So, I want to find all the marks which are greater. So, less than b dot marks. Now, I want to find for each student how many others have got higher marks or how many distinct marks there are above this. So, I am going to group by a dot id. So, what do we have so far? I have paired up each mark like what here each student with every other student who has higher marks. So, now I want to count let us say how many distinct marks there are that are higher. How do I do this? I already have group by a dot id. So, now I can select id comma count b dot it is supposing I want distinct. So, I want to know how many distinct. So, this corresponds to the first of all the dense rank is how many distinct marks there are above sparse rank is how many marks there are above. So, depending on what I want I will add distinct or not. So, let us say here b dot marks and then I add a keyword distinct which is optional. If I have the distinct it is going to give me the dense rank. Because in this case one has nothing above it sorry plus one here the count would be 0 well there is a little bit more tricky stuff here. First of all the person with the highest mark would not have anything here. And the right way to do it would actually be to use a left outer joint which we have not yet covered it is coming up. But let us say that for the moment let us forget the top ranker the question was how do you find the tenth highest ranker right. So, we can deal with that coming up. So, now the highest ranker is going to be lost from here, but the second highest ranker that is somebody with 2 will be paired up with 2 will be paired up with both of these. And now if I say count distinct b dot mark these 2 have the same mark 99 distinct will remove duplicates and then that will give me 1 count is 1 and then plus 1 will give me 2. So, I have the rank of the second highest for whoever is second highest the rank will be 2 whoever is third highest the rank will be 3 correspondingly. Now, supposing I want the tenth highest I can have a having clause having count again distinct depending on which version I want b dot marks plus 1 equal to 10. So, for the dense part it is easy somebody will be 10 with the sparse thing there is always a possibility that nobody has the tenth rank there may be 2 ninth rankers and then the next one is 11. So, there is no tenth rankers in many of our entrance exams this is done. So, I think there are gaps in the ranks. So, that corresponds to the sparse but depending on your application you may want to use sparse or dense. I think I will stop on the queries here because I want to make some progress. This was a very interesting query I am sure many more have interesting query, but I do want to cover certain number of topics today. So, I am going to switch back to the lecture. So, we have tried writing some queries. Now, the next topic is method sub query which allows us to put sub queries where we expect either relation or we can put it inside of where clause or even inside a select clause. So, let us look at some examples. So, the first example is actually something we have seen before but it is just written in a different way this is just to keep life simple to introduce the syntax later we will see more meaningful usage. So, what is this example? Find courses offered in fall 2009 and in spring 2010. So, the outer query is finding courses which were offered in fall 2009 and that is an extra condition which is course ID in select course ID from section where semester spring 2010. So, this will first filter out courses to those which are offered in fall 2009 and additionally check that the course ID is in the result of this sub query. So, what is this sub query doing is getting a set of value and this condition in condition make sure this value is in this set. Similarly, I can say not in which is the opposite which is course offered in fall 2009 but not in 2010. So, earlier how did we did not in 2010 we said except or in oracle minus here we are using not in. Now, here is another slightly more complex example which is find the total number of distinct students who have taken course sections taught by a particular instructor. So, I want to know how many people were taught by the instructor do not double count if a particular student took two courses as an instructor count that student is only one. So, basically the outer query is or you would expect count select count distinct ID distinct avoid duplicate from takes relation but the key thing is to make sure that this particular student took a course taught by a particular teacher. Now, if I simply look at course ID you know many people may have taught a course this student may have taken a section taught by a not a section taught by ID 1 0 1 0 1 which is what we want. So, what we have done in this particular query is first of all find which all sections this teacher has taught how do you do that select course ID, second ID semester year from teachers where teachers dot ID equal to this value. So, this is giving me all the sections taught by this instructor and now this has become a sub query to this outer query which checks that the takes couple is included provided the course ID, second ID semester year is in this list. And then I find the count of distinct ID. So, this is just illustrating one more query with in. Now, many queries with in can be naturally replaced by different query using a join construct. So, quiz question 8 says above query can be rewritten using natural join aggregation group by order by there are many options I have kind of spoiled the suspense here by saying that you can rewrite it using join I decided in the interest of time to tell you how to rewrite it using join also. So, what would you do here this is the these are the sections taught by this person. Now, I want to join it with teacher and what is the join condition I could just do a natural join which will equate course ID, second ID semester year those are all the things which are common between teachers and takes is it we have to be careful here is these the only things common we have to go back to the schema and check. So, if you look at takes we need we have the ID of a student who has taken it in teachers we have an ID of a professor whose teaching. So, in fact this quiz question 8 is wrong I forgot to correct this is raised earlier the problem is if you do a natural join of takes and teachers the ID will get equated. So, we will only get courses taught by a person who took it which is idiotic nobody would take a course they teach themselves. So, instead of natural join we can use join conditions explicitly or using. So, the query is actually simple let me flip back to the white board in just a second. So, the query is select count this distinct takes dot ID because I only want to count students from takes but I cannot do natural join because extra attributes would be equated. However, I can use the join using takes join teachers using the common attributes what are the common attributes we have course ID, section ID year semester. So, this is matching students who have taken courses which are offered by some teacher. So, we also need to narrow it down to where teachers dot ID equal to 1 0 1 0 1. So, that is the final query to find the number of distinct students taught by this particular teacher. So, the moral of this story is many queries which you could write using the in construct can actually be rewritten using the join. Now, this actually raises another interesting issue with respect to how to evaluate the query. We will come back to this later on but at this point I want to mention that a naive way of evaluating the first query is to go over every student and then evaluate the inner query. Let me flip back to the screen here. A naive way of evaluating this particular query is to take every row of takes and then evaluate this sub query completely every single time which is expensive. You can imagine if 1 0 1 0 1 has taught many courses we compute that whole set over and over again and each time we check whether this particular set of values is in this value I mean this couple of values rather is in this set here. That can be very expensive. So, naive evaluation of next set queries can be very expensive. Now, most databases have a good optimizer which can take a query like this and actually turn it into the join query which we saw and that will be more efficient. However, if you write more complex queries sometimes the optimizer is not clever enough it depends on which database you use. If you use PostgreSQL the optimizer is moderately clever but not overly so and it will handle this query very efficiently but somewhat more complex queries it may not do it very efficiently. In contrast if you use SQL server which actually has a really good optimizer it will take pretty much any query you nested query you throw at it and find out the best way of evaluating it. But if you are using PostgreSQL or MySQL my advice would be use nested queries with care. If you can translate it to a join query reasonably easily do that. Don't depend on the optimizer to do it for you because it may not. So, that was nested query. There is a little more stuff here on set comparison which is illustrated here. Find names of instructors which shall be greater than that of some many at least one instructor in the biology department. So, we can write it using a self join I am going to skip this for the sake of time and instead come down here. What do we have here? Select name from instructor where shall be greater than some and then a sub query. So, what we have done earlier is the sub query is linked using an in clause. Now we have replaced that in clause by the greater than some clause. So, this will succeed if there is at least one value here such that this condition is satisfied. You can also have greater than all which means it must be satisfied for every thing in here. So, now take this query and this one. Now, let me come back to this query. There is a quiz question here. So, let me first explain this query. This has instructors as p instructor as a self join. What is the join condition? p dot shall be greater than s dot shall be. We just saw something like this when we did ranking, but here our goal is not to count. We just want to make sure that there is at least one person in biology with a lesser salary. So, s dot department name is biology and t dot shall be greater than s dot shall be. That is our condition. So, we want these instructors. Now, the problem is that if you have a particular instructor with shall be say 100000 and there are five people in biology with shall be less than 100000. Every one of those five will match this and the name will appear five times the same person's name. So, what did we do? We said select distinct t dot name so that the name will appear exactly once. That is the query. Now, these two queries look to be the same and in fact, if you do not have duplicates at all there is no problem. They are exactly the same. Now, this is a very small variant of the previous query where we said let us say salary is greater than all select salaries from instructor where department name is biology. So, now what the only change is instead of greater than at least one person we want to say greater than all the people in biology. So, this is valid in SQL. It works. The question is can we rewrite these queries using aggregates? So, just think for a moment I want instructors whose salary is greater than all the salaries of instructors in biology. So, how do we rewrite this using aggregates? It should be intuitively clear that if it is greater than all of the salaries it must be greater than the max of the salary and vice versa. It is greater than the max. It is clearly greater than all. So, the basic idea is to find the max salary and then check if it is greater. So, that is in the next slide. Select name from instructor where salary greater than select max salary from instructor where department name is biology. I hope this query is clear. Now, I am going to rush through a few more constructs and then take a break for question. The next one is test for empty relation. You can have exist or not exist. Exist is if a relation is not empty, not exist is true if the relation is empty. P here denotes the empty fact. P is the Greek letter C denoting the empty fact. So, here is a query which says select course ID from section SS where some fall year 2009 and exist select star from section ST. So, note that I am naming this copy of section SS, this copy of section ST and I am doing something more here. Where semester equal to spring year is 2010 and here is one extra thing which we did not see before. What we are doing is this S over here section SS. I am using in the subquery and I am checking if S dot course ID equal to this copy of section is called t, t dot course ID. So, because I had section inside and outside I rename them otherwise I can use the original relation. So, what we are doing here is we are running a subquery which effectively takes as a parameter a value from the outside query. So, this value is think of it as a parameter which comes from the outer query and this kind of a situation is called a correlated subquery. Why correlated? Because there is a parameter from the outer query which is used in the inner query and this thing here S dot course ID is really from the outer query. This is called a correlation name or a correlation variable. This occurrence here in the subquery is called a correlation variable. So, how is such a query executed? Again the way all subqueries are executed is intuitively simple. We take the outer query and execute it ignoring the subquery initially and then check the subquery condition. So, once we have got a section here which satisfies all 2009 we will then check the subquery condition exists blah blah blah. So, now when we are looking at a particular section a particular instance a particular row in the section table S dot course ID is now meaningful. So, for that course ID I will execute this query. So, what we are checking is that particular course offered in spring 2010. If it is then a tuple will be generated here and select star outputs that tuple exists succeeds because a tuple is output. Supposing that particular course is not offered in spring 2010 then what is the result of this select query is the empty relation because there is no matching section here. And therefore, exists will fail and then this outer query will not output that particular course ID. So, I hope this correlated execution is clear that is the default semantics of subquery. Again as we saw there are many cases where we can rewrite the subquery to turn it into a joint. And usually that is faster than the subquery. So, if you have the choice go with the joint do not do a subquery. In this case it is actually very easy we saw how to do this query earlier using intersection that is perfectly fine you can use intersection you can actually do this particular case using joint also. So, just take this query and then similar query with spring 2010 and then if you do a natural joint on those only course IDs which appear in both the relations will appear in output. So, that is in this particular case is equivalent to intersection and intersect with respect to duplicate count it is not the same as intersection. But if we do a select distinct then it would be the same as intersection. Similarly you can have a not exist. Now somebody had asked me about division. Now this particular query illustrates division and let me explain the query first before I come back to the term division. The query says find all students who have taken all courses offered in the biology department. Now why would we care about all courses offered in the biology department. This is really a simplified version of a query which is actually very useful. Most universities have a set of courses which a student must have completed and then a set of electives out of which the student can complete any of say 5 electives or 10 electives. But there is a bunch of core courses which the student should have passed unless the student passes all the core courses the student is not given a degree. So, this is a simplified version of it which says all courses offered in the biology department. But you could replace this by if you had a relation which says these are the core courses you could substitute that relation here. So, that was a motivation. So, let us look at the query as shown here students have taken all courses offered in the biology department. So, what is going on here I am finding students where something is true. Now what is this something let us break it into pieces. The first sub query here is select course ID from course where department name equal to biology what is this doing it is finding all courses in the biology department. Now take the second sub query what is it doing select t dot course ID from takes as t where s dot id equal to t dot what is s here s is actually this outer thing from student. So, for a particular student I will take I have a particular s dot id. So, I am taking in effect I am doing a selection on takes to find all course IDs taken by this student. So, now what am I doing let us call this s q a and let us call this s q b s q a is all biology courses s q b for a given student is all the courses that the student has taken and my goal is to make sure the student has taken all the biology courses. So, before you understand the above query let us jump ahead here you can think of the above query as logically as select distinct this part is the same from student as that part is the same where s q b contains s q a. So, what is our goal here s q b is the set of all courses taken by the student s q a is all the biology courses what I want is students who have taken all biology courses. So, what I want is cases where s q b courses taken contains all the biology courses. Now, this is actually fairly nice simple syntax and weirdly enough it was part of the s q l language early on in the very first version of s q l in addition to you know things like in not in they had this contain sub query contain sub query for some reason this fell by the way side and was removed from s q l and nobody reintroduced it I do not know why this happened probably it made it easy for people to write very expensive query and rather than allow people to write queries it can bring the database to a halt because it is so expensive may be they removed it from the language but it is a very useful construct. But in its absence what you do can you get the same effect in fact you can and what this means is you could have equally well brought the database to its knees by writing a somewhat different query but people may be thought that this makes it too easy to write bad queries will make it a little bit harder. But if you did want to write this query here is how you do it so we have two levels of query this is where not exist select course ID from biology except this thing what is s q a all biology courses except the courses taken by the student what is this result going to be if this student took every single biology course this result s q a except s q b is going to be empty if the student missed even one biology course not going to be here and that course will show up in s q a except s q b and now we are checking not exist so if the student took every single biology course then this will be empty and not exist would return true and therefore the student would be output if the student missed even a single biology course that course would appear in this result this except that not exist would fail that student would not appear in the result so I hope that is clear. So this is how you would write this query in current generation s q a I do want to point out that you cannot write this query using equal all in its variant many people think that can we write this query where student the set of courses taken by the student equal to all set of courses taken by a set of all biology courses you cannot do this equal to all takes on its left hand side a single value not a set and sometimes people write where you know write other kinds of queries with equal to all which are totally meaningless in this context. So it is a little complex query to write but this is currently the only way to write it in SQL but what you should do is first of all think of writing this query in this form with contain and then rewrite it to this form to get the query that you want so that is how you think about it in SQL it turns out that this is the equivalent of the relation algebra division operation so the some reason there is a box line here ignore it so what is the division operator this it is a little hard to read but I will give an example which shows the division operators is given relations r with some schema capital R and s with some schema capital S the important thing is the schema of s must be a subset of r. So the second relation must have a subset of the attributes of the first relation the division operator says r divided by s is the largest relation p on what attributes on attributes in r minus the attribution that is all the remaining such that p cross product with s is subset of r. So think about this if you are doing division on real numbers how do you define the result of division you can define it using multiplication a divided by b is simply the number such that b times that number equal to a so that is what we are doing here p times the denominator equal to the numerator this is r divided by s. So p times s is equal to r is what we would do if we are doing real numbers if you are doing integers how would we define it we want integer division which means we want to truncate we want 9 by 4 integer division will give 2 not 2.25 but 2 we truncate. So what is the result of integer division it is the largest number such that p into say 9 by 4 that number into 4 is less than 9 what is that largest number it is 2 the same intuition applies for relational division it is the largest relation such that p cross product this is the Cartesian product now s is subset of r. So that is intuitively how this is defined now let me switch to white board and let us take an example with biology suppose I have biology courses I have a sub query to compute the biology courses let us say I have a set of those bio 101 bio 102 and bio 210 let us say those are the only courses in biology there are 3 courses now I have a student ID and the courses they have taken. So ID this is the takes relation let me get rid of the year semester and so on in fact I have to project it out I will remove those attributes and all that I want so this is course ID it is important before doing division to remove all extraneous attributes. So I am going to project takes on just the ones that I want so ID 20 comma course ID other attributes will cost trouble when I do the cross product I need to project it out on just the minimum attributes that I need. So let us say ID 25 has taken bio 101 ID 25 has also taken bio 210 but has not taken 102 whereas ID 37 has taken all the bio courses. So what should let us call this R and let us call this S what should the result of R divided by SB it should include 37 because 37 has taken all the bio courses. So the expected result is 37 now if I go back and so this is a relation with a single tuple in this case you could have more and a single column ID. So note the schema the result is ID course ID the schema of S is course ID. So this is R sorry this is capital S the schema this schema is capital R. So if I do R minus S is just ID that is what we saw in the definition so that is what I want here if I take 37 cross product with this what are the tuples I get 37 bio 101 that is present next one is 37 bio 102 is present 37 bio 210 it is present. So certainly 37 is in the result supposing 25 we wanted to check if 25 is in the result we take 25 101 sure it is in the input relation 25 102 sorry it is not there 25 cannot be in the result. So this is the answer and this is exactly the same thing which we did in the SQL query in a kind of roundabout way with the not exist you know sub query 1 except sub query 2. So we did it in a more convoluted way but there are many ways of actually doing it in SQL one way of doing it is actually using aggregates supposing I want to do the same thing what I can do is I can subset this takes relation only include courses in biology. So how do I do that it is actually is very simple using relational algebra you can write it in SQL equivalently supposing I do natural join this is step one. So what is the result of natural join these are the biology courses now in this case all the course of the biology but let us say I also had 25 with course chemistry 101 this could be in the R series. So what does R join S do it is going to output all of these because they are bio courses this one chemistry 101 will get eliminated because it is not here. So what we are left with is just IDs with bio courses now what I can do is count the number of bio courses taken by each student. So if I want to write this in SQL I can say select ID comma count course ID we can do distinct in case students can take the same course twice I need to put in distinct from takes natural join takes join I want to find biology courses so there is a relation here select course ID actually I could I could do natural join that is not a problem here let me stick with natural join takes natural join select course ID from course where department name equal to biology whatever I achieve so far I have taken a natural join and then I have done well I need to complete the query group by so I have taken a natural join of the takes relation with all the courses from biology and the natural join is on the course ID attribute that is the only common attribute and then I have found for each student how many distinct courses biology courses in this case the student has taken now how do I check if a student has taken all the biology courses think about it for a moment what I have done so far is I have got a count of the number of biology courses a student has taken number of distinct biology courses a student has taken how do I know a student has taken all the biology courses and answer is very simple I simply count how many biology courses there are and check if the counts are equal so I am kind of out of space here so I am going to write this here having count distinct course ID equal I want to know how many biology courses there are that is simple select count star from course where department name equal to biology so I have effectively done division in a two step process first for the right hand side I count how many couples there are then I am doing a join of the left hand side with the right hand side to eliminate tuples which are irrelevant in this case CH 101 was irrelevant I do not want to count CH 101 so the natural join eliminated all the irrelevant tuples and then I did a group by and did a count to find how many biology courses in this case the student had taken and then make sure that the counts match if the count for a particular student is less then the student has not completed all the biology courses can the count be greater well it cannot be because I am only counting biology courses which the student has taken it cannot be greater than the complete set of biology course so it turns out division can be done using aggregation like this but aggregation is not essential for division division can also be done using the basic relational algebra operation and in fact I am not going to spend time on it now but the slide on division operation which is at the end of chapter three slide shows you how to do it I am not going to spend time on it little complex but it shows you how to do it using only cross product set difference project that is it just using the basic relational algebra operators you can do