 Okay, there's a lot of topics here in the two hours that we have, I cannot possibly cover all of it. I'm just going to pick a few slides and go over the examples. So first of all, types in SQL, again you're probably familiar with this. We have character types and you can specify how long it is, where care, it turns out on some databases, this business of care versus where care can be a pain, in Oracle for example. If you compare a care with a where care, things which look identical will return as not equal. Anyone seen this? Some of you would have used Oracle, have you come across this? What happens? What is the difference between care and where care? Care means it's fixed length, where care is variable. So if you have only three characters in a care of length 10, what does it do? It blanks for the remaining one. So now if you do a compare of something which is where care of 10, care 10, both are the same size, but you put the same text in both, but it's not length 10, in Oracle it will say it's not the same. Then what do you do? What do you need to do? How do you compare them? It's a pain. You have to trim, which is remove the spaces, join conditions go for a toss. It's a big mess. So the general advice is don't use care. Use where care. Use it uniformly and you will not have these problems. So don't use care, it's the model of the story. Use where care always. Even if you know that the length is fixed, today it may be fixed, tomorrow it may not. So our role numbers for example, we always knew it was fixed at eight characters till a couple of years back our academic office decided for some silly reasons to increase it to nine. We had been asking them to reduce it from eight to something less, but they went the other way. So where care would be safer in such situation? The next type is int or integer. These are all standard. There are variants of it, numeric, which is useful for numbers. It turns out that SQL comes from a commercial heritage, used for commercial purposes. Then commerce, what do you use? Do you use binary or do you use decimal system? Decimal, right? And does this actually matter whether you use internally binary or decimal? Why don't you just use floating point or real or something? Does it matter? It turns out it matters sometimes because certain things can be represented precisely using decimal, but not using binary. So the string becomes very long or infinite. So if you convert something from numeric to binary and back, sometimes you get into trouble. So in IIT we had developed software for the academic office, all in Java, and we had to do the calculation of the CPI, the credit performance index. So how many credits, what grade you got, so average it out. And for the longest time, academic office would come back and say your program is all wrong, it's giving wrong results sometimes, and we were mystified. It's a very simple program. What can go wrong in a small loop which just does average, it's not even a loop. It was done as a loop. It could have been done in SQL, but it was done in Java. It's a very simple program. What could go wrong? It turned out that they were storing it as a real in Java, putting point, and then converting it back to decimal at the end, and the academic office was using a calculator. Guess what? That calculator used actual decimal, and they got different results. So what do you do? What do you do in such a situation? Curve? No. No. The bottom line is if you want to get the same, I mean, one level you could say, I don't care. But at a different level, so the problem happens because you're rounding it off to two digits or something. To go to n digits, the difference is infinite smell there. When you round off, it might go up or down. And some people get bothered by this. It doesn't matter in some sense, but it does in another. So anyway, what we ended up doing was there is a numeric package for Java. We use that, which actually does computation in software. It keeps the decimal and does stuff. So anyway, so these do matter in the commercial world. Numeric types matter. Now here is a standard example. For our schema diagram which we use, we had an instructor relation. So this is the schema we actually use. This matters in the lab because you're going to be inserting tuples and so on. So the types matter here. You see the ID, often we have made it five and like IIT is eight or nine digits, we kept it short at five. We didn't want long strings cluttering of examples in the book. We made it care. Maybe you should have made it wire care. Name is wire care and we have said it's not null. It cannot take null value. One name we have made wire care, salary numeric, because this is a business thing. And now we have insert into instructor, we can give values out there. Note that we are giving a string value for character, so they are quoted. But for numeric, you should not quote it. Sometimes people forget this and get into trouble. And then we have integrity constraints on these. Again, you must be familiar with most of these. The not null integrity constraint we just saw. Primary key is again a standard feature and primary key in SQL also implies not null. You cannot have a null value for a primary key attribute. And foreign key ensures what? What does it ensure? What is the point of reference integrity? Depends on primary key. The point is that a value present in a table cannot be arbitrary. It has to be one of the values present in the other table. New values are not permitted. Yeah, new values are not permitted. Because if you allow it, you have chaos. So on Facebook, many of you would have entered your education institution and so on. Maybe a restaurant you like, whatever. If Facebook enforces anything here, any foreign key restriction, let you put in any text you want. Well, they realized after some time that they were messing up. That they could not get a review for a restaurant because it is referred to in 100 different ways. So one of my friends is actually in charge of a feature in Facebook which helps users to correct these later. So at some point they'll be asked, well, you entered the review for this. Did you mean, is it the same as this one? And they are trying to standardize it. So they're cleaning up after the fact. If any database person could have told them, hey, you guys are screwing up, you are going to get into trouble. Well, they did it anyway and they got into trouble and now there's a lot of work keeping my friend gainfully employed, cleaning up after the mess. So in databases, of course, I've known this for many years and foreign keys constraints are enforced, but for a small KBI. And that is, if you make a field null in a foreign key, it no longer has to be present in the other one. That's an escape clause, has to be present or maybe it is null. So this is something you have to keep in mind unless you declare it as non null, then you will actually enforce that it is actually present. My default SQL allows it to be null. We have these other tables, a student which has department name, which is a foreign key referencing department, all these types are fairly standard. We have a student table takes this student ID, takes this course ID, section ID, semester. If you note here that year has been made, numeric 40. Now we could have done it differently, SQL has a date type and then there are some extensions which say it has to be a date with an accuracy of a year, but not more precise than that. But for simplicity, we just, and there are some non standard different databases do stuff differently. Oracle being one of the culprits. So I think Oracle, please, as of some point, did not support the date type. So only I had a time stamp or date and time type. So we chose to keep it simple with a numeric field. And what else do we have here? We have primary key declaration, which says these four fields together constitute the primary key of 6. And the foreign key can also involve multiple attributes, course ID, section ID, semester year, references section. What is the order of these attributes? Does it matter? It does. It has to match the order of what? The primary key declaration in the other, in the section date. And then we have course, which is a fairly straightforward table. Credits is numeric to 0, we could have made it an integer, we just chose it to demonstrate numeric type. You cannot have more than 99 credits for a code. This is something which IIT Bombay has almost challenged. We have an MTech project, which is, I think, 50 credits or something. So it's halfway to overflowing this thing. So this design is, sometimes you make a decision and then you may have to reuse it. But luckily, most databases will let you change the type of already existing table. So it doesn't mean you have to rebuild the whole thing. There are a couple of questions. Primary key constraint ensures that primary key values are not repeated and are not known. And the foreign key constraint on course ensures all departments have associated courses. No, remember the order in which it is. One name occurs in the department relation. That's all that this enforces, unless it is null, because we have not declared it as not null here. So the correct answer is none of the above, it should be unless it is null, okay. There's always, we saw creating table, we can drop table. Alter table, I won't get into the syntax details, but you can change the types, add and drop attributes. Okay, so now let's see some sample SQL queries. Look at this query here, select distinct department name from, as we saw, distinct the most duplicates. Now this is a common error when you ask students to, you give a problem and say, write this query. It's implicit in your mind that you don't want duplicates in many cases. And this is something which very often students forget, not only students, instructors also forget. I have given assignments where I just said, find the names of students, it was implicit in my mind that there should be no duplicates. But students did not realize that, they didn't think about that aspect and they gave a query. And it actually worked fine on whatever database there was because there were no duplicates in that particular database. But then if you ran it on something else, you're going to get duplicates. This is a bit of a problem when you correct assignments in SQL. In fact, I want to spend a minute on this. Many of you have conducted labs with SQL, so how did you correct the assignments? Read the query and see if it looks all right. That's what most of us do. When you conduct an exam, you give an SQL query. The problem is you can express the same thing in many, many different ways. So you look at the query and you read it and say, does this look reasonable? And then you give marks. Well, let's say exam, it's okay, if a student got it 95% correct in some sense, forgot the distinct, you also forget to check for distinct. It's not the end of the world, as far as the exam is concerned. But when you build an application, it may look bad when you have duplicates with a user of the application. They say, why are you giving this name two times, it's odd. So you really need to be able to test for such situations. And this is actually not easy. So somebody was talking to me about research. So this particular problem, how do you check for correctness of an SQL query? How would you do it? What are the other alternatives? It's a good research topic. You can read it, that's totally manual. Is there some other way? This is an obvious way, right? Many people use this. What is it? It's not for an exam, but for a lab assignment, where the query has been entered and provided to you. It's all online. What can you do? You can go execute the query on a database and see if it gives the right result. Is that enough? Yeah. Yeah. Testing with different test cases. You really need to test with many test cases. How many? That is enough. There's no simple answer to that. So this is an interesting research problem. And this is something which we decided to work on a few years ago. And we have a few publications in this area. So at the risk of, again, digressing from this, what we have built is a tool which looks for certain common errors. What are the common errors? We'll see some more common errors as we go along. It is one of them, which we just saw. Another common error is where you should have done an outer join. You do an inner join. We'll come to outer joins in a bit. That's a common error. People routinely make these mistakes. Another case is when you have an outer join, you have a condition in the wrong place. Again, I'll come to that. So there are a number of common cases like this. And what we have is a tool which can take a schema and a query and generate data sets which can catch these common errors. Can't catch all possible errors. That's very hard. But if it can catch most of the common errors, you already have a leg up over, just running it on one database. So in fact, we compared it to running the same queries on the standard. So along with the book, we provide two different sample databases. One is a small one, and another is a large one. So we ran these queries across both these databases supplied with the book. And then we ran the same queries on the databases we generated. And already the queries had been corrected by TAs who use some combination of running it on this database and then reading the query and seeing if it was all right. So we ended up doing a comparison which showed that many errors which our tool caught which the other things didn't catch. So my goal is to develop this tool into a point where we can actually distribute it. So I hope that I can make it available later on in the coming year and some of you can use it in your courses. So when it's ready, I will put it up on Moodle so all of you can try it out. Okay, so coming back, we have more SQL constructs. We have the where clause again. All of you know this. I am going to skip obvious things. The from clause, you all know this. Let me come to join. We already saw natural join. This is the other standard form of the join in SQL. SQL supports natural join. But it was a feature added much later. The original SQL was only Cartesian product with select. So you would say if you want to find instructor would teach specific. What is the teach relation? What does it relate? Instructors with a section of a course. Which section of which course they teach. If you want the more details of course, you have to go from section to course. So each of these steps involves a join. So between instructor and teachers, what is the connection? So here is the schema. One has a course ID, sorry, no, for this query. This schema is for the second query here. For the first query between instructor and teachers, what is the common thing? It is the ID of the instructor. Teachers has the ID of the instructor who is teaching it. Therefore the join condition has to be instructor.id equal to teachers.id. And this is another one between section and course where the common attribute here is course ID. So the condition here is from section comma course where section dot course ID equal to course dot course ID. And this particular query was for course ID semester year title of each course offered by the computer science department. So department name equal to computer science. So there are duplicates in this particular thing. Can there be? Yes or no? It depends on the schema. You have the schema for section there. Look at the list of attributes in the select clause. So how can there be duplicates over here? When you join a section with a course, the section uniquely identifies the course. So you cannot have two courses for a section. So then how can duplicates come depends on what attributes you project. If you project all the primary key attributes, there cannot be duplicates. Have we projected all the primary key attributes here? Have we? What is not section ID? Section ID if you look here, the primary key is shown here, course ID, section ID semester year. But section ID is not occurring here. So there can be duplicates. If a particular course is offered twice in a semester, you will have duplicates. So what should we have done if we didn't want duplicates? Should have been a select distinct. Like I said, this kind of error can happen and there may be examples. I don't remember if this one is in the book or not, but it's quite possible there are a few mistakes like this in the book. So if you catch any, let me know. So when I said SQL tutorial, the idea is that we can have an interactive session. So why don't you suggest some queries and let's try writing them. So what I'll do now is show the schema, go back to that one. I'll leave this on the screen and let's have an interactive session. Who wants can suggest, let's start with very simple queries all using this schema. I may make mistakes if you catch them, very good. And when you teach this course to students, I'm sure you will also make mistakes. The students catch it, that's fine. That's part of life. Nobody is perfect, but it's important to run these even at the risk of making mistakes, it's okay. So let's do it. Some useful information which you may get from this particular schema. So what are the prerequisites for a particular course? So let's say the course, pick a particular course, say CS 101. Well, 101 probably doesn't have prerequisites. But let's say our database course, which is, we have to identify the course explicitly. So let's say in IIT Bombay, that would be CS 317. So we have to find out the prerequisites of 317. So how would you do it from here? You can walk through the schema diagram and that will suggest to you what are the things that you need to do. So from 317, what can you match here? So you have several options here. You could match course ID in course or you could go directly to prerequisite and match the course ID. So now, if you use the latter, it's actually a very simple query. It's simply, select prerec ID from prerec where course ID. So that's a very simple, let's try some other one. Any suggestions? Raise your hand if you have a suggestion. So that's a good point. So here we just found the prerec ID. So supposing you want to find the name of that course. So let's come back to that same sheet. How can we modify this particular query? We have to connect it to the course. And what is the connection between these two? Pre-req dot. Pre-req dot course ID equal to course dot course ID. So the question is, do you want the title of the database source or of the prereq? Pre-requisite. So then what should be? Pre-requisite ID. We should match with the course ID in the course. So prerequisite.pre-requit ID equal to course dot course ID. Exactly, yeah. So prerequisite.pre-req ID equals course dot course ID. And so that is the join condition. And along with the title in it. And that gives the complete information. Completing with the course ID and title of the prerequisite. Next one. So the question is, select all the instructors who teach in a particular building. So this might be important because maybe some work is going on in the building. You want to tell them, watch out for something. So where is all this information available? So first of all, you need to know which courses run in a building. So where is that information available? Building is there in the classroom relation. And from that we can go to the section relation. From that, teach. From that, so we have a lot of join. So that happens often in SQL queries. So what is the query one would write? I wish we could have both screens around at the same time. But I'm forced to switch. So just remember this. What is the connection between classroom and section building? Between section and ticks, lot of things have to be connected. So I'm going to use the natural join construct in SQL to do this, because it simplifies life. So we could equate each field, but it's easier to do the natural join. And then from there to, takes, sorry, teachers, from teachers to instructor. So before that, let me remind you of the natural join construct. So this is the syntax for natural join in SQL. So this is select star from instructor, natural join teacher. Now, note that if I want to do multiple natural joins, I can nest that. So I can say that natural join something else and so forth. So I'm going to use that and going back to the, there is one other issue with natural join. So we should be careful with natural join, but I'll come back to that. Okay, so what all natural joins can we take? Classroom. Excuse me, sir. Yeah. Can we throw light on rename clause first before we go to natural join? Because all relations may not have same names. Yeah, so I'll, I'll come to that in a bit. But just to simplify writing this particular query, I jumped ahead to natural. So what do we do here? Classroom, natural join section, so that will force what? Building is already there in section, so we don't actually need to join with classroom. So if you had a building name, then you would have to go through that. Yeah, that, that's a good point. We don't have to do this. So you can have a redundant join in the query. A student might write it by including a join with classroom, it's still correct. So in this case, we can skip that particular joint and we can go directly from building to, in section two, teacher. So section natural join teacher and all the four common attributes there will get equated. And then, from there to instructor, so that again if you do a natural join, ID will be equated to ID. So that's a slightly simpler query. Select name, which is instructor.name from section natural join, sorry, water relation, teachers. And then on that we again want to do natural join. And we just wanted it for a particular building in this case. So where building equal to, let's say, a lecture hall complex. LHC is the short name for this particular building, so we can use that. Yeah, so if you want, if you want to get the ID of the instructor, ID, what do you want to use? Nested queries. Nested queries, yeah. So you can use nested queries, there is a use for it, but you should not overuse it. So there is one reason for not overusing nested queries is that some databases are not very good at evaluating nested queries efficiently. So use it when you need it, but don't use it unnecessarily. Or your query may run slowly, in certain cases. Ideally it should not matter how you write an SQL query. So there are many ways of writing it, you can use nested queries. You can't always do it, by the way. For a join like this, a nested query is probably not the right way of doing it. You have to connect three relations, so there is no natural way of doing this with nested queries. But in general, don't use it, but in this particular case, how would you write it using a nested query? Instructor where ID equals select ID from teachers, where session ID equals session, select session ID from session where building equals, so and so. Okay, so already there is a problem with that, when you say equals, you have to say in, because there may be multiple things. So you can write it using nested queries, and a good optimizer would convert it. But it's usually better to write it directly as a join. So when you go through that schema diagram, you know what all things to connect. Turn them into joins, don't write it as a nested query. It's easier to read this query than the other one, with the nested query. Okay, good, any other one? We'll move on to other SQL, I'll take one last one before we move on. A join in general is better for performance. Like I said, a good optimizer should be able to figure out, even if you write it with the nested queries, that it is really equivalent to this other one with join. And both should run with the equal efficiency, ideally. But the fact of life is, the very few optimizers which will always work properly on nested queries. They will handle the simple cases, but more complex cases, it depends on the database. So we are using Perscrisql, which is actually not all that great with nested queries, it doesn't do as well on many nested queries. You will not see that performance difference on the tiny database. But when you have a large database, that is when you will feel that. In fact, Oracle is a little bit better. Microsoft SQL server is much, much better at optimizing nested queries. So this brings in the issue of query optimization. We don't really have time to cover it, but we will cover query plans in the main codes, even in the labs here. So you can see what are the plans. Some of the plans will correspond to actually doing it in the nested fashion. As written, so that query which I'd select from instructor where the instructor is in the teachers, where the building is, so and so. Where the teacher is in the section. So the natural way is to go over each instructor. For that instructor, go over each teacher's relation, a tuple which matches it. For each teacher's relation, tuple matches it. Go to the section and then compare. So it's like a set of nested loops. And if you don't have the right indices, nested loops can be very, very inefficient. It can be quadratic or cubic if you don't have the right things. It is possible to rewrite this with a much better plan which is actually close to linear in the size of the relations. Ideally, the database should figure this out. If you do it as joins, it's very easy for the database to do it right. If you write a query as joins, pretty much all databases will figure out a good way of doing it. If you write it as nested queries, it depends on the query. Sometimes they will do a perfectly good job. It will be identical plan to whether it's joined on a set query. Sometimes it works. But join is usually much better. Yeah. Whether that's over, source databases, is it not comfortable with that large databases for example, data warehouses? Open source and data warehouses. You were talking about MS SQL as well as Oracle will give some good results. So their optimizers are slightly better than the Postgres SQL optimizer in this context. So in overall, the SQL server optimizer is a very nice piece of work. It can do some really clever stuff on your plans and get very good plans. Oracle is not quite as good, but it's not too bad. Postgres is also not bad. It does a pretty decent job on nested queries, but probably not as good as. What about MySQL? MySQL is probably a little lower down in the picking hierarchy. Maybe comfortable for small databases like that. So like I said, there are two parts. One is the efficiency of doing a particular operation. And here MySQL is pretty good. The other part is if you give it a complex query, can it find the best way of running that query? And there MySQL traditionally had not been as good as the other ones. In fact, it really sucked some years ago. But they have obviously improved things, so it's not so bad now. But Postgres SQL, look, I don't want to compare this thing. But I think it's probably one notch ahead of MySQL for this. Now, if you want to handle really large data, there are two parts. One is coming up with good plans. The other is actually executing queries on very large databases. So all of these can handle very large databases. On a single machine, they can handle databases with many hundreds of gigabytes. But of course, it may be slow. And people have built parallel systems using these. Some of the largest parallel databases have been built with MySQL or Postgres SQL underneath. There are also parallel databases with Oracle or SQL server underneath. But because these are free, if you buy 1,000 copies of Oracle, you will go bankrupt, okay? So 1,000 copies of Postgres SQL costs till zero. So some of the largest data warehouses in use today are built on top of these. Because one of my students asked me that, how we can evaluate the performance of the databases? Is there any tools are available? For example, how many seconds or nanoseconds is taking to complete that particular query like that? Yeah, so there are benchmarks. You can run your benchmarks on the database. There's a family of benchmarks for your TPC, Transaction Processing Console benchmarks, which are widely used. So you can download them and run them on your favorite databases and compare them on whichever machine you want. But the bottom line is, it depends on the application. And for most applications today, which are not really large scale, most of these databases will be comparable, except for some queries where if you write it carefully, you are okay. Okay, let's come back to SQL. Yeah, we saw the natural join. And I mentioned that natural join can be dangerous. So here is a case. List the names of instructors along with the titles of courses that they teach. So this is the query which we wanted. So we said, look, course has a course ID, teachers has a course ID, teachers has an ID of an instructor. Instructor has an ID of the instructor. So we want their names, we have to join all three tables. So the first cut might say, select name, title from instructor, natural join, teachers, natural join, course. But there is this bug here. What is the bug? That depends on the schema, which you may not remember. But let me remind you of the schema. The course relation had a department name. What is that department name? So which department is that course in? Now instructor also had a department name. That is which department is that instructor in? What does a natural join do? It's going to equate it. So when you do that join, what will you miss? Something will get eliminated, which was not intended. Which is, yeah, if a particular instructor in one department taught a course from a different department, that combination would be eliminated because of that condition. And it's actually hard to test it. If you take most data sets, most of the time, people teach courses in their departments. How often do people go and teach courses in other departments? They do, but mostly it's within their department. If your database did not have an instance of a person teaching in another department, this would work. This would seem to give the right result. But in fact, it's not. If you have that situation, it could give a wrong result. So there are many cases. In an IIT Bombay, we have quite a few cases. Like there are courses sometimes in computer science, which are taught by faculty from electrical and vice versa. And then there are people in computer science who teach courses in the center for technology alternatives for rural areas. So there are interdisciplinary programs. So it's actually not uncommon here. But in sample databases, often it's missing. So how do you deal with it? One way is to use the rename as somebody was suggesting. Or don't use natural join at that point. Maybe instructor natural join teaches is OK. And you go to course. Don't do a natural join. Use a explicit condition. On teacher's course ID equal to course course ID. So course department is not being equated to instructor department. There's another syntax which SQL supports. Instructor natural join teaches join course using course ID. It's an equate only course ID, which is common in both places. Department name is common, but don't equate it. That can be done. Or you can rename. You can rename in subqueries. Here the renaming is shown as the main query. You can create a subquery which renames. And then that name can be used in the outer query. So here the first example, select salvy divided by 12. That's an expression. What is the name of the expression? If you don't give a name, that database system will give some name. You don't know what it is. Not very useful from the query point. But you can give it an explicit name in this case as monthly salvy versus annual salvy. And this is particularly important when this is renaming of attributes. There's also renaming of relations. This case, you have a self-join. Instructors join with itself. What are we trying to do? Find names of all instructors who have a higher salvy than some instructor in computer science. This is actually not a good way of writing the query, but it's still useful to illustrate the point. What are we doing here? Select distinct t.name from instructor st, instructor s, s. Two copies of instructor, where t.salvy greater than s.salvy and s.department name equal to computer science. What is this doing? It will output a particular instructor's name, provided there's at least one instructor in computer science who earns less money than that instructor. Now, how many times will a particular instructor's name appear if you remove that distinct? Appear as many times as there is different instructors in computer science with a lower salvy. That's why we had the distinct. Now, what would be a better way of writing this query? Yeah, you can use an SZ query, but what is the idea? Select names from instructor were salary greater than select max salary from instructor back of department name. So that's a right. Minimum salary. All clause can be also used. So there are different ways of doing it. So the cleanest way in this case, we want to find instructors who earn more than some instructor. So why compare with every instructor in computer science? Let's find a minimum salary in computer science and use that. That's much more efficient. In fact, very few optimizers will be able to take this and come up with the other plan. What is the cost of this? Can be very high. If there are many instructors in computer science, we want to compare with each of them. That can be pretty slow if the data is large. Whereas the other one, to find the minimum salary in computer science, one scan of computer science, you've got the minimum salary. Now, one scan of instructor, find those who's salary is greater than that minimum, you're done. So it's a much more efficient plan. We'll see how to write it in just a few minutes. So this syntax was used to illustrate the fact that you could have a relation occur more than once in a firm clause. And the as keyword is optional. So you could just say instructor t. Again, you will find some in Oracle, for example. If you say as, it will not accept it. If you just say instructor t, it will accept it. Non-standard features. There are string operations. I'm going to skip it. Order by, I think you're all familiar with this, right? I will not go into that. You can, oh, by the way, order by is, if you want it in a particular order, if you omit it, sometimes the result may come in the order you want. Does that guarantee it will always come in that order? If you put the order by, it will come in that order. If you don't put it, you run the query, you happen to get it in the order you wanted. It just happened to be true at that point. So the question is, if you run this query again on a different instance of the database, will it still be ordered? And the answer is not necessarily. Why? What is the default order? If you don't say order by, what is the order with SQL guaranteed? Descending, ascending, no? Some order, some order. We don't know, not exactly random. It depends on what particular query execution plan the optimizer chose. So different plans might give different orders. And this is one of the good things in fact about SQL. If you don't care about the order, you're giving the optimizer the option of choosing whichever order is the cheapest to generate. So the query can run faster. Now, why would you not care about the order? Here you might when you're printing it to the user. It's usually good to have an order. But if you're using it internally for something else, why force an order if it is not used frequently? So there is no guarantee about the order unless you put an order by class. And the plan might change depending on the, if you put more topics in a database, a different plan might become the cheapest and then the order might change. Okay, I'm gonna skip these queries which are set operations, union, intersect, accept. Find a minute on null values. What is five plus null? What does null mean in some sense? I don't know what the value is. How do I check if a value is null? If I say equal to null, so the query here says select name from instructor where salary is null. Supposing I say salary equals null, what will happen? For anything which is not actually equal to null, equal will fail. What about if it is actually equal to null? Is null equal to null? Null equal to zero? No, no, no. Of course not. So is null equal to null? What should null be considered as equal to null? I don't know. Supposing I say I don't know your name, I don't know his name, you're both null. Are the names equal? Can't say they're equal. So null equal to null should fail. Turns out an SQL, not just an SQL in general, if you say that any comparison with null is false, there is a problem. Supposing I say five equal to null, is it true or false? Let's say it's false. So now I write not five equal to null, not open parenthesis, five equal to null, close parenthesis. You said five equal to null is false. Not of false is true. Can you say this with certainty? I don't know what the null is. How can I say it is not equal to five? I can't say that. I can't say it's five. I can't say it's not equal to five. We have a dilemma. Both are wrong. It's not true, it's not false either. What is it? I have a comparison, five equal to null. It's not true, it's not false. Both are wrong. What is it then? It's null. Yeah, it's a particular, so for Booleans, null is called unknown. There's a specific word there. So it's not true, it's not false, it is unknown. So in general, SQL, when you deal with nulls, it's actually using what is called three-value logic. The logic uses true, false, and unknown. So I have some slides here. So what is or of unknown and true? That is, sorry, unknown or true. Let's go over the samples here. Well, let's start with this one. Five less than null, is it true, false or unknown? It's unknown. Null not equal to null, unknown. Null equal to null, unknown. So in contrast in this query, if I said salve equal to null, it would be unknown. But I want to check if salve is unknown. Therefore, SQL offers a special syntax, it says is null. Is it unknown? Yes, if it's null, it is null, but it's not equal to null. So coming back here, unknown or true, it's all. So it has to be true. Unknown or false has to be unknown, regardless of whether, you know, okay, take this case. Unknown or unknown? Again, I don't know what it is, it's unknown. True and unknown? Again, I don't know what it is. If unknown is false, it's false. If unknown is true, it's true. So it's unknown. False and unknown, doesn't matter whether it's unknown, it's true or false. The other false will make the result false. So false and unknown is false. Similarly, unknown and unknown is unknown. Not of unknown is unknown, this was important. And you can also say P is unknown. Just like we say is null, you have to say is unknown. Now all this is fine, but eventually when you write a query which says select star from this where name, sorry, whatever, let's say salve equal to 500. And a particular salve value is null. So what can I say? Null equal to 500 says unknown. But that is the where clause. Is this row going to be output or not? We don't know, but we have to make a choice because SQL does not have the ability to carry forward a tuple which may or may not be in the result. There are some extensions which people have proposed which let us carry this further. We can tell the user this tuple may or may not be in the result, I don't know. But SQL doesn't allow it. It only allows a tuple to be there or not there. So at this point you're forced to make a choice. So finally we can carry this three-valued logic as far as we go, but finally when we apply a select in the relational algebra terms, the select operation, is this particular row going to be in the output or not? SQL will output it if it is true. If it is false or unknown, it won't be output. So at that point you make a choice, whether it is to be included or not. So the good thing with that is if you had say null equal to five, if you had made this false and then we added a not, it would become true. But now because we carry unknown here, null equal to five, unknown, not of unknown, unknown. Finally, will the tuple go out if it's part of the where clause? No, because it's unknown, it will fail. So that's how it works. Okay, moving on. We come to aggregates. All of you are familiar with the SQL aggregates, min, max, sum, count. And then most databases support many other aggregates, standard deviation, variance, even median, a variety of aggregates. So here are some simple queries. Find the average instruct, a salary of instructors in the computer science department. So fairly simple query. Select average salary from instructor where department is computer science. So what is this aggregate on? It's all tuples which are generated by that SQL query. So that's a simple case. This is another simple case. Find total number of instructors who teach a course in the spring 2010 semester. So select count, now note the distinct, you have to be careful there, from teachers where semester is spring and year is 2010. If you forget the distinct, what will happen? Particular instructor may teach two courses or two sections of a course in the same semester. And then the count will get bumped up. But the goal was to find how many instructors taught a course. Again, you have to be careful here because your data set may not have had such duplicates and the other query may slip through as being correct. So this is important. But there's also count star, which is the number of tuples. So there is a small detail here with nulls which I should point out. If a tuple has a null value, so should it be counted as being present or not present? Now, if I say, well count, let's say sum, let's say more intuitive example. If I said here, select sum of salary. And let's say a particular instructor in computer science had a null salary, what should you do? Logically, you should say, I don't know the total salary in computer science because one of the persons, I don't know the salary. But it turns out this is a little pessimistic in the sense most of the time these queries are used to get some overall idea. And because one person's salary is not known, if you say I don't know the salary, you're being very pessimistic. So SQL chooses a pragmatic approach of saying I will just ignore nulls when I compute aggregate values. So for sum, what will I do? I will ignore it. I'm not treating it as zero, I'm ignoring it. For count, what should I do? I include that person, the average will come down. So for count also I ignore it. Everything is ignored unless you say count star, that will count nulls also. So nulls are essentially removed before computing the aggregate. What if the site is empty? The only one instructor in computer science and that person's salary is null, I'm now computing something on the empty site. So by the way, here it's not a problem because ID is part of the primary key, it cannot be null. So here it's not an issue. But in general, if the site is empty, what will happen? What is the result of any aggregate on an empty site? I don't know. So the result is null. I could say some is zero, but I think on any empty, on any site with no tuple at all, SQL defines it as null. Okay. Now the next basic feature is the ability to create groups of tuples and compute aggregates within groups. I'm sure all of you have used this. In SQL the syntax is group by. So this is select, find the average value of instructors in each department. So here I say, select department name, average value from instructor group by department name. So what does it do? The group by breaks up the input that it gets from the previous step. Here it's just from instructor. In general you may have from multiple relations, all that is done to get a set. That set is taken here and then broken into groups by the group by. How does the group by do it? In this case, department name. So all things with the same department name form one group. You get multiple such groups. For each group we compute the, in this case the average cell V and output it. This is a simple query. Now supposing I say select name, department name, average cell V from instructor group by department name. So the only changes I add name in the select class. What will happen? So what is the logical problem? We'll see that first and then see what SQL does. What is the logical problem? Problem is that in a group there may be many people with different names. Which name do you put? SQL is a syntax error. If you try to output a value which may vary across different members of the group it has to be inside an aggregate. If I say average name, well average doesn't make sense on strings but you say min name then I can do that. That is correct. There is a unique value for a group. But just select name is a syntax error. So the SQL engine will reject it. Now you are suggesting select the minimum name or the first name. If you want the first name traditionally SQL didn't have it. More recently there are some extensions which by the way it is a little trick here. What do you mean by first name? I said things are unordered in SQL, right? What is the first name? If you mean the minimum name you can do that. But if you mean the first name in some other order let's say the first name by salary or something there are some extensions to SQL now. Not supported by all databases but you can do other tricks. You can say order it by salary and find the name of the first because no one is really salary. So there are some other things but we won't get into that. Any other questions? By default it is showing ascending order. No, no, no. So by the way this is simply for convenience of showing it on the slide. There is absolutely no ordering by default. So is there anything shall I associate order by with the group by? Not in the group by. You can add one more line to the SQL query order by department name. So here why did I sort it on department name? Because it would bring all the people in the same department together which is what I want. So it may happen that the SQL engine does the same thing. So that all instructors with the same department are adjacent. So then it can scan it from the top. The first one is biology. Is the next one biology? No. So that ends the biology group with only one row. Find the average output. Next one is compsci. Keep going as long as there are compsci. Keep accumulating the sum and the count. When it changes from compsci to electric engineering output the average. And keep going that way. Internally many databases would do this. They would sort. But they don't have to sort. There are other ways using what is called hashing which would also get the same result but now there is no guarantee of order. And even if they sort they may sort it in the opposite order. You don't know to be in the order. Any other questions? Any restrictions Adi? And group by section? For example if I am using the field which is not available in the selection class. If I am using that same field, some of the different field group by will it work or not? So supposing I drop department name from the select. Yeah. But keep it in group by that is fine. The only thing is you will get many average salaries. You don't know which one is from which department. But if that's what you wanted SQL will happily give it to you. So there is no restriction that something in group by must appear in the select class. But the other way is required. If it is there in the select class it should either be aggregated or present in the group by. So this is an example with this which I just told you about. ID is extra there. Not available. Now there is one more class called the having class which lets you put conditions on the result of an aggregate. So this is called the having class because the where class is evaluated first. So let's see the syntax and the query. So first of all what is the query? Find names and average salaries of all departments whose average salary is greater than 42,000. So I want the names of the departments. So I can do it in two steps. The first step is select department name average salary and now the condition is the average should be greater than 42,000. So I can say having average salary greater than 42,000. Note that I could have used average salary here and I may have not even used it here. That's okay. I could drop it from here. I could say sum of salary here and average salary here. It doesn't matter. It will still be correct. Note that having class predicates are applied after the formation of groups whereas the predicates in the where class, this one doesn't have a where class but if there were a where class it would be done first before forming groups. This is often a thing which students get confused about. They don't know where to put what. They put it in the wrong place and you will get a different result. There are a few slides on nulls and aggregates. I already mentioned this. So this is the point I mentioned. All aggregate operations except count star ignore tuples with null values on the aggregator attributes. And then this query what is collection as only null values that already answered. So this is a special case. Count returns 0. Count on an empty set is 0. Everything else returns null. Note that the input to these are multi sets. Why? If I say some of salve from instructor there will be 2 instructors with exactly the same salve. You don't want to remove the duplicates before doing the sum. So take this case. 1133 the average is what is the average? If you had treated null I removed the null from the sum but not removed from the count then you would get a different thing. 2 is the answer. But if you kept the null when you did the count you would get a wrong result. So you have to remove the null from the count also in order to get 2.