 Today we are going to be talking about SQL, a lot of SQL. Today is SQL day, primarily. In the course of the lecture, I'm going to go a little bit slow and give you the opportunity to ask questions. As you just saw, we also have database of you are ready. So we can even try out queries. We can write queries on the fly and see how they work. So I want it to be an interactive session. So please do ask questions. So yesterday, I guess you a little bit about SQL, and in the lab you tried out some queries. But we will go over the same things more formally today, and give you more time to understand what is going on. The key thing to remember is that when you program in SQL, it's different from programming in COBOL. You are all used to an imperative style of programming, where you say, do this step, do this next step, and so on. You have a loop, you have an if-then-else. SQL is very different. There is no if-then-else. Well, there is, but that's in procedural SQL, not in the basic SQL. So there is no if-then-else, there is no loop. So the entire way you think about how to write queries is completely different from what you're used to. So it takes a little bit of time to adjust to this, and to understand how to write queries, especially complex queries. The most simple query is obvious. You select from where you're done. But as things get a little more complex, it's a little bit different. And today we are going to try to help you start on that path. So what all are we going to cover today? We'll be talking about the basic query structure of SQL, some basic site operations. We'll also talk about aggregate functions, null values, and nested sub-queries. So you saw examples of SQL yesterday. This is the general form of a simple SQL query. There are, of course, a lot more constructs. But the basic SQL query looks like this. Select, and then a list of attributes or column names from a list of relation names where some condition p. What exactly does this mean? SQL queries like this can actually be understood in terms of the relational algebra. I'm not sure how much of relational algebra was covered yesterday. But did you look at the join operation yesterday? Yes, the select operation and the project operation. So what an SQL query like this formally means is you take the relations which are listed here, r1 through rm here, whatever the relation names are. Formally, this is not actually how it is executed. But to understand what the meaning of the query is, think of first taking a cross product. What is a cross product? Of two tables? Combine every pair of rows from those two tables. If you have three tables, well, every combination. So if you have m tables, every combination of a row from this table, a row from the second, from the third, and so on. So take all possible row combinations. Now from the examples you saw yesterday, you will realize that most of these row combinations are pretty irrelevant from the viewpoint of the query. You don't want all combinations. You want a customer to be matched to their accounts, not to all possible accounts. Whereas a cross product does what? Does every possible pairing. So the where clause over here, the condition p, gets translated into a select. So you have all the combinations of rows. And then from that, you filter out only those rows that satisfy whatever predicate you have given. We'll see more examples. And finally, you project out the columns which you have specified in the select clause. Now the relational algebra select denoted by sigma here is obviously different from the SQL select here. The SQL select here, the clause, is actually corresponds to the projection. And the relational algebra select corresponds to the where clause. And the from clause corresponds to taking the cross product. So it's very simple. And the result of an SQL query is a relation. That's what is nice about these relational languages. You've taken one or more relations, and then you output a relation. Now this, in turn, can be used to do other things. So if you want to build a complex query, you can break it up into parts, get a temporary relation for one part, then get another temporary relation for another part maybe, then join them, and then aggregate them, and so forth. So the select clause, as you've seen, lists the attributes. And here is another simple example. You've probably seen this already, but let me just describe it anyway. You have a relation loan with three attributes, loan number, branch name, and amount. And this query, select branch name from loan, results in this relation, which has a single column branch name, and then the values, downtown, Redwood, and Perry Ridge, which are the same as here. What if there are two loans from the downtown branch, let us say? Then what will happen here? This result, there will be two downtowns here. So duplicates can occur in SQL. If you don't want duplicates, you can get rid of them as we will see. Another thing to note when you do programming in SQL is that SQL names are case-insensitive. It doesn't matter what case you use, upper all over. I think I believe COBOL is also similar. So you're used to that. And as we just saw, SQL allows duplicates. And if you don't want duplicates, just add the keyword distinct after the select. So select distinct branch name from loan guarantees that each branch name will occur exactly once. And if you want to explicitly say that, do not remove duplicates, you can say select all. But if you don't say anything also, it is the same. That is the default behavior. Now, why would you want duplicates? We will see. There are cases where you want to get all the balances and then sum them up. Then you don't want to remove duplicates. But in other cases, you want to remove duplicates, so that control is in your hands. There are some syntactic features which are useful. If you want to select all columns, you can say select star from loan. So that is all the columns of loan. And the select clause can also contain any expression. In the simplest case, you have arithmetic expressions with numbers, integers, and floating point numbers, and so on. You can do the usual plus, minus, star, division, and so forth. However, there are also other things you can do in there which are also very useful. You can have string functions, as you will see, concatenating strings, and so forth. So here is a simple example. Select no number, branch name, and amount times 100 from loan. So if this were the input, this is the output. You have loan number, branch name, which is the same. And amount 3,000 becomes 30, 1,000, 4,000 becomes 40,000, and this becomes 17,000. What about the name of this column? So whenever you have a query in SQL, the output has column names. This is important, because the next level up, if you want to use that relation, you need to be able to refer to a column of the relation. 1,700 times, oh, sorry. My arithmetic. I need to go back to kindergarten. I should have put 10 there. So there, let's fix it. So what is the name of that? Oh, multiplied by 100. Let's do it really right. There we go. So what about the name of this column? For the two columns, which were directly there in the select list, the name is the same. What about this guy's name? What do you call it? It's not clear. Can you call it amount? No, you could have added two columns to get it. So there is no obvious name to give it. So the name in this case would be sort of system dependent. SQL does not define exactly what name should appear there. However, if you want to, if you're just putting it to the screen or if you have a program that picks up the third attribute, you can always refer to a column position. So this is the third column. Even if it does not have a name, if you access it from COBOL or Java and so forth, you can actually access the third column of the result. So you don't need to give a name necessarily. But if you wish to give it a name, there is a way, which we will see in a couple of slides from now. So now coming back to the where clause, this specifies whatever conditions the result must satisfy. And here is an example where we have two conditions. This is the same select loan number from loan, where branch name is peri-rich and amount greater than 1,200. So you can, just like in any programming language, you can have arbitrary combination. We have seen the comparison operations examples here. This is all standard, just like in COBOL or any other programming language. There are a few other details, which you can use. For example, SQL has between comparison operator. So you can say, select loan number from loan, where amount is between 90,000 and 100,000. And what does between mean? It's an inclusive between, less than or equal to 100,000 and greater than or equal to 90,000. It's a syntactic sugar, as it can obviously be written using the end of two conditions. So now let's come to the from clause. This is where the Cartesian product is occurring. And here is a query, which does not even have a where clause. This is a valid query. It may not be a useful query, but it is a syntactically valid query. So what does this query do? Select star from borrower loan. So let's take this very small loan and very small borrower relation. And the result would have nine rows, three times three, when I don't have space here to show all of them. So I've just shown the first three rows. Actually, the order of the rows is not defined. This is important. SQL does not guarantee anything about the order in which output comes. If you want ordering, there is a way, as we were using order by. So this is something different from your normal imperative language, where if you're reading a file, you know the order in which the data is in the file. It's going to come in that order. And then if you do an operation on that, you know exactly what order the result is going to be. But in SQL, there is absolutely no guarantee. So this can be a little troublesome at times, in that you write a program, you get everything in the order you want, and then you deploy it somewhere else on a different database. And suddenly, the order is completely different from what you have been used to. This can happen. So anywhere where you depend on the order, you have to explicitly do an order by, which we will see. So anyway, coming back to this, the cross product, in this case, would have nine tuples. Three of those tuples would correspond to the first row here, matching each of the three rows here. The next three in whatever order would be maybe the second row here and all three here, and then the third row here matched with these three. I'm going slow here, so that I don't lose anybody. This is critical to understand the rest of SQL. No, select star does not mean cross product. From whatever loan, without a wire clause, results in a cross product. If you add a wire clause, you will have a cross product logically. And then from the cross product, you will filter out certain rows, and only output those rows. The select star says all columns. So all attributes. So in this case, all of them appear in the result. This from clause, without a wire, gives a cross product. The star does not mean cross product. The star is not a multiplication. Maybe that's what you're thinking. Absolutely not. The select star simply means output all the columns. Don't suppress any column. Whereas the cross product is for different rows. And that is purely from the from clause. The from clause always logically creates a cross product. And then the wire clause filters out. Now, can you actually do it this way? What if the tables had a million rows each? You can't take a million times a million to take forever. So that's not actually how it is implemented. But don't worry about efficiency. This is another very important thing with SQL. You don't worry about how efficient the operation is at the time when you write the query normally. Now, there are some things you may need to do to make operations efficient. But you can do that separately. After the program is written, you can actually make some changes to the database, like adding indices and so on, to make the thing go faster. You don't have to rewrite the program. So you don't think imperatively. As a COBOL programmer, you might think I need an index sequential file. I need to index on this and fetch this record, and then find some value of this record, and then go fetch another record using an index from another file. In SQL, when you write the query, don't even worry about it. Later, for performance, you may need that index. If you do not have an index sequential file, if you are searching sequentially through a file, things can be very inefficient. So you may have to go back and make sure those appropriate indices exist. But that is a separate concern. So this separation of concerns means that you can ensure correctness of your query much more easily. It's very good what this query does. If you wrote the same query with index lookups, and fetches, and a for loop, and so on, a two line query will stretch to probably 100 lines. I'm sure many of you would have written such queries, which will access two tables. How many lines would that be? Something like this. Make a guess. We'll take the average. Come on. 20 lines, 30 lines? 30, 30. You have to open both the files. You have to read both the files. Then join it, and then again, go around 50 lines. 50, 60. And then there are going to be bugs. Yes. When you check for the file is available. Right. So there are a lot of things you do. It will soon stretch 200 lines even. And it'll take maybe a day or two to write things, unless you're an expert programmer. Maybe you can do it in an hour. But here you can type this in 10 seconds, and you're done. So now, let's take a different query where you want to find the loan number, the name, and the amount of all customers. The name is the customer's name of all customers having a loan at the Redwood branch. Now, the cross product of these two tables, loan number and customer, has all combinations. But look at the query that we want. We want to find customers who have a loan at the Redwood branch. So the customer name is here. So that is available. The loan number is here. And then the mapping from loan number to customer name is in this table. So this query cannot be answered using this table alone or by using this table alone. There's not enough information in either of those tables. But together, they have the required information. And they are linked. How are the tables linked? Loan number over here. So what we want to do is find matching rows of this. You want to match this row, L170 down to 3000, to this row of customer, and this row to this row. So the matching is done on the loan number attribute. So how do we ensure that we match the rows appropriately? First you do the cross product. And then you filter out rows which are equal on these two. So there's an additional condition here that it's at the Redwood branch. So there are two steps. First you select whatever attribute you want from borrower loan, where borrower dot loan number equal to loan dot loan number. What is that part doing? The from class does a cross product. And then the first condition ensures that you consider only rows that actually match. The other rows in the cross product are thrown out. So we have only matching rows, where a customer is matched with the loans that they have. And from that, we are further filtering out those which belong to the Redwood branch. In this case, this would be the sole answer. Of course, you can have multiple answers here because the output in general is a relation with multiple rows. It could also be an empty relation. Any questions? What's the execution part? Whether it is on the right or left side? No, there is no guarantee about how it is executed. In fact, there are a variety of smart algorithms for implementing joints like this. The algorithms make sure that you don't first take the cross product, but rather they match the things properly. Now you have written code to do the matching. That is one kind of joint, the nested loops with indexing to find matching tuples. That is one kind of joint algorithm. But that may not always be the best way of doing this joint. So in fact, any database system has multiple ways of doing joints. And it actually will figure out the cost of different ways and pick the best way. So you don't worry. We'll come back to this on day 4, that is. Day after tomorrow, we will come back to this issue. Where it will not make a difference in the where clause if I put this first? No. Like, France name is equal to Redwood. And then I do this. It does not make a difference. Correct. That's a good question. It does not matter in what order you put it. So it may look like doing this first might save some effort, but the database takes care of it. I will borrow the number is going to be selected first. Maybe it is matching the number is going to be saved. Right. So why is that in the select clause? Good question. If the SQL implementation is very clever, it can realize that these two are the same, so it does not matter. But to keep things simple, the SQL implementation says if the column name appears in two relations, you should say which one you want to pick it from. Because in this case, you have made them equal. But you could have written a different query. You could have said borrower.loan number equal to loan.loan number plus 5, meaning less query. But you could write it that way. And then what happens? You have to explicitly specify whether you want borrow a loan number or loan number. So you may find some implementations which will actually allow you to just say loan number. However, to be safe, do it this way. In this case, you could say loan.loan number. But you cannot just say loan number. That would the compiler would complain, saying loan number is defined in two tables. Which one do you mean? Yeah. It is an ambiguity. Correct. That is an ambiguity. And the point that she was making was actually the value is going to be the same here. So why can't it be clever? Sorry. But when we don't give the word, it requires the. Yes. Exactly. It requires the. It is required. It is required. What I'm saying is there may be some implementation which is very clever. But don't depend on it. That is in the SQL standard, you should make it unambiguous like this. Yeah. If the borrower, the row corresponding to, let's say, L230 is absent. OK, L170 is absent. Fine. So first of all, in the cross product, what happens? L170 is paired with 230 and 155. But then when you apply the low number equals low number, there will be no output. So what has happened is if there is no matching row here for this row, it vanishes from the join. So this is an important thing to keep in mind. If you use a join, if there is no matching row, boom. It just vanishes. And this is the cause of a lot of errors in SQL programs where you think there must be a matching row. But maybe there isn't. If you know the schema and you're guaranteed there is a matching row, then that's fine. If there may not be a matching row and yet you want the loan to appear, well, there is a way of doing it. It's called outer join. We will be covering that tomorrow. So now let's look at a few string operations. These are very useful. They occur all the time. There are some basic string matching primitives of which the two most widely used are percent and underscore. So these are what are called wild card characters. So here is an example. Select customer name from customer where customers treat like percent main percent. So what is this percent? It's a special character which says it can match anything. It can match an empty string. It can match any string. So when I say like, the like operator is special. It interprets the right argument of this. So the like operator takes its right argument. And if there's a percent, it says it can match any string. In this case, we have percent main percent. Or actually, I guess it's percent blank main percent. It's hard to tell from here. With fixed width font, it'll be clear. So now what all strings with this match? Any string in which main appears inside the string, main with capital M, small a, i, n. If that appears anywhere in the string, then this will match. So all streets which are of the form first main, second main, third main, and so on, would match this. Even if you had first main east or something or some such thing, following main, that is also OK. That'll match. So 2% ensure that main occurs anywhere inside the string. It'll match. It can be right at the beginning. The street name can be just main. Nothing more. That also matches. Similarly, underscore is a special thing which matches any single character. So that's also useful sometimes. Now what if you want to look for strings using the like operator, which actually contain the character percent? So then you have to have an escape character. And in SQL, you make it explicit. So you say like main backslash percent, escape backslash. You can use any character here, actually. And in the context of this thing, the escape character says the immediately following character should be taken literally, not as a wild card. The slash means that the following character, in this case, percent, should be taken literally. So what will this match? Like main backslash percent will match exactly the string main percent. The percent should appear in the string. It matches strings which have the characters main followed by the percent character. Otherwise, if you say percent, it'll match anything. It is not case-sensitive. It is case. So in the SQL standard, string matching is case-sensitive. So if you give this, even with the like operator, it'll only match things which have capital M followed by small a i n, followed by percent in this case. If you have small m a i n, it won't match. If you want it to match, you have to do something more, which is here. So the standard way, which you must have, many of you would have encountered, is to first convert the input string into, let's say, upper case. It could be lower also. So where upper customer street, like percent, capital, all caps here, main percent. So whatever combination of lower and upper case you use, this will match. Any questions? Now be aware that there are some implementations of SQL which don't follow the standard. In particular SQL server and MySQL, they will allow matching, ignoring the case by default. Yeah. No, any string here. By default, string matching ignores case in both of these databases. In SQL server, you can tell it to turn off the default and be case sensitive. In MySQL, I think you can't even do that. It's always cases. And there are also other operators for string matching, which allow more complex patterns. More recent version of SQL has a tilde operator. But I won't cover that here. It's not that widely used, but it can be useful. Another useful string operator is the concatenation operator. So if you have different parts of a name, which you want to put together. So if you have stored a name, first name, and so on separately, but in the output, you want a single column name. What do you do? You can say first name, and then the concatenation operator, which is the two vertical bars. And then last name. So that will give you a single string, which is the concatenation of these two. So there are many other functions. For example, you can find the length of a string. You can extract a substring from a string. Some of these are defined in SQL. But before SQL got around to defining them, many databases provided their own functions. So you have a lot of non-standardness. So exactly what the function is called depends on whether you're using PostgreSQL, or Oracle, or whatever other database you use. So you should look up that database and use the appropriate functions. So here is another query, which says, find a list in alphabetical order the names of all customers having a loan at the IAT branch. This is the query we want to write. So let's do it in steps. First of all, say where all do we get the information from? Customers having a loan at the IAT branch. This is the same query we saw before. So I mean, it's a variant of a query we saw before. So we know that the information is there in the borrower and the loan information. These are the two tables containing that information. So when we write a query, this is the first step, where all is the information, which tables contain the information. This is the first thing we want to look at. It may all be in one table. It may be distributed across tables. So in this case, we know two tables contain the information. So we put those two into the from clause. And then the matching condition as before is borrower loan number equal to loan.loan number. So that is the matching condition for these two tables. Now at the IAT-POWY branch becomes end branch time equal to IAT-POWY. And we want to list the names. So select customer name. Can you have duplicates here? We have put a distinct. But without the distinct, could there be duplicates? Yes. Multiple. Yeah. The same customer can have multiple loans. In fact, we are making an assumption in this schema that customer names are unique, which is of course silly. Obviously, two customers can have the same name. So why do we use that schema? Just to keep the examples short, fit width of a page or a slide. But in the exercises which you will be doing in the afternoon, we actually use a schema which has a customer ID in there. So we are not assuming that customer names are unique anymore in the exercise schemas which you will be doing in the afternoon. So but in the book and in the slides, you will find that customer name is assumed to be unique. But still, because a customer can have two or more loans, customer name can appear multiple times here. So we should put a distinct to ensure just one occurrence. And then order by customer name ensures that it is sorted. What is the default sort order? Ascending, alphabetically ascending. SQL has actually a variety of the correlation sort orders. SQL has been much better than most other languages in supporting most of the computer languages, I should say, in supporting human languages, because it is used to store different languages. So the sort order for English is different from the sort order for Hindi. So the SQL language actually allows you to specify different sort orders. Although, of late, the Unicode standard is sort of superseding these things. But earlier, SQL could store it in ASCII, you could store it in ISCII, you could store it in anything else. And then define sort orders and get it sorted according to the specific language which you have stored. And if you want it descending, you just say descending. So order by customer name descending. More than one thing, just list them one after another. So the primary sort will be the first one. The next one will be secondary. So if you want to sort on customer name descending, and then in this case we have just selected customer name. But say we also selected the loan amount. So we want to find the customer names and the loan amount. So we sort on customer name primarily. And then secondarily on the loan amount. You just list one after the other. So you could say order by customer name descending, amount ascending. You can choose whatever combination you want of descending ascending. This may be a good point to try out some of these queries and see that they actually work. So I had this query, select star from account. And I executed it. And here are the accounts that I have. Account numbers are a1 through a10. And then there's some branch ID. In the slides in which we have been using, we assumed a branch name is unique. But again, more realistically, internally you would have a branch ID, which gives you the ability to change the branch name without changing the entire data. So if Bombay became Mumbai, you could change the name from something Bombay to something Mumbai without affecting all the rest of your data. So in this case, we have used a branch ID. So this account is in this branch. And this is the balance in that account. Now let us look for branch. I hope you're able to read this here. The font is a bit small. I can, it is font size. I'm not sure how to change the font size here. But anyway, if you're able to read it, I won't waste time on that. So we have these six branches with a branch name, a branch city, and assets. These columns are the same as in the slides. So we have a customer. Now let's look at borrower, which we were seeing. So earlier we had a name directly in borrower. Now we only have a customer ID, which maps to the customer relation, which we saw. And these are the loan numbers. Let's look at loan. So you have these eight loans at corresponding branch IDs with corresponding amounts. So these are the tables, which we have. So let's write some queries joining multiple of these tables. So select star from loan comma borrower. Oops. What has happened now? Cross product. As you can see, there are lots of rows here. We had, I think, eight loans and corresponding things in borrower. I think there were eight things in borrower, or maybe nine. Some of the loans may have been shared. And so you've got a lot of rows here in the cross product. Now obviously, most of those are meaningless. So what should the where clause here be? Loan.loan number equal to borrower.loan number. So in this case, the second row matches, because both are L1, the first row here does not match the connection. So now we have got a much smaller number of rows, which actually match. And you can verify that the loan number here is equal to the loan number here. Note that the column name shown here say loan number twice. So the fact that it is borrower.loan number or loan.loan number is actually hidden. Internally it is there, but it's just that the display is shown like this, just shown as loan number. And then if you wished you could say loan dot. Well, let's just try loan number and see what happens. So what is the error? Loan number is ambiguous. So now if you just say connection closed, we lost the connection. I think we lost the wireless connection. Now claims to be back, but it's not working. Oh well, so since we are using VNC, hopefully the session will still be there. If you do loan.loan number, because it came finally. So there's some lag here, but it's network delay. But anyway it's there now. So we've got all the loan numbers correctly. Now if I wanted order, we just saw the order by right. So let's turn off the screen. So let's add an order by clause. In this case, there's just one field. So we'll have to order by loan number. Now it's already, in this case, it happens to be sorted in ascending order. But like I said, there is no guarantee. It just so happened that this is the case. Now let's execute that. And you want to try any other queries on this? Any of the other things we saw? What are all the tables? That depends on the tools which we use. In this case, with NetBeans, you went into the services, you went to a database, and there you have tables. And you can see here that you have account table, when I expand on that, it's showing the columns of that table. But you can see the tables here. Account borrower, brands, and so on. If you didn't see it, that means it did not connect successfully. You couldn't see it. I don't know what specifically went wrong. But normally you should be able to see it here. Otherwise, how do you find out what are the tables in the database? There are actually different ways of doing it in different database systems. In PostgreSQL, you can type slash d and get a list of tables, or backslash d. But that will not work directly from here. If you use the psql command. So from there, you can do that. Here, you don't need to do it that way. Yeah, it depends on the database. The exact command to list the tables varies by database. One of the nice things about NetBeans is it hides some of these details. You can use the same whichever database you use. It will take care of it. And then if you type public. Yeah, I think there is some connections. NetBeans, if you're not careful, it connects with. So what is the schema business? The database can have multiple schemas. And over here, if you saw earlier, it said lyc.lyc. When I say view data, what did it run? It ran this query, select star from lyc1.account. So lyc1 is a schema name. What we have done now is create different schemas called lyc1, 2, 3, each of which has its own copy of the data. So that two of you don't use the same data. Then you can modify it without affecting others. So there is also some other schema called information schema, which actually has data about what tables are there in the database and so forth. So you shouldn't be using that schema, really. But when you set up the connection, which schema should it connect to? If you don't specify it, I think it defaults in NetBeans to information schema, and then you get into trouble. So we'll explore this in the lab. Let's not worry about it here. Questions? Let's go back to the talk. So we were, at this point, ordering the display. So we already saw that there can be duplicates. And you just think. Now, if you have a relation which has duplicates, and then you do an operation on it, you can get duplicates. If you had a relation which did not have duplicates to start with, if you project on some column, that column may have duplicate values. So if you looked at the depositor or the borrower table, the person may have multiple loans. So if you look at the customer name from there, there can be duplicates over there. So if you project, you can get duplicates. Now, if you have duplicates like this, and then you perform an operation such as select how many duplicates do you get in the output? If you perform a cross product, how many duplicates do you get in the output? All of this is formally defined in SQL. Most of the time, you don't have to worry about this. You don't mess around with duplicates. But if you're interested in the details, they are there in the textbook. If you, at some point, need to worry about the number of duplicates, it's well defined in SQL. I'm going to skip it over here. And of course, most of the time, in the output, you don't want duplicates in output. You use a distinct. And the only place where duplicates really matter is when you do aggregation. We will discuss that when we come to aggregation. So now, as I told you earlier, SQL takes sets of tuples. What is a relation? It's a set of tuples. In fact, it's a set which can have duplicates. Technically, this is called a bag or a mighty set in mathematics. But we will use the word set loosely to allow duplicates also. So you take a set of tuples and output a set of tuples. Now, it makes sense that when you have sets, you should support set operations. The operations we saw so far were not the usual set operations, which you would have seen long ago, but different ones. Now, what are the most common set operations which we use? Union, intersection, set difference. All of these are there in SQL. And there are queries where these are useful. What is interesting is that the default union intersect and except in SQL actually remove duplicates in that output. They can have duplicates in the input. The output removes duplicates. But if you don't want that, you can also use the all version, so that preserves duplicates. And then how many copies are there in the output? That's all precisely defined. We won't get into that. So here are some examples of queries. This time, I have not shown the query. I've just shown the SQL query. I've shown the English query. So now, let us try to write these in SQL. So find all customers who have a loan, an account, or both. How do you do this? What are all the names of customers who have a loan, or an account, or both? So first of all, how do you find out who all have loans? Remember the relations which are relevant here. We have the loan relation that does not have customer names. We have the borrower relation, which has a loan number and a customer name. So from the borrower relation, we can find the names of all those who have loans. Similarly, there is an account and a depositor table. The account table is exactly like the loan table. It has an account number, a branch, and a balance in that account. And then there is a depositor table, which is just like the borrower table. So I have shown here the borrower and the depositor table. So the borrower table says that this guy has this loan. The depositor table says that this person has this account. So these are the two tables which have the information which we need to output. So the query here is find those who have a loan or an account, or both. And implicitly, we may not want the name to appear more than once. You just want the names to appear once. So how will you write this query? Select distinct customer name from borrower. From borrower comma depositor? No, if you write a query, so here is a suggestion. Select distinct customer name from borrower comma depositor. What will that do? It will take every pair from borrower and depositor. But there are two customer names, one from here and one from there. So that doesn't actually do what you want. So if you say select distinct borrower.name from borrower comma depositor, it will only tell you who are all the people who have a loan. Similarly, if you say select distinct depositor.customer name, it will only tell you those who have accounts. It does not tell you those who have either. So you have to use union. So you will do the following. Select customer name from depositor. So this will list all these names, John Smith and Hayes. Union, select customer name from borrower. Sorry, the other way, depositor is here. This will give Hayes, Johnson, Johnson, Jones. And this will give Jones, Smith, Hayes. And then the union gives what? Each name, but exactly once because union removes duplicates. Now, why does union remove duplicates? It's some quirk of SQL. Don't worry about it. The default behavior is to remove duplicates. But if you don't want to, use union. So that query is straightforward. So we have the same tables, except the schema is slightly different. Instead of customer name, we have customer ID. So these are the customer IDs from borrower. Are there duplicates? Yes. Customer ID 7 is duplicated here. Similarly, say depositor, are there duplicates? Yes, one is duplicated. So is 6 and so is 7. So now, if I say select star from depositor, so what are we? No, it didn't execute the whole thing. This didn't work. In fact, I can't quite do this because the column names don't match. So what does union mean in this case? Yeah, I should do customer name. Now, why is it hanging? It should complain. Yeah, I should. But I'm just wondering why this is not complaining. What is happening? It should not do this at all. It seems to be just doing the first part of the query and then stopping. Network is really messing things up. Sorry, customer ID. You're right. In this case, it's customer ID. It looks like it does. Name does not exist. So at least they gave an error this time, customer ID. That should do it, hopefully. So now, in this case, actually, all customers, I think, everyone had a loan and account. So it doesn't really show a difference. But anyway, that would have made a difference. And then if you say intersect, you will get the same result. So the next topic is aggregates. So what are the set of aggregates? What is an aggregate? You have a set of values. And then you combine them in some way. It's an aggregate. It's a typical aggregates are adding things up, some. Counting how many things there are, that's a count. But you can also do average, min, max. And these, in fact, are the five basic aggregates which a SQL supports. There are also other aggregates like the standard deviation, median, mode, and so forth, which are also supported. We will use these for our examples today. And what do these aggregates operate on? They have to operate on a set of values. In fact, is it a set? No, it actually can allow duplicates. So if I want to sum up the balances of accounts with a particular branch, what will I do? I will have to first get all the balances at that branch and then sum it up. So the balances of the branch can contain duplicates. You should not remove them, and then you sum it up. So here are a few sample queries. And we are going to run them on this relation with branch name, account number, and balance over here. So find the average balance of the peri-rich branch. You do it as follows. Select average balance from account where branch name is peri-rich. So what have we done here? There are two steps. First, find out all accounts at the peri-rich branch. So in the from clause, you have account. And branch name is peri-rich in the where clause. So those are the first two things. Think of them as going in that order. First the from clause, the cross product if required. Here there's just one relation. Then the where clause. And then the select clause is supplied as the third point. Here in the select clause, instead of just projecting out specific columns, what we have done is used an aggregate function. The aggregate function is going to take all the values that appear, and then aggregate them into a single value. So here it's going to give us the average of the balance. So what will the average balance here be? Well, sum up these five numbers and divide by five. Get whatever you get. Now here is the peri-rich of the first two, 400 and 900. So 1,300 by 2, 650. I need a coffee badly. So take the second query. Find the number of accounts in the bank. How do you do that? Use the count. Select count. In this case, we use a special notation star. Count star means count how many rows there are. So it's like saying, listing all the attribute names there. But it's just a short form. I'm not counting how many branch names there are in account. How many account numbers there are in account. How many balances there are in account. That's not what I'm counting. The number of rows use count star. So far, these results are going to give you a single value. It's called a scalar value. It's not going to give you a set. Because we are taking an entire set of things and then applying an aggregate on it and getting a single value. So the question is, in this case, I found the average balance at peri-rich. But what if I want to do this at every branch? Not just for peri-rich, but for every branch, the average at that branch. So what I really want to do is something like this. I want to take this table. Don't look at the query yet. Just look at the table, the input that we have. I want to group it saying, let's take all the accounts at the peri-rich branch, all the accounts at the bright end branch, all the accounts at the right-good branch. And for each of those, individually, I want to apply whatever thing. Here, I'm doing some instead of average. So here, the sum of these two is 1,300. The sum of these two is 1,500. And the sum of this is itself 700. So this is what I want. And the way I write it is for this. Select branch name, sum balance, from account, group by branch name. The important thing here is the group by clause. So the steps that happen here are first the from clause is applied, then the where clause. Here, there is no where clause. So no rows are filtered out. All the rows are there. Then you do the group by. So what is the group by here? Branch name. So what effectively happens is you take this table and break it into groups. How many groups do you have here? Three groups. So take the input table, break it into groups, and then finally come to the select clause. So what does the select clause do? Select branch name. Now, the branch name is unique in each group. Because we have grouped by branch name, the branch name is unique. So that is well-defined. The second thing in the select clause is sum of balance. Now in each group, you can have multiple different balances. So we are going to sum up all of them to get the sum of balance. So that query will give us this result. Here, I've given the name of the output column as sum balance. But as I said, this is database specific. It's not a standard name for the column. So you can't have multiple separate group by. But you can say group by branch name, comma, something else if required. So then the groups will be finer groups. So for one, if you say, as you were saying, account type, you have three types of accounts. Let's say savings, current, and some or two types, savings and current account. Then if you say group by branch name, comma, account type, you will get perirage savings, perirage current, bright and savings, bright and current, and so forth. All the combinations which exist in the data. If it so happens, a Redwood branch does not have a current account, that combination will not appear. Only the combinations which appear will be there. So group by can have multiple columns. So the ultimate output will be the combination of both. Like your branch name and then type of account. Whatever you put in the group by and in the select loss. So. Since the order of the field, you must be getting the supposition of branch-wise account-wise. We can also do account-wise, branch-wise. You can do that. In fact, the only difference between those two is the sort order. Otherwise, it doesn't matter. So there is no difference if you say group by branch name account type. It is entirely equivalent to saying group by account type, branch name. Because you have to order. You can add an order by at the end. If you want it sorted, the branch and within each branch, the account type, then you use an order by. Just saying group by branch name, comma, account type does not guarantee it will be sorted in that order. Short is on other relation which is generated from the select. Exactly. So the steps are from clause. You take the cross product. Where clause is applied to filter out rows. Then you do group by. Then select. And then order. Yes. And then underneath, you have different types of accounts. Yes. And suppose I want savings underneath that branch. If the branch is 50. Yeah. And then you want a subtotal for savings. Yeah, something like that. OK. So let me explain this a little more carefully. If you do group by branch name, comma, account type, what are the combinations that you get? Every combination of branch name, account type that exists, you get. Now if you say group by account type, comma, branch name, again you get the same set of groups. There is no difference in the groups that you get. It's exactly the same. But perhaps the query you're looking for is a sort of subtotal query, which is group by first on, let's say, a branch and then on account type. So I'll say for peri-rich savings account, the sum is 5,000. For peri-rich current account, the sum is 6,000. For peri-rich, all account types, the sum is 11,000. And then move on to the next branch. This is the sort of statistics which you want to output. That cannot be done directly using the SQL aggregation. SQL actually has some more advanced constructs for doing that. But typically people don't use that construct directly. There is a group by roll up. There are some extensions which we can talk about later if you want. It's there in the textbook if you want to read it. It's in one of the later chapters which talks about decision support systems. So this sort of query is used for data analysis and decision support. So you can write such queries in an SQL version which supports it. I'm not sure if PostgreSQL even supports that. But there are a lot of systems for data analysis which will let you write such queries. Most of the Oracle SQL server, IBM, all of them support it. I'm not sure if PostgreSQL does. But the commercial ones do support such tabular reporting with subtotals. And then, of course, that is still going to output it in some funny tabular format which looks a bit odd. If you want it to be formatted properly as a report, you don't write it at the SQL level. You use a report generating tool which will give you a formatted report. And it will generate the required SQL query. So you don't have to worry about the SQL query. You can use the reporting tool to do it. That is typically what people do in practice. They use reporting tools for this. So you can use either there are some commercial reporting tools, crystal reports, and others, which you can use against a variety of databases. There are also some public domain reporting tools. Or you can use reporting tools which come with the database and then use that to generate such reports. It's an important business requirement. So we are not covering it here. But the use of a relational database makes all of that very easy, generating reports. You can just open a GUI interface, enter a few pieces of information, and your report is ready. So we saw the group by clause. So here are some more examples with the group by clause. Find the number of depositors for each branch. The previous query, previous thing, we said number of depositors for each branch. Sorry, not that. There were two queries. One was find the number of depositors across the entire bank, regardless of the branch. And the other one was sum of balance, so forth. So now what we are doing is find the number of depositors in each branch. So suppose if something is appearing, some attribute thing is appearing between select and clause. Yeah, that is in the select clause, you mean? Yes. It appears in an aggregate clause. No, so if you're not using an aggregate, it had better appear in the group by clause. The other way is not required. So I can do select count distinct customer name from depositor, blah, blah, group by branch name. So what that means is there can be something in the group by which does not appear in the select clause. That is OK. But if it appears in the select clause outside of an aggregate, then it had better appear in the group by. Otherwise, there's problem that you have a group with different values for that, and therefore you cannot output a single row for that group. So when you have a group by, you'll output only one row for that group. That's guaranteed in this group. At most one row for each value there. So in this case, we did a group by branch name, and we did count distinct customer name. So the number of distinct customer names in each branch is counted. Is this equivalent, if you add up these across all the branches, will that be equal to this one? Select count distinct customer name from depositor. Is this equivalent to first doing this and then summing up the things across each branch? Yeah, summing up the counts. I hope you understood the question. So here, I'm finding the number of distinct customers in each branch. If I add this up, is that the same as the number of distinct customers overall? Exactly. In other case, that would have been eliminated. Correct. So if a customer is there in two branches, then we have counted that customer only once in the previous query. Here, the customer is counted twice. But within each branch, a customer is counted exactly once, even if they have two accounts. So this point says what we just discussed. Attributes in the select clause outside of aggregate functions must appear in the group by clause. This is what we just saw. So the last topic we will cover before the break is the having clause. So the having clause lets us do the following. It lets us do the previous steps, which is the from clause, the where clause, the group by clause, the select clause, all that is done. We have something. Now we want to filter again. So look at this query. Find names of all branches where the average account balance is more than 12,000. So there is a condition, more than 1,200, on the result of an aggregate. This cannot go in the where clause, because in the where clause, you don't have aggregate value. The where clause is applied first immediately after the from clause cross product, Cartesian product, the where clause is applied. At that point, you do not have groups. You do not have aggregates. So this must be done later. And the way SQL lets you do it is by adding a having clause. So what we have done is select branch name, average balance from account, group by branch name, having average balance greater than 1,200. So in fact, the having clause is done just before the select clause. So what it does is the from clause is applied, the where clause is applied, the group by clause is applied. So you've got the groups. Now you apply the having clause. So for each group, you compute the average balance. Check if that is greater than 1,200. If it is, that group is there in the output. Otherwise, the group is eliminated from the output. Now if the group is there in the output, what are you outputting? You're outputting the branch name and the average of the balance. In this case, you're outputting the same average balance, but you could output many more things. You could also output count of account numbers. You could output any other aggregates you want. But only those groups which satisfy the having clause will appear in the result. Is this clear? The aggregates can be different in the select and in the having, or they can be the same. It doesn't matter. But only those that satisfy the having clause will appear in the output. Yeah, you can have where. So if I wanted to, yeah. So the where clause is applied just after the from clause cross product. At that point, you do not have groups. Group by has not been applied. You do not have aggregates. Aggregation has not been done. So in the where clause, you cannot use average balance and so forth. However, the having clause is done after the group by is done. So from where group by, you've got the groups. Now you do the having clause. At this point, you have the groups. Therefore, it is meaningful to do the average balance on each group. If you don't have a group by, that is OK. Then the entire table is one group. And then you are doing the having clause on the entire, then the single group, which is the whole table. And if that satisfies the having clause, then you do the output. Otherwise, that particular group will not be there in the output. So if I drop the group by branch name, so just take the same query, remove branch name from group by. There is no group by. Remove branch name obviously from the select also. It cannot be in the select anymore. Then what happens? What does that query mean? I will check if the average balance across all accounts, the whole bank, across all branches, is greater than 1,200. If so, I will output it. Otherwise, I will not output it. That is the meaning of that query. So the result can be empty or the average. In the having clause are applied after the formation of groups, whereas predicates in the where clause are applied before forming groups. I think I'll stop here.