 Let us move to the next chapter which is introduction to the relational model. So, our goal in this chapter is to familiarize people with the concepts of the relational model and in particular to look at a number of operations which are called the relational operations which are used on relations. And these operations form the basis for the SQL language and if you want to understand how SQL is implemented internally, even if you want to understand how to write queries in SQL, you have to understand the basic operations on the relational model. So, on this slide you see the same example of a relation which we saw before, but now you will notice a few things. First I mentioned that columns are called attributes, this is shown explicitly in this slide. So, the name attribute comes from the formal modeling of things as relations from the theoreticians, columns came from the practitioners, so both are used interchangeably. Then you have a notion of rows or tuples, a tuple has a formal definition as a set of n values each from some domain. Again there is a little more formalism on this in the book, but in the interest of course time we have cut it down here and we will call rows as tuples or rows interchangeably. Now each attribute has a type which is called the set of values which is defined by the type is called the domain of the attribute. Now each attribute value is normally required to be atomic that is indivisible. What do we mean by this? Is a name divisible? If you break a name into first name and last name, yes it is divisible, but if we treat a name as just the name with no further break up it is indivisible. Now dividing name into first name and last name is not really a big deal because instead of storing one attribute called name with two parts we could store two attributes first name and last name, so that is not really a big issue. The bigger issue is when you start interpreting the parts of an attribute in different ways and we will come back to this when we look at normalization theories. So the special value null is part of every domain in any database system. Now what does null represent? It represents that either that particular tuple does not have a value associated with that attribute or maybe it has a value but we do not know what that value is. The value null represents both these possible situations. We do not know exactly which one of these situations a particular null represents but it just means we do not know values in there. So coming back the null value which means unknown actually complicates many operations and we are going to look at its impact on each of several operations coming up in. Now the columns or attribute we will usually refer to formally as a1 through an and we will have we will call r with a capital letter being a list of attribute names as a relational schema. Now this is in some sense a proto schema because it does not include attribute types and so forth. We will give a list of attributes and call it the relational schema and for the purpose of design we are not going to worry about the types initially and then add the types as a second part of the design. We will also call relation or table we will use these two words interchangeably. Again the word relation comes from the formal people and the theoreticians and the word table comes from practitioners but they mean the same thing over here and tuple as is to row is also something we already saw. Now something which people should note is that relations are inherently unordered. Now when you look at the contents of a relation the system has to show you the rows in some particular order but what I want to emphasize here is that the relational model does not assign any inherent order to the rows of a relation. So earlier we showed the instructor relations ordered on id that was just the convenience for presentation. It might also be how it is stored but at the logical model level the sorting the sort order is irrelevant that is a part of the physical model which we do not worry about when we write queries. So what does this mean? When you write a query you say you know show me all the tuple instructor tuples it might be that the tuples are shown in increasing order of id but there is no guarantee. If you do want it to be shown in increasing order then in the query language you have to say show me the contents of the instructor relation sorted on id in increasing order. So all that is part of the query language. It is not a property of the logical relational model. Sort order is not part of the logical model it is a physical property. Now what is the database? It is basically a collection of multiple relations. Now in the university case we are going to break it up into a number of relations. We have instructor, we have student, we have course, we have which instructor teaches which course actually it is a little more complicated. A course like CS 101 has an existence independent of a particular offering of the course. The same course is offered every year or maybe even twice a year. So a particular instructor teaches a particular offering of the course in a particular year or maybe a particular section of that course. So all these kinds of details we are going to look at coming up. But before we get into specifics I want to show another example of bad design where we combine instructor id name, department name, salary, student id and whole bunch of other stuff from three different relations. Instructor, student and advisor. Now why on earth would we combine information like this? It seems idiotic, you know. So in all our examples we are saying first here are the relations and by the way if you combine it it is a bad idea. But you might say why would you combine it? What is the point of normalization? And the point here is how do we get at this correction of relations in the first place? And there is a history to this. If you look at the formal work on normalization theory which we will again see later on. It was based on the notion that we put all the information about a particular enterprise into one single relation called the universal relation, put it all together and then break it up into smaller pieces based on certain properties such as functional and multi-valued and other dependencies which we will see later on. However in practice that approach did not take off. What took off was the entity relationship modeling approach which does things differently and starts off by breaking things up and then comes up with a set of relations. And the point here is that if somehow you came up with a relation which combined irrelevant information you might land up with some problems. In fact one of the problems which is repetition of information we saw earlier. We saw that earlier that if the same department name appears twice should have the same building and salary. Here if you have a instructor, department name, salary, student ID and so on, supposing so this was combining students with instructor through an advisor relation. So this instructor is an advisor to this student. So if there are two students with the same advisor all the fields of that instructor name, department name and so on would get repeated. But there is another problem supposing I am storing your student ID, name of the student and other information but a particular student does not have any advisor. How do I store information about that student in this particular relation? The only way I can do it is by having null values for instructor ID, name, department name and salary because there is no associated advisor. So all of these cause problems so we want to break it up into several parts. Again some notation this is about keys again many of you may be familiar with this. So first of all there is a notion of what is a super key? What is a super key? It is a subset of attributes which of the original schema are not here that we are saying let k subset of or equal to r what we mean by that r is a set of attributes k is any subset of those attributes. We say that k is the super key of r if the values for k are sufficient to identify a unique tuple for each possible relation small r of r. Now there are several notations which are combined in this one line. First of all small r open parenthesis, capital r close parenthesis what does that mean? First capital r is a set or a list of attribute names that is a schema. Small r in parenthesis like this says that small r is a actual relation with that schema and what do we mean by each possible relation? Well let us suppose that the relation in this case is instructor. Today there is a set of instructors in IIT tomorrow somebody retires somebody joins the set of instructor changes. So the relation instructor remains but the contents of that relation keep changing. So by relation instance we mean the current contents of a particular relation in say the instructor relations instance is may be the current contents of or the current set of instructors. Now a set of attributes is a super key if the values in there are sufficiently sufficient to uniquely identify a tuple in every possible instance of that relation. Now what do we mean by every possible instance? Now there are some real world constraints. In this case if it were the instructor relation we might say that id is a super key because it uniquely identifies instructor. Now somebody goofs up and gives the same id to two different instructors id is not a super key that is a mistake. This is an illegal state of the world. In fact I have a colleague here who was issued a PAN number which is identical to the PAN number issued to somebody else with the same name in Chennai. And he has this big problem every year when he files his returns there are two sets of returns filed with the same PAN number and that has been causing endless trouble for him and he luckily being in the computer science department here had colleagues and others who were or he had consulted for people who knew people in the income tax department and they were trying to figure out how to fix it. It is a big problem if you have a goof up like this. So anyway this brings up the issue of a legal state of the world. In any legal state of the world no two people should have the same PAN number and that is what we are talking about. So here id is a super key assuming no two instructors are given the same id. But also id comma name is a super key what do we mean by this? Well if you have a given value for id name is impossible for two people to have the same value for id name in this case because id by itself is already enough. But there are other cases where id by itself is not enough you need to add something more. For example, how do we identify a particular course offering? There is a course CS 101 it is offered maybe in summer 2013 and that is a particular offering and there may be two sections of that course in maybe not in summer maybe in autumn there may be two sections of CS 101. So we are going to identify an offering of CS 101 by the course id that is CS 101 in the year 2013 the semester autumn or summer or whatever and a section id in case there are two offerings of the same section in the same course in the same semester. Together these four attributes will uniquely identify a particular offering of the course. So it is a super key. Now a super key is said to be a candidate key if it is minimal. So for example id comma name is it a super key yes it is is it minimal no because even if you drop name id by itself is still a super key. Therefore id comma name is not minimal therefore it cannot be a candidate key. On the other hand id by itself is a candidate key because it is a super key and it is minimal. Our relation may potentially have multiple candidate keys. For example we may give id's to students but they may also have other id's right now many most have but maybe not all next year maybe all will have well barring foreign students. So maybe we can use adhara as a candidate key also but we have to choose one. So let us say we choose id not adhara and the next concept is of a foreign key. This is actually a constraint which says values in one relation must appear in another relation. So we saw this already. We saw that a department name in instructor must appear in the department relation. Some terminology here we have a notion of a referencing relation the instructor would be the referencing relation for department name and department would be the referenced relation for this particular foreign key. So the foreign key constraint is on a table which says the value here must appear in another table. Now here is a schema diagram which shows the relations in our university database. I want to spend a few minutes on this because all our queries everything which we do in the first few days is based on this particular schema diagram. This represents part of the information of a university again it is abstracted simplified but it is still complex enough to show many issues. So let us start off with this part of the diagram which shows a student. The student has a id name, department name and total credits. What do we mean by total credits? The total number of credits which the student has completed and passed. Now moving down here we have department which is department name, building and budget. Now if you note in both these tables the first attribute is underlined. Underline of an attribute indicates that it is a primary key. Now in some cases here you will find there are in section relation there are four attributes which are underlined. So section has course id, section id, semester year all of them are underlined. What that means is all of them together have been chosen as a primary key. Of course that means they uniquely identify the particular tuple in the section relation. Now coming down here we have instructor with id name, department name and salary. And moving here we have a course relation which has course id, the title which is the name of the course. A department name every course we insist has to be associated with one department. And then we have credits, number of credits for that course. So we have courses which are six credits, eight credits, three credits which reflect how much effort it takes to participate in that course as a student. Now you will notice arrows between all these tables. So these represent foreign keys. So take this arrow here from student department name arrow to department indicates that department name is the foreign key from student to department. Similarly course has a foreign key from department name to department. Instructor similarly has a foreign key from department name to department. Now let us look at this thing section as I said of course can be offered many times. So this particular offering will use the name section to refer to particular offering of the course. Now as I said this has four attributes which together uniquely identified, but it has some extra attributes. It has a building and room number which indicates in which room that section meets. So we are assuming that every time this section meets this course is going to be in exactly the same room and building in this particular semester. Then there may be another offering of this course that is another section which may meet somewhere else. So that is those are two extra attributes and note that these two are foreign keys into a classroom relation with where a classroom is uniquely identified by a building and a room number and it has one extra attribute capacity. So we do want to make sure that the enrollment of a course is less than or equal to the capacity of a classroom. Now this is not a constraint which we can specify in SQL, very nice if you could, but it makes it too complex and makes it inefficient. So that is usually left to the application programmer to enforce when students are added to a course maybe you enforce that it does not exceed the strength. In fact the enforcement here may be soft in the sense that you may allow it to exceed, but then you flag it so that the course is reassigned to a different room. So now coming back to course, the course ID forms a foreign key to the course relation. So section has course ID which is a foreign key to course and section also has attribute time slot ID which indicates at what time that section meets. Now there are universities where the course may meet at exactly the same time every day. So they may or a particular set of days. So they may say that the time slot is Monday, Wednesday, 11.30 to 12.30 or something. In other places it may be more complex. So we are going to have a time slot ID which indicates at what times during a week a particular section meets. So all of these are based on a weekly schedule. So we have a time slot which has an ID, a day, a start time and an end time. Now the same time slot ID may meet say 3 times in a week. So it will appear twice, may be different days with different start times. Note that time slot ID, day and start time are all underlined. End time is not. Why is that? These three are unique enough to uniquely identify a particular time when the course meets. It is not possible for a section to start at the same time but end at two different times. That is impossible. But on a particular day the section may meet once in the morning and once in the afternoon. That is legal. So we want the day and the start time as part of the primary key but end time should not be. It would not even be a candidate key if you include it. It would be a super key but not a candidate key. And finally these last two relations, we have a prerequisite relation which indicates that some course is required before you take some other course. For example in our department here in IIT Bombay, we have a data structures and algorithms course and most of our other courses require that a student have taken this course. Now in the old ways of doing things, these prerequisites were implicit. It was assumed that a course in semester 3 should be taken before a course in semester 4. But these days like many other universities, we give a lot of flexibility. Students can take courses out of order and so forth. However the course depends on the student having understood the material of a preceding course. So that is recorded in the prerequisite relation which has two attributes, course ID and prerequisite. If you see both are foreign keys back to course ID of the course table but they serve different roles. The course ID is which course and prerec is which is the course which is the prerec for this one. So CS 317 may have CS 101 as a prerec and may also have CS 313 as a prerec. And finally we have the advisor relation which indicates which student is advised by which instructor and in this case I think we have underlined S ID which is the student ID which indicates that a student can have at most one advisor they cannot have multiple advisors. So this is the schema diagram we will keep coming back to it. We have already told you about this issue of non-procedural versus procedural and within the we are going to focus on the declarative or non-procedural languages. Within this there are several pure languages which means they have formal without too much syntactic sugar and then there are real world query languages which people use to build real things. So the pure languages there is a relational algebra and then there are two forms of relational calculus. We are not going to cover it in this we are not going to cover the two calculates in this course. We are going to focus on relational algebra and we are going to look at the number of relational operators because that is important to understand what goes on in a database. But after wrapping up quickly with relational algebra we are going to spend more time on the SQL query language because that is what you need in the practical in the real world. But the relational algebra is very important to understand how a SQL query is executed. So when we do internals we are actually going to use the relational algebra operations. Now I had a few instructors ask me what is the point of relational algebra nobody is going to write queries in relational algebra why should our students learn all this it is just one more painful thing for them to learn and the answer is it is very important to understand internals. So by all means do cover it the relational calculates on the other hand are not that important which is why we are skipping those they are useful there are query languages based on those but in the interest of time we are not going to cover it. So now we are going to have a quick tour of relational algebra by looking at the several operations which together form the relational algebra. Now all of you are familiar with algebra from school. So what is algebra in mathematics when you deal with plus minus time division and so forth what do they operate on they operate on numbers what kind of numbers integers real numbers may be even complex numbers and so forth. In the context of the relational model each of these operators which we are going to look at operates on relations. So an operator takes one or more relations as input generally one or two and it outputs also a relation just like plus 5 plus 5 is 10 what are the types integer plus integer gives back an integer real plus real gives an integer real plus integer may give integer let us not worry about those details in the relational world you have taken relations and output relations. So here is an example of the select operator which selects tuples. Now those of you who know SQL know that SQL has a select clause which is actually completely different from this it corresponds to different relational algebra operation called project. This select relational algebra operator takes a table like this and outputs some subset of rows from this table. Now in this case let us say we want to select tuples where a equal to b and d greater than 5. So let us look at the first table is a equal to b both are alpha yes d greater than 5 yes it is in the output. Next one a equal to b no that fails it therefore that particular tuple is not in output. The third one it does satisfy a equal to b but it fails d greater than 5. So notice that it is not in output and the last one satisfies a equal to b because both have the same value beta and d is 10 which is greater than 5. So that is also in the output. So this is the output of the select operator we write it formally like this sigma is the Greek letter corresponding to s. So select is s so Greek sigma sigma a equal to b and d greater than 5 on r. So that is the output there of this particular operation. Now there is a quiz question here and fortunately I do not think we can run the quiz on the clicker right now unless a network is back. But here is something that you can check out I will just give you a couple of minutes and let you answer this question for yourself. So read the question and the choices and I will give you a minute. The question was the select a not equal to b or d less than 7 again we will go over each of these tuples. Does the first tuple satisfy a not equal to b no it fails d less than 7 that also fails that is out. The next tuple satisfies a not equal to b it is in the third one satisfies d less than 7 although it does not satisfy a not equal to b. So that is also in and the last one fails a not equal to b and it also fails d less than 7. So the output is the middle two tuples and the answer is number two two tuples. So that is a small exercise. The next one is selection of columns which is the project operator in relation algebra and the slide is titled selection of columns because this is what sql means by select we will come to sql later. But there is a clause called select there and that is used to select columns or attributes. So here is a relation with three attributes and different tuples you will note that each of the tuples is different there is no repetition of tuples in this relation. And if you select only columns a and b which is in future we are going to call this projection to avoid confusion with the selection operation. So if you project columns a and b which is denoted here by pi because pi is the Greek letter corresponding to p for project pi acr. So project only columns a and c of r. So what do we get here? We have alpha 1 another copy of alpha 1 from the second tuple the b value is different but a and c are the same then beta 1 and beta 2. Now you will notice that the first two tuples are the same after projection this can happen after projection and in the pure relation algebra duplicates are removed. So the net result is this relation here without duplicates. So again there is a quiz question here the projection operation does it remove duplicates or does it not as this thing here shows it does remove duplicates. Now I should also mention that there is a variant of the relation algebra which does not remove duplicates and I will come to that later and it turns out that this is the variant that SQL uses for various reasons which we will come to later. Now let us stick to the basic relation algebra. Here is an example which joins two relations through a Cartesian product. What is a Cartesian product? It outputs every pair of tuples. Now this is by itself it is a pretty useless operation in most cases there are only some rare instances where an actual Cartesian product is useful but it forms the basis for the next operation which we will see coming up. But first let us understand Cartesian product. So here is in relation r here is a relation s every tuple in r is paired with every tuple in s. So if you see here alpha 1 is paired with these four tuples. So here are the four results with alpha 1 and each of these tuples for c d e if you look down here. Similarly beta 2 appears here four times once with alpha 10 a next with beta 10 a and so forth. So that is a Cartesian product shown by a x over here. And the typical use of a Cartesian product is to use it as the first step and then follow it up with selection. So in this case what we have done is we first did r cross s as in the previous slide but we have added a select on top of that which is sigma a equal to c on r crosses. So what is this doing? It is going to remove all pairs where a is not equal to c and retain only those where a equal to c. Now we have something more useful we are pairing up tuples which match on something. And here what happened? Which are the ones which did match? The first one a and c above alpha it is retained. The next three tuples a and c do not match they are all thrown out. The fourth tuple again a is beta c is alpha it is out. The next two a and c are both beta. So those two tuples are in in the result and the last one again does not match. So this is what we have. Now if you go back to foreign keys the most common use of this particular operation it is called the joint which first Cartesian product followed by an equality selection like this is to match up tuples from two tables which are linked by foreign key. It is not a requirement there is absolutely no need that there be a foreign key relationship between it and there are many queries where such a relationship is not there. But a lot of queries do in fact use joint conditions on such foreign keys. And then there are a number of other operations. Now the next three are all standard set operations union intersection and set difference. So given these two tables r and s the union puts together all the tuples and appear in both the table. If you notice here alpha 2 appears in r and in s and in the result of the union it appears only once not quite again duplicates have been eliminated in this version of relation algebra. There is another version of relation algebra where the union would have two tuples that is called the multi set relation algebra. There would be two copies of alpha 2 in that case. Intersection here the only common tuple is alpha 2. So that is the only thing present here. And finally set difference is all tuples in this case r minus s is all tuples in r which do not appear in s. So the first tuple is not there in s it is there. Second tuple is in s it is out. The third tuple is not in s so we have this as the result. I think this would be a good point to see what questions are out there. So the couple of questions the first one was discuss schema and instance in detail. As I said schema is like the type what are the attributes what are the types and so forth. So if you are writing a program language like java or c the schema would correspond to the actual variables which you have in the program. Now an instance is the contents of those variables. So if I have a variable x in a java program at a particular point in time x may have the value 5. At some other point in time as the program continues x may get the value 10. So instance is the contents of the variable at a point in time. In the relational world a relation r is has a schema that. So the schema may refer to the set of attributes but it is the type of the relation without worrying about its content. The instance is the contents of the relation at a point in time. The next few questions are talking about care and where care and so forth people are jumping ahead we will do that when we come to SQL. And the last question which I am seeing here is what is the exact meaning of minimal regarding candidate. So what is minimal? This is a good question. All of us know what is minimum. Minimum means the smallest value. What is minimal? Minimal is meaningful in the context of a set. A set is minimal if with respect to some property in this case being a super key. A if it satisfies the super key and B no subset of that set is also a super key. So how do we check if set is minimal? Well we first drop one attribute check if it is still a super key then we drop the next attribute. We try it for every way to drop one attribute. If all of these fail in being a super key we know that no subset of that set can be a super key. Therefore, that set is a minimal super key and therefore it is a candidate key. So in our context we had ID and name as a super key. Now if we dropped name what is left we have ID. Is ID a super key? Yes it is. Therefore, ID comma name is not minimal. But if we have just ID I mean anything with just one attribute is trivially minimal unless the relation has only one possible tuple which is silly. So in this case we cannot drop ID and still expect to have a super key. Therefore, ID is minimal and it would be a candidate. I hope that answers the question. If you have questions on relational algebra please pass it on to your coordinators to type in now. Let me move ahead the next two things on relational algebra. But do pass on your basic questions now. So the end of this I will take the questions on basic relational algebra. So here is an example of the natural join operation which basically does the following. It is shown symbolically here with this bow tie symbol and what it does is it takes two relations. First of all it checks which attributes are common between these two relations. So here what are the common attributes? A is not common, B is common, C is not common, B is common and that is it. So B D are the common attributes. Now it is going to take every tuple here and match it to every tuple here which have the same values for B and D. So take the first tuple B is 1, B is A. Now look at the first tuple here B is 1, B is A. Yes it matches. So we have an output alpha 1, alpha A, alpha 1, alpha A and the E attribute comes from here which is alpha. So that is one of the tuples in the natural join. The next thing here, is there any other match for this tuple? Is there any other 1 A? Yes there is. So the third tuple here is also 1 A but E is gamma. So we have the same first tuple output with gamma as the value for E. So that is also part of the natural join. Now moving on similarly for the next tuple B D is 2 A which does not exist here. So this second tuple is lost in the natural join it is not in the output. The third tuple has B D being 4 B that is also not present it is lost. The fourth one is 1 A again and that appears once with alpha and once with gamma. So here alpha 1 gamma, alpha 1 gamma A with alpha and then match with this one here with gamma. And finally delta 2 beta B here 2 and B matches this one tuple here 2 B. So we have that one with delta. So that is the natural join example. There is again a quiz question. So coming back to this quiz question the natural join operation as we just saw. So matches rows whose values for common attributes are equal. So very simple question. My goal here was to ask this quiz question based on the output here and then explain natural join. So now this slide formally defines natural join of two relations. I am going to skip the details but what this says is it shows how to obtain the natural join. It says consider each pair of tuple if they have the same value in each of the attributes in R intersection S. So what is R intersection S? Again remember the notation capital R capital S are the attributes of the relations. The relations are called small r and small s. R intersection S is the common attributes. So if each at common attribute has both tuples have the same value for each common attribute. You add a tuple to the result which has the same value as P R the tuple from R on attributes in R. This should have been capital R and capital S pardon the error. So for all attributes from this it has the value from here. And for all attributes from the second relation it has the value from the relation S. So we have already seen this. So what we have seen so far were what were called the basic relation algebra operation. Later people realize that there is a very important class of operation which people need to do with relation which cannot be expressed using the basic operation which we had seen so far. And these are what are called aggregate operations. It turns out some aggregate operations can in fact be expressed using the basic relation algebra but not all. So what is the aggregate? This was a function that takes a collection of values and returns a single value. So average min, max, sum, count and so on are all aggregate functions. These are the 5 basic aggregate functions in SQL. However SQL allows many more aggregates and most databases have many more aggregates. For example, standard deviation, variance, median and so on. These are all things which most databases support. Now we will be seeing SQL syntax later on for aggregation but at the relational algebra level we are going to use the following syntax. We will use this funny letter here which is the calligraphic capital G, the calligraphic point. G stands for group by but we are actually migrating towards gamma. Initially this is what several people used but later for consistency with the other relational algebra operations such as sigma for select pi for project and so on. The Greek letter gamma is used these days by many people. So instead of this we could use gamma. So the two sets of arguments means the same thing and there are two sets of things, arguments to this group by operator. We are showing one set on the left and one on the right. So on the left are the group by columns and I will explain what this means coming up. On the right are the aggregate functions on specific attributes. Now again notation wise some books will put these also on the right hand side. They are all on one side and we know which is a group by attribute because these guys here will have an aggregate function. Anything without an aggregate function is a group by attribute. So again these notations are equivalent. It does not matter which you use and this is an operator which operates on a relation. Like all other relational algebra operations they operate on relations and give a relation. This one takes a relation I am calling that E. So here is a simple operation where the group by attributes on the left hand side of this is empty. So here is a relation and this is group by sum of C r. So C is this one. Sum of C on r adds up all the C values. In this case we get 27 that is the output. Not the result has an attribute name which we have labeled as sum C. That is kind of system dependent in actual implementation. So you can use aggregation by to aggregate all the values in a relation into one value. But aggregation can also be used with grouping. So take this particular query find the average salve in each department. Now we have an instructor relation which is shown below where each instructor has a department name and a salary. To find the average salve in each department we are going to group the tuples by department. So if you notice here biology there is just one instructor that is there. Compsi has three instructors they have been brought together. Again finance has two they are brought together and you will also notice here that it is sorted. Sorting it turns out is a very easy way to bring things which have the same group by attribute together. So we have sorted by department name. Not essential but it is one of the ways of doing that. Now once you brought it together we can find all the corresponding tuples and find the average. So for biology there is just one tuple the average is equal to 72000. For commpsi there are three different values and the average is this case 73000, 333, 0.33 you will ignore that and so forth. So for finance there are two tuples and the average is 85000 which is shown here. So that is the complete group by aggregate operator. Now you can put any aggregate operation you want here. We could have said sum which would have added of these instead of finding the average. We could have said min which would find the min salary. For commpsi the minimum salary would be 65000 max salary would be 92000. We could have a count which is how many people there are in effect here. So for department computer science you would have three biology would be one and so forth. So you can use any you can use have actually multiple things here. I can say average salary comma sum salary to get multiple pieces of information using one operator. And if you want a name for it to be used later as I said the name may be unclear. So what we are saying is AVG of salary as and give it a name. So this is just a syntactic thing which gives a name to this attribute. All the other attributes have names because they came from the name in the input table. There is also a rename operation in the relation algebra. I am going to skip it for simplicity. So with that we have done with this particular chapter. It is a good time to take some questions. So the first question which I am seeing here is why relation algebra when all these operations are in SQL as well. Now what I want to say here is SQL is a complex language. The SQL standard back in 1992 covered a few thousand pages around a thousand pages. By 99 I think it was some seven thousand pages. Now we have lost count of the number of pages. On the other hand relation algebra has a handful of operators. We have already seen the major set of operators. There are more there is division and other things but you do not really need any. So the other operators can actually be expressed in terms of these. So at the formal level you do not need any more operators unless you want more complex things. There are some cases where it is convenient to have other operators but it is a very very simple language. And if you are implementing a database system it actually makes sense to take a complex language like SQL and translate it down into a simpler language which is actually implemented underneath. So in fact the relation algebra forms the basis for database system implementation. They take SQL queries and turn them into some form of relation algebra and then that is what is executed. It is actually a little more complex because in addition to understanding what are the operations they are usually tagged with how the operation is to be executed. So people sometimes call this the logical algebra versus the physical algebra and so this is what is used inside to actually execute query. So if you want to understand the internals you should understand relation algebra. Now of course anything which you can do in relation algebra you can do in SQL but the point is that we want a simple thing which covers all the functionality which we need and then build a language on top. The next question is what is the need to cover SQL relation and tuple relational calculus if you use SQL in application development. So I think I just answer this question if you want to understand internals you must understand relation algebra. The calculus are less important so it is exactly why I am not even covering it in this course. In the question which says is there any aggregate function to calculate the second maximum value that is a good question. So many times you want not only the maximum but the next one and sometimes the third and the fourth and fourth. Sometimes you want to rank I want the rank of people in class the maximum will be people with the first rank the maximum marks people will be the first rank the next mark will be the second and then the third and so forth. So is there an operator for this and the answer is in the basic relation algebra no there is no operator to do this. The second maximum you can actually compute by a thing which first takes finds the maximum then does a set difference to remove it from that set and then does a maximum again. So you can express second maximum in relation algebra by a cascading set of operation. Is this an efficient way of doing it? No it is not. Now what if you want all the ranks of students in a class can you do it using relation algebra? So yes you can but it is very inefficient. In fact it becomes a quadratic operation whereas you know very well that to find the ranks you just have to sort by marks and then assign the ranks one two three so forth that is how we do it. So it turns out that from the viewpoint of efficiency even though the rank operator can be implemented using other relation algebra operators I leave it as an exercise how to do this using aggregations and cross product and select and so on it is possible I would not give you the solution now. It turns out it is a good idea to implement it as a basic operator. In fact today ranking is a basic construct in SQL most databases now support the ranking construct which is defined as part of the SQL standard and correspondingly their implementation will have a ranking operator which gives ranks to tuples first second third rank and so on based on some attribute and some ordering of that attribute. The next question says DBMS book by Ragu Ramakrishnan does not include aggregates as part of relation algebra that is another good question. Instantly Ragu was my PhD advisor so you know he and I can have a little fight about this maybe but seriously the basic relation algebra did not include the aggregate operator and when it was first defined COD and other showed that it is equivalent in expressive power to what is called first order logic. So, that is a branch of mathematics which is a long history and basically the point they wanted to make is that relation algebra is as expressive as first order logic. What they did not say is that first order logic is all that one requires in the real world but they defined relation algebra with a particular set of operations. Soon enough practitioners started using it and they realized in the real world people often need aggregate and you cannot express aggregates in either basic relation algebra or first order logic. So, then people defined what is called extended relation algebra which includes aggregates. So, even in our book we differentiate between the basic relation algebra and extended relation algebra for the purpose of this course I papered over the differences in the end the equivalence to first order logic is a theoretical concept which most of us do not need to worry about real life. But what we do need to know is that these relation algebra operators exist and are important in its real implementation. The last question in this set is how to apply aggregate function on another aggregate function for example, maximum of count. So, this is one of the basic things in any algebra whether it is your algebra which you learnt in school or the relation algebra. The output of one operator is of a type which can be fed into another operator. So, for example, we can group by department count instructor to find the number of instructors in each department. This is a relation on this you can apply another operation. So, let me use the drawing board here and group by we wanted the number of instructors in each department. So, the input to that what is the input not department because department does not keep track of how many instructors there are. But the input to it is the instructor relation where the instructor has a department name attribute department name. So, that is a group by and then we have count we can just say count star which is how many instructors there are in each department. So, the output of this thing might say CS3, physics 2, history 1, economics 2 and so forth. Now, on this we can apply the max. So, we can say group by max now this has to have a name. So, let us say count star as let us call it C. So, we go in an attribute name C now we can say max of C on this. So, this is an operator whose input is this whole thing here from here to here. So, now we found the count using this part of the query and then we got the max from that. So, that is how you can cascade these to get aggregates and on that another aggregate and so forth. Now, you can write these in SQL as well I am not going to emphasize too much writing these complex queries in relation algebra. Like I said our goal is to use it for internals, but it is evidently possible to write queries directly in relation algebra. And it is also important from the view point of understanding the internals when you see a query plan it is going to be in some form of relation algebra you need to understand what it does you may not need to write it yourself, but you should be able to understand what it means. So, next question is what is null in a relation is it true or false I am going to cover nulls in a lot more detail. So, I will answer these questions coming up I am going to answer it in the context of SQL. You could define all of these in the context of relation algebra also just like multi sets nulls and so forth aggregates with null there are many issues with null. So, I am going to answer all of these in the context of SQL rather than relation algebra. There are a few other questions which I will one more question where can we write relation algebra queries to extract results. So, there is no implementation of relation algebra out there which you can use, but one of the assignments which I have given as to my students I think it is also there in this course as an optional assignment for JDBC is to build a simple relation algebra interpreter on your own. So, you can actually take input and execute relation algebra queries, but you have to build it yourself. But practically it is not relevant because nobody wants to write relation algebra queries SQL is a lot easier to write queries. I am going to stop there on the questions.