 So, in this chapter, we will have a very quick introduction to the relational model. The concepts in this chapter are fairly simple. However, towards the end of the chapter, we are going to introduce you to relational algebra. So, what is the relation? We already saw this particular relation before with its attributes or columns and rows or tuples. This particular one is an instructor relation. I am going to cover a bit of terminology here. The set of values which are allowed for a particular attribute is called the domain of an attribute. Now, most attributes have a domain which is very very large. So, you cannot, you would not normally list them out. For example, salary is an integer or maybe a numeric with two decimal numbers and that is all you are going to say or maybe you will say that the salary should be at least whatever is the minimum salary according to the rules. And it cannot be more than let us say 100 million. Now, it is 100 million a correct bound. Why put that bound? Maybe it is to catch errors in data entry. Of course, periodically you have people who are paid such ridiculously large amounts that a constraint may be violated even though it is not really an error in which case you have to update the constraint. So, the domain could have a type as well as constraints which restrict it to certain values within that type. Now, the attribute value are normally required to be atomic that is atomic in the sense of indivisible. Now, what is this? What does this mean? We will see it in more detail later, but an example of something which is not atomic is a set of values. So, for example, an attribute of an instructor may be phone numbers which lists multiple phone numbers. Now, this attribute is not atomic because it has multiple values in there. Now, what do we have against nonatomic value? It turns out that nonatomic values complicate querying, they complicate the representation and they have other problems. And people realize that it is simpler to normalize these relations into what is called first normal form where there are no such nonatomic values. So, there are no such values actually. There are other instances of nonatomicity which we will see later. And those are all ruled out in the relational model although in extensions of relational model people have reintroduced them. So, they have some uses, but if they use carelessly they only hurt they do not help. There is also a special value called null which we will be seeing repeatedly which is to represent the fact that a particular attribute value is either missing that is it just does not exist or we do not know what it is null could mean either. For example, if we do not know the salary of an instructor we could set it to null, if we do not know the department of an instructor we could set it to null. Now, it is possible that we have an unpaid instructor and there really is not a salary it should have been 0, but we do not know so we set it to null. Now, null values cause complication in the definition of this operation in the SQL and relation algebra and we will see this what is the impact of this later on. For the moment we will not worry too much about null. Now, a schema is a set of attributes in the simplest form. Now, a more fully detailed schema will include the types of the attribute, but initially at least we are going to ignore the types to simplify our life and treat a schema as simply a set of attribute. So, the schema for instructor in this case would be ID name, department name and salary that is a set of attribute. Now, certain people like to think of it as a set which means it does not have an order, but when we deal with the real query language the order in which you list the attributes becomes important for to ensure that the result is deterministic and therefore, in SQL the attributes in the schema are listed in order although the basic relation model some people feed them as unordered. Formally a relation instance is a cross product of the domains of its attributes, what does that mean? It is a subset of the cross product rather. What does that mean? Any particular tuple has one value for each of the attributes. Now, that value must be from the domain. So, now what is the cross product? It is every possible combination in this case if I have domains d 1 through d n the cross product is every possible value chosen for each of the attributes. So, any tuple has to be an element of the domain and any relation instance has to be a subset of this cross product of the domains. The instance is the current value of a relation is what is actually stored in the database. Now, although the attributes are usually represented in a particular order left to right the rows of a relation in the relational model and in SQL are by default unordered. So, we saw earlier the instructor relation if you recall we had sorted it by I d. So, that is easier to look up a particular instructor given an I d, but in the model the order of rows does not have any significance. So, this copy in this slide the same instructor relation, but now the tuples are not sorted on anything they are kind of jumbled up randomly, but it is really the same relation there is no difference. Now, of course, when you output information to a user the user may wish to see things in a particular order and therefore query languages include the ability to output results in a desired sort order, but in the relational model there is no such sort order. A database has multiple relations in our sample database for a university we have many things including instructor, student then there is an advisor relation which says which instructor is the advisor for which student. Then there are many more which we will see later such as what are the courses which courses are offered in which semester, which instructor is teaching which courses which student has taken which course offering and so on. Again here is a small example of a bad design and the relational model does not per se deal with good or bad design, but when you design a schema we do need to worry about good and bad design. A bad design could result in repetition as we saw earlier or it could result in the need for null values. For example, if I had a student along with an attribute called advisor id, if there is no advisor we would have to store the null values and null values as I said can cause problem. So, if we can avoid them we should avoid them. Some more notation which is going to be seen very very often. So, if you are not familiar with this please pay attention a key is any subset of the set of attributes of the relation other than the empty subset problem. Now, we say that a particular key is a super key if the set of values in that key uniquely identifies a couple for every legal instance of the relation R. What does this mean? Let us say that in any university instructors have an identifier. The identifier has to be unique we should not have two instructors with same identifier that is a mistake. Therefore, we can say that the id value is a super key for instructor because it uniquely identifies one instructor. If you have a relation instance where two people have the same id that is an error. So, a super key uniquely identifies a row. Now, the combination id comma name of an instructor is also super key because it uniquely identifies instructor, but as should be obvious the id alone is enough. How about name alone? Is it enough to uniquely identify an instructor? Well, in most colleges or university this is likely to be true, but you cannot be sure. There are certain common names for example, certain last names are very common in India like Jain, Shah and so on. Sanjay Jain for example, is a very common name. There are multiple Sanjay Jain used to be in single class multiple students in the same name. So, clearly name cannot be a super key because it is possible for two people to have the same name. So, name is not a super key id is a super key id comma name composite single using two attributes is also super key. Now, we generally do not want to tag on a necessary attributes with a key. So, a candidate key is a minimal super key. What does that mean? A candidate key is a set of attributes such that if you drop any attribute from that set it will no longer be a super key. So, is the combination id comma name a super key? Yes it is. Is it a candidate key? Well, we know if we drop name id will still be a super key. As a result id comma name is not a candidate key although it is a super key. So, whenever we choose to store something in another relation to refer to a row in a second relation we are going to use something which is minimal and do not throw in unnecessary things. Now, there may be multiple candidate keys for a particular relation. So, id is probably a candidate is a candidate key as we said. How about email id? Now, pretty much everyone has a unique email id these days. So, we could have used email id also as a candidate key. However, you have to choose one of these two as a unique representation. It gets very confusing if some people refer to instructor by id and some other by people I mean some other relations refer to instructor by email id that will cause confusion. So, we will choose one candidate key and call it the primary key. Now, the primary key usually has to be chosen carefully would you choose the email id of an instructor as the primary key? It turns out the email id is the bad choice for the following reason. Many people change the email id. If your email id is linked to your college when you move to a different college your email id changes. So, it is not a good idea to have a primary key which changes. So, typically you will pick a candidate key which is not going to change. Even if you stay within the college you may change your email id for whatever reason. So, a unique id generated by the college called the id is probably a better choice in this case. Now, all of you have no doubt heard of the UID project now renamed as Aadhar which is being run by Nandini Lekhani. So, that is going to give a unique id to every person in India. So, tomorrow you could use that universal id as an id for across all applications in India. In fact, for it to be universal it turns out they will have to have not just Indian citizens, but also everyone who is connected with something going on in India. There is another notion called a foreign key. A foreign key constraint says that a value in a particular attribute of a particular relation must occur in a specified attribute of another relation. So, for example, we had a department name for instructor. Now, for this to be meaningful the department name must appear in the department name attribute of the department relation. If it is missing you have an instructor in a ghost department. Now, one of the goals of the UID project is to prevent ghost consumers of ration shops. So, apparently people make a lot of money by creating ghost consumers who do not actually exist and then siphoning of rations. So, that is one of the major motivations for the UID project. So, in a database context you know this would correspond to a value which should have been a foreign key, but it is not. There are again two terms here. There is a referencing relation in the referenced relation. So, in instructor dot department name is a referencing attribute in the referencing relation instructor. The referenced relation is the department relation. In the referenced attribute is the department name attribute of the department relation. So, here we have a schema diagram. I am pretty sure you will not be able to read it on your screens because the points are very small. So, what is the schema diagram? It shows all the relations for our sample university and it is of course a toy. It contains only a small fraction of information, but let me tell you a little bit about what it covers. It has students. It has teachers which are called instructors here. It has departments and these are three things we have seen already. On the right hand side there is an advisor relation which links students with instructors. For the moment do not read the attributes of each of these relations. Just read the relation name which is on top. It is shaded in blue and look at the lines between them which indicates that this relation is connected to this relation and this relation through certain attribute values. Moving on to this side, there is a course over here. Each course is associated with a department and each course may be offered one or more, zero or more times rather in a particular semester. So, we are going to use the term section to denote a particular course offering. The same course may be offered this year, next year or this semester, next semester or may be two times in this semester because there were many students and therefore we wanted to have two sections of the same course. You will notice here that the section relation is linked to course. Of course, we need to know which course that section corresponded to. It is also linked to student through an intermediate relation called takes. So, this relation called takes indicates which student took which section while down here is a relation called teachers which indicates which instructor taught which section. Now, it should be clear that you may have many students who take a single section. How many instructors would be teaching a section? Well, it depends on the university. Some universities may say a section should be taught by only one instructor, but most places in IITB do not insist on this. So, you can have many instructors teaching a section. Now, you will notice that a section relation has multiple attributes which are underlined. The underlined attributes here form the primary key. So, to uniquely identify a section, you need to know which course it is for, which semester, which year and even within the semester and the year there may be two offerings of the same course. So, they would have to have different section ID. So, maybe they have section ID 1, 2, 3 and so on. So, these four attributes together uniquely identify the section of the course. You will notice that the relation takes between student and section has all these four attributes. So, that it uniquely identifies a section. It also has an ID of a student. So, a student is uniquely identified. So, one instance of the takes relation links one section with one student. Now, there may be many such rows in that relation which link different students to different sections. Now, if you look further down in the section relation, there are other attributes such as which building and room number the section is running in and which time slot ID that section is running in. Now, what is the time slot ID? It is you need to say at what times of what days a particular course section will be running. Now, how do you do this? One way is to list each time saying it is going to run on Monday 8.30, Tuesday 9.30 and so on. What we have done in our schema is to split this into two parts. We are going to have a notion of a identifier for a time slot. So, we will say that this section runs in time slot 1 or time slot 2. We are going to separately say which are all the times on which day of the week that time slot 1 runs and the time slot 2 runs. So, that is the time slot relation. So, if you notice the time slot relation has an time slot ID value. It also has a day, it has a start time and an end time. If you are able to see the diagram clearly, you will notice that the first three attributes are underlined which means that you can have a single row in time slot is identified by a combination of time slot ID, day and start time. What does this give us? It lets us have the same time slot run twice in a day if you wish. However, it does not include end time as a primary key attribute. Now, why do we choose this? Now, it would be very silly to have a particular time slot start at let us say 8.30 and end at both 9.30 and 10. That makes no sense. So, there cannot be two distinct values of end time given a particular value for the time slot ID, day and start time, which is why only the first three are underlined and a part of the primary key. If you include end time, it would still be a super key, but it would not be a candidate key given the common sense constraints on what we can do in a particular section. There are a few more relations in here. For example, a course may have a prerequisite. You are allowed to take the database course only if you have done a data structure course. Otherwise, you would not be able to understand what is going on here. Then, we would have a prerequisite which says that each course has an ID. So, whatever is the ID for the data structure course will be listed as a prerequisite for the ID of the database course. So, that is the prerequisite relation here. You will notice that prerequisite has course ID and prerequisite ID, which says that prerequisite ID is required in order to take course ID. Course can have multiple prerequisites. So, course ID by itself is not a super key, but the combination of the two would be a super key. In this case, that would be the only meaningful super key and it would be the primary key. So, we have a few more relations. We have a classroom relation, which tells us information about the classroom. So, if I want to allocate courses to classrooms, I need to know how many people are enrolled for the course and how big is the classroom. So, the classroom relation in our context has basically three attributes, a building number and a room number, which together uniquely identified and a capacity which indicates how many people can sit in that classroom. Looking up at the takes relation again, it is uniquely each tuple in there uniquely identifies one student and one section, but it has one more attribute at the very bottom, which is the grade attribute. This is going to represent what grade the student got in that course. So, when a student takes the course, does the student have a grade? No, grade is assigned at the end of the semester after the final exams are over. So, till then what value should the grade attribute take? So, we will have to set it to the null value in our schema design. We could have designed the schema differently. We could have had one relation, which indicates that the student is registered for the course and a separate relation, which indicates what grade the student got for the course. But to keep our schema simple, we decided to keep just one relation and allow them values. And this will show up in some of the queries, which we run, where we have to deal with the fact that we may have a takes tuple without a gradient. So, if you want to know what all courses the student has completed successfully, we need to look at only those rows of takes where the grade is not null. And we will see this in the course of our query. Now, student has several attributes, ID, name and department name. It has one more attribute, which is total credits. So, this is actually what would be called a derived attribute. It is something, which can be computed from the other thing. Now, what is this total credits mean? We are using it to represent how many credits the student has completed successfully. So, if the student has passed courses whose credits total 80, total credits value would be 80. Now, it should be clear that if our database is complete, we can compute the total credits from the takes relation, including the courses, which they have passed successfully. We will have to eliminate those where the grade is null. We have to eliminate those where the grade is failed, because failed courses do not count against total credits completed. But we can combine this information and aggregate it to get what we want. But it is often useful to store it, so that we can look it up without having to run a query to compute it each time. So, that is a derived attribute. And we have the advisor relation, which links students with advisors. And in this case, it is not very clear over here, but we have underlined just the student ID value, which indicates that for the advisor relation, student ID is a primary key. Now, what does this mean? How many advisors can a student have? The fact that it is a primary key means that a student can have at most one advisor. Does it mean a student must have an advisor? No, it is possible that a particular student row does not even appear in the advisor table and therefore, does not have any advisor. But a student cannot have more than one advisor with this particular constraint. If you remove this and set that the primary key consists of student ID and the advisor ID, in this case, SID and IID, then we would allow a student to have multiple advisors. So, I have spent a fair amount of time on this slide, because it is going to be very important. So, please study this schema later during your breaks before the lab session. Now, how do you deal with information from the database? How do I extract it? As I said, we have procedural versus declarative and within the declarative family of languages, we have what are called pure languages, which are not designed to be easy to program in, but are designed to be very minimalist kind of languages, which simplifies the job of a piece of software, which has to figure out how to evaluate it clearly. A simple language is always easier to write an interpreter or compiler. However, humans do not want to deal with such simplified languages. They need more complex constructs, which simplify their life. So, there are going to be real life languages. We are going to look at SQL also, although there have been others. And typically, what will happen is the first stage of query processor will convert your human readable query in SQL into something like the relational algebra, which is a pure language without syntactic sugar. And then a second stage will figure out how to evaluate. We are going to look in this chapter at the relational operators, which are a part of relational algebra. We are not covering relational algebra in great detail, but it is very important to understand these operators. So, what are these operators? So, the first operator, let us see what is going on here. We have a relation and the goal is to find certain piece of information from this relation. Now, do not look at this slide yet, but think if you had an instructor relation and I ask you to tell me what is the salary of the instructor with ID 22222 or what is the name of that instructor? How do you get this information? If you as a human were to execute it, you would search down that table till you found a row for 22222 and then output that information. So, what you are doing is you are selecting a particular tuple from that relation. Now, the query may not ask for just one, it may say find me all the instructors in the computer science department, in which case your selection is to find all the instructors. So, then what do you do? You are going to return multiple rows, but both of these are really the same operation called the selection operation. So, here is operation which selects tuples with satisfy two conditions. The first condition is that a equal to b and additionally the second condition is b greater than 5. If you go down the rows of these relations, it should be clear that the first tuple of a satisfies a equal to b and it satisfies b greater than 5, it is there in output. The second one fails a equal to b because attribute a has value alpha, attribute b has beta, they are not equal, that is all. The third row satisfies attribute a equal to attribute b, but if you look at it is b value, it is 3. So, that fails, it is all. The last one satisfies a equal to b because both are beta, b is 10 which satisfies b greater than 5. Therefore, it is in the result also. So, that is the result of the selection operation. Now, in the relational algebra notation, we will write the selection operation using this notation in the third bullet here, which says this Greek letter sigma which incidentally is the Greek equivalent of the English S, S for selection therefore, sigma. And the condition is a equal to b and b greater than 5 and the selection operation is applied on the relation R. So, that is what we have done there. So, this is the mathematical notation in the relational algebra to get a subset of the rows of the relation. Now, here is a quick quiz question which you can try out. I have sigma a not equal to b less than greater than in SQL indicates not equal to. So, I want the condition to be a is not equal to b or b is less than 7. How many rows in this relation satisfy this? Is it 1, 2, 3 or all? So, I will give you a few seconds to work out by going over the relation. If you have had enough time, if you go through that relation, you will notice that the first row fails a not equal to b, but there is an R condition. So, is d less than 7? Unfortunately, it fails that also. So, it is out. If you look at the second row, it does satisfy a not equal to b. So, we do not care what the b value is. The third row fails a not equal to b, because both are beta, but it satisfies d less than 7, because b is 7. So, we have 2 rows so far. If you go to the third row, it also fails the fourth row rather. It also fails a not equal to b. Thus, we have result containing the middle 2 rows of the relation. So, the answer to this quiz question is 2 terms. The second basic operation is the selection not of rows, but of columns of the relation or attributes. In the relational algebra, this operation is called the projection operation. Although, in the SQL language, this operation is referred to as the select operation. So, there is some confusion when we switch between the 2. To avoid the confusion, we can say selecting rows versus selecting columns. So, in this case, I have a relation with 3 attributes. For some reason, I want the output without the b attribute. I want just the a and c attribute, because that is all I want for this particular query. Therefore, in the relational algebra, I will write it as follows. I am going to say pi. Why pi? Because pi is the Greek equivalent of the letter p, which stands for projection. The subscript is a and c, which means output these 2 attributes a and c. This operation is applied in this case to the relation r. It should be clear that we are going to get 4 rows initially with the same values for a and c as the 4 rows in the original relation. But, you will note that the first 2 rows differed only in the b values. Once we project on a and c, they are actually the same after projection. So, the projection operation in relation algebra removes duplicates and outputs just 3 rows. But, as you will see later, in SQL, the duplicates are not eliminated by default, unless you specifically say select distinct to indicate that duplicate should be removed. If you do not remove duplicates, then this one over here with 4 rows is the result. If you do remove duplicates, this one on the right with 3 rows is the result. Quiz question 2 is a trivial question, which ask does it remove duplicates or does not. In the relation algebra, it does remove duplicates. The next basic operation is combining information from 2 relations. This is a very, very important relation, very important operation in relation algebra, because it is very clear from the examples we have seen that information is split across multiple tables and we have to combine it to answer many queries. So, in this case, we have 2 simplified relations with attributes a, b, c, d, e. So, how to combine information from 2 relations is actually split into 2 operations in relation algebra. The first operation does something which seems very stupid, but we will see why it is useful in the next slide. So, the first operation is the Cartesian product operation, which simply takes 2 relations and combines every pair of attributes from the 2 relations. So, here you have r with 2 rows and you have s with 4 rows. r cross s, the Cartesian product of r and s will have 2 times 4 rows, which correspond to all combinations. So, if you take alpha 1, it occurs with each of these 4 rows. If you see here, the first 4 rows are the result, are really alpha 1 with this row, with the second row, third and fourth row of s. And then similarly, beta 2 is repeated with 4 rows of s. That is the Cartesian product. By itself, the Cartesian product is not useful, but we will combine it with selections as we will see to do useful stuff. But before we see how to do that, let us wrap up the other basic relations, operators of relation algebra. The next one is the union operation, which should be obvious. Take 2 relations with the same attributes. The union simply puts their rows together. If there are duplicates, because both the relations have the same row, in relation algebra, the duplicates are removed. In SQL, it turns out by default also that duplicates are removed, although you can tell SQL to keep duplicates. The next operation is set difference r minus s will give you rows of r, which do not occur in s. And in this case, r has alpha 1, which does not occur in s, alpha 2, which does occur in s. So, that is removed. Beta 1, which does not occur in s. So, the result of r minus s has alpha 1 and beta 1. So, these are all fairly intuitive operations. Then there is the intersection operation, which is also obvious. It outputs rows, which are there in both relations. So, those are the basic operations in relation algebra. There are some more, which we will see later, aggregation and so on. But, with just the basic operation, you can already do a lot of stuff. For example, if I want to combine information from two relations, I need to join them. Now, typically the join is going to be done with, by insisting that a particular attribute value of a relation is equal to a value from the other relation. So, for example, I have r and s. r has attributes a, b, c, d and s has b, d, e. I am going to define the natural join operation, which as it turns out can be expressed using cross product selection and projection. But, it is still a very useful operation. So, let us see what it does. The natural join operation is going to take all the common columns of these two. In this case, b and d are the common columns. It will take a row of r and find all matching rows from s, which have the same value for all the common columns, which are b and d. It outputs all those combinations. So, if you see here, alpha 1, alpha a, that is the first row, has b d as 1 alpha. If you look at s, here is the first row with b d, sorry 1 a, not 1 a, it is 1 a, b d being 1 a. But, if you go down on s, you also have the third row, which is also 1 a. So, the first row of r combines with these two rows of s and in the output, you have these two rows here. If you go further down this table, there is one more row of r, which also has 1 a for b d and this one also combines with the same 1 a over here and 1 a over here and those are the next two rows. Finally, the last row of r combines with exactly one row here. 2 b occurs exactly once in s. So, there is one more tuple for that. So, that is the natural join. It links up things with the same values for the shared attributes. You will also notice that since the values are going to be the same on the shared attribute names, these attribute names occur only once. b d are there in both the relations, but they do not occur twice in the output. They occur only once. In contrast, if we took the cross or the Cartesian product of r and s, there would be two copies of b and two copies of d, which we would refer to as r dot b, r dot d, s dot b, s dot d. There is a simple quiz question here, which you can try out later, which says the natural join matches rows whose values for common attributes are not equal and a few other silly things. The answer should be obvious. Again, it was just to make sure people are awake. And to wrap up this chapter, here is a summary of some of the basic relationship operations. There are others, but the most commonly used ones are the selection, the projection, the natural join, the Cartesian product, the union. Of course, we also saw the intersection and the set difference. And there are some examples of all of these. I will also note that a query may stack up these operations. Just like when you write an expression in algebra, you can say x plus y times a plus b. So, you build a complex expression with multiple operations. So, you can similarly build complex expression. Why? Because each of these relation algebra operations takes as input 1 or 2 relations. And its output itself is a relation, which means the output of an operation can be the input of another operation. So, just like addition, multiplication, division, output numbers, which themselves can be input to the next operation. This is also true of relation algebra, which lets you construct complex queries by putting these together. With that, we will stop here. Thank you very much.