 Last time what we discussed was the notion of an entity model. One of the things that we discussed was an entity relationship diagram, whereby we represent information pertinent to entities that we wish to handle in our business. I had taken an example of the academic environment here and I had suggested that if we have to model, let's say the registration and great processing data and get important information out of that, what are the entities that we need to model? And we figured out that we have to model a student entity which represents a set of students, a course entity which represents a set of courses. In modeling we said last time that an entity model is represented by a rectangle wherein you write the name of the entity set. Please remember it's a set. When I write student, I mean whatever 4,500 or 5,000 students of IIT Bombay or whichever place. Using ellipses, I would define attributes which are associated with or which characterize each student entity. So each student has a name, hostel, room, CPI, roll number. Notice the naming. I am using s-roll, s-name, sort of to signify the student roll number, student name, etc. Notice here that there are two attributes which are shown by double ellipses. All others are single ellipses. This is to indicate that single ellipse indicate attributes which have single value for a student. There is exactly one name for a student, one hostel, one room. However, a student may be registering for many courses. A student may have multiple hobbies. The underlining means that s-roll is an attribute which is unique for a student. There could be multiple students in the same name. 300 students may be staying in the same hostel. Same room in different hostels will be occupied by different students. But each student has a unique roll number. And we said that such a thing will be called a primary key. So this was our modeling for a student entity. Likewise, we said that if we want to represent the actual data, please note that this representation is called meta data. There is no data. There is no actual name of the student, actual roll number, actual hostel in this model. So this is data about data. I actually want to deal with roll numbers, names, CPI, hostels of students. But that data in whichever way I represent is being abstracted and being represented through this model. This is called meta data, meta data meaning data about data. The actual data could be represented in a table form where we said we write roll number, we write name, we write hostel, other attributes. And inside this table we write rows, each row representing values, set of values which are the values of the attributes for each student. So this is the roll number of one student, Ajay Shah is the name of that student, zero two is the hostel number of that student, etc. So each row represents the values for one student. And there are as many rows as there are students in my actual database of file. Needless to add, this table could be represented physically in the form of a register or a file. Or in terms of a computer data, it could be stored inside a file. The file could be a sequential file, index file, whichever way, which you will. Similarly, we said there could be a course entity. The course entity could also have a primary key which we said course code, course name, credits. And we have double ellipse as to indicate that the number of students registered for a course could be many. Faculty could be many, several people could be teaching a course. We actually discussed many other things which I am not repeating here. But one of the important things that we discussed is that the attribute, any attribute for an entity could be of different types. We see two types here. This is called a single value attribute or an atomic attribute. This is a multi-valued attribute. Multi-valued attributes is very difficult to handle. How do I represent it in a table? The multi-valued attributes say course and students. So there are 100 students registering for this course. So if I have a course table like this, I can definitely write the course code. I can definitely write the course name. I can definitely write the course credits. But the moment I come to see students, this dot dot dot will not be a single value, but a list of 100 roll numbers and along with the roll numbers, names, everything else. There are 100 students to be represented. So it's a mess. Please understand that if I have a simple table containing attributes which are single-valued, I could always sort this table or the associated computer file on any one of the single-valued attributes. So I could sort this on the course code. I could sort this on the course name. I could sort this on the course credit and get all courses with six grades before head crates and so on, descending and ascending, which are where. But if I have a list, there is nothing to sort. There are 100 students. Some student is taking some other courses. So that same student's roll number or name will appear at multiple places, not necessarily the same order in that list. So I can't sort or arrange information in a way such that I can extract information from this. Consequently, we said that in our model, we will never ever permit multi-valued attributes to be sure. If that is so, then how do we handle these C students and C faculty are going back to the previous slide. How do we handle the number of courses registered by a student, number of hobbies that a student may have? We then said that in our modeling, we notice one very interesting thing in this small model that we have established, that student has courses as a multi-valued attribute and a course of students as a multi-valued attribute. You will notice in your modeling experience that very often a multi-valued attribute for an entity itself is usually another entity. In fact, indeed that's the case just now. We have a student as an entity, course as an entity. How do we handle them this problem of showing that a student may take multiple courses or a course may be registered to by multiple students? This we said will handle through an association or a relationship. That is what gives the model the name entity relationship model and is represented by an entity relationship diagram or ER diagram. This is an entity diagram. This is a table representation for that entity diagram. This is also an entity diagram. This is a table representation for that entity. We said that we will model this by removing all multi-valued attributes from any entity and put an association set something of this sort. We say that actually each student takes multiple courses, each course is taken by multiple students. So what we really want to represent is the collection of all these lines. They represent the association between students and teachers. Imagine that a student takes on an average six courses and imagine that there are 5000 students. Totally how many lines do you expect? 5000 into 6 is 30,000 lines. If you were to draw a line from each of these roll numbers to each of these courses that will be actually 5000 multiplied by 500 if they are 500 courses. That will be like a cross product of these two sets. Out of this cross product the relevant information from our perspective is the information which pertains to actual students taking actual courses. So we want to represent these lines. Since these lines represent an association or a relationship we introduce the notion of a relationship between these two entities and we say that relationship actually represents the set of all these lines. Please note that each line is uniquely identified by the two endpoints. Since a roll number uniquely identifies a student, course code uniquely identifies a course. We have a unique identification of this line. If we specify this and this then that is a unique identity. Consequently we said that we shall talk about this relationship set where there is a student, there is a course. I have not shown all the other attributes but I have shown only the primary key attributes which uniquely identify the respective entity. And then I show the relationship with the diamond saying student registers for courses. Notice that just as I can treat student as a set of elements where set of students each student represented by so many attributes. Course is also an entity set, set of entities called courses. Can I not imagine registers for also to be a set? It is a set of associations between these two entities. So if this entity has 5000 elements a entity set, if this entity set has 500 elements then this entity set does not have 5000 into 500 but as many elements as there are associations between students. In short if I treat this itself as an entity set, registers for is an entity set then this will be equivalent to having an entity set which has what are the attributes of this entity set now? No, student is not an attribute. This is now I am taking this to be a entity set. So let's just switch over here. Set of elements for the student is equivalent to another set of elements here and this, this set if I treat this as a set this is equivalent to all the lines that you can see between these. So this is actually set of these lines. Since it is characterized by the end points uniquely attributes of this entity set. So if there are 5000 into 6 on an average elements of this set then each element which represents an association is uniquely identified in course score. Consequently if I were to treat this as a entity set then I can say that it has one attribute which is S roll, another attribute which is C code and both together constitute the primary key for this set. Can you imagine an attribute which this set should have but neither student nor course could have the question which we had raised and was answered correctly last time. So people who attended it last time should not answer. Any attribute that you can think of? See student has attributes, hostel number, room number, additional attributes, height, weight, address, marks and grades code. Notice that the grade of a student can neither be an attribute of a student because each student will have up to 6 different grades depending on how many courses he has registered. Or can it be an attribute of course because each course would have 100 or 300 grades depending upon which student, how many students are there. So grade actually is an attribute of each line, each association has a grade. So this student in this course got AB. The same student in other course got BC. The same student in third course got A. So consequently we said that grade shall be an attribute of this artificial entity set which represents an association. Is that clear? So the relationship set attribute is grade here. And when you model information systems, you have to be very careful in identifying all relevant attributes of participants. First you have to identify participating entities. Then you have to identify attributes of participating entities, identify the primary keys of those entities and subsequently you have to identify relationships between them and you have to identify attributes that the relationships may have. Is that clear? How will you represent the relationship set as a table? This set as a table. Let me show you the student table. This is what you call a course table. Similarly, this was a student table. This would have 5000 elements. That course table may have 500 elements. How do you represent in table form the data which is representative of the association? Let me call this table as the registration table or age table. How many attributes this table should have? Exactly the same attributes that we identified earlier for the association set which means it should have s-roll, it should have c-code and grade. Do you agree that these are the three attributes which uniquely identify and completely describe the association between student and a course? Consequently, there could be a table for age which will say roll number, let's say 07010513 and let's say code is cs63407010513 courses let's say cs691 etc. Notice that if a student has registered for say six courses, there shall be six entries for that student in this table. So the student's name will appear six times. Similarly if a course has an average of 50 students, a course code will appear 50 times. But still each row in this table is unique and that is because the combination of roll number and course code is unique. Last time we had asked this question that if students have their registration data where they fill up a form online or whichever way or by hand, how do you create a role list for a teacher? A student fills up a form giving up to six courses that he or she registers for whereas as a teacher, I would like a role list of 100 roll numbers who are attending my course. And we have gone through the exercise of how to create that list. We finally concluded that I don't need to maintain either the registration form or the course list independently. If I maintain the data for all students in the student table, all courses in the course table and all valid registration data in this form alone. If I do that then there is absolutely no problem because any time I want to construct the registration form, I can sort this table on roll number and collect all entries belonging to same roll number, I will get all the registration data. Any time I want to create a teacher's role list for a course, I will sort this on course code and collect all rows which have CS 634 let's say as the course code. All those 100 students will come together. Indeed this is what is done. If you see the back end system, you see only the front end of the registration system where you fill up the registration form. I see the front end of the registration system when I look for the course list. The course list is not explicitly stored. Your registration data is not explicitly stored. What is stored is this. Every time I ask for the data, it actually looks up this registration table, it hunts out for all the students who have registered for CS 64 and show me that. What is the advantage? If it were to store your registration data, then the processing is not very easy. Today if you modify your course registration in exactly three seconds, I can see the reflection of the modification because every time I make a query, the report or the form is regenerated based on what is the data there. You realize that this is not only efficient, but this can reflect absolutely the current information without any problem. Later on when we talk about schema design in databases, we will talk about what we call the theory of normalization and we will conclude that this kind of set of tables is said to be in the third normal form. There are only terms right now. We will get back to this later. So far this is clear to you. Now we come back to processing of the information. Imagine that we have actually got all the students information in the student table which has been put in as a computer file. We have got a table which is a course, which is a course file and we have got a rich file which has this registration data as we just saw. How do we extract meaningful information out of this? I mentioned that a student can see this report by sorting that table and etc. But an ordinary student does not write programs to sort. Let's say a programmer. A programmer at the back end will have to write programs which can give this report, that report, that report. Ordinarily the programming would, if done in a conventional programming language, would still not be very easy. You would be familiar with programming languages such as FORTRAN, BASIC, C, whatever, whatever. Some programming languages all of you are familiar with, right? So if you were to write a program to produce information that we need, it would still take some effort, some testing and so on. There have been a lot of development over the last several decades. A particular development in this field which started with Professor Kaur who was working for IBM resulted in a database model called relational database model. That is based on a firm mathematical foundation of representing information, extracting information and that model is called the relational model which we are going to briefly look at. The relational databases are entirely based on that relational model. Please note that the word relation and relational model is not the same as the word relation in entity relationship diagram. Actually that's a bad word in the sense that what we are talking about there is association. So better remember the entity relationship model as an entity association model because the word relational model refers to a completely different notion of relation. We will briefly look at the mathematical foundation of the relational theory, relational algebra and see very briefly again an extremely powerful expression mechanism which is now available through a programming language or a scripting language called structured query language. So this relational model was developed by Professor Kaur. By the way the databases notionally existed even earlier. Databases tried to separate out the management of data in files and file structures from the way you wrote programs. Earlier people would write conventional procedural programs. In those programs they will be required to describe the data files. Any program that you write in Fortran or C or whatever you would be familiar that you would have to describe that file. And that file definition then becomes intrinsic to that program. Now therefore there are 10 programs which are referring to 5 files. The 5 file definitions happen in all those 10 programs. Any 11th program to be written must incorporate those file definitions. Suppose you change the design of the file then you have to change all 11 programs because the definitions are included. To separate out the physical data management and the logical access to that data the notion of databases evolved. The earlier databases used a slightly different model which was more around the way you access the data. This was the first model which was developed around a formal mathematical notion and that is the notion that we shall be looking at. So this was developed by Professor Kaur as a part of the system R research. System R was the first relational system which was implemented at IBM. This paper came out in 1972. So this is a fairly matured technology now. This provides a mathematical foundation for database modeling and it uses relational algebra and relational calculus. The query language, a simplistic query language which was defined for this model for people to access data instead of having to write elaborate programs with sorting algorithms etc. It was originally called a structure English like query language S-E-Q-U-E-L which is pronounced as SQL naturally which was subsequently standardized through an international standard organization and that is now called SQL. It is still pronounced as SQL by most people in honor of the original one. So SQL is the programming language which is now the lingua franca for data access. Let's look at the relational model. Please forget the relation word of the earlier thing. As I already said, you have to recall that it is an association between two sets. The relational model defines mathematically relations which are sets. Each relation is a set of n-tuple. What is an n-tuple? This word is composed to rhyme on a similar expression that we have for a set of values. For example, a couple means two values. A triple means three values. A quadruple means four values. An n-tuple means n values. N values together form a single element. And many such elements is the relation. In short, a relation is nothing but a set of rows in a table. Each row, how many values does each row have? As many parameters are attributes that you have. So if a table has five attributes, you have a five-tuple. And set of five-tuples is the entire table, which is the relation. Each relation has a schema. This is a new word. It comes out of scheme. You know about scheme. Not scheming in the English language sense, but the scheme of things, like organization. So organization of a table is nothing but a list of attributes. If you describe the list of attributes or the column headers in a table, if you are defining the metadata of the table or you are defining all the attributes for that entity, consequently you are defining the schema of a relation. So I am just suggesting a mapping from your ER model onto a relational model. Here is a sample relation schema for the simplistic information system that we looked at. Student is a relation. So actually this is not a relation but what is being shown here is the schema of that relation. What is the relation itself? The relation itself is the set of all rows in that table. This is the relation schema. That means student as a relation schema is defined as a set of five-tuple, which is roll number, name, student hostel, student roll number and CPA. At the moment I am restricting to this. The source is represented by a three-tuple. That means there will be three tuples in that table. Course code, course name, course credit. Rage as a relation will have, as we agreed last time, if on an average a student registers for six courses, then that table in that table form would have 5000 into 6 which is about 30,000 elements. Each element will have a roll number, a course code and agree. So is that clear? Relation is actually all the data in the table. This is relation schema. You will also agree that in so far as course registration and course grading system is concerned, these three would form the crux of the scheme. So we can say that this is the database schema or the main database schema for a database which handles academic information system for the institute. In actual practice it would have another 20 other tables or whatever, whatever will go into those details. But you understand the meaning of this? So this is the relational model. Given that this is the schema, each schema must also have a unique attribute which identifies the entity within that relation uniquely and that we call the primary key from the earlier modeling. We will examine this and elaborate the concept of keys a little more in this session. But suffice it to say that when I underline one or more attributes, those one or more attributes together form a unique key. That means the actual values in the end tuples will form the unique key for those entities inside that table or inside that. Is this clear? Here is a student relation. This circle represents a set. Please note that in the earlier diagram I had drawn only roll numbers. I have written here dot dot dot dot dot dot dot. What does it mean? That each element in this set actually is not just roll number but roll number, name, hostile number, room number, and whatever, CPR, any number. So this is like an element of this set. This is another element. This is third element. This is fourth element. The reason I have written roll number is very obvious because roll number is the key in the schema for relation. How many rows will this set have or how many elements will this set have? Number of elements is equal to number of students. Agreed? This student schema going back to the previous slide has five attributes. Consequently each element in this mathematical set is a five tuple. Five tuple means it has five attributes. Is this clear? Just a couple of questions here. In practice I implement this relation. Now this is a relation. If this is a relation that means it has as many elements as there are students today. If I represent this relation as a file then I can handle all the information and I can apply the same relational algebra or mathematical operations on that file as I apply them here. In this context I would like to see what are the relational operations and what are the equivalent SQL expressions for those operations. But one thing I would like you to understand very clearly if I have a file containing all students' information and let's say one student leaves the institute what do I do in case of a computer file? I delete the record of that student but I still continue to call it the same file student file, right? Suppose in July a whole lot of students join the institute or 5000 students some students have passed out 600 new students are joining. Consequently that file will have 600 additional entries but I still call the student as a student file only. However mathematically will that be the same relation as earlier? The relation is a mathematical concept. It's a set. If even a single element consider the following. A student who is there in the student file changes the room in a hostel. Happens quite often. As far as the file is concerned does the file change at all? In terms of number of records it has? No. There are 5000 students earlier there are 5000 students now. Only one student's data value for the room number has been updated. We still call it the same file. Mathematically is it the same relation? No. So please understand this. Any changes that you make in any element of the set for any one of the five tuple or six tuple that you have or if you add or subtract elements what you get is a new relation. Its physical implementation may be the same file but it's a different relation. It is possible then to imagine that as a student file maintained by a mundane data processing system I still keep referring to it as a student file, student file, student file perpetually but every time a change is made there I'm actually creating a new relation. Mathematically therefore you should be very clear on what is a relation. Its manifestation as a physical file is incidental. It remains a single file. Is that clear? We are now talking of maths so we should be very clear on that. So mathematically we have what we call a relation. This is a student set. A student relation as a table. So here I have shown this as an element of the set. There is one element, this second element there are 5000 elements. As I said some time ago if I change the hostel number, room number, somebody changes the name. Somebody can change the roll number? No, not easy. But I think when people ask for a change of branch sometimes the roll numbers used to be changed. I don't know what happened, whatever. It doesn't matter what. If something changes here although as a table I may still call it the same table as a mathematical relation it is not the same relation. It is a different relation. As long as you understand this the theory will be easier. Since we are dealing with relations where elements of the relation are n-tuples we now require mathematical operations on these relations in order to extract meaningful information out of these relations. We are very clear that each relation contains valuable information for us. How do we extract information out of this? Since relation is a set and since sets permit themselves to be operated upon by algebraic operations first a core thought of defining an algebra called relational algebra. A relational algebra has operators which are defined over relations. A relational operation results in a new relation. If you operate upon a relation, if you operate upon numbers you get another number. If you operate upon a set which is a relation you create another set. This new set, this new relation may have different set of rules. May have different schema depending on what operation you have or may be both. That is clear, right? And I operate upon the set. So it is in this context Prof. Scott defined some very simple and fundamental relational operations which help us in extracting information out of these relations and their manifestation called structured query language helps us to extract similar information out of a data table in the company. First we take the notion of keys. You are all familiar with primary key. In the context of relational algebra or the relational theory we define different types of key. We define a super key, we define a candidate key and we define a primary key. So let us understand what is a super key. Super key is nothing but a set of attributes which together uniquely characterize a specified end top up. What is the difference between super key and primary key? We will come to that difference. But to begin with you will agree that if in a table I take all the attributes then their value will uniquely identify a student. So for example, roll number plus name plus hostel number plus room number plus CPI will uniquely identify the student. Course code plus course name plus course credit plus course slot whatever will uniquely identify natural means common set. After all even if a roll number can uniquely identify then roll number plus anything will always uniquely add. How many super keys can you have? No, no, not how many super keys in a table. How many super keys for a table can you have? It's a metadata. Number of columns? I am not so sure. For example, I have five columns. Are there five? A super key is a combination of attributes. I just explained that one combination which combines all five is a super key. We also by common sense say that roll number alone is a super key where it can uniquely identify. So one sample is roll number, another sample is all attributes roll number plus name plus this. How many such super keys can you think of for the student table? Can name plus hostel be a super key? No. No, because there could be an asant living in hostel 4 and certainly another in hostel 4 in different rooms. Can name plus roll number be a super key? No, another joker in the same name may be another hostel. So super key is a mathematical concept. When you say unique it is guaranteed to be unique. Primary key has to be part of the super key. But the primary key as roll number that we decided was an artificial choice. Can you not imagine a table with 100 attributes which might have multiple possibilities for primary keys. Indeed that is the next notion of a candidate key. Mathematically candidate key is a minimal super key. What is minimal? You take five attributes which form a super key. Now you remove one attribute from it. Does the remaining part still remain a super key? If yes, then that is a lesser representation of that super key. It is minimal. You keep on doing that. You will hit upon a set of attributes beyond which if you don't anything it will no more uniquely characterized. Then that is called a minimal super key and that is known as a candidate key. We take some examples. If you have multiple candidate keys in a schema then you have a choice of designating one of those candidate keys as a primary key. The choice that you make is called the primary key. The primary key does not descend from the sky. From your information modeling candidate keys will come out. If you have only one candidate key, choice is simple. That has to be the primary. But if you have 10 candidate keys possible, you may choose one of them as the primary. So let's illustrate this by some example. Take the student. Roll number, name, student hostel, room number, CPR. How many super keys are there? S roll plus S name is a super key. SH plus SR, hostel number plus room number. This you are not familiar with but last time we made an assumption. Currently it's a hypothetical assumption but when IIT was established that was true. Namely that one student is allocated one room. Today there are many rooms in which unfortunately two students have to live. Assume that hypothetically and in Golden Jubilee we are launching a campaign that in five years we shall ensure that a student has a room to himself or herself. That's the design. Assume that it is true. Then theoretically is it not correct to say that the hostel number plus room number uniquely characterizes a student? If that assumption is correct, then SH plus SR is also a super key. SH plus SR plus S roll is also a super key. SH plus SR plus S name is also a super key. So you can actually multiple super keys. Now we come to the candidate keys. Candidate keys are reduced to minimal super keys. So if you reduce from this and this and anything else, everything that still retains the uniqueness then you find out that S roll is a minimal super key. So it is a candidate. Similarly you look at hundreds of such super keys, you will find SH plus SR is also a minimal super key. And you will find only these two as possible super keys for this particular model. In actual practice you have to hunt for super keys, I mean minimal super keys and identify candidate keys. In normalization theory we shall see that one of the normalized form definition actually depends upon your ability to identify candidate keys. And then we say we choose S roll as the primary key because it is a practical choice because as long as the student is here, we would like the student to be uniquely identified throughout his or her stay in the campus. If a student changes the room, even in a single room concept, we will have to change the primary key for that student. So till yesterday the student was identified uniquely by something else, from tomorrow he is to be identified by something else. Not very practical. Not only that, after a student passes out, another student may come and stay in the same hostel and room them. And 10 years later when those two meet and if they say I was identified uniquely like that, he says what nonsense, I was identified uniquely like that, there will be problem. Ideally we would like, as far as the issue is concerned, that a real number is never repeated in life for anybody. And obviously then S roll is then a better choice. Please also note that this is a standard dilemma in real life information systems, that you cannot uniquely identify an entity by natural attributes which that entity has. Roll number is not a natural attribute. You were all given nice names by your parents when you were born. Roll number is something artificial given to you by this institute. You all assume it to become an integral part of your information system but it's not natural. And you will find invariably that in computerized systems, unique identification requires an injection of an artificial attribute. Roll number, course code, part number in an inventory system, train number, seat number, all these are artificial entities required to make something unique. But that's okay, we can live with that. Is the concept of super key, candidate key and primary key absolutely clear to everyone? Is this clear? Okay, fine. And now we come back to conclude that in any schema that we design for a relational setup, we must have a primary key identified for every relation. If a relation does not have a primary key, then we refuse to take it as a relation. In the worst case, all attributes of the schema together must form a primary key. Why? Simple, a relation is a set of elements. No two elements can be same because if they are, these elements cannot be identified separate. So in any mathematical set, all elements must be unique. To guarantee uniqueness, we must associate a primary. Is that very clear? So in a mathematical set, each element must be unique. In our context, the uniqueness is guaranteed by the primary key. We must therefore identify primary key. Here is a sample relation. S-Roll, S-Name, S-Hostel, S-Room, SCPI. So Vijay Amre, Parag Samarth, Arul Manazze, Sajay Shah, Rajam Tambay, Hostels, Rooms, SCPI, etc. We now look at this relation as containing basic information for the student and we now look for a relational operator which will help us extract meaningful information out of this. In physical practice, we will have a file and we are looking for manifestation of that relational algebra expression to be converted into a simplistic programming script by which I can extract information from that table of database. So here are the basic relational algebra. This is the sample relation. Here I am just exemplifying the fact that in this thing, you take, there are one, two, three, four, five rows. So there are five elements. One, two, three, four, five, six, ah, something happened. Sixth element got added. You might think it is the same table but as I told you, no, this is a different relation. Why? Because a new row has been added here. You delete this row. You get back to the original relation, no, this is the third relation. It incidentally happens to be same as first relation, that doesn't matter. If there is a change here, change in the 7.25 from the previous table where here Rajan Tambi had a CPI of 6.85. He slogged his ass out and got a CPI of 7.25 or other way round. Then that makes a relation a new relation. Just to exemplify that relations keep coming as when you add, subtract, change, whatever. We are now talking about extracting information out of the table but before that just read this out. If any value in a table or a file changes because of add, delete or update, such a change can be made in the same table or file physical. However, a relational operation results in a new relation always. So distinguish between the mathematical model and the real-life model. The real-life model is supposed to represent the mathematical model but it has this dilemma that in practice you don't keep creating new files every time you make a small file, you refer to the same file. Relational algebra operations are used to extract meaningful information from one or more relations and each relational algebra operation can be mapped directly into structured query language. That's what we are going to see now. These are the four fundamental relational algebra operators. One is called selection. This operator gets you rows from a relation and creates another relation. The other is called projection. This gets you column from a relation and makes another relation. When you get only few columns from a relation, then the schema itself is changed of the new relation. Old relation adds five columns, its schema adds five attributes. New relation has only three columns, the schema has three columns. It might have all the rows but it's a new relation. Cross-product combines rows with multiple relations. If you imagine the student set and the course set that we saw earlier with 5000 elements here and 500 elements here, this cross-product will get you a relation with how many elements? Cross-product. 5000 into 500. Cross-product of two relations is number of elements here, multiply by number of elements there. 5000 into 500. You will all agree that this 5000 into 500 will be large number of elements. Not each one represents meaningful information to me. Cross-product means take one element from here, take one element from here combined. So all of that may not contain meaningful information. We shall see an example and therefore we shall deduce that we will need something more than cross-product. In relation algebra it's called a JOIN. JOIN means get meaningful information from cross-product which may contain a lot of rubbish. We shall see some examples from this. Here is the selection operation. It is represented by the symbol sigma. And there is a condition P which is called a predicate which is written next to sigma. Sigma is a selection operation which says select certain rows from the relation R which satisfy the condition predicate P. This will result in a new relation R1. Since I am selecting certain rows from the relation R, the schema of R1 is necessarily same as schema of R. I am selecting rows, full rows, right? P must be a valid condition on R if I choose student relation. And I want to select rows which represents let's say people who have scored more than so and so CPI or people who live in hostel 8. Then that's a condition. But I cannot put a condition there saying select the rows from the student where course code is something. The condition must be relevant and therefore must be composed of comparison operations on the attributes of that relation schema only. Common sense? Let's look at this. So R1 equal to sigma PR is a representation of a relational operator called selection. Here is an expression. Sigma SCPI greater than 9 and SH equal to 3. Let's say I want to select all the academic top performers from hostel 3. In the sample you might have noticed there was a row containing this name. Arul Maniz actually was a CS proper of old times. Not even 98 I think is much earlier batches. But anyway, so when I apply this operator to student relation, what will I get? I will get all those rows which are hostel number 3 and which are CPI greater than 9. I have shown one row based on that sample table. But if you have all 5000 students and quite likely that there are let's say 20 students in hostel 3 whose CPI is greater than 9, you will have 20 rows. These 20 rows will constitute the new relation R1. Would you agree that such an extraction of information is fairly powerful? Because I can quickly extract meaningful information because I can write conditions which are meaningful in my information processing. If you were to implement this relational operator called selection in a C program or a Komal program or a Fortran program, can you imagine how will you write the program? You will have a table which is a file, say file called STD file or whatever. Then first you have to define what that file is. Then you will have to define what the attributes are. Then in the actual program itself, the function or procedure that you write, you will have to say open this file, start reading a record. When you read one record, examine whether this CPI value is greater than 9, hostel number is this. If so, print this. If not, go back. Now you can't keep going back because the file is suddenly n. So you will have to put an appropriate n condition checking. At the end, you will have to close those files and get out. At least a 30 line program that you will have to write. The beauty of the relational model is Professor Card stipulated that I can write a structured query language query in SQL for doing this operation which can be written like this. Select start from student where SCPI greater than 9 and SH equal to 3 semicolon. This is a complete SQL program. He'll ask me where is the student defined? Well, of course, before you write this query, the student table in SQL must have been defined separately and somebody must have inserted the data. The point is the student table needs to be defined only once. Course table needs to be defined only once. Data optimization can happen as and when changes happen. But once the table definitions are there, a database management system uses those table definitions for answering any queries. In the query itself, you need to write only this problem. I can bet you that there is no simpler way of expressing this query than in an SQL. That is why SQL is called a fourth generation programming language, not a third generation language, not a procedural language. You're not telling it what to do. You're saying this is what I want. Select start from student. What is a start here? Star means select all attributes of this. So if there are 5000 rows in that table or 5 million, if the number of rows which qualify as per this predicate, this is a predicate, where? Condition. This is a predicate where there are 1 million rows which satisfy the predicate or 10 rows or 1 row or 0 rows. This SQL query will work exactly like that, giving you that. Can you see the beauty now? The relational model, a simple relational algebra operator mapping directly onto SQL. Second operation. This relational operation is called projection. Projection as in, at least some of you would be engineers, right? So you would understand first angle, third angle projection in your drawing. For even others, projection is you take some structure and you project only one part of that structure. And that is called a projection. Here we are saying that this relation R has a schema consisting of so many attributes. I want to project only a certain subset of those attributes given in the list. This relational operation is represented by the symbol Pi. So Pi list R means take this relation and forget all the rows, all the attributes are not mentioned. Only those attributes are mentioned in the list. Extract all the rows for that. Basically take a subsection of the entire table and that project it out. That is this new relation. R1 has a new schema. What is the cardinality? Cardinality in the context of set is the number of elements in that set. How many elements will this set? R1 have any idea? Same as R? Yes. 100% sure? Absolutely everybody agrees? Okay. Let's see some examples and see whether that is really true. In general, by the way, what you say is true because it's common sense. I'm just projecting means I'm taking a part of the structure. But if you recall that a relation is a mathematical set and must therefore have unique elements in it, then you will see that this may or may not be true. Let's look at some examples. Consider this. R1 is equal to pi CNM comma credit course. Since usually all course names will be unique anyway. If there are 500 courses in the course table, then this relational operator means I will take the names and credits of all 500 courses. In this case, the number of rows in the resultant relation R1 will be same as number of rows or number of elements in the relation course. This is exactly the answer that you gave. So this is generally correct. We shall see an exception soon. However, before that, let us look at the representation of this relational operator in an SQL program. The SQL query for this operation can be written as select CNM comma C credit from course. Notice that in the earlier query that you saw, we had said select star. And I explained that star in SQL means all attributes. If I don't want all attributes, as in this case, I want to project only part of the attributes. I say select CNM comma credit. In short, for the relational operator pi, whatever is the list, you write the list as list of attributes to be selected. Unfortunately, the word select selects, that's the selection operation. It also does the projection operation. That is because stupidly, the SQL query language right from the beginning said, if I want to do, if I want to extract any information from the database, I shall have a word called select. So it uses select to represent everything. If you discount that nomenclature, then select star from something where something, where something, the where is the commanding clause representing the selection relational operation. You can understand that because the predicate is written here. Similarly, select specific sublist is the projection operation higher from, etc., is incidental that details the relation on which you are operating. So is the projection operation clear? Before going further with the combination, which all of you must have guessed, I can combine selection and projection. But before that, I want to answer this crucial question where everybody in the class seems to be unanimous that the projected relation must have the same number of elements as the original relation. So let's look at this. Project, student hostels from student. How many students? 5,000 students. The SQL query will be select SH from student, right? How many elements would R1 have? If there are 5,000 students, R1 should have 5,000 elements? Because I am now projecting only one attribute. That attribute stupidly does not have 5,000 unique values. They have only 13, 14 hostels in the campus. They have 5,000 students. So if I try to extract information from this table, sample table, let's say, you will notice that if I extract all the, truly project a portion of this table, then I will have 0, 2, 0, 8, 0, 3, 0, 3, 0, 8, etc. These are real values, of course. But as a mathematical relation, do they make sense? As elements of a set? No. Elements of a set must be unique. Please remember that the new set that is created out of projection has its own schema. We said that. What we forgot temporarily is that any time a relation is created with a schema, it must have a primary key which must be unique. If the new relation has to have a unique primary key, it cannot have elements which are identical. Consequently, the relational algebra operator will automatically chew up all the common values, rather duplicate values and give you only a set which will ultimately have how many values? 0, 1, 0, 2, 0, 3, 0, 4, up to 1, 3, period. It cannot have more than that. However, unfortunately, the SQL query language is not as smart as mathematics. So if you run, if you do a mathematical operation, you will get only these 13 unique values. If you do a SQL query, select SS from student, it will stupidly list 5,000 values if there are 5,000 elements. In order to ensure that we have a construction in structured query language which mirrors the relational algebra mathematical purity, there is a modification that is provided. That modification is select distinct SH from student. If you serve the word distinct, as you can see, it means only distinct hostile number should be extracted. This will give you the result as 2, 8, 3, whatever. You appreciate this point? The obvious extension. We can write a relational expression combining selection and projection. Sigma SCPI greater than 9 and SH equal to 8 on student. All this in bracket project S name. The reason we can write this is because this is a relation. The relational operator sigma operates upon this, creates another relation. Since that is a relation, I can do a projection on that relation. So project S name of this bracketed quantity. Please remember when you write relational algebra expressions in your notebook or something, don't use big bracket, smaller bracket, smaller bracket. You use same size bracket. Starting bracket and end bracket identifies what is inside. What is inside as far as projection pi is concerned is a relation. It incidentally happens that it's not an absolute relation, but itself is a result of some relational operator being applied on another. And now as you would have all guessed, the representation of this expression in SQL is pretty straightforward. Select S name, that is this projection operation, from student which is the base relation, where SCPI is greater than 9 and SH equal to 8, that is the predicate. So this will give you only names of those students from hostel A who have an academic performance of greater than 9. Can you write any programming in any known programming language this easily? This is the power of expression of SQL. And this power derives not in an ad hoc fashion, somebody has designed a programming language, it's very simple, but it derives from a strong mathematical formulation, strong mathematical theory behind it. We're not going to deal with relational algebra in the rest of the class at all. This session will look at it. We are going to study SQL in some more details. But it's important for you as scientists and engineers to understand that relational databases are not developed in an ad hoc intuitive fashion. They have evolved out of a strong theoretical base, which is the combination of relational algebra and relational calculus. Those who are interested can read up the standard text by Sudarshan and others. He describes both these adequately. Cross product is the first operator, in fact the only operator really, generally independent operator, which works on two relations. Cross product, you are familiar with cross product of sets. R1 cross R2 is cross product of two sets. It results in a new relation R. In the context of the relation that we are talking about, it concatenates each tuple with R1 with every tuple of R2. Roll number, name, this, that, that, combined with, let's say course, course, code, whatever, whatever. Is there any meaning in combining and calculating the cross product of a student in a course set? It's completely meaningless, right? Completely unrelated information. But the cross product is the only mechanism to extract any information out of multiple tables. The earlier two operators are called unary operators because they operate upon a single relation, gives you a single relation or single table extracting information. This is the only operator which works on multiple tables or multiple relations. So, if there are M and N elements in the two relations, the resultant relation has M into N elements. What is the schema of the resultant relation? The schema of the first relation plus the schema of the second relation. That is the schema of the, could get pretty messy. Here are two example tables. Notice I have chosen the tables very carefully. The student table and the rage table. You will agree that taking a cross product of student table and course table will never give any meaningful information at all. I am concatenating two completely independent entries. But student table has student information and rage table has the association information for every student with courses. If for example, I want to find out the names of students who are registered for this course CS64, the rage or registration relation will give me the roll number of students who have registered for CS64. But I cannot get the names. This table has names of people but has no information about courses. It is very obvious intuitively that if I want to get the names of students who are registered for CS64 somehow I must combine information from these two tables and extract something meaningful out of it. You agree? Indeed whenever you take cross products or subsequently the joint that we shall define it always involves one of the tables to be the association representation that diamond representation and the other table to be one of the entities participating in that relation. Is that clear? So here is an example table. Treat this as a relation. Treat this as another relation. Incidentally these are not artificial names. Sunita Sarawagi is a brilliant researcher in data mining and machine learning. Kesha Nori used to be a professor. He is now the chief knowledge officer of Tata consultancy services. Krishna is a retired professor from IIT Madras who was a great computer science researcher in the last four decades. So these are real names. So please remember that this table has 5000 entries. This table has how many entries? No, no. 5000 into 6. 30,000 entries. Now if I take a cross product of these two how many total entries I will have in the relation? 5000 multiplied by 30,000. Count the number of zeros that you have. The information that I shall have out of such a cross product is this. 8900501 to Sunita Sarawagi then 8900501 to CH413. Notice this row is being combined with this row. But the cross product means this row will also be combined with this row. It means this row will also be combined with this row. Please note that when I combined 501 to Sunita Sarawagi with 89007017 to CH634 I have no meaningful information. But unfortunately that is the stupidity of cross product. Cross product will combine this with this, this with this. So I will get such a table. You will agree? Why do I do such a stupid thing because it's not giving me any meaningful information? I do it because in relation to algebra cross product is the only way to look at information from two tables. There is no other way, period. So I have to do cross product. But when I do a cross product I notice that while there is some meaningful information somewhere. For example, this row definitely represents meaningful information because this row number is same as this row number. So therefore this registration record pertains to Sunita Sarawagi is known. There is some association. Similarly, this row has an association. But this row is meaningless. This row is meaningless. This row is meaningless. Similarly, Keshavnori getting associated with the registration record of Sunita Sarawagi does not give any information. If I analyze this, I will notice that in this table, in the sample, sub-table of cross product I have this row which is meaningful, this row which is meaningful, this row which is meaningful. If I have 5000 entries in the student table and 30,000 entries in the registration table the cross product which will have 30,000 to 5000 elements how many elements are likely to contain meaningful information? 30,000. It is only those 30,000 rows of the second table when combined with the appropriately corresponding row of the student table which will contain meaningful information. All other thing is junk. So, if the relational theory says that cross product is the only way to extract information out of these two I will say alright, perform the cross product but don't give me all the junk. Give me only the meaningful information. Now, can I not find the meaningful information out of this cross product by doing a selection on this entire big table knowing that this big table is a relational and I can apply relational algebra on this. What would be the kind of predicate that I would like to stipulate for this meaningful information extraction? I would like to say that out of these millions of rows that you have please ensure that this roll number which came from the first apple is same as this roll number which came from the second apple. You agree? All such rows and only such rows will have meaningful information. In short, what am I doing? I am joining the two tables on the common value of an attribute that isn't both the tables. That is why the notion of join as a relational algebra. So, I hope you have understood that join is nothing but a selection on the cross product in order to separate out the shaft and gives you only those rows which have a joining condition which have identical value in some common attribute or attributes that you have identical. That is precisely what the join operation is. Join is defined to get meaningful set of rows from a cross product. Mathematically, it is defined as R1 cross R2 being applied further by a selection operator with a P as a predicate giving you a new relation R. However, this P is not any general predicate that you can think of. For join, the predicate provides for meaningful information extraction. The meaningful information extraction can be done in one of the two ways. One is called a natural join. The other is called a theta join. Mathematically, let's see what these are. Natural join ensures equality of values of all common attributes. Clearly, that is the join we are looking for. What are the common attributes in two schemas? Roll number. So, if a join operation guarantees that it will extract those rows of the cross product which have same value for the common attribute roll, then I call it a natural join. Natural join need not be defined any further except by saying instead of showing only a cross product, I put two vertical bars around cross product. Student natural join range is how you read this. Student cross range will give you the cross product. Student natural join range will give you the join operation. This is exactly same mathematically as this expression. Student cross range, you form the cross product. On the resultant relation, you do a sigma. What is the condition however? The s-roll of the student is same as s-roll of the range. Notice the way you identify because there are two s-rolls. You say student dot s-roll equal to range dot s-roll. This is the way in a scale also you will qualify the same way in relation algebra you will qualify the same way. That is the name of the relation followed by the attribute. You don't have to say that if you express it like that. Of course, you must have common attributes on both the tables otherwise natural join is meaningless. There is another thing when you form a natural join. What should be the schema of natural join? We said schema of natural join must be schema of the cross product and that will have s-roll, s-name, this, that, s-roll, c-code, etc. Notice however that once I say that s-roll from here and s-roll from here must be identical, then there is no meaning in keeping two columns of s-roll in the final relation. Consequently in the natural join. This is the s-query for the natural join. Select star from student natural join range. See the simplicity of expression. This is a later edition after the international standard where SQL was upgraded from 86 to 89 and so on. Earlier you wrote an SQL query like that. Select star from student, range where student dot s-roll equal to range dot s-roll. The key feature is select star from student, range. We are writing two table names in this from clause. We can write three, four, whatever. Please note the mathematical implication and the actual implication of writing multiple table names. If you write more than one table, then SQL always implicitly performs a cross product of those two tables. So if you stop this query here, select star from student, range you will actually get the cross product. Since you don't want to let the computer do the gola query because then you will have to do the additional gola query of reading through millions of rows to figure out what it is. You will always say select star from student, range where student dot s-roll equal to range dot s-roll. This again is a complete SQL program. Try writing something similar in four terms. You will find that extraction of some information from a single table was a lot simpler. Join algorithms even implemented inside SQL engine are not easy to write. A whole lot of research has been happening. Sudarshan in fact specializes in query optimization as his area and there is a huge lot of research going on in IIT Bombay on that. It's a perpetual. What is a theta join? Theta join is nothing but an extension of natural join. Theta is any meaningful predicate. Usually has a join condition. I'm sorry. There is a mistake here. This is not okay. You may write it. This is the way you write it. But this is not a selection out of natural join. This is another selection out of the cross product itself. What is the difference? R1 cross R2, sigma, some predicate. Earlier in natural join, the predicate was fixed. Namely equality of values in all common fields. In this particular case, the theta join permits you to give a partial join condition because without it you will not get meaningful information anyway or full join condition and in addition some other condition that you may want to put. For example, look at this theta join. R student theta join range where theta is student dot s roll equal to range dot s roll and SCPI greater than 9 and SH equal to 9. Notice we are very similar to this. Students from hostel 8 who have a CPI greater than 9. If I extracted information only from student table I will get their roll numbers. This is the representation. Select star from student comma range where student dot s roll equal to range dot s roll and SCPI greater than 9 and SH equal to 8. I am doing a theta join. Is it required at all? What information am I getting out of this? Okay, let me ask you this question. If I wrote select star from student where this condition? It would not make sense because the condition is student dot s roll equal to range dot s roll. The other condition is what? SCPI greater than 9 and SH equal to 8. So what actual information I am getting in a semantic sense from my database? Am I getting any information which I would not have got from only one table? The courses that such students have registered. All the courses so I am actually getting the course list of a student's registration form but with full information about name, roll number, hostel number etc. Ideally what kind of thing I would like to extract in such a case? CPI or whatever. Suppose I want to get the names of the students of CS634 who have a CPI greater than 9 and who stay in hostel 9. How will I modify this query? Select star from student, range where student dot s roll equal to range dot s roll and SCPI greater than 9 and SH equal to 8 and C code equal to CS634. Out of these, these are theta conditions. This is the joint condition. Again notice that there is no equivalent simple mechanism in any programming language to get this kind of information. We need to stop here.