 With this, we start our second session, which is about relational models and an introduction to databases. First, we look at the relational model. We will have a very brief introduction to relational algebra, which is the theoretical foundation. In the process, we will also introduce two structured query language, the programming language, which is used mainly for database, which is what you are going to then practice over the next five days. The relational model was developed by Professor Codd, as I mentioned earlier and this formed the basis of a great research work at IBM in system R. System R was relational system. Before that, the databases, which existed, which were defined by Codacil. Codacil is the committee on data system languages, which defines these databases. The same committee actually defined many standards related to business information processing, cobalt programming language. How many of you know the origin of cobalt programming language? This curiously came as a committee defined standard. There did not exist any cobalt. The first programming language to come in this world, higher level language was FORTRAN that appeared in 1956, but it was found to be inadequate for solving many of the problems related to data processing. So, a committee was appointed to study and define a programming language. Then that committee was taking a very long time. So, it was renamed as long term committee and a short term committee was appointed to come up with specification of a programming language, which could be used in the interim till the long term committee works. There is a lady from United States Navy, who headed that committee. The Department of Defense in United States used to be the purchaser of 50 percent of all computers manufactured in the world. So, they were major users. So, they formed the chairperson was that lady and there was a group. It was a very curious discussion. They defined all the features that you see them even today, like short features that all they defined. On the last day, when they were concluding that report, somebody said we have not yet come up with a name for the language and some names were doing the round. One of the younger participants said how does it matter? This whole language that we are defining will stay alive only for 2 or 3 years, because the long range committee is going to define things later. So, the names which were being discussed were COSIL, COCOSIL, etcetera, rhyming on CODASIL, which was the committee. He suggested COBOL as common business oriented language. This name was never discussed in that committee till that point and it was discussed exactly for 5 minutes with the rationale that anyway this name and the language is going to die because the long range committee will actually work later. So, they submitted the report. The report was so appreciated that the long range committee said we have no more work to do and they wound themselves up. That is how COBOL has stuck with us all our lives. So, the funny thing that happened in the history of development. Database came much later. As I said the network databases and the hierarchical databases were essentially an improvement on navigating associated records in a large file system. But many features of the databases like this atomicity of transactions or like the concurrency control, like the recovery, they were actually made part of the database. The notion of schema was already developed. But the relational model came out of this work quite independent. So, Professor Codd is actually the creator of this model and the system R which resulted in DB2, the database 2 of IBM mainframe, which is now available on Unix and other systems. There was the world's first major commercial database. Simultaneously Ingress was the world's first database, which was born in the university environment. The possess SQL which you are going to use is actually a outgrowth of the Ingress database management. There is mathematical foundation for database modeling. There is a relational algebra and relational calculus of which we are going to briefly look at relational algebra and the most important part is structured query language. Originally it was called structured English like query language and was termed SQL, S-E-Q-U-E-L. When the databases became international standards, the name changed to SQL, but it is still pronounced as SQL in honor of the original name SQL. A relational model merely formalizes something that we have briefly discussed in the previous set. You remember those sets of students and sets of courses and the association sets that we drew. Imagine that each set is a mathematical entity. Based on that fact that each set can be recorded as a mathematical entity comprising of elements of a set and mathematical operations can be defined on these sets. Came out the relational algebra or the relational theory that professor Kaur proposed. He defined a set, a relation to be a set of n-tuples. So, what is an n-tuple? It is a new name for many of you. You probably know a couple. Couple means what? Two values. Triple means what? Three values. Quadruple means what? Four values. So, if some entity has two attributes, we say it has couple of attributes. So, entity has three attributes. It is a triple attribute. Quadruple attribute. N-tuple means n attributes. So, n-tuple is nothing but something rhyming on double, triple, quadruple and therefore the name n-tuple. So, n-tuple is nothing but a set of n values which represent an entity. So, for example, if a student is represented by roll number, name, hostel number, room number and performance. So, these are five attributes. Then we say there is a five tuple which represents one student. How many n-tuples will be there in the set? As many as there are students. If there are 5000 students, there are 5000 n-tuples. If there are 300 courses, there are 300 n-tuples. If the range set, the range set has n-tuple is what? Roll number, course code and grade. So, it is a three tuple. How many such three tuples will be there describing the range? 30,000 may be. The number is not important. What is important is each element of the set is unique and each element comprises of n-tuples. These correspond directly to rows in a table as we have seen. Further, each relation has a schema. What is the schema? Effectively, it is nothing but FD entry for that five, except that it is guaranteed that there is no occurs clause. All values are single valued attributes and there is a primary key associated with the table or five. So, this is similar to column headers in a table or FD entries in a cobalt file. Conceptually, there is not much difference. The fundamental conceptual difference is in treating these entities as sets, mathematical sets, because that is what gives rise to very powerful paradigms of expressions which result in SQL data. Here is a sample relation schema. Student S roll, S name, S hostel, S room, S CPI, course, course, course, C code, course name, course credits, range, S roll, C code, grade. You are all familiar with this. So, this is the schema of the relation. The actual, so this is schema is nothing but metadata. The actual relation is those 5000 tuples in the schema for student, 300 tuples within the schema of course and may be 30000 tuples in the schema for range. That is the actual data. So, relation is the actual data. Relation schema is metadata. This much is clear and greatly similar to our concept of files. Now, the fun starts. So, here is a student relation which is looked as a set. What is important is to imagine this not as a file of individual records, but as a mathematical set. So, this is a mathematical set. The number of elements in this set is equal to number of students and each element in the set is a 5 tuple if the student has 5 attributes. Similarly, there will be a course set and there will be a range set. A student relation as a table can be seen like this. So, every row is a n tuple or a 5 tuple. So, this is one element. This is a second element, third element, fourth element as many elements. The relational algebra is nothing but a mathematical algebraic foundation where each relation is treated as an element and operators are defined over relations. So, mathematically an operator is what consider an addition operation. Addition is defined over a set of integers let us say and 3 plus 2 is 5. 5 is also an integer. An important part of the mathematical operations of this type is that whenever an operator operates upon elements the result is an element of the same type, the same set. So, it remains in the same boundary of the overall mathematical structure. As we shall see it may not be an element of this set or that set. It may be element of another set because a new schema may get defined in the process, but it is still an element which can be treated exactly like a element of any other set in this mathematical structure. A relational operation results in a new relation. In terms of our table definition or file definition we can consider this to mean that it may result in a table with different set of rows, but with the same schema. It may result in a different schema, but with the same set of rows or it may result in both, but it will still be an n tuple of some schema of some relation. Now this simple and elegant mathematical structuring permits us to derive immense value in terms of our power of expression to extract information out of these sets as we shall see in a moment. First of all we discuss the notion of keys. We will discuss it briefly here. We will discuss it more formally later. We are all familiar with primary key, but let us consider the nomenclature that is used. One is called super key. The next is called candidate key and the third is called primary key. So, let us see what these animals are. Super key is a set of attributes that uniquely characterizes a specific n tuple in a set. So, any combination of attributes whose values uniquely characterize a particular tuple is a super key. If you take all possible super keys and try to see which is a minimal super key that means if you remove one of the attributes from the set of attributes, the remaining thing is still a candidate is still a super key then the original was not minimal. This one is minimal you keep breaking it like that. When you get a minimal value then you say that is a candidate key. There could still be multiple candidate keys each one of which is minimal in this sets. Then out of those candidate keys we choose a particular key depending upon practicalities. That particular choice is called the primary key. So, primary key is nothing but a chosen candidate key. Let us examine this in the context of our sets. Student has S roll, S name, S H, S R, S C P R. So, which are the super keys which are the unique identity fires. S roll is apparently unique. S roll plus S name is also unique. In fact, S roll plus S name plus S S everything. S roll plus anything is unique. So, please note that there are many possible super keys. If we assume that every student lives in a single room which is allocated to every student, then as we saw S H plus S R is also a super key hostel and room number. S H plus S R plus S roll is also a super key. S H plus S R plus name is also a super key. So, there are large number of super keys which are possible. Which are the candidate keys? Candidate keys are reduced minimal super keys. So, for example, S roll plus S name is a super key. If I remove S roll and S name separately, is S name a super key? No, but S roll is. Since S roll is a super key, S roll plus S name cannot be a candidate key because it is not minimal. All such minimum super keys are candidate key. It is very obvious in this example that S roll and S H plus S R are the two candidate keys. Out of these two candidate keys, I have to choose one. As we discussed briefly, it makes eminent sense to choose this artificial number S roll as the primary key. Therefore, primary key is S roll. In a nutshell then, we may have many super keys. A super key is nothing but a combination of attributes whose value uniquely identifies a tuple. Out of all those super keys, we find out minimal super keys. They constitute the set of candidate keys. Out of multiple candidate keys, we may choose anyone to represent unique identifier for that entity which becomes then or unique we will not talk of entities now. These are elements, mathematical elements, uniquely identifying element. Then, we call it a primary key. You identify yourself. Myself, I am Blash Kumar. Basically, this means that all possible combinations of attributes which can act as a primary key are candidate keys. Candidate keys can have a field which where null value can be applicable. No. Never. In fact, no super key can have an attribute which can be nullable. What will happen is that if there are null values, then there is no uniqueness that you can guarantee. No, because that is a combination of fields. Correct. Suppose there are five attributes forming a super key. Fifth attribute can be null value. Imagine two elements which have the same value for the first four attributes. For the fifth attribute, both of them have null value at a instance of time. Then, you cannot uniquely associate any one of these with one particular element. In fact, in the entire discussion on keys is a very important point that he is making. In the entire description of keys, we should never have an attribute which is a null valuable attribute as a part of the key because we will lose the important feature of the key, namely unique guaranteed, unique identification of an element. Therefore, null values or null valuable attributes should not be part of this. For example, CPI which could be a null value. Of course, CPI will never occur alone or CPI will never occur in combination of other things in this particular example, but you are very right. There should be no null valuable attribute. Super key, candidate key and primary key is clear. Later on, we will have a more mathematical definition of this. Here is a sample relation S-roll, S-name, S-h, S-r, S-C-P-I. This relation is shown as a table. A table intuitively, all of us consider to be a file. There is one important difference between a relation and a file or a table. For example, suppose a new student joins in and suppose you have a file in cobalt. What will you do? You will insert that student in that file. The original file remains as that file. If you call that file as student 5, then the student file remains. Mathematically, however, when you insert a new student in a relation, you get a new relation. The original relation adds so many rows, the new relation has so many plus 1, the relation is different. Mathematically, these two relations are not same. So, you have to call it student and student 1. This is very, very important to appreciate that mathematically, a relation where even a small change occurs anywhere is not same as the original relation. That is why when you say operation, operation will always produce a new relation. Here is a sample relation where S CPI is shown here. Consider this, Vijay Amre Parag Samarth, Arul Manages Ajay Shah, Rajan Tambay from the previous slide. Here, I have Sanjay Mistry, Vijay Amre Shaukat Ali, Arul Manages Ajay Shah. You see one row has been added here. So, this is a new relation. Consider that Sanjay Mistry is deleted and Rajan Tambay's CPI is changed to 7.25. For us as cobalt programmers, this is the same 5 as the values have been updated. That is what indeed happens even with database schema tables. But mathematically, this is a different relation. It has different set of attribute values. Therefore, it is a different relation. This is just important to conceptually understand the mathematical model. Otherwise, in practice, we are not going to have every time something changes, you are not going to have a different file. But mathematically, it is a different entity. As long as we recognize that, that is okay. In practice, it does not affect because such updates or inserts do not change the schema of the relation at all. The schema remains the same. Therefore, the mathematical structure also remains the same. If any value in a table or a file changes because of add, delete or update, such a change can be made in the same table or file. However, a relational operation results in a new relation. This is the conceptual difference I was talking about. Relational algebra operations are used to extract meaningful information from one or more relations. That is the importance of relational algebra. We shall see relational algebra operations and we shall see for each operation a corresponding SQL query to which a relational algebra operation can be directly mapped. That is what gives the power of expression to anybody who is trying to extract information out of such tables. We shall briefly see that. What are the relational algebra operators? There are four operators. This is a new terminology for many of you. You might want to pay special attention to the mathematical aspect. Selection. In plain English, it means gate rows from a relation. If your table has, if your set or relation has 5000 students, then you want to select some 200 students. On what basis would you select? That basis is called the selection predicate or the condition on which you want to select rows. But the operator mathematically is called a selection operation. Projection is to get columns from a relation. Let us say you have roll number, names, CPI, hostile number, etcetera, etcetera. You just want roll number and names. Now, when you want just roll numbers and names, you want two of the column values out of that entire table of five columns, then you are extracting columns from a relation. Here again there is some condition. What is that condition? You give the list of attributes which you want to be extracted out. That operation is called projection. Projection term, those of you who have done some engineering drawing would remember. Anyway, if you have a three dimensional entity, you keep it in the sun. The shadow is two dimensional. So, the shadow is the projection of that original thing. It is like this. You have a five dimensional or ten dimensional entity in a relation. Some dimensions you project and that is called, that is why it is called a projection. Cross product is another mathematical term which is very easy to understand. It simply combines rows from multiple tables. For example, student table has 5000 rows. Course table has 300 rows. What are the total possible rows that you can make out of a combination of these two tables? 5000 into 300. That is called a cross product. A cross product simply results in number of elements which is equal to the product of individual number of elements. The way the final schema of the cross product is defined is very simple. You concatenate the schema of one and schema of another. The totality is the schema of the cross product. The number of elements in that is you take one element from one table, concatenate one element from second table, that is one row. First element of first table, second row of second table, that is second row and so on. That is how you get 5000 into 300 rows in that concatenated table which is called a cross product. Clearly, the cross product table that show results or relation show results is highly unlikely to be useful at all. It will have millions of rows which will contain some kind of rubbish information. But cross product is the only fundamental operation in relational algebra if you want to extract information from more than one table. From a single table if you want to extract information, you have selection and projection operation and you can obviously combine them. But if you want to extract information from multiple tables, then cross product is the only fundamental operation. Since cross product does not result directly into any meaningful information, you apply further selection and projection on to that cross table to extract only the meaningful information and that operation is called the join operation. So, the join operation is meant to get meaningful information from the cross product. That is the conceptual framework. Let us look at examples of each one and also try to see how a relational algebra operation is mapped as an expression into structured query language. We shall study structured query language in greater details later, but it is just the first introduction. Here is the selection operator. The selection operator is denoted by sigma. You are familiar with the Greek letters. So, sigma, there are not many that we shall use. Sigma is the Greek letter that is used to denote selection. The way you write the selection algebra operator is r 1 equal to sigma subscript p in bracket r. What does it mean? r is the original relation, say student relation. On that you apply a condition called p which is a valid condition on r. p is called a predicate. Predicate is the mathematical term. Predicate means condition. So, applying this predicate p to this condition r, you will get a new relation which is denoted by r 1. The properties of this selection is that r 1 has the same schema as that of r. That is very obvious because you are selecting only rows and p is any valid condition on r. So, if you are putting a condition like hostile number should be equal to 4, that is valid. But height equal to whatever 5 feet 7 inches is not a valid condition because the condition must be defined in terms of attributes of that particular relation only, otherwise mathematically invalid. This will permit you to extract any desired conditional information from a single relation. This is a selection that is shown as an example here. r 1 is equal to sigma sc p i greater than 9 and s h equal to 3 in bracket student. You all remember the student relation? You all remember that there was a attribute called student hostile and there was an attribute called performance index. What are you trying to say here? Get me all those students who live in hostel 3 and who have a performance of greater than 9. This relational algebra expression is a complete expression of that desire. Nothing else needs to be stated because the relation schema is already known. In that schema context I am writing this. The resulting relation in this case if you go back to the example table that I had shown, you will get one row because Arul Manages used to be in hostel 3 in room number 146 and he had a CPI of 9.2 far. If there were multiple students having CPI greater than this and hostel equal to 3, all of them will come out and they will form the new relation r 1. Mathematically this is very clear. In practice the clarity is exactly as simple as that. The structured query language of the databases that we are going to study, the SQL query for this relational operation can be written as this. So, look at this query. Select star from student where a CPI greater than 9 and sh equal to 3, semicolon. This by the way although I have written it in four lines can be written on a single line and it forms a complete and valid SQL program. Imagine the COBOL program that you will have to write to get the same information out of the same file. What will you do? Describe all the files, all FD entries, then in procedure division open this file, move some high values, low values or add and something. You set up a iteration, read a file record, then examine if sh equal to 3 and CPI greater than this, then write this record into another file, go back at end of file, go down close file, whatever, whatever, blah, blah, blah. Come say come at least a 20, 25 line procedure code and a 30 line FD entry, etcetera, etcetera. If you have defined the schema earlier into the relational database as we shall see, this is the complete SQL program to do the design. The star here means all attributes. We shall see later that star can be replaced by something when we discuss the projection operation. So, is the selection operation clear? Select star from this, where this and this or this, you can imagine now. Any complex condition can be written here. Here is the projection operation. Projection operation specifies the list of attributes which are to be projected out. So, for example, if you have a relation R, the projection operation is denoted by pi. This is the capital pi. So, pi list, this is the list of attributes. The resulting relation R 1 will now have a new schema. Why? Because I am projecting only a few columns. So, original columns were 5. I am projecting 2. The schema will have those two column headers as the attributes of this. This has to be a formal relation. R 1 has to be a formal relation. That is the requirement of the mathematical structure. What will be the cardinality? Cardinality means number of rows in a table or number of elements in a relation. If I have 5000 students, there should be 5000. So, number of rows will be exactly same because there is no condition. Well, we shall see that. There is a surprise waiting for us. Consider this first. Project course name and course credit from the course. So, I just want to know what are the course names and credits that are offered. This will give me from my sample list information system 6 credit, data mining 6 credits, professional ethics in IT 3, grades, etcetera. This will be the output that will come out due to the relational algebra expression if the original relations were this. The corresponding SQL query for this relational algebra expression you would have guessed instead of that star which denoted all attributes. Now, you say select CNM, C credit from course semicolon. Now, you will notice that the characteristic of the statement written earlier for selection operation was the wear clause, not the select statement. Curiously, SQL uses select statement for getting any information out of the database. So, select statement is used for selection, it is used for projection, used for join, used for clause, used for everything. The distinction is made by the clauses. The selection is characterized by the wear clause which puts the predicate. The projection is characterized by the list of attributes which is written in the select statement and there is no predicate. As we agreed, this list would have course name and course credit and as everybody says, it will probably have exactly the same number of rows because I am not reducing the number of rows since I am not putting any condition but weight. Consider this. From the student relation, I project S H which is the student hostel. The net result is R 1. The SQL query for that is select S H from student. No, there are no null values. All hostels are properly written but there are 5000 students staying in 13 hostels. So, if I get all 5000 elements of that S H column, same hostel will repeat many times as many times as there are students sitting there. In a file, I can have multiple values. In a relation mathematically, I cannot have identical elements because then it is no more a set. Please remember what I said. A relational operator results in a relation and that relation must be a mathematically valid relation. Effectively, it means that mathematically valid relation must have a primary key because then only it is unique all elements. If S H is the only column I am projecting, then S H itself must be the primary key of the resulting thing. If S H is the primary key and if there are exactly 13 hostels in IIT, there cannot be more than 13 elements in the resulting. This is a fundamental difference between the relational algebra mathematically interpreted and the SQL query. SQL query unfortunately behaves as stupidly as programming languages do. It will produce exactly 5000 rows repeating the hostel again and again. If you do not want that, if you want the SQL query to return what is properly a relation, you have to say distinct. Select a distinct S H from student, then it will remove the duplicates. Please note that in programming, removing the duplicate is not as easy as it sounds. It will have to get all the 5000 rows, sort them and out of each sorted group, take only one element. But if you want that, then there is a separate structure. So, how many elements would R 1 have? Here is an example. Projection, if I want to project only S H, then I will have 0 2, 0 8, 0 3, 0 3, 0 8. Now, these two jokers belong to the same hostel and these two belong to the same hostel. As a mathematical relation, I will have 0 2, 0 8, 0 3. That is it. Select distinct S H from student. Therefore, the result should be S H 2, 8 and 3, where I added the word distinct. So, remember projection operation will always result in a distinct relation. However, SQL, if you say select this, comma this, comma this from this relation, you may not get only the unique thing. You have to add distinctly. It is a minor problem. So, selection and projection is now very clear. Combination is an expression. This itself is an expression, a relational algebra expression. This is not a mathematical expression in a simple form. It is a relational algebra expression. Sigma S C P I greater than 9 and S H equal to 8 in bracket student, then the whole thing in bracket, because this whole thing is a relation. Student was a relation. When I apply this selection, I get a relation and on that relation, I apply the projection. I get only S name. So, let us say I am warden of hostel A and I just want to know all the brilliant people in my hostel by name. I am not interested in roll number as it is an artificial thing. This query will get me that. This query will first get student roll number, name, hostel number, SPI, room number, whatever, whatever and then it will project only S name. And as I think all of you would have guessed, the corresponding SQL query will be a combination of a where clause and a list in the select state. So, select S name from student where S C P I greater than 9 and S H equal to 8 semicolon. This again is a complete SQL program to extract this information from this state. Do you see the, do you feel the power of the SQL now and the power comes not out of some ad hoc programming language design. It comes out of a very strong mathematical foundation. A few important things. In the relational algebra, when we put this thing first in inner bracket and then we put the projection, we are describing to the relational algebra system the sequence in which these operations are to be done. This has to be done first and then this has to be done. Would it make sense to say pi S name student and then apply a sigma? Why it would make sense? Correct. When I apply pi S name to student, I will get only names though two unique names. Now, on those names I cannot apply this condition because that relation this condition is not valid on that relation. The relational algebra therefore is in some sense procedural. It describes the order in which it has to be done. What about SQL statement? We are nowhere saying first take me S name or first do me this where clause and that is the beauty of SQL. SQL is non-procedural. True to implement this program in real world the database management system will have to create some procedure. Read file, do this, search this, compare this, whatever. The beauty is that the database management system takes the responsibility of determining not only what is the correct way of giving you the answer but also what is the most optimal way. If it does not enter into consideration here, we shall discuss it later but the correct way that I must first apply this condition and then only project is automatically understood by the backend databases. So, is this clear? The projection operation, the expression and the way it is evaluated and the expression the way it is translated. Now, imagine you do not know anything about relational algebra. You will still feel comfortable in learning about how to extract information from a table given some conditions and given some requirements of projection. So, while the mathematical formalism is important, it is not mandatory to be completely understood while using SQL. The reason we are discussing this is to tell you the beauty and the beauty can be further elaborated in your better understanding. This book contains a whole lot of elaborate mathematical structure. Those of you are interested can actually look at it in greater details. Now, we come to the cross product. The cross product is the fundamental thing which can extract information out of multiple relations. Remember what I said? Selection and projection in combination also apply on one table but if I have information in student table, information in course table, information in the rest table and I want to extract information from multiple tables. In the example that we have been considering, one simple case is I want to get all the courses registered by one student and I want to print the name of the student lost their number etcetera and all the courses. Now, that I cannot get unless I take information from the student table and the rest table. Unfortunately, the only mechanism to extract information out of multiple tables is called the cross product. So, let us look at the cross product. If there are two relations R 1 and R 2, then the cross product is defined by R 1 cross R 2. Simple multiplication and multiplication in a relation sense is all elements of this concatenated with each of the elements of the second table form the cross product. So, concatenates each tuple from R 1 with every tuple of R 2. If there are m and n elements in the two relations, the resultant relation has m into n elements, chaos. Imagine if you were to form a cross product of all your policy masters and all your premium transaction records, you will cross product will fill up all the available disc in the country. So, obviously this is not a very beautiful way of doing things, but mathematically it is the most elegant way of doing. The schema of R is concatenation of two schemas of R 1 and R 2. This is the mathematical foundation. So, here is an example. Let us take student table and the rest table. I have given an example where I have three rows. There could be multiple rows. These names are not hypothetical. By the way, Motokrishnan is a renowned professor of computer science in IIT Madras. He is retired now. Keshav Nori is a great academician working with Tata consultancy services. Sunita Surawagi is the youngest trustee on the VLDB trustee committee. She is a colleague of mine. She is a data mining and machine learning expert. So, since all of them were students at some time or the other, I have just put them as students. These are the courses which people have registered for. Here again, I have given some sample entries. There are of course, as I said, there would be about 5000 students in this table and there would be about 300 things in this table. So, the total will be 5000 into 300. I cannot represent that big table in any presentation here. So, I have taken this sample. There are three elements here and there are five elements here. Totally, how many elements will be there in the cross product? Three into five, 15. Notice that Sunita Surawagi's roll number is 5012 has registered for two courses. Keshav Nori has registered for two courses and Mutu has registered for one course. That is the fact of life. Now, let us look at the cross product. If I form a cross product, the cross product will have 15 rows. Again, I have shown some early rows in that because even 15, I cannot fit into this. Notice what is happening? S roll S dem, etc. from the first table and S roll C code, etc. from the second table. I have made one distinction here because there are two S rolls. I have called this range dot S roll because this S roll comes from the second relation. Correspondingly, I could call this as student dot S roll. This incidentally is the formal method of distinguishing between the columns of different tables in databases. So, this is called prefixing the relation name or the table name. Anyway, notice that first row has 5012. This also has 5012. This has 5012. This has 501. The third row, however, combines the first row of the first table of Sunita Surawagi with the third row of registration table, which actually belongs to Keshav Nori's but cross product is a Dumbo fellow. It is simply concatenate. Now, can you tell me what meaningful information does this row convey to you? Absolutely nothing. Sunita Surawagi's roll number and name and Mutu Krishnan's Keshav Nori's registration has no meaning. Obviously, one of the tasks before us is to ensure that such a meaningless piece of information are discarded. We shall see how they can be discarded. But this is the cross product. You all understand this cross product? In this cross product, if you will see these two rows are important. Why? Because these two rows represent the registration of Sunita Surawagi. How do we infer that? We infer that because in the first attribute S roll the value is 5012 and in the corresponding attribute extracted from range, the roll number is same. So, that means if the attribute value is same for the common attribute of two tables, then that row is meaningful. As we all have been saying, we are looking for association, relation and the relation is established because the same roll number here, same roll number there, that is the relation. Everything else is non-relation. So, out of this large table of 15 elements, we see these two rows being relevant. We see this row again being irrelevant because this row has Keshav Nori's roll number here and again Keshav Nori's roll number for one registration. So, effectively, if there are five elements in the range, then there will be only five meaningful rows in the cross product. There cannot be anything more. But I will get 15. How do I extract 5 out of 15? The trick that we discover is to say that somehow if I can choose only those rows which have the same S roll value here and same S roll value here, I have meaningful information. That is what you do in a operation beyond the cross product. That operation is called JOIN. In plain English JOIN means meaningfully joining information from multiple tables. We have seen what is meaningful. Meaningful is that common attribute value should be saved. Correspondingly, the JOIN is defined to get meaningful set of rows from a cross product. It is defined as R1 cross R2. So, fundamentally, the operation of extracting information from two tables is still a cross product. R1 cross R2 forms a relation which is that big huge Rumangas relation. On that relation, if I apply the selection operation with a row which is another predicate, this time it is written as a JOIN predicate. That is why it is not written as a P. It is written as row, but you can write it as P as well. It is a predicate, but this predicate is a special predicate. This predicate is not arbitrary condition. This is greater than that, that is less than this. This predicate has to do with some commonality or relation between the two things so that JOIN is meaningful. There are two types of JOINs each providing for meaningful information extraction. One is called a natural JOIN. Natural JOIN is one in which the equality of the common attributes is indicated. You will all agree that the S-Roll being same on the two sides of the student and reg is the natural relationship is therefore a natural JOIN. The other JOIN is called a theta JOIN, where the predicate is replaced by theta here. Instead of P or row, you write a theta and theta is another predicate. That theta predicate can be a complex condition, but one of the conditions is necessarily some kind of a meaningful JOIN. Otherwise, you will end up getting meaningless. So, is this clear? Now, let us see what happens when we put a JOIN. Natural JOIN ensures equality of values of all common attributes. The resulting schema is the union of two schemas as the schema of the class product, but wait, it is union. Union means in a set union, you remove the common elements. If there are two sets of attributes, when you join them, remember the roll number attribute will appear twice. So, you remove one of them and you say, because it has the common value, natural JOIN means there is no point in having that column, because that column and the first column are exactly identical in the JOIN, not in the cross product. So, when you do that, you get a natural JOIN. Natural JOIN is specially denoted in relational algebra by writing a cross product symbol and by putting two vertical bars. So, X is cross product and X preceded and succeeded by vertical bar is a natural JOIN. So, R is equal to student natural JOIN range is the symbol and its meaning in terms of the relational algebra operators is I first do a cross product, student cross range and then say sigma student dot roll equal to range dot roll. This is the most important condition. It is called the natural JOIN condition. A natural JOIN in general terms is far more rigorous than this. If there are three columns which are common on both sides, then equality on each one of those three columns is ensured by natural JOIN. The reason you require theta JOIN is sometimes you want to join on equality of only two or one of those three and put some other condition. Then you call it a theta JOIN. So, natural JOIN and theta JOIN are not much different. You can say natural JOIN is a fundamental strong JOIN and theta JOIN is a liberated version of that natural JOIN. But the JOIN condition is required. So, student cross range I have shown first eight rows. I have shown this, this and this. In this we notice that in the natural JOIN, this row will remain, this row will remain and this row will remain because they have the same values. These three rows will be extracted in the natural JOIN. The natural JOIN is represented in SQL by simply saying select star from student natural JOIN range. See how beautiful and concise the SQL command is. Select star from student natural JOIN range. Earlier in the early versions of SQL in 1986 and 89, you did not have the natural JOIN as a syntax available in SQL. In those days, you were required to write select star from student comma range. Student comma range means a cross product of student and range. If you put a semicolon here, you will get those millions of dirty rows. So, select star from student comma range or select star from x comma y comma z comma t is all cross product in sequence. Now, this time where is not a ordinary selection condition, it is a selection condition being applied to the cross product, which says student dot s roll is equal to range dot s roll. You agree that this is the JOIN condition in this example. This then is a complete SQL program to extract the JOIN of two tables. This is the classical syntax. This is the modern syntax. Yes, sorry. I am Jay Raj actually. From when this JOIN is available actually? Oh, it is available in all SQL implementations after 1989. So, the current standard, we shall see that 93 standard SQL implements natural JOIN. What happens is, just like COBOL programs, there are some programs in LIC, which were developed 20 years ago. There are some which are being written now. Now, you are using micro focus COBOL a certain version. Earlier you were using soft tech COBOL a certain version. COBOL was still same. Many new features have been introduced in COBOL, which people today will be using if they are writing new program, but the old programs are still continued. 80 percent of the SQL code in the world of old systems will use this second type. It is not necessary to use it at all anymore. So, when you migrate, if anybody uses this, I will be very upset. Why are you using? This is not necessary unless you have a theta JOIN, which means you do not want a complete natural JOIN. You want something like that.