 So, these are the nomenclature problems. Secondly, you see an underscore here. What could it mean? You could guess easily. The value of s-roll, we have not yet formally introduced the notion of primary key. But what we mean for an entity is that even if there are 10000 entities, for each entity the roll number attribute will be unique. So, this becomes an identifying attribute which in our terminology we called a primary key and that is the correct terminology. So, we generally put an underline to represent a primary key in all the names. You see single ellipses here, but you see double ellipses here. What could they represent? Multiple values. We saw that already. In the course registration form, the student was putting in multiple courses that he has registered. So, double ellipses means there are multiple values to this. Now in general, it is not considered a good idea for an entity to have multi-valued attributes. We shall see in a moment why it is so. Similarly, okay, first of all this is the most important part. If I represent entity set by this model, then I can directly map all the information about all entities into a simple table which can be stored as a file. This is how the table will look like. Roll number, name, hostel, CPI, whatever, whatever and the multi-valued attributes will appear as a list. So, six courses, ten courses, seven hobbies, whatever one. In short, this will look like a record in that complex file. And if there are 5,000 students, there will be 5,000 such records. There is similarly a course entity which we had already identified. We will not talk of teacher and other department, we will just talk of student and course which will elaborate the point that we are trying to make. Here you can easily see that course code is the primary key because it identifies uniquely every course entity here, course name, course credits. And like in the student, we had multi-valued attributes. Here you have C student, that means course students registered, multi-valued attributes. C faculty, like this course is being taught by Professor Sudarshan and Professor Fatak. So, there are two faculty teaching this. Again, this can be mapped onto a table. So, I have course code, course name, course credit and of course, course students which will be in our list. How many? Maybe 600, depending primary modeling of an entity. But is this a good model? Perhaps not. We take a clue from our earlier problem in those two files that we had. Apparently, this does not seem to be a good model. Why we can also identify is not a good model? You notice here, in course, you have multi-valued attribute called student. But student itself is an entity. Whenever an entity comes out as a multi-valued attribute of another entity, you should notice there is a problem in the model. Entity should be pure play entity, its attribute should be pure play attribute of only that entity. If another entity guesses in, you will have to do something about it is a intuitive field and we shall see how we do that. First, we see the problem here. In the table representation of the course entity, where I have C code and other attributes, I have faculty which is say Deepak Fatak, Umesh Belur, Sudarshan, Krithi, whatever whatever. And this will be the multi-valued attribute. So, there will be a list, not one value, but list and we shall see what happens when you have such multiple values. Before that, we shall discuss all the possibilities of these characteristics of these attributes. We have seen single value and multi-value as a important thing, but there are other things as well. The attributes can be of the following type. A single-valued attribute which has a single atomic value is of course common sense like roll number, name, whatever. Derived attribute, what is a derived attribute? Derived attribute is one whose value is derived based on some other attribute. For example, age of an insurer. So, I am insured my age is 60, but 60 is a derived value. Two years later, my age will be 62, but your record will show 60 because you have made the mistake of creating an attribute which is not absolute, which is derived from some absolute value. What is the absolute value here? Date of birth. So, date of birth is a proper attribute, whereas age would be a derived attribute. In general, it is a bad practice to model derived attributes as important attributes of an entity because they can be derived at any point in time. Composite attribute, we say name, but if I want to distinguish between first name and last name, I have a problem because somebody may write Deepak Phatak, somebody may write Phatak DB, somebody may write P Deepak and there could be confusion. You know in the past course, you have to give these names and there are unfortunate problems. For example, in South India, there is no notion of a last name and you should see the fun at the US immigration point or something where there is a lot of Maramari when he asks what is your name and he tells the real name and then he says, no, this is not you. So, all these kinds of problems happen, but what it is, you want to identify and isolate components of an attribute. This is not multi-value. These are multiple segments of a single value and therefore, you call them as composite attributes. The correct way to represent composite attributes is to represent them as separate attributes as many components as you have. So, you will have first name as an attribute, last name as an attribute, middle name as an attribute, fifth name as an attribute, whatever, not fifth, but typically. Or if you want, let us say address, then address itself could have street, number, city, pin code, these must be separate components. Then there are attributes with null values. This is a very special class, we shall briefly discuss this. And then there are multi-valued attributes, all of us have understood. Multi-valued attributes individually can be composite attributes also, but we generally recognize these kind of different types of attributes. Single-valued attributes, these are some examples, very clear, roll number, course name, date of birth of a person, capacity of a lecture hall, price of sugar at a shop, whatever, these are single-valued attributes. Derived attributes, age of a person is correct only on the date of recording, is better to store date of birth, obvious, common sense. Total salary of an employee is another derived attribute. Total amount of premium paid by a person is another derived attribute. You will have some files in which you will be storing this. Why do you store this? Because otherwise to calculate the total value of premium by summing up all the previous premium records over the last 10 years is a cumbersome job. You do it for a specific operational reason, but file modeling, that is not the correct way of modeling, because it is a derived value. Total premium values are actual values, actual attributes. This total sum is summed. So, when you do data warehousing or when you do reporting, it is natural to store these derived values, but in the original modeling, you do not generally, you try to avoid. So, for example, if total salary is sum of basic pay, allowances, minus deductions, it is better to store each separate attribute value and compute the total pay at any time that you require by referring to all the components. Here are composite attributes. Donor address. This could be a single attribute technically in your conceptualization of attributes for an entity called donor. Donor is not relevant to you. Nobody donates to you, but to IIT Bombay, there are many people who donate money or who donate things. So, we are very interested in donor's address. So, these are the components. Name itself will be last name, middle name, first name. Street itself could be house number and street name. City, state, pin code, country, in future, planet, solar system, whatever one. So, these are the kind of conceptualization. This is a very funny example. The purpose is different. The purpose is to stress that if you can think of 10,000 years hence at which time your no system will be operational, I can guarantee you, then maybe you will definitely think of 15 years which is essential for modeling this. And please remember your application software may die after 15 days like you are migrating today, you may migrate after 15 years. But the basic model may still say LIC will still be in the business of life insurance and therefore the basic model is very, very important and therefore long term consideration is important. I will come back to the null-valued attributes. This is something which we cannot handle effectively in a conventional programming land. For example, suppose the value of the attribute is not more. Cumulative performance index or your total marks in the final year when you pass out from IED. Every year you will accumulate marks. Every semester you will get marks. One semester you will get 65 percent. Second semester you will get 62 percent. Third semester you will get 85 percent, etcetera. When you are studying in the first semester, what is your total percentage marks? You are not even given a single exam. Should we say your marks are zero? That will double you as duffer. So your CPI or cumulative performance index cannot be zero. Is it five? No. It does not exist and therefore you require a special representation for it. In COBOL, how will you handle it? If you write zero as a value, you have a problem. Typically, COBOL programmers do a strategy. They put minus five or minus one or minus nine or high values or low values. Basically saying, look, these values will never occur in life. Therefore, any time a program sees this value, it should say the value does not exist. Effectively, you are trying to implement null. But remember, every time you take a decision, that is an individual programmer's decision. Tomorrow, some other programmer makes some other decision and that interpretation will have to be made by your programs every time. It is not a very clean way of handling. Anyway, we will discuss the implementation later. But this is one of the problems in conventional program. The other reason why a null-valued attribute could be there is not only because the value is not known but may not be applicable. This example may be outdated for 21st century because it says what is the number of children if the marital status is unmarried, ordinarily number of children should be zero. Forget it. I am just trying to illustrate the basic principle that there could be a situation where the attribute itself is not applicable in a particular context. Or in academic sense, are you a PhD? If there is a question, are you a BTEC? If you are not even a BTEC or you have not passed school examination, you ordinarily will not hold PhD. How do you represent such values? If the values are strings, how do you represent strings? Do you effectively handle such things by putting all stars in a string or all pluses? What is the most favorite of the LIC? When a value is not known, what do you put in a string field or in a numeric field? Come on. Zeroes and blanks. Spaces. That is blanks. So blank is because it may be an invalid value. Ordinarily, no blank will be there as a valid value. Nobody's name can be black-black-black. But a zero is a very dangerous thing. For example, consider this. Here are roll numbers and marks. 78, 72 dash, 62 dash. Dash means these two people were suffering from malaria, admitted to hospital or had a headache. Usually, our students develop a headache just one day before the exam. So they go to the hospital, of course, teach them, give them a medical certificate, and then they give an examination one month later. So this is a very standard practice. But now, if I as a teacher, find out what, or forget me as a teacher, let's say my head of the department finds out how have students performed in FATAC scores and calculates the average. And let's say one of the LIC cobalt programmers has implemented that system putting zero and zero here. You'll put 78 plus 72 plus zero plus 62 plus zero divided by five. What is the correct average? This, this, this. That means in any aggregation, you're not supposed to take cognitions of null values at all. And this is not easy to handle in conventional programming language. In my model, however, I must notify during modeling that look, these attributes could be null value. These attributes need not be null. These attributes must have a mandatory feed, etc. Multi-valued attributes we have already seen. Number of courses taken by a student, number of teachers teaching a course, number of hobbies of a student. And the problems that we face is each of these becomes a list of values. So identity of individual element in that list is blurred at the external interface. A programming interface when you wanted to sort, for example, that registration five, you need not know, did not know how to sort. One value here, one value there. So this is what we saw. For example, Umesh Bello is teaching CS634 with me and he's also teaching 628 individually. Now, how will you sort? How will you figure out in a simple operation that Umesh Bello is teaching these two? That is why multiple values are not good. Here if it is a course entity and I am having see faculty as a multi-valued attribute, I will immediately notice, look, bloody, faculty is an entity in its own right. Faculty members have names, have designations, have salaries, whatever, whatever, expertise. So I must model faculty independent. And what I am trying to show here by multi-valued attribute is actually a relationship between the subject and the faculty. Just as by multi-valued attributes of the courses that one listed in the registration, one was trying to represent the relationship between student and the courses. So people who designed this ER model figured it out long time earlier that multi-valued attributes invariably will imply a relationship between two independent entities and it is best to model such relationships differently. So remove all multi-valued attributes from your entity model and then worry about how to represent those multi-valued attributes as they represent relationship values. The entity model, in the model each entity has a primary key which is a set of attributes. Value is unique amongst entities of the same and obviously avoid multi-valued attributes. We shall see if multi-valued attributes come in the first phase of your modeling thinking how to take care of it we shall see in a moment. What are the primary key choices? Here is an arbitrary choice I have made in the entity for student. Roll number name, hostel room. Originally the roll number would be the primary key we saw that. But suppose each student is guaranteed to be given a single room in the hostel. You know IIT is a residential institution. Then hostel number and room number together would be unique. If hostel number and room number is unique then it can be used as a primary key. What are the problems with such a choice? Sorry? No, I may still have roll number as a unique but I am merely saying that instead of using roll number which is artificial I will use the place of residence as the primary. You were saying something? No, we are saying single occupancy. But you are right, suppose that student moves and some other student comes. Now that student's room number and hostel number combination will change. Ordinarily it should not worry me because it will mean I am changing the primary key of a person. But during the lifetime of that student on the campus it is ideal if the identification of that student remains unique during that time. In fact it is ideal if that identification remains unique throughout. Because even ten years later he comes back and says give me my mark sheet. I should be able to extract it very easily. That is the reason why such choices are never made. And you end up making an artificial choice. Please note that roll number is not an attribute natural to a student. The natural attribute of a student is a name. The parents when a child is born do not give him or her a roll number. They give him or her a nice name. Similarly course is given a nice name. Everything is different. But part number, roll number, course number, policy number, branch code. These are all human endeavours to ensure that all entities in a set are uniquely identity. And therefore primary key in nine out of ten cases will be invariably an artificial number created by us. It is therefore very important to create that artificial attribute very carefully. We had when the issue began we had four digit roll number. Then we changed it to six digit. Then we wanted to change it to seven digit when a bright system analyst Mrs. Rekha said that look again five years later you will do some funny thing. So she designed a system which is a eight digit roll number and which accommodates the department, the degree, this, that, whatever, whatever. And so far we have worked well with it. But someday we may run out of it. How many digits you have in the policy number? Nine is all what? Hundred crores or no? Ninety nine crores. So some person in Jharsugula will not be insured by you because you cannot get the policy number which is unique. If policy number is unique, these are the long term considerations. Anyway, the table representation of any entity set is very clear. The attributes will become the column headers and the rows will become the records in the file. Or every value of every attribute for one entity will constitute one record or one row in that table. Now column headers are metadata. They are called metadata which is data about data. Metadata is an important term. All the nomenclature that you use in your file description is actually metadata. You need to specify more about these attribute names. You need to specify type of value whether it is integer or string. For example, hostel in IIT Bombay is a two digit number. But the numbers are between 0, 1 and 1, 3. In COBOL, if you write picture 9, 9, 75 is a valid hostel number. And you will have to do something about it by programming. By writing a program saying the value should not be more than this. And this checking will have to be done individually by every COBOL program that you ever write for the system. There is another pain point in conventional program. Please remember all the pain points that I am mentioning are obviously because the database management system has mechanisms to help you take care of these automatically. Which is why that programming paradigm is preferred. A row represents information about one entity of a set or one object of a class. No two rows are same. That is important. In any table, no two rows are same. Or in any file, no two records are same. How is it guaranteed? Simple. There exists a primary key. And primary key is unique. So even if all other attributes are same, name is same, the student is saying the same hostel has been allocated a double room, so same room studies the same degree, does the same course and funnily enough performs exactly in the same manner as the other. So everything else is identical. A roll number is different. So no two rows will be identical in a table. And set of rows represents the entity set depicted by the model on one hand and it permits easy implementation of a file. Line sequential file, index file. You can implement all the records as a file. So you can do any conventional programming if you want to. Basically ER model therefore is a link between the conceptual strength of the representation of information on one hand and ease of implementation in programming on the other hand. That is why the importance of this. There is an entity diagram as I said, rectangle and associated ellipses. It must be accompanied by a detailed data dictionary. This data dictionary is a new term. If you are not used to it better jot it down. Data dictionary is nothing but complete textual description of everything in the model. So your model has entity, then for each attribute name, value representation, typical values, constraints and primary key attributes for entity set must be recorded in that data dictionary. ER model or ER diagram in that each object is an entity set may have association with one or more objects of another entity set. One student takes several courses. One course is taken by several students. This is the obvious conclusion that we have reached. A diamond connecting these two rectangles is used to show this association. Here is a student and course association. Here is a set of students. Student set, all roll numbers are there. Here is a set of... So this is a course set. These are the course course. They are the students. It is very obvious to you that there are many students who will register for one course. Then there will be a same student who will register for many courses. So there is a many to many relationship between these two sets. Those of you will remember elementary set theory and maths. These are two sets of values. I have shown only roll number, but please note that this is not a roll number. Roll number, name, hostel number, CPI, all of this is one piece of one entity. Similarly, course, code, course name, credits, slot, etc., is one thing. So there is an association. When you model a student set, you have got all student attributes. When you model course set, you have got course attributes. In order to model those two funny multi-valued attributes that you saw in each, now we see a way out. We see those multi-valued attributes coming in both the places as nothing but these lines connecting these. What does each line represent? It represents an association or a relationship between one student and one course. How many lines will be there if there are approximately 5,000 courses and if there are 5,000 students and there are approximately 300 courses offered in a semester? The maximum possible number theoretically is when each student registers for every course. That will be 5,000 into 300. But that will never happen. We said earlier the assumption was that there will be on an average six courses a student will take and therefore there will be about 3,000 such lines. Our problem is that in our model we want to represent such lines as independently existing features of my information system in addition to my student synthesis. I do that by showing this diamond. So I have a student entity and I have a course entity and I have put a diamond in between connecting them to these entities saying student registers for a course. Student is represented by a role. It is a primary key but there are other attributes and please remember once I put a primary key everything else is identified for that entity. Similarly for course I say course code and other things. Now this registers for is a diamond which represents an association. I know that in my model A this diamond is actually representing on an average those 30,000 lines going between student and course. Each line is represented uniquely by what the role number and the course code combination. The two end points of a line are unique the combination. There may be multiple lines from one student there may be multiple lines to one course but role number and course code is unique. Why not then I say that just as I have student as a entity set which has about 5000 entities course as an entity set which has about 300 courses why not I say registers for is also an entity logically it is an entity. How many elements does it have? 30,000 maybe 35,000 maybe 20,000 whatever. What are the attributes of each of the supposed entity within this? Only two attributes are required. Number and course code. These two attributes completely and uniquely define this hypothetical entity set called registers for and what is the primary key for this? The combination of these two is the primary key. So we come to a very important and significant conclusion relationship between entities is shown by diamonds. Diamonds are taken to be equivalent of entity sets where the primary key of such entity set is the combination of the primary key of the participating entities and they uniquely represent this. Just as student has other attributes and course has other attributes does this entity set registers for also has other attribute ordinarily not required but in this case there is something which you skip because that was part of the detail in the example. You get marks or grades in a course, right? Your grade in a course is it the attribute of a student? If you are a student you will be registering for six courses and you will be getting six different grades. Is it the attribute of a course? A course has 300 students and therefore course will have 300 grades. The grade is an attribute of each association line. One student taking one course gets one grade. So, the grade is an attribute of that line connecting the two sets and therefore grade is an attribute of this entity set. So, you depict that attribute in this model here. Is this clear? The table for relationships you saw the table for student table for course where now saying you create a table for the relationship set to represent the relationship. This can be represented as an entity set. The primary keys for participating entities form the primary key for this set may have additional attributes like the grade. Thus s-roll, c-code, grade is the schema for this relationship set. Schema is nothing but collection of attributes. So, student schema is roll number, name, hostel number, room number, etcetera, c-pia, but no multi-valued. Course schema is course code, course name, course credits, slot, whatever. And the reg schema or the registers for schema has the attributes s-roll, c-code and grade where s-roll and c-code together form the primary key. And this can be made into a table. So, s-roll, c-code, grade s-roll, c-code, grade somebody has not yet given the grades for HS 699 because the teacher has not completed the correction. The grade will be null, not zero. And now, consider that you have modeled the information system not from a cobalt programmer's perspective which who captured the forms and the lists as the basic thing, but from the perspective of information system thinker who captured the entities and their attributes. Then how many files will you have? A student file, a course file and a registration file. Suppose a student fills up a registration form. You will not store it as a registration form. What you were to do while breaking up that record for sorting, you will actually do at the point of capture. You will take the registration form. You can do something additional. You can say whether the student exist in my student file or not. That's a cross check you can do then for every course that student has registered in that form you will create one record in this table and this is your master table of relations. It need not be sorted in any order but you can build a key on each of these two fields because these are now single valued fields. Anytime a student wants to know what are the grades in my courses you search this on that student volume. Anytime a teacher wants to know what are the students who are registered for my course you just search and do this and this search can be made dynamic. In our institute information system for example in the early days of the registration because people can register over a number of days every time I go on to the online interface to find out who are the students who are registered for my course that information can change every five minutes. Anytime a student registers for a course that registration information goes into a table like this anytime I run a query to find out who are the students registered for my course that tables are searched and then a new list is composed and it is shown to me. I think we will take a break for about 17 minutes then come back at 11.30 but I hope you appreciate the way of looking at the model in a slightly different form than a conventional way of program. The next session there are some design issues which are mentioned in the slides but effectively what I said you know use of entity sets versus attributes what is important is that what you are modeling is what will define how you will model so what to choose as entity what to choose as attribute so typically possible guideline is that whenever you have multivalued attributes try to see whether those multivalued attributes naturally are part of another entity or an entity itself in which case you model it separately and model the entity set. There are things like binary versus n-nary relationships what we have seen a binary relationship between two entities but it is possible that there could be relationship between multiple entities for example if you introduce a teacher so there could be many teachers for a course and consequently many students may be associated with multiple teachers in different course so how do you model this so typically the correct way of modeling is that you split the n-ary relationships to multiple binary relationships so if you have teacher, course and student you make three binary relationships student and teacher, teacher and course course and student so like that you can have multiple relationships depicting the scenario essentially ER model captures the static characteristics associations can be one to one meaning exactly one element of one set is associated with one element of another set it could be one to many, many to one or it could be many to many as we saw in our case example it was a many to many relationship and most important is easy to map the structure to files or to database schema or tables you put one table for each participating entity set and one table for every relationship or association set we saw that as I said you must make a data dictionary which is a elaborate text description typically if you have let's say 20 entities and about 50 relationships you will typically have equivalent of 70 tables or 70 entity sets all told for each entity set you must have a repository for entire metadata for the whole information system for one entity you must give the name you must give the approximate number in that entity set so that you know how much of this space ultimately will be required frequency of changes in the information that will happen so for example a policy number never changes insured value may not change but the address may change say over a 10 year period on an average it may change 5 times 3 times now these are the reflections that you have to capture from the business people since you yourself are the business people you will have to note these things down similarly for every attribute you will have to give the name of the attribute what are the permissible values this is very important I mentioned that hostel number here is between 1 and 13 so picture 9 9 is not good enough you have to say the number must be greater than or equal to 1 and less than or equal to 30 enumerated list like degrees so degree is picture x 3 or x 6 whatever but it does not mean that any arbitrary abracadabracad will do so you will say the degree must be beta, mta, mfill, msc whatever what so similarly grades grades cannot be anything we have a grade description saying a a a b b c et cetera d d f f x x x r something like those are the only valid values why is it important because any validation that you want to do on the base data that you collect has to be done later on we shall see that these validations the database management system is capable of doing automatically once you describe the rules of the game those rules are called integrity constraints as we shall see in a moment then if there is any other constraint for example there cannot be a record in the registration table unless the student role number exists in the master table and unless the course exists in the course table otherwise it will be a invalid combination then there are typical authentic source of actual value where do you get this actual value for example rate of birth has to be verified from the originator of the person whose rate of birth you are represented and you might mention that there will be a document these these documents are acceptable et cetera all the business rules they will have to be prescribed in your data