 Just spend about half an hour describing the notion of entities, attributes and the file and record design. We will not discuss any programs today, no quiz today. I would like to observe that this clapping is completely uncalled for because postponing a quiz is like postponing your death. There is not much merit in it. You might as well die once so that you will live again. Anyway, so we will just discuss the notion of entities and attributes. I think I will just use this title slide and use the paper to explain things. Excuse me, I think the next 20 minutes or so we should try to spend by concentrating on the discussion that we have because this has important implications for the projects that you will be doing. Because any large data handling requires a very careful design of your files and the fundamentals behind a good file design comes out of this modeling. If you recall, when we discussed the basics of software engineering, we had a couple of items which we said we will not discuss then. This was related to ER model or modeling of information. ER stands for entity relationship model. What we will do is we will understand the basics of that modeling and we will try to see how that kind of model of any information system, any programs that you write to handle data in large volumes can be used to convert that model into appropriate files which you desire. So first of all entities, practically everything that you see is an entity. It could be an abstract entity. It could be a physical entity. Consider student. Student is a physical entity. A course on the other hand is an abstract entity. A teacher is another physical entity. Hostel is a physical entity but could have abstract attributes. We shall see exactly what these things are. When we look at information related to entities, we find that that is exactly what we wish to capture, what we wish to store, what we wish to process and what we wish to produce as output results. Each entity generally will belong to a class which will be a collection of all similar entities. For example, all students in IIT Bombay could be said to belong to an entity class called student entity class. Every member of that class will have identical attributes, the list of attributes, not identical values of it. What is an attribute? Attribute is a feature which characterizes an entity. For example, take a student. Every student has a date of birth. Every student has a name. Every student has height, weight, color, whatever. In fact, if you look at a class of entities, then an entity could generally be described by hundreds of different attributes like this. Do we model for the purposes of data processing? Do we model all possible attributes of an entity? Perhaps not. We model only those attributes which are relevant for the particular problem that we are trying to solve. For example, if we are trying to solve a problem of processing academic performance of students, then we are obviously not interested in date of birth, but we are interested in the courses that the student takes. We are interested in the grades that the students score in the courses. We may additionally be interested in some basic information about student which is independent of the data that you are processing for particular problem. For example, where does the student live? Why is it academically relevant? Well, suppose I want to contact a student. I should know which hostel, which room that student lives in. Consequently, while an entity may have hundreds of attributes, you identify those attributes for your modeling. Remember, you are doing modeling for solving a problem. So you consider only those attributes for modeling which you believe are relevant for solving that specific problem. The entity and attributes are displayed as a part of what is known as ER diagram. An entity diagram, an entity class is represented by this diagram. Typically the entity class is shown by a rectangle and you write the name of the entity class. So we are trying to describe a student. When we say describe, we are trying to capture which attributes will describe the student most as far as the relevance of those attributes to our problem is concerned. Whenever a student belongs to a class of entities called student entity, the first thing that comes to our mind is some attribute which will uniquely characterize a student. We would ordinarily consider name as a characterizing attribute. Unfortunately, when there are large number of entities in a class, there could be multiple students having the same name and that could cause a confusion. One of the important things in any entity set is that all elements of the set be distinctly recognizable. Let us just identify other attributes of a student which may be relevant. Let us just look at the attributes which I have listed. Incidentally in this model, every attribute is shown by an ellipse connected to the rectangle. So name is an attribute of a student, hostel is an attribute of a student, room is an attribute of a student, courses, notice the plural. All the other three attributes which I have listed here are single valued attributes. That means a student will have one name, a student will have one hostel associated, a student will be staying in one room, but a student can register for multiple courses. When a student can register for multiple courses, you typically represent this by double ellipse and this is called a multi valued attribute. As I said, we always look for a unique identity for every entity in a class. We have let us say 6000 students on the campus. We should be able to name one attribute value. That value will be unique for a student. As I said name need not be unique. So what is the unique identity attribute or identifying attribute that we use typically? This roll number is not a natural attribute of a student. You are not born with roll number. In fact when you are born, your parents name you. Name is a natural attribute. So roll is an artificial attribute created by the system to uniquely identify an entity. Such attributes which uniquely identify an entity are called primary keys. If you were to represent information about students in this set, you could consider a table. Can you relate to this table very easily? A table in which as the heads of the columns, we put all the attributes that we have identified in our model and then we insert rows in this table. One row corresponding to one entity. So information about the student 1000, 1012 for example, with the name of the student, hostel of the student, et cetera, et cetera, could be written in one row. Another row will represent another student. How many rows will be there in this table? As many as there are students in the institute. This is the fundamental of entity modeling. Why is this modeling relevant to us? When we want to represent information for processing by our programs for any problem, we would like to store that information for later retrieval. That is why you have files and other things. But what information we should store, in which fashion we should store, needs to be decided upon. That is a design decision. So whether I store roll number, name, hostel, et cetera, whether I store roll number, name, date of birth and something, whether I store height, weight, whatever, that decision has to be taken before I start processing the data or before I start putting the data. This modeling helps us to take that decision. Because when three or four of us sit together, somebody writes this model and somebody says that look, I need to also process the mess bills for the student in the problem. In which case, I may need more information related to mess bills. Suppose somebody says, I also need to access this data to identify who are the football players, who are the hockey players so that I can select my teams for my hostel. In which case, one would perhaps list another multi-valued attribute called hobbies. As I said, there could be hundreds of attributes. And indeed, you should actually think of arbitrarily large number of attributes about an entity. Before you say, all right, out of these, I need to pick up this, this, this, this, this for representing mindfulness. This model is not very difficult to understand because we are essentially identifying attributes belonging to an entity set. And in a table representation, such a table could be called a student table for example, we can store information about individual students in terms of the values of the attributes that we have identified for the model. What is important is for every student, we should try and put all attribute values which are listed here. How is it relevant to our processing of information? Why it is relevant is that this particular model of a table is amenable to be said to be equivalent to a file. Can you not see that I could convert all information inside a table into information that I keep inside file? After all, if you look at the mid-semester marks file that we saw earlier, what was it? There was a line representing one student. What were the contents of that line, do you remember? There was a serial number. There was a serial number. There was a roll number. There was name. There was lab batch. There were marks in individual questions and there were total marks. This was the information that we represented as far as mid-sem marks were concerned. Notice that even in the mid-sem marks, you did need the name of the student. You did need the roll number of the student. You did need the lab batch of the student. This is not relevant from the performance point of you alone, but this is relevant because when you analyze information, you may want to correlate pieces of information across different attributes and that's why you need to do this. The next we consider, so you understand entities, you understand attributes. Next we consider files. How can files be used to store information about entities in a class? This is the mechanism of translating our model into file design. We are designing files by the way. This is not an analysis. We are now saying I want to represent the student, so I will do a design of a file as to how I will keep the student's information. So you remember the file that we created, the program that we saw that day was very artificially picking out one line of input, was adding five stars to it at the beginning, again arbitrarily, and was writing a larger line as output onto a file. But instead of writing a larger line in that arbitrary fashion, I could have extracted pieces of information, identified them as what is known as fields in a record and pushed them onto the file. In fact, if you consider this entity diagram itself, a particular row of this table would become an equivalent of a record in a file. Do you agree with this? The record that we had in the text file had, as I said, some five stars, serial number, name, batch number, all separated by commas. What were commas doing really? Well, if you see these attributes, row, name, hostel, etcetera, etcetera, and when you convert it into actual data in a file, you need to identify where the row number ends and where the name starts, where the name ends and where the hostel starts. Consequently, each attribute is mapped onto some kind of fields in the file. So this is one field, this is another field, this is third field and so on. How many fields in a record you would have normally? As many as there are attributes that you have modeled. So the task of system analysis consists of identifying entities whose information you want to represent and there may be more than one entity. Then identifying possible attributes of that entity in total, then selecting those attributes which are meaningful for your work, then writing some sample data here just to get a hang of what type of values are going to be there for each of these attributes and then saying, okay, I will now keep all this data in a file. So this file will have records, one after another, and one row in this table will correspond to one record in my file and each record will have fields. The field values will be different but the fields should be ordered in the same way. Would it make sense, for example, if one line of the record contains roll number as the second field and fourth line of the file contains roll number as the fifth field? You will not be able to write a program to handle that at all. You would expect that the fields appear, field values appear in the records in exactly the same order in which you have described this. And how can you ensure that for processing purposes? Remember the struct construct, you can define a structure and in that structure, if you define 1, 2, 3, 4, 5, 6, 7, 8 fields, then exactly that order is the order in which the information will be found in a file if you wrote the complete structure as one entity. So a structure in our program is nothing but a record description of a file and that contains various fields. When you define a structure, the boundaries between fields are implicitly defined as far as C++ is concerned. But when you look at data on a file in a text form, nothing is implicit because the name is a variable value. Where does the name end? Where does something else begin? Consequently for files which are particularly text files, you require an artificial field separator. What is the field separator that we used last time? Comma. In a table, various records look very nicely separated. First record in one row, second record in second row. But if you consider file as a continuous sequence of bytes, how does your program know where a record ends and the record for the next student starts? How will you know that? In a text file that is known by a new line character because you say one line represents one student. So when you convert all of this into a file, you will have a backslash n here at the end. If you did not have this backslash n, you could not have processed information for different students. All these descriptions that I have said about comma being a field separator and backslash n being a record separator. You remember the org script that we briefly saw in org script, this comma is known as a field separator or rather there is a field separator which could take different values. It could be a blend, it could be comma, it could be a pipe symbol, any symbol that you choose you can define. That was for the org program. In general in a text file, you would have comma separated values or blend separated values. Similarly in org, you call this as a record separator. A record separator is a new line character for text files. You remember we mentioned binary files. A binary file is one which does not necessarily maintain information in a textual form. For example, hostel is a number, room is a number. I could store hostel and room as int type. That int type is not visible. A comma separating one int from another int does not make sense. A new line character separating a binary record from another recorder makes sense. Therefore the binary files of C++ are arranged such that you don't have to worry about record separation. But in order to understand where one record ends and another record starts, you would generally insist that each record be of the fixed length. 40 bytes, 80 bytes, 120 bytes, 173 bytes, 2428 bytes, whatever. Every record will have exactly the same length. But you have values which are of different length in each record. For example, name. Some name could be 10 characters, some will be 40 characters. What would you like to do in such a case? Yes, any suggestion? What is done to convert such variable length fields into fixed length fields is you make an assumption that no name of a student in IIT can exceed, say, 40 characters. In which case you say I will deliberately define a string of 40 characters or 41 characters to accommodate backslash 0 which is the null. And I will say every name will be converted to a 41 character. If one name is shorter I will pad it with blanks so that all names are uniform. If we do that, if we said that the name was 40 characters, if we always said that the roll number is 9 characters, now there are some students who have 8 character roll numbers. So we can take the same stance that as far as character fields are concerned, even if some roll number is smaller than 9 characters we will pad it with blank at the end. That is our decision. But if we take that decision, then we suddenly realize we don't need a field separator at all. Because we read the first 9 characters, interpret it as roll number. We read the next 40 characters, interpret it as name. We read the next 4 bytes and interpret as integer number. We read the next 4 bytes, interpret as another integer number, etc. And we know exactly how many bytes are required for representing all the attributes that we have chosen. Consequently the size of that complete structure becomes the record size. And that is exactly what is done in binary files. So when you open a binary file for writing, you usually write a structure as one record in that file. And for every student you keep creating one record, second record, third record, fourth record, fifth record, etc. The advantage is you don't require a record separator, no new line character is required, because when you read you will read exactly so many bytes for a record. There is however one additional feature which is required. We know that files can be directly accessed. I have something called file position and I can say go to the fifth record, go to 104th record. However this ability currently is constrained by the requirement that I have to give a record number, where that file position is byte number 0, byte number 40, byte number 80, or byte number 0, byte number 100, byte number whatever. I can specify any byte number. And obviously if I know the record length and since each record has same length, then suppose I want to access tenth student, then 10 minus 1 multiplied by the size of the record is the starting point of that student. So if I have a serial number associated with every student, it will be much easier for me to directly access that student's record from the file. I don't have to read all the records of the file to figure out what that information about that student. That is the reason why you will find that in the mid-same example, marks 5, as well as the file that we will create for some time, you will have an additional field here called serial number. And this will be 1, 2, 3, 4, 5, like that. The first serial number on the file, the record starts with byte position 0. The second serial number, where does the record start from? 0 plus the size of this record. 0 plus 2 times the size of this record. So you can see if I am given a serial number SNO as a variable, using that SNO, arithmetically I can find out what should be the position in the file I should go to to read that record. And that is how direct access is possible. Otherwise imagine if you are processing not 5,000 student's record, but let us say a census record of 120 crore Indians. To find out 1 person, even though you have unique ID as the national ID, you know the unique ID file. But to find out the information about that unique ID, what will ordinarily you will be doing like we did till last time? Read a record. Is the roll number equal to this roll number? No. Read next record. Read next record. Read next record. Read 100 crore records. And if you consider reading one record from a file is roughly 1000 times costlier than accessing a memory location for that record. 1000 times. So when you start searching for a person who is let's say 5 years old today, by the time that person becomes 60 years old you will say ah, your name is so and so. Not very useful. Not very useful. You should be able to say ah, this is your, tell me your serial number. Or give me some mechanism of converting your unique ID into a serial number. Once I know the serial number, I can directly access them. For that time being, we shall be using an artificial serial number as the primary key for our records in the file. Later on we shall see how spatial file structures called index files can be created. This can do a quick mapping between your roll number and serial number. And we will be able to directly access that file. So I don't think there is much merit in going forward here. There is also another problem. You remember I mentioned that we will be putting up the possible list of project proposals that has not yet happened. We have a short co-ordinators meeting today evening, in which case most value it will happen by tonight. But I hope in the meanwhile you are all started thinking about project ideas. Do look up the moodle tonight. You will get some examples there. And you should start concentrating on finalizing the project that we will be doing. I guess I think we will stop today.