 So, if you recall, we discussed the notion of files last time. To recapitulate, we have been using files primarily as input redirection or as output redirection. So, basically when we have voluminous data, instead of typing it every time when you run a program, you will edit a file and give that file as input through a redirection mechanism. Files are actually more important than that. And as a prelude to understand the importance of data in file, we will first try to understand the notion of structures. After that, we will do a very quick review of whatever simple things about files that we have seen earlier. And then, we will look at some algorithms which will work on the files that we define which contain data. More important for us will be the notion of direct access to records in a file. In that context, we will be discussing files which we have not seen so far, namely files containing encoded data or called binary files as opposed to text files that we are all used. So, there is an issue. Typically, I would declare variables or arrays in my program to handle different types of values. Take for example, a hypothetical roll number, not IIT Bombay roll number which has capital D and such thing. At most other civilized places, roll number is an integer value. So, we will take this as int. Name will be what? A care type array because the name could be 10 characters, 20 characters, 30 characters. A lab batch, as we know it, is 1, 1, 2, 3, 4, 2, 3, 4, 1. It is again another int type value. Marks, should it be an int? Float. You do not want to lose 0.5 marks that you gain. So, mark should be float. For all int and purposes, these four are independent variables. We understand that all four of them describe some attributes of a student. But as far as the program is concerned, these are just four independent variables. Memory has been allocated. Computations are done as and when you operate upon. What we would like to do actually is that we would like to treat all four as if they are a part of a single structure. For example, if my roll is located somewhere here, let us assume for the moment that memory is allocated continuously, then this will be followed by my name which would have many characters. So many assuming four bytes for roll, there could be 30 bytes for name for example. Then there would be batch and then there would be marks. What we would like to say is that all of this should be part of a single structure or a single organization. So, this entire thing, the actual memory locations for each one of these four parts together, we would like to refer to it by some name. Let us say s. Ideally, we would like to take this entire s, put it in memory, take it out of memory, put it in disk, take it out of disk. Although we would like to retain the individual components and access to them as well. Such a facility is available in C plus plus and in most other languages through the possibility of defining what is known as a structure. Here is an example of records in a text file. You would have seen almost exactly the same file minus the marks in the last lecture's example before the midship. There I had put these values as separated by comma and we had introduced an extra comma at the end, so that we could process using the comma as a delimiter three different components. At that time, we had seen a program which extracts these in three different arrays of character. So, notice that we were treating 1, 0, 1, 0, 1 as five character array, not as an integer number. The name will have to be treated as an array of characters because there is nothing else that you can do. But the other components, we would like to treat as int, float, int, whatever. And as I said, we would like to have entire conglomerate of this information about one student to be referred to by some single name. Actually, if you look at records in a text file, these are lines in a text file. So, there is an additional terminology that I would like to introduce. If you see each line, what does it describe? Information about one student. Second line describes information about another student. Does the second line contain information which is different kind from the information in first, third or fifth line? For example, first line has roll number name, batch number and marks. If the second line contains roll number, height and weight of the student, then that would not be in consonance with whatever is there in the first line. It is very obvious and natural for us that whenever we construct such a situation, whenever we construct a text file, we would have exactly similar attributes in each of those lines. Values will be different for differences. The naming for this kind of notion comes from the world of data processing or information processing. And these are the new terms that I would like to introduce. File, you are already aware. As far as you are concerned, file is a sequence of bytes. But if it is a text file, you can open it, see it, edit it and so on, because you see the text symbols there. In our text file, for example, containing the student's data, we see different records. There is one record for one student. More importantly, each record has values of exactly four attributes. So the notion of record and attributes is important. The attributes that we have here are roll number, name, batch and marks. These are called fields of a record. The first field is roll number, second field is name, third field is batch, fourth field is marks. In a text file that we have seen, these fields are separated by one or more blanks. Last time we had seen a notion where the fields were separated by comma. Can blank always work as a separator of the field in a text file? Can you always recognize different fields if they are separated just by blank? Possibly not, because a name, suppose instead of writing my name as Deepak, I write Deepak Phatak. I won't write it as a single, continuous work. I will have a space between Deepak and Deepak. So space is a natural component of any string. And therefore, if any string is a field in your data layout, then that field has to be separated from other fields by some other means, not by a, not by a blank. This separation of field is a different issue. We, we have tackled it earlier by saying comma separated fields and so on. But imagine the other possibility. The other possibility is, suppose you can pre-define the width of every field. Suppose you can say that name will be exactly 30 characters long. The roll number in character form will be exactly 5 characters long. Then actually you don't need to separate these fields. You can just count the number of bytes from the starting and you know exactly where one field starts, another field ends. If you agree with this, the same concept could be implied in a file which does not retain data in text form. Suppose I have an internal representation for int followed by an internal representation of float. You all know that internal representation of int takes 4 bytes. Internal representation of float takes 4 bytes. Suppose I wrote these two fields one after another in a file, in a record containing internal value of int representation, internal value of float representation. It will still be valid because if I read 8 bytes, I know first 4 bytes are these, next 4 bytes are these. Of course, I can't read them using a normal text editor and so on because this will be internal. How nice it would be if I could take an int value internally, directly write it to file as int and later on when I read back, I directly read it without any scanf or without any scene or whether directly take it and put it in the memory location. No transformation, conversion from string to int, int to string, et cetera, will be required. The binary files which I mentioned, one of the things which they do is they handle data in such a form. However, coming back to the nomenclature, I would like to introduce one more term which is called meta data. You saw the file, let's go back to the previous slide. This slide has actual data. This is the first roll number, second roll number, third roll number, fifth roll, et cetera, et cetera. Of course, you'll call it one record for a student. Each record has four fields. The fact that record has a name, each of the fields has a name and each of the field has a characteristic. Roll number is an integer value between this range and this range. If you recall the mid-same example, we had seen an account number was stipulated to begin with one, a customer ID or whatever, account number was stipulated to begin with 411, et cetera. Some of the characteristics. The description of these names of fields, name of the record and characteristics is called meta data. The word meta means data about data. So meta data is a description of what is the record layout, what are the different fields in that record, what are their characteristics, et cetera. What is the meta data for us, the name of the record and the attribute names. We observe once again that the name string, this has nothing to do with the name in what you call this name. This is the name string as in this name, name of a person. So string could be of variable length or different length for different students. That is one of the reasons why you see the records. Some record shorter, some record larger, some record shorter. Look at this, records in a text file. All that I have done is I have reproduced the data actually, the records and indicated the names of the fields at the top. So I have roll, name, batch, marks and then we have record number zero, record number one, et cetera. That's because all counting in programming begins with zero. In fact, when you have a text file like this, you will actually refer to the first line as 0th record, next line as first record, next line as second record and so on. Observe also that these records are not of the same length. What could be the advantage if they are of the same length and what is the disadvantage if they are not of the same length? Imagine an array that you declare, let's say array of int or array of float. Each element of the array is of exactly the same length as any other element, four bytes for example, for a float. So suppose you have named an array A, then to go to any element of that array, say ith element, you say ai and we have seen already that the position of ai can be quickly located by c plus plus by using the base address of a zero, adding that i to it and saying, okay, this is where the ith element must be. This calculation could not have been done if different elements were of different lengths, then you won't know where the ith element is unless you sequentially look at each one. So the advantage of a fixed length information is great when you organize several of these. Whether you organize them in an array in the memory or whether you organize them as records on a disk, the advantage could be same. We do not know about this as to how even if a file contained records of fixed length, can we like we say ai equal to something or p equal to ai, can we say get me the ith record from the file directly? Don't have to scan sequentially. The fact is, yes, it can be done and that is the reason why whenever you have to maintain meaningful and critical information on the disk, you generally don't maintain it by varying length records. You maintain that in fixed length record. The point is that, suppose this is zero ith record, this is first record. If they were exactly of the same size, say 100 bytes each. Now, suppose these 100 bytes are written on the disk, 100, 100, 100, 100. And suppose I told you that on a disk, you can directly say go to 500th byte. Now, if the records are fixed length, when you say go to 500th byte, you know you are going to the beginning of 6th record. Whereas, if the record lengths were different, there is no way you can connect a particular position of starting of a record with any pre-calculable value. And the comparison I made was with an array. The fact that you can say ai and the computer can quickly go to ith element is because each element of the array is exactly of the same size. So, starting with the base address, you can calculate what would be the address of ai. That is the point. Structures in C permit us to do two things, simultaneously. One, they permit us to collect different pieces of information, typically attributes of an entity, fields of a record, put them together and they permit us to give a single name to that conglomerate. Second, such a conglomerate is always of a fixed length, which will help us later on when we deal with records on the file. Struct is the keyword that is used. This is useful in reprinting entities which have many attributes of distinct types. So, suppose we wish to handle this, say, roll number, name, hostel, sorry, there is a syntax error here. What is the syntax error? I have roll and name, but not hostel and room. This is what I originally started with. The first one is batch and the second one is now observe what C plus plus permits us to do. It permits us to define a new data type. Just like int, float, it can define a data type which we are calling here student info. Student info is not the name of a variable. Student info is name of a new data type which we have defined. It does not exist in C plus plus and we are saying this new data type is a structure which has the following components. The components have names. So, I have int roll, care name, 30, int batch, float marks. See the way it is defined here. Struct, student info, opening curly bracket, four components with their own individual types and names, curly bracket close. There should be, of course, a semicolon at the end. So, this is the definition of a struct. Now, having done this, I can define variables of type struct, student info. So, that type name is not student info. Type name is struct, student info. Just like int, float, struct, student info is a new type. As opposed to the other types, int, float, and care and other things that we have seen which are incidentally called native data types in C plus plus, these are called abstract data types. Later on, we will see when we discuss the object oriented programming concepts. You can take this to a larger extent where you can define what we call user-defined classes or user-defined types, and you can build upon types of types of types, et cetera, et cetera. So, you get a lot of powerful expression features, but currently we limit our discussion to this. Note that a struct definition itself does not define any variable, and therefore no memory is allocated. For example, int has no memory allocated to it. Int x has memory allocated to x because x is a variable. Similarly, struct, student info has no memory allocated to it. It is a structure. It is a new data type definition. But if I define variables like this, struct, student info, s. Now, s becomes a variable of struct type, and when s is a variable of struct type, then automatically memory is allocated. How will memory be allocated? Two important facts. Memory allocated to structure is always contiguous, like array elements are contiguous. No confusion there. This is ensured by C plus space. That means the first four bytes will contain what? Role. The second four bytes, how many 30 bytes or so will be named. This will be followed by batch. This will be followed by marks. If I had another variable of this type, say s2, then if this is s, there will be another thing for s2 here, exactly the same thing. As many variables as I have, so many blocks of memory will be allocated to those variables. If I say array struct, student info, list 1000, then 1000 such blocks will be allocated, and the index of that list array i, let's say, will refer to the ith block. That is all very nice. But how do I, for example, I don't want to operate upon this conglomerate. I want to operate upon individual roll number, marks, et cetera. How do I tell C plus plus that when I say roll, I mean the roll of student s and not that of s1? I have to distinguish, therefore, between components belonging to different struct info type variable. That is done by what is known as qualification. In our program, if we write s dot roll, s dot name, s dot batch, and s dot marks, where s has been declared to be a variable of type this struct info, then we automatically mean those particular memory locations which are allocated to s. Is that clear? So, it is a new name. So far, names of variables are arrays where contiguous words as per the rules. But this is a qualification. s dot something means this something is part of the variable s, which is a variable of the structure type. That is right. In fact, the allocation of memory is, question is right. The allocation of memory is always in the same order in which the definition has been made here, roll, name, batch, and marks. It is interesting to go back to the notion of file and fields that I mentioned, and that will convince us why this must be so. We said that there are four fields in every line of the student, every record of the student. We also said that the first is roll number, second is name, third is this. Suppose we want marks to be first, and suppose you said first record contains roll number followed by name followed by some. Second record contains marks followed by batch followed by name. Wouldn't there be confusion? Because these are only values, and some program is going to interpret these values. If the order in which the field values are given is not identical in every record, we would have a problem, and that is why the order is important. So, to recapitulate s is the structured variable. Its components are s dot roll, s dot name, s dot batch, s dot marks. And just like any other type, the struct student info also has s size. What is the size of int? Four bytes. What is the size of float? Four bytes. What is the size of a care array? As many bytes as is the size of that array. So, if it is name 30, 30 bytes. There is some confusion. In this particular case, the size will be 44 bytes. Unfortunately, if you look at the individual components, they don't seem to add up. Let's go back. Int roll. How many bytes? Four. Care name 30. How many bytes? 34, 38, 42. There is nothing extra or hidden inside a structure. Unlike in some cases, in some peculiar files, there may be some extra markings done by the operating system. But as far as structures are concerned, there is only one rule that prevents us from getting 42. The rule says that all integer and float and double type of variables which occupy four bytes must start on what is known as a world boundary. A world is a four byte entity. So, while the memory is accessible by byte number 0 at first, second, third, you can go to any byte, an integer value cannot be stored from byte number, in byte number 2345. It has to be stored in 0123, then 4567, etc. Notice that the first element was int which was occupying the first four bytes. The next is a character array which occupies 30 bytes. 30 plus 4 is 34. At 35th byte, the next integer value cannot start by rule. It has to start on the next boundary, meaning an address which is divisible by 4. Consequently, two extra bytes are wasted in that structure. But those bytes are part and parcel of the memory allocate. If you ever write this structure on that is, those two extra bytes will get written. What will they contain? God alone knows. That is because we have no access to those bytes. We can only access the components of the structure variable that we have in the memory. So, those two bytes are extra bytes. These are called padding bytes, but this padding is not blank or any such thing. There are just two extra bytes allocated by C plus plus because of this rule. That is how you get the total number of bytes in this particular structure as 44 and not 4. You can guess that in any structure, if you have a single int or float element, the total size would almost always be divisible by 4, provided this such element is at the end. If at the end of a structure, you have this variable array of care, then there will be only 30 characters. No padding is required because there is nothing else that is to come later. Here is the declaration and the way to find the size of structure. So, after you are include IO stream, using namespace, STD, etc., you will say struct student info, int, roll, care, name, 30, int, batch, float, marks, and then when you begin your main program, you can say struct student info s. It is not necessary to define the structure outside the main program. You can define it inside. There is a reason why I am defining it outside and the reason will become apparent. You can define it outside. In fact, as I had once commented, you can define these structures in a separate file and just as you say include IO stream, you can include, you can say include structure.h. There is a separate file in which I have defined this structure. Mr. compiler, before compiling, include that file as text, put it inside and then go ahead. That is possible. Anyway, this is the way I define a variable of type structure student info. Suppose I have defined an integer variable rake underscore size to get the size of this structure variable. Please note that all variables will be of the same size and the size of that type is what we will apply. To get the size of a type, what is the function that we use in C plus plus? It is called size of. So, whenever you say size of int, size of float, size of this, we have discussed this in one lecture. You can get the size. Here if you say size of struct student info, then C plus plus compiler will calculate how many bytes will normally be allocated to any variable of that structure type. That is the type and it will give you those many bytes. That is how we get rake underscore size. If you print it, it will be 44 bytes. The problem that we wish to solve namely looking at putting students data into binary files and attempting direct access to those binary files. For that reason, I have put a single function definition separately which if I give a structure to it, it will print the values of various components. You realize that it would be troublesome if you want to print the entire record at any point in time knowing individual components. You will have to keep saying c out s dot roll, c out s dot name, c out s dot batch. So, rather than you doing it at many places in your program, you write a function. So, if you ever say print student s. Now, that s variable which is a structure variable will come here and this does not return a value. It is only printing output. That is why the function is void. Notice that instead of c out which I could have used, I am using print f which will give me what? A formatted print. So, for example, s dot roll will be printed as a five space integer number, s dot name will be printed as a 30 character string, s dot batch will be printed as a three space integer, s dot marks will be printed as a five space floating point number with two digits to a decimal side and in between each there will be exactly one extra black. So, this is the meaning of print f. That is how the value will be printed. So, this is a standalone function. Now, where do we define functions before the intimate? But if in that function I have referred to struct student info, then while compiling that function c plus plus, now what the hell is this struct student info? So, it will not help if struct student info is defined somewhere later in my program and that is why you saw let us go to the previous slide. That is why before intimate I have defined. In fact, after defining this, this is the position in which I can define that function. So, that compiler when it reads first it understands this is the new type that you are defining as structure and then it has a function which uses that. We now come back to the discussion on files. This is a recap of a few things that we discussed last time. First of all, the notion of a file. So, file has certain properties. It has a name, an extension, a path, permission, size. What is size? Total number of bytes which are recorded in the file. You see all of these on your Ubuntu by saying l s minus a or just l s you will get the names of the files size. You will also get when the file was created modified whatever lot of information. All this information is maintained by operating system. The physical location of the file and its properties are known to the operating system. You do not know that. What you see is the name and you can request operating system to give you more information such as by a command l s minus l or something. So, you will know what is the size what are the permissions what but the file is maintained by the operating system on the desk. You do not ever directly write to that or read from it. It is all done via operating system. Operating system is an intermediate. As far as we are concerned in our program, a C plus plus program treats a file as if it was an array of bytes. So, the size of a file is let us say 70,000 bytes. Whatever it may be a text file, it may be a encoded file, whatever whatever. These 70,000 bytes are treated logically as if it is an array of 70,000 bytes on the desk. Notice that array has a property. I can go to any element of the array. The disk as we shall see some time later has the property that I can directly go to any point on the physical desk and say read from here or write to here. Since the operating system knows where the bytes of my file are located, if I say go to ith byte of that file, the operating system can figure out where the ith byte is directly go there and at that position it can read or write say 200 bytes, read 200 bytes etc. That is the peculiar advantage of a desk. What are the differences? The differences are that the disk read or write is far more costlier in terms of time than reading or writing elements of array or I mean processing elements of array. Excessing an array element, you go to the main memory to that location. Do you know how long it takes generally on modern computers? Anywhere between 10 to 100 nanoseconds. In 10 to 100 nanoseconds, you can go to any main memory location which is electronic memory. And of course, extracting say 20 bytes of character data or 4 bytes of in data from that memory to processor may take a few additional nanoseconds or maybe 100 nanoseconds. On the disk however, if you want to go to a specific point on the disk, it takes at least 5 milliseconds. So, if you are reading let's say 200 bytes from the disk, you spend 5 milliseconds in going to that point and a few additional milliseconds to read that because the disk is rotating physically. In memory you may access some memory location in 100 nanoseconds but you may spend additional nanoseconds in taking that much data out of memory to the main processor. In general as a thumb rule, the factor of slowness of disk is 1000 times. Disk is slower by 1000 times than main memory. Disk is slower by 1000 times from the main memory. We will just park this information in our mind and we shall see why that becomes relevant. Now, we have seen text files so far but a file can be defined to contain encoded data. These files are called binary files. There is a clear distinction between the two. Binary files will typically have internal representation of in, float, double, whatever you have. Notice therefore, the structures that we have defined. The structure variables can be written directly on to disk and brought directly out from the disk. They will constitute a binary file. There could be other data such as digital pictures that we talked about. They are not float or something. They are in fact, we interpret one byte as int if they are monochrome images. But if you take for example, a JPEG file, it is a very complex structure. And there could be many other files, voice data. Now, if I sample a voice data for digitization by say 5-bit sampling, then every sample value will contain 5 bits, 5 bits, 5 bits. So, consecutive bytes will not contain individual sample values in full bytes. So, I may have to take all the chunk of bytes and internally say, oh, these first bytes, this value, next 5 bits, sorry, bits, this value, etc. All that is possible. Anyway, in a nutshell, files are either text files or binary files. Now, this is what is important. How do we handle a file within a program? We have seen this last time. It is important to understand this concept absolutely clearly. A file is handled through a file pointer. So, this is my file. Let's say, marksData.txt. This is one file. There would be hundreds of files in the disk. I want to open this file, read bytes from it. In my program, how do I refer to this file? Do I say my mData.txt? No, that is the name given by the operating system. Insofar as C++ programs are concerned, this file will be referred to by a special data type called capital F, capital I, capital L, capital E file. It is a special data type. A pointer to that special data type is the effective name of the file inside our program. Effectively, if this is your memory, there will be operating system somewhere here, which actually will do the job of I.O. And this is your program, let's say, C++ program. Then inside this program, you will define a file pointer by saying file star fp. And this fp here is effectively the name of the file as you are telling. Of course, your program fp does not know that it has to be associated with mData.txt. That task is done when you open a file. At the time of opening a file, you associate the real file name with this pointer. Then onwards, this pointer becomes the name of the file. So this association actually is to this file. Is that clear? It is possible that you open this file by associating it mData.txt. Open it, read it, close it. After closing, you open another file and attach it to the same pointer fp. Now fp will become name of that another file. So file pointer is a special variable available to you, which can at any time be associated with one file. Of course, if you want to open five files simultaneously, read from them and write to another three files, then you will require eight file points. In general, you require one file pointer for every file that you are handling within the program. So is this definition clear? 5 star fp. Please note it is called a pointer, but this is not the pointer which points to something or information within the file. That is not the objective. This pointer points to the file. So this is effectively name of the file. Of course, this alone is not sufficient because we said that a file after opening, I may want to read and I need something to indicate where inside the file I am. So there are two things. First, the file itself and second, inside the file. Inside the file, what do I have? Bites. How many bytes? Size number of bytes, whatever be the size? We have already said that as far as we are concerned, the file is logically an array of bytes. Now here is a difference between the array of bytes as a file and the array of bytes in memory. In array of bytes in the memory, the memory contents are there. Nothing else is maintained for that array. We, when we say ai, we are saying go to i-th byte if it is a carry. Inside a physical file, however, whenever a file is open, since the file could be read sequentially or could be written sequentially, where exactly you are inside the file at any moment has to be known to the operating system. For example, suppose you have read one record from an input file. Next time when you just say read again in a iteration, does it read the same record again? No, it reads the next record. How does it do that? Because the operating system has to maintain, if this fellow has already read this 44 bytes, now it will read next will be 45th byte. So internally, it maintains an index which shows the current position in the file, whatever it is. This position is automatically incremented whenever you read something by as many bytes as you have read or is automatically incremented when you write something by as many bytes as you have written. Initially when you open the file naturally this internal index is at 0. Now this is what is shown here. We consider the file to be logically an array of bytes. We can go to a position and access a number of bytes. Suppose we want to go to a position indicated by pos. I want to go to posth byte. Pos is 350, I want to go to 350th byte. Please note that such a facility exists. But how will I prescribe pos? I can prescribe pos as an absolute number. No matter where you are Mr. Operating System inside that file, go to 350 second byte starting from beginning. Or I might say go to so many bytes ahead with respect to current position wherever you are. Please note that a current position is internally maintained by the Operating System and it could for example be this. Normally the so called current position is not visible to you. Current position is something which is maintained by the C++ by the Operating System. As I said when you open a file current position is zero as you keep reading or writing current position advances. But it is possible for you to set this current position here or there. Just like you say a I, you can say I want to go to posth byte. No matter you are at the current position. Take your current position to this and start reading or writing from that point. That facility exists in C++. That facility is given by specifying a position in certain special functions which are to be invoked which we shall see in a moment. And this position value has to be a long integer. It cannot be a normal integer because integer is not sufficient often to define all the number of bytes that can fit into a disk. Please remember disk has a large size. I might have a very large disk file and I should be able to specify an individual byte number in that large size. So generally the position value or position variable that we use has to be a long. The function fc sets the position indicated by pos. fc is a special function. In fact just like you have seen c in and c out or printf and scanf whenever you deal with file last time we saw which functions f scanf and f printf. Today we will see additional functions f read f write f tell f open and f close you know and fc of these fc is the most important. fc can be used to prescribe where you want the next read or write to happen. So you are sort of seeking a position in the disk. That is why the word seek. In fact the technical word whenever the disk arm is moved forward or backward to go to the right track on the disk is called the seek operation that is why the word fc has come. Now this pos is a displacement value specified relative to either file beginning current position or end position. File beginning is this current position is this end position is this. So in my fc I can say pos which is say 500 it could be with respect to beginning position which means absolute 500. If I say 500 with respect to current position then it will mean current position plus 500. If I said 500 with respect to end position what will it mean? No it will mean go 500 bytes beyond the end of the file. It is like if you travel 500 kilometers west of Mumbai you will fall into Arabian sea. So obviously if your relative positioning specification is seek end that means end of the file the pos value should be negative other it will not be meaningless. Whatever the value of pos you specify it is always added to either the current position or the beginning position or to the end position. If you happen to buy mistake go beyond end of file for which there is a special indication called EOF capital EOF then you will hit error as you will get an EOF. Now the f read and f write which are specified they will read or write so many bytes at this position. How many bytes we shall see those functions in a moment and we shall see how read or write happen. Please note that anytime a read or write happens whether it happens due to f read and f write function calls whether it happens due to f print f and f scan f function calls or in your std in when you are redirected a file it happens due to c in or c out the internal position is automatically advanced by so many bytes. Here are some file processing functions that we will discuss f open you are familiar with this notice in the context of today's discussion f open is a mechanism to give our own name to a physical file and that name will be a file pointer. So when I say 5 star fp I am defining a file pointer and when I say fp is equal to f open my data file dot text comma r there are two parameters here one is the name of the file as is known to the operating system. So I am saying look mister operating system now onwards whenever in my program I say fp please remember I mean that file and how is that file to be opened because I can either read or write to that file I am opening that file to r meaning read r stands for read w stands for write and a stands for amend append append means add something look at another file 5 star fp out again it need not be fp out this is my choice but obviously the name suggest that I would like to associate this file pointer with a file which I will be creating and writing output to notice that I am saying fp out is equal to f open db dot bin there is a mistake here what is the mistake there should be a double quotes it has to be a string it has to be a valid name that is known to operating system it could also incidentally be a normal string array character array in which you have put the name in fact many times when you do not know what is going to be the physical name that you need to process for example you have asked ten friends to do data entry for ten different markless now some fellow has got file one dot that or dot txt somebody else has my file dot txt now you want every time change a program to read that one so what you would do is you will give a care file name thirty or forty an array then you will ask that fellow first put that file name as a string and then here wherever you have file name instead of db dot bin in double quote you can put that array name that becomes the file so first parameter is a file name physically known to the operating system the second parameter is a mode operator r means read mode w means write mode a means append mode if you just say rw or a by default the file is a text file but if you want a binary file you have to add the character b so rb means read mode binary wb would mean write mode binary file a b would mean append mode binary file in the context of the logic that was used to name fp out is this mode correct fp out stands for what we want to have an output file we want to read from a text file and write to an output when I want to write to an output file can I open it in a read mode if I write such a program the file will not open because the file may not exist after all I want to create that this is where it becomes important to know whether a file was opened or not open please note that can happen even with this first file suppose my data file dot txt does not exist then what will this fp equal to f open do can it open an on existent file for reading obviously there will be an error so you must have a mechanism to check there and the only mechanism that is available to you whenever you invoke a function is by looking at the value written by that function consequently all file processing functions return some value or the other in case of f open since you are supposed to attach a pointer to an external file if a file is actually open then a valid pointer is attached whatever file is not open the c++ way to tell you back is to give you back a null pointer a null pointer is a universal sos in c++ or in any program null pointer means sorry I could not figure out anything and consequently you can check that null pointer this is what is done in the next statement there is an error here also if fp equal equal null notice that fp was the file pointer sorry fp out because you assigned a value of return value to fp out which would have been a valid pointer if that pointer is null the file was not open that is why you see the statement if fp out is equal equal null then c out cannot open output file so I cannot carry on in fact immediately after this if my entire program was about creating this file about about it I should better say return minus one because I could not open the file I should do same thing with every file that I open so consequently the first set of statements which are shown here are not very well written I am just saying star fp fp is equal to f open my data file I am assuming that the file exists there are so many possibilities by the way the file must exist in the current directory you are at the moment suppose you have created the data file in lab 4 directory today you are in lab 5 you write this program and run it in that directory you will not find that my detail system will bomb the program will terminate and it is better to check if fp equal to null then give some error and get out here is the specification of a desired position in fact at this stage I would like you to look at the page 2 of the handout page 1 you can read it is a more general conceptual thing but page 2 page 2 describes several major functions which are used for file operations you want to go to a certain position on the disk now you have opened a file and you want to specify that I want to perform read or write operation at this particular place you remember we had actually described this fc somewhere else earlier so what is the format of fc what is the first parameter please read the fc it is the file pointer so let's say I want to specify repositioning of the file pointer fp what is the next parameter the position I will typically have described long int pos or simply long pos and I would have said somewhere pos is equal to something let's say thousand two hundred so pos has a numerical value the next parameter is pos ordinarily if I stop here I should automatically mean go to the poset position right from the beginning but since c++ permits you to define three possible points of origin with respect to which the pos can be prescribed so here you can say for example if you want to specify that pos is from the to be counted from the beginning of the file what will you write sorry the three possible names for the third parameter seek set seeker and seek end these are not values these are flags and these are internally defined how does the c++ compiler know that seek underscore set is a predefined thing you are not defined it anywhere it knows it because you have you have to say in your program just as you say include i o stream you have to say include c std i o which means c like standard library for i o functions when you define that all these these are called macro seek set seeker they all get defined in fact seek set is defined to be zero seeker is defined to be one and seek end is defined to be two if you print the values of six at seeker because these are all known value you will get zero one and two so obviously you are not going to add zero or one or two to everything these are only indicators if it is zero whatever is the internal pointer maintained by the by internal index maintained by the operating system that is to be treated as the starting point for counting or the beginning has to be treated as starting point of counting or end has to be treated that's what is pos will not be negative pos is the value which you are setting pos is not a value which the internal system knows if pos is positive and if this is seek end then the file will attempt to seek a position which is way beyond the end of file it will apply this non-intelligent so in fact the good point that is what I have explained there I have said a word like origin so this origin can be either seek set or can be seek end right this is what you are saying the formula that is internally used is very simple whatever is the origin that plus pos that is going to be the new position if origin has the beginning that means it is zero then zero plus pos if origin is at the current position then whatever is the current position suppose you already traveled 200 bytes it will be 200 plus there are traveled 500 bytes 500 plus it may so be possible that you have traveled almost towards the end of the file you have not reached end of the file yet but only 20 bytes are remaining so even if you say set curve you may still go beyond the file that's your problem 99 percent of the times you will be specifying an absolute value we relate you to 0th but there are squiggles where it is easily possible for example you want to traverse backward in a file how could you do that you say seek end minus 20 seek end minus 100 seek end minus 1000 it will do a numerical is that clear now this is something which I would like all of you to understand very clearly internally whenever you open a file the operating system maintains an internal index which knows where you have to do the next operation at the beginning it is at zero as you read or write the index will advance okay this internal index the only way to set it to a different point is fc there is no other way however you may want to know where the exact index is whether it is at 500 byte or 200 byte or 753 byte you don't know where it is because it is hidden from you there is another function which will tell you that index value and that is called f tell so if you say p is equal to f tell where p is another long end that you have defined then the operating system will know where you are and it will tell you I am at the 5253 byte but please understand that whenever I do a read or write operation this value will change so if you do f tell after every read or write operation you will know how quickly the internal index is advancing the way to set this internal index is through this state so when you say f seek fp pause seek set this internal index is automatically brought to the point that you have specified let me give an example suppose I say fc fp 0 what it means suppose the current internal index is somewhere here the system will be forced to bring it all the way back here and it will be positioned at the beginning so to find out where the internal index is you use f tell ordinarily that won't be required that is required you know where I am at a certain position I say f tell I capture that in p and keep it with me now I do some read write operations later on I want to go to that position itself unless I have captured that position and kept somewhere how will I go to exactly that position I can then use this p to set the internal index by another f seek command that is the objective now there are many squiggles which are possible but these squiggles are non-essential for normal programming normal programming follows some simple principle so let me initiate that principle and then you can later on read those programs and understand more about the simple principle is if I want to read or write sequentially I open a file and keep reading or writing automatically the internal index will move as I read of course it is my job to read or write the specified number of bytes which we shall see in a moment if I want to directly access anything I must use f seek followed by read if I want to directly write at any position I must again do f seek and write so the logic is very simple for sequential files I keep reading reading reading or writing writing write for direct access file which I may want to read and modify I must fc can read I must again fc can write let's see that after we discuss the read and write state reading from a specified position this statement is f read this followed by what the file pointer always what is the second parameter of f read is this the first parameter what is the first parameter of f read so why did you agree with me when I said fp you have not reached that point please read the definition of f read very carefully on page 2 you agreed to fp because so far you always seen file pointer being the first parameter in most of these functions read and write functions do not have file pointer as the first parameter but they instead have another pointer remember the student record s either I want to read from the desk as many bytes as are describing that student record and put it in s or I want to take whatever is inside my memory s and put it on desk read or write operations require a source and a target in the read case the source is this file the target is somewhere in the memory you have to specify both that is why only file pointer is not adequate file pointer of course is required because that will describe the source in case of read and target in case of write in this particular case and s is the first parameter which becomes the target whatever you read from the file please deposit it here and that is why the pointer and s it points to the entire memory structure of s this is followed by what I will call it record size because I am talking in terms of record so if for example you are putting 44 bytes structure this will be 44 ok this is followed by count how many records so I recommend record of 44 bytes notice that it is possible to read 20 records at a time however if those 20 records are put in a single structure variable s you will have huge problems because there is only 44 bytes allocated to s there is nothing more the point is that this pointer and s need not be a pointer to a structure it can be a pointer to an array of 1 million bytes just a block of bytes and you can say read 10,000 bytes from the disk and put them in the 10,000 position here that is all possible that is why this is a general statement in general whenever you are doing a file processing involving information about entities this count will almost always be 1 almost always be 1 because you are reading one record at a time if this is not one and something else then the pointer as the target must be appropriately set the last of course is at what point this read will occur whatever is the internal index or whatever you have done fc earlier fc could have said that so you will read from here please note after executing this statement that internal index will automatically advance by so many bytes suppose there was a student 10 115 and his records was somewhere here at this point and you read this record now you want to modify the marks of that student and rewrite that record can you just say f read this followed by f write no because if you did f write f write will write here now because the index has all automatically moved so that is why invariably you will have a series of statements that f seek f read again f seek f write because you have to go back you have to bring this fellow back here so after reading this you have to do f seek again and f write has exactly a similar what is the f write syntax again and s comma rake size comma count no difference the only difference is source and target are different in f write the source is and s the pointer to the structure and target is fine and in the other case the source was fine target was but at what point you write you better put an f seek before this to say whatever in general in a program you will not find these read or write statements without being preceded by an f seek to make doubly sure that you are exactly where you want to read or write okay f scan f and f print f we already seen they work exactly like print f and scan f statements so for example the print f you would have had a string which will say something like hello then one blank percent d backslash n let's say comma m suppose m is 25 this will print hello 25 now you want to do this to an output text line on a file you just say fp here that's all nothing else so when you say f print f you are doing print f to a file we are already seen as print f where I can do this to a string internal because after all print f is converting internal binary float values into character strings so that you can visibly see them or the other way round in scan f where you read character string and convert them into integer float etc all that you are saying is operate on the files this is the problem will not discuss this problem in the class because I have already given two programs which are there on page 3 and page 4 don't read them now that's homework basically the problem that is solved by these programs first program says read records from the text file extract the four component values from each record and assign these to appropriate parts of a structure and create a binary file called student db dot bin student database dot bin it's a binary file so it will have structure for one student structure for another student structure for third student etc etc if there are 500 students in cs11 500 this particular sample which works on a sample data file creates a file with ten students and then it says find marks of a student given a roll number now that second program is named file ops dot cpp it performs different operations first it tries to locate a student by reading the roll number sequential reading the record sequentially and finding out whether the roll number is found notice that is what you will do in your normal sequential search in an array and that is where it is important to understand that that will not work imagine you go to an ATM machine to withdraw your money the moment you put your card actually that your account number is going back miles away through the network to a disk which contains all the accounts of all the customers of that back for state bank of India it is something like 18 crore customers how would you like your record is being read by reading 18 crore records how long will you be standing at the ATM so there must be some mechanism to convert your roll number into a relative position in the file say your position is 53 lakh 48000 434th record and directly go to that so that you get your money and other people can also withdraw that is why direct access is important the issue of how do you translate your account number into a relative position is still critical to simulate that here I have roll numbers it so happens that an example that I have given a certain roll number at a position if I know the position I can directly access that so the second part of that problem says get me the record in the earth position and a third problem says modify the marks of such and such students it is like modify your account because you have withdrawn 500 rupees reduce that from your balance and rewrite so please go through these we will continue the discussion on some additional aspects thank you