 As I mentioned last time, I wanted to describe the notion of a random access file and particularly binary files. We have seen sequential files, how to read and write to them, but in larger problems where we have to handle a very large amount of data organized in the form of records, such as the one which you are handling for projects, for example. It could be a random type of data. It could be records representing bank accounts, balances and so on. It could be records describing employees and their salaries and other things. It could be records describing inventory, for example, the number of parts which are stored by manufacturing companies and the amount stored and so on. So we do require to handle very large amount of data, not all of which can fit into the computer's memory. Also, when we want to frequently change the contents of such files, it is not easy to do so if you could read and write the contents only sequentially. And that is the main reason why we need random access files. So first I will describe the binary files and then you shall see the notion of random access in the files, why it is necessary. And then we will find out how we can actually open a file, read any arbitrary portion of that file, then we write that portion in place as long as the number of bytes that we write is exactly equal to the number of bytes that you had read originally. In which case, even if you have millions and millions of records under this, which cannot all be put inside computer's memory, you could actually pinpoint a specific record, read it out, change it and rewrite it in place. So almost like updating an array in memory, you can treat the records as if it is an array of records under this. First, the notion of binary files. Now, text files is what we have already seen. Whenever you see them and see how you interact with the machine through keyboard and through monitor, what you see on the monitor are typically text files, text information. So symbols which are graphically representable, which can be seen and understood. Not all information can be represented symbolically like this. For example, the fingerprint files, they contain binary data, grayscale data. One byte per pixel and the value stored is anywhere between 0 to 255. We cannot represent the printable symbol. It just represents the value of the picture color at that particular position in grayscale. Similarly, you could have digital photograph, digital audio, video files. Now go that far. Consider the internal representation of some of the data values that we handle. Int, for example, when you say integer, we have seen earlier that integers could be represented internally in a binary format, which could be signed or unsigned. Similarly, fractional numbers could be represented in a floating point format. Now, if you have a four byte area of memory in which the different bits are interpreted differently, some can mean mantisa, some can mean exponent, etc. There is no way by reading the bits, you can immediately understand any printable symbol. So, this is not the integral information. What you do is you type in a value in a decimal notation in a character form, the machine converts it, the C in function, the C converts it into an internal format. Similarly, when you say C out, the internal format is converted into text form and you get to see the details. Now, that is also as far as human interaction is concerned. But is it necessary to store such values even on this point is the amount of storage that you may require could become very large, could become very excessive. Take, for example, very large integer numbers. A four byte integer can store what largest value using four bytes. Four bytes is 32 bits, so even if you use one bit for sign, a signed integer will be always the largest value between the power 30 and minus 1. Now, this is a very large value, 10, 11 digits. If you store a signal digit as an ASCII code, then each digit takes one byte. So, very large values would require 10, 11 bytes to store. On the other hand, internal representation is quite economical. Take gray-scale values. A three-byte can store a value between 0 to 255. But if most of the values in your fingerprint image were 120, 140, 170, each one requiring three decibel digits. And if you were to represent them, you require three characters. That is the reason why your .xpm file becomes so large. You would generally store these things as binary compact images. Essentially, then, you tend to use binary storage even on this, not just inside computer's memory whenever you want to conserve space. Ordinarily, when you work with PCs in the labs or at your home on the laptops and the kind of problems that you solve, it is not very appealing to you why do you need to conserve space. But imagine you are dealing with millions upon millions of records. Each record is not 20 bytes or 30 bytes or 40 bytes, but could be 5,000 bytes, 10,000 bytes. And then suddenly you realize that you need to conserve space. Not necessarily because space is costly. It is not the only reason. But because largely the amount of information that you have to read and write from the desk, the longer it will take for your computer program to process that kind of data. So even from a performance point of view, it makes sense to store minimal amount of information on desk. This is the reason why you would like to use binary for people. Now, if you have binary files, it is not necessary that all information within a record of the file should be completely binary. A fingerprint, of course, would be completely binary file. But it is possible, for example, I have a record comprising a few bytes. First few bytes may be less than 90. So this is 9, 2, 3, 4, 5, 6 bytes. Now, if I want to store a unique information such as my date of birth, which is not stored in 2 digits for date, 2 digits for month and 4 digits for year, amounting to 8 bytes. It can be stored as a 4-byte number or even a smaller number. In this case, the next 4 bytes contains something which may be called a date of birth, but it might be a fixed point number, for example. There could be my salary next. And the salary will be another 4-byte number, which could be a 14-point number. Then there will be my address which may have 21 characters such as A, 15, whatever. So, you see, it is possible to mix text information with internal binary representation of some other information and create a composite record. It is present with these kind of records that we often deal with in computer files. And it is important for us to understand how exactly we can read or write such records. The random access portion we shall come to in a moment. But the point I am making is, even the structured records of most full information may be stored in binary files. Now, such files, we have to be declared and processed as binary files. So there is an additional prescription that is required when you define such file. Although some of the data may be textured format, overall the file contains non-text information and therefore we declare the whole file as binary file. Binary file can of course contain text data as well. So how do you handle how do you define such files, how do you read and write such files? From a random access perspective, it is useful to have records which are of fixed size. If you take digital photographs, for example, which you want to store along with a person's individual information, if the photographs are of different size, each individual record will have a different size. It is not conducive for random access to such records to have different size records. So what you do is you freeze a record size, so in the maximum information that can be tagged for one individual or one item or one entity of a set. In some cases the information may be small, in some cases the information may be more, but maximum is this much. Then you define that as record size. And once you define that as record size, then you may say, in this file I might have one million records, each record is exactly 2,534 bytes long or 5,720 bytes long or 48 bytes long or 38 bytes long, whatever, a fixed size record. It is in this context that it will be useful to relook at the files that are organized on the disk. The file on the disk is essentially a stream of bytes. As you have already heard, a file may contain very large, a disk may contain very large number of files. In fact, as you are all aware, disk is organized into various directories and sub-directories and within that directory or sub-directories you will have multiple files. What we are looking at is the constitution of any individual file. So as we are concerned as programmers, a file is referred to by what we call a file pointer, typically noted as fp, but it could be any name. So when you say my file or student data file or whatever the names that you use in your program are nothing but file pointers. These names, when you open a file, you associate a particular specific file on the disk with your name and that may be nothing but a file pointer. So once you open a file, you get a file pointer. It is also called a handle, a file handler. Now once you get a file handle, the entire file contains, can be treated merely as sequence of bytes and that is what is the most important way of looking at it. That's why we call it a stream of bytes. So first byte, second byte, third byte, fourth byte, fifth byte, sixth byte, etcetera, etcetera. 20 billion bytes, doesn't matter. A file can be as large as 20 gigabytes, would be as small as few bytes, whatever. The entire file is nothing but a series of bytes. The easiest way to construct this is to consider the disk to be an array of bytes, an array whose size is equal to size of the file. The way you access this file therefore should logically be that if you are given a particular position, say a value of POS, POS, then wherever that POS points to, that element of the array should be the content of that file. So if POS is zero, traditionally zero is the first position in any array. If POS is zero, you are pointing to character M. POS is one, you are pointing to character I. If POS is thirty-eight, you are pointing to some character. What you have some character may not be an individual character. For example, if you are in the middle of a four byte integer number, then this byte alone you may not be able to make any sense out of it. It is for this reason that you ought to know how are the records within the file structured. So there are two ways in which you look at the file. One, the logical way, as per the definition of records that you have given in your program. So you say first I have the name of the person, then I have the whole number of the person, then I have a hostile number of the person, then I have marks in case, whatever, whatever. That is only of looking at it. That is the logical way. The way operating system handles files is merely as a sequence of bytes. Consequently the operating system has the provision to read or write any byte in the file. So you can arbitrarily read 5,434th byte of the file. You can arbitrarily read twenty-eight bytes starting from any position. Consequently the most important aspect for physical reading and writing of the file become one, the file pointer, two, the current position, and three, the number of bytes. The easiest way to remember is that any file that you have, you open the file by prescribing the file pointer, you maintain a current position to the file and at that position you can read or write any number of bytes and test how you have complete control over how the file is to be handled. If file contents are not visible as text, automatically the file is a binary file. The point is, when you read individual bytes, you must know exactly what that byte contains so that you interpret it properly. So for example, at some arbitrary position, if four bytes around that position present a binary number, then it is stupid for you to read just one byte and try to interpret it. You must read all of these four bytes and you must allocate the contents of these four bytes to an integer number which is having an integer representation exactly similar to what you had written there. And then let your complete programming interpret that integer number into whatever value it has for which you have programming constructs are available. So this is the point that I would like you to remember. When you have records and you could have records of a fixed size, we shall see that. I am extending the same problem that we discussed last time, namely creating a database of students records. I have modified the student info.h a bit. So we had character name, 31 characters. So 30 character name followed by backslash 0. Then we had role which was a character theme. Now I am saying integer hostel. Then I am saying float marks five. So there are, let's say some teacher decides that there shall be five evaluations and there will be five different marks. Test one, test two, test three, test four, test five. And finally there is a character grade, two character grade A, A, B, etc. Observe that of all these the first, second and the fifth components are actually printable character components. All in between that you have integer numbers and five floating point numbers. Assuming that integer takes two or four bytes, the total number of bytes that will be required to represent this totality of information will be how much? Can you count 31 plus 9, 40, 40 plus 5 into how many? Each floating point number is four bytes. So five into four, 20 bytes. You have 40 bytes here, you have 20 bytes here, so you have 60 bytes. A character grade three, three bytes. So 63 bytes. You integer hostel, let's say four bytes. So how many bytes total you get? The problem is you will not get the number by simple counting like this because whenever you define a structure, the way the computer allocates memory internally could be different from a simple juxtaposition of the number count of bytes. For that purpose, whenever you define structures or composite compound entities together like this in a strut, just as you can find out the size of a cal or size of integer, size of flow, you can find out size of a strut as well. So if that size comes to, say, 67 bytes, then the total structure will be equal to 67 bytes. If you wrote the structure as is by assigning values to different components and wrote the whole structure in the disk, it will occupy as many bytes as is given by the size of that strut. So for the purpose of our discussion, I have modified the student infrastructure to contain character elements as well as integer floating point and other elements. I am now creating a binary file out of the data that we had seen last time. I shall look at that text data. Basically we had input row number, I think name row number and hostel. For the purposes of initialization for this file of students records, I am putting all marks as 0 and grade as star star because at the beginning of a semester, no marks are there, no grade is there. So this is the print student function. You are already familiar with this. This is from the last time. All that I have is that for int i equal to 0 to 4, I am printing s dot max i and I am also printing s dot grade. That means I have the student list as well. Observe that in this particular case, I have the luxury of assigning a complete array to contain information for all students. But if the number of students was not 100, but let's say 2 million or 5 million, I would not be able to assign an array of this kind and I would have to read and handle individual records from the list. I have a few additional things here. Rec size is the record size. So what is the record size? Let's go back a couple of slides here. This structure, whatever is the size of this structure, is going to be record for one student. So as many students I have, so many records will be there on the file and therefore I have an integer and unique variable which calculates the rec size here. The way I define the file, I can use a file stream, for example, input file stream. I define this to be fill dot open. I give the name of the file as usual, but I will not describe i o state as in. That is, it's an input file. If I say out, it would be an output file. I am sorry, there is one more mistake. This is the batch file. So this is the input file in which I have typed the name, row number and hostile. All that is text information. So this is not a binary file. This is the input file. I define s to be of type student in 4 and now I calculate size of s. They are built in function. This will give me in bytes the total size that is internally allocated by the C++ compiler to the structure s. And that I am capturing in the rec size. Now this is again similar to what we saw last time for we first read the value of n, the number of streams that I have and then for 1 to n, I keep reading the records s dot name, s dot row, s dot hostile. As I mentioned at the beginning, I arbitrarily set marks to 0 for all the tests and I arbitrarily still set grade to star star where the grade does not exist at the beginning. And incidentally I print students, I have added one more parameter, a serial number. This i is the serial number. i is varying from 0, 1, 2, 3, 4, 5, 6, 7, 8, etc. This serial number I propose to treat as the key to the student. Why this serial number? Because this serial number implicitly gives me an index into area of records and address. Let's go back to this. So if this is the 0th position and if the record of one student is equal to rec size, let's say 68 bytes, then where will be the next record? The next record will start at 69th byte or 68th byte. So if 68th byte is the length of the record, every 68 bytes I will get the 2nd student, 3rd student, 4th student. Correspondingly if I have a student number as the key, it is the easiest way for me to get the particular record given the number. So you just say give me the record of 243rd student. I simply multiply 242 by the record size and that takes me to the beginning of that record. I am just saying that data has been heard from the batch file. Now I will create an output binary database file. The size of each record for my information I am just printing it out. That is the way you create the binary file. So avoid that I have a db file pass or the database file position just like pass. Please note that putting pass or db file pass as such values as mere integers will not be adequate. Why? Because the number of bytes that you can have on the disk are very large. 4 bytes integer is not adequate. So you always define it as a long length. So you typically have 8 bytes to represent a particular byte count. Now I am creating an output file stream for my file. I am also opening it in the same statement. So I have student db.bin. Bin is an arbitrary extension I am giving to indicate that it is a binary file. Also now that IO status is said to be out. So I am creating an output file. Second I am also saying that it is a binary file. So this statement opens a file for writing and it says that the file contents will be binary. Of course part of it could be textual but that is incident. Now look at how what I do. Very simple. For i equal to 0 to n n minus 1 basically I just capture from the student list i-theremin into my structure s and I simply write. Now this is a complex write. Basically what I am writing? I am writing s. The way I write s which is the structure to understand that what you write is actually a number of bytes that is record size. At which position do you write? At the current position. Every time you execute the right statement the file pointer is automatically moved so many bytes further. So consequently if you sequentially write a file like this the position will keep moving forward and forward and forward and you keep writing successive records one after another. This is a peculiar casting. You remember we are going on a typecast so you can say int or flow in brackets and you could cast a particular thing into a different type. Normally such typecasting has to be compatible. For example you can cast int into a float or unique quantity into a unique quantity. You can cast a character string into a character pointer because the first character of that. Generally when you talk about files it really is not worried about what contents it is going to write but it expects a character pointer and the number of characters to be written. And that is why if your original entity which you are writing is not a character string. In this case obviously s is not a character string. It is a structure. So this is an artificial way of CPS class of our telling CPS class that whatever is the structure please convert that pointer to that thing as a constant style type. That is a character pointer. So this is called the re-interpret underscore cast. This is a special function in CPS class. In ordinary C programming you would not require this kind of thing but this is standard practice in CPS class. So even if you don't understand it you can blindly write this followed by the and s which is the structure that you want to write. What is important is the rec size. As many rec sizes will be written as many times you execute this statement. So n records will get written and finally you should close this file. I forgot to close this file and you return. That's the end of this. Here is the input data in a batch file. So our six students see now I have just modified some numbers and names whatever. This data goes in, this is a textual data but once I put 0000 for marks and all I have floating points into your numbers etc. If I look at the contents of the file I can't see them on the screen. I can't see them using g-edit because the contents are not editable. They are not text everywhere contents. So I use a special program called octal dum or hexadecimal dum. o d minus x a will give me a hexadecimal dum on the file student db. When you type it to us you will get a few lines at a time. What I have shown here is just one line. This 00000000 means first position 000000000000 0020 this what would it mean? 100 bytes. You are not talking decimal now, so you are talking either octal or hexadecimal. But these are the sequential bytes. Now you observe that M, I, L, I, M, B, main is a text name that you have given. But the name has been declared to be thirty-one characters long. What you have is a now terminated string here. All other characters here are bunkam. They don't need A, D. Whatever was there in the computer is never a got written here. But while interpreting, you are not going to interpret any one of these bytes. They are the first backslash zero that you find, they terminate your string. So you can interpret it properly. If you look at other values here, for example, these values are not text zeros. They are binary zeros for integer. So in general, you will not be able to interpret everything by looking at the bytes. You will have to write a program to understand, to read these and to interpret them. Now comes the crucial question. Can I update the file arbitrarily a single record? For example, suppose let's say two students, say one student's marks in test three have to be modified. Another student's marks in test one have to be modified. Presume that over the semester, the marks have been filled up for test one, test two, test three, test four. The procedure will be exactly same as we shall see in this particular thing. The question here is, just like you wrote a query language processor, here the query is very simple. Given a particular test number and marks and the student's serial number, can I in the file change for that student marks in that particular test which I have prescribed? If I can directly do that, then I have a random access capability. That is what is being demonstrated by this program. So as usual, I have test marks as float and test NO as integer. Serial number is the key. So serial number is 0, 1, 2, 3, 4, 5, 6. I'll say that it's a priori first two student number. Please remember this is artificial. The actual key to a student which is unique is actually a row number which has been allocated. However, you cannot translate row number into a position under this query easily. So we are creating this artificial key called 0, 1, 2, 3, 4, 5, 6. That long I shall explain why that is even meaningful. So if I give some serial number and if I give a test number in test marks, the issue is can I update the record in place? Notice how I open this file. I neither say IF nor say OF. It is not an input file, it is not an output file. It is a file which is open for both input and output. So I simply define it as F-steam. The file is same student limit as them. But here I say iOS is in and iOS is out as well. Of course, initially iOS is binary. So it's a binary file from which I will both read and write. I will be careful when I read and when I write. Please note that I can have a pointer for reading, a pointer for writing and always I am very careful I might read from somewhere and write somewhere else. The basic capability I forgot to mention is not actually reading right. The basic capability is get character and put character. Of course you have to mention file. So let's say your file pointer is F-P. You will say F-P dot get C or F-P dot put C. This is the basic characteristic. So at the position you read a character or at a given position you write a character. That is the fundamental capability that operating system has and it permits you to do exactly the same thing at the level of C or C++ program. Where do you get and where do you put is generally decided by the pointer. So you have to actually tell the operating system that in my file I want to read or write at a specific position. Please seek that position for me. Since for reading I use get. The corresponding nomenclature used is called seek G. Since for writing I use put. The normal nomenclature used for a pointer position for writing is called seek P. So seek P and seek G are the functions which are available associated with any file by which you can give a particular pointer value and ask the file to go to that particular position directly. From that position either you can read or you can write. The example given here will illustrate that. So this is my file and I use it for input and output. I just say give a key value for student. I read the serial number. Now I ask for test number and test marks to be updated. I read in the test number and test marks. Please note that while I type that test marks I will type them in text but they will get converted into floating point because internally the text marks are defined as floating point. Internally test number is defined as integer etc. How do I read a student's record based on the serial number? Very simple. I know where I have to go. The 0th student starts at the 0th byte. The next student starts at the record size byte. The next student starts at 2 into record size byte etc. So I simply multiply SNO by rec size and I get the dv file position. So if I want this particular student's record to be either read or written it should be read or written at this particular position. Then I use the seek g function because I want to read. I want to get. I don't want to get a single character. I want to get multiple characters for which you use the function read. If you say get you will get one character. If you say read you will get more characters. But before that you have to set the position pointer. So when you say my file dot seek g dv file pause Operating system internally will move a logical pointer to that dv pause. And it will say alright I am ready to read one or more characters at this point. Later on when you issue a read command in exactly the same way reinterpret cast cast star and give the rec size. It will read so many bytes from that position and put it into your structure s. So this is as simple as that. You are defined a structure. You internally put values into that component and write it. Or you read a particular set of bytes from this file. Put it inside that structure. C++ will automatically interpret different components of that structure because so many bytes have come in scale as integer, floating point, text or whatever. So once you have read this all that you need to do is change test marks. First I print the record. Then I change the test marks s dot marks test number minus one is test marks. Remember third test will mean 0 1 2 will be the index. Now I want to rewrite this. Please note that I will not rewrite only the test marks. I must read the complete structure. I must rewrite the complete structure. But not the whole file. So I have changed some components of the structure. Doesn't matter. The structure stands change. I now seek the position for writing. Seek p for put character. Which position? Obviously the same position. Because from where I read I want to write. And I say my file dot write this. So this is a direct access. I read so many bytes into s. I modify s. I print if necessary. And then I rewrite it. After that just to confirm that changes have happened in the file. After all what I have written might have gone somewhere else. Just to confirm because I am writing this program for the first time. I will again read back the same record. And print it. I have printed it before updation. I have printed it after updation. I will know whether any changes have happened. And the changes will be confirmed on the disk. At the end I can close this file. In general when I keep updating my records. Every 5 days, 10 days or 20 days. I would like to get a list of all updated records. What is the current status? Originally I started with roll numbers. Names. Hostile numbers and zero marks and star star grades. At the end of the semester I will have full of marks and appropriate grades there. So at any point in time if I want to just print all the records in a file. I would like to sequentially read that file and print all the records. That is not very difficult. The rest of the program remains same. I open the file for input. But this time I can say while not my file dot EOF. This is a very cute trick. Dot EOF is a function of member function for the file. Which not my file EOF will mean file has not ended. So as long as the file has not ended. Observe that unlike in the earlier case. Where we were first reading N as the number of records. And then reading so many records. Such luxury will normally not be available to you. Because the number of records may increase or decrease depending upon how many students come in or go out. So you will generally not keep the number of records in the file. What you will do is you start reading the file. When the file ends the records are finished. As many as you have read that is the number of records in the file. So in general you access the file by using end of file. And you check for end of file. What we are doing? We simply set db file pos to sno into rex size. Sno we are set to 0 to begin with. And we keep increasing sno by 1 inside this loop. All I do is I go to that seek g or seek get position at db file pos. And my file dot read at this position as many characters as rex size. And I simply print that students info. Because I know every time I am going to get one structure full of students inform me. Notice that the list of records in updated file will be like this. So you will notice that two changes were made perhaps. Let's say Ashank's marks in third quiz were third test were changed. And Vinita's marks in the first test were changed. So any update that will happen logically on to the file directly. The purpose of all this discussion was to ensure that you should be able to comfortably handle files of any size. No matter whether those files can fit inside your computer's memory or not. Because you have exactly the same facility as you have to access array elements. Even if you have a large array of 1 million elements by giving a pointer you can access an element value. An element could be integer, floating, point whatever. In exactly similar fashion there could be a structure which is at the nth position in the file. If you know n you can read that structure, you can write that structure. After reading you can modify, rewrite. There is an additional mode which I have not discussed which is called the append mode. In which case if you have a file and you open it in the append mode, additional records get written at the end of that file. So file gets extended. That is quite useful when you are adding records for example. There are many squeakers to file processing. In fact there are advanced courses on database management and so on. Some of you will do those courses later. But for the purpose of this course this much discussion is sufficient. We will stop today but I will remind you that hopefully you have said the test questions. You remember last time I asked you to do that. Every lab batch has to submit three questions. One simple, one medium and one complex type. One which will take 15 minutes to solve. Another which will take 30 minutes to solve. And third which will take 45 minutes to solve. I am reminding you that that submission will have to be done. Those batches which don't submit this will lose marks because there are marks associated with the submission. Once again the questions could be set by any one or two people teams. But at least three persons must attempt to solve that question in the stipulated time of either 15 minutes, 30 minutes and 45 minutes. No names are essential but there has to be a genuine effort and that effort must be submitted as part of the submission. About the quiz assignment I will make an announcement on Monday. Thank you.