 So, as I said, we will quickly look at the queries which were raised in previous sessions about unions and jagdarrays and about the execution time of turbosy programs and then we will proceed to the discussion of files. Although we are supposed to have discussed text files yesterday, so what we discussed in terms of basic IO will be extended by properly defining the files. We will look at possible file operations and more specifically we will look at binary files and random access files. Then we will look at two examples, one which I will be covering explicitly on binary search, the multiplication arithmetic example will be relate to you through a sample set of slides and programs later. But we will discuss the workshop projects for which we will take some examples and we will talk about the contributions to be made to the question bank. I understand a lot of participants have questions on this workshop projects. So, we will discuss those in details today. First, queries from the previous sessions. So, thanks for the person who pointed out that unions apparently require less storage than structs. So, why is it that we prefer structs and not unions? So, I studied union in fairly great detail and found out that union actually does not accommodate all the components that are defined in the union simultaneously. So, it is a spatial structure actually which permits the programmer to keep different types of data values in the same location. Obviously, this cannot happen at the same time and therefore at one instance there will be one set of values. At other instance, there could be another set of values and it is not just the values of the same type but values of another type. In short, the program is able to interpret the data stored in a common location differently depending upon the context or depending upon what values are kept inside the memory allocate. So, union and structs are not comparable at all. This is the point that I would like to make. It is wrong to compare union and structs. The purpose is entirely different. When I define a struct and when I define a variable of the type struct or an array of the type struct, explicit memory is allocated and actual values, real values of the appropriate type can reside in those memory locations as long as we wish them to. Unions on the other hand do not allocate memory to different types of components which are defined as part of the union. But rather union is a mechanism to permit different interpretation of data stored in the same location obviously at different times. So, here is an example of use of union. Suppose we wish to prepare a text data file containing names and details of our remote centers and for each center the names and details of the participants. As you all know, we have differing number of participants in different places. So for example, Ahmedabad we have 30 people and let's say then we will have a record describing Ahmedabad as the center, 30 as the number of participants and let's say Anita, Chirag and such others, 30 of them, these are the participants records. I'm sorry, I did not have the actual enrollment number accessible last night at about three o'clock. So I have given these numbers as artificial enrollment numbers. However, the names are real. Similarly, if we have 32 participants in Amaravati, there will be 32 records of the participants with enrollment numbers and name let's say this is simplified version. But I will have one record naming the center Amaravati and naming that there are 32 participants here. Now notice that total number of records that I will have will be equal to the total number of participants that we have in the workshop plus the total number of centers that we have in the workshop. However, the data that is seen in this particular record, Ahmedabad 30 or this record Amaravati 32 is different from all other records which contain an enrollment number and name of a participant. Now if I were to read this data each record into a memory location inside my computer, then I will have to interpret this record differently and I will have to interpret this record differently. Obviously, both of them will never coexist in that memory location. I will be either reading this or I'll be reading this. Unfortunately, before I read, I cannot make out whether that record describes the center or whether that record describes the participant at the center. To distinguish, I have included one more column or one more data field in each record which I briefly call the data type or record type. So here for example, one is the record type for records which describe centers. So consequently, Ahmedabad 30 has record type one, Amaravati 32 has record type one, but all other records which pertain to the participants' information have record type two. So this record will read as one blank, Ahmedabad, blank 30. This one will read as two blank, 174 blank, Anita, two blank, 392 blank, Chirag, etc., etc. If I were to describe these, obviously I could describe two different structures, one which contains this, another which contains this. However, since I do not know whether to read the record in one type of structure or another, it is possible to use union here effectively. For example, to define struct, center-reg, care-reg type which has a value one, care-c name 40, and int-cnum participant which is the number of participants. It is one type of record. The other type of structure is participant-reg where care-reg type which has, by the way the same name I have deliberately given like here, but this time it will have value two. And int-serial-no which is the serial number of the participant and care-p name 40. Observe that both the structures incidentally have the same size, although that is not required either for my data processing requirements or from the point of view of union. But it is very obvious that if I define the union of these two together, then the question which our friend raised the other day becomes pertinent. Individually, these structures seem to require 40 plus 1, 41 plus 4 for int-cnum participants which means 45 bytes. This one will also require 45 bytes. It may appear that when I define it as union of this and this structure, then only 45 bytes totally are required. So he is right in terms of the memory allocation. However, he is wrong in that both these structures cannot simultaneously hold different data. At any point in time, I will be able to read 45 bytes and store these 45 bytes in the memory that is allocated to the union. But I am restricted to use either this particular record structure or the other particular records. So in short then to conclude, union is actually a way to permit us to interpret the same memory locations in a different context depending upon what is being stored there at any one point in time. And while I might have union of multiple records or multiple structures or multiple types, I cannot assure that in fact it is not possible to have all of these data items to be stuffed with values at the same time. There's also a question about jagged array. I read more about the jagged array and found out that it is essentially an array of arrays. But it differs from the normal multi-dimensional array in which the number of columns or elements in each row are same. So in jagged arrays, one can define each row to have different number of elements. And the row will have elements as arrays themselves. So this is a special data structure defined in most object-oriented programming languages. What could be the objective of using jagged array? Why would I require a two-dimensional array, for example, to have different number of columns for the same row? Well, one example could be extending the same particular illustration that we took for let us say centers and the number of participants or information about participants. So let us say information about each participant is contained in a structure. Now, for each center, there will be as many participants as they are there. And therefore, an array of participants. There would be a different array, let us say, for each center, each remote center. Ordinarily, to accommodate in the conventional two-dimensional array structure, I would have 22 rows in my two-dimensional array corresponding to 22 remote centers. And in each row, I will have as many number of columns as will collect the maximum number of participants in any center. So let us say, online city Chennai has some 80 participants. Then each array element of the centers will be another array of 80 elements, each of which will contain information about one participant. However, if I have a jagged array, then while I have 22 elements representing the 22 centers in the, let's say, first column of the array or the first dimension of the array, but the second dimension, namely row for each center, need not exactly be 80 elements long. It could be 35 for one center, 12 for another, 48 for another. And this permits us further to economize on the storage and also have exactly as many elements as are required. Jagged array, unfortunately, as a data type does not exist in conventional programming languages. As a matter of fact, it does not exist as a special data type or a native data type, natural data type in other programming languages also. However, in most programming languages such as C++, C sharp or Java, jagged arrays are permitted as abstract data type. This was one particular issue that was raised. This was in relation to the demonstration that I had given of the execution time and where we had briefly discussed the notion of algorithmic complexity. The question was, in Unix, I can use the time command to find out how long it takes for a program to execute. And also find out how much time spent by the instructions of my program, how much overhead by operating system, and how much real time, etc. What do I do if I am using turbo C environment? My first observation is, turbo C, GCC or any other C compiler cannot and does not have any mechanism to tell us about the execution time of the program, written and compiled using that compiler. Because the execution time depends on the computer system on which the program is run and the operating system is alone capable of giving us the time information. So how do we find this time information? Just like the time command that we have under Unix or Ubuntu, there are commands under Microsoft Windows through which any executable generated by turbo C can be run and its execution time measured. There are actually tools which permit us to do that. However, for the less initiated where you are not very sure of what command and what structure should be used to find out execution time, I would suggest a simple mechanism. Open a command window in your Microsoft environment. In that command window, just give a command time slash t or simply time. Time slash t will give you more details. It will give you the current talk time. Then run your program and then again give time slash t as a command. So the difference between the two times roughly will tell you the clock time that has sort of elapsed between the two commands and therefore the time totally taken to run your program. This is not as accurate as the time command for which you can either use an instrumentation provided by not only Microsoft Windows, but by Unix or something which incidentally can be incorporated in your program itself. For example, you can make a system call and get the current system time and you can actually measure the execution time not only for the entire program, but for the execution of any segment of the program at the beginning of which you make a time call and after executing which you make another time call. In this context I would also like to mention an extremely important tool called profiler. As we know our program has many iterations, if statements, conditioners and heavy computations and so on. So if you take a 500 line program, it is not that every line of the program takes the same amount of execution time. Obviously if there is an iteration that iteration will take much longer. If there is a nested iteration as we saw in our example, that would take still much longer. If there are multiple iterations at different places and they are executing different amounts of time, I would not really know which part of my program is taking maximum time. For this purpose, there is a tooling available. Tooling by that tooling I mean software available, which is actually part of the compilation process. So whether it is in a Microsoft environment or Unix environment, there is a utility tool called profiler. So when you compile your program, if you enable profiler, what the profiler does is the analysis that the compiler does of your program, it actually forms data flow graphs and so on. And for different segments, the profiler inserts some counts. And when your program is executed, these counts are updated. And at the end, these counts are output. So you know which portions of your program executed longer or took more number of counts, that is they executed more number of instructions. And that way you can find out what we performance people typically call hot spots in your program. It is a common thumb rule that for more complex software, 20% of your program, not one program, but this program system, takes about 80% of the execution time. But you do not know which 20%. Profiler will help you identify this 20%, after which you can concentrate your personal attention on that portion of the code and optimize it by rewriting it or writing it properly or writing it differently. I would also like to mention the power of web. Many of you would be aware of it, but I'm not sure how much of this power you use. I would strongly submit that you can and should use as simple a thing as Google search to find answers to technical queries. I have constructed an example based on the last night's experience that I had. When I was discussing the workshop slides, I was discussing it with my colleagues, Tushar Kamli and Nagesh Karmali. And a query was raised about a structure within a structure. So the first question was, is it permitted? I answered yes, of course it is permitted. And while I wanted to give an example to illustrate such usage, I wanted to know what the way has to say about it. So I put in a query on Google, which simply said struct within a struct. So when I made this query, I got thousands of responses. One of which is typically it was like this. This response came from this particular site. I have put the site address here. So those of you who are curious can actually paste this into a web browser and get it. A person asking the question has given an example problem. He has a struct called a structure containing three float fields. So he of course artificially constructed example. He wanted to show the problem that here. So he has struct a structure, float a, float b, float c. Now he says he wants to use this entire structure as part of another structure. Going back to the previous page, this was the query that he had. He has defined a structure called a structure which has float a, float b, float c. And now he wants to incorporate this entire structure as part of another structure in a header file, test.h. So he wrote up in his program in the test.h file, he wrote struct b structure. And inside it, he wrote a structure location. So location is a variable or a structure of the type a structure he wanted. And then float d and something else. Now he says this was a problem line. And when the program was compiled, he gets the following error. The error says syntax error before a structure. And he couldn't figure out what is the syntax error. Is the opening brass wrong? Is this syntax wrong? Etc. The correct answer, as many of you would have guessed, is the wrong way of specifying a structure location. In fact, this was pointed out by two people immediately. The first response said that in C, you either have to use struct a structure whenever you reference that type. Notice that unlike int or float, that type is not a structure, although you have defined the type to be a structure. The whole name of that type is struct a structure, not just a structure. If you want a structure to be the name of that type that you are newly defining, then you have to use a special statement called type def, which will look like this. He gives that answer, which I thought I will share with you. Most of you would know this, but those who don't, there is another way of defining the data type a structure. So here you say type def struct, and then you give the structure definition, and at the end you say a structure. Notice that if you had said struct a structure, done this and written any name here, that name would have become a variable name of type a structure. But when you say type def, then a structure itself becomes a type like int or float. Compiler recognizes this, and therefore subsequently, you can simply use a structure wherever you want to refer to a variable. So if you had done this, then in the previous page, saying a structure location would have been right. However, since you have defined the a structure simply in this fashion, struct a structure, float a, float b, float c, it is wrong to say this, and therefore, you have to use struct a structure. This is exactly what is pointed out by the second person. This response says either you have to add struct before a in definition of b. So he gives the code, struct b structure, struct a structure location, and then struct whatever, whatever float b, or use type def. You see how people very quickly write shortly and succinctly pointing out the errors. Web is an extremely powerful collaborative medium. It is not just a one way answer giving machine. There are collaborative platforms such as this. If you go to this side, you will find that it is, it has abandoned set of questions, which various people like you and me have raised, and large number of other experts who have kindly responded giving those answers. This is what enriches every human being who undertakes that kind of activity in this particular case program. In fact, the portal which I mentioned, the subject portal, which we hope to launch sometime in March, April for effective computer programming, teaching and learning will be an example where all of us can continue to collaborate and extend this collaboration to still larger number of teachers and students. So with this, the need for struct in a struct was to be indicated. This is the example which I have constructed. The previous example about which the question was asked on the web was actually an artificial example. I have tried to construct a realistic example. So consider here, suppose there are people who are working in some factory or business or something, we call them employees, their addresses need to be recorded. Consider the address of one employee. Now typically it will be represented as a collection of pieces of information. So an employee's address will have house number, it will have street name, city, pin code, whatever, whatever. So I might use the following definition of a structure to say define M address. So I can say struct M address, int house number, int street, care street 40, care city 40, int pin code. So this completes the employee address. If I want to refer to an individual component, I can say, I will of course define of this structure type a variable. So suppose I define a variable called e, then I can say e.hno or e.street, which will be an array of 40 characters describing the street name and so on, or e.pin code for example. Now imagine that the company in which this employee works needs to maintain data about the employee other than the address. So there will be an employee code for example, the name of the employee, there will be salary paid to employee, date of joining, whatever, whatever. Different pieces of information. Consequently it is possible that to define such a record, I will use another structure called structure m.info. This struct m.info will have e code, e name, etc. But now I also want to write the address. But I want to maintain the totality of the address as I have defined in the other structure and therefore I incorporate the entire structure here. This is the way to do so. I can say int e code care in a 40 struct m.adress e address. So e address is the address of the employee and struct m.adress is the type. Float e salary and so on. If I define m.1 to be a variable of the type structure m.info, then its employee code will be known by writing m.1.e code. His name will be known as m.1.ename. The salary will be known as m.1.e salary. The entire e address structure could be referred to by saying m.1.e address. But if I want to refer to the city in which the employee lives, or the pin code of that employee, then I can simply say m.1.e address.pin code. Because m.1 is a struct, its element is e address as shown here, and its element in turn is pin code. So this is a very neat way of organizing complex pieces of data in a organized fashion as a structure or as a record. In fact, when we discuss data structures in our course, the typical data structures which are discussed are stacks, queues, link list, graphs and so on. These data structures are extremely important in very specific applications. For example, if you want to implement sorting and searching, you cannot do without the understanding the notion of trees. Binary trees, for example, within the in-memory searches or B-trees or other multi-node trees in case of these file searches. What is important to understand is that in a large majority of application fields, the most often required data structure is actually an organized structure like this. So multidimensional arrays and organized structures like this are structs. Usually account for more than 90% of the usage of data structures in applications. So what is a file? A file is regarded as a large collection of bytes stored outside the main memory, typically on a desk. That is the as simplistic a definition of file as can be given. As a matter of fact, C programs or C programming language recognizes file in this simplistic fashion only. Namely, it's a collection of bytes. Now, more important thing is that file is ordinarily residing outside the memory and it is on a disk. We have not discussed, let's say, organizational of files on a disk at all in this subject so far. When you teach programming to your students, in most syllabi you will be required to explain to the students not only hardware organization of the basic computer, but something about the disk file system operating system and so on. In that context, you will be telling them some preliminary notions of the files. But when we are discussing files under C programming language, we are talking about how to manipulate these files inside the data contained in these files inside the computer's memory. Clearly then, there must be a movement of data between files and memory. How exactly is that movement done? It describes how file input output is handled in that programming language. Once the data comes from the files inside computer's memory, to handle it, my normal knowledge of C programming constructs is adequate. I can have my execution of expression evaluation, while loops, for loops, whatever, whatever, and I can manipulate the data. Once I have calculated the desired results to put it back to the outside world, I will again require file IO. In short, input output is an extremely important aspect. The very purpose of running a program is to get some data from outside world, process it, and give the results back. And therefore, we need to understand how input output is done. Yesterday we discussed input output in a generic sense, particularly in the context of formatted input output, which is what we usually deal with. Because we want to give input data in terms of character strings typed on a keyboard, and we want to collect output from our computer programs, and see that on a computer terminal. This is the most typical IO. However, generically, a file is defined as a collection of bytes stored outside the main memory, typically on a disk. Now, the files on the disk are managed by the operating system. There is a component called file manager, or file system manager, which manages the organization of files or the data on to magnetic media, such as this. And this provides for directories, sub directories or all other of them. Within the sub directories, you have the files. And files can be created, data bytes can be written to the files, data bytes can be read from the file, files can be deleted, files can be extended, et cetera, et cetera. Additionally, if you have an existing file, you will be able to insert data in that file practically at any arbitrary location. Or you can delete data from that file practically at any arbitrary location. When you use a word processor, for example, and you are editing a file, you have entered, let's say, a large amount of script in that word file. Let's say you are typing a chapter in a book that you are writing. And suddenly you decide that three pages should be deleted from that particular chapter because they don't make sense. When you delete the three pages on your screen by using a simple delete operation of your word processor, and when you save your document, the corresponding pages have to be deleted from the original file. The computer operating system incidentally does not keep creating a new version of a file every time you save a file. And the file that you save may be different in the original file. All of that is managed because these files on the desk can expand the string through insertion and deletion. But most importantly, files can be created, data can be written to them, and files can be opened and data can be retrieved from them for reading. So each file has properties, it has a name, it has path, it has permissions for reading, writing, and so on. And physical location and contents of any file are known to the operating system. In short, when I write a program in which I have to read or write data from or to a file, then I have to use operating system as an intermediary. Of course, I don't explicitly call operating system. But it is worth while remembering that any input output operation that is done is necessary than through internal calls, intrinsic calls to the features of the operating system which actually permit this kind of input output. The first peculiarity of the C language that we would like to understand. C language treats a file as a stream of bytes. That is why the word stream, if you notice in C++ you call it IF stream, IO stream, and so on. Because a stream of bytes, like a stream of river, where the river flows. River of course has a single directional flow and water keeps flowing in it. A stream of data is actually a stream of bytes. That is what a file is in C. A file can be opened for reading bytes from it, which we call an input operation, or it can be opened for writing bytes to it. In which case it is called an output operation. Or you can do both IO. The bytes are simply treated as characters. Now that is very interesting because a care in C is equivalent of int. And therefore when you read bytes from a stream, you actually get int values. Of course, they are in fact characters generally if you have a text file. But they can be treated as cares if you wish to. A file is defined through a special pointer called file star, which points to a file object, such as file star fp. So this is the way you define this. I would like to repeat again what I had mentioned earlier. C programming language per se does not have any direct constructs for input and output. So the input and output is all implemented through a set of library functions. The keyword file capital five and all functions which deal with input output are part of the standard C library. So when you say include stdio, stdlib for example, or stdio.h, you are actually including all of these things in your program. For this file, C is capable of positioning a pointer to any byte within the file from which or at which bytes can be read or written. This is an important concept to be known. So here is an illustration. First of all fp will star fp, that is fp will point to the file object. So let us say this is a file. I have shown it as a strip, a strip containing a series of bytes. One, two, three, four, five, six. How many bytes a file can have? Millions and billions. The total number of bytes in the file is called the size of a file. And when you open a file, the file will be opened usually at the beginning. So if you are reading, when you say get me a character, you will get this character. When you say get another character, you will get the next character and so on. Operating system and therefore C programs maintain a pointer to the location which is currently pointed to within the stream. Typically we use a name like pass or position. So each file in short has a file pointer which refers to the entire file. Any operation to be done on this file will require us to use the file pointer once we have associated with the file. Second is a position or pass which at any given point in time points to a specific byte on this entire strip. So when I say points to, the pointer is actually notional. For example, if your entire data is on disk and let us say you have 2 million bytes there and you set pass arbitrarily to 0.253 byte where there is nothing like 253 byte either in your memory where the pointer reside, the data is still on the disk. So this is actually an internal computation which calls positioning a pointer into a position on the disk. It is possible for you to go to the disk file system operating system and say please read me so many bytes starting from this position. And the operating system will use the value of pass, go to the disk file and read as many bytes as you prescribe. You can also write as many bytes as you want at that position. Of course this provision is available for what we call direct access files. Most of the files that we have seen so far are sequential files. That means you keep reading the data in sequence and you keep writing the data in sequence. When you do that you obviously do not require to know about the position pointer pass. And that is the reason why in formatted input output we never had to worry about anything called pass. We simply said scanf and the scanf read as consecutive bytes. Once so many bytes were read and we wanted to read something more when we said scanf it automatically read from that point onwards. True that we did not really explicitly use a file we were using scanf in the context of keyboard input. Similarly when we say printf the written output went to your terminal. So where is the disk file in this context? We shall shortly see that both of these are actually exactly equivalent of a disk file. There is no difference in the behavior except that they are sequential files and therefore position sort of pointer is very rarely used in such files. Very specifically just as a file can either be a sequential file or random access file. Random access file means I can make use of the position pointer to read or write at any arbitrary place, any arbitrary number of bytes. Similarly a file can be either a text file or a binary file. We are familiar with text file which contains text data. Binary file is one which can be used to store non-textual data such as digital images whatever whatever you. There are open statements not really statements but open function calls which permit you to open a file. Remember what I said on this every file has a name. Now there are modes for file opening. I have tried to just list here all the modes which will most frequently require will be R or RB mode for opening a file for reading W or WB mode for opening a file for writing. Notice that when I use the key character B it becomes a binary file. Without B it is a text file. So that is the simple difference R is for read W is for write and R plus or R plus B is for reading plus which means reading and writing. So if you want to do IO you use R plus or R plus B if it is a binary file. If you want to be able to create and read and write then you have to use W plus. There is a subtle difference between R plus and W plus. R is primarily a reading file but you may also want to write. W is primarily a writing file which means you can create a file but you can also read from it. But otherwise practically you can do all IO operations in either of these modes. You also have a special mode of opening a file which is called an append mode. In an append mode data can only be written. Append is obviously an output mode. You cannot read data in append mode but you can write data only at the end of the file. So consequently no matter what value you associate with that pass variable pointer variable I mentioned you might try to position that pass somewhere in between but in append mode data will always be written at the end of the file in the last location. So this is about the various modes for file opening that you have. Very briefly let us discuss sequential files. Sequential files are able to read or write data only in sequence as I mentioned. So the position pointer is essentially managed automatically by read write functions. We do not have to do anything explicitly. Now most text files are treated as sequential files and formatted input and output can be done using these. However it is not impossible. In fact it is definitely permitted to have text files as random access files also but mostly they are sequential files. This is perhaps the most important observation that I would like you to emphatically tell all your students that whenever a compiled C program starts executing the operating system automatically gives it three files to work on. One of them is called stdin or standard input file. Another is called stdout or standard output file. And the third one is called stderr or standard error file. Every C program can need not open or close these files. These are opened by the operating system and are made available to you. When your program stops executing operating system automatically closes these files. So you do not have to use open or close statements but you can use all input output statements associated. Since these are standard files and the files are meant for specific purposes such as input and output is the purpose for stdin and stdout respectively. As a matter of fact the sequential file stdin is implicitly connected to your keyboard. Remember what I said file is a stream of bytes. So a stream of bytes emanating from your keyboard constitutes stdin. Consequently when you say scanf for example you do not have to give any file name. It implicitly assumes stdin as the file name. Similarly when you say printf the printf command goes always to stdout which means it goes to your monitor. stderr is an output file like an stderout and therefore you have to mention that you are outputting to stderr but that is also by default connected to the monitor. Operating system provides features to disconnect these three files from their default connections and to connect them to these files instead. That is called redirection. So whenever you execute a program let us say myprog.c you compile it and you say dot slash a.out ordinarily all input will have to be given from a keyboard but if you want to read all input from a file exactly as if you have typed it you can simply say dot slash a.out less less my file where my file is a disk file. What operating system does is it disconnects stdin from keyboard and temporarily connects it to my file so that all data is read from my file. But the behavior otherwise of the stream of bytes remains same whether it is stdin as connected to keyboard or connected to any other file. These are some of the library functions. Can we go back to the query that was raised here? Can we go over to Nagpur briefly? Yes, Nagpur. Sir, I want to ask about the JAG array. Could you please elaborate on that point? How to use that JAG array in C? Can you give an example? I think I missed out to state that there is no there is no notion of using JAG array in C at all. You cannot because it does not exist and you cannot define a JAG array in C. JAG array is an artificial data type or an abstract data type which has been defined in C++ C sharp and Java. It is possible because object oriented languages permit not only definition of abstract data types but also permit through operator overloading definition of new operators interpretation of new operations etc. Such extensibility does not exist in C. The limited extensibility is through type def whereby you can merely define structures as new types and so on. So therefore I am afraid JAG arrays have no place in C programming. So you cannot use anything like JAG arrays or any such esoteric data structures naturally. You will have to do that artificially in a completely different way and there is no convenient way of doing so in C. So as far as C program is concerned just tell your students that JAG arrays do not exist in C. We must take a 5 minute break. Thank you so much.