 So, here again there are detailed specifications. The contents of the format string could be white space or ordinary character. So, anything except white space or percent character is an ordinary character. The next character in input stream must match this character. Incidentally, formatted input means that the input values are given in some format. Notice the way C in operator works. I have one value, one or more blank spaces, another value. Both these values are red and are associated with two different variables. I can do the same thing in scanf also. Just like C in is incapable of reading a complete string as a single string with spaces and things like that. Similarly, scanf which is essentially a formatted input function and which is designed to distinguish between different values that are typed on a line with some separators. We require something special to handle strings even here. Of course, individual strings can be read as per the format that you specify. Again, I have lots of details, conversion specification. But I will just note that there are other functions such as gate care and put care or gate s and put s or gate c and put c which actually handle individual character or string read. Here is an example of reading simple characters without using scanf but a proper c program using gate care. So, include stdio.h, include stdlib.h, this is the standard library. ch is equal to gate care. Notice the fundamental dichotomy that c language has. I am reading a character as the name supply, as the name indicates. But I am reading it into an integer variable. The sad problem with life in c programming is that c uses integer and character type interchangeably within. As a matter of fact, it hardly distinguishes between the two because when a character is read, actually its value is read. Maybe it is ASCII value order but it is a numerical value, it is an integer. And c merrily treats that numerical value as an integer. Therefore when I get a character, I can read it into an integer variable. I can test whether that character is, let us say some character a. Notice that ch is integer but when I say compare ch whether it is not equal to a, what it is doing is it is taking this character, finding out whatever its ASCII value is and is comparing with the value that is stored in ch. So if the character that I type on the keyboard is b or t or d, it will not match. If I type a, it will match. So the moment it matches, the file loop will break and I will get out. Otherwise I will check whether the character given is backslash n, which is a new line character. When I press return key or enter key on my keyboard, I actually send a new line character. If it is there, then it will print that character was a new line character. But notice how it is printing that new line character. Print f, ch was percent c comma value percent d backslash n. This backslash n by the way has nothing to do with this backslash n. This backslash n is part of the formatting which says at the end of whatever gibberish you print, go to the new line. What is important is to look at percent c and percent d. The difference is that the percent d is a format specifier for printing a numerical value. Percent c is a format specifier to print a character which is represented by the ASCII code value which is stored inside that variable. So both places I am printing ch and ch only. But the first ch will be interpreted as if it is a character and that will be printed. So if I have not printed a new line character, I have typed let us say b. Then this printer will print b. The next printer, suppose I type in z, the first one will be printing z as a printable character because that is what this format specifies. In the same way I am saying value percent d. So I will get characters say b, value and some value. And what will be that value? That value will be the ASCII character. In short what this program is doing is, if I keep typing characters they will print every character and show what is this ASCII code. Except for character a, when I print a character a it will come out. And except for new line character even if I type enter it will keep going again and again and again. So the program will terminate only if I give a. So notice that after printing whatever character I have input, the program will read another character and go back to the file loop. So this is the infinite iteration sort of till I type a I will not get out. And I can get this way ASCII code for all characters that I input except of course for a because when I type a I will get out. As I said c language treats a file as a stream of bytes. A file can be opened for reading bytes from it or for writing bytes to it or for both. These modes are called input mode, output mode or IO modes. And the bytes are simply treated as cares. Each individual care is a numerical value between 0 to 255 which is what can be put inside a byte. So when we invoke the function scanf or our operation c in which we have been using the program actually reads from sd in and hence from the keyboard. Similarly when I say printf or c out the output is written to the sd out and hence to the monitor. And as I said OS error messages are output to sd err. One question of one point which I have not raised is in redirection on Unix I indicated that I can redirect input file to come from any file that I have pre-return and stored on the disk. Similarly I can redirect the output to a file on the disk so that a file will be created and output will be written to it. But suppose I want to store the messages which go on to sd err and I want to redirect sd err. How to do that? Actually it is a simple mechanism since this is not a tutorial in OS I will not spend time in it. But I would request all participating teachers if they do not know already to experiment with with funny commands like two greater than and one and and basically how the operating system recognizes these at the OS level for a command interaction through shell prompt is an interesting thing to look at. Now I come to the purpose of handling files. The purpose of handling files as I said is not to master the technique of opening files, closing files, writing somewhere in between, seeking some position on the disk and so on. These are the details which we require to do but the objective and the purpose is always to handle data inside the file and to handle a specific way of organizing that data. So we must remember this objective very clearly always and must communicate to the students. The syntax of details of formatting, the syntax of open statement, close statement, the syntax of fc etcetera etcetera etcetera is actually irrelevant. It is a necessary evil. We must tell our students that the objective of all these functionalities is to be able to read the data meaningfully from the disk and to write data meaningfully on the disk and to organize this reading writing operation in the most optimal and convenient way. So what is of paramount importance is the data and its organization. Since we have to store the data on the disk and retrieve the data from the disk, we will need all these commands. Since we might want to do this in a peculiar way, we need additional features. All of these are provided by the C functions. I emphasize this because I have always seen and many of you have corroborated that in our teaching often the syntax of the details and syntax is often emphasized. I think we should never let our students lose the sight of the main objective of everything that we do. Just as the main objective of writing programs is to solve problems, similarly the main objective of all file I O statements and the complicated syntax is to efficiently handle input and output of data with the external word period. To illustrate that how the data is important, how its handling is important and reading and writing is actually a accompanying exercise that we have to fulfill. So here is an example. This example is from my CS 101 examination marks. We conduct a mid-same exam and end-same exam. After the mid-same examination was conducted, we got the marks entered in a text file. Actually the marks were entered in a spreadsheet, all of your familiar spreadsheet. When we want to process the data, we cannot write a program to read directly the spreadsheet. So what my teaching assistants did is, they exported the data from the spreadsheet using what is known as a CSV format, comma separated value. So they took this comma separated values and these are the records they create. There was a serial number which comes whenever you export data from spreadsheet unless you are careful. Then there is a comma. Then there is a roll number. Then there is a comma. There is a name. There is a comma. This is the lab batch. Remember we told you we had batches 0A, 0B, 0C, 0D, 1A, 1B, 1C, 1D, etc., etc. And the last value are actually marks. You can see here 44.5, 44, etc., etc. The TAs were advised that if somebody is absent, they should give negative marks in the spreadsheet. Different TAs gave different negative marks. Somebody gave minus 1, somebody gave minus 2, etc. So in short I have this funny data. More complications come out because let us say for one student the name was not there at all. The TA had forgotten to enter it or we did not have it on the record. Now what happens when you convert data from a spreadsheet? Well, if there is a null string in a place where there should have been characters, spreadsheet simply puts two successive commas. These commas are actually called delimiters. That is why its comma, delimiter or comma separated values is the name of the form. So I got the first line itself I got in which there is no name. So here is the data. What I have given is sample data. There were totally more than 800 students. So there were 800 lines that were created. Look at the peculiar way in which the data exists. I have already told you this fellow does not have a name. This person is absent. This person is absent. Look at something interesting. Look at this person. Shailendra Sarabh, the batch of that Shailendra Sarabh is not known. Now this is very funny but this is what real life is when you have 800 students. For example, right now 500, 600, 800 participants are there. Technically all of them have entered their data but look at some mistakes that we made in designing the form in which we captured the information. Name was captured correctly, details like where one is serving etcetera was captured but the name of the institution and the city in which the participant is serving as a teacher was not captured explicitly as separate fields. As a result we know the name of the participant. We know which remote center the participant is attending but we do not know exactly in a decipherable way the institution where the participant is serving. Suppose me and my friend are attending this conference. I have written IIT Bombay and my friend has written Indian Institute of Technology Bombay as far as my analysis of the data is concerned these are two different strings. If there is no postal code I may not even be able to detect that both of us from the same city I might write old style Bombay he might write Mumbai. Such are the things which happen actually whenever you collect data and this is universal whether you are filling up a form for train, journey, reservation whether you are filling up form for municipality whatever whatever whatever the data will have problems and the challenge in writing our programs is to handle the data in this form. So what we understand from this description is that each line represents sort of one record for a student and it has several fields separated by commerce that much is obvious. How many fields are there? There is a serial number, there is a roll number, there is name, there is batch and there are marks. So not very complicated but the complication may arise we believe because the way in which the incorrect or inaccurate data has been given to us but since this will be the case always we have to write programs to handle this data proper. So first thing we begin by identifying the meaning of values in different fields from the knowledge of implicit metadata. So you are teachers, you are teaching assistants, they have given you this spreadsheet because you are the teacher you know exactly what data is you will be able to figure out that the first field is a serial number, next field is a roll number, third field is name, fourth field is batch, fifth field is marks etc. We also note the possible missing field values so name is missing here, batch code is missing. We also note that in the roll number here we see capital D by the other is the peculiarity of IIT which was introduced a few years ago when we started a double degree program. As some of you will know that digits of roll numbers are also used to symbolize something so undergraduate, post graduate, different branches etc. So somebody decided that in the eight digit roll number D will indicate a student who is a dual degree student. Consequently roll numbers which were traditionally numerical suddenly became characteristics. We have a serious problem in many spreadsheets for example when such a number or such digits or characters are inserted as long as there is a D this will be treated as a alpha numeric string. But if the number was 0 9 0 0 7 0 1 0 it will be treated as a numeric value and the first 0 may disappear when you take a print out or when you see that. So you can see these are uncanny problems and we need to handle these problems properly. Anyway so in order to process the data in this file I have created a file it is called say input data dot text or whatever. I now need to read each line as a string separate out the field values and store these inappropriate variables for serial number and marks and in character arrays for names, batch code and roll numbers. Please note that my conventional scan F which is an input statement will break down completely. Scan F will have no clue to how to handle comma comma scan F probably will not be able to read comma separated values. So a formatted input capability of reading which is implemented by scan F or C in will not be suitable whenever I am handling real life data in large quantum or in terms of records like this that is why I will have to do the following I will have to read a complete string that is the whole line as a string separate out various portions of it find out where the commas are and then make meaning out of intermediate portions. For example if a roll number contains a non digit how do I handle it? So we would generally like to store basic information for all students such as this in a file. So what will be the objective of a C program that we write? We collect we read this data line by line separate out the fields understand and interpret the field values and probably store them in a nicely structured form one record per student in a database file where the data has been massaged properly numerical values are numbers string values are strings spaces on either side are removed etcetera etcetera and you have perfect records. Consider to begin with this is by the way a lab problem that has been given today that I have given this entire text file and said precisely you do this and store the extracted information into a database file what is the database file is nothing it is a file which is probably each record is a structured record of some sort there is no commas etcetera they are not required commas are required for visible separation for us and therefore they have to be tackled by any program which reads this as input. Now we consider some simple processing requirements first as a teacher what was my processing requirement? I had these 800 students their mid semester marks are there I want to find out the average marks what is the level of a performance of my entire class similarly since students are divided into different batches I would like to know what is the batch wise average. So batch wise average marks and class average is all that I need but my batch numbers are there marks are there so what I will have to do I will have to read every line I will have to find out first of all I will have to maintain a sum total for the total marks so I can say that very easily sum total is equal to 0 initially and every time I read marks add it add it add it add it I will get it so I can get the final average for the entire class how will I get average for the batch observe that batch is the fourth field fifth field was marks now if fourth field is 0d then I have to maintain a variable which will separately sum marks only when the field is 0 when the batch is 0d if the batch is 0a it will it will have to maintain a separate counter and so on it is not a very easy thing to do there are 40 batches observe that somebody might make a mistake in typing the batch name for a student suppose somebody says zz now zz is not even a valid batch so therefore in my C program I may not account for it I may just say 0a 0b 0c 1a 1b 1c 1d etc 9a 9b 9c 9d which are my standard batch but what if data value is wrong I have to account for it somehow I have to give at least a error message these are the complications so effectively if I want to find batch wise average marks and the class average I will have to write a program which first of all reads a line then separates out the values then identifies what is the batch code identifies the marks if it is negative mark says fellow is upset if it is positive marks first it will have to add up in a sum total which will give me the class average second I have to for every mark I have to add it up into a separate count which I must maintain for each of the batches if I have 40 batches I must maintain separate count 40 counts but if two people have made mistakes in giving me the data one fellow has written a batch zz another fellow has written a batch xx then there are two extra batches not 40 but 42 and how many mistakes will be there I do not even know in advance suppose I determine that look for every batch you give me the average for that batch as per the correct batch code but if there are students who are the batches are not properly written they are not one of these then please give me a separate report for this I hope you will agree there is not a not a very trivial job to write this program at this juncture I would go over to a completely different programming paradigm and show how this problem can be most easily done by the way I had used this example in my class last time and I was encouraged by the reception that people appreciated the simplicity of alternative forms of programming but they also understood what exactly was being done more clearly and then they wrote c programs to do exactly the same job later so here is a different solution before considering c programs to solve this problem I am going to use a programming language called or some of you will be familiar with all some of you may not be very familiar with all is actually a scripting language which was designed in 70s in Bell Labs so Alfred Aho Peter Weinberger and Brian Carnegie are aware the authors most of you will recall Brian Carnegie as the co-author of the first book in c programming Carnegie and Richie all of these are stalwarts by the way stalwarts of computer science this language makes heavy use of string data type associative arrays we have seen associative array you want to compute the histogram of an image since histogram has maximum 256 values you actually declare an array of 256 elements and whenever you get a pixel value you directly use that value to access the array element itself in short of value becomes an index in the array consequently you do not have to maintain histogram is a good example in this context you do not have to maintain 256 separate variable counts to count the number of pixels in that value you maintain an array and whatever value comes you use that as an index to go to the right element in the array and increment a count by that at the end when you have analyzed millions of pixels in each element you will get how many pixels with value 0 how many pixels with value 1 etc. etc why I give this example what we want to do here is we want to do the same thing except that in the data instead of getting a pixel value I am going to get a batch code I want an array where the index of the array elements is not 0 1 2 3 4 but 0 a 1 d 3 c that means the batch code itself I want to be used as an index it is not naturally possible in c or other programming languages but ork permits that so those of you are not familiar with ork might want to find this discussion interesting it is a language which processes files of text file is treated as a sequence of records and by default each line is a record further ork automatically breaks up each line into a sequence of fields so that we can think of the first word in a line as the first field second word as second field etc then the ork program can be written instructions in ork program consists of a pattern and action every time it reads a record it applies that pattern to that record if the pattern matches then the action is taken if the pattern does not match the action is not taken what is a pattern and what is an action well let us first look at some of these details so as I said the ork handles records in this fashion this is my record let us say I can indicate to ork by the way that my words in the line are not separated by blanks or any other symbol but they are separated by comma I can describe whichever symbol the most favorite symbol incidentally of ork users is a pipe symbol why because a pipe symbol ordinarily does not occur in any real life data our names are not d pipe b pipe farted but it may be d dot b dot farted it may be d space b space farted so pipe is a favorite symbol but comma is as good so you have this kind now what ork does is it reads this line and automatically allocate values to internal variables which are actually the fields and these variables are called dollar 1 dollar 2 dollar 3 as many as there are fields in the record so if this record is read dollar 1 will be associated with 13 dollar 2 with this dollar 3 with Guru Raj Shaileshwar notice that this blank etcetera then count as far as dollar 3 is concerned it has the complete string from this point up to this point because it treats comma as a separator nothing else as a separator dollar 4 is 7a this is crucial dollar 4 has a value which is string which is a 2 character string which is 7a dollar 5 is a numerical value 44.5 so ork separates out various fields as it reads records and assign values to these what do you want to do well now here is what is we want to first of all we want to check whether the marks are positive or negative if the marks are negative we want to ignore that so we want to write a pattern which if match some action would be taken pattern is dollar 5 less than 0 action is increment account variable for absent students so if there are negative marks increment account so that I can count the absent students if the marks are positive then I want to do variety of things increment total marks also increment batch counts so for every batch 7a 2b whatever record I get corresponding count should be incremented in short I need an associative error interestingly ork automatically provides for an associative error let us look at the ork program that does it this is the ork script it the first statement says dollar 5 less than 0 in brasses absent count plus plus notice that absent count is obviously a variable it is not initialized in ork the moment you use a variable it starts with initial value it gets auto initial so when I say absent count plus plus ork will for the first time when it executes this it will create a variable allocate a location initiate it to 0 and on the first occurrence of a negative value it will add 1 to it please note that this program is going to repeatedly be applied for every record that is read so the iteration is also automatically set up iteration is implied in ork because it is objective in life is to read record by record by record till the file ends so for every record it will check if the marks are negative it will add up absent count if the marks are positive then whatever I do first of all I have to count the number because I have to find the average so count plus plus this count will represent total number of present true thought marks plus equal to dollar 5 exactly C syntax what is dollar 5 the value of the fifth which is marks since marks are positive thought marks will accumulate the marks total of all the students so very simple no initialization required thought marks plus equal to dollar 5 since this statement is guaranteed to be executed for every record where dollar 5 is greater than 0 at the end I know thought marks will contain total marks count will contain total count division will give me the class total but wait I do not want only class average I want also the bad so I want to do exactly this kind of counting of currencies and totaling of marks individually for each back look how simply on does it batch thought in bracket dollar 4 plus equal to dollar 5 batch count in bracket dollar 4 plus plus notice that these are very similar to count plus plus and thought marks except that these are arrays notice that I have not required to declare an array the moment I use an array element array comes into existence how many elements it has it has dynamically increasing element for example when it reads the first record and dollar 4 in let us say 0d then it will create an element which is indexed by 0d it will initialize it to 0 and add marks to this similarly it will initialize to 0 a batch count 0d element and add one to it next element comes as 7a then it will create another element called 7a so the index of the elements of these array created by or internally are actually the batch course which is dollar 4 how many elements will get created at the end when all my data is there well as many distinct batch codes are found in your data in short this is the only program that is required to be written and this is the entire program if marks are negative absent count is incremented if marks are positive the count batch count total marks and batch batch total marks are actually increment that is it of course there is something that I need to do when the record ends these are the patterns which are matched for every record a record is there and this pattern is matched but when end of file comes then none of these patterns are matched or gives a special pattern called e and d so e and d is a pattern e and d followed by opening plus means when the end of file has reached then forget everything else do not iterate anymore come and do whatever is written here ordinarily I may have a simple print statement here for what I am doing here for I in batch print I batch count batch total so I am printing what is I in batch means what are the index values in batch okay I print I batch count I batch total so if please remember I is not a numerical value I represents the index values which were created by or so these are batch codes so I will actually print 0 a 1 b 2 c 7 d and for each one I will print the total batch count and marks total divided by batch count will directly print the average similarly I said total students are these number absent is this and the class average is not marks by count that is it this is the complete program and it will run perfectly correctly these are the results notice the command line incidentally n orc stands for new orc so over the decades orc has also transformed itself this is the command that you give minus f and in comma in double course tells orc that the field separator is comma ordinarily space or tab would be a field separator by default but this is a field separator minus f n l is analyze midsave dot orc what is this is the file name which contains those 10 lines of program and it says redirect input from midsave marks dot txt obviously midsave marks dot txt is a text file which contains all that data with lots of problems all that orc program does is reads every record and it cleanly produces this output these lines are decipherable zero is batch zero a zero b this is the average 23.736 at 30.76 1 a has 25.36 we suddenly notice after 1 a 0 d has come this is not in sorted order actually I can pipe the whole thing those of you who know the unix pipe to a sort command in unix and get it sorted but the reason why orc produces this output is that the variable the array elements which are created are not 1 2 3 4 they are created whenever an actual batch code comes in so these roughly indicates the order in which the data came in with batch code in any case I will have all the batches all of this is calculated perfectly well and I got all the batch averages here at the end I have 40 batches so 40 lines and then total students are 819 number absentees 10 class average is 25.929 notice that the output produced in this program is actually not 40 but 41 batches look carefully at the first batch I am going back to the previous slides this slide look at the first line it simply says nothing here 122.5 you know what it is you remember there was one record in which the student did not have a batch code then the student did not have a batch code it took blank as a batch code null as a batch code and it is actually printing that batch code and it is saying that there is a student with that batch code so the number of students in that batch is 1 if people had made mistakes while entering data and had given batch codes as xxzz then for each of those also orc would have produced the line and if there were two students with batch xx it would have said xx as two students and it was calculated the average please note that it is not very easy to do these things in a C program where I want to actually handle the entire data and make sense out of this this is the power of orc okay as I told you I discussed this orc strip and the students were thrilled but then what was the objective of this orc is so simple work so well so why use complicated programming languages so then I answered orc is superb for such problem but has very limited capability to handle data of all kinds for example if I want to do matrix multiplication or matrix inversion there is no easy way to do it in orc similarly it is an interpreted language so the program scripts are not separately compiled into machine instructions and therefore operationally it is less efficient but that is not the more important part the more important part is that orc is superb for these kind of activity but orc is not very useful for general computation and hence programs are still written in conventional program orc is not good for example for reading data from index 5 scenario database something which a C program can do and therefore we get back to our C program there is a remote center which wishes to raise a question hello good morning is there any comment that someone would like to make from Amrita Bangalore over to you Bangalore I am going over to I am C C Pune now I am C C Pune you have raised a flag I am handing over control to you over to you good morning sir myself Amar Modiraj from I am C C Pune I want to know one thing that how can I print super scripting on output screen that is instead of writing scripting of 2 equals to 4 instead of 2 equals to 4 I just want to write those things over to you sir thank you very much the query raised was that if I want to basically print meaning not necessarily on a printer paper but on a terminal I want to show a super script so I want to show for example 10 raised to 4 or if I want to show a logarithm I may want to show log of something to the best 2 I want to print this string and how to print this string very good question and sadly there is no direct answer there is no easy way the format formatting commands that are available or formatting specification that is available in C is completely inadequate to do this so let us first understand how it is done actually on the terminal whenever on a terminal you see 4 raised to 5 for example in my slide or while in the world or something when you want to do that what do you how do you see a super script or substrate it is actually done by sending an escape sequence by the processor to the terminal you will have heard of escape sequences every terminal has processing power and the way the display is controlled by any program is that whenever it wants to show something on a particular line number column number it will actually give a what is a command a special command these commands are called escape sequences so these are special escape sequences which means escape character followed by some code means go to a particular line number column number or a escape character followed by certain code means that the subsequent character should be printed board should be shown board or should be shown highlighted or should be shown under line color all of these combinations can be controlled by escape sequences these have been standard with all terminals ordinarily when I use software such as word processor or spreadsheet or something my my parameter specification to my word processor saying that show this as a super script is translated into the appropriate escape sequences which are given to the terminal if I want to do the same thing through my C program I will have to know what are the escape sequences which will show a particular character not in the same line but as a super script this can be done but this requires the programmer to be aware of the terminal capabilities in operating systems like Unix there has been traditionally a term cap file or a term cap database which for any terminal gives you what are the escape sequences but the right answer is that unfortunately in C programs we cannot directly and easily do this but can it be done yes the answer is so yes it can be done thank you very much so with that I think we will break for tea now thank you over and out