 Today I want to start discussion on another important topic which will lead to the discussion of files, files of various kinds in which we maintain data and the operations on files such as reading, writing, accessing, directly a particular record in a file, the notion of a record, etc. To begin with, we will look at how data is ordinarily structured inside a file. We have had some examples. For example, the image that we looked at, we did say even in the mid-same question that you are required to design a file format in which data would be actually created by you as a text file. Files are not only text file types, there are multiple types of files. For example, another popular file is known as the binary type file where the data is stored in the internal format and therefore is much more compact and occupies less space. We will discuss these things in due course of time. Today we will continue our discussion with the help of an example which relates to the performance of students of CS101 in mid-semester. Typically, people would use a spreadsheet. How many of you are familiar with a spreadsheet? Microsoft spreadsheet, most of you are. Those who are not might want to look at a spreadsheet. A spreadsheet is nothing but a two-dimensional array of cells, rows and columns, and in each cell you can keep some value. Typically spreadsheets are used whenever you want to put a large amount of data related to the same type of entity, but there are various entities, such as students. So you put one student's data in one row of a spreadsheet, another student's data into next row, and so on. And the data could have as many files as you want. For example, for mid-semester, I would put a serial number in one column, a roll number in another column, which is actually a string, name in third column. I might put a lab batch in the fourth column. And then I might, if I want a detailed analysis, I might put marks scored in question one A in the fifth column, question one B in the sixth column, and so on. And finally I would put total in one more column. A spreadsheet not only acquires data of this kind and can maintain it in the files on the disk, but it can also provide a whole lot of computational nuances. For example, if you have a total column, it is possible to write the generic formula in that column, which says, add up all columns from column five to column eighteen. And whatever is the total, you put it in this column. So this is automatically done. Once you put that formula, for all rows that formula will be applied. That way you can cross check whether the totaling done by your teaching assistants is correct or not, because you will get a total there. It can do a whole lot of statistical computations. It can do sorting of data. And therefore spreadsheet has become an extremely important tool in the hands of all those people who need to analyze data. So you take financial analyzers in banks or financial institutions. You take statisticians or you take people who want to just maintain data such as academic office or people like me, teachers, who would like to maintain a data on a spreadsheet. Now spreadsheet itself is an extremely sophisticated programming system. People have written application programs which constitute the spreadsheet. Spreadsheets don't fall from the sky. Typically most of the spreadsheets including the Microsoft spreadsheet would have at the back end programs written in C, C++ or what Microsoft calls C sharp, which is the equivalent of C++ or Visual Basic or whatever programming language you may choose. And there would be hundreds of thousands of lines of code written to make a perfectly working spreadsheet for you. Our objective is not to see how spreadsheets are made, but our objective is to see that if data has to be analyzed in this fashion. In what way could I maintain that data in a text file or in a text file? Exactly the same kind of data that I described. And then how could I analyze that data using my C++ programs? So that is the objective. Here is a sample mid-sem file. You can see that it is not a spreadsheet file. But whenever you enter data in a spreadsheet, every modern spreadsheet permits you to export the data contained in that spreadsheet in a variety of formats that you want. One of the commonly used format is known as comma separated value format or CSV format. So while the spreadsheet data will be contained in different cells, one row per student as I told you, when I export the data in a CSV format, I will get a text file which will look something like this. So for example, you can see a whole lot of things separated by commas. So these are called comma separated values or comma separated fields. Obviously, this one is the first field which says one. You'll notice one, two, three, four, 50, etc. These are serial numbers. It is not necessary for me to export serial numbers into a data file. But for the sake of convenience, I have done that. You can see the next field is roll number. You can see the next field is name. Notice that our notion of reading data from input comprises of use of C in statement. The C in statement is able to read numerical values or strings which do not contain a space. We therefore do not know how to read a name of this kind, Joshi Aditya B, we don't know, or Mandane Amal B, or Abhishek Kumar. Whenever there is a blank in between any two parts of the character string, the C in statement will terminate there. We don't know how to read that. We shall in the subsequent lectures handle that problem properly. But this is just to tell you that this is a problem that we have in the use of the input statements which we have studied in a very limited fashion. However, the point being made here is that these are different fields and each field is separated from the other field by a comma. That is why they are called comma-separated fields. I have now got my data into such a file. So let's say this file I call midsamemarks.txt, some text file. You are all familiar with how I can write a C program and use input redirection to say a dot out less than midsamemarks.txt. Then this whole file, line by line, can be read by my C program. How do exactly I write a C program, a C++ program which will analyze this data and get me variety of details? What kind of details would a teacher be interested in? A teacher would be interested in, of course, the individual marks obtained, who are the best performers, who are the weak performers. Additionally, a teacher would also like to know what is the class average? In fact, class average is a very good indicator because all individual students can then figure out whether they are below the class average or above the class average. We have lab batches. So we would also like to know what is the performance strength of a batch. So I would like to have batch averages. You will notice here that we have a field, the fourth field, which contains actually batch number. So 11, 11, this 14 here is a batch number. So there are batch numbers up to, whatever, 75, and all that data has been collected. Data management, by the way, is a completely non-trivial exercise and it involves as much of good programming, as much of actual data management and meticulousness, entering values, cross-checking. For example, when a typist enters this value in a spreadsheet, he could easily make a mistake. One of the reasons why we had the notion like a check digit was to ensure that mistakes are not made. We don't have check digits with roll numbers. We don't have check digits with names. We certainly don't have check digit with marks. People could make mistakes while entering the data. The objective of writing all programs is to analyze some data. If the data itself is not properly handled, any amount of good programming will be useless. So you have to spend time and attention to ensure that the data that is captured, either by you or by somebody else who is going to use your programs, is done as meticulously and as properly as possible. So the way we handled it, for example, in an examination encompassing such large number of students is that one person will enter the data, another person will enter the data for some part and then these two people will cross-check each other. That is how you ensure. Additionally, because marks are sensitive issues, we also put the entire list onto Moodle and request all students, please check whether your marks have been correctly entered on. So that if there is a discrepancy, a student can point out that they can check it and we can correct it. So this is non-trivial work involved other than programming that needs to be done in real life. However, we come back to this problem. Notice this record. It has comma, comma, comma, comma. Obviously there are no values. The reason there are no values is that this student was upset. To indicate that the student is upset, a special value has been written here, minus 5, as we know that nobody is awarded minus 5 marks in the exam. So any roll number data which ends in negative marks is a person who is upset. Now that is a logical decision that has been taken before entering the data. Actually my staff entered AB absent. But for the purpose of illustration here, I do not want to get into string comparisons because I don't even know how to extract those small strings at the end in between and so on. So I have replaced that by minus 5. There would be 8 or 10 such students who would have minus 5 and there would be of course other students who would have marks in various subsections. You will notice that what I have here after the batch number are marks in question 1A, 1B, 2A, 2B, 2C, whatever, whatever. Okay. How will I write a C program to analyze this? Now that is a problem that I am going to discuss just the outline of that problem later in the class. But I want to show you another extremely powerful programming mechanism called AUK. AWK is the name of a programming language or a scripting language. It is named, these AWK are the initials of 3 people. I will give the details of what the AUK programming language is and something in these slides which I will make. As I said, you please read up those slides later. But I want to show you how simple an AUK script or a program could be. It rhymes very well. Its constructs, internal constructs are very similar to C++ constructs. So you will not have much problem in appreciating that. It uses the powerful concept of associative arrays which it builds unlike conventional programming languages which are compiled such as C, C++ or Java. This is a non-compiled version. The lines of program are interpreted. No declaration of variables or arrays is required. No initialization is required. AUK automatically creates variables or arrays in the first use, initializes them appropriately and uses them. And that makes the language extremely powerful and simple to use, particularly for such data processing. What I want to show you is a simple AUK program which analyzes this and produces two things. It produces batch-wise average values for the performance and it produces the total class average. Just so that we understand what is involved in computations, let us look at this. First of all, the logic of my processing would be as follows. My program and whichever programming language I use does not matter. Let's say a C++ program would be written to read one line at a time. When it reads one line, I would like to capture the marks scored by that student and I would like to accumulate those marks in some kind of a total. Suppose there are 560 students. At the end, I would like to divide this total by 568. That will give me the class average. Of course, I must discard those students who are shown to have negative marks because they are actually absent students. So in the process, when I read that data, I should also count the number of students who are actually present, number of students who are absent and only take into account the number of students who are present while calculating the class average. This part is simple but I also want to find out the batch averages. The batches are 11, 12, 13, 14, 15, 16, as you know, go up to 71, 72, 73, 74, 75. Exactly, could I do the batch average computations? Any idea how will I compute the batch averages? Okay, let's say to compute average, what do I need? I need the total number of students in that group and I need the total marks scored by that group. I divide total marks by total number of students, I get the average. This happens for the whole class. So the same logic must hold for individual batches as well. Now there are multiple batches. So what I could do for example, is I could have an array which is called batch totals. I could initialize this batch totals to 0 and every time I get a fellow from batch 11, I would like an appropriate index here to increment that value 0 from 0 to 1. I get another 11, I would like to change it to 2. This will give me the counts. Let's say this is batch count, just batch. I will also have another array which is called batch tot, let us say, but this accumulates marks. This also corresponding to 11, this will start with 0 marks. But when I get one student with batch 11, I will increment the first array with a count here and I will increment the corresponding element by the marks that the student has stored. So these 18 marks will go to be added to two different totals. One is the class total and the other is batch total for the corresponding batch. The point is I have to figure out what is the corresponding batch because this 18 should be added to the class total and it should be added to the 11th batch score here. If some student comes with batch 14, then whatever is the batch 14 index in that array, I should increment that by, no sorry, this fellow we have to remove. But let's say there was a student who scored, let's say 25 marks. The 25 marks should be added here and the count should be added in the corresponding. The problem is when I start looking at this data, I have no clear-cut mapping between batch number and index. Batches are 11, 12, 13, 14, 16. There is no batch number 9. There is no batch number 18 either. There are only batches 11 to 17 on Monday and the batches from Tuesday start from 21-20. So consequently while my last batch number may be 75, I don't have 75 batches and I don't even know which are missing batches. There is an added complication. I might initially make an array with batch numbers inserted because I know which are the batches I have made. And some typist makes a mistake in typing out the batch number here. Instead of 11, it times 01. There is no batch number like that. The student has appeared for the exam. I need to account for that student. But the batch total for one batch will be wrong because the batch number is wrongly written. However, when I analyze the program, when I analyze the data by a program, my program must point out that there was a person whose batch number did not fall into any one of the other batches and therefore that data has not been taken into account. In general then, the C++ program could become fairly complex logic program but you can write one and you will have to write programs to do this kind of analysis as we shall go forward. Coming back to the AUG script. The AUG programming language automatically separates out. It reads first of all line by line of any text file and it treats each line as a record. So when I write an AUG script and say run this AUG script, that AUG script when it runs, it will start reading lines one after another, one after another, one after another. Every line that it reads, it will separate out the fields and the fields which are so separated out are associated with predefined variables which are made available to programmers of all. These predefined variables are $1, $2, $3, etc., etc. So these are predefined names of the variable. Consequently whenever in the AUG script, you are processing a particular record. If you say $1, it refers to the first field. If you say $2, it will refer to $100, $200, $002 if you are on the first record. If you are reading the third record, $2 will be in second field of the third record. So this assignment is automatically done while the input file is read. How does AUG know which is $1, which is $2, which is $3? The different fields are assigned to these different $1, $2, $3 and therefore separation of the fields must be understood by AUG. Consequently AUG has a notion of a field separator called FS. Ordinarily field separator is a blank, just like in your scene statement. When you give a data, a blank is treated as a separation between two fields. But I can pre-assign a value of my choice to field separator. Some commonly popular field separators are vertical bar or pipe symbol, comma, tab, etc. Out of these conventional programs like spreadsheets are capable of giving you an exported data in the format comma, comma-separated fields or tab-separated fields. I have used comma-separated fields here. So if the AUG scripts reads these lines, every time it reads a line it will automatically separate out fields because it now knows beyond any doubt that a comma means beginning of a new field, end of one field. So notice that handling of this string, for example Abhishek Kumar, even though there is a blank in between here, it does not matter. Whatever comes between this comma and this comma will be inserted as a string of value for a variable $3 because this is $1, this is $2, this is $3. How does AUG know whether a value is numeric or characteristic? AUG provides a beautiful interpretation. It says you may use that value as either a characteristic or numeric depending upon the context. If you add a value of a field to something, then I will treat that as a numeric value. If you concatenate that value is another string, I will use this as if it is a string. Obviously it is for me to correctly write my logic such that values which are expected to be numeric are treated as numeric and values which are expected to be string are treated as strings. But this is an extremely powerful generalization which AUG permits me to. Now since the data is prepared by me, I know in advance what is string and what is numeric. I know for example this is a string. I know this is a string although it looks like a number because there could be a character D here and so on. I would like AUG to treat this as string. I know for sure that these are marks. So I would like the AUG to treat each one of these as numbers. And I would like it to take decisions based on the numerical value. For example AUG will have to decide that if a number in the last field is negative, then that data has to be discarded. Finally when a data line is read, I would like AUG to update the arrays which I have so created. Array of counts for different batches, array of marks for different batches. And I would like it to automatically update that particular element of each of these arrays which corresponds to a particular batch number. 11, 14, whatever. Since I am not going to prescribe in advance what are the various batch numbers I have. AUG has a difficult task. Imagine AUG reads the first record. The first record has batch number 11. At that time there is no array. But I suddenly say update the 11th element of this array. What AUG does is on the first reference it creates an array. It creates an element. It creates actually a cell here which can be indexed by 11. That is the associative array. So 11 is the batch code. And the 11 need not be number. It could be a string. And it will increment that number here. It will increment marks here. Next it reads again. It is 11. In the same element it will update. Next record it reads 14. Assume that the marks are not negative but are positive. Then it will create another element called 14th element. And it will update the count here and it will increment the marks here. Whenever it gets 76 it will update the 76th element. What it means is as many batches or as many unique batch numbers as really exist in my data. So many elements in my array will be automatically created by AUG. You see the power of this facility. We don't have such a facility in C++. We have to decide on the size. We have to decide on each element what it will contain. We have to map the index which is 0, 1, 2, 3, 4, 5 up to the size to the corresponding batch number and so on. It's not very straight forward. You get the point? Okay. So let us look at the AUG script now. This incidentally is the complete program to analyze the data. So let us look at what this program says. As I told you AUG program, each statement of the AUG program is of this type. Pattern followed by action. That is a AUG statement. An AUG program or an AUG script is nothing but a series of pattern action, action, action, action. There are two special patterns that AUG provides for. One pattern is called begin pattern and the other pattern is called end pattern. The begin pattern matches before any record reading has started. That means all the actions specified after begin are executed by AUG before it starts reading any data from the file. So you can say this is something like a pre-processing. The end pattern matches end of file. That means all the records have been read. Nothing more is to be read. But you want to take some concluding actions. Then you write end followed by whatever actions you want to take. What are the actions we are likely to take after end of file? Well, print the averages, print the totals, print the sums. That action I can't take in between while I am reading the records because I have not calculated all the totals. So invariably in any data processing task, there will be something that needs to be at least printed at the end. That job is done by writing the end pattern. The begin pattern is typically used to initialize something which is not initialized by AUG as default. For example, the field separator. Read the first line. It says begin FS equal to comma. Double. What I am trying to tell AUG is, AUG now understands after executing this action that whatever data is going to be read by my program is going to have fields separated by a comma symbol. If I don't write this statement, AUG will assume that fields are separated by bloods. Actually white space is a natural field separator. So either one or more blanks or a new line character or a tab character or a natural field separator. But when I want a special field separator, I say this. So you know this begin equal to FS for example. Rest of it is my program. And this program which will have pattern action statements, there are exactly two such statements that you see here. One is this and the other is this entire group. This program does is for every record that is read from the file. So this is how AUG works. After initializing anything that you have stated in begin, it will start reading records. For every record that AUG reads, it will apply all the patterns to that record. If any pattern matches, the corresponding action will be executed. If no pattern matches, no action will be executed. Now you can see that you can write patterns which are nothing but conditions in terms of what you want to do with which record. Consider for example, the previous slide in which we had the data here. If you count the number of commas, you will find out that the marks which is the last field is actually dollar 19. So what you want to do, you want to examine whether the value of the 19th field is negative or not. If it is negative, you want to discard it. If it is positive, you want to count it towards all your computations. Look at what the program says. Program says dollar 19 less than 0. So this is a condition. This is called a pattern in AUG. Notice I have no variable name, no declaration. I am depending on the fact that when AUG reads a record, it will automatically separate out fields. It will assign them to dollar 1, dollar 2, dollar 3, dollar 5. And I know for sure that the last value will be called dollar 19. So when I write a statement, each dollar 19 less than 0, that's the condition. If it is so, take the action. What is the action? Absent count plus plus. Why? Because a negative number means the student was absent. Notice I have not declared absent count as int or anything. Whenever I use a variable, if that variable has not already been used earlier in the program, AUG creates that variable for the first time. And it also initializes it to 0 if it is a numeric interpretation or it initializes it to null if it is a string interpretation. Since while creating AUG may or may not know what is going to be the usage, it actually maintains both initialization. And it uses one which is most appropriate. But once you start using that variable, you have to be consistent with the usage of that variable. So what this statement will do? If the first record which has been read has dollar 19 negative, it will create absent count, initialize it to 0 and add 1 to it. Subsequently, when I get another record which has dollar 19 less than 0, it will add 1 more to absent count. You will agree that when it has read all 569 lines, the absent count variable will contain the total number of students who are absent, as simple as that. For the same record, I will apply this pattern. For the same record, I will apply third, fourth, any number of patterns I write. This is something important to remember that all patterns are applied to each record which is read. However, the way I am writing this pattern makes it exclusive. Either the value is less than 0 or it is greater than equal to 0. So obviously for every record which I read, either I will take that action or I will take this action. Actions could be multiple. All of them are combined in curly brackets. We are familiar with that. Look at what is happening inside. Count plus plus, you understand exactly like absent plus plus. This count will be the total number of people who have taken the exam because they have positive score. Thought marks plus equal to dollar 19. You understand this statement now? This is thought marks equal to thought marks plus dollar 19. Dollar 19 is my marks. I want to add them to a variable called thought marks. Again thought marks is initialized to 0. It is created for the first time and it will keep on accumulating this value. Let us look at batch. Batch dollar 4 plus plus. What am I doing here? I am creating an array of batch. Inside it I will have dollar 4 which is 11, 12, 13, 14, 16, 17, 21, 22, 73. 7 z if somebody has wrongly typed 7 z as a batch. An element will be created for 7 z. I am actually using this only to maintain unique batch identities which are discovered in this. However, the counts and the totals I am maintaining separately here. Batch count dollar 4 plus plus. What does this do? This will count one by one the total numbers. What will this do? Batch count dollar 4 plus equal to dollar 19. That means the element of this array batch tot which corresponds to let us say batch 11. It will be incremented by the marks if that record has batch as 11. If I get another record with batch 12, the 12th or element corresponding to 12 will be upgraded. In short, when I repeatedly execute these statements for all my 568 rows or 69 rows, at the end I will have the count of people who have appeared for the exam. The total marks that have been scored by the whole class, division will get me the average. The batch count will give me the count per batch who are present and the batch tot will give me the total marks per batch. When you write a C plus plus program, which you will have to write to solve these kind of problems, you will notice that it is not as easy as you think, as easy as this. Notice that the data that you had seen here, I will comment on the data slightly later. Let us look at the results. The script is not complete. Remember I told you about the end pattern, end pattern and in between patterns which are applied to data. The end pattern says for I in batch, what does for I in batch means? Batch is an array. How many elements does the batch has? The batch has as many elements as there are distinct batches found in the data. Please understand the crucial difference between the distinct batches found in the data versus distinct batches which exist in your course. These two are obviously supposed to be same, but they need not be. Somebody as I said might make a mistake. Instead of writing 2-7, somebody might write batch as 2-0. And that's a valid batch as far as hockey is concerned. So it will create an element for 2-0. It will add one to that count. Marks will get added there. But if my data accurately represents what I have in reality, then I will have as many elements in batch as there are batches. And these elements will not be 0, 1, 2, 3, 4 up to 40, but these will be 11, 12, 13, 14, 15, 16, 17, 21, 22, 23 exactly corresponding to batch. This statement for I in batch actually varies just like we vary I equal to 0, I less than n minus 1 or n. Similarly, this for I in batch will make the variable I take values 11, 12, 15, 17, 73, 62, 15, whatever, whatever, all the values for which independent batch identities have been created. And for each of those, using that as an index, it will print that value of I, it will print the batch count I, and it will print the batch total I divided by batch count I, which is the average of that batch. This for loop ends here. But the direction does not end here. The for loop only ends. There is another instruction. Print total students are count plus absent count. You agree? There are two separate counts I meant. The total number absent is this, and the class average is simply torque marks divided by count. This is a complete valid correct AUK program. And if we run this program across the data that we have given, this is the command that is given to run the AUK program. M AUK is the name of the current version of AUK minus F. I can actually write the program many times. AUK program is just two or three statements. So I don't create a program in a separate file as I create dot CPP file. I can write the program immediately after M AUK, there itself. That's the program. But when I have a larger program such as this, I would say minus F, analyze, mid-sem, 2-0, 1-0, v-1 dot AUK. So I created an AUK file, because I have as many as 12 or 13 statements in the AUK program. It's an uncannily long AUK program by the way. Most AUK programs are two, three, four lines. This will read that program and then I have the text file. You will notice that I am using the pipe symbol here. This pipe symbol tells Unix that after running this program, don't create the output on my terminal. But send the entire output to another program. And output of that program you put on my terminal. What is that other program? That is a utility in Unix called SORT. It will sort all the lines that are given to it. So this is almost like a redirection. But it's more than redirection. Redirection will create a file. A pipe will force output of a program to be fed as input to another program. And the output of that program will be seen here. Why do I need to sort? I need to sort because I would like my batch averages to be printed in sorted order. If I don't do that, the batches will be printed in the order in which those elements are created internally for that array by op. So I have a student of batch 72 as the third student. Elements 72 will be created first and it will be printed first. When I say sort, it will sort on this. So you can see this is the batch 11. This is the batch total number of students batch average. Batch 15 batch students average. Students average like that. After it prints all batch averages, it will print class average is so and so. It will print number of students absent is so and so. It will print number of students, total number of students are so and so. Your job is now to write a C++ program to do precisely this. How long will it take to write that program? How long will it take to write that program? Yes? Suppose this was the mid-same exam question. Okay. In an exam where you can assume the data to be properly ordered in this fashion, you can still work out the solution in about 45 minutes. But if you have to write a professional program which does this, then you have to take care of the following. This is extremely well formatted. I might have a data in which there are blanks in between here and blank. That is blank on either side of a value. How will you take care of that? Ordinarily in C++ you will read the entire line as a single string first. After reading that entire line, you will have to hunt for the occurrence of a comma. Then from where you started counting, up to the comma, you have to extract those characters in a small string and convert that string into either a numeric value or a character string value. And if it is a batch, you will have to find out for that batch code which is the other element that you have allocated and increment count there. This would take non-trivial amount of time and some amount of testing to run. But that is how you would learn programming and that is how you would have to do programming because C++ is what we need to use. Why, by the way, do we use C++ for handling such things? Why not use Oc? Looks good. So let us have some comments on why. Why do people use programming languages such as C, C++, Java for doing such tasks? When programming scripting languages like Oc exist, any idea? What is your opinion? One answer that he is giving is that you want to probably make your solution platform independent. The fact of life is that Oc is generally available on almost as many platforms as on C, C++, Java is available. Same issue about other scripting languages, by the way, such as Python. And these languages such as Oc and Python are indeed popular and a lot of serious data processing scripts are written in Oc and Python. But let me give you the downside of using such tools. First of all, these are non-compiled programs. If they are non-compiled programs, that means execution time will be much larger as compared to compiled program, which can be very optimal programs in terms of the code which they generate. Second, I will not have data in simple text files like this. Imagine now not the data for the students of CS101, but let us say census data for Maharashtra. Eight crore people, variety of different information. I would in general not be keeping that data in a text file because what will happen is suppose I am one of the persons who has been monitored and I am supposed to exist in state of Maharashtra. Now I change my address. My address change must be reflected in the values that data that they have. If I have five-sixty students, I have six-seven pages of spreadsheet, I have a lot of tiers, I will say, okay, you have changed from hostel four to hostel seven, I will update it here. But when you have eight crore people, you don't update it like that. So there will be an official process by which people will say, okay, confirm the address proof, this, that, that. And finally when the data has to be updated, somebody will say, Dr. Deepak Phatak, address to be updated from, say, B14 to A15 in IIT. Now that time, exactly one record from all nine crore records must come in front of that fellow who can verify what was the old address who will verify what is the new address. AUK is incapable of handling anything related to directly accessing files. So AUK is good for data analysis, but AUK is not good for data management. And in real life, you would require real applications to do both data processing, data analysis and data management. Take for example, railway reservation system. How many of you have reserved a ticket on internet for railways? Many of you. You will find that you are able to give a train number or train ID. You get an equivalent of an application form. If the bus are available, you are immediately told the bus are available. Imagine there is an AUK program sitting at the backside. Whenever you say this station, it will read all the text file and find out whatever are the total number of trains available running between these two stations. That information is not valuable to you. You want information to be extracted out of a database which can later on be updated. If you say Frontier Mail, Bombay to Delhi. And when you say both, as an added attraction, the railway reservation system has to go to a payment gateway, to a credit card or debit card fellow and you will say, okay, collect this much money. So data management is a far more serious part in most of the real-life applications for which you require conventional programming languages and other tools. These are not really tools. These are called data management packages. The conventional name is Database Management System or DBMS. But to invoke the functionality of DBMS, invariably you would use C++ or Java to access the DBMS function. It is not common to combine conventional programming language programs with org scripts, although that can be done. In conclusion then, this illustration was given to you actually to demonstrate what processing logic needs to be applied for analyzing data from such text files. And that is of material value to us for this course because we have to do this kind of analysis by writing C++ programs. Additionally, you have got some exposure to a language called org which is incidental. We are not going to have org programming as a part of the syllabus or anything. But it is for you to appreciate that such powerful tools exist. However, our job is to write C++ programs. Now, coming back to some comments on the actual performance. This analysis is done on the raw data. So the marks that are there in my file are not the final marks. You will remember I mentioned that we will be sort of giving some bonus marks for people who have done proper indentation and proper commenting in their programs. Now that job is being done by individual lab TAs. And as they complete all the lab batches till Saturday, only on Saturday I will get all the final data. So these are not final marks. Yet I would like to commend that the average of 16.77 is rather a poor average. So I am a bit concerned. There are several students who have scored less than 10 marks. If I were to be one of them, I would be deeply concerned because as we analyze the exam questions, up to 10 marks should be easily scored by everybody. I would like all those people who have for some reason not been able to demonstrate a better performance in Midsim to introspect and find out whether they lost marks because of some silly mistake or because they have a lacunine understanding. I am yet discussing it with my TAs but here is a proposal that I have that all those people who have got less than 10 marks, I will be giving them a makeup test about three weeks later. The makeup test syllabus will be same as Midsim syllabus. It will be another 40 mark exam. If they score 10 or more marks in that exam, then their present marks which are less than 10 will be replaced up to a maximum of 9.5. So people who have already got 9.5 marks may not worry about it. People who already scored 10 marks may not worry about it. But suppose I have scored only one mark or two marks or zero marks, then I would be concerned about my total marks being passing marks at least. So this is the opportunities which will be given to those people. I hope you agree with this? This kind of a makeup test only for a selected few. Now this is about the marks and evaluation but there is something more to it. Why there are poor marks is because why did I get two marks to begin with? I got two marks to begin with because either I was unable to study and spend time or maybe I have genuine difficulty in understanding some basic concepts. So I would like to personally conduct extra classes or extra tutorial sessions where we will not have lectures but we will actually solve simple problems pertaining to this portion of the syllabus. It does not matter if those students are not quickly able to grasp the subject-oriented concepts that we are going to discuss. But fundamentally every student of programming must know these basic concepts of programming. So I will be announcing these classes next week and those of you who have these weakness this will not deal with any material that we will be discussing in the subsequent weeks. But this will have only specific tutorial-like exercises where I will have mental tears with me to ensure that before the makeup test happens three weeks later the students are well prepared to address that exam. Because I think we need to handle both the performance in exams because ultimately marks get converted to grades and also the basic understanding. So that is the proposal I hope you approve that. The maximum marks scored in the raw format incidentally has been 39 on 40 so that is very good score. I will be putting up an honours list next week when I have collected all the bonus marks and everything. Thank you.