 So, today is practically the last day for the TA ship engagement, we will of course have some meaningful activities on 21st when we discuss setting up of question papers and correcting question papers, but today we are going to see some more examples and you will have a lab immediately after this and also the programming test in the afternoon which will be online test in the lab itself. So, the TA should note that the lab has to run in the afternoon, where are the TA's? So, lab has to run in the afternoon where whatever is the assignment for the lab in the morning you will be extending your own programs as a part of the TA's. So, today we continue to deal with more examples, yesterday we looked at arrays, we will see multi-dimensional arrays and some matrix computational example, we will then look at handling data from text files. So, in any data processing that you will have numerical values and character string values or mixed up as a part of a required split in various fields and you will be required to handle those values including their validations and computations based on those. In fact, your lab assignment is essentially based on this kind of problem today. And finally, we will discuss the digital images how they are represented and that we will discuss specifically the histogram equalization problem. Any one of you has done a course in image processing or something? Yeah, just a few people, must have been an elective in your final exam, okay. But basic image representation and handling is pretty simple and straightforward. So, I will describe that and that will lead to some discussion on the kind of projects which are given in CS101. So, we will discuss those projects a bit and then we will describe the lab assignment today. Multi-dimensional arrays again, well known to you of course. By the way, I hope you have figured out that what I am explaining is not so much as to explaining to you but how these things would be explained to first year students so that you keep in mind that perspective because many of these examples will be actually as is used in this CS101 with of course several more about which you will know from time to time. So, effectively we say that just as we declare a one-dimensional array, we can declare a multi-dimensional array and this has 50 rows and 40 columns but the indices work from 0 to 49 and 0 to 39 and that each element is now accessed by a pair of index expressions and if it is a three-dimensional array then a triplet of index expressions etc. And the rules for index expressions stay exactly the same as they are for single-dimensional array. We then go into handling matrices. Some interesting problems about floating-point arithmetic get exemplified when we discuss these computational problems. Here the problem that is being discussed is the representation of a system of equations in multiple variables. So, the example here shows two equations 1 and 2 in two variables y and z and these can be represented as a coefficient matrix multiplied by the independent variables equal to the right hand side. So, typically the Gaussian elimination method which we described uses two facts from the system of linear equations. One standard fact is that the system of equations is unaffected if an equation is multiplied by a constant. So, we take equation number 1 and multiply it by 0.5. So, we get equation 1 dash and equation 2. Observe that what we have achieved is a 1 on the diagonal element. The next step of Gaussian elimination usually says this is merely a representation of equation 1 dash and 2 in terms of coefficient matrix the independent variables in the right hand side. So, this is the second fact that if equation is replaced by a linear combination of itself and any other row the system of equations remains the same. These are very well standard things I suppose they are known to this school level algebra right. Solving simultaneous equations in algebraic variables is well known fine. So, that is what we found at least for the J E entrance that they all had done this. So, we show that we can replace the second equation by subtracting from it a multiple of the first equation or modified first equation which is a 4 times this and this was equation 1 dash which we multiply by 4 and subtract from equation 2 this multiple. So, effectively we are trying to get this as 0 and this 3 as whatever. So, we get 1 to 0 minus 5. Notice that we are slowly moving towards an upper triangular matrix in this case. And of course, this is minus 5 here. So, we can get 1 on the diagonal by further dividing equation 2 dash by minus 5 then we will get this. The advantage of this is indicated to these tools by saying that if I have the system of equations in this form then I can start from the last equation where I directly get the value of z equal to 3 and then I can do a back substitution of z into this. So, for example, z is equal to 3 and then I substitute this value 3 into the first equation and I will get y plus 6 equal to 4 or y equal to minus. So, if I have more equations I could keep on doing this back substitution and get the result. This I have found by the way it also helps people to sort of re-select their memories of simultaneous equations because some of them might have done it only partially. Some of them have given it up in what you call matter of choice because in standard exams you have solved any six questions out of ten kind of stuff. So, I do not know what you did but I used to drop off portions of syllabus in my third year and fourth year and fifth year of engineering. So, that would happen any time if you are given a choice. So, this also helps them to revise their own notions of simultaneous equations. So, we conclude that the essence of this method is to reduce the coefficient matrix to an upper triangular form which means all elements of the diagonal are one and then to use back substitution. Then we point out that this method is susceptible to round off error. Multiplication and division will not cause so much problem but when you subtract something from something else, for example when you get a zero you have to calculate a factor, multiply by that factor and equation and then subtract or add whatever. And when you do that if there are coefficients which are drastically different from each other you will have a round off error. So, floating point errors are very important here and then we say that there are other variations such as Gauss Jordan elimination which also involves pivoting or LUD composition. We leave the names here. Those who are interested can go back and refer to or in subsequent years of their stay in the in their own respective departments they will study these things. I think I mentioned this numerical recipes in C, C++, FORTRAN etc. So, these are the four authors. This is an extremely famous book. If you get a chance do not miss to look it up. Then we describe the general representation. We say a general representation of n variables. A linear system in n variables can be represented like this and then we say that although this is the mathematical notation that we use since arrays begin their coefficients from zero and not from one we would rather prefer this representation. So, this is an artificial replacement of the standard mathematical notation that we use but it is very easy for people to map it from 1 to n to 0 to n minus 1. And then we say the Gaussian elimination technique will reduce the coefficient matrix to the upper triangular form which will look like this and once we get this we can use back substitution to solve the problem. So, in general a system of n equations can be written as a two dimensional matrix A multiplied by one dimensional matrix X which is the unknown variables equal to B which is the right hand side matrix. And then we explain using the array notation how the equations will look like. So, A0 plus A0 1 into X1 plus A02 into X2 etc. equal to B0 and so on. So, then we say that if we example equations of the two variables is taken how will it look like into a matrix representation form is very straightforward and this is shown to be reduced to upper triangular matrix and solution formed by back substitution. So, with this background we then go ahead to explain the program which will implement the Gaussian elimination. We have this matrix A matrix B and X we define things like divisor factors some etc which are obviously temporary variables which are required during the computations. This is just the matrix input. So, I read out the two matrices matrix A and the right hand side matrix B. Then I start with Gaussian elimination. So, for each row I divide a row by coefficients on the diagonal. So, I calculate a divisor. I set the diagonal element of I through to 0 because that is what to want because that is what I am supposed to get. Notice it is useful to do that because if you actually divide then the division may or may not yield an exact one. Not in this case because you are taking the exact value but several cases this kind of things will happen. So, now you recalculate all coefficients in that row and then normalize the corresponding arches element because right hand side also has to be taken care of. And finally you replace the subsequent rows by subtracting portions of the i-th equation from it as we saw earlier and this is the simple program. So far say k equal to i plus 1 to n because we are looking at i-th row. So, you have to do it for i plus 1-th row to n minus 1-th row and these computations are very straightforward. And then we show back substitution starting with last variable say xn minus 1 is computed directly and then for n minus 2 all the way backward to 0th element we do the summing up of the that particular row with values of x which are already determined. So, this is exactly back substitution. And finally we will get the values of x i and then we say we output the result. So, we output the matrix A, we output the matrix B and we output the values of x. This of course takes much longer in a class, but two things I would like you to notice. First is the matrix used in Gauss elimination is not treated as a programming problem. It is treated as a mathematics problem that we are trying to solve. Second sufficient time is spent in explaining what Gauss elimination is although most people will know it because the framework in which we are putting Gauss elimination is different. It is a two dimensional array of cc plus programming context rather than a plain mathematical note. So, we believe that it is important to do so so that the students acquire a fairly good model in their mind of both the programming implementation as well as the Gauss elimination. Notice also a series of comments which we generally make it mandatory to write to explain various steps of algorithm. It is not uncommon in many engineering colleges to teach programming without insisting on writing comments. Occasionally you will be told to write comments, but as long as you write some documentation accompanying your program that is considered adequate. As I mentioned earlier it is not so. External documentation is not a substitute for inline documentation. So, both must exist and both have their own role to play. General useful facilities of the operating system such as redirection we discussed it yesterday. I had one or two queries on this. So, I thought I will just include these in the slides. I will not spend any time on it, but since all these slides should be available to all of you, I am just waiting for a common model to be established. I think you are well done. Percounts have been activated now. Did you receive a mail? Somebody would have checked today morning because I had sent it late last night that you have not checked. I sent it to M.Tek-1. The confusion is today officially you are M.Tek-1, but till last week there was another M.Tek-1 group. So, I do not know who received my mail. If your seniors have received it they must be wondering why Phatak is asking me to write a S.T. program, but that will be interesting. Did you receive that? No. Okay, good. So, it must have reached you. You will find it in the lab. Anyway. So, yesterday what I showed you is that instead of S.T.E. in and S.T.D. out, you can use the redirection operator to read the data from this. What is important to emphasize is it will read all input only from the named file. So, you cannot have some input being read from keyboard and some input being read from that file. Once you say redirect, everything is redirect. So, naturally if you want to give even one value as input from the keyboard, you cannot use redirect. So, you have to open the file and handle it that way in your program. Similarly, an output file can be redirected. S.T.D. out can be redirected to an output file. It shows the input data file of S.T.D. out text and it shows the sample results. This I found to be quite useful and many of my colleagues endorse this. When you discuss a program, it is useful to show the sample output that you will get after running that program. It also forces me as a teacher to actually run the program because these results are not hypothetical. These are actually screenshots from the actual Ubuntu machine on which the programs are run. So, then we can do some massaging to this. For example, here we point out that look, the values are not decently organized. So, you have 0.05, 0.5 and then 1 here. 1.5 is spaced like this. But 1.4 comes here. Minus 1.08696 comes like this. In short, these 4 by 4 elements do not appear in any well organized fashion on the output. So, you just point this out and say this is not good. And we would like the results to look something like this. So, this is the motivation for formatting. Till this point in the class, no printf is discussed. No scanf is discussed. You will notice a fundamental difference between let us say a course where you start with C programming. You have no choice but to discuss scanf and printf right of friend. Now that you have done lot of programming, you would know exactly what those are. But recall the first few days in the lectures. Would you make sense out of printf and scanf? The percent symbols says that as it is include something is disastrous. Int main is completely unknown. And to add to the confusion if you are printf and scanf, it is not very good. Of course you people like all students are smart. So, you figure out that even if you don't understand it, the damn thing works. So, let us reproduce exactly what is written in the book or given by the teacher. Which is a very bad practice. So, in order to avoid it, we say don't use any such thing. A simple operator like C in and C out which is good enough. And that is why indeed we don't call our course by any programming language name. It's a course on computer programming and what we use is a combination of C C plus plus. Incidentally why do we do talk about objects but we don't talk about methods for pretty late in life. We mention classes as natural containers of objects but nothing beyond that. So, we don't teach object oriented programming for at least half the semester. We teach program and we teach basic constructs. We teach basic control structures. We teach arrays and matrices. And you don't need for any of these, any mention of any object oriented programming because normally a conventional procedural programming is adequate. So, that is what is first conveyed there. We already discussed the DDD for you. For the first year we take students that DDD is described but more importantly what they are told is that when you have a long program and if it is misbehaving at portions, you would typically like to include some statements to identify errors. So, you print some variables, print something, etcetera, etcetera. And these would be called debugging statements. So, the useful facility from C or C plus plus for that matter is that you can have hash if, some name and end if. All statements included within this block behave very differently. This x x x x can be any name type. For example, if I say not sure is a tag. So, in my code if I have, if not sure, see out i j, sum i j, end if. Normally this block will be ignored completely. However, if I compile my program with minus capital D option, giving the name of the tag that I have used, then all the statements will be considered as if they are part of my source code. And they will be put by preprocessor in exact sequence where it occurs. What it means is that when the program is compiled with this option, my program will execute with this see out statement and all such other statements that I have. Once I have confirmed, once I have made changes, once I have corrected things, instead of having to do the golagery of removing all such debugging statements or commenting them, I will just recompile it without minus D option. And again the blocks will be ignored by the compiler. So, you will have a clean run of your programs. So, ordinarily see compiler ignores the block, but if you compile it with this option, these statements are compiled and are executed as part of the program. So, this is about the matrices and some attendant issues, attendant facilities that the operating system has and the compiler has. We will now look at a programming paradigm which probably only those of you who have worked with Unix would be aware of. And even those people who have used Unix may not have used Ock or N Ock. Have you? You have used. Okay, good. This is introduced to first year students as an alternate paradigm to solve some problems. So, first we will look at the problems. This was a large assignment after the mid-sem exam. Mid-sem is the exam once it was concluded. It was a 45 mark exam and the student smarts were collated by teaching assistants such as you. But for their simplicity, they put it in spreadsheet. Spreadsheet is much simple to handle. You can sort it, arrange it, whatever, whatever, exactly the same way as you would do in a database. But you do not require SQA statements to do everything, just inter-data. So, they collated and compiled everything. Now, when they have to give me that entire data, they simply exported it in the comma-separated value format. You are familiar with CSP format. We discussed it, I think. Basically, when you take a spreadsheet and say export this spreadsheet as a text file in comma-separated format, then this is the kind of records that you will get or lines that you will get. So, observe what we have here. 1, 2, 3, 4. It looks like a serial number. 9, 0, 0, 2, 0, 4, 0, 0, 9, D, 0, 1, 0, 1, 0, 1, 0. These look like roll numbers. Notice but what happens in a spreadsheet? This number is taken as a numeric value and therefore the first zero is talked of. Whereas what you have are eight character roll numbers. Spreadsheet is very bad. If it sees a number, it treats it as a number. So, it will remove leading zeros. Consequently, it's a wrong representation of roll number. The next one is name. But in the spreadsheet, if name is blank, you will get two commas. So, there is no name here. This next one, 0, B, 7, 8, D is not known to you. This is what we call a lab batch. So, 0, A, B, C, D, 1, A, B, C, D, 2, A, B, C, D, etc. There were 40 lab batches. And the last one, of course, are marks. 44.5, 44, 22.5, etc., etc. What does minus 1 or minus 2 mean? Well, minus 1 and minus 2 was used by my TAS to represent somebody upset. So, somebody chose minus 1, somebody chose minus 2 like that. But a negative mark represented somebody's upset. Notice that these artificial design decisions taken by a few people impact the life of all programmers who have to deal with that. Or if you are using a database, then if there are no marks, that is, if marks field is so, that means the person's paper has not been evaluated whether the fellow is absent or not, whatever. Single conclusion and SQL will work correctly to calculate the averages or whatever, whatever you want. Not so in text files. So, you have to be very careful when you make assumptions or when you take design decisions of this guy. A design decision should be as generic as possible. Nevertheless, this is a good example of simply structured data values for every student yet more complicated because of these kind of problems here. What we want to do is to find out batch-wise average marks and the class average. So, this on the face of it appears very simple. This is the batch and these are the marks. So, I have 800, whatever, 810 students were there last year. So, we have 810 students mark roughly divided into 40 groups. We want to count the total of marks for our students in each group and then divide it by the number of students who appear for the exam in that group and you get the batch-wise. Seemingly, a computationally straight forward problem. But the data is in this form. So, I will handle that. So, this is where we say that ordinarily as CC plus programmers, you will be using files, reading those files, extracting information from this complex record which is comma-separated, fill those values up in appropriate arrays and then do your computing. But there is a simple paradigm, a scripting paradigm using a programming language called AUK. Aho Weinberger and Karnigan wrote this. Karnigan is a familiar name. Karnigan and Richie are the creators of C programming language. They wrote this AUK programming language which makes heavy use of string data types, associative arrays and regular expressions. Standard stuff in any Unix C kind of environment. You would have heard of a language called Perl. So, Perl is a successor of AUK, which actually enhances many of the functionalities there. So, these are AUK language fundamentals. For those of you who have not used AUK, it might be interesting to see this. First of all, it's a language for processing text files. So, since it is an interpreted language it doesn't get compiled, you often call it a scripting paradigm. So, you write an AUK script which is an AUK program essentially. An input file is treated as a series of records and each record has a set of fields. So, when a text file is read line by line, each line is treated as one record and then as a line is read, it is broken up into sequence of fields. So, you can actually think of a line as first word as for one field, second word as another field, third word as third field, etc. Naturally, if we are talking about some 1, 2, 3, 5, 7 fields, then AUK must know when one field ends and another field starts. This is done through a field delimiter. Now, ordinary field delimiter if you don't specify everything is either a blank or a tab. But you can specify the field delimiter. So, here we specify the field delimiter as comma, which means whenever a comma comes, the next thing is next field and so on. Pipe is generally the preferred delimiter in Unix environment for most people. For the simple reason that the pipe symbol does not occur in names of human beings, it does not occur in numerical values, does not normally occur anywhere as a valid value. Comma on the other hand could occur in say for example an address field. There could be a comma. But anyway comma has been a value separator for ages, so that is what we have used here. What does an AUK program do? AUK program is nothing but a series of pattern action, pattern action statement. So, any statement in AUK is a pattern followed by an action. Now, what AUK does is that when it starts reading lines, it actually iterates on the lines. So, it takes, say, 10,000 lines. It will read first line and apply all the patterns that you have specified in the program to that line. If any pattern matches or more than one pattern matches, the corresponding actions are taken on that line. Then the next line is read. Patterns are matched. Then the next line is read, etcetera, etcetera. Till all lines are read and handled like this. In short, it's a record-by-record processing of input lines where each line is taken in. Patterns are applied and actions are taken. We shall see what patterns and actions are. Finally, when the file ends, there is no more input available, then there is a special pattern called end, which means that you apply that end pattern only when all records have been processed. So, it's like end of the job processing. Like you want to calculate averages and so on, you have accumulated sums. That is where you will use the pattern end and matching action will be taken. There is also a pattern called begin, which is applied before even the first record is read. This is used for initializing certain things if you wish to, such as input field separator or output field separator. Incidentally, the variables and arrays in awe need not be declared, need not be initialized. The first time you use a variable or an array element, the array gets declared, the array gets initialized, the variable gets declared, the variable gets initialized. That is how the programming becomes so simple here. Let's see some example. So, now we analyze this record of the file. So, this is a typical record. Now, we know the file of field, serial number, roll number, name, dash, code and marks. What all this is, it creates internal variables to hold values of each field. It separates out the fields as it reads the record and assigns the values to the field variables which are called $1, $2, $3, $4 as many as they are. So, there are 20 fields in a line, the $20 will be the last one. $0 refers to the entire record as a string. So, automatically when you read a line, it is like scanner for C in, C in, $1, $2, $3, etc. It's almost like that. These are automatic. Now, what do we want to do for every line that we read? First of all, we want to locate the absent students and sort of discard them. We'll maintain a count because they are of no use to us in the average calculation. So, what is the pattern? The pattern is that the marks should be negative, which means $5 is less than zero. If $5 is less than zero, we will perhaps increment an absent count. However, for every other record where the marks are positive, we need to calculate the, we need to update not only the total marks for the whole class, but for the batch also we need to update the marks. So, for other patterns, increment batch counts, mark totals, etc., and attain, print the accumulated results. This is what, yeah, please. Sorry? Does it take care of typecasting? Does it take care of typecasting? Oh, very, very good question. As I told you, AUK is almost a free form language. There are no types except string and numeric. Okay? If something is not discernible as numeric, then it is treated as string. There is also a type, a mixed type, called string numeric, where you may use some, some value as a string and also a numeric if it has numeric value. So, it appears completely in contrast to the conventional harsh typing that we are familiar with programming languages. So, this is, this is not that kind of language at all. Its, its basic intention is to permit you to do something simple, simply. So, there is no hard, hard rules there. If you put hard rules, then you have to write longer lines of code because of the hardness of the rules. So, this makes life simple. So, three types, numeric string and numeric string. And what is what is decided on the fly based on the value that AUK sees. So, it decides if the value looks like numeric, it is numeric. Otherwise, it is a non-numeric which is string. Here is the AUK script. The AUK program, in this particular case, we do not require a begin pattern, which means nothing needs to be done before the AUK starts reading five. So, there are exactly two patterns and set of actions. One pattern says value of five is less than zero. If that is the pattern, what is the action? Absent count plus plus. Notice no int absent count, no initialization to zero. All that AUK does automatic. The moment you mention the first occurrence or the first use of a variable name brings it into existence and also initializes it to zero. It initializes it to zero if it is used later in a numeric sense as it is with plus plus. It is initialized to a null string if it is used in the context of a string. So, the first use determines what the type of that particular variable will be. The other pattern is value of five greater than equal to zero. That means people have appeared for the exam. Notice that zero cannot be an absent indicator because somebody may get zero. Actually in IIT exams, people can get negative marks. My TAs did not know it. Luckily in that exam, nobody got negative marks or maybe because TAs have decided to use negative values to indicate absentism, they did not give any negative marks to school. Anything is possible. But whatever, generally as I said, you have to be very careful about the design decisions that you make. Now look at these statements. Count plus plus. So that means for every record for which you have positive marks, a count will be incremented. This will be the count for the entire class. All the students who have appeared for the exam will get counted by this count because these statements are to be executed every time this pattern matches. And this pattern will match every record except for those which represents absent fruits. Similarly, top marks plus equal to dollar five. Observe that the syntax is very similar to C C plus. There is no difference whatsoever. In fact, the creators of C programming language created this. So what else do you expect? But notice again that top marks need not be initialized. It is brought into existence initialized to zero and subsequent iterations. If there are 820 lines, 820 times this statement will be executed. Not 820 times. Minus those statements are those records which are negative. So this will be updated. The trick is we want to update now batch-wise totals. We want to maintain batch-wise totals. 0A, 0B, 0C, 0D, 1A, 1B, 1C, 1D, etc. 9A, 9B, 9C, 9D represents 40 batches. Ordinarily, you would have declared an array of 40 elements. And since the 0A, 0B, 0D, etc. are character strings, you cannot use character strings as indices of the array. So you have used indices of 0, 1, 2, 3, 4, 5, 6 and map the 0A to 0, 0B to 1, 0C to 2, etc. Or it permits you to use associative array. That means string value can be an index in the array. So when you say batch-tot $4, $4 is the batch code. When you first use it, the batch-tot array is created. Actually, elements are also extendable. The memory allocation is dynamic. So the first element is created when one value of $4 comes in. Let's say that is batch 1A. So first index will be 1A. And when you say plus equal to $5, that means the marks are added to that array element which has been initialized to 0 automatically when it was brought into existence. So after 4 or 5 records, suppose another student with 0A batch comes, then because I have said the other fourth element of batch-tot, 0A element of batch-tot will be updated. This is associative array. So based on the contents, you are actually accessing the array element. So you see how extremely simple life is. In exactly the same fashion, I need a batch count because I need to calculate the batch-wise average marks. I have to calculate the total average and batch-wise. So you see I have a total count and total mark count and batch-wise mark count and batch-wise student count. And that's it. That's the complete op program. So when this script runs over all 820 student records, all these counts would have been updated. Of course, at the end I have to do some work about computing the required results and that is where I use the end pattern. Remember I told you when I say end, this is a special pattern that is matched only when an end of file has occurred. That means there is no further record left. And therefore I can do whatever I want to do here. First let us look at the last 3 lines. These are very straightforward. Print total students are count plus absent count. Print number absent is absent count. Print class averages plus marks by count. These are the counts and we are dividing by them. This one needs some explanation because what I am printing is actually the batch code and the batch average. The batch average is found out by batch dot i divided by batch count i. If I am looking at ith element, similarly print i will print the value of i. Ordinarily if I had a numerically indexed array, I would have varied i from 0 to 1 to n minus 1 or some such thing. 0 to 39 or 0 to 40 or 1 to 40. But this is an associative array where the index is not a numerical value. The index in fact is a characteristic which represents batch count. For that or has a special provision called for all in batch. So what this means is that the batch array has been created. We don't know how many elements. As many elements as as many batches are there. Now we are asking all to go over all the actual index values that it has encountered. So i would be 0a, 1b, 2c, 1d, 0b, all unique values. Because those many elements have been created for that array. And for each of those values it will print also this i is not numeric. This i is actually the index in the batch. So it will print 0a, 1b, whatever. Similarly the batch count element indexed by 0a or indexed by 1b etc will be printed. So you can see the power of the associative arrays. Just try to do this simple thing in C. And you will find the kind of matching that you will have to do. At least you will have to do a table lookup. Yes. We are not discussing the implementation of AUK itself. Figure it out. You are a compressor student. We are discussing how to use AUK. But that's an interesting question. So people who write compilers or interpreters should worry about that. We are not worried about it here. Yeah. Yeah. Where we haven't defined? We haven't defined batch anywhere. There is a mistake here. Okay. So you tell me what is to be done? Yes. For i in batch count is okay. Because batch count, batch count both have the same index value. Dollar 4, dollar 4. So any one of these names is okay. That's an error here. You are right. So I can use i in batch count or i in batch dot. That is what I am printing here. This is the output that the program produces. The way to run the program is to say n AUK. Currently AUK is out of fashion. There is a replacement called new AUK or n AUK. So that's the command you will find on most of the Unix machines. Notice that the syntax of executing n AUK is first you say minus capital for such batches which do not exist. It will be hard to do in C. That is why in fact it will be hard anyway. So what you do in conventional data processing cases? Whenever you get such data, you run rigorous validation checks first. And you throw out the data which does not meet the validation rules. How can a student be there who does not have a batch? So go catch hold of that TA, bash his head, say write his batch code, whatever. Because there are people who are responsible who sometimes make mistakes and these things happen. But AUK in the same program will not bother. It will give you these results. So it will produce execution like this. And at the end it will say total students are 819. Number absent is 10. Class average is this much. You can see the power now. So I do introduce this to describe the file handling and the need for validation, etc. And the natural question that our students have is if AUK is so simple and works so well, why use complicated programming languages? And then we announce the fact that it is superb for such problems but has limited capability to handle functions to solve general computation. For example, if you were to do a Gauss elimination or a LU decomposition you would have problems here. Further it is an interpreted language which means it is not compiled into the machine instructions and therefore operationally it will be less efficient. However, for a limited amount of data such as 50,000, 80,000, 1 lakh rows, etc. AUK could be extremely useful. On the modern hardware it is quite fast and quite adequate. So you use it as a special purpose tool. If you want to quickly find out something you can use AUK to do this. Of course, as far as our first year BTEC students are concerned, we tell them get back to our CC plus. And that is what will be our assignment today. The lab assignment will require you to read the midsem marks.txt file. The same file exactly has been given to you in the zip bundle. And you have to find the class average to sort the data on marks and to produce an honest list of top 10 performers. So in the first part in the lab immediately after this you don't have to do too much of analysis of various fields, etc. Although if you feel like you can add validations now itself but essentially your job is to find the class average to sort the data on marks and to produce the honest list of top 10 performers. So the list you should get in correctly is roll number, name and marks. Here we are not talking about batch wise performance, etc. But the afternoon test which will be conducted online in the lab itself you will be required to extend your program. Now I do not know whether you have a common directory because you are all logging in as a single user. So if you do that you can't store so many programs there. One suggestion is that at the end of the lab when you mail me the program file mail it to yourself also. So that in the afternoon I will mail you the test and you take that test and also from your own copy of the email download your own program which you can extend. So we use email box as the common repository to be shared between you and me. Something of that sort. Is that okay with everybody? Fine. I am sorry it's already okay five minutes but I will just take about ten minutes to describe the last part of a typical CS 101 course which is about the course projects. So here we use digital images as a description. The actual course projects involve handling of fingerprints that I will come to later. So first we explain to them what the digital images are. We tell them about pixel values for example that the picture is made of picture elements and number of rows and number of columns determined by the width and height of the photograph and the digitization sampling rate will decide how many pixels you will have and then we say each pixel can be represented if it is a black and white image by just one bit, zero and one. It is a grayscale image monotone that is zero to 255 where zero represents black, 255 represents whites and you have shades of gray in between. That's why you call it a grayscale image. And finally a full color image which is represented by three bytes per pixel red, blue and green. So you have intensity values of red, intensity value of blue, intensity value of green and the mixture of these intensities will create a color. So effectively you can have 16 million colors. Incidentally, this is interesting. I don't know how all of you know this. The capacity of a human eye is limited to a small range from 200 to 2000 colors as far as distinction is concerned. So really given 16 million colors you cannot distinguish two neighboring colors at all by human eyes. But in any picture you will not have 16 million colors. In fact you will not have more than 200, 300, 400 colors and that is why human eye is able to dissect between even neighboring pixels if they are different. But contrast plays an important role in the human recognition of images or digital images. But this is roughly about the digital images. Then we tell them that when we want to store information about images in a file all that we need is the value of width, value of height, the type of colors present and value for each pixel. That's all. So images such as black and white fingerprints have small size, 500 by 300. And notice that one bit can represent one pixel because it is either black or white. Of course when you capture a fingerprint you get a grayscale image. That grayscale image has to be converted through some thresholding into a binary image, etc. For large images of course compression is mandatory because you take modern cameras, you are heard of 5 megapixel, 3 megapixel, 8 megapixel cameras. A 12 megapixel camera can produce 36 million bytes per image. 12 megapixel means it has 12 million pixels sensitivity. A single photograph it captures, it can capture 12 million pixels. Each pixel is 3 bytes. So 36 megabytes would be the size of just one image. Even with modern storage available in small form this would be a very large one. So you need to compress it. And then we tell them that compression can be either lossy or lossless without going into too much details. We just explained that lossless compression means from the compressed form we can always recover the original image without any loss of information. And lossy means that reverse transformation will not be exact but we say that the so-called lossy transformation is adequate for most human purposes. And that is why you have formats like JPEG and so on. Incidentally there are so many formats. RAW, PNG, BMP, TIFF, JIF, JPEG, XMP. Those who have done the courses in image processing have you seen all these formats? Wikipedia is an extremely good source of the details of these formats. It's interesting to see how these formats are defined and how they are described in any program and how images can be read, displayed, etc. Anyway, now we come to the brass tacks of programming. So we say that pixel values in digital images can be read in a matrix, a two-dimensional matrix for further processing. And each picture point pixel will have an associated tonal value. Tonal value means intensity as we said. For grayscale it will be 0 to 255, 0 represents black and 255 represents white. And consequently, each element of such an image matrix would contain a value that can be of that type, size, shorten or cap. One byte is adequate. So consequently, one byte per pixel will require W into H pixels to represent an image where W is the width and H is the height. Then we describe the notion of a histogram where we say histogram just tells you how many pixels in an image have the same value. So this is the example that I used last year. A sample image, 8 by 8 pixels. So notice that it's a grayscale image. The center square which looks like white, it's not actually white. You can see it's not pure white. Similarly, the columnar square is not jet black. So obviously, there is no pixel with value 0, no pixel with value 255. And that is the reason why the contrast of this image is so poor. It's sort of occurring blurry. So these are the pixel values in the sample image. This entire example is from Wikipedia. So observe that the smallest value is 52 and the largest value is somewhere here, 154. And all pixel values are in between. There is only one element with value 52. If you notice 55, there may be another element and so on. And that is what we do when we do histogram computing. For any value 52, say for example, how many pixels have 52 as their intensity? Count of that is the histogram value at 52. How many elements have 63 as their value? So it's 1 and 2, maybe. So the histogram at 63 is 2. Like that, you calculate the histogram value. So at 52, you have one pixel. At 55, you have three pixels. 58, you have two pixels. 59, three pixels and so on. Observe that there is no pixel value less than 52, no pixel value greater than 150. Then we define a cumulative distribution function. A cumulative distribution function is nothing but below this pixel, how many total pixels are there? Below this value. So just going back to the previous slide, if 52 has one, 55 has three, 58 has two, then at all less than 55, there are four pixels. At all less than 58, there are three plus two, five plus one, six pixels. At all less than 59, there are six plus three, nine pixels. So that is a cumulative distribution function, one, four, six, nine, et cetera. What is the importance of a cumulative distribution function? What is the importance of histogram? Well, we then say that one of the reasons why that mantras was not good in that image was because the lowest value was not zero, highest value was not 255. Obviously, if you have the stretched image with all intensities, the contrast would be better. Now histogram will tell you what is the limit of the distribution of these pixel values. And if you get stretched that, which is called histogram equalization, then hopefully you will get a better contrast. So we then explain the notion of histogram equalization. We show this formula without going into the details of it. And the equalization formula for the example image will work out to be this. And then finally, we say using this histogram equalization, we can map a CDF of 78 is, for example, 46. So what it means is that using that formula, if a pixel value is 78, it should be replaced by 182. So what you find out using histogram equalization is, given a present pixel value in the stretched histogram, what would be the final value? Obviously, the lowest value should be stretched to 0 and the highest value should be stretched to 255. But it is not arbitrary stretching. So that stretching is in consonance with the nature of the image and that is why the histogram of the image and the distribution function has to be taken into account. So this one, first year students are also able to appreciate and then you show them the results. So these are the pixel values after histogram equalization. You show them that you see you got a 0 here which was 52 or something and there were 155 or 154, it has become 255. And all other elements have been stretched according. And then we say that, look, if we had rate the pixel values in a two dimensional matrix, calculated the histogram, then the histogram equalization and got these pixel values and re-displayed that image, that image would look like this. And we show this contrast. So this is the original image. You can see the contrast is much better. Of course, here a pixel is very large. Actual pixel will never be like that. You can't, in fact, 8 by 8 pixel image will be seen as that by human eye. But still this example is good enough to convey the basic principle and to consolidate their understanding. I had used another example again from Wikipedia, which was this. So this is a grayscale picture. Can you discern from the distance, you know, there are some trees, it's like a hilly area and you can see some trees and so on. Now here, if you look at the histogram, it looks like this. So you can see almost all pixels are concentrated between 100 and 200. And the cumulative distribution function is like this. Now you want to stretch this. So I tell the students that when I stretch this, I want it like this. So I want pixels to start from 0 and pixels from 250. Essentially, a histogram equalization will give you a linear histogram cumulative distribution function. That is the objective. So with respect to the original function, how do I stretch it into a straight line? That is what the histogram equation formula does. And if you apply that, you get this. So just contrast it with the original and this is what you get. I found last year that this example could actually convince people the usefulness of histogram, histogram equalization, et cetera. And later on when you add that any Photoshop or any such thing, automatically does this histogram equalization easily, then they are able to relate it to actual worthwhile software products which intrinsically contain such simple computation. In fact, modern cameras can do histogram equalization easily. So even our histogram equalizer has an embedded software. Then we tell them how to write a program for this. So we show them the program to calculate histogram. We take an image. We write, of course, comments like this. And then we read out for i equal to, it is a square image that we are assuming. So you take an input pixel and also output pixel. After every row, you go to a new line so that you can just confirm the input. Now we said histogram comes to zero. So observe that what we are doing is, irrespective of the actual pixel values present in the picture, I need to calculate histogram for all possible values, zero, one, two, three, four, up to 255. Notice what we are doing is, we are setting up each element of histogram array to zero, and then we set up a double iteration to go through entire matrix. And all that we do is whatever element we are looking at, whatever is this pixel value, we use that value to index the matrix and update the count. So notice histogram image i, j plus plus. So image i, j is the value of the picture. And that is being used as an index to the histogram. Another example of associative array. Except that I can do this association in C, C++ programming because image i, j is a numerical value. And very happily that value is between zero to 255. If it was something else, I will have some more complication. It is useful to emphasize the notion of associative array because it keeps coming up in variety of guises at different applications. And then we, of course, print the histogram. This is just an extension to find the maximum value in the histogram. The maximum value in the histogram, the minimum value in the average, is often used as a thresholding for converting a grayscale image into a binary image, etc. Okay. My God. I will just take five more minutes. So I will just describe the course projects. Generally a lab batch of about 20 students develop one programming project. As I said, it is not uncommon to have the actual final programming done by them to be in the tune of 500 lines to 5000 lines. Depends upon how different people do it. It is not just that they are able to write large programs but they learn an extremely important thing. Namely, that modern programming is a team activity. No single individual can ever create a million line code system. You require a team. And therefore what you people have studied in software engineering and other principles, right at the first year, we try to establish these principles in their mind without using the bombastic terms of software engineering. So effectively, each batch is divided into teams, each with four members. So if we teach them how to form a hierarchy, each team is led by a team leader. And all the team leaders in that batch form some kind of a coordinating council. And they elect one of them as the sort of batch coordinate. So a leadership is established, a hierarchy is established. Then each problem, each batch has to select a problem from amongst those which are specified. We generally permit batches to suggest the problems also if they can come up with idea. And we accept those ideas. Provided the amount to the requisite amount of efforts. Typically the efforts for every course that students take outside the classroom in the lab engagement is about four to six hours per week. So if you are taking six courses, you are expected to spend about 24 to 34, 34 hours of your own time in studying for that course in the week. Apart from the classes and so on. So we then calculate the person hours that are available for that team. There is a 20 people team. And the project typically runs for four to five weeks. We know exactly so much effort is required. Now TAs and the teacher are generally able to judge whether the project suggested by them will be adequate or not. And this is another important thing. All projects are necessarily open-ended. That means if required, somebody can extend the functionality of that project to build a 100,000 line code system. That is the openness. So the idea is people should understand that in real life they are not going to solve the kind of problems that are given in the exams. They are going to solve real life problems. And in a limited time they may not be able to solve the entire problem. They will be able to solve only a portion of the problem which is good enough. So the batch then carries out a systematic development activity. Weekly reports are submitted. Weekly diaries are submitted. And finally we evaluate them. The projects which were given last year were from the National UID project. So every Indian has to be fingerprinted. And duplicate fingerprints are to be carved. So that no duplicate ID is given. ID must be unique. So essentially of that part we said since London was busy setting up the whole national framework the bio, what should I say, bio standards part. That is what kind of biometrics will you use. I decided on a fingerprint and iris. So we took the iris and said that you have to capture fingerprints, classify fingerprints and store these, then check for duplicates and then use an application. So typically I provided fingerprint capture devices in the lab. See every batch will come, they will capture fingerprints of all students once. They will also capture the fingerprints additionally by trying to register that favor as somebody else. And then another batch will take these fingerprints, classify them and store them. A third team will locate duplicates if any. So they have to do heavy duplicate matching. I arranged special lectures from outside experts to the first year students on duplicate catching, on the image analysis of fingerprints and so on. So it was actually became a fairly advanced course in image processing for some of the teams. But people did excellent work there. The evaluation was very unique. 25 marks were allotted to the project. So notice 25 marks out of a total of 100 marks for the entire course. So projects have a substantial weight in deciding the grade that they will get. 15 marks for project report and 10 marks for self-evaluation. So each student decides I have worked so much, I deserve 6 marks, I deserve 7 marks, I deserve 10. Nobody ever says I deserve 0. But if I have done nothing and I put myself 5 marks, then there is a peer review process where all the team leaders and the coordinators sit together, all other team members sit. And when I say 5 marks, my team leaders are there, you are sleeping whole of the month or you have done nothing. So there were cases where 0 marks have been awarded for students by that team through the peer review process. We found this to be an excellent mechanism. I want to increase the marks now of the self-evaluation and peer review to 20 marks out of 100. See, this is something which is peculiar to an IIT system. I can do these kind of innovations. In a university system where a teacher does that, he will probably be thrown out of the college or something. I end this with a sample image of a fingerprint. Just to show you that a grayscale image will look like this. There are ridges here. You can see that these are called, you see these breaks. So these are branches. Then there are minutiae points. And it's a very complicated logic, very interesting logic. Those of you who are interested can look at the fingerprint analysis software. There is an open source software, huge system, almost a million line system developed by NIST, which is a National Institute of Standardization, US Institute for entire image processing including duplicate detection, classification based on minutiae, etc. And that is available for anybody to use. So Indian government is also adopting that as the base system on which different people will develop things. So we'll end here. Any questions? Yesterday I had some time late night to Peru's through your assignment submissions. So I was glad to observe that about 79 of you have submitted. I don't know how many are attending, but 79 have submitted. Some efforts are, of course, I could see that you haven't put too much effort. You're just almost copying the methodology that was given in the sample program. So you're still insisting on reading 17 students. The number should not be more than 17, etc., etc., etc. See at the intake level, and this is a signal that I would like to give to you and through you to any students that you coach as TAs later. Namely that so far you are used to solving a problem as stated. So the sample example says up to 70 students, your mindset is 70. But that is not correct. If a sample says another 100 elements, you will invariably put in your program another 100 elements. Now that we have discussed a case where there are 820 students, you will grudgingly put an array of 850 or 900. But will you automatically put an array of 10,000 elements, for example? You will not. And that is because you, and like all of us, when I was a student, same thing happened with me. For five years, four years, I am used to solving problems with exactly stated specifications. Using examples which have exact statements. And therefore I have stopped imagining, stopped looking at things in a generalized way. Now that is a message that I want you to adopt first. So look at a problem. But don't go by the problem statement alone. Apply your mind in real life what all issues will be there. Think beyond whatever is prescribed. Because that is how you will be solving real life problems. So generalize as much as possible. Extend as much as possible and state that with commands. So if you see in a sample program an array of 20 elements and for some reason you don't agree with it. You should say I don't agree with this assumption. It is perfectly alright to counter a teacher. This is something which we have not seen in ordinary institutions. Here the students learn to counter a teacher, to ask questions, to argue and to prove that no mystery you are wrong I am right. And in the process sometimes they have to accept that they thought wrongly. Sometimes the teacher has to accept that he has thought wrongly. But what is the net result? Knowledge increases. Thinking increases. So I would like you to do that. I was very pleased to see several efforts. I am still going through them which are absolutely excellent. And people have written proper functions for insertion sort. Somebody has done a selection sort. There are very good comments written there. So I have a mixed feeling about the assignment submission. Some efforts have been extremely good. But some efforts bordered on the OK type. So for your own benefit, I mean these are not marks. You are not going to get any grade or something like that. It doesn't count to your cumulative performance index. But I would humbly suggest that for your own sake. Use this opportunity to really become absolutely top grade program. You are just one more day to do that. That is why I have put the test, not as a conventional written test. But some squiggles in programming which you have to do on this problem itself that you saw. You have the print out of the assignment. The data file has been made to you. If you don't receive that, I will mail it to senior TAs. I have a list of all the orientation program TAs. So if I mail it to them, even those who are in the lab will get that. So they could then in turn mail it. But MTech1 at CAC is the correct mailing list now for the first year MTech students. Then what is your mailing list? Oh, you have become MTech2. But there might be a few unfortunate friends from actual MTech2 who have not yet passed out. So they are thrown into Pabai Lake or what? Or they are concatenated with the new MTech2. I don't know how the group is handled. You might want to ask the demigods courses as to what exactly they have done with those few poor people. I know one of the students is doing his R&D project. So he is still around. Anyway, that's a separate discussion. Thank you so much. So one second. The schedule remains as it is. Tomorrow there is no engagement in the department. Because the institute is conducting a, I suppose they have distributed a schedule for that. Not yet. Very funny. Half of last night I was preparing for my talk tomorrow. If they cancel it, I will have to beat somebody. Anyway, most probably you will have a session tomorrow from 9 to afternoon. And they will explain the use of Moodle in more details. And day after tomorrow we will meet briefly in the morning here. Thank you.