 What we are going to discuss today is to continue our examination of searching a given value in an array. Yesterday we had seen how to search for a given roll number in an array of roll numbers. We had said that we will start with the first element, compare the first element of the array with the given roll number. If that does not match, we will go to the next element, next element, next element and so on till the end of the array. And if we find a match, of course we have found the given roll number, we will get out of this iteration. If we do not, we keep iterating till all values in the array are examined, at which time we might come out and announce failure. We call this a linear search and we estimate it that on an average, although the algorithm is order n, we will actually make n comparisons if you go through the entire array. But if you terminate the search after you find the given roll number, then on an average you will do n by 2 searches. If n is large, either n or n by 2 will also be large. So imagine there are 20,000 students and you are doing a linear search, you will make on an average at least 10,000 comparisons, worst case 20,000 comparisons. There has to be a technique which is faster than this to search for a roll number in an array or to search for any number or any information in an array. And I had mentioned yesterday that the way you would search such information quickly in the human processes that you are familiar with is that if the values are arranged in ascending order, in which case you don't go stupidly searching one after another, but you sort of go somewhere in between, look at the value and then you decide whether you have to look prior to that point or after that point. And a typical example I had given was that of a dictionary search. So you don't search a word in the dictionary by going through each and every word from start to finish. You sort of look somewhere in between, look at a word. If the given word is alphabetically larger than that word, you search in the second half, otherwise you search in the first half. Effectively you divide the search space into two and search in either first half or second half. If you repeat this process, the number of words that you will have to examine or look up, compare, will be much less than n or n by 2. That is the principle of what we call binary search and that is what we are going to look at. So searching an element in an array, in fact you might want to jot down searching an element in a sorted array. That means an array in which the elements are arranged in ascending or descending order. There is a special search that can be applied to such cases which we call the binary search. Today we will also look at another problem of manipulating multidimensional arrays. Particularly two dimensional arrays by looking at the problem of matrix multiplication. All of you are familiar with matrix multiplication? Yes? Yes. So first let us revisit the search problem. I had mentioned that the algorithm has a great conceptual similarity with the algorithm which we discussed earlier in the course of finding out the root of an equation using bisection method. Do you recall this diagram? There is some function and you start with two values called high and low. High is merely on the higher side of x-axis, low is merely on the lower side of x-axis. What you know is that the value at high of the function is let us say of one sign in this case positive. The value of function at low is negative which confirms that if at all there is a root the root is in between these two. In fact in this case if the values of the function are of opposite sign this bound to be one real root. That is the conclusion that we had drawn because the curve has to cross x-axis somewhere or the other. And the way we proceeded to find out that root was not to examine every point of the real line from low to high. But to examine the middle point arbitrarily middle of low and high. Please note that the function value at midpoint has no relevance to the function value at low or function value at high. But it is ought to be something in between hopefully or it will be closer to the root that is the estimate. So when we look at the midpoint and evaluate what is the function value we notice that the function value is positive. That means my root if at all will lie on this side meaning that I have reduced the search space by half. Now I have to search only this space from low to mid whereas earlier I was searching between low and high. This concept is very clear because we had solved this problem we had written a program for it. What I am trying to show is a conceptual similarity between this approach and the binary search within an array of a given element. Effectively in the bisection method if instead of treating this line as real line if we consider this line is made up of points 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 etc. How many points can there be? Truly speaking infinite point if it is a real line. However we arbitrarily decide some end points let's say. Observe that we are trying to lead to the similarity between this situation and the situation of an array which is sorted on given value. Because array elements will be 0, 1, 2, 3, 4, 5, 6 and so on. So similarly if I regard this real line as consisting of point this, this, this, this etc. And I have noticed that low is this one point high is this one point. Then as far as I am concerned the mid merely is one point here. Now I have not shown this point exactly at mid here because it's a real line. But in case you compare this with a similarity of an array which we shall see in the next slide. I have just inverted that figure. You may not be able to read everything unless you twist your neck a bit. But I suppose you can see now the similarity. This is the line in which this is the low value this is the high value. The difference between function and array is that while a function has value at every real point on x axis, array has values only at discrete points. So there is a 0th element of the array, first element of the array, second, third, fourth, fifth, sixth and seventh. Which means there are 8 elements in this sample array. Observe that this is the list of roll numbers which is in sorted order. Thousand one thousand two etcetera etcetera. Notice that some roll numbers are missing. They are not necessarily in sequence. But whatever roll numbers are there they are arranged in sorted order. Also notice that there are corresponding marks given in the data. You have handled this while reading this information in a program and so on. So you are familiar with this kind of structure. The point that I am making is if there is a given roll R and I want to find out whether R exists within this array, instead of searching linearly which is what we did yesterday through this search algorithm, I can actually apply a method which is conceptually similar to the bisection method of finding roots. So for example, I will consider this to be the low point which is equivalent of the first element. And this high point to be the last point which is equivalent to the last element of the array. Now I have to search between these points. These are discrete points however. If there are n students then low will be 0, high will be n minus 1. Now equivalent of finding out the midpoint and the function value. This is the midpoint and this is the function value. I will calculate the midpoint which will be low plus high divided by 2. Please note we are now doing integer arithmetic. So low is 0, high is 7 and therefore 7 plus 0 by 2 is what? Not 3.5 because there is nothing like a 3.5th element of an array. It is an integer division so I will get this as 3. Observe that when I have even number of elements, the middle element will not strictly be middle because at middle there is nothing. So it will be either 1 less or 1 towards the low or towards the high. It doesn't matter which one. But effectively I am now looking at third element. Please note that 3 is equivalent, 3 which is an index in this array is equivalent to this midpoint. And this rule number 1004 is equivalent to the function value here. Just as I compare this function value whether it is close to 0, here I want to compare this function value or array value with the given rule number. And here the comparison is very discreet. Either the given rule number is equal to this, if not the given rule number is either larger than 1004 or smaller than 1000. If it is smaller than 1004 what do I have to do? I have to search in this part. Suppose it is 1002 then 1002 cannot be in the higher part of the array. It has to be in the upper part. So consequently what should I do? I should keep low as 0 but I should move the high to each point. Yes, what should be the value of high reset 2 for the second iteration? I am still doing iterations. If I set high to 3, I will be unnecessarily including this number also in the search. Remember I have already examined this value. Had this value been equal to the given rule which I am assuming to be 1002, I would have got out because I would have found the value. So it is clear that at the midpoint the given rule number does not exist. That is why I have to search further. So there is no point in including this point in the search. I would rather go one last. So consequently if this is the midpoint and if I am searching for 1002 for example then I will set high to mid minus 1. On the other hand suppose I am searching for 10010. Observe that 10010 does not exist in this list. But I do not know that. I have just gone to the midpoint which is 3. I have examined the rule number 1004. Is it equal to 10010? Answer is no. So I have not found it. However I notice that the given rule number is larger than this. 10010 is larger than 1004. And very clearly then I have to search in this half. To search in this half I retain the value of high as 7. But I shift the value of low which was 0 to what? Should I shift to 3? Not necessarily. I have examined the third element. So I will shift low to mid plus 1. Now I have redefined values of low and high and I repeat the process. If I repeat this process either I will find the element or at some place low will not be smaller than high. Either it will be equal or it may go even greater because I am adding one or subtracting one. If that happens I must terminate the search saying I have not found anybody. However if I locate a given rule number I must terminate the search there itself. Is that logic clear? Let us look at this evaluation by marking out the low and high points. First I will show you the algorithm which I have written. I am assuming that I have read the array of roll numbers and marks. I start with low equal to 0, high equal to n minus 1. Again assuming that n is the number of students. You agree this is the definition of low and high? Since we are going to analyze this algorithm by actually executing it for certain values. I recommend that you write down these steps of the algorithm quickly in your notebook. Because you will need it in front of you. When I switch over to the handwritten page I would not have this in front of me so you would not be able to see this. This is not a full program. This is a segment of the program. There is obviously I have not included hash include. I have stream hash include this using namespace std etc etc or int main. Not have I included return. So this is obviously a portion of the code. I hope you remember we had used this earlier. The break statement immediately gets you out of the while iteration. Return statement or break statement has an impact that you get out of the iteration that you are currently in. Since you are inside this while iteration, ordinarily you will keep iterating again and again and again as long as high is greater than low. However, if you locate an element such as role made is equal to given role, then you set some flag called found flag equal to 1 and break. That means you do not have to carry on further. You have found the element. Only if you do not find the element that is else, if this is not true, then if role made is greater than given role, you set high to mid minus 1. Otherwise, you set low to mid plus. This is the logic. We are going to execute this logic for the given data that we had seen last time, which was there in the previous slide. I have written down here the array elements is the role and these are the marks. You have the algorithm written in front of you. What I have written here is at the beginning of first iteration, this is the iteration number, artificial number. The value of low is 0. The value of high is 7. 7 plus 0 by 2 is the midpoint, which is 3. What is the next instruction that is executed after calculating mid? I compare role made with given role. Obviously, I must take some value of given role. Let us imagine that given role is equal to 1002. If given role is equal to 1002, then the midpoint, which is this, will be examined, is 1004 equal to 1002? No, it is not. So obviously, I have not found the role number. What is the next statement in the algorithm? If role number at the midpoint is greater than given role, well, the answer is yes, 1004 is greater than 1002. If that is so, then what do I do? I recalculate high, setting it to mid minus 1. So the high will get a value because mid is 3. So high is reset to mid minus 1. It is set to 2. The next statement again says else. It does not apply. So obviously, you go back to the next iteration. Consequently, you start the next iteration number 2 with low equal to 0, high equal to 2. And now you calculate the midpoint again here. What is the midpoint? 2 plus 0 by 2. That is 1. Again, you get into the iteration. Is role made equal to given role? Well, role made is 1002. Given role is 1002. You have found the fellow. So your algorithm terminates here. How many comparisons did you make in this particular case? You made one comparison with this mid-value. And you made another comparison with this mid-value. Only two comparisons for an 8-element array. Where in the worst case, you would have made 8 comparisons. And in the best case or an average, you would have made 8 by 2 or 4 comparisons. So clearly, the number of comparisons have reduced. Let us execute this algorithm for some other role number which is not present. For example, let given role number be 1010. Observe that role number 1010 does not exist in this array. Let us start once again. First iteration. The value of low will be 0 as usual. This point. Value of high will be this as usual. This point. Value of mid which will be calculated since high is greater than low. You will enter the iteration of the while. And you will calculate midpoint as 3 again. You will examine the third element 1004 against this value. Is it equal? No, it is not equal. Then you go to the next statement. Is role mid greater than given role is 1004 greater than 10010? No. So you will skip that if also. So you will come to the last L statement which you set low to mid plus 1. Please look up your program. Low to mid plus 1. What would it mean? The low value now will become 3 plus 1 which is 4. The high value will remain 7. So what are we doing? We are now searching this part. Observe that earlier we were searching the entire array. Now we are searching only half of the array. So we have reduced the search space by half. This is the conceptual similarity with the bisection method of finding roots. At this point, what is the midpoint value? 7 plus 4, 11 by 2, 5. I will now go to the second iteration with this midpoint value. In fact, this midpoint value will be calculated when I go for the second iteration. The fact that there will be a second iteration is guaranteed because high which is 7 is still larger than 4 which is low. My high loop says high is larger than low, continue. So I will get into the second iteration. Calculate mid, I should actually write it here because this is done within the second iteration. Now what do I do? Same thing. I compare the fifth element with the given role which is 10010. The fifth element is this. Is 1008 equal to 10010? No. So I have not found the element. So I will go to the next statement which is L safe. Is role made greater than given role number? Is 1008 greater than 10010? No. That means the possible position for the given role number is in the further lower half of this area or towards the higher indices. So what should I do in that case? The L statement will be executed. Again low, this time low shall be set to mid plus 1. The high remains at 7. Low becomes how much? Mid is 5. So low becomes 6. And this becomes 7. This remains 7. I come back to the third iteration. In the third iteration, what is the midpoint now? 13 divided by 2 which is 6. So I now examine this element. Is 1009 equal to 10010? No. So that if statement will not find the position, the next statement says is role made greater than given role? No. 1009 is not greater than this. That means I have to still search further in the air. So consequently I will recalculate low once again. This time low will be set to mid plus 1 which is set to 7. High remains 7. The midpoint is 7. I examine is a 7. The role 7 equal to 1010? No, it is not. Is it greater than 1010? Yes, it is. The mid role, role made is greater than given role. So what I will do, I will set high to mid minus 1 now. What is the value of high? 7. It will be set to 7 minus 1, set to 6. The low remains 7. Observe that low and high have become topsy-turvy. Effectively it means that I have looked at the entire array and I have not found the element. This time when I go back for the next iteration, the iteration will not continue at all because high is no more greater than low. So I will come out. We have taken two example values. One given role equal to 100,002 which was there. Another given role equal to 1010 which was not there. In both cases we found that this algorithm A correctly locates the number which is found and B correctly concludes that the number is not found. However, does that guarantee that my program is absolutely correct? As somebody pointed out some while ago, there are problems with this algorithm. Having looked at this analysis, this is by the way called testing. I am testing my program. I have given some test value 1002, 10010. I will now give you two minutes to test this algorithm because you have the algorithm in front of you. You have these values. Test this algorithm for 10011. Given role is 10011. Test this algorithm for 1001. Test this algorithm for 900. Test this algorithm for 1200. Why am I asking you to do that? You will observe that if in general the given role number is within the array elements actually existing there appears to be a good chance that I will locate it. In general, if the role number is not in the array but within the range of the first and last role number there appears to be a reasonable chance that it will work because I have tested this through some sample value. The correct testing of any program or any algorithm requires that you examine the behavior of the algorithm for extreme values which are the extreme values. Typically, for array bounds like this we are starting with 0th element, ending with the last element somewhere the counting may go topsy-turvy and you might miss to examine either the first element or the last element. So you should always have a test value which will test whether the first element is there, is it found and how the last element is there, is it found. Similarly, you should test for a non-existent role number which is below this range or above this range. That will tell you that in those extreme cases also your algorithm works. In general, hand executing an algorithm of this kind where you are manipulating array indices, not just the array values you are not doing any manipulation with the array value you are not adding subtracting role numbers, you are just comparing them. But what you are manipulating is the index. Sometimes low is 0, sometimes it is 4, sometimes it is 6, sometimes it is 7 and that manipulation is happening as per some logical plan that you have to search. I will leave it to you to examine this but already some students have pointed out that this algorithm could run into trouble. I would like to draw your attention to the fact that it does not appear that this algorithm is excessively smarter or more efficient than the earlier algorithm that we had. Consider this, once again we go back to this analysis. How many comparisons did we make in this case? 1, 2, 3, 4 comparisons. On an average I would have made 8 by 4, 8 by 2 or 4 comparisons in any case. So big deal, what is the advantage? The true advantage can be sensed if you imagine that the array has not 8 elements but 800 elements or 8000 elements or 80,000 elements. In general the order of the previous algorithm which was doing search of all the values in the array, the order of that algorithm, complexity, time complexity of that algorithm was order n. In absolute terms it required either n or n by 2 comparisons. So if n was 8, in the worst case it required 8 comparisons. Average it required 4. If n was 80,000, you would have required 80,000 comparisons in the worst case. On an average 40,000 comparisons. How many comparisons will be required by this algorithm if the number of elements is 80,000? To calculate this number which is not easy if the number of elements is 80,000 let us revisit to find out what is happening in this algorithm. Remember the algorithm is reducing the search space by half. Originally it was searching all 80,000 elements. First time I calculate made I am reducing the search space to only 40,000 elements either in this half or in this half. So consequently I am searching only half the original number of elements. Every time I go through the iteration this search space reduces by further half. So in first iteration I come from 80,000 to 40,000. In second iteration I come from 40,000 to 20,000. In third iteration I come from 20,000 to 10,000. Fourth iteration 10,000 to 5,000. Fifth iteration 5,000 to 2,500 etc. Can you guess what is the likely value of number of comparisons that you would make in this algorithm in the worst case? To get that the best way is to take n equal to some power of 2. So let us take n is equal to 128. At the end of iteration 1 the search space I will call it sp is reduced to 64 elements. You agree? Either I will be searching upper 64 or lower 64 elements. I have divided the range in half. At iteration number 2 this search space will be limited to 32 elements. At iteration number 3 this will be limited to 16 elements. At iteration number 4 this will be limited to 8 elements. At iteration number 5 this will be limited to 4 elements. At iteration number 6 this will be limited to 2 elements. And at most I might have to do a seventh iteration where I compare with only one number or whatever or get out. What is the relationship of number of iteration and iter with n? It is useful to consider 128 as equal to 2 to the power? How much? Yes? 7. What is the log of n to the base 2? Agreed? Log of n to the base 2 is 7. In fact whenever you reduce the search space successively by half, half, half, half you are having a logarithmic reduction. Since it is half it is log to the base 2. That means if the total number of elements are 128 at most you will require 7 iterations. Or 7 comparisons. Yes. That's right. So for 128 elements he says you will require 8 iterations. In general then if I were to write a formula for time required for the number of comparisons required this will be equal to some constant k1 times log n to the base 2 plus k2. What you are saying is in this case k2 will be 1 and log n to the base 2 will be the number k1 will be 1. So in the worst case you will require log n to the base 2 plus 1 comparisons. Very good. In terms of integrities, in terms of micro view you are saying not 7 but 8. But please understand the importance of this algorithm. It is not whether it is 7 or 8. It is much smaller than 128 or much smaller than 64. So if you have 1 million elements the total number of comparisons will not be 1 million or 1 million by 2 as in the linear search but log of 1 million to the base 2 plus minus 1 there it does not matter. And therefore in the macro view of the algorithmic complexity we shall say that this binary search algorithm is of the order of log n to the base 2. And this is clearly much more efficient than any algorithm which is order of n. Do you agree? And that is the point we are making. That is why we said that look for algorithms which reduce the number of computations done by an order of magnitude. The order need not be a polynomial order. It could be a logarithm instead of a single order. This reduction cannot be matched by any brilliance that you might show on the linear search like stopping after you have found out or something of that sort. Note that this requires the array to be sorted. We will conclude this discussion here. But please remember always that you should be on the lookout of algorithms which are efficient preferably by an order of magnitude than any other known algorithm. If that is impossible then within the same order of algorithm you should try to reduce the coefficients appropriately so that the number of computational operations are minimized. Please also note one more point. In this particular case we are talking about comparisons. But we are assuming that all roll numbers have been read in the array. Is that right? We have read all these roll numbers in the array here. What does that assume? It assumes that there are enough space in Dumbo's memory to allocate locations for a very large array. Well, we do not know the exact hardware constraints that Dumbo may have. As of now we have not discussed that feature. But surely you would agree that the number of locations are not infinite. Dumbo has a finite amount of memory. And in actual practice the amount of memory that Dumbo's cupboard has which we call the random access memory is truly limited. Although today it is much more than what it used to be in 1960s or 1970s it is still limited. For example, you might have heard a term called two gigabytes, four gigabytes. This is the kind of memory in bytes that modern computers have. A gigabyte is roughly a billion. And four of the bytes are required to contain a numerical value. But consider problems of much larger magnitude. Imagine the following problem. Instead of roll numbers, I have the unique national ID number that government is planning to give to every Indian citizen. And since Nandan Nilekhan is after it, you will get it within one and a half years. How many such numbers? 102 crore numbers. Obviously you are not just going to store those numbers. For each number you are going to store information about that person. Could be names or name, address, date of birth. Let us assume that roughly 300 characters worth of information in text and numbers is stored for every number. And let's say that includes the unique ID also. We are talking about 300 characters worth of information for every ID. Why do I have to search? Imagine that somebody wants to masquerade as DB Fatad. And he goes to a shop and says, give me that 500 rupees book. This is my credit card. My name is Fatad. Now he would like to cross check some more details. Suppose he says, oh, you are Fatad. What is your date of birth? Since that thief does not know my date of birth, he will say some date of birth. He will say some address. Wouldn't the shopkeeper like to use his computer to search the national database for that ID and find out the detailed information so that he can compare what is involved there? Somebody in India would like to give a given role or given ID to a back end system which stores all this information and quickly retrieve one record pertaining to that ID which contains all that 200 bytes of information. 100 crore people multiplied by 300 bytes is so many crores of information pieces which cannot be accommodated in a computer's memory at all. What it means is that even these search algorithms will be of no consequence because we simply cannot read all the data in an array even though we maintain the national IDs in an ascending order. Consequently, these elements will have to be stored on what we call disk where you store your files which has much larger capacity than computer's memory. You can never read that data into memory so you will have to make search on the disk. And if you recall that reading something from a disk as compared to comparing something in an array is at least 1000 times costlier. Imagine if you do a linear search at 1000 times the cost of one comparison and do it for 102 crore entries. The merchant will find that this fellow is not fatter after about 8 years or something. Not worth it. He has to find it in split seconds. That is a challenge. Later on in this course we shall discuss how do we handle situation where the data is so large that it cannot fit into the array. But to begin with we must understand how to handle data effectively and efficiently when the data can fit into the memory. To fit it into the memory we have structure of data. Simpler structuring is variables in which we allocate values, read values, write values and compare values. The next complex so to say data structure is an array which is a list of values which is what we have studied. Arrays could be one dimensional, two dimensional. However the abstraction that we require to handle complex data in the world and I don't mean complex numbers but complex data where you have names, addresses, where address may have street number, city and pin code where you have such multiplicity of pieces of information. You need additional abstractions to manage this data. Later on in this course we shall discuss the notion of a structure, how you can handle structure data. A group of elements are put together and associated with one person. There are very interesting and esoteric abstractions such as linkless queues and stacks where the abstraction of that kind and the powerful computational mechanism that we create for those abstractions will be extremely useful in solving some peculiar problems. We can't of course discuss all of that in the first course but I'm just trying to tell you it's an extremely interesting development that has happened where the computer programming languages permit us to represent simple and complex data in a format, in a form which is amenable for a efficient algorithm. The second point I would like to make before going further is that just I talked about big data, whether data is very big, either it may consume all the memory that you have or it may overflow that memory. In exactly the same way the programs that you write will become bigger and bigger and bigger. What is the largest size of program that you have seen so far in your labs? Observe that in one of the lab assignments I had asked you to count the number of lines that you have in the program. Most of the programs that you have in the assignments are one page or two page long which is let us say 20, 30, 40 lines because many of you are beginners even 40 lines program appears to be a very large program. In programming parlance that is considered a trivial program. A non-trivial program must be at least 500 lines which is called a small program. A medium size program is typically 2000 to 5000 lines long. It is obviously not a single program. As we have already seen we do not write monolithic programs but we break the program into modules which we call functions which perform different tasks and such collection of functions when invokes through a main program gives us my total program. When you do your project which will be announced after the mid-same each group would be expected to write, actually compose, write, test and work with at least 2000 to 5000 lines of program. Today it might appear very difficult for you when you are struggling with simple syntax of where to put brass, where to put less than, equal, equal whatever. But very soon you will get accustomed to doing that. In general then I would like to tell you that the main objective of this course is to teach you programming concepts but the other important side objective is also to expose you to large programs and the ability to handle large data because that is what you will be doing in your professional life. Not necessarily by programs which only you have written. You may be using programs which others have written such as the libraries that way.