 Okay, we will get started. Today's plan is to continue with matrices and matrix algorithms specifically of a numerical flavor. So before we do that, let's quickly recap a little bit. By the way, the midterm syllabus will not include any kind of sophisticated numerical algorithms on matrices as long as you know how to declare a matrix right into a cell using one or two indices and retrieve a value from a matrix cell using again one or two indices that will be enough. So the specific algorithms I am describing today, nothing like that is included in the midterm. So this is how you declare a native matrix with two dimensions and we already said that logically you need the row index to be between 0 and rows, column to be between 0 and columns, 0 inclusive, columns exclusive and if you valid that rule you might be accessing weird memory locations without an idea of what value you are supposed to get. So internally the CRC++ compiler makes no distinction between a 2D matrix which is 3 by 5 and a 1D matrix which has 15 elements, they both have 15 elements. In case of the 1D matrix or vector, the elements are numbered 0 through 14. So the only facility C and C++ will give you is given the allocation statement where rows and columns are actually instantiated as numbers, it will use that to calculate the one dimensional cell number corresponding to a two dimensional cell number. So since I know the number of columns in the matrix, I can calculate the cell number of Rx Cx as Rx into columns plus Cx and we saw that this corresponds to a row wise unrolling of the elements of the matrix. So element 00 starts at byte 0 and is called cell 0 and right up to the first row then the second row then the third row to the last element which is cell number 14 but byte number 56 through 59 and this is for an int matrix where each cell takes exactly 4 bytes. So you saw a demo of it and you also saw that you can create illegal indices which walk outside the legal area of cells that you have allocated but as long as the CRC++ compiler can dumbly calculate a cell number, it will just compute that and access it. If it is outside 15 cells even then you can access it as you already saw in case of one dimensional vectors but if it is inside then of course there is no further way of protecting yourself against getting nonsense values. So and then you looked at some of the very basic operations on matrices, you can initialize a square matrix to identity by assigning the diagonals to 1 and off diagonals to 0's that changes very very slightly if you want to initialize an upper triangular matrix with 1's in which case you just test that the row is less than equal to column and then you assign them to 1 that is the light gray area and then if Rx is greater than Cx then those are 0's as represented by the dark gray area. So those are very simple things to do. Generally speaking you will read large matrices from data files and you usually not enter them by hand but today because we will see very small examples I will set elements to specific values inside the code itself. How do you transpose a square matrix? The idea is to not double transpose so do not go below the diagonal, stay above the diagonal and interchange two elements across the diagonal at a time. So here is the first shot the 3 by 3 array has elements 1 through 9 as listed. First you go to the cells colored yellow 2 and 4 and you interchange those then you look at the cells 3 and 7 then you interchange those and finally you look at the cells 6 and 8 and you interchange those. So see that you needed to only look at above diagonal elements and swap them with their mirror images in the diagonal. So what is the mirror image? The mirror image of cell Rx Cx is the cell Cx Rx that reflects it in the diagonal where Rx is equal to Cx and there is no reason to try to swap the diagonal because nothing will change. So that is why we have strictly a plus 1 here. So there is no need to swap anything on the diagonal itself. So that is transpose easy enough and then matrix vector multiplication all of you have done by hand and by now the index arithmetic should be fairly clear. If A has rows rows and calls columns then we are calculating y equal to Ax then the sizes so this is rows that is calls this is A. So x will have to have calls rows and the answer you get is y which again has rows rows. So that is the sizes and proportions and the code itself is extremely simple. The outer loop is on Rx once you lock a particular row you are going to also write into Rx of y this is y and within that we first set y of Rx equal to 0 and then we walk Cx simultaneously across the columns of A and the rows of x. So A Rx Cx times x Cx those are multiplied and then accumulated on to y Rx. So that is the copy book definition of matrix vector multiplication and there is really no need to do it any smarter this is fine this is the best you can do. Now of course this is true for dense matrix vectors in case the matrix or the vector becomes sparse things will get more interesting and we will see many many occasions where we will need very vast very sparse matrices. For example so here is one matrix that could be very interesting to us suppose people in say Facebook are numbered 0 through n minus 1 and this is also 0 through n minus 1 and if person i is friends with person j then I will set that cell to 1 otherwise I will set cells where i and j are not friends to 0s. Then this is a very nice logical representation of the friendship graph but because most people are not friends with most people this matrix has a vast majority of 0s very few 1s. So you of course would not want to hold it in a dense format but now it is complicated by the fact that there are 2 dimensions and one of the ways to store it would be just to say i, j and 0, 1. So you can have 3 gang arrays one of which gives the row index the other gives the column index and then there is a value array which gives either 0 or 1. That would be a standard sparse representation and if you had that then finding matrix vector multiplies would be much more interesting. So we will talk about that perhaps after the midterm but for now for today at least we are dealing with entirely dense arrays matrices and we will probably continue that for another lecture. So matrix vector multiplication animation is just follows that chalked up diagram there. So I look at a 0 0 and multiplied by x 0, a 0 1 and multiplied by x 1, a 0 2 multiplied by x 2 and that gives me y 0 and so on. Now then I started talking about matrices as transformations and some of you may already have known about this but just to bring everyone up to the same point if you look at say in the 2D plane a 2 by 2 matrix you can take a 2D, a 2 element vector like x which is x 0 and x 1 and multiplied by a to find y. Now it is a little confusing for us who are used to the x axis being x and y axis being y this is not the same thing. Each of x and y has 2 dimensions the dimensions are the axis are axis 0 and axis 1. So I apologize for the slightly confusing letters used ok y is not the y axis. So I start with a point x 0 y 0 which is plotted right here ok. So the horizontal coordinate of that is x 0 and the vertical coordinate of that is x 1 and then I multiplied by a 2 by 2 matrix and that amounts to transforming this vector x from the origin to this point to a different new image vector y which is y 0 and y 1. So in this case y 0 equal to twice x 0 plus 1 x 1 and y 1 equal to x 0 plus 3 x 1. So I do this mapping and I find the vector y joining the origin to the tip of y 0 y 1 and that is a transformation ok. Now if you remember we have already seen things like these when we talked about people migrating from city to city. Effectively we are saying that y 0 is the next population of Mumbai and y 1 is the next population of Pune and that is some linear multiple of their old populations right. So these things are all over the place. In particular if you have say in electrical engineering I could make up a linear circuit with registers like this and let us say every node here so how many of you have heard about Korshof's law? So charge cannot accumulate at a node it has to go somewhere. So the total flow into a node of current has to equal the total exit of current from the node. So let us say the potential of any node i is V of i the potential here is V of j right and then the current here is V of i minus V of j divided by R i j right. So that is another kind of linear transformation because I can now write linear recurrences across all the V's and try to solve them. So these sort of things are all over engineering that is why Eigen systems are so important. So we saw that if we choose the x 0 and x 1 to always lie on this locus which is the unit circle around the origin then the transform points y fall on an ellipse in general right and so we map this circle to an ellipse like this. So observe one thing which is that you know the pre image is on the circle the post images are on an ellipse and the important thing is that there are these two directions which are not changed by the transformation A that we already saw that is the major axis that is the minor axis but one other thing to note quickly is that near the minor axis the rays are diverging away from the minor axis near the major axis they are sort of converging on to the major axis that is a very critical difference this is why it is so easy to find the first Eigen vector and first Eigen value but you need more complicated procedures to find the lower order Eigen values and Eigen vectors. So we will discuss that in more detail now so this is fine right so circle goes to ellipse the ellipse has of course two axis so these two axis are stretched but not turned by the transformation A. So let us say these two directions that are stretched but not turned are called u 0 and u 1 those are the red arrows on the screen. Now observe that if u 0 is not turned then neither is minus u 0 so throughout this discussion whenever I am saying find u 0 your numerical procedure may well find minus u 0 the opposite direction of the axis is directionally the same sign is different the lengths of the semi major and the semi minor axis are called the Eigen values because the unit vector which finishes on the circle manages to get stretched into a longer vector here in general it could also be shorter the ellipse might be thin enough that it falls inside the circle at some points. Now lambda 0 is the length from here to there and so that is the ratio by which a unit vector got stretched similarly lambda 1 is the length from here to there. Now by definition lambda 0 is greater than lambda 1 now certain matrices of course have lambda 0 equal to lambda 1 in particular the identity matrix but in general you will have lambda 0 and lambda 1 different and let us say lambda 0 is the larger one. Now u 0 and u 1 are also perpendicular to each other by definition the major and minor axis of an ellipse are perpendicular to each other and they also have let us say they have unit norm. So in fact u 0 is the vector going from here to there and that is lambda 0 u 0 u 1 is the vector going from here to the circle and lambda 1 u 1 is the one that goes up to the ellipse. Now the two red addos of unit length going up to the circle those are now like a new axis system they are perpendicular to each other they have unit norm now we can express any point in the earlier coordinate system as a point in the new coordinate system fine. So suppose I take any old vector here so let us say I take a vector like so finishing here where the cursor is now that has a coordinate in the original system but I can also take its projection onto the two red axis and calculate a new coordinate with regard to u 0 and u 1. So because the two red axis are that perpendicular to each other I can express any point in the new coordinate system. Now let us say that in general I pick an initial vector x which has coordinates 1 and c for some arbitrary constant c by changing c I can implement pretty much any reduction almost with finite c and it will eventually turn out that it does not matter where we start off x anyway. So you can feel free to choose any c u 1 actually. So basically we start off with some x which is proportional to u 0 plus c u 1 because I cannot draw easily on the LCD screen and I should not here is the origin here is the circle here was the new coordinate system u 0 u 1. What I am saying is that let us pick any vector like so whose projections on u 0 and u 1 are 1 and c respectively. So by changing c I can turn the vector x around as I wish so there is no loss of generality. Now let us see what happens if I take this x and keep multiplying it by a so power iteration find the largest eigenvalue lambda and the corresponding eigenvector u lambda 0 u 0 actually of the matrix a by initializing x arbitrarily for example u 0 plus u 1 c and then it repeats finding y equal to x that is a matrix vector multiply and then it resets or overrides x with y divided by norm of y. Now let us say for the sake of argument that before starting out I normalize x to unit norm so x always remains unit norm like here after picking this general direction this is x I can always drop it back to the unit circle and take that up to the blue vector so that is my first x. So in general what I am going to do is I am going to start from some arbitrary vector x which I will call x 0 and then I am going to pull it out to the ellipse I am going to drop it back to the circle pull it out to the ellipse again and keep doing that as we saw last time until I end up on the major axis of the ellipse. So why does this work and we can see we already saw a demo using Sylab but I will today show an implementation in C plus plus. So let us call the sequence of values of x and y as x super 0, y super 0, x super 1, y super 1 and so on. So we use super scripts for this because the elements of x and y will use the subscripts possibly. So the setup so far is clear notation setup I have an initial vector x which can be expressed as 1 and c in the u 0, u 1 coordinate system and I keep multiplying it by a. So this is what happens when we do that. So we will go line by line carefully. So remember I said I will initially drop x 0 down to the unit circle which means even if I was given the original vector as u 0 plus c u 1 I have to normalize it to unit norm. Now u 0 has unit norm, u 1 has unit norm therefore the norm of the numerator is square root of 1 plus c square. So I have to divide by that and now x 0 gets unit norm. Now at this point let us find y 1 which is a x 0. So if you just multiply through it is easy to see that if I multiply u 0 by a that is just lambda 0, u 0. If I multiply u 1 by a then that is just lambda 1, u 1 what I have done is I have pulled the lambda 0 outside to leave u 0 plus c lambda 1 over lambda 0 inside and the denominator so far is the same. What is the norm of y 1? The norm of y 1 you can pull out the constant any time out of the norm inside it is 1 plus c square lambda 1 over lambda 0 square because u 0 and u 1 have unit norm. So that is the norm of y 1. Now for the next step when I replace x 0 by x super 1 I get y super 1 divided by norm of y super 1 which is u 0 plus c u 1 divided by this stuff divided by this. So various things cancel out the 1 plus c square cancels out the lambda 0 cancels out and I am just left with u 0 plus c lambda 1 over lambda 0 u 1 divided by the square root this big square root here. And now you see this kind of pattern and you can repeat this iteratively a few times until you see the pattern which is after k steps where k is greater than equal to 1 x super k is going to be u 0 plus lambda 1 over lambda 0 to the power k u 1 and therefore the norm has to be 1 plus c square lambda 1 over lambda 0 to the power 2 k. Everyone comfortable with this? Now what happens when k goes to infinity? So remember that by definition lambda 0 is the bigger guy, lambda 1 is smaller than lambda 0. So lambda 1 over lambda 0 is a strict fraction. Therefore as k grows without bound the second part just vanishes and we are left with u 0. So in other words if I start off with any vector which is some linear composition of the first and second eigenvector and I keep on multiplying by a and normalizing it the coefficient of the second eigenvector will diminish to 0 and only the first eigenvector will survive. So that is the reason why power iteration works. Now as you can see this sort of algorithm will no longer suffice to find the second eigenvector. We will have to do something special and also I noted that this is one occasion where the limitations of finite precision arithmetic will actually save you from wandering around the various axis of the ellipse. Because what will happen is that in the picture it is already somewhat intuitive that near the minor axis you see the blue rays as diverging away from the red line. Whereas near the major axis the kind of coinciding. So there is actually a stabilizing effect if you are close to the major axis you will not move far away. Whereas even if you start off at the minor axis and your finite machine precision will jiggle a bit here or a bit there even if you initially line up x to be exactly along the second eigenvector. After multiplying by a as you have already seen there will be some inaccuracy that inaccuracy will move you slightly away from the red ray and then the blue rays will take care of it and move you away from the minor axis. Now in general for matrices with larger sizes if I have a 5 by 5 matrix then I am going to have instead of an ellipse I am going to have a ellipsoid something like an American football and that will have so many axis that will have 5 axis it just turned in the space. But it has one major axis and all the other axis has smaller than that. So all the lower dotted eigenvalues will die away in the same way. It is very easy to extend this argument to multiple more dimensions. So the only messy part is if there are two dominant eigenvalues with exactly the same value. But you can read up any linear algebra book about how to deal with that. So in this class we are not going to tackle that. This is after all a programming class. So how do I implement this? I am going to stop. So we stop when x k minus 1 and x k are sufficiently close to each other. And one way to check that is to set up some sort of a threshold epsilon and say that the L2 norm of the difference between the last x and this x has to be less than epsilon. Then I will quit. So for you to explore is how do you find u 1 lambda 1 and in case of higher dimensional or higher sized 2D matrices how do you find say u 2 lambda 2 u 3 lambda 3 how to do that. So we will come back to that when we do more linear algebra after the mid-sams but you can try to think about it. So let us write boost code for power iterations. The building blocks we will need and of course matrix and vector. And we have seen some of that already. It is very easy to declare a matrix and vector and fill them with values. We also need vector difference so that you can calculate x k minus 1 minus x k. We also need norm and we also need matrix vector multiplication. Those are all already provided by the boost u blast library. And the code will be extremely short. The code will in fact be almost as short as the conceptual pseudo code that we have outlined already. This is actually running boost code. So infinite loop I already have a and x initialized outside filled with values. So now I say a vector of double y is product of a and x. Then I say vector of double new x is y divided by norm 2 y. Boost already defines a method called prod and a method called norm 2. But observe also that y slash norm 2 y means something different from slash. Because usually standard slash is when you are dividing an integer by an integer or a short by a short, a float by a float or a double by a double. Here it seems like I am dividing a vector by a double. Norm 2 y returns a double. So boost redefines slash so that if you apply it to matrices or vectors in the correct way, in some meaningful way, like dividing a vector by a constant number. It can do that silently. Inside it rolls out into a lot of code. So how do you divide a vector by a constant? Well you divide every element by that constant. I think by now everyone should be able to write that code. But it is already implemented for you. Now then what I do is I have a new x, I have an x. So I can subtract it. Observe that again. This subtraction is not subtracting two doubles or subtracting two floats. Boost defines a new minus which can deal with arrays or vectors of the same size on the two sides. And it does element wise subtraction. Similarly with element wise addition. If the sizes are conformant then sums and differences of matrices are well defined and boost will do it for you. On top of that I can now apply norm 2. And if that drops below epsilon which you have somehow defined somewhere then you break the loop. Otherwise you set x to new x and then you continue. Like many other occasions there is no reason to store all the historical x i's or x k's. I just need to store the previous one and the next one just to compare if I am close enough. Then I quit. So extremely simple code calculates eigenvectors for you and eigenvalues. So let's see a demo of that. So the other mumbo jumbo in front is just to get an x set up. There is nothing real going on. So I define my epsilon to be said 10 to the power minus 5. A is a matrix of doubles of size 2 by 2. And I initialize it to the same number used to produce those pictures. So 2, 1, 1, 3. Now I create this vector of doubles and I don't really care where I initialize them. All you know is that for any reasonable initialization of x there exists a value 1 and a value c such that x can be expressed in terms of u0 and u1 for the sake of analysis only. As far as running the algorithm you don't need to know what 1 and c are. That's just to prove that convergence will happen. So this is an example of a piece of code where convergence is non-trivial to prove. Convergence doesn't happen because you are running through indices and you run out of an index or you reach the end of an array or you have finished merging two arrays. Convergence happens because of certain mathematical properties of eigenvalues. You have to analyze that and convince yourself that you will exit the loop at some point for any positive epsilon. So now I start iterating and here I actually record the iteration number just so I can print things out. It depends on how you interpret it. So when you take a product with a matrix it will be interpreted as a column vector so it can be right multiplied by a matrix. So I haven't tried left multiplying it so you can try it out and see if it throws an error or not. So inside the loop here it's the same as what was in the slide the only additional baggage is to print things out otherwise nothing has changed. So I produce y which is a times x. I produce new y, new x which is y divided by norm of y and I calculate the error. Error is norm 2 of new x minus x and then I print out the iteration number, the current vector y itself, the value let's see maybe I should print out new x instead because that's closer to a normalized eigenvector and then I print out the eigenvalue estimate, the ratio of the old and the new length and the old length 1 and then I print out the error which is how much have I changed from the previous estimate of x. So and then you know if error is less than epsilon then I break otherwise I set x to new x and then I continue the loop. Very very simple code. So most of the complexity has been hidden inside the boost libraries and in fact after mid-sams when you start declaring our own functions more extensively then you will see that this is the only way to code. You don't want to you know unfold your code in all detail in one place because reading it becomes unmanageable. You want to write modules such as taking a norm such as multiplying a matrix and a vector. You want to give it a name and not only can you reuse it from multiple places without repeating the same code again and again but you know it makes it the code much easier to read. This is much easier to read than if I unrolled out all the loops and made it like four levels deep and so on. Error will never be equal to zero technically. Error is a double so not error is not directly defined. Error is a double precision number. Error is not an integer for example. You can only do not on integers in C and C++ and that's like a blight so don't do it. So all I need to say is that the error is small. So this floating point number double error this will shrink towards zero. To actually get to the last to squeeze out the last bit to zero may take a long time for one thing and it may even never happen. The reason is this that you could have X, K minus one and X, K oscillate around the value required to give you exactly zero. And then X, K plus one will be X, K minus one again. So you will do a little alternation just because of finite precision. You will never reach that X star which gives you exactly zero error. So you definitely need this epsilon. Otherwise you may oscillate at the end and never reach zero and not exit the loop. So that's an important question. Because of finite precision it may well happen that you cycle at the end. So it's better to have this epsilon error and quit on that. So this code is clear, right? So in fact I'm done in 11 iterations only. I started off with this vector which was 0.3, 0.9-ish and the eigenvalue was found to be 3.16 and the error was 0.3 which was quite large. So in iteration one I updated my point on the circle to 0.44 and 0.89. You can verify that all these things lie on the unit cycle. And the eigenvalue was updated to 3.53 and the error was smaller. And you can see the error is decreasing quite drastically. And finally around iteration 11, between 10 and 11 the vector hardly changes. It's 0.5, 0.8, that's the eigenvector. And the value is, the eigenvalue is 3.61 and the error has come down to about 9.6 into 10 to the power minus 6. So this convergence is quite fast. In fact for those of you interested in numerical linear algebra you can try to find theorems which talk about, as you go down iteration by iteration at what rate will the number of leading zeros after the decimal point increase. There's actually a very reliable connection between the number of iterations and the number of zeros. I don't exactly remember what it is, but it's monotonically increasing. But it also depends on the ratio of lambda 1 to lambda 0. If lambda 1 and lambda 0 are very close together you're going to have slow convergence. If lambda 1 is much smaller than lambda 0 you're going to have fast convergence. We saw it already from the formula. The other comment I needed to make is that being that vulnerable to the ratio of lambda 1 and lambda 0 is not always a nice thing. So for example, Google needs to compute the eigenvector of the web graph by which I mean. So if you look at the matrix I drew here, so these two people are friends, other people are not friends and so on. There's a similar graph that Google needs to mine which is the web graph of links. What page links to what page? So it turns out that a very important signal into Google's ranking algorithm is the dominant eigenvector of that matrix. So Google needs to compute the dominant eigenvector of a matrix that is 20 billion by 20 billion, whereas we are starting with 2 by 2. Now once you start computing the eigenvector of a 20 billion by 20 billion matrix, which is very, very sparse, every row and column has typically only 30 to 40 non-zero. Each page links to about 30 pages on an average. Then you realize that you can save every cycle you can afford to. And there are algorithms to, for the speed up, the rate of convergence beyond the elementary algorithm that I have talked about. People have done whole PhD thesis. There are at least, I can point out at least 10 PhDs on how to speed up eigenvector convergence at that scale. So that's another very intriguing area of numerical analysis with enormous impact on current web companies and many other areas of data mining. So let's try a small tweak on the code. We'll remember the last couple of values in a buffer and then we'll change the value a little bit and see what happens. So this is with the old initialization. Now let's initialize this too. Let's pick some other x. So I'll pick minus 1 and I'll pick minus 1. So now the initial value is pointing in some other direction. Let's recompile and run the code. Yeah, excuse me. So now what happens if we run this? The vector now is both negatives, but the value is the same. So now I've converged slightly earlier, one iteration. Now if I compare this to what I had last time, this is what I have, exactly or almost exactly the negative of the old vector, but the ratio is the same. So this is exactly corresponding to snapping to the opposite direction of the dominant eigen axis. If I start closer to one side, I'll snap to that. If I start closer to the other side, I'll snap to that other end of the ellipse. So there is no distinction between finding u0 and minus u0. The direction is the same and the ratio is the semi-major axis length. That should also be the same. Any questions on finding eigenvectors and eigenvalues? Now some of you might be thinking at this point that well I kind of rigged the elections. I picked a value of A which gave in particular real eigenvalues. Not all matrices have real eigenvalues or real eigenvectors. It could be complex. So if you are expecting your matrix to have complex eigenvalues or complex eigenvectors, then you have to start it off with complex numbers. So we'll discuss that in after midterms. We haven't even talked about complex numbers and what to do with them. So we'll do a little bit of that once we come back after the example. Now some of you might know how the eigenvalues look like when plotted on the complex plane. So basically complex numbers will come in conjugate pairs. Some of the eigenvalues would be on the real line. Those can come in singles. Any complex eigenvalue will have to come in a pair, a conjugate pair. So we'll talk more about this later on. So any questions about eigenvalues and eigenvectors? Of course if you increase the size of the matrix, you'll get so many eigenvalues and so many eigenvectors, the algorithm will still give you the top one, lambda0 and u0. Yes? No, it's your application dependent. You have to decide what accuracy you need. No, you have to define what you mean by epsilon. Otherwise it'll be an undeclared variable. It entirely depends on what precision your application needs. Okay, so to continue with more matrix operations, the next is matrix-matrix multiplication. And this looks busy and clumsy, but if you know how to multiply a matrix and a vector, you already know how to multiply two matrices. It's just that the second vector became a stack of vectors. That's all. So there are three matrices here. Let me draw the ratios again. The point is I can't hold both of them together on screen, so the board is better. So the first matrix is of size l by m. So it's generally a bad idea to name variables, single letters in the code, because it's harder to search for it. You search for that single letter, you find it all over the place. You don't find the variable. So generally speaking, it's a good idea to have multi-letter variables. So I call it l size and m size, but on board I'll show it as just one symbol. So it's l rows by m columns. And to multiply with it, b has to have m rows and say n columns. And that results in a matrix c, which of course has size l by m. So that's the geometry. And now the code proceeds by having three nested loops. You should think of the two outer loops as targeting c. That's why they're named starting with c. So you basically say I want to fill up c rx and c cx. So c rx is the row index of c, c cx is the column index inside c. Now how do I fill it up? I initialize c matrix to 0. And then I have this combined abx cursor. I start off at c rx here. I start off at c cx here. And then abx will go that way. And abx will come down this way. And it runs through those two vectors lockstep. Multiply the two values. Accumulate on to c max. So three loops. It's quite like matrix vector multiplication, but x has many columns. So it's a triple nested loop. And the time required is l size times m size times n size. So why is that? The outermost loop goes for c rx equals and the number of iterations of the outer loop is clearly l size. Then the next loop says c cx equals and clearly the number of iterations of that is n. So whatever is here is being executed ln times. And then the stuff here is another for loop which takes abx from 1 to m. Or 0 to m minus 1. So this stuff takes m time. Therefore the overall time required is ln m. Any questions about matrix multiplication? It's really simple. And you don't need to do it anyway because boost already does it for you. So now that we can multiply matrices and vectors together and matrices and matrices together, the next extremely important application we'll look at is least squares fit. A lot of you will hopefully do experiments and then collect data from them. For example, we might be in a high school electricity lab. You might be putting a battery across a resistor and then changing the battery voltage by putting more and more cells in CDs and then measuring the current through the resistor. And you're not given the value of the resistor, you have to find the value of the resistor. That's a typical lab. And there are more complex things that people do. So suppose you are given a reading of battery voltage and current. And you do a scatterplot of those. If your experiment conditions were absolutely ideal and your meters were perfect, then you would get an exact straight line through the origin because v equal to IR. In practice, this will not happen. We've already seen occasions to do measurements in this course earlier. For example, I'm trying to do, say, selection sort and I'd like to find out how much time my selection sort would take as a function of n, the number of elements in the array. So one way would be to just time it using the system time command and then plot n versus the time. Or time against n. And because selection sort is expected to be a quadratic time algorithm, you expect that your running time would be something like t equals some constant times n squared. But it will never be exactly n squared. There will be some initial setup time for the code. So it will be more like, you know, A0 plus A1n plus A2n squared. That's your estimate of the running time. So suppose I collect numbers like this. I have n and I have t. And my estimated model is that t equal to A0 plus A1n plus A2n squared. But I don't know what A0, A1, and A2 are. Why would it be interesting to estimate A0, A1, and A2? Well, A0 would give you some estimate of the time that your Linux kernel and bash shall take to set up your process and tear it down. That's independent of the size of n. Whereas A1 might give you a good estimate of how much time your runtime library takes to allocate an array of size n. That's expected to be about linear in n, say. Initialize it to the initial values, reading the values from files. That should be proportional to n. And then the actual sorting time, you expect to be proportional to n squared. But the three constants of proportionality might differ. Reading from files is very slow. So you expect A1 to be pretty large. Setting up a process, tearing down a process at the end, printing things to the terminal, that's also slow. So A0 might be quite large as well. But sorting inside RAM is very fast. We know that access to random access memory is very fast. So you expect A2 to be quite small. Beyond this, you have no initial idea. So instead of guessing, you just use a whole bunch of values of n. You collect the corresponding values of t, and then you'd like to do a fit. So how do you fit this? So here is the process. So you collect a matrix called x. Now let me rename the variables a bit because this might get a little confusing. So this is n, and that is the t. I might reuse some of these later on. So this is capital N. And let's say I do a sum total of n experiments. Small n experiments. So I run the code 100 times small n is equal to 100. Whereas capital N might go into millions. You want to sort bigger and bigger areas. So forget about capital N now. I only have these three things. So what I do is I line up those n experiments as rows of a matrix. In each row of the matrix, I put three numbers. Remember, each row corresponds to one experiment. The experiment that started with n elements and took time t. So my matrix is going to have three columns. Where in the first column I'll put one, constant one for all rows. In the second column I'm going to put n. In the third column I'm going to put n squared. The three factors which I believe govern the running time t. Now I'm going to say that I'm multiplying it by this three element array which is a, a0, a1, a2. So clearly for the ith experiment the time should be a0 plus a1n plus a2n squared. And meanwhile I've used a stopwatch to also measure the time taken by the ith experiment. So that's ti. So now the rest is clear. I would like to find a0, a1, a2 such that this matrix x times a is roughly equal to the t vector. So that's written down in a slightly different notation in the screen. So my observations are x0 through xn minus 1. These rows, the model I'm trying to fit is w which in this case corresponds to the three coefficients a0, a1, and a2. And what I'm trying to fit is this observation called y. That's my running time in case of running the code. So if I want this to be roughly equal what do I mean? Well we have done some of this in the past. If you want two vectors to be roughly equal then you want the norm of their difference to be very small. So suppose I say find a w which minimizes the l2 norm of the difference between two vectors. One is xw which is the modeled part. So if you multiply these you get what is the predicted running time of your code. And this is the actually observed running time of your code. So you say I want to find a w which minimizes the discrepancy between the predicted running time and the observed running time. Now it turns out that the first line is equal to the second. The second line is just a way to write an inner product. So let me write down the xw equal to this skinny thing and then I'm subtracting y. So that has also that the discrepancy of running times. I'm taking the transpose of that. So I'm taking that guy and I'm writing this times itself. xw minus y transpose xw minus y. What's the product of a vector and its transpose? Well it's x1 squared plus x2 squared. So that's the l2 norm squared. If I have a vector which looks like say p1 through pn this is also p1 through pn. The inner product of a vector with itself is sum over i pi squared. So that's equal to l2 norm squared. So everyone okay with the last line in this slide? So that's what we want to minimize with regard to w. Now let's write this out and at this point you may or may not be totally comfortable with the notation. There's nothing sublime or profound going on beneath. It's just notation might be a little unfamiliar. But I'll go through very slowly so that you are comfortable with it. So fw is what we are seeking to minimize with regard to w. So fw as we have already written is xw minus y transpose xw minus y. Now the very elementary rule of transposition is that if I have say vector c and I have a plus b transpose that's equal to cA transpose plus cB transpose. So the very first line is pretty simple you just expand out terms term wise and what is AB whole transpose? It's B transpose A transpose. So it should be comfortable with that. So we get w transpose x transpose xw in the second line minus now if you look at the two terms they look different but they come to the same value eventually. Why is that? So the second term needs a little bit of study. The first way in which the second term happens is y transpose xw. The second way in which the second term happens is w transpose x transpose y. They are the same. Why is that? y transpose looks like this long skinny row. X look like this and w was like that. So this is y transpose xw. Whereas the second one is w transpose x transpose y. You can verify that they will have exactly the same value the value in particular will be one number. It's just a scalar. Why? Because on multiplying this by that I am going to get another skinny matrix. The product of those two is going to be one skinny matrix column vector and then that product will give me a scalar value. Similarly here after multiplying this fact matrix with this skinny vector I am going to get just one short column vector and after multiplying those two I will get a scalar. So what I am writing as f is entirely a scalar although there are matrices involved inside, matrices and vectors. The final value is a scalar. Why? Because what is w transpose x transpose x w? We didn't tackle that one before. So in shape that looks like w transpose x transpose x and w. So this is w transpose. This is x transpose. This is x that is w. So let me collect terms together. When you multiply those what do you get? This combines with every column thus giving you one line like this. That line combines with the next matrix giving me just that and then that combines with the last thing to give me a scalar value. So the second line has the value of the second line is just one number. It's not a matrix or a vector. It's just one number. And finally there's minus plus y transpose y but it doesn't matter because I'm trying to minimize f w with regard to w and that last term doesn't even have a w in it. So it's immaterial. So how many people are okay until line 2 now? Line 2 is okay. So I'm trying to find a minimum of f w. Therefore I have to take something like g w which is f prime of w and then set that to be equal to 0. That's how you do minimize a function. So again there's some compact notation that you may or may not be familiar with. So when I take a form like w transpose x transpose x w and I differentiate it with regard to w what does it even mean? So for those of you who are only familiar with scalar based calculus it's sort of like saying my f w, if w were a scalar that is written here in the form of say w times something times w which in scalar land would land up being called say a w squared. The second one is like a minus or whatever plus b w and then plus some constant. This is what I was trying to minimize with respect to w and in matrix land it turned into that strange expression but there's no difference here. So when I differentiate a w squared with regard to w what do I get? I get 2 a w. If w were a scalar when w is a vector and I have got this form say w some matrix, w transpose some matrix w and then w transpose some vector b then what should my derivative be? Now derivative with regard to every element in w. So when f reaches a minimum then df over every dw j should be equal to 0. Now so all I need to do is to say that if I write df over dw without any subscript what I mean is this whole vector. So df over dw 1 or 0 write through df over dw say m if I had length m vector. So that's what I mean by the gradient and now it's not difficult to see that suppose f of w equal to say w transpose b where b is some fixed vector. So that means this is equal to w 1, w 0, b 0 plus etc plus w m minus 1 b m minus 1. Then what's the derivative of f with regard to the first coefficient? It's just b 0. What's the derivative with regard to the second? It's just b 1. So in other words in this case df over dw is nothing but b the vector b itself. So as long as you line up your objective to look like w transpose b and you differentiate it all you get is that which is just as you differentiate in scalars. You just if y equal to mx then the derivative is m. So exactly in the same way if you look at the quadratic term you can easily do it in parts. You can do this part and that part. It has two parts and if you add up the parts you'll see that it will be exactly equal to twice x transpose xw. So beyond this you can manipulate the indices at home and convince yourself but take the leap of faith that in scalar algebra you take a derivative of quadratic term you get 2x here you get 2x transpose xw. So that's f prime and if I set that to be equal to 0 the right hand side to be equal to 0 then I get x transpose xw equal to x transpose y. Now if I solve this for w, how am I going to solve it for w? I have to take off the x transpose x and in matrix land you can only do that by inverting. So w star the solution which minimizes my fw is going to be x transpose x inverse x transpose y. So this in one shot gives you a solution to the best model. Given x and given y the best w or w star is given by that last expression. How many people are already aware of that last expression before they came to class today? The problem is that at least as far as cs101 is concerned we don't know how to invert matrices here. We'll learn soon but we don't know how to invert. So could there be a way to find the best w without inverting a matrix? So observe that if I write a standard scalar expression so this is a scalar as ax squared plus bx plus c where and I'm trying to minimize it with regard to w. The minimum of a quadratic function like this is well defined only if a is greater than equal to greater than 0. So in which case the surface looks like a bowl a parabola. There is a counterpart in matrix land. m has to be what's called positive definite. So we'll talk about that if you want later on also. Now assume that m is already positive definite. If it's positive definite then it's invertible. Now if this is the form of f what is the form of g? g is just linear, right? You differentiate a quadratic expression with a linear expression. So trying to find the minimum of f is equivalent to trying to find the minimum of g, sorry a 0 of g. How do you find the 0 of a function? You can use Newton-Raphson method or you can sort of use gradient descent which means that let me define f w as there. I define g w as f prime w and now let's say I initialize with some arbitrary w 0. This is again iterative. I want to do gradient descent. I want to minimize. So now I repeat w at the next time step is equal to w at the previous time step minus something called a step size times g prime of w at the previous time step. So you want to reduce the function. You want to walk against the gradient. The gradient tells you the direction of increase. You want to decrease the function. So fix an arbitrary step size. This is very simple. Let's say that epsilon we saw before. It depends on what you can get away with. There's a lot of theory about setting step sizes that I can't even go into. There's huge theory on step size selection. Start with some w 0, some arbitrary guess. Repeat for time equal to 1, 2, etc. Set w t to be w t minus 1 minus the gradient at the previous time step. Once again until w t and w t minus 1 become very close. Suppose I am here. So this is my w t minus 1. In standard gradient method what happens is you find that the gradient here is positive. I would like to reduce the function. Therefore I have to go left. If instead I was here then the gradient would be negative and so I have to go in a positive direction. So what happens in 1d here is very obvious. You find the local gradient and you walk against the gradient to minimize the function. The exact same thing happens in matrix land. You find the gradient, the multidimensional gradient vector and you take a step in the negative direction of that vector. So once I am armed with more C++ graphics capability after midterms, I am going to show you contours of equal function values and how the w is roaming around in that space. So we can't finish that today but I will give you some nice animation after the midterms. But broadly the intuition just extends directly from one dimensional optimization to multidimensional optimization. In one dimension to the right of the minimum the gradient is positive. Therefore if I am stuck in the right I should go left. To the left of the minimum gradient is negative and so I should go with it to reduce the function value. This holds element wise. If I find a gradient vector I can always take the negative of the gradient vector and take a small step on it so that I go towards the smaller function value. How many are comfortable with this way of minimizing it? New molecule. So in this specific problem this is actually a guarantee to converge pretty much for any eta. It doesn't matter how carefully you pick eta. There are other problems where f is complicated enough that you need a lot of care with eta but not in this problem. This problem is one simple parabola in multiple dimensions. So the process is very simple and reliable. So how do we write boost code for it? So by now the boost code looks really really simple. So let's see. I will set up the following C++ code. So I will give you three points to fit a straight line through. So and it's in 2D. So n small n there is 3. And this is in 2D means that your observation is only one thing but we will pad it with a one as before. So I am going to fit y equal to mx plus c. To fit the c I need to pad every observation x with a one. So I am going to fit y equal to mx plus c. And what I am given is x and y a chart of those. I take my x and I pad a one and this bigger matrix becomes my capital X. And my w corresponds to a w0 which is my c. And the second one is m. So what I am actually fitting is c and m in that order. So if I take say 1x1 times cm I get c plus mx1. Next one will be c plus mx2. And I also have the y. So otherwise the problem is all set up. Now I am going to try an initial bug checking or test code where I am going to give you x and y values which are extremely simple. I am going to do 11, 22 and 33. There is no least square required there. A single line fits to all of them. No problem. And then we will jiggle the point around a little bit and see what happens. This is my starting point. So that's how the code looks like. So actually the code here is a little flipped in that the x0, 0. The first column is actually the given values 1, 2, 3. The second column is the pad of ones. And y are 1, 2, 3. So clearly the relationship should be y equal to x plus 0. So m is equal to 1. C equal to 0 is the correct relationship. Now I initialize my w vector with two things. And initially they are all 0. It doesn't matter. This bowl has one global minimum. So no matter where you release a ball, it will always roll down to the bottom. It doesn't matter where you start. Now I have a step size which is arbitrary 0.01. I will take a step which is one hundredth of the gradient in every dimension. And now I have an infinite loop. It is pretty simple what I do. I have x. I can take the transpose of it. I call it x trans because I will be using it, reusing it a little few times. If you are using a simple small scalar expression like the square root of x, then calling it a few times may not be very expensive. If you are reusing x transpose in a bunch of places, you are better calculated only once. Otherwise you are going to waste space and time. So there the C plus plus compiler is not smart enough to realize that your x will not be changed and it can be reused and so on. So I create x transpose once. I create x times w once which is the product of x and w. Now g of w was f prime and I had an expression for it which is x transpose x w minus x transpose y. If you consult your notes, that is the expression for g of w. 2 does not matter. So it is x transpose x w minus x transpose y. And in boost it is this closed. It is very easy to code x transpose x w minus x transpose y. And then the next w is w minus eta times gradient and then currently I am not quitting. I am just going to show you what happens if I run forever. I print out the current w and the next w. W goes to next w and then I will just finish up. So very simple loop. Let me run this. So I started off from 0 0. After a little walk along the gradient it became 0.14, 0.06. 0.14, 0.06 became 0.256, 0.10 and so on. Around here it is 0.78 and 0.32. Keep going, keep going. 0.92, 0.16, 0.95, 0.10 0.97, 0.05, 0.98, 0.02, 0.99, 0.000. So you see that I am gradually converging to y equal to 1x plus 0. So now I figured out that the gradient is 1 and the offset is 0. So let us tweak the values and just get some non-trivial gradients and offsets. So suppose my offset was say 1. So let us put 2, let us put 3, let us put 4. And just for some fun let us say 0.2, 2.9, 2.1. So I will get some more different values. The best fit is not going to be y equal to x plus 1. But let us see what we get. So in the end we should get something close to 1, 1. That is how I generated the points. So I start off with some arbitrary stuff. 1 and 0.4, 1.16, 0.544, so 1.05, 0.966 kind of converges. So in fact the true value may not be 1, 1 because of the given point the least square fit will be a little different from 1, 1. So that is how you do least square fits. And I have only done it with 3 points but of course you can use 3 million points if you want. So I can post instructions about generating random numbers on Moodle. If you try generating synthetic data where you take a straight line and then use random numbers to distort the points a little bit and get synthetic 2 million points and submit it to see if your ghost program can deal with it. So try that offline before the next lab after the midterm.