 Welcome back to our machine learning one course. Today we're going to talk about multiple linear regression, which is linear regression with more than one predictor. So the example that I had last time was predicting the height of a person from a person's age, right? So you have one predictor the age, your response as it's called response variable height is also just one dimensional it's just height that you're predicting. So you have two parameters beta 0 and beta 1 intercept and slope that describe your regression line that you can draw in two dimensions very easily. So for each value of x which is just a number you get some value of y which is also just one number, right? So that's a function. Today we want to talk about a situation where you have two or possibly more predictors. So if you have two predictors, let's say you don't just have age, but maybe you have a gender or you know any other the weight possible number describing a person. So let's say you just have two or call them x1 and x2 that's your predictors and you're still predicting the height now you need three dimensions to draw that and instead of a regression line you have a regression plane so we'll still be talking about fitting a linear function that's a linear regression but now it's a regression plane for each combination of x1 and x2 over here you should be getting one value of y and this defines a plane how to parameterize a plane you need three numbers you need an intercept beta 0 and then beta 1 and beta 2 are the slopes like in the direction of x1 and in the direction of x2 just think about a tilted plane living in this three-dimensional space. Last time we talked about loss function a lot so for the one-dimensional case we have two parameters beta 0 and beta 1 and we have a lost surface that we can nicely visualize in 3d and clearly has a minimum there if you have two predictors you have three parameters so in order to draw a lost surface I would need four dimensions which is why there's no picture here and if you have more predictors than that which is often what you do you have a lot of predictors then you're lost there's no way to visualize a lost surface anymore so this means that conceptually what we're going to do today is actually not more complicated than what we had last time it's the same thing we just want to generalize the simple linear regression to the general case of any number of predictors however technically it's quite a bit more complicated because you need to deal with the with the high dimensionality and with multiple predictors and multiple betas and you need to keep track of everything so one usually writes it down in in a matrix form in a vector form so today it's going to be a little bit technical lecture where we'll try to develop the formalism that will allow us to to write these things down easily okay so let's start with that so our model here for for multiple linear regression is a linear function of p predictors so I will always use p to denote the total number of predictors that one that one has in the model remember that the sample size goes from one to n and p and and predictors go from one to p and there's additional beta zero which is the intercept so you have p plus one parameters in your model given p predictors this is a little bit cumbersome to write down like that so I will never do this again it is convenient to write it down in a vector form and for that it is convenient in addition to define x zero so that this beta zero over there doesn't just hang you know like that without x one and looks different from everything else let's try to write it more symmetric so we'll just define x zero as equal to one always x zero is just another way to write one for me and if we do that we can assemble all x's from zero to p in a vector i'll call it vector x and we assemble all betas from zero to p in a vector beta and then what we have there as our linear function is just a scalar product which assume everybody knows of these two vectors beta and x right because the scalar product by definition is the sum overall coordinates of the of the product of the coordinates of these two vectors another way to write exactly the same thing and the way the way that is more standard in in statistics in machine learning textbooks and then I will also adopt here in these lectures is not to to write down these vectors signs but I will just use bold font to denote a vector so whenever you see a bold x it means it means it's a vector and the bold beta also refers to a vector and then to denote scalar product I will write it like that beta transpose times x which which so that's what it means so the I will assume always here that a vector is a column vector so if I just write bold x it means it's a column vector with all the coordinates sitting one under another as here and then if it's a transpose of the of the of the vector beta then it's a row from beta zero all the way here to the beta p and I'm not writing the the the dot anymore but I assume that this is then a matrix multiplication between these two matrices now not just vector but really think about that as a very simple matrices and we'll talk a bit more about matrix multiplication in a moment here it just means that you multiply beta zero with x zero beta one with x one and you sum all up and it's the same thing as above and it's just a different way to write it very simple okay so that's the function beta times x let's now to let's now proceed to write the loss function right so the loss function is now a function of the bold beta and we'll use the same losses before so that's a mean square error so the mean square error means that you sum over all training samples from one to n you well you actually average that's why there's one over n in front of the sum now I will use this funny bracket notation which is a little bit cumbersome to denote the ith training example so the y i in brackets is just one number that's what you're predicting for the so that's the value of height in our example in the in the ith train training example number i and here beta times x is the output of the model of our linear linear regression model that takes x i which is a vector of all predictors for the same training example and also beta times x is just one number that's the prediction we subtract y i from the prediction we get an error with squared an error then we average so we get a mean square error good remember what we now need in order to proceed with that so we have a loss function and then either no without either we just compute the gradient of that and then we either use gradient descent to go step by step towards the minimum or we say the gradient is zero let's try to find analytical solution for where the gradient is zero for this point so either way we need the gradient and either way we need the partial derivatives so let's try to do that so we need to compute the partial derivative of this of this guy over there with respect to beta k right so we have p different we actually have p plus one different betas and so we will have p plus one partial derivatives so i'm taking one of them which i just do notice k here it can be any number between zero and p and i'm computing the partial derivative which is nothing else then um so it's it's a derivative of the square right so the two goes in front um and what do you have in front so now you have to think about the beta x over here being a sum so it's a sum of all coordinates if you are computing the derivative with respect to beta k over here then of this stays the same but what goes outside of the bracket is the value that was in that um term in that sum so if you if this derivative is with respect to beta k then that's why here is x so it's a k coordinate x k gets outside of the here and this is the sum of all training examples so this is nothing complicated but this as you see becomes a bit of a mouthful to even pronounce so make sure to you know pause the lecture and think about every every step here so that you understand the formulas that's important now this is the partial derivative so you can you can we can write it for beta zero beta one beta two and so on until beta p and now we want to combine it into one p plus one dimensional vector which is called the gradient as we discussed in the previous lecture and that's super easy because they these these things here and now here on this line are all the same so we can just say well we we we just notice that if we want a vector then we can make this last thing here the x ik also in the vector notation write it like that and every component of this equation will just be what we have here so that's the vector gradient and in principle this is what you need in order to compute in order to proceed with the gradient descent there's nothing else that that you need you can for any given value of beta so for example for your initial guess given the data that is sitting there fixed and never changes you can compute the value of the gradient which is a vector you change your beta you make a little step right with the with some learning rate in the direction of the negative gradient however so in a way we're done but there's a better way to write that that is more convenient mathematically and more convenient practically and that's to rewrite all of that in a matrix notation so that's what i want to do here so we have a collection of vectors x i every x i every vector every x i is a p plus one dimensional vector and there's n of them so we can combine all all of these x data in a matrix that's called in statistics it's called design matrix and here's how it's gonna look it's the rows so one row of this matrix corresponds to one training example right so we have p plus one columns so that when you take one row you have p plus one numbers in this row and that's your training example number zero one two three four five and so on until end you can also look at the columns and this will be your feature so if you look at the first column that's the value of x one overall training examples so I will call this a feature vector x one for example all values of your first feature and then you have p feature vectors plus additional intercept feature vector which is just one one one one one one one so in terms of notation if I write x and then one in brackets this means it's a first training example so it's a row and if I write x one in the bottom like over here that refers to the features right the matrix is not square it looks on the slide a little bit square but it's not of course it's it has n rows and p plus one columns they can be very different okay we will also collect all y values so the responses into what we call response vector so that's a vector of length n that just has y one y two and so on y until y n combined into one vector so that's the that's the objects that are very convenient to work with in linear regression but actually everywhere so that's this is a very standard thing and by the way if I write capital bold x like there or any capital bold non-italic letter this refers to a matrix okay so uppercase bold matrix lowercase bold vector non-bold and number just a scalar okay so what can we do with these things we can rewrite everything that we had before that's what we're going to do here so if you have x which is your whole x your whole predicted data set and you have some value of beta fixed how do you compute the predicted values y hat well it turns out you just multiply x by beta which is super convenient which is why say matrix multiplication is very useful here so let's let's make sure that everybody understands that what happens if you take this vector if you take this matrix x the entire matrix and multiply it by uh by the vector y so the matrix multiplication for those of you who who forgot or never saw it before is um it works like that you multiply always a roll by column so if you want the first element in the in the result over no sorry over here to get the first element you take the entire first row here and you take the entire column over here and you multiply the first element with the first the second with the second and you sum over them and if you look at what will happen if you do that you'll notice that well you just get your prediction so it's it's beta zero times x zero then beta one times x one and you go over the whole entire first training example x one vector and then of course you obtain y hat zero great and then for the second element you take the second row for the third element you take the third row and so on until the end so if you multiply x times beta uh following the definition of matrix multiplication you just get y hat which is great very simple way to write that can we now write down the loss function somehow also using this x uh matrix and beta and y vector notations yes we can of course so we need to take that that equation that we had before not the equation just the definition of the l so the first one um over here that's what was written uh previously and now it becomes a little bit cumbersome so again pause and think about that to make sure that that it really follows so you compute the x um x beta product and this as we saw before is just a vector of prediction so what we need to do remember in the mean square error you subtract the actual value from the predicting value and then you square and sum all over uh so that's the sum over there but then if you take so you can think about there's a vector of errors which is the y vector minus y hat vector is a vector of errors um I should have written it down I don't have it so think about the vector of errors and now you're computing the sum of squared elements of this vector this is by definition a vector norm which is denoted like it's written here in the end so we just have one over n over the norm of y minus x beta which is now super simple right you just have y you have x beta which is the prediction you subtract and then there's a norm very nice there's no indices there's no sums um so it looks great much more convenient to write down um so you can write the same thing which is sometimes useful as a scalar product of this vector with itself so it's a vector of errors times the vector of errors transpose so you just get the squared elements of this vector so the norm of any vector is by definition a scalar product of this product with itself that's nothing else uh is written here okay now we need the gradient so this becomes a little bit also or maybe come a little bit confusing again take your time to really make sure that you understand what's what's what's happening here because we are um we're taking this equation from top and previously we had this this this form of that and if you take the derivative of that so this was worked out previously you just get this thing over here right so it's combined into one vector equation now the question is can we write it somehow without the sum and the answer is yes and i'm just writing it down here so it's not completely obvious that that's the correct way to write it in advance but you can check that if you take your x matrix you transpose it which means you take the matrix that that was remember it had n rows transpose means you wrote it at 90 degrees now it has n columns and p plus one rows and you multiply that by an error vector y minus x beta then you will get the same so each coordinate will be the same as over here in the first part of this which means that this is the gradient so really i recommend to to write it down on a piece of paper and see that like make sure that you see that this matrix multiplication works out like that this is not a joke if you don't do that you will be confused well we will also have exercises that will force you to do that um so a great thing here is that in fact if you are familiar with how these calculus of the derivatives of the matrices work then you can go directly from here from this formula the norm the y minus x b as x beta norm squared as a loss function directly to here or you need to know how you differentiate the norm but actually if you look at that it's basically the same as the school calculus right you you have something squared so you're left with two times the same thing times the derivative of what's inside and the derivatives of what's inside since you had beta is just x so you have x here so it's wonderful it just works out it's a miracle that this notation is so conveniently generalizes you know the usual standard rules of the of one dimensional calculus the only thing that one has to be careful if one does that is for example the transposes so you didn't have any transposes here but here in the bottom there's suddenly x transpose because if it's not transposed you can't multiply these two matrices so one either has to really remember how this works or one has to like see which way you can multiply these things so do the sanity checks in your head or you just do the go the hard way and really go coordinate by coordinate and then in the end you can just make sure that well doesn't make sense yes it seems to make sense okay so that's matrix calculus in a way and and just notice what we have in the end so we we have x the the predictors the matrix it doesn't change the only thing that changes during gradient descent is the beta so for every value of beta we need the value of the gradient which is a vector and here one can compute it like that with just with just multiplying these two matrices it's it's literally one line of code and importantly it's not only mathematically very simpler to write it in this matrix form compared to the the very confusing form where you need to keep track of where the bracket i is and so on so mathematically definitely more convenient to write it like that but also if you implement this in any programming language then um so something you might have heard in especially in the programming languages like python or r or anything that allows you to work with matrices then writing down matrix multiplication by having one matrix and multiplying it by another matrix is much much faster than writing it down by definition you know have the sum over here that that that goes and and essentially computes the matrix multiplication this row by column but but manually with a for loop so if you write a for loop that that goes that computes the sum versus you write this in a matrix form and your matrix x is large so that it actually takes an untrivial amount of time then you will see that the matrix form is faster and it's of course less error prone because you just have one line of code instead of looping over your array keeping track of the indices and so on well sometimes it is more convenient to work in with this explicit notation so depends on what you're doing but computationally definitely whenever you have a vectorized form this matrix form it's it's it's very beneficial all right so now we have the gradient we can implement the gradient descent what is the next step on our list we can find we can we can solve it analytically because we can say in this case in the case of linear regression we can obtain analytical solution by saying the gradient is equal to zero so let's try how it works here we say well here's my gradient I just say it equals to zero I put a hat on beta because that's the condition for beta yielding the the minimum of of the loss so now is the beta hat here and that's just a linear equation for for beta which should be easy to solve and it is if you know how how the matrix algebra works so you we open the brackets we of course can get rid of two over n because it's zero on the other side anyway so we have beta times so there's some matrix x transpose x times beta and it equals to some other matrix or vector x transpose y so x transpose x let's think for a second about what is x transpose x so you take your matrix x that has n rows you transpose that you multiply it with a matrix x non-transpose which has n so this now has n columns this has n rows you multiply them you sum over that so when once once you multiply it you're left with a square matrix of p times p side or p plus one times p plus one side so if everything is well then you can invert this matrix so that's something I will just write down here assuming assuming people are more or less familiar with that if not we can we can discuss this in the in our tutorial time so if this square matrix x transpose x so how do you solve this function well if it were just numbers you would say well I want to divide by this thing over here and then you will have the the answer for beta and in fact that's what you can do here if this matrix is invertible so if it's invertible which it let's assume for a second that it is here then you can invert it like that and the matrix inverse is a matrix that so if for any matrix a a matrix inverse a minus one is defined as a matrix that if you multiply a minus one times a you get an identity matrix identity matrix denoted by i is a matrix that is zero everywhere apart from the diagonal and has ones on the diagonal remember here we're talking about square matrices so if you have x transpose x you multiply it by x transpose x minus one so inverse of that you get identity matrix and if you multiply beta by an identity matrix you just have beta so you imagine taking the second row over here multiplying from the left both sides by this inverse on the left you will just be left with beta hat and on the right you have this formula because this inverse gets in front of the x transpose y and that is the analytical solution for for general case multiple linear regression let's compare that with what we had before so in the one-dimensional case super simple what i call baby linear regression in previous lecture that was the the formula sum over x times y divided by the sum of x squared now we have a complicated case but take a second to appreciate that that's basically the same formula right so you have here x transpose y that's the same here and here you divide by the sum of x squared well now you have the whole matrix of x squared x transpose x but you kind of divide by that because it's matrix inverse or let me put it the other way imagine make sure that you understand that if matrix x has just one column which is what we had in baby linear regression then the formula in the bottom simplifies to the formula here on the top okay and it does of course so it's the same thing just a general case okay so i want to talk a bit about the role of this x transpose x minus one and let me maybe go back and say that so now we have the analytical solution and now we have the formula for the gradient in the matrix form so we can go ahead and code and obtain the results and in a way we're done here but i want to develop some some ways that will help you think about that and will help you analyze what goes on with this with this with these predictions with the and with this estimation procedure depending on the matrix x so that's everything that follows in this lecture is a way to analyze this equation okay so if we somehow didn't like magically did not have this term of x transpose x minus one which is a bit of a confusing term you have a matrix and you square matrix you need to compute an inverse of that like who remembers even how to compute the matrices on a on a paper like nobody i for example wouldn't be able to do that if a matrix is larger than two by two at least it would be difficult i would need to remember how to do that so that that is confusing so if one magically somehow did not have this term had not had this term if it were just identity matrix then it would simplify a lot then we would just have that your beta hat is x transpose y which is super simple everybody can compute that you just multiply x by y you're done so is this let's think is this ever is this ever true well and in fact yes it's it can be true it can be almost true at least so let's think about this x transpose x matrix again the element ij of this matrix so this means row i column j is given by the product of scalar product of the two feature vectors x i and x j if we want that to be an identity matrix and then of course the inverse of that is also that identity matrix then it would mean another way to say that is that the features are orthonormal so this term means that they are orthogonal and have no one and this means that if you multiply any two features any two different features x i and x j for i not equal to j then you get zero that means they are orthogonal that's the definition of being orthogonal and if you take x i times x i then you get a norm of x i and this will be equal to one so what does it so that's just the condition that we want so what does it tell us so well features having norm one that's already clear um let's let's think about a little bit about what orthogonal means especially remembering that the first feature that we have is just the intercept column right so it's a column of one one one one one one one that's the whole x zero feature so if you have x zero times some x j then the this scalar product is just the sum of all values of x j right so that's what written over here if you plug here x i for i zero then this all falls out because these are just ones and you have the sum of of x j elements and that should be zero and so zero because the result should be identity matrix which has one on the diagonal and zero everywhere else so that should be zero and this means that the sum of the elements is zero which means that just has mean zero the average of this feature average across all training samples is zero okay um and now so let's let's now remember that all our features if this condition is true then all our features have mean zero if we now multiply x i by x j where none of them is is is the intercept column we just take two real actual features multiply them you get this this sum over here and now both of them have mean zero and the sum is zero uh because that's our condition so this in fact means that these two features are uncorrelated so in order to see that if you don't see it immediately you need to write down the definition of the correlation which is basically almost that you just need to subtract the means before multiplying but the mean is zero so that simplifies to that so i will leave this also as an exercise to really make sure that you understand what's what's what's written here that it holds but it actually says that if the features have mean zero or each feature has a mean zero all pairs of features are uncorrelated with each other and all features have variance equal to one then this matrix x transpose x has a particularly simple form and our whole uh our whole solution simplifies such that for each uh beta so each each uh each predictor beta i no sorry each coefficient beta i can be obtained by just multiplying this feature with the response vector y and that's all so e it means that regression coefficient for each predictor can be computed independently so instead of having one multiple regression with p predictors and this complicated matrix equation that gives you the betas if everything is uncorrelated and and has variance equal to one and and centered and uncorrelated then you can compute each coefficient separately so it's like decomposes into p separate regression problems and that's of course a large simplification now it's it's i always say uh like the intercept works a little bit separately here so but in fact if all features have mean zero as we said before then we can work out the value of beta zero hat that will be optimal and it's very easy and i leave this also as an exercise to show that um in this case beta zero hat will just be the average of the average response over all training samples and notice that if you require that all your features have mean zero this actually does not affect your prediction in any way because if you require now that x one here for example has mean zero it means you subtract from x one it's its own mean so you get like beta one times the mean of x one hanging around here and we can just put it into the intercept so if you center this then the intercept will change but the prediction will stay exactly the same the same is true for any other features so it's often very convenient for especially not so much for implementation but for mathematical analysis to say let's imagine that all the features are centered which means they have mean zero then the intercept is just the average of a response vector so if we imagine that the response vector is also centered also has mean zero then the intercept is just zero and we can just forget about that so what is very convenient to often do to to think about regression problems is to say let's just imagine that everything is centered the responses centered the predictors are centered we know what the intercept needs to be in order to you know achieve that and we can think of all other coefficients now forgetting about this annoying intercept you will see why it's useful in a second but before I want to talk about about the y hat vector here and give you a bit of a geometrical intuition behind this behind this matrix algebra formula so if you want to predict y given your x and your model coefficients beta then you just compute x beta and we have a formula for beta hat already actually this misses a hat so y hat should be x times beta hat and we have an equation for beta hat before and now if we write it all together that's that's what it get that's what it is so it's some complicated looking matrix here consisting of four times x yeah transposed and reversed a little bit and then multiplied by y so if you transfer this this this complicated matrix transforms y into y hat which is why it's called hat matrix hat matrix and it turns out that one can understand one can think about this hat matrix as in the projection matrix in the n dimensional space so that's a bit confusing the first time you you think about it like that but let me try to explain that we usually think about linear regression as something that happens in p dimensional space so each of our training samples is a p dimensional vector plus an intercept so maybe p plus one dimensional but one can one can turn it upside down and one can think about this happening in the n dimensional space where this is the sample so in the n dimensional space the first feature vector is a vector so the feature vectors are just vectors in this n dimensional space and the response vector is also a vector there so if you if we have two feature vectors x1 and x2 then there's a plane that these vectors span sort of representing the whole predictor matrix x that's the let's call it the predictor plane and now we have the y pointing somewhere off the plane that's the response and what i'm claiming is that the whole linear regression solution is nothing else than orthogonal projection of y onto this plane so y hat is x times beta hat so it has to lie in the in the plane it's a linear combination of feature vectors so it's it lies in the x plane but why why is it is why is it an orthogonal projection so why does orthogonal projection of y onto the x plane give you the y hat but in fact that's very easy to see that's because the the whole setup is constructed such that we want to minimize y minus x beta hat right the residual vector norm should be minimized so imagine that you have y and you're here on the plane find some y hat such that the distance from y to y hat is minimal what's the answer to that well you just take a linear projection and that's where it's minimal because if it's not a linear projection then this will be longer okay and another way to see that is to look at the gradient formula that we'd write before which is which is here and this is nothing else so it's equals to zero at the minimum of the loss but this just nothing else than the condition that the response vector no sorry the error vector here is orthogonal to every feature in in x so you can see directly from the loss you can see it more explicitly from the gradient in any case the y hat is orthogonal projection of y onto the x plane so that's often if one thinks about linear regression problems or even some generalization of these more complicated that's a very useful image to have in mind because in a way we're doing here something that algebraically might look a bit complicated by that's just conceptually in a way orthogonal projection so you can you can conceptualize linear regression as orthogonal projection so the hat matrix is in fact just the way to write down the projection operator in this n dimensional space okay and now the final chapter in in today's lecture is the singular vector decomposition so the singular vector decomposition is is a way to decompose any matrix and that turns out to be super convenient in in almost everywhere where you look in statistics and machine learning so I wanted to introduce it already here and then maybe later on in the course we can we can use it every now and then and you will see how convenient everything can become so we said before five minutes ago that when x has orthonormal columns or all features are orthogonal to each other uncorrelated centered then everything simplifies a lot in fact your whole regression problem becomes just a bunch of separate linear regression problem but usually of course it's not it's it does not have orthonormal columns so can one somehow still relate to this simple example of orthonormal columns yes one can because one can transform x such that become such that it has orthonormal columns and in a way that's what singular value singular vector of value decomposition does so any matrix x so here's now a non trivial fact a theorem if you want i'm not going to prove that i'm just stating it any matrix x x can be written as a product of three matrices usv transpose that and these components us and v have particular properties which are that the matrix u has orthonormal columns right every column has norm one and every column is orthogonal to any other column matrix v has orthonormal columns two and the matrix s that sits in between is a diagonal matrix that is zero everywhere apart from the diagonal and the values on the diagonal are called singular values of matrix x the so the vectors that so the columns of matrix u over here are called left singular vectors the columns of matrix v are called right singular vectors and elements of s are called singular values and that's how to write the condition that the u has orthonormal columns that's how to write the condition that v has orthonormal columns and in this case i'm assuming that x has more columns than rows than the matrix u is also tall it has more rows than columns but s and v are square and for square matrix if this holds then actually this also holds which we may need to discuss separately in the tutorial time for now we just assume that this is possible to write with any matrix so will it help us yeah okay before i'm showing you how it will help the writing down the linear equation problems let's let's have a little bit of a i want to give you a little bit of a geometric intuition behind this yeah x equals us v t decomposition so let's assume that all features and the response are centered so remember 10 minutes ago we talked that it's possible we can center everything and this doesn't really change the problem only affects the intercept value but can simplify things in conceptually in some sense so here on this slide we assume that we did it all features are centered now we want to consider the svd of the x matrix and let's just consider two feature vectors we just have so x has two columns x1 and x2 and there is some correlation between them so the x itself definitely does not have or the normal features is the points right there's positive correlation between x1 and x2 the svd tells you decomposition tells you essentially that you can rotate your hat here like that such that if you look at this scatter plot with your head rotated then you have features that are uncorrelated and if you additionally squeeze them along the axis with your head rotated then they will have norm one so imagine that you basically have this scatter plot fixed and you keep rotating your head or alternatively and it's actually easier to imagine at least for me think that you have this scatter plot on a piece of paper and you rotate this piece of paper so imagine doing that there's a positive correlation before you started now you're rotated and whatever the data are whatever the scatter plot is at some point you will you will reach a point where the correlation is zero right if it's still positive you just rotate a little bit more and you will reach a point where the correlation is zero once you reach this point you you stop and then you squeeze every axis such that the norm of each the sum of squared values over each coordinate is equal to one and when you did that that's your u matrix now u has features that are uncorrelated and then have norm one and you can transform this back into x by unsqueezing that's your multiplication by the diagonal elements of s each feature gets multiplied by its own singular value so you're unsqueezed you stretch it in one direction for example and then v is a square orthogonal matrix which turns out is nothing else than a rotation matrix so then after stretching it just takes this whole thing and rotates it and then that's the x that you originally had so the way conceptual way to think about svd is that you can start with something some so whatever data you have you can start with something that is uncorrelated and has no one then you scale it to give it any norm you want then you rotated it to introduce the correlations back and that's your actual data so for any data you can do that I hope that actually in two dimensions it's kind of obvious that it's possible but of course to prove it in all generality requires some mathematical care we're not going to do that we what we will do instead is we'll just apply it to our to our setup here so for example you want this orthogonal projection y hat so that's a hat matrix times y which is this x times x transpose x minus one times x transpose of y so let's say we do the svd of x and we just plug it in so that's you know like it's a bit scary the first time you you see this written down but the great thing is now everything will start cancelling and you will see how useful this is for for analysis because everything cancels so what happens inside the brackets u transpose u that's one that's the identity matrix we can remove it s times s remember s is the diagonal matrix so the square of the diagonal matrix is just a matrix that has squared elements on the diagonal so I will call it s squared then you have this v s square times v transpose you take the inverse of that well the inverse of that just turns two into the minus two I have it here and leaves the rest intact you can verify that if you multiply this with what you had in brackets before you get identity matrix because v transpose v is one and the other way around and now the v is cancel again and the s times s minus two times another s over here also cancel each other and you just have that u u transpose y so all that that that scary thing is cancels and in the end your hat matrix can be just obtained with only using left singular vectors of x matrix and that's the orthogonal projection can be very easily written down in terms of the left singular vectors of x so convenient for computation it may be but can also be very convenient for like later on perhaps in the next lecture or in two lectures when we start tinkering linear regression a little bit then this will help us understand what's going on better and another way to look at that so this is this is what we had in the previous slide and now what is the beta hat itself well that's almost the same thing without x in the front and you can verify yourself that if you write it all down that's what you're left with so that's the equation for beta hat and I have two remarks here about that so the first remark is that the formula how we wrote it originally beta hat is the inverse of this matrix times x transpose y is good for like it's it's yeah for mathematical analysis it can be good it's clear what's what's happening but it is actually terrible for computations because you need to form the square matrix which can be large if you have a lot of predictors which squares everything which can also lose precision and then you need to take the inverse which is a very computationally intensive operation that is not recommended to carry out unless you really really have to and in this case you don't have to so if you program linear regression solver not as an exercise you know but in in in practice then you don't want to form these matrix and take an inverse and so on there are much better ways computationally much faster and also with more accurate numerically ways to compute that and one of the ways which may also not be absolute best but which is definitely better than that is to compute svd of x first and then do this okay that was remark number one and the remark number two is that now finally we can gain some insight here from this from the svd manipulations if so think again about this this this rotation scaling that we had in the previous slide if you have strong correlations between the features what does it mean in terms of the scatter plot if the correlation between x1 and x2 is large then you just have a scatter plot that's very thinly spread out like that right and this means that if you do this rotation and then you like stretch everything so that everything has no one then actually the singular value one singular value will be large that corresponds to this direction in which you need to stretch the data a lot but the second singular value will be very small that's the direction in which you need to squeeze the data a lot to get this high correlation and that holds for many features too so if you have features that have high correlation in your regression problem then the matrix x transpose x will have some small singular values possibly close to zero and then you need to invert this matrix over there or here we see it explicitly that we invert the s element the elements of the s so the singular values and if some of them are small then after inversion they will be very large right so this means that the beta hat vector can get very large coefficients and of course if you think about uncertainty of these estimates then whenever you need to divide by something close to zero your number blows up and the uncertainty also blows up so you have large coefficients you have high uncertainty and you have possible numerical problems if you need to divide something that is close to zero and it's a mess which is why if you have strong correlations between features it often indicates a problem or a whole bunch of problems for the performance of your model because of this inverses over there and how to deal with that and what it means in practice we'll be talking in the next two lectures about that thank you