 Hello, and welcome to lecture 10 of our introduction to machine learning course today, we're going to talk about PCA the principle component analysis and let me start with the with the Same remind is added last time. So we spend a lot of time talking about supervised learning where we have some input data X and we have some output data Y and The task is to learn the mapping between the X and Y. So that's prediction problems and then We started to talk about unsupervised learning where we don't have any Y. So we are not predicting anything. We just have some data The matrix X and we try to learn some some structure. That's present That's present in the data. So this is not a prediction problem anymore. This is the exploration Problem, if you will so last time last week we talked about clustering. So in clustering we Want to we want to split the roles of X that correspond to samples in the data set So as a reminder, these are samples and these are my features. So we're splitting the samples into several groups and We say this is cluster one. This is cluster two. This is cluster three For example, if we if we want to cluster into into three clusters, and that's the output of our clustering algorithm Today, we're going to talk about a different thing dimensionality reduction Which operates on the on the features on the columns of the X matrix. So we want to reduce the number of columns in X We usually Don't want to just pick some features that are present here and discard the others but we want to Somehow treat all these features together and just reduce their number Let's say we had a thousand features in the data set and we just want to keep 10 so we want to transform the Existing thousand features and produce 10 new features and these are the features that we want to use later on. So this is What we're going to understand by dimensionality reduction reduce the dimensionality of the data But importantly, we're not just selecting features here. We're transforming them. So why would we want to do that? And I think there's two very broadly There's two reasons one why one may want to do something like that So one reason is to explore the data to obtain some insight into the structure of the data. So after you Obtain this reduced number of features you somehow want to inspect them You want to look at these new features and learn hopefully learn something about your data a Different reason is to do that as a pre-processing step So maybe you reduce the number of features from a thousand to ten and then you put this new Smaller data set with only ten features into some other algorithm perhaps to predict something else perhaps into supervised learning algorithm And there may be different reasons for wanting to do that. So we're going to talk about both reasons today and As applied to principle component analysis. So PCA can be used for both for both goals alright, so In principle component So let let let me let me start as always or as almost always by by talking about some toy data So here is a two-dimensional data set feature one and feature two They are correlated. These are my samples and in this very toy setting very simple setting I'm going to reduce the number of features from two down to one, right? This is a linear method PCA is a linear method, which means that we are Transforming our data X into X times some vector W Which means we're just projecting all the samples onto a New axis. So again, we're not just picking one of the existing axis. We're projecting it on a new axis So let me try to illustrate it with a with a ruler So this is going to be my new axis and it can be anything So let's say I'm putting this axis like that This is my new axis and then each point is going to get just projected on that like that orthogonally Right, and then we look at the coordinate that you obtain on this new axis here And then you imagine that you basically take this one-dimensional Data outside here you forget about the original data Now you have the positions of your point here on this ruler in one dimension and that's your new data set Okay, so you reduce dimensionality by projecting everything to one dimension The question is how do you choose the dimension so you can project like that like that like that Right, so you can imagine this thing rotating and this is the freedom that you have you can maybe you also can just project everything on the Horizontal axis or on the vertical axis or on some kind of diagonal axis So what axis would you choose here if you only are allowed to? To have one one axis here, is there like in some sense a special axis yet Maybe you can already guess from the from the shape of this of this scatter plot what would be what would you choose? As the best axis I didn't define what best means so far, right? So I want you to also just consider this the scatter plot and think a little bit in your heads about what this may be so One remark is that when we define the problem like that Then it's enough to consider only unit vectors. So the vectors that have unit length And they define right there any axis can be defined by unit vector pointing in the direction of this axis Okay, so Here is one particular axis drawn with the with the with the linear projection. How would we? How would we choose an axis? So in fact, I can give you several several possible loss functions several possible criteria by which By which to choose an axis in particular. I will give you two so one is you may want to minimize the reconstruction her So by reconstruction error, I mean these these distances here distance between the original point and the projection on This one-dimensional subspace. Okay, so these are all this this this length here Sometimes they're larger sometimes. They're smaller. So imagine you sum this all up the squared Distances across the entire data set and this is your squared reconstruction error and It makes sense that we may want to minimize that we want to somehow preserve as much information as possible about the original data In our new reduced data. Okay, that makes sense Here's a completely different criterion We may want to maximize the variance and this means we projected the data on one dimension Now we just have these points and nothing else or imagine that you only have these coordinates and nothing else And you compute the variance of these points. That's it. You compute the variance and you want to maximize this very so why would you want to do that because intuitively if you If you if you had some some variation in your original raw data, then you projected everything to some To some dimension and the variance is very small of the resulting data Then maybe you lost a lot of useful variation that was present there originally So you want to find some projection where the data is spread spread out You're hoping that when the data are spread out, then this means that you also preserving some useful structure That's not guaranteed, but that's the hope And by the way, this can be set up in any dimensions, right? So you can have ten original dimensions and you still be looking for one axis So you can project everything on one dimension even starting from ten dimensions doesn't have to be too Obviously you project everything to one dimension and then you either want to minimize the reconstruction error Or you want to maximize the variance of the projection now the amazing thing is That this is actually the same Objective these things are equivalent and principal component analysis does both you can introduce principal component analysis But like that or you can introduce it like that. These are equivalent things Let me prove that that's actually pretty easy to see I said it's amazing because the first time you see these things Defined they don't look very similar, but once you think a little bit about that. It's actually pretty pretty clear. So Here is here is My two-dimensional data and I'm just focusing on one point here here This is X I consider this one sample and this is the axis Let's say my axis is now fixed and I'm projecting it here. So this E I is is is my Reconstruction error and this D I is the new coordinate of the point in this new in after the projection, right and by Obviously by Pythagoras theory in this squared length plus this squared length equals the the norm the squared Length of my original vector Which means that I can rotate the axis as I want but the D I squared plus E I squared But will always be equal to these which does not depend on the axis. That's just some constant. So that's neat We can sum this up over the entire data set And divide by the number of samples and then what we have here on the right side is just a constant Depends on the data set but doesn't depend on the chosen axis What we have here is the reconstruction error the sum of the squared errors and What we have here is the sum of squared deviations from zero But let's assume here for the moment that all features are centered, right? So we subtracted the mean in advance and Then this is the same as the sum of squared deviations from the mean, which is variance by definition So what we have here is that for centered data? variance plus mean squared error Or mean squared reconstruction error equals some constant So if this goes up this goes down, which means whenever this is the smallest This is the largest and of proof, right? That's what we wanted to show So we can want to we can decide we want to to minimize this this term or we can say we will maximize this term These are this is actually the same thing. Okay, so that's that's that's neat one comment here is That whenever I say minimize the squared error you may think of linear regression, right? So the in linear regression We are also minimizing the prediction The squared the squared error across the data set. So this may seem similar, but this is a very different thing Let me try to illustrate this here so that there's no confusion this is my illustration of PC a the same picture as before right we have two features and Well some line like that will probably make this this These errors small, right? So this may be something like that may be the solution to the to the PC a problem But I can say that I'm predicting this feature from that feature So I will call this y and I will call that x and then I can consider a regression problem Where the difference is that it's also the loss function here is also the sum of the square distances But these are squared errors between my prediction and the corresponding actual y So all these lines are parallel to the y-axis, right? That's how that's how we define regression in one of the first lectures Whereas here all these lines are perpendicular to the axis. They are not parallel to any of the Any of the axis and this makes a big difference this this makes these problems not equivalent This is not symmetric, right? So these treats these two features differently because we think that my axis fixed and I'm just trying to predict y So my errors are go like that and here it's symmetric the x1 and x2 are on the same footing here I'm not trying to predict anything And in fact, they may have or they will usually have very different solutions So this may not be super obvious from this catapult that I drew here. So I tried to make it clearer With this example where I consider the data set that is almost spherical, but it's just a little bit stretched in the diagonal direction But only only a little. Okay, so the data basically is formed like that It's almost it's almost spherical a little stretched And if you think about minimizing these errors, then it's clear that as soon as it's just a tiny bit stretched in that direction This will be the axis that minimizes these errors basically by the symmetry. It's it's pretty obvious Whereas here it's not the case So this is a bad regression solution because here you have a lot of points with large error Below the line and here you have a lot of points with large error above the line. So actually your This the regression loss will decrease if you rotate this axis towards the horizontal axis And if if you think what happens if you start with spherical Spherical distribution just gradually stretch it like that Then the regression line will start horizontally and then we'll slowly rotate like that because the regression solution is actually continuous function of the input data if you think remember what what was the formula for beta hat Transforming the x data a little bit will just transform the beta head a little bit And this consideration by the way immediately tells you that for PCA. This is not the case So in PCA you can change the data a tiny little bit and the solution of the prince of the PCA problem We'll we'll just jump elsewhere. So it's not it's not a continuous formula, but we will see this later on Okay, so not equivalent problems. All right, we're talking about reconstruction error, but that's a different error Good now. Let's try to define this mathematically. So what is then the the loss function? We'll try to define it in formulas and then we will try to solve it So here is minimizing the reconstruction error. So this is the squared norm of what? So I have my original data X and then I have my projection So I have I take my x data I projected on some w which is just a unit vector defining the axis, right? So these are my projections, but these x w is now a one-dimensional object And I need to somehow compare it to the original two or higher dimensions. So how to do that? Well, think about think about this catapult on the on the previous slides. You're projecting these points Then so I had some points and I'm projecting somewhere. What's the coordinate the two-dimensional coordinate of this point is The w the unit vector along this axis times the coordinate after projection So x w gives you the coordinate after projection then I multiply it by w again to put this on this axis back and now I can subtract and And and these are my arrows and this this Frobenius norm sums as we discussed before over the over the entire data set So this is my Average or summed reconstruction error across all samples By the way, this w w t that appears here We can call this is mathematically known as a projection operator because what this w w t does This takes the x and projects on the one-dimensional subspace But still keeping both dimensions, you know, so it's it's this together Is vectors in two dimensions, but all of them lying on one on one on one dimension Okay, so now we have a loss. That's the loss function But let let let us consider this maximizing variance objective also. So now we project everything x w are my new coordinates and as we agreed x has all features Scented all right. So if I want to compute the variance, I'm just taking the squared squared Values and and sum them up. So that's a scalar product of this vector with itself So here it is transposed times itself divided by the number of samples And now what we have here is x transpose x over n which is the sample covariance matrix of my data So I can rewrite it like that Here is my covariance matrix of the data gets multiplied by the w on the right and on the left The resulting thing is just one number which is the variance of my projection, right? This is subject to w being a unit Vector having a unit length. Otherwise, it doesn't make sense to ask to maximize this because I I can take w very long and this will just increase the value of this Of this product. So this is this is uh, it can go to infinity. This doesn't make sense So we say w has to lie on a circle or in the one-dimensional sphere in higher dimensions And under this constraint, we want to maximize this object c is given, right? That's the covariance matrix of the data So we have the matrix c and we want to maximize that subject to this constraint So this is equivalent to this I proved it for you Just with a simple proof using Pythagoras theorem One can write it down algebraically and see that it's the same here, but it will be the same proof. Just Um, just put down in in this formulas Okay, so now now we want to solve it. How do we solve it? So we want to maximize this uh, this term in the title of this slide Under the constraint that w has unit length, right? So to solve a problem under Reconstrained we can use the Lagrange multipliers that we discussed briefly before when we talked about What about rich regression? I believe Back in lecture four. So in this case, it's actually very simple. So we have this this thing, right that we want to To maximize I'll put the minus here so that we're minimizing it And then we have the constraint that w transpose w equals to one So we put this Lagrange multiplier in front and that's what we want to minimize now That's how Lagrange multipliers work. So we're setting this to zero the partial derivative with respect to w Which is in this case, everything is quadratic. So it's very simple derivative of that It's just c w Actually times two and on the right you have lambda w times two. So I Already erased the twos And there is just plus lambda which doesn't contribute anything. So we get this so whenever We are at a at a maximum or minimum of this function this Holds so this is a curious curious equation. It means that We the w our unit vector w has to have a particular property When multiplied by the matrix it just gets scaled. So usually when you have a vector And you have a square matrix, right? So c is a square matrix That's an in ten dimensions. So you have a vector in ten dimensions Usually when you have some vector and some matrix and this matrix acts on this vector like that you multiplies It it rotates somewhere And and also changes the length But it will not point in the same direction after multiplication But sometimes you may have the situation that you multiply it by the matrix and it just gets scaled Right, it's basically the vector stays the same just just stretches, but it doesn't get rotated Um, so this this object then has a name. This is called an eigen vector of matrix c Okay, so by definition whenever this holds true w is an eigen vector of the of of matrix c with Lambda being an eigen value of this corresponding eigen vector. Okay, so This means that So a matrix can have can have several eigen vectors But what we proved here is that if we want to maximize the variance Then we should take one of the eigen vectors as the solution. So which one we should take. Well The variance will be just given by w c w Which is w lambda w, but w double w times w is one. So the variance will be just given by lambda So the eigen value actually gives you the the resulting variance So if you want the maximum variance, you have to take the largest eigen value, right? So here's the solution We we we we solved it We need to take the eigen vector of the covariance matrix with the largest eigen value Well, we solved it in a sense that we just we just said that you need to find this eigen vector, right? So the question is how to find the eigen vector and that's something that for which you need to use Um, like you usually so it's complicated problem that you can't do by hand. You need to you need to run um standard Implementations existing in any programming language that will give you the eigen vectors of a square matrix. So we're not it's not a numerical Um, numerical optimization course. We're not going to discuss how the program actually finds the eigen vector Which is actually another interesting question, but not for today um Let's say we have a routine that gives you eigen vectors of any given matrix c If you have that you just take the one with the largest eigen value and that's your principal component number one Okay, um, I want to um spend five minutes now talking about Uh a bit about eigen vectors and eigen values of covariance matrices in more generally So in our case we're talking about covariance matrix, right covariance matrix is a symmetric square matrix with all real entries um One can prove I won't prove that but one can prove that Any matrix like that has p eigen vectors So if it's if it's in 10 dimensions, it will have 10 at least 10 can have more But it will have at least 10 eigen vectors And you can choose these 10 eigen vectors such that they are all orthogonal to each other So in fact this this this second part is easy to prove You can quite you can very easily prove that if you if you have two eigen vectors with different eigen values Then they have to be orthogonal. That's an exercise for you um, and let's let's just Take my word that you can find p of them Okay, one thing that I do want to prove is that if you have two eigen vectors w one and w two So they are and if they are orthogonal, right? So this is as I said Above so let's say you take two eigen vectors that are orthogonal to each other then um Also this term is zero So this is immediately follows from from the above right because the c w two That's an eigen vector just scales w two And then you have lambda over here you can take it out of this and then you have w one transpose w two Which is zero So if you have two eigen vectors and they are orthogonal Then you can plug the covariance matrix in the middle And the product will still be zero which means that after projecting your data on w one and on w two The covariance between the projections will be zero, which means the correlation will also be zero so Very important the correlation Of your data after you projected the data on two different eigen vectors of the Two different orthogonal eigen vectors of the covariance matrix the correlation will be zero So this is actually pretty neat. Um, I will um I will show you why in a second, but let let's continue with the math a little bit So this actually implies that if you if you rotate your entire coordinate axis such that Every basis vector is the eigen vector. So you rotate you you rotate your coordinate system into the eigen vector basis Then the covariance matrix will become diagonal like that. It will have zeros everywhere Um and only lambdas on the diagonal. So let me let me Walk you through this. So x is my data Let's say I take all eigen vectors of c and stack them as columns in in a matrix and call it v Then you multiply x by v to get your rotated data. So this is actually an orthogonal matrix V is orthogonal matrix. So it just does rotation of the coordinate frame. So this is your new rotated data Okay, and if we compute the covariance matrix of this new rotated data, right? So which is this data transpose times itself Then here in the middle you get c and you get These are all eigen vectors. So by this property Whenever this term is off diagonal and these are two different eigen vectors from here, you get zeros So you get a matrix here. That is diagonal. I call it lambda It has eigen values on the diagonal and zeros everywhere else Which means I can also rewrite multiply by v and v transpose on the left and on the right And write my original covariance matrix like that. So we can take any square symmetric matrix any covariance matrix and And and then decompose it like that that's called an eigen decomposition It has p eigen vectors corresponding eigen values and one can rewrite it like that So Let's step back a little bit after this math and just consider what we learned again. So here is my scatter plot That I used in the beginning of the lecture, right and we're choosing we're talking about Choosing here the best projection axis. So in fact, there's I can give you a third a third Way to think about choosing the best axis here. So we can either Maximize the variance of the projection and look here the variance is really small, right? If you project everything here But here it's larger larger larger larger and then somewhere here the variance will be really Large as measured by my ruler Then I can Want to minimize the reconstruction error and here the reconstruction errors are as small as they can be So this is equivalent. We'll prove that and the third way to think about that as is to also think about the The second orthogonal axis that is rotated together here and in this position There is no correlation between the projection on the ruler axis and the projection On this orthogonal axis, right? So if you turn your head like that and look at the scatter plot Then it becomes uncorrelated. There's in the original data. There's correlation with in any other position There's still correlation. There's one position where the correlation is zero and that's the same Axis that maximizes the variance and minimizes the error That's the eigenvector. So that's the first eigenvector with the largest eigenvalue of this covariance matrix and orthogonal to it Is the second eigenvector? With in this case the smallest eigenvalue. There are only two They are orthogonal these eigenvectors and the projections on this are have correlation zero So if you think about these in 10 dimensions, then you will have 10 um eigenvectors that are all orthogonal and and projections On them are all uncorrelated okay I always was talking about choosing the first Axis so I Define the last function for for choosing the first axis that maximizes the variance minimizes the error But actually it works in a in a greedy fashion. You can find you can similarly define subsequent principal components principal component axis And they will turn out to be just subsequent eigenvectors So the first eigenvector defines you the direction of the first principal component The second eigenvector defines you the direction of the second principal component, which has the maximum variance Under the constraint of being orthogonal to the first one and so on and so forth So once you did the eigen decomposition of the covariance matrix You have all principal axes already in there that all eigenvectors give you the principal axis of your data So in some sense sometimes you can you can hear saying people saying that pca is just rotation So pca just takes your original data rotates such that all correlations now are set to zero. That's that's equivalent View of the principal component Analysis, okay one last remark and then we will proceed to some examples um Is the relationship of these two singular value decomposition? So you might have noticed that this is pretty similar to what we discussed before For svd. So we talked in in several lectures already that you can take your data x and decompose like that Where this is left singular vectors. These are right singular vectors. They are orthogonal to each other and s is a diagonal matrix with singular values, okay, and now we can we can compute the covariance matrix Which is x transpose x Plug this in here U transpose u falls out because this is this is identity matrix and what you left with is this This is some s squared is a diagonal matrix. We can call it lambda and then So if you first do svd of x then the covariant corresponding covariance matrix Has this form which is the same form as as I said is the eigen decomposition just two slides two slides Back, right? So these are eigen vectors of my covariance matrix and these are eigen values Which means immediately that eigen vectors of the covariance matrix is the same thing as the right singular vectors of my data matrix and The eigen values of the covariance matrix are a squared singular values divided by the sample size So in fact, eigen decomposition singular value decomposition are very very similar things um And you can obtain The eigen decomposition of the covariance matrix by doing svd Of the original data matrix All of that holds important remark just a reminder if x is centered which means all features All features have mean zero Why because if if if that's not the case then this is not the covariance matrix the covariance matrix Is like the covariance between two features is the sum of squared deviations um Some of like product of of of deviations from the mean, right? So we have to subtract the mean here before multiplying But if the mean is zero there's nothing to subtract and then we can write it like that So it's it's as always as was the case in the regression problems. It's very convenient to to first Center the um the x matrix Okay As I said in the beginning one can do pca for two reasons or maybe there's more but that's two reasons that that I usually encounter Either you want to explore the data or you want to pre-process the data So let's consider these two goals now and how would pca help us? So here we will start with data exploration And here is one One figure that I particularly like here. This is a data. So this is a very small Um and simple in a way dataset Where the samples are wines these are italian wines or every every dot here is one particular wine and uh each wine was measured with respect to like something like 10 different features so the features can be the alcohol content of the wine and uh the concentration concentrations um of of particular of different things in in the wine and I see color here. So it was somehow color was coded numerically, right? And then you might I think these are all red wines But still the hue of the red color can differ a little bit So you measure every wine with this 10 characteristics or uh around 10 and that's your that's your x matrix data So and then you can do pca on that and then you can plus so usually when you see pca results plotted Then it's usually two dimensions. So you don't just plot Things in one dimension you find principle component number one and principle component number two which corresponds to not Finding the best one dimensional projection but looking at the best two dimensional projection, right? So where is the plane? In in my original 10 dimensional space such that I project everything on my plane And it has like the highest variance and so on. So this is this plane principle pc one pc two on the vertical axis So if you look at only the points so far ignore the arrows then From the colors it's of the points. It's clear that there is quite some structure captured here. So what are the colors? these are two and these are three wine grapes And varieties of the grapes right and and they are neatly separated here So you have one variety here another here and another here and they Almost don't overlap but pca didn't know about that, right? It just it just found the plane with the most variance But actually it turns out that if you know that these are three different grape varieties, then of course you would want this to be to be Also visible on this two dimensional projection and it is even though that was an unsupervised problem That didn't know about the about the varieties And in fact, if you didn't know about the varieties just plotted all of that in black, then you would maybe also see some structure in here Uh, perhaps you wouldn't see three clusters, but maybe you would see that there was like seems to be a one dimensional structure um Okay, so that's that's the points and the second knee the second nice thing here is the is the original features so one can one can One can make a plot like that. It's called a by plot where you plot the projection the the points But you then you also for each original feature find the correlation of this feature With the principal component one and with the principal component two and plot an arrow that Has as coordinates these two correlations. Okay, so if something is correlated to pc one strongly You will see a large error pointing to the right or to the left And if something is correlated strongly with principal component two it will point upwards so By looking at these arrows Focusing on the large ones on the long longest ones You can see which original features basically drive principal component one and principal component two in particular So this may help you to interpret The pc one and pc two so often in this exploration case you would maybe want to understand Well, here's my principal component one. That's the largest mode of variation in my data And it is actually associated with like phenol Levels in my wines right, so This kind of plot can be helpful Okay One thing that I didn't explain about this plot see it says 37% explain variance here and 19% explain variance At pc two. So what does that mean? So let me let me Explain this bit A very important object here is the sum of all eigenvalues So after you transformed your your covariance matrix into diagonal shape You can sum all diagonal values and this is called a trace of the covariance matrix The trace of the covariance matrix is the sum of the or the trace of any matrix is the sum of diagonal elements And the nice thing about the trace is that if you rotate your coordinate frame the trace doesn't change So the sum of the eigenvalues is the same Thing as the sum of my original variances because that was the trace before I did the pc a I had my Covariance matrix c right and had some values off diagonal It had some values on the diagonal on the diagonal You have the variances of your features if you sum them all up only along the diagonal You get this the sum of the variances and by this property of the trace That's the same of the sum of the eigenvalues actually It's easy to see if you know how to work with traces So the trace of of my original covariance matrix, which can be written down like that Under the trace one can that's the property of a trace that I'm using here one can One can so if you have three matrices a product of three matrices under trace You can take this one and move in the front and then trace won't change I won't prove it just an algebra fact and then you you you bring You bring v transpose here. This is the identity matrix and you have just a trace of lambda So this is very useful. This means that you started with some features that had some variances And then you did the pca and the sum of eigenvalues is Is the same as the sum of your original variances We order the principal components by the by the eigenvalues Which means that the first principal component So we say explains or captures this fraction of the total variance, right? The next one captures the lambda 2 over this trace Fraction of the total variance and together they capture 100 percent of the explained variance So that's that that's a metric that allows you to quantify how much How much original variance present in the data like a high-dimensional variance is captured by each of your principal components One comment to that example You may have noticed that these features of the wine data set were very Very very different, right? There was a concentration of something and there was a color of the wine So maybe the concentration is given In in in in some units and and color is given in completely different units And we're talking about the variances here. So one of them can have much larger variance than another So you often have data like that and it almost always makes sense to somehow normalize all the features before you proceed with pca So if you have data that has features on different in different units It probably doesn't make sense to do pca on it just like that It makes sense to standardize all the features which means center and then divide by the standard deviation So after you did this now every feature has Standard deviation equal to one, right? But there's still correlations between them. So you can still do pca But now actually, you know what what is the trace of the covariance matrix? It just equals the dimensionality, right? Because the original variances are all one But the eigenvalues can be can be anything. So I will give one example of that. So this is a different a different data which is performance of of Several athletes in the in the competition where they compete In in in several different disciplines, right? So they run 200 meters run 400 meters the same run 800 meters jump throw javelin the spear and and so on so my roles are people athletes and my columns are the The times and seconds How fast they ran and also the distance and meters maybe how how far they threw something or the height and in meters probably The height of the jump and and so on So if you take this data as as they are and do pca you get something like that and look at this By plot lengths now I just have two Very long lines that are one is almost exactly horizontal and one is almost exactly vertical and this is run 800 meters and this is The length of your spear Throw the javelin result. So why does it look like that? Well, it looks like that because this this will be the largest numbers Okay, you're running 800 meters. It takes time. So the result in probably in seconds These just will be large numbers. If these are large numbers, then they will also tend to have large variance Okay, so you have a lot of variance in here and similar reasoning Shows that you have a lot of variance in here compared to all other features. So you just end up with pc one Basically equal to this feature and pc two basically equal to that feature and that's not very informative Um, you didn't need to do this analysis to see that this has the highest variance So you get a much more informative picture if you standardize all features and then you do pca So another way to see it is that you do pca on the correlation matrix and not on the covariance matrix So you do eigen decomposition of the correlation matrix Or alternatively you standardize everything and then do eigen decomposition of the covariance matrix and then You get something that looks much more meaningful. So now you have all all features contribute to this and you can see that Running 800 meters and running 200 meters are correlated. They point in roughly the same similar dimension Right and and they seem to be anticorrelated with the jumping results or some people run faster and some people jump higher and the throwing The javelin axis is orthogonal to that so that's like orthogonal axis of variation So very meaningful actually very insightful plot that you get here After you standardize but not before um The question often comes up if you do this of how to To choose how many principal components to look at right? So so far I only showed you always two dimensional plots But you can in principle make a similar plot for principal component three versus principal component four and so on So how many did you choose? So the the the the right object here mathematically to think about is the spectrum of the covariance matrix Which is the set of all eigen values. So if you plot them sorted, right you have P eigen values that just decrease Like that and then what you want to To to focus on are these initial large Eigen values and then usually for high dimensional data You will have this tail of small eigen values that also are very similar and just like slowly decrease and so on So the some eigen values stand out Usually correspond to some interesting modes of variation in the data And then the rest is is basically noise that gives you this Slowly decreasing tail. So there's a lot of heuristics about how to find The number of interesting eigen values here and these are like maybe you want to look for the for the Like somehow find a place where this decays rapidly and then starts going down slower So you may you find a place like that and say this first four eigen values Maybe i'm meaningful to me. So this is called Looking for an elbow in this plot or you can say I will I will Take as many pcs as needed to capture together 90% of the variance, okay So i'm summing up the eigen values and once I have the 90% of the total sum I stop These are all just heuristics. There are more objective methods I know of two broad families. One can either do cross-validation or one can use Shuffling of the features it takes a little bit of time to to explain how to do cross-validation for pca So I will not be explaining this today. I want to explain this Shuffling of the features method though Here is my in the same toy example. Here are my original eigen values of my original matrix So one can do a very a very useful And a very simple But nevertheless useful trick one takes this x matrix and shuffles every column Independently so shuffling means I just change the order of the values in column one and in column two But separately from column one and then in column three in all columns So what will happen is that the variance of every column is unchanged I change the order of the values, but the variance is the same So if you think about the covariance matrix the diagonal values will stay exactly the same as they were But if you had feature one and feature two correlated after you shuffled them Independently they are not going to be correlated anymore, right? You killed the correlation So you will have but you will not have exact zero correlation Because you shuffled but maybe there's still some correlation left by chance So you will have after shuffling the covariance matrix which has the same values in the diagonal and smaller random values of diagonal Which will be the larger your sample size the smaller these values of diagonal will typically be so you can just shuffled once and then do the Eigen decomposition and plot the eigenvalues and what will you will typically find is that you will not Have any very large values anymore because if you had original data Which with very like it had a lot of correlations, right? Remember the scatter plot in the in the first slides if you have correlated data This gives you this high mode mode of large correlation along this diagonal axis If you didn't have correlation after shuffling you don't have a lot You don't have correlations strong correlations anymore, then you won't have this large eigenvalues So you will have smaller typically you will have smaller values here However, the total sum is the same Right, that's important because the variances of the features the trace stayed the same So the sum has to be the same. So if these went down then these had to go up So you can do that you can do this a bunch of times and then look for the eigenvalues that are above Above what you get by shuffling in this case. Let's say it's three Um, and then I say these my three significant eigenvalues They are in some sense above the noise that you obtain by by shuffling the data So that's a very useful trick actually useful in other contexts too um Next is principle component analysis for preprocessing the data. So so far I gave you a couple of examples for using pca for data exploration Now I all mentioned that one can also Use pca for preprocessing. So I take my data x I find some Leading eigenvectors. Let's say I take k um, a small number of eigenvectors put them in matrix v k and multiply x by vk get a smaller matrix With only k features these over here and then use this smaller matrix for downstream processing One can so you it's it has it can have some advantages. So what are the properties of this matrix? It's smaller, right? Obviously um, it takes less space um, it has If you use it in some regression problem, you have less coefficients to fit It has low dimensionality smaller size also all correlations are zero, right? We discussed this before so all these features will be uncorrelated If you kick out all these small eigenvalues, then the result will not have small eigenvalues anymore and remember in uh lectures on regression we talked a lot about how small eigenvalues cause problems Sorry cause large uncertainties standard errors and estimation problems For your for your regression in the regression setting So it's it's often a good idea to to get rid of small singular values, uh, or eigenvalues Of course, you can also just say that you use all eigenvectors Which means you rotated the data then it has the same size and the same number of features Now they still all features have correlation zero But all the small eigenvalues are still present then you just rotated the data So usually this will not achieve anything You only get something different if you if you rotate the data to the eigenvector basis and then kick out small Principal components with very small variances It turns out that if you use this for regression that actually this is very closely related to rich regression So I think this is actually an under or often under appreciated fact. I would like to stress this here. So Uh, let's discuss the relationship between the rich regression And this procedure where you take x matrix to the pca keep only leading principal components And then use that to predict some y So this even has a name. It's called a principal component regression pcr So let's discuss the relationship between pcr and the rich regression Um, so I will I will only talk about that very briefly and as a reminder Here is so if you have the matrix x and the matrix y then the Ordinary least squares solution beta hat is given by x transpose x inverse x transport y And you're predicted values that I called y hat You need to multiply x by this beta hat So you get this and if you plug here the singular value decomposition of x Then it simplifies a lot and you just get this So you you transpose y where you are the left singular values of x. So this is standard regression, right? This is just written like that but No regularization Now we talked in a lecture on rich regression about what happens if you add the rich penalty And it turns out that if you add the rich penalty, you add a term here within Inside the inverse, right? And then if you rewrite everything in terms of singular Vectors and values then what happens here is that between this u and u transpose You get this diagonal matrix with singular values and this guy here in the denominator So if lambda is very small, this doesn't do anything But if the lambda is large, then it will penalize The large no the small singular value. So the small singular values will get Um, we'll get to zero here with the help of lambda and but the large ones the ones that are much larger than lambda Basically won't change much. So the rich regression. This is just a recap of what I said in lecture four what rich regression does it affects It shrinks the in some sense the small singular Like the it shrinks the directions in which the original data had very little Had had had little variance. So the The directions corresponding to small singular values In principle component regression, it's very similar thing But it's like a hard thresholding of singular values. We say here are my singular values I will just take the first 10 and kick out the rest So I'm using you can think about that as using a diagonal matrix that has one one one one 10 times And then the rest is just zero. That's my diagonal matrix I plug it in here instead of this and this will be solution to the pcr Regression problem. So as you see this is something that has decreasing values on the diagonal And this is something that has decreasing values on the diagonal This just does the hard thresholding of singular values And this is you can you can say that this does soft thresholding of singular values um But it's very similar right it's it's similar in spirit and qualitatively And this parameter k Can serve as a regularization parameter So if you take larger k, you regularize less, right? So you you you have larger bias and less variance and Sorry, you regularize less means you have smaller bias And higher variance, of course, and if you increase the regularization strength Which means you decrease the number of pcs that you keep Then Your variance will go down, but your bias will go up. So you have the bias variance threshold the same thing depending on k usually in practice You're better off just using a rich regression and not thinking about that but um, I think conceptually it helps a lot to have this mental picture of um Of uh, red regression doing something qualitatively similar to just kicking out small pcs in the in the data Which will which doesn't guarantee which is not guaranteed to always work, right? You can have a situation where the smallest principle component is actually the one that predicts your your y Your response So you have a lot of high variance noise in the data and then there was this this very small There is a component. There is some direction in the data that has very Small variance, but that is the direction that can predict the response. You can this this is this is possible You can construct a situation like that and in that case pcer will fail and rich regression will also fail And if you don't have enough data Your sample size is low so low that you can't use the ordinary squares because because of all the Because of the high variance problems and overfitting And you don't have an a priori knowledge that this is the direction you should be looking for Then you just doomed and you will not be able to to extract this Regression coefficients reliably so in a sense all this rich regression principle component regression and so on is is it has an implicit assumption that um That it's the large variance directions that are meaningful to say it another way It's the assumption that the regression coefficients should be small, right? That's how rich regression operates And and this is just an assumption. It often it often works, but it doesn't have to work Okay, last thing that I want to cover today briefly, and then we're done is called probabilistic pca which is another perspective on principle component analysis Um, and I want to mention that because it's a very useful perspective. It's something that um That will that is related to many other things in in more advanced statistics and in machine learning and if you keep studying it You will you will encounter this all these topics Um, so I want to like mention it now so that you have uh that you have a Correct perspective on that. Okay, so ppca probabilistic pca. What what what is that? And um, I will I will introduce the problem and at first it will have nothing to do with pca, but Stay with me. Um, and you will see where I'm going. So We're considering here a lay a probabilistic model a latent variable model Okay, and I'm setting up my latent variable model as follows I have there are latent variables that we're not observing that they are latent And they are just coming from a spherical Gauss and in k dimensions. Okay, so they have all mean zero And they have um, the covariance matrix is identity matrix in k dimensions. So this is this is my my hidden Hidden variables And then there are vectors that we actually observe the x's And given the z the x is coming from another Gaussian distribution that has the mean Given like that, right? So I'm taking my z and I'm linearly transforming it with some w and maybe adding some mean vector And that's the mean of my x now conditioned on z and then there is some noise in addition So what does that mean? It means that I have some some hidden variables Let's say in two dimensions So then you can think about two dimensions and a gaussian in two dimensions these get somehow Mapped with w so I multiply it by some random matrix w and maybe move somewhere And this can now live in 10 dimensions and For each particular z, right? I have some z value It gets mapped some to some 10 dimensional value there And then I have a different gaussian here centered around that value and that's where the x coming from from this distribution It's like a two-stage hierarchical hierarchical Latent variable model that where you first draw z from here Then you map it like that use that as a mean to draw Another random vector and this is something that you actually observe So if you have the vector the the sorry matrix x where you put all your data, right? The rows of these matrix are coming from from this distribution But each row corresponds to some value of the hidden variable z. Okay, so that's um That's my Latent variable model one can see that the the mean of the entire x Um It is actually just given by mu because this has been zero. So this has been zero. So there is just mu left So that's easy. It's pretty easy to see that the covariance matrix of x Is basically given by by this w. So w directly specifies you the covariance matrix plus Also sigma squared on the diagonal doesn't matter that much for now the task here would be you I give you the some data. So the matrix x And you want to find the mu the w and the sigma squared That Like under the maximum likelihood principle, right? So you want to find the parameters of this model that give you the maximum probability to observe these matrix x And notice that this is a latent variable model and we want the maximum likelihood um solution of this latent variable model and i'm Mentioning that because just last week we talked about a lot about Gaussian mixture models And I said that whenever you have a latent variable probabilistic model You can use expectation maximization to fit it and these can be applied to this Um latent variable model, which is not discrete anymore as it was there But everything is continuous here and you can use nevertheless you can use em Expectation maximization algorithm to fit it and it's actually very natural also very very simple to to formulate here So you have the e step and the m step like in the previous lecture and in the expectation step Given some fixed values of w mu and sigma squared you find the posterior distribution over zads over these latent variables and then in the m step given these zads You hold zads fixed and the x fixed and you just find the Parameters the w mu and sigma that maximize the likelihood given the x and given the z Okay, and then with some work one can work out the update formulas um for For both e step and m step. It's actually not that difficult But I don't have time for that um today the I hope just so I I just want to convey the general principle that you can do in principle these as an e step and this an app step then you iterate and you converge and thanks to the to the uh expectation maximization theory you will maximize you will you will reach Um a maximum at least local in this case Uh, you will also reach the global maximum But the nice thing is that you can prove and I will not prove that either But one can prove that this solution that you will get so the maximum likelihood w is actually Um, basically the pca solution. So the w hat the maximum likelihood estimate of this w matrix Consists of leading eigenvectors Scaled in a particular way, but the directions of the w columns are just the eigenvectors So if you if you assume two-dimensional latent variable model, then you will get two leading eigenvectors as the columns of the w Which means you basically are doing pca, right? That's why it's called probabilistic pca So this shows that you can give a probabilistic interpretation to the sum to to the method that pca that didn't look probabilistic at all um At first During the first hour of this lecture, but one can view it through these lands And I mean, it's a little bit different. There's the scaling and so on. There's a lot to discuss about principle about ppca here But I think the important thing is that it's actually Nearly equivalent To pca you just get the same the subspace that you get is identical Meaning if you're looking for the two-dimensional Latent variable you will you will find the same two-dimensional subspace Of your data as the as the pca does That's actually pretty pretty remarkable result and on this last slide I will mention that you can make this latent variable model More complicated in various ways and by making it more complicated you obtain actually useful existing Latent variable models that are used in in in different fields And I will just mention one example Which is this stays the same this stays the same the only thing that changes I had previously sigma square times identity matrix and now I'm putting this matrix Psi, which is a diagonal matrix, but arbitrary diagonal matrix So it has if this is a 10-dimensional vector then this has 10 diagonal values And I treat these 10 values as the parameters of my model Okay, so I don't only want to fit w and mu now. I want to fit these 10 Variances in the in the psi matrix and this is called factor analysis So some of you may have heard about that. It's a it's a model very popular in some in some social sciences sciences Where you also have some high-dimensional data and you want to find Factors like hidden latent factors in the data. That's how it's called there So you're doing factor analysis for that and in fact, this is very very related to doing pca or To doing at least probabilistic Pca because this is a probabilistic latent variable model that is that is slightly more general than probabilistic principle component analysis and there are different other ways to to generalize that And then you get the whole field of machine learning that and statistics That deals with latent variable models With this I finished this lecture. Thank you