 So I shall be discussing about principal components which probably many of you are aware of but in order to complete the things and as well as this probably I will try to give I mean a different way of looking at principal components. I think all of you may be knowing what a covariance matrix is if you have a vector x then its covariance matrix covariance matrix which are represented by sigma and suppose this vector x has here I am changing the notation slightly here see it has capital D dimensional vector let us just say then this matrix is going to be basically covariance of the first variable with itself covariance of x1 with x2 and covariance of x1 with xd then covariance of x2 with x1 covariance of x2 with x2 covariance of x2 with xd then in the last row covariance of xd with x1 covariance of xd with x2 covariance of xd with xd where this vector x curl is actually x1 to xd then this is the covariance matrix which is a capital D by capital D matrix which is a capital D by capital D matrix till now when we are talking about feature selection we had a criterion function criterion function is defined based upon some I mean some particular characteristics which we believe that the features should possess. Now in this principle components here we are not going to talk about classification here what we are going to do is we are given some capital D number of features we would like to see somehow where you have more I should say variance the places where you have more variance you are seemingly going to get I should say more variance provides there more of an idea about that particular variable or that particular I should say combination of variables basically you look at a data set like this say here you have two variables and you have a data set the variable x1 has some variance and the variable x2 also has some variance okay but probably if you take a linear combination and if the linear combination provides something like this this may have more variance than this or than this am I correct basically in principle components this is what we are trying to get that is let me explain it you have totally capital D number of dimensions there were number of dimensions is capital D then what we will see is that in this capital D number of dimensions number of dimensions is finite we are we would like to look at all possible directions for example here how many directions are there you have this direction you have this you have this like this you have directions right so basically your directions are 0 to uncountably many directions uncountably many directions you are going to have then we would like to find out that direction where it provides maximum variance what is the meaning of providing maximum variance the meaning of providing maximum variance is the following so this is your data set my direction is let us just say this axis my first direction then you take the projection of each point on to this you take the projection of each point on to this at this point when it is projected it falls here then you measure this length which is basically because this is already the x axis which is basically the x coordinate right so if it is this axis you are basically going to get the x coordinate values of each one of the points but if it is not this axis let us just say an axis like this this one then what are you going to do you take the projections and each one you measure the distance from the origin you measure the distance from the origin okay you measure the distance from the origin they will be the projected points the when this point is projected on to this then the corresponding value is going to be the distance from the origin to this one okay and similarly for this point the distance from the origin to this and something is positive another one is negative here some of this is negative and this is positive right if you are taking this as your positive side of axis this as your negative side then from here to here you are going to get negative distance from here to here you are going to get positive distance so the corresponding values are going to be that okay. So for any such direction say this is your direction then this point is projected here this is projected here then you take this thing so you take a direction and project the points on to that direction so then you will get single dimensional values for each of these points then you can calculate the variance of this are you understanding what I am trying to say you can calculate variance so for each direction you will get a value of the variance now you find out that direction for which the variance is maximum find out that direction where the variance is maximum okay how to find it out I will come to it later say suppose you have found it then you store the direction now look at all its perpendicular directions look at all its perpendicular directions now among them find the direction with maximum variance then you have two directions now look at all of all the all the directions perpendicular to these two directions all the directions perpendicular to these two directions then among them find the one which has the maximum variance like that you just go on and on and on doing it when you come to the last one that means from D you will find 1 2 3 up to say D-1 you have found D-1 directions always I mean you see you have found somehow D-1 directions then the dth direction is uniquely defined is it uniquely defined there are totally capital n orthogonal directions sorry capital D orthogonal directions you have already found D-1 orthogonal directions so capital D-1 is uniquely defined okay and there also you project the points on to that then you get the variance okay so now for each one of the points in this suppose one point is let us just say why it is a capital D dimensional point then corresponding to the first direction that is the one maximal variance you will get the corresponding coordinate value to that corresponding to the second direction for this point you will get a value to that like that corresponding to all these chosen D directions you will get a value so that is the transformed value of this vector why from the old set of axis that you are given to new set of axis we are basically going to get a new set of axis is it correct because all these are perpendicular directions and you have an origin and all these are perpendicular direction so you are basically going to get a new set of axis right these directions are called the components if you want to get small D number of components then 1 2 3 up to this small D you have to take they will be your small D components they are called principal components and there is also another term associated with this this comes from electrical engineering do you have any any one of your background in electrical engineering car honan low this is discrete car honan low expansion this is discrete car honan low expansion so why the variance is given importance variance is given importance because variance will tell you where you have the more variation in the data set and you do not want to lose the information on variances so if you have to represent the whole data set by a single component then you are choosing that principle that component with maximum variance that is why each time you are looking at the maximal variances okay now now the next question is how do you calculate this is the basic principle but then it looks to be extremely complicated you have to take a direction for which you have the maximal variance and then you take the perpendicular direction of you consider all the perpendicular directions to this chosen direction again find the one with maximal variance this seems to be a very highly cumbersome process so there is a simple way of doing it that simple way is you take the covariance matrix of this vector from which you have got all these observations and find its eigenvalues and eigenvectors find and write down the eigenvectors eigenvalues in decreasing order that is there is a very basic question here I just said write down the eigenvalues in decreasing order what happens if an eigenvalue becomes a complex number can I write it in decreasing order after all this is a matrix square matrix variance covenants matrix is a square matrix for every square matrix you can calculate eigenvalues and eigenvectors and it is not necessarily true that eigenvalues will be real they can be complex numbers also if they are complex numbers you cannot write down the eigenvalues in decreasing order or say something like that right now my question is that is it possible that for a covariance matrix the eigenvalues are complex numbers the answer is no for a covariance matrix eigenvalues can never be complex why covariance matrix satisfies several properties one of the properties is covariance matrix is a positive semi definite matrix or non-negative definite matrix positive semi definite or non-negative definite they mean the same thing it is if I read write down the covariance matrix as sigma then a prime sigma a is greater than or equal to 0 for all a if it is greater than or equal to 0 for all a not is equal to 0 vector here equality is introduced then this matrix is said to be positive semi definite or non-negative definite non-negative means it is not negative that mean it can be 0 or it can be greater than 0 non-negative is same as positive semi definite positive means strictly greater than 0 semi means you are including 0 okay so then sigma is said to be non-negative definite matrix or a positive semi definite matrix and for a positive semi definite matrix the eigenvalues and this matrix is also symmetric sigma is a symmetric matrix and positive semi definite matrix then the eigenvalues they are not only real they have to be also strictly greater than or equal to 0 this is a I mean this is a proven statement and from matrix algebra offering for sigma the eigenvalues are not only real but they are also strictly greater than or equal to 0 that happens because for a positive semi definite matrix the determinant is can you tell me what the determinant will be the determinant is actually for these matrices it is product of the eigenvalues the determinant is product of the eigenvalues so if the equality holds then there is at least one eigenvalue which is equal to 0 then that means the determinant is also equal to 0 there is at least one eigenvalue which is equal to 0 that means the determinant is also equal to 0 usually the covariance matrices are positive definite that is usually they satisfy this usually they satisfy this and if they satisfy this then this is true then that is true that means all the eigenvalues will be strictly greater than 0 okay. Now let me ask you a question corresponding to an eigenvalue how many different eigenvectors can you have we generally write corresponding to this eigenvalue you have this eigenvector my question to you is how many different eigenvectors you can have corresponding to a single eigenvalue do you have a unique eigenvector unique in the sense of the magnitude and the direction both have to be same or the direction is same the magnitudes are different can you say anything about it direction is same magnitude is different right that means suppose for the matrix sigma suppose lambda 1 is an eigenvalue then sigma x say eigen okay so x is an eigenvector so sigma x is equal to lambda 1 x and suppose I take some constant C times x then sigma of constant C times x is equal to C times sigma x this is C times lambda 1 x which is lambda 1 times Cx all right. So that means corresponding to an eigenvalue you are going to get vectors the same direction but different magnitude okay now suppose two eigenvalues are same okay before that let us just see suppose all the eigenvalues are different then can you say anything about the corresponding eigenvectors suppose all the eigenvalues are different then can you say anything about the corresponding eigenvectors here what is the meaning of corresponding eigenvectors I take only those vectors with magnitude as one I take only those vectors with magnitude as one so corresponding to an eigenvalue you are going to get basically two eigenvectors since you are going to take square root okay – 1 whole square is 1 and 1 whole square is also 1 right so plus I mean in two different directions you are I mean it is the same thing in the same axis okay so you might get two eigenvectors with magnitude as one but you take any one of them no problem similarly for lambda 2 you take one such eigenvector so for lambda d capital D also you are going to take one such eigenvector my assumption is all these lambdas are different then what can you say about the corresponding eigenvectors do you have an answer if eigenvectors are if I call them a1 a2 ad then ai prime aj is equal to 1 if i is equal to j if i is equal to j you are going to get ai prime ai which is actually the magnitude right square of the magnitude that is equal to 1 but if two eigenvalues are different here I am assuming all the eigenvalues are to be different then the corresponding eigenvectors they satisfy this property that means they are orthogonal they are orthogonal am I right ai prime aj is equal to 0 when i is not is equal to j and ai prime aj is equal to 1 if i is equal to j so if all the eigen if the if no two eigenvalues are same that means if all the eigenvalues are different this property is satisfied but if two eigenvalues are same can you say anything about eigenvectors the eigenvectors first you are going to have several problems about eigenvectors this is identity matrix what are the eigenvalues of this matrix this is identity matrix what are the eigenvalues of this matrix they are one and one they are same every vector is an eigenvector right am I correct so if eigenvalues are same then this sort of property may not hold are you understanding if eigenvalues are same this property may not hold but if I can if all the eigenvalues are different that means no two are same then eigenvectors are orthogonal and now you try to remember what I told you in the very beginning I said that you somehow find a direction now find all the directions perpendicular to that so each vector is perpendicular to all the others am I correct every if all the eigenvalues are different then eigenvector every eigenvector is perpendicular to all the other vectors because of this okay now if you take the first eigenvector a1 corresponding to this I said that you should get a real number what is that real number is a1 prime for that particular x you remember this diagram say this is the direction for this one you should take this for this you should take this again for this these are the values. So for a particular vector x the corresponding value is this corresponding to a1 corresponding to a2 the corresponding value is this corresponding to ad the corresponding to ad the value is this ad prime x these are the projected values and I was talking about variance the variance of all these values is actually for a1 it is lambda 1 for a2 it is lambda 2 for ad it is lambda d that is if I have to write it in mathematics variance of ai prime x is equal to ith 1 it is lambda i variance of ai prime x is equal to lambda i so these are all the eigenvalues and eigenvectors of this variance covariance matrix. So the eigenvalues are going to give you the variances in those directions and eigenvectors will provide you the directions since we are trying to look at the one with maximal variance so you take the eigenvalue which is the maximum and corresponding to this you find eigenvector this is your first component then I am assuming that the second one is strictly less than the first one the second eigenvalue is strictly less than the first one then I get a2 this is your second eigenvector so this corresponds to the second component now you take this and up to you take d lambda small d and corresponding to this you have the direction ad and your small d these are your d principal components these are your small d principal components and the corresponding variances are lambda 1 lambda 2 lambda d the corresponding variances are lambda 1 lambda 2 lambda d in fact principal component analysis it is used extensively because of these properties that I told you and there is also another property since I have been talking about variances is there any connection between these values and this diagonal note that every diagonal element is a variance term this is variance x1 this is variance x2 this is variance xd so is there any connection between these diagonal elements and those lambda 1 to lambda d the answer is s there is a connection what is the connection the connection is some of the variances this is the trace of the matrix sigma trace of sigma I hope you all remember the meaning of the word trace trace is the sum of the diagonal elements the main diagonal elements this is nothing but I want you to check these things I want you to check this I am not giving you any proofs for this thing but please check it summation i is equal to 1 to capital D of lambda I that means we are just summing up all the variances that we have got this is nothing but summation i is equal to 1 to D of variance of x i so basically what we are trying to do here this variance of x i the sum of variance of x i we are trying to make a partition where somehow we are just trying to keep the information about the larger variances and we are removing those things smaller variances and now what is the meaning of this this is a linear combination of the original variables right a1 x it is a linear combination of the original variables original variables are x1 x1 x2 xd and this is a linear combination of that we have taken here small d such linear combinations originally what we have is capital D such linear combinations that is the original setup where you have capital D such linear combination and these are all orthogonal to each other we have taken small d of them corresponding to the larger variances and the other capital D- small d they correspond to smaller variances and if the variance is small and if we remove those things would it create a problem or I would like to ask the question in another way variance is small how is it going to help you can you tell me if the variance is small we can replace all the values by the corresponding means are you understanding by the corresponding means because since the variance is small from the mean the distance will be very very small so we can actually replace the values by the corresponding means in that way we are losing some information I am not saying that we are not going to lose any information but the information loss is small so when the variance is small you can replace the values by the corresponding means okay so by keeping the larger variances and removing those things which smaller variances yes we are losing some information I am not denying that since the variances are small if we replace by the mean yes there will be some information loss but it is not really that much okay it is not really that much and the information loss or the loss in this procedure is actually measured by this the loss in this procedure is actually measured by this and some people may do this also that means maybe some people may take this ratio also you can measure the loss either by this quantity or this divided by this the ratio okay and so that whole theory that I was mentioning that can be easily done by looking at the eigenvalues and eigenvectors of the covariance matrix there is a theorem and proof for this relationship between those directions and the covariance matrix eigenvalues and eigenvectors of the covariance matrix that is generally available in many pattern recognition books it is also available in many electrical engineering books and I will not go into the details of that I will not go into the details of the proof of these statements if the people whosoever is interested in these things they can always go through the corresponding proofs in the books and that they can find very easily okay and this is really a popular procedure because of all these properties that I mentioned because of all these properties that I mentioned and it is used vast that just too many too many places it is used just too many places where principal components is used this also resulted in I mean in fact usually computer scientists are statisticians they have to go through this eigenvalues and eigenvectors because of principal components that is one of the reasons why statisticians are computer scientists they have to go through the literature on eigenvalues and eigenvectors because of this principal components and because of those wonderful properties of I mean the covariance matrix and when you see that there is a division that takes place the summation variance Xi is same as summation i is equal to 1 to d lambda I this is a very strong property this is a very strong property and so it just divides partitions that and that is very nice you see that is very very nice and there is a PCA LDA about which Dr. Sukhinder Das anyway he will teach okay and there are many other variations of PCA which are used at very many places and about one of them I shall take a lecture probably tomorrow okay where that is PCS are used for feature clustering where principal components are used for feature clustering in fact PCS have been used at I mean several several places one of the recent works is regarding principal components for sparse matrices a sparse matrix is a matrix where you have more 0 elements than non-0s you have more 0 elements than non-0s okay. And then that means basically your data set is such that you have too many dimensions and in those dimensions too many of them let us just say I was mentioning an example yesterday I will tell the same example today it is your your data set is something like a web mining data set that is say you have got a collection of web pages let us just say some documents let us just say you have 100 documents you have 100 documents okay. So in each document you have some sentences some words and some sentences let us just say the number of words per document is of the order of just give me some number let us just say 50 words are there of the order of 50 it may be 51 52 53 or it may be 47 48 49 or some of them may be even may be much more much smaller let us just say 50 so for each document on an average you have let us just say 50 words now you have 100 documents so so 100 x 50 let us just say 5000 words and for the sake of convenience let us assume that all these 5000 words are different even if some of them are same the number of words will be quite a bit okay so let us just say you have 5000 words and all these words are different now what we will do is that we represent a web page by a 5000 dimensional vector if word 1 is present you write 1 otherwise 0 if word 2 is present and that location write 1 otherwise 0 so your vector is going to be a 5000 dimensional vector where you have 0s or 1s like that you have 100 documents that means 100 vectors but your number of dimensions is 5000 number of dimensions is 5000 but the number of such vectors is 100 now if you have to look at the corresponding variance covariance matrix to find principal components you are going to have some problems if it is instead of 5000 in fact if you look at web pages you have too many words the number of words may be in lakhs then your variance matrix will be 1 lakh x 1 lakh such a big matrix right so but then most of the elements are 0s if you have to describe one web page where you have hardly 100 or 250 words 100, 150, 200, 250 let us just take 500 words but then your number of words that you have taken is 1 lakh you have 1 lakh words among which 500 are here so the rest all of them are 0s so your data matrix is basically a sparse matrix and nowadays many data sets are like this and for these sort of data sets if you have to do all these things then you may have to develop some new methods the reason is that whenever we do this sort of thing we assume there is an inherent assumption that the number of points is much more than the number of dimensions the number of data points is much more than the number of dimensions whereas for many real life problems that may not be true for many real life problems the number of data points may be much less than the number of dimensions I have mentioned for web mining data set okay and you have many many other data sets many data sets involving bioinformatics you have cancer patients you have many gene expression data sets where again the number of dimensions is of the order of 2, 3, 4,000 but the number of points may be of the order of 100 or 150 or 200 so these are some of the latest problems where when you are trying to apply principal components you may face some problems because of since the number of dimensions is much more than the number of data points then and your computer may not be able to support finding a finding Eigen vectors for a let us just say 5000 by 5000 matrix your computer may not be able to support it but the same thing for a 100 by 100 matrix probably your computer can support it so sparse data and sparse matrices are occurring many times and in many applications in real life so there are some papers where somehow people are trying to find the principal components for when you have sparse matrices there is one paper by Tipshirani in this regard I think that appeared in one of the statistics journals Tipshirani is a famous person working in machine learning and I hope by now you know that many of these things are we are calling it pattern recognition some people are calling it data mining some people are calling it machine learning and some people are calling it artificial intelligence so many of these things are actually occur in too many disciplines okay and Tipshirani and a few such others statisticians they call themselves as machine learning people so they are working on this thing some papers are already published and there are several problems related to principal components in very very high dimensional data sets because your computer may not be able to support such high dimensional I mean finding eigenvalues and eigenvalue vectors for such high dimensional matrices I am stopping it here if you have any questions please ask me no more questions