 So, welcome back to the lecture series on pattern recognition. We have discussed many details about different type of classifiers, both supervised and unsupervised methods. Typically if you take the names, we started with the base classifiers, the supervised one then we also discussed other methods of classification and clustering to take a few examples, DB scan, KNN, neighborhood rule, the K-means algorithm and so on. There are two other types of variants of classifiers which are very commonly used in many pattern recognition application. One of them is of course supervised, the other is unsupervised method. We will start with one of them and this method can be called an unsupervised method of classification as well as clustering which is called the principal component analysis, the PCA. It is a widely used method. Its main application has been in many other applications like data dimensionality reduction, then compression and so on and so forth but we will see if it can be used for classification as well and we will call it as unsupervised classification because in this case the class samples of the data are not known. There are many variants of this terms of the PCA which is used. One is called the KLT or the Kanova-Louis transform, the KLT as given in the slide. People also casually sometimes call PC as an SVD, the singular value decomposition which we studied much earlier in the course under basics of linear algebra and vector spaces. Sometimes PC and SVD are used interchangeably although I must mention it is not correct to use that but one must remember why it is so because SVD is or the singular value decomposition is the main tool or the method used for obtaining a PCA. So what are principal components and what is the principal component analysis and why SVD because we need to do some Eigen decomposition. So if we look back into the slide it is based on an Eigen analysis, the KLT or the PCA and the main purpose of this is to obtain a set of Eigen vectors which are derived from the Eigen decomposition of a scatter matrix. We will define what is the scatter matrix. It may have been also used in other applications throughout this course or other concepts discussed and we will have descriptions in this slide so that in spite of whatever I tell you you can always go back and look into this video with the slides describing concepts of PCA as well as SVD here and we will also take a toy example at the end to describe what is this PCA. As we have been doing in many other examples we do take toy examples in 2D and 3D so we will do that and but remember one thing that this is involving Eigen decomposition and matrix analysis. So if you have not gone through that couple of lectures on matrix decomposition please go through that although I will have at least one or two concepts discussed with the help of slides on Eigen decomposition or SVD. So we aim to actually obtain a projection set that best explains the distribution of the representative features of an object of interest. This object of interest could be any particular pattern or any particular object which we are trying to retrieve from the image but of course this PC is applicable for any other type of data. People have applied it for speed signals for other type of multidimensional signals which happen in practice image, video, acoustics, speech, music, video etc. and many other applications as well. And what is this projection set and what is this representation for the distribution we will actually look at it. In some sense you must remember although we are talking about representation this is a non-parametric or parameter free method of obtaining a representation of the data. PCA chooses a dimensionality reduction projection that maximizes the scatter of all projected samples. So we will see that with the example what we mean by this scatter of the examples you know the scatter is a very important term you need to do that using we have done that using the covariance matrix when we talked of distance functions and any method of clustering or classification which you do involves trying to look at the scatter of the data. So when you look at the scatter of the data so it becomes very very important to model that and use it for our analysis and although the main aim is to reduce the dimensionality of the problem by PCA and also in that process choose the best set of dimensions in a different domain okay. It is not like you know if you have n dimensional data just choose some arbitrary best m dimensions where m is less than n that is not the main aim of PCA but we transform it to some other domain and that dimension of that projected domain will be less than the original dimension of the data and in the projected domain we choose certain directions where this scatter is maximized we will see how this is done okay so that look at the last line this idea is important where we are talking about trying to maximize the scatter of the samples which are projected onto another dimension. So carrying on the discussion on PCA I have used a set of n sample images as an example but it could be any other data so what this basically means is x1, x2, x3 up to xn are different image samples instead of image samples it could be any other data okay I am choosing an example of an images because in many of our examples we have also chosen that and some of the toy examples and analysis which we have done and which we are going to show also we have used lots of image samples because that is an easy method of illustration and visualization which can help you to understand concepts better. So what we are trying to see here if you are given n different images it could be any other data mind you but let us say n different images this n different images pertain to one type of an object okay one type of an object within that we will have classes let us say you have image samples of fruits or image samples of cars or even let us say faces for the example of person authentication in case of biometry so you are given n different image such samples all belonging to the same or similar type but of different classes that means if you have capital n sample images of m different individuals say m could be 10 n could be 100 that means 100 sample images of 10 different individuals that means typically on an average you will have about 10 samples per person or per class so if you have 10 different fruits 10 different cars 10 different bicycles for each such class you have 10 different samples so totally you will have 100 different image samples this could be any other data which we do on image like it could be speech it could be audio it could be some vibrations signals observed using some sort of a measuring device for any other measurements so you have n different images and of course we have a dimensionality for each data in this particular case of an image if you ask me what is the dimensionality of each and what is the feature of the feature vector here each image is of a particular size let us say w cross w or d cross d in that so in such a case you will have the dimensionality for each data as d square or w square so if you look back there are n sample images x1 x2 x3 up to x capital N each of this capital X is a vector of certain dimension depending upon the dimension of the feature space you are working if instead of images if you work with a certain feature dimension because certain types of data set samples which are given for the purpose of pattern recognition pattern classification or machine learning applications are given for a certain dimension it could be dimension could be 30 could be 100 or a few thousand and there are different samples given for all the classes put together or for each particular class the total number of samples is capital N the dimension will be d and each image belongs to one of this small c classes so this capital C indicates the class level and typically capital C will be of course much less than the capital N because for each class you will have a certain number of samples so average number of samples per class multiplied by the total number of classes should actually give you this capital N so remember once again the capital X is the class level small x indicates the data point of a certain dimension and what we want to do with the help of PCA is to do a linear transformation mapping the original n dimensional image space data this capital N is this d square small n is not same as the capital N be careful with the notation the small n indicates the dimension of the sample okay so if it is d cross d d square is the small n if it is w cross w image size then n is equal to d square and if it is some other feature data set whatever the dimension of the problem d is typically d we have used d in many of our previous lectures as the dimension of the sample or the domain in which we are working and here we just be careful that we are using capital n small n as a notation so we want to map or project either a mapping or a projection you can think of from the original n dimensional image space to an m dimensional feature space where typically m is less than n sometimes it could be much much less than small n okay so the new feature vectors okay remember so henceforth we will not discriminate image pixels or original data point from feature vector dimension if you take the original sample point that is also true so the feature vector was of size of dimension n small n and we are projecting down to a lower dimension small m and we represent that by yk so xk is projected to yk each of this yk is belonging to dimension m and the linear transformation is given by this particular expression here okay we talk of a certain matrix for the timing you can think of this as a sort of a weight matrix okay but this is the one which is responsible for the mapping or the transformation from an higher dimensional description or data space n to a smaller dimension m so if you look back here xk is actually of dimension small n which is larger than the dimension yk which is m here and since small n is more than m we have seen that in the previous slide so the w is responsible for the transformation so what will be the dimension of the size of the w it will be n cross m because this is of size m this is of size n so it is a matrix with orthogonal columns representing the basis in some feature space so we want to find out this w that is the purpose of PCA is to find out what is this w which will map x to y such that you not only have a reduction of the dimension but you also have some nice properties like in the new dimension m of size m to be very precise these samples yk will have some properties of maximum scatter along those dimensions okay it is possible that certain dimensions of n are redundant or certain dimensions of n do not have a scatter which represent the samples so in some sense they will be given less importance or you can consider them to be eliminated in some sense not directly elimination but they will be giving very very less importance in the transform space and this transform space of size m again I repeat less than n will be holding the maximum scatter in some so we will define the criteria which will be actually be minimized with the help of PCA so remember this is the expression target w is the weight matrix which has to be obtained with the help of PCA which will help us to implement this projection here and of course the number of samples of k will be running from so this will be applicable for all the samples capital n number of samples which are available in the larger or original dimension and to implement this we look at an expression of a scatter matrix st here which is given by this particular example we have already defined what is this xk k is the index for the number of samples running from 1 to n and mu indicates the overall class mean mu indicates the overall class mean for all the samples so xk minus mu basically means you are subtracting the overall class mean from the number of samples correct you are subtracting the overall class mean from the some side we look at the expression here xk minus mu is can be considered to be a column vector and xk minus mu transpose will be can be considered to be a row vector and this resultant multiplication is also an outer product okay the outer product that means this product will actually give you an n cross n so small n cross n because the dimensionality of the problem n is the dimension so that is what you will get sum over all the samples so you subtract the mean from each sample subtract the mean from each sample create the outer product some all of them up that will give you this scatter matrix for the overall sample okay what is scatter matrix properties will have it will have some properties related to the distribution of the samples remember it is a nonparametric value it is just a matrix indicating the samples which you have and another expression of the sigma ij which can be considered as this is nothing new to you it is a if it is an element of the scatter matrix it can be visualized to be the expectation of this so we have seen this a scatter matrix which is another term of so if you look at the expression you will actually going to get the covariance matrix along the diagonal you will have the variances and along the off diagonal terms you have the correlation between two particular dimensions i and j for an arbitrary element i comma j okay so capital N is the number of samples in the image and mu is the class mean for all samples of images or any other data and the scatter of the transform feature vectors will prove this very soon after you have done the transformation using W it will be actually having the scatter as given here so this is the scatter after you have done the transformation what is the transformation let us go back this is the transformation we are talking about so if capital S is the scatter of x then the scatter of yk will be given by this particular expression here this is the scatter of the samples yk if st is the scatter of samples in the original domain of xk okay so this we will prove this and show why this is so and live with a little bit of analytics and we will show that this type of W transpose st multiplied by W will actually maximize the scatter along the certain directions in a certain order based on PCA so we have to choose the W that is our main aim in principle component analysis in PCA W optimal this subscript indicates that which remember you can choose any arbitrary W but it does not will not give you a PCA it will give you some transformation if you take a random set of values for a matrix W of size m cross n and you can project the samples yes you can get a lower dimensional mapping anyway but it will not satisfy the criteria of maximum scatter along certain directions in reducing order we will see that those properties with the after projections to yk that would not satisfy remember so we are looking at a optimal value of W which is chosen to maximize the determinant of the total scatter matrix of the projected sample so this is what we are looking at that if you choose certain W such that this particular value is maximized st what we have seen in the previous slide is the scatter matrix from the original samples x in the dimension n small n and if you choose a W such that this expression determinant of this matrix is maximized we will call that as a W and the W i which are basically the columns of the W is the set of n dimensional eigen vectors of scatter matrix st corresponding to m large n eigen values you can check this proof in many books including the book by Fukunaga so I am avoiding the proof here what this basically means is that if you look at the n dimensional eigen vectors of st so there comes our decomposition based on eigen analysis or SVD which we have to do to get the n dimensional eigen vectors of st using that if you construct the W and if you do that corresponding to the m largest eigen vectors then you will actually form your W which will be optimal to give you this maximizing criteria of scatter in the transform domain so this is what we are maximizing so continuing the discussion on the PCA before we get to a little bit more analytics okay so typically if you talk of eigen vectors or eigen images or certain eigen the data of certain pictures they are also called the basis images or the facial basis functions if the example of images are faces but remember the images could be of any other sample such as cars bicycles even humans let us say or buildings or any other scenario but they should be typically samples belonging to the same class okay because for which we are trying to get the scatter it is not that you take all the several images which are there in your gallery or you make a repository of your own from Google images or Yahoo and then start you can do a scatter and do a PCA on that but actually in some sense that will be meaningless it is done for a particular set of samples belonging to a particular type of images may be of different class that means different faces of individuals different car models different bikes different humans okay it could be of different buildings as well okay or different type of box let us say which is used as travel bags okay so in some data sets which have these different type of samples belonging to a set of similar classes so we are talking of images so they form the basis vectors and any data say if example of a face it can be constructed approximately as a weighted sum of all the collection of images that define the facial basis or Eigen images and a mean image of the face so what it basically means this sentence is remember you had the set of samples X in some n dimension and assuming that you have done Eigen analysis and find out what is the optimal W correct after getting this W you project the samples X to Y in a lower dimension from small n to small m where m is less than n so you get your samples Y these form your basis images if the images are faces they are called the facial basis PC is a very common method used for doing some sort of a classification or even clustering with facial images it was proposed by the in a paper if I am not wrong with the date around 80s or maybe late 70s by Turk and Pendlan PC a where he talks about Eigen faces then of course there are other types of beautiful papers by Belhomer about Fisher faces versus Eigen faces and all that so then PC a become very popular for representing faces of course there are better methods now to do facial classification or clustering okay so face is an example but it could be again for any particular images so what you are doing is when you project the samples of the face or any other particular type of data to a lower dimension why these form your basis images for representation in lower dimension and you also have a mean image of the face which you can compute from the original data that you can do remember we have done a mean subtraction before doing computing the scatter matrix so scatter matrix itself we have done a mean subtraction so can you get back your X given Y can you get back your X given Y remember what is Y Y is W transpose multiplied by X maybe I will write that expression on the backside on the board because that is a very important projection which you are trying to do so we got all of these equations so far this is what PC is supposed to do project it from a higher dimension to a lower dimension space with M less than enough of course but sometimes it is taken much less than M W is obtained by a PCA which will basically solve this you know satisfy this criteria that you are it is going to maximize the scatter in the projected space in the space given by this Y and this W is obtained from the eigenvectors of this scatter matrix typically this is an unbiased scatter matrix sometimes you may need a normalizing term here okay and the corresponding terms in this scatter matrix is the same as the covariance matrix and we will see how an SVD done on this scatter matrix will help us to obtain so there is a proof which I am skipping which will give you this optimal and then using this W you do the corresponding projection from X to Y so given these expressions now which we have just seen on the board and we have also seen earlier in the slides so you can get Y from X the next question comes is can you get X back from Y yes to some extent because Y is a representation it satisfying some criteria of maximum scatter along certain directions okay the you choose the directions which are the maximum scatter that is all right okay and of course there are few other properties but that is the main property which we have to concentrate now given those Y which you have obtained by doing a projection using W can you get back X that is the biggest question you should be able to get back because W is an invertible matrix okay it is mixed it is based on orthogonal components okay and if you choose a square W you should be able to invert it and you should be able to get back X by back projecting Y on to the other dimension the question is will it give back the original X of course the other thing which you keep need to keep in mind that you have to add the mean vector which you have subtracted from the samples to get back the original samples back the reconstruction is complete if you keep m is equal to n if you keep m is equal to n that means keep correspondingly all the direction or what are called the eigen vectors in Y then you will get a perfect reconstruction back but in general we are looking for dimension reduction also with the help of PCA so depending upon the number of dimensions you use you will get back a reconstructed signal X which will be a very close approximation very close approximation of X but not the same X mind you that okay it will be very close so face images if you take do a PCA go to a lower dimension come back the faces may not exactly represent the same faces or look like the same data samples which you have started with in the original dimension it will look something different we will try to give some examples in the next class of how what are called eigen faces look like when you project the samples and then reconstruct back okay so just remember this mind it is a low dimension representation it will give an approximation when you project it back and you must remember that we are talking about the mean image of so look in the last slide this data so any data can be reconstructed or approximately this is what I was trying to highlight as the weighted sum of small collection of images that define the facial basis or eigen images so these are my Y's so representing X as a weighted sum of all these Y's and the mean image of the face which you had subtracted to compute the scatter matrix okay so the data form a scatter in the feature space through a projection set which is the called the eigen vector set and the features are extracted from the training set without prior class information I do not have any class information available here hence it's also called a method of unsupervised learning it's like learning without a teacher we know the difference between supervised and unsupervised supervised you will have class samples here you do not have class samples it's called unsupervised store sometimes that's why PCA is also called a method of clustering okay it will not group the data as such but it will tend to form groups okay depending upon the direction of the maximum scatter okay and it may try to discriminate between classes but you will be wondering I'll show an example what this means when the class information is not there what do you learn from the data that is a big question you can ask now so you're talking of unsupervised learning and for learning you need to learn what are the class samples if that itself is not given to you what will you learn and what will be the output so we'll see that with the examples that the main purpose of PCA is actually to extract out and give you a low-dimensional later which satisfies some criteria of maximum scatter along certain set of dimensions or the other application is dimension reduction in terms of representation also in some cases this is driven towards a method of unsupervised learning okay with the hope that if you are extracting dimensions along maximum scatter along the direction of maximum scatter to be very precise the data will form groups along with their clusters or they will form groups along the direction of maximum scatter as per the number of class if you have two or three different classes they tend to form groups along those directions and with that hope we can call PCA very loosely as a method of unsupervised learning although you must remember that that is not the main aim that is only a secondary thought and an application of PCA which one can use actually you cannot learn anything because the in the true sense of the term in the field of pattern recognition because the class samples are not known but anyway we'll keep this in mind that the features are extracted without prior class information and hence it is also called unsupervised learning this is an example which shows what the scatter is look at the set of samples here so you have a set of samples indicated by a certain color and a symbol here you have another class second group indicated by another set of symbols with a different color and if this is a two-dimensional example in 2D what you expect PCA to give is the first eigenvector will give a direction as given by this arrow which will say this is the direction in which I have the maximum scatter or the maximum spread or the maximum width or the maximum separation scatter spread separation whatever term you want to use to give it a meaning but we'll use the word scatter henceforth which could mean separation or spread as well okay in that direction you can see that is the direction which is given the maximum scatter there is an orthogonal direction in fact you have to the figure does not reveal you but this direction is orthogonal to the first eigenvector so these two vectors are orthogonal to each other this second eigenvector where you have the second highest or less scatter compared to the first one and remember since the two-dimensional problem so you will have only two directions in which you can project so you can basically you are projecting into one direction in which you have the first eigenvector giving you the maximum scatter this is another example so this is an interesting example where it this shows that the PCA is not able to produce direction which will separate the two classes if you look at this if you look around the first eigenvector for the first example and project the samples on to this particular axis this will form a clutter here this will form a cluster here and it should help you to do some unsupervised method of clustering or classification okay this is not the case anymore look at the direction of the maximum scatter which is this one okay along this if you project all the samples there will be huge extent of overlap in fact is the second direction which will actually give you the separation this is another example using an image set of contour points if you take x and y as coordinates if you take x and y as coordinates and do a PCA this is the direction in which you will get the plot this is an example from a squid home page for object shapes you will get the maximum scatter along this particular direction of this set of x and y what are the samples here the data is 2D but the samples are x and y coordinates so x and y 1 one object point on the counter is one sample second sample is x 2 y to the next point and so on so you will set of n samples take them do a PCA it will give you some direction in 2D the first eigenvector will give you the maximum scatter that is what this will give you so these illustrations are not showing you what is the direction of the maximum scatter or spread or width of the data let us go to some analytics of the PCA now and before that I will just give you some inputs which people use you in terms of interpreting what is PCA it is also called a technique used to reduce multi-dimensional data sets to a lower dimension for analysis we already talked about this that it is a problem of it is also considered as a tool or a mechanism to produce a lower dimensional data or dimensionality reduction the application can be used for predictive analysis models exploratory analysis it involves the computation of eigenvalue composition or SVD we talked about this of a data set usually after means entering the data for each attribute we talked about subtracting the mean of the data anyway for a data matrix x transpose you know that is only a matter of notation whether that means all the samples put together form that matrix x with zero empirical mean because they are subtracting the mean there is a empirical mean will be there anyway the empirical mean of the distribution has been subtracted from the data set that is what mean by the empirical mean each column is made up of results from a different subject and each row the results from a different probe this are terminal logic what it basically means is that if x is a matrix indicating all the data samples put together then each particular column is basically indicating a sample point and each particular row is indicating a dimension that means if you take the previous example of objects of contour points on that object which we just shown and indicated one direction of the PCA the matrix x will be of dimension 2 that the mill 2 rows multiplied by an as many number of contour points which you have okay so this is only a notation sometimes these are called subjects and probes or samples and probe means it is picking up a feature okay so each probe picks up a particular value as a feature okay so that is an a terminology which is sometime use okay so in this particular case for images each gray level value or a pixel itself could be a probe it is an observation which has been made by a sensor so that could be a sample point as well and this will mean when you are doing this on x which is our data that means the PCA of the data matrix will be given by remember you are taking the x and you are trying to actually project on to y using a W transpose and it is also given by this matrix where the SVD of x you can write the if you write the SVD of x as something like this okay so replace x by W sigma the diagonal matrix and another set of eigenvectors V we substitute it here the W transpose W it is an orthogonal matrix so it will vanish and you will be left with this so this is another representation but typically most people use this representation but this is the if x you take the SVD of x and obtain this so basically what it means is W is obtained by the SVD of x and it is the left set of eigenvectors because you have two sets of eigenvectors W and V so the goal of the orthonormal matrix W is to actually find this W out from the scatter matrix x and then use it to project to y and is such that the covariance of y which is also given by this now we have a normalizing term y y transpose this is diagonalized okay so if you take the covariance of the W scatter of W you will have only diagonal terms here that is the basic aim and the rows of the W are basically the which is the basically the matrix used for projection or PCA the principal components of x and they are also called the eigenvectors of covariance of x so unlike other transforms there are other types of transforms which also exist in literature like the discrete cosine transform DCT discrete Fourier transform DFT discrete wavelength transform DWT Hadamard transform other types of transform which exist also in literature including in the fields of signal processing matrix algebra theory of communications and so on PCA does not have a fixed set of basis vectors it depends on the samples remember W is composed of set of eigenvectors you derive them from the data what is the data sample here the capital X matrix that means if you change x how can I change x very simple take off a few sample points or add a few sample points change a few set of samples you have a different x if you do a different if you have a different x you take the SVD of that you get a different W you get a different matrix W and you will have a different set of projections now okay so the eigenvectors which are responsible for the data dimension it depends on the data itself which is a different than the other type of transformations which exist in literature for other type of applications remember DCT DFT has major applications in many branches of science and engineering as with the as well as discrete wavelength transform for the case of multi-resolution signal analysis so they have basically a set of fixed vectors or representation matrices based on which you do the decomposition you do the projection but here it is totally data dependent and since we are we have studied so far that SVD is the main idea behind performing the PCR PCA has been an SVD we just have one slide here which indicates or gives you a revision of singular value decomposition if you have actually have done a skip of the discussion which was done earlier in the class okay in one class when we discussed vector algebra and spaces so SVD takes a matrix A of arbitrary size n cross P and the theorem states that you can decompose A into a unitary matrix W and another matrix T we here both are orthogonal matrices and a diagonal matrix is given by this the calculation of SVD consists of these are just key points if this is not a lecture on SVD is the key point it basically finds the eigenvalues in S and eigenvectors of A transpose which you will get in U and V respectively so what it means the columns of V are orthogonal eigenvectors of A transpose A that is what you will get in V which is here and the columns of U which is this matrix are the orthonormal eigenvectors of A transpose okay so A transpose is basically this matrix A so you can form an A transpose out of A and you are talking about the eigenvectors of that particular matrix which will be symmetric then you are talking about those eigenvectors as U and that will form my W in fact W is taken out of U okay and also the singular values in S are sometimes is written as sigma in some notations you will find in my slides also that this is written as a sigma matrix are the square roots of eigenvalues of either A transpose A or A transpose A but interesting is that it will be arranged in descending order that means if S is a diagonal matrix at the top left diagonal element you will have the largest eigenvalue then the second volume will have the second largest eigenvalue and so on up to the least eigenvalue at the bottom okay at the bottom most diagonal element will have the smallest eigenvalue okay in descending order and U and V are also real matrices some important observations of SVD the singular values are the diagonal entries of S matrix and arrange in descending order which we know and they are always real numbers and the matrix M is also very very real but sometimes some books will use matrix M this is the same as the matrix A which we have seen earlier the right singular vectors which are the corresponding to the V the corresponding to the vanishing singular values of M so if you have some null eigenvalues of in the sigma they form what is called the null space of M and the left singular eigenvectors of which are there in the U the corresponding correspond to the nonzero singular values of M and the span the range space of M we may not use the concept of range space and null space right now in our discussion but it will come in our discussion in the next when we discuss LD after PCA okay so carrying on the discussion of PCA which is also called the chronology transform or KLT it is equivalent to finding the SVD of a particular data matrix X we know that and then obtain the reduced space of the data matrix Y by projecting X we have seen that so X is basically W Y is basically W transpose multiplied by X with the reduced space defined by one of the L singular eigenvectors this L is our M remember so we are representing that so you can actually choose a few set of singular vectors not the entire dimension and assuming that the SVD gives you this and if W you have seen this expression a few slides back where Y can be represented by this W is basically the left eigenvectors corresponding to SVD of X or you can write them in this particular form as well okay so you can find out that the matrix W of single vectors X is equivalent to the matrix W of eigenvectors of the matrix of the observed covariance X C X X transpose so if you look back into the slide here so we are talking about the covariance X X transpose and it is given by this particular expression so if you substitute X as this and X transpose will be basically V sigma transpose W transpose and substitute that here it is very easy for you to see that V transpose V will cancel out and you will be left with this which can be written in terms of this okay but D is a diagonal matrix again which is a square of the singular values which you get here on the sigma so the covariance of X which you get after SVD will be given by this so this is again a diagonal matrix and you have the left singular eigenvectors of W on both sides so carrying on the discussion by the PCA by what is meant by the covariance matrix which we started the discussion so we assume that the W matrix is of certain dimension D cross D such that we have this constraint and the covariance matrix is a diagonal matrix of Y with W transpose which can be proven by this so if you look at the expression of the derivation here the covariance of Y is the expectation of Y Y transpose so substitute Y equals W transpose X in the expression here I repeat again take W equals W transpose X substitute it here this is what you will get okay then you get this and you take out the matrix W out of the expectation term here this will give you the covariance the covariance we have derived that to be the weight matrices with the D inside and so that is interesting that after all you have it equal to the covariance of Y which is obtained remember the Y samples are obtained by doing a PC on X which will yield the covariance which will be strictly diagonal what does this basically mean this basically means that you have the what what is the D D is a diagonal matrix which is square of the eigenvalues of the diagonal matrix obtained by SVD okay and you then have the scatter of the samples of Y which is maximum along the first direction because this will also appear in descending order this will also appear in descending order and you will have the maximum scatter along the first dimension the second maximum scatter along the second dimension and so on and that is what covariance of Y indicates and if the covariance is strictly diagonal what does it mean off diagonal terms are off diagonal terms should be 0 if the covariance matrix is strictly diagonal then the off diagonal terms are 0 that means what are the diagonal terms sigma ij it being 0 that means there is no covariance no relationship between the two dimensions because they are orthogonal to each other okay and their correlation becomes 0 means that one data is completely independent of the other one so you are projecting on to a certain dimensions in not only you have maximum scatter along the first then the second and so on you also do not have any correlation between those two dimensions you have projected on to a dimensions feature space where the dimensions are independent of the other the first is independent of second third and so on the same is to do with the second it is independent of the first as well as the other one so if you take any two dimensions i and j after doing a PCA in the transform domain two arbitrary dimensions a and j i not equal to j they will not have any correlation and that is what is reflected in the covariance matrix of Y which will give you a diagonal matrix that is what you get here that is the interpretation of that you can also prove this that W multiplied by the covariance Y is very same as the covariance multiplied by W which will actually give you you can derive from the other that the corresponding singular values multiplied by the corresponding weights are going to give you the covariance multiplied by the W okay so this is a simple analysis proof I leave this as an example for you an assignment for you to derive this at the bottom from here because the covariance matrix is diagonal that is what will give you this particular matrix. Let us take an example of the PCA a hand worked out toy example you take an example of 2D because sorry the data is 3D this data is three dimension I have just three sample points of course you can take many and this is the overall data sample you can see this is stacked up as columns and gives you this x so what is the job of PC how do you do PCA you have to do SVD but before SVD you must not forget one important step which is to subtract the mean of the data so let us calculate the mean of the data the mean of this data particular sample so it is a 3D problem number of samples are also 3 and each column is an observation sample and each row is a variable or a dimension this is the mean of the samples so that is the sample mean mu is one-third what is how do you get this value of 1 by 3 at this 3 okay this is the first dimension okay so what is 4 minus 1 minus 2 1 divided by 3 number of sample that is 1 by 3 or it is 4 by 3 where 3 plus 1 is 4 divided by 3 this is 2 plus 1 plus 3 which is 6 divided by 3 that is 2 so that is very simple that is how you calculate the mean subtracted sample that means you take each of these individual samples x1 x2 x3 subtract the mean this is what you get I leave this an exercise to check it out this is also the same as taking each of these columns and subtracting by the corresponding vector here this is what you will get now you can construct the SVD create the sample x okay this is x tilde which is based on this tilde indicate that is mean subtracted data so this now the data sample x is become this how do you get this take these 3 columns one after another and form this new x which is subtracted data let us look at the covariance term which is xx transpose by 2 remember there is 1 by n minus 1 so that is how the 2 comes and I am giving you the answer here you can actually use any mathematical toolbox to compute the it is a 3 cross 3 matrix so it would not be even difficult for you to actually multiply and help calculate xx transpose which you will be getting as this so you have gone the covariance of the matrix or the scatter as it is called and we will do an SVD of this scatter matrix in the next slide okay so this is what we have done so far this is the scatter matrix example okay you get the mean subtracted data and you can actually compute the scatter matrix by individually summing of this what this basically if you actually follow this expression to compute the scatter matrix from these data samples to subtract the mean and construct the what is C1 xx transpose of this means subtracted so the first term of this expression x1 minus mu x1 minus mu transpose x1 tilde is this compute this expression outer product is this similarly for x2 you get this C2 and similarly for x3 you get this C3 I repeat again from this first sample using the first term of this summation you get this C1 second summation of this from the sample x2 you get this and the x3 for the third term of this remember k is equal to 1 to n there are three samples there are three terms 1 2 3 for three data points C1 from x1 C2 from x2 C3 is obtained from x3 some all of this up is what you will get some all of this up is what you will get you can check a few examples if you want let us for example look at the last element why it is 2 because it is 1 plus 1 plus 0 how do you get minus 3 if you sum up these 2 elements up minus 5 by 3 this is minus 4 by 3 so it will basically give minus 9 by 3 is minus 3 add these 2 elements up you will get 6 also so you can just take the last column it is easy to add up the same thing is applicable for the last row as well and I can add it at these 2 elements you get minus 3 so that is very simple and then you get the covariance which is sigma C by 2 which is this and this I repeat is the same covariance which you obtain by the sigma also this is what we got in the previous let us go back this is what you got look at the first term it is just 2 by 6 let us look at the last 2 it is easy to verify here it is symmetric matrix so this will be 3 minus 3 by 2 1 I repeat if you take the half inside will be 3 minus 1.5 and 1 correct so we will check whether this value is correct or not yes minus 3 minus 1.5 1 this is same as this so you just have to divide by 2 you get the answer so you can compute this scatter matrix either using the expression or you can compute it is simple as a covariance so it will give you the same so this is what this scatter matrix looks like in the last slide this is what we had and if we do this scatter matrix is given by this and you sometimes people take X transpose X instead of X X transpose you will get this matrix which is a variant it would not look the same this is done sometimes if the dimension is too high and but let us do SVD of both so that I am showing you SVD of X X transpose let us look at this is the actual one which has been proposed as the matrix but this is just a variant and let us look at if the examples are same so excess this is the correct one on the left hand side and look at you of both the matrix they do not appear same but look at the diagonal matrix the Eigen values actually if you are interested only in the Eigen values or the set of Eigen values for this X X transpose or the radar sample X you can either take X X transpose or X transpose X and the covariance scatter in whatever way and compute to the SVD the diagonal matrix will give you the set of Eigen values which is also sometimes called the Eigen spectrum because it is arranged in descending order with the largest values the first element the Eigen vectors will not look the same so you must actually be very careful if you are doing PCA that you take the look at the slide if you look back into the slide yeah X X transpose then you must do this and these are the corresponding Eigen vectors which will form your W this is another example okay this is set of 6 points they are arranged to create this X so quickly run through the example the lack of time so it is a 2D problem in 2 dimension with 6 sample points again each column is an observation each row is a variable this is the mean vector how do you obtain this take the sum of the first row divided by 6 will give you this take this sum divided by this this is what you will have this is the mean subtracted value of this that means you take all the sample points divide by this corresponding vector this is what you will have this is the covariance X X transpose X transpose X will give you something different so it does not matter which one you take you can take X X transpose here there is a 2 dimensional problem but you can also do this X transpose X this is the correct one to do X X transpose again the covariance of X X transpose as well as this means except means means subtracted is given and let us look at U S and V look the S is different U is different but look at S here as well okay you can look at the first two Eigen vectors the Eigen values are same okay but look at the V will be different you can also see here that this being the V is equal to U here because this is identical symmetric matrix this is also symmetric this is also symmetric matrix so in that case you can use the Eigen vectors from any one of these U and V which are same so that is what to do with PCA there are other types of variants of scatter matrix which are also available which are also available in literature and we look into those and see examples of another type of Eigen analysis and decomposition which is possible and then compare both with an example when we look at these scatters itself let us look at a scatter matrix which is called the within class scatter matrix this expression is now a little different than whatever we have seen so far we had a scatter matrix in PCA where we did not probably have this subscript W we had one sigma and we had X k minus mu X k minus mu transpose what is different in this expression well you see the mean has an index I for a particular class okay and this index I runs from 1 to c number of class so it seems this scatter matrix now requires that you have samples belonging to each and every class separately grouped to compute this within class scatter matrix that means what is the scattered within a particular class that means you must know samples which belonging to a particular class that means if you have pictures of 10 different persons of face images or what are called facial image samples then for each particular class or person in this particular case you must group the samples belonging to that particular sample and then compute the within the expression because you cannot compute mu I unless you know the samples which belong to the ith class so X k are samples belong to the class Xi so the class level must be given with this so I am not doing a PCA you are doing something else here which will describe soon so you sum them over all classes and for each class you take the samples belong to the particular class X i and compute the scatter the rest of it is same once you have the class mean the data sample X k belong to the particular class you can compute the within class scatter matrix the W here indicates not the weight matrix the within class is indicated by the W similarly you have a between class scatter matrix it is a scatter of the expected vectors around the mixture mean of the entire mission look at this expression now you have the overall data mean which you have used for PCA earlier you also have individual class means and using this means you form this expression which is an outer scatter of the class outer product of the means multiplied by the number of samples summation over all classes so you have within class scatter matrix with the data samples you have between class scatter matrix which is mainly dependent on the class means and sometimes casually people call that inter class and intra class scatters the intra class int r a intra class scatter basically means within class scatter within each particular inter class int r inter class scatter basically means between classes what is this scatter okay so if you have these expressions computed as SW and SB it has certain very nice properties I will not be able to discuss all of them due to limitations of time so what is within class scatter it shows the scatter of samples around the respective class expected vector or class means so around the particular mean what is the scatter here how the class means are scattered that is the second term SB the B stands for between we will keep this notation of W indicating within class scatter B indicating between class scatter and the overall class scatter which we have seen the overall mixture scatter of data samples is a summation of this is just a property the proof is given in many statistical books okay and the criteria formulation of class separately need to convert these matrices into a number so we will see how this is done and this number should be larger when between class scatter is larger and the within class scatter is small why do you want to do this remember forget this factor about converting matrices into a number we will see how to do this we will now form and different criteria using SW and SB and remember when we are talking about classification versus clustering much earlier in this course what did we say we say that the classification problem becomes easier and a job of a classifier becomes easy if the within class scatter is less and the between class scatter is larger if there are two class samples class one and class two class A and class B you would prefer or you would desire to have a scenario where the distance between the individual class means are much wider and the individual class scatter for the samples for a particular class A and class B they are not large then this samples will not overlap leading to easy formulation of a very good decision boundary linear or nonlinear whatever the case may be between the two class scatters so you want a larger between class scatter and a smaller intra class or within class so typically if you go back to the expression of SW and SB I want smaller values of SW and larger values of SB that will be the basis of my optimization or criteria which I want to make in order to have a new projection dimension and typical examples of such things are these let us take an example of something like this trace of S1 divided by trace of S2 as a criteria what are S1 and S2 these SW and SB so I want one of them to be larger one of them to be smaller the SB between class scatter should be larger so it should be the numerator of that expression and within class scatter should be small if you are heading towards this but if you cannot convert it to a scalar quantity you can also use some norms like these as variants of this so these yields us to a new method of classification which is called supervised which is under false under the category of supervised learning because the learning set is level you want samples of different class available to you to compute the within class and between class scatters is called a linear discriminant analysis or an LDA and it tries to now shape the scatter in order to make it more reliable for actual classification so now we are trying to select W which will maximize the ratio of between class scatter versus within class scatter that is SW and SB and the between class scatter we have already defined them as this where mu y is the mean for class i capital N is the number of samples which belong to a sample xi what is the within class scatter this is defined as this earlier and what we want to do is maximize the ratio that means we will take SB divided by SW SB divided by SW and that ratio we want to maximize we must select some projection matrix in which the in that projected domain the SB scatter becomes large and the SW or even the ratio it if not the individual scatters at least the ratio must be larger for it for classification to work better in fact to compute this scatter matrix you need the class sample hence it is called supervised learning unlike the PCA which is unsupervised because you did not need the class labels and when you do not have class tables you can apply PCA when you have class samples you can apply either PC or LD and in fact we will see why PC actually precedes the LD a method I think we will stop with this