 Dealing with high dimensional data is always a problem. It could be the pixels features in an image or gene expressions that can be used in cancer research. Processing such large feature sets can take time. So, is there a way around this? Usually such data can be compressed without much loss of information. By reducing the dimensionality, we have faster processing and potentially improved performance. Two ways to incorporate dimensionality reduction. One is feature selection, where we reduce the dimensions by determining a subset of the feature space. Or two, that's feature extraction, where we select a smaller new set of features from the original set. In this video, we're going to see exactly how we can perform such dimensionality reduction with a famous feature extraction technique, principle component analysis, PCA. And we'll get into the math that powers it. So, let's get to it. First, let's understand our goal. Input data is characterized by user noise. This noise is the essence of a sample, and hence is good noise. This noise is basically variance of data, and hence represents information. The goal of PCA is to reduce the dimensionality of data while retaining as much of this variance as possible. Or retaining as much information. Simple enough. Now, what is a point? Let a point x be a vector in a very high dimensional space, d. We want to reduce this dimensionality by projecting it onto a lower m dimensional space. So, what is PCA doing? PCA is a method of linear dimensionality reduction. So, using a linear transformation u on an input x, we should be able to transform it into some z in the reduced feature space. Now, this is just one point. But let's consider the case for n points. We have a matrix of features then. Now, what is information? The information is represented by the covariance matrix s. Remember, the goal is to reduce the dimensions while minimizing information loss. So, we have an optimization. This maximization has no upper bound on u, so it can take any infinite value. Let's add a condition that every vector in this matrix has a unit magnitude. Now that we have an optimization problem with equality constraints, we can solve for u using the method of Lagrange multipliers. I've discussed this in detail while deriving the dual formulation in SVM. You can check it out in the info card at the top. Let's quickly go through the general technique here, though. In the general case, x is a variable where we seek to minimize a function f under a set of constraints determined by different functions g sub i. The new set of variables, also called the Lagrange multipliers, is a set of lambda i. Take the derivative with respect to the variable to maximize and we get an equation. I'll stop here because this is as far as we need to go. Applying this to our current optimization, we introduce lambda as our Lagrange multiplier. Once we differentiate with respect to u and equate to 0, we get a final equation. Sxu is equal to lambda u. In matrix equality, the entry on one side is equal to the other, so we have a set of m equations. This looks similar to an eigenvector equation, and we can solve it by eigen decomposition of the covariance matrix Sx. However, Sx is a d cross d matrix. Hence, if diagonalizable, it'll lead to d pairs of eigenvectors and eigenvalues. So how do we obtain m pairs from these d pairs? Let's talk about that. Consider a square matrix S. The concept of eigen decomposition involves splitting or decomposing a square matrix into a product of three matrices. S is equal to p dp inverse, where p is a matrix of eigenvectors and d is a diagonal matrix whose diagonals consist of eigenvalues. A diagonal matrix has non-diagonal elements equal to zero. And for this reason, eigen decomposition is also called matrix diagonalization. So say we want to diagonalize a matrix S. We basically want to find two matrices, p and d. And hence we want to find pairs of eigenvalues and their corresponding eigenvectors. If S is an n cross n square matrix, we have n such pairs. But why are we doing this? Why do we want pairs of eigenvalues and eigenvectors? Say we want to find the eigenvalues and eigenvector pairs of S. Then they must satisfy the equation Su is equal to lambda u, where lambda and u is one such pair. Now, what does this represent? Well, just read it from the equation. A transformation applied on a vector u only changes its magnitude and not its direction. If we project our points in the direction of the vector u, we retain a variance proportional to lambda. Since the maximum number of eigenvalue and eigenvector pairs for an n cross n matrix is n, the sum of eigenvalues corresponds to the entire variance of the transformation. If, however, we choose to project the data along a subset of these n vectors, say the top d eigenvectors, then the variance retained would be the sum of those eigenvalues. Hence the amount of information retained can be expressed as a percentage of the original using this formula. By picking a subset of these eigenvector and eigenvalue pairs, we are able to retain most information while only using a fraction of the original dimension. So here's another question. Which eigenvector and eigenvalue pairs do we select? Now look back at the equation that we obtained during the maximization of the retained variance. For a covariance matrix Sx of shape d cross d, we first determine the d eigenvector and eigenvalue pairs. How? Matrix diagonalization. Now sort these pairs based on the eigenvalue. Since the eigenvalues are proportional to the variance retained, we only select the top m pairs with the highest eigenvalues. Why top m? Because this is the smallest set of the eigenvectors that can retain the maximum variance. Each of these eigenvectors correspond to ui, and hence the entire matrix u is actually a matrix of the top m eigenvectors of Sx. So we now have a transformation matrix u. Now the final question is how do we actually perform the transformation? Remember, z is equal to u transpose x. So we can transform a d dimensional vector x to an m dimensional vector z while still retaining percentage info variance in data. Now that's awesome. This is the fundamental idea behind dimensionality reduction with PCA. But wait, this is a simple linear dimensionality reduction. What if instead of a simple base vector x, we have a complex nonlinear basis feature phi effects? If we perform the same reduction we did before, we come to the same optimization. But this time the covariance matrix is complex as we have this difficult to compute phi term in our way to determine the transformation u. So let's throw things around. Now we are left with this phi transpose phi term. How do we deal with this? I've mentioned this in the last two videos, but I'll give you a brief up again. It's through kernelization. Kernels are functions of base features x that satisfy two properties. They are symmetric and positive semi-definite. It is because of these properties that the inner product of the complex basis features can be written in terms of the base features x using such kernels. It is because of such kernels that algorithms like SVM are so versatile. So let's now try to kernelize this expression. Since we want to dimensionally reduce n points in phi to a new feature space z, we want z is equal to phi times u. We use this in our expression for u. Group the phi terms and replace it with our kernel matrix. Now what does this remind you of? Yep, the eigenvector equation. So to get z, all we need to do is diagonalize the kernel matrix k and we're done. To repeat, when I say diagonalize k, I mean first from this n cross n kernel matrix, we determine n eigenvector and eigenvalue pairs. Then we sort these pairs based on their eigenvalues and select the top m vectors to constitute z. And that's what we want in the end. We are able to determine the final transformation z without knowing the true nature of the complex nonlinear basis phi. Once again, through the power of kernels. This is why a PCA can be used for dimensionality reduction from even a complex space. I've linked resources to the exact steps to diagonalize a matrix by hand. So check that out in the description below. The mechanical process isn't too difficult. I'll be making different kinds of videos from paper discussions to algorithm discussions to comedy shorts. Just making sure my channel has a variety of high quality content. If you liked the video, hit that like button, hit that subscribe button, ring that bell for notifications, and I'm looking forward to your support.