 Can you hear me? Now yes, now better. It's everybody back? I don't know. Yeah. I will start. Hello to everybody again. For me it's a bit strange to look at the camera, but I will try to do it. Today we are going to continue talking about dimensional reduction. Let me start with a bit of reminder of what we have seen last day. It's that we are going to talk about unsupervised. In concrete last lecture and today we are going to talk about dimensional reduction. Reduction we mean that our data, our original data, that we can express as a vector for each data point, lives in a space in which we have big number of dimensions. What we try to do is to project this data in a different space. Why? For each data we have these small dimensions in such a way that this is usually much smaller than this. I explain to you how it is related with physics by making a comparison with a human-made dimensional reduction in the case of two free atoms, different with bonded atoms. In general what we explain to you is one of the possible methods for which you can do this transformation that is called PCA. If you remember the method, this method just tries to maximize the trace of the covariance matrix by applying a linear transformation to this data. So for using PCA the only thing that you need to do is to compute the covariance matrix in which each element is computed as this. Some for all the points each are centered. Decide in order to make a reduction of dimensionality. What I did not explain last time was how to decide what's big and what's small. That was a question from one of you. And now is the first thing I'm going to explain. But before that I just want you to ask me if there is any question. I can hear after this fast reminder. There are no questions. Let me explain to you how to decide the number of age and vectors. So you have age and values and you can sort it there if my data lies in an hyperplane. Let's put the simplest hyperplane that I can imagine. If I apply PCA here I would find a different basis. I would say, see is that age and vector, let's call it B1 would be associated with a different value of the variance that is the covariance of my data along this straight line. It's lambda 1 B1. Since I'm in an hyperplane this B0, strictly 0 if I'm in a line, this lambda 2 would be 0. So once I build this intuition, what would happen if I'm in approximately an hyperplane? If I'm in an approximately hyperplane and I plot lambda of the rank, the first one would be the bigger one, the second one would be here, the third one would be here. But at certain point, if I'm in something that is similar to an hyperplane, what I would see is that it would start to be, you would see a gap in your spectrum of age and values. If I see something like that, I can say, here I can just retain the first three age and values. So the first three components of my PCA, and those are the principal components, and ignore the other ones. This is something that happens when your data is nearly in an hyperplane. What happens if this is not the case? If this is not the case, what you will obtain, let's say it's something that is not necessarily gap. You will find a spectrum that is gap less. In this case, what you can do is, for instance, decide that you want to maintain a given percentage of the covariance explained. As we were saying last day, the fidelity of your reproduction is the sum of all the age and age values from one to this small, divided by the sum between 0 and 1. So one way that you can reduce your dimensionality in a gap less is the same spectrum. I want to have a fidelity that is nearly 0.9. You decide your D in such a way that f is equal to 0.9. This is a question in the chat. Let's explain again the case where there is a gap. If the data lies in an hyperplane, it would be a gap by definition because all the pure rotation will maintain only the significant variables and the other ones would be exactly 0 if you are in an hyperplane. If you are in something that is near to an hyperplane, you would see a gap but maybe, of course, if they are not in an hyperplane, they cannot be 0. So what you would have would be something like that. It's the difference between having an exact straight line and something like this, like that. In this case, the variance in this dimension would be small but would not be 0. But in any case, you would see a gap if your data is similar to an hyperplane. This is the best way of performing VCA. You check if there is a gap and if there is a gap, you are quite sure that your transformation is meaningful. Somebody asked what does fidelity imply for the VCA? The question is that, of course, the VCA what performs is a rotation. It's a rotation in which you maintain a priority as many components as you want. If you maintain all of them, you will have 100% fidelity, but you would just say, I just rotate my data. I'm not reducing my dimensionality. If I want to reduce my dimensionality, I want my data in the reduced space to be as similar as possible to the one in the original space. With the fidelity, you can quantify that. The fidelity is, we'll write it again, the sum of the agent values that you maintain in your representation divided by the total sum. If you look at this equation, if I do my fidelity with the one rotating, I'm not altering my data. A rotation is not a change in my data. If this, as it usually happens, is smaller than the, then I can have whatever number between 0 and 1. If I obtain one, it means that my data lives in an hyperplane whose real dimension is this, and it was embedded by rotation in this other one. The fidelity, it's a way to quantify how similar it's my projection to the original data. Talking about dimensional relaxation, the PCA consists in a linear transformation. Something that is my definition, because I take my transformation as a linear one, but it means that if my data is not linear, I cannot use PCA properly. Imagine that you have a data set that is like that in two dimensions. Whatever transformation I do with PCA, whatever rotation would not describe my data properly, right? Because even if this data is nearly approximately one dimensional, not embedded in an hyperplane, okay? You can see that this is a problem. In fact, the main fault in which my data lives is not linear. It's a bit twist. Maybe I can imagine even having something like that. Why not? Data are odd. The PCA cannot reduce the dimensionality, understood? I'm going to explain you in the next, let's say, 14 minutes are methods, well, at least the data allows you to the dimensionality that is to obtain this transformation in cases in which this transformation is not linear. I think that I have to explain things called multidimensional scaling. Multidimensional scaling is just a way, a different way to think about dimensional direction. What you try to do is imagine that you have in this space that we are talking about the Euclidean distance. This is nothing else than the square root, the sum. Clear this formula to everybody. This is just the distance, okay? Exactly the same. The idea that underlines multidimensional scaling is to choose y in such a way that these distances are as similar as possible to the original ones. This is what is called multidimensional scaling that allows you to do that. And the different methods, the difference between the different methods comes from how they define this comparison. By comparing two sets of distance, it depends on the distances, the original space, and the distances in the projected space. Excuse me, could you repeat what's the name of the S? Oh, stress, okay, thank you. Sorry, I know my spelling is not really good. Okay, so let me have some space. We can define it in many ways. Let's say the classical way of defining this stress, like let me check the questions because I don't remember them exactly. The sum, this is what it's called classical dimensional scaling. Let me put it like that. The dimensional scaling has the advantage. It can be solved when you compute there to minimize this quantity, because it's what you need to do, can be solved analytically. And when you minimize it, the solution is almost identical to a PCA. And it's almost identical to a PCA in the case that you use Euclidean uselessful. This result, you could say that it's useless, but it's not useless. Because linear distances are not the only way that we have for different configurations. So if you use a different kind of distances that are not Euclidean distances in this multi-dimensional scaling, and you will obtain something that would be similar with the same characteristics of PCA, but with a different set of distances. I think that you arrive always to a diagonalization, so it's something that is really similar to PCA. I'm not going to, because we don't have time. I'm not going to derive that. But you can think of a different kind of stress that it's what it's called metric stress, that it's metric. You do something, a different definition of S is equal to the sum. This is the rest. It's different, as you can see. And as opposite of the previous one, it has not an analytical solution. For solving that, you need to employ numerical initiation methods when you compare it with the classical dimensional scaling. And probably the most important difference is that it's not depends on the scales on the distances. You are, by dividing by this quantity, you are waiting less the sum of the distances. So a bit different. But it's an option that it's, let's say, not so commonly used. Mostly because it doesn't convert it necessarily to the same minima. Numerical minimization, would usually obtain local minima, not global ones, so it's not always convergent. In which we do a different thing and we use, for defining this scaling, function that is really similar to the one that I employ for the metric multi-dimensional scaling, functions, distances. Well, this is a really powerful method. You need to define this function. The idea is that, if you are wise, may I cancel that for multi-dimensional scaling many times with all the distances. But, for instance, only the ones that are close, the small distances. Because in this way, you can reproduce the local environment of your data caring about the long distances that would be far anyway, but you don't care so much if they are two, let's say, at one kilometer or ten kilometers. The multi-dimensional scaling is that this non-metric multi-dimensional scaling, it's like, for instance, in many, many cases, in many, many applications of multi-dimensional scaling on dimensional reduction, you are interested in your nearest neighbors. So, nearest neighbors are not so important. This is a problem of PCA, because PCA usually weights more long distances if you want to, if you express it in the terms of multi-dimensional scaling. So, when you apply this, what you can do is just to choose wisely these functions in such a way that small distances count normally while long distances are important because we want what is far, far but you don't care about how far are the things. For instance, you can apply an exponential care energy or other kind of functions. For sure, I mean, it means that your scaling will depend on the functions that you choose, so it's kind of a talk, it can be really useful in some cases. I don't know, I'm going to stop here for questions because we are half late. There are many questions. A question on what is the difference between multi-dimensional scaling and least squares. Well, in the case that you use the first one, it's a list of squares. You apply least squares to your distances. The idea is that you use the least squares equation but what you are optimizing, let me write again the question that we were talking about, what you optimize by minimizing this S is the set of Y. So it's a bit more complex because you are having this Y, this distance, it depends on this Y, this distance, it depends on this X. So what you value when you are optimizing this S is this. And you think a list of squares is a loss function. The sum over A and J means over every possible pair of different configurations. It always corresponds to the coordinates in the original space. It defines delta I, J. It's nothing else that my coordinates in the configuration space. In the Y, it's the same but in the projected space. So in order to get the small delta I, J, we need the Ys. We need the Ys. But do we first choose an ansatz function that maps the X and Ys? We just optimize directly depending on the stress. In the case of the metric stress, you can directly optimize it by taking the numbers directly. It's a diagonalization and you directly obtain the minimum of stress as a function of the Ys. In the other cases, what you have to do is to obtain the derivatives by applying the chain rule. You obtain the derivatives of this stress as a function of Ys. And then you starting from random Ys arrive to the local minimum. Okay, so there's it's not necessary for us to have an explicit form, explicit function that maps the Xs into the Ys. No, we don't have that. We obtain directly the Ys, by applying the directly minimizing the stress. Okay, thank you. No more questions. I'm going to try to guess. Let's say, mention some other methods that are those that are most used, but that we are not going to explain the maps behind. We are going to just say, let's say, the concepts the concepts that underline these methods. One thing that one method that is kind of pretty used yet is isomac. What do you do in isomac? What you do is to compute a multidimensional scaling in which the distances in the original space are not the Euclidean ones, but are what it's called the geodesic distance. What's the geodesic distance? I will explain to you just the concept without going too much, but imagine that you have I will... I'll just say that it's like that. Okay. You see that the meaningful here it's something... Let me just write it. Something like that. If you compute the Euclidean distance between these points you will obtain for instance from this to this it's okay, but you will also obtain this distance as Euclidean distance. In the geodesic distance what we try to do is to compute the distances within the manifold. Okay, I will play the other question later. Let me finish that. So what you try is to compute the distance between the manifold within the manifold. And for doing that it's what you do it's try to approximate from this point to this point as the sum of distances of neighbors. This way of computing depends on how many neighbors you consider or if you put a cutoff for instance you can put a cutoff then compute here the distances to this point then you put a cutoff here compute here the Euclidean distances but then at the end okay from here to here I find the path that gives me a minimum distance and this is my geodesic distance this way of doing that allows you a military course method using the geodesic distance what you are doing is unfolding your data unfolding your manifold and it would be something equivalent to having in this point here the distances are almost the same as the Euclidean but when you are going far in the manifold this point would be here right so by computing the geodesic distances you are computing the distances in the manifold in which the data allows and once you have these distances you try to reproduce them in the projected space multi-classic multi-dimensional scaling I don't know if it's clear let me try to reply these questions okay I didn't explain how to determine F and J what I said because it depends on many cases many different ways of determining them usually but what you try to do is to have functions that decrease when the distances are really big scaling as I explained you can just provide you the each of these things depending on your what you want to do for instance it's better to have a good projection than a bad one even if you cannot reproduce your projection for new data because sometimes if your data looks like that yes you can use PCA and obtain a good function that reproduces maybe it's not the optimal choice there is another question the distances between that you use to build the nearest neighbors usually are the Clidium ones but there is no restriction for that you can build the geodesic distance using whatever distances that are better to use data many for it's not a Clidium probably it would be better if you employ different distances questions they will continue because we are running out of time many many many methods either in PCA or on multidimensional scaling ok so the examples that I put you it's isomap but there are many ways of doing that I would mention there is something that it's called like I will put in the metrics some kind of reviews because we don't have time to look for all of them ok what is important what I just want to explain you now is what makes dimensional reduction powerful it makes dimensional reduction powerful it's that the data the real data usually lies in many forms of relatively low dimensionality and that's because in interesting data there are the correlations as I explained you yesterday I mean also in physics the interactions between atoms the interactions between the Hamiltonian of the distance determines a lot of correlations and these correlations lead to spaces where the dimensionality is low but it's not trivial to check ok so there are methods that are oriented not only not to project your data in a given manifold but just to know which is the dimensionality of the manifold in which your data lies ok the idea of these methods is usually these methods are called intrinsic but these methods try to do is directly from your data estimate the dimension of the manifold in which your data lies in many cases this corresponds to the minimum number of coordinates of features that you need to project your data ok in the case of PCA when we are deciding that looking at the spectrum after a gap I have enough with these three dimensions what we are saying is that the intrinsic dimension of my manifold is three to determine this quantity I will just give you a really really fast idea and it's that the number of points within a radius of a given point the density of points to the deep imagine that I'm in one dimension points in just one dimension if I look at this point the number of points within a given radius let's put it within this radius would be proportional this radius to the one in two dimensions my data uniformly is that the number of points within a given radius would be proportional to the area of the circle right would be proportional to a response if I'm in dimension three so on so the methods that many methods that try to estimate this being just take the distances and use the distances and use how to the distances from one point to its neighbors is case in such a way that you can see for instance in this case you would have that the logarithm of the number of neighbors of that would be proportional by using the slope of this dimension excuse me tell me the dimension is not the capital letter D related to the original data no it's the D it's the one in which you want to project so we should define the small letter D first and then come to this intrinsic dimension yeah this one is not the it's the one that you want to compute this is the intrinsic dimension it's what you want to compute I see here you are using the distances in another space okay and then you are computing which is the dimension of the manifold in which they live and then if you have a good projection method you can try to project in this D but projecting in a D that is smaller than this one would give you some error because you are you would not be able to fully express the manifold in which your data live I have a question finally what is the relation with the PCA for example and the D in this formula there is a relation yes let's say if your PCA is well done the result of for instance it's not that you are going to employ this method but this method is probably the the share one but this method would give you a value of this if your data do not live in an interplay is that you would need more components that this D because this D let's say it's the minimum value of variables that you need for project if your data live in a manifold in an interplay then this D would be the one that gives you the gap if not probably you would need more principal components that the ones that you have ok thank you if there are no more questions we can leave it here next day I will start with clustering methods ok I will change the format I will do the lectures from my house with a PowerPoint with some slides so thank you very much and see you on next day