 Okay, so that's day two and just for a recap, what we saw yesterday. So this is the part that we saw yesterday. So you found out how to go from the FASTQ files to having account matrix. Then you did understand what are the QCs that you have to look at in order to remove, for instance, dying cells or doublets. And then we went on to discuss a little bit about the normalizations that you should be using in your dataset, and then how you can do feature selection and how you could remove confounding factors with the regress out in the scaling data. So now today we'll be about this part here, mainly, I guess until probably here, where we will try to understand how we can now visualize our data and how we can understand what are the cells that we have in our data. This is what we will be aiming for today. And so for understanding how you can go from the account matrix to something visual, we will actually go to having to discuss about dimensionality reduction methods. Okay, so what's dimensionality reduction about? So what you're trying to gain is to simplify the complexity of your dataset. So it becomes easier to work with. So the idea would be that you start with a huge dataset where you have all of your cells and all of your genes. And what you want to end up with is that you still have all of your cells, but that you have something that you might be able to visualize like two dimension or something that you might be able to grasp to make the dataset easier to work with for some methods. So the idea would be that you try to find out what is the redundancy that you have in your data. So you would like to remove this redundancy. You would like to identify the most relevant information. You would like to filter the noise and just keep what's informative. And the idea would be also to reduce computational time for downstream procedures. Indeed, there are actually many functions that you can use that rely on the fact that you have a smaller dataset because they are not able to compute otherwise. So you need to reduce computational time for the downstream procedure. Then you can also facilitate clustering if you have already removed the redundancy in your dataset and only kept the most relevant information. Some algorithms actually struggle with many dimensions. So it's nice that you help facilitating clustering. But most of all, what you're trying to aim for is to be able to visualize your data and visualize your data as I said before, that you would like to go to two dimensions or at most let's say three dimensions such that you will be able to plot it and visualize it. So I think if I can go out of this and then share my screen again, the internet, and then I will go just for you to see it. There is a web page which is called singlecellRNAtools.org. I think it's quite relevant for working singlecellRNAsack because it's a sort of repository where most of the new tools that come out in singlecellRNAsack are actually put there. And they have been categorized in 30 categories. And so you can have a look at actually the tools which are used for visualization, the tools that are used for dimensionality reduction, etc., etc. And then you can click on it. And here you can actually sort by whatever is interesting for you. So now we're talking about dimensionality reduction. If I may click it, it doesn't want me to click. Okay, it's loading. After loading, you can click on whatever is interesting for you. So for instance, dimensionality reduction, and then you can filter. And what's nice about it, again, it needs to load. It should be faster. I guess it's because I'm online on Zoom at the same time. But you have then all of the tools that are for dimensionality reduction in singlecellRNAsack. And then you can have a look at what these methods are. And if they're in R or in Python or whatever. And then you can see also how many people downloaded it, for instance. So how popular the tool is. So here, this one you see, it has 73 downloads per month, for instance. This one has 272 per month. So which is TSNE. And then you have those which are much higher. And then you can also see if it's in Bioconductor or not. And this is quite helpful in understanding what are the tools that are there. You can click on one. Let's take this one. And then you have a small description of it. You can see where you can find the code. You can see if it's Python based or based, etc. And you can see the licenses and how long it has been there. You can also see something quite interesting is the number of citations because this is maybe telling you how popular the tool is. So that's just for the single cell RNA tools. And now I can go back to my presentation, which was here. And then you can see here. So for dimensionality reduction, you can see the number of tools that exist or that are in this repository right now, 348. So for sure we will not discuss this 348 tools today. I chose the most popular ones and the ones that are actually inside Surat and try to explain to you how they work. Such that you have a general understanding of what you should be careful about and what are the parameters and how you can then change them in which situation. So in Surat, as I said, this is what we are aiming that we will be using today. There are these three tools that are used mainly. So it's principle component analysis, PCA, which has been there for hundreds of years already. There is T-SNE, which is a T-Distributed Stochastic Neighborhood Embedding and UMAP, which is Uniform Manifold Approach and Projection. We will be discussing these three tools, but such that you already know how you can do them in Surat. It's quite easy. It's called Run PCA, Run T-SNE and Run UMAP. And we will be practicing those in the exercises. So I have now a V-Vox question. 26 and 7. Okay. So which dimensionality reduction are you mostly used to? So is it UMAP? Is it T-SNE? Is it others or where you're not aware about any of these dimensionality reduction methods? So I can stop it. So UMAP is the most popular one. There are some which are used to T-SNE. There are some which said others and some people are not really familiar with any dimensionality reduction methods. So I guess this course will be quite useful for you. I'm quite curious if one person wants to speak up about the other methods that you are used to. Should I have included PCA in one of the answers? Yeah, for me it's PCA. PCA, okay. I will add it. So PCA is the first one I will discuss. So it's nice that you actually are used to that. Maybe you already know all about it. But I just wanted to make you aware about what PCA is really about and just for you to understand also what Serrat will be telling you as an output such that you can understand what is going on there. So PCA is a method that is based on the variance in your data and it's actually just a way to change the original axis into a new axis system. So it's really just the best angle to see and evaluate the data. And the new axis that you will generate are actually linear combination of the original axis. So it's a linear method. And how can you actually choose these new axes to look at your data and how are they determined? So it's an optimization procedure and that's why sometimes it's already part of the machine learning algorithms. And this is how it works. So what it will try to do, so this would be like your original data set and it's a dumb one where you would have only two genes. So this is not what we are actually doing. But this is for you to be able to visualize what we are doing in PCA. I will take a two-dimensional data set and I will change it into a two-dimensional data set. So here is how it looks like. You will have your two genes, gene one and gene two. This is all of your cells, how they are distributed. And what you will do is that you will first try to find the axis where you have the most variance in your data. And you can see that where your data mostly spreads into is this direction. So this will be your first axis, Z1. And then your second axis will be actually an orthogonal axis to the first one. And the two choices or the choices that you would have here would be to have this one or the one starting here and coming out of your computer. But as you can see, the data is only moving in this direction. So the second best axis to choose once you have determined this variation would be that one. And so that's how you will define your second axis. Now, once you have understand where is the most variance, you can actually describe that axis with your original axis system. So the original axis system were E1 and E2. And now you can determine how you would have this axis in your original axis system. So it will be, you go two times in the direction of E1, then one time 0.7 in the direction of E2. And this will be now your new axis system. For the first axis. Then the second one will be an orthogonal or uncorrelated axis to the first one. And it's where you have the second most variation. So this is the one. And here you would go one time in the direction of your original axis E1. And 1.2, this time minus, you go down, in the direction of E2. And this will be now your new axis. So what's important to know is that the axis is now going to this direction, but it could actually have gone into that direction as well. So some people will, you will see in the exercises, some people will actually get the reverse picture of what we have in our solution. This is not a problem. It's still the same principle component analysis or the same projection. It's just the direction of the axis might be the opposite. So this will be your axis system. And then you can actually project now your points into this new axis system, which are your two axis B1 and B2. And some principle component analysis method actually then will put these axes on the side. This can also happen. Or you can have that the middle point will be actually in the middle of your data. So this is just a way to plot it. Mathematically speaking, and this is a slide that you can maybe put on the side if you want to. So what it's actually doing is that it's calculating the eigenvectors of the covariance matrix. And this is actually the direction of the axis. So that will be calculated where you have the most variance. So this is what you are doing. And then you have the eigenvalues, which are corresponding to these eigenvectors. So these are the coefficient attached to the eigenvectors. And these are actually telling you how much variance is carried by this principle component. In a data set where you would calculate all the principle component analysis, you can actually understand the proportion of variance carried by each of the principle component. So you can know that here, for instance, PC1 has 22% of the variance that is carried by PC1. Then PC2 has 20% of the variance that is carried by the PC2, et cetera, et cetera. This you would just do by taking the eigenvalue of each principle component associated to the principle component and divide by the sum of all of these eigenvalues. So this is what you would be able to gain. And adding these proportion of variance up will tell you the proportion of total or cumulative variance that is carried by all of the PCs. So for instance, if you include PC1 and PC2 in your data, in your projection, then you will already have covered 42% of all of the variance in your data, et cetera, et cetera. And you could go on to select only 80%, for instance, of variance of the data and say that the rest is maybe noise, can be forgotten about. So then you will look at this graph and see, okay, I need to actually include eight principle components such that I carry most of the variance in my data. And so you have reduced the complexity of your data set. In R, run PCA actually is not able to compute or is a fast way to compute principle component analysis and is an approximation of principle component analysis where you will actually not compute all the principle components, but just a subset. And by default, the subset, I think in the New York version of Surat is 50 principle components. And so since it's just an approximation of the principle component, you can actually not calculate the percentage of variance carried by each of the PCs, but you can actually make a graph with the eigenvalues. Since the eigenvalues of each of the principle components represent the, is correlated to the variance carried by each of the PCs, this is enough to understand when is it that you have informative principle component and when is it that you have noise. So in terms of mathematics, there has been this elbow plot that has been described as a way to select which are the principle components that you would like to keep in your data such that you have informative understanding of the variance in your data. And this is how it works. So you would plot these eigenvalues, which are related to each of the principle components, and you will consider this to be an arm where you want to figure out where is the elbow of this arm. And here you can see the elbow would be around this or this point. And so you would know that you should include, and I'm sorry it's cut here, you would have to include actually four principle components and that the rest is starting to be flat. So the amount of information that you would gain by adding each of those PCs would be just almost the same since it's starting to get flat, it's where you have the underarm. And so this is where you will get less and less information by adding those PCs. So this would be where you could consider it to be noise. And here is where you would consider it to be informative. So run PCA will have an output message, which isn't quite important. It's telling you what are the genes that mostly contribute to the PCs positively and negatively. And this is something I would really love you to understand because it's where you should actually have a look. So these are actually the linear parameters that have been computed to understand in which direction you should go. So if we go back to our easy example, here, we had only two genes and two dimensions, but we know that the second or let's start with the first. We know that the first principle component goes two times in the direction of E1 and 1.7 times in the direction of E2. So we know it actually goes more into the direction of gene 1 than of gene 2. But we know that it's almost equal. Now you could imagine that maybe your first principle component goes a lot into the direction of a certain gene and much less into the direction of other genes. So then you would know that what's mostly represented by this principle component is that gene which has a very high, and this is what we call loading actually of this gene. Now you could use what we call a rotation matrix, which is just how to go from the first to access or first and access to the new access system and how you go from the first access system to the next access system is just by understanding these parameters here. And so this is what you would like to do and to use to understand what the principle components represent. So what I want you to understand is you will reduce dimension. So since you will reduce dimension, you will only go or describe dimension where you have most of the variance, so you will have lost some information. Now run PCA gives you actually the contribution of the different genes for each of the PCs such that you can understand that if you would like to very well distinguish for instance T cells versus B cells and that you have a gene that you definitely know is accounting for T cells and is only represented in PC number five by this output message, then maybe you should also include PC number four and five to understand this difference and to very well separate your T cells from your other cell types. So this message is quite important and this is something we will try to practice also in the exercises and have a look. So here is just what I said out loud before. I would like you to understand that PCs so the principle components are actually linear combination of the original axis. The estimated parameters of this linear combination is known and therefore we can know the genes that are positively or negatively related to those PCs. So this is what I said you have your original axis which are your genes, gene one, gene two, gene three and it goes up to gene 12,000 for instance and the new axis system is a combination of those original axis. So it's a certain times direction of gene one plus a two times gene two plus a three times gene three for the first axis then the same thing for the second axis etc etc and the AIs that are the most high positive and negative are therefore the genes which are mostly represented by this axis that you see. By default I think it might be 30 and not 10. I will have to check again. The most highest positive and negative values are displayed in R with the serial package but you can make that number bigger. You could also access those genes if you want and one observation I really press here is that scaling is super important for principal component analysis. Indeed if you would not scale then you would have some genes which are in a much bigger scale so have a much higher variance and so would be the genes that are mostly contributing to the different axis and so what you would see is that what would dominate the PCA procedure is mostly the genes which are most highly expressed and not the genes which have the highest variance if you scale them and so it's quite important to scale the data before going to PCA but this is what we already practiced yesterday. It's PCA what it does and what it does not do so it is a linear method so it's nice because then you can really interpret what it's doing. The top principal component they contain the highest variance from the data and it can be used for filtering but how you do that there are several ways of doing it and some people prefer to take the PCs that would explain at least one percent of the variance and take all of them. Some use some methods that actually give you p-values. There is a method that is called the Jackstraw method which is giving you p-values of how important each principal component is and some people use that. Some people would always use the first five to ten PCs and some people would actually use what I said before the elbow plot. There is also some packages that will enable you to do correlation between principal component and metadata such that maybe you would include all the PCs until the metadata information is covered. This might also be something that you would like to use for representation. What's important to understand is that there is a little problem is that the two first principal component in single cell RNA-SEC oftentimes would only account for a very few percent of the total variance. This is something very different from RNA-SEC where RNA-SEC you would usually have the first two principal components that account for enough variance such that you have a good visualization of your data. Here it's not the case so only including two principal components, so only visualizing with two principal components is not enough to understand the variability in your single cell RNA-SEC data. It's not enough such that we are actually want to use a second method of dimensionality reduction after PCA because PCA is quite powerful to remove data set that is correlated so genes that are super correlated these will be removed by PCA. PCA is also quite powerful in order to go only into the direction where you have the most variance so where you have most information so it is reducing also the complexity of your data set like that. However using only two PCs is not enough to well represent your data. I guess that was enough on the PCA so I'm happy to take questions at any moment feel free to interrupt or to ask. The next question is the following. Should I again enlarge this because I see 20 people? This is about seeing if you understood what I said and I'm happy to repeat if the answer is wrong so feel free to also put the wrong answer in case. It's anonymous. I see 27 people. So the idea is to understand what are these genes that are associated with PC1 positively and negatively those which would be outputs by the SRAT method. Are these actually correlation scores between PCs and the gene expression? Are these genes with the highest and lowest value calculated with the rotation matrix or are these differentially expressed genes positively and negatively that are associated with PC1 versus all the other principal components? So differential expression rotation matrix or correlation? A few more seconds. Okay perfect okay so it's correct most of you understood that it's this rotation matrix so the matrix which enables you to pass from the first set of axes to the new set of axes. The correlation score was wrong although sometimes it is true that the PC location and the gene expression is correlated but this is not the case it's not what is calculated. There is also no differential expression that is calculated between PC1 and the other PCs so there's not such a thing as a p-value that is given it's just a matter of loadings. I think the next question is about integration. Yes okay so I go back to my presentation. So that was for PCA and now I will go back I will tell you something about t-snays some of you answered still that they are used to see t-snays plots and just for you to understand what it's all about. So these were the two scientists involved in describing this algorithm the t-distributed stochastic neighborhood embedding algorithm with their original paper which I really find difficult to understand and here you have a YouTube video of stat question with Josh Stormer it's a YouTube channel where he tries to describe some statistical question or some questions the followers have and he actually describes how t-snays works and you will see that some of the pictures that I have next are actually inspired by what he has described in his YouTube channel. So the idea of t-snays is quite simple and it's non-linear this time so this is a quite a different approach what it's actually model-based you will see how and what you do is that you start with a data set here I again show a two-dimensional data set and what you want to gain is a lower or a reduced dimension projection in a reduced dimension and which keeps still the structure of the clusters that you have in your original data set what it does is that it will randomly project the points in your lower dimension and since I say randomly project it's quite important to have a seed of how this random projection works and then it will move the points little by little closer to the the points that they were close to in the original data set so that's the general idea and what's important to know is what this mean what it means to be close in the original space and this is what I will describe now so close in the original space is based on a distribution and the way you you do this the distribution on the points is actually like that so you will take a point and you will take a normal distribution around that point so the normal distribution has a mean which is the point here itself and the variance will be given by which what we call sigma b so is dependent on the point and is dependent on the density of points around this point so you will take a normal distribution just make a normal distribution means that most of the points are in the red circle you have a little bit less points in the blue circle and then even less point in the green circle and very unlikely points in the white part here what it means by that is that you will calculate probability of being inside the distribution so inside that circle so it's you should have you should be a neighbor of this point it's very likely if you are in the red circle it's less likely if you're in the blue circle it's even less likely in the green circle and really unlikely to be a neighbor of this point if you are actually in the white part and what I mean by that is that you will calculate something called the similarity of data point a to data point b in this similarity score is actually a probability that a would pick b as its neighbor if neighbors were picked in proportion of this uh Gaussian distribution which is centered at b and has a certain variance so this is really a probability that you calculate of uh two points to be to pick each other as neighbors so if we go back to this example here the red point is very unlikely to take the blue point or the blue point is very like unlikely to have the red point as its neighbor because most of its neighbors should be in the red circle a little bit less should be in the blue circle a little bit less in the green circle and whatever is outside of this range should be very unlikely to be a neighbor of the blue point this is what it means and as I said you have this variance that you are considering so you have a Gaussian distribution so a normal distribution around the point b so this is the mean and you have a certain variance and how the variance is calculated is um with a normal distribution and depends on the density around b so the more cells closer to b the lower the variance you will consider in this normal distribution so if we repeat it you take a point a you take another point b you plot the normal distribution around a and uh with a certain variance in a certain and the mean being a and you understand how the plot b is related to a by looking at where it falls in this normal distribution so these two are very likely to pick each other as neighbors now it's called the unscaled similarity because then you would do a scaling such that the points are then such that the the distances are adding up to one and importantly you also will do the little trick that actually if you calculate the similarity between a and b and the similarity between b and a you might get different result so it's important here to correct for that by taking the average of these two values to calculate the similarity so most of what I said is not so important because it's the mathematics behind what you just need to understand is that it's based on an assumption of distribution around the points but you get at the end is a table like that where you will have a score of how close the the points are in your original data set and this is the picture here that I put here so most of the the blue points will pick each other as neighbors are in a normal distribution way the red points will take will pick each other as neighbors in a normal distribution kind of way and the orange points will pick each other as neighbors in a normal distribution kind of way now what you do is that you will project randomly your points into the the lower dimension and you will calculate a similar distribution a similar in a similar way who is a neighbor of who and then you understand that you did something wrong here these two have a very low similarity so are not picking each other as neighbors where in the big dimension they do pick each other as as neighbors so you know that you need to move these two closers and in each iteration it will move the points closer to to how they should be in the big dimension and try to get to a picture that is most closely to what you have in the big dimension and how it does that is by moving the points as close as possible this is what I said in each iteration but how it does that is by optimum and optimization procedure and the function that it tries to optimize is the following just for you to know that it's actually taking into account the similarities in the big dimension the similarities in the low dimension divides one by another and adds it up and this is what it will try to minimize there are parameters for tisney and the ones that you should worry about is just this one the perplexity the perplexity is a sort of number of neighbors to calculate the density around the points so to understand actually the sigma parameter that you will use around each points in the gaussian distribution so this is the important parameter that will actually twist it off so this is where you can play if you want to to change the picture that you have this is an interesting webpage because it tries to illustrate how this perplexity works and what tisney is doing and here you have actually the formula of the perplexity so this is also something you can really forget about if you if you are not interested in and what I want you to to see is how the perplexity works so perplexity is really about the number of points you will consider to be neighbors to understand the the measure of the gaussian distribution that you will assess around each point so the variance that you will calculate around each points so if you take a too low value then nothing will be considered as neighbors of each others and you will have a picture that is falsely representing what you have in your original data set and if you take a number that is too big then anything will be a neighbor of anyone and you will again get a little bit of wrong picture of what's happening in between so the two will be too low and the hundred will be too high in between you actually still get the picture quite correctly of having points yellow there points blue there even though here you still have a sort of smaller cluster that appear but here with complexity 30 or complexity 50 you get quite a nice picture of what was in your original data set so it's working quite nicely what I want you to understand here is that uh let's say in your data set in your original data set you have 100 of cells 100 of cells then with the perfect complexity of 100 you will get a completely false picture and with a complexity of two you will also have a complete wrong picture then with a complexity in a perplexity I say complexity a lot but I meant perplexity you are in a in a quite good range so the default is 30 and you usually do not have to change that because it's working quite nicely to separate clusters you would have to change that if you get to a subset of cells where you would only end up with 50 cells for instance then with 50 cells having a perplexity of 30 it's again a wrong way to do then one important thing about tisne is that distances between clusters do not matter and this is something very important to understand this is because it's not included in the calculation of the optimization procedure and since it's not included in this calculation of the optimization procedure you will not recapitulate it on the lower dimension so here is a picture where you have actually three clusters where two of them should be a little bit closer this is the original data set and you can see that with perplexity of 50 you might actually see it a little bit but with a perplexity of 30 everything is equal distance so you did not figure it out now what's what happens is that we took exactly the same I mean they took on their on their webpage exactly the same cells it just that they doubled the amount of cells that they have in each cluster and now you can see that even with a perplexity of 100 you do not recapitulate this this picture of having the points blue and yellow closer together so this was about tisne so it's it's it's based on an assumption of distribution and this is what was important to remember now umap is the other method and I think I still have a little bit of time yes so umap is the next method and it's the one which is getting most popular nowadays and probably would see at the end of this presentation why it's so popular it stands for uniform manifold approximation and projection it's non-linear again and it's a graph based method so it's graph based and not as tisne which is based on an assumption of the underlying distribution um it is quite efficient and it can use several different metrics so this is also what is cool but what is really nice about umap and this is a spoiler alert is that it it will define local and global distances which will enable you to actually have also the clusters which are right but also the distances between the clusters that will matter and importantly if you add new data points it will not change the picture it would just add these new data points and this is not something which is true about tisne since if you add new data points you would change the the density around points so you would change therefore the sigma of each points and therefore it will change actually the picture so it's important to understand that now about umap it was defined by these guys so there was a mathematician a computer theorist and a guy computing in art which came together to write this article it's quite a complex article but the mathematician was speaking at the conference here on this youtube channel on this youtube link and it's quite understandable i think and since it was popular and understandable he actually wrote down um whatever he's talking about in this youtube channel uh in uh in uh in an article that he tries to describe what he says in the in his talk so it's also quite nice and understandable so it's functioning on a graph based approach and the the the graph it tries to generate is actually a higher order graph which we call a simplicial complex so to understand what a simplicial complex is you just need to look at this picture what it does it does it is that it will say that a zero simplex is just points a one simplex will be points and links between the points a two simplex is actually a triangle a three simplex will be a third raedron and then you can even go to higher order four simplex five simplex by uh generating um links between points of of those dimensions and a simplicial complex is just a a way to link all the points that you have in your data set in a way that you have um links between three points that will be filled with a triangle and then links between four points that will be filled with the trihedron and this is what a simplicial complex is but um in a computer way of talking actually this is combinatorial because you just have to represent what are the points and the links and this is therefore very easy to implement I mean very easy I wouldn't be able I guess to implement and it keeps the information of the global structure because you have an understanding of what should be linked together in a in an easy way what's mostly nice about it for me as a mathematician is that there are nice theorems that exist that will prove um what kind of structure it keeps from the original space and one of the theorems that I'm talking about is called the nerve theorem now the question is how would you build a simplicial complex on top of your cells that you have in your data and your and your original data space so this is how it works so what it it does is that is it is assuming an underlying object so here the underlying object will be this wavy structure that you have here and with that it will um draw unit balls with a certain metric that you can choose actually around the points and whenever you will have two two balls that will cross each other then you will put a link between the these two points whenever you have three units the balls that will cross each other you will actually put a triangle um between those points and whenever you have four you put etc here you run etc etc so then this is how you will actually summarize your data with a graph like structure and this is the graph like structure that you will then use to to understand how you will project the points into the lower dimension um an assumption of the the methods here to summarize the data with a simplicial complex instead of taking the points is that the the data is uniformly distributed on your underlying objects so the underlying object here would be the wavy the wavy curve and to to be able to recover just with the graph the structure of your original space you actually need to have a normal uniform distribution of the points on your original or underlying manifold this is not the case because data is not so nicely distributed and you don't have um infinite data so they came out with a solution for this problem and this is to vary the notion of metric and in that sense be able to still recover the underlying manifold distribution and the distribution of the points this is what they do with what they call fuzzy topology and what they just do is that in regions that are less dense basically they will change the notion of the radius that they use so they will not use any more unitary ball but they would use um a radius that is varying and with a certain range of certainty and with that understand how they should build the simplicial complex they call this the fuzzy topology as i said so as you say as you see you have regions that are colored more dark and regions that are colored with more transparency so if balls would cross in that region which is less um trans more transparent there is a less certainty that there should be a link and in places where you have actually um the the the dark part or the or the more dense part of the the balls that are crossing then you are more certain of the link between the points and with that they would generate um an idea of simplicial complex so here it's not a simplicial complex it's actually a directed graph because you have a link between points with a certain uh weight and so you can see that for instance you have um the weight here is a little bit less important than the weight here just because the way um this ball crosses with this one is with the the darker part and this is how it actually works so now they will solve the problem of having not a simplicial complex but a directed graph by just using this formula here this is not important to understand just understand that at the at the end they get a notion of a graph simplicial complex with a link between the points and a weight between the points that they have and with that they will understand what's the original space like now to know with the theorem that they use they actually have a second assumption which is that the manifold is locally connected for you to understand that what it's doing in practice is that you cannot have isolated points in your data set this will not happen and what they do afterwards is that they would like to now that they have a notion what the manifold looks like in the big dimension they want to know how they can then project the points into something which is uh human graspable like two dimension and what they do is this exact same idea as um as they do in tisney they would randomly project the points into the two dimension they would use the fuzzy topology to calculate the money the the the simplicial complex into the two dimension and then then they would understand how this relates to the big picture and so this is exactly what they do and they would calculate this formula this time to understand how close the low dimension is to the big dimension which maybe you can see it but this is exactly the same part or a similar part as what we see in tisney but they've added something else which is to be able to get the gaps right so the distance between the points as well and with that it's actually here is the summary so as i said the first phase consists in constructing um this kind of graph in the big dimension then the second phase would be to be able to do an optimization procedure to be able to get the low dimension representation so the low dimension graph to be as close as possible to the big dimension and this is done by uh this uh formula that is called cross entropy in case you want to know and uh it's in enabling you to get the clumps right but also to get the gaps right so if we go to this picture here what we actually end up with is that picture so this would be projecting this two dimension into two dimension and as you can see it does not really represent something clearly of what we had in the big dimension so it's important to know that it's not like pca where it's super where it's linear and you can very well understand what you did here it's uh not the case anymore but knowing that the representation should get the the clusters right and get actually the distance between the clusters as well so in terms of parameter there is one which is very important is a neighborhood parameter and the neighborhood is actually the similar thing as the perplexity in tisne it's enabling you to get the the neighbors correct when you go to the local metric so the metric in the into the low dimension so it's it's where you can play to understand uh the the clusters correctly so this is where you could actually work with if you want to um if you want to to change a little bit the the output of this dimensionality reduction the rest is not so important maybe the meanest can also be something that you would like to to change this is really a beauty parameter if you have too much point one on top of the other you could make this a little bit larger and then you have the points that would go a little bit further apart but this is has nothing to do with the the lower dimension or the dimensionality reduction so now i will just show you some pictures which do compare tisne and umap such that you get a sense of of how they are different and here they have several data sets where they compare the performance of tisne and umap and as you can see what's very important is that umap is much faster this is something to to know uh but here is how it will look like so PCA here you have handwritten numbers of people so they they had to write zeros and ones and two etc and what you would like to cluster together is the zeros together the one together the two together etc and here is the picture of using PCA what you can see is that the ones here are quite uh correctly identified as a cluster maybe the zeros as well here somehow but the rest is quite blurred now tisne understands the clusters quite correctly as you can see you have a super nice cluster of zero here a super nice cluster of one here etc etc um but the distances between the cluster is always similar so there is not something that you could grasp out of it now if you look at umap it's a quite a different picture because you can see that zeros are very different there are maybe the closest to a six but then you have a cluster forming or a subcluster forming of and this is maybe not so correct but i say five eight and three maybe people sometimes write the three very closely to a five i don't know and then you have here other clusters that form it's maybe more intuitive in this example where they have pictures of fashion items so pictures of bags of shirts of sandals of coats etc you can see that PCA does again perform in the sense that maybe trousers are a bit further apart and here maybe sneakers can be well distinguished you can your sneakers and sandals together but everything is a little bit blurred here in the middle tisne does function not so bad because it does actually group together bags for instance it does group together um trousers and t-shirts etc but the distance between the cluster is again not so representative whereas if we look at umap we can see that you have a very distinct cluster forming with everything connected to the foot so you have boots you have sneakers and you have sandals and here you have everything related to shirts i think because you have coats you have dresses and you have shirts and you have tops which are together you have the bag which is separate and here on the bottom you have everything related to the you have the trousers so this is something different so you can see that the clusters here are are more meaningful and the distance between the clusters are more meaningful as i said in surat it's these three functions that you will have to use run pca run tisne and run imap and this is something that we can then practice so i think that was it for what i wanted to say about dimensionality reduction methods it's quite complex but i just wanted to give you an intuition at least what i want you to to remember is that using only pca is not enough because you will have only a few percent of the variance that is represented by your first two components so i do not want you to visualize your data only with pca i want you to understand that tisne is based on distribution so this is important and that the optimization procedure is using um is only making you having the clusters right and umap is based on a geometric assumption it's based on a graph it's a graph based procedure and the graph that it produces in the low dimension is supposed to look as the graph in the high dimension and the optimization procedure here is trying to optimize two things it tries to optimize at the same time to have the clusters right and to have the graphs right that's what i want you to remember so that's it i think we are in the schedule quite correctly do you have any questions no questions so in the chat roxanne yes thank you very much for the explanation it was very good what i didn't understand yet is if you like use all of them for one dataset and then you look at where you get the most information because from what understood from you is that mainly the umap is the best so why why would you use the other two yes so it's it's important to know that umap and tisne are not interpretable so they they would just reduce the big dimension into low dimension and keep all the information of the big dimension or they try to keep most of the information and put it into low dimension so you really need to help tisne and umap in the best way possible so you would always do pca first before running tisne or umap so this is something important to keep in mind we always run pca first because it reduces the the dimension in a way of keeping only the relevant information and having reduced the redundancy in your dataset this is not the case for umap and tisne which take just all and put all in the lower dimension right so therefore it's important to run pca first now tisne and umap they do function in a different way because manis has a geometric assumption and the other one a distribution assumption so let's say if your points are very well normally distributed then tisne would work just fine and as we can we could see on the pictures tisne and worked just fine to get the clusters as well so it was just a matter of visual representation anything you do here is to get a nice picture for your for your for your paper right or a good way to look at your data so it's not what will enable you to make any conclusions this is what we will do with clustering for instance or with differential expression so at the end if you're happy with the picture you get you can stop there i don't know if that answered the question yes