 Okay. So now that you have generated a nice picture, you might wonder what are actually the clusters in your data. And this is the next step. And the idea of clustering would be that you start with your single cell and what you would like to gain at the end is understanding what are the populations that you have. And the populations that you have would be up to you to decide at which granularity you would like to get. So there is not such thing as correct or wrong in terms of how refined you would like to get or how in depth you want to get. So the idea would be that whatever has the same cell type should also have quite close gene expression. And so therefore having quite close gene expression means that you should be able, once you use some clustering algorithm that will group together similar patterns of cell expression, you would be able with that to tell what is a cell population. So the idea would be that you start with your points and that you are in a certain way able to determine what are the different clusters that you have in your data set. And according to the handbook of clustering, there are two different types of clustering. Either you do partitioning clustering where you say this is cluster one, this is cluster two. Or you could go for clustering algorithms that are hierarchical, where you either start with everything separate and you group and group and group things together until you only have one group. Or the other way around, you start with everything connected and you disconnect it until you only have leaves. And so these are the two big categories of clustering and partitioning clustering which is what we would like more aim for me being able to say these are cluster one. So therefore these are cell type one. There are also several different approaches. So there are the partitioning clustering which are convex, which give you a convex partitioning of your data. An example of it is k-means, which is quite popular. Then there are the density-based approaches where a popular example is dv-scan. And then model-based approaches where you are supposing an underlying distribution of your data, which m-clust is an example. And the last category, which is the most popular one in terms of single cell RNA-sec, is graph-based approaches, where you would generate a graph onto your data and figure out how to cut that graph. So the idea of any graph-based approach is that the nodes will be your cells, sorry, and the edges between your cells would be a similarity measure. And this could really differ in terms of the different graph-based approaches that exist. Now, two quite popular graph-based approaches, and that are part of CERAT, which you can find with find neighbors, is the KNN graph and the SNN graph. The KNN graph, which stands for K nearest neighbor graph, and this is a graph in which you would, two vertices, p and q, would be actually connected by an edge if the distance between p and q is among the k smallest distance from p to other objects in your data set, meaning that q will be among the kth neighbor of p. This is what it means. So k would be something that you choose, for instance, to be 3. And so you would connect with p, everything that is the three first neighbors of p, and this is how you would connect the points. So the shared nearest neighbor graph, the SNN graph is similar, but is a graph in which weights define the proximity or similarity between two nodes in terms of the number of neighbors they have in common. So if p and q have three neighbors in common, then the weight between p and q would be 3. That's the idea of SNN graph. And so as I said in CERAT, this would be the function of generating these graphs. So graph-based approaches are using either adjacency matrix or weight matrix, where adjacency matrix would just be defining whenever you have an edge between two points. So here you have, for instance, a and b are connected. So you have a 1, a and c is connected, so you have a 1. But for instance, there is no direct link between a and d. So here is a 0. And so this is what we call the adjacency matrix. Then other methods function with the weight matrix, where you actually also have on top of just knowing whenever you have a link between points, you also have a weight associated to it. And here you can see that a and b have higher weights than, for instance, b and d. So this is something that you then report in this weight matrix. But for all of them, the question remains of how you go from having a graph to having communities or clusters. And so there are quite some algorithms in terms of community detection. And most of them function in the way that they want to figure out a measure that will define that you have more edges inside the group than edges linking nodes outside the group. So here, for instance, you would like to have more links inside that red group than links to other cells in other groups. And this is how you try to detect communities or having a higher probability of being connected to each other than to members of other groups. But how do you do that? And how do you then separate the clusters? So the one possibility could just be you choose the smallest cut and smallest cut, meaning that you just remove one edge. And then you have one cluster and a second cluster. But this is usually not the best cut that you can have in terms of biology. So this is not usually how clustering algorithms that are graph based function, they don't try to search for the smallest cut, they try to search for the best cut. And defining what the best cut is to in a data set is quite difficult. And there are many methods that will work in a different way to find this best cut. Here are some that are listed, but actually, I would like to again, to go to the single salary tools. And you can see that for clustering, there are roughly 250 methods that exist right now. So you can go and have a look at those. Maybe there is one that fits perfectly your data. But in terms of Surat, as I said, the first thing you do is either construct the SNN graph, or construct the KNN graph. Based on the Euclidean distance in PCA space. So you first project the points in the PCA. And in the PCA, you calculate Euclidean distance that will then tell you which are the neighbors of which guys. And then with that, you can construct the SNN graph, where you would link together with the weights, the points that share some neighbors, and the weights will be defined by how many neighbors they share. And then you will actually use an optimization procedure, as always. And optimization procedures means that you have a function that you will try to optimize. And the function that you will try to optimize is a modularity function. It's called, it's called. And the modularity function is basically a cost function that will determine how well the cutting was made. Associated to that modularity function is a parameter that is called resolution. And resolution is a parameter that is basically telling you how refined you want to be in terms of generating your cluster. So higher resolution will give you much more refined view. So you will have much more clusters. And lower resolution will give you much less clusters. And this is then up to you to decide what is best for your data set. Do you want to have only a broad knowledge and having knowledge about there are the V cells, there are the T cells, or do you want to have a more refined view on how it works? So if you want to have a look here, you can have a look on the papers on exact formulas for this modularity function, if you need to. And here is the way to do it in our in our with Surat, which is called fine clusters. And then you have the resolution parameter that you can choose by default, I think it's 0.8. If I remember correctly, maybe I'm wrong, we need to check that. But there is a default parameter of resolution, but you could search for many different resolution and have an idea of how cluster change through different resolutions. There is a benchmarking paper that exists that try to have a look at different clustering algorithm and how they perform on several different data sets. They have this adjusted random index, which if it is one, it is exactly the same clustering. And if it is zero, it's a totally different clustering. And you can see that depending on the number of genes that were chosen, and depending on the methods you do perform similarly or very differently, you can see that the ones which do perform have the highest random index across all the different data sets that were tested are Surat. Surat does function quite well. There is also a monocle that we will discuss tomorrow, maybe, that are performing quite well. And Surat seems to be less dependent on the filtered genes. And you can have a look at the full paper where they describe several measurements of clustering algorithms and how good they are. TSCAN, for instance, is a method that would function on density based approaches. Yes. So in terms of clustering, there are quite some challenges that one needs to take into account. So you need to answer for yourself. What is this cell type for you in your data set? So do you need to go more in depth? Do you want to have the broad picture? Do you want to have both? And there is something that you also would need to ask yourself is what is the number of clusters that you want to gain? And this is going back to also the same question as the cell type is how refined you want to get. So do I want to have five major cell types, B cells, T cells, et cetera? Or do you want to go more in depth? And this will help you choose or guide you choose the number, the resolution parameter, for instance, and will help you guide the clustering. So you should also check the QC keys after clustering to see if all the cells in your clustering are nice. So meaning, for instance, what I usually do is for I start with major clusters, for instance, B cells and T cells. And then I would subselect only B cells and then see in there, do I have some more clusters by redoing all the steps from scaling to run PCA to run new map. And therefore I understand was my cluster that I visually saw on the UMAP as being only one cluster, actually, several clusters. Once I go to a more refined view, when this is something you could do as well. And then I really want to say that clustering is subjective. There is no ground truth. There is not one clustering that is better than the other one or one clustering algorithm that is wrong and other ones which are correct. It depends on what you want to see and then what you want to do with it. You should always be aware that the clusters that are given is just a way to partitioning your data. This doesn't mean that you don't have any wrong labels. This doesn't mean that you have the perfect labels. There is no such thing as a ground truth because you at the end you don't know what are the T cells and what are the true B cells in your data. You can only have an idea. So the idea at the end, once you have the clusters, is that you have a way to justify biologically what you see there. So I would urge you to generate, for instance, justification dot plot or feature plot that will show what is it that we see in the partitioning that you got are these T cells or these B cells. And this is what we will discuss afterwards once we go to the annotation of the clusters. Some clustering algorithms have been shown to not be stable. So meaning that if you would have only half of the data, you would get very different clusters. So this is something you could also play with if you have time, for instance, reduce the data set to only half of the cells and see if you still get the same clusters. You could also only take one sample and annotate one sample only and see how different this annotation is from the big. And then you can have an understanding of how stable the clusters are that you got. Yes, this last point is the same as the previous one. And to be known that single cell experiment have grown a lot. And in the last years, it has exploded. So some clustering algorithms were working very fine in the early days of sequencing, and are not scalable at all. So it's important also to know once you go to a clustering algorithm, can it be used in the order of magnitude that you are now with your cells? And this is how I would go with it. And that's already it for clustering algorithms. So what is the optimal number of clusters in a data set? The unsupervised clustering and fine clusters will determine it. I don't have to choose. It depends on the resolution parameter. One resolution parameter gives me the optimal number of clusters. Or there is no ground truth clustering is subjective. So there is not such such a thing as an optimal number of cluster. Yes, so most of you understood the message that clustering is up to you to decide when is it that you want to stop or what is the optimal for your setting. So there's not such a thing as correct or wrong. And it depends on the resolution parameter. That's correct. And each resolution parameter gives you some the number of clusters for that resolution parameter. But this doesn't mean that this is the optimal number of clusters in your data set. It's just the optimal number of clusters for that resolution parameter. And the first thing was also correct. Fine cluster I mean was saying that fine clusters will find will determine the number of clusters. That's true. But this is also not meaning that these are the optimal clusters for you. It really depends on the data set and on the message you want to give.