 All right. So after we have done some clustering, we found that some cells are quite similar among each other and that they form a distinct group. And now we want to actually know what these cell types are. So we will start looking at cell type annotation. And before doing that, we ask ourselves, what is a cell type? So it's considered to be the fundamental unit of life. And at the beginning, it was defined in terms of function, location, tissue type or cell morphology. So here we have an example of an image of a C elegans where the nuclei of all of the cells have been stained. And we can see that the organism has different organs. And of course, within each organ, the different cells will do different functions. And based on their location in the body and their function, we can define different cell types. So initial definitions were in terms of function and location, for example, but you can of course extend them to other features such as the presence and absence of cell surface markers. So for example, if we have one cell type, that is, for example, immune cells that flow in the blood, we can have within this immune cell type several other subtypes depending on the presence and absence of surface markers. Then you can extend this definition to the gene expression. So based on a full molecular profile within a type of cell, you can define additional subtypes based on their gene expression pattern, for example. So for the moment, a cell type is really defined within a biological system. And the different biological systems will define cell types differently. You can define them based on cell cycle phase, for example, or if they're differentiating or not. If they're migrating or not, for example, if you have tumor tissue, some cells in the tumor will rather stay at the tumor and other will mutate, for example, and start to spread and metastasize. So you have even different cell types within a tumor cell type. So the question to why we should identify cell types, of course, is because samples are heterogeneous. So remember that we had this image of a smoothie at the beginning, where a tissue is a mix of different cell types. And so we want to see first if we process a data set through single cell RNA-seq, which cell types were we able to recover and what these are. So we want to find the different cell types in our heterogeneous sample. Another question could be to profile cells from healthy and tumor samples and see how much they differ from normal cell types. So if you have tumor cells, what is their different gene expression pattern? So you have to identify these different cell types. Then the one of the goal of identifying cell types would be to find new cell types that you haven't analyzed before. So maybe if some of you are familiar with immunology, where cells are analyzed a lot with flow cytometry using surface markers as proteins, basically, when you analyze with flow cytometry, you're limited to a small number of surface markers. And so it is possible that a cell type that you have in your sample is missed because you don't actually stain for that particular surface marker that is expressed in that cell type. While with single cell RNA-seq, you process the whole mix of cells and then you will have the presence of this missed cell type. It allows you also to follow cell fate and determine cell differentiation mechanisms. So for example, if you have a timeline experiment in each time point, you want to re-identify the same cell types and see how it evolved. So this is why it's important to identify where your cells are. Finally, it's quite fancy now that using single cell RNA-seq data, you can determine how cells communicate with each other. So we will see probably in tomorrow's lecture some ideas about cell-cell communication tools where we determine which cells express, for example, ligands and which other cells express a receptor. And these two cell types might be able to communicate. Finally, we can of course try to compare the abundance of cell types in different conditions, etc., etc. So probably all of you attending this course have a good idea of your biological systems and probably know which cell types you want to identify. So here is an example of changing abundances in two different conditions. So here we have some lymphatic endothelial cells that come from wild-type mice. And then after knocking out a transcription factor, we see this green population of cells that is appearing that is nearly absent in the wild-type cells. So basically the idea here of using tools for cell type annotation is to identify what this new population is. I would like to give a warning about analyzing differences in abundance in single cell RNA-seq data. So it's not always the case that the frequencies that you have in your dataset of the different cell types is really a good image of what you had in your actual sample. So here in this experiment, we can see that the differences are quite striking where one population disappears and one appears. And using biological replicates, three mice per condition, these differences were conserved across all mice. So we sort of trusted the differences in abundances. Plus, it was validated with flow cytometry, for example. So I would give a word of caution about interpreting differences in abundance in single cell RNA-seq data. You always have to validate if your differences are actually true. So a small note about surface markers. So they're often considered to be the gold standard, especially in immunology. And these cells are actually proteins that you can encounter on the surface of the cells. And you can tag them with antibodies and try to identify which cells have these surface markers and which don't. And depending on the presence and absence of these surface markers, you classify your cells as belonging to different cell types. So here we have an example where we have, for example, the CD45 surface markers expressed in all cells. So we have probably all cells that are belonging to the immune cell type. But within this CD45 population, we have, in addition, different surface markers, which will define new sub-populations and so on and so forth. It is important to know that in mRNA sequencing, so single cell RNA sequencing, it can happen that surface marker protein expression is not reflected by the level of mRNA that you have in your single cells. So the dynamics between the level of mRNA and the protein at the surface is not always correlated. And here we have an example of the CD4 surface marker, where we have a single cell RNA-seq data set. Some cells are high for this. And at the same time, for this data set, the surface proteins were tagged. So remember in Khart's presentation yesterday, he showed you, for example, this hashing method where you can tag a surface protein with an antibody. And basically this is what was done here. So for every cell, you have the mRNA level of your gene of interest and also the surface protein that is measured. And here we see, even though these cells have the surface protein that is expressed, that is present, at the mRNA level, the CD4 is rather low. So to identify cell types based on surface markers, it is important to not only take one, for example, and call it a day, it's maybe good to use several genes. And so one example could be to use a method, so we will see this tomorrow, to find markers of different clusters and then use a list of all of the genes that are significantly more expressed in like, for example, here or here, and see if they belong to your cell type of interest or not. You could also use a combination of more genes. For example, if you know that the CD4 secretes some proteins, if they're both expressed here and here, then you can more confidently say that these cells are CD4, even though they don't have the CD4 mRNA. So about possible ways to annotate cell types, you have two different ways to do that. You have the manual way to annotate the cells and the automatic way. So the manual annotation is used very commonly because people know very well their biological setting, what genes the cells should be expressing or not. So they have a list of marker genes of the cell types they expect in their sample. And they will just look at where these genes are expressed, if one gene is expressed in cluster 0 and not in cluster 1, then I can say my cluster 0 is this cell type. The drawback is that it can be time consuming. You have to check many genes. And of course, you require expert knowledge about your system. And it's not always easy to find exactly which cells should express what. So sometimes it can be a bit subjective and inaccurate if you work in a biological setting that you don't know 100%. Then there is the automatic way. And here you require actually a reference. So what I mean by reference is that you have an expectation of the cell types that you have in your sample. And so, for example, in the literature, you find the publication of purified cell types that were profiled, for example, with bulk RNAseq or that will profile with microarray. And then you use this list of reference cell types and their expression profile to use as a reference to annotate your unknown cell types in your data set. So this means that you need a reference. And this reference can, as I just said, either be a published bulk RNAseq data set of sorted cells, or it can be also even a single cell data set that has been published, where I don't know, it could be cells coming from a big organ, but you're only interested in a subset of these cells to use as a reference. You could manually select these cells in the published single cell RNAseq data and use as a reference. Again, the drawback, then there is one, is that it can miss some cell types if in your sample you have some cells that are not present in the reference. So always make sure that the reference is quite broad and includes all possible cells that you have in your single cell experiment. So the method is that for every unknown cell in your data set, you will receive a label for that cell. So this can be either this assignation of cell type can be either done per cell or per cluster. So this we will see with the method that we use in the exercises with single RNA. You have to, you have the choice to predict a cell type label either at the cell level or at the cluster level. And the way it works is that it uses a correlation of the different single cell expression profiles to the profile of your reference. So let's first explore a little bit how manual annotation can work using marker genes. So this is a single cell RNA seek data set from one patient suffering from a brain glioblastoma. Actually this is a data set that is available on the 10x genomics data website. So 10x genomics has several data sets that you can use to play. So if you want to, after you do this course, if you want to test your code on another data set than the one we propose, feel free to go to the 10x data set website. And you can actually download the three files that you get after cell rangers. So the filtered barcode matrix, the barcodes and the features files. And you can just follow the Surat pipeline. So here what we have is different cell types from the brain. For example, we have microglia where we know that CCL4 should be expressed and we color our cells according to the expression of this gene. And it seems that this group of cells express this gene and the rest don't. And here we have another marker, for example, GFAP, which is usually expressed in astrocytes. And here we have another group of cells that is colored. So we can start to see which potential cell types we have in our data set and start to annotate our cells. So of course, maybe if you use manual annotation, you will need a list of marker genes for every cell type that you expect in your data set. And sometimes it's difficult to find easily if you don't have previous knowledge. So here are a few databases with cell type markers that you can browse. There is PanglaoDB, for example, which is a collection of created single cell RNA seek data from mouse and human. And they actually compile lists of markers for different cell types. So maybe we can have a look at this website, which is here. So for example, if we would like to have a list of markers, if we go back to the brain. So here is the home page and we can go to data sets, cell type markers. And we can browse a list of different tissues. And for each tissue, you can have a list of different cell types. So let's have a look at astrocytes, for example. So here we have a table of gene expression markers that were compiled from a collection of public single cell RNA seek data. And it's nice because it lists each gene, whether it's a marker in both species, so mouse and human. So in this case, we see that this is a marker for astrocytes in both species mostly. But sometimes for other cell types, you will have either it's for human or for mouse. Then you also have some specific specificity measures, which could be interesting to have. So for example, the specificity will tell you the frequency that this marker is not expressed in the other cells. So any other cells that is not an astrocyte, what is the proportion of these cells that express that marker? And for example, for GFAP, it seems to be quite low. So it seems to be a marker that's quite specific to astrocytes. So this can help you and maybe select which are the cleanest. So, for example, ALDOC could be also highly specific for for human, but maybe a bit less for mouse, et cetera. So then you can compile a list of marker genes like that. Then there is a cell marker is also a database for mouse and human. And it's also contains a list. I think I have it here. Yes, perfect. So basically when you're on the cell marker website, maybe I see if I can open another link here. Yes, you have, again, mouse and human. And then you can select your different organs and then if I can zoom in, no, it's quite small. But basically, again, if we go back to the brain, we can select astrocytes, for example. And the website is a bit slow, but if you wait patiently, you should be able to see like a cloud of genes. And the size of these genes will tell you how specific this gene is for this cell type. So again, we see that we have many genes, but GFAP is quite a good marker and you can be quite confident with that one, but you could add and look at others also. So this is I like when this websites are sort of visual like that. Then we have other possibilities to find cell type markers or even like reference data sets. So for example, single R. Single R is the package for a cell type annotation that we will use in the exercises. And via single R, you have access to another package, which is called cell decks, which is like a database of microarray data from sorted pre-fight populations. So this you can actually, when you need to use single R, you have your unknown cell types in your single cell RNA-seq data set. And you need to have a reference to do the annotation. So you have direct access to these references. Then you have the human cell atlas. So this is a consortium that aims to create a cellular map, a reference map of like the whole human body. I think they also have some mouse data. So here again, you have this option to use the human cell atlas as a reference. And finally, the single cell portal. So this is an interesting online tool because it contains published single cell data sets. And actually you have the option to visualize the data. So you can create UMAPs. So you can view UMAPs for the published single cell RNA-seq data. And you can also download the normalized matrices for the data sets. And you could actually use this downloaded data as a reference in your single cell annotation. And the good thing is that authors that submit data to the single cell portal actually provide their cell type annotation that they did before. So if you find one that is nice and nicely annotated, you could download it and use it as a reference. Again, if after this course, you want to further play and exercise all of the scripts that we did, the single cell portal is a nice source of data to play with because you can download the matrices and just import them into Surat. So here is the concept of module score. So we have seen, for example, in PanglaoDB that you can have several marker genes of astrocytes. So instead of going and checking each of these genes, like for example, if you have 200 genes that should be specific to a cell type, you don't want to go and see where each of these genes are expressed. So you could calculate a score for your list of 200 genes. And in Surat, you have this add a module score function that allows you to calculate this score for a set of genes. And what it does is that it compares the expression level of the genes that belong to your list of genes. So here I call it a signature. It could be a list of path of genes involved in a pathway also if you want. So you compare the expression level of your genes in your signature to some randomly selected genes that are called the control genes that are randomly selected. But they have a similar expression level across all cells compared to your signature genes. And then the score is just the deviation of the expression level of the genes in your signature compared to the expression level of your control genes. And so for every cell, you get a score which will be added to the metadata of your Surat object. And then you can color your UMAP according to the score. So here we see, I think this was a pathway activity. So genes link to a pathway. And we have some cells that are very high, some that are intermediate and some that are rather low. But you could use this for marker genes for any list of gene that is of interest to you. There is an additional package. So as you saw from Mahasha's presentation for clustering, you have hundreds of methods for dimensionality reduction. You have hundreds of methods. And again, for scoring of list of genes, you have several methods available. So I just show you the add model score of Surat. But of course, there is a development of these kind of methods. Here is one that's called UCELL that will also do signature scoring. And one of the advantage of this is that you could specify which marker genes should be absent. So for example, if you know that one cell type expresses one gene, but absolutely does not express another gene, you can provide this list of genes to be part of your signature. While with the Surat model score, you cannot have negative scoring, so negative genes. So in this case, you can include this and it can be maybe a bit more precise to get a more precise score, including the genes that should be absent. All right, so let's take a bit into single R. This is the tool that we will use in the exercise. So as I just said, it can have easy access to reference data. So there is the human primary cell atlas. It's manually annotated. It contains it's every cell type that is included there has several annotation one like main or broad cell type like immune cell, for example. And then for each of these samples in the reference, you can have even further subtypes. So you can have both levels of annotation. It contains many samples. Then you can also have access to the data from the blueprint and encode. In this case, it includes reference samples from bulk RNA-seq and has several broad types and also subtypes. And if you're working on mouse, you have the in gen, oh, sorry. You have access to the immunological cell types through in gen and also a data set of mouse RNA-seq that's linked to brain, for example. So again, here we are a bit biased towards human and mouse. But you can, of course, use single R for a non-model organism, for example, if you have a reference. So that's always the important part. You could generate your own reference and then use it to annotate your single cells. So you can classify each unknown cell in your data set, either to main type or subtype, and do the labeling either for each cell or for each cluster. How does single R work? So here we have an example of a single cell. It is unknown. It has to be annotated. This is its barcode or cell name. And here we have an example of our reference data set. So this is, for example, several cell types that were pre-fied and profiled using bulk RNA-seq. And so each dot represents one sample that corresponds to that reference cell type in our reference. So for example, for neutrophils, we had several samples included in the reference. But for dendritic cells, for example, we had just one. So if you're really interested in annotating dendritic cells in your data set, maybe this reference would not be the best choice. So again, it's always good to really carefully choose your reference. But it can give you a first idea if you don't have another reference to see what type of cells you have. So we have several different types. And what it will do is like first step in single R is to select a list of genes that are differently expressed across your reference cell types. So because it will calculate a correlation for every single cell, the computation can be quite long. So it will only calculate the correlation for like the top differentially expressed genes across the reference. And there are several methods to select these genes, but typically it will select the genes that have a median expression in your reference that is higher compared to the median expression in all other cell types. So it will take the gene expression of your unknown cells and correlate to the genes expressed in each cell type. And you will obtain a Spearman correlation score against each sample in your reference. So you can see sort of the distribution of the correlation score to each reference. So here we see that our unknown cell has overall low correlation score to neutrophils. So it's probably not a neutrophil. And here it was just sorted according to a median correlation score. And we see that very good correlation score or I don't know how good it is, but the highest correlation score was obtained for CD8 and CD4T cells. So the score are very similar. So probably this unknown cell is a T cell, but it's maybe hard to say if it's CD8 or CD4. So single R will include a step that is called fine tuning. It will reselect the genes that are specifically different between this sort of close matches between CD8 and CD4 and re-correlate and compute the score and then have like a final cell type label decision. You can do some annotation diagnostics once you assign a label to each of your single cells. And here are some of the plots that you can generate on a single R object. So one is called a plot score heat map. It will take these correlation score of every single cell. So in the columns, we have the unknown single cells in our data set. And in the row, we have the different reference cell type labels. So here we color according to the correlation score obtained for each single cell against each reference cell type. So here we see a group of cells that has very high score to duct cells, for example. And at the top, you can see the final assigned label for each unknown single cell. So it took the one that had the highest score and it's the label is indeed associated with duct cells and the same for the others. So this is nice to to to have a look at because you can see if some cells have like a sort of high scores for several cell types, then you may need to go and maybe choose another list of reference samples to better define your cell types or to manually look and see if you can discriminate and make a final decision based on marker genes, for example. Another plot that you can generate with single R is called plot delta distribution, also on your single R object. And what it will do is to calculate the deviation of your score for that label against the median of the scores obtained across all cell types for each cell. So here every dot is one cell and you see that the deltas for the cells assigned as being as inner cells are quite high. But for a few cells, this deviation is rather low. So maybe you would have to manually check and see if these are really as inner cells or if there's something else. So try these during the practical exercises. So as I was just saying with single R, you have access to references directly through the cell dex package, for example. But in the case you have, for example, a non-model organism or if your reference samples are not included anywhere, you can actually use your own reference. So here's the example where we have some cells. So again, this is a mouse, but it's just the example to show that you can use any reference to annotate your cells. So these were immune cells exposed to different strains of LCM virus. And we wanted to know how well these clusters correlated with bulk RNA-seq of different populations that were available in the lab. So it's really like custom reference that we used. And when you use this with single R, the reference just is like a matrix of normalized gene expression for each one of your samples included in your reference. And of course, the cell type name or label that belongs to each sample in the reference. And then you can see which clusters correlate most with each reference. So here we have naive, etc. So this is typically what you can do with single R and using your own custom reference. Here is a paper that sort of evaluates different methods for cell type annotation. And in terms of performance, I think that in the paper, they concluded that the tools that are called SVM, SCPRED, and single cell net were the best ones. So they evaluated Python methods and R methods. And they also checked whether you need prior knowledge of the cells or not and also whether there is a rejection option. This rejection option is the fact that if you have a cell that has a low score, for example, for one of the references, do you force a label on that cell or do you just leave it as unannotated or undetermined cell type? So we see, for example, with single R, which is R-based. Here you can see that all the different underlying methods that are used. So we just discussed that this is correlation against a reference set. You don't need any prior knowledge and there is no rejection option. So you have to keep in mind that when you use single R, all of your cells will receive a label, even though these cells are not part of the reference, for example. But you have these diagnostic plots to help you in deciding more or less manually if some cells are wrongly assigned or not. So this is something to keep in mind. And other tools, they do have this rejection option. So single R was not the top performing method, but it was among the best. And because it's quite easy to use, we often use it day-to-day in our analysis. Other options for cell type annotation that are based on Surat functions. One is the cell type label transfer. So you have seen today with the integration that you can actually use this exact same method as the integration to actually annotate unknown cells using another single cell reference. And this is what was done here, for example. So the commands are quite short again. First, you find the anchors and then you combine and put a cell type label using the transfer data function. So here, the reference is a single cell RNA seek data from annotated cell types. So we have four cell types. And then you can have a single cell attack seek data, for example, and see which cell types you have in this data. And you have used this cell type label transfer for this prediction. So the cells that you want to predict don't have to be attack seek. They could be another single cell RNA seek data that you want to annotate. So using this label transfer is also an option. And here, I think there is no rejection of the cells, so all cells will get an ID. The authors of the Surat package also developed azimuth. So you have two options to use azimuth. And basically, I think it's just based on label transfer, but they offer a list of references of cells in different tissues. So you have PBMCs or cells from fetal development or human lung, etc. So you could actually just install that package. So here you have the vignette. If you want to run it locally, just install that package. And then it's quite easy. You use run azimuth on your query with the reference that you want that is included in that package. And then you'll be able to do cell type annotation. And here, the screenshot is actually from a web app version of this package where you could upload some raw data. And it will do all of the processing and annotation online. So these are two ways that you can use this package. Finally, I think this is the last example is, again, another method for a cell type annotation. And here, the idea of the authors of this package was to take published single cell data from several sources. And so by combining, so you have each of these cell types that is present in each of these sources, but you have a small number in each. So by combining this data set, you have a bigger data set that you can use as a reference and maybe annotate more easily. And then here, what you can do is if you have your unknown single cell data set, you could actually project your unknown single cell data set on top of this exact same representation. And then you have a density plot that will show you I have a lot of CD8TX and a few CD8 naive. So basically, it will project your unknown cells onto this dimension. And then you can see where your cells fall. And so in the end, sort of annotate your cells. You don't have to use it that way. We don't have to project it, but you can also just simply use it to annotate your cells in the end and then show it on your own UMAP after this annotation. So this is another way to annotate cells. And you can use the algorithm also using your own reference. So if you don't work on immune cells, you don't have to use their own reference. You can use your own. And here in this case, there is a rejection method. So if a cell does not belong to one of the reference types, it will not be annotated. I think that's it. So we haven't mentioned the Oscar book, I think, yet. So the Oscar book is Orchestrating Single Cell Analysis. It's an online book created by people at Bioconductor. So it's mostly based on Bioconductor packages. So no Surat will be involved in there. But what I like about this book is that it has, for each section, it has explanations and some a bit of background. And so you can have, even though you don't use these tools, I think reading about the theory is always interesting. Then if you work with multimodal single cell data, so that means you measure like RNA plus a taxic or other sort of combined modalities on your single cells. There is a paper here that can help you in dealing with it. And finally, there is a bit of a more recent cell annotation review than the one that I showed that you can also have a look at. Nice. So the question is, which cell annotation method will you choose for your own dataset? Manual annotation, automated, or I will try both and maybe compare. So there is no wrong or right answer. It's just to see what is your gut feeling there. All right, we have 25 people, 25 answered. Nice. So it's good that you're brave and trying several methods and and see which could give the best annotation. So I'm happy to see that it all depends on your biological system. It's it may be possible that if you work on a non-model organism, only the manual will be feasible because there is no reference available. But if you work like on human and mouse, there is so much published reference that automated is just doing the perfect job. So it all depends.