 There we go. So this presentation is an introduction of the tools you can use for downstream analysis and the quality control you can do during downstream analysis, at least when you have the count table. So now we have our own cell ranger and we have our filtered count table. What's next? So this is what a count table can look like. This is actual single cell data. I enlisted a little bit for genes with counts because also many genes often do not have counts. So we have the gene in the rows and we have the cells in the columns. So all the cells, they are identified by the cell barcode. I think this is by people 16 base pairs. Over here, I've specified only four, but typically the cell names, let's say, are the cell barcodes. And there we have the genes in the rows and then of course the counts, which are the unique UMI's, basically the alignments, the deduplicated alignments, aligning to the gene in a certain cell. And of course, you can already see that there are quite specific expression patterns over here. So we probably have to deal with different cell types over here. For example, if we look at this C1QA and C1QC gene, I have to be honest with you, I don't know what they exactly are, but they are relatively highly expressed in the third cell while they have no expression or at least their expression is not measured in the first two cells. And also have a look at, for example, in a one, which has a high expression in the middle one and also a very low expression in the two other ones. So probably we can do stuff with these expression profiles. But first, before we do that, we would have to do some quality control and some scaling and normalization, for example. So what kind of tools can you use to do this kind of analysis? Well, there are quite a few tools that you can use. A lot of tools have been developed in the recent years because single cell transcriptomics have gained a lot of popularity. So luckily there are also a lot of tools, which is very nice. What the difficult tool does is that they perform at least the following. So they perform at least quality control. They do some normalization and scaling, because, for example, you want to be able to compare gene in the same scale and you want to be able to compare cells and normalize, for example, by the number of reads you have generated for cell. So that's what normalization and scaling is about. And most of them also do some dimensionality so you have a nice two-dimensional visualization of yourself, like UMAP or TSNE. In Python, the most popular tool is Scantpy to do that. We won't be looking into that in this course, but there are a lot of very nice tutorials online. If you are working in, let's say, the bioconductor universe, typically you would work with a skater and Cran in order to do your analysis, so that would be within R. An alternative to that is Monocled3, which also does a lot of similar things. Not on bioconductor, not even in Cran, but still only installable through GitHub. And I have to be honest, I doubt a bit on how much people still are investing in developing Monocled3 further. But still, it is a package that is used very frequently and it's particularly strong in doing, for example, trajectory analysis. But it's not only there for trajectory analysis, you can also do typical quality control and dimensionality reduction with Monocled3. And then the last one I wanted to mention, of course, it's not only D4, but these are the four most used as far as I know, is Seurat. Seurat is on Cran, but not on bioconductor. And it's also an R package. And I think it's still the most used tool today to do single cell transcriptomics analysis. And they will also use Seurat in R exercises. So what do you typically do? What are typical analysis steps during a single cell transcriptomic analysis? So usually in addition to the filtering that, for example, CellRanger already did, you do some additional cell filtering can be based on number of detected genes, number of reads, but also, for example, on the number of reads that are aligning to mitochondrial gene, for example. We will go into that later on. Then there is normalization step. And normalization is where you normalize a per cell for the total number of reads or UMI's generate per cell, so you can better compare them. Then there is usually a feature selection. And with feature selection, what I mean with that or what the developers of Seurat mean with that is that you select the most variable genes because usually you can do a dimensionality reduction. You can do it with all the genes, but it's just computationally much more efficient if you do it only with the most variable ones. So you pick a few thousand usually, two thousand most variable genes, so that very most between cells and use those to do scaling. Scaling is important for dimensionality reduction. After the scaling, we do the dimensionality reduction with start with a PCA, the principle component analysis, more about that tomorrow, I think, right, Rochelle? And after the PCA, you continue usually with a UMAP or ESNE. Once you have that, you have a nice visualization. Typically, you continue with, for example, clustering. So you want, for example, to cluster cells together that have the same type and annotations. So you want to see which cell types you actually have to specify which of the cells are the details, which ones are the details, and so on. And of course, after that, you want to do differential gene impression, for example, very often. And then, of course, many other things. So in this course, what the tools we will use is mostly serial to do most of the steps, including the filter normalization, feature selection and scaling. And for, for example, annotation, we will have a look at single R. And for differential gene expression, we will use lima. And for, for example, enrichment analysis, you'll be using cloth profile. But most of the steps, especially today and tomorrow, will be done in serial. So a little bit more about cell filtering. So what you have already seen what cell range it does, basically calls the cells. So it orders the cells by a number of UMI's, and tries to find this seed drop in order to see, okay, these are cells, indeed, aren't. But there are also other ways to filter cells. For example, if you find a very high number of UMI's, this can also point to having, well, apparently, a very complex library and a good point to a doublet. So that you have, for example, two cells in there. So many different transcripts in there. Another way to filter cells would could be the number of detected genes. Maybe you do have a lot of reads in there. So a lot of transcripts are measured, but only very few genes were detected. Good point to empty droplets, but could also point to, for example, cell types you don't want in there, like erythrocytes. Another, or third one, could be mitochondrial UMI. So reads that align to mitochondrial genes, typically, that could point to dying or very stressed cells. The reason for that is that these that are in a mitochondria, they do not, or the transcripts do not leak out the cell as quickly if the cell membrane becomes permeable. So those transcripts stay in the cell, and therefore you relatively measure a higher number of mitochondrial genes, and that's therefore that can point to these dying or highly stressed cells. Typical people also look at ribosomal UMI. So that would be the actual, I must say this correctly, for genes that code for the ribosomal proteins. I don't, I haven't, I've never really found any good reasons to use this percentage to filter for cells. While all people do, but I'm not entirely sure why and I'd be happy to hear an explanation why you would want to do that, but it's a measure you can generate for cell. Another one could be the percentage of reads aligning to globin genes, and these could point to, for example, erythrocytes, and some very often people do not really want these erythrocytes in the data set. And of course, what you can also do is look at the relationships between each variable. So for example, you want a low number of detected genes that are relatively high percents of globin, and that could point to erythrocytes, for example. So stuff like that. An example of that we can find over here is actually the data set you will be working with. So just to make that point, so we have here three temples, and we have the total number of UMI's per cell on the x-axis and the number of detected genes on the y-axis. And most of the relationship between the total count and the number of features is relatively linear, so more reads you have for a certain cell, the more genes you measure, so also lowly expressed genes are in there then. But apparently, there's also significantly big group with relatively high counts, but not a lot of detected genes, and probably it's because there are not a lot of genes expressed in these cells. So what could that be? Well, what we could do, for example, is at the same time generating the same plot, but then color the cells according to the percent of globin. Globins, and then we see that most of those cells, they have a very high percent of globin, so probably these are erythrocytes, so we have an erythrocyte contamination, whether you would call it a contamination that's up to what, whether you were aiming for sequencing erythrocytes, you have to know. Yes, I think there is a question of general interest in the Slack channel, so I will read it out loud. How the cutoff for mitochondrial DNA has been determined, and what if we expect mitochondrial DNA to be highly expressed? Example, if you have oxidative stress, is that good? Yeah, so about the cutoff, so you're saying, okay, we want to filter for, filter out cells with high mitochondrial expression, so with a lot of reads coming from mitochondrial genes, that very much depends on the dataset, so typically what you do is you go back and forth in the analysis, so usually what I do, and I think Tanya and Rachel do very similar things, is you filter relatively mildly for mitochondrial, so you only take out the top, let's say, highest expression, you continue with the analysis through the UMAP, for example, and then what you can do is color the UMAP, so the dimensionality reduction, or the clustering even, based on the percentage of mitochondrial reads, and if you then see a very clear cluster of cells with very high mitochondrial genes, those are most likely not, they don't have the same cell type, but they are clustering together because they have such a high mitochondrial gene expression, then you can say, okay, what kind of percentage of mitochondrial genes do they have in there, and they can go back to the filtering again and then carry on, that's typically how you decide, because it just depends very much on the dataset, what kind of mitochondrial genes you can expect, how many, what the percent can typically be, my experience with tumor cells is that mitochondrial gene expression can be typically high, relatively high, and for all the tissues it can be expected to be very low, I hope that is the question answered like this, yeah, great, so that about cell filtering, and cell filtering you typically do already with Seurat, and Seurat has a quite a specific way of storing your analysis in an object called Seurat Object, so if you are a little bit familiar with R, you know that there are different types of objects in R, right, so you have for example data frames, you have factors and you have lists, there are also more complex objects, and an object is basically a set of rules, what a certain type of data should look like, so we can do specific calculations on there, so a Seurat object is an S4 object with slots, and within these slots you can store different types of data, but all of these slots are always there in a Seurat object, and they can be filled yes or no, so what is always filled in a Seurat object is the essays slot, so basically you can kind of consider an object like this, an S4 object like this as a list, but this is not a list, this is basically a list of slots, that's how you can see it, and all these slots all have a name, so we have the essay slot, and that essay slot has again a different S4 cloud object called essay, and then that essay can have again multiple slots among which the counts, which are the raw counts, so that's basically what was generated by Stellranger, data is normalized data, skill.data is skill.data, var.features are the varical features, if you haven't selected them, if you haven't selected them, that slot is not filled, and some meta data about these features, which is not very frequently used, you have the raw counts stored in the count slot, in the essay slot, normalized data, the skill data, and the variable genes are features, then there is, you can store meta data per cell, and you usually add meta data per cell during your analysis, for example, if you are clustering cells, so you want to specify clusters of cells that are more similar to each other than to the red, you specify that in the meta data flow, and the meta data flow just contains data frame, so that's a very typical r object, which is just a table with columns specifying information about for each cell, so it can be the cluster, it can be the total number of reads, it can be the annotation, if you're further, all kinds of those things, so the information per cell is stored at the add meta.data block, then there is the graph slot, because for now I will just skip active essay and active identity, I think I'll talk about it later, so in the add cross slot, you store the graphs that are used for the clustering, so if you are doing clustering, you can do the clustering, so specifying which cells belong together with Serot, and you can also store that in this same Serot object, then the dimensionality reduction, so those would be the PCA or UMAP, or the T as in E, you can store those together at the add reductions slot, and then again, you have a different plus, which is the dim reduction plus, also with its own slot specifying information about the PCA, for example, we call those typically embeddings, but a coordination in the two-dimension, for example, in the UMAP, then the slot that is not very frequently used, but I consider that to be very convenient, it's the add command slot, and in the add command slot, you can store all the commands that have been run to generate the object you're looking at, so sometimes you get from a colleague, you get a Serot object, colleague did her analysis, and then gives it to you in order to, for example, do different with gene expression, that is or whatever, but you do not really have a script, for example, but in that Serot command slot, there is the commands are there that were used to generate this particular object, can be very convenient to disfigure out the history of a Serot object, so then the active essay and active.ident slot, active essay refers to the essay under, it gets under counts, if I'm not mistaken, so we can have multiple, no, no, we can have multiple essays, I'm sorry, so we can have multiple essays, so we can have the dependent essay, which is typically called RNA, and later on, if you're going to integrate multiple data sets, then we can have a second essay with only integrated data, the integrated data is used for dimensionality reduction, but not for, for example, different to gene expression analysis, so therefore, we keep them separate, they contain different information, more about that tomorrow. The active.ident is mainly used for plotting, so that is a column in the metadata slot, so over here in the metadata slot in the data frame, specifying which kind of metadata information, for example, the original sample name you're using, for example, during plotting a UMAP, so you can color according to sample name, for example, that is pretty much it, project name, always have to specify, that's pretty much it, what I wanted to discuss, all others are less relevant to you. Then a few words about normalization and scaling, so what's the difference between the two? Well, the difference is normalization is for cell, where you remove basically technical effects for cell, usually that library size, so how many UMIs did you count for that particular cell, so you're able to compare cells, because for example, if you have generated a lot of reads for one cell, it becomes very difficult to compare it to a cell where you only have generated a few reads. Then there's scaling, you do that per gene, usually, where you standardize the range, the mean, and the variance, and that is important for principle component analysis, so typically, you always scale your observations for principle component analysis. So actually, both normalization and scaling are mainly for the purpose of dimensionality reduction, if you continue to differential gene expression analysis, there's usually also a normalization scaling, that these are typically handled by the algorithm that is calculating, for example, the P cells for you and the local changes. One other thing you can do with a data set is regress out variables, that's typically done after you have visualized your data set with, for example, a UMAP, and if you see that your UMAP is more affording to a variable that you're not interested in, you can try to regress it out. So what we typically see is that you have, depending a little bit on data set, but you very often have cells in there that are dividing. So that are in, I should say this correctly, that are in the E2 or the S phase. But you're not interested in clustering according to the cycling phase, but you're interested in, for example, clustering according to the cell type. So what you see is that cycling cells that have different cell types, so let's say cycling B cells and T cells, they cluster together because their expression pattern is very similar because they're all cycling, although they do not cluster with the B cell and the T cell cluster. Maybe you want them to cluster together with the B cell and the T cell because you want to compare those, regardless of whether they are cycling yet or not. Then you can regress that out and there's a pretty nice feature in Seurat that enables you to do that. Can be a bit challenging, depending a little bit on the data set again, that works very well or that doesn't work very well. If it doesn't work very well, you can always choose to take out the cycling cells, try to recluster those, reenact those, and annotate them again as a B cell or a T cell or whatever.