 Yeah, so now we're going to talk about quality control and normalization. So this is going to be the first step on what is a quite long process of going from your single cell RNA-seq experiment to your beautiful UMet with all your different cell types and your differential expression with answering your important biological question that's going to save all of manatee. So right now, we're going to start at the left here with quality control and normalization to get our normalized expression matrix. Then tomorrow, we're going to talk about feature selection and dimensionally reduction and then cell-cell distances, creating a k-nearest neighbor network, and then clustering and differential expression. Right, so we're starting at the B here. So quality control, the point of quality control is to remove dead or damaged cells from your data set. Because when we're handling our tissue and ripping apart into individual cells, we're going to be disturbing those cells and some of them are going to get killed or they're going to freak out and react to our handling of them and commit apoptosis or some other way they can get damaged or died just like from the physical act of going through the tubes in our 10x machine. So in order to do this, all we have is our UMI count matrix. We can't see ourselves, we can't use some sort of stain to see whether they're alive or dead. All we have is our UMI counts. So what kind of statistics could we calculate to estimate what our cell quality is? Yep, so one of them is the percentage of mitochondrial genes. What else might be a measure of how good our healthier cells are? Number of zeros. Yeah, number of zeros or number of detected genes. Yep, the number of molecules per cell. Any other ideas? Counts of cells. Yep, so counts of housekeeping genes or metabolic genes to show our cells are still alive. These are all possible statistics. So how do we determine which one's our best? So probably if some of you may already know, we tend to use percent mitochondrial and number of detected genes. We tend to make up a little story for why that is, but the real reason is if we go way back in the day of single cell RNA-seq, by that I mean 10 years ago, and back when we only had the fluid im C1, we didn't have 10x genomics, we didn't have split-seq, all of these things. There we could actually see the single cells we were capturing, and we could actually take a picture of them and see whether they were broken or dead or damaged or still alive. So we group in Oxford or no in Cambridge, so they would calculate every statistic they could think of to measure the quality of the cells and see whether they agreed with the images of dead and damaged cells or not. So these are some of the ones they came up with, like expression of apoptosis-related genes, housekeeping genes, representable genes, mitochondrial genes, yada-yada-yada, mapping quality, everything. Everything you could possibly imagine. They tested all of them, and the best two were the mitochondrial percentage, which you can see is very high in broken cells, and the number of detected genes, which was very low in empty cells. So those cells were where we didn't capture a cell. And from then on, that's what everyone uses. So that's quality control solved. We're done. We can do that ourselves. Next, we have to think about normalization, which comes to the library-sized problem. Right, so here's some output from a single cell RNA-seq sequencing experiment. So this is actually output from CellRanger, and you can see the number of reads with sequence there. So the number of reads we sequence, we determine when we tell the sequencing facility how much sequencing we're willing to pay for, and that's how many reads we get. So what would happen if we doubled or quadrupled how many reads we sequenced? Yep, the number of reads increases, and the number of UMI's we detect in each of our cells will increase, and for each gene in that cell, it will also increase. So if we go back to our barcode rank plot and think about what this actually shows, so remember this is a log scale on the y-axis. So the difference between the cell indicated in blue and the cell indicated in red is two-fold times the number of genes. So there's twice as many UMI's detected in that blue cell than in that red cell. So if we were just to compare these raw counts to each other, right, every single gene in that blue cell will look like it's two times more highly expressed than the genes in that red cell. That's not very helpful. So we want to normalize this to get rid of this fact so that all of our cells have an equal amount of sequence of expression. So the first solution was counts per million, and million is in quotation marks here because it's not actually per million anymore. So here what we do is simply take each cell, divide it by the total number of UMI's in that cell, and then multiply it by 10,000 so that our numbers are back with the normal. So we took the gene expression divided by the total number of UMI's. Most of them are going to be something like 0.0003. That's annoying to deal with and look at, so we multiply that by 10,000 so that we get three instead. Some tools won't use 10,000. Sometimes they'll use the median total UMI's across all of your cells to normalize, but it's effectively the same thing. However, this also has a problem. As you imagine, we have a cell here where these are the actual concentrations of these genes in the cell, and then the percentage of total RNA on the right there that we measure with our UMI's. If we increased the amount of gene A by 10 fold and looked at the percentage of RNA, right, because that's how we're normalizing our data. What happens is that the percentage of total RNA, so our normalized gene expression for all of the other genes also changes. So they're all going to go down because we've added a whole bunch of A, and that's not really ideal either. The actual concentration of those genes is not changing. It would be good if their normalized expression values didn't change either. We came up with more solutions to this, so those of you who did the bulk RNA secourse probably heard of some of these normalizations already. So what we do is we assume most genes are not differentially expressed. We might want to think about whether this is true in our single cell experiment because it may not be. For something like mouse or human, there are a lot of housekeeping genes that won't be differentially expressed, but if you're looking at a strange organism like a malaria parasite, your genome will change. Their entire transcriptome will change between different stages, and this assumption will not be true. So you need to be a little bit careful there. Then we can calculate these statistics. So DC uses a statistic where you do the median relative expression versus geometric mean. And then HR uses the trimmed M values, basically take each gene, calculate the log fold change for that gene versus a reference cell. And then look at what the average of those fold changes are in this particular cell. And then that becomes what you divide the total RNA by, instead of that value. However, these have a problem in that they assume that all of the genes are detected in all of the samples, which in bulk RNA sec is totally fine. That is the case. But in single cell RNA seek only about 10% or less of your genes will be detected in every single cell. So these no longer work very well. So our solution was scran, which is a method where we take our single cells, we randomly pool them together so that now we have like a miniature bulk RNA seek and all of our genes are detected. We calculate our trimmed M values for that pool. And then we do that multiple many, many times. And then we use linear algebra to decompose the size factors we get from trim M values in each of those different pools back to a trimmed M value for each cell. That says scran works. The other part of normalize normalization, right, so normalization is called normalization because we want our data at the end to follow a normal distribution. Because many statistical tools assume the data follows a normal distribution. So if you do linear regression, you're assuming your data assume follows a normal distribution. If you do ANOVA, you're assuming your data follows a normal distribution. If you're doing PCA, you're assuming your data follows a normal distribution. You might counts for single cell RNA seek do not follow a normal distribution. They follow a negative binomial distribution. So we need to fix this by transforming our counts so that our transformed UMI counts follow a normal distribution close enough that our tools will work. Okay. So here's some UMI counts for a gene. This is the raw values and see they're super skewed. You can counts permanently normalize them. And it's still super skewed but at least it looks like an exponential distribution there. If we log transform that now we get kind of a normal distribution here and then the spike at zero. So we can just log transform our normalized counts to get a fairly normal distribution for our data. And then we can use things like PCA and linear regression and things like that. The other option that's become popular are Pearson residuals. So Pearson residuals here are calculated as, so O here is your observed count. I don't know if I can reach over there. So that equation there. O is your observed counts. E is your expected counts given a model. So in this case we know our data follows a normal distribution. So we can model our data as a normal distribution, figure out what the expected counts are given that normal distribution, plug it in there, and then divide by the square root of that expectation as well. And that will give you something that looks like this on the right. So you can see that also kind of looks like a normal distribution. So that's considered generally close enough to a normal distribution that we can now use it for things like PCA. When we're doing normalization, we might also want to consider other potential confounders that we might want to get rid of. So confounder is some sort of biological effect that will change the UMI counts in our gene expression matrix, but they're not what we care about. So there are things like the cell cycle, genetic background, right, if we're using patient data or tumors, those cells will have different genetic backgrounds, and we'll detect gene expression differences because of that. It could also be tissue storage. So some of our samples were kept in the freezer for three months and some of them for one month. You might see a difference because of that. You have like environmental lifestyle, date, age of processing, circadian rhythm, cell stress, etc. You can think of all kinds of months. For these there's lots of potential solutions we could have. So ideally we would design our experiment to minimize these confounders. However, sometimes that's not possible. So you might have a certain timeframe where you can collect your samples and you have to store some of them in the freezer until you have the next batch of samples ready. Or you might be working on a disease and some of your individuals have the disease, some of them don't, and obviously they're going to be genetically different from each other. So in those cases what we can do is we can use a regression model to fit that particular confounder and its effect on the gene expression and then remove it. That works great. However, it takes a long time to compute, and if it's confounded with your biological effect of interest, that won't work. The other option is we could just find the cells that are most affected by that. So part of our QC was finding cells that were really damaged or stressed and just removing them from our dataset. We can also find genes that are most affected by it. So if you're looking at cell stress, you might get all of the stress pathway genes and just remove them from your gene expression dataset, your gene expression matrix, and not consider them to reduce the effect. You can also include these confounders as a covariate when you do your differential expression. If you use a model-based method for differential expression, just include these as a covariate. Or you can just ignore it and hope it doesn't cause problems, which is sadly often the one people go for. So obviously there's lots of these. I'm just going to talk a little bit about the cell cycle because I'm guessing probably many of you work on cancer or development. So that tends to be what a lot of people doing single cell do. So the cell cycle, you have to be a bit careful. So obviously it can be a confounder, and you saw even in Trevor's talk that some of his clusters were different because of the cell cycle. They're in different stages of the cell cycle. So it could be a confounder. So if you're looking on development, usually cell cycle is a confounder. Usually all of the cells are cycling to some degree, and they're just in different stages of the cell cycle, and you don't care that they're in different stages of the cell cycle. You're interested in their phenotype besides that. If you're working on mature tissues, cell cycle is probably not an issue because most of the cells aren't cycling. So you can just ignore it. However, if you're working in cancer, often the cell cycle is biologically interesting. You probably care about which cells are cycling and why those cells are cycling and not the other cells. But if you just use the tools to regress at the cell cycle, those tools are pretty dumb. So they will remove any differences and all differences between the cells that are cycling and those that aren't cycling. So if that difference is something you care about, you probably shouldn't regress at the cell cycle. But if that difference is not something you care about, you can regress at the cell cycle. And lastly, I'll talk about imputation. So imputation is when, so in our single cell RNA-seq data sets, a lot of the entries are zeros, as you saw when you were handling some in the lab. 90% of the data matrix is zero. And a lot of the times those genes are actually expressed in those cells, we just don't observe them for whatever reason. Usually, random chance because we don't sequence every single molecule in our library. We don't capture every mRNA molecule in our cell when we're creating our sequencing library. So a lot of people have come up with tools to impute the data, where you take those zeros and replace them with your best guess of what the gene expression should be for that gene in that cell. Which seems like awesome, great idea, we get rid of all of these zeros, we can't find out what the actual gene expression is for ourselves. However, it has some problems. So here's a little example data set. So this was just some fake data where I generated some random single cell RNA-seq data, where half of the genes here in red and blue were differentially expressed in two populations, and then half of the genes were not They're just random and evenly expressed across all of my cells. And they're expressed at different levels, and you can see these are the correlations between the genes. So all the genes that were up-regulated in one cell type are correlated with each other, and all the genes that were down-regulated in that cell type are correlated with each other. So then I applied the imputation methods to see what would happen. And see all of them cause some of these genes that are gray, so that are not differentially expressed between my cell types, to become correlated with each other, and appear to be differentially expressed. And so some of them are worse than others. So SA impute only changes a few of them. But if you use something like magic, all of your genes that aren't actually differentially expressed between cell type A and cell type B are now differentially expressed between cell type A and cell type B. And to a lesser degree, the other methods. What this means is that because these tools are essentially saying this cell is similar to this other cell, therefore the gene expression of this gene should be similar to the gene expression and the expression of that gene in that other cell. You are creating signal in your data, even when there's just random noise. Yep. Yeah, I think the argument is that it makes sense. I just don't see any motivation behind imputation. The main argument for imputation is, hey, it makes our plots of correlation between my two genes look really awesome. Pretty much, yeah. The differential expression of these genes. Oh, look how strong it is now after I did imputation. Oh, these two genes that I thought were correlated. And before imputation had a correlation value of 0.1. Now it's 0.9. Isn't that great. So I recommend not using this. And just like, don't, don't do that. It's not good. Not even though there's like 10 or more methods to do with us now. Just don't. All right, so just summarize. So we had only, so for quality control, we use lowly detected genes per cell to indicate a dead cell, high mitochondrial reads to indicate damage cell. And we need to do normalization for downstream. Bulk RNA-seq normalization doesn't work. So we use pooling or model-based normalization. So if you use SURET in the lab, we'll do SC transform. So that's a model-based method of normalization. Essentially what it does is it takes each gene and says, are you, is your expression correlated with library size? If yes, I will remove that correlation. That's essentially what it does. Although it doesn't do every gene because that would take forever. It does about 2000 genes and then guesses that the rest of the genes follow the same pattern. And there are many biological founders. And you'll have to think about how you want to deal with them for your particular system and your biological question. And of course, imputation should only ever be used for visualization to convince your viewers that your data is awesome. So don't draw any conclusions from it yourself.