 Okay, so we're going to continue on with dimensionality and reduction in visualization of single-cell RNA-seq data sets. So yesterday we covered the first step of our workflow. Okay, and how to do QC and normalize our expression. Today we're going to do the boring technical aspect of single-cell RNA-seq analysis. Today we're going to do all the fun stuff. So we're going to do the biologically relevant stuff and getting to our biological interpretation and answering our question. So the next step that we're going to talk about right now is feature selection and dimensionality reduction. The main tools for this are high variable genes for feature selection for dimensionality reduction. We're only going to talk about PCA. Some of you are interested in super high data set integration, and for that you would use the other method listed here called autoencoders, which are machine learning tools that are black box that do dimensionality reduction in a magic magical way, and that works really well and is super scalable. So that is what people use for integrating data sets of 4 million cells. But if you only got like 50,000 cells, you can use PCA and the tools we're going to talk about in detail. So first we got to understand why we need to do all of these steps, and that's because of the cursive dimensionality, which is a really hard thing to understand. I looked up all kinds of different explanations for it and none of them were that great. So this is my best shot at it. Essentially, if you imagine your data on a straight line here, if you have 10 data points, they're barely contained on that straight line. Whereas if you have those same 10 data points in two dimensions on a plane, now you have a lot more space for those 10 points to be in. If you're there now in three dimensions in a cube, you have even more space for those 10 points to be in, and your data is essentially going to get more and more spread out as you increase the number of dimensions. And in single cell RNA-seq, each gene is considered a dimension. So we have 20,000 dimensions, which is a lot. The other way to think about this is that each dimension, so each gene contains some amount of noise. But the effect size, right, the biological difference between ourselves stays about the same, regardless of how many genes we'll have. This means if we increase the number of dimensions, eventually this noise that's in each dimension, so that's adding up each time we add a new dimension, and eventually that noise from all those dimensions will swamp out the signal and make it really hard to find the biological truth, the biological effect. So we want to get rid of this extra noise, get rid of these dimensions that aren't helping us actually understand our system. So the first way to do that is simply feature selection, right? So single cell RNA-seq, we've assayed tens of thousands of genes. All of those genes have some degree of technical noise, right, just from random sampling. All of them also contain biological noise. Transcription isn't perfect. There's noise in that process at the biological level. And only some of those genes are actually going to be differentially expressed between our cell types. So we'd like to just look at those genes that are different between our cell types and ignore all the rest that are just noise. So what does an interesting, differentially expressed gene look like, right? So here's some examples. So three of these are pretty obviously biologically relevant here, right? So I've got these three immunoglobulin genes and a ribosome gene. So the three immunoglobulin genes and the ribosome gene. So those three immunoglobulin genes, they're going to be relevant. The ribosome gene is probably not biologically relevant. So what do, what characteristics do these have? So they have a high variance, right? Some cells, they're highly expressed. Some cells, they're not highly expressed. They're lowly expressed or absent entirely. They can be correlated with each other, right? IGHG4 here is super highly correlated with IGHG1 because they're both expressed in the same cells. And they're relevant to the known biology, right? These cells are B cells. This is the BCR receptor. You know they're relevant to these cells. So these are the kinds of genes we want to pull out. So, but there's a few problems. If we just look at high variance in single cell RNA-seq data, if we just look at our UMI counts, the mean and the variance are correlated with each other, right? So this is what you get. So here you can see the key highly variable genes here in red that are the immunoglobulin genes. But you can see this background distribution of genes has an upward trend to it, right? So the mean versus the variance here has an upward trend. This means if I just drew a horizontal threshold that say genes above this amount of variance are highly variable, I can either set it low like this and get some genes that aren't biologically relevant, right? So I've got some noisy genes there that are just high, have high variance because they're highly expressed. They tend to be like things like ribosomal genes. Ribosomal genes are super highly expressed. They have super high variance, but they're not necessarily biologically relevant. Or I could draw my threshold up higher and know I'm missing some of the genes that are biologically relevant. We can look at the gene gene correlations. However, that's really slow, right? To calculate the correlation, we have to calculate the similarity between the expression levels of every pair of genes across all of the cells. This is going to take us a time proportional to the number of cells times the number of cells, or the number of genes times the number of genes times the number of cells. So 20,000 times 20,000 times 50,000. So that's a lot of computation. No one wants to sit around and wait for that to happen. We can look at the genes that have known biology. However, most of the time we don't actually know what the genes are that are relevant to our known biology or don't know very many of them. And that's why we did the whole experiment in the first place. Out of these three problems, the easiest one to fix is this top one. And all we have, we can simply fit a line to the relationship between the mean and the variance and figure out what genes are significantly above that line. And that's what we mostly do for how the variable genes. However, when we think about biological features of ourselves, right, we generally don't think about each individual gene as a feature. We think about things like the cell cycle stage, right? Are they in G1S or G2M? Are they, it's using particular metabolic pathways? Are they involved in some particular type of signaling? Do they have some structure like a cilia? Are they in some particular state like senescence? These are all combinations of multiple genes that create some biological feature. And we don't necessarily want to treat each gene as its own thing because then one of these features, like for instance, metabolism, will have more weight because there are more genes involved in metabolism than there are in say something like a cilia. Maybe we want to treat the presence or absence of cilia to be equally important as whether they're involved in oxidative metabolism or non-oxidative metabolism. Or as if we're just treating each gene as a feature, that metabolism is going to be considered more important because there are more genes. So we'd like to sort of condense these things down into one feature for each biological concept. However, this again has some problems in that we don't know what all the relevant features are for our particular system. So sure you could go get the GO pathways, but the GO pathways don't cover absolutely every particular system or pathway out there. We also don't know which genes contribute to all of these features. Sure we have some genes that we know what they do, but lots of genes we don't really know what they do. So for this we're going to use dimensionality reduction. So feature selection is technically a type of dimensionality reduction, but we kind of treat it as separate. Because here all we're doing is choosing features from our data set that already exist. Whereas dimensionality reduction typically refers to creating new dimensions from combinations of existing features. We can do this in a supervised manner. So we can go and take those GO pathways and turn each of those pathways into a new feature, which is essentially what GSVA does. But most of the time we instead use unsupervised methods because our pathway databases aren't that great. In which case we just use the correlations between genes to decide which features to merge together into new features, which is basically what PCA does. Those genes that are highly correlated with each other collapses them into a single principle component. So what does PCA do? There's lots of math behind it in linear algebra. If you want to learn that you can go do a linear algebra course. I'm not going to talk about that. But sort of intuitively what you do is you have this sort of cloud of data and you want to draw a straight line through it that captures sort of the maximum variability across that data. And once you've got that first principle component, you then look at all the possible lines that are at a right angle to that first principle component and pick the one that maximizes the variance explained in that that is orthogonal. And just keep doing that over and over again until you've completely described your data. Or in single cell world, we keep doing that until we get bored and don't care anymore about our principal components because we've got like 50 of them and each new one we're adding is adding like 1% or less than 1% variance explained and that's not really important. We've got the key components and that's all we're going to use. So the assumptions here are that each component is linear so it's a straight line. So if you have some sort of like exponential curve like in your data so maybe the cells suddenly take off and differentiate into a new cell type. That's not going to be captured that great with principal components because we're assuming a straight line. They're also orthogonal which is what I was talking about at right angles. But most importantly they assume your data has normally distributed errors, which means we have to log transform our single cell RNA seed data first, before we can do PCA. This isn't exactly true because there is a method developed for single cell RNA seek that assumes a zero inflated negative binomial distribution and does PCA called ZinWave B. I don't remember what all the letters stand for here. Unfortunately, no one uses it because it's slow. So mathematically it should be better. So that's dimensionally reduction. Now we'll talk about visualization. So visualization is not dimensionally reduction. Dimensionally reduction we're trying to preserve the distances and information in our data set. In visualizations, we're trying to make a pretty picture. So lots of people use visualization techniques and call them dimensionally reduction. They are not. They are visualization. And we need to keep in mind whenever doing visualization, all of them are going to be wrong. They're going to capture only a small portion of the information in our data set. But that doesn't mean they're bad, right? They're going to provide some information to us that is useful. Okay, so I was trying to understand how things work. So we're going to talk about T-SNE and UMAP. Again, I'm not going to go into the mathematical details. Both of them have math underpinning them. But in concept, what they do is if you imagine a box, right, that's three-dimensional structure, and we want to make a two-dimensional plot of that box. So what PCA does, right, we could do PCA, we could draw some straight lines through our box and basically squash it onto our page based on those straight lines and get something like this. Right, so we have two sides at our top and our PCA plots. And sure, using these two plots, we could, thinking about it really hard, figure out that this was a box. But we have to think about it really hard, which is not ideal. So alternative was what UMAP and T-SNE does, which is essentially they unfold the box to put it flat on the page. Right, so now we can see all the sides of our box at the same time, which is nice. But we've got some strange things. Right, so this edge and this edge up here, in 3D, those two are touching each other, right? But here on the plot, they're very far apart. I mean the same thing with this edge, right, so this edge and that edge are also in 3D, they're touching each other, or if they're very close to each other. But on the plot, they look very far apart. Okay, so it's important to keep that in mind when you're looking at T-SNE or UMAP. Things that are very far apart may not actually truly be very far apart. Okay, so here's an example. So this is a little toy data set I made and I plotted it with PCA, UMAP, T-SNE, and diffusion maps. So each of these different techniques captures a different aspect of our data and preserves it. So you can see all the plots look different. So PCA preserves the accurate distances between points, but it doesn't show us that much of our data at once. This is why PCA is dimensionally reduction and we can use it for things like clustering and trajectory analysis, but it's not that great for actually visualizing our data. If you want to look at the overall structure, generally UMAP is the best one for that. Whereas if you want to see distant clusters in your data, generally T-SNE is the best. Whereas if you're looking at trajectories or gradients, that's what diffusion maps were designed for. They were designed to find and visualize trajectories and gradients. So looking at these four plots, what do you think the data really look like? How did I generate these data in the few minutes to think about that? I hope you've all got a mental picture of what you think this data actually looked like. So this is how I generated it. Red and blue are parallel to each other. And purple is coming off of red in an independent third other direction compared to the red and blue. How many of you thought that was what the case? See some hamas and haas. Yeah, so you see these visualizations are helpful, but it's hard to figure out exactly what's going on. And this is just three dimensions. You're going to be looking at single cell maps with dozens of clusters in super high dimension. So you've got to be a bit careful and worry that these are just limited visualizations. They're helpful, but they're not telling the whole story. Because it's really hard to understand super high dimensional, huge data sets. We do the best we can. So here's some UMAP of some real data. Think about how you would interpret this UMAP. Particularly looking at these, the two clusters I've got the black arrow next to. Would you say those two clusters are similar to each other or not similar to each other? No, it's similar. Yeah. So I think it's not similar for a technical reason, which is that the ordering of the colors matter there. So whatever brownish one gets started first. I think it's like hiding it. And then I look at the blue, but I don't know how it is. But I know changes. Because it's not exciting. He put it on top of the one color. Yeah, yeah, that's also a good point that you're layering it on top. The ordering of clusters here, by the way, is always this is done with certain. So the clusters are always in order from largest smallest. So zero is always the biggest cluster 22. Here's the smallest cluster. So here's a t-sne of the same data. Now is, would you say these two clusters are similar or not. Right. Super easy to see that they're not here and compare these two plots of a real data set. You look at the big cluster there. Right over here, here might say that one's super different from everything else. But if you look at the t-sne, well, doesn't look super different from everything else. So when you're looking at visualizations, generally recommend look at lots of them. Don't just trust one visualization to interpret your data. Okay, using different tools use different parameters. You'd still get these two clusters. Right. Over here, they look super similar to each other. Over here, not so much, they look fairly different. So here's another data set, both plotted using UMAP. But I changed. So UMAP is based on KNN graph. So when you create your KNN graph, it will be used by UMAP to plot your data as well. So here I changed the K of my KNN graph before running my UMAP. As you can see on the left, these two clusters are clearly distinct from each other and look super far apart. But on the right, they're connected to each other. They're part of the same developmental trajectory. So summarized, single cell RNA-seq is high dimensional in noisy data. We can use feature selection to identify the genes that have biological differences, not just noise. Here we use high variable genes. Other options are available, but basically no one uses them. We always do dimensionally reduction to condense multiple genes into a single feature. Almost always with principle component analysis, although actually this is SDM, singular value decomposition, because we can approximate it and be fast, but it's essentially the same thing. And you can use T-SNE and UMAP to visualize the data. Don't treat them as dimensionally reduction. Never do clustering on your T-SNE or UMAP components. Lots of people do it. It's bad. So if you do the clustering using the T-SNE or UMAP X and Y coordinates for yourselves, don't do that. Use the PCA, which is what SURED does. It is good on that front, but I still see papers where people do clustering on the T-SNE and UMAP directly, which is bad. Don't do that.