 Okay, hopefully everyone's back. I want to get started on time because this is the longest and most complicated lecture. I'm sure there'll be many questions. I may skip the questions so that we can all get lunch on time so you're not all starving. But there'll be plenty of time for questions like at the start of the lunch or at the end of the lunch break. So we will get to everyone's questions about everything, but this is a long lecture. Okay, so now we're going to do biological analysis, which is all the awesome stuff. We're going to answer our biological question. So for here we need to keep in mind what is your question and what data do you need in order to answer it? So here's some example questions, right? You could ask something like, what are the progenitor states that exist between mad and pox stem cells that need to differentiate blood cell type? Right, so you need to get some bone marrow stuff. You only have the normal condition, so you only need a bunch of samples for that. Whereas if you have something like the effect of Huntington gene variants on urinal identity and function, right, you're going to have a case control. So you've got to have controls. You've got to have cases. And you want some sort of sampling method where you're going to capture neurons. So you don't want to do single cell RNA seek. Single cell RNA seek won't capture neurons because they're all tangled up. You're going to want to do single nucleus RNA seek. Right, and do things like how many cell types are there in some particular region of a tissue. You can look at communication between different cell types. Or you could look at like what transcription factor regulatory networks are there in our single cell RNA seek. We're going to cover communication at cell cell communication or transcription transcription factor regulatory networks, mainly because there's no good way of knowing if any of those tools work. There's lots of tools to do it. They will give you an answer. I cannot tell you what that answer is true or not. So we're not going to cut this, but you feel free to explore this. Okay, and go back to our classroom workflow. Right. So we did that yesterday. We just did feature selection and dimensionally affection. So now we're going to do how we define k-nearest neighbor networks and do clustering. We're also not going to cover trajectory inference. There's lots of tools to do that. There's over 42 of them at the last benchmarking paper. But out of those 42, PCA range number three or number five, depending on what metric you used. And it was the most consistent performing across different types of data. So if you had a circle trajectory, if you had a fork trajectory, if you had a straight line trajectory, PCA would work pretty well. Whereas each of these methods would could outperform it on one of those cases. So generally, I recommend just use PCA for trajectory analysis, but there are other options available. But we're just not going to cover those because you have limited time and they're not that great. And we're going to do different. So for clustering, we use graph clustering, which is not actually invented for single cell RNA seek. It was invented for social networks. So this was invented to analyze Facebook data basically. But we saw when we single cell RNA seek came around, it's like, oh, we've got tons of stuff, tons of data now. We need a clustering method that's good for big data sets. What's available? Oh, that thing. It works great for huge data sets. Let's use that. So for that, we need a graph. So a graph has made of vertexes, which are things. So in our case, they're cells. And edges are relationships between them doesn't really have an interpretation and single cell RNA seek is basically just a measure of how similar those cells are. So if they're fairly similar, we have an edge. If they're not fairly similar, we don't have an edge. So how do we create turn our single cell RNA seek into a graph. So we use K nearest neighbor networks. So I've been already talking about this a little bit. So here we just take each cell and we connect it to its K nearest neighbors. In case if K was three, each of these dots is our cell. Right. So we take the cell connected to its three nearest neighbors. Take another cell connected to his three nearest neighbors and keep doing that for all of the cells and we get something that looks like this. So hopefully you can immediately see we've got some problems here. Right. So this guy in yellow and this guy in yellow they're outliers. They shouldn't be connected to anything, any of the other cells, but they are because by definition, K is three, they will be connected to three other cells. You can also see the see the small cluster of only three cells, because it has they each have to have three neighbors, one of those neighbors is going to be outside of that cluster. So it's going to get stuck to a second cluster using K K nearest neighbors. The solution to this was that was to not use K nearest neighbors, which is the related method called shared nearest neighbors. So here we take that K nearest neighbor network. And for each cell, we ask, how many of your nearest neighbors for you and your neighbor, how many of your neighbors are shared. For this cell, right for in this neighbor. Both of these two are neighbors of both the cell and the cell. So it gets a weight of two. Same for this one, right this one that one, it has two shared neighbors. So it gets a way of two. Oh, the cell, the yellow cell there and the cell. They have two nearest neighbors, so these two other red cells. So this line here will have a weight of two. Yes, over here. So that's the yellow guy down here and one of these red ones, we have a weight of two. It's got two shared neighbors. So this yellow guy at the far end, he goes to this neighbor. These are neighbors. None of those are shared yellow. So that edge disappears. And so they all appear in the sky. No shared neighbors. So that edge will get removed. Yeah. So each, each cell here was connected to its three nearest neighbors, but other cells could connect to it as well. Just how you get more. So each cell here, so this cell, for instance, it will be connected to its three neighbors. So that one, that one, and that one. But for this cell, it's also it's nearest neighbor is that one. So this one connects to this one, but this one doesn't connect that one. In terms of its nearest neighbors, but because we don't care about the direction, they end up connected. Yeah, so we define it, we define the three nearest neighbors for each cell, but they don't have to be mutual. So a cell can be this cell is the nearest neighbor of this cell. In terms of this cell's perspective, but for this cell, that cell is not it's three nearest neighbors from its perspective. So you can end up more. So I've got this graph, we need to define clusters in this graph. So the method we use is called vain or light in which is a variation of the vain. And it's based on the modularity score. And this is the modularity score q. All right, it's a big, big scary equation. I'll try and simplify it. So first sigma here, first you want CI and CJ, that is one, if these two cells are in the same cluster, or it's zero if they are not in the same cost. Right. So we're going to sum up the score across every pair of cells that exist in the same cluster. So that it could exist in other cost clusters get a score of zero. We then calculate. So a here is the number of edges, or the weight of edges between a pair of cells in the same cluster. And we're going to subtract from that the expected number of edges between those two cells, if we had completely randomized our network. So to M here, that's so M is the number of edges. So to M is the number of ends to that edge. So the number of cells that would be connected from all of our edges. So this is the number of edges involving one of those these cells. So Ki and KJ. So Ki divided by two M is basically the preferred percentage of all the edges that cell I is involved in KJ over two M is the percentage of edges that cell J is involved in. So this is now the expected number of edges between cell I and cell J. Hopefully guys on this side, we're following that as well, covering that term. And we just have a normalization factor start. So this is defining clusters as a continuous set of points that have a high density of edges or higher density of edges within that group, then we expect to be there by chance. So we had a network like this, and these were our clusters in green and purple, we get a modularity score of 0.4. So this is bigger than zero. So that's good. And these are good clusters. Or as if we clustered ourselves like this, this would get a modularity score of negative 0.1. So this is less than expected by chance. So these are bad questions. Okay, so how does Louvain? So Louvain is just going to optimize a set of clusters to get the highest modularity score possible. And it does this in a fast way so that it can apply to large networks. So basically, it's going to start off being greedy, and it's just going to take steps iteratively, making the best choice it can at each step. It's going to be fairly fast, because I'll get to the right answer pretty fast, but it's also going to be so fast, so it's going to randomly pick the order of steps it's going to take. All right, so how does it, what does it actually do? So you randomly choose a node in your graph, so a cell, and you start with every cell in a different cluster. You then ask for that cell that you randomly picked. If I merged you with the cluster that one of your neighbors is in, does my modularity score go up? And the answer is yes, we merge that cell with that cluster. And we just keep doing that over and over and over again until our clusters don't change. So here I've picked this one random cell. I asked which cell of its neighbors would increase my modularity score, and it didn't change. No, my animations are working. So that would turn green. That one would turn purple. That one would turn blue. That one would turn purple. Oh, that one would turn green. That one would turn purple. This one would then turn green. And then that one would turn green. And then we'd be done. And we'd have our nice clusters again. However, you'll notice that when I first presented this equation, there were no parameters to it. But if you've used SUREP before, you'll know it has a parameter called the resolution. So I slightly lit wide. And how Louvain actually works is it has this other parameter in there called gamma. You can see there in red. And that is what the resolution parameter is. So when you run SUREP and you set your resolution parameter, that is what you are doing. So if you make that really high, or above one, you need a lot, you need more edges than you expect by chance to get a good cluster. If you set it lower, you're allowing clusters with fewer edges than you expect by chance to still be a good cluster. Okay, so high means many small clusters that are super dense with edges. Low is large dispersed clusters, but less edges, less dense. So you can also have the parameter k for how you designed your network. Right, so you can, if you have high k, you have lots of neighbors and you'll get large clusters. If you have a low k, you'll have few at few neighbors and you'll get small clusters. It's not intuitive, you can change it in SUREP. It's just extra steps to do that. Okay, so I have a question. We've been talking a few times in the course about rare cell types. So rare cell types are hard to find. So I covered our whole trajectory here to our clusters. Where do they get lost? So basically you can rely on their default, which is 20 k, and then you can play around with it. There's no real, no one's bothered to try and figure out what the right k is. I think the SUREP people did some sort of benchmarking when they designed it, but it was like a footnote in a supplementary file figure 1,000 or something. So, but mostly it's like 20 seems to work. So people use 20 based on data sets we were analyzing five to seven years ago. So if you want to change it, if you're using a big data set that we weren't using back then and increase it. The question is where you might use the rare cell types. I was thinking in the feature selection stage. Yes. And then kind of logic was you shouldn't like how these variable genes and, you know, rare cell types, like they have like sort of like marketing or whatever. They would have very small variants that they're only trying to use themselves. Yeah. Yeah. So how variable genes feature selection. I mean, this small rare cell types there, they would have high variants, but in terms of total variants in the data set, it would be small because they're only expressed in a few. Yep. Yep. Yeah, you're K if you have a high K, those rare cell types are going to get merged into another cluster. So those are the two main ones. You can also lose them at the dementia reduction stage. If you pick too few dimensions, because they will tend to be at the higher number of dimensions. You'll get a little in your clustering. So if you pick a really low resolution, you will lose your rare cell types. So when you've run now, let's suppose we've run our clustering algorithm, we've got a set of clusters. Do we believe our clusters are real cell types. So cluster, you can run the vain clustering on absolutely anything. It will always give you a bunch of clusters. And it's almost impossible to get it to give you only one cluster and say that there's nothing, nothing there. Theoretically, it's possible. So they get to sale, say in their paper. We can see our algorithm can tell you that there are no clusters in your data set. If you just one cluster in practice, it is almost impossible to get it to do that. So you will always get a bunch of clusters. So do we believe those clusters are correct. So here's some data. How many clusters do you think are in this data. I'll give you a few moments to think about that. So hopefully everyone's back and settled, and that was not, not too disruptive, because we're going to try and keep going and get this done before lunch. Okay. So how many clusters do we see here. I had heard some fives. Anyone disagree three. Other numbers out there. Five. Well, four, three fours. Okay. So these are my guesses for what what you would come up with. So how many how many thought of the first one here on the left. Yeah, few. How many think the middle one. No one. How many think the one on the right. Most people. The middle one was actually correct. So hopefully you can see from this, figuring out how many clusters are in your data set is really hard. We even as people, we have our time of it explaining how to do this to a computer is even more challenging. So here's some of the ways we can try and figure out if we've got the right number of clusters and the right clusters. And so you can look at our our clusters robust to our costume parameters. So if we change the resolution a little bit, do we get the same clusters. If we change your K a little bit, do we get the same clusters. If yes, they're probably good cost good clusters and they represent some sort of biology. You can look at the marker genes. Do we have a whole whole bunch of significant marker genes for our different clusters. If yes, I probably represent biology. No, they probably don't. We can look at known marker genes. Okay, I have a I know this is a marker of neutrophils. Do I have a cluster that's exclusively expressing that marker. Yes or no. We can use quality statistics statistics like the silhouette index that basically measure how dense is my cluster compared to other clusters. Are they well defined and distinct from each other. And good scores give you good clusters, low scores don't. You can look at consistency across experimental records. So if you have five samples, and you can cluster each one separately and do you get the same clusters. Or you can look at a reference data set do my clusters that I found match clusters in a reference data set for the same tissue. And you can use tools I guess see map for that. If you have some sort of spatial data available either from publicly available data or as part of your experiment because you have lots of money to spatial transcriptomics and single cell. You can look to see whether there's spatial structure for your different clusters. And if there is that's a good validation, or you can do some sort of experimental validation and spend hours and hours in the lab, and tons of money and try and figure out if your clusters are functionally different from each other. Note that they look good on my you map is not a good metric on this list. The reason for this is that your visualization tool will that's packaged with your clustering method will be based on the exact same assumptions as your clustering method. So they will always agree with each other. So your cluster should always look good in your you map, regardless of those clusters are good or not. So I'll cover a couple of statistics here you can use. So one is the silhouette index. So here we've got a blue cluster and a yellow cluster. So IJ is the cluster that cell I is a member of the IJ is the distance from point I to point J. Okay, so silhouette select index takes each cell and says what is the average distance from you to other cells in your cluster. And what is the distance between you and all the cells of the next closest cluster to yours. So what's the difference between those divided by the maximum. So we get a score between negative one and one, or one means our clusters are completely distinct from each other. And negative one means they're completely mixed together. So high scores are good. And we go score for each cell, which we can aggregate at different levels. So here's a bunch of clusters. And here's the score for each cell in those clusters. And see all the scores are pretty high. So these are well-defined clusters. Here I've changed the clustering resolution and got an extra cluster. And now you can see the silhouette scores for these two clusters. And sort of orange and green down here. And see that the scores have gone down. And we're starting to see some negative scores. Yeah, there is no threshold. They're basically compared between different clusters. So one cluster might have a higher score than another clustering other than being. Yeah, which whichever question gives you the highest score. And you can also use it in other ways. So here we could see these two clusters have lower scores. So maybe when we come to annotation, we want to give them the same cell type and merge them. Or we can change our clustering resolution to merge them. And it's also important to consider that there might be many good resolutions to your clustering. So if we think about some sort of tissue, you might be able to say, I've got a bunch of T cells, I've got some endothelial cells, I've got some stem cells. Right. And that's a good clustering resolution. Those are well-defined cell types. But then you might also say, if I increase the resolution now I've got T helper cells, T regulatory cells, and whatever. And bunched in different types of endothelial cells. And that might also be a good clustering resolution because those are well-defined cell types as well. So there can be multiple good resolutions for any particular data set. So explore. And hand in hand with your clustering goes annotation. Because you want your clusters, whatever your clusters you settle on to represent cell types. So you can sort of go back and forth. I've got a bunch of clusters. Can I annotate those clusters as different cell types? If I can't figure out the cell type of a cluster, should I change my clustering resolution and have that merged with a different cluster? Because I think they're the same cell type or not. Right. So this is, oh, yeah, I should mention. This was paper we did with Delaware and one of your TAs. And so we get credit for these, this as well. So we recommend this sort of, there's lots of ways to do annotation. This is what we came up as sort of a good efficient pipeline. They start with your unclustered data. You run it through, you take a reference data set, and you do some sort of automatic annotation and annotate what cell types you can see based on that reference. Right. So to do on the side, unannotated, a reference, you get some sort of automatic annotation. Some of the clusters don't get annotated well. You don't know what they are. Maybe they get some sort of mixed annotation of like half T cell, half endothelial cell or something with really low scores. Whereas some of your others get really good annotations. And quite often you might have lots of clusters getting the same annotation. Right. You might get four different clusters that all get annotated as T cells. Yeah. And you do the single cell annotation. Yes. Yes. So you can use single cell to annotate single nuclei, depending on which automatic annotation tool you use. So some of them will work well across data sets, particularly ones based on marker genes, whereas ones based on like correlations between cells won't work as well between single cell and single nucleus. So you can also use multiple different automatic annotation tools to make sure they agree with each other. And you should also make sure whatever reference data set you're using is appropriate for your data set. So if you're looking at liver, don't use a brain reference data set. That's not going to give you good results. And it's particularly important to keep in mind if you use one of the tools that has a built in database of cell types. So like single R has a whole bunch of built in cell types that can annotate. They might not have the cell types for your tissue. So if you want to deliver a tissue, it could annotate hepatocytes and T cells, but by all that cells did not exist in the reference, so it couldn't annotate them. So make sure your cells exist in the right house. So the other one is based on marker genes. So you could use your reference annotation, annotation, find the marker genes for the clusters in your reference, and then look at the expression of those clusters in your new data set. So if you are one of them, it's really important to use the reference data on your data set. It will depend on the reference data set. So some of them are really well annotated to deep cell types. My guess, though, with that if you're looking at one very particular cell type is that you'll the best to be able to get out of automatic annotation is something pretty generic like glutamineurgic neuron. So that's okay cell type but it's not going to be exactly your specific cell type. So you'll have to do a subsequent step of manual annotation. Yep. It depends what resolution you're looking at in the cell type, obviously. Certainly for things like immune cells versus endothelial cells, you can absolutely use mouse data. If you're looking at like specific types of T cells, probably not. So yeah, it depends on the resolution you're looking at. You can combine more than one. Usually it's better to if you want to look at multiple ones to run it on each one separately rather than combine the references and run on the merged, run on multiple different references and see if they agree. Any other questions? Yeah. Yeah, that is another option. So you can take your, your data set. And the reference data set and integrate them together and transfer the labels that way. So just direct directly comparing them. I don't know if it's any better. It just makes a prettier plot for your paper, because you can see, oh, they're on top of each other, other than here's a heat map with the correlation between them. I think that there is that the annotation tools, as you'll see in the last section today, aren't necessarily as great as we want them to be. So integrating them together can take trial and error to find one that works. So that has the same problem where you have to get a good integration, which can take three or four tries to get a good integration with different methods. So once you've got your automatically annotate your cells, you're going to then do manual annotation, both to check that your automatic annotation worked because sometimes it does silly things depending on the method. The methods will always assign a cell type, even if the cells don't look like anything in the reference, it will give it its best guess and that best guess is off to run. They don't want to double check that these are right. Usually using marker genes from the literature or from one of the various databases now. Yeah, you can also take the marker, you can take each of these clusters by marker genes for them and do something like pathway analysis to confirm their identity as well. And this is typically required for annotating subtypes. So if you're looking at different types of T cells, very, very few automatic annotation tools will be able to tell those apart. You'll almost certainly have to do that manually using marker genes because there's very few genes set to find them. And then finally do some sort of valid verification. If you want to say in your paper you discovered a new cell type to prove that it's real now. Didn't used to have to prove it was real, but fortunately, there are viewers have smart and deaf and said actually no, can you prove that's real not just some computational thing you found. Right. So when you're using market chains is also good. Important to take into consideration what someone says in a paper is a marker gene for a cell type doesn't necessarily mean it's a good marker gene to use. So if you look in the databases or in papers, you'll find that it game and fcgr 3b are both always listed as marker genes for neutrophils. And you can see here, one of them is a great marker for neutrophils. Other one, not so much. Right. So you want to make sure you have the right you have to consider sensitivity. So some marker genes, you might not be able to capture with your single cell RNA seek, especially something like CD4 is notoriously hard to capture with single cell RNA seek, even though in every paper it's a huge marker for sensitivity cell types. So, sure, we'd love to be able to use CD4 might not actually work. Important to consider how high, how high its expression is, lots of cell type markers are like transcription factors that define a cell lineage. There tend to be pretty low expressed, it's going to be hard to find them. So if you need to consider specificity, see it can on the right. Is it actually expressed only in that cell type, or is it one of many cell types that express that gene. And look at the log full to full log to full change if you can. Is it really differently expressed or is it only slightly more highly expressed in one cell cell type versus the other. You also want to make sure you're using enough markers. So single cell RNA seek is noisy. Probably not good to rely on a single gene to annotate your cell types. Try and find multiple markers that will agree on a particular annotation for yourself. And finally, what if you have multiple samples. So it's essentially two different approaches here. One is you merge your samples first, you do all your joint analysis of clustering and everything together. You get a set of clusters and you annotate them once at the end with your merge data. The other option is you can take each sample, do the full pipeline get your clusters and take those clusters and then compare them across your differences. You can do an analysis separately and compare or joint and one analysis. The obvious advantage of integrating them and then doing joint analysis is you only have to do the pipeline once, you only have to annotate once. You'll have far fewer clusters to annotate. So it saves you a ton of work. The other side is you have to use some kind of integration, usually, and as mentioned integration is a bit tricky, because the methods don't necessarily work that well, depending on the particular data you're working with. So you might have actually many more steps to get a good integration. And some integration tools can actually cause you to lose certain cell types and certain clusters, if you analyze them jointly. So depending on the size of my data set, if I only have a small number of samples, I will do analyze each of them separately first, or at least each biological condition separately first, and then joint merge together my different biological conditions and analyze them a second time. Any other questions about annotation? Yeah. So merging. So sir, merging just puts your data together into a single object. It doesn't actually make them similar in any way. So you always have to merge first. So, ideally, if you merge your samples together, they all line up on top of each other perfectly in the UMAP and you're done, you don't have to do anything else. Sometimes this happens, sometimes not. If they don't line up, then you'll then you can do a second step as integrating them. And that will do some sort of computational process to make your two samples more similar to each other, either at the gene expression level, or in the lower dimensional space. That's a great question. Do you want them to be the same? No, so maybe you don't want to do integration. Merging won't make them the same. It'll just put them on the same UMAP. So I'll still be able to see them as separate things on the same UMAP. No, integration won't increase the number of populations you find. All that integration does is it takes two samples that look different from each other and make them look more similar to each other. So you don't have all the cells. So just merging them, you'll have all of the cells from all of your samples in one object. Yeah, so I have one for this one. I don't know what it is. Maybe maybe it's also what I like to do. This is because I want to learn harmony on that one because maybe it is a different culture. And so I'm not sure I can do the same. So if you analyze them separately, all you're really doing is adding metadata to your single cell experiment object or your serrat object with the results of your analysis. You can do multiple rounds of analysis. You can analyze your data 20 different ways, if you want, and then look at how the different analyses get the same answer or different answers. So if I am comparing two different samples, which are biologically different, and I want to discover the biologically, I want to find out which process it is for your company and then hear this. Then would you not recommend this type of integration before when you're talking about it? It depends. It depends how you integrate the data set or how you don't integrate the data set. Some integration, it just depends. Like sometimes you can integrate it and you can still see different cell types that are unique to different biological conditions. Sometimes you can't. So we're going to cover different types of integration in the next section. Lastly, we've got these beautiful clusters. Now we want to use differential expression to answer our biological question because everything in biology comes back to differentially expressed genes between A and B. So the first type of differential expression we do is finding marker genes for our clusters. So here we're asking what genes are unique to a particular cluster or what genes are different between cluster one and cluster two. Here we need to think about to understand why tools work the way they do. You have to think about what we care about when we're asking this question. So do we care about the sensitivity of our test? Do we care about finding every single gene that's differentially expressed between our different clusters? Usually that's no because usually we're just trying to use these for annotating our cluster. So we only care about like the top 50 genes. So we don't actually care if we find every single differential expressed gene. What we do care about is the specificity. We want our genes to be very different in our different clusters and only genes that are differentially expressed. We often also don't really care about the significance of this difference, right? We're just taking the top 50 differential, top 50 differential expressed genes that are going to be super significant basically all the time. So we don't super care about our significance. We do care about the effect size. So we want them again to be very differentable change. And often we care about how long it takes to calculate this because we're going to, we may be doing this for many, many clusters in many samples and over many different versions of our clustering to figure out what clustering we like. So we want something to be fairly quick. This means the test that Surat uses and its default parameters are designed for this. If you use the default parameters for finding markers in Surat, do not talk about the P values. Those P values are meaningless. If you use the default parameters, if you want the P value to matter, you will have to change the parameters to make it work to make those P values meaningful. But if all you care about is the top 10 marker genes, it works great and it works best. If you actually care about having accurate P values, you have to use, there's only one method that actually calculates them properly because there is a bias because we've done clustering. So if you imagine we have this distribution of gene expression across all of ourselves. We cluster ourselves based on the gene expression. So if we clustered the genes based on this gene expression, right, we would cut this distribution in half. All the cells on the right would be in one cluster, all the cells on the left would be in the other cluster. And that this gene would be significantly differentially expressed, even though it's a single normal distribution across all of our data. So because by clustering, we've created a differential express gene between our clusters. So you could do something like that. This means that if you just do a standard differential expression test on clusters, your P values will be skewed to be more significant than they really are. And here's sort of what you can see. In the T test for doing differential expression between clusters, you have almost everything is super significant sweet the P value on the X on the X axis here, whereas the uniform expected distribution is way to the left. So they're all going to be biased. If you want to do it properly, you can see this paper and use their test that corrects for this. It's going to take way longer to calculate. So we know it uses it. But if, if this is important to what you what you want to say in your paper, use it. But for just annotating clusters, you don't really need. Yeah, and surat actually pre filters genes before it does marker genes based on log to bold change by default, which will make this even more skewed and more biased. So ignore the P values and just use the rankings from surat. So this is a question of what kind of test do we use. So there's a whole bunch of different tests that people developed. Now pretty much everyone just uses the Wilcoxon rank test which is circled there in red, because all out of all of these tests they most of them worked reasonably well and evenly. Except you can see at the left there's few that didn't work very well at all. And those are ones that surat people developed deliberately for finding marker genes and single cell or any seek data. And this doesn't have the date on it about five years ago, this paper was published saying hey actually that's the worst possible test, even the stupidest most simple test does better than that. And the surat people like oh okay, well, we'll go back to just using the Wilcox test. So now surat uses the Wilcox test. So the Wilcox test is independent of the distribution of data values. So it doesn't assume a negative binomial distribution it doesn't assume a normal distribution it works on anything because it's just it ranks the data and says our rank are my average ranks of group one higher than my average ranks of group two. This is great because it can be applied to batch corrected, scaled, normalized, regressed data anything at all. Any kind of data anywhere in the world you can use a Wilcox test on. But it doesn't account for confounding factors or anything else you have to account for that yourself. It just says group A is bigger than group B for anything, any data. If you want to do case control differential expression. It's much fancier and more complicated. So here we've got a question like what is the effect of a gene mutant on microglia. So we'll have some sort of experimental design where we have here I've just got a bunch of mice got three mice per biological condition. Well tight single knockout and double knockout. So this is sort of the maximally complex experiment. Most people in Canada will be able to afford to do. So this is this is when we actually did with collaborators. So this was $60,000 for this experiment. So I doubt you'll do anything more complex in this. So we get our data. So for each of these mice we collect a whole bunch of cells, several thousand cells, and then we want to do differential expression. So we're considering comparing these different groups. How many biological replicates would be would there be for my wild type versus my APP knockout in this experiment. Let me give you a moment to think about that. And you want to hazard a guess how many biological replicates we've done. We've got. Yes. So the number of cells we sequence doesn't matter at all. What matters is how many mice we did. So we should have three per group for our comparison. We took all of our single cell, or in this case single nucleus RNA seek data, and did differential expression treating each cell as a biological replicate. That would be wrong. We will get everything is significant. When in truth probably very few things will be able to detect because our power is pretty low because we have three biological reference. And the easiest way to do this, there's fancy ways to do it that we're not going to cover the easiest way to do it is to do what's called the pseudo box, where we take our expression in each cell type, and we break aggregated per biological So we take the average. So for our question about microglia will take all the microglia from mouse one and do average expression across all of those microglia. So we have one replicate now for mouse one, we do the same thing for mouse two and mass three. We get the average expression in a particular cell type in a particular biological reference. So let's do this like bulk RNA seek, but three replicates at bulk RNA seek, we can choose any model or tool that we use for bulk RNA seek and use it for a single cell RNA seek data for our case control. And these are all based on the general linear model on GLM. So I'm going to try to explain this, but it's scary looking at general linear model so we're going to stop thinking about this as linear regression. So hopefully everyone has done linear regression. Right, so we had length of ending in jeans versus cognitive decline. Right into linear regression, you get a straight line, why equals MX plus be, and we can fit our slope in our interest. So for these purposes we're going to change our notation a little bit, and we're going to call this be zero for intercept and be one for our slope. So exact same model just linear regression. And we can call be our bees here are called coefficients. And you see the results of your test. That's what a coefficient is. So here X and Y are continuous. So what if X is actually discrete. So pathogenic or not for mutant or wild. So here we've got our discrete groups for X. And but to use linear regression X has to be numbers. So here I've got no or yes for pathogenic. I can just change no to zero. Yes to one. Now our X is numeric, and I can use linear regression. I can just do linear regression. I've got a slope that slope. Right. Now expressly again, that slope is just the difference between not pathogenic and pathogenic. So I can consider this as happily one mutant versus wild type for zero and one. Okay, I have two different mutants right at APP and apoE wild type or knock it so I can imagine this as three dimensions now at APP on one axis, apoE on a different axis and then my expression on the third axis. Now my Y equals MX plus B just gets an extra term for B2 for times Z for my second predictor. Right. So B1 is still APP while mutant versus wild type. So that's that slope in both of my groups. And B2 now is the slope with respect to apoE knockout. Right. So B2 is now the slope going in that direction. And my B0 is still my expression in wild type for both mutants. So just one more step. Now you've done a general linear model. It's not awesome. However, because we're looking at different mutants we might have a third question about what if the effect in the APP and apoE double knockout is different than just adding those two together. So B2 becomes what's known as an interaction term. We write it as B3 is X times Z. So X times Z if X is zero, then X times Z is zero, if Z is zero, X times Z is zero, if X and Z are both one, then that's one. So B3 is only the difference here between the double knockout and the single knockout. And that's the interaction. If you're all prepared, you can write a general linear model for your complex experimental design using your mice as replicates, not your genes or your cells as replicates, and using bulk RNA-seq tools. And these will run on raw counts because we're using a model that will account for the negative binomial distribution because that's another feature of general linear models. We can do other distributions. So this will always be on the raw counts. So that's our model. And we're just going to test whether B1 and B2 or B3 are significantly different from zero for our differential expression test. And it's basically lunchtime, so I'm going to leave this description on the board. And think about how you would design this experiment, the analysis for this experiment before or after lunch, because I'm sure everyone's thinking, let's do a break for lunch now.