 This afternoon, we are going to go over the sort of the basics of differential gene expression analysis for RNA-seq data. Note that CVW offers a more in-depth workshop for differential gene expression analysis, RNA-seq analysis. So here we are just going to give you sort of the conceptual overview for how you would start with RNA-seq data. We're assuming it's been processed already, meaning that you've started the raw read, they've been aligned to the genome, and you've used tools to call accounts for gene expression. And we are going to use the EdgeR library to call differentially expressed genes, where this we're going to make use of the concepts that we've been learning all along, notably the GLM concept from this morning. So let's get into it. So at the end of this lecture, you will understand sort of the key steps in identifying differentially expressed genes using RNA sequencing data. You will learn how to use QQ plots, we actually decided the p-value histograms would be too much, to gauge how much signal you have after multiple hypothesis testing. We tend to do a lot of multiple hypothesis testing when we do genome-wide studies and we are trying to find a signal amidst all our genes, so how do you adjust for that. And then we will create a very common visualization called a volcano plot to look at the results of differential expression analysis. So here in a nutshell is the workflow for differential gene expression analysis. You start with a raw expression count matrix, and remember that in an RNA-seq, you are measuring, you know, how many molecules of a particular gene you found in your library. So you get these integer counts, and you get this matrix where your rows are your genes or your transcripts, and your columns are your samples. And so this is where we start. And then, you know, we exclude the genes that have a low read count to limit the number of tests we are doing, because of multiple testing burden, we'll talk about that. Then we have to equalize the samples for some sources of technical variation between them. And then finally, we build our statistical model, and EdgeR just does this out of the box, but it just helps for you to know what's happening a bit under the hood. We run, you know, a gene by gene hypothesis test, which is to try to see which of the genes expressions is explained by our variable of interest. Say, you know, disease, not disease, responder, non-responder. And then we correct for multiple testing. And the genes that survive multiple testing correction, they have an adjusted p-value, you get one for each of your genes. You apply a cutoff. And the genes with the adjusted p-value below that cutoff, those are your differentially expressed genes. And then you can go on to look at the signal, you see how many genes have I got, how many are up-regulated, how many are down-regulated. And then you can go and do fund downstream things like what pathways are these genes in, what else can I say about them. And that is a topic of other courses that are covered under the CVW umbrella. So let's start here at the top. When we do our NAC analysis, instead of measuring one response variable, we are measuring as many variables as there are genes, right? So those are all of our measures we're looking at. We tend to do this in genomics. If you're doing DNA methylation analysis, you might be measuring anywhere from half a million to half a million, yeah, half a million to almost a million number of measures, right? And at the end of the day, you're looking for what parts of the genome seem to have values that can be explained by your factors of interest, okay? So you've got 20,000 measures. And at the heart of it, you're doing 20,000 individual statistical tests. So this model that we've conceptually put up here, you're fitting one model per gene, right? And that's why you get, and then the p-value that you get is for your treatment effect, your effect of interest. So you've got 20,000 p-values. Now, can somebody define what a p-value is to me? We talk about them all the time. What is a p-value? What does that mean? So a p-value, you guys are all getting in that area, but the definition is a p-value is the probability that you would see a value as large as the one in your data if there was no effect, okay? If the null hypothesis is true. It's a probability that your effect size would be this large under the null hypothesis. The null hypothesis means the treatment effect is zero, okay? So a p-value of 0.05 means there is a 5% chance you would get a value this big if the null hypothesis is true. If there was no effect, if there was no effect, 5% of the time you would see a value this large. So what happens if you do 100 tests? 5% of the time you could get a value this large, right? So your p-value of 0.05, so if you did 100 tests and you accepted anything with a p-value less than 0.05 from your 100 tests, then that p-value is telling you 5% of the time you would get a value like this even if your treatment effect was zero. So this comes from doing something again and again and again, doing the tests over and over again, right? So this is a multiple testing burden and this is something we deal with when we do a large number of statistical tests such as happens in genomics, okay? So what do you do with multiple testing burden? You have to limit the number of tests you do and you need to correct for as many tests of Utah. So for example, if I did do 100 tests, then a very conservative way of correcting it is I need to take that 5% p-value and I need to divide it by 100 because I did 100 tests, right? And that is a very conservative method of correction and it's called the Bonferroni method of correction. So that's what it's very simple, it's very simple mathematically and it follows from this definition of p-value but it tends to be conservative. So there are other methods of correcting and a very popular method is called Benjamin E. Hodgeberg correction, okay? And all we're going to say is when you do multiple tests like this, you're going to have a multiple testing burden. And if you want to ask the question, how many of my genes are differentially expressed? You do not want the uncorrected p-value, you want the corrected or adjusted p-value, okay? And how do you try to limit this burden? You try to reduce the number of tests you do and then you do some other things like there are some tricks built into edge R, right, which we'll go into in a bit. Okay, so one way of reducing multiple testing burden is to remove genes from the tests which are unlikely to have any effect. For example, if I have a gene that's practically unexpressed in all the samples, I'm just, it's a waste of my time and my multiple testing whatever buffer to test that gene, right? So one thing we do is it's a heuristic, you say, I'm going to exclude genes that are expressed very, very low. And then what is expressed very low means it's like a field convention, but I'm going to exclude them. So if 20,000 of my genes, 5,000 of them aren't expressed very highly, you've lost, you've already removed 25% of your tests, right? So you'll see a multiple testing burden is that much less. Okay, so that's, that's why we exclude genes with low rate count. Chances are you're going to find a signal in there, you know, you'd have to think of a scenario where you could have a low expressing gene, you know, that's so low you don't even have enough measures and you're still going to see an effect. Can you believe that effect? Could it be sampling variation, right? There's other assays as well. So in methylation analysis when you have close to a million data points, sometimes you will rank your measures based on their variance. Variance is standard deviation square. And you say, oh, I'm only going to test the top 50% of varying probes. I'm not going to test the others. So if you're sort of a heuristic, yes, you lose signal. But, you know, you kind of try to balance it based on your experimental design, like if I have only 10 samples, right? Then you're trying to fit this model with 10 data points for a million points. Multiple testing issue. How big is your signal going to be? Is half the genome changing? Or are you expecting one or two genes to change or five, 10 genes to change? Or do you not know? So that's what, that's why we exclude genes with low count. Remember, try to think about where the counts and the count matrix are coming from. You are sequencing, sampling RNA molecules in your sample, right? And that sampling is going to be variable from samples to sample. Some samples happen to get sequenced a little bit more. Some samples happen to get sequenced a little bit less. Your sequencing assay is not guaranteeing exact same number of total reads coming from a sample. So there's going to be some variation. If, you know, think about it this way, if your controls happen to be sequenced twice as much as your cases, then the RNA counts are just going to look, they're going to be twice as high across the board for all your controls, right? So if you have five molecules in your controls, sorry, what was I saying? Let's say your cases got sequenced twice as much. Okay, let me start over. Suppose in the sequencing, just by chance, a lot of your cases got sequenced a lot more than your controls, say factor of two. Now you look at the RNA count matrix, raw count matrix, then all of your cases have a factor of two compared to your controls. What happens if you don't normalize for that variation? You're going to call everything a significantly different, right? But it's not real. It's because of variation. So you need to normalize for that. Then you build your statistical model. And at the heart of it, you know, you've got a null model, which is expression is just, you know, explained by the intercept plus, you know, some variation. And then you have the full model, which is, no, the expression is a function of, you know, the intercept and a function of the disease state, right? This is what we want. These are the genes we want. This is a full model. So what RNA seek, you know, tools like HR do is they, for each gene, they will fit this model and this model, and they will, they will use a likelihood ratio test to ascertain which model explains the data better. Do the data better fit the full model or the null model? And the p-value that you're getting is from this likelihood ratio test. Okay. And because this is for count-based data, if you want to test that there is some kind of additional variation in count-based data that's coming because of biological factors, you use a negative binomial distribution, which is a type of generalized linear model. So if you use, depending on what genomic assay you use, your model has to make sure to follow the assumptions of those data, right? You're not trying to fit lines with your RNA seed data. And then you need to correct for multiple tests. You've got to take your adjusted p-value, less than .05, and then finally you evaluate your signal. Okay. So now we are going to go look at a worked example for RNA seek analysis.