 We're just going to do a pretty quick lecture on some of the high level concepts related to expression and differential expression, and then we'll transition back to the practical exercises of really going through the next steps of the workshop. So this is the expression and differential expression module. We're hoping to cover concepts, including expression estimation for known genes and transcripts. We're going to talk a little bit about these FPKM style expression estimate measurements versus the raw count approach. We're going to review one differential expression method, which is string tie that you guys will be using in the practical lab. And then we're also going to do a very brief discussion of downstream interpretation of expression and differential expression data. I need to full screen that. So you've already seen now just recently a screenshot or a live IGV session showing some expression data. This is very similar to what you were just looking at. So this is showing two different tracks, two different samples. This could be, let's say, a tumor versus a normal. And then in these sections of IGV, you're seeing the individual read alignments. In this case, you can see that they're spliced alignments. So these little narrow lines are connecting the aligned portions of exomes to the next aligned portion of exome. And they also roughly correspond, in this case, to some known gene annotations. And then the example that Malachi just recently showed you also revealed that even at this level, you could start to potentially see something like 3 prime bias. So here, if you look at the coverage, so this top track is a coverage track, it's a little bit hard to tell from looking at the read alignments because you're only seeing a very small subset of the read alignments. But basically, there's very few reads of the line over here as indicated by the very low coverage bars. And then towards a 3 prime end, those coverage bars get quite high, meaning that there's kind of a huge stack of aligned reads here that you're only seeing the top few. And that's probably indicative of some kind of 3 prime bias in this sample. Maybe this was a Paulier selected sample that was somewhat degraded. You can also see that maybe there's evidence of down regulation here that you have quite a bit more coverage in the top panel than you do in the bottom panel, suggesting that maybe this transcript is more highly expressed, therefore represented by more sequences, more alignments. So this really leaves this super basic idea that you could quantify gene expression or transcript expression by the number of fragments of CDNA that you observe in your sample for any given gene. And one of the most common ways to express that expression level is through the FBKM or formally the RPKM measure. So these are actually basically the same thing. Originally, they were called RPKM, which stands for reads per kilobase of transcript per million mapped reads. But then when we started doing paired end sequencing, it got confusing. When you would say reads, did you mean the read pair or individual reads? So they started calling it fragments. So the fragment represents the read pair or both reads of a read pair. So you wouldn't count a read twice. You would align both reads, which is evidence for that particular fragment. And so we call it fragments per kilobase of transcript per million mapped reads. And the idea is that in RNA-seq, the relative expression of the transcript is proportional to the number of fragments that originate from it. But of course, the number of fragments is also going to be somewhat biased towards larger genes. So larger genes are simply bigger pieces of RNA transcribed, which means they can fragment into more pieces and are going to be overrepresented in the pool of fragments that you sequence relative to, say, a small gene. And it's also the case that the total number of fragments is correlated with the total size of the library. So if you have two samples and you sequence one sample to two million reads and another sample to one million reads, and then you see twice as many fragments mapping to a gene in the one versus the other, it doesn't necessarily mean that gene is upregulated, right? The more you sequence, the more you're going to sample those fragments. So you have to control for that. So you can either try to sequence exactly the same number of reads for every sample, which in practical terms is not actually that easy, or you can normalize for the total size of the library. So that's what this FP-CAM measurement attempts to do, to sort of simultaneously normalize for the size of the gene and the total size of the library. And it's a very simple formula. There's a few different ways of expressing it, but this is one way where you have the number of mapped reads to a gene or transcribed or exon that you're trying to estimate the expression of. So that's C. And then you divide by the total number of maffable read fragments in the library and then by the number of base pairs in the gene transcript exon, so the size, and then times by this factor, which is a million reads and a thousand bases, to get this per million per kilobase effect. And what these factors are is a little bit arbitrary, right? So we could talk about gene sizes in kilobases. We could pick another arbitrary size. That's just a kind of convenient size. Same thing with a million. It could have been per billion. It's really arbitrary. So you might have also heard of TPM, which has become a popular alternate metric, maybe more popular than FP-CAM now. And really, there's a kind of very subtle difference between these two measures, and it really just has to do with the order of operations of how you calculate them. So we already went through FP-CAM. This is another way to describe that formula. So you're getting the total number, the total library sample fragment count, and dividing by a million. So that's your per million scaling factor. And then you're taking each gene fragment count and dividing it by that per million scaling factor, and then dividing by the gene or transcript length and kilobases to get FP-CAM. So for TPM, instead, you're taking the gene transcript fragment count and dividing it by the length of the gene and kilobases. So that gives you your fragments per kilobase. Then you sum up all those FP-K values for the sample and divide by millions. This gives you, like, your per million scaling. And then you divide one by two to get transcripts per million. So you're actually using essentially all the same data. You're just calculating in a slightly different way, but it has a significant effect in that the sum of all TPMs in each sample is the same. So they always add up to the same total value of a million, whereas with FP-CAM, they don't add up. So it's a little bit harder to do a direct comparison of a value between one sample and another. So that's an advantage of FP-CAM versus TPM. I think we'll show an example later where we calculate both for the same data set. And the difference in some cases is surprisingly subtle. Like, it's not a dramatic difference in the values and the results you get from doing differential expression. But people do find that TPM is somehow a little bit easier to work with. Yeah, actually the current versions of the software from the same group that created cufflinks, which is the string tie and Cumberbund tools, they use TPM now. Or it's available as well. I actually don't know which they're using underneath in the statistics. That I don't know, but I suspect yes. I think they've generally moved from top hat and cufflinks to high sat and string tie. But yeah, I don't know like whether they're kind of doing bug reports or maintaining cufflinks or doing improvements to it. But my impression is that they've kind of moved on and that the tuxedo suite is now actually string tie, high sat, not cufflinks top hat. Yeah, they run a lot faster. So we're going to use string tie. We're going to talk about it briefly just at a high level how it works. And this is a figure from the paper first describing string tie. And they actually contrasted it to the cufflinks and trap approaches. And all of these approaches start with the same starting point, which is they align RNA secreeds to the genome. A difference with the string tie approach is that you can assemble reeds into super reeds, which is basically combining individual reeds into larger kind of linked reeds or overlapping reeds. And then both of them have a clustering step, which isn't well depicted in this diagram, but the clustering step is about grouping all of the reeds that align to the genome according to what you believe are gene loci. So when you want to go and estimate the expression of a gene and the expression of different isoforms of that gene, you want to make sure you're considering only the fragments that actually map to that gene and not the fragments that map to the next gene. So imagine you have kind of two genes in the genome next to each other and a bunch of reeds align here and a bunch of reeds align here. The clustering is about gathering all of these reeds together into one cluster and gathering all of these reeds together in another cluster. And it sounds easier than it is because the genome is a pretty busy place. There are sometimes transcripts or gene loci that are quite close together. There are biological phenomena like transcriptional read through, where the transcriptional machinery will sometimes kind of make a mistake and it's transcribing from one axon to another in a gene locus and it'll just like get a little bit excited and just keep going and transcribe into the next gene over. It's a very common phenomenon. There are also overlapping gene loci on opposite strands, right? And depending on whether you have stranded or unstranded data, that might result in some ambiguity. So it's not as simple as it sounds like actually doing this correct job of clustering and saying, okay, we're going to treat all of these reeds as one thing. And try and estimate for a gene instead of isoform from that locus and then do the same for every other locus. So quite a bit of work goes into the algorithm of trying to correctly cluster. I'm sorry. Can you say that again? With polyA isolated? Yeah. Yeah, it probably would help in some way. So if you use polyA selected DNA, for one thing, you're simplifying the transcript to them a little bit, so you're getting rid of the possible confusion between reeds from your polyA transcripts versus non-polyadenylated. And maybe you're also enriching for processed RNA and depleting for unprocessed RNA, which could also be a source of confusion to the clustering step. So it's a reasonable assumption that that might actually help with this step. It has other disadvantages, but yeah. Okay. So we're going to align our reeds. We're going to assemble them in the case of string tie into some super reeds. We're going to cluster the reeds into clusters. And then string tie does this complicated procedure where it looks at the alignments and it builds a spliced graph. And then from that spliced graph, we're going to go through in more detail in the subsequent slides what we mean by that. It also constructs flow networks. And then it computes the maximum flow through each flow network and estimates expression based on that and then removes an individual flow network or possible isoform and repeats that procedure for all of the isoforms that it's inferred. So just to kind of illustrate this in more detail with an example, here's a set of RNA-seq reeds. We're going to do the step where we assemble them into some super reeds. So reeds that can be sort of easily assembled into overlapping contiguous reeds are assembled together. And these are aligned to the genome and the transcript isoforms are inferred. So in this case, if you look at the alignments, you can see there's sort of three possible isoforms. A one on the top here involves exon one being connected to exon three in the green and then exon three being connected to exon five. So that's one isoform. There's another potential isoform here where you have exons two and three. So here's maybe like an alternate UTR or something connected to exon four. Sorry. Yeah. Exons part one and two connected to the second exon, which is part five. And then the final isoform you have this part one connected to parts four and five. So that's depicted in the three diagrams below. So there's these different splicing patterns that are consistent with three possible isoforms. And based on those three sets of splicing patterns, we can construct this spliced graph. So here this is just depicting what we just walked through, right? So there's three paths through this graph. One is 135. One is 145 and one is 235. The s and the t are what are called the sink and something else. I forget it's a terminology that's specific to graph theory or to flow network theory. That's basically the start and the end. So you have this flow network and what you do is use some graph theory to estimate the maximum flow through the graph, assign an expression estimate based on that and then remove that isoform and repeat the procedure to estimate for subsequent isoforms. Yeah, so here's a depiction of one of these flow networks. So here you create this flow network where you have. This is just related to the first possible isoform. So we've got 15 total fragments here. So the gray bars are fragments. And there's two nodes in the flow network or any two nodes in the network are connected. If a fragment starts and ends at those nodes. So if a fragment starts at node one and ends at in or at node three, we would say there's a connection between one and three. If a fragment starts at node one and ends in or at node five, we would say there's a connection between one and five. So you construct this flow network. So if you look at just the first node, which is represented here, there are seven fragments supporting this node. It is connected to node three by six fragments and to node five by one fragment. Again, the connection here is based on the start and end of the fragment. So this fragment goes from one through three to five. So it's being counted here as one. Yeah, exactly. Which is a little bit confusing, but that's just how this works. And then there are 13 fragments that support this second node, three of which connect to the third node. And so what's the maximum flow through this node network if you were just to sum up the different paths? Yeah, that could be. So it's like seven plus six plus 13 plus three plus four, whatever that works out to, versus seven plus one plus four. So it should be this path, right? So the maximum flow, the most likely isoform is one to three to five. So string tie estimates the coverage level of the transcript by solving this maximum flow problem. And there's a whole field of math that's based on graph theory and optimization theory. And you have to like go to the string type paper, you can see there's pages of formulas. So we can't, I'm not a statistician or a mathematician, so I probably can't explain it properly. And also it would take me probably two hours, even if I could. So there's a lot of complicated math that goes behind this, but conceptually what it's doing is trying to find the most likely path based on the data. So what the alignments support in terms of which isoforms are most likely present and then the relative expression levels of those isoforms. Does that make sense? It's using this spliced alignment data to both like simultaneously infer what the isoforms are and then estimate the expression levels of them. Yeah. Yeah, so I remember for cuff lengths, which is the previous version of this that they definitely had an estimate of certainty on the isoform and the expression level of the isoform. And I'm sure that's true for string tie as well, but I forget the details of how it works. But yes. And I'm sure this is a part of the statistic as well, right? So we're going to run string type, something else that just want to briefly mention because you're going to use this tool is string type merge. So this allows us to merge together all the gene structures from all the samples. And the reason we need to do this is that some samples may only partially represent a gene structure or even only represent subsets of the genes or transcripts. So because we're inferring all these isoforms from the data and the data are different, there's different expression levels. What we're going to get from each sample is like a different world view of all the possible genes and transcripts. But then if we want to do differential expression analysis, we need to be comparing apples to apples, right? So if we consider an extreme example, let's say in one sample there's a highly expressed gene and there's all these read alignments that support that gene and maybe a specific isoform of that gene. And then the other sample, the gene is literally not expressed at all. So there's no reads for it. It's just not there. So this is not going to be able to assemble an idea of that isoform because there's no data supporting it. So then when you go to do the differential expression analysis, you're going to be comparing something to the absence of something, which will be confusing for the software. So exactly, it probably is real or could be real. So by doing this merge step, what we're doing is creating a single consistent representation of the isoforms across all of the samples. So that when we do the differential expression analysis, we can compare the data in sample A that has this gene expressed to sample B that doesn't have it expressed, and we'll get the proper answer, which is some high expression value versus zero expression. And that same logic applies to not just the simple contrived example of like express versus not express, but maybe they both express the gene, but they don't express the same isoforms of the gene. Or maybe there's support for an isoform in one sample, but it's only partial support. So it's missing some X on X on junctions. So it only has partial coverage of that transcript, but it has enough that we can probably assign some expression value to that transcript. But we need to have like a common consistent representation of all the isoforms. So this string time merge tool is going to allow us to build that. And it also allows us to compare or merge what is inferred from the data to some existing knowledge about the transcript dome. So we have this GTF file, right? For human, we know actually quite a lot about what transcripts are expressed and what they look like. So we might want to say, given what you inferred from this data, map it onto this existing knowledge about transcript isoforms so that it's consistent with our existing knowledge about transcription. A related tool is GFF compare. So you're going to compare, for example, you can compare your merged transcript with the known annotations just to see, for example, you might want to know how comprehensively you've recapitulated the known transcript dome using your data. And then finally, we're going to use a software called ballgown for differential expression. This uses parametric F test to compare models where you fit basically two models to each feature using expression as the outcome. So one includes your covariate of interest, which is maybe case control or tumor normal, in this case, brain versus body tissue, and then one that doesn't include the covariate. And then an F statistic and p-value are calculated using the fits of those two models. So it basically measures how much the covariate kind of explains the difference in expression, how much your condition explains the difference in expression. And then this is adjusted for multiple testing using a FDR based multiple testing correction. And then in addition to doing the statistics, ballgown also produces some nice visualizations. So we're going to run ballgown and that lets us do things like create histograms or box and whisker plots, showing the distribution of expression values for genes. We can also compare between different conditions. It will also show the relative expression estimates for different transcript isoforms of a gene and so on. So that's the sort of string type view of the world. If you study differential expression or expression estimation, you'll see that a lot of people use a kind of main competing category of analysis, which is the raw read count based approach. So instead of doing this somewhat complicated estimation of FPCAM with graph theory and flow networks to estimate, like to discover the isoforms and estimate expression, typically just at the gene level people will do differential expression just simply based on the counts of reads that align to one gene in a sample versus another sample or in a set of samples for a condition versus another condition. We use HTC count, but there are various other options for producing these counts. This is really very simple. It basically produces a single number, which is just the count of reads that align or overlap with a set of features that you define. It's very similar. So like what's described here is the different modes you can run HTC count in to define overlap in different ways between the read alignments and your features of interest. And yeah, bed tools is basically the same idea. I don't see why you couldn't do that. I can't say that I've recalled someone seeing someone use bed tools instead of one of the... I think you would probably have to do maybe a little more work to get it to work with bed tools, whereas these tools have just made it very simple. They give me a BAM file and a GTF file and I'll output the counts for feature, like in one command. Yeah, I don't know. Yeah, it should be possible. It may be very easy. So just be aware that there are different modes that you can run HTC counting and there are different consequences to that in terms of what is considered overlap between your read and different features. And then I think we have it here. If you go to seek answers, do read the documentation and forms for HTC count. This is having to do with estimating expression levels for different isoforms. The take-home message seems to be that you should use this with extreme caution that basically assigning counts to isoforms is hard, right? Because most transcripts in human genome are very similar to each other. So the differences between them can be quite subtle. There may be very few reads that truly distinguish between one transcript and another transcript. And methods like stringtie have done a lot of work trying to solve that problem in sophisticated ways, whereas simply doing the count-based approach seems to be maybe kind of underestimating the complexity of the problem. And basically the HTC count authors essentially don't recommend using it for transcript level expression. So really it's meant for when you're just looking for gene expression level estimates. So you're giving the HTC count? Yep, it just comes up with the counts. You give it the BAM file and then you tell it the location of the genes that you care about with the GTF file. So we're going to do all of that in the lamps. So if you do have raw count data, then you're probably going to want to do differential gene expression. There's a number of packages, especially in R, like DC2 or EDJAR and others. I think the example we have is EDJAR, but we've used both. So which one should you use? This is kind of a long-standing debate. There are definitely camps who really believe in the statistical approaches underpinning the count-based approaches and other people that prefer the FPKM style or the string-tied cufflinks kind of view of the world. I'm not really a strong opinion on either of those. We use them both. When you want to leverage the benefits of things like the tuxedo suite, so when you do want to do isoform level estimation, it's good for visualization, I find, so the FPKM values are relatively easy to visualize and directly compare to each other or TPM values, I guess, calculating full changes. A lot of people use FPKM for that, whereas the counts are really good for the robust statistical methods for measuring differential gene expression and especially if you have more complicated experimental designs. So if you have like more than two groups, time series data, that sort of thing, there's a number of our statistical packages that are count-based that accommodate those different statistical designs. Can you go between? Yes, let's say that's relatively common. So we calculate both the FPKMs and the counts, we do actually statistics with both, and if something is significant, say, by the count-based method, you can still look at and visualize the data using the FPKM values, and sometimes you will see like, oh yeah, it kind of looks like it's differentially expressed still there. I mean, visualizing the raw counts is not going to be that useful, so you probably do need to convert it to something like an FPKM for visualization purposes. Yeah, there probably are, like in some of the R packages, there are probably tools to calculate the FPKM values. Yeah, interesting. I think the argument was that if your gene is smaller than your average prime example, then you're not going to have a similar expression. I would think you would. Didn't you say it was overestimating the expression? So you're not representing? Yeah, I see. It's trying to, it has a built-in factor to try to compensate for what it anticipates would be. And I think that I'm not the other genes, and of course I thought that that's true. Yeah, so I'm not familiar with that issue. It doesn't surprise me that that would be a case. I guess you should try comparing the results between string time and cufflinks and see if like this size-based bias goes away or if it's still a problem. Yeah, and then you get these, and then you have to really understand every single step you take. I certainly see the value in having statistical analysis, but then in fact it's a quite a few things like these overestimation that are not obvious in terms of what you find out about the player. I mean, I guess I would hope they're at least consistent from sample to sample so that you would still be able to do comparisons between samples, but... It's still a very high value, but not as high as when you have 20 million units, like a screw to do everything. Huh? If you compare them, and then you see a difference, and then you're wondering, okay, who's right? Yeah. It'd be interesting, I guess, to compare two-year-old and an SDK of about 10,000 and just get the gap. Yeah, so I mean in all of these cases these are all, they're just estimates, right? They have uncertainty attached them, sort of known uncertainty and kind of unknown uncertainty and also weird biases. There have been a number of studies have also looked at like the ability to which even, you know, what would be widely considered sort of the best approaches for isoform detection and estimation, like string tie or cufflinks. I'd argue that. Yes. Yeah. So you can always calculate the sort of simplistic FBKM and compare. I mean, they would say that their FBKM is a better FBKM because the things they're doing to adjust it are based on some thought process. It's like a normalized FBKM or something, yeah. Sure, yeah. But yeah, what I was going to say is there are some publications that have shown that actually what short-read sequence data is just, it's fundamentally not that good at actually estimating expression levels of different isoforms. So especially at the isoform level, you know, all of this should be taken with a pretty big grain of salt. Yeah, I think that 100 or 150 is still short-read, but maybe 250 is getting closer. Oh, yeah. I mean, we're doing mostly two by 150 and because of the sort of standardized library fragmentation protocol we have at the production center, we're seeing tons of read through, including like even read through into adapter on the other ends, which is creating problems that we never had with our pipelines. Yeah. So I think that this is a good segue actually into the fact that multiple approaches are advisable. So I wouldn't take any of these individual software tools as ground truth. And even like the data set that we're analyzing here for the larger version, we took something similar and we analyze it with three popular approaches at the time like cuff diff, which went with cuff links, edge R and DC. And there's a significant amount of overlap. So you see, you know, 3000 genes that were determined to be differentially expressed and just differentially expressed in common between the approaches. But you also see substantial non overlap. So depending on the goals of your study, you may want to limit yourselves to things that are very kind of reproducible and identified by multiple approaches and take the intersection of these strategies. Or if it's more of a phishing expedition where you definitely don't want to miss anything interesting, you might want to take more of a union of all of these different approaches. And in all cases, you probably want to validate in some way, right? So this is never the end. It's always kind of the beginning of your study. So just quickly, some lessons learned from microarray days. People have published on this. It's kind of a trope now, but sequencing technology does not eliminate biological variability. Arguably, it just maybe measures the variability more accurately. So what that means is that you need to still think about your power to detect differences. You need biological replicates to have statistical robustness and reproducibility. There are a number of tools and forums you can read to give you advice about power analysis for your RNAseq experiments. I think especially in the early days of RNAseq and still to some extent today, there's a lot of advantages to RNAseq. People want to use it because you get more information. You can do more with it. But it's still very expensive, right? Much more expensive still than microarrays. And so budgets are not unlimited. And a lot of times what people have had to do is sort of sacrifice replication for the increased knowledge that they hope to gain from some of what RNAseq can teach you. But the result is you're maybe underpowered. I think, you know, we're just being human. I think we have a link to like a video you can watch about this where there's like a biologist and a statistician talking and the biologist is trying to convince a statistician to just say, yeah, three is the right number. But it's obviously not. Yeah, there's really no real reason why people do three except that it sounds better than two or one. And it's less expensive than four or five. It's really, yeah. Yeah, so a lot of the statistical packages will simply error out with less than, certainly less than two. I mean, almost nothing will do a one versus one comparison. Some will probably error out with two. Once you have three, the software will run, right? But that doesn't mean it's giving you statistically robust results. Yeah, so what I'm saying is that if you have two conditions and one replicate of each, like a lot of tools just won't even operate. Yeah, I mean, if you don't have replicates, you don't have a measure of variability, right? And most statistical tests are based on variability. So a lot of these statistical tests will just be like, sorry, there's nothing I could do for you if you don't have replicates. Okay, so they actually are replicates? Yeah, yeah. And it's actually hard to find tools that will, like literally if you have like, you have two samples from two conditions and you just, you did RNA-seq on sample from condition A and RNA-seq on one sample from condition B and try to identify differentially expressed genes, you will have a hard time finding like software that will do that for you. It's, you know, it's not impossible, but like it's essentially meaningless to do a statistic on that. All you can really do is calculate a full change and believe that it means something, but it probably doesn't. Okay. So multiple testing correction is another problem. So let's assume you have a well-designed experiment that's well-powered that has lots of replicates. But then you go ahead and you look for differential expression. Just like in the microwave days, you have tens of thousands of genes or thousands of genes and you do thousands of tests and eventually by chance, just due to variability in the data, you'll find some significant results which would not necessarily be replicable. And this problem has just been compounded by RNA-seq, right? Because now we have so many more things that we can measure. So we can look at not just genes, but individual isoforms. We can look at the exon level. We can look at the individual exon exon junction level. So there's more and more possibility of tests, which means there's more possibility of false positive results. So you really want to think about multiple testing correction. Can you filter the data ahead of time? Yeah, you can do test reduction. So you can say like, I'm not going to do 100,000 tests. I'm going to think about in an unbiased way what would be a sensible subset of tests to do. So a very common thing would be to say like, let's only consider genes or transcripts that are expressed in at least some percentage of the samples. So if they're not expressed anywhere, why waste a test on it? And then similarly, you might say, unless there's some coefficient variation across all the samples, independent of the class that you're trying to distinguish, you could eliminate samples that are just like, if they're just constantly expressed across all samples independent of condition, they're probably not that interesting. So you might eliminate those tests as well. So that's another really common approach. And then finally, downstream interpretation, this would really be the subject of another course. And actually, we do have another course on genomic visualization and interpretation that I linked to at the bottom here, which includes sections on differential expression and pathway analysis. We're going to do a little bit of stuff in string tie and ball gown. And these results could be fed into many downstream pipelines, right? So you can do clustering, heat maps, classification analysis, pathway analysis and more. And we've listed a few suggested packages for that, but really there's just a plethora of such options. So that is the introduction to module three. And what we're going to do next is move on to the expression estimation and differential expression steps. So by now you guys should have your raw data, your reference genome and your gene annotations, and you've aligned them with Hisat. And so you should be ready to do expression estimation with string tie and then differential expression with ball gown. So that'll be module three. Thank you.