 So we're moving on to module three now, which is getting into expression and differential expression analysis. So you guys have basically been introduced to the general idea of RNA-seq, and we've done some kind of preparatory work to identify reference genomes and gene annotations and things like that. You've produced your alignments. So now the task is to attempt to estimate expression and do some kind of differential expression analysis based on that alignment data. So that's what we're going to talk about. We're going to talk about estimating expression of both genes and transcripts. We're going to talk about FBKM versus raw count approaches. We're going to use both in the practical exercises. We'll briefly talk about differential expression methods and then some downstream topics. So what we're trying to get to is basically a point where logically we can determine the abundance of a transcript based on this alignment data. So I think the last section you guys actually looked at some of your RNA-seq alignments in IGV and just at a really basic level what we're trying to do is go from a concept of these alignments to a quantitative measure. Just roughly you can kind of do it by eye, right? So here these two samples, I think this might be a tumor in a normal, let's say. We have decent coverage of these transcript isoforms for both samples. But in the top case, using the coverage bar, you can see there's actually quite a lot more alignments in the top sample than the bottom sample. So assuming that these were from libraries that were sequenced to similar depths, you might be able to infer just from looking at this that, yeah, maybe there's more evidence for expression, more high expression of this particular gene in the top sample versus the bottom sample. The other thing you can notice from this visual is the three prime end bias. So notice how the coverage is really low at the five prime end and gets higher towards the three prime end. Does anyone know why that might be the case in this particular data set? Yeah, this was probably a polyA selected RNA-seq library. So there's a certain amount of degradation of the full length of RNAs that occurs. And then when you're using the polyA at the three prime end to enrich, you inevitably bias yourself towards sequences at the three prime end of transcripts. So you've probably heard this term FPKM or RPKM. It still maybe is the most commonly referred to way of summarizing expression levels. It started out, I think, probably from the very first RNA-seq paper with RPKM, which stands for reads per kilobase of transcript per million reads. FPKM is basically exactly the same concept. We switched to talking about fragments instead of reads when we started sequencing paired strands, because we had two reads for one fragment. So instead of counting or estimating the number of reads per kb per million, we started estimating the number of fragments per kb per million. But they're essentially the same concept. And the reason that we started with this very simple normalization is because while the relative expression of a transcript is thought to be proportional to the number of fragments that you sequence, there are, of course, at least a couple major confounders to that. And one is that larger genes are going to tend to have more fragments, even for the same number of expressed copies of that gene. So for a very large gene, even if there's just a single copy of that RNA floating around, it gets broken into multiple fragments during your library construction, and maybe you sequence many reads off that one, what originated ultimately from just one actually expressed RNA. So you need to account for gene size. And then, of course, the other major factor is library depth, right? If you sequence more, you get more sequenced fragments. So the expression level of genes could appear to go up if you're not normalizing or accounting for the total depth of sequencing that you performed. Yeah? Yeah, so GC percentage is potentially a factor. I would say it was something we very often thought and talked about during the microarray days, right? Because of hybridization-based approaches, there was definitely a bias towards or against GC-rich regions being preferentially hybridized. You hear less about it at the RNA-seq level, but it is possible that GC bias could affect things like amplification during library construction. If you have a CDNA capture involved in your RNA-seq library, then again, we're talking about hybridization, so maybe GC bias could play a role. So yeah, that's a kind of downstream type of normalization you might think about doing. Yeah? Yeah. So I have here just a simple formula for FPCAM, but basically all you're doing is taking the number of total mappable reads for a particular gene or transcript, and you're dividing that by then the total number of mappable reads for the whole library and then by the size of the gene in base pairs. And then because that results in a very small number, we apply a kind of multiplication factor to express it per kB and per million to make the numbers a little bit nicer. So another very commonly used metric is TPM, or transcripts per kilobase million. So we added just a quick overview of the differences between FPCAM and TPM. I think in general, maybe TPM is gaining in popularity, although you probably still hear as many people using FPCAM as TPM. And they're actually very similar, so the difference between them is quite subtle. And it actually comes down to just the order of operations. So for the FPCAM, you're determining this total sample fragment count and dividing by a million. That's your per million library scaling factor. And then you're dividing each gene or transcript fragment count by that. So that gives you your fragments per million. And then you're dividing that by the length of the gene in kilobases. So that gives you your fragments per kilobase per million. TPM, it's almost exactly the same thing, except you do it in a slightly different order. So you divide each fragment count by the length of the transcript or gene in kilobases. And then sum those FPC values for the sample and divide by a million. And then divide one by two to get the transcripts per million. So it's basically the same functions in a different order. But it has some subtle and possibly important differences in terms of FPCAM versus TPM. So there's actually a nice blog posting here. I put the link to it. I'm not sure if these links actually work. And if you want to read in detail an explanation, even with the video explaining the differences between TPM and FPCAM, you can. And so one of the main ideas is summarized here that basically when you use TPM, the sum of the TPMs in each sample are the same. So you have the same denominator across samples. And that is supposed to make it easier to compare values between samples, to make it a little bit more of an apples-to-apples comparison. We're actually going to later calculate both FPCAM and TPM and look at the correlation between these values for our test data sets to kind of see how big of a difference it seems to make. So we're going to be using string time. This is the latest or current version of transcript, assembly, and estimation tool from the tuxedo suite. And there's a depiction here from their nature biotech paper where they kind of compare how string time works compared to how cufflinks and TRAF worked. So cufflinks was the previous iteration of their software for transcript assembly and abundance estimation, and TRAF is kind of related or competing approach. Of course, in their paper, they claim that string time is the best. That's how papers work. It does seem to be quite good. I mean, it's very widely used. We're going to present this workflow as well as a count-based approach, which is another popular approach using something like EDJAR or DEC for the differential expression. And I'm going to go through a couple of slides that really show at kind of a high level how string time works. So this is the high-level description of how it works. Basically, you assemble or you align your RNA-seq reads to the reference genome, so using a splice-aware aligner, of course. So this is a reference-based approach, not a de novo assembly-based approach that we're going to talk about first. And there's this optional step that string time introduced that you didn't do during cufflinks, which is to assemble super reads. So these are reads assembled from the individual read pairs or sets of read pairs into kind of like mini-contakes, which can improve the ability to determine different transcript isoforms. But as I said, that's optional. So you either have these individual read alignments or individual read alignments plus the super read alignments. And then what string time does is it builds a splice graph. So the splice graph basically connects exons to exons like you would imagine for something that's depicting splicing. And it determines the heaviest flow through the splice graph or the heaviest path through the splice graph. And from that, it builds a flow network, and then it computes the maximum flow through that network to infer the isoform and its abundance kind of simultaneously. And it does that in an iterative way. So it takes this heaviest path through the splice graph, assumes that represents a possible transcript isoform, makes this flow network, determines the maximum flow. And then it removes all of the reads that support that putative isoform and repeats the process, identifies kind of the next heaviest flow through the graph, creates another maximum flow or another flow network, determines the maximum flow, and then iteratively continues. So here's another depiction of that. Basically here you have your RNA-seq reads. Some of these have been sort of assembled into super reads. And then you map those reads and super reads onto the genome. And here we're showing a simple region of the genome with several exons, five exons, depending on how you define them, so with a couple different splice sites being used for some of these exons. And it creates the splice graph. So based on the reads of support that exist, there are paths through this graph that involve going from this first exon, this orange, directly to this green exon, and then directly to this red exon, so kind of like 1, 3, 5 is that particular isoform. And that, in this case, is being depicted as the heaviest path, so according to its estimation that the largest number of reads support that path through the graph. But there are other possible paths based on the pattern of splicing. So some reads also start at this other exon boundary, so that creates another possible exon as a starting point, which then go through three, and then you see some splicing again over to five, and so on. So it determines this heaviest path, in this case it's 1, 3, 5, and then it constructs a separate network for that, which represents this particular isoform. It calculates the maximum flow through this graph, which defines the isoform and determines the expression level of that isoform. It removes the support for the reads or fragments for that particular isoform, and then it repeats until it builds up all the possible isoforms. So there's another isoform 2 that has that alternate start site, and then another isoform 3 that has an alternate splice site for exon 2. Yeah? So I guess it used the information for the expense and the exon map, which I guess is more the menu of coverage. Yes. It's considering both coverage, actually the next slide kind of depicts this. So here's an example where we take, let's take this first isoform, right? So we determined this is the heaviest path through the graph, and now we're going to look at this specific example as a flow network, right? So you have about 15 reads aligned to this putative isoform. Some of them are reads like you're talking about that splice across introns from exon to exon. Some of them are just internal to exons, right? So here's like some reads here that don't splice. But they do contribute coverage towards that exon. So they get factored into the calculation of maximum flow. So the network that's built this flow network, each node represents one of the exons, right? So this large orange circle represents the first exon. The green circle represents the third exon. The red circle represents the red exon and so on. And it determines the number of reads supporting each exon and each connection between exons. And then I think this is a very simple example, right? The maximum flow is probably quite easy to determine. And it's clearly through from exon one to three to five. Although there is one read that actually connects exon one to exon five directly. Although it has an intermediary alignment in exon three, so it both sort of supports both. But yeah, at the total coverage at this position, at this exon as well as these exons and the connections between them is what's used to ultimately estimate the abundance of that exon. Or that transcript isoform. Any other questions about this? So if you read the string type paper and I've read it numerous times, there's a lot more going on here, right? So there's at least two kinds of graphs being constructed using graph theory. There are some heuristics they use for determining the heaviest path through these graphs. There's optimization theory that's used to calculate the maximum flow. There's probably about, I don't know, like 50 equations in the supplementary methods for the paper. So going through that would be a whole course and probably the creators of the software have such courses. But it's probably beyond the scope of this lecture. So this cartoon depiction gives you a general idea of how it's working and why it works. But yeah, if you really want to understand the underlying math and theory, you'd have to read the paper and probably several other papers. So one thing to keep in mind is that these concepts of isoforms are being inferred from the data, right? And we're going to run string tie in several different modes. So there are ways that you can kind of give it hints about what the transcript structure should be using, for example, known gene annotations, right? So for human, for example, we have a pretty good idea now of all of the known possible transcripts. So we can provide that as an input to string tie and say, here's some guidance on what the isoform structure looks like that you should be trying to estimate for. But you can also run it in modes where you don't do that, where it tries to infer it entirely directly from the data. And then there's also kind of an in between mode where it uses that information, but it also can infer novel isoforms. And that's really useful, especially if you're trying to identify novel splicing, maybe aberrant splicing. But the disadvantage of this is that if you run the software on two samples, you could get different inferences about the isoforms that are there. You could have some incomplete inferences, right? If the coverage is kind of marginal where it doesn't quite assemble the idea of the transcript correctly, and then you don't have an apples-to-apples comparison, you want to say, is transcript X higher in sample A than sample B? But they're not even talking about the same transcripts, right? Because they inferred a different set of possible transcripts from the data. So a common step is to basically merge together these concepts of isoforms that are inferred from the data, so that you can have one consistent representation. And then in some cases, you will go back and supply that as an input and re-infer the expression-level estimates for all of your samples against a consistent set of possible isoforms. And so string type provides a merge command to help you do that, to basically create merged GTF files that represents the kind of union of inferred isoforms across all of your samples, so you can get one consistent set of expression estimations. Yeah. Would you want to keep treatment separate about, for example, if you have a treatment that is causing a certain isoform beside a kind to be expressed in the other treatment, if not there, you wouldn't see it if you just merge them all together? I would say no. You actually want to merge them together so that you can confidently say, yes, this, if I look specifically for this isoform that I think has been novelly introduced by some treatment, I want to know for sure that it's not, I want to show definitively that it's not expressed in the other condition. And the best way to do that is to tell string type about it and say, tell me if you can find any evidence for this novel isoform that I think is being induced by whatever my treatment is. So that's actually when you would want to merge. So if this is our line of priest's junction files, or like I said, I'm going to take this junction file for junction, or would I have to go through the high-sat when we talk that punitive transcripts are more of a string type merge? So are these novel junctions? Could be any junctions that would be reported? Inferred from the data. Yeah, so there's a couple different things. For one, when you create the high-sat index, remember we actually supplied exons and junctions. So one thing you could do is use that information to kind of supplement the building of the original index, which then might help the aligner do a better job of aligning reads to those what you believe are important junctions or novel junctions. But in general, yeah, any other approach you use to define isoform structure, you could in theory have a GTF file or merge with another GTF file and include as guidance to in one of these ways, one of these different modes of running string type. Yeah, yeah. Is there a quality control function in string type but also see like a quality control method to like... So one thing is GFF compare. So one way you could assess it is you could take the transcript definitions that are determined from running string type in one of these modes and compare it back to let's say like the ensemble or UCSE reference GTF and simply ask the question what percentage of known isoforms were inferred correctly and then how much novel stuff was there. And that might give you a general sense of how it's performing, right, like if it's 99% concordant with kind of our known understanding of human transcription. But it's an interesting question of whether there's like a an isoform by isoform level of confidence that it's a true or correct isoform. I mean, I'm not aware that there is. Yeah, like what's the pro... what's its sort of state of belief or probability that a particular isoform prediction is is correct or true. Yeah, I mean you have... you have the expression estimate, the abundance, which is in a way a level of confidence in that inferred transcript. But yeah, it's not really specific maybe to what makes that transcript novel. So there could be maybe only a few reads that actually support the junction between this novel junction between exons that no one's ever observed before. Yeah, that's a good question. I don't know the answer to that. Okay, so we're also going to use GFF compare in actually kind of a way like I described where we're going to take the isoforms that are predicted or inferred from the data and compare back to a reference. And then we're going to use the ball gown software. So this is an R package that's developed and it works with string tie output to determine differential expression statistics, uses a parametric F test and the way it works is it basically fits two models. One that uses the covariate of interest and one that does not. So in this case if you're comparing case versus control, tumor versus normal, pretreated versus post treated, it builds two models, one of which that includes that that variable as a covariate and one that does not calculates an F statistic and a p value for the fits of those two models. And then a significant p value basically means that the model including the covariate of interest fits much better than the model without that covariate. And then we adjust for multiple testing by using an FDR q value. It also comes with a lot of visualization options. So it'll calculate your differentially expressed genes or transcripts. Then it has a number of sort of prepackaged visualizations to help you interpret your data. So we're going to go through that and look at examples using string tie cummerbund for differential expression and visualization. But there are alternatives to this pipeline and we provide one such example. So the most common alternative to using FP-cams for differential expression statistics is using raw read counts. And there are a number of reasons why people prefer this. But basically instead of calculating FP-cam, you simply assign reads or fragments to a defined set of genes or transcripts. So right there there's an important distinction, right? You need to know what features you care about. So now we're not talking about novel isoforms being inferred. We're telling the software, I know about these 20,000 genes. Tell me how many reads support each gene. Of course you can use a combined approach where you use something like string tie to infer isoforms or define your features and then do a count-based approach where you assign counts to those features. But in general people do this for known genes, known transcripts. The software that we're going to use is HTC Count. Another popular one that's widely used is feature counts, I think it's called. There's an example of how you run this command. There's an important caveat to this which is a basically it doesn't recommend transcript level analysis. So in general, especially for species like humans, there's so much ambiguity between transcript isoforms, meaning there's often many transcript isoforms that look almost essentially the same, right? Where they have almost all the same exons and maybe there's just one exon being skipped out of 20 exons, which means it has one unique junction. This count-based approach isn't really that great at determining which reads are unique to one isoform versus another when the majority of the isoform overlaps. That's what these sophisticated approaches like cufflinks or string tie were designed for, right? So even the author of the tool, if you read the post, basically says don't do that. That's why these other tools exist. It won't work very well for that. So it's really used at the more discrete feature level where either you're you're just throwing everything into one big bin like the gene level and not trying to tell whether a particular read is from one isoform versus another. Or it's generally okay to go down to the exon level where there isn't that much overlap between exons. But the transcript level, it doesn't work very well. And basically all you're doing is asking the question does my read overlap with my feature where features could be gene or exon. But there are different ways of determining what constitutes overlap, right? So overlap could mean completely contained within the feature. Maybe it means it overlaps at least part of the feature. Then the question is how much of the feature, right? So these things matter. If you have a read that overlaps to exons and spans across an intron or is inclusive of the intron, that could be counted differently depending on the way the software is set up. And so ACC provides several different modes. We recommend the intersection strict method which basically requires that the read overlap the gene but is not, doesn't count it if in this case, so for example, we wouldn't want to count this read towards gene A when it includes the intron. We want to count it when it spans the intron. So that's a nice feature of the intersect strict versus say either these other options. And it will not be able to assign a read in cases like this, right? Where two genes overlap and the read doesn't have any unique component to it. But if the read generally overlaps gene A better than gene B, it still assigns that read to gene A. It doesn't fall in the ambiguous. So basically it's complicated a little bit which of these modes you choose. I think generally the intersect strict mode is quite popular though. Once you have these counts, so gene or exon level counts, you can do differential expression on them. So you've probably heard of methods like DSEEK or DSEEK2 or EDJAR. These are our packages that do differential expression analysis with many different types of statistics and they assume that you're giving raw counts. So you don't want to supply FBKM or TPM values to these. You want to start with raw counts which means you need to use something like feature counts or HTC count. And there's been a long running debate about which you should use and you can find countless papers and blogs that argue that the only correct thing to do is to use counts with a statistical method like EDJAR or DSEEK. You can find other papers and blogs arguing that the string tie style approaches are more accurate for various reasons. I think in general the count-based approaches are are pretty widely regarded in terms of just doing straight gene expressions, differential expression. But FBKM does have some advantages especially if you're wanting to leverage the benefits of the tuxedo suite in particular if you want that isoform deconvolution, right? So the count-based approaches don't give you novel isoforms. They don't give you transcript level estimates. So you pretty much need to use one of the more sophisticated approaches if you want that. FBKMs I think are fine for visualization. So like making heat maps with kind of visualization friendly values. They're fine for calculating fold changes in general. But if you want more robust statistical methods for differential expression and many of our packages I would say accommodate more sophisticated experimental designs. So like time series experiments or more complicated experiments with many different factors. There's a lot of different options for different kinds of statistical tests using the count-based approaches. We typically do both and sort of take the union. So on the data set in this course at one point we ran de-seq and cuff diff and edge-r and we showed that while there was a substantial overlap or intersection there's also quite a bit of differentially expressed genes that were unique to one method or another method. In particular there was a lot of stuff picked up by de-seq that was missed by the other two approaches so it seems to be quite a sensitive method. So it all depends on your downstream application, right? If you're looking for something that's really really confident maybe these intersection genes are the most interesting because they've really been reproducibly predicted to be differentially expressed by multiple strategies. If you want to be more comprehensive and make sure you don't miss anything and you're going to go on and do downstream functional validation, if this is more of a hypothesis generating exercise you might want to use the union of all these approaches or more than one approach to make sure you don't miss anything important. Yeah, I have not personally used lima but it's also very highly recommended and very popular. So you can find blogs saying use lima don't use anything else for sure and papers. I think it's definitely considered a very statistically robust approach. I think probably lima and de-seq 2 are maybe the most popular. Yeah. Is there an expectation of what reviewers will typically ask for for these kinds of things or is there a sort of best practice intersection that we should be going on? I mean in general what we've presented we hope our best practices and widely used. There's certainly defensible with numerous precedents but to be honest I mean it depends on what kind of journal you're talking about but for the most part the kind of reviewers you get for biology journals don't really know the difference between them if it sounds reasonable to them. Just in my personal experience having submitted probably like dozens of papers involving differential expression analysis is rare that a reviewer comes back and says oh why did you use edjar or hd-seq with this mode you should have done something else that's just not really that common. They're normally complaining about something completely different. I certainly encourage you to you know document well the methods you use and justify them and you know provide citations of why you think they're best practices but I don't see it as that commonly an issue for reviewers. It should be probably but but it's not generally. So just a few lessons learned from microarray days we're getting to the point now where maybe many of you don't remember the microarray days but there was a long period where I think people were so excited about RNA-seq and the advantages that it offered over microarrays in terms of sensitivity and dynamic range and the ability to discover what's really going on in your transcriptome right novel splicing fusions that people kind of forgot some of the basics of expression and statistics so important things to remember are power analysis so it's super common that because RNA-seq is still quite expensive it's it's the case that we often see sort of unfortunate choices being made in experimental design right where there are numerous papers especially in the early days published with kind of like one versus one comparisons or really there's not much statistical validity to that. So just because RNA-seq is a better method in many ways so microarrays doesn't mean the biological variability goes away right it's still there we're actually measuring it better but you still need to have replicates right so there are some tools that we provide links to for doing power analysis for RNA-seq experiments and talking about the need for biological replicates and good study design which really hasn't changed at all but i think has sometimes been overlooked multiple testing correction it's not only still a problem it's probably even more of a problem because as you know as more attributes are compared it becomes more likely that any given difference can appear by chance alone and RNA-seq gives us access to way more features or attributes than microarrays right a microarray might have had 20 30 000 genes on it now we're measuring those same 20 30 000 genes but also we can get expression levels at the individual exon level for every individual isoform for every specific junction and it's often the case you end up doing statistical tests at multiple different levels or for all of these kinds of features so you have more tests than ever which means you have a bigger multiple testing problem than ever so keep in mind multiple testing there are good packages for multiple testing correction you may need to think about test reduction methods as well and then the downstream interpretation of expression analysis this is really the topic for for another course or another day so we don't really have time to go into all this but the expression and differential expression estimates that you get out of these pipelines that we're showing you are obviously the input for many kinds of downstream analysis doing clustering producing heat maps doing classification analysis doing pathway analysis and so on we recommended just some of our commonly used packages and tools for these types of analyses but unfortunately those are probably a topic for another course but if you have specific questions about any of these topics like this is what you're trying to do with your data come and talk to us and we can give you some pointers or point you to some resources so that's the lecture for expression differential expression we're going to move back to the practical exercises and just to orient you again we're sort of over here in the workflow now we've got all of our input data our raw sequence our reference genome our gene annotations we've performed our read alignments we've we're now going to do transcripts compilation and then expression estimation and differential expression calculations and visualizations so I don't think we actually need a break right now right and probably just go into the next until lunch so I'll give a really quick overview of the tutorial and then we can jump right into it so what we're going to do in this tutorial is generate these gene and transcript level expression estimates and then perform differential expression analysis with ball gown and then summarize and visualize some of those results using r optionally we're also going to do the count-based approaches alongside so this is just a different way of depicting the overall RNA-seq analysis flow so we're down here in transcript assembly and abundance estimation but now actually using the string tie approach instead of the cuff links so alignments you already have BAM files from the previous steps we're going to use a couple of options now in this part of the course to do the so-called reference guided method so we're going to use this E option with string tied to calculate expression values for just known transcripts so for right now we're not going to try to use the data to infer novel isoforms we're going to just calculate expression levels for transcripts that we know about based on ensemble gtf so we're going to generate both fpkm and tpm values using string tie we're also going to play around with this count-based approach so that means we're going to have two kinds of expression estimates at the end right we're going to have the high set two alignments together with string tie and then the htc count approach and then we're going to perform differential expression analysis using ball gown from the six libraries we're going to combine expression estimates across the replicates and we're going to compare uhr and hbr to identify significantly differentially expressed genes and isoforms between these two very different sets of source samples right so we're this is an artificial comparison of brain matter tissues to other body tissues so we should see lots of differences and then we're going to use ball gown to create some visuals to kind of summarize the data we also optionally can use some old-school r commands so we've provided you some alternatives to the ball gown approach if for whatever reason ball gown doesn't do what you need it to do and you may need to go to kind of core base r to do some other kinds of visualizations we provide a number of sample scripts for that if we if you want you can compare the results from your differential expression analysis between htc edge r approach and sorry the string tie cummerbund approach and then we'll also look back at the ercc spiking data and just see how well the expression estimates that we're coming up with correlate with the abundances that are expected from those um spike ins. I don't know if we talked about the ercc spikens but just remember that these are transcripts corresponding to non-human I think bacterial transcripts that aren't expected to be in the sample other than because you spiked them in and there's basically a gradient of known expression levels and there's two mixes so there's mix one and mix two where different transcripts are at different known quantities so not only do you have an expectation of known expression for a bunch of transcripts but you have an expectation for known differences between mix one and mix two and we spiked mix one into condition a and mix two into condition b so we can also look and see how well the differential expression like the full change or p-value corresponds with the level of difference that we would expect based on the spikens so those are kind of sanity checks that what we're getting out is sort of sensible results