 Okay, so this is module 3, just like before, it's Creative Commons Share Share Like. We're going to cover the next topic in the workflow, which is expression and differential expression. So the learning objectives for this module are to just introduce you to the concept of expression estimation for known genes and transcripts. We're going to talk about the terminology and definitions around FPKM and talk about how that differs from using rock counts. We're going to briefly talk about some differential expression methods and very, very briefly the downstream interpretation of expression and differential expression. So you guys have already sort of maybe started doing this in your mind, looking in IGV at the alignments that you made yesterday, where basically you have at this time showing an IGV view, you have each of your reads aligned to some part of the genome, and since this is RNA-seq data, we were seeing generally quite nice alignments to the expected transcribed portions of genes. And some of the things you can see immediately from those coverage plots, remember this top row here, gives you a kind of histogram estimate of the number of alignments, number of reads at any given position. From that, you could already start to get the idea of expression and differential expression levels, right? If you saw basically a lot more reads piling up in the tumor than the normal, you might say that this gene is up-regulated or down-regulated. I guess in this case the tumor is on the bottom. You can also see things like, in this case, 3' end bias, right? So the coverage is really low at the 5' end and it's kind of increasing and highest at the 3' end. And this is quite commonly observed for polyase selected libraries where maybe your RNA is a bit degraded and fragmented. So really the whole purpose of this module is to formalize the prediction of expression levels and differential expression along the same kind of concepts as what you're doing in your brain by just looking at the numbers of alignments for one condition versus another condition. And probably the most common way that this is done is by calculating an FPKM or RPKM value. These are basically the same concept, highly related terms. They stand for either reads per kilobase of transcript per million mapped reads or fragments per kilobase of transcript per million mapped reads. And the distinction between fragments and reads, pretty much everyone refers to FPKM now. In the early days before you had paired-end reads, each fragment of RNA that was sequenced was represented by just a single read and so you could say, okay, let's calculate reads per kilobase. But then you had two reads per fragment. So it was a little bit ambiguous what you meant when you said reads per kilobase per million. So we changed the terminology to say fragments because the two reads are measuring the same single fragment. In RNA-seq, the relative expression of a transcript is hopefully proportional to the number of CDNA fragments that originate from it. But there are, of course, some biases that you have to take into account. So the number of fragments is going to be biased towards larger genes. So the bigger your gene is, the more pieces it's going to fragment into when you're constructing your library and the more chances that a fragment will randomly come from that big gene versus a small gene, just because there's more space from which fragments can occur. So you need to take into account gene size, especially when you're comparing expression-level estimates between genes. So a large gene and a small gene might have the same number of transcripts per cell, but you will get more fragments from the large gene because there's more RNA to fragment and to get sequence reads off of. So you need to account for that. The other thing you need to account for is the total number of fragments in the library. So if you have two libraries from, let's say, even biological or technical replicates, and you do one lane of sequencing data for one of the replicates and four lanes of sequencing for the other replicates, for any given gene or transcript, you're going to have a lot more reads or fragments coming from just in absolute terms, the library that you sequenced four times as much, right? And there's always this kind of variability in how much sequence data you get, even if you try to make it as similar as possible and just do one lane each and the cluster density is very similar and so on, you're still going to have some just variability in the total output of the data from that lane. And that doesn't have anything to do with the amount of transcript copies in the cell. It's just a function of, you know, how efficient the run on the machine was or how many lanes you run or whatever. So you need to account for that library depth. And the FPKM or RPKM, that's basically what it's trying to do. It's trying to normalize for the size of the gene and the size or total depth of the library in all in one go. And it's just a simple formula for how it does that. It takes the number of mappable reads or fragments for a given feature, which could be a gene or a transcript or an exon, and it multiplies that by 10 to the 9, which is the thousand times the million, and divides by the total number of mappable reads in the library times the number of base pairs in the gene transcript or exon. So that's how you calculate an FPKM. There are some good BioStar posts on this topic if you want to read more about FPKM. It's arbitrary. We just decided to normalize to an imaginary gene of 1000 bases and an imaginary library of 1 million reads. And so 1000 times 1 million is 10 to the 9. It could have been FPKB for fragments per thousand bases of transcript per billion read library. It's just a choice that was made. It's not important. You could use something else. It's just a convention. So how does cufflinks work? Cufflinks is a very sophisticated piece of software. It has very complex statistical models built into it. It is not a straightforward thing to explain in one slide, but that's what we're going to do. Just conceptually, the way it works is you have your top hat aligned reads, right? So your paired end reads have been aligned to various places of the genome, and you may or may not have known transcripts to map or align those two. And what happens is sort of overlapping fragments are kind of bundled together and assembled. And then these assembled bundles of fragments are connected in an overlap graph. And what it tries to do is find the minimal number of paths required to cover this graph that's created. You have to read about like graph theory to understand the details of that. But it basically tries to infer, it looks at sort of mutually incompatible fragments, which result in different paths through this graph, and then it tries to infer the minimum number of different isoforms that would explain that graph. And so in this toy example, you're getting, for example, these three transcripts, where you've kind of got like X on one and two or one and three with a skip over X on two. Another isoform that's including all three, and then one that's basically reading through the whole region. So it tries to estimate all these or infer all these possibilities from the sum of all the different fragments that align to a particular locus. And then the abundance of each of those inferred isoforms is estimated with a maximum likelihood probabilistic model, which tries to make use of information like things like the fragments length distribution and other details. So it basically tries to assign each of the fragments to one or more of these different isoforms that it's inferred from the data itself. So the good thing about this is that it doesn't necessarily need to know what the transcripts are. It's basically determining the transcript structure, the different isoforms from your data. You can then map that back to known transcripts. And that's what we're going to do today using the reference only mode of running cufflinks. Does it accept a known transcripts as it arguments? Yeah, pretty much. And that's the way that we're going to run it today. There's actually a really good, pretty good overview. If you go to this link, I just added it recently. So it might not be in your printouts, but it's on the version on the Wiki or if it isn't, I'll put it there. Or you can just search for cufflinks, how it works, and you'll find it. And this is created by the creators and they try to explain briefly each of the different components of cufflinks and how it works. And they link you to all of the papers. There's probably about five or six different papers that they recommend that you read in order to really understand what cufflinks is doing. But they do a nice job, like kind of an educational overview and review and giving you the key papers that you should read. Same kind of thing with cuff diff. It's fairly sophisticated. Maybe it's not quite as complicated as cufflinks. But what it does is try to determine the variability and fragment count for each gene or transcript across replicates. So the fragment count for each isoform is estimated in each replicates as described in the last slide. And then there's a measure of uncertainty that's determined that arises from the ambiguity of read mapping. So in simple terms, transcripts with more shared exons and fewer uniquely assigned fragments will have more uncertainty. And I think it's easier to understand that if you just think about like a simple example, right? Where with a lot of genes, pretty much all genes are alternatively expressed, right? There's more than one transcript for almost every gene. And you can see how if you have like big differences between genes, let's do a skip. There's going to be certain features, certain reads like, for example, reads that map across this junction that uniquely differentiate this transcript from that transcript, right? The more similar the transcripts are, or the different isoforms are for a particular gene, the harder it is to know like these reads here, right? We have no idea if they're coming from this isoform or from this isoform, any reads that align in these shared regions. When you have genes that have lots and lots of similarity and only a few subtle differences, the individual isoform level estimates of expression have more uncertainty in them. So the more unique features there are that differentiate between the different isoforms, the more certainty you have. And COFTIF tries to estimate that uncertainty. And it combines the uncertainty from mapping ambiguity with the variability that you observe across your replicates to come up with a kind of total measure of uncertainty or variance for each expression estimate. And then that's modeled in the statistic to determine statistical significance. So if you're comparing a set of 10 normals and 10 tumors, it's going to take into account the uncertainty in the estimates and the variability within the conditions to determine whether a certain full change difference between those two conditions is significant or not. So we're going to run in this tutorial today cufflinks, COFTIF, and cuffmerge. So we just have a quick slide talking about why cuffmerge is necessary. The reason for cuffmerge is that you have this assembly process that cufflinks is doing, right? It's not just mapping fragments, reads to known isoforms. It's trying to infer the isoforms from the data. Even if you give it a GTF file and you're guiding it, it still goes through this process of trying to say, based on the data, based on this RNA-seq data, I think there's a transcript that looks like this. I think there's a transcript that looks like that. The problem is when you then run it on two different samples, it doesn't come up with exactly the same set of possible transcripts, right? The data is different, even if you were doing technical replicates just because of random variability in the reads. There might be subtle or slight differences between the transcripts that are inferred in one sample versus another. So when you go to do differential expression, you don't have an apples-to-apples comparison. You're not able to directly compare them until you run cuffmerge, which basically goes through all of the different samples that you've run cufflinks on and comes up with one common representation. So it basically merges together the different assemblies to come up with one consistent assembly of the different isoforms and then reassigns expression estimates across all the samples. So the simplest example of how this would work would be, let's say in sample one, cufflinks inferred these two isoforms because it had reads supporting these axons as well as some supporting these junctions. And then in another sample, maybe this isoform wasn't expressed. Maybe it's just the top isoform that was detected. And there were no reads sort of supporting this junction, so it didn't infer that isoform. When you run cuffmerge, it'll say, okay, in all the samples, these two isoforms exist. So I'm going to assign a value to this isoform in this sample and this isoform in this sample and I have, I'm going to say this isoform in the sample is the same as this one. This one's missing, but I'm going to assume that it also could possibly exist and maybe I'll assign it a value of zero. It's not there, but in the data file, it will now have a value. So then when you're doing differential expression, it'll say, okay, as a value of zero here and something there, maybe this is an upregulated isoform. So that's why you need cuffmerge. It also does some filtering steps. Again, at this stage, you can provide a reference GTF to merge these novel or inferred isoforms with known isoforms from ensemble or UCSC or something like that. And it makes a final GTF file that you can use with cuffdiff. So once you run cufflinks, cuffmerge, and cuffdiff, you're basically at the stage where you want to summarize and explore and interpret the data. And they've provided an R package to help you do this. It's supposed to be, I don't want to say it's that user-friendly because it's quite complicated, but it's a way to get you to very fairly sophisticated representations of your data really quickly. If it works, you can basically take your cuffdiff object, feed it into Cumberbund and cut out all these different kinds of visualizations and plots. You get distribution plots, correlation plots, MA plots, clustering, heat maps. And I've shown a few examples here. And we're going to run through this, so you're going to get a file that has all these kinds of plots for the data you've been working with. You can get like your typical sort of heat map where you're showing the expression level of genes in maybe your tumor versus your normal. You can look at expression changes on a per isoform level. So these are four different isoforms for the same gene. You can also get representations of the isoforms that were predicted by cufflinks. Compare those to known isoforms from Ensemble. Look at other tracks, like conservation tracks. So you can do a lot of useful stuff with that. Well, sometimes there are issues with like usually that when I've had problems it's version compatibility issues where like you've upgraded your cuffdiff software and not your Cumberbund software. There can be, and then you point, you try and import your cuffdiff object and it throws a bunch of errors because it didn't understand it. It tends, in my experience, tends to be sort of all or nothing. It usually works, and if you have a problem, it's usually something like that, like an incompatibility between the software you run. Especially if you run cuffdiff, let's say like six months ago, and now you fire up R and you don't realize that you've updated the Cumberbund library. And now Cumberbund is newer or older than the cuffdiff. Yeah, you can have sometimes issues. So in the course materials, we kind of provide two parallel paths to get representations like these. Not all of them, but they're highly overlapping. One we've called the kind of old school R tutorial, which is just using core functions or popular functions in R to produce these kind of plots. It has nothing to do with the tuxedo suite or cufflinks. It's just regular R. And then the other is Cumberbund. So I don't know which one we'll go through today. There's sort of advantages and disadvantages to both, but we'll go through one of them and you guys can go through the other one on your own time and you can kind of see there's going to be like two or three different ways to produce these kind of plots. And if you find Cumberbund is just too confusing or too black box-like, and you want to kind of create plots more from the ground up, the supplementary R old school R tutorial is probably a better starting point. It's sort of, no matter what you do, you find that there's something about a plot you want to change. And I usually find it's easier to tweak plots that I've created from the ground up as opposed to Cumberbund ones where I don't know all the idiosyncrasies of how they created these plots. So there are alternatives to FPCAM. It's probably the most common way of representing expression values for RNA-seq data, but the next most common would be to use the raw read counts. And this is very popular in the use just for differential expression analysis specifically. So instead of calculating FPCAMs, you just assign each reader fragment to some definition of genes or transcripts and determine the raw counts. So like in these examples here, we would literally just say I have this definition of an isoform and I'm going to count up the reads that I see aligning to that isoform. And there are different parameters you can use to deal with these issues where the reads could be counted towards one or another isoform. At the gene level, you would kind of just collapse this into one representation of the gene and you would just count up all the reads that align to that gene locus. And you would have like a raw count like 1,522 reads aligned to this gene locus. And then you can use statistics that are specifically designed to work with raw counts to determine differential expression between your different conditions. So the software that we're going to use to do that is called HTC Count. There are others. We're going to go through how to run it in the modules. There are some caveats to doing transcript analysis. Basically it's kind of tricky or not that advised. The author of the tool himself doesn't really guarantee the best results at the transcript level. So we personally use this mostly for gene expression level differential expression. But you can read the debate on transcript level used. So which should I use? FBKM versus raw counts. We tend to use both side by side and for different purposes. If you want to benefit or leverage the benefits of the tuxedo suite like the Novo assembly of transcripts and the sort of general ease of use of running the software and Cumberbund visualizations and all those kind of things, then you probably go with FBKM estimates. FBKM values are good for visualization. It's tricky to visualize with raw read counts. As soon as you start to think about it or use them, you tend to sort of create some version of an FBKM value because of the reasons we discuss. The raw counts don't take into consideration things like the total library depth and the size of genes. So as soon as you start comparing one gene to another, you're going to mislead yourself if you're not accounting for gene size and so on. So for creating visualizations, calculating fold changes, that sort of thing typically would use FBKM. But just in terms of wanting to have robust statistical methods for differential expression, the raw count methods are very popular. Those software packages that do differential statistics on raw counts tend to have sort of more sophisticated options for specifying experimental designs. So if you have like a complicated experimental study that maybe has like, I don't know, a multi-factorial design or a time series or something like that, you might find that some of the differential expression software in R for count-based methods work better for you. So I've just listed a couple of the more popular DC. There's also DAGSEEK, EDJAR. There's others that use these count-based differential expression. I think it was actually the authors of CUFDIF that produced this figure to show that their recent version of the software CUFDIF2 produced differential expression predictions. So the numbers shown are just the number of differentially expressed genes that were predicted to be significant in a particular study. And they're highly overlapping, but there are significant non-overlapping as well. So there's some that are unique to CUFDIF, some that were unique to EDJAR, and actually quite a large number that were unique to DCK and a pretty good overlap. So I think you can argue that if you really don't want to miss anything, if you want to catch every potential differentially expressed gene or feature, and then you're going to go and validate those at biological level, there may be a union of more than one method may be advisable. So that's what we typically do. Just a few reminders that just because RNA-seq data is such an awesome technology, it doesn't absolve us from some of the basic facts of statistics and experimental design. You would never be able to tell that reading the RNA-seq papers from the last five years. But it's true. You still have biological variability that you need to account for. So it's expensive, I get it. You know, RNA-seq was very exciting, but too expensive. So in the early days, there were literally papers published comparing one RNA-seq library to another RNA-seq library for two conditions. So you have, right, N of 1. Statistics are impossible, and though you do see people calculating some kinds of statistics in those situations, but they're not really valid. So there are tools now, like these guys have created a tool called Scotty that helps you do a power analysis for your RNA-seq experiments to try and get a sense of how many replicates you might want. You do need replicates. There are BioStar posts discussing this in detail and also about different study designs. Another important lesson that was well-learned in the microwave days is the need for multiple testing correction. So I don't know if most of you are aware of the concept of multiple testing correction. Not really, okay, some of you. So the idea is that as more attributes are compared, there's a greater chance that the treatment and control groups will appear to differ on at least one attribute by random chance alone. So because you have variability in your data, when you compare, let's say, a set of normals to tumors, there's a certain amount of random noise or variability. Just by chance, if you keep comparing gene after gene, eventually the cookie's going to crumble in such a way that just by chance all of the samples were a little bit higher in the tumor and a little bit lower in the normal, it's just going to look like a significant difference. But if you repeated that experiment in another biological replicate of the whole study, because that was just random noise or chance, that would disappear. And it's just a function of doing multiple tests that eventually more and more of the results will be kind of spurious or due to random fluctuations in the data. And so this was well-known from array studies where you had, let's say, tens of thousands of genes or maybe hundreds of thousands of exons that you need to basically correct for the number of tests you do. So whatever your p-value is, you kind of have to make it... you have to penalize the p-value for the number of tests that you ran so that when you see a corrected p-value, you can really believe, okay, even given the possibility for multiple testing-related false positives, this is likely to be actually really significant. With RNA-seq, it's a bigger problem than ever because of the complexity of the transcriptome. Now we're not measuring just genes and exons, but we're measuring every exon-exon junction. We're measuring intron expression. We're inferring all these novel isoforms. You could have literally millions of different kinds of features that you might potentially want to run a differential expression test on. So this problem of multiple testing is basically more important than ever. You really want to think about issues like pre-filtering your data to limit the number of tests to begin with and running multiple testing correction. And there's a really good bioconductor package called MULT Test with reasonably good documentation that you can read on how to do that. Also, because this is a well-recognized issue, a lot of software now incorporates this into the software. So when we run CuffDiff, it's going to give us both raw p-values as well as corrected p-values, taking into account this multiple testing correction issue. But I don't know about all of the count-based methods, whether they would build that in or leave that up to you. So you need to read the methods and be aware that if they haven't provided a reasonable multiple testing correction, then you want to do it on your own. So downstream interpretation is definitely something we would like to have a whole separate course on. Basically, the expression estimates and differential expression lists you get from CuffLinks or CuffDiff or the other workflows that we're going to discuss can be fed into many of the different analysis pipelines and interpretation pipelines that exist and are used for other kinds of data like microarray data. So we do have the Supplementary R tutorial and the Cumberbund tutorial that gives you a start on some of the kinds of visualizations, like heat maps and so on that you might want. If you want to do classification analysis, there are lots of good packages in R. WECA is a good learning tool. I often use Random Forests R package, and there's actually a series of four tutorials in Biostar on how you would go through the process of... This is based on microarray data, so it involves some microarray preprocessing, but then there's some tutorials on how to do classification analysis and survival analysis. And those kinds of approaches will work just as well for RNA-seq data as they do for array data. There's really no important distinction. You can do pathway analysis using things like David or Ingenuity Pathway Analysis or Cytoscape, and there are many good R-Bioconductor packages for that as well. So the last, or not the last, the second-last module is going to cover these sections. So we have our raw sequence data. We've performed our read alignment. We're now going to do transcript compilation or estimation using cufflinks. We're going to merge those, in this case, onto our known GTF for known genes, and then we're going to do cuffdiff for a tumor versus normal comparison, and then visualize those results in either Cumberbund or this old-school R tutorial. So just to briefly recap what we're going to do in the exercises, we're going to generate expression estimates with cufflinks. We're going to run cuffmerge and then cuffdiff, and then we're going to summarize the results with either Cumberbund or the old-school R methods. There are a lot of files for all those parts, so the main Tutorial Module 3 Part 1 Linux you'll see on the Wiki. This is what's going to get you to run the cufflinks, cuffmerge, cuffdiff commands, all the Linux commands. And it's going to create the files that you need to run the Cumberbund that should actually say .r, which is the r script. It also creates the files that you need to run the supplementary R or old-school R. And then there's an optional part where instead of using cuffdiff for differential expression, we provided some sample code on how you would do it with edge r and the raw counts. We probably won't have time to get that, but you can take that away with you. And if you decide that you want to go with the raw count method, there's like some sample edge r code to get you started. So generating expression estimates, we're going to start with our line SAM and BAM files that we generated yesterday. And these are going to be used by cufflinks. We're going to do it for all transcripts on the target chromosome, which is still chromosome 22. There's an option in this step called dash capital G. I guess there was also such a step in top hat. This forces cufflinks to calculate expression values for the known transcripts based on the GTF file that you're going to provide. To discover novel transcripts with cufflinks, you would not use the dash G option and do the de novo transcript assembly. That's going to be done in module 4. So you can just wait and we're going to show you how to do that. You can also use a combination of the dash G option, dash big G option with the dash little G option, in which case the known transcripts are used as a guide, but novel transcripts are still predicted. So Malachi is going to go over these tomorrow. These are basically like the three main methods of running cufflinks. We're just going to do today the simplest method, which is saying we know what the transcripts are in our genome. Later today, sorry. Not tomorrow. This afternoon I should say. So, we're just going to do it with known isoforms. We're going to get basically one isoform level and one gene level expression file for each library. We're going to have these FPKM estimates. We already talked about what FPKM is. We're going to run kind of some parallel optional alternatives as we go along. So we already started this yesterday, right? We had run top hat and star aligners. So we're going to kind of keep going with those if you want to see how it differs or doesn't differ to use the star alignments versus the top hat alignments. And then we're going to also explore this count-based method. So you're going to run HTC count. So in the end we're going to have three expression estimates for each sample. We're going to have the top hat alignments followed by cufflinks, star alignments followed by cufflinks, and then the top hat alignments using HTC count. And of course we could have done the star with HTC count, but it just gets a little bit ridiculous. We'll do differential expression analysis. So first we're going to have to run that cuff merge, which is going to combine the different expression estimates from the four cufflinks runs into one combined result. We're going to then compare tumor and normal to identify differentially expressed genes and isoforms. These commands can get pretty complicated when you have replicates. So especially if you're doing the integrated assignment later be really aware that spaces and commas and positioning matters in the way you set up these commands. The comparisons that we're going to do are just tumor versus normal for all replicates. And now we're going to summarize and visualize those results using our either cumberbund to produce these kind of plots or some very limited post-processing of files and then pulling them into R to do kind of the old school R analysis. And that's going to replicate a lot of the things that cumberbund does. So we'll look at like technical replicates and correlation between the different libraries visualizing the differences between and among replicates and then examining the different differential expression estimates and getting a list of the top differentially expressed genes. And if you follow through all of the materials and all of the optional and extra stuff you will eventually have gene expression estimates and differential expression estimates from both cufflinks and edge R and you can see this is the result that I got when I did it which is that there is fairly significant overlap and non-overlap between those methods. So again you may want to consider using one or the other or both in terms of predicting differentially expressed genes from your data. Okay.