 So, this is module 3 where we're going to take the results of modules 1 and 2, basically our alignments and try to calculate expression and differential expression estimates from that data. So, learning objectives are to talk about how we generate these expression estimates for known genes and transcripts. We're going to talk a little bit about FPKM versus raw counts. We're going to introduce the differential expression methods and then we're going to talk very briefly about some of the downstream interpretation issues, which are really the topic for a whole another course. So, we've already seen a view like this, but this is the basic concept of RNA-Seq when it comes to expression analysis. We're starting from alignments, that's our raw data at this point. We're looking for cases like this where we have, based on the coverage track, different amounts of expression between different samples, and we're also looking qualitatively at what transcripts are we seeing evidence for expression of. This is a case where we're seeing some down-regulation, and I just pointed out the 3' bias that Malachi described earlier is quite apparent here, so the coverage really tails off as you get away from the 3' end. So, FPKM and or RPKM you're going to hear a lot about. This is the most common measurement that's described for expression analysis of RNA-Seq. It was kind of invented right from the first days of RNA-Seq, or really even earlier back in the days of SAGE we had a similar concept. They're basically the same thing, it's really just a terminology, reads or fragments, per kilobase of transcript per million mapped reads. Of course, there are a lot of different ways to arrive at these values. Conceptually they're the same, but in the details they can differ quite a lot. They're based on the concept that in RNA-Seq the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it. But we have to be aware of this problem that the number of fragments is biased towards larger genes. So if you're just randomly sampling, you take a larger gene and break it into pieces, you're going to get more reads from that gene just because it has a bigger space. And we don't necessarily want that to be reflected as a higher expression level of that gene. Also, the total number of fragments is related to the library depth. So the more you sequence, the more your fragment counts go up, and apparently your expression levels go up, but that's not really reflecting any change in the underlying expression level of that gene in that sample. It's just being sequenced more deeply. So these two measures, FB-CAM and RP-CAM, are an attempt to simultaneously kind of normalize or scale for gene size and library depth so that you have some kind of apples-to-apples comparison between genes and between samples. Yeah? No, sorry, it's RP-CAM, slash FB-CAM, and these are the same things, basically. It's being done in this formula. So you're taking the total number of mappable reads and you're dividing by the number of mappable reads in the library, the total mappable reads in the library, and the number of base pairs in the gene transcript X on it, et cetera. And this 10 to 9 is actually taking into account a thousand bases and for... it's giving you the per thousand bases and per million fragments. So the gene size is factored in through this L value. So idea being that either... so RP-CAM or FB-CAM, it's just reads versus fragments. This was... basically came about when we started to do paired-end sequencing, we didn't want to call it reads anymore because we're counting two reads as representing one fragment. And you'll see this much more commonly now, RP-CAM, you don't hear discussed as much. And it's simultaneously trying to account for both the gene length and the total depth of the library. And there's more than just these two, but there's a bunch of biostart posts where they go into great discussion about what an FB-CAM really is and what does it mean. So cufflinks is a really complicated method. This paper, or figure from their paper tries to summarize it, simplistically it's much more complicated than this even, and even I find this figure fairly complicated. But the general idea is you're taking alignments, you have these bundles of alignments which are basically assemblies of reads. And those assemblies of reads are mapped into a graph, an overlap graph, and then transcript isoforms are inferred by the minimum number of paths required to cover that graph. So this is not a perfect method, but based on having a bunch of reads in an area, making an assembly of them and then saying, what's the minimum number of transcripts paths that could explain this graph? And based on that they define these transcripts and they assign abundances to each of those isoforms using a maximum likelihood probabilistic model which is also very complex. So if you want to know the nitty-gritty details you really have to read the cufflinks paper and probably take a statistical degree. But in the end they produce basically their own concept of the transcriptome and we're going to run cufflinks in a few different modes and so you can basically try and guide it to use our current known idea of the transcriptome from something like Ensemble or you can run it in a totally de novo mode where they just infer all the transcripts in your samples without any preconceived notion of what the transcriptome should look like. And there's sort of some modes that are in between those two extremes. And it tries to be clever and make use of information like the fragment length when it's assigning each read to each one of these putative transcripts. So it'll say okay if this read spans, looks like it's spanning this many introns going from junction X on 1 to 5, is that likely given the fragment size of the reported fragment size of this library. And it uses other pieces of information to assign reads to the most likely transcript and then come up with estimates per transcript. So it's much more sophisticated than say a lot of the well-counting methods and it generally works pretty well but it's not perfect either. From that you get these cufflinks-based estimates in the form of FPKMs for all the transcripts in your transcriptome which has been inferred by some process of assembly. And then cuffdiff basically takes those values and it models the variability in fragment counts for each gene across the replicates if you hopefully have replicates. And then the fragment count for each is estimated in each replicate as before along with the measure of uncertainty that arises from ambiguously mapped reads. So transcripts with more shared exons and few uniquely assigned fragments have a greater uncertainty and that makes sense. So basically if there's two transcripts and they're largely the same and this happens all the time, this is why it's such a challenging problem and maybe they just have a very slight difference in the last exon. There's a lot of uncertainty. So a read that's mapping to exon 1 which is used exactly the same way in these two transcripts. Have no way of knowing whether that read should have come from transcript A or transcript B or should like isoform A or isoform B. So it tries to do its best with those areas where there is unique content or different junctions that distinguish the transcripts. But at the end of the day the more similar set of transcripts there are the more uncertainty there is in which transcripts isoform or sorry the fragment comes from. And that is incorporated into the model or into the algorithm. So it combines these estimates of uncertainty as well as the cross-replicant variability that it observes into what they call a beta-negative binomial model of fragment count variability to estimate count variances for each transcript. So now you've got estimates of FPCAMs of intensity or expression levels and then some concept of variance between samples for those transcripts. And those variance estimates are used for statistical testing to identify differentially expressed genes in transcripts. And again this paper tries to outline this method and or this figure outlines a method and the paper goes into much greater detail. So if you want to really understand how CUPDIF works become a statistician and then read the paper. Cuff merge is a step we're going to run and it's necessary because of this thing I mentioned where cuff links tries to infer the transcriptome. So it's going to take your two samples in the simplest case where we're just comparing sample A and sample B and it's going to say looking at this data for the most part I'm not going to just believe what ensemble says are transcripts I'm going to infer what the transcripts are and then it's going to do the same thing for sample B. But the problem with that is of course it comes up with slightly different answers even if they're like totally technical replicates it won't come up with exactly the same structure of the transcriptome. So then when you're trying to do a comparison between the two you're not really comparing apples and apples. So cuff merge is what makes an apples to apples comparison possible. Basically merges together those two concepts of the transcriptome structure to create one consistent concept so that a direct comparison can be made. And you can optionally provide a reference GTF which we're going to do so that when it's merging these isoforms that it's inferred it also takes into account known isoforms. So that where possible you can assign expression and differential expression to genes that we already know about not just potentially novel transcripts. Yeah. So it'll be, you're talking about at the cuff merge stage, it'll be okay, it'll take the transcript as evidence of a transcript from the one sample and it will have that concept of the transcript in the merged GTF file but it will be assigned a zero value from the sample which doesn't have any evidence for that transcript. So Cumberbund is our package that the tuxedo suite has developed for visualizing and summarizing the results of their differential expression pipeline. So one of the things you're going to see when we run cuff diff and cufflinks for that matter is that it produces this huge array of confusing files. There's files at the level of isoforms and CDS and gene and other levels and each of those has multiple different levels, FBKM files and retracking files and so on and none of them are really actually convenient for proceeding with your analysis. So usually what happens before Cumberbund is you'd go into those files and you'd have to kind of munch them and reformat them and combine them across different folders so that you would get something that looked kind of like your gene expression matrix that you would get in an old microwave experiment. And we still do that often but an alternative to that is to take Cumberbund and basically just point it at the differential expression folder and it underneath has the knowledge to assemble the data in a useful way and it can create all kinds of useful things like distribution plots and correlation plots, classic plots like MA plots and volcano plots. It will do clustering in PCA and MDS analysis for you. It can make heat maps at a global level or for as few, all the way down to one gene or just a selected subset of genes or just a significantly differentially expressed genes and so on. And you can also create some pretty nice gene and transcript level plots showing the transcript structure which is useful especially where cufflinks is defining its own transcript structure and the expression level of those transcripts and genes. And it's things like this. So there's like a scatter plot of normal versus tumor. There's distributions to see what's the overall distribution of the data look like. How do the samples cluster compared to each other? Typical heat map of genes versus samples for the most significantly differentially expressed genes. Expression levels for a particular gene broken down by isoform. So this gene we looked at before, TST, it predicted four isoforms for that gene and most of the action is happening for this one particular isoform which is being down regulated from normal to tumor or vice versa, I can't remember. And things like this where you actually get to see the transcript structure of those four transcripts. You can compare that back to what ensemble has in terms of transcripts for that gene. Look at other tracks that you can load in for UCSC like conservation and plot it against an ideogram. So it can do a pretty good job of making sort of publication ready figures. So we're going to go through the whole tuxedo suite and I recommend it. It's relatively easy to run and it's very powerful especially if you need to get something together quickly. But it's not everything. A lot of people don't believe that FBKM estimates are perfect. They don't believe the transcript predictions are perfect and some people don't believe the statistical methods that are being used by the cufflinks package are good. So a lot of people are going back especially just for the differential expression analysis to these raw read count-based methods which they think are modeling the data better. So instead of calculating an FBKM, you basically just assign reads or fragments to a defined set of genes or transcripts and we can define those using a GTF file and then we can just determine raw counts. So basically just how many reads we're seeing mapping to gene X or transcript Y. Doesn't that sound like some of the same bias that FBKM trades to normalize the gene length? Yeah so the reason that that is okay is because people typically only use this for a differential expression so where you're comparing two samples where they're going to have the same gene sizes and they're going to model the library depth as part of the statistic. So don't worry about the gene size because you're always comparing two things with the same gene size in the statistics and they do account for the library depth issue. So those things are taken into account and they claim they're being modeled better using these negative binomial distributions or cross-on distribution statistics than they are with things like FBKM with cufflinks and they're pretty popular. So the problem is of course you need to get these raw counts. We're going to show you how to do this with one piece of software called HTC Count but there are others and that's just an example of what it looks like to run but we're going to go through that. There's an important caveat to this and there's a discussion of that on seek answers and this is basically that the counts are only as good as the features that you define to count for and it suffers from exactly the same problem as I was describing for cufflinks which is that if two transcripts are exactly the same how do you know which transcripts, so like let's say a gene has two transcripts and they're exactly the same except for one tiny little axon at the end. Most of the reads are from over here, maybe all of the reads that mapped are from these regions of ambiguity between the two isopharms, you have no way of knowing where to count that to any particular fragment. And things like HTC Count tend to make simplifying assumptions, they usually bin those as ambiguous and don't count them. So at the transcript level it's going to work okay for some transcripts and really bad for some others and I think in this form you even have the author of the tool saying this was designed for gene level analysis, I don't make any claims that it's super effective for transcript level analysis. So they're kind of side stepping in some ways the problem that cufflinks is trying to deal with but it's a difficult problem and they may not be dealing with it perfectly or very well at all. So yeah doing things at the transcript level especially where transcripts aren't very distinct from each other is still very challenging. So one thing that's confusing with HTC is that you still have to normalize, you still get and you can never list the number because the number you're laying is like 200 million or something so you always get like a transcript per million pieces mapped or so it seems like it's very similar or because of my, because I have very low experience in this it's very different. So it looks like the two methods are... How are you talking about using the data I guess that's what I would ask. So HTC is a read count method, right, basically for the gene but in it you still have to normalize to the number of sequences you read in the lane and so you still have to go through a normalization procedure. It depends what you're going to do with the data so if you're using something like HR or one of those differential expression count-based packages that would be like that's part of the thing they handle. But if you want to like take those read counts and do something with them like display them and heat map and compare two genes then yeah you're back to the same problem of how like having to normalize for gene size or having to normalize for library size. I guess the only advantage is that it has an excellent level of cameras. For me the only advantage it has is that you can use it with the count-based statistics. That's the only thing I use it for. If I want to just say something like does this gene have a higher expression level than another or if I want to make a heat map I'm right away going back to some kind of FBKM. And then yeah why not maybe just use cloud size I don't know. Unless you yeah unless you it's basically a more simplistic you can still you know normalize for relength and you can basically calculate your own FBKM from the HTC counts using your own definitions of genes or transcripts. Is it possible to do a new set on FBKM on an excellent basis? Yes it is. But it has the same problems as described for transcripts. I mean well because exxons also can overlap right. Maybe less than the transcripts but you definitely can. So what are the advantages like basically what should I do? This is what we do. We typically use FBKM when we want to leverage the benefits of the tuxedo suite when we want to do things like visualization for calculating fold changers etc. And we only use the counts for when we want to make use of these more robust statistical methods for differential expression. Especially because they seem like they're better at accommodating more sophisticated experimental designs. So like the EDJAR and DC methods you can set up like really complicated experimental designs where it's not just like five tumors versus five normals. And it's not clear that cuff diff is quite as good at that. But all of the places I've seen where people are using DC or EDJAR with count based they still tend to be going back to FBKMs for something. They're using it for filtering it or for calculating fold changes or for display purposes. And I guess the question is do you want to get FBKM on your own or let cufflinks give it to you? We usually let cufflinks give it to us. So I mentioned we can get these raw counts and I've just listed a couple of the packages. We're going to do a really quick tutorial of how to basically use EDJAR for these raw count approaches. But there's another one and there are lots of others I think for several others if you search for them. It's probably still advisable to not trust just one for differential expression. This is actually from the cuff diff to paper. They showed that most of what they predicted as differential expressed in their test dataset from cuff diff was detected by one or more of the other methods which they looked at EDJAR and DC. But there's definitely some substantial amount that are being predicted by one and not the other. And they didn't say which of these they think are true and maybe they're all true. So if you don't run EDJAR, you're losing all 87 of these potentially interesting observations. Or it's more loose. We don't know if these are good or bad. But yeah, it seemed to be at least the most comprehensive in terms of the total number passing. Also I would guess it depends on the P value that they heard there on our sign and also the fold change. Exactly. I would presume for this they've tried to set like a comparable P value and fold change but they're using different underlying statistical methods. So it could be that like a lot of these were just barely significant in DC and just barely not significant in cuff diff. You'd probably want to do like a more comprehensive analysis of how they differ. But if you want to just, it depends if you're in a hypothesis generating exercise you want to make sure you're not missing anything. It might not be such a bad idea to try more than one method. And maybe you have more confidence in these guys that are predicted by multiple methods. So one thing that we have to remember and so far the community is generally not doing a good job of in RNA-Seq is that the lessons learned from microarray days are some of them never learned but should have been learned. And this paper gives an amusing review of this issue which is basically sequencing technology does not eliminate biological variability. So you still have biological variability and you still in order to find significant events you need replicates. Probably almost nobody is doing this but it's recommended and the Marth Lab has actually developed a tool to help you do power analysis for your RNA-Seq experiments. So they take as input I think things like total library depth and read size and so on and the estimates variability and tell you approximately how many samples they think you need to run for a particular kind of experiment or what depth you need. Would that be mentioned a little bit in the intro? It might be. I mean for sure it talks about the issues of needing biological replicates but I don't know if it talks about power analysis specifically. So we do need biological replicates for RNA-Seq and the study design matters. So there is a bit of a tendency to just run RNA-Seq experiments without always thinking about what you're doing or why you're doing it. And there's some good BioStar posts on these topics. Another lesson learned which is more important than ever is the issue for multiple testing correction. So as you compare more features it basically becomes more likely that the treatment and control groups will appear to be different for at least some number of features by random chance alone. This is basically the simplistic explanation of the multiple testing correction problem. And we've known about this for a long time so it's well known from array studies where you have tens of thousands of genes or in the later days hundreds of thousands of exons that you need to do multiple testing correction before you can put a lot of confidence in a significant differential expression result. But with RNA-Seq it's more of a problem than ever because now we have essentially all the complexity of the transcriptome. Not just what we thought to put on the array and there's almost an infinite number of potential features. So you can look at things at the gene and transcript level, you can go specifically to the exon level, you can look at junctions, you can look at retained introns, microRNAs, link RNAs, et cetera, et cetera. Basically like Malachi said the whole genome is essentially transcribed at some level and we can look at all those levels and we can conduct tests at all those levels. So there's just a massive, massive number of tests. So you really need to think about issues like test reduction and multiple test correction when you're doing your analysis. And I usually use the bioconductor malt test package. I find it's pretty good for doing both like the really simple single step corrections like Bonferroni or Benjamin Hopper or if you want to do something more sophisticated there's a function called MTP which has quite a few statistical tests and it does like a random permutation approach which is pretty good. The multiple testing correction, the COFTIF does its own multiple testing correction. Yeah, exactly. Would one prefer this? I would be probably at least preliminarily satisfied with what the COFTIF is doing. They seem to know their business. I was thinking more when you get into doing analysis outside of the COFTIF framework. I'm not sure. I think EDGE-R might also have built-in multiple testing correction. So they're obviously aware of these issues. Downstream interpretation of expression analysis, we're just getting through the point where we have the data to start this is already this whole workshop as you can see. So it really is a topic for another course and there are I think CBW courses that cover a lot of these topics. But what you get out of this pipeline is expression estimates and differential expression lists and those can be fed into many or most of the analysis pipelines that you would use and have used for many years from your microarray experiments. We do provide a supplemental R tutorial on sort of how to format cufflinks data and start manipulating it in R. We're not going to go through that today, but you can take that home if you want to start developing your skills beyond the kind of canned analysis that's provided with Cumberbund. Cumberbund does do some downstream analysis, so it lets you do things like create heat maps. But if you find that the heat maps that it generates are not to your liking, there are a number of other R packages which you can use to do your own custom analysis. For things like classification analysis, a lot of times we don't really have the sample sizes yet or the clinical details, but this is changing. So for projects like TCGA you're starting to see hundreds or thousands of samples for a particular tumor type and sometimes you're getting clinical annotations and we can start designing prognostic signatures from RNA-seq data in the same way we have been from microarray data. So if you're interested in that, I suggest WECA as a good learning tool and then there are many, many R packages for classification analysis. I usually use random for us and there's a BioStar tutorial being developed for that, so you can watch for that. Pathway analysis. This is something people always ask for and never really know what to do with when you give it to them. But people will ask you as soon as you can do the stuff in this course. Oh, you've got expression data and differential expression analysis. Now tell me what pathways are significant. So David is a really easy web tool. You can basically just paste a list of gene IDs in there. It's not bad. I've used Ingenuity Pathway Analysis quite a bit. It's commercial software though. Site escape is not bad. It's good for visualizing and it does have some pathway plugins. And there are a ridiculous number of our bio conductor packages. So if you search this link, I think it finds like 250 packages. So you probably have to like go on BioStar and see what people actually recommend or use or Google around a bit. So sorry to blast through that but we do want to get through the tutorial material. Just to remind you where we're at. We are done with getting our data. We have made our alignments and now we're going to get expression estimates with cuff links. And then we are going to use cuff merge and cuff diff to do differential expression analysis.