 So today we're going to continue on and we're going to have a pretty brief lecture on some basic concepts around expression and differential expression and then we're going to dive into probably the most comprehensive or the most extensive practical exercises of the different modules. We're going to learn about how to estimate expression abundance, calculate differential expression, we're going to tackle it with a couple different approaches and we're going to do quite a bit of stuff in R as well as at the command line. So this is module three, expression and differential expression. It's three out of five modules, actually the fifth one is missing here which is going to be the reference free expression estimation approaches that Malachi is going to cover this afternoon. The learning objectives for this specific module are to talk about estimating expression for known genes and transcripts. We're going to briefly review FPKM style approaches for expression estimation versus raw counts and talk about some differential expression methods and briefly talk about some of the challenges with downstream interpretation of expression and differential expression. So I think by now you have already looked at some of your RNA-seq alignments in IGV, is that correct? So you should have already seen a visual that looks very much like this and maybe last night you were just thinking in terms of the alignments but what we're going to use those alignments for is to get a concept of abundance. And really at a basic level it's kind of intuitive, the more fragments you have from a specific locus, the more highly represented those sequences are, perhaps that represents a more highly transcribed or expressed locus. And IGV calculates for example this coverage track, this thin track along the top and you probably can't make out the numbers but it will tell you the range of that so it's sort of like zero to a thousand x coverage or something like that. And just from a very basic level if all other things being equal you saw sort of very high coverage for one sample, let's say maybe this is your tumor sample and much lower coverage for some other sample like your normal sample, you might conclude that perhaps this gene is up-regulated in one versus the other or down-regulated in one versus the other. Now of course it's not quite as simple as that and there are very sophisticated methods to attempt to normalize for things like the depth of sequencing for your library and for different gene sizes and for other potential sources of variation but in general that's the basic concept. You can also see in this particular example a case of 3-prime bias so this gene happens to be 5-prime to 3-prime on the positive strand and you can definitely see there's much higher coverage towards the 3-prime end compared to the 5-prime end. And this was a common feature especially in the days of polyase selection where you're pulling down mRNA based on the polyase tail which is at the 3-prime end. So if your RNA was fragmented you were more likely to more commonly pull down just parts of the RNA bias towards the 3-prime end so this was a common pattern. You don't see it as much with other RNA-seq approaches that do not use a 3-prime polyase selection type method. So you've probably, if you've thought or talked or worked with RNA-seq at all, you've probably heard of FPKM or RPKM so I wanted to just quickly review those metrics. This was probably, I want to say almost from the first paper describing RNA-seq they had a concept of RPKM and basically it's this idea of normalizing the number of reads you have mapped to a given transcript or gene locus by the library's size and the gene size. So from the first time anyone created an RNA-seq library or probably even more so when they created the second library and started to think about how would I compare these, immediately you come to the conclusion that I can't simply compare directly the counts from one to the other without taking into account the fact that I might have sequenced one library twice as much as the other, right? Or I can't compare the number of reads aligning to one gene versus another directly because one gene might be much, much bigger so a single copy of transcript from that gene might actually produce more reads than a single copy from a smaller gene. So these RPKM and FPKM metrics were designed to attempt to account for that and basically all they do is sort of normalize the number of fragments based on the length of the gene and the total library depth. And there's an equation here, they're usually expressed per thousand, that's the K basis of the gene, and per million reads arbitrarily and there are a variety of different ways to express that formula but essentially you're taking the number of reads and dividing by the kilobases of gene length and millions of reads per library. There's another way of expressing it here and also comparing FPKM to TPM. I should have mentioned the reason we went from talking about RPKM to FPKM happened around when paired reads started becoming common so that there was this potential confusion where you have two reads but they're from the same fragment and what we're really counting are the fragments so that's why we stopped saying reads and started saying fragments. So for FPKM one simple way to calculate it is to sum the sample fragments, the entire sample fragments and divide by a million and then also divide the gene transcript fragment by that library factor and then divide the FPKM by the length of the gene and kilobases to get an FPKM. And TPM which you might have heard of is now maybe more popular or equally popular in terms of what people use to express expression for RNA-seq is basically very similar, the only difference is in the order of the operations. So here you start with the gene length first, you divide the fragment count by the length of transcript in kilobases so you get a fragments per kilobase and then you sum all the FPKMs for the entire library and divide by a million and then divide one by two. So you would think these are basically the same thing but this order of operations actually has a really big effect. And what it means is that when you use the TPM because of this order of operations the sum of all the TPMs in a sample in each sample is the same. So you're basically like setting a theoretical total number of transcripts. So what this allows you to do is more easily compare TPM values directly between one sample and another sample. So in contrast FPKM the sum of the normalized reads in each sample can actually be different so you can have a different total number of normalized reads so it can make it a little bit harder to compare samples directly. Now which of these is more correct is a subject of some debate and you can go into the literature and see many people arguing TPM versus FPKM to some degree it depends on your assumptions so it may be convenient to assume that there are the same number of transcripts in one cell or population of cells compared to another but of course that's not necessarily true. But I think in general TPM has gained in popularity over FPKM but you still see both very widely used. So I'm not sure which paper but so you're talking about correcting for three prime bias or oh the random Hexmer bias yeah yeah I've not noticed anyone attempting to correct for that specifically that's an interesting question. So string tie I'm not going to go into great detail if you read the paper there's a lot of math behind it it has to do with graph theory and flow diagrams it's the latest evolution of methods from a group that also produced the cuff links it's part of the tuxedo suite it's the currently recommended approach for estimating expression from their suite of software so they transitioned from recommending top hat and cuff links and cuff diff to now recommending high-sat string tie and I guess what's the no longer cumberbund ball gown thank you I'm blanking on their various tuxedo terms so string tie is an evolution of that method I'm a little more familiar so with cuff links because we've been using that for years and we just recently switched us to string tie there's a nice paper in nature biotechnology where they compare the methods they basically compare how string tie cuff links and trough works I'm not at all familiar with trough basically what string tie does is it constructs a splice graph and then it extracts the the heaviest path the sort of most supported path through the splice graph and it computes what's called the maximum float through that heaviest path to estimate an abundance and then it updates the splice graph by removing the reads that were assigned to that part of the float algorithm and repeats the process and so all the reads have been assigned and expression estimates created or abundance estimates created for each of the different paths which correspond to different isoforms because this is a splice graph and what this does is basically give you some sort of relative abundance of one isoform versus another from that gene locus so one of the challenges or gotchas that we have to watch out for with these methods is that string tie and previous to it cuff links were actually not just estimating the expression of abundance but defining the transcripts identities so it's basically like essentially like building a transcriptome assembly as it analyzes your aligned reads which is great it has the advantage of being able to identify abundance estimates for not only known transcripts but potentially novelized forms which is often of interest right but a challenge is that when you go to do differential expression you may find yourself comparing apples to oranges right so for one sample the exact nature of the different isoforms that were inferred from the alignments might be slightly different or might be very different from what was inferred from another sample so they provide tools to allow you to basically merge together those representations into one consistent view of a transcriptome and then recalculate on the bonuses given that new merged or consistent view of transcripts so we're going to use string tied merge function to basically merge together all the gene structures from all the samples and this allows or corrects for the fact that some samples may only partially represent a gene structure and and it allows for the incorporation of known transcripts with assembled potentially novel transcripts and we're going to run string tie in various different modes so the first way we're going to run it is in the so-called reference only mode in that case the string time merge is not really necessary because we're saying we have a set of transcripts that we believe in and we're basically forcing the algorithm to only assign bonuses to these known transcripts but you can run string tie in these other modes like reference guided mode where it uses that GTF of known transcripts but is also allowed to to identify or define new transcripts and then a de novo mode where it's basically like just purely coming up with the its own GTF based on the data so it's really for these modes the de novo and reference guided mode that we're going to run string tie with this we're going to rerun string tie with the merged transcript assembly so that all of the samples have estimates for all the same set of transcripts which would allow us to then directly compare them you can also use a tool I think this still comes with actually the previous suite comes with top hat GFF compare you can use this to compare one of the merged transcripts GTF that you create using string tie back to another set of annotations for example your reference annotations so if you had some RNA seek data from a bunch of samples and you ran string tie and it predicted a bunch of transcript structures and then you merge them to create one common complete comprehensive set of all of your transcripts isoforms from that set of samples and then you want to know like how comprehensively have you and how accurately have you recreated the known human transcript dome you could compare it back to for example an ensemble GTF and it would tell you how many known transcripts you successfully kind of recapitulated it would tell you how many potentially novel isoforms you've detected so it's a useful little tool and we're going to try running that as well we're going to use ball gown for differential expression so as I mentioned this replaced what we used to do with coftiff and also I think simultaneously replace cummerbund so you used to run coftiff on the cuff links output and get differential expression results and then load that in an R package called cummerbund to get various visualizations and now those things are sort of combined together in a ball gown which is also an R package and this performs a parametric test comparing nested linear models basically it fits two models to each feature using expression as the outcome one including the covariate of interest in this case like case versus control or whatever your differential expression there it is and then one not including the covariate and then an F statistic and P value are calculated using the fits of the two models and a significant P value results when the model including the covariate of interest fits significantly better than the model without the covariate and then it also provides for you automatically I believe is a multiple testing corrected Q value to correct for a false discovery rate and often people will use those Q values in identifying their list of significantly differentially expressed genes we're going to go through ball gown in the practical exercises it basically does a lot of kind of useful visualizations with a very little work on your part so you can quickly get to for example for a particular transcripts box and whisker plot showing in this case if we were comparing female expression to male expression overall distributions of expression across your samples or across genes representations of the different transfer isoforms identified for specific gene locus and the relative expression of each isoform versus the others so it's quite useful set of visualizations there are alternatives to these FPKM style or TPM style approaches especially for differential expression statistics a lot of people advocate working directly from the raw read counts so of course the raw read counts are not just used directly in some kind of simple fold change that they're still accounting for things like library size and well probably not gene size because typically you're comparing differential expression of a gene between samples so the gene size is fixed and so that's not expected to affect the counts for that comparison but they basically deal with the issue of library size within the statistic directly and a lot of people prefer this method some of the packages that allow differential expression statistics from raw counts seem to have them sort of more they're more configurable like they allow for more kinds of experimental designs like time series or maybe with many different conditions and they offer some like quite robust statistical approaches we're going to show you how to use one method for getting gene level raw counts and basically this is fairly straightforward of just counting up the number of fragments or reads mapping to each gene locus so for this particular tool there is a good post in C cancers why you should be careful about how it does with transcript level analysis basically it doesn't really recommend transcript level analysis so it's more useful for gene level analysis so which should you use epi cam versus raw there's not really an easy answer to this we do both on a regular basis so we find the epi cam values are kind of useful certainly if you want to leverage the benefits of this complete tuxedo suite they seem to be useful for visualization so the raw counts themselves are not that good at creating for example like a heat map of expression levels for genes and samples you have to do more work to the read counts before they're useful whereas the epi cam is fairly ready for that kind of visualization sort of simpler to calculate fold change from epi cam but as I mentioned for the counts there may be more robust statistical methods available and it accommodates more sophisticated experimental designs and you're going to see that they produce not exactly the same results so for raw count I would say probably DC because the most popular edge R is also very popular there are a number of other mostly R based statistical packages for identifying differential expression estimates from raw counts and there have been some papers that have looked at these different approaches and essentially come to the conclusion that while they do have substantial overlap there's also quite a bit of non overlap and for this reason multiple approaches are advisable and then depending on what your goal is like let's say you really want the most confident differential expression predictions maybe you only really have faith in the intersection or if you want something that's more comprehensive you want to make sure you don't miss anything and maybe you're going to go on to validate and additional experiments maybe you want to take the union of all the predictions on this slide is just to remind us I talked about this a little bit yesterday sort of lessons learned from microarray days or from pre RNA seek days sequencing technology does not eliminate biological variability that was actually the title of a paper in nature biotechnology in response to especially many of the early RNA seek experiments like we talked about yesterday that kind of did like one versus one and then tried to draw conclusions from these experiments I mentioned that there were some tools so here's the one I couldn't remember the name of Scotty I think this still is up you can basically go there and kind of enter in details about your sequencing sequencing experiments I think it considers things like total library depth and read length and things like that and attempts to perform some basic power analysis to tell you how many replicates you might need for your experiment similarly things like multiple testing correction actually not only are they still important I would argue they're more important than ever because there's more tests than ever right so when you're doing your microarray experiment the number of tests was had an upper limit based on the number of spots on the array essentially so if your array technology was capable of measuring 20,000 genes or 50,000 transcripts that's kind of like the upper limit of the number of tests you might perform in terms of statistical tests probably you do some test reduction based on variability or expression but then you would calculate thousands of differential expression tests between all of the different array targets with RNA seek there's almost like no upper bound to the amount of potential tests because you have all those same gene and transcripts anything that is potentially expressed but now you're also measuring intron expression you're measuring you can get estimates of individual exon-exon boundary expression you can get estimates at the exon level and so and any of those are potentially interesting in and of themselves in terms of differential expression right so it could be really interesting if there's like a significant difference in intron retention from one condition to another or a significant difference in the use of a specific novel exon-exon junction so you might find yourself quickly like expanding the number of tests that you're considering to millions of different tests right for all these different kinds of features that you're now able to assess with RNA seek data but it creates a huge multiple testing problem so as you do that and generally given the small number of RNA seek samples that you can typically afford sequence you could quickly get yourself into a position where nothing can possibly pass multiple testing correction so you have to basically keep that in mind that you need to think about how you set up your experiment in such a way that you're sufficiently powered to detect things and then you still do need to consider multiple testing correction more than ever because there's a danger you can just keep trying more and more things until you find something significant but at that point maybe you're just looking at essentially random random noise so downstream interpretation of expression a lot of this stuff is actually going to be covered in the next modules after RNA seek so things like pathway analysis really this is a topic for an entire course you can feed the estimates of expression or differential expression from string tie and ball gown into many different analysis pipelines we provide a supplemental art tutorial that does some basic data manipulation and visualizations things like clustering and heat maps this was provided by cumberbund i'm not sure if it's provided by ball gown but we do provide an example of making a heat map in the supplementary art tutorial you could do classification analysis so this is a common use of RNA seek data if you're interested in that we recommend weka as a good learning tool also i have actually now developed this is out of date a series of bio stark tutorials on how to do random forest classification exercise on expression data i think the test data i used there was microwave data but the principle is exactly the same for RNA seek data and yeah i think you're going to cover pathway analysis in detail and perhaps some of these other topics as well in the upcoming sections of the of the course