 So guys, we like more than halfway through this boot camp, right? So I'm going to skip past this preamble that you guys have seen enough times already. We're going to module 3 out of 4 of the RNA-Seq section. And today we're going to talk about expression and differential expression. So the lecture is not going to be very long, but we're going to spend a bit more time taking you from your alignments to actually estimating the abundance levels of transcripts and genes in these demonstration data sets, and then seeing if we can calculate differential expression metrics for our two conditions. So that's where we are in the course. And then for this module, we're going to talk about how you go about estimating expression of known genes and transcripts. Later, we're going to talk about how you would get estimation for potentially novel isoforms. We're going to briefly review some of the differences between so-called FPKM expression estimate approaches versus the raw count strategies and look quickly into the differential expression methods, and then talk a little bit about some of the downstream interpretation steps that you might do, which unfortunately we don't really cover very well in this course because we just kind of run out of time, but we can talk a little bit about that. So you guys have already sort of started doing this yesterday, looking at the alignments from your first sets of RNA-Seq data. And we looked at a view very much like this in IGV, right? And we were talking about how with I, you can already start to kind of think about whether this is a highly expressed gene or not if you've seen sort of a lot of genes and you can say, oh, that's a lot of coverage or there's a lot of reads aligning to this gene or not. And we looked and determined whether we saw three prime bias. In this case, you see it much worse than the example we were looking at yesterday. There's like a lot of coverage here that kind of tails off as you get towards the three prime end of the gene. Sorry, it gets towards the five prime end. Five prime end it tails off, it's biased towards the three prime end. And you could say maybe this is a down-regulated gene because you have more coverage in the top condition than in the bottom condition. But that's, like we said yesterday, a little bit of an over-simplistic view of things. We're not really kind of statistically solid to do that. So we do need like a formal approach to this. And probably the most commonly used, I think this was introduced in the first RNA-seq paper or a version of it, is the RP-CAM or FP-CAM measure. Mostly you hear about FP-CAM now. It was introduced first and it referred to the reads per kilobase of transcript per million mapped reads. And really the only reason we switched to talking about FP-CAM was because rather than thinking about things at the single read level, we started thinking about things at the fragment level because we had two reads per fragment when we started doing paradigm sequencing of the fragments. So it's basically the same concept but you're not counting the two reads of a paradigm read twice. You're counting them once as a fragment. But the principle is the same and the basic idea here is that first we hope that in RNA-seq the relative expression of a transcript is proportional to the number of CDNA fragments that originate from it. But even just casually thinking about it you'll realize that the number of fragments is biased towards larger genes. So you can't just say like gene A and gene B and compare the total number of fragments if gene B is three times as long because that transcript, even one copy of that transcript will break into more pieces and potentially be sequenced in more reads than this smaller gene would with one copy. So we have to deal with that, the fact that the size of the feature matters. And then another kind of obvious problem is that the total number of fragments is related to the total library depth which is arbitrary. You can just keep sequencing. So if you sequence condition A to a million reads and condition B to two million reads and you don't account for that any kind of differential expression estimate you come up between those two conditions is going to be flawed, right? So this FPKM attempts to account for both of these issues simultaneously. And it's a pretty simple formula, right? You have the counts or the number of mapable reads or fragments for your gene and you're basically dividing by the total number of fragments in the library and by the number of bases in the gene or transcript or whatever the feature you're looking at is. And then the problem with this is you get this usually a tiny fractional number so we multiply by one million times one thousand or a billion to make it work out to this per thousand bases of gene per million reads of library. And that's really just a convenience so that the numbers are something that are more convenient to work with. And there's a lot of discussion about how to calculate FPKM and how to interpret it on biosars. So cufflinks produces an FPKM estimate but it's a lot more complicated than that. It's actually really complicated. So you have to be a little bit careful that not all FPKMs are created equally. You can and we do sometimes implement a very simplistic FPKM calculation where we literally just take the reads, align them to genes or transcripts and do the calculation on the last page. Sometimes we do that. But more often we use software like cufflinks that tries to take a much more sophisticated approach. So it's not just trying to account for the difference in library size and the difference in gene sizes, although those are incorporated into the method as well. But it's also trying to deal with things like the uncertainty of mapping where you have a read that could apply equally well to different isoforms and it's trying to figure out a good estimate of all the different isoforms for a gene locus despite the fact that there's all this uncertainty about where each read maps. Some reads are really clearly belonging to a specific isoform because maybe only that isoform is the only one that has that exon or that particular junction between exons. But a lot of times the different isoforms that are expressed from a locus they share a lot of the exon contents. They'll all use exon 1, 2, and 3. And then one of them will use exon 4. One of them will skip 4 and use 5. And then one of them will have like a slightly different UTR or something like that. So it'll be these like subtle differences. The cufflinks tries to assemble transcripts in a way that it can get at that information. So they provide this figure in their paper that's quite complicated. Just at a really high level to summarize how it works. They take the map alignments or fragments and they construct these bundles of fragments that are overlapping. And then they look for mutually incompatible fragments. And then fragments are connected in an overlap graph. And then the isoforms are inferred from the minimum paths required to walk through that graph. And once you have those paths, then they do some fancy probabilistic statistics to assign estimates of expression for each of these isoforms as defined by these paths. So here's a relatively simple example that they provide where you have these fragments and they look at how they overlap and they basically try to identify these mutually incompatible fragments. So there's this yellow fragment that doesn't overlap with the blue fragments which doesn't overlap with the red fragment and vice versa. All these gray fragments overlap with someone else and between the yellow, blue and red. And then they basically construct a path that goes through these different possibilities. So the isoform could involve these early fragments and then go down through this unique bit of red here or they could go through these early fragments, go through this blue probably exon and then to the end and so on. And from that they try to create a picture of what the likely isoforms are in this assembly. And they're doing this genome-wide for all these different bundles of fragments where each bundle is kind of roughly corresponding to a gene locus. And depending on which mode you run cufflinks in, they may or may not use information from the known transcriptome. So if you have a GTF file of a well annotated genome where we kind of have an idea of what all the genes are, it will try to use that information to guide this process. But you can also run it in a mode that says, like, I don't know anything about what my transcriptome looks like, just figure it out from the data. And it uses this approach to try to do that. Once you have these isoform definitions, which could be totally novel and just made up by cufflinks or they could be guided by what we know about the transcriptome, it goes back and tries to assign reads or portions of reach to each isoform to get some kind of concept of the likely abundance of each of these different isoforms in that sample. And it tries to make use of things like the fragment length distribution. So it knows, or you can give it the mean fragment size and standard deviation of fragment size for your library. And it will try to say, okay, I have this alignment and I have these different ideas of what the isoforms are, which of those isoforms is this fragment most likely to come from given that I know the fragment sizes are on average, say, 200 kilobases. So with this one fragment or considering isoform A, maybe it maps to exon 1 and exon 6. So an isoform that involves all the exons in between would mean it was maybe several hundred bases apart, a thousand bases. And it might say that's unlikely given the fragment size. This fragment is more likely to belong to one of the isoforms that skips all those intermediate exons. See how that logic kind of worked? So it's trying to be quite clever and basically use all the information that has available to it to make the best guess of where each fragment belongs in terms of these different isoforms in the sample. So you get expression estimates. They're expressed as FPKM type measurements, but they've come from this very complicated probabilistic model. And then you can try to estimate differential expression between two samples or between ideally a large number of samples for condition A and a large number of samples for condition B. And the way it does that is by looking at the variability in fragment counts across the replicates. That's kind of standard statistics, right? You're looking at variability across all your replicates in condition A compared to condition B, and you're looking at both sort of the average distance between expression levels in these two conditions, but also the variability within those conditions. So if they're very different and not very variable, they're more likely to be differentially expressed. But it also tries to work in an idea of this uncertainty in the alignments themselves. So the fragment count for isoform, it's estimated as before in the last step, but as it's doing that estimation, there's varying amounts of uncertainty that go into the estimates. So transcripts that have a lot of shared axons and very few uniquely assigned fragments are going to have greater uncertainty. So that example that I described at first where you have, let's say, two isoforms, and they're basically identical. They share all the same axons, but they have this one tiny axon at the end that belongs to one isoform and not the other. The vast majority of the freeds or fragments for that gene locus, you don't know which of the two isoforms it comes from, right? They align to axons that are shared between the two isoforms. So there's less certainty in terms of which isoform those fragments belong to, so that would make it harder for differential expression to be observed between that gene and maybe another gene because you're less, or especially if you're looking at the transcript level, you're less sure about what the actual transcript isoform level is. And so the algorithm tries to combine these estimates of uncertainty for the alignments as well as the cross-replicant variability into a single model to estimate variances and then determine whether these differences are statistically significant between genes across your set of samples. So when we go through the lecture, or sorry, the exercises, the first thing we're going to run is cuff lengths. But then before we run cuff diff, we're actually going to run cuff leverage. And maybe you can guess why this is necessary based on what we've already talked about. So for each sample, cuff lengths is doing this thing where it's trying to infer the transcript on, basically, depending on which mode you've on it in. It's trying to say for this sample which transcripts are present and how are they defined in terms of their exon-exon structure and so on. If you do that process for two samples, you can't guarantee, in fact, it's almost certainly likely that the assembly of transcripts is going to be a little bit different. So now when you're trying to say, is there a difference in expression level between these two samples, you don't have an apples-to-apples comparison, right? The transcripts have been potentially defined quite differently. So you need to merge them together. So cuff merge basically takes all of your cuff lengths assemblies across all your different samples and creates a single consistent assembly. So it's almost like another approach to this would be instead of running cuff lengths individually on all your data, you could create one giant pool of all your RNA data and use that to try and get an understanding of all the transcripts that are present in your pool of samples and then define that as your new GTF file and use that for cuff diff. But that's not usually how you do it. So usually you run cuff lengths on each sample, you get sort of predicted assembly of transcripts from each and then you merge them together to make one consistent representation of the transcript owned across your school of samples. It also does a few other things, so it pooled or it filters out problematic transfrags. So there are known kind of issues where there'll be like a huge spike in coverage. It could be for a biological reason but usually it's like a mapping artifact or just like a lot of alignments pooled there and that kind of creates problems for cuff diff and cuff lengths. So cuff merge will like remove some of those possible artifacts. Like I said before, you can also provide a reference GTF to merge the novel isoforms that cuff lengths predicts into the known isoforms and that can sometimes really help and that's what we're going to do today. But in the end you're going to get this assembly GTF produced by cuff merge that is suitable for use with cuff diff and that's going to let you kind of compare apples to apples across all your samples. If you have one sample, no, you don't need to run it because you'll just have the one transcriptome assembly from cuff links and you'll get expression estimates from there and you'll just stop there. Yes, you can use Sinclair sequencing with cuff links. Certain information will be unavailable to the model, right? Like the fragment size distribution is going to be thought about differently and so forth but it will do its best. I do think cuff lengths is quite, especially now over the years, they've been quite optimized towards paired-end data because that's really what most people have and what they've probably been developing the method with. Cuff links has gone through a lot of developments. It's been maybe five years now and multiple, many versions have come out and they've changed it quite substantially over the years. So it matters also which cuff links version you're running. Yeah. All samples across all conditions. It's okay, it doesn't try to harmonize in the sense that, let's say, I mean this is actually the sort of intended use case, let's say you're comparing brain tissue to skin tissue or something that are very different and they're going to have quite different transcriptomes. They're also going to probably share a lot of stuff too though. When you run cuff links on each of those, the transcriptome gets kind of defined and you want to know which transcripts that are defined in this sample are the same as those in this. But if there's a unique isoplamide that's only expressed in brain tissue and not in skin tissue, that's okay. It'll define that. The merged GTF will have a representation of that but it doesn't mind that only maybe one sample or one condition has that isoplamide. Exactly, it's a union, not an intersection. Yes, that's a good question. Good, yeah, it would be something to consider. I mean, I think it would depend. I would tend to think for human it wouldn't be necessary because we have quite, you know, now quite comprehensive representation of the transcriptome of all things that are ever expressed. It's probably not 100% complete but it's pretty good. If I'm trying to do novel isoporm discovery, it's a different thing. If I'm trying to, you know, get good estimates of known genes and known transcripts and I would probably still start with like the ensemble GTF just because you're kind of guiding it towards what is really well-established and it sort of may prevent it from kind of going down a funny path because this is a really hard problem that Cufflinks is trying to address and the data it has is actually not that great. Like two by 100 base pair reads with 100 million reads is not what you want to assemble a transcriptome. There have been papers published showing that it actually worked pretty terrible. So it's good that they're thinking along these lines and they're trying to do their best but I wouldn't take, you know, 10 RNA-seq libraries, run Cufflinks and Cuffmerge and say I have a good representation of the human transcriptome at that point. I would much, much more trust if you had developed over the last 10, 20 years through, you know, like full-length CDNA sequencing and things like that. Okay. Cumberbund. So when you guys are done with Cuffdiff we're going to use an R bioconductor package called Cumberbund which is also part of the tuxedo suite and this is kind of a convenient way to sort of jump-start analysis on your differential expression results. They basically create a lot of functions to generate commonly used visualizations for expression and differential expression. So you can get things like MA plots if you're familiar with those from the microarray days, volcano plots, clustering, PCA, MDS plots, heat maps. You can get these nice gene or transcript-level plots that show the structure of the genes and this is especially useful if Cufflings has defined new transcripts and you want to visualize what they look like and the expression levels of those new isoforms. So we'll show you some examples of that output. And we're going to go through this in detail in the lab so this is just kind of like a quick sneak preview. This is like what I was talking about. You can make these kind of like quite nice looking plots very quickly that have like a chromosome, an adiogram and all the isoforms that were predicted by Cufflings and then you can pull in the isoforms say from ensemble to compare what was predicted to what's known and you can get like conservation tracks from UCSC so it's quite powerful for creating nice visualizations that are kind of like figure ready for a paper. So alternative stuff became, it's by far not accepted that this is the only way you should do things. There are sort of two major camps, one that is really probably led by the Cufflings crowd, probably the most outspoken about their approach, obviously, but it is very widely used and regarded. But there are a lot of other people that say for differential expression analysis especially that you should just use raw read counts with the right statistics. So instead of calculating in FPKM you just assign the reads or fragments to a defined set of genes or transcripts. So this doesn't work if you don't know what the genes and transcripts in your genome are but if you do like in the human genome for example you can determine the so-called raw counts for each gene or transcript. The transcript structures could still be defined by something like Cufflings. So if you wanted to use Cufflings to define the transcriptome and then switch to a raw count-based approach you can do that and actually in the latest version of the tuxedo suite they now directly support that within their own tools. So even they recognize that sometimes for differential expression you want to use these count-based approaches. We're going to use a tool called HTC Count. There are different modes of running it. We're going to go through this in the lab. You can tell whether your data is stranded or not and how to deal with sort of overlapping alignments, how to assign them, whether to summarize at the exon or transcript or gene level, you give it a GTF file. There are some important caveats to this transcript analysis by HTC Count and strangely the author is actually not that supportive of using it for transcript level estimation. For some of the same reasons that Cuffling's estimation is challenging. So where you have these two isoforms again that basically overlap almost entirely and just have a little bit of unique content between them your ability to come up with unique counts for that difference is really hard and so the count-based approach breaks down a little bit for that. So a lot of people use it mostly at the gene level if you're just interested in straight differential gene expression. Yeah. So TPM is a kind of variant of FPKM. It stands for transcript per million, I think. It's a way of doing the... Let me describe this right. You're kind of doing the library normalization but not the gene size normalization. So you can use it to compare abundances of transcripts within a sample. And some people like it for that to just get an idea of what are the highly expressed versus lowly expressed genes within a condition. But I don't know if it's recommended to use for cross-sample comparisons. So counts, yeah. There's a biostar... Sorry, seek answers post that you might want to read if you're going to try transcript level analysis with the count-based approach. So which should you use? For me, FPKM, I use it because I like to leverage the benefits of the tuxedo suite, which is quite well-developed. If you want to have this complete workflow that lets you do things like de novo transcript identification, it's got the visualization stuff built in with cummerbunds. The FPKM measure itself is kind of convenient for visualizing, for heat maps, for calculating full change. It's much more alike or in line with, like, a GCRMA normalized value that you would get from an athymetrics experiment. I could take these FPKM values and just feed them into the same pipelines that I used when I was analyzing microarray data, whereas a count is a very different thing, right? It's just a raw count of reads, and in and of itself, it's not that useful. But there are some really robust statistical methods for differential expression that take counts as input. So a lot of the, like, really statistical crowd is focused on developing these different statistics to do a good job of really identifying which differentially expressed genes are significant based on the counts. They tend to have the capability to handle more sophisticated experimental designs. So there are count-based differential expression packages in R, for example, that can handle, like, time series experiments or experiments that have, like, many different conditions with complicated relationships between them. Cuff links is, like, slowly adding more and more of these features over time. So I would say it's, like, kind of gaining parity in the ability to deal with sophisticated experimental designs. Another advantage is just computational, especially in the past. Running the tuxedo suite takes a lot of time. It's very computationally intense. The more samples you do, you know, that just, like, increases sort of linearly. If you have one sample, it takes maybe a couple days to run through all these pipelines. And now you have hundreds of samples. That's hundreds of days, right? So you need a big cluster to get it all done in a couple days. And then when you'd go to do the cuff merge and differential expression in the past, once you got above a certain number of samples, because it was trying to do so much, it would just, like, run out of memory. It wouldn't successfully complete. So the count-based approach, you're really synthesizing things down to a very simple matrix, basically. Like, you run it on each sample. It's quite quick. And then you build a matrix that is, like, sampled by Gene, and it just has a count value in it. That doesn't take any room to store, and it's not hard to compute on. You can load it into R in, like, one second and run most of these count-based statistics in no time at all. So it could be used in cases where you had, like, really large RNA-seq datasets that you would have a really hard time pushing through Tuxedo. But, again, having said that, like, within the last maybe six months, they've really tried to deal with this by adding yet another tool called Cuff Quant, which converts your cufflinks output for each sample into a much more efficient binary representation, which CuffDiff can use that allows you to run CuffDiff on much larger numbers of samples. So they're kind of addressing that limitation. Yeah. I'll show you one, and I can list off a few, like, EdgeR is the one. Lima is a package that you can use for different expressions of the name. I've probably had to do, like, some transformation for data before Lima, but I didn't know. So we have... Is it live yet? So we're doing a kind of review of RNA-seq analysis method. Nalchi has, like, gone through and compiled some crazy lists of different tools and recommendations, including probably a very long list. And we're going to provide links to that. I mean, it's going to be a bit all along the same way. There are a bunch of other links that I just mentioned, DEC and EdgeR. There's a working example of code for this as part of this set of tutorials. DEC is another one that's actually, I think, developed by the same guy that developed HTC Count. But there's a bunch of others, and yeah, we'll post those on the wiki soon. So I would say it's still the case that probably multiple approaches are advisable. So in our pipelines, by default, we always run both the cuff links, cuff diff style approach, as well as the HTC Count and EdgeR approach. And there's, you know, there's a pretty decent overlap between these methods, but there's also a lot of stuff that is unique to each method. So depending on what your goals are, a lot of the times this is like hypothesis-driven research, right? We're not looking for the definitive answer so much as clues or potential experiments to test in the lab. And you really don't want to miss out on all of these potential predictions because of some idiosyncrasy and how cuff links works that because of some alignment challenge, there was like a weird fragment bundle and it like threw it away, so you won't get any idea of estimation for that locus or that gene or transcript. So by using a few different methods, you kind of hedge your bets that you won't miss anything biologically important. You'll probably also bring more positive to the lab, but you can test on those in downstream analysis or experiments. So for example, one gotcha with cuff links is that there are parameters that affect like the max bundle fragment size. So when it's looking at all your alignments and building these bundles that sort of correspond to gene loci, it says, okay, there's like this crazy amount of alignments here. This is probably expression of a gene that's from this one genomic location. And I'm going to make a bundle out of it, but it has a parameter that says if this bundle gets to be more than, let's say, a million reads, it's going to cause problems downstream for computation just because the way cuff links works. So we're going to assume that that's an artifact, that it's unlikely that that many reads would align to one place in the genome. It's probably like a low complexity region, like a repeat where tons and tons of reads just pile up there and it creates this big spike and I don't trust it and it's going to cause computational problems, so I'm going to toss it and it'll get marked as I think they call it Idata or something like that. And it doesn't get an FBKM estimate. So you have no idea what the expression of that possible gene is now. You just have this vague status and lots of times you don't notice that, right? Like you'll go through the cuff links and cuffdiff output and cuffdiff won't will be, it'll be invisible to cuffdiff because there's no FBKM, so there's no way that it can calculate a differential expression and in the expression output it's quite common to sort of sort on FBKM or on the status, so it has these different status and usually you look for, I think it's like the OK status and you'll basically lose these high data loci and a lot of times they are artifacts they're just like problematic regions for alignment but almost every time I've looked there's also been like 10 or 20 genes in there that are known to be just really highly expressed genes and you throw them away and sometimes I've seen in cancer samples where it was like an amplified gene that was like overexpressed and probably driving the tumor and that's why it's so highly expressed but it failed this max bundle size parameter which you probably didn't even think about when you were running cuff links and you know that's we've noticed that and it's like oh crap so we go back and like bump that up but of course there's a cost to that as you increase it every time you run cuff links it takes way longer to run it uses way more memory so you have to like come up with this balance like what's practical on our cluster given that we're going to run hundreds of samples through it can we afford to bump this up for every one of those or do we have to do it at the lower setting then look for problems and then rerun certain samples so that's an example of how you could lose an interesting biological observation just from the way that cuff links works that you wouldn't lose if you also had like a count based approach to kind of hedge your bets yeah yeah similar I think we do it maybe in this exercise if you go through all the way to the end all the optional stuff of course here we have this like toy data set that's sort of small but even there I want to say it was like at least 50% overlap but usually it's better than that but there's always a pretty substantial non-overlap as well like I say if we go all the way to the end so you have to do like all the cummerbund tuxedo based approaches and then all the count based approaches and come up with your two lists of differentially expressed genes and then I think almost the last thing in the lab is like go to benny.org and create an event built into the statistics so like the packets that you like enter are pretty easy it's covered for directly so when yeah but it's done sort of internally so you get under differentially expressed genes that are very pretty so the question is for the count based approaches do we also need to think about genes gene size and library size and answer is yes what do you do so she's asking well how is it really different it's not different in the sense that they both try to deal with it the difference is that with the count based you never get the normalized expression value that you can then go and do something else you just get whether the gene is differently expressed and a key value but the only information you have about those genes is still the raw count you see what I'm saying the normalization to library size and gene size is internal to the statistic and it's a term in the p-value but at the end of the day you just have the counts and the p-values like gene size so you don't actually need to take some gene size into account because you're only ever going to be used and it both tissue A and tissue B the gene size is constant so it kind of settles out so yeah you get more counts because the gene is bigger but all you're doing is being bigger genes because you're only comparing that to yourself it's probably okay if you manipulate the data you may violate some assumption about the distribution so I don't know that it would necessarily say it was always recommended that's part of the raw count is that they are thinking about this data in a certain way and so to some degree they are expecting a certain way of reprocessing the data and we're only talking about gene size but there's actually a lot of things that have to be unexpected so for example the complaints also try to take into account more biases related to again that allows you to have more people putting from gene A to gene B within a single data set so when you look at the range of genes from the most highly expressed to the most globally it kind of makes sense that if you compare the gene A to gene B again you don't need to account for that because they have the same gene A as the same GC content that gene A has so there's many many factors that are counseling out and the raw count people will basically say you can't account for all those factors you're better off to just use RNA-seeking and relative expression and allow all these things to be like 20 papers published comparing these different approaches and I want to say that everyone I've seen just considers that application a differential expression so just straight I want to know which genes are significantly different from the two sets of conditions should I use alpha A should I use alpha A and they come to different conclusions and nobody really says a firm answer either way that's the most common denominator but you're absolutely right once you start thinking about more sophisticated analysis like cold expression analysis you need a value to start like so I put in this section called Lessons Learned from Macquarie Days this is less of an issue now as it used to be but it's still an issue often I'll have collaborators come and they're really excited about RNA-seeking and it's going to solve all their problems and they're going to make all these amazing biological discoveries and they propose to sequence their cell lines and they would like to do one replicates for condition A and one replicates for condition B and that's going to cost about $5,000 wow that's expensive but we'll get a lot of data and we'll learn a lot from that in a minute, right? like microarrays to RNA-seek this transition hasn't changed anything about the need to consider biological variability so there are some people that I've just recognized this and the MARF Lab has a nice tool for during like power calculations for your RNA-seeking experiments and there are some good posts about study design and the need for biological replicates but basically there's no reason why you need any less biological replicates from RNA-seek than you do from microarrays in some ways you need more because you're actually measuring a lot more and this comes into the question of multiple testing correction are you guys familiar with the idea of multiple testing correction everybody pretty much basically the more attributes we look at the more likely it is that we're going to see differences between the treatment and control groups by random chance when you had 30,000 genes on your array it was already a problem given the certain amount of random variability if you compare five samples from condition A to five samples from condition B every so often by chance there'll be a significant difference between these that is actually just random fluctuations in the data and so multiple testing correction methods attempt to correct your p-values or your statistics for this well-known phenomenon when you move to RNA-seek you're adding like orders of magnitude more tests potentially right because now you have not only like the tens of thousands of genes and the hundreds of thousands of exons but you have all the complexity of the transcriptome you have genes, transcripts exons but you have also unique junctions every single exon-exon junction that's known or novel that is predicted from your data you might have retained introns that you think are a real or potentially interesting biological phenomenon you have other species of RNA genes that probably worked on your typical array like microRNAs and link RNAs and so on so you have a huge multiple testing problem and if anything this would probably require more data to get around more replicates to get around rather than less when you're talking about pooling samples are you talking about pooling this that you can be appropriate in some circumstances I think in some ways methods like couplings are kind of doing that in a silicone way with this top layer step you know like they're recognizing that you have imperfect or incomplete sampling of each sample and so there's a concept of uncertainty in the estimate for each transcript built in and when you merge together you get the single representation of transcript film and then you go back and try to reassign probabilities of that large transcript film but yeah I don't know we don't ever do that merging files with replicates but I have seen it down I mean I think if you have really small RNA-C clusters like in older days maybe would be in more sense having you feel like each one is less of a complete representation of anything so a lot of the variability you're seeing there is really insufficient sampling problems that will help you with that so I think this is the last slide in this section just to remind you guys that you know once you have your expression estimates or your differentially expressed list of genes there's of course this whole downstream set of applications and that's really the topic for an entire course and actually probably the Canadian Biometrics workshop here offers such a course but we don't have time for it here but really the expression estimates and differential expression estimates that you get out of one of these pipelines can be fed into a lot of the existing analysis pipelines that we might have used in the micro-grade days we provide a supplementary R tutorial that's just sort of optional extra there that will help you to get your cuff links data into R and manipulating it and show you how to for example do some clustering and make some heat maps a lot of this functionality is also provided through the Cumberbund package which we will go through let's see classification analysis that's a common application of our new seek data at least for us so where we're looking to maybe build a signature that predicts outcome between different types of tumors or that could distinguish between benign and malignant expression signatures that sort of thing and I think R and A seek data is quite powerful for those kinds of approaches because it does give you such a comprehensive representation of what's going on I did this exercise where I compared the ability of these different technologies to correctly classify for example breast cancer cell lines into their different subtypes or into drug responders versus non-responders and we compared R and A seek data microwave data copy number data whole bunch of different data types and really R and A seek was the most powerful just had the most kind of signal to noise so I think it's really good for that there could be machine learning on let's say for example the gene and transmitted expression levels where you have let's say 10 samples that are the 9th condition and 10 samples that are related condition and the FB cam values for example for each gene are features. Right, exactly we use those first I have hopefully not 10 and 10 hopefully under the scope of that classifying unknown samples based on expression values if you measure for that sample to accurately predict whether it's reliant or not. So I actually provide a series of tutorials on Biostar. There's four tutorials that kind of walk you through this for a microwave data set and the reason it's for microwave data is mostly because you do need quite a lot of samples aren't that many big RNA-seq data sets out there with good clinical outcome information that you can do classification on but the methods are really exactly the same so if you learn the methods on for example this array data set you can definitely just port them directly to your RNA-seq data Pathway analysis of course is a common downstream application if you're looking for something just really quick and easy things like web tools like David I don't know that I would really recommend that but it is a really quick and easy thing there are commercial solutions like ingenuity pathway analysis site escape is pretty good for pathway analysis and then there are many many are bioconductive packages and if you go to this link you can see a list of them. So the question is what are sufficient sample sizes for classification? We always say this but there's no right answer to that question because it depends on kind of like the size so if your two conditions are really really different like if you're trying to build a classifier that tells one tissue type from another tissue type and they're very different tissue types like liver tissue from brain tissue you'll be able to build a very strong classifier on very small number of samples of course that's not really a useful classifier because it's not hard to tell those tissue types apart but yeah probably tense with these solutions and the difference between your conditions is quite different many many genes that are totally related to one never deserve to build a classifier that is not easily the case at least for us in the answer world we're always being asked to build a classifier to tell who's going to respond to treatment who's going to progress to pass this on and you have maybe 10 years to follow up in the baseline sample those kinds of differences tend to be fairly subtle to build even like a marginally useful classifier which I think is why it hasn't really moved into the RNA syndrome that much yet because nobody's had the money to build such data sets but they're starting to come and they're going to be more popular and probably as soon as the RNA suit leads over to like the X10 instrument then you'll be able to produce a RNA seed data set of like what we're talking about here for a couple hundred dollars then I think you will start to see like the whole let's do RNA seed the biggest RNA seed data set I know of is human or like the TCGA breast cancer I usually use random forests so that's why I made the random forests part out because that's what I'm most comfortable with it has some nice features it can take it's fairly good at taking any kind of variable as input like a mix of categorical or continuous variables it has built-in feature selection like decide ahead of time what you think the good features will be it sort of just comes out of the method and then features are ranked so you'll see which features are most usable in the crossfire it's pretty good at cross-validation that's built in and it has this out-of-bag internal cross-validation that has a really good job in my experience with predicting future performance on new independent test data it's something that trips people up a lot of times when they're doing machine learning or classification where if you're not careful about practices you can over-train and think like oh I have this great performing crossfire but then I go and test on some new independent data and it doesn't work very well and the reason is that you over-trained on your original training data the way random forests is set up takes that out of your hands a little bit because it's built into the method a little bit harder to sort of screw that up but having said that the WEPA package this has a good learning tool for cross-vacation and it has this nice interactive mode where you can it's actually graphical and you can import a data set and then you can put all these different crossfires and then connect it to a performance or accuracy estimator to run your data set through like SEM and random forests and all these other methods and then get the performance estimates for each and I've done that for several of my classification problems a lot of times there's many like if you have real biological signal there that is classifiable there's lots of machine learners that will I like random forests do you use we do use ingenuity pathway analysis sometimes just because it has nice graphical features for making figures other than that I would use one of the bioconductor packages but you don't know which one I would recommend yeah a lot of universities will have a shared site license or shared seat license you can get access to it but on your own I wouldn't recommend it because of the ingenuity stuff I use David but I think I mentioned before a lot of the work we're doing now is like N of one tumors where pathway analysis doesn't really make sense you can't really do a meaningful pathway analysis on one patient I've had the same experience I don't really know what the answer is but I usually try to steer the eyes of how I'm traveling away from pathway analysis but it's like this chemistry like oh we need pathways analysis we'll find something amazing but that's also been my experience but we run a bunch of different pathways analysis kind of like cancer pathways even with differentiated I mean I would say I would I'd like to map my differently expressed genes onto pathways so that I can look for patterns on my body but I don't know what that established pathway analysis but I don't think that is the problem okay we have the significant this list of significant original pathways because they they just don't ever see that to cross-work I would rather map the genes of the different expressed on pathways basically come up to my own people that may not be like this could be the source of my options for it oh I don't know I don't actually know what the subsequent modules are so you guys are doing network analysis and pathway analysis so when I think about network analysis I'm usually thinking about you're kind of building your own pathways that are based on interactions between genes and proteins right so they could be literally like protein-protein interaction data or it could be things like gene co-expression measures that say that these genes kind of their expression moves together so maybe they're related they're in the same pathway but the pathways aren't defined by like canonical sources like keg or whatever so there's probably utility and overlaying pathways or integrating pathways that are kind of predicted based on your raw data like gene co-expression or protein-protein interactions with these canonical pathways that have been defined by like biochemical experiments or how they've been defined but I really don't know what the next modules are about so I think they'll have to wait and find out oh that's it actually so we are going to are we going to take a coffee break Michelle? well maybe I'll just introduce where you're going to be working on I think it might make more sense to do the break and then start the lab exercises so you guys are moving on to this part of the workflow so we're going to use cuff lengths as described here to do transcript compilation basically to define the transcripts in each of your samples and then cuff merge and cuff lengths to identify the genes and assign expression modes to them and then I think in this module we're also going to do the FNCumberbund I think we're going to use visualization as well yeah the data sets so we'll just continue with the three UHR and 3HVR so it's really not an ideal data set to look at for differential expression we will find to differently differentially expressed things because these are very different tissues but even though we have like so little data you'll see differentially expressed things coming out what about the differences are they similar was that also focused on one problem or the other yeah it'll be similar it's really just to illustrate how to run the pipeline but it's you'll have to use your imagination a little bit what it'll be like with proper large data sets so yeah why don't we take