 So we're moving on to module three today, we finished with introducing you guys to the cloud and sort of preparing your data and your annotations and your reference alignments and doing your alignments. So now of course the next thing is to try to estimate actual expression levels for the transcripts and genes based on those alignments. And we're going to continue to work our way through this working example of the RNA-seq analysis pipeline and specifically in this module we're going to cover the concept of estimating expression from, as I said, those alignments for genes and transcripts. We're going to talk a little bit about this concept of FPKM which you may have heard of and the use of that versus raw count based approaches. We'll talk about some of the different differential expression methods and then a little bit about downstream interpretation of expression and differential expression with some kind of gotcha topics. So you guys have already looked at some of your data in IGV, you should have anyway. And hopefully you saw something like this. You saw alignments, very clearly spliced alignments where you're seeing most of the reads align obviously to the axons of a gene with some alignments spanning axon to axon across those splice junctions. And we started talking yesterday about just in a very general way how you could start to think about expression and differential expression based on that data. Really the amount of alignments are some kind of measure of expression. So there's a certain amount of noise with the possibility of misalignments and biological noise in terms of just transcriptional noise. But in general, if you see a pattern like what is depicted on the screen, you can already feel fairly confident that this gene is expressed. I mean this is very convincing evidence of fragments that are clearly spliced RNA. It would be crazy coincidence to see alignments like that if you just had some genomic DNA in your sample. There's definitely some RNA from these transcripts there. So you can already say there's probably expression of this gene. Now obviously the more interesting question is how highly expressed is it compared to other time points compared to other samples in comparison between different conditions and so on. And you can also get a general feeling for that simply from looking at the alignments with some caveats. So you'll recall that from this view along the top, if I can get a mouse somehow. There's this thin bar along the top which is the coverage track that kind of shows a histogram if you will of the total depth of coverage at any position. And that kind of gives you a high level summary of the total amount of alignments at any given region and over this locus in general. And if your two libraries were sequenced to similar depth and you're looking at one gene between two samples and you see a massive pile of data in one sample and practically no data in the other sample, very little alignments, you might already conclude that maybe there's evidence of differential expression. You can get a crude sense that this thing is more highly expressed than the other thing. And that is indeed the case here. Clearly there's a lot more reads on the top than the bottom although visually that's hard to tell because you know you have the sliding window. If we were to slide down on the bottom panel you'd see that the reads fall away quite quickly whereas the top you could slide down and down and down and you just see more and more reads which is depicted by the higher coverage bars. Something else you can see in this view is evidence of some three prime bias. So that's something you also want to think about and watch for especially when it comes to data targets that require good coverage of the transcript. So if for example you're looking for variants or splicing patterns and you have much better coverage for some part of the gene than others right. So this coverage kind of tails away to almost nothing as you get towards the five prime end. So you have less power to detect events at this end of the transcript than in this event. This is something you really commonly saw with polyA selected libraries which I would guess that this probably was. You have a certain amount of RNA degradation and then you're capturing based on one end so it's inevitable that you sort of enrich for the end that you're capturing based on. So this is another advantage perhaps of non-polyA approaches to RNA-C. So what is FPKM or RPKM? You've probably heard these terms. They're probably the most commonly used metric for expression levels based on RNA-seq data. I believe this goes back to the very first RNA-seq paper where they had a concept of RPKM and they may at that time even coin to the phrase RPKM stands for reads per kilobase of transcript per million mapped reads. And really it's just a simple normalization to account for the fact that we can't simply count up the number of aligned reads for a particular transcript and use that as a comparison to other transcripts or other samples because there are some basic normalizations that have to be performed and these have to do with the size of the gene. So larger genes, larger transcripts will just overall tend to produce more fragments and therefore have more reads sequenced even if there's the same number of copies of that transcript as some other shorter transcripts. Again with the library depth if you sequence one sample to 10 times greater depth and you don't account for that all of the transcripts and genes in that sample will appear to be more highly covered and therefore possibly more highly expressed than in your less well sequenced library. So you can't just compare raw counts without kind of thinking about those issues. And so the RPKM is a simple way to account for that. FPKM is basically exactly the same concept we just switched to using the fragments per kilobase technology or terminology when paired end sequencing became popular and common. So with paired end sequencing you actually don't want to count the two reads of the pair twice. There are two reads sequencing the same fragment representing the same piece of a transcript. So in terms of counting your goal is to count up copies of transcripts in a cell or in a pool of cells that you've sequenced. You basically want to treat both those reads as evidence towards one fragment which is evidence towards one piece or one copy of that transcript. So as I said in RNA-seq the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it but you want to normalize for the size of genes and the total depth of your library. And FPKM and RPKM attempt to do that. They have a very simple formula which is just take the total number of mappable reads for a particular feature whether it's a gene or transcript or indeed other features like exons. We times it by a thousand and a million to get this per k per m and divide by this the total number of mappable reads in the whole library and the length of the gene transcript feature whatever you're looking at. And that's what gives you the kind of classic FPKM. You can do this calculation yourself but most of the software like cufflinks produce an FPKM like measurement for you while also incorporating other more sophisticated normalizations. So they will incorporate things like GC bias or other more sophisticated statistical concepts to further normalize that FPKM value but it's based on this fundamental concept. So how does cufflinks work? Cufflinks is actually incredibly complicated in how it assigns read counts to different isoforms and then calculates FPKM like values from those different isoforms. And the reason for that is that cufflinks is trying to do something really fancy. It's trying to not only assign expression values to every transcript but determine the identity of all of the transcripts in the sample at the same time basically. And you can guide that based on known transcript structures but it still is trying to be very clever and trying to say based on the data which transcript species do I believe are in this sample and then assign reads in a probabilistic manner to those different putative isoforms. And it does this first by assembling these overlapping bundles of fragment alignments. So they'll say okay I see a pile of alignments here that look like they belong together so I'm going to call those a bundle and that basically creates a possible locus. So there's this region of the genome that has all of these overlapping alignments which look like they could be evidence for expression from a gene locus and then it tries to assemble those fragments into different transcript isoforms using concepts of graph theory. So it creates an overlap graph in which the different isoforms are inferred from the minimum paths required to cover the graph. So that's kind of depicted here. So you have all these fragments and there are a small number of fragments in there that are called mutually incompatible fragments. So they're fragments that give some kind of unique information where two transcripts couldn't basically have two isoforms couldn't have both of these mutually incompatible fragments. So they indicate putative differences where for example in one transcript an axon might be skipped and another transcript an axon might be retained. So those are a transcript can't have both a skipping and a retaining of an axon and so if there are such differences in the data you will see reads that span across junction for axon one to three say and then others that go from axon one to two and from two to three. That right there indicates that there must be at least two different isoforms in this population. So it uses that information and attempts to assemble using a complicated probabilistic model what the most likely minimal set of paths through these transcript fragments are until it comes up with a representation of in this case it believes based on these fragments that there are at least three possible transcripts being expressed from this locus a yellow transcript a blue transcript and a pink transcript and then it goes back and looks at the reads and it assigns reads in a way to each of these fragments to estimate an expression level for each of them. It doesn't do like a simple direct one to one basically there's going to be cases where for example in this part of this axon there are many reads that are going to be equally supportive of any of the three isoforms right and this is the main problem with using short reads to estimate expression of transcripts that there are in most cases almost all the alignments don't distinguish between the transcripts and they could equally come from any of them. So it kind of uses the information of the mutually incompatible fragments to try and get a sense of proportionality and it comes up with the sort of most likely assignment of reads to each or proportion of reads to each isoform and it uses information that it has like the fragment length distribution. So it has a sense of the typical size of your fragments and based on that it knows that a particular alignment may be more likely to be evidence for one transcript versus another. Sorry you had a question. It's a good question and the answer is both or either depending on which mode that you run it in. So it has several different modes that you can run cufflinks in and we're going to try running basically at least three of these different modes. So there are some modes where you really give it a GTF which has known transcript structures and that really drives the assignment but there are others where you can do like a pure de novo mode and tell it nothing about the known transcript dome and it attempts to infer it from the data. So I've read this paper I'm not a statistician I've read this paper numerous times and tried to understand exactly the details of how this procedure works. It's very complicated it seems to work reasonably well I can say that. I encourage you to also read the paper and probably you can some of you can understand it better than I do. But it's kind of like the the Bentley of expression estimation right they've like really thought hard about this and tried to make as sophisticated a model as possible. We're going to compare and contrast this to the complete opposite end which is the simplest approach which is like a count-based method. You'll see for expression and differential expression purposes that there is a reasonably high concordance between them but there are definitely significant differences. And in our pipelines we use both because the truth is probably somewhere in between or it's a third thing that hasn't been invented yet. Yes. Yep it just starts with the so we're going to start directly from the top hat alignments that you generated yesterday. It doesn't have to be top hat so recall that we had these optional aligner steps and we've tried to set up the star alignments and the high set two alignments in a way that is compatible with cuff links. And indeed the future supported aligner for cuff links is high set two which was developed by the same people as top hat two. It's kind of like the updated aligner. I was looking at the alignments from all three methods yesterday at a whole bunch of genes from the data that we're looking at and there are definitely differences but they're quite subtle so the alignments are very very similar. I think the main improvements for star and top star and high set over top hat are performance based in terms of run time and so forth or other options but when run in the kind of cuff links mode the high set and top hat especially produce very very similar alignments. So those should be somewhat interchangeable. So how does cuff diff work? Again just at a really high level what you're doing here is looking at the variability in fragment count for each gene across replicates. That's like any statistic right? So however you come up with a measure of expression there's going to be variability in that measure of expression between samples and a you know a statistic like a t-statistic would take into account the degree of variability between a set of samples in condition a versus b and calculate a statistic based on that variability but there's also this concept of kind of confidence in the expression estimate. So the fragment count for each isoform is estimate and replicate as before but there's also this measure of uncertainty in the estimate that has to do with the nature of the alignments. So if you're looking at two transcripts that have a lot of shared exons and very few of these unique features where there are uniquely assigned fragments you're going to have more uncertainty. So I find it's easier to explain this if I draw it. So the simple idea here is that imagine you have a really simple gene that has two transcripts. Let's say there's just like a really subtle difference like this guy is just a little bit shorter than this guy. That's one situation and then imagine another situation where you have some quite substantial differences. So this is gene A and this is gene B. So let's say it skips a whole exon as like part of this exon and then another exon. Right so you can see how in this situation you have this is a gene with two transcripts that are very very similar and this is a gene with two transcripts that have quite a lot of differences. Now imagine you're aligning data to these two transcripts. In most cases I'm just going to use single n reads just for simplicity. You're going to get reads like this right stacking up on the exons and you're going to get reads that span the introns that are junction supporting reads and then there's going to be a few reads down here that are actually unique to this isoform that are not in this isoform. But other than that all of the reads when you look at them you have no way to know whether this read is for this isoform or this isoform. The situation is very different below right where you're going to have reads that look like this. Reads that look like this and so on. There's going to be lots of reads that are specific to this isoform or specific to this isoform because they have so many differences right. So when cufflinks is trying to assign an expression estimate to each of these isoforms in gene A and gene B it can have a lot more confidence or certainty in the value that it assigns to B1 versus B2 compared to the value it assigns to A1 or A2. Right here it's going to more or less assign the same value to both and maybe if it sees like something happening over here where as I've depicted there's like quite a bit more reads piling up it's going to say well based on this very limited unique space that I have to work with I think the expression of isoform A is a little higher than or A1 is a little higher than A2 but my certainty about that is much lower than my certainty about the expression differences or expression estimates from E1 and E2. So this concept of certainty together with the variability right this is just within one sample now I have this sample and another sample and another sample where there's variability in the total amount of alignments to these different isoforms in each those two concepts of variability between samples and uncertainty of the individual transcript or exon level estimates are incorporated into this statistical model for differential expression. So it somehow combines the estimates of uncertainty and the cross replicate variability. It uses a beta-negative binomial model to perform a statistic and basically comes up with a p-value and it has a built-in multiple testing correction as well so you'll get like a q-value and that will give you this concept of a differential expression statistic. And again I believe there's a separate paper describing the whole cuftiff differential expression statistic or maybe it's in the cuftiff paper yeah sure so I was just looking out for a transcript for isoform just per gene which is I assume it's much simpler than top of the list with the most differentially expressed or significantly differentially expressed genes but full of those that had very low expression levels in both samples. Some of them were zero and one and very low in the other. So how do you get rid of those? How do you can filter them out somehow or how can you correct for this? Yeah you can definitely filter them out. I'd say this is a common problem with count based statistics. Yes but it's ultimately based on digital counting right like the underlying data is it's not like like a microarray where you have hybridization it's like an analog measure right so I think like if you look at like going back to the beginnings of like SAGE was the first like count based gene expression sequencing method we always had this problem where yeah the one versus zero ends up like looking significant. You would really hope that the statistic would do a better job of dealing with that like that that would be built into it because there should be very high uncertainty for the expression estimates for something with so little alignments based on so it's disappointing that those are coming through as significant for you but you can certainly easily filter them based on and it's very common to have like minimal FPKM cutoffs for um so you can say that for example a common one is at least a certain percentage of samples in your dataset must have at least a minimum FPKM value in order to trust the statistic. Do you see them? How do you set the value? Yeah what's that? The themes of these analysis were proposed out with the last step which is like ad hoc summarization and we're going to actually go through some like our scripts where we can't go through this kind of thing where we visualize and say okay these things are called significant there's not really much of a full change that's applied filter or what's graphed so that we can see like what things what genes are different that are actually highly expressed what genes are all different but they're both very well expressed. You can apply that filter at multiple steps I guess I would probably apply it at the end just because you run through the whole workflow and then you have the data. The problem with that is the physics I think that's the entire set so you weren't taking in order to be your set subset. Yeah I don't know how you practically do that it's a good point because like I guess at least the multiple testing corrections and things like that are based on the the complete result. I don't know of an easy way that like in line you could remove those from the workflow and basically force CoughDiff to not consider them. I guess if I was worried about that I would probably switch to one of the other statistics that does allow you to more easily pre-filter before applying the statistic like where you're doing for example EDJAR you're applying the statistic to the data in a to the almost the raw data so you could easily filter out you could do this exercise find out genes that you think are questionable and then not even bother applying a statistic to them. Yeah it does you might be able to change some parameters so that more genes get no tests. Yeah so yeah we do the same exercise and compare sounds like your overlap was worse than ours so we did have a substantial overlap between the significant results of HTC count in EDJAR versus the cuff links CoughDiff approach. Arguably maybe you trust the intersection more than things outside of it but if you see virtually no overlap then it doesn't give you much faith in either approach and that's that's challenging. Also possibly a number of replicates most of time given the high cost still of RNA-seq we can't afford as many replicates and if you have few or no replicates the differential expression statistics I would say are almost worthless I mean you can do them but you would probably be just as well served just looking at changes or some crude metric because the statistics really break down without replicates. So when you don't tell it about known transcript structure it attempts to infer one so it basically builds up its own GTF file so it will define some transcripts. It only knows it to the to the degree that it was able to build a model so you're you're asking about like how does it know that a read the lines here belongs to a transcript that has this exact structure I mean it's tried to infer this structure by looking at the junctions and but it you know it only has a certain confidence that that ice form actually exists but it tries to determine what ice forms exist and it creates a definition of ice forms which may or may not be correct it does its best job based on alignment spanning junctions and so forth. I'm not sure that I know but what you're asking so between two ice forms of the same gene or between two genes between two ice forms of the same gene well it so it tries as I said it tries its best based on content that is unique between the ice forms so it's really the only thing it has to go on right so if there's an exon junction like a skipping or an extra piece of an exon that one ice form has and the other does not that's really the only thing it can use to guess at the expression level of one ice form versus the other for all of this for all of this shared content like these exons are exactly the same between these two ice forms it really can't do anything with that information in terms of telling the expression level of one ice form versus another it knows that there's expression here at the gene level it will just ignore the different ice forms and just group it together and say this is all evidence of expression from this locus but if it's if and when it tries to assign an expression level to this specific ice form over this specific ice form it can only really do so based on differences between the ice forms so if there are substantial differences like exon skipping events it's not that bad because the breeds that span across the junctions some of those breeds will be very unique they could only represent ice form one other reads could only represent ice form two so we can use that information to give a good sense like for all of the the unique ice form one junctions how much read support do I have for all of the unique ice form two junctions how much read support do I have and you can use that to get a sense of whether one ice form is much more supported and highly expressed than the other but if there are very few differences it has a really hard time to tell and that's why the uncertainty value would increase for that situation and it will make it less easy to detect a significant differential expression difference for those ice forms because there's greater uncertainty in them so at the gene level so the particular gene has got three different ice forms so but does it take to the average of the ice forms and does it has the gene essentially yeah you said that the statistics breaks down if you have only one replica is that yesterday you said that the technical replicates are very good you get very good correlations so but you need biological replicates oh you're talking about logic yeah yeah so if you again I guess you you won't gain that much from technical replicates because the technical replicates are so similar to each other that they're basically just giving you the same measurement again that gives you confidence in the measurement you have for sample a from condition a but when you're trying to show a significant difference between condition a and condition b you need lots of samples from condition a and samples from condition b so unfortunately we're still at the point with RNA seq where it's it's quite expensive and say the vast majority of studies out there have not had enough replicates just for practical reasons oh yeah yeah can still you can get all of the above basically but yeah when we run cufflinks we will have gene level estimates of expression we will have transcript level estimates and possibly estimates at other individual feature levels and sometimes those estimates will be good and sometimes they won't be depending on the complexity of the gene locus really what kind of outliers like sample outliers like does this sample not belong sorry yeah I mean I think that all of the practices of expression analysis from 20 years ago from microwave still applies so you can certainly do an outlier analysis when you have all your individual data sets your expression estimates I mean we would do clustering analysis we would do outlier analysis if yeah if a sample looks very suspicious is not clustering with the conditions you would expect is an extreme outlier then you may exclude that using the same logic you would for any data set before RNA seq existed yeah are you so are you talking about transcripts that cufflinks inferred from the data like potentially novel isoforms and so forth yeah it's pretty easy to get so it it builds a gtf file which is you know we saw gtf files yesterday they basically lay out the structure of genes and transcripts and features and there are tools that you can use to extract like say a fastest sequence based on a gtf file and a reference genome yeah I don't know if it produces a faster file I don't think it does just by default like I don't think it's a an existing output of the tool you would have to run another tool but it wouldn't be hard to get that so we're also going to use a tool called cuff merge at one point during the workshop and the reason for this is really tied up in what we've been discussing which is cufflinks ability to infer transcript structure and what it allows you to do is merge different cufflinks assemblies together and this is necessary because even with replicates cufflinks won't always assemble the same number and structure of transcripts right so if you give it data even from the same exact sample like let's say you took your sample and divided it into and sequenced it to great depth and then provided it to top add and then cufflinks and it tries to infer all the transcripts even in that ideal situation it won't come up with exactly the same answer right so in this case like it might correctly identify this structure but then maybe the I don't know the next sample for whatever reason because there's like a hole in the alignments or something it comes up with a slightly different structure you would have differences in the identity and the start and stop of exons between these inferred isoforms and then how do you compare right so you've got two samples and you're trying to compare the expression levels of transcripts but you don't have the same transcripts so that's obviously a problem it's like apples to oranges problem so cuff merge basically takes the gtf files that each cufflinks assembly has produced and create a consistent gtf file where it tries to use all the data at once to basically create a single representation of all the different isoforms that jives together and this is even more important when you think about like different samples that aren't replicates so let's say we're doing a comparison between brain and liver and there are some genes that are expressed in liver that just aren't expressed in brain right so for those transcripts those liver specific transcripts in the liver sample cufflinks will infer isoforms there in the brain sample it doesn't have any data because it's not expressed so it can't possibly infer that those isoforms exist so the gtf file it creates from the brain sample will simply not have certain isoforms represented that are represented in the other sample but now we want to do differential expression analysis and those are exactly the things we're looking for right the transcripts that are highly expressed in one tissue and not in the other maybe they're at zero in the other so we need to create a common map that says these are all the total isoforms and now I'm going to tally up the counts and expression estimates for this merge set of isoforms so that I can find exactly those cases where maybe I have a really high expression in one sample or a set of samples and no expression or very low expression in the other set so you merge they won't do it by condition right actually combine all the liver yes yeah you'll give it like your entire data set and do a merge across your whole data set irrespective of condition and then the last major tuxedo tool that we're going to use is cummerbund this is an r package which generates many of the commonly used data visualizations distribution plots correlation plots ma plots volcano plots clustering heat maps and even showing the individual gene and transcript level structures that have been inferred by cufflinks and these are just some of the samples so these are a lot of just common visualizations that that you would create in r or some other visualization package many of them are inspired from microwave analysis but cummerbund is supposed to provide you with a kind of easily accessible or easy to use tool to get at these visualizations really quickly so you can generate some of these visualizations with just one or two r commands whereas if you were just given the output of cufflinks you know you might have to write a pretty lengthy r script to recreate some of these visualizations from scratch so they're kind of convenience functions or helper functions to get to some of the commonly desired visualizations more quickly unfortunately I will say that I don't think cummerbund is very well supported compared to the rest of the tools in the tuxedo suite I think the postdoc whose project that was has moved on and this has not really been maintained very well so I feel like every year less of it works so we do have like an old school r scripts to generate most of the same visualizations of course those are more complicated because we're doing exactly the thing that I was telling you cummerbund is supposed to help you avoid but there we do provide both so you'll kind of have sample r scripts for how to create some of these visualizations in sort of the old school way yeah so there are alternatives to fpkm the largest category of this would be raw read counts so instead of calculating an fpkm you simply assign reads or fragments to a defined set of genes or transcripts and determine the raw counts for each gene or transcript you could still do this first by using something like cufflinks or another transcript assembly program to generate like a gtf so if you didn't want to use known genes or isoforms you could try to infer genes and isoforms and then use that as an input to some of these count tools to assign counts to those but typically these are used for for the case where you know what your genes and transcripts are you're going to just trust ensemble or ref seek to have figured that out and all you want to do is assign read counts to those known genes and transcripts and then use some of the other available statistical packages to determine differentially expressed genes or transcripts so we're going to go through how to use ht seek counts to do that and then we'll use a package I believe we use edge R to calculate differential expression between raw counts so which should you use fpkm versus raw counts they definitely have advantages and disadvantages as I mentioned before in our RNA seek analysis pipelines we just do both the fpkm style measurements are good when you want to leverage all the benefits and usability of the the tuxedo suite I prefer it for visualization so if I'm trying to make a heat map and show kind of like global expression patterns and differences between sets of samples I find fpkm is a very convenient starting point for that raw counts not so much calculating fold changes things like that I think where the counts are beneficial is when you want to use some of the more robust statistical methods especially if you have appropriate replicates and maybe if you have more complicated experimental designs like time series or if you're doing some kind of multi varied analysis there's a lot more options for the count based statistical packages to accommodate those different designs so yeah basically we we do both yeah so if you just purely use raw counts it's problematic these statistical packages aren't just doing a t-test between the raw counts they're taking into account ideas like the library depth and so on and they will argue that they've thought through like the statistical problems of this count based data more carefully or in a better way than the fpkm statistic or the cufflinks yeah raw counts by themselves are not very useful but together with the packages like edge are or de-seq or other ones you can get good statistical measures of differential expression where they're doing the appropriate normalizations and applying the right assumptions okay alternative differential expression methods these are we're going to try edge are these are just a couple examples of those that those are packages that use raw count approaches as I said we advise multiple approaches and the reason for that is is really expressed in this venn diagram where you can see there is a substantial overlap between the approaches but there is also a huge amount of non overlap and I don't think it's clear yet which of these answers is correct probably none of them are quite correct so applying multiple approaches I think it gives you a little more confidence in things that are consistently identified between the approaches it also gives you a more comprehensive result if you're worried about missing something but as with anything you're going to want to think about validation experiments and so on for any of the predictions that come especially from only one method so lessons learned from microwave days I think in the early days of RNA-seq people really did kind of forget some of these lessons RNA-seq is very exciting it's extremely powerful to detect new kinds of events but when you're talking about differential expression analysis and clustering and so on really most of the issues that we've learned the hard way in microwave expression days are still valid and apply so things like power analysis there's a tool from the marth lab you can use to try to determine how many replicates you need to be sufficiently power to detect significant events under certain conditions most people will not like the answers they get from that but there's not much we can do about that there's definitely a need for biological replicates the as I said before the tools just really break down when you don't have replicates and then there's quite a few discussions about RNA-seq study design that you can find on on bio stars yeah scufflings is set up for for a stratified experimental design are there tools that allow you to analyze if you have an epidemiological design yes I think I would I don't know which package will be best but I suspect that one of these count-based packages that I mentioned would have a lot more options in terms of your study design so you're talking about trying to determine like correlation or regression statistics between a continuous variable and expression yeah I mean you can certainly do this with the output of cufflinks as well but you just probably would not run cuff diff you would import that into r and then start applying many of the existing r packages for for those kind of statistics but the count-based packages I bet have those abilities built into them multiple testing correction is more important than ever so the main reason for that is that as you compare more attributes it's more likely that treatment and control groups will appear different for at least some attributes by random chance and when I say attributes I mean genes transcripts exons and so on so when we had micro arrays we had something like 20 to 50 thousand genes or transcripts being measured with RNA-seq we have all the complexity of the transcriptome we have potentially custom defined isoforms maybe a hundred thousand isoforms we have all the individual features can now be assayed very discreetly which you know could go to the individual exon or junction or junction or intron level micro RNAs link RNAs etc so we're doing more tests than ever which means our multiple testing problem is bigger than ever we usually use the bioconductor malt test package there is multiple testing built into many of the the software as well like like cuff diff and then of course RNA-seq doesn't solve any of the downstream problems which would be really a topic for a whole other course you can feed the results of your expression and differential expression analysis into many existing analysis pipelines pathway analysis on that sort of thing we do provide a supplementary R tutorial on some basic clustering and heat maps both by cummerbund and custom R scripts if you want to do classification analysis I think RNA-seq can be really powerful for that I did a study of breast cancer looking for classification models that could distinguish between drug responders and non-responders and I had very comprehensive genomic profiling of these samples so I had microarray data I had exon sequencing mutation data copy number data a bunch of others and I found that RNA-seq was actually the most powerful data type for distinguishing between the samples again just because it has such rich information so many features to distinguish between samples there's packages like weka or good learning tool for classification I usually use random forests and R when I'm doing classification analysis and yeah again pathway analysis there are many existing tools and packages and you can read about those in biostars but unfortunately we're probably not going to get far enough in the time allotted to focus on those so in module three going back to the practical exercises we might just rewind slightly because I think there was maybe one exercise we didn't finish yesterday and I don't know if you guys want to finish that we can kind of go back and finish it but then we'll continue with the module that basically picks up where you guys have left off so what we've done so far we've obtained our raw sequence data we've produced alignments for that data against our reference genome that we also obtained we obtained annotations and we use those or going to use those annotations together with our alignments to compile transcripts and estimate expression for those transcripts and genes we'll then merge together those transcript representations from the different samples to create a consistent view of the the transcriptome which will allow us to do this cuftiff comparison between our two conditions and then ultimately visualizations based on those So the question is the data that we got is is it stranded or not and yes it is and is there any like setting or file where we have to specify that it's stranded and the answer is yes so there were options in top hat and I think there is an option in cufflinks as well where we specify strandedness but for sure in top hat so that's something you have to know about your data although when you align data you can often figure it out and kind of reverse engineer it because stranded data has a very distinct alignment pattern compared to unstranded data and I think I don't know if you did this exercise but when you're in IGV if you color the reads by first strand pair I think the option is called you can see very clearly which strand the alignment is coming from for each transcript and in stranded data you'll see all of the alignments on one strand for what for each gene and it'll switch between strand depending on which strand the gene is transcribed from but yeah there are there's not a file but there are just options that you choose when you run the tools to specify the stranded nature of your of your data and there's a whole table on the wiki that explains what those settings are for the different tools that we use in the workshop it's I think supplementary table two