 Okay, so you guys are used to these slides by now, so module 3 we're going to move on to expression and differential expression. This is your third out of four modules and the objectives for this module are to talk about estimating expression for known genes and transcripts. Now we're going to talk a little bit about the idea of FPKM versus raw counts. We're going to review very briefly the differential expression methods or a couple of different ones and talk about downstream interpretation of expression and differential estimates. I think we're going to have a slide or two just talking about multiple testing correction and a few other topics. So we've already actually just in our minds when we were reviewing the IGV alignments yesterday started to think about expression estimation for known genes and transcripts. So yesterday when you guys were looking at IGV you could see the alignments that you've made of your raw data against the reference genome and you could line those up against known models of genes and from that data you could already start to think about let's say total expression level for one gene versus another or for a tumor versus normal just by remembering comparing the coverage estimates for those two genes. I don't think we got into this but you could probably imagine how you would start to go about thinking about alternative expression as well, right? Because you could see alignments that are spanning certain junctions and you could make inferences that if there's more reads supporting this junction than another junction maybe that means this particular isoform is more highly expressed than another isoform. So of course you can really do that manually in IGV. It's nice to kind of confirm with your eye but we need a more comprehensive solution and that's what cufflinks is going to give us. I'm just showing here an example of a gene that is perhaps down-regulated where the coverage is higher in the top sample than the bottom sample. You can also see that there's evidence of a three prime bias in this sample. So as you go towards a three prime end of this gene you're getting more coverage and it kind of drops away towards a five prime end. And that's pretty common to see and I mean part of the reason is that if you're doing a poly-A selection you're pulling down fragments from the three prime end and if you have degraded RNA you're going to have a harder time pulling down those fragments from the five prime end. So has anyone heard of EPKM or RPKM? So this is probably the most commonly used measure of expression level from RNA-seq data. I think I mean the commonly cited first paper for an RNA-seq experiment introduced the concept of RPKM I believe or something very similar. So RPKM is basically an acronym for the reads per kilobase of transcript per million mapped reads. And FPKM is almost exactly the same concept but it's referring to fragments instead of reads and the main reason that came about is when we switch from single end sequencing to paired end sequencing we started thinking about the fragment that was being sequenced from both ends rather than a single read of a fragment. But they're basically the same concept. It's just a different convention. In RNA-seq the relative proportion or the relative expression of a transcript is proportional to the number of fragments that originate from it. So if you have more fragments of RNA in your RNA-seq sample for a particular transcript you expect to get more random fragments from it and then therefore because you're randomly sequencing these fragments you're going to get more sequences and more alignments from that transcript. That's a theory. But there's a couple of important sources of bias. So bigger genes, right, like a huge gene, it just has more space. So when you randomly break up the RNA in your sample more fragments are going to tend to come from large genes than from small genes because there's just more RNA there to create fragments from. Similarly the total number of fragments is related to the total library depth. So if I have two samples and I sequence one of them with one lane of alumna and one of them with two lanes of alumna I'm going to just by the very nature of it get twice as many total fragments from the two-lane sample as I am from the one-lane sample. And you wouldn't want to start comparing just raw counts of fragments from those two samples without first correcting for that difference in library depth, right? That makes sense. And there's always differences. So even if you do one lane for each and you try to make them exactly the same there will be differences in the efficiency of cluster formation which will result in more fragments for one sample than for another. So you always want to correct for the total library size. So RPCAM and FPCAM have a fairly simple formula which is that you just take the number of mapped reads or fragments for a gene or transcript or exon or whatever feature you're trying to calculate an expression estimate for and you divide it by the total number of mappable reads in the library and the number of base pairs in that feature, gene or transcript or exon. And then we do these, the K and the M come from basically times in by a thousand and by a million so that the numbers just work out nicer basically. You just divide by the total library size and the size of the gene you end up with all these weird fractional numbers. So they just decide to make everything relative to some hypothetical thousand base pair gene and million reads library. So that's why you're multiplying by a thousand times a million which is ten to the nine. Does that kind of make sense? You can calculate this on your own from raw counts but in cufflinks this is done for you so the main output you get from the software is an FPCAM measurement. So how does cufflinks work? This is probably the topic for a whole presentation. There's been more than one paper published on how cufflinks works. It's quite a sophisticated piece of software. You probably can't even make out all the details of this figure. So we're just going to think about it in really high level terms. You can read the paper. There's reasonably good documentation online and lots of discussion online about the specifics of how it works. But the general concept is you assemble these bundles of fragment alignments. So fragments that's kind of overlap are assembled together and then they're connected in a graph. They try to basically draw a path through transcripts that have connections by fragments which overlap. And from that you can infer or attempt to infer transcript isoforms from this concept of the minimum path required to cover the graph. So they will have a graph like this and they try to, using probabilistic methods, say okay this is probably a transcript, this is probably a transcript, and this is probably a transcript. And depending on the options you use with cufflinks, if you have known isoforms available, you can use those known isoforms to guide cufflinks. So it'll say okay I already know about these real transcripts. This part of the graph fits perfectly with this real model of reality. But it can at the same time say there's this other exon here that isn't explained by any of the known transcripts and looking at the graph I predict that there's a fourth novel transcript for this gene. Or you can run cufflinks in a totally de novo mode where it will try to basically infer all of the transcript structure right from your data. And the quality at which it does that will depend on the quality of the data you have and how good the algorithm is. From experience it's far from perfect like it does a pretty good job but sometimes it will definitely make mistakes. You want to keep in mind that when it's doing this making inferences about transcripts it's just doing its best guess essentially. And the transcriptome is complicated and it's a really hard problem. So you have cases where there are genes right next to each other that have really complicated exon structures that might be overlapping. They might have a lot of different isoforms that share 95% of their exon content and just have these subtle little differences which sometimes are just basically almost impossible to deconvolute with short fragment data. Because we're starting with a very imperfect measure. Ideally what you would do is pull down every transcript in its complete form and sequence it entirely. But that's not what we're doing at all right. We're pulling down transcripts, we're breaking them into pieces and then we're only sequencing little pieces of the pieces and we're trying to reassemble the reality from that. So mistakes do happen basically all the time. So you'll sometimes see cases where you look at the cufflinks predictions against the alignments in IGV and you'll see that it's mistakenly added like the exon of this gene as an exon in the gene next to it. Of course the other problem is that biology is also messy and sometimes that's real. There really are biological read-throughs of one transcript into another and sometimes that has functional significance and sometimes it doesn't, sometimes it's just a mistake of transcription. So it's a very hard problem. No, yeah, it can do it without. So that's the Denome moment exactly. It will define your transcriptome for you as best as it can from your data. And you can do that with, let's say you have 100 samples. You could do it with as much of your data as possible to try and get a complete picture of all the different transcripts across your population of samples. And then you can use that to sort of save a definition of the transcriptome kind of like your custom annotations and then go back and estimate expression for those definitions on a sample by sample basis. Yes, yeah. I don't know if there is a performance advantage there maybe. It just, if you have a good definition of transcripts and you provide them then it gives cufflinks a lot of information to start with, right? And it will probably do a better job of estimating expression for the genes and transcripts that you know about. But where it can't fit a piece into what's known it will create a new transcript if it feels like there's one there that isn't explained by what's known. Or ABT option? I'm not sure. I'd have to double check. I don't think we run with that mode in this lecture. Where were we? The abundance of each isoform, right, is estimated from a probabilistic model. So it looks at these definitions and it basically assigns fragments to each definition to come up with a kind of count of a likelihood that each fragment belongs to each inferred transcript and then assign an expression value to each transcript. And it tries to make use of other information like the fragment length distribution. So depending on the distribution of fragment lengths there's different likelihoods that you would be able to measure certain kinds of events. So it tries to take that into account when it's estimating an expression. So yeah, you can read the paper and there's a lot of like math and statistics in it. But if you really want to understand the guts of it it's probably more than we can cover right now. And there's a lot of debate online about whether it's the best method and what other methods are better. You can actually, there's some blogs that the author, some of the authors of cufflinks have and other people have and there's been like an ongoing kind of argument online about what you should do for expression and differential expression estimation including some really like inflammatory conversations and some name calling and stuff. So I think it's reasonably good but I wouldn't, it's not like ground truth necessarily. We use combinations of methods because we don't totally believe in cufflinks only. So actually in your lab materials we're going to kind of go through two parallel approaches for estimating gene expression and transcript expression levels. We're going to spend most of the time working on cufflinks because that's kind of the workflow of this lecture or this lab but you guys have been doing like star alignments on the side. We're also going to do optional HTC counting instead of using the top hat approach, the top hat cufflinks approach and then we're going to use edge R or there's an example of using edge R instead of using cuff diff. So how does cuff diff work? Basically cuff diff takes the inputs of cufflinks and compares two sets of cufflinks outputs to come up with the differential expression estimate. So this takes into account the variability and fragment count for each gene across replicates. So if you have replicates which I highly recommend each replicate will have its own fragment count for each transcript that's been defined and the variability in those counts will be modeled. The fragment count for each isoform is also estimated in each replicate so it basically does it at the gene level and at the transcript level and there's also when you're looking at the transcript level there's a measure of uncertainty in the estimate arising from ambiguously mapped reads. So remember that I said there's a lot of complexity in the transcriptome and sometimes two transcripts can be very similar, right? So you could have a... So this is a simple case but if you have two isoforms of this gene one that includes this middle X on and one that doesn't all the reads that map here are sort of ambiguous, right? Like these reads could be counted towards this isoform or they could be counted towards this isoform whereas reads here or reads that span this junction will give you unique information about the top isoform whereas reads that span this junction would give you any information about the bottom transcript, right? So when you're estimating the expression for any particular isoform the amount of non-unique versus unique content matters and the less unique content there is the more uncertainty there is in your estimate of the expression for that isoform. So Cufflix tries to take this into account and assign a kind of measure of variability or accuracy for each isoform depending on the quality of the information it has about that isoform. So those variants estimates are then used in statistical testing to report significantly differentially expressed genes and transcripts. They use just modifications of the typical kinds of statistics you would use to calculate differential expression. So we're also going to do a step in between Cufflix and Cuffdiff which is Cuffmerge. So why is Cuffmerge necessary? Basically you need Cuffmerge if you have more than one Cufflix assembly. So remember that even if you're not using the de novo mode even if you're using the so-called reference guided mode where you have some concept of the known transcripts Cufflix still to some degree is going to build transcripts of its own. It'll give you estimates for the known transcripts but then it's going to say there's this other transcript that I think is here and I'm going to call it X001 and it's this new custom transcript I've defined just in this Cufflix run. When you run two different samples that's going to happen both times and because they have different data in them they're going to come up with different creations. There's going to be different potentially novel isoforms defined. Now when you go to calculate differential expression it's not an apples to apples comparison, right? Because you've got some unique isoform defined in sample A and some other unique isoform defined in sample B how do you say what the differential expression is between those? So Cuffmerge basically takes these two assemblies and overlaps them and tries to come up with one consistent set of definitions. So even though this isoform was only observed or predicted in sample A it's not going to look for that isoform in sample B as well and vice versa. So let's go to create kind of a union of all the different transcript definitions. So Cuffmerge is basically a necessary step to compare more than one Cufflix assembly. It also filters out some artifacts and then optionally you can provide a reference GTF again to merge the novel isoforms and known isoforms. And the ultimate goal of the Cuffmerge is basically to make a new GTF file. So you have your GTF file for your known transcripts you're going to create a new GTF file that is a kind of super set of the known and novel predicted isoforms from Cufflix so that you can do CuffDiff and compare apples to apples, like I said. The last thing that we're going to do with this data is run Cumberbund. So Cumberbund is basically a data visualization tool. So after you've run you've got your alignments and you've run Cufflix and you've merged your Cufflix assemblies together and now you've calculated differentially expressed genes and isoforms you want to kind of visualize the data get a sense of the overall quality how many differentially expressed genes there are what they look like. You might want to look at things like MA plots or volcano plots. You might want to look at cluster diagrams or do some principal components analysis or sort of typical heat maps all those sorts of things. Cumberbund is a really quick way to get all these kinds of visualizations basically automatically by just pointing it at your CuffDiff result. And here I've just shown a sampling of a few of the graphics that you get. We're going to produce all these in the lab and we're going to go through them but you get things like here's your typical heat map showing let's say normal versus tumor for a bunch of different genes. This is the kind of thing you see often in a paper. It can produce actually some fairly sophisticated graphics like this would take you a long time to figure out how to produce an R to get an ideogram and to get a representation of the transcripts that were predicted from cufflinks and compare those to let's say known isoforms from ensemble and then add in other tracks like conservation tracks and so on. So Cumberbund basically provides examples and code to do all that for you without you having to figure out how to do it. Of course as soon as you want to change or modify then you have to sort of understand what they're doing but they provide a lot of functions and tools that make your life kind of easier in R. So there are alternatives to FPKM. The biggest or most obvious alternative would be just raw read counts. So instead of calculating this FPKM you basically just assign reads or fragments to some defined set of genes or transcripts and determine raw counts and this is usually done against known definitions of genes and transcripts although you could use cufflinks to assemble a transcript dome and then assign reads to that new GTF and then pull counts out of that. We're going to use HTC Count which is just one of the various methods you can use to assign counts to genes or transcripts or exons. I'm just showing an example of how you run it. It's reasonably straightforward. We're going to do this in the lab. I've put a link here to a C-cancer's forum about a sort of important caveat of transcript analysis by HTC Count. So the author of this software himself doesn't really think that the transcript level analysis is that reliable and it's for the reasons that we talked about that it's hard to assign each fragment like most fragments you just can't know which transcript it really should be assigned to. There's only a small number of unique ones and because of that the estimates that you get for transcripts are less accurate. So usually when we use account-based approaches for differential expression we do it at the gene level. Yeah, so that's what FBKM is trying to do to account for the size of the library. Just a raw count-based approach you of course still have to account for the size of the library but it's done in the differential expression statistic. So you feed in just the raw counts part of the differential expression statistic it models the expected distribution of fragments given the total number of fragments and incorporates that directly into the differential expression statistic. So you can't do that much with the raw counts except calculate your differential expression P values. So they don't mean for you to really think about the individual count numbers, they don't want you to compare one gene to another or the counts from one gene in one sample to another sample they're just input into the statistics for calculating differential expression. If you want to start making like the heat maps like we're talking about then you're where you're just going to straight visualize and think about this gene versus that gene or this sample versus that sample you need to have something that's normalized like with an FBKM so that you're taking into account gene size and library size. So if you've done some kind of capture or selection that's skewing your fragments towards some specific sub-population of the transcriptome probably for that obviously you want to make sure that you keep conditions as similar as possible between your bilaterals and the transcriptome. So you want to make sure that you keep conditions as similar as possible between your biological comparers, but then yeah I think you might want to use like an edge R or HTC count approach. Yeah, so I'm not that familiar I know that Cuff Diff has slowly improved with regard to more complex experimental designs like time series or multi-way comparisons actually I think on the next slide one of the advantages or reasons to use the count based approaches is because the differential expression packages for the count based methods tend to have a lot more options about setting up your experimental design. So they will have options for time series or multi-way comparisons you just need to set up the differential expression function correctly it's set up in much the same way like a regular statistical package would be set up like an ANOVA or something like that where you just need to define the test correctly, but it has a lot of options for what for a time series type thing I would probably use raw counts because for the takes raw counts has better options for that but it's not to say you couldn't you could take FBKM values and you might just need to pull them into R and define your own statistical method like Cuff Diff might not do what you wanted to do for a time series I've never checked into that though I mean if you Google Cuff Diff time series experiment someone might say oh yeah there's this option in Cuff Diff that I haven't heard of I wouldn't do that but I don't think there is but for sure there are in some of the other packages like EDGE R or DAG C or DEC that use count based approaches so I tend to use the FBKM cell approach A when I want to leverage the benefits of the tuxedo suite which is that it's reasonably easy to install and run and has a lot of flexibility in terms of whether your transcripts are known or if you want to build novel transcripts FBKM is good for visualization so if you want to make a heat map it tends to work out better using FBKM than trying to make sense out of raw counts we usually use FBKM when we're calculating full changes again because it's more of an apples to apples comparison you can't calculate a full change very easily directly from raw counts because you haven't taken into account things like the library size or the size of the gene but the count based approaches arguably many people have argued that the statistical methods are more robust so the identification of which genes are differentially expressed might be more accurate from the count based approach at least at the gene level it does accommodate more sophisticated experimental designs so things like time series or block designs or different kinds of experiments that have more samples or more types of samples so we're going to look at edge R if we have time, if not you've been provided with the optional exercises and code and so forth to try out edge R if you search in BiosR or C-cancer you'll find all kinds of passages for doing differential expression from RNA-seq data there was a paper recently that actually looked at the differences between these different approaches so they compared cuff diff the latest version at that time edge R and D-seq just as three examples and what they're showing here is the overlap in the genes that were predicted to be differentially expressed for some sample comparison and you can see there certainly is overlap but there's also a lot of overlap like there's 1600 genes that were found to be significantly differentially expressed by D-seq alone and a smaller number for cuff diff in edge R so depending on what you're trying to do if you're trying to be really comprehensive like you don't want to miss anything at another level like let's say with QPCR or some other experiments then maybe you want to take the union like everything that was predicted by any one method if it's really important that you feel confident in the differential expression you might want to just look at the intersection so things that were predicted by more than one method or all three methods so we typically run both edge R and cuff diff maybe we should think about D-seq because we could be missing a lot I think it's really important to remember the lessons we've learned from microarray days maybe a lot of you are too young to remember some of those days but with RNA-seq we haven't really gotten to the point where we have tons and tons of samples like we're still doing RNA-seq on ten samples or maybe a hundred samples because it's pretty expensive but we're slowly getting to the point where it's similar to microarray days where you might compare 500 normals to 500 tumors or something like that but most of the issues with statistics are the same and we've kind of forgotten those in some ways because everyone's very excited about trying out RNA-seq but then they're faced with the reality of the cost so they'll do things like compare literally one RNA-seq library to another calculate differential expression and then make some conclusions about the genes that were differentially expressed in that one versus one sample and they justify it because at the time cost them $20,000 but you have no biological replicates the conclusions you're drawing from that are extremely shaky so we should remember things like the need for biological replicates is just the same as it ever was and then the study design matters like I think people forget about study design again just caught up in the excitement of doing RNA-seq so they won't think carefully enough about what they're really comparing in terms of a biological question Yes, sometimes biological replicates it's insensical to get one biological replicate and now it's just one simple question and just then develop it in the CRR I mean it makes sense on one level I think that's an extreme example though like the one versus one I think you might say let's do at least three or ten I mean it all depends on what you're trying to do if it's a fishing expedition and you're trying to generate hypotheses that you're going to test at another level then maybe it's okay but you have to interpret it in that context you can do it but I would just be cautious because without enough biological replicates you don't have an idea about variability from sample to sample in those maybe you have five genes you're interested in and so you might very early on make a conclusion that oh this gene's not interesting it wasn't differentially expressed between these two samples and move on and that could be the most interesting biological effect happening in that gene that you've missed because you just looked at one sample and the same thing could happen in the opposite it could just be random fluctuation or noise that cause one of the genes in your list of interests to be much higher in one sample versus the other and you spend all this time in the lab doing QPCRs and chasing this gene down and making a story about it when if you'd looked at three samples or ten samples versus ten samples you'd see that there's a lot of variability in that gene and actually averaging across a series of biological replicates that's not significant at all so yeah you just have to think about it carefully and that's the only point I'm trying to make is lessons learned sure I mean there's like a hundred flavors of biological replicates right so yeah that could be a biological replica each kind of biological replica is different and it has different caveats that you need to think about but I mean if that's the question you're asking whether mutations in this gene or mutations in this pathway result in changes in expression that are important you're certainly better off to look at ten or a hundred patients with mutations that you believe are acting in a similar way compared to looking at just one or two but yeah I would consider those biological replicates yes I would say it's better generally what we're finding with RNA-seq now with the maturity of the technology is that technical replicates have insanely good correlation like they're almost a waste of time at this point like doing things like comparing the estimates you get from one lane versus another or taking your sample, dividing it into two isolating RNA, sequencing them independently there's very little variability between those I mean it's not a bad horrible idea to do but if I was going to spend my money I would much rather spend it on biological replicates and even imperfect biological replicates rather than technical replicates like really all you're doing is getting more depth of the same thing because the data is so consistent within the technical side there's also about ten different kinds of technical replicates though so depending on what your experiment is you might consider a technical replicate at various levels on the wet lab side as well right and those can have more sources of variability like you know plates from the same colony but pulled from the incubator on different days or there's a kind of a gray zone between biological replicates and technical replicates especially in in vitro experiments but on the sequencing side the data production side technical replicates tend to be extremely reproducible so much so that they don't really have much point like when we do whole genome sequencing or RNA sequencing of a sample and we feel like we didn't get enough depth we'll go back and just do another lane of that same library and just add it to the same BAM file basically as if it was just more of the same we don't stop and look and say are these really the same because those experiments have been done to death and it's quite safe to combine those kind of technical replicates and treat them as just more data okay any more questions yeah multiple testing correction so has everyone heard of or are familiar with multiple testing correction? is anyone not familiar with it? no so the basic concept of multiple testing correction is that when you do a statistic when you compare a set of samples for one gene there's a certain chance that you're going to see a significant difference between those two sets of conditions by accident just because of noise there's variability in your data just like we've talked about and when you compare group A to group B sometimes you'll see a gene that looks significantly different to express that doesn't get born out by repeating the experiment for example but it had a significant peak value there's just a certain risk of that happening and maybe it's a small percentage as you do more and more tests the chance that some of your significant results are those kind of false positives increases and basically there's a certain fraction of your significant test that are bound to be not real that's a problem with multiple testing the more you test the more chances there are of finding something that's curious basically and the danger of that is that when you do these genome-wide studies you're looking at tens of thousands or hundreds of thousands of features and if you don't correct for the possibility that with this many tests some of them are going to be fake results or spurious results then you could spend again a lot of time chasing down phantom observations so this has been a very well established field since forever in the microwave days you would always make sure to run some kind of multiple testing correction on your p-values yeah I mean more at the level of the features so if you have let's say you have 10 tumors and 10 normals and you're comparing those replicates each test is at the G level so I say gene A look at the expression level across these 10 normals and these 10 tumors if it's higher in the tumor than the normal according to some statistic that's a test again for every gene in the genome that's the multiple tests I'm talking about the problem with RNA seek is that we've just taken that problem and compounded it by several orders of magnitude right because you've now got not just 10,000 genes or transcripts but you've got hundreds of thousands of exons hundreds of thousands or millions of possible junctions of exon-intron boundaries that you might be interested in small RNAs there's just a huge variety of things you might possibly want to test for and you can just keep doing more and more tests for all these things until you find all kinds of significant results but unless you correct for that number of tests that you're doing many if not most of those could be false positives so RNA seek field didn't need to do any work to develop these correction methods that are already very well established and one of the very popular packages the bio conductor and malt test package so you can take your your p-values from your many statistical tests you performed and you can correct them for multiple testing with a package like this I think that Coughdiff actually has this built into it so when you run Coughdiff you're actually going to get both raw p-values and corrected they usually call them q-values when the p-values have been corrected so you want to be aware of that when you're interpreting your differential expression data and then the downstream interpretation of expression analysis there's a whole topic for probably another course but I think Malachi is going to go into some aspects of downstream analysis from the perspective of alternative splicing but you can take the expression estimates from Coughdiff or alternatives and they could be fed into many different analysis pipelines we also provide a supplementary R tutorial if you want to practice your new R skills that's a little bit more accessible than the Cumberbund stuff so Cumberbund they've basically added a lot of layers of abstraction to R and created a lot of complex functions that allow you to very quickly just say run this function and give me this pretty graph but if you want to see more how to build that up yourself from scratch using custom plotting methods and things like that we provide a supplementary R tutorial for that and it goes into examples like how to make the typical clustering heat map you see in a lot of papers how to do classification analysis I don't think that's covered but that's another topic things like pathway analysis of course your differential expression results from Cuffdiff can be fed into these kind of packages or websites or software just like from any other kind of platform so the tutorial I guess that we could start now or after a break when is the first break scheduled okay so we have lots of time we're looking now at this part here right so we've done our raw sequence data we've run bowtie and top hat so we now have our alignments we're going to do transcript compilation now with cufflinks and expression estimation we're going to use cuff merge to merge together the different cufflinks assemblies from our different samples and then we're going to do cuffdiff to do some comparisons of samples and finally we'll use cummerbund the results of those okay so just a few slides to review what we're going to do in the exercises so we're going to generate gene slash transcript expression estimates with cufflinks we're going to perform a differential expression analysis using first cuff merge and then cuffdiff like was described and then we're going to summarize and visualize those results with cummerbund and then optionally you can play around with the old school R methods so just to explain all the files that are available on the wiki you have this tutorial module 3 part 1 which is the linux commands that are going to run cufflinks and cuff merge and cuffdiff and that's what we're going to go through next when that's done we're going to use this cummerbund R script should say dot R to summarize and visualize the results of part 1 and then kind of optionally you have this part 3 supplementary R and edge R analysis which we probably won't have time for but you can take back and if you want to do the HTC count edge R approach for differential expression you have kind of an example of what that would look like so to generate expression estimates we're going to start from the alignment band files that we generated yesterday and those are going to be used in cufflinks to calculate expression estimates and we're going to do this for all transcripts on the target chromosome of 22 there's an option in this step called dash G which forces cufflinks to calculate expression values for known transcripts to discover novel transcripts of cufflinks we would not use the G option I think this is actually now covered in Malachi's next module which should have been pulled out of this lecture but basically this step of generating expression estimates with cufflinks is going to give us FBKM values where each fragment corresponds to a repair in the genome so there's going to be some optional commands to use additional alignment files we did both cufflinks top hat and star alignments yesterday so basically everything we're going to do today could be done with either the top hat alignments or the star alignments and it should work more or less exactly the same I think in the tutorial we're just going to go through with the top hat stuff but if you wanted to satisfy yourself that it works just as well with the star alignments you could run the commands with the star alignments the other alternative that we're going to explore today is this count-based method so we're going to use HTC count remember this requires a slightly different sorted SAM file I don't remember if we did that yesterday or if we're going to do that today and in the end if you do all of the different options you would have estimates from top hat alignments and cufflinks or top hat alignments in HTC count and you could also do star alignments in HTC count if you wanted this is one of the challenges when you start doing alternate methods that you have so many paths you can go down once we have the cufflinks results we're going to use cuff merge and cuff diff that's going to combine the expression estimates from our four libraries into more convenient files and it's going to combine expression estimates across replicates and then we're going to compare the tumor versus normal samples and identify significantly differentially expressed genes and transcripts so these commands can get complicated when you have replicates so just we have to be careful how we set up the commands they've been set up for you but when you're doing this on your own the ordering and positioning of the samples that you give to the cuff diff command really matters because that's how the statistical tests are going to be defined we pretty much talked about this already once we have the cuff diff results we're going to visualize them in Cumberbund and we're hopefully going to see nice figures like this and then we're going to do some post-processing of files which is necessary if we want to do the old school R analysis so the output of cuff diff is perfectly designed to go into Cumberbund because it's part of the same software package but if you want to do other analysis just your own custom analysis in R you kind of have to munch the files a little bit into a format that's more useful for that purpose yeah so this is the optional supplementary R scripts it does things like examine expression estimates look at how reproducible the technical replicates are and how the different library methods correlate if you go through those exercises you can answer some of these kind of questions and if you do all of the optional things you can also compare the differential expression results from the cuff links analysis to the HCC count analysis and you'll see that there's an overlap and a substantial non-overlap between the significantly differential express genes from the two methods