 How are you guys doing? Your brains are covered. So we have our splash screens to remind you about the creative commons. This is module 3. We're going to talk about expression and differential expression. This one we're really going to learn more by doing, so this is not going to be a very long lecture. We'll go into the lab relatively quickly. This is a third out of four lectures, so you're halfway there and the main purpose of this lecture is to introduce you to the concept of expression estimation for known genes and transcripts. We're going to talk about two of the common metrics that you hear discussed for expression estimation for genes and transcripts, which are these FPKM style measurements and then the raw counts approach. And during the lecture or sorry the lab materials, you're actually going to do a kind of parallel workflows where you do both the count based approach and the FPKM based approach. We'll talk briefly about the differential expression methods and then we're going to do some discussion of downstream interpretation of expression and differential estimates. So you guys actually have already seen a view like this yesterday from your initial alignments of your merged BAMs from condition one versus condition two. And I think from that you could probably already start to kind of get a sense about how you would identify differential expression between genes in one condition versus another. At least in a sort of manual way you could look and say basically are there more reads aligning to one condition to another than another, right? So here's an example where we're showing just two sets of alignments. A couple things are highlighted. These are the coverage are on the same scale so you could very crudely infer that perhaps this sample is downregulated or down expressed relative to the top sample. That would possibly be a slightly dangerous oversimplification but just gives you the general idea. You can also notice in this more highly expressed sample that there's perhaps evidence of three prime N bias where you have this coverage that's kind of high and then really tails away it's almost nothing towards the five prime end of the gene. So the differential expression methods that we're going to talk about today attempt to formalize what we're sort of doing with our eyes and our brains in terms of comparing the amount of coverage at this gene locus for one sample or seven. One set of samples versus another. The probably one of the most basic and I think maybe the first way of estimating or normalizing so to speak expression for genes using RNA-Seq is the RPKM or now more commonly referred to as FPKM measurement. RPKM or FPKM stands for reads or fragments per kilobase of transcript and per million reads per million mapped reads. It's kind of the hope or theory that in RNA-Seq the relative expression of a transcript is proportional to the number of actual fragments that originate from that locus. However, the number of fragments that you would expect is biased towards larger genes so bigger RNA transcripts larger sequences are going to fragment into more pieces and have a higher chance of being sequenced just because there's more pieces even when you were just imagining one copy of a transcript from a large gene versus a small gene. I think you can see how you would possibly get more reads from a large gene than a small gene because it can contribute more small fragments to the pool. Similarly, the amount of total sequence you have matters. So we've been talking about how much sequencing should I do, should I do one lane, should I multiplex three samples per lane, five samples, ten samples. When you make those initial design considerations, you're dramatically altering the total amount of reads that are going to be available for that particular library or sample that's being sequenced. So you have to be careful if you're just crudely comparing say coverage like we were on the last slide. If the top sample was sequenced three times more, we might expect to see this increased coverage of that sample purely based on the additional coverage that that sample has available to it. And it might have nothing to do with biology and increased expression of this particular gene. So we need to account for that. So FPCAM or RPCAM attempt to do that. It started out being named as reads per kilobase back when there was really just single end reads. But when we switched to paired end reads, we started thinking about sequencing the fragment with these two read pairs. So we're counting the fragments, not an individual read. So that's why we say FPCAM now mostly. So the formula or equation for our FPCAM is quite simple. You're basically taking the number of mapped reads or fragments for your feature of interest, let's say a gene. And you're dividing by the total number of mappable read fragments in the library and the size of the feature, so the gene length. And then multiplying by a common factor like 10 billion or 1 billion, 1 million to I guess it's 1 million times 1,000 bases. So what you end up with is reads per million fragments in a library per 1,000 bases of feature length. So that's the formula for FPCAM. And you can calculate this yourself just from the raw reads knowing the lengths of your features and the total depths of your libraries. And that is still somewhat commonly done under certain circumstances. But most people use a software package like cufflinks to do that for you. Cufflinks does more than just calculate an FPCAM. What it reports? The FPCAM measurement is done on the per sample library or per library. You don't pull them all together, yeah. Because one of the things you're trying to account for is differences in total depths between samples. So we're not, no? We merge those just for convenience of display purposes so that when we went into IGV we would see more data. So you want to feed in each individual sample even if it's wrapped in? Yes, in this case we're going to run cufflinks on each individual sample. Because we want that per sample normalization. And the part of the way that cufflinks works, which we're going to talk about in a minute, it uses variability from replicate to replicate in its statistics to calculate differential expressions. So we have to start with individual replicates. Yeah. What do you mean by technical replicate? Yeah. So there's a lot of different levels of technical replicates. I mean the simplest level would be everything is exactly the same, but you do another lane of data, for example. You could think of that as a technical replicate, right? That's a good question. I think in general those kind of technical replicates are so highly reproducible. There's so little variability between them that you can probably pull them together. I mean that's commonly the case, like let's say you do one lane and then you decide, oh, I didn't get enough total depth. I need a little more and you run another lane. People will often pull those together. But... In that case it's the same library. Yeah. Exactly. But let's say it's a different library construction. Let's say you did one library, you sequenced it and you weren't totally satisfied. Maybe you saw, I don't know, maybe an RNA sequence is a bad example, but high duplicate rate or something that caused you to be concerned. And then you made a new library from the same sample and resequenced it. So it's not a biological replicate. It's still the same biological sample, but it's a more different technical replicate. Then it gets trickier, whether you merge that before running this analysis or whether you keep it as a kind of replicate. Yeah. I think, yeah, that's a hard question to answer, but you're almost now doing two experiments, right? You're comparing two biological conditions and you're now also comparing potentially two processing methods. So you probably want to keep them separate and analyze, look for differences between the processing methods. If you can convince yourself that despite having redone everything with new libraries and so on that the data is highly consistent with very little variability, maybe you can convince yourself that it's safe to merge it before proceeding with the biological condition comparison. Yeah. We definitely will always try to sort out these kind of experimental details like what kind of library prep method we're going to use to side on something and then produce the data for your biological experiment consistently, right, and have that be the basis for your comparisons. Because otherwise, you know, what if you have, even if they're fairly similar, but there are definite differences from your different prep methods, then you merge them together, but you have some of this merging of different methods happening in your normal but not your tumor. And now you think you're seeing tumor normal differences, but you're really seeing differences between these preprocessing methods. It's, yeah. You're basically introducing the very real possibility of batch effects. And there are statistical methods to try to correct for batch effects, but you need to, like, think about that. Yeah. I mean, there are ways of, unfortunately we don't cover this in this course, but there are certainly ways of detecting batch effects and there are ways of attempting to correct for them. There's a software package called Combat in R that's, I would say, the one that I hear the most about people using, the most popular, to control for batch effects. Yeah, exactly. Something like a PCA plot or some kind of clustering analysis will usually pull it up. There was a recent controversy involving RNA-seq data where they did different tissue expression profiling from mouse and humans and compared and the authors claim to have shown that, how did it work? There was more differences between tissues and between species or vice versa and that was contrary to what had been reported previously. I guess previously they've shown that liver and liver from mouse and human cluster more closely together than this. This tissue is more important than species basically and they found the reverse. But then when other people saw this published and started digging into their methods, they found that the libraries that were constructed for the mouse tissues and the human tissues were done in different labs at different times with different protocols and they did kind of some detective work and were fairly convinced that it was all batch effects basically. So it's definitely something you want to think about. Okay, so cufflinks. Bufflinks is the method that we're going to use for expression estimation. It's one of the most popular, probably the most popular method, maybe. It's one of the most popular for sure. The way it works under the hood is quite complicated, but in terms of a user, it's fairly easy to use. So they're one of the groups that have actually done a good job of documenting all of the options and how it should be run, different workflows and actually maintaining the software so that you download it. It actually installs and runs. That is not nothing. There are a lot of different RNA-seq analysis methods out there that they might convince you in their statistical methods and their algorithmic design that this is a superior method, but you go to use it and you can't get it to work. I mean, that's like 90% of them almost. So there's a lot to be said for someone that's actually standing behind their software and making sure that it works. So that's one of the reasons we use cufflinks. It does produce this convenient FPKM output, but it's much more sophisticated than the kind of simple equation we showed on the last slide. So it does things like Malachi talked about of trying to take into account things like GC content. The fragment size distribution of your library is input to the method and it tries to use that when trying to infer whether a particular read comes from one isoform versus another. The other sophisticated thing that it does is this assembly. So it's the first thing it tries to do is basically, you can give it a known transcriptome as a kind of guidance, but it almost in every mode still attempts to basically infer the transcriptome that is in that sample. So if there is a unique or novel exon exon junction that's a result of some aberrant splicing pattern in your data, cufflinks can detect that, add it to its own model of the transcriptome and then correctly assign reads to that novel junction and potentially novel isoform rather than trying to force it into some other existing isoform. So if that all works as advertised, that has a lot of advantages, right? The first thing it does is it takes all the reads across the genome because remember we've aligned with a spliced aligner but still to the whole genome and it creates bundles of fragments. So it basically says it looks like all these alignments belong together. What it's really trying to do is spot gene loci. So it's trying to say like let's not do a transcriptome-wide assembly all at once. Let's first focus on this gene locus. For this gene, all of these one million reads probably are coming from this locus from different isoforms but from this one gene. That's what it attempts to do and then within that bundle of fragments it tries to determine or assemble the different isoforms. So that's what's depicted here. What you'll find is that for most gene loci, at least in human which is maybe one of the more challenging genomes to do this in the different transcript isoforms and almost all genes have different transcript isoforms we've come to learn. They share a lot of content so they'll have maybe 10 or 20 exons and there might be two isoforms that are basically identical with all the same exons being used in the same splicing pattern and maybe one exon is skipped. So it's actually quite challenging to tell from RNA-seq data which reads belong with which isoform because most of the exon content is totally overlapping and could belong to equally any of the transcripts that involve those exons. So cufflinks tries to basically spot the differences. So it'll look and it'll try and find places where there are reads that are spliced in such a way that they indicate mutually incompatible isoforms where there's a spliced junction across what is otherwise an exon so that this read couldn't possibly belong to the same isoform that this other read belongs to. And it uses these mutually incompatible fragments to draw different paths through a graph. So it basically produces a connected overlap graph where you have things like here these reads are probably from a common exon that's shared by all the isoforms but then they diverge at some point and these fragments, these alignments aren't compatible with each other so it tries to identify the minimal paths that would explain this graph and produce a model of transcripts for that locus. So in this case it's inferring that there are three transcripts one that is basically one big exon one that has splicing between two shorter exons that overlap with the first one and then another with three exons. And once it's inferred these transcripts from this graph it then goes back and tries to say okay which reads belong to which isoform. Some of those are kind of already decided because they were part of the information that created the idea of the transcript in the first place so like this yellow fragment comes from the yellow transcript the blue fragment comes from the blue transcript and so on but then all the other reads can belong to one or more of these different putative isoforms with a certain amount of probability so like all these gray or black eyes they could equally belong to any of the three isoforms some of these red fragments could only belong to the red isoform and so on this pink eye could belong to the blue or the red but not the yellow. See how that logic works and from that it tries to basically assign a probability that each read belongs to each fragment and come up with an estimate of the likely transcription of each of the isoforms given all of the data and the model for that is as we said quite sophisticated and it tries to take into account information like GC content and fragment like distribution It does so it comes in actually now when it starts to try to do the differential expression estimation where it says some isoforms as you point out are easier to map fragments to and therefore they basically have more, let's say they have more unique content maybe they have like a lot of splice junctions that belong to that isoform and other isoforms don't have those that's a much easier isoform to estimate expression from because you have all these transcripts or reads that you're really confident that come from this isoform so the level of uncertainty on the estimate of expression for that isoform will be lower than the amount of uncertainty on another isoform that shares all of its content with a lot of other isoforms and it uses that amount of uncertainty in the differential expression statistic so when it's comparing expression levels for one gene across multiple samples it's going to factor in both the variability in replicates in terms of your experimental design and the uncertainty for any individual expression estimate based on the challenges of assigning reads of that isoform or not there's something in the library of disruption like rival minds that this perhaps happens the differential levels of genome contamination I think could definitely be an issue they will in some cases cause cufflinks to make mistakes and start to imagine there are transcripts there that aren't really there and things like that it'll probably result in greater uncertainty as well though for the estimation of expression from those low sign because it's basically more noise so that might be taken into account in the statistics might reduce your power to detect significant differences so actually we kind of talked about this how cuff diff works the variability in fragment counts for each gene across the replicates is modeled and the fragment count for each isoform it's estimated in each replicate as on the last slide and then there's this measure of uncertainty that we talked about that's estimated from the amount of ambiguously mapped reads so at low side where you have several transcripts that are very highly similar and share most of the same axon content with very few uniquely assigned fragments there's gonna be a lot more uncertainty in the assignment of expression levels to one isoform versus another it does its best job but it may basically not have a good idea of what the expression level is of those different isoforms a lot of this is simplified when you boil analysis down to the gene locus level because you kind of just group it all together but when you start to try to really tease apart transcript isoforms it gets a lot more complicated and success is lowered I would say and there have actually been some studies, some papers showing that our current RNA-seq standard like 2 by 100 reads is actually just woefully inadequate for this problem like cufflinks is doing its best job but it doesn't have the information that it needs to succeed really not long enough that so with only a 2 by 100 base pair read and given the problem that I described where you have lots of transcripts that are quite similar in most cases or in many cases you just don't have the power to properly tell one complete transcript isoform from another and accurately estimate it so it's pretty challenging basically it's going to work at some low side better than others so I'm not saying don't do it you can definitely make interesting discoveries but you can't expect a totally complete and accurate and comprehensive result at this point well and that would be because there would be much less munition if you have no splicing it's way easier way way easier so for you do have overlapping transcripts like overlapping loci at the gene level in general if you have sufficient depth which is probably less than a whole line of high-seq data you tend to see I mean as Malachi showed with some of his experiments and many others have shown you have very high correlation with other expression estimation methods so I mean I think at the gene level this works very very well at the transcript level it varies depending on the complexity of the locus that you're looking at so if you have stranded RNA-seq data that's a parameter that you can set when you run it to tell cuff lengths I know what the strand is and it will use that information as well so that will help it a lot actually in those cases where you have overlapping genes on alternate strands that removes a big additional source of uncertainty just like the problem where you have multiple isoforms it can be hard to tell whether this read comes from one isoform for another if you have unstranded data and you have genes that overlap on opposite strands you have exactly the same problem right you don't know if this read should go with this gene on this strand or this gene on that strand so if you have stranded data it will use that so at this like bundling step where it's determining which bundles of reads to try to assemble and make graphs through it will create separate assemblies for the two strands right so the variance estimates for the biological replicates and then these measures of uncertainty are used in a statistical test to identify differentially expressed genes under a beta negative binomial model I'm not a statistician so if you want to know more details about exactly how the statistic works like you can read the paper it has a lot of detail in it one thing that we need to talk about if you guys are okay with moving on is cuff merge so this is a tool that we're going to use after cuff links but before cuff diff and the reason for it is that we need to merge different cuff links assemblies together so someone asked whether we're going to assemble our merge alignments or the separate alignments we're going to do the separate alignments right so for each sample we're going to run cuff links and it's going to try and infer its idea of the transcriptome from each of those individual samples even though we're going to give it the known transcriptome as a guide it's still in some cases going to say you know what I think there's an X on X on junction here the problem with that is it doesn't happen the same way every time in different samples right in some cases there's going to be legitimate differences between the samples there's a unique alternative splicing happening in one condition and not in the others but there's also a sampling problem where if you don't have enough data just because of you know the distribution of reads across your transcriptome some details about the transcriptome might be missed in one sample versus another so their representations of the transcriptome aren't going to exactly line up but when you want to do differential expression analysis they need to line up right you can't look at the expression level of a gene or a transcript across all your samples if your genes and transcripts are defined in different ways so the cuff merge step basically brings these things together into a parsimonious representation of the transcriptome so it also does some filtering steps so it filters out some problematic bundles where you have known issues with alignments it does that in a kind of heuristic way where it'll eliminate for example a low side that has a bundle that's just got like a crazy amount of alignment that's unrealistic even for a really highly expressed gene although you have to be careful with that because sometimes it can be fooled by a really highly expressed gene you can provide a reference GTF to merge the novel isoforms and known isoforms but basically it makes a new GTF file that's based on your data but that's a consistent representation of the transcriptome across all your samples so now you can compare apples to apples so it'll have representations of transcripts that might exist in one sample but weren't detected in another sample and we need that and in fact those might be the most interesting things that we're looking for so after we run cuff lengths and cuff merge and cuff diff we're also going to look at comma bonds which is a kind of final tool that lets you take the output of these tools specifically the tuxedo suite tools and produce useful visualizations like distribution plots, correlation plots, MA plots, PCA plots we talked about those, heat maps and even it does a nice job of showing you gene or transcript level alongside the actual transcript structures that have been predicted so that can be useful especially with this these modes of cuff lengths where it's basically determining novel transcript isoforms and things like that we're going to go through all these plots in the lecture so I'm just showing kind of like a highlight of the kind of stuff you can get from Cumberbund this package I wish it was a little bit better supported than it has been I want to say like the postdoc who created it has moved on or something it still works but I feel like slowly over time it's working less well which is a shame because it gets you really far fast a lot of the steps that we've run till now are reasonably straight forward like once you understand how to create the command you can run it on your data and running it on one BAM file is the same as running it on the next file so once you've figured that out you're going to be pretty comfortable with that you can almost automate it but when you get to the point where you have your expression estimates and your differential expression it's going to be like what now there's all this possible analysis that you can do and each one of those is like oh crap how do I learn how to make a histogram how do I make a heat map how do I make a PCA plot Cumberbund does a lot of this stuff for you like with one or two relatively simple commands and it understands the cuff lengths and cuff diff output so you can kind of just point it at a cuff diff analysis and say give me heat maps give me whatever PCA plots and really quickly get some quite nice visualizations to help you understand your data so it's really useful in that respect but the disadvantages of it are that it's maybe not being as well supported as it used to be and if you want to make tweaks to it of course it's because there's so many layers of abstraction to get you these nice plots with simple commands to change them is then in some ways more challenging than if you were to create these plots through a more old school method like with your own R code or with a software package maybe you're more familiar with so we're going to provide example code for both in R to use Cumberbund and to kind of do just the do it yourself R analysis alternatives to FP cams so there's kind of two camps in the world some that like the FP cam approach and others that believe that everything should be done at least when we're talking about differential expression analysis directly from the raw read counts so instead of calculating these FP cam values they basically just try to assign each reader fragments to a known set of genes or transcripts and determine raw counts so for this you absolutely need to know what your genes and transcripts and exons are otherwise you can't assign a read to them so it doesn't have a step where it tries to do an assembly or infer transcribed forms you can do that yourself and then provide those definitions to one of these counting tools and then count based on some new representation of the transcriptome but it's more common to just obtain a GTF file from like Ensemble or UCSC and say given this BAM file of alignments and this representation of known genes assign each read as best you can to a gene and just give me the counts and then with those counts you can do a lot of different statistics for differential expression using different mostly R packages so the tool we're going to use is called HTC count there are others and the output is basically a read count table so this analysis is much simpler than FP cam there are pros and cons I usually use FP cam when I want to leverage some of the benefits of the tuxedo suite so if I want to infer novel isoforms if I want to use cummerbunds if I want a good value for visualization so raw counts aren't very good for creating heat maps for example because they're not normalized in any way and interpreting them is you have to be very cautious about interpreting raw counts visually so FP cam is usually really useful for that calculating full changes etc people typically use the raw count method because they believe the statistical methods for the differential expression are more robust they're also up until quite recently I would say more sophisticated experimental designs are possible so maybe you have like time series data or you have some kind of like condition A versus B versus C versus D like a complex multi level experiment and there are specific packages in R that would accommodate those experimental designs and do the right kind of ANOVA or ANCOVA or whatever you want to do in terms of statistics for that experimental design and that wasn't really available with cufflinks and cuffdiff it was more just A versus B differential expression it seems like they've slowly added a lot more of that feature richness to the tuxedo suite though I think that differentiation is not as clear as it used to be so there's an additional step in the tuxedo suite now where you can use a tool called cuffquant and you can actually get a kind of count like measure and you can set up more complicated statistical tests so they're kind of converging in some ways we're going to look or provide an example that uses edge R I think that might be an optional alternative but we provide an example of how you would use edge R to analyze your count based data DC because another popular one but there are a whole bunch of other statistical packages for using the count based approaches the bad news is that of course they don't all agree and we don't necessarily know that one is right and another is wrong so many people have reported this that looking at differentially expressed genes that are predicted to be significant based on cuffdiff versus edge R versus even another count based method you can get quite different results and depending on what you're doing you may want to take the intersection of multiple methods to just have the most confident predictions or you might want to take the intersection or you might want to take the union you might want to take everything and just say you know what I'm going to do downstream validation and I don't want to miss anything important so I'm going to take all of the predictions from multiple methods in our own pipelines we use both cuffdiff and HTC edge R which are the methods we're presenting here and we usually consider both so before moving on to the labs I wanted to review a few things lessons learned from microarray days it's getting to the point now where some people haven't even really been exposed to microarray data because we've been doing RNA-seq for a little while now but there's definitely some lessons that maybe haven't fully been translated over probably the most important is that fancy sequencing technology does not eliminate biological variability so in the early days and even until now you'll have people that design experiments where they're essentially comparing one RNA-seq library to another because RNA-seq is expensive that might cost them like four or five thousand dollars and they want to try out RNA-seq and they might discover something interesting in the way of a really unique novel alternative splice pattern or something but if you're looking for differential expression you can't really do that right you need biological variability or you need to account for biological variability if you're going to do differential expression statistics it's no different for RNA-seq than it was for microarrays so five, ten years ago if you submitted a paper and you compared three normal samples to three tumor samples with your microarray data it's very likely that a reviewer would say whoa this is way underpowered to say anything and you need to run more samples you need more replicates, you need a larger study because RNA-seq has been quite expensive there have been a lot of studies done like that and it's kind of, I mean it's proof of principle type stuff but it's not really powered to do what they're trying to do so I think like the price of RNA-seq basically still needs to come down before we start seeing a lot of large scale good RNA-seq studies if you look through like the gene expression omnibus right you can find hundreds, thousands of studies that involve thousands or hundreds of samples I'm not really aware of any RNA-seq studies of that scale quite yet there's a few big projects like GTACs and TCGA have done quite a lot of RNA-seq but it's not really an experiment per se it's just a survey of what's out there and that's great if you're looking for like recurrent fusions but if you plan to like replace what you were doing with microarrays with what you're doing with RNA-seq you need to think about the exact same kind of power analysis issues and the need for replicates so there's good debates about this on Biostars and there's also actually a power analysis tool from Gabor Marth's lab that can help you determine how many replicates you would need for your RNA-seq experiment multiple testing correction is another really important one so I heard you guys all familiar with the concept of multiple testing correction is that anyone not heard of multiple testing correction before? so the basic idea for those people that happen is that as you compare more things more genes, more exons, more transcripts and do more and more statistical tests given that there's noise in the system and natural variability there's an increasing chance that you will start to see significant differences by chance so you can take randomly generated data and imagine that you have two conditions and take a typical distribution of RNA-seq data and just randomly assign values to samples and do a bunch of statistical tests and you will get significant p-value just from totally random data so the more tests you do the higher the chance that a statistical significant result is because of just you've done enough tests that even in a noisy system you would expect to see some significant differences but they aren't necessarily related to biology at that point so there's a whole suite of different kinds of corrections that you can apply where you take your p-values and basically correct them or penalize them for the number of tests that you performed and this is called multiple testing correction I thought the CUTF did that itself it does so the CUTF has multiple testing correction built into it which is convenient so it produces both p-values and corrected q-values is what they're often referred to as so you just need to be aware of that that you need to consider multiple testing correction here but in array studies but in many ways it's actually worse because there's a lot more features now which means there's a lot more potential tests so instead of just having maybe 20,000 genes on the array you now have as many genes as you can define based on current understanding of gene annotations and all of their transcripts and exons and junctions that you might be interested in looking at in terms of differential expression analysis so it creates a huge multiple testing problem we usually use the bioconductor package called malt test to correct for this after the fact or as was mentioned in the tuxedo suite this is kind of built in unfortunately the downstream interpretation of expression analysis is kind of a topic for an entire course and I think CBW probably does have courses like that last week so if you missed that sorry maybe next year the expression estimates and differential expression lists that you'll get from cufflinks, cuffdiff sort of the end of this workshop can be used for many downstream purposes right we do provide some examples with cummerbund and a supplementary art tutorial about how to format the output and start manipulating it in R and making different visualizations and doing things like clustering and heat maps but there's all kinds of other analysis you could do so maybe you want to do classification analysis maybe you're looking to build a model that can predict your one condition from another we do that in the cancer world a lot I provide a couple tips there for learning purposes if you're interested in getting into classification analysis WECA is a good learning tool and I use the random forest R package and this slide is actually out of date because we have developed a Biosar tutorial there's actually four tutorials that will take you from we use array data as an example again because you need a lot of samples if you're going to think about doing classification analysis but the principles are the same I think in the first tutorial we pull in the array data and create like a matrix of samples by gene expression and then the subsequent tutorials go through how you would build a classifier based on that data pathway analysis is a common question once you have your differentially expressed gene list people often want to look for over represented pathways there are commercial options like ingenuity pathway analysis we use that sometimes and then there are kind of free options like site escape there's a ton of bio conductor packages if you want to do it in R for pathway analysis and there's a whole section on the bio conductor pages to talk about different pathway analysis options so that's it for the lecture side we're going to jump into the tutorial for module three now I recall that what we've done so far is we've gotten our raw data we've done read alignments against the reference genome using the GTF file for our gene annotations and now we're going to compile transcripts and estimate expression levels and then do differential expression after this cuff merge and then finally we'll visualize the results with the cummerbund and or the supplementary R tutorials that I mentioned