 module we have. It also has a tutorial component. It's really only meant as a brief introduction to the topic of isoform discovery and alternative expression analysis anyways. That is a very complex area with a lot of tool development that's still underway. The tools that we're going to use here, they're trying to do something very elaborate and they do kind of an okay job at it, but it's an area that I think still needs a lot of development. It's a challenging problem to actually do isoform specific expression analysis with RNA-seq data. I'll kind of explain why that is during this lecture. Really what we're going to do in terms of the hands-on component here is use cufflinks again, but in a different mode that's a little bit more geared towards isoform discovery and abundance estimation, but both of these modes still require a reference genome sequence. Just to remind ourselves, up till this point we've been kind of focusing to a large degree at the gene level and not really paying too much attention to the specific isoforms, but of course for most genes in a species like human and many other species when you go from a pre-mRNA molecule to an mRNA molecule there are often multiple different paths that can be taken and you can get isoforms that differ sometimes quite substantially in their exon content. There's quite a few methods that have been developed to study splicing by RNA-seq. This is taken from a review that I believe was published on BioArchive. This is just a slice of sort of a huge table that this person put together summarizing some tools that can be used for different aspects of alternative splicing analysis. You can review some of those. It was a little while ago now, so there are probably, it needs to be updated. There is also on the wiki we have, there's also some splicing related sections in the tools table that also include probably a little bit more up to date list of tools that you can use for splicing analysis with RNA-seq data. There's a number of useful resources and discussion that are linked out from this slide. So kind of general discussion of best approaches, how to detect alternative splicing, identifying genes that express different isoforms in cancer versus normal RNA-seq data. Some discussion of how the cuff links and cuff diff output differs. And then some tools for visualizing alternative splicing events. So some of these events can be sort of quite complicated to interpret and visualize. And the next two slides just show some of the sort of classic patterns of alternative splicing. So at the top you have a simple transcription and splicing event where you have a hypothetical gene with three exons and two introns. The introns are removed and you get a canonical isoform with the three exons stitched together. But you can have alternative transcript initiation where you have one or more alternate first exons being used, which give you transcripts that are longer or shorter depending on which exon is used. And then classic sort of cassette exon splicing where you may have a simple skipping of an exon. Exon 2 in this case you have two isoforms, one that includes exon 2 and one that skips it. Alternative 5 prime splice site selection. So these are alternate donors in this case where you basically have a different edge of the exon being used in one transcript compared to another transcript. And then you can have the same thing at the 3 prime splice site where you have different acceptor sites being used. And this gives you a different edge, a leading edge for an exon in different transcripts. You can have mutually exclusive exons where you wind up with two transcripts that have the same number of exons, but they use a different set of exons in the middle somehow. You can have complete retention of an intron. This generally happens with relatively small introns. Generally, if you include a large intron, you'll introduce a nonsense event and then it'll lead to nonsense media decay. And then at the 3 prime end of the transcript, you can have alternative polyadenylation where just like you can have alternative transcript start sites, you can have alternative polyadenylation sites that give you different ends to the transcript. And people have been studying this for many, many years. It was discovered a long, long time ago that alternative splicing was a thing and various sequencing platforms have been used to try to measure isoforms specifically. The sort of gold standard for actually resolving the structure of isoforms is full length CDNA sequencing. And this was what we would ideally love to do if we could basically just sequence each transcript from the beginning to the end. And we could do that in a massively parallel fashion that was very cheap. We would just do that instead of what we're doing now, which is sequencing pieces of RNA molecules, relatively small pieces, actually. So they're generally in the order of sort of one, maybe five to 10% of the length of the average whole transcript length. And that means that we're kind of like piecing together the pieces of a puzzle to figure out what the structure of the full length transcript actually was. So it's a lot of inference. We're not actually sequencing full length RNA molecules pretty much at all. I mean, occasionally an RNA seek fragment will be long enough and the gene that it corresponds to is short enough that the two reads of your read pair actually meet in the middle. And if the transcript was only like a couple hundred bases long, you sequence that transcript from end to end. But most transcripts are not like that. Most transcripts are a KB or 2KB or even larger. And they have many, many exons that make them up. And we're just sequencing this little piece. And then we're looking after the fact that all of these little pieces that we've sequenced, and we're trying to think, okay, what did the full length thing look like? But we can't really know anything for sure. We can just infer what we think it might look like based on the patterns that we see when we assemble all of these little fragments together. And so cufflinks attempts to do this. And it's very clever what it's attempting to do. But it's a fundamentally hard problem. So I think we should be skeptical of what it produces because it has a lot of guesswork to basically accomplish and guessing is hard. But this is basically the idea that it pursues. It sort of breaks the problem into three pieces. And those are differential splicing, differential promoter usage, and then differences in the resulting coding sequence. So you can kind of think about that by looking at this top diagram. So you have, in this case, three transcripts. So you have a relatively simple gene locus that has three alternative transcripts in this hypothetical example. And they differ in subtle ways. This is already pretty complicated, but compared to just a typical human locus, this is still really, really simple. So a lot of human loci, for example, will have mixes of all of these things matched together. There'll be 10 alternate divisive forms that are known, and they all have subtle differences in where their transcript start site is, where there are poly denulation site is, which exons they use in the middle. But cufflinks is sort of breaking this problem into three pieces. And then they provide this simple illustration to kind of describe those three pieces. The first one is to basically try to compare a set of transcripts for the way that they start. And so it does that by basically looking across the set of transcripts at a locus and saying, how many different transcript initiation sites are there? So it breaks the transcripts into two categories and says, okay, I've got these two transcripts, A and B, that use this transcript start site here. And I've got another transcript that uses this transcript start site here. So I've got kind of two classes under the transcription start site category. And it'll, so it'll look at the RNAC reads, and it'll say, okay, how much evidence do I have for this transcript start site being used? And how much evidence do I have for this transcript start site being used? And it will allow you to use cuff diff to try to get at whether there's differential promoter usage based on those those groupings between your two conditions that are being compared. And then the second thing it does is it looks at the splicing preference. So it says basically what exons are being used. And it's breaking them down into sort of two categories, we've got A and B where you've got this exon being included, or this exon being skipped. And then the third thing it does is look at the CDS. So what coding sequences are there? And that's sort of the fatter part of this sequence. So we've got a UTR here and then a coding stretch that starts here. So we have an effectively an orph, sort of a long orph. And then a short orph. So we've got one transcript that's using the long orph and one trans two transcripts that are using the shorter orph. And so it's going to compare those two groups, they're sort of like three ways of breaking up this, this set of three transcripts. And then it performs alternative splicing tests on those groupings. And it produces and it splits the results into three output files. So we're going to run cuff links and look at these, at least very briefly at these output files. So it'll give you a splicing diff file where it's looking at the inclusion of exon differences. And then it'll give a promoters diff file where it's looking at the differential transcript start site usage. And then there's a CDS diff file where it's looking at the different orph usages. Okay, so in this tutorial, which will be the last one that we do today, we're going to run cuff links again. But we're going to sort of change its mode two different ways. We've already run it in the so called reference only mode. But we're now going to run it in reference guided and Nova modes. And there's commands are very similar. And really the only thing that's changing is how we utilize prior knowledge of transcripts in the human transcript dome. So when we ran cuff links already, we supplied a GTF file. And we said, Hey, cuff links, this is what we think the transcripts look like in human, please use that information and take it very, very seriously, because we believe that these are really high quality predictions of what the transcripts look like. And it gave us an output that is pretty close to sort of one result for each of those transcripts. So it's sort of like we have this pile of transcripts, and we asked it, please tell us what you think about the abundance of all of these transcripts that we already know about. Now we're going to go back and run cuff links again, but we're going to tell it two different things. The first time we're going to tell it, we have these transcripts, but we just want you to use them as a guide. So don't limit yourself to these transcripts, but use the information to sort of guide the process of assembling the transcript dome. And then in the last mode, we're not even going to tell it that there are anything about transcripts. We're just going to say, here are the RNAseq reads, you're on your own. You tell me what transcripts are there. I'm not going to give you any information about what transcripts are present. And this is sort of the true assembly or transcript discovery mode. So if you had a genome that for some reason had a reference genome sequence, so someone had actually sequenced the genome, but not much transcript annotation had been done yet. So not very much CDNA sequencing or EST sequencing had been done. You could actually do RNAseq to try to figure out what all the transcripts are in that genome. And so there are people doing this now, sequencing a new genome and then using RNAseq to basically annotate the genes that are in that genome. And this is really kind of revolutionized in a lot of ways that sort of genome annotation process because it's a lot cheaper than doing a lot of CDNA sequencing, which is very labor intensive, expensive proposition. So we're going to learn how to do that part of it as well. And then we're going to take the output of those, the assemblies that you get from running cufflinks in these ways and we're going to do some merging, just like we did to account for differences in the predictions from the different replicates. And then we're going to do the differential splicing analysis with cuff diff. And then we're going to do some other sort of more targeted analysis. So we're going to look at the junction readout that comes directly from top hat. So this is a way of getting that sort of splicing patterns without even running cufflinks. So you can get this straight from top hat. And then we're going to visualize in IGV some of the top hat junction counts and some of the cufflinks assembled transcripts. If we have time, we might run out of time before we get to that. So running cufflinks in ref guided and de novo mode, we're going to walk through this step by step and there's pretty detailed instructions in the wiki. But it all really comes down to the use of these different G options when they have longer, more descriptive versions, which help, but it gets a bit confusing, all of the different places that this G is used. So top hat actually had a G option where we specified a transcriptome GTF and that is pretty much unrelated to what we're talking about here. When we told the aligner about transcripts, we were basically giving top hat a heads up that we already knew something about transcripts and that it should try to align against known transcripts as well as against the reference genome. But it wasn't trying to predict transcripts or estimate their abundance or anything, it's just purely used for deciding where the best place to put each read is. Top hat also has a little G option which is also unrelated. They just happened to have also used that letter just for the option that specifies the maximum number of multiple mappings for a single read. So you can decide how many identical mappings of a read you want it to report for cases where a read matches here and it matches somewhere else and it matches somewhere else. Again, that's kind of unrelated to this. But the G options and cuff links are really what we're talking about here. So we already ran cuff links using this big G option to supply a transcriptome GTF and we call this the reference only mode. When we use the little G instead of a big G that's when we're gonna tell cuff links to interpret that GTF just as a kind of a guide instead of really concrete information. And then if we just don't supply the GTF file at all so we don't use either of these options that's what we're gonna call the de novo mode where it's just trying to predict the structure of transcripts directly from the reads that were aligned against the reference genome. And then this is just a brief sort of introduction to this junctions bed file that top hat gives you. So this is something that is automatically generated when you run a top hat alignment. You get this bed file which is a tab delimited file where each line in the file corresponds to a single X on X on the connection. And then the number of reads in the alignments that were observed that supported that X on X on connection. And we're gonna look at a few of these files to kind of demonstrate what each of the columns means. This is what it looks like when you view that junctions bed file in IGV. So you can see that you've got little arcs that are in red or pink here. Each one corresponds to read one or more reads that was observed to span from in this case from the edge of this X on all the way over to the edge of that X on. And then this is the next pair of X on's and this is the next pair of X on's. So this was all done without knowing necessarily that this is where the X on X on connections are. But it's done as you can see a pretty good job of matching up with the gene model. So the reads that we observe in our RNA seek data correspond very nicely with the expected connections of X on's and sort of the darker these curves are, the more reads were observed that supported each of those junctions. So this illustrates a kind of really simplistic alternative splicing or transcript discovery exercise that you can do. So there's actually evidence for a novel X on in this data. Can anyone see where it is and why you would be able to tell that's a novel X on? Yeah. Yeah, that one. There, right? So you can see there are sort of three arcs that are covering this space. One of them goes from here to here, which covers or corresponds to the edge of these two known X on's. But then you have these two arcs that are kind of coming in here to like a bit of a dead zone where there isn't any X on annotated. But the fact that they both come in from a known X on and kind of land in the same place from both sides suggests that there probably really is an X on here that just hasn't been annotated. It's not part of the whatever isoform was sequenced to create this known gene model, which I believe comes from RefSeq. I think that was actually in this data. So you can probably go into your alignments and find that example. Cuffmerge we've already talked about. It's just gonna combine our predictions into sort of a unified GTF. And then we're gonna get GTFs from each of the different modes and we'll be able to see how some of the modes predict different things than the reference only mode did. And we'll go through that in IGV together. And then before starting the last hands-on tutorial, usually just briefly discuss what you should do when you get back to your own work and you have questions or you can't get things to work on your own data. There's quite a lot of materials that we try to provide with this course that are difficult to go through in the time provided. So it's possible that the answer to your problem may be there already. The guys who created this suite, the top hat cufflinks cuff diff suite of tools wrote a nature protocols tutorial. That's quite nice. You can check that out. It's listed along with a lot of other resources like that in the Wiki. Of course, we've talked already about searching through bio stars and seek answers is another option. If your question's not already answered on bio stars, then you can ask that. The top hat cufflinks guys provide a very lame troubleshooting guide, which has only three problems listed on it. Or maybe it's four of the probably a hundred things that could possibly go wrong. But the supplementary tables for the RNA-Seq Wiki paper have a quite much larger list of sort of common questions and challenges. So there's probably a better chance that you'll find it there. And if you can't find it anywhere, then ask it on bio stars. And hopefully someone will have already experienced that problem and have a quick answer for you.