 Okay, welcome to Module 4. This is pretty much an evolution of everything that you were doing yesterday, both in terms of theoretical material and practical tutorial materials. We're basically just going to continue on with the tutorial from where we left off. And it seems like most or all of you have got through day one tutorial, at least the Linux part of it quite successfully. So hopefully that will go smoothly. What we're really going to talk about now is more of the isoform discovery and alternative expression analysis. And there are some components of cufflinks and cuffdiff that help us to perform this kind of analysis. So this is where we are in tutorial Module 4. Basically the main learning objective of this module is to reuse cufflinks in a few different modes. So a cufflinks has this reference annotation based transcript assembly mode where it's going to use a GTF file representing the transcriptome to assemble what it thinks the transcriptome looks like in your sample, but using some notion of the known transcriptome as a guide. And then there's sort of a fully de novo mode where it doesn't know anything about the transcriptome. You might not even have a GTF file for your species. And it's going to try to assemble the transcriptome basically from scratch. So this could allow you to basically create the first notion of a transcriptome for a species that perhaps hasn't had its transcripts annotated yet. But both of these do require a reference genome sequence. So we're using reference in two different contexts. There's the reference genome assembly that we align our reads to, and then there's reference annotations of transcripts that have been annotated on that reference genome. And whenever we're using cufflinks, cuffdiff, et cetera, we always need the reference genome, but we don't necessarily need the reference transcript annotations. We can use them if we want to, but we don't have to use them. Does that mean that neither of these would be very relevant? No, I wouldn't say that because we haven't discovered every gene in human yet, and we haven't discovered every isoform or every gene in human in particular. So we know that most human genes have many isoforms, but a lot of them have not been well characterized or discovered yet. So we're going to see how basically if you look at any RNA-seq data, you can find evidence for isoforms. They're not currently annotated ensemble pretty quickly. I think that our view of the human transcriptome is far from complete, and that's because it's a very complex, large problem. So just to review this gene expression diagram, so I showed this the first day, really what we're going to start talking about in more detail is this part here, where we have the splicing machinery that comes along and converts pre-mRNA, immature RNA molecule into a mature RNA by removing the intranet and splicing the exons together. So what's depicted here is this happening in one way. So we have three exons being defined in a particular way. The introns between them are being removed, and we get a single messenger RNA. And there's a very, very complex machinery that regulates this and controls this process, and there are many regulatory features that are recognized. And generally, as I said, for most human genes, you don't actually just get one form. You get multiple forms. So exon 2 might be skipped, and various other things may differ, and often you'll get multiple isoforms being generated from the same blocus in the same tissue. And there's a whole field of biology studying the effects of alternative splicing and the functional consequences of alternative isoforms. As far as RNA seed goes, there's also a whole field of bioinformaticians and mathematicians and statisticians and analysts studying ways to take RNA-seq data and try to learn something about alternative splicing. So you don't need to worry about reading this, but I provide it as a reference. It comes from a blog, the RNA-seq blog, which is a really useful blog for keeping on top of RNA-seq developments. It's a guy in Spain that has a real interest in RNA-seq, and he regularly posts interesting papers that come out, tools that are developed, and so on. And he's working, has recently submitted the paper that has this figure, and I think it still hasn't been published, but it's been made available. That's basically sort of an ongoing map of different tools that you can use to study alternative splicing from RNA-seq data, and they're sort of broken down by categories. There's tools that help you with the mapping, tools that help you reconstruct isoforms from your alignment files, tools that help you quantify those transcripts once you've identified what their structures are, and then tools that help you compare the expression of those transcripts between conditions. And as you can see, there are many, many tools, and we're just going to use a small number of them. I've also provided here a list of useful resources and discussions that have started to accumulate on bios stars. So for example, there's been some discussion of what the best approaches are to predict novel and alternative splicing events from RNA-seq data, and there's a couple forum posts there to help you get going on that topic. Alternative splicing detection, same thing. Identifying genes that express different isoforms and cancer versus normal specifically. And then some questions about the cufflinks, cufftiff out in particular. And then a post that tries to summarize some of the ways that we can visualize alternative splicing events using RNA-seq data. So I'm just going to spend a few slides here reviewing the types of alternative expressions. It's good to, or alternative splicing, I'm going to use those terms somewhat interchangeably. It's good to think about what the structures are that we're trying to predict here. So again, I'm showing sort of a cartoon model of a gene with three exons and two introns. And in the simplest case, it's going to be transcribed and result in our simple isoform with exons one, two, and three, and then it gets polydentalated. Sometimes they'll be sort of a most common isoform that will be called the canonical isoform. And then there will often be a variety of alternative isoforms, and these can be generated by different mechanisms. So you can have alternative transcript initiation, where the polymerase starts transcribing the RNA at, say, this position or that position. And depending on which position is used, you'll get an exon that includes exon one or starts at exon two. So basically in this case, you've got transcripts that have different five prime ends, and this may or may not affect the protein sequence. Alternative splicing deals with the things sort of in the middle of the transcript rather than the beginning or the end. So for example, in cassette exon skipping, you have the simple scenario where exon two might be included or it might be skipped. And again, you can have a combination of these two events happening at the same time in the same tissue, or perhaps this one is brain specific and this one is liver specific. All kinds of scenarios can occur. In addition to the whole exon being skipped, you can have alternative boundaries of each exon. So what's being shown here is an alternative three prime boundary or donor site. So this exon has two donor sites that can be used and depending on which one is used, you'll get slightly different mRNA sequences. And then you can have the same thing at the acceptor site or the five prime end of the exon where you can get alternate five prime sites being used. And again, these give you slightly different exons. And this difference can be very small, it can be just a few bases or it can be quite large. It might have a very pronounced effect on the protein sequence or it might have a very subtle effect. It can result in a nonsense signal being introduced and then one of these might be degraded by nonsense media decay. In this scenario, you have exon skipping, as I showed on the last slide, but in this case you've got mutually exclusive exons. This is a relatively rare but interesting pattern that's seen in some human genes where exon one and three will always be included and then one of two alternate exon twos will be included. So you wind up with transcripts that have basically the middle of them is different. The entire intron can be retained. So in this example, you've got exon one and then instead of having exon two and three, effectively you just have one large exon two that gives you a much larger transcript. And these intron retention will commonly introduce a stop code on and trigger nonsense media decay so they can effectively be a way of silencing a gene without actually stopping transcribing it. Finally, so we talked about alternative transcript initiation that gives you different five prime ends of transcripts. You can have the same thing at the three prime end where alternate polyadenylation signals are used and it gives you transcripts that differ in the exons at the three prime end. So when we're thinking about analyzing RNA-seq data to find these kinds of events, it's useful to think about what these patterns are and whether analysis strategy will be able to tell us something about each of these categories or whether it might be limited to particular categories. And as you look at these diagrams, you can start to think about what the sequences are that will help you distinguish what's going on. So really you have two things going on. You have the sequence content of the exons that's included. So one can imagine searching for this sequence and this sequence to try to get an idea which of these two things is being expressed. And the other form of information that we have is the connections between exons. So you have distinct paths through the genome and we can look for the sequences in our RNA-seq data that are distinctive of those unique paths through the genome. So these are the splicing events that join two exons together and that join is often called an exon exon junction. So we're going to be looking at the junctions.bed file from top hat, for example, which is basically a readout of this information from your RNA-seq data. And you can use it to reconstruct in theory a lot of these kinds of patterns or at least infer what might be going on. So this is just sort of a bit of a history lesson on sequencing methods for studying alternative isoforms, which is going to culminate with alumnus sequencing. So again, showing just an example region of the genome with a variety of hypothetical transcript variants that differ in their five prime ends. In their three prime ends, the exons that they include or skip the boundaries of those exons that are used, introns being retained, and so forth. And really sort of gold standard for resolving the structure of transcripts is full length CDNA sequencing. If we could just sequence CDNAs from end to end without having to break them in pieces and we could do thousands or hundreds of thousands or even better millions of those things, we would not bother with any of the analysis that we're doing today, because we would not have to do nearly as much inference. If we could just, for example, with nanopore sequencing, feed a transcript through a pore and read its sequence in entirety from one end to the other, we wouldn't need to do any fancy analysis. We would know pretty well what the structure looked like, but unfortunately, there's just no way to do this in a really high throughput fashion. There have been some large scale projects that attempt to capture some representation of a huge number of human CDNAs, but they're done at great expense by large genome centers over the period of decades. For example, the EST sequencing projects spent millions and millions of dollars and took many, many years to generate these huge databases of transcript structures, but it's just not a very high throughput technology. So there's been various attempts to do this in a much more high throughput way. So there's various small tag based approaches that were attempted for a while with various cryptic names like Sage, Cage, and GIS, which focus on the three prime or the five prime or three prime ends of transcripts or the beginning and end of transcripts in the case of GIS tags. 454 came along and this gave us larger numbers of reads, not as parallel as Illumina, and they're a bit longer, so they're not too bad for assembling transcripts. And then Illumina came along and it gave us a huge amount of data, but it's very fragmentary. So we have these small pieces, and we're trying to piece transcripts back together by assembling them, comparing against the reference genome. So all of these things give us a lot of data, but they require us to do a lot of inference about what the total structure of the transcript looks like, because these things are very small pieces compared to what the transcripts really are. Okay, so cufflinks is what we're gonna use to try to get at this problem. And it does a number of alternative splicing tests. And I'm gonna just kind of briefly describe this. It's quite a lot of information, but basically it has sort of three forms of output to help us look at alternative splicing patterns. So for example, or examine a simple example of a gene where we've got three transcripts that differ by their transcript start sites, and the exons that they include or skip. We're gonna try to get the relative abundance of these isoforms by basically measuring the parts of the sequences that are unique to each of the isoforms. And then cufflinks does sort of three types of tests. One is the sort of the splicing test, which looks within a group of transcripts that start at the same site. It tries to quantify the relative proportion of each of the transcripts that use that start site. So it basically takes all the transcripts out of the locus and says, these two start here, this one starts there. And then within that, it tries to look at the difference in splicing. So in this splicing, in this example, it's basically comparing transcript A versus transcript B at this start site. And then it also directly compares the transcript start site usage. So where it's basically considering these things together because they both use the same transcript start site. And it's comparing the amount of those two things to the amount of this one thing. And then in the third mode, it considers everything with respect to the actual coding portion of the predicted coding portion of the transcripts, which is shown by the fatter area of these bars. You've got a narrow part and then it becomes bad. So it's like focusing on this coding portion versus the coding portion here. So you can imagine that this actually covers a fair amount of the scenarios that I was describing in these two slides. But not really all of the possibilities. So for example, it doesn't really tell you anything about the alternative poly denolation sites directly. And I'm not sure why they didn't consider that. So this is a bit overwhelming, but the sort of take home message is that when you finally get your differentials splicing output for each of these three tests, these are the files that we're gonna be looking at from cufflinks for each of these three scenarios. There's a splicing.diff file, a promoters.diff file, and a cds.diff file. And that's it, so short and sweet for the lecture. Again, we'll review this flowchart to see where we're at. So we've now gone through this entire thing. So this morning you did Cumberbun, you made it all the way to the end. What we're gonna do now is back up a bit to the cufflinks step. And we're gonna rerun cufflinks, so you're gonna get to review the cufflinks command. But we're gonna use different options that are gonna allow us to run it in a way that helps learn more information about splicing. So we're gonna rerun cufflinks in two additional modes. And the mode that we already ran, we're gonna refer to as reference only. And then by that I mean the reference transcript annotations. And now we're gonna run it in reference guided and de novo mode to try to predict novel isoforms. So we talked a bit about this, and I just thought I would cover it because it's a common question. We didn't have time in this tutorial to go over this scenario where you don't have a reference genome at all. But I thought I would just talk about it briefly because this is a common question both online and elsewhere. What if I don't have a reference genome for my species of interest? And the first thing I usually ask people when they ask me this question is, why don't you have a genome and have you considered sequencing the genome of your species? There's legitimate reasons why you might not be able to do that cost, the complexity of the genome, etc. But it is a really, really useful tool for studying the transcriptome to have a reference genome. Even a poor reference genome is better than none at all. And the cost of sequencing is low enough that fairly modest labs can start to think about taking on the task of sequencing the genome of their critter of interest. So it's something to think about. But there are times where it's not practical to do that or where you just simply prefer a transcript discovery approach that does not rely on prior knowledge of the genome or the transcriptome. And there are some tools that will help you in that scenario. We just don't have time to cover them. But if you look back at that complicated list of tools that I showed a few slides ago, there's a section that summarizes some of these tools for this particular scenario. And you can explore some of these on your own time. I particularly recommend Trinity and Transibus. They both have fairly good reputations. Okay.