 Okay, so this is the last lecture and just like the other modules, there'll be a brief lecture and then an introduction to the hands-on tutorial and we'll go through some hands-on exercises and then if there's some time left, we can go back to the integrated assignments. So isoform discovery and alternative expression, this is a huge topic so we're really just going to cover one aspect of it and that's the aspect of it that relates to the tool chain that we've been using which is the top hat cufflinks cuff diff paradigm. So the learning objective really of this module is to explore the use of cufflinks in sort of a slightly different mode. So we're going to go back and run cufflinks again with different parameters that are a little bit more optimized for alternative expression and isoform discovery. And just to remind ourselves where we're at, so we've been really focused on assembling and quantifying instances of transcripts in our RNA-seq data and what we're going to do now is go back and reanalyze the data but do it in such a way that we're going to be more able to quantify and capture cases where this mRNA gets made multiple different ways. So the depiction here is really simplified, you have a pre-MRNA transcript with two introns and three axons and those introns are removed and the axons are spliced together and this is being done only one way but in a lot of eukaryotic species there's multiple, multiple isoforms that are made from the same DNA template. And there's a huge field of people studying alternative splicing and even within RNA-seq bioinformatics there's a lot of people working on tools related to alternative expression analysis and this figure is taken from, believe it was a bio-archive paper and there was also a blog post about it. I'm not aware if this paper might have actually been published somewhere since then, we should check to see if we can update that but they basically did a review of existing tools and methods that use RNA-seq data to study transcript diversity and alternative expression patterns. This slide is also kind of a reference so here's some typical questions related to alternative splicing in RNA-seq and sort of interesting discussions that we've noticed that happened on bio-stars but each of these things generally relates to the attempt to understand some of the complexity of alternative transcript generation. So there's just to remind ourselves here's a depiction of some of the different forms of alternative splicing so in the very simplistic form you have simple transcription where you have a series of exons, introns are removed and the exons are spliced together to give us some mature mRNA and then there's all these alternate ways that can happen. So in this example you're using different transcript start sites or alternative transcript initiation so transcription either starts with this exon or it starts with the second exon and that gives you two transcripts that start at different places. Alternative splicing is sort of the most classic form of alternative expression where you're either including exon-2 or you're skipping exon-2 so again you get two isoforms of different their exon content, sort of similar pattern here where you're using alternative donor sites so in both resulting isoforms you have three exons but one of those exons has different boundaries you're getting a longer exon-2 in this case by extending the three-prime end of exon-2. You can have the same thing at the other end of exons by using alternative acceptor sites again giving you sometimes dramatically different isoforms or it can be quite a subtle difference and because RNA-seq is really has a base level resolution you have the potential to tell the difference between very subtly different alternative isoforms even though it's quite difficult. Mutually exclusive exons is kind of a special case of exon skipping where you have two isoforms being generated that both have three exons but two different exon-2s are being used so you get sort of two alternate paths that choose a different second exon the entire intron can be retained so as we've been looking at some data in the IGV screenshots you're often seeing reads that are aligning within the introns we have to oh interesting we have to consider sometimes the possibility that the reason we're seeing a lot of reads within an intron is that that intron may actually be retained so in this example you're getting an isoform that effectively has a very large exon-2 that combines all of the usual exon-2 and three plus the intron in between and usually this is a way of turning off a gene so when you include an intron it often will trigger nonsense mediate decay but sometimes you'll actually get different orfs resulting from the inclusion of small introns in different alternative isoforms and then just like alternative transcript initiation you can also have alternative polyadenylation where you have alternate exons at the three prime end of the transcript being used again giving you RNAs of slightly different lengths. So over the years there's been a lot of attempt in humans and other species to try to understand some of the transcript diversity that we see in the transcriptome of eukaryotes and there's been a number of sort of technological advancements over the last 15 to 20 years or so that have really allowed us to do a much better job at this so this is starting to get a bit out of date but this figure sort of shows some of the technologies over the years that have been applied so starting with and we can imagine a locus where we have some hypothetical fairly complex combination of isoforms that exist but we don't know what they are yet and they differ in which exons they include or introns that they retain or the boundaries of exons that they use and so forth probably what remains still really the gold standard way of investigating this diversity is to clone CDNAs and to do full length sequencing of those clones so you completely resolve the structure of a CDNA and then you can map that sequence, high quality sequence back to the reference genome and figure out exactly where the exon intron boundaries are. So there will be an attempt to try to tell whether there was actually splicing going on there or whether the intron was being retained and you can look at the coverage pattern and say okay normally we see really high coverage at exon 2 and then there's this intron and the coverage drops away and then the coverage goes up again for intron 3 but then when the intron is retained in some other sample we see sort of an even coverage all the way across from the beginning of exon 1 through the intron into the beginning of exon 2 through the intron to exon 3 you can use that pattern of coverage to infer that the intron is being retained and not skipped and you can also use the the junction splicing reads to try to get a sense of whether the intron was being spliced out or whether it's being retained. A lot of things yeah it's hard to do so depending on how clear the pattern is it might be difficult to say exactly what the transcript looks like unless you've actually like really sequenced all the way through it really well. I think it's one of the hardest categories in a way to detect exactly what's going on but you know if you have there are definitely cases where you have two exons and especially usually it's easiest to tell with smaller introns if you really see even coverage all the way across that intron into the surrounding exons you can get pretty you know pretty good evidence that the intron's being retained but yeah if you really want to know what the structure of a transcript looks like ideally you would sequence a full length cDNA or slightly less good but still pretty good is to to amplify if you have some notion of what the five prime sequence looks like and what the three prime sequence looks like you can amplify cDNAs based on that and then you can clone those amplicons and sequence those so you're kind of constraining your hypothesis to just looking at what the internal structure of an isoform is. So these these two methods are a really low throughput but there were some large-scale projects in human and mouse and other species that really tried to catalog a lot of transcript diversity across quite a few tissue types but it's extremely labor-intensive it's very expensive it takes forever. EST libraries were kind of the first one of the first attempts to make this a bit more high throughput so this is where you're still sequencing cDNA clones but you're generating them on a massive scale and you're using robots and then you're just sequencing the ends and you're really trying to automate this and then so for large and large transcripts you're missing out on the sequence in the middle but you're able to learn quite a lot of structural information by just sequencing in from the ends of cDNAs and it's quite a bit more high throughput than full-length cDNA sequencing is but not nearly as high throughput as some of the technologies that followed. So there are things like CAGE, SAGE and GIS which take an even more focused approach and just sequence the ends of transcripts so they can provide a lot of information about the five prime start sites of transcripts or the three prime polyadenylation sites or in the case of GIS you kind of get both at once sort of the beginning and end of the transcript and you don't know much about what's in between so it tells you a lot about transcript initiation and polyadenylation sites in the genome and then really what changed all of this was when sort of RNA-C came onto the scene and you started to have 454 and then SELEXA or now it's called Illumina so you have really really high throughput data for sequencing the transcript dome. The reads are small but you have a massive massive amount of them and this allows you to in one shot get a huge amount of information about the transcript dome in a single sample and this has really sort of revolutionized our ability to detect alternative isoforms and to measure their abundance although it remains a very technically challenging problem. So what we're going to talk about today a little bit is cufflinks attempt to look at this. So cufflinks doesn't you know tackle it doesn't solve all the world's problems but it kind of breaks this problem into three parts and that is to look at transcription start sites that was one of the categories that we looked at and it also looks at the CDS components of the part that's actually translated and sorry this is transcription start site this is the coding sequence part of transcripts and this is the splicing preference within transcripts so if you look at the color coding here we have this hypothetical example where we have three transcripts a blue a yellow and a red and they differ in slightly different ways so we've got two transcripts here that have the same transcription start site but one of them skips an exon relative to the other one so that we've got this exon being included in the blue transcript and it's being skipped in the yellow transcript and then we've got another set of transcripts that differ by their transcription start site so we've got two transcripts here that have that use this promoter site and then two transcript or one transcript here that uses a different transcript start site and then if we look at the CDS's of these the part that's actually an open reading frame again we have two but in this case the blue has sort of a long open reading frame and the yellow and the red share the same short open reading frame so cufflinks tries to basically look at each locus and break the transcripts into these different categories so it's it looks at all of the transcripts at a locus and says what are all of the promoters the transcription start sites that appear to be in play here and I'm going to bin the transcripts according to which transcription start site been used so in that case you're going to bin A and B together and compare it to C so because A and B use the same transcription start site and C uses a different one and then it does the same same thing with exon include it says okay which of these transcripts includes this exon two and which of them skip so it's going to compare A to B and then it says which of these transcripts have a particular open reading frame and which of them have another open reading frame and it bins them like that so it's going to compare A that has the long open reading frame to B and C that have the the shorter open reading frame which is shown here and each of these sort of tests or arms of the analysis is sort of divided into a different output file when you run cufflinks so you might have noticed in the the cufflinks output that you have all of these different output files so there's these categories here where in addition to the sort of gene differential expression analysis you also get a splicing differential expression files and that's the one that corresponds to these exons being skipped or included you've got the promoters files which correspond to this comparison of alternate transcript start sites so tss one versus tss two and then you've got these cbs files that compare the different transcripts with distinct open reading frames so really the tutorial is going to go back so we're going to circle back to the cufflinks transcript compilation step and we're going to rerun cufflinks and cuff diff in a slightly different mode that's more sort of attuned to detecting these kinds of events so i think we've already talked about these things so i'll just skip that all right any questions on that before we jump into the last set of slides okay so there's quite a lot of jargon in this section and one of the first things that we're going to talk about is the different modes of cufflinks which we refer to as reference only reference guided and de novo so you might have noticed as you've been running your through your cufflinks steps so far that you named the subdirectory ref only at one point we're basically going to go back and run very similar analyses and create like very similar output but in two additional sets for reference guided and to de novo modes and then we're going to run cufflinks again and we're going to combine our transcript elements from multiple cufflinks runs as we've done and we did this before we basically compared our transcripts to known transcripts but because we were really like biasing the results towards known transcripts in the first place we didn't really dig into the the result of that comparison very much because it was sort of expected what would happen but now that we're running we're going to run cufflinks in more of a sort of de novo or predictive mode we're going to compare the transcripts that are predicted back to our notion of what the known transcripts for human look like and we're going to look at how cufflinks classifies its predicted transcripts against the known transcripts so it's going to try to bin its predicted transcripts into different categories such as this transcript that I predicted looks exactly like one that we already know about and this other transcript that I predicted appears to be novel or different from what we currently know in the transcript dome and then we're also going to step back even further and go right to the top hat output so top hat actually gives you this junctions count file and this is much more focused on the counts for exon exon junctions but it can be quite revealing and useful for detecting alternative splicing patterns as well without even having to run cufflinks you can just really focus in on specific exon skipping events by looking at the read support for each of the exons being connected that's something that top hat spits out at at the end of the of its alignment process and then to try to make this a little bit more interpretable we're going to visualize some of these junction counts and the the different transcripts that get assembled by cufflinks in igb so we're going to get another opportunity to become even more familiar with igb as well so running cufflinks in these different modes in module four or sorry this should say module three stupid number in in module three we ran cufflinks in the ref reference only mode which basically gave us a sort of a one-to-one relationship between transcripts that we supplied in our transcript gtf file and transcripts that we had abundance estimates from four from cufflinks but now we're going to be able to handle or detect potentially novel genes and novel isoforms of known known genes so to accomplish this we're going to rerun cufflinks in the reference guided and de novo modes in the ref guided mode we're still going to use a known transcript dome but we're going to tell cufflinks to just use this as a guide and still try to predict novel transcripts and then in the de novo mode we're going to run cufflinks without a gtf at all so cufflinks will just be given the rna seek data with no prior knowledge about what exons are expected and what their connections are and the way this is all handled is by using different parameters when we run cufflinks and they all have this very confusing naming strategy so there's a bunch of options that use a either a small g or a big g and it's really easy to get them all mixed up so this is just a description of what they are that's kind of a reference so you have a big g that's used when you're actually doing the alignment so top hat itself has a an option to supply a gtf file and we did this so we when we were running our alignments we supplied a gtf file to top hat and it used that file to build a database of transcript sequences and to build a database of junction junction connections and it used that to help the alignment it's not trying to predict ice forms it's just trying to align reads against the transcriptome in the genome and this is how we tell it what the transcriptome is so it's purely to do with alignment and then top hat also has a small g option which has nothing to do with splicing really and that's used to tell top hat the maximum number of multiple mappings for a single reads the sort of tangential issue as to when you have a read that maps equally well to two or three or five or ten places this option basically says you know when do I stop making note of those if this read I think the default is 20 or 40 so it'll report multiple equally good alignments and then once it gets to this whatever number you put here or the default if you don't specify then it'll stop reporting all of the alignments for that that read that maps all over the place okay so we're not talking about we're not really going to be dealing with those what we are going to be doing is changing the the g options for cufflink so it has a big g option and this is what we use to supply a transcriptome gtf file and this is basically how we triggered those the reference only mode that we ran in module three it also has a small g option again you use it to supply the same transcriptome gtf file but this time it's you when you say use the small g you're telling cufflinks to use the this gtf just as a guide instead of as in the reference only mode and then if you don't specify either the big g or the little g so you don't give it a gtf file at all then that's what we're going to call it the nobo analysis mode and then cuff diff also requires a gtf file and then just to make it even more confusing they decided to specify this gtf file without using a big g or a little g it just you put it in a certain order when you're creating a command so even though these are all developed in the same lab if this is sort of remarkably obtuse way of laying things out one thing you can do to sort of make it a little bit easier is to each of these options has sort of a long version so you can spell out the long version then it becomes a little bit less confusing and in the online uh or the hands-on tutorial we tried to use the the long version that are a little bit more descriptive and not quite so confusing okay and then the top had junctions bed file so this is really a much more focused and simplistic way of looking at splicing data but it can be quite revealing nevertheless so right after we alignment top hat creates this summary of all of the reads that spanned across exon exon junctions so these are not read pairs that span across different parts of the transcript but individual reads were part of the read aligned to the edge of an axon and then the alignment for the rest of that read continued on in the next axon and there's an intro in between and that's called a junction read and at the end of doing its alignment top hat produces this bed file where it basically goes through and it says find all of the unique instances where there appears to be a read spanning across an intron that connects two axons and build that unique list and then give me the count for the number of reads that span spanned across each of those unique junctions so you get this output where it just names these junctions arbitrarily one two three four and so on and you'll often get tens of thousands if not a few hundred thousand in your transcript dome and then it just gives you the count so it says okay I saw this junction and there were three reads for that axon-exon connection and there was five reads for this one six reads for that one three reads for that one and so on and we're going to visualize the that output in igb by loading this junctions bed file and then we're also going to experiment with a plugin for igb that's called sashimi that which helps you produce like really nice visualizations of splicing patterns right from igb and the first the first of those things is shown here so we have this junctions bed file that's been loaded into igb just like we would load a bam file or we can load gtf files in lots of different files that you can visualize an igb and what we're showing here is a junction file from one of our data sets and it's being compared against the known transcript and each of these red arcs corresponds to one of those junctions in the previous bed file and the the sort of strength or darkness of the arc is proportional to the read count so how many reads supported each of these junctions and we're going to load one of these files along with the bam file and then you'll be able to see okay here's the reads that seem to span across these junctions and then here's the summary of the number of those reads that were identified for each of those unique junctions and this is actually a case where you can you can demonstrate the sort of potential for alternative splicing analysis just by thinking about these junctions so we have our gene model here with all of our little exons that are the sort of higher boxes connected by the sort of intron lines and the junctions actually indicate a novel exon that might be expressed so there might be a novel isoform that contains an exon that isn't part of this gene model can you can you guys see where that is in those junction arcs does anyone want to point it out you mean this this medium-sized thing and that one yeah there's they're just split that way so that they're not piling up on top of each other not necessarily they're just so they're just being that's a good question so they're just being spaced on onto three lines here so that they're not piling up on on top of each other but the where which of these three spots they're put in is kind of arbitrary and they're just put that way to sort of so that they're not piling on top of each other and actually this is the expanded views you know how in idv there's a lot of this expanded collapse squished different views the default will often be the the collapsed view where they all they are all just piling up on top of each other and when you look at them alongside a gene model like this it's really tempting to think oh they must all be connected because they line up so beautifully with this known transcript and they're all seem to be in order and they probably do come from that transcript but you don't actually know that for sure it's it's an inference based on how they correspond but yeah there looks like there looks to be a novel exon here so most of these arcs are really lining up with the edges of known exons here and we have one that's lining up here and it's connecting all the way over there but then we have these two arcs that seem to be going over here so there's this one that goes from the edge of that axon into this space here where we don't have a known exon and then there's this one that's coming from the exon over there and it's coming over here somewhere also known known exon and presumably those two things are coming to the left and to the right side of an unannotated exon that was not known to I guess this is probably the ref seek gene trap okay so cuff merge we're going to run cuff merge again and as before it's going to combine transcripts from multiple data sets into a single unified sort of union view of the transcriptome we're going to do this before running cuff diff and then at the same time we're going to supply it with a gtf file of our known transcripts so that it compares all of our predicted transcripts back to the the known transcriptome and we'll take a quick look at this file and it's going to assign a code to each of these predicted transcripts that tells you something about how novel or known that transcript is and the description of those class codes can be found in the the cufflinks manual looks like these these are these links all need to be updated as well the manual has changed locations and then we're going to load some of these things into igb and just do some comparisons of the gts from each of the cuff links modes and what i'm so what i'm showing here is just an example where we had a reference only mode that summarized these three transcripts that presumably correspond to known unsolved transcripts and then in the ref guide into novel mode we're getting a prediction of something being transcribed from this region that doesn't correspond to a non-solvable transcript and here's just another example again of the sort of complexity of some of the this output so in this case we had the reference only mode and we had one known transcript so we got basically one abundance estimate and one transcript assembly for the reference transcript and then in the novel mode we get all kinds of stuff so this sort of speaks to that the comment about intron retention where we have a small intron retention being predicted here presumably because cufflinks okay there's a real spread of coverage across this intron that seems like it might be connecting these two exons so maybe this exon actually consists of the the regular edges of these two exons plus the intron being retained and we're seeing different three prime ends and exons being skipped or included they were getting mapped so we're not we're not going to change the alignments so if yeah it's kind of crazy right like we're fundamentally dealing with the same reads mapped in the same ways to the same places we're basically just interpreting those those alignments in increasingly speculative ways so in the in the way that we already did we were basically saying you know try to interpret all of these reads in the context of exon transcript structures that we already know about and so in that scenario a lot of these reads that map within introns or in spaces where there's no known transcript are just basically ignored they're sort of put it put aside and sort of like they don't contribute to any of the the abundance estimates because they don't map up match up with one of our prior expectations about a transcript in the de novo mode it's it's much more of an attempt to explain every read by a transcript and there's a compromise there that means you know it's kind of satisfying in a sense to say well if we have all these rnac reads they're they're all supposed to be resulting from transcription events so let's try to assign each of them to a predicted transcription event but doing that it becomes much noisier so you wind up with a lot of spurious likely false positive transcript structures and I think the reference guided mode is kind of a compromise between those two extremes and the data set that we use is so kind of optimized for the tutorial and massage that the difference is actually quite subtle which is a bit misleading if you just if you hadn't sort of purified the data for things that align quickly you would get a much bigger difference between the reference only and the de novo mode and the de novo mode tends to be quite wild like you get large numbers of transcripts being predicted for almost every locus and a lot of them are probably not real they're just there was just enough noise in that one intron that it thought okay maybe there's a transcript that includes that intron i'm gonna take a guess that's what happened um and transcription is just so noisy that you wind up with a lot of noise predictions but you also have the potential to discover a novel transcript or a novel gene and it it's what the authors recommend but you never quite know i mean they're obviously very proud of this aspect of the analysis so i think that a lot of people have been a little bit more skeptical about how noisy it is and it's just a really hard problem there's no way around it you might be underestimating the total transcriptional output from a locus because you're making an assumption that maybe that the that the transcriptional like the transcript diversity from that locus is really as simple as what you claim it to be in your gtf file so if your if the annotations in your gtf file are an oversimplification then you will tend to yeah fail to account for reads that really do belong to a transcript from that locus which means you will tend to underestimate the the overall expression output from that locus so the flip side of that is that you if you start predicting all kinds of different transcripts you're you're kind of dividing up the abundance amongst more things you're saying oh and now i believe that the cuff links are saying now i believe there are 10 transcripts here but each of them is an independent thing so i can only distribute the total expression output amongst them so that tends to bring down the abundance of each of them individually and if some of them were just false predictions then you're kind of spreading around the expression output to things that aren't real that's here you may be underestimating the abundance of a real transcript in order to to give some weight to a transcript that's false so you would definitely not want to be like comparing the output from cuff links in in one mode for one sample to the output from cuff links in a different mode in it for a different sample so like whatever you do you want to do it consistently across your samples so when our automated pipeline at mgi we run all three of these modes automatically and sort of for the first pass i just want to know is egfr expressed kind of thing we use the reference only because it tends to be robust to the kind of noisiness of various mad predictions that cuff links makes but then we have the other two more sort of speculative and exploratory results on hand if we want to go and look to see whether there's some interesting splicing event going on or whether there's a novel gene or whatever okay and then i think we've already talked about this what do you do when you can't get this to work on your own data so quite a few of you have been playing around with either other example data and integrated assignments or in a few cases even your own data which has been you know really impressive in some cases you probably maybe don't have the data on hand or you weren't able to get a down sampled version of it or that was practical to put on a laptop so inevitably you're going to go back and try to you know apply this to your own data and there'll be some hiccups and so there's a number of avenues to pursue obviously review all of the materials in this course for clues as to what might be going wrong there is this really nice nature protocols tutorial that the authors of cuff links wrote so it's sort of an alternative story to the one that we tell here but that's very very similar biostars cc answers and google of course and definitely biostars to ask any questions that you haven't asked and if you you know if you don't really hear back or you really think that your question is important and it hasn't been answered then you know you can reach out to us and we'll we'll try to do our best to to find time for that and then so the actually the motivation for creating that that table the question answer table that i showed you that will be that's already available on the wiki it's part of the the manuscript was really a response to this the cuff links offers authors produce this table of things that might go wrong it's a troubleshooting table for for cuff links and they have this comically short list of only four things that they could imagine going wrong which are like drawn seemingly randomly from the probably a hundred things that we've seen go wrong so that yeah the materials that we provide or attempt to try to you know flesh this out a bit more and catch more of the edge cases things that sort of typically causing you to run amok