 Alright, you guys ready for module four? Last one. Alright, so this one we're going to talk about isoform discovery and alternative expression. It kind of builds naturally on a lot of the stuff that you've already been doing with cufflinks. Mostly what we're going to be doing is rerunning some of the types of commands who are already running in cufflinks but with slightly different options that are more geared towards isoform, quantification, and discovery. So some of the learning objectives of this, the tutorial that goes along with this lecture are actually, sorry I've got the wrong presentation here. Basically we're going to use cufflinks in what's sometimes called a reference annotation based transcript assembly mode or RABT mode. And also it's a denovo assembly mode, so in so far we've been running cufflinks with a GTF file that provides an idea of what the transcriptome already looks like and we've been basically asking cufflinks to really take that model of the transcriptome very seriously and to estimate expression for the transcripts that are described in the GTF file that we specify. Now we're going to run cufflinks in two additional modes. One where we tell cufflinks to basically just use that transcriptome as a loose guide, so to try to use the information there but not be totally beholden to it. And another mode where we don't tell anything about the transcriptome. So we could be studying a species where we didn't even have any transcript to some annotations at all. And it's going to just try to assemble the transcriptome that it thinks is expressed in your RNAseq sample without any prior knowledge of what the transcripts look like. So this is a slide, just a review from the first presentation. So remember we talked through this idea of the central dogma where we have information flowing from double-stranded genomic DNA template to a single-stranded pre-mRNA molecule where the introns are in place between exons and then the splicing machinery comes along and removes the introns and assembles the exons. Again, this is mostly applicable to eukaryotes. And then this thing gets exported from the nucleus to the cytoplasm where it gets translated into a protein sequence that is then folded and various post-translational modifications occur. So what we're going to be really thinking about now is the way these mature mRNAs are structured and how there can be different versions of isoforms being expressed from each locus. And those versions come about by differences in the way splicing happens. You have a splicing machinery that comes along, it recognizes signals on the pre-mRNA molecule and it removes the introns. But for basically every human gene there are multiple ways that that happens and there's a very complex regulatory system that controls exactly which isoforms are expressed from which loci. And this is a really active and deep area of research in various species, especially the mammals. So this slide is really just to illustrate that there's a lot of complexity to the analysis of RNA-seq data with respect to splicing. So I got this slide from the RNA-seq blog which you haven't checked out already. I highly recommend it. This blog does a pretty good job of keeping on top of the latest developments in RNA-seq technology and analysis. And there'll often be reviews of new tools that have come out or collections of different tools for different types of analyses. And this is actually from a sort of pre-proof manuscript that the author of this blog seemed to be writing where there he's listed a variety of tools for different types of RNA-seq analysis starting with mapping. So there's various mappers that are used for reconstruction of isoforms and then quantification of isoforms and then ultimately comparison of different conditions of differential expression analysis. And there's lots of things that kind of fit into multiple categories or span across those boundaries. The next slide is also kind of just meant as a resource. So it's a list of some useful resources and discussion that I found over the last few years starting with discussion of best approaches for predicting novel and alternative splicing events from RNA-seq data. So there's a couple of biostars posts. So we've mentioned this forum a couple times. Again, if you haven't signed up for a Biostars account, you should do that and you should in the future when you have questions after you go home from this tutorial and you have questions about RNA-seq that would be the first place I would check or the topic area of your question. Maybe spend 5 or 10 or 15 minutes doing that. If you don't see a question that seems relevant or the questions there don't have the answers that you need then go ahead and ask the question that you have there and there's a pretty active community of bioinformatics folks that are involved in Biostars and you'll often get some really useful feedback and answers to questions. A couple more posts that are really relevant to this topic on alternative splicing detection are listed here. There's an interesting post on identifying genes that express different isoforms in cancer versus normal RNA-seq. Some discussion of how cuff links and cuff diff differ in what they're doing and then a discussion of tools that you might use for visualizing alternative splicing events from RNA-seq data to help us think about both the types of events that we're going to be looking at and the way to visualize them. There's a couple slides here in sort of cartoon depictions of the types of alternative expression that occur. So the first example at the top here is just simple transcription. This is just a reciprocation of what I showed on the central dogma slide showing a simple gene with three exons and two introns and it's being spliced into what is called the canonical isoform, which is usually a reference to the most common or representative isoform that's expressed from a particular locus. And then there are a variety of other categories in which transcription can happen differently. So for example, alternative transcript initiation is where the transcriptional complex sits down in different places and starts to transcribe at one position but may also do it at another position. So for example, we have two transcription start sites being depicted here and these could lead to two isoforms that differ by having basically a different first exon. So we get three exons and then another transcript that has the same exons except it starts at the second exon and does not use the first exon. Alternative splicing is sort of a loose term to describe stuff that happens sort of in the middle of the transcript and there's a variety of subtypes of alternative splicing. So something that's called cassette exon skipping is where you have two isoforms that differ by the exons that they include. So in the first example we have our canonical isoform with exons 1, 2 and 3 and then we have an alternative transcript where exon 2 has been skipped. You can also have alternative 5' splice site usage so in this case the 5' in this case the splice site is referring to the 5' of the intron. So you've got two alternative ends to this exon basically. So basically what's happening is the splicing machinery is coming along and it's deciding to use this donor site or this donor site and this gives you slightly different mature mRNA transcripts in this case with a longer or shorter exon 2. Similarly you can have alternative 3' splice site so these are where you have an alternate acceptor site being used and again this gives you isoforms that differ slightly in their length. In this case you've got a short exon 3 being used and a longer exon 3 being used in the alternative isoform. Mutually exclusive exon usage is very similar to the cassette exon usage except in this case you're going to produce two isoforms each with three exons being used but the center or second exon used is different between the two of them so you have sort of an exon 2a and an exon 2b. And then finally you can have intron retention where you have the basic scenario where you have exon 1, 2 and 3 and then another scenario where instead of exon 1, 2, 3 you have exon 1 spliced to 2 and then the entire intron is included and it continues on to exon 3 so effectively you just have a really long exon 2 here. So you're saying you have basically an an alternative gene inside another gene? Yeah so it is possible that you kind of think it happened. If you have stranded information often these kind of events are at least happening on the alternate strand which is helpful. If things like that are happening on the same strand it can be very confusing to depopulate what's going on. The good news is that those things often tend to be sort of single exon events that are within an intron so there won't be evidence for them connected to the exons on either side so you can use that information to try to tell which of them kind of go with the 5 or 10 exon gene that's being expressed and which of them are maybe a separate transcriptional unit within the intron. But yes it's possible that you could have sort of confusing situations like that where you inadvertently assemble a sort of chimeric or Frankenstein isoform that isn't really real and that kind of all comes back to this sort of caveat or warning that you have to always remember that we're inferring a lot of things here and sometimes you can put the pieces of the puzzle together wrong and it looks reasonable but you actually have a misrepresentation and hopefully you would be able to sort that out in some kind of downstream validation before you went too crazy doing functional work on this thing that you think exists but is really kind of a misunderstanding of the data. Any other questions? So this slide is just kind of a quick summary of some of the history of sequencing methods for studying alternative isoforms. This has really been an area of development over the years. People have been interested in understanding the structure of mRNAs in human and other species for many many years. Once the genome was sequenced we had this in human this amazing reference sequence to work with but then going from that to what transcripts are actually expressed and what their exact structures are was also a big task. Initially a lot of the heavy lifting was done by full length cDNA sequencing and this is really the gold standard for resolving the structure of mRNAs so if you can isolate a full length cDNA clone it and then sequence that entire cDNA across the entire insert sequence you really can tell with exquisite accuracy what the exact structure of a messenger RNA is by then mapping that complete sequence back to the reference genome and seeing where the exons and introns wind up. Unfortunately this is still really low throughput activity but if we could just somehow magically sequence full length cDNAs we would probably not do RNA seek the way we do it. If we didn't have to fragment our cDNA into all these little pieces and then kind of shotgun sequence them and try to put the pieces back together after the fact the analysis would be way way simpler. So one of the sort of dreams of the sort of next next gen sequencing is nanopore sequencing where one could imagine feeding RNAs through a pore and reading the sequence off as those RNAs are fed through that pore and not having to fragment the RNAs just read the sequence off from the beginning to the end and we would do that millions and millions of times and we would get both quantification and the complete structure of every RNA that was in your sample. But we're just definitely not there right now and we've kind of gone through these stages of low throughput full length cDNA sequencing and then to try to get at cDNAs that were missing there was sort of targeted or sequencing time primers to amplify and then we sequenced many products from those amplifications to try to identify alternative splice patterns. A lot of EST sequencing was done where we generated cDNA clones in a high throughput fashion and then just sequenced the ends of those in a fairly roboticized way and were able to accrue a lot of data. And these top three things are really where the bulk of the annotation of the transcriptome that we're using today comes from. Most of the transcripts that you're seeing in Ensemble or RefSeq came from these projects where teams of people running robots were sequencing full length cDNAs and ESTs. And that data quality is really high but it's incredibly slow and expensive to produce on even one sample or a series of samples so it's very difficult to do functional biology with those kind of platforms because we just can't afford to go back and do it for our two conditions of interest or our drug treated versus untreated or all of the tissues of an individual or a particular tumor or whatever. So then we moved into a stage where there was much more cost effective and high throughput methods for getting at this kind of information. So there was a variety of really small tag sequenced based approaches where we have some kind of enzymatic approach that gathers little pieces of RNAs, concatenates them together and then allows us to sequence dozens or hundreds of these little tags at once. So some of these are SAGE, PAGE and GIS and those basically differ in terms of whether they go after the three prime end of transcripts or capture the five prime end of transcripts or the five prime end three prime end of transcripts. And those methods are really good for looking at alternative transcript initiation and alternative polyamidolation sites. And then shortly after those technologies became well established the sort of next gen sequencing instruments arrived on the scene. First there was 4.5.4 and then there was SELEXA which is now called Illumina and now there's also IonTorrent and there are other platforms being developed and these really took it to the next level in terms of data throughputs. We were able to sequence basically shotgun sequence all of the RNA in a particular sample and produce an amount of data that is completely dwarfed the scale of the data production that we could do just a few years before. The only downside really is the size of the molecules that we're sequencing so we're still limited to these paired 100 MERS and we're gradually increasing that size and some people who are really interested in alternative splicing will use a different sequencing strategy where they sequence a single and 300 MERS or they try to do paired 250 MERS so pushing the limits of the Illumina platform to get longer reads so that you can do a better job of resolving where the exon and intron boundaries are and you don't have to do so much inference so much piecing all these little pieces of the puzzle together to get your full-length isoform prediction. And it's this kind of data that cufflinks is really meant for so it was really designed with these short sequences in mind and all of this sort of mathematical wizardry of it is about trying to make these inferences and piece together all these pieces and try to predict what the full-length isoforms really look like and there's a simple depiction here which is actually a very simple scenario relative to what is happening at most in most human genes but even this very simplified sort of toy example gets very complicated when you start to really think about what's going on so what's shown at the top here this is from the cufflinks manuscript are three hypothetical transcripts the first two share the same transcription start site and the third one has a different transcription start site and then they also differ in different ways so two of them share the same 3 prime exon the first one has this distinct exon that's being skipped in the second one and so forth so there's all these little subtle differences between these three but they also share certain features and I don't know if we've talked about this before but this is a very common depiction where you have a transcript that's drawn with sort of a narrow rectangle and then a wider rectangle and usually what that's indicating is the portion of the transcript that is coding so the portion that becomes translated into protein and then the narrow part is in this case the 5 prime UTR then the coding portion and then the 3 prime UTR at the right here so you can also think about in each of these isoforms what would the open reading frame or protein coding portion look like so you can see that B and C share the same short open reading frame here and A has a different coding sequence A and B share the same promoter sequence and C has a different and so forth so you've got all these different ways you can think about how these isoforms are different or similar to each other and the first thing cufflinks does is try to estimate the relative abundance of each of these 3 isoforms and then in this sort of splicing analysis component of cufflinks it tries to break down the isoforms into these different categories so for example it will compare within the transcription splicing group so it divides your samples into those that have transcription the start site 1 and then it looks just within those how well expressed there are so we have 2 isoforms the blue and the yellow sorry the blue and the yellow that share the same transcription start site so we can ask for that transcription start site how do those 2 isoforms how do the expression of those 2 isoforms differ and then when we're doing differential analysis between our 2 conditions we can look at the ratio between those 2 isoforms that use the same transcription start site in condition A versus condition B similarly we can now look at differential promoter usage so in this case binning the isoforms that share the same transcription start site so A and B both use this transcription start site and C uses the second transcription start site so basically kind of pooling the data from A and B and comparing them to C and again in our differential scenario we would look at the ratio of use of transcription start site 1 to transcription start site 2 in our first condition and our second condition and then finally you can do a similar kind of thing with the coding sequence so maybe what we really care about is the expression of isoforms that have a particular predicted open reading frame or coding sequence so in that case we would bin isoforms B and C together because they share the same protein coding section and A is what has a different protein coding section so in that case we would compare A to B and C and in our differential analysis we would look at the ratio of B plus C to A in condition A and B so that's like a very bewildering amount of numbers and letters but all of this ties back to the output that we're going to get from cufflinks when we run it in the splicing mode where we look at the files that are really aimed at investigating splicing so this first scenario the output that corresponds to these kinds of comparisons is going to be in this file called splicing.diff and then when we're looking at the usage of different transcription start sites that output will be in the promoters.diff and then we're really focusing on the differential expression of coding sequences that will be in this cds.diff file any questions on that so we've been gradually working our way through this flow chart of the different steps what we're going to do now is actually kind of circle back to what we've done before we're going to go back to running cufflinks but we're going to change up the options so cufflinks has many many options and you can run it in different modes with different goals in mind with sort of multiple tools built into one but we're going to do a sort of similar pattern where we do a transcript compilation with cufflinks and then we're going to merge our transcripts together and compare them to known transcripts and then we're going to use cuffdiff except instead of just doing simple differential gene expression we're going to do differential splicing analysis or alternative expression analysis so before we go on I just had this slide that's sort of randomly thrown in here that we I thought we would just address this because this is a question that has come up a few times already here and it comes up every time we talk to any crowd of people doing RNA-seq analysis and that's for the question of what do I do if I don't have a reference genome for my species and the bottom line is that we don't really have time to get into a lot of the details of that you really have to take different analytical approaches but one of the things that I often ask people to consider is why don't they have a reference genome and in some cases are you sure you don't have a reference genome and the purpose of that question is just to sort of explore the idea of in some cases genome sequencing has actually become quite cheap and if you're really struggling to do RNA-seq analysis without a reference genome and it actually might not be that hard to generate the reference genome for your bizarre critter that only you and five other people in the world are studying the whole genome and assembly analysis methods have really come a long way and it may actually be cost effective and a reasonable place to start to actually try to build your own reference genome sort of a related piece to that is that a really actually a bad reference genome is better than none at all and there's nothing stopping you from doing a reference free analysis but if you can quickly and relatively cheaply produce some kind of reference genome by sequencing the DNA of your species you should think about doing that now there's definitely some legitimate reasons why you may not already have a reference genome and why making it might be impractical some of the common reasons that we here are the genome of my species is too large or too complex so this is something that the plant people often have to deal with where they have these just truly massive genomes that are hexaploid or worse and it really probably would be quite computationally expensive and difficult to produce even a bad reference genome another place where you see this is metagenomics so where you're actually not studying you don't actually know what is in your sample there are multiple species and they're all mixed together in the same sample so there isn't really the simple concept of oh I have a human cell line therefore I'm going to compare to the human reference genome sequence or I know it's a mouse because I can see the mouse in the cage and then I kill the mouse and I get its DNA it's a very different situation from I isolated RNA from a gut sample from someone after they ate some Mexican food or something whatever your experiment is so it might not be practical but basically one of the answers to what you do if you don't have a reference genome is to do transcriptome assembly which will work without having a reference genome the bad news is that de novo transcriptome assembly it's beyond the scope of this workshop and one of the reasons for that is it's fairly complex and the tools that do this are quite elaborate but the good news is there's a lot of commonalities between running one bioinformatics tool suite and another so a lot of the sort of basic skills that you learn here would be applicable to installing a transcriptome assembler and gathering input files and running a series of commands and then filtering the output dealing with the output files and so forth if you go to this link you'll probably have an updated version of this even by now there's a couple de novo genome based assemblers there and a variety of other de novo transcriptome assembly programs the ones that I'm most familiar with are transibis and trinity and I've heard very good things about both of those I've been involved in some projects that involve transibis in Vancouver and they yielded very good results the thing that I've heard about transibis that really attracts from it is that it's it's difficult to install difficult to run seems to be some black magic to getting it to run and I haven't seen a lot of success stories for people outside of the group that developed it actually using it which is obviously not good trinity though I have seen a lot more use just in the community so people just picking it up and using it and there's a course in cold spring harbour in the fall that has an RNA-seq component where they walk you through the process of doing analysis of RNA-seq data with trinity I've seen that a few times and it seemed to go fairly smoothly so I guess if I had to recommend one tool that you might check out first I guess that would be my recommendation this is developed at the brode trinity ok so as usual we're going to now switch over to a few slides about the tutorial itself and then we're going to dive into the tutorial and that will be the last tutorial of the day and we're doing pretty good for time so we may even be able to end a little bit early and then be able to spend some time talking to each of you about particular questions you had or challenges you have with your own experiments and so forth and do the survey spend a lot of time really thinking hard about the survey can do that too so the learning objectives of this tutorial we're going to run cufflinks again but now we're going to learn how to run it in what we call reference only mode so that's what we've already learned we're going to extend our use of it to the reference guided mode and to the de novo modes so remember this is where we use our gtf file to guide cufflinks instead of really forcing it to consider those transcripts and no others and then de novo mode is where we don't tell anything about what the transcripts look like just tell cufflinks to tell us what the transcripts look like without any cheating or hints and that's really making quite a demand of cufflinks then we're going to learn how to use cuff merge to combine transcript domes from multiple cufflinks runs so we're going to run cufflinks on each of our samples our two normal replicates and quotations and our two tumor replicates and then we're going to combine the transcript domes with all of that those four runs to give us sort of a common reference point so that we can then go back and develop a unified set of expression estimates across the four samples then we're going to learn how to perform differential splicing analysis with cuff diff so this is basically going to involve running cuff diff on the output from cufflinks where cufflinks was being run in sort of splicing aware or splicing tailored modes we're also going to go back actually a little bit further and take a closer look at the top hat junctions counts file and also dig into some of the cufflinks differential splicing files of the command line so this top hat junctions file actually can be quite revealing I mentioned it earlier I think in the first lecture where you can use summaries of how many reads span across junctions together a sense of the quality of your library another thing that you can do is just take this junctions file that comes out of top hat and look for particular interesting splicing events so this is a much more segmented or focused analysis where cufflinks hasn't been run yet all you've done is align your reads to the genome and some of them have spanned across introns that represent X on X on junctions and this junctions file is automatically produced to summarize how many reads span across each junction so you basically get a readout of every connection between two X ons that was observed in the transcript dome that you sequenced in your RNA-seq experiment and you can use that to identify interesting splicing events you don't necessarily know what the whole transcript looks like but you can still use it to find candidate alternative splicing events and then once you find those candidates you can sort of back up and say what might this isoform look like and it's sort of a very quick and focused way to look at the splicing output in your RNA-seq experiment without even having to run cufflinks and you can visualize this file in IGV and it can be quite revealing just on its own and that's what we're going to do so we're going to visualize this top hat junction count file in IGV and also the little transcripts that come out of cufflinks when we run in reference guided and de novo mode so the first thing we're going to do is start rerunning cufflinks and refguided in de novo mode as I said in module 3 we were running cufflinks in this reference only mode and the nice thing about the reference only mode is that it gives you kind of a simple output you get one estimate for every transcript that you fed into the tool which is kind of nice you get this sort of microarray style output I know I have 25,000 transcripts and I've got these 10 samples and I just want to read out that has 10 columns, one for each sample and it's got 25,000 rows one for each known isoform and you can really quickly get into differential expression analysis without having to do a lot of complicated downstream processing of files so it gives you this really microarray style output but that's really an underutilization of the richness of RNA-seq data so we're going to try to do a better job of taking advantage of that richness now so in order to do that we're going to play around these options so we're going to talk a lot more about these dash G, dash little G and dash big G options in both top hat and cufflinks and cuff diff and it gets a bit confusing because they, you know, these the same letter keeps getting used over again in slightly different but related contexts so just to review them now and then we'll do this again when we're actually running the commands so top hat has a dash big G or GTF option and this is, remember, is used to supply the transcriptome GTF file during the alignment so we're not trying to assemble transcripts but we're trying to help the aligner do the best job it can figuring out where every read should go and reference genome and we're allowing it to use the transcriptome independent of what you do in the downstream step with cufflinks you might decide to use the GTF file when you're doing your alignment or not and then as an independent choice when you're running cufflinks you might decide to use the GTF file to basically force cufflinks to give you estimates for those transcripts or use it as a guide or not use it in the de novo mode or some combination of these things so just to maximize the confusion top hat has both a big G and a small G and in this case the small G is really an unrelated option that's used to specify the maximum number of multiple mappings for a single gene I don't know why they chose G for that they probably should have chosen something else but that option we're not really doing anything with but we did use the big G to give the GTF file to top hat and to tell it to use that information now cufflinks has a big G option and this is used again to supply our transcriptome GTF and if you specify dash G with a transcriptome GTF file cufflinks is going to quantitate against those reference transcript annotations and that's what we've been calling the reference only mode that's what we did already so big G is what we did in the module 3 we switched to using the little G option again you supply the same GTF file transcriptome GTF file that we've been using but this time it's going to use that to guide the assembly rather than really being strict about what transcripts are being used and this is what we're going to call the reference guided analysis mode and then if you run cufflinks without the big G or the little G this is what we're going to call de novo analysis mode so there's no GTF file even being specified and then finally cuffdiff it requires a GTF file but it is not specified with a big G or a little G option it's simply supplied as a path when you're constructing your cuffdiff command so I mentioned this so I realize that's very confusing we're going to go over it again when we're running each of these commands we're going to go over this top head junctions bed file as well but just to give you a brief introduction of what I mean by junctions so after our alignments you get top hat creates a summary of all the reads that happen to have supported exon exon junctions or spanned across an intron so for example you might get a readout that says for a particular gene that exon one two has five reads and a connection of exon one to three has nine reads so this exon one three is sort of implying that an exon two was skipped so this is sort of exon skipping junction and this file just has a very simple format and it reports all of the unique exon exon junctions so every line in it is a unique coordinate combination and then the fifth column simply contains the junction read counts the number of reads that supported that exon exon junction and this is what it looks like what it looks like if we view this file in IGB you get these sort of arc little red arcs that span across exon exon junctions so for example the one that's being shown here is going from exon one to exon two that's what this arc is here and then two to three and three to four and so forth and then the sort of darkness or thickness of this arc is a representative of how many reads supported that so you could also view the individual reads by loading your BAM file and you could see that all of the individual reads that supported this junction and you could start to correlate how the increasing number of those gives you a sort of a fatter arc and less of them gives you a sort of thinner arc a quick introduction to cuff merge so I think we've already used cuff merge a few times but basically the idea here combines transcripts predicted from multiple RNA-seq data sets into one view of the transcript dome and we run this before running cuff diffs so that we can compare across multiple conditions and have this sort of a unified reference set to compare to and then you can also ask cuff merge to simultaneously compare transcripts to our known transcript GTF file from Ensemble so when we run cuff cufflinks in the fully de novo mode we're going to predict potentially novel transcripts from each of our four samples and then we're going to run cuff merge to merge those together into sort of a unified transcript dome but it's still totally predicted we don't really know how it relates to what the known transcripts from human so when we run cuff merge we can also say here now at the end of everything here's our known GTF file just tell me how the transcripts you predicted correlate with what was known about the transcript dome but now that the assembly is all done so it's not going to influence what transcripts get assembled or don't get assembled it's just a sort of annotation after the fact this is a simple comparison of merged GTFs from each cufflinks mode so just to give you sort of a visual example of some of the results what's shown on the top here is a track of UCSC genes and then we're seeing output from cufflinks being run in three different modes here 1, 2, 3 and then this is the ensemble gene track that would correspond to our GTF file so you can see a number of things are going on here first UCSC and ensemble don't agree about what genes or transcripts are expressed at this region of the genome so you can see UCSC as a single gene here and ensemble has 1, 2, 3 and then when we ran cufflinks we got transcripts that appear to correspond to the known ensemble transcripts but when we ran in both the reference guided mode and the noble mode we've now predicted a novel transcript that doesn't appear to correspond to any of the genes in either ensemble or UCSC but this is potentially a novel gene that's been predicted straight from our RNA-seq data it's possible to infer false positives if a gene that's it you definitely don't know that it's real just because it was predicted it could be a false positive but the question is how do you know there are definitely pieces of information that might make you more or less confident so one thing you could do is look sort of cast the net wider in terms of what other types of data you compare to so in that case I was comparing to UCSC and ensemble but there are many, many other transcriptome databases out there that you could compare to so you could say well there's some evidence from some prior experiments or from DBEST or libraries of CDNA sequences there might be other things also about the nature of the sequence itself that made it more or less believable so for example if it looks like it's in a region that would be difficult to align to like say a repetitive element that might make it more likely to be just the result of a mapping artifact and there's also sort of a built in prior based on what species you're working with and how well annotated that species is so for something like human it really has been pretty heavily annotated so I think you need to just inherently be a little bit more skeptical about the prediction of novel genes because there's sort of a question of well how come it hasn't been found so far when we've been looking for these things so exhaustively for so long whereas if you're dealing with some species that there's hardly any annotation at all you might expect plenty of novel transcripts to be found and for quite a number of those to actually be real I would also say that transcripts that are multi exons that have good coverage that are longer those things are less likely to just be artifacts of mapping whereas small single exon just like a block of reeds aligns here and then cuff lengths is like oh that looks like a transcript there those things are probably more likely to be false positives than oh I found this thing that's 1.5 kb long and it's got tens of thousands of reeds of support and there's 5 exons and 4 introns it really looks like a transcript it just doesn't line up with other known transcripts so there are a bunch of things like that that you can do at the informatic level but then of course you would want to definitely validate it by some orthogonal approach but the nice thing is that cuff lengths really gives you what you would need to design that validation experiment you basically have a prediction the sequence looks like this so it's fairly straightforward to go back to your sample and say try to amplify that CDNA clone it see whether you can actually rescue that this is sort of a similar idea but this is where we've merged our GTS from each sample again showing the reference only in de novo mode so in this case we've got in the reference only mode we're just seeing one transcript being summarized by cuff lengths because that's what we told it to do by saying do this in reference only mode and then in de novo mode we're seeing at the same locus a lot of different transcripts being predicted and each of them is very similar but they have sort of subtle differences so for example you can see that this one here has a rotating intron and this isoform has a predicted alternative transcript initiation start site and a slightly longer exon than what is used in these other isoforms and so on and this is typical that isoforms that are predicted are usually very similar to each other and they have some subtle distinctive difference yes yes I'm not showing it here and I don't remember this is just a screenshot that I took at some point while I was exploring the data but we're going to look at them side by side so you'll start to get a sense of how much they differ and how much they vary depending on what data you use they might differ quite a lot or they might be very similar and it will also depend again on the reference transcriptome that you supply when you're doing the reference guided mode if it's really complete it'll have a bigger influence than one that was quite incomplete and generally just the cleanness of the data I think if your data is noisy or you have a lot of misalignments then you might tend to get more false predictions but I think generally the authors if you just ask them like I just want to run the software in one mode I don't want all these different modes I want to kind of simplify the analysis but I want to do splicing analysis they'll generally recommend that you use the reference guided mode although I think that's changed over time and they did publish I think on the wiki there's some of the papers that are listed there are sort of follow up papers by the authors of these suites of tools and one of them is another sort of published tutorial that sort of best practices for using the tuxedo suite and it will describe which modes they think are appropriate for what types of goals and that's actually related to this next slide so what if you return to your lab and you can't get to work on your own data so obviously one thing you can do is refer to the materials that we provided with this course so we're really kind of giving you a bit of a survey of a lot of these materials and there's a lot of detail there that we're having to freeze past at times if you go back there may be additional sort of annotation in the tutorial files that we didn't expressly cover or that kind of flew by and it didn't sink in at the time so that's the answer just by going back and then there's this nature protocol tutorial that Cole Trapnel wrote and it provides actually a troubleshooting guide which I have on the next slide that explains some sort of common problems that come up when you're doing this kind of analysis so you can check for resources like that I've already mentioned searching biostars and seek answers and of course there's always Google for every problem that you might have if your question's not already on it's just a good way to sort of get the question out there and you never know you might get a great answer and then if you pursue the problem further so this is sort of a best practice for these question and answer forums is that if you find the answer please come back and tell us at Biostars about what you discovered because before too long you'll be the expert not the people on Biostars so here's this troubleshooting guide it kind of reviews some common problems that people have when using top hat cuff links and cuff dips so for example the first one is top hat cannot find bow tie or samtools and the possible reason is that you've got your path variable set incorrectly so remember we went through this exercise of setting our path variable to tell top hat and basically tell Linux where to find bow tie top hat samtools and so forth so I'll leave those extra problems for you to refer to on your own time