 So, we'll talk a little bit about the intrusion to splicing while we wait for Clista to index the human transcript down, something to pass the time. So this is much like many of the other modules. It's going to build on what we've already been doing. We're going to pretty much be using the same tool that we've already been using, which is string type. We're going to run it in a few different modes that kind of gear it towards transcript discovery a little bit more, and have it be a little bit less guided by our prior knowledge of the transcriptome. The same principles of these tutorials apply where we want it to be kind of self-explanatory and self-contained. As I said, we're going to use string type, we're going to run it in kind of a reference only mode where we're telling it, hey, we think we have a pretty good idea of what the transcriptome looks like, Ensemble or RefSeq or whoever has already done a great job of annotating where the transcripts are and what their sequences are. So just focus on those sequences, even though we're giving you reads that were aligned against the whole genome and they could have gone anywhere, really focus on the known transcripts. But of course that sort of limits your ability to discover totally new things that maybe aren't even represented in your GTF file, and it's entirely possible that you have RNA-seq reads that are aligning to a region of the genome that seem to be aligning in things that look like exons and basically look like a region that's being transcribed that just hasn't been annotated by Ensemble yet, and this is one of the features of string types that it can do that sort of de novo analysis via RNA-seq data against a reference genome. And it's one of the reasons that you're willing to sort of expend the computational cost of doing things that way is to leave the door open to discover new genes, discover new transcript isoforms of those genes. So just to remind ourselves what we've been talking about until this day, so we're really focused on interrogating RNA reads that came from fragmented versions of these mature RNA transcripts, so we have RNAs that were converted to CDNA and the CDNA was fragmented and then we're sequencing those fragments, but we've kind of to this point generally been assuming that there's sort of one way that this is being done or if there are different ways we already know about them, but of course, there are generally for most human genes multiple ways that the pre-MRNA can be spliced together to form the mature mRNA. So some genes will have two, three, ten, twenty different transcripts that use different combinations of exons or different edges of those exons, and that's what we're going to do a brief intro to in this module. So just to go over the sort of general types of alternative splicing or alternative expression, I've broken it down into eight general categories and this is pretty consistent with the way a lot of people do it. So there's four on this slide and then there'll be four on the next slide. The first one is just sort of a reference point, so simple transcription, we have an imaginary gene with three exons and two introns. It's got a transcription start site and a polydenylation site. It gets spliced together by removing the first and the second intron. Now our three exons are put together and it's capped and polydenylated and it gets exported from the nucleus into the cytoplasm where it can be translated. But we can also have alternative transcript initiation where there's two possible choices for the first exon. So transcription, the transcriptional machinery might initiate here and use this exon, which will then get spliced onto two and three, or it could initiate at the second exon and then it's of course not going to have that first exon. And this will give you two isoforms, one that has three exons and one that has two exons with alternate transcript initiation sites. This is called alternate transcript initiation or alternate TSS usage. We've looked at an example briefly in IGV of this type where we have a cassette exon where there's effectively two paths through the same combination of exons. We have our three exons, sometimes exon two is skipped and sometimes exon two is included. The same transcript initiation site and the same polydenylation site are used for both so the outer bounds of the transcripts are the same, but there's something happening in the middle where exons are being excluded or included. And in this case, this is a very simple case, you get a transcript with three exons or you get a transcript with two exons where the second transcript has been skipped. So of course this is a simplified example, most human genes have 5, 10, 15, 20, 50 exons and they're in some cases of this huge combinatorial number of possibilities where many exons are skipped or included in different combinations. And that's what gives you the large number of distinct transcripts for some loci. You can also have the cases where you have the same number of total exons but the edges of the exon that are used are slightly different. So in this case we're depicting a different alternative five prime splice site being used or a different donor site being used effectively. So the donor site could be here at the edge of this blue part or over here at the right hand side of the blue part. And this gives us two isoforms that have three exons each but in one case the second exon is extended on the three prime side. And then you can have the same concept at the other side, in this case at the acceptor side where you have alternative acceptor site usage giving you alternative isoforms again in this case the same number of exons but slightly different edges of those exons being used. You can have mutually exclusive exons, this is sort of a modification of the cassette exon form where you're going to produce two transcripts that each have three exons but you're going to use one of two alternate second exons. And then you can retain the entire intron. This generally only usually happens with smaller introns. You don't generally see 50kb intron being retained at least not in a functional transcript. It's usually a way of effectively silencing the transcript by triggering nonsense media decay so you see that as well. But basically you go from having three exons to having two exons where one is a very large one and it's a sort of superset of exon two and three plus the intron in between. And then finally the sort of corresponding version of the alternative transcript initiation at the five prime side at the three prime end you can have alternative polyadenylation sites where effectively you end at an earlier exon or at an earlier part of the last exon giving you different three prime tails on transcripts. And some of these can be quite dramatic so there are known human genes for example where you have a 20 exon form and a 30 exon form and the reason is that one of two polyadenylation sites is being used. And in some cases that's functional so you may be basically chopping off something that would localize that protein to a particular part of the cell when the full length version is there and the protein goes to a different part of the cell when the shorter form is used. Okay so that's a sort of a crash course in the terminology of the types and forms of alternative splicing. There's a huge number of methods and tools that attempt to utilize RNA-seq data to study splicing and there's a couple reviews and blogs that sort of try to keep track of these tools and Biostar posts and these slides here are just really to refer you to some of those materials so we've compiled this list of alternative splicing related Biostars posts that lists sort of methods and approaches and tools that are sort of designed with alternative splicing in mind. Over the years that the sequencing methods that are used to study alternative isoforms have really changed a lot where we are now is down here with the Selexor Illumina reads a sort of millions and millions of short reads that we're assembling into isoform transcript information but I think it's useful to still think about those sort of longer sequencing technologies that are still quite relevant to alternative splicing because it really is hard to do full-length CDNA sequence inference from these short reads. RNA-seq is pretty amazing but it's still there's a lot of uncertainty still when you're trying to predict what a full-length transcript looks like so you may if you're interested in a particular transcript that's been predicted you're probably going to wind up wanting to validate it somehow get a sense of what the full-length transcript looks like and I think there are some other technologies that are starting to become more useful. I think the PAC Bio for example is pretty good at generating long sequences for a smaller number of reads. So there might be a smart way to use Illumina sequencing and PAC Bio sequencing sort of in combination to study isoform diversity where you really need to have longer range information about how all of the exons are connected in a transcript and I think there's a lot of interest in the nanopore sequencer also perhaps filling this role where we get in theory start to feed full-length CDNAs into these nanopore sequencers and read off the complete exon-intron structure instead of having to piece it together from these small fragments that we sequenced by Illumina sequencing but it's still kind of early days for those two things but I think it's an area of opportunity to do some interesting analysis. So the last module so back to our sort of flow chart here we're basically going to repeat what was done previously with string tie using some slightly different parameters and then just to go into the details of those parameters a little bit more there's a last slide deck with a few slides here. Okay so we've already learned how to run string tie in reference only modes remember we've labeled our string tie output directory as ref only that was in anticipation of producing kind of parallel results that we're going to generate now and we're going to do that by running in the sort of reference guided into nobo mode and then we're going to use cuff merge to combine transcriptome predictions from multiple runs of string tie sorry that hasn't been updated yet and then we're going to learn how to perform differential splicing analysis and then we'll also do some visualization in IGV and examine some junction files. We've already done a little bit of that looks like this slide deck has actually not been updated there is an updated version but somehow it hasn't been copied over so I think what I'll do is skip that slide and because there are there's something very comparable to string tie but it's not quite the same we're going to look at a junctions bed file we're actually instead of getting one from top hat high set doesn't seem to produce this for you directly so I'll point you to an alternate tool that you can use to generate a junctions bed file from any BAM file and basically the idea of these bed files is to produce these views so remember we looked at the those arcs to kind of represent the pattern of splicing in IGV it's basically a way to get that information stored as a file where each line in the file corresponds to a single predicted exon-exon connection so that's what each of these arcs is showing here and then the sort of weight or darkness of the arc is a representation of how many reads there were that supported each of those unique exon-exon combinations. Cuffmerge is kind of a generic tool that's used to basically combine GTF files together and these GTF files could be from wherever they could be from string tie or previously there were from cufflinks and it kind of does two things it takes two different sets of transcript predictions and merges them into kind of a unified superset of predictions and it also allows you to compare those predictions against some known transcriptome models like from the ensemble GTF and both of those applications are pretty handy it's a sort of like a way of annotating transcripts that were predicted merging and annotating so we're going to do some comparisons of the GTF files that we get out of running string tie in the different modes and the way we're going to do that is just by sort of browsing around in IGV and there's some specific regions that you'll be pointed to in the the online exercises to illustrate some of the differences that come out of these different modes. We talked a little bit about this already what to do and when you return to your lab and you can't get this to work on your own data which is probably inevitable but there'll be some problems I think as Jared said you know you can email us with follow-up questions you can ask questions on bio stars or seek answers where we're also active so we might get be one of us that answers your question there. There are some troubleshooting guides for some of these tools the manuals are pretty good or they're reasonable but sometimes yeah the answer isn't there Google is often the most reliable way of getting answers to specific questions about these tools just because there things tend to be changing quite quickly.