 All right, so this is the the last full lecture. It's not super long. It's quite short. So we're just going to, and this is one of the shorter, thankfully, because we only got, when do we end? Five? Okay, so we've got an hour and 20 minutes, just maybe 10 slides in this lecture. And then there's a very brief introduction to the last hands-on exercise. And it's also one of the shorter ones, although this is a pretty big topic, so we're really not doing a really deep dive on it, just sort of getting your feet wet a bit. And I think Brian's going to talk a little bit more about this stuff tomorrow as well, sort of building on some of the things that we're talking about here today. So really this, this module is all about running string tie in a few additional modes that facilitate transcript and isoform discovery. So they're going to, these commands are going to be very familiar to what you already did to do your expression abundance estimation. We're just going to run string tie in slightly different ways to sort of improve its ability to construct previously undiscovered transcripts and alternative isoforms of particular genes. This still does require a reference genome, so the more reference-free stuff is going to come tomorrow. Just to kind of circle back and review the central dogma here, so so far we've been talking about these mRNAs basically being converted to CDNA and then fragmented and sequencing those things by RNA-seq. Alternative splicing is really all about the previous step here where we have single-stranded pre-MRNA molecules that are being spliced together by removing the entrants and stitching together the exons. And of course for many genes and many eukaryotes, there's multiple ways to do this and this manifests as alternative isoforms that differ in the combination of exons they use or the boundaries of some of their exons. There's a number of sort of classic forms of alternative expressions, so starting with just a simple transcript model at the top that's got three exons and two entrants and is being transcribed into an mRNA with the three exons. There's sort of seven modes of alternative expression that are listed on this slide in the next one. The first one is alternative transcript initiation where we've got two possible first exons being used. So in this case you have a transcript with the three exons being used or you might use a different transcript initiation site downstream and wind up with a transcript that just has two exons. And I think we're going to actually find a few examples like that in the data that you guys have been analyzing. And then alternative splicing takes several forms. You can have cassette exon skipping, which is shown here where there's sort of two paths through the set of exons where exon two in this case is either skipped or included. Alternative donor site usage. So this is where the three prime end of an exon gets longer or shorter. So a different slice site of the same exon is used. Similarly we can have alternative three prime slice sites or alternative acceptor site usage where the three prime end of the intron is spliced differently. And it just gives you potentially a very subtle difference in the transcript structure or it could be quite dramatic depending on how much more or less of the exon is included. It's kind of rare but sometimes we'll have mutually exclusive exons where you can have two transcripts that each have the same total number of exons but they're using alternative exons in the middle. Sort of in this case exon 2A or exon 2B and not both. And then the last one that's sort of in the alternative splicing category is just intron retention where the entire intron is retained and effectively it becomes one large exon instead of having exons two and three. You just have a really large exon two. And then finally just like you can have alternative transcript initiation at the beginning of the gene you can have alternative polyadenylation that gives you different exons at the three prime end of the gene. Giving you shorter or longer transcripts. The following site is just kind of a reference material. This is from a review that looked at a whole bunch of different tools and methods for studying splicing by RNA seek. And there's also additional references in the wiki for some of these tools. This is starting to be a few years out of date now but we're trying to keep the tables up to date as new tools are released. Just zooming in a little bit on the transcript reconstruction part of this which is what we're going to focus on in the hands on tutorial. I'm pointing out just a few sort of bio stars post where there's been some sort of ongoing discussion of tools and methods for alternative splicing detection and analysis in various categories here. And just to kind of summarize and put this in context. There's been quite a few approaches over the years that have been used to characterize alternative mRNA form expression. RNA seek has really become the most popular way. But there is still probably application for some of the other techniques. So someone mentioned pack bio sequencing here if you really want to understand the full length structure of a transcript you might decide to take advantage of the longer reads that you get out of the pack bio. And maybe in a targeted sense try to sequence some full length RNA sequences on the pack bio or much longer. People are still doing CDNA synthesis and sequencing where you really need that gold standard or where you want to actually produce the reagent where you actually have a physical copy of the transcript in a plasmid that you can use for further experimentation. There's just a ton of existing data out there that you may also want to leverage. So just to review this is we're sort of going to take a little bit of a step back and we're going to go back to the steps that we're running at the beginning of the expression module and basically rerun string tie with some different options. And then we'll set up the files that you would need to do ballgown analysis again as well. And then we're going to look at some examples in IGV so you can kind of see what this looks like in our example data. So just to briefly introduce the hands on component of this module. So we've already been running string tie in what we've called reference only mode. Now we're going to try running it reference guided and de novo modes. And then we're going to learn how to use cuff merge to combine transcript domes from multiple actually string tie runs that should say and compare the assembled transcripts to known transcripts. And then we're going to perform differential splicing analysis or at least do the setup that you would need to do for differential splicing analysis with ballgown. It's really exactly the way you already did it. You're just going to have GTF files that were generated in a slightly different way and expression estimates that were generated in a slightly different way. We're going to examine some junction counts with a sort of helper tool that we developed called reg tools. And then we're going to look at some junction splicing files or outputs in IGV along with the string tie assembled transcripts that we were able to generate. Okay, so how do we run string tie in rough guided and de novo modes? Basically we're going to do this by taking advantage of some additional options. And it all has to do with basically how and whether we take advantage of the GTF file of known transcript annotations that we've been using. So we're using this GTF file in several places, so it kind of has been we're sort of keep coming back to it over and over again. Remember during the actual indexing of our genome before we even did our alignments with high set, we use the GTF to provide the aligner with additional information about exon exon junctions and about the boundaries of exons to help it actually align the RNA secrets against the reference genome. And then we used the GTF file and string tie by supplying with the dash G option to where we're going to do this to run string tie in the so-called reference guided mode. So this is a mode where previously in the reference only mode we basically were telling string tie give me expression estimates for all of these transcripts. Now we're going to run string tie and we're going to say, yes, do that, but also look for novel transcripts, but use the GTF file as a guide. So use whatever information is in there to help you figure out where transcripts ought to be. But if you happen to see any novel exons or novel exon boundaries, then we want to know about that. And then the reference, the de novo mode, we're basically not going to give it a GTF at all. So it doesn't even know about the transcript structures at all. It's just going to look at the RNA seek data and at the reference genome and it's going to try to figure out what the structure of transcripts ought to be. So this is a much more challenging problem now. It has no prior knowledge of the boundaries of the exons or how the exons are connected in this set of known transcripts, which for human is a really rich source of information. So we have these really great, you know, pretty highly vetted annotations for the human transcript dome that have been built over the years with really high quality data from ESD sequencing and CDNA sequencing and so on. It's going to try to do everything that it's been doing already without the advantage of having any of that information, which is a big difference from the reference only mode where we use this E option to say, basically give me estimates for these known transcripts. And then after alignment, we're going to create a junctions bed file. So this is shown, an example is shown at the bottom here. This is just a way of summarizing what exon-exon connections were found in the RNA seek data. So all of the reads that happened to span across an exon boundary, we're just going to pull out all of those cases and summarize them specifically and sort of ignore the RNA seek reads that fall within exons. Just to get a sort of a map of the connectivity of all of the exons. And when we look at this data in IGV, we'll get a view that looks something like this where for each junction that was observed, we'll display a little arc that shows basically where the read goes from and to spanning across an intron. And then the number of reads that supported each unique combination of two exons. So each of these arcs corresponds to a combination of one exon to another exon. And the sort of thickness of the arc is a representation of how many reads there were that supported that particular exon-exon combination. And this is an example where you can actually see evidence for what appears to be a novel exon. So you can see that most of these arcs correspond to known exons. So we have a gene model up here that would have been represented in our GTF file for a particular gene. And you've got pairs of exons. And then you've got these arcs that seem to correspond to the edges of all of these known exons as we go from the beginning of the gene to the end of the gene. They seem to really match up really nicely. But then when we get over to this intron, we can see, yes, there is a junction that spans from this expected exon all the way over to this expected exon. But we're also seeing two additional exon-exon junctions that are sort of going into the middle of the intron and landing actually kind of close to each other. And this can be really suggestive that there's actually a novel exon somewhere around here where its acceptor site is here and its donor site is there. So we could zoom in on the RNA-seq data in that region and we could try to see how much evidence is there that there's actually potentially a novel exon there. Yeah. Can we separate the graph with three rows? Are they all the same difference? Yeah, they're just being spread out for display purposes. So in IGV, you can compact them so they're all on one line. But then sometimes it gets a little bit hard to see them properly when you have instances like this where there's exon skipping and one of them is going over top of others. So there's usually this option to kind of expand it all out so that everything is really easy to see. Any other questions on any of that? Okay. We're going to play around a little bit with this tool called cuff merge. So this is really a way to combine transcripts that were predicted from multiple RNA-seq data sets into sort of one unified representation of the transcriptome. So if you're working with multiple samples or maybe in a species where you actually want to come up with a representation of the transcriptome, you might decide to pool several runs. So maybe you have five conditions that you're interested in comparing for biological reasons. So you're processing them separately because you want to ultimately compare those five. So then when it comes to predicting the structure of transcripts, you might decide that, you know what, let's just pool all that data together because for the purpose of just understanding what transcripts are there, we just want more data available. And so cuff merge is a way of basically say sort of amalgamating the transcript prediction information from those five runs into a single run. And then you could use that as a new reference transcriptome and go back and redo transcript estimation from each of your five individual conditions using knowledge about the structure of the transcripts that was gleaned by looking at all of the data together. We're going to do some comparisons on our own in IGV of the merged ETFs that come from each of these different string time modes. This is just showing a couple examples. I'm just going to skip over that because we're going to look at some examples ourselves using our own data and that will be more fun than looking at it in slide form.