 Okay, so welcome everyone to the day three of the workshop. Today we're going to be discussing genome-free transcript reconstruction and the analysis framework we have for studying RNA-seq when you don't have a genome. And the afternoon we'll be focusing on annotating transcripts, so trying to gain functional information about the transcripts that we reconstructed. So a few learning objectives for the modules for this morning. One is just to understand the various challenges involved in reconstructing transcripts from RNA-seq data. We did cover a little bit of that yesterday, but Obi described the string-tie algorithm, and I'll make reference to similar approaches here and how it differs from doing it without a genome. And we're going to become more familiar with the different algorithms and data structures that are involved in doing transcript assembly, particularly genome-free. You'll be able to appreciate the importance of string-specific RNA-seq more so than maybe you had before, and we're going to learn various ways to assess the quality of the assembled transcriptome that we have. So as we know with RNA-seq, especially with the Illumina platform, which is really the dominating technology these days, the reads that you get from the machine are much smaller than the transcripts you're actually starting with. So we need to have a way to be able to take the short reads and reconstruct the full-length transcripts. And there are a couple different approaches for doing this, and it depends on whether you have a reference genome to start with or not. Because if you have a reference genome, you could, of course, just align the reads directly to the genome as you've done in the last couple of days. And then we have tools that can take the alignments and reconstruct transcripts based on those alignments, reconstructing the full-length transcript place-of-form sequences that are best supported by the reads. But of course, there are limitations here because you might not have a beautiful reference genome for the organism that you're working with. Or the genome you have might be a draft genome, or it might be just full of gaps, and it's still useful, but maybe you can't rely on it entirely for the work that you want to do. So we have other options. We can just take the reads and assemble the reads themselves without using a genome. And then if we have our draft genome or we have a genome of a related organism that we want to leverage to gain more insights in our transcripts, we can then align those transcript sequences, those reconstructed transcript sequences to that genome using a splice of liner. And that can give us some more information about where the introns and exons might lie within the structure of the transcript. And there are lots and lots of tools that have been developed for this over the last decade. It's been a very active area of research and development. So you have lots of tools to choose from. Some tools are more popular than others. And you've gotten to learn about certain of these, like Top Hat, Star, High Sat, G-Snap, maybe you haven't heard of, but it's another tool that was developed for doing short read alignment. And then for transcript reconstruction, we had cufflinks, and now we're using string tie, but there are lots of different algorithms for doing that too. And similarly, when it comes to assembling transcripts at DeNovo, genome-free, you now have lots of options. We developed a Trinity software, but of course you have many different tools for doing this sort of thing. So there's some similarities between genome-guided reconstruction and genome-free reconstruction, at least conceptually. The main goal with any kind of transcript reconstruction is to take the RNA-seq reads and to try to encode into some sort of a graph structure. And the graph structure will encode the sequence information, the order of the reads, the orientation of the reads, and indicate overlaps among the reads. And it's just an example of a graph structure that we might have where the nodes represent the sequences that have been already oriented, and the edges indicate the order of those sequences and the orientation and the overlap. And the goal with reconstructing the transcript sequences is basically to find a path from left to right within this order graph that's best supported by the reads or the alignments in case we have alignments of the reads. And given a path through that graph, we can then just reconstruct the transcript as just traversing those nodes of sequence in that order. And different paths through the graph would give you different transcript sequences, so different transcript lines of forms. Now one way we could do this is just to take all the reads and just identify all the overlaps among those reads. And we can just treat each read as a node in our graph and just traverse the overlaps to then reconstruct the final transcript sequences. It's a fine approach, but the problem is that it doesn't scale well. It might work fine when you have thousands of reads, but once you start generating millions and billions of reads, determining all the overlaps among all those reads can be computationally challenging. If you look at the number of alignments that you require for determining overlaps as a function of the number of reads, you can see that it grows exponentially. It's just not something that's feasible with short read data or you have millions and millions of reads that your organ would. Feel free to ask questions. This would be very interactive. So in short, this is really an impractical way of going about short read assembly. So we want to be able to avoid doing all these pairwise alignments or avoid doing all these millions or billions of pairwise alignments to establish relationships between the sequences. And so it's a very popular data structure that's been used to basically encode the information in the reads. And from that, determine which reads overlap each other in the extent of their overlaps. And it's called the de Bruyne graph. And the way you construct the de Bruyne graph is by breaking the reads down into individual camers. The camers are these short strings with a word size. In this case, in this example, we have a word size of five. But in general, it might be 25 or 30. And basically just take all these words, all these overlapping words from the reads. We just start at the one end. We start at the left side and just pull out the first camer. And each camer is going to become a node and a graph. So in this case, we're going to construct this de Bruyne graph by adding this one camer as a single node in this graph. And then what we do is we just move that window over by one base. You can see that this next camer overlaps the first camer by all but one base. It's k minus one length overlap. And we take that camer and then we add that camer as another node in the graph. And then we can just draw an edge between those nodes, indicating that they're adjacent within the read and they overlap by k minus one bases. And we do this for all the reads. If we come across the same camer in different read, we don't add that as a new node in the graph. We basically just reuse the existing node that's already there in the graph. So the number of nodes in the graph is just going to be equal to the number of unique camers that we have in our entire read data set. And once we do that, we can end up with a structure that looks a little bit more complex. But again, every node in the graph is a camer. And every edge indicates a k minus one overlap. And we'll end up with these bubbles and branching patterns in the graph anytime we encounter sequence variation. And that sequence variation could be resulting from, could be a polymorphism, because all we need is a single base change to give you a unique camer sequence. Or it could be a result from alternative splicing. Because alternatively spliced isoforms are going to have some regions that are shared sequences, but they're going to have other regions that are going to be unique to those isoforms. So any of those points will diverge and you'll see these bubbles and branches within the graph. So we built this de Bruijn graph. And one of the first things that we generally do is simplify it, because you'll find that there's these long stretches of unbranched regions. There's just one camer linking to another camer. And we can just collapse those down into single nodes, where that node just has more sequence information attached to it. Or it was basically collapsing down those overlapping cameras into one node with one sequence. So this is a much simpler graph to work with. And then we can basically traverse that graph, just like we did before. From left to right, we can find paths through the graph from left to right. And each path could give us a different transcript sequence, or could reconstruct a different transcript isoform. Now, which path you choose in the graph to traverse is going to depend highly on a number of factors. Earlier on yesterday, when Obi was describing stringtie, there was a network flow algorithm where he was trying to find a maximum flow to define what path was going to be the first path that was going to traverse. So network flow is one way we could try to address this problem and determine which paths we should choose. But there's lots of different ways. There's many different algorithms we're trying to address this. A lot of it boils down to different algorithms in computer science. Some of the more classical algorithms like, what are the minimum paths? The minimum path coverage criterion and determine what's the minimum number of paths you need in order to traverse every node in the graph at least once. There's lots of different ways of doing this. But ideally, the paths that we report are going to be those that are going to be best supported by the underlying RNA-seq data. In this case, we would have four different paths. And if we traverse each of the four paths, we would end up with a different sequence. If we line those sequences, we can see that there's differences. There's a polymorphism in there, and you also have a large deletion. So genome-free transcriptome assembly uses similar algorithms to genome assemblers. They both use this De Bruijn graph structure. But there are some important differences between how a genome assembler uses a De Bruijn graph to reconstruct genomes as compared to a transcript assembler, where it's using a De Bruijn graph to reconstruct transcriptomes. And this is actually one of the key reasons why we built Trinity in the first place, because it was about 10 years ago when Illumina, it was slucks at the time, came on the scene, and we had ways of now generating millions and millions of short reads, and we needed ways to reconstruct transcripts from these data and use them for annotating genomes and doing transcript research. And there were no good tools that were developed at the time to do this, right? There were tools for doing genome assembly. There were tools like Velvet and Abyss where they used De Bruijn graphs and they would reconstruct genomes. But if you took those tools and applied them to transcriptome data or RNA-seq data, you would not get a very good result. We really needed to have tools that were tuned to the differences between trying to assemble RNA-seq data and to get transcripts versus DNA-seq data for genomes. And it's a lot of it boils down to the tools of different expectations and they have different goals. With genome assembly, your expectation is that you're going to have relatively uniform coverage of your reads across the entire genome. There'll be depths in coverage due to GC content and repetitive regions and certain biases, but overall you expect to have relatively uniform coverage across the entire genome. With transcriptome assembly, we have very different expectations here. We have some transcripts that are expressed at very low levels. We have other transcripts that are expressed at extremely high levels. The dynamic range can be 10 to the 5th to 10 to the 7th between them. So we have very different types of transcripts in terms of the read coverage that we expect to see. With a genome assembler, if it comes across a region that has very, very high coverage, it's not going to think this is a highly expressed region of the genome. No, that doesn't make sense, right? It's going to think it's a repeat region. And all the times it'll take those reads and just put them in a separate bucket and maybe deal with it later. So when you apply your genome assembler to transcriptome data, all of your highly expressed genes end up being highly expressed transcripts that are getting put aside into a new bin and maybe not assembled at all. So that's one of the huge problems. With genome assembly, you expect to have a single contig reconstructed per gene. Ideally, you get entire chromosomes reconstructed as single molecules. In the case of transcriptome data, we expect to have multiple contigs reconstructed per gene. If we have evidence for alternative splicing. In the case of genome assembly, we're expecting to assemble large contigs, like entire chromosomes, like I said. In the case of transcriptome assembly, we're expecting not to generate megabase size sequences, but instead smaller sequences. On average, maybe a few kb and two or three kb and many thousands of these. Another key difference has to do with the type of data that you're inputting into the assembler. With genome assembly, the DNA-seq data that you're using is inherently double-stranded. But with RNA-seq data, we can generate strand-specific data. That's really critical, especially if you want to be able to differentiate sense transcription from any sense transcription. This very good strand-specific RNA-seq protocols are very popular right now. If you're generating RNA-seq on your own and you have the option to do strand-specific RNA-seq versus non-strand-specific, always take the strand-specific option if you can. A lot of times they're not perfect, but still 99% of the reads will have the correct strand after sequencing. And again, this is why we developed Trinity about 10 years ago now. It was a tackle, a lot of these problems. And Trinity has come a long way over the last 10 years to improve on this. Another difference between Trinity and other assemblers, just in general, other genome assemblers, other transcriptome assemblers, is how we build our graphs. A lot of other tools will develop one big graph and then just try to tease the transcripts apart from that one large graph structure. In Trinity, we try to partition the data, partition the reads into smaller graphs, where ideally we'll have one graph per gene. And each graph will represent the transcriptional complexity of that gene. And this works out very well for us in terms of being able to perform parallel computing, because each graph basically becomes a separate compute job. Reconstructing the transcripts for that gene represents a separate isolated compute job. And if you have a big compute farm with hundreds or thousands of nodes, we can basically fan off all these jobs onto the compute farm and have them all run in parallel. And then once they're all done, all the results just come back and you have your final assembly. So it's very convenient. So here's a general overview for how Trinity works. We originally called it Trinity because it had three major parts. And there were actually three of us that developed the code. Myself, Moranya Soarer, was a PhD student at the time. She's now a postdoc with Eric Lander. And this is actually her PhD project. And I've been at the brood for 10 years now. And I got involved in the project early on just because I was particularly interested in having good tools for being able to reconstruct transcripts from RASc data to use for genome annotation. And then there's a very talented scientist and engineer, Manfred Greb Herrer, who had a lot of experience in genome assembly. And he's since moved on. And he's now at Uppsala at the SciLife lab as a professor. So I'm the one that remained and I'm keeping the Trinity legacy going to the extent possible. So we have these three different modules. We have inchworm, chrysalis and butterfly. I wrote the inchworm tool, Manfred wrote chrysalis and Moranya Soar wrote butterfly. So inchworm takes the RNA secrets and it builds contigs from those RNA secrets. It does it very fast and it's a very draft sort of context. So it generates context but they're not in any way supposed to represent like a final product of doing transcriptome assembly. And I'll describe how this works. It just takes the unique cameras within the reads and does a greedy extension to build out these contigs. And we'll walk through the details because it's fun as well as interesting. Once we have these linear contigs and chrysalis, these linear contigs basically represent the unique parts of isoforms. So you might have a full length transcript isoform as represented by a few different inchworm contigs. And chrysalis' job is basically to find the different parts, the different inchworm contigs that correspond to, likely correspond to the same gene, just represent different unique parts of an alternative isoform gene, and it clusters them together. And once we have these clusters of contigs that represent the different parts of isoforms, it then constructs this proper de Bruijn graph structure that I described earlier. And once we have this de Bruijn graph, butterfly will come in and it will take the original reads and thread the reads through the graph structure to try to figure out what are the paths that are actually best supported by those reads in the graph. And then it will actually provide you with the final transcript sequences. So I'm going to walk through this, starting with the inchworm algorithm. So what the inchworm first does is it takes a read sequence. And as we're talking about before, we're building a de Bruijn graph. We're going to extract the camer sequences. Now we're not going to actually build a de Bruijn graph. We're just going to do the first part. We're going to just walk through and pull out all the cameras that overlap by K minus one. And that's going to do is instead of building a graph, we're just going to build a table. A table has two columns. One column is going to be the camer sequence. The other column is going to be the number of times we've seen that camera in the entire set of reads. In computer science, we tend to call this a hash table. We basically have a key and a value, key value pairs. So in this case, our key would be a camer sequence. And if we look up that key in our table, it gives the count of how many times we've seen that camera in the data set. And the first thing the inchworm is going to do is it's going to determine a seed camer that it wants to start to use to build out a contig. And the way it defines a seed is it just chooses the camer that has the highest count in the table. So we'll pretend that that's Gattica, the most famous of all camers. So we start with Gattica, and Gattica has found nine times. There's a movie made about it, so it's very famous. So it's found nine times. And we pull that out, and it's basically what we're going to do is we're going to start from Gattica, and we're basically going to build a contig sequence by extending from both ends. We do this in a greedy fashion. And so we start with Gattica, and we look, okay, well, if we're going to extend it by just one base, which base would it be? And since our alphabet for nucleotides only has four characters in it, we only have four possibilities. We can extend with a G and A or T or a C. We just have to figure out which letter are we going to choose to extend our contig. And all we do is just look up each of the four possible overlapping camers in our dataset. So the overlapping camer would start with A, and it would end with one of these four bases. So we just look through and say, okay, look up in our table. How many times do we find Attica ending with a G, and we find four times. And then we look up the next one, ending with an A. It's only found one time. T, we look up our table. It's not even there. So T is not an option. If we wanted to extend this with a T, we just couldn't because there's actually no camera that ended with T in our entire read set. And then go down to C, and we find that camera is found four times. Ideally, if we found one that was at the highest count, we would just take it and go with it, okay? But in the case where we have ties, it gets a little bit more complicated because then we have to break the tie, all right? So we have a tie. We basically have to look out from the G and the C. We do the same sort of thing. And C, you know, if we just go out one more base, can we break the tie? As we look up all the other camers, and we find, yes, we can break the tie. The A wins. So we're going to go in this direction. Okay. And we basically just keep doing this one base at a time until we get tie cases, and we have tie cases, and we have to go out a little bit further to break the ties. So we just keep extending it one base at a time until you get to a point, and then there's just look up each of the four possible camers, and none of them are on our table. All right? Well, if none are on our table, then we have to stop. All right? So that's how we stop extending at that end. So once we finished extending to the right, we basically do the same thing, but now we extend to the left. Okay? So just basically go in this direction, look up the cameras on our table to figure out what the greedy extension would be. Take that greedy extension and just keep on going until we run out of options, and then we stop. Once we stop, we report that contact sequence. And this would be our first inchworm contig to be reconstructed. And then we go through this contig. Any camer that's in that contig, we remove from our data table. Okay? I remove from our hash table. So any camer that we use can only be used once. Okay? It'll only show up in a single contig. Yeah. So we're going to get to that. That's why this is not the final output from Trinity. So this is just a useful intermediate. This is a useful first phase to basically pull out the unique parts of transcripts that are essentially best supported by the reads, because we're doing this greedy extension. So we have multiple options. You might have a sequencing error and you have the real base, and the real base will oftentimes have a much higher account than the sequencing error would be. Yeah. Do you have a fast way to figure out which is the highest account? Otherwise, it'd be un-squared if you have to go through the whole, right? Yeah, you just sort. You do initial sort step. So after you build your table, you just sort it by count, and then everything's already, you know, pre-sorted. Yeah, so you only have to do the sort step once. And is this a hash table? The lookups are quick, so in a constant time. Okay, so we remove the camers from the data table, and we basically just start the process all over again. We choose the next seed that hasn't been used yet, extend it to the left, or extend it to the right, extend it to the left, remove those camers, and keep going. And eventually, your data table becomes completely empty, and that's when inchworm is actually finished. And inchworm reports all of its, the contigs that it was able to reconstruct. Now, but we do have this problem, as Francis pointed out. In the case of alternative splicing, yeah. Yeah, true. Yeah, I mean, in terms of total numbers of reads, and if you're starting out with like a billion reads, you get in up with like a million contigs. So, but it also, I mean, it depends on the complexity of the data that you're working with, to actually have a billion reads from like a yeast, then it's a huge, yeah. Right, right. So, in the case we have alternative splicing, we have, in this case, isoform A and isoform B. They have some regions that they share, so in this case, they share the orange, they share the green. But isoform A has this yellow piece that is unique to isoform A. Alright, and if we're going to run inchworm on this, you know, inchworm is not going to be able to reconstruct both isoform A and isoform B, because every camer can be used only once. We're going to have to show up in one isoform or the other isoform, assuming that it got one or the other. If we're going to have a graphical representation of these isoform structures, it might look something like this. And if isoform A is lowly expressed and isoform B is highly expressed, we might represent that by weighting the edges. In this case, isoform B is highly expressed. We have this darker edge, let's see if I can actually use mine. There we go, yeah, this works. So we have a thicker edge here indicating that there's more read support, basically, for going this path, because there's more camers for the more highly expressed transcript. Alright, so if inchworm is going to reconstruct a contig from the camers that are represented by these two isoforms, we'll probably just give us this. Maybe start over here somewhere and then take this path and then give us the green segment. So it gives us a nice full-length contig for the more highly expressed isoform. And then later on, it might recede on a unique sequence that's in the yellow region. Alright, when it pulls out that contig, it's going to pull out mostly just the yellow bits, because these two can have no camers shared in common. Alright, every camer that inchworm uses has to be unique. Alright, so these actually have no camers in common, but they can have k-1-mers in common. Okay, so that's sort of the trick here. There's k-1 to help define relationships among these parts of the isoforms. So if two inchworm contigs share k-1, then there's a chance that they might actually be related. And we can confirm that relationship by looking at the reads and saying, do we actually find reads that actually support the link between these two contigs? Alright, that's really the goal of Chrysalis. This comes in, takes the inchworm contigs, looks for k-1 overlaps. If it finds k-1 overlap, then it looks at the reads and says, okay, is there actually reads in the data set that support the junction between these two inchworm contigs? And again, they really do go together. And if that's the case, then it'll cluster them together. There's just a general overview of the Chrysalis framework. So starting with our inchworm contigs, we integrate these isoforms based on the k-1 overlaps. We use the read information as we call it glue. The reads are used as a glue to sort of glue the contigs together. And we also use the read information here too. So if we have two contigs, maybe they don't have a k-1 overlap. In this case, it looks like they actually have overlap. But in any case, we use the paired-end information and paired-end RNA-seq data to also try to associate inchworm contigs with each other as part of the clustering. And then once we have these clusters, then we build the burn graph structure. And this is really how we end up with our many thousands of these clusters or many thousands of graphs. This is how we're sort of partitioning all the data. So then in the next... So in the final step, Butterfly comes in. And Butterfly is actually going to do the transcript reconstruction given the graphs and given the reads. For every de Bruijn graph, it first goes through that stage of compacting the de Bruijn graph. So if we have these long stretches that are unbranched, it'll collapse the nodes down into single nodes. They have more sequence information. And then it takes the reads, and it threads the reads through the graph in order to identify those paths that are best supported by the paths that individual reads take in addition to read pair linking information from paired-end sequencing. The repairs were used during the clustering of the contigs. But other than that now, it's mainly just used for the clustering of the contigs in the crystal stage and then in the final reconstruction stage using from Butterfly. Right, so Butterfly basically just gives us the resulting sequences. We just have a few examples of how this has been useful in mouse. In mouse is a nice test case for us because we actually have a reference genome. So when we reconstruct transcripts to Novo, we can always go back and look at the genome and say, okay, well, how good do we actually do? Here's a very simple case where if we look at the graph structure that we have for one of our graphs, we only have three nodes. This is almost as simple as it gets. And in this case, we have a blue node, a red node, a green node, and we have labels on the edges indicating how many reads actually support linking this node to the next node. There's really only two paths we can take in order to reconstruct transcripts from this. We can go from the blue node to the green node, and we reconstruct one transcript. And if we take the other path through the red node, and then we end up with this other isoform, or this little red piece. Red piece is short. It's only 130 bases here in this red node. And if we didn't have a genome, then all we know is that we have presumably two transcript isoforms. They're different for each other, and that one has the red piece and the other one doesn't. But we don't really know what is that red piece. Is it a skipped axon? Is it an alternative donor or acceptor splice site? An axon has it really correspond to in terms of the overall structure of the transcripts. And you really need a genome in order to be able to infer that. And in this case, if we map these two isoforms back to the genome and look at where do these sequences fall, we can see that the little red bit is actually a skipped axon. Axon that's skipped in one isoform and not the other. So having a genome, even if it's a genome of a related organism, it's really useful for trying to gain more insights into the structure of the transcripts that you've reconstructed to Novo. Can you also let us know how complete rough seek is? How complete rough seek is? For mouse. Yeah, we're going to touch on that. But yeah, so like for any of these data sets. No, no, no, no, no, no. This is an important point. So this goes into the topic of how messy transcriptomics can be. So in general, when we have reference data sets that we're working with for human or mouse, we talked about yesterday, we're talking about I think the human data set which is gen code and there's about 200,000 transcripts in our reference data set. How many genes is that? I think it was like 50,000 genes or something around that. But typically people say like how many genes are there in human and there's a contest and I think they said it was like 30,000 or so. It was the number they settled on. But anyway, we're talking about tens of thousands of genes and maybe a couple hundred thousand total transcripts for mouse or human. And when you do a de novo transcriptome assembly, you end up with a lot more than that. There's a lot of transcription that's happening and there's a lot of transcripts that you'll reconstruct. They're not necessarily included by your reference. And there's a lot of reasons for that. This is going for like hours about this. In general, you'll capture a lot of the known protein coding genes which are pretty well characterized. But you'll end up with tons of non-coding RNA transcripts expressed from different regions of the genome and they won't necessarily be in your reference annotation data set. It might be because the reference data sets are really based on well-studied tissues. Once you start digging into different tissues or even in single cell transcriptomes, you end up finding lots of novel transcription that's just not captured by the reference data set. Again, most protein coding genes are pretty well captured, but there's just tons of non-coding transcription that is presumably biologically relevant. A lot of it has to do with just transcription that really the product of transcription might fall. That's a lot of different ways. Enhancers are transcribed. There's lots of transcription that happens. It's biologically relevant, but the actual product, the actual RNA product itself, whether that actually has a specific function is highly questionable. So how many genes do we have? How many transcripts are there? These are questions that will never be answered in a way that most people will agree on, at least within my lifetime. So just be aware of that. How's that? Is that good enough for us? That's all right. No, it's good. It just doesn't aside. I work on reference data sets. I also work on lots of non-reference data sets. One of the studies that we published almost recently is the salamander transcriptome, the axolotl, the Mexican salamander. It's a model organism for looking at limb regeneration. So it's really fascinating. But the genome is huge. The genome is 30 billion bases, or 10 times bigger than the human genome. We assembled the transcriptome for this. We ended up with over 1.4 million transcripts being assembled. It's got a big genome, but how many genes does it have? It's thought to have around 30,000 genes, like human or mouse. And the genome is just huge because of an ancient repeat expansion due to mobile elements. Wait until you end up with 1.4 million transcripts. We started off with over a billion RNA secretes. The more reads you sequence, the more transcripts you'll end up reconstructing. There's lots of stuff that's very lowly expressed. If you go deep enough, even if the human genome, if you go deep enough, it looks like the entire human genome is basically transcribed. There's a lot of that, too. So, yeah, just things to be aware of. But if you start applying things like minimum expression thresholds, let's say one transcript per million, or one FBKM even, is thought to represent around one transcript per cell. So if you want to consider that to be a reasonable threshold, if you filter out all the stuff that's less than, say, one TPM, you'll end up with 100,000 transcripts, and maybe 30,000 genes. But there's just tons of stuff that you can... Once you go deep enough, you have enough material to actually reconstruct transcripts on. And you start seeing lots of regions of the genome lighting up that maybe you hadn't expected. And is it biologically relevant or not? It's maybe. Different cell types, different tissue types. There's lots of really interesting things that are happening that you'll only see once you start digging in very deeply and have a very sharp focus on that cell type. All right, back to business. So another example with butterfly. Here we end up with a case where... We don't actually have two isoforms. We actually have transcripts from two parologous genes. But these genes are highly similar. They're like 95% identical. And they have some regions that are unique and they have some regions that are shared. And we find that they end up getting clustered together into the same graph. So here's a graph and we can see we have regions that are shared here in the middle and we have regions that are unique in each. But because we have good paradigm information, we can use the paradigm linkage information. And if we had longer reads, we could use that too. But in this case, there was short read data. We can primarily use the paradigm read information to tease these apart and basically unzip these structures into their separate parologous transcripts. And this is just to show what we reconstructed and what it looks like when you map it to the reference genome. And you can see they're basically spot on with the reference transcripts. As I mentioned earlier, given the options, strand-specific data is absolutely preferred. Any chance you have of getting strand-specific RNA-seq, I'd be able to capitalize on that. And there's a couple reasons for it. And then one I mentioned earlier is just biologically, you're able to separate a sense transcription from an anti-sense transcription. And I have a nice example. It shows where that's biologically relevant. But another reason is just it simplifies the assembly process because if you have double-stranded data, then you pull out any given camer. That camer is going to be represented in the graph in both its forward orientation and also its reverse complement orientation. They basically mean the same thing. And if we have strand-specific data, then we can actually tell the difference between the forward camer and its reverse complement camer. There are actually two separate nodes in the graph. All right? So the law is to basically separate these things out. And that greatly simplifies our ability to reconstruct the final transcripts. While back, Josh Levin, who's at the Broad Institute, is one of my collaborators. He did a very well-known study looking at different RNA-seq methods that are available for doing strand-specific RNA-seq. And he settled on this DUTP second-strand marking method as being the leading protocol. Now, I'll walk you through what that is briefly. This is basically the standard way that a strand-specific RNA-seq is being done these days and are really good kits for it. So we start with our RNA sequence template. The first thing we're going to do, we're going to do a first-strand synthesis with just normal DNTPs. I generate a first-strand CDNA. But then when we do the second-strand synthesis, instead of using thymine, we throw in uracil. And we know uracil is not a standard DNA base. It's an RNA base. And if this ever ever... You don't actually find DTTPs as a naturally occurring base as far as I know. That would be a dangerous thing to have happen in the cell. But there are cases where DNA can get damaged and cytosines end up being deaminated to uracil. And that's a problem. And it's not repair able to lead to a mutation. But we have really good DNA repair enzymes that will recognize, hey, if I see a uracil and DNA, I know it doesn't belong. It'll pop it out. And then the rest of the DNA repair machine will come in and actually replace it and restore it to its original base. But in molecular biology, we can actually leverage these enzymes. And we can put them in a tube and we can sell them as a kit. And any place we have this uracil and our second-strand synthesis, we can basically remove using these DNA repair enzymes. So this is a bit of a trick. Let's take a step back here. When we do the second-strand synthesis and we include the use in the second-strand, we then do adapter ligation. You can see we have the yellow. We have the red. Y adapters. Very NHN. And if we didn't remove the use and we're going to do sequencing, let's say we started the sequencing at the yellow primer. Then our sequencing would go into the blue for some sequencing products. And we start over here in the yellow. It would lead into the pink for other sequencing products. We'd effectively be sequencing both strands. But if we use this fancy DNA repair enzyme to remove the use, this pink strand cannot be used as a template in PCR. We're visually destroying the pink strand. So now, all the sequencing that we do from the yellow has to go into the blue. We'll never see any yellow going into pink. That's really the key point here, is how we get our strand-specific sequencing. Because only one of the strands can actually be used as a template for sequencing downstream. It's really just very, very clever. And then why is this important? I have a couple of examples here. Here's an example. It's important for a lot of things, sense and any sense, of course. But if you're working with compact genomes, where the genes are very closely spaced next to each other, there's a lot of transcription, where the transcription from one gene can run off into the other gene. Then having the strand-specific RNA-C can really help tease them apart. So this is just a case. One of our favorite cases we do a lot of work with, Espombe. And here we have strand-specific RNA-C. And these are just the reference genes down here. So here we have one reference gene that's going to the right. We have another gene that's on the bottom strand going to the left. And since we have strand-specific RNA-C, we can actually separate out the coverage and look at the coverage on the top strand versus the bottom strand separately. So here we can see the top strand coverage and we can see that this transcription overlaps in the regions between these two genes. But because we have strand-specific RNA-C, we can actually tease that apart. And when we do the actual reconstruction using Trinity, we actually do reconstruct these transcripts as separate transcripts, each in their proper orientation, and they're allowed to overlap here in the middle. It's not a problem. But if we didn't have strand-specific RNA-C, what would we end up with? Yeah, one big transcript. If we look at the coding regions, we see, oh, it was the huge coding region on the upper left and there's a huge coding region on the bottom right. But other than that, we do just reconstructing transcripts. It's like one big transcript. And that's really not ideal. Here's another example. And again, in our S-Pombe vision used. In this case, we have three genes. They're all on the bottom strand. And we have our strand-specific RNA-C coverage profiles in the middle. And we can see that, at least for the genes that are on the ends here, the RNA-C coverage that we have is consistent with the transcribed, you know, the coding region orientation for that reference gene. But in the middle here, we can see for this myosas-specific protein kinase, the transcription that we have is actually on the other strand. If we didn't have strand-specific data, then we wouldn't recognize that basically every bit of transcription that we have here is antisense transcription. And it actually turns out to be important for this organism that many of these myosas-specific genes are down-regulated by antisense transcription. So it's important biology that we can capture here if we have access to the strand-specific methods. Now, when we run Trinity, and you'll have some experience doing that shortly, if you haven't used it before, the main output here is just a multifastifile. Okay, so we have nothing super fancy to look at. It's just a fastifile, like any other fastifile. There's a couple of things that are useful to note here, though. So we have the accessions for the transcripts, and I've got to update this with our current representation, but basically you have an accession value for the transcript, and the prefix for that accession value is meant to represent a gene identifier, and it's a gene in error quotes, because it's not like a real gene in a lot of cases. These are just transcripts that are clustered together, and presumably they come from the same gene, but we're still not good at teasing apart parologous transcripts from differential isoforms of the same gene, but they're related in some way. They have some sequence that's shared in common, and they'll have different isoform numbers. So in this case we have a prefix, here we have a sec one and we have a sec two, but they share the same prefix. So we know that basically from the same gene, but they're different isoforms of that gene, and they have some sequences that are shared, and they have some sequences that are unique. And we can look at the path representation here. This is basically indicating the node in the graph, just a node number, and the region in the sequence that corresponds to that node in the graph. So if we compare these two paths, we see the paths are the same, but this one here has this blue bit here in the middle, that's not in the other one. This is the sequence that's really unique to this isoform. In this example, this blue piece is actually the red piece from the mouse isoform I showed you earlier. Skip that's on and one isoform is not found in the other. So this can be somewhat useful. And there are other tools, this is a tool called Bandage that we're not going to have time to play with, but as a nice gooey interface, you basically upload your trinity facile file into Bandage, and it'll give you a graphical view of pieces that are unique, pieces that are shared, and it'll give you a pretty structure. So just Google Bandage, trinity, assembly. Yeah, so this is a pretty cool software, and they have different versions, windows, Linux, Mac, but you can basically upload your trinity assembly into this. We could try it later depending upon if we have time for it or not. And if you have, yeah, this is actually, this is a genome, this is not a trinity transcript. This would be a crazy trinity transcript if that's what it was. But basically, you'll end up with a whole bunch, like all of your genes will be shown down here, and if you have alternative splicing isoforms, then you click on one that has interesting structure, or zoom in on it, and gain some more insights into that. All right, any questions? How often do you have cameras that are completely, that are really different than just by random chance, right? So you have like a connection. All right, yeah, it doesn't happen that often, unless you have large gene families. What is K? K is 25 for what we use in trinity, and 25 is relatively unique. Once you start getting below like 19, I think you start running into trouble. But yeah, 25 is a relatively unique camera size. But you still have issues. You have olfactory receptors and mice, you got like a thousand of these things. Some are more closely related than others. Other genes like kinases, they have shared kinase domains. There will be certain domains that just show up at high frequency and occasionally have a camera that will show up. So it happens, but in all of the experiments that we've done for trying to figure out what's the optimal K value for trinity, 25 seemed to be the optimal. Yeah, yeah, sure. So alumina is pretty good. There's not too many mysteries, but we are trying to combine packed bio-data for the length of the error in the creature. Right. How do you maybe combine, like do you have long rates that are error-prone with a bunch of short accuracies? Right, right. So eventually there will be a version of trinity that can handle a wide variety of different data types. But we're talking at least like a couple years away. The latest version of trinity is actually capable of incorporating packed bio-reads along with aluminum data. But these are packed bio-reads that have been carefully error-corrected. So you have to go through the whole ISOC pipeline or use some other error correction strategy. Because trinity is not going to be able to handle long reads that have lots of, you know, like 85% identical to reference. 15% error is another way of saying that. But yeah, ideally, you know, in terms of like the longer roadmap for where trinity is headed, nanopore, dirty or packed bio, and other retypes.