 So this is just going to be hopefully a quick introduction to the idea of expression or abundance estimation. And we've kind of already started thinking about this idea right so we were looking at IGV at a view very much like this, looking at a single gene locus. And starting to ask questions like this, this gene seem to be differentially expressed. So you might argue that like this view is showing some evidence for down regulation of this sample in the bottom track or up regulation of the sample in the top track depending on your point of view. So there's just like, even though, like superficially it looks like sort of similar amounts of read alignments in the view, you know that if we had our scroll bar here right like based on these coverage levels. There's a lot more alignments up here than there is down here. And I think that the coverage level has been set at the same scale up here, like we did manually right. So there's just a whole lot more data more alignments, which represents more fragments of DNA, which represents more originating RNA molecules from our sample. So that suggests a difference in expression levels between these samples. It's also in this case suggestive of some end bias right so you're seeing a lot more alignments here at the three prime and relative to the five prime and. I think that tells us about the preparation of this library degradation of the five prime and yeah. Anything else. Yeah, I was probably Paulie I selected so it's like a combination of those two ideas right you've got Paulie a selection of a degraded products. So you capture a lot more of the three prime and then the five prime and. Yeah. Okay, so you've probably heard of our P cam or f P cam. This is one of the most common summary statistics, I think it was introduced in the first RNA seek paper ever like the Barbara world paper whenever that was. So basically the idea of counting reads. So counting alignments to transcripts or genes, but then normalizing for the size of the transcript and the depth of the library. So right from the first time anyone sequenced RNA and then sequence another RNA and wanted to compare them and compare the relative amounts of alignments to a gene for one sample versus the other they thought well wait what if these were sequenced to or what if I want to compare the relative amount of alignments to gene a with gene B, but gene B is like, I don't know, 10 times as big so it just generates a lot more fragments that can be sequenced when you fragment that RNA. So, almost immediately people realize okay we need to normalize for these ideas. So those are P cam just because we originally were doing single and sequencing we were generating like one read per fragments, but then at some point we started doing paired and sequencing so you're getting two reads per fragment. So we just changed the terminology to count fragments. So you're essentially not counting both reads of a read pair. And then you're calculating your FP cam you're just using those two reads as evidence of one fragments. So that's just kind of a logical distinction between our P cam and FP cam but effectively they're the same idea. So what is FP cam we kind of talked about this why not just count the reads because the relative expression. It's proportional to the number of CDNA fragments, but it's biased towards larger genes or to samples with greater total sequencing depth. So we want to normalize for that. So what you typically do is normalize per kilobase of transcripts so that's where the K comes from and per million mapped reads in your library, and that's where the M comes from. So as we said, F cam attempts to normalize for gene size and library depth, our pre cams basically the same thing. This is an example of the formula. So you're taking the C as the total number of mappable fragments for a gene or transcript and as the total number of mappable fragments in the library. And L is the number of bases in the gene so the size of the gene, and then F became is just C divided by N times L times 1000 times a million. There's like a bunch of different ways you can express this that all mathematically equate to the same thing. So you can read more about that in these bio stars postings. F cam has become equally if not more popular than F became it is just a slight variation of this idea. So you're still normalizing for gene size. You're still normalizing for library size. You're doing almost exactly the same thing there's like a slight tweak in the order of operations. In T cam you determine the total fragment count and divide by a million, then divide each gene fragment count by number one, and then divide that fragment per million by the length of each gene or transcript in kilobases. In TPM you divide each gene fragment count by the length of the transcript so you get fragments per kilobase, you sum all the fragment per kilobase values for the sample and then divide by a million, and then divide one by two. So the details here are not terribly important. There's a slightly different order of operation, you end up with values that are very, very similar, but there's this one nice property which is that the sum of all the TPMs in a sample are always the same. It makes it easier to compare the values between samples, because, and also between genes in a sample and across samples because there's sort of like a fixed denominator. Whereas, in with the F became method you can end up with like more total FP cams. If you add them all up you get like a larger total in one sample versus another sample. And, depending on how you think about it that could be sort of correct like there's maybe there's just more total expression in that sample than another sample. But for the most part, we're happier to make the assumption that like the total expression is kind of the same and what we're more interested in our like what are the relative changes in expression of certain genes or pathways. So TPM, like, I guess is mathematically just slightly more convenient. I think at one point we produce a plot that you'll see later of the FP cam versus TPM, at least in our data it's like, extremely tight like extremely highly correlated this like it's a pretty subtle difference. Generally at TPM is just like considered more favorable. So we're going to use software called string tie, much like with high sat. This is a very complicated and detailed algorithm for which you could probably take not only like a whole lecture or workshop but probably a whole university course on or multiple university courses actually to learn the underlying mathematics and theory. We're not going to cover it at that level, but we're going to look at it at a very high conceptual level. So the way it works is, you take your read alignments, you can optionally create kind of super reads or merge the reads together into longer reads or longer alignments. And then you create what's called a splice graph and extract the heaviest path from that splice graph construct a flow network, compute the maximum. Flow through that flow network to estimate the abundance for each isoform and then iteratively update the spice graph by removing the reads that were assigned by the flow algorithm in the previous calculation of maximum flow. And that process kind of repeated until there's no reads left. And we'll look at an example of what that kind of looks like. But essentially, you're like, thinking about the connections between exons. Right. So this is all about how do I estimate the expression level of different isoforms. If you're just trying to estimate the level overall of genes, it's fairly easy, right, because you have a gene locus, you figure out all the reads that map to that gene locus in one isoform or another. And you come up with like a TPM or FPM style measure, that's like by comparison, very straightforward. When you want to estimate the expression level of individual isoforms, it becomes much more complicated, right, because now you're trying to figure out for all these read alignments. Some reads map to exons that are shared between transcripts. Some reads map to splice junctions like exon exon junctions that are shared by only maybe a subset or only one isoform. And you kind of like have to reverse engineer like what's the possible explanation for this pattern of alignments that I see. How can I from that extract like the isoform by isoform expression levels. It turns out that like is a very hard problem. And probably the short paired and reads, while they're good are not everything that we would wish to really do it properly or the best job that we could so and the other problem is like, your results will vary a lot. Depending on the complexity of the locus. So how many different ice forms are we talking about, and the expression level, because that affects how much sampling of that locus you get. So if you have like a gene that's pretty simple like it's got two ice forms, and one has like an exon skipping event and the other one doesn't. Then maybe it's like relatively straightforward to like look for the reads that define, you know, this exon skip event and assign expression levels to those two ice forms. And if you have like good expression of both of them, or good expression from that gene locus, maybe it's like relatively straightforward to come up with an accurate estimate of those two ice forms. But in a lot of species like human, you have like sometimes five or 10 or 20 or 50 different ice forms with all like subtle little differences between them. And then maybe that gene is only modestly expressed. So you have kind of like spotty RNA seek alignment coverage across this locus. Just because of insufficient sampling, you haven't seen some of the exon exon junctions that define certain unique ice forms. And so from that you conclude what that the ice form isn't there and you sign a level zero to it. You know it just it gets like really challenging really quickly. I mean, the short reads the relatively short reads that we're using a lot of times they're still ambiguous like they don't. They're not long enough to distinguish between all the possible ice forms. So what we really want is like full length transcript level sequencing right and I think Malachi alluded to this briefly that like, we're getting much closer to where that will be a reality that like you will take your, you know, we produce full length CDNA and feed it through like a nanopore or a pack bio and just fully sequenced ice forms and there will be none of this like craziness that we're going to kind of review here because it'll just be like a pretty simple exercise of saying yep, this is the full length of this transcript it matches this is a form that I know about, or maybe it's a novel is a form, and I can estimate the expression of it pretty straightforward. The only thing that's holding us back is the cost of the throughput you need in the nanopore or pack bio like you still need lots of counts right you still need lots of reads whether they're long or short to come up with a good expression estimation. So as soon as we can like feed a large amount of data through the nanopore at like an effective cost like something that's like $300 or $400 this approach will probably go away and we will like just be doing long read. But we're not there yet. So for now, we're depending on the incredibly smart people develop software like this, like do the crazy backflips to try to infer what's happening at an individual transcript level. So what they're doing is creating a flow network associated with each transcript. So we've got a very simple example here, transcript with three exons exons 13 and five in orange green and red. This flow network represented by these colored nodes. There are in this case 15 fragments so each of these gray bars represents one of the fragments that have been aligned to this cartoon gene locus. And then you make connections between the nodes of the flow network based on the alignment so if an alignment starts in exon one and ends in exon three, you add that as a connection between exon one and exon three, and so on. You can similarly connect exon one to five or exon three to five. So you kind of map these values onto these nodes of this flow network, and then you figure out what is the heaviest path through this graph. In this case it's fairly obviously this path right there's a lot more counts along here, and there is along here. And so, using that you calculate using flow theory and a maximum flow through that flow network from that you assign an estimated abundance. You then remove the reads that contributed to that particular path through the graph, and you repeat the process with other paths through the graph until you've like extracted all the possible paths. So we're kind of using a number of different areas of math I guess so using like graph theory and flow theory and some optimization theory and some heuristics to estimate abundance levels through this through this graph. And this conceptually maps on to like the exon exon junction connections that you're getting from your alignments. Yeah, just to summarize string ties using basic graph theory splice graph heuristics heaviest graph have more graph theory, which is a flow network optimization theory to do the maximum flow. And if you want to read all about this there is a very good and detailed string type paper for the definitions and math that's underlying that it works relatively well. And the thing that you may come across is the desire to merge together gene structures. So because we're doing this, like sophisticated string type approach, we can not only estimate the expression level of known isoforms but we can also infer some of the novel isoforms. So a complexity of that is that you can run string tie on one sample in a certain mode and come up with like a list of transcript identities like both known and new, and they can run it on another sample and come up with a different set of isoforms known and new. Well, how do I compare the abundance levels between these things. They might not even have the same transcripts discovered from them right. So you can do the string time merge to create like a common set of transcripts that have been discovered, and then rerun the expression estimation on that common set of transcripts. We are going to mostly be just doing the so called reference only mode so really just estimating the expression levels of known genes and transcripts, but we provide like a whole section on how to run it with reference guided or de novo modes, if what you really want to do is discover novel isoforms. We can also use this tool to do a compare so again if you are inferring transcript structures from your RNA seek data on different samples and you want to ask how they compare. There's a tool to help you do that called gff compare.