 So far, we've learned about RNA-seq, some background, learned about library construction, how to generate the sequences. Now we get to the important part where we take the sequences and we do some alignments and visualization. So in this module, the focus is to look at some of the challenges that we face when we deal with RNA-seq alignments. What are some of the alignment strategies that are available out there? We will be focusing on bowtie and top hat and go over how these tools work. And then after that, we'll go over the output files from the alignments, so the BAM and the bed formats. For those of you who are not familiar with those formats, we'll go through the formats what they look like and how you can manipulate them and what you can do with them to do further downstream analysis. And the final part will be alignment QC. So we have talked about QC at various points in this pipeline. So we've talked about QC at the library construction level, when we talked about RINs. We also talked about QC at the FASQ level where you're assessing the sequences themselves to see how they were sequenced. But here, we're going to be looking at post-alignment. So after you align, after you figure out where these sequences are coming from in the genome, we can do a lot more QC assessment at that level. So I'll give you some examples of some QC metrics that you can look at to assess the quality of your sample and to give you more information and some biological information about the sample that you're looking at. Now, when it comes to RNA-seq alignment, we face a few challenges. So some of these challenges include the computational cost. This topic was touched upon multiple times today. But each lane of sequencing, again, depending on what kind of sequencer you use, but if you're using, for example, high-seq 2,500, a lane of sequencing can generate 300, 400 million reads. And usually, you do one to two samples per lane. So we're talking about hundreds and millions of reads that you're going to have to take and align to the genome. So that process is actually computationally expensive because you will need, just like we're doing today, you're going to need some computational instances, such as Amazon. And the process is also time-consuming. And not only that, but the amount of data you generate, there's a lot of data. You get alignments, you get fusions. So you have a lot of files that you have to store and the footprint is a lot more here compared to other data types, such as DNA. When it comes to alignment as well, we get to deal with introns. So these are large gaps that the aligner has to recognize. These are things that aligners in DNA sequencing, they don't have to deal with these large intronic gaps when you try to map them back to the whole genome, but they have to consider when you're dealing with RNA-seq data. Also, sometimes you ask, if I run it once, will I be done? No, there are a lot of updates that happen to the reference, a lot of updates that happen to the gene annotation. There are a lot of tools that come out that will look at different aspects of the RNA-seq data. So chances are you will go back and reprocess your data multiple times depending on how many updates happen and how long your project is going for. Now, when it comes to the RNA-seq mapping tools, you can classify them into three different categories. So there is the de novo assembly tools, you have the tools where you align to transcriptome reference, and tools that you align to a whole genome reference. So what strategy is best? Now that depends on what data you have and what you're trying to achieve. So for example, if you do not have a reference or your reference is not sequenced, then you tend to do de novo assembly. So there are tools that will do that for you. You also tend to do de novo assembly. There's a lot of complexity or polymorphism in your data set and you don't wanna force it into a reference. So we talked about variations, we talked about how the reference is only representing certain genomes and you don't wanna force your variations into that one genome. You can go ahead and do de novo assembly because that will deal with the complex polymorphisms. You can also do, there are tools that will align to the transcriptome. So these are references that will have all the possible combinations of transcripts and isoforms in the genome and you're simply just taking your mRNA or CDNA and you're trying to map it back to the transcriptome without having to deal with assembling the transcript or isoforms. But the most popular type of aligners are the ones where you get to take the CDNA and try to align it back to the whole genome reference and do some sort of an assembly of the isoforms. And again, each one of these strategies, it comes with a complex and a set of tools and packages that you can try. Now, in terms of tool, here is just a timeline of all the different aligners that are available out there and when they were developed. And the ones that are for RNA seek their highlight in red and this is 2015, so it's pretty recent. And you can see that there are so many different tools that are being developed every year, different aligners to be more specific. And for this workshop, we have decided to pick two tools. So we're doing top hat and I don't know if we have time to do star. The commands are available. Now top hat is a bit more established. It's been in the community for a while and we like to start with top hat because of the fact that it's very established and it's user friendly. There's a big community out there. So the likelihood that if you run into a problem, if you have a question, most likely that it's actually been answered, someone else has faced that problem. And yeah, so that's top hat star is another great tool that has been recently developed. It's a lot faster. That's one of the biggest advantages of using star as an aligner. It's a lot faster than top hat. But as I said, we'll get to use both. Again, there are a lot of other aligners. However, once you learn to use one tool, you can take those goals and just apply them to any other tool out there. Just a matter of learning the specific parameters that come with every tool. Should I use splice aware or unsplice snapper? Again, so it's going back. We know that the RNA that we have does not contain entronic regions, but if you plan to align it to a whole genome reference, then the whole genome reference is DNA. It contains entron and exon. So you will have to deal with or pick a tool that is splice aware. If you're aligning to transcriptomic reference, then you don't have to pick a tool that is splice aware. Now, Bowtie and top hat, the package is splice aware aligner and splice junction detector. So the way it works is Bowtie is actually the backbone aligner for this package. Bowtie is great at aligning short reads. However, the thing that Bowtie cannot do is it cannot deal with very large gaps. Those are the entronic gaps. And that's why top hat works like a wrapper. It uses Bowtie as an aligner and it takes all that information from the alignment to piece the information together and create the splice junctions. And I'll go through an example of how it does it one step at a time. So let's take this as an example. Let's assume that we have two reads, read X and read Y. And this right here is the reference. So in this reference, we have an exonic region, entronic region and another exonic region, exon one, entron, exon two. And we're gonna assume that the reads that we're dealing with, they are long enough that they could span multiple exons. So read X in this example spans two exons, exon one and exon two. But read X, again, as a reminder, it's a CDNA coming from mRNA, so there is no entronic region in that piece. Read Y, we can assume that it's a read that spans one exon. So the way it works, top hats, well Bowtie is gonna try first and align all the reads to the whole genome. And anything that perfectly maps, then it will put it in an aligned bin or perfectly aligned bin. And anything that partially maps or partially un-maps, depends where you look at it, and then it's gonna put it in an aligned bin because those reads, there is a potential that they span multiple exons. So top hat will deal with those reads later. So top hat will take all the unaligned bins from Bowtie and it will try to split the reads into smaller segments. So it will take read X and split it, in this case into three segments. And you get to decide how many segments you want because you get to decide the length of each one of these segments. By default, it's 25, I believe. I don't know if that has changed. And then you take each one of these segments and then Bowtie will align, will do alignment again. So align it to the whole genome again. And you will notice now, previously the read did not map, but when you break it down, the first piece will map the exon one. The second piece will still un-map because it's an exon-exon boundary. And then the third piece will map. So top hat will collect that information from Bowtie. And it will try to estimate a site where the splice junction is actually happening based on the mapping information of the small segments to the whole genome. And it will do that for all the reads that did not map initially. And it will build some sort of an index of all the possible sites of splice junctions. And once it comes up with that list, it will take all the reads and align them again to the sites that it created. To the Slice Library. And while it's doing that, it also looks at the coverage distribution across the different pieces and to make sure that it matches and so on. So is that process clear? The alignment process and the splice section? Questions? Okay. What happens to X2 in this one? So the reason why it's doing that is it's trying to figure out where the splice junction is happening. So because X1 aligned and X2 did not align but X3 aligned, it knows that the splice junction is between X1 and X3. So it tries to estimate based on the mapping and unmapping information of these little segments. Well, you're using it to obtain the splice sites. Once you get the splice sites or the predicted splice sites, you go back and align all the reads, the full read, not the chunks, to those splice sites to see if it actually is aligning properly and if the splice sites that they detected are proper. Yes. In case for the X1 matches with multiple and matching with X1, five? Yeah, I'm not sure how it deals with... I'm sure there is a filter for that, for pieces that map to multiple locations. Also, there is a different way of assessing mismatches, I believe, depending on the location within the read. If it's at the beginning of the read, then it allows a certain number of mismatches, but if it's at the end of the read, then it allows more just because the end of the read is known to have prone to errors, technical errors more than the beginning of the read. So I'm sure it deals with that aspect of multi-mapping. And you wanna, when you're deciding the length, and that's something we'll do in the integrated assignment, when you're deciding the length of those segments, you don't want them to be extremely short that they get to a point where they're not unique anymore and they map to so many different places in the genome because that will introduce a lot of false splice junctions in your data. Okay. Does the cycle once? So it does the cycle once, yes. So it does that for when we will run the alignment, you will see that it does it for the read one and read two, and it does that once. And we'll look at the output while we're aligning, and I think the process will make more sense as you follow the log while it's doing the alignment. Okay. Now, should I allow multi-mapping in top hat or in RNA-seq? Now, it really depends on the application. When you're dealing with DNA-seq, for example, oh, sorry, question. Yeah, I'm thinking about the previous slide. Yeah. How was you mean that it doesn't have something that binds to one side, but it's a fusion with this? So for fusion, it's a different algorithm. So this will say, this happens to be a fusion rather than splicing this algorithm, right? You have to specify, if you're looking for fusion, then the parameters that it uses, the thresholds for the expected distance changes. So here it's expecting, based on your library size, because you tell it the library size or it tries to predict your library size. So it has an expected distance between two breakpoints. But if you turn the fusion on, then the expected distance, the threshold will be increased and then it will consider it to be a fusion. Sorry, so yeah, multi-mapping. When you're dealing with DNA-seq, that's a common thing that you will encounter where a read can map to multiple places. And then depending on the tool you use and depending on what you wanna do, sometimes you can report all the reads that map to multiple places. And the tool can pick either the randomly pick the best alignment or pick the one with the highest quality and sometimes there are multiple reads that map with the highest quality so you can randomly pick one. With RNA-seq, it really depends on what you are planning on doing. If you're planning on doing varying calling, then I would say perform the same thing as DNA and try to not allow multi-mapping. If you're doing expression, you don't want to influence the dynamic range of your expression. So it's recommended that maybe you keep the multi-mapping so that the reads can map to multiple places. And same with Fusion, it's recommended that you would keep the multi-mapping. And that's an option that you will find in Top Hat. When we run the commands, you'll see there's an option that you can turn on and off. Now, once you run the alignment, what do you get out of it? You get SAM file, SAM stands for the sequence alignment map format. A BAM file is simply the SAM file. A SAM file is just a text file and the BAM file is the compressed version of a SAM file. And it's recommended that you keep the BAM version just because you will be saving space and you're not losing any information when you go from SAM to BAM. So it's safe to just keep the BAM and get rid of the SAM if you can. Now, there are a lot of tools that you can use to convert SAM to BAM, BAM to SAM, and vice versa. One of the ones that I use all the time is SAM tools. If you've heard of that, I use that a lot to convert back and forth. When you try to view a BAM file, you can't just simply open the file because it's binary. So you're gonna have to use a tool that will convert it to SAM, or just like the ZCAT version, you can stream the reads and look at them. However, that's not recommended because you will get millions and millions of reads so you can't really browse through the reads. Yes? How do the aligners handle SNPs? So as I said, so it allows for mismatches. It does. It does, and the rate of mismatches differ depending on where in the read it is. So at the beginning of the read, the rate is a bit lower than at the end of the read just because of the, I'll show you when you look at the QC, the quality of the bases in the read drops towards the end. So it allows for that. It's flexible that way. And you can change the number of mismatches if you want. If you wanna be more strict, then you can change that as well when you're doing the alignment. The default settings tend to be fairly permissible. So if you haven't seen a SAM or a BAM file, this is what it looks like. You can break it down into two sections. So you have the header at the top and you have the body. The header contains information about the sequencer. I'll get into the details of what the header actually includes in a bit. But the body contains the actual sequences that you have aligned. So all these sequences are the same sequences that you found in the FASQ file. And then it's adding another layer of information. It's that sequence in the FASQ file, but I'm telling you where it actually aligned and the quality of the alignment as well. So you can pretty much take the BAM file as well and then convert it to FASQ because all the sequences should be there, aligned and unaligned. I've talked about that. And BAM files actually, you can actually index the BAM files as well and that will make the retrieval of sequences a lot faster if you wanna pull a specific read coming from a specific position. You will have to index the BAM file to be able to pull those reads pretty quickly. So this is the header section I was talking about. What kind of information does it contain? So the tags that you add to HD, for example, it looks at the version of the BAM file. It looks at the sorting order. So how is this BAM file sorted? Is it sorted according to position or sorted according to the reads within the BAM? It also looks at the reference that was used to perform the alignment. What version of a reference was used? The contigs and their length and the species as well. Read group, so the library identifier. So it's all information that will help you. If you read the header, you should be able to go back and reprocess the data based on that information. So it'll tell you the library, it'll tell you the sequencing center, what type of sequencer was used and the sample name. And finally, the program that was used to perform the alignment and sometimes you get even like a command. The actual command that was used to get the alignment. And the richer the header is, the easier it will be for someone to reprocess the data without asking you questions or asking the person who sent the BAM file any questions. Now, the body of the BAM file contains a list of information. So I'm not sure if you can see the bottom. So I'm just gonna go through an example of tags that you can find in the alignment. So for each one of these sequences that are aligned, you'll get information. It's just the Q name. So that's the template name. It contains the read name and it depends to it the sequencing information. And it tells you whether it's read one or read two. You have the flag. So the ones that are highlighted will get into, I'll explain them in details in a bit, the flag and the cigar. These are the two that I'll explain in a bit. Our name, so it tells you the chromosome that the reads aligned to, the position within that chromosome, also the mapping quality and it will look at the pair, the read two. It will tell you where, what chromosome read two aligned to and what position of read two and how far is read two from read one. So that information can help you come up with an insert size distribution. So you can find the distribution of the size of the fragments that the libraries came from. And then you get the actual sequence and then another string for the quality. So for each base, you'll get the quality score, just like was discussed before with the FASQ files. So what's this flag that I've shown you earlier? So this is a 12 bitwise flag. It helps you, it describes the alignment. So they're trying to pack as much information about the alignment in this score. And this score is, you can think of it as a combination. It's a number from 1,000 to 2048, but it is represented in a hexadecimal representation. So it makes it a bit confusing. But each number represents a type of quality for that alignment. So it will tell you if the template has multiple segments or multiple alignments. If it will tell you if the segment actually mapped or it didn't map, if it was a proper pair, if the second read actually mapped as well in the same chromosome, it will classify it as a proper pair. If it's the first read or the second read, if it's a primary alignment or a secondary alignment, and if it passes filters, if it's a PCR artifact or if it's a supplementary alignment. So you can use these flags. If you use SAM tools, for example, you can use SAM tools and use that flag to pull all the reads that are coming from proper pairs. So you just provide, you don't need to provide anything, you just provide that flag that's associated with the description. And it will pull all the reads from that BAM file that are coming from proper pairs. It will pull all the reads that are coming from the first read, not the second read. So it just makes the retrieval of the reads from BAM file a lot easier. The cigar string is a string that's also in the BAM file. And that string is, so each read gets a cigar string and the cigar string explains the, it describes the actual alignment, the breakdown of the alignment. So if we take a look at this as an example, so saying 81M, 859N, 19M. So what does that mean? This means that there are 81 bases in that read that matched, then there were 859 gaps or bases that were gaps and then 19 bases matched. So the hundred bases read, the first chunk of it matched an exon and there was a big, a large intronic region and then another exonic region. So by reading this cigar string, it will help you visualize the read and where it aligned. And that's what actually BAM viewers or alignment viewers use, they pull the cigar string and then they use that information to help you visualize the read and what the breakdown of the read is. So you had a question. Yeah, what's secondary alignment? Secondary alignment, as I mentioned, if you have multiple, if the read aligns to multiple places and the primary alignment would be the best alignment and then if there are other locations for that alignment that are less quality, then that would be considered the secondary alignment. Yes? The flag, on your example, it says 99, how do I interpret that? Sorry? Yeah, so on the previous slide, on the example, it says 99, you see that 99. Oh, this one right here? Yeah, four flags, but then on your next slide, there is no 99. Oh, so it could be a combination of, so not one, so 99 is probably a combination of different qualities. So there is a website that I usually like to show. So let's try and look for it, where you can put the number and it will break it down for you and it will tell you what that mean, what combination of metrics that actually represents. Because it could be like a first read and it could be properly paired and it could be past quality and they combined it is a score. Let's see if we can find cigar, because it's kind of impossible for, oh, there you go. So this website is provided by them, so you put the flag, in this case it's 99, and it will tell you what combination of metrics it's actually coming from. So all I need to do, you don't have to put the whole link, all I did is I searched cigar flags explained and this is by Picard. So you can go here and if you're interested in a combination of filters, you can check all of them. So you can do the opposite, you can check the filters that you're interested in and the numbers will change. Then you take that number and you go to Sam Tools view, you specify the flag that you're interested in and it will subset the file. And when you're sub-setting the file, you can either include, make an inclusive list or exclusive list. So you can say I only want reads that fit these criterias or I want all the reads that don't fit this criteria and so on. So there's so many different ways you can subset and ban file. Go back. So we talked about the cigar string. The bet file is another format that is important when you're trying to subset a ban file. So a bet file is a text file that contains a few columns. It has the chromosome, start, end and whatever feature you're interested in. So let's think of, let's use an example. So let's say you wanted to pull a specific gene from your ban file. So in the bed you can say the gene has this chromosome, has this start, this end and this is the gene name. So you can use the bet file as a way to intersect the bed with the ban. So there are tools that will take the bet file that has the gene information, gene boundaries and the ban file that has the alignment and you tell it, okay, take these boundaries and overlap them with the alignment and pull the alignments that overlap this specific gene, just these boundaries. And it doesn't have to be a gene, it could be anything you want, a region in the genome, a band, whatever you want, as long as it has chromosome, start, end and a name, that's a bet file and you can use it to intersect with a ban file. And you can actually intersect two bed files as well if you want, it doesn't have to be a bet and a ban. So this is just a list of some of the tools that you can use to manipulate the ban file or find intersection with the bed. So SAM tools, as I've mentioned, I use that a lot. You can use BAM tools, you can use Picard. For bed files, you can use bed tools or bed ops. Now, for me, I mainly use SAM tools and bed tools. Bed tools is very, very useful, especially when you're looking at coverages, for example, because you can do that, if you wanna pull the raw number of reads that cover a specific region, you can use bed tools. How should you sort your alignments? There are two ways that you can sort your alignments, you can do it either at the biposition and this is mainly for performance reasons, so that the whatever tool that you're using downstream will have an easier time to retrieve the reads if it's looking for them biposition. You can also sort the ban file by read name and the intention behind doing it that way, sometimes, for example, if you're looking for fusions, you want to find out if the first read is mapping to a chromosome and the second read is mapping to a different chromosome and you want the reads to be right next to each other, so when you're doing the check, you can just check quickly. And there are other tools that you can use also for sorting the ban file. Now, IGV is one way that you can use to visualize your alignment and I believe, has anyone ever used IGV before? Okay, that's great. So you can simply just load the ban file in IGV and the way it works is that you get this window where you have the ideogram, so that's like the chromosome, the P-arm, the Q-arm and the chromosome along with the bands, you have the centromere and in the bottom you can load a gene track so that for a specific gene, it will tell you the exonic and entronic regions, you can change the gene track, you can add whatever indication that you have or you choose to use. And then you get the reads. So this is a pile up of the reads that you have sequenced and this plot right here shows you the distribution of the coverage for the exonic islands that you have. The lines that connect these red fragments, they represent the entronic regions that the aligner or top hat predicted based on the reads that you have. So what you try to do is you try to look at the predicted splice junction which are these blue lines, light blue lines and then you compare them to the known annotation to see if the islands of exons and introns match. And they should most of the time match unless you are dealing with novel genes where you have novel splice junctions that might not be annotated in the database. But most of the time the splice junctions that you detect are probably known and they should match the transcripts at the bottom of the screen. Now you can color code the reads as well. You can color code them to tell you whether they're coming from the first read or the second read. If you're doing strand specific libraries then the color coding is very important because it tells you whether it's coming from the sense or anti-sense. So it adds another layer of information that can be useful. And you can look at the color of the reads and you can look at the direction of the transcription. You look at the arrow and that will tell you whether it's coming from the sense or anti-sense and if that direction actually matches the color of your reads if they're coming from the same strand. So here are some other alternatives to IGV. I personally haven't used any other visualization other than IGV. And I just use it, I don't know, like I don't, it's really hard to pick a region because the transcriptome is so huge unless you're interested in a specific gene if you're doing targeted panel or after you actually do differential expression and then you find a list of genes they're interested in, then you'd go back and just make sure that the coverage is okay and these two. Otherwise it's really hard to assess the whole transcriptome using IGV. It will just take forever and it's just not efficient, not an efficient way of assessing QC. And that's why you should use tools to assess the QC for you. And here's just a list of metrics that you can look at to assess the QC of your alignments and I'll go through each one of these metrics and I'll show you a plot that you can generate to help you understand each one of these metrics. So we're gonna be looking at the coverage bias, we're looking at the nucleotide content distribution, the base quality, the read quality distribution, the PCR artifact, the depth, the sequencing depth and how determined, how deep should you go when it comes to sequencing. You can look at the base distribution and finally the insert size distribution. Now all of these plots that I'm going to show you today they are coming from one tool called R6C. You are not obligated to use that tool. You can go ahead and use whatever tool you want. You should adjust some guidelines of things or metrics that you can look at. You can establish, you can write your own script to pull these metrics or you can use a combination of other tools. I believe Picard has a very nice tool for RNAC, collect RNAC metrics, I think that's what it's called. So you can ideally you would like to combine multiple tools when you're developing a pipeline for your QC, but this is just a good starting point for QC. So the first metric that we'll look at is three prime and five prime bias or you can think of it as coverage uniformity. So you want to look at the transcripts and you want to make sure that the coverage across the transcript is uniform, that both ends of the transcript are being covered the same way. Now, in order to calculate that what R6C does because the transcripts they have different lengths. So it takes all the transcripts and then it splits them into quantiles or bins from one to 100 and it looks at the coverage for each one of these bins and that will help you visualize. So if you plot on the right, we're looking at a plot where the X axis represents the quantiles from zero to 100 or you can think of them as bins from zero to 100 and then the Y axis is the coverage. And once you do the plot, you realize that there are two groups of samples. So there is a group of samples that seem to have consistent coverage across all the bins from the five end to the three end of the transcript. But you see a group of samples that seem to have a lot more coverage on the three end compared to the five end. Now, this is alarming because it could mean multiple things. It could mean that the transcript is actually, or your RNA is degraded. The five end has very low coverage compared to the three end. So if that's the case, you wanna go back and check your RIN number and see if it's low or if it's high. It could also mean that there is a three prime bias, maybe the kit that you use, maybe it's poly A selection. If you're doing poly A, then you're only pulling things that have from the three prime end. So if that's the case, you wanna go back and then maybe look at assessing the library selection that you've done, choose another one or so on. If you don't adjust for these differences, they will actually have some bad consequences in when it comes to expression estimation. Because if you have bias in your coverage, you will end up underestimating or overestimating your expression, depending on how severe the bias is. Now there are methods, computational methods, where you can correct for these. If you absolutely cannot obtain new samples or you can't rerun those lanes, then you can correct for them, but you have to be aware of them and figure out why and then correct for it before you perform expression estimation. Any question? Yeah, I just wanted to ask. How you can correct for them? Yeah, occasionally, unless you- Yeah, I don't have a specific tool on the top of my head that I'm aware of, but I guess it really depends on the severity. Like if all your samples have that or if you've done two groups and one group does, maybe you can use that as a benchmark to normalize and try to correct the distribution based on the other group that doesn't have that artifact. But at the end of the day, if you're using a downstream, you will see the byproduct of this bias somewhere unless you adjust for it in your downstream analysis. So I'm not sure how good the tools are at correcting those. Ideally, you would like to rerun the sample because it could be that you're missing out on other. If it's a polyase selection problem, then maybe you're missing out on reads that are not here. So even if you correct for them, the reads that are not there or transcripts that are not there, false negatives. Yes. In which group do you have the reward? Yeah. Is that a read one or five prime? That's a three prime. So you go from five to three. It's five to three, I believe, five to three. Yes. For the genes? Yeah. So if the read aligns something, you have the coverage over the entire RNA or you have a coverage over at the end of the RNA. Yeah. The counting should not change, right? The counting should not change. What do you mean? I mean, the counting should be distributed across all the exons within. So there shouldn't be a bias towards one end or the other. The gene expression. Yeah. So you are counting the reads which are aligned into one set of genes. Yes. To the gene. Yes. So if it aligns at the five prime end or it aligns to the three prime end, the count should be the same, right? So it should not affect the process. If you are doing gene expression analysis, if you have this sort of sample distribution. Well, why are they not aligning to the three prime? Like you're saying if it's biasing towards one end, that's okay because the count is still there. Yeah. But the count should be distributed evenly because what if you are, if there is a degradation in your RNA, then you're biasing towards the short reads or the short transcripts, their expression will be higher than long transcripts depending on how studio the bias is. So for short transcripts, if there's more accounts on the five prime end or sorry, the three prime end, that's overestimating the expression versus very long transcripts where they'll have a lot of reads on the three prime but there's nothing on the five prime end. It can't be okay for just you to bias this consistency across all of your samples. Yeah. And sometimes it is. Yeah, and you can't guarantee at the selection process if you've missed any. It's a sign that you might have missed and that you can't really correct for because you didn't really pick it up if there was degraded. If it's yeah. Well, as I say, you need to figure out where it's actually happening. Like if you can go back and check your RIN or like the library quality to see if it was good or bad, if it was really good, then it's a sign that it wasn't maybe the library. It was something that happened after the library construction. So try to trace back the error and see where. You can produce a plot like this and if you set it. Yeah. If they're all the same and it doesn't look good. Most of the time when I've seen this, it was an indication of a kit. It was just the kit that was used. It was a reproducible error because of the kit that was used, so kit differences. We're pulsing it in something that was called an RAC client. Do you observe the similar pattern and write what we don't want to do, sir? Write what? If she is good, she should get a pretty even. She shouldn't bias you towards one end of the transcription. If it was zero, you would check the nature of the results. Yeah, a song, but I don't know that you would see that. Unless it's degraded. On this, unless it's. Another metric that you can look at is the nucleotide content. So here I'm looking at the position within a read. So assume that our read is 35 base long, very short read. And then we're looking at the nucleotide frequency at each position across all the reads in your library. And what you expect is that you expect the same distribution. So AC, GT to be about 25% presentation in your pool. However, when we do this for some of the alumina samples, you notice that there is a deviation from the expected 25% at the beginning of the read. And it seems that it's due to random primers that do the reverse transcribe RNA fragments to CDNA or double stranded CDNA. They seem to cause, they seem to be not very random and they cause certain patterns over the first few bases of the read. Now, this does slightly affect the alignment. So when way you can improve the alignment is by trimming or trying to adjust for those changes. I usually just trim the first bases of the read and then use the rest of the read that has AC, GT about 25%, 25% yes, it's always the beginning of the read. Beginning of the read, yeah. How many bases? Sorry, how many bases? Between zero and 10. I don't see much variation because of that in this sort of. But before you trim, do the plot, see if you have such distribution and if you do then I'll try to fix it. R6C, yes. R6C is the same? No, this is different, yeah, yeah. So there are multiple tools. There's that, there's also the card, collect RNA-seq metrics. And they don't all overlap in terms of the QC, so each tool would have some unique metrics and plus and that's why I said the union of all of these tools would be great if you can run all of them. Another metric you can look at is you can look at the distribution of the base distribution across the read. So here we're looking at the same read 35 bases long and we're looking at each base and we're looking at distribution of the quality. The quality is usually reported as a fret score and fret score is just the probability that the base calling is wrong. So a fret score, if you see a fret score of 30 or Q30 means there's one in a thousand chance that this base calling is wrong. So the higher the Q score, the better the quality of the base because it's less likely that it's an error. You notice that, so here most of the reads are, most of the bases are above Q30, which is great. You do realize that there, notice there is a decline in the quality. That's okay because that's just an expected for all the sequences by synthesis technique that the error profile increases as you go towards the end of the read. Just make sure that the overall distribution of your reads are above Q30. The, another metric that we look at is PCR duplication. So in DNA, I don't know, you guys didn't touch upon PCR. Did you talk about the collapsing, collapsing your data? So in DNA, you collapse the data because some of the reads can be PCR artifacts. So you expect the reads in your genome to overlap. You don't expect them to pile up in a one region. That's when we talk about DNA when we talk about RNA. Things are a bit different because the start positions in the RNA, they're not random. They're the genes they have start a transcription site where they start. So there is a likelihood that you will see reads that pile up at a certain position. So you can't really differentiate between a PCR artifact and a true biological transcription signal. So I usually don't do collapsing when it comes to RNA. However, if you want to assess the duplication rate, there are various ways you can do that. So one way that RCC does it, it looks at the sequences and it looks at the sequences for both reads, read one and read two. And it looks at the position. So they must have the first start point for read one and end for read two. It also looks at the sequence context. So the sequence content is exactly the same. Then it calculates the duplication rate based on the sequence content and based on the start and end and it will give you an estimation of that. And based on that, it's up to you if you see a very high level of duplication, then you can go back and check why this might be happening. Is it the library or prep? And then you get to choose whether or not you want to collapse and check the dynamic range of your expression after and before collapsing to make sure that it doesn't change massively. Sequencing depth. So if you're trying to figure out how many lanes of sequencing you want to do for your study, how many samples you want to multiplex in one lane, again, that question is dependent on so many things. It depends on what kind of sequencing you're doing. Are you doing total RNA or are you doing polyaselection? What kind of analysis do you want to do downstream? Are you just doing variant calling from RNA? Are you doing expression estimation? Are you doing fusion? So there are a lot of variables that you have to consider when you're trying to decide how many reads per sample, but one simple way can be by doing a pilot experiment where we take one of your samples and then run a full lane on it and then do a saturation experiment or a saturation test. So what the tool does, it takes the aligned band file and it samples at different levels. So it samples 10% of the reads randomly, 20%, 30%, 40%, up to 100%. And at each sampling level, it calculates the number of novel junctions and known junctions based on a database that you provide or like a file that you provide. So it goes through them and then calculates the number of known and novel junctions and then it comes up with this plot. So the point where you want to stop sequencing is the point where your total junctions have saturated. So no matter how many reads you're adding, you're not actually getting any more information. You're not getting any more junctions. Now you notice here there are three curves. So the blue one represents all junctions. The red one represents the known junctions and the green one represents the novel junctions. And you will notice that the first one that saturates is the known junctions. And that's because there are a lot of false positive junctions that you'll get with whatever tool you're trying to do. You will get false positive junctions, they're novel junctions. So you can consider both a novel and known and try to come up with a threshold at which you're not getting any more junctions. Now you can do the same thing that's not an R6C but you can also look at maybe the expression of gene families. So you do the same sampling experiment and see how many more genes are you picking up? How many genes are you picking up by adding more reads? And at a point at which you saturate where you're not introducing any new genes or that the expression doesn't change, then maybe that's the point where you can stop. And that will give you an idea of how much sequencing you need to do and then you can go back and run that on the rest of your samples. Base distribution, so what this is doing is just taking the bases that have been aligned and it's trying to figure out where in the transcriptome those bases are aligning, are they in the intronic region, exonic region, intergenic region, and so on. And that will confirm the type of library that you've used. If you use Total RNA, you'll obviously see more intergenic regions. If you use PolyA selection, then you should see a lot more coding or exonic bases. And if there's any discrepancy, then you wanna go back and check why. You can also look at the inter-size distribution. So you have your fragment is what you're trying to sequence. This is the fragment of CDNA. You attach the adapters to it before you sequence it. So when we talk about inter-size, we're talking about the fragment without the adapter sequences. When you're talking about the inner distance, that's the distance between the two reads. So if this is your CDNA fragment, you're trying to sequence read one. So that will be like 100 bases this way, 100 bases this way. Whatever is left that hasn't been sequenced from that fragment is called the inner mate or inner distance. And the reason why I'm mentioning this because we'll use that metric in the integrated assignment in a bit. So keep that in mind when we're doing that. But you can use that information, as I've mentioned, you can pull it from the cigar string. Or sorry, you can pull it from the, there's a column that has the distance from read one to read two. And that information can help you come up with an inter-size distribution. So that will approximate the fragment size of the library that you started with. And then you can check it against what was reported or what you were told and see if the library size matches or not. And I believe also, you can also use IGV. If you have a specific gene in mind, a specific site in mind, you can load your data in IGV. And if you're looking for specific variants, you can, the variants or the SNPs, they will be colored. If any base that does not match the reference will be colored. So you can look at that. And if you want to calculate the allele frequency, you simply just count the number of reads that have a different base over the total number of reads that cover that site. And that will give you the allele frequency for that variant that you're interested in. However, this is just if you know what gene you're interested in with Exxon. Ideally what you want to do is some sort of variant calling tool or you can do SAM tools pile up that goes through every single base in your genome. And it does that for you. It calculates the number of bases that match the reference, number of bases that are alternate, and it will calculate that frequency for you at each site. Okay, so that concludes the second module.