 All right, so we're a bit behind schedule, but that's okay. There are some topics that I'm covering that Malachi already covered, so maybe I will not spend as much time going over them. So, so far this morning and early this afternoon, you've learned about RNA-Seq in general, how to construct RNA-Seq libraries, how to perform sequencing, what kind of files you generate after you get the RNA sequences, what kind of references, what do they look like, annotation files. At this point, we are going to learn about RNA-Seq alignment. So in this module that we're going to be spending the rest of the afternoon talking about RNA-Seq alignment, the challenges that we face in RNA-Seq alignment and how they're different from DNA-Seq, the different alignment strategies that are available. We are going to pick one of the tools, one of the aligners, Hisat 2, and go over how that aligner works. And then we'll go over the type of files that the alignment generates. So BAM files, bed files and how bed files can be used to subset a BAM file. We will cover some basic manipulation of BAM files if you're not familiar with BAM. And then we'll spend the rest of the module talking about QC. So, so far you talked about sequence QC, but at this point we're going to be talking about alignment QC. So after we map those reads to the genome or transcriptome, we will talk about how well they align. So the concept is very similar to some of the tools that we mentioned we'll use again, but we'll highlight some of QC metrics that you can use in your lab to assess the quality of the alignment. So one of the biggest challenges for aligning RNA-Seq data is the computational cost. And that's a challenge that's actually shared between DNA-Seq and RNA-Seq. You get a lot of reads when you sequence RNA. A lane of sequencing can generate nowadays 600, 700 million reads. So that's a lot of reads. And what people usually end up doing is they do multiplexing. So they would pool some samples in one lane, but that's still, that's a lot of reads that the aligner has to map to the transcriptome or the genome. In addition to that, the advancement of technology has enabled us to increase the length of the read as well. So as was mentioned before, our reads can go up to 150 now, 200 a few are dealing with back pack bio, you have thousands of bases along. So the length of the read is also another computational cost that RNA-Seq aligners face. In addition to that, what's unique about RNA-Seq is the fact that introns are present. So there is these large gaps in mRNA fragments that are not present in the DNA. So the aligner will have to figure out where those gaps are when you're trying to align them to the whole genome. And also the fact that with RNA sequencing, it's very unlikely that you're just going to align your data once and be done with it. The fact that with RNA-Seq, you're using gene annotation. And gene annotation changes constantly. So chances are you're going to go back and re-align, reprocess your data. And not just that, the fact that there is so much that you can get out of RNA-Seq. So you have alignments, you have expression, you have fusions, you can do variant calling. So there are so many different tools that these tools always get updated. So chances are you're going to be reprocessing frequently to keep up to date. Aligners, in general, they're split into three classes. So you have the de novo assembly aligners, you have aligners that align to the transcriptomes, and then aligners that align to the reference genome. How do you know which aligner is good for your project? For example, de novo assembly, if you're dealing with a project where no reference is available, then de novo assembly is a good choice. But also sometimes if the reference is available, but the sample that you're dealing with, or the project that you're dealing with, has samples that have so much polymorphisms or mutations or variations that might actually be missed if you try to compare them to a fixed reference genome. So assembly in that sense is a good choice. If you do have the reference, you can either choose to align to the transcriptome or the reference genome. The reference genome is more popular. There are a lot of tools available in each one of these classes, but there are a lot of tools available in terms of aligning to reference genome. So here I'm highlighting some of the... Actually, there's a timeline that highlights some of the most popular aligners that are available for RNA-seq, bisulfide DNA, and microRNA. And each one of these tools, when they are published, sorry, question. Yeah, so for the alignment strategies, would it be appropriate if you wanted to get expression data out of like a cancer sample that has maybe these complex polymorphisms or whatnot to align using a reference genome for a regular expression analysis, but to do the denominal assembly when you're doing mutations? Yeah, that would be a good choice. And it depends how complex we're talking and how many genes are affected or how many regions in the genome or transcriptome are affected. But that would be, yeah, that would be a good choice. So yeah, so each one of these tools, when they came out, they tried to compete with the preceding tool by doing three things. They tried to increase the speed of the aligner. They tried to reduce the memory and they tried to improve the quality of the calls. Now, I'm not gonna go over all of these tools, obviously, but I'm just gonna highlight a few tools that are considered to be essential in RNA-seq alignment. So the first one I'll talk about is Top Hat, which was released around 2009. So Top Hat is used with Bowtie. So Top Hat is a transcript assembler. It uses Bowtie as a backbone aligner. So Bowtie with aligned short reads. And Top Hat is like a wrapper. So we'd use information from Bowtie, the alignments, to stitch all that information together and come up with transcripts. It was very popular. It's used a lot in publications. There's a huge community. So chances are if you run into a question regarding Top Hat, running Top Hat, then you will definitely find an answer online. The only problem with Top Hat was that it took a lot of time. So 100 million reads would take few days to process. The next stop was Star. So Star, when it was introduced, it provided a revolution in terms of RNA-seq alignment. Because what it did is that it cut down on processing time by so much. So 100 million reads went from a few days to 23 minutes. And the way it did that is by increasing the memory usage. So the memory usage went up from four gigs in Top Hat to 28 gigs in Star. So it's a lot of memory. Sometimes you might not be able to run it on your local machine. You'll probably need a cluster to run it. But it did cut down on the processing time. And the quality of the reads were comparable to Top Hat. It wasn't until recently that, I believe the same team that worked on Top Hat released a high-sat, which is a tool that tried to cut down on memory and cut down on processing time. So they managed to do that. So 100 million reads can now be processed in 23 minutes as well, but it only uses four gigs of memory. So that's the tool that we picked for today for this module. And I believe we also have Star. I don't know if we're gonna have the time to run both, but instructions are available for both. So when you're trying to decide whether or not you should use a spliced aware or unspliced aware map, or it really depends, the key point is that the kind of reference you're aligning it to. If you're aligning it to a whole genome, then you're going to need a spliced aware aligner. If you're aligning to a transcriptome reference, then you're not gonna need a spliced aware aligner. And for today, we're using high-sat and high-sat is a spliced aware aligner because it takes a whole genome as a reference. As I've mentioned before, it's considered to be very fast. And the reason why it's fast is because it's using a combination of indexing methods. So it's using a combination of global indexing and local indexing. And what does that mean? So with global indexing, it's using that to, it takes the read and it tries to anchor the read or try to find a unique location or a start for that read within the whole genome. And then once it finds a position within the whole genome, it switches to local indexing to try to extend that read. And that saves a lot of time using the local indexes indices, saves a lot of time. So what are the local indices? Just think of the whole genome and split it into 48,000 bins so that when you look for that read, you're not looking through the whole genome but you're looking within those bins. So you're really narrowing down the search by a lot. So I picked three examples here and we're going to go through the three examples and I'll explain to you how Hisat works for each one of these examples. So in the first example over here, you have a read that is, let's assume that is 100 base pair long and the read fully overlaps with an exon. In the second scenario, you have a read where most of the read overlaps with one exon and then there is some of the read that overlaps with exon two. And then the third scenario you have, we were half of it overlaps with an exon and the other half overlaps the other exons. So with the first example, the way Hisat works is that it tries to anchor, so it tries to find a position where it can locate this read in the whole genome. So it uses the global index to anchor the first one eight bases. So it usually needs 28 bases to anchor the read or the position of that read. So it does that and then after the 28 bases, it switches to extension. So it just extends the bases until it reaches the end of the read. No problem there. Now, it will try to do the same thing for the second scenario. So we'll start with the global index, it will anchor the first one eight bases and then switch to local indexing, extending, extending, extending until it gets to the first mismatch. And that mismatch is indicating there is a slice junction there. So once it reaches that mismatch, it's going to switch to local indexing. And what that means is that it's going to switch and now it's going to try to find the rest of that read within the local index or within that local bin. Previously, so top hat, the difference between this tool and the previous tool is that the rest of the read, which is eight bases long, you would have to go and look through the whole genome to find a position for those eight bases. And short reads are classified to be, they're known to be non-unique. So you'll find so many different spots in the genome that are eight bases long. But the advantage of having those local bins is that there's only one position in each local bin for those eight bases. So it makes the search a lot faster. So once it switches local, then it anchors the eight bases in that local index and then we'll stitch that information together. The third scenario is similar to the first scenario. So again, it will try to anchor the first 28 bases in the global index, then switches to extension, extend, extend. And then it finds that first mismatch, switches back to local indexing. And then when you switch back to local indexing, you only need eight bases to anchor. So you anchor eight bases and then you switch to extension within the local index. So it's just a series of anchoring and extension, global indexing, local indexing, and then gathering all that information together to form a transcript. Any questions regarding that? Okay, perfect. Also, so should you allow multi-mapped reads? So sometimes what happens is that a read, for example in DNA, a read can map to multiple places in the genome and it will map equally with a very good score, equal good score in all those spots around the genome. So some of the tools or the aligners in DNA, you can ask them to report randomly, pick one of those reads for you or just the first one that you encounter. The problem is that you can't really do that with RNA because picking a random read or picking the top one might actually affect the dynamic range of your expression. So it really depends what you're trying to do. If you're doing variant calling, then maybe just allow multiple reads, but if you're doing expression estimation, allow multiple reads. And you'll find that in a lot of these parameters when you're doing alignment, you really have to think of what you're doing after. Are you doing expression estimation? Are you doing variant calling? And things will change depending on what you're doing. The output of Hisat 2 is a SAM or a BAM file. How many people have seen a BAM file or worked with a BAM file before? Okay, that's a good number. So SAM file is, you can think of it as a text file and SAM stands for the sequence alignment map format. So I think for every read that you try to align in the fast queue, you're getting that read, but you're getting extra information. You're getting information about where it aligned, the quality of the alignment, and that's pretty much the SAM file. The BAM file is just the binary version of that SAM and you're saving a lot of space when you compress from SAM to BAM. So it's highly recommended that you keep all your alignments in BAM format. You can convert from BAM to SAM and back. And there are a lot of tools that will enable you to do that, including SAM tools. So this is what a BAM file looks like. It's divided into two sections. You have the header at the top and usually the header starts with at signs and you have the body which contains the sequence information. So the header, as I've mentioned before, contains information about the reference that you used, information about the fast queue files that you used, what parameters you used when you did the alignment, what kind of tool you used for the alignment with version. And the body contains information about the sequences themselves, where they aligned and the quality of the alignment. So this is just a list of all the stuff that you have in the header file. Just like I mentioned, you have the reference, it breaks it down. Actually per chromosome, we'll tell you the length of each chromosome, read group, again, name of the sequence center, the sample name, and then the program version. Under the alignment section, you have the following information. You have the read name, the bitwise flag, which I'll discuss in a bit, the reference, so the chromosome where it aligns. So let's go over that example. So this is the read name, this is the bit flag, the R name just means this read aligned to chromosome one, and this is the position where it aligned, and this is the mapping quality of that read. The cigar string is just telling you a description of the pattern of alignment for that read, so it's just telling you that 100 base is matched, and I'll explain that in a bit as well. Then it tells you where the pair read aligned, and equals mean means it aligned on the same chromosome, the pair read, and the position of the alignment for the pair read, and then the fragment length. And then you get the sequence, and then you get the quality. So that's the base quality per base for that sequence. So what is the bitwise flag? It's just a bitwise flag that describes the alignment. So think of it as, instead of having 11 columns that describe the alignments, they squished all these 11 columns into this number, and this number that goes from one to 2048, and this is pretty useful because you can use that information to filter your reads. So SAM Tools has an option with it. There's a parameter called dash f or little f or capital F, where I think little f means include, and capital F means exclude. So you can subset or exclude reads that have certain filtering criteria. Let's visit this website. All right, so let's take this flag, for example, 99. So this website, what it does is that you can put the flag and then hit explain, and then it will break it down for you and it'll tell you what columns out of those 11 columns that 99 represents. So 99 represents reads that are paired, the reads that are mapped in proper pair. So the second read was actually, we were able to match it to the first read, major verse strand and first and pair. So the read that you're looking at is the first read in that pair. So again, if you are interested in those criterias, you can do SAM Tools view dash f, you put 99, you put in the name of the band file, and it'll only subset reads that have those criterias. And it can go the other way around. You can use this website to check whatever you're interested in, and that will give you a number, and you can take that number and then put it in SAM Tools and do it the other way around. Cigar, as I mentioned before, is just, it describes how each base in your read mapped. So for example, if you're looking at this one right here, it's just saying that out of the hundred bases in your read, 81 matched and 859 did not match, and then 19 bases matched. So the 859 that did not match that represents the gap, the intronic gap that it wasn't able to map. So you can use that information to figure out how long the intronic regions or the spy junctions are. And it also provides information like soft clip, what bases were soft clipped, what bases were hard clipped, and so on. Bed format, bed file is another file that is important when you're trying to subset BAM files. So bed files, for those of you who haven't seen it, is just a text file that has a chromosome, start, end, and feature name. So for example, if you're interested in a certain region or a gene, you can put that information in the bed file. You can say chromosome one, one to a thousand, and gene X. And there are tools where you provide the bed file along with the BAM file and it will go and extract reads that belong to the feature that you specify in the bed file. So if you're looking at coverage for a specific gene, if you just want to count reads for a specific gene, you can put that gene information in the bed file, go extract that information from the BAM, and then do all of that. Tools that you can use, bed and BAM files, SAM tools, BAM tools, the card, bed tools, bed ops, there are a lot of tools out there that you can use to use that combination of files. Sorting files, you get two options. When you sort BAM files, you can sort them either by position or by read ID. And it really depends what you want to do with them. If you're trying to retrieve reads based on position, then it's best to sort them by position because retrieving reads will be faster. If you're trying to maintain the order of the reads or the pair, for example, if you're looking for fusions and you want to see pairs mapped to different chromosomes, then it's best to sort them by read name. So visualization, you can visualize RNAC alignment using IGV. Have you guys used IGV before or looked at alignment through IGV? No, we're not, are we doing that? I don't think we're doing that in this workshop. But IGV, we're not looking at IGV. Actually, maybe an expression. Yeah, okay. But we're not gonna be doing, we're not gonna be loading BAMs. Oh yeah, will you? Yeah, BAMs, okay. So I guess we'll save that until later. Oh, perfect, okay. So we'll use that later, so. Perfect, so it's just a tool that helps you look at read distribution from the BAM file. So you load the BAM file and you can see the, this is the chromosome. You can load the gene annotation at the bottom and then the distribution of the reads. You see that for those read coverage islands, those represents exons, places that have no reads, those are intronic regions and you can load the region of interest, the gene of interest and look at the coverage distribution for your read depth. These are just alternatives to IGV. Okay, so the next section is gonna be, dedicated to QC assessment. Again, this will be very, some sections will be very similar to what sections that Malachi covered. So we'll cover them quickly. But this section is based off of a tool called RCXC. So all the plots that you're gonna see next are from RCXC and we do have the commands available. I don't know if we're gonna have the chance to run the tool, but it is available if you wanna use it. And there are other tools that you can use to provide QC. So this is just providing guidance of the kind of metrics that you would look for when you're assessing library quality. So here, one of the first things that I usually check for is the coverage bias. I wanna see if there's any three prime or five prime bias on the transcript and RCXC provides that this plot looks at the top thousand expressed transcripts and it tries to take all these transcripts because these transcripts will have different lengths. What it will do, we'll try to bend them. So we'll create hundred bins for each one of these transcripts and then it will plot the coverage for each one of these bins. So it's just a way to normalize the differences in transcript length. And what you expect to see is you expect to see uniform coverage distribution across the transcript. However, in this example, we're seeing two groups of RNA-seq samples where one, the coverage is actually uniform across and then there's another where there's so much coverage on the three prime but not much on the five prime. And this could be an indication of the library technique that you've picked. It could be that there's a three prime bias or there's some degradation that's happening in the RNA at the five prime end. So once you see that you should go back and we assess the library technique protocol, figure out why that happened if you can fix it. If not, I believe there are tools that would help you normalize the coverage or adjust for biases that you know for. If you can't do any of that, make sure that you keep track of these differences if you're doing expression estimation or differential expression. Make sure you adjust for these biases in your model. You say these samples are coming from this cohort and these samples are coming from this cohort just so that you adjust for these differences in your model. If you wanted to do this programmatically, right? So you just want to automate this question query. How would you reduce that shape metric? So it just pings you if it's something you want to look at. Yeah, so one metric you can look at is the ratio of five prime to three prime. And I mean with all these metrics, there isn't one number that you can use as a threshold. That will really depend on your institute and it will depend on you. But if you take that ratio and you try to come up with some sort of a cutoff or threshold, that will flag the sample. Because with R6C, I don't think it's like fast QC where it says, oh, this QC failed or it just provides numbers and then you have to make sense of those numbers yourself. Okay, this is, this was already covered. Again, when you plot the distribution of the ACGT for each base, you will see that there is some randomness, there is some selectiveness that's happening for some bases for the first few reads. Again, we discussed that it could be due to the random primers being not so random. So again, you can choose to trim depending on how it affects your alignment rate. So you can either trim the first 10 bases, but if it's not affecting your alignment rate, you can just keep them, it's really up to you. Quality distribution, similar to FASQC, here we're looking at E for each base in your reads. We're looking at distribution of the quality. And most of the time when you talk about quality, you refer to the fret score, the Q score. What it means is really it's just a negative log 10 of the probability that the base you called is not the right base. So usually a fret score of 30 is the cutoff that we use for good quality reads. Anything that's above that, we move forward to expression estimation. There's a lot of reads that are below, then you have to reassess your sequence in technology or the library technique. PCR duplication, that's another thing that with DNA, we try to get rid of PCR duplication by collapsing reads. So PCR duplication is when you have reads that have the same exact start and start and end position and they are artifact of PCR. So we're trying to amplify the fragments. What's happening is that some regions get amplified more than others. So you end up with these piles of reads and those are not good, they're biases because if you have a mutation, for example, in that region, you end up amplifying it just because of PCR and it's not a true reflective of the mutation frequency in your sample or in your cohort. So in DNA, you expect random start points so you shouldn't have reads that pile up. So what we do is that we collapse. So we pick one of the reads and that will be the representative of that region but the problem with RNA, it's not as straightforward because the start sites for transcription are not that random. Transcription site sites are fixed sites, unlike DNA. So you have to be careful when you're collapsing. Some people do collapse, others don't. It's really up to you. There are some tools and this is one of them that tries to help you get an idea of how much PCR duplication there is in your sample by looking at the position of the reads. So it checks both reads actually, the pair, beginning with the first read, end of the second read and it also checks the sequence quality. So make sure that the sequences match as well. And based on that, you can choose whether or not you wanna collapse or not. Sequencing depth is another thing that you can check in QC. So I used to get asked that question a lot, like how many lanes of sequencing do we need for this project or for this sample? And at that time, that was an issue because each lane of sequencing would generate like 100 million reads but now that we have a lot of reads per lane, it's not an issue anymore. But again, if you're limited in terms of a budget and you wanna pool samples in a lane and you wanna figure out how many samples you wanna pool in a lane, one way you can do that is you can maybe sample, put one sample on one lane and then, so that will give you 500 million reads and then you take that BAM file and then what you do is you sub sample from that BAM. So you take 10% of the reads, 20%, 30%, 40%, 50% and at each level you try to find out how many junctions, known junctions and novel junctions you're getting. And you're looking for a point at which you saturate. So you're not getting any more junctions. By adding more reads, you're not getting any more junctions. And that's the point when you stop and say, okay, I think we have enough reads. We've already detected all the junctions and we don't need beyond that number. So let's say if you do one lane of sequencing and you take these 500 million reads and realize that at 100 you're saturating and the number of twice junctions, then you know for the rest of the samples you can go back and just sequence around 100, 150, leave a window. And that will help you or guide you figure out how many reads you need per sample. Now you see here there is a difference at the rate of saturation between known junctions and novel junctions. The known junctions, they seem to saturate a lot. They're faster because a lot of the novel junctions are probably false positives. And that's just the result of the tools that you use. So the known junctions will saturate faster. So you can either look at the known junctions or you can look at a combination of the known and the novel and try to figure that out. Another thing you can do is instead of looking at slight junctions by themselves, you can look at the family genes. How many genes are showing up in the expression? So instead of looking at slight junctions, you can look at an expression and then see is the expression changing for those genes? Are my seeing new genes being added as I'm adding reads or am I seeing the same family of genes despite the number of reads that I add? So this can be actually, the slight junction can be generated by R6C. The expression, you're gonna have to do some manipulation to the code to get that. QC, a base distribution. It's another thing that you can look at and that will give you an idea if your library selection strategy worked, like if you were doing a whole transcriptome versus mRNA, you should expect to see differences in the proportion of coding bases in tronic or intergenic. That will be reflected in distribution plots. I think I'll skip that. It's the end of this module.