 So now we're going to cover module two. So far you've learned how to obtain the reference files, how to get the data, you looked at the data, you looked at the FASQ formats, you've looked at the reads, you know the format of the reads, you looked at the quality of the reads, how to assess them. The next step is for you to take these reads and align them to the transcriptome or to the genome to figure out where these reads are aligning. Okay, so some of the learning objectives of this module, we're going to be looking at some of the challenges that we face when we talk about RNA-seq alignments. Then we go over some of the alignment strategies for RNA-seq and pick one of the tools for RNA-seq alignment, Hisat, and talk about the algorithm itself, how it works and how it works compared to other alignment aligners out there. Then talk about the output of these aligners, BAM and SAM format. For those of you who have not seen these files, we will go over the format of these files and how to manipulate them in terms of sorting, for example, it's just one example, and then how to visualize those alignments using IGV. The second section of this module is going to be about alignment quality assessment. I picked one tool to do the quality assessment for the alignments, but you don't have to use that tool. We can go over some plots that you can generate in the lab to assess the quality of the library, whether or not it passes or fails and whether or not you need to redo the library again or realign. Let's start off by talking about some of the challenges that we face when we talk about RNA-seq alignment. Sequencing technologies have advanced so rapidly over the past few years. Now we're generating a lot more reads than we did a few years ago. For example, when I started working on this workshop a few years ago, one lane of sequencing would generate up to 100 million reads. Now, one lane of sequencing, you can get up to 600, sometimes 700 I've seen, million reads per one lane of sequencing. That's a drastic change in the number of reads that you're getting per sequencing run. In addition to that, not only the number of reads is increasing, but also the length of the read itself. Back in the days, people started doing a single end, it was 36 base pairs, then advanced to a paired end sequencing, 75 base pair, 100 base pair. Now people do 100, 150, up to 200 base pairs. We're talking Illumina, but if you talk back bio, you can get reads that are super long and both the number of reads and the length of the read raises a computational challenge because that will increase the cost. We will need resources to be able to align such big number of reads. Now you might think that this challenge is not unique to RNA-seq. DNA-seq is also having the same issues. What's unique to RNA-seq is the fact that introns are not present. Unlike DNA, where the full sequence is present, in RNA-seq the introns are spliced out and you're taking those exons that are being stitched together to form transcripts and you're trying to align them to a reference genome that has those introns. These gaps the aligner has to take care of. That's another challenge that the aligner has to face or take care of. In addition to that, RNA-seq is very rich. There's so much that you can get out of RNA-seq compared to DNA-seq. With RNA, you can get expression, you can get variant calling, you can do spliced variants, transcript assembly, isoforms, fusions. There are so many things that you can get out of RNA-seq. Running your data by one aligner and running it once is not usually the solution. You probably end up re-running the alignment depending on what kind of data you want to get out of it. So if you want to get RNA, if you want to get variant calls out of your RNA-seq, you might have to tune some of the alignment options or even the aligner to be able to use the RNA-seq data for variant calling. And as I was mentioned before, Hisat is not the only mapper that is available. There are many other aligners that are available out there. And aligners can be classified into three classes. You have the Dinova assembly sort of aligners. You have the ones that we align to transcriptome and then aligners that align to a reference genome. So how do you know which strategy or which aligner is best suited for your data set? If we're talking about Dinova assembly, usually if your data set or if the samples that you're working with or the species that you're working with does not have a reference available, then you might resort to using a Dinova assembly method. Sometimes you actually do have a reference for your sample, but there is so much polymorphism in the samples that you're looking at that if you try to compare it to a reference, you might actually not be able to, you'll miss some of the events that you're interested in. So it's best if you do Dinova assembly for that. If you're aligned to transcriptome, if you have short reads, then that would be good because the reads will only span the exon and will not span exon-exon junctions. However, if you have longer reads, then it's preferred that you align to a reference genome. And each one of these classes, they come with a bunch of tools that are publicly available. Yes? So you're a good resource or way to process sequences for us? Yeah, I'm not actually aware of any database that tracks the polymorphism. Yeah, I don't, I'm not aware of any, but that's a good question. I'll keep that in mind. Do you know of any? If there is a public resource where known polymorphisms, like other than dbSNP, I guess. Yeah, but just to know, can you ask me, could you just use what's the positive or you should Dinova assemble or would you just really try it once if it didn't work so well, try something else, different strains of lines? Yes, exactly. Question, yeah. Why, just because you have longer reads, are you not also aligned to a reference? I think the idea behind it is that because there are short reads, there's less likelihood that they would go over exon-exon boundaries, so they would not. You could, but you have to use aligners that are splice-aware if you're using longer reads. So you have to be careful when you're choosing an aligner. So you can't simply not choose an aligner that's not splice-aware if there are long reads and they span multiple exons. I think that was the idea behind it. It just has to be splice-aware. But if you've used splice-aware. You're aligning two spliced DNA sequences? Yeah, so you could do that. Yeah, yeah, you can do that. You can discover novel ones. All right, so as I mentioned, each one of these classes that I talked about comes with a bundle of aligners. So here's a list of some of the public aligners with a timeline of when they came out, when their papers have been published. And you see that around 2007, there was a spike in the number of aligners. These aligners are, they can be grouped as RNA-seq aligners by sulfide DNA and microRNA aligners. And I've highlighted a few. I'm going to go over a few of the aligners that I've either used or used in this workshop and talk about some specifications that they have and what makes them stand out compared to others. So every tool when it comes out, it tries to compete with others by either cutting down processing time, trying to reduce the memory used or increase the accuracy of the calls or the alignments that is generating. The first one I'm going to talk about is top hat bow tie package. I'm not sure if you've heard of that package. So bow tie is the backbone aligner. It aligns short reads and then top hat. You can think of it as a wrapper around bow tie. It takes the information from bow tie, from the alignments and try to assemble those reads into transcripts and puts the slice junctions together. Top hat was very popular. We've been using it over the past few years for these workshops. The good thing about it is that it has a very good community. It's very supported. So chances are if you're new to RNA-seq and you run into any issues, you will find an answer online because a lot of people have used it. In terms of processing time, one lane of sequencing did take about a day to two days to run. So that's considered long. It wasn't until Star came out around 2012 where it made a huge difference in processing time. So instead of one lane of sequencing taking up to two days, it was taking 20 minutes. So there was a huge difference in terms of processing time. And the reason why that happened is because they switched to an algorithm, the compression algorithm that they were using was different. It was uncompressed suffix arrays. And the only downside to that was the fact that it was using a lot more memory. So I'll get into numbers in a bit to compare the different aligners. But it was using a lot more memory than top hat was. The tool that we were using today is Hisat. And this was released about a couple of years ago. And I believe it's by the same developers of top hat. And it managed to cut down in processing time and keep the memory low. And we'll focus on Hisat. I'll tell you how it works and how it does that. But just to give you a rough idea, if you're trying to align 100 million reads with top hat 2, it would take about a day to do that. With Star and Hisat, it would take 23 minutes to align 100 million reads. Now, in terms of memory, Star would require about, I believe, 28 gigs of RAM versus top hat and Hisat, which will only require 4 gigs of RAM. So there is a huge difference in terms of the RAM usage between these tools. Yes? Like MicroRNA? I don't know. I mean, I know that they're capable of aligning. I think they had a recommendation that it cannot be shorter than 30 base pairs. So the... Yeah, yeah, yeah. I'm not sure. I know it's capable of aligning long reads, so longer than 200. So they can do that. But I don't know if they can do short or even like MicroRNAs. Now, so when you're trying to decide on an aligner, should you use a splice aware or unspliced aware mapper? As it was mentioned before, we're starting with DNA. The introns get spliced out. So you end up with mature RNA that is missing those introns, and it's just the exon regions. So if you're trying to map mature RNA back to the whole genome, you really need a splice aware aligner because otherwise it will not work. And Hisat 2 star map splice are examples of splice aware aligners that you can use. So the idea behind Hisat and Hisat 2 is that it is a splice aware aligner. It maps to the whole genome, and it uses a couple of indexing methods, the BWT and FM index. So it uses two types of indexing. There is a global search and a local search. And it's a combination of the global search and the local search that makes this algorithm a lot faster than others. So it takes the read and tries to first find globally, within the whole genome, it tries to find a location that it can map it to. And then as it encores the read, the rest of the read it tries to look for that in the local indices. So what it does is that it takes, when I talk about the local index, we're taking the whole genome and we're splitting it into bins. And there is about 48,000 local indices or local bins that are overlapping in the genome. And this will just cut down on the search time. So instead of searching the whole genome for a read, you're searching within specific areas or specific spots in the genome for that read. So I'll walk you through three different scenarios, just to show you how this algorithm works. So right over here we have, this is just a piece from chromosome 22, where we have an exon, entronic region, and another exon. These are the red chunks that are the reads. And the three techniques that I'm going to highlight are the global search, the local search, and the extension. So the three scenarios will go over three different types of reads that we can obtain. So scenario A will go over a read that is fully obtained within an exon. So it's not overlapping any splice-adjunction boundaries. Whereas scenario B, we have a read that most of the read aligns to or goes over exon 2, but only a small chunk of the read actually goes over exon 1. And the third scenario, we have half of the read overlaps exon 2, and the other half overlaps exon 1. So in the first scenario, where we have the full read going under exon 1, what the first thing that happens in highside 2 is it tries the global search. So we'll try the global search, we'll look through the whole genome to try to find a spot for that read. It does that by looking first at the first 28 bases in that read. And then once it finds a spot in the genome, it stops after 28 bases and just simply extends. It doesn't have to look anymore, it just continues to extend. There are no mismatches, then it will just continue to extend until it reaches the end of the read. Now for scenario B, it will start off by doing the same thing. So it's going to do a global search, look for the first 28 bases. It finds it, then it starts extension. It's going on up until this point, it has the first mismatch. And that first mismatch is because there is a splice junction. So it stops, and that's when it switches to local search. So when it stops right here, there are 8 bases left to the read. So it switches, now it takes those 8 bases and tries to find the position of those 8 bases. And this what makes highside different from top hat, the fact that it's using this local index. These bins that I talked about. In top hat 2, it used to take these 8 bases and look throughout the whole genome to find if there is a match for these 8 bases. And you might know that the shorter the read, the less unique it is. So you'll probably find so many different spots in the genome that you can align these 8 bases. And that complicates things. The cool thing about this local search is that now it doesn't actually have to go through the whole genome to find a match for these 8 bases. Because it started the global search, we know where the beginning of the read is. Now we just focus, narrow down the search on one bin instead of looking at the whole genome. And the chances are you only find one match for those 8 bases in that bin instead of finding so many different matches somewhere else. So it will switch to local, find a match for those 8 bases, then we'll combine the results from this chunk and this chunk and we'll report the alignment for that. Now the third scenario, we have half of the read aligning to one exon and the other half aligning to the other exon. Again, first step, you start with global search. The first 28 bases, you anchor the first spot. After 28 bases, you stop. You start just extending, extending, extending, extending until you hit this spot right here where there is a splice junction, it stops and then it switches to a local search. And the local search only does the first 8 bases. So global search will do the 28 bases, local search will only do the first 8 bases and then after it finds the spot using the local search and then it will just continue with the extension. And then it will combine the results of the alignment. Yes. How does it tell the difference between a splice junction and a variant? I think you can specify the number of mismatches that you can allow. So that's something that you can control, yes. And also you can provide it with SNPs, known draw line variants and splice sites, annotated splice sites. So it can use that information to skip regions or distinguish between splice variants and SNPs. The other question that you might get asked when you're aligning this data is, did you keep all the reads, the multi-mapped reads? So when we're doing DNA sequencing, what ends up happening sometimes is that there is a read that maps to multiple locations in the genome and it maps with the same exact quality across different sites in the genome. So what the aligner does is that sometimes it just picks, randomly picks one of those alignments and then reports that. So an RNA seek, it really depends what you're doing. So if you're doing variant calling, sure you can do that. However, if you're doing expression estimation after you align, then it's highly recommended that you do not do that and you actually report all the possible alignments that are available. And the reason why you want to do that is you don't want to affect the dynamic range of the expression. What if there are genes that have the Xuda genes, whatever, that have the same expression or the recent map to multiple locations in the genome. You want to maintain that when you're talking about expression and not get rid of those reads. Just report everything. So what do you get out of HiSat after you run the alignment? 20 minutes after, you get a SAM file. So SAM stands for Sequence Alignment Map Format and it's a text-like file so you can open it. You can do a head or more or a cat. It's not recommended to view the whole file because it's going to be a giant file. But the idea is that you can view it using text editors if you want. However, the BAM is the binary version of that file and that's what usually you will end up converting your SAM into a BAM. The reason why you do that is because you want to save space and a lot of the tools downstream will probably accept BAM formats. The only downside to that is that you cannot easily access the reads. Align reads inside the file. You'll probably need other tools that will enable you to view the alignments. There are other formats that you can use. We'll talk about that in a bit. And then there are different tools that you can use to convert BAM to SAM. So this is what a BAM file looks like if you haven't seen one before. It's divided into two parts. You have the header which contains information about the run and the alignment and the body which contains the actual sequences that you passed in the FASQ file. After the alignment, it will add information like where it was aligned and so on. We'll go into details of what's in there. The other format is Cram. I don't know if you've heard of that. It manages to compress the SAM in a very efficient way and really reduces the footprint. However, you have to be careful when you're choosing a compression format because you want to make sure that the tools that you're using downstream, whether it's expression, fusion, calling, they support that compression format. BAM is widely used and acceptable. A lot of tools are used, so you should be fine with BAM files. And Obi talked about indexing and how indexing is useful. BAM file also needs to be indexed as well. So there are tools you can run to index a BAM file. And the reason why you want to index a BAM file, again, let's say you're interested in pulling reads from a certain region in the genome. You don't want to go through the whole file. If you're looking for a read on chromosome 22, you don't want to go through all of the chromosomes to get to that read and pull it out. You want to get it out faster. So that's why you index the BAM file and that will make it easier to pull out specific reads or specific regions from the BAM file. So I mentioned that the header section of the BAM file contains information about the run that you've done, the sample. So this is a list of the exact information that is contained within the header. The header line has a format version. So that's the format of the BAM file that you're using. How SO stands for sorting. How was it sorted? Is it sorted by read ID or is it sorted by position or whether or not it's sorted at all? Reference sequence dictionary. So reference sequence name is at GRCH37, GRCH38. It also provides the sequence length per chromosome and the species that you're dealing with. Read group. Just like the read group and the FASQ file that you looked at, the read group in the BAM file also contains information about the library. So that can be useful. The name of the sequencing center and the sample name or the patient ID. Finally, the program that you use to perform the alignment and also the program version. And usually it has the actual command that you use to run the aligner so people know what parameters you use to generate this specific BAM file. So that was the header. Now we move to the body of the BAM file and it contains information about each read that you passed. It aligned it and now it's telling you where it aligned it to and what was the quality of that aligner. So let's walk through an example of alignment here and we'll walk through all the different columns that you'll find in the BAM file. So the Q name starts with the query name. So this is the read ID flag. I'll explain that in a bit. Our name stands for the chromosome. So that's saying that this read aligned to chromosome 1 position 11, 6, 2, 3. Map and quality is 3. So that's fret score, which stands for the negative locked end of probability that this was a misnatch. So the higher the map and quality, the better. The cigar string, I'll also talk about that in a bit. Our next is just saying, does this read have a pair and where is the pair? So equal means that the pair of this read aligned to the same chromosome. So it's also on chromosome 1 and this is the position of the mate pair on chromosome 1 and then you get the actual sequence and then the quality for each base within that sequence that you try to align. So the two things that I skipped here, the flag and the cigar string I'll talk about them after. So the flag is a 12 bitwise number that describes the alignment. So you can think of it as one way to consolidate 11 columns of data describing the quality of the alignment. So we're taking these 11 columns of information and then squishing them into one number and that one number can tell you a lot about the alignment. So it can tell you information such as if this read is unique or if it's mapped to multiple places. It can tell you information like if it's the first read in the pair or if it's the second read in the pair, whether or not the read aligned or didn't align. If it's a primary alignment, if it's a secondary alignment, if it's a PCR artifact, if it's a supplementary alignment and so on. So let's take a second and try to use a web page. So there's a web page because if I give you a number, you can't really do it in your head. It's hard because it's a bitwise number. So there's a page where you put the number and then it tells you exactly what information that codes for. So if we go back to the example that I showed you, the flag in this case was 99. So let's try this. So if you just search for explain SAM flags, this page will come up. I should increase the size. So here, for example, 99. That was the flag that we're trying to look for. So once you hit explain, it tells you it checks the boxes. So 99 stands for that the read is paired. The read is mapped in a proper pair. The mate is the reverse strand. And this read that we're looking at is the first in the pair. So yeah, as I said, you can put in any number that you want within that range and it will break it down for you. Now why is this important? It's important because if you're trying to subset a BAM file, you can use SAMtool's view. There is a parameter in SAMtool's view where you give it this number and it will pull all the reads that fit these criteria or these conditions. So if I'm interested in all the reads that fit these four conditions, all I do is SAMtool's view dash F and I quit this number and it will pull all the reads in my file that fit these specific conditions. If you want to change the conditions, you can actually check all these conditions in the number for you and you go back and use SAMtool's view to pull those reads. So you can either pull the reads that satisfy this condition or you can exclude reads that satisfy this condition by using dash small F dash capital F. So I find that pretty useful. I use it quite often. So that was the flag parameter. Now cigar strings, you can think of cigar strings as a sequence of base lengths. So what a cigar string is doing is that it's trying to describe the alignment status for every base in your read. So if you have 100 bases in your read, the cigar string is trying to give you some sort of a code to how each one of the bases aligns within those 100 reads. So if we take this string as an example, it's saying that it's 81M859N19M. So M, if you look at the table, stands for an alignment match. And N stands for a skipped region from the reference. So what this is telling me is that out of the 100 reads in my read, the first 81 matched to the reference. Then the aligner had to skip 859 bases and then it was able to match 19 bases. And what the 859 stand for is the splice junction. So it's saying that I tried to align it to an exon and then there was this gap that I couldn't align, which stands for a splice junction. And then there was 19. So 81 plus 19, anything that matches should add up to 100 or the length of your read. And you will get other letters like soft clipping, card clipping and so on that can be useful to visualize. So I believe that's what IGV uses for visualization. It tries to read this cigar and visualize it on the screen. So I've mentioned one way to subset a BAM file is by using those flags that I've talked about before. Another way, if you're interested, for example, in subsetting a region or a gene or a small, as I said, region in the genome. If you're trying to subset a small region in the genome, what you can do is you need to provide another file, which is called a BAM file. And in that BAM file, you specify the chromosome, the start, the end, and the ID. So if you're looking for gene A on chromosome 22, you say chromosome 22, this is the start of my gene, this is the end of my gene, and my gene is called gene A. And there are tools where you take the BAM file and you take that BAM file, and you can subset the BAM file to only pull reads that belong to that specific region or are within those coordinates. So you can create a mini BAM for your gene, for your transcript, and so on. Can you do that with the gene? I don't know if you can subset. Usually it's a BAM file. If you're using BAM tools, then you will need a BAM file to do that. But that's just one usage of BAM formats. You will find, once you work with RNA-seq data or other types of data sets, you will find that BAMs are very useful file formats. So there are a lot of tools that you can use to manipulate these BAM files, subset them. This is just a small list of some of the tools that you can use. SAM tools, as I've mentioned before, you can use it to view files, subset files, BAM tools, Picard. You can use that to sort your BAM file. And for bed files, you can use bed tools, you can bed ops, and there are a lot of other tools out there that you can use, but these are the most common ones. In terms of sorting, there are two different ways that you can sort your BAM file. You can sort it by position or by read name. If you're sorting by position, this is mainly for performance reason. If you're trying to pull a specific read from a specific coordinate, then it will make it easier that way to pull the reads from the BAM file faster. However, if you want to sort by read ID, the reason why you do that is if you're interested in read pairs. So if you're interested, for example, for fusions, and you want to see the pair, you want to pull the information from the pair and see if they align with the same chromosomes or different chromosomes and how far apart those two reads are. So, yeah, two different options for aligning your BAM file. And if you have a BAM file that's already aligned, as I've mentioned, you look at the top in the header and it will tell you how this file was aligned, whether it was sorted, sorry, how it was sorted, how it was sorted by coordinates or by position. A lot of the tools downstream of BAMs, if you want to use them, then they require the BAM file to be sorted. So that's the first thing that you need to do after you get your BAM file. If it's not sorted properly, you make sure you sort it before you do expression analysis or anything else. And as I've mentioned, one way to visualize the BAM file after you've generated it is to look at IGV. You guys have seen IGV before, have you used it before? Yeah. So I don't need to get into details, but this is the chromosome, whatever chromosome you're interested in. In this case, I think it's chromosome 3. And then these are the piles of the reads. This is the gene annotation. So these islands represent exons, and these lines that connect them are intronic regions. And you expect to see exons piling up, sorry, reads piling up on exons and not really on introns unless there is a novel discovery, a novel exon that you have not, has not been annotated before, and you discovered it in your dataset, then that could be a reason why. So now we're going to move to the second section. So now that you have the BAM file, you've sorted it, you want to find out if your library worked, if the library was good, if the alignment worked. So you can do alignment QC. Just like we did with FastQ file, we did FastQC. In the BAM file, you can use FastQC. However, FastQC is not really implemented to assess the quality of a BAM file. So what it's doing is checking the sequences, but it's not checking how well the alignment was done. So it was just checking the sequence quality from the sequencing, not the alignment. There is a tool called R6C that you can use to get all the metrics that I'm going to be talking about today, or you can implement these metrics yourself. I'll show you how. It's very straightforward. So some of the metrics that we're going to be looking at, we're going to be looking at 3' 5' bias, nucleotide content, base and read quality, how to assess PCR artifacts and what to do with it, look at sequencing depth, base distribution, and insert size distribution. So the first thing to do is the 3' 5' bias. The way the tool does it is that it takes the top 1000 express transcripts, and then it divides, and of course these transcripts are going to have different lengths, so it takes the transcripts and divides each transcript into 100 bins so that it normalizes the length of the transcript. So it takes all the different transcripts and then it divides them into 100 bins, and then it does read count per bin. And that's what we're plotting. So from 0 to 100 percentile within the gene body, we're looking at the coverage for each one of these bins or percentiles. And what you expect, you expect to see even coverage across your transcript. And here we have two batches of samples. One batch where we are seeing a uniform coverage across the transcript, but we're seeing another batch that seems to have a lot more coverage at the 3' site versus the 5' site. And this usually raises a red flag. It could be the library construction technique that you used. Maybe this was a poly-A selection. There is a bias towards a 3' or it could be degradation in your sample. That's why you're not seeing the transcripts are being degraded at the 5' end, and that's also not a good sign. So you might, if you see this, you want to go back, check what caused this, if it was the library technique, switch the library technique. If you can't, then there are tools out there that you can use to adjust and normalize for such artifacts. I believe it was mentioned here before when I think Obi was doing the FastQC that there is a way to check nucleotide content. You expect the nucleotide content to be proportional for ACGT around 25% throughout the read. Here we're looking at within each base of the read from 0 to 35. We're assuming that this is a short read. It has a length of 35 bases. For each base within that read, we're looking at the nucleotide content. And as was mentioned before, the first, let's say, 10 bases, there seems to be this bias in terms of the number of nucleotides. And the claim is that it's the random primers that are used to transcribe the RNA to cDNA. They use these hexamers, and these hexamers are just chunks of DNA that bind to the RNA. And we need that in order for the transcriptase to initiate, just like a polymerase. And apparently these hexamers, they're six-based, they introduce a bias. So they have a preference to certain bases that bind to the fragments. And that bias is causing this. So it's usually not a big deal because it's not really happening at a fixed point. It's happening at the beginning of the read, and you usually have a lot of read that overlap around the region. But it's best to generate the plot, see if you can avoid it by using a different library technique strategy or if you want to trim the first few bases of your read, if you have a long read and you can afford to trim the first few bases, then you can go ahead and do that and then align and see if there is a difference between the alignment before and after you trim for those bases. Quality distribution. So here we're looking at, again, the 35 bases read length. We're looking at the quality of each base within the read compiled across all the reads in the library. And as I've mentioned before, FRET score is used, and FRET score is a negative lockdown of the p-value, the p being the probability that the base calling is wrong. So if you have a q-value of 30, that means that there is one in a thousand chance that you got the base wrong. So 30 is usually the threshold that is used when we assess quality. So you want everything to be above q30. If you see anything below q30, there are a few options you can do. You can either trim those bases and a lot of tools that trim, you can actually specify not just how many bases you want to trim, but you can tell it, okay, trim anything that is less than this q-value or this score. The quality here is good. Everything is above 50, so that's good. PCR duplication. So duplicate reads are reads that have the exact same start and end position. And when you're doing DNA sequencing, one thing you want to avoid is you want to avoid pile up of reads. So if you're looking through IGV and you're looking at a region and you see this region that has this pile up, all these reads that have the same exact start and end, you don't want that. The reason why you don't want that is because if there is a variant in that region, it will be amplified. And this amplification is not biological. It's actually artificial because it's PCR. So PCR is the process where you try to duplicate the reads to get more reads. And the error will propagate if you don't collapse your data. So what we end up doing is that we collapse all the reads and we only keep one read if they have the same start and end position. And that works with DNA sequencing because you expect to... the start sites of the fragments to be randomly distributed across the genome. There aren't specific starting sites when we talk about DNA seek. However, with RNA seek, the case is a bit different. You want to be a bit careful when you choose whether or not to collapse because there are actually fixed starting points in the genome. These are the transcription sites. That's where the transcription begins. So these pile up of reads could actually be just a sign of transcription sites and if you collapse them, that will actually affect the dynamic range of your expression values. So you have to be super careful to distinguish between a true PCR artifact or a biological transcription site that's causing these PCR-like artifacts. So this plot can help you assess how severe this problem is and based on that you can decide whether or not you want to collapse. So here we're looking at the number of reads and the occurrence of reads. So that's how many duplications we have. So you expect a good library. You want it to look something like this. If you get something like this where a big number of reads has very high duplication rate, then again you might want to go back and check the library technique and see if there is something that you can modify to get rid of this. Another question that gets asked is sequencing depth. How many lanes of data do I need? How many samples can I actually multiplex within the lane to be able to answer my research question? So this is very much dependent on what you're trying to do. So with DNA sequencing it's pretty simple because you can just look at the average coverage and you can say, okay, I want to sequence so that I have 40x coverage. I want every site in my genome to have about an average depth of 40 reads. The problem with RNA is you cannot do that because the expression of the genes vary. Some genes have very high expression like housekeeping genes. Other genes are not expressed at all. So there isn't one number that you can pick and say that's my distribution across the transcriptome. And not only that, but also it really depends what you're doing. Are you assessing expression? Are you looking at fusions? Are you doing variant calling? Do you probably need a different threshold and different coverage? But one way that you can kind of assess how many reads you will need is by looking at this spot which is a splice junction saturation plot. So what this is doing, what the tool does, it takes your BAM file, the full file, and then it randomly samples different levels of that BAM file. So it samples 10% of the file, 10% of the reads, 20% of the reads, 30% up to 100% of the reads. And for each level, each sampling level, it will try to find the number of splice junction detected, novel and known. And it will make this plot and what you're looking for, you're looking at a point at which there is saturation in the number of junctions that you detect. The saturation means that beyond this number of reads, you're not actually detecting any new splice junctions. Obviously, there's going to be a difference between novel and known because there are probably false positive splice junctions in your data set because of the aligner that you use or the splice detector that you use. So here, for example, the known junction curve saturates a lot earlier than the novel junction. So it's a balance between the known and the novel. You want to pick a point at which both start to kind of saturate or slow down and at that point you say, okay, this is probably the number of reads that I will need. I don't think I need anything more than that. So you can do that. You don't have to do that for all the samples. Maybe you can run one sample, do one lane or two lanes, do this kind of experiment, see the number of splice junctions that are saturated, and then go back and apply this desired number of reads to the rest of your samples if they all have similar conditions biologically. Another way that you can do it if you don't want to look at splice junctions, you can look at the number of genes, number of expressed genes, or number of new family of expressed genes that you get per sampling level. So instead of looking at the number of splice junctions you can look at, am I actually seeing new family of genes that are being expressed as I increase the number of reads? And again, if you reach a point where you're not seeing a big difference, you're not seeing new genes, then maybe again that's a point at which you can draw a line and say this is sufficient. The sampling is done by RCC. What you do is pass it I think it only requires two files. It needs the GTF file and it needs a BAM file. I don't think so. I mean, you can always down sample the first sample that you did if it was 100 and you only needed 50. You can down sample that and then just to make it at the same level as everything else. But again, I would have to make sure that the other conditions, they match the condition that you tested or they might require a bit more read. So as long as they're under the same condition, they're very similar and I would have a range as well. Like if you think it's 50 then you would do maybe add a bit of buffer just in case. Another part that you can generate is the base distribution. So here we're looking at out of all the bases that we aligned what does the distribution look like? What percentage of these bases are coding? What percentage are there UTR and Tronic and so on. And that will give you an idea of what library was used. If you had no idea what library was used, whether it was a whole transcriptome or a poly A, then doing this plot will give you an idea what was used and if you see something that is strange then again you go back and then you talk to the people who prepare the libraries. For the poly A you're expected to see a lot more coding bases than whole transcriptome and so on. Another check that you can do is you look at the insert size distribution and there are so many terms that are used interchangeably when we talk about insert size. You see fragment size, you see insert size you see inner mate when people use top hats that term is used in there and it can get confusing and they don't all mean the same thing each term means something different and you have to be careful of what you're plotting. So when we're talking about fragment size that includes the fragment that you're trying to sequence beginning of read 1 and read 2 with the adopters that's fragment size. Insert size is usually the fragment without the adopters and the inner mate is the distance from the end of read 1 to the beginning of read 2. So depending on what you're plotting the distribution probably look the same it's just that the numbers might be a bit different whether or not you've included the adopters or not so you just have to be careful. So here for example we are plotting the insert size distribution and this just gives you an idea of how what the fragment size was and how long your reads are the closer the number to 0 it means that you're close to actually sequencing into the adopter because your reads might be too long and your fragments are a bit short so this will give you an idea of how if your library was too small and if your reads were too long you can adjust, go back and adjust. Here you're seeing bimodal distribution which is usually not a good sign it means that there's probably when size selection was done there are probably two sizes were selected or there is some sort of degradation that is causing the difference modes in your data set. They can be negative if your fragment size is small and your read length is longer than the fragment size. Let's say that you're doing 2 by 100 and your fragment size is 80. You sequence the 80 bases of the fragment but then you sequence minus 20 into the adopter so you end up sequencing the adopter. Anyone to avoid that? When you see that you can either trim adopter or shorten your read length that's either the two choices that you can have to avoid that problem. If you're looking at variants I'm not sure if there are tools that are specifically designed to call variants from RNA-seq alignments. You can try GATK and they might have something that's specific for RNA but another way if you're interested in a specific gene or a region you can always use IGV and look at any alteration that is present in terms of SNVs or SNPs you will see that the base will be color coded and this is just saying that the reference to the C and you're seeing a T here you might have some DNA-seq data and you want to do some sort of validation and you look at your RNA-seq data set just to make sure that the variant that you saw in your DNA is actually present in RNA so that's one approach that you can do, you can just go back to IGV and then check to make sure that that variant is present in RNA so that gives you more confidence that this variant is not an artifact, it's actually there. Can I ask something? Have you compared the two that say you have the same sample and you try to do variants based on the DNA sequence and the RNA sequence the same sample? Is it true because of the difference in the proofreading activity of the RNA with the less variants if you do the DNA and if they just want the mRNA to get an open estimation of the number of variants? That's possible. Also you have to keep in mind that in RNA alignments you're doing splice detection so there is probably a high positive rate so when you're looking at variants make sure that you check how far they are from the splice variants because if you probably do a heat map you'll see that there are hot spots of false positive variants around the splice sites because the tool is trying to find the splice junctions, it estimates that it's not proper and it creates mismatches and those could be false positives so you will have more false positives in RNA or variants in RNA so you just have to be careful there. Thank you.