 All right, so we're on module two. So early this morning, you had the intro to cloud computing. You had the introduction to RNA sequencing. You did the tutorial. Now we're going to focus on the alignments, RNA-seq alignments, what aligner we're going to be using, what files we'll be generating, and how we can visualize alignment files. So objectives for this section would be to go over some of the challenges that we face when it comes to RNA-seq alignments, some alignment strategies, the tool of choice, as it was mentioned before. So we'll be using Bowtie and Tophot for this workshop. Then I'll introduce you to the BAM format, SAM format, and BED format, if you're not familiar with those. And then some basic manipulations of BAM files, alignment files, and then some visualization. And then the second section of this talk is going to focus on alignment QC. So what kind of QC metrics you could use to assess the quality of your library that you're about to sequence. And also a bit on BAM recounting, if you're trying to call variants from RNA-seq. So some of the challenges that we face with RNA-seq is the first challenge is the computational cost. We're dealing with hundreds and millions of reads. So as sequencing is getting cheaper, we're getting more and more reads that we have to process. The other challenge is the fact that unlike the DNA-seq, RNA-seq, we have to deal with introns. When we construct libraries from mRNA, so these are exons or coding regions, but when we try to align them against reference, we have a whole genome reference, which includes both introns and exons. So that's another challenge that the aligners face when they try to map exons to a whole genome reference. And the fact that RNA-seq data is so rich, it contains so much information about, you can get information about, not only expression, but you can get information about splice variants. You can get actually, you can call SNVs from RNA-seq. You can look at fusions. And each one of these different data types have its own set of pipelines and ways that you can process that data. So when it comes to mapping strategies, there are three different categories of mapping strategies. So there is the Dinova assembly. You can also align to a transcript tone if you have a gene model that you are familiar with or you know. Or you can use the whole genome reference and infer possible transcripts, abundance from that. So how do you decide which strategy you should pick when you're trying to choose an aligner? For Dinova assembly, you would pick that if you are, if you don't have a known reference genome, you would go with that. Or if you have a very complex polymorphism or mutation haplotype, that might be missed if you compare it to just a regular reference genome. You can also align to transcript tone, as I said, if you have short reads. But what we are going to be doing today is that we're aligning to a reference genome. And that's what most people do with the aligners, such as Top Hat and Botai. And as mentioned here, each strategy involves different aligners and assembly tools that come with it as a package. So this diagram here shows a list of the different aligners that have been published since 2001. And you can see there are a lot of aligners. The aligners that are for RNA are highlighted in red. I just want to highlight a couple of aligners. So the first one I want to highlight is Top Hat. It has been used for a very long time. There are a lot of publications that have been using Top Hat. It has a very good documentation and support. And most of the time, a lot of the questions that you have regarding how to run Top Hat, you'll probably find answers for them online if you just do a simple search in Google. Another tool, another aligner that has been also used lately is Star. It's known to be extremely fast when we, if you want to align a human paradigm library, one lane. Usually in Top Hat, it would take about a day. With Star, it takes a few hours. So it's a lot, a lot faster than Top Hat. So in the workshop today, we will have commands. You've installed Star in the first module. We will focus on Top Hat, but we'll also run Star if we have time. So all the commands for the Star alignments are going to be there. So we'll be doing the alignments with Star. And then we'll also do the expression and then differential expression using that alignment from Star and Top Hat. So should I use splice-aware or unspliced mapper? Again, as I mentioned, the libraries that you've prepared, they are mRNA libraries. And they only code for exons. And the whole genome contains both introns and exons. So if you're doing, if you're aligning against a whole genome, then you have to pick an aligner that is aware of splice junctions. Otherwise, if you don't want that, then you should align it to a reference that consists of only the transcriptome. So how does Bowtie work? So Bowtie and Top Hat. So Bowtie is the back-end aligner that Top Hat utilizes or uses. So what it does is this Top Hat is splice-aware and requires a reference genome, so a whole genome. And what it does, essentially, is that it takes the reads and then it breaks them into small chunks. And then it uses that information and maps those reads to the whole genome and uses that information to come up with splice dictionary. And I'll go into the details of or go through an example to show you how that works. So let's take this as an example. So we have two reads, read x and read y. And this is the reference genome, two exons. Exon 1, Exon 2, and then the entronic region in between. And let's say that the first read, read x, spans two exons, while read y only spans one exon. And you're trying to align that. So what Bowtie does, it looks at each read. And the first pass, it tries to align the read against the reference genome. So in this case, it's not going to have any difficulty with read y because it spans one exon. And it will map perfectly to the first exon. However, then it goes to read x and is trying to align that. It will have an issue because half of the read actually belongs to exon 2 and the other half belongs to exon 1. But the reference has an entronic region. So it will not be able to map it. So what it does, it throws it into an unaligned bin. And then it takes all the reads that are in the unaligned bin. And it breaks them into smaller chunks. And then it takes each one of these chunks. And it goes back and it aligns to the whole genome reference. So after breaking the read, now the first chunk is going to align to exon 1. The third chunk is going to align to exon 2. But the one in the middle is still going to have some issues aligned. But what it does is that it collects information, the map information, from the x1, x2, and x3. And then it tries to come up with a splice library of potential splice sites or where are the splice junctions. And then once it comes up with that library, it's going to go through the alignments again based on the splice junction library that it came up with. And it will re-align all the reads around those sites that it detected to see if those are true or not. And then it will annotate each read according to the splice junctions that it detected. So when you're doing alignment, another question that you will face is whether you or not you should use multi-mapped reads. So when we're dealing with DNA seek, most of the time what we do is we pick. Sometimes a read can map to multiple places. And it maps to multiple places with very high quality. So with DNA seek what we do is you randomly pick one of the top alignments if they all have the same exact quality score, if they all have the same high score. But with RNA seek, you don't want to do that because you don't want to affect the dynamic range of your expression. So if you're assessing expression, let all the reads that multi-map keep them in there. If you're trying to call variants from RNA seek, then you can do that method that you do with DNA seek and only pick one of the reads that multi-map. So what's the output? So you run your fast Q file through the aligner, through the splice junction detector in the assembly. The output will be a BAM file. So a SAM or a BAM file. A SAM file stands for sequence alignment map format and a BAM file is just the compressed binary version of that file. Now a SAM file is simply a text file so you can just open it up and then you'll be able to read the content of that file. With a BAM, you cannot do that. You will need to convert it to a SAM or use other tools that will help you open up the file and view the content. You're saving a lot of space by converting from SAM to BAM. So most of the time, if you're running out of space, it's best to convert your files to a BAM file and just keep the BAM because you can always go back to the SAM and uncompress. And a lot of the tools that are downstream, they take both SAM and BAM formats. How can you convert BAM to SAM? So throughout this stock, you'll see links to Biostars blog. One tool that I usually use is SAM tools. So you can use SAM tools to convert a SAM file to a BAM file. And that tool does not only convert but it does a lot more than just conversion. And we'll see some examples throughout the talk today and tomorrow. So what does a SAM or BAM file look like? So here we're looking at an example of a SAM file. I just want you to know that there are two components to that file. So there is a header and there is a body. And I'm gonna go through each section and then in detail show you what each section includes. So as I said, the SAM format consists of two sections. There's the header, there's the alignment section. And I also mentioned that the BAM is a compressed version of the SAM. Also, one thing to keep in mind is that BAMs file, they're usually indexed. So especially when you, we're gonna try and view these files in IGV, we'll have to index the file. So what that means is that you generate another file smaller version that is the index of the BAM file. And the extension is BAI. And what it does, it helps you retrieve alignments a lot faster with that index. You won't be able to view the BAM file in IGV unless you have the index file associated with it. Now this is a breakdown of the header section. So this is the information that the header section of the BAM file. So you start with the header line, which contains the format version and the sorting order of the alignments, whether you've sorted by coordinates or you've sorted by the readings. Then you get information about the sequence that you've used in your alignment. So the reference name, the reference sequence length, and also the species. The read group is the third section. You have a read group identifier and the name of the sequencing center and a sample name. So all of these things are things that you can modify yourself and add. And it's best if you annotate it properly and add the proper sample name, the name of the tool that you use, the version that you use, so that when you open a BAM file, you know what was used to process this BAM file instead of adding big information that, and then you won't know what was used exactly to generate this file. And then again, so you can add the program name and program version. If this wasn't done properly at the beginning, there are tools where you can go back and modify this information, the header, to annotate it properly. So that was the header section of the BAM. Now if you move to the body of the BAM, you will find a list of the file information. So I've highlighted two, which I will talk about in depth after this slide, but you'll find information for each, so you'll each line would consist of a read or an alignment for that read. So you'll have the template name, you'll have a fly, which describes the read, and I'll get into that in a bit. You have the reference name, which means the chromosome that read aligned to, the position, the map and quality. So there are two qualities that you'll get. You'll get a map and quality for the whole read, and at the end you'll get also a sequence, so a quality for each base in your read. You'll get a cigar string, and I'll talk about that in a bit as well. And you'll also get information about, if you have paired end reads, it will tell you where the, not only where your read aligned, but if it has a proper pair, and where that pair aligned as well. What chromosome and what's the position, or how far is the second read is from your first read. So for example, if you have a fusion, you realize that the second pair would be on a separate, on a different chromosome, so that's one way you can actually pull the fusion out from a band file. And yeah, as I said, so you get the sequence itself, so this is an example, so you'll get the sequence itself, and underneath it you'll get a quality per base for that sequence. So you get a quality per base, and then you get an overall alignment quality for that read. So what are, so the two things that I had are the flags and the cigar string. So what are the flags? So the flag that you see in the band file is just an 11 bitwise flag. It describes the alignment. So think of it as, instead of having 11 separate columns to describe the alignment, you can combine those 11 columns into one metric. And that usually, that number is between zero to 2047. And I'm gonna ask you to go to this website right now if you can check out this website. So this is a breakdown of the possible combinations of descriptions that will describe the read. And this number is important because we use it to subset band files. So if you're interested in a subset of the band or a subset of alignments, that don't map. So I'm only interested in things that don't map. So we can, through that webpage that I'm just asking you to go, if you check that box, the unmarked box, it will give you a code or a number. And you can use that number to filter the band file using samples. So I mentioned samtools, you can use it to convert SAM to BAM. You can also use it to subset the BAM file. So if you, in Samtools view, there is an option called dash F. And depending whether it's capital dash F or small dash F, you can only subset things that did not align or exclude things that did not align. That's the difference between the small F and the capital F. And you can actually use combinations. So if you check two things, so if you want things that are properly aligned and they are the first segment in the template, so the first read. So you can check this and this, and you'll get a number. And you can take that number and then use it in Samtools view to subset. So you generate a new BAM file that includes things that are properly aligned and you're only looking at the first reads. And so on. Any questions regarding the flags? The other thing that you'll find in the body is the cigar string. So each read will get something called the cigar string. And the cigar string is a breakdown of the alignment summary per read. So if you take this, for example, start with the example, this is an example of a cigar string. So you'll have 81M, 859M, 19M. So what this is telling you that out of your read, the first 81 bases were a match. And then the second 850, and then there was a gap of 859 bases in the reference. And then there was a 19 base match in the read. So when you see this gap in RNA-seq, it usually means intron. So there's an intronic region. So this is an exon, exon, and then there was a gap that represents an intronic region. So for each base, you will get information about whether or not the base matched, and whether or not there was an insertion, a deletion, or there was just a gap in the alignment. Another way you can subset BAM files is, so I said you can subset them according to the quality of the reads or description of the reads, but you can also, if you're interested in, for example, a list of genes that you wanted to pull out, or a list of positions that you wanted to pull out from your BAM file. Another format that is used very often is the BED format. So the BED format is a text file that has four columns, chromosome, start, end, and annotation. So in this case, it could be gene name, it could be whatever target that you want. If you're doing targeted sequencing or you're doing exon capturing, each one of these technologies, they will come with a BED file, list all these different targets. And so you can subset a BAM file using that BED file that you have, and a lot of tools and QC tools, they accept BED files if you're only interested in those regions, if you wanna generate QC for just those regions. So BED file is very easy to generate four columns and they're top separated. So this is a list of tools that you can use to manipulate SAM, BAM, and BED files. The ones that are used quite often, SAM tools, Picard, and BED tools for modifying BED file. So for example, if you're interested in looking at coverage for certain alignment using a specified BED file, you can use BED tools to do that. So after you generate the BAM file, you will probably need to sort it if the package or the tool that you use doesn't sort it itself. And there are two ways that you can do the sorting. You can sort the reads by position or by read name. So you sort them by position, it just makes the accessing reads within the BAM file a lot faster for tools that want to access reads within BAM file. However, if you wanna sort them by read name, the reason why you do that is you want to maintain the read one, read two association. Because if you sort by position, you might miss on the order of read one and read two. And for some tools, you want to keep that order, you wanna maintain that order. So you sort by read name and this makes sure that the order is maintained. So you can view the alignment in a tool called IGV. And as I said, you will need to index that BAM file because IGV will require the alignment file along with the index. And this is just a snapshot of what IGV looks like. So when you open up the BAM file, you will see, at the bottom you'll see a gene track. You will see also the genomic position right here. And then you will see piles of reads. So the blocks here represents coverage. So areas where there were reads that covered that part of the genome. And then the light blue dashes represents the intronic or predicted intronic regions from the tool that you used. And so what you do is you try to look at those predictions and your annotated gene track to see if it matches. If you're getting the same slicing adjunctions as it was predicted in an own track like UCSC or whatever track you can load. So you can load as many tracks as you want and you can actually load multiple BAM files at the same time. It gets too crowded sometimes. So the maximum of two BAMs is enough. Now, also you can color the reads to reflect the quality of the reads. You can also color it to reflect the strand that the read is coming from and so on. So there are so many different parameters. We'll get the chance to actually view the BAM files that we generate today through IGV. How many people, just curious, how many people have used IGV before? Okay. Oh yes, of course. So yeah, so this is just a list of alternative viewers to IGV, you can check Biostars as well. There's probably a list there. All right, so now moving on to the second segment, which is alignment QC assessment. So here we're gonna cover some simple metrics that we look at when we assess the quality, such as three prime and five prime bias, nucleotide content, base and read quality, PCR artifact, sequencing depth, base distribution and insert size distribution. So so far, when we talked about quality, we used FastQC, we were looking at the quality of the sequence prior to alignment. So we didn't really look at anything post alignment so far. You can use FastQC on a BAM file. So if you pass it a BAM file, it will still run. But again, it will not provide you with information about the alignment or how well the reads align. It will tell you that the reads have a certain quality but these are sequence quality, not mapping quality. So these things are interesting and they're really important to look at post alignment to assess whether or not your library actually construction worked. So the first QC is the three prime and five prime bias. And in these slides, I will be referring to a tool called R6C. There are many tools out there that you can use to generate QC reports. You can even come up with your own. So if you don't want to use this tool, that's fine. But I'm giving you a frame that you can work with. So if you look at the plot, it's something that you can generate your own if you don't want to use this tool, if it's not available. So for example, in this slot, what I'm looking at, I'm looking at in the x-axis, I'm looking at the gene body percentile. So the position, normalized position for the transcripts. So what that means is that if transcripts have different lengths, the length gets binned so that you bin the length of transcript to 100 bins so that you standardize the length. And then you look at each one of these bins and you look at the coverage for each one of these bins. So what you expect to see is you want to see sort of uniform coverage across the transcript. You want to see a similar depth across the transcript and you want it to be balanced across. So here you'll see two groups. The first group where the coverage is balanced but you also see another group where there is so much coverage at the three prime end. And you need to be careful when you're analyzing these things because if you have bias in your coverage, it will highly influence your expression estimation because short reads will have overestimated expression and very long reads will have underestimated expression. So if you detect such a thing at the alignment level, you need to either go back and figure out why that happened. Most of the time it's probably the kit that you use to construct the library, there is a problem with it or maybe your RNA is degraded. So and if you can't do anything about it then you will need to look for computational tools that will assess and fix such biases when they're assessing expression. Another plot that you can do, the nucleotide content. So here on the x-axis you can plot the position of the read. So this is a short read, there are only 35 bases and then you look at the percentage of ACGT within each position in your read. And what you expect, you expect it to be random, right? So it should be about 25% each. However, with Illumina sequencing, I'm not sure if some of you have generated such a plot before. You will notice that the first 10 bases they are not very random. I mean you see imbalance in the percentage of ACGT and Illumina has recognized that and it's due to the random primers that they use when they reverse transcribe RNA into a CDNA. And it turns out that the enzymes they are not very random and they introduce some patterns in the first few bases of the read. So you have to be careful about that because those actually affect the mapping, mapability of your data. So one thing you can do is you can trim the first few bases to eliminate that bias from your read. Yes. In the bacterial system, how do we interpret this graph where there is GC-hypnotizing some of the strengths? So in that case, the GC percentage will be very, very high in comparison to our DNA model. So how do we say, is it GC-contained in the bacteria or it is the bad data? Like how do we get the same data? I guess with the GC-content you look at the overall distribution and it's not positional effect. So it doesn't happen at the beginning of the read or at the end of the read or in the middle. So you look at the distribution of the GC-content and see if that overall distribution is what you expect it to be. This only happens at the beginning of the read. So it's position specific. So unlike GC, I guess. So GC is not read, it's position dependent. So again, before you trim, make sure you, generate a set plot, make sure that you have a set bias and if you do, then try to take care of it. Quality distribution, as I said, there are two types of quality. There is the base quality and there is the mapping quality. So here we're looking at the position of the read. So for each base across the read, what's the quality? And you will encounter this term a lot, the FRED quality. So what is the FRED quality score? It's simply the negative log, negative 10 times log 10 of P where P is the probability that the base calling is wrong. So when you see a FRED score of 30, that means that there is one in a thousand chance that the base calling is wrong. And usually 30 is a good number to use when you're trying to pick good quality reads. So you pick things that are higher than 30. But again, the higher, the better, the more accurate the calls are. PCR duplication. So duplicate reads are reads that have the same start and end position and they are usually a result of PCR artifact. So Malachi I think talked about PCR duplication and how people don't tend to collapse reads because it affects the dynamic range of your data. But if you want to visualize PCR artifact or possible PCR artifact in your data, one thing you can do is, the tool, the same tool I talked about, looks at the number of duplicates and the number of reads that have that duplication number. And what that percentage is, how many, what's the percentage of reads that have such a thing. And ideally what you want is you want to have a small number of reads that have a very high duplication number and you don't want a curve that goes that way. Now, what would you do if you see a high duplication? I guess I'll leave that up to you because again, collapsing might affect the dynamic range of your expression data. So you might wanna check the library construction and the PCR cycles to try and adjust for that. The other question that I used to get asked when I was in production is how deep or how many lanes do we need? How many lanes of sequencing do we need? So one way you can think of that, when you talk about DNA seek, it's a lot easier because you have a threshold you say, okay, I'm gonna sequence 50X or 100X, an average of 100X. However, it's harder with RNA because of the expression. So the expression for genes will be different. So you can't really set a threshold in terms of coverage, but one approach that this tool has taken is resampling technique. So what they do is they take the BAM file, the full BAM file, and then they start taking samples from that BAM file. So they would take 10%, 20%, 30% of the reads, and so on, up to 100%. And for each sample, they would go and look at the splice junctions, the number of splice junctions that they can detect from that sample and compare it to known splice junctions. And what they're looking for is they're looking for saturation effects. At what point do we saturate? We have enough data and we can't really discover any more splice junctions. And it's that point where you probably have enough reads and you don't really need any more. So if you run that on just one sample, try to figure out how much data you need based on this plot. You can go back and then apply that number on the rest of the samples in your dataset. Yeah. Yeah, so then I guess what you can do, you can take the same approach and look maybe at the number of genes, new genes that you discover, or new genes that appear in your expression. That could be a different way of looking at it if you don't have splice. And you notice that there are three lines. So one line looks at the novel junction. The other looks at the known junctions and the purple is all junctions. And you notice that the known junctions, they saturate a lot faster than the novel junction. And that could be because the slice junction detection tool produces a lot of false positives. So a lot of these could be false positives that might saturate, might not saturate, depending on the tool that you're using. So it's good to look at both the known and also the novel when deciding how much you need. Another important thing to look at is the base distribution. So how many bases align to a coding, a non-coding region. And depending on the library you use, or the library construction, whether it's a poly A or whole transcriptome, the distribution will be different. So with the poly A will get more coding. With the whole transcriptome, you'll get less coding. So it's important to look at that distribution after. And if you're not getting the desired proportions that you should go back and again, check why is that and reassess library techniques. Before I talk about insert size distribution plots, I wanted to highlight few terms that tend to be used interchangeably in the community. But I just wanted to clarify them. So here we're looking at a DNA fragment that we are interested in sequencing. So this is the second line. You were looking at the fragment plus the adapters attached at both ends. When you're doing a single end, we are sequencing one end of the fragment. When you're doing paired end, we're sequencing both ends of that fragment. And depending on the length of that fragment and the length of your reads, you'll get a gap in between the two reads. So these are the three terms that I wanted to highlight. When you talk about fragment, we're talking about the part that you want to sequence plus the adapters. Insert is just the part that you want to sequence without the adapters. And then the inner mate is the distance from the end of read one to the beginning of read two. And that's what I'm plotting right here. So I'm plotting the insert size. So the distance between the gap between the two reads. And you'll see that there is a distribution because when you do a size selection, it's a distribution. So you'll get some reads that are longer than how there is here, showing a bimodal distribution. So it seems that there are two fragments maybe that were selected. Usually you see a normal distribution and that reflects that there is one fragment that was selected. But you notice that here I'm plotting the distance between the reads. So sometimes you'll have a zero distance between the reads. And that's when there is no gap. So the total sum of the read length equals to the sum of the fragment. But once your fragment gets shorter and shorter and your read length is fixed, you'll actually have some overlap. And that's why you're getting negative distance. Is that when the fragment size is shorter than the sum of your two reads. So you'll get negative distance between them. So that negative distance, it will not affect your expression. However, what you want, ideally, you want to keep that distance. You want it to be zero or higher. In case, for example, you're looking at, you're calling variants. You don't want one variant to be called from two reads because these two reads are dependent. They're not independent. They're coming from the same fragments. So ideally you want to pick a fragment size with increases, maximizes the answer size more than zero. In terms of variant calling from RNA-seq data, you can do that. So you can do it various ways. Here we're doing it the IGV way. So you can look in IGV and if you see a base that is different from your reference, you will see that alternative base highlighted. So here, for example, the reference base is in a G while some reads have a T. And that highlights the variants in your RNA-seq data. So that's one way you can look at that. So this only works if you actually know where the positions are to go and view them. You can also do pile up if you want where you give it a reference and then for every single position, you will go and check what the base is in your sequencing data versus what it is in the reference. And if there are any differences, then we'll report those differences and you can take those and then filter them or view them in IGV. But there are also tools that call variants from RNA-seq data. I believe GATK now, which is known for calling variants from the DNA. A DNA-seq can actually call RNA, can call variants from RNA-seq. I haven't personally used it, but it can be done. You can also look at the, if you already have SNBs or variants from DNA, you can take that and then look at those positions in the RNA to see if there's any concordance between RNA and DNA variants. And this leads us to the tutorial for module two where we'll take all the FASQ files that you obtained from the previous module and we will run both top hat, the aligner to get BAM files, alignment files. And we'll also, I believe if we have time, we'll try to run FASQC and R6C. So we have a module that's optional. We'll see if we have time. We can run FASQC and generate the plots that you've seen. All the commands are there as well. If you want to take the commands and then run them in your lab, you are more than welcome.