 All right, how's everyone doing? Let's be tired, almost through the first day. Have one more hour. A lot of learning today. So I mean, so far today you've learned about cloud computing, you've learned a lot about RNA-seq, just general introduction, what RNA-seq is, some of the challenges that you face with RNA-seq that you don't see with other platforms like DNA-seq. You learned about sequencing, you learned about how the libraries are constructed for RNA-seq and what kind of files you generate after your sequence, so FASQ files. We haven't had a chance to go over the quality, assessing the quality of the sequences, FASQ files, but we'll do that I think either later today during the integrated assignment or tomorrow morning. We're going to be spending the next hour or so talking about alignments, focusing on alignments. So we'll talk about some of the challenges in RNA-seq alignments specifically that we don't see in other platforms. We'll talk about different alignment strategies that are present with RNA-seq data. Then we'll focus on one specific tool, high side two, and I'll break it down for you how it works and we'll provide some examples. That's the tool that we're going to be using for the rest of the workshop. And then we will talk about the output of that aligner, the BAM file, and some other formats that will be useful along with BAM files, such as BAT formats and how you can manipulate the BAM file. And then talk about visualization of RNA-seq alignments. The second part of this module is doing some QC, so alignment QC assessment. What metrics should you look at to help you figure out whether the run was successful, whether the quality of the sample was good, and how do you visualize that? What kind of reports can you generate? And then finally, BAM read counting and determination of variant allele expression status. So that will be the last thing. So in terms of challenges that are specific to RNA, computational cost is a big challenge when it comes to aligning RNA-seq data. So one lane of sequencing nowadays can generate hundreds and millions of reads. I've seen lanes, for example, for high-seq that are generating 900 million reads. And that's a lot of reads that you're trying to align. It's a lot of data. It requires a lot of computational resources. In addition to that, the RNA-seq data is challenging because we have the sequence that we're trying to align only contain exonic region. And we're trying to map it to a reference genome that contains introns and exons. So there are these gaps that we're going to have to take care of, the splice junctions that the aligner will have to take care of. And that's something not present in DNA. Also, the fact that it's unlikely that you will only align once and be done with it. There are so many different data types that you can, information that you can extract out of RNA-seq from expression to fusions, to mutation calling. Each one of these tools will have to be run on your FASQ file. And each one of these tools gets updated so frequently, like every year there is, every year or two years there's a new version that will require you to go back and reprocess, depending on how major the update is. In addition to that, also the transcript dictionaries and the GTF files and the annotation files that you're using, those get updated as well. The reference files get updated as well. And that will require you to reprocess. So that's a lot of reprocessing, a lot of computational cost for every time you're reprocessing. And as I mentioned, there are a lot of aligners that are available out there. Hysat is just the one that we've picked today. But there are different classes of aligners in addition to Hysat. So there are mainly three different classes of aligners. So there's the Dinova assembly aligners, there's the aligners where it takes the sequences and align them to a transcriptome reference. And there are aligners where you take the sequences and align them to a reference genome. And depending on your project or what you need the RNA data for, you will have to pick one of those classes or aligner from one of those classes. So for example, with Dinova assembly, if you don't have a reference genome, then you'll probably use a Dinova assembly. But most of the people here are using human, so the reference is available. So you might not need to use a Dinova assembly. Another option is if the samples they were dealing with, they have a very high rate of polymorphism or mutations. Then you might want to do Dinova because you don't want to miss these when you're comparing it to a reference genome. And the second option is to align to a transcriptome. That's very rare. Most of the aligners are actually aligning to a reference genome. Now, each one of these classes, they have a bunch of aligners that again gets updated every single year. This is a very cool map, it shows a timeline of all the different aligners. It's not specific to RNA, so RNA, DNA, bisulfide, and macronase. And it tells you at the bottom and the x-axis, you see the date they were published or released to the scientific community. And I wanted to highlight two to three aligners that are very big when it comes to RNA-seq, very important when it comes to RNA-seq. The first one is Tophot, which was released around 2009. So each one of these aligners, when they are released, they're trying to do three things. They're trying to speed up the process of the alignment. They're trying to minimize the amount of memory that is used to perform the alignment. And the third thing is they're trying to maximize the quality of the alignment. And I'm going to use, let's for example use 100 million reads. And I'm going to give you some metrics comparing the different aligners that I'll highlight. So Tophot, if we're trying to align 100 million reads, it will take 1,000 minutes to perform that. And the good thing is it did not require a lot of memory. So it only needs four gigs of memory. But the only problem with that is it took a lot of time to process those reads. The good thing about Tophot is that a lot of people used it back in 2009. So it had a very good community. Chances are if you have any questions when you're trying to run Tophot and you're struggling, you will find an answer. Because a lot of people were using it at the time, most of the answer is dressed. And it's a good starting point. So if you're trying to learn how to run things with RNA-seq, Tophot is a good starting point. The second milestone was when Star was released back in around 2012. So what Star was trying to do is trying to cut down on that processing time. You're trying to speed up the process as much as possible. So what they did is for those 100 million reads, they were able to reduce the time from 1,000 minutes to 24 minutes. But at the cost of memory. So the memory went from 4 gig to 28 gigs, which is hard to run on local machines. So you probably need a server to be able to run these samples. But the cutting down at time is very important. Because now as time goes by, you're actually producing more reads and things were taking a lot longer with Tophot. But Star provided that solution where things were running a lot quicker than Tophot. The third milestone is with Hisat. So what Hisat tried to do is try to cut down on the time and also the amount of memory used by introducing various indexing methods. So the way that you index the reference, they came up with creative ways to do that, to cut down on the memory used and the time. So the same number of reads that we talked about before, now it takes 20 minutes, same as Star, and it only uses 4 gig of memory. So we're cutting down memory, cutting down on time. So when you're trying to decide whether or not you should use splice aware on splice aware a mapper, the short answer is, if you're aligning to reference genome, you need a splice aware aligner. If you're aligning to a transcriptome, then you don't. So that's just a simple answer. And Hisat is a splice aware RNA-seq aligner. And Hisat stands for Hierarchical Index Splice Align Transcriptome, I think. And as I said, it requires a reference genome. It's very quick and it uses a combination of global indexing and local indexing. And I'll talk about what that means in a bit. So what Hisat did is that it took, so there is a global index, which when it's tried to map or align the read, it will look for the, it'll try to anchor that read, it will look through the genome to find the beginning for that read. And once it does, then it switches to local index. So think of the whole genome and split it into bins. So it's split it into 48,000 bins. And those bins are overlapping. So now when you try to extend the read, instead of looking through the whole genome again, you're just going to look through those bins to speed up the process. So you're focusing on a smaller bin, and that will really speed up the alignment process. So it's a series of anchoring the read and extending the read, global index and local index. And it does that separately for read one and read two, if you have a pair. And then it collects information from the mapping information and then puts it together and reports it to you. So let's go through, I'm going to go through three examples and walk through how it does it step by step. So the three examples here, the red chunk represents the read that you're trying to map. And then the two yellow parts, these are two exons from the reference genome that you're trying to map to. And then the gray section is the intronic region in the reference because I said the reference contains exon and introns. So the first example, you have a read that is fully spanned within the exon. So sorry, fully contained within an exon. Bless you. The second example, you have a read that most of it is spanning exon one, but then a small bit of it is actually spanning exon two. And then the third example, you have a read where half of it is spanning exon one and the other half is spanning exon two. So if we go through the first example, as I said, it's a series of global indexing and local indexing. So the first thing it does, it tries to anchor or find a beginning of that read. So it will try to anchor the first 28 bases using the global indexing. And once it finds the location in the genome of where this read belongs, then it just starts a series of extension. So it extends the bases up until the last base. It doesn't find any mismatches. It keeps adding bases and then it's good. The second example is the same thing. It tries to find a location, it tries to anchor that read within the genome. So it does the first 28 bases using global indexing and then it starts extending, extending, extending, but then it encounters a mismatch. So that mismatch represents the splice junction, the presence of a splice junction. So it stops. And now it switches to local index. So previously what Top Hat used to do is that when it stops, it goes through the whole genome and tries to find a location for that little chunk that's left. And that was very time consuming. So with the local index, instead of going through the whole genome now, because we know what bin the first part of the read belongs to, we can just look within that bin and see where we can identify this chunk that's left. So that's what it does. It goes and anchors the first eight bases of the piece that's left and then it continues the extension. So the first time you need 28 bases to anchor, the first part, the second for the local index, you only need eight bases to anchor. And you only need eight because you're looking within smaller space. So things are more unique within smaller spaces. The third scenario is exactly the same. You start to anchor the first 28 bases, then you do extension, and then once you encounter the first mismatch, again, you look within your local bin, you anchor again eight bases, and then you continue extending. So it's a series of global indexing, local indexing, anchoring and extension. That's pretty much the idea behind HACSAT. How do you know that these chunks separate the logs to the bin? Oh, yeah. You're trying to match the sequence to the sequences within that bin. The sequence doesn't... I mean, it's all present in the DNA. It's present in the RNA. The DNA, there is an entrant between them. Yeah. How does it know? How does it know that... because the RNA is combined with two pieces, right? Yeah. So if you go through the reference genome, there is a gap. Yeah. So it's trying to mine the gap and then look after the gap. So it is going to take the gap into consideration. That's why you're looking within the bins, right? That's why you're breaking the whole genome into bins. It's going to... I mean, it won't find it in the entronic region, but it will find it in the downstream, right? The entrant will find it in the other axon. And then I'll say, okay, well, there is this giant gap between this chunk and the beginning. So I'm going to take this junk and classify it as a splice junction. That's how it actually builds a dictionary of splice junctions by locating the axonic parts or matching the axonic parts in the reference. And whatever that's missing between the two axons is... it's going to hypothesize that it's an entronic region. How does it differentiate between that and its region and our like deletion or something? It's the distance. So these are parameters that you can tweak. So you can set, okay, I want the maximum distance between the two chunks to be this. Anything beyond this will be a fusion. And I think for Hisat to call fusions, you call another function, Hisat specific for fusions. But yeah, you're right. It's the threshold. The threshold that you specify. After this distance will probably be classified as a fusion and below this threshold then you'll probably be a splice junction. So we'll do this separately and then it will come up with a dictionary or a list of all the possible potential splice junctions. It will take a look at the splice junctions that you provided it with. If you have evidence or you know that there are splice junction presence, a good annotation you can provide it. It will use that as a guide as well. It will compare what it found with what you provided and see if it's a match and so on. Does that make sense? Does that process make sense? Yeah, okay. So after you perform the alignment, you might get some reads that map to multiple locations the same quality. So those are called multi-mapped reads. With DNA aligners, what they do is they usually, there are different options. They either pick the first one that comes on that list if they all have the same exact quality or it randomly selects from those reads. So with RNA, it's up to you how you want to deal with them. Usually if you're dealing with expression, if you're assessing expression from RNA-Seq, you would leave multi-map. You allow multi-map reads. But if you're doing a variant calling, again, it's up to you whether or not you want to get rid of that option or maybe select one of the reads that are mapping to multiple locations within the genome or the transcriptome. Now, so after it does the alignment, what you get, you get a BAM file. How many of you have worked with a BAM file in general in the past? Okay, cool. So we actually get not a BAM file, we get a SAM file. So remember the FASQ that we looked at which contains all the sequences that we actually sequenced, the RNA sequences. So the SAM file is very similar to the FASQ except that you're getting the sequence but you're getting a lot more information. So for each sequence that is present in the FASQ, you're also getting information about where it is located in the genome now. So we'll tell you, okay, I was able to locate this on chromosome 9, this position, and I was able to map it with this quality. And each base gets a quality, the whole string gets a quality, so you get a lot more information. But the idea is that every piece that you pass it in the FASQ, you will get it out in the SAM file, in the aligned file. And if it couldn't map it, then it will tell you, this string, I was not able to map. So it will have a flag for that as well. So a SAM file is, you can think of it as a text file. It's a text file. You can actually open it up. It's not recommended because you will get a lot of information, a lot of reads. So what people do is usually they compress it to BAM. So BAM is the compressed version. You save a lot of space by doing it that way. And what you need to do is also you need to index the BAM file. So that when a lot of tools can actually work with BAM file directly, you don't have to unzip the BAM file to SAM. You can parse reads from the BAM file if you have a proper index along with it. So this is what a BAM file looks like. It mainly consists of two pieces. So you have the header, which contains information about the run and the tools that were used to run it. And I'll go through an example of a header. And then you have the body, which consists of the sequences that you pass, along with all the information that I just told you, where it mapped it and what the quality of the map is like. So I think I've mentioned all of that. You need an index for your BAM. So what information can you find in the header? This is a list of information. So you'll have the version number. You will have information about how the alignments were sorted, where they sorted according to position or according to read, read name. You will have information about the reference that was used. What's a version of the reference that you used? Was it human? Was it something else? Another species? And also the individual chromosomes that were used. What were they called? And how long was each chromosome? How many bases? And you'll have information with the read, the name of the sequencing center, sample name, library name, and then information about the program that you used to do the alignment. So in this case, it will be Hisat, what version of Hisat. And the idea is that by looking at the header, you should be able to reprocess the sample. If I look at the header, I'll know what was used to process it. I can go back and reprocess it using these parameters. It will have all the parameters, actually have the command that was used to run the alignment. So you can go back and then say, okay, I know all these parameters. I can just go back and tweak them or run the same thing again. The body contains information about the alignments. So this is a list of, you can think of these as columns. So they are actually as columns in your body. So Q name is just the name of the read, for example. Flag is, I'll talk about flag and cigar in more details in a bit, but flag is just trying to compress as much information about the read as possible in a number. And I'll tell you exactly what it means in a bit. Our name is the chromosome. So what chromosome it aligned to. Position, mapping quality, cigar string. I'll talk about that as well. It pretty much describes how the bases were aligned within that read. And you'll get information about whether or not this was the first read in your pair or the second read in your pair and whether the second read in the pair mapped on the same chromosome or a different chromosome. And you can use that information to figure out if it's a fusion, for example. So at the bottom of this slide, you'll see an example of all these values. And over here, we have a string name, which looks very similar to the FASQ that we looked at. And you can actually, it is pretty much the same. So you can actually pull this read from the FASQ file if you're interested by looking at the read name. And it's saying it mapped to chromosome one. Position was 11, 6, 2, 3. The mapping quality is 3. And the cigar string is 100 M. And I'll tell you what that means. This means the next read in the pair actually aligned to the same chromosome. So this is pair one. Then the second read mapped also on the same chromosome. So it's not a fusion. And how far? I'll tell you how far. How many bases downstream was the second pair? And then you get the sequence, the actual sequence, the whole read. And you get a quality per base within that read. So what's the flag? So what's the second column that we were talking about? So the flag is a 12-bit-wise flag that describes the alignment. So think of it as trying to compress 11 columns into one number. So you're trying to condense all that information into one number. And what that number is telling you is giving you information about whether or not the read was mapped, whether or not the second pair of the read is mapped, whether or not it's the first or the second read pair, whether it was a primary alignment or a secondary alignment, whether it was a PCR, a duplicate, whether it was a supplementary alignment. So you're trying to condense as much information as possible within that number. And it's not, you can't just look at the number and know what it means. You have, there's a website that actually breaks it down for you. So you put the, if you go to the website link and you put the number, then it will tell you all the different description points that actually that number represents because you can combine all of these into that 12-bit-wise number of flag. So the cigar, what the cigar does, it actually describes for every base in your read, what was the alignment like? So if you look at the example that I have here, that's 81M, 859N19M. So that's an example of a value that it will have in that column. So what this is saying is that within your read, I was able to map the first 81 bases. And then there was a gap of 859 bases in the reference genome. And then the final 19 bases of your read, I was able to gap. Do you know what that represents? What does that mean? So it mapped the first 81 bases. And then, so if you sum the 81 and 19, that's 100. That's the read length. So it mapped, it was able to map the first 81 and then the last 19. But it couldn't, there was 815. Sorry? It's the entronic region. So it's a splice junction. So it was saying that I mapped the first bit of the read because that probably belonged to axon one. And then I mapped the second part of the read, that's probably axon two. And then there was a gap. And that gap probably represents an entronic region. So you can visually, by looking at that cigar, kind of have an idea of how each base was aligned within your read. So now that you know what BAM files look like, BAM files are huge. And a lot of times you're interested in looking at a specific region within the BAM. You don't want to look at everything. So if you have a specific gene that you're interested in, you want to pull that region from the BAM file and just look at that. And for example, if you want to look at IGV, how many people have used IGV before? Okay, so IGV is a way to visualize your aligned reads. I don't know if we're going to have a chance to actually do IGV. Are you doing IGV tomorrow? Yeah, you're doing IGV tomorrow? Okay, cool. So what IGV does, you can load the BAM file and then you'll be able to see all the reads and you'll see if there are any mismatches and how the reads were aligned. And you can actually add tracks, like gene annotation tracks. So you'll know where the reads piled up, what strand the expression is coming from, the forward or the reverse strand and so on. So it's a very visual way of looking at the alignments. However, some BAM files are huge. So if you want to subset them, then you can use BED files. So a BED file is a text file that contains columns, chromosome, start, end, and then gene name, for example. And you can use SAM tools or BED tools to subset your BAM file using specific regions that you put in your BED file. So if you're looking at one gene, two genes, ten genes, you can put those genes in your BED file and then use tools like SAM tools or BED tools to only extract those regions and then look at them in IGV or do whatever you want with them. So besides SAM tools, there's BAM tools, there's Picard, there's BED tools and BED ops. All these tools you can use to index your BAM file, convert your BAM to SAM and vice versa, and also subset your BAM files. You can also use them to sort your BAM file. And there are two different ways you can sort your BAM file. You can either sort them according to position, as I've mentioned before, or according to read name. And if you're interested in retrieval of the reads in a faster way, then sort them according to position. But if you're interested in maintaining that relationship between the pairs, and for example if you're looking for fusion and you want to pull the two reads from that specific pair, then maybe you should sort them according to read name because if you do it that way, then that information will be maintained. So BED is just a subset of BED? BED is just, you just put coordinates. So it's a text file, it's a separate text file and you just put in coordinates. So you just put chromosome, position and gene name, for example. And then you pass these tools, you pass them both the BAM file and the BED, and then you generate a new file. You either generate a new BAM file, or you can generate a new text file. It's also good for coverage. So it will tell you how many reads, for example, overlap these regions that you specified in your BED. So if you're interested in QC or coverage, then you can say, okay, what's the coverage like in those regions? Can you tell me how many reads overlap those regions? So this is what IGV kind of looks like. We were talking about that earlier. You have the, at the top, you have the chromosomes, the P-arm and Q-arm. And these represent the reads. So each one of these lines represents a single read. At the bottom here, you have the annotation, the transcript annotation. So the exons, the entronic region, those are the known annotations. So you load these. And then the idea is that you compare the coverage of your read, the distribution of the coverage of your reads, to what is the transcript assembly. Ideally, you don't want to see pile-up of reads on the entronic regions, because that shouldn't happen unless your annotation is not a good representation of what samples you're working with, or they're not up-to-date. Maybe you discovered a novel slice junction that was not in the annotation. Ideally, you want your reads to pile-up where the exons are in the annotation file. And we see that here. Yes. So what are those read-connected to each other? Which ones? These ones? Those are on the left, just on the top of single... Sorry. Those things. Here, there is a line in between. And there, there is no line. So, yeah, I... That's a good question. I don't know what... Are those single reads, but then I don't know why it couldn't identify... Simple reads that span across a junction. Yeah, but, you know, what about, like, this chunk right here? Yes, there is no connection. There is no connection to not span across a junction. There's a line that says, one can chop into that region. It looks like it just reads from the introns. That's it. From the introns. Yeah. I'm just trying to think, like, it's 100 of these reads, so I guess it was only able to align. So they look... Partially? Yeah, it's hard to tell. Could it be that it was trimmed? That's also possible. Yeah. And the thing is, I mean, I think it also samples. You're not seeing all the reads and in here. It takes a sample of your reads, and there are so many different parameters and visualization you can turn on and off. So it's hard by looking at just that without actually interactively checking the reads and figuring out what they represent. In general, though, where do you think these introns, these reads come from, these intronic reads? What's the explanation for that? So that could be something that's sort of misaligning there. It's not clear where it aligns. It's one possibility. And to mature mRNA. So there could be the sample has RNAs that aren't completely processed, which inevitably there will be some amount of RNA that, like, at the time you broke open all those cells. There were some RNA that was in the process of being spliced together, so you have some intron from that. Those sequences. It could also be genomic DNA. There's nothing to say it has to be RNA, right? You could do a DNA treatment, but that's not perfect, so there will be some low-level amount of genomic DNA in every RNA-seq sample. And they could legitimately be that that intron is sometimes routine on purpose. So it still makes up all of those things. Probably some other things I'm forgetting about right now. It could also be a new splice junction that's not a novel. It could be a new exon. A new exon that wasn't discovered before. You could imagine that there was a chunk of exon here that corresponded to this. What your transcript? It depends on the details of how the counting is done. So when we do that, the expression estimation will probably talk more about those details. But most approaches, they'll wind up being ignored because they're either not part of the sort of predicted assembled transcript that you sort of built a novel from the RNA-seq data, or they're not part of an exon transcripts of your GTF file, or they don't correspond to the gamer index that Calisto built. Each of those methods relies on some prior knowledge of where the exons are to some degrees of the stuff that's in the introns. If you want to sort of go back and analyze those and say, oh, maybe that does represent a novel alternative is more, that would be kind of like a separate analysis path you would go down where you would specifically sort of start to take those additional needs into account. Yeah, and most of the expression quantification tools they look at properly here, reads. So like the single hand reads that might not have been properly paired, I don't know if we take those into consideration when it's trying to quantify an expression for that region. Yes. Each row is a read, right? So, for example, in the first row you have a whole bunch of red boxes. Yeah. That's all coming from one read. It could be, are we not, they're kind of just really close together and you're seeing it's hard to see the edges because they start and end so close. Basically what IGV does is it tries to pack as much information as possible, as efficiently as possible, into the top and then it only goes down to the next row when it runs out of space. So you tend to get this pattern where it looks more sparse as you go further down because it's trying to efficiently use the space. Yeah. I think there is overlap that would put them on different rows. But if they're overlap, they're on separate rows. And they would be on the same row. Yeah. Yes. Again, it's imperfect. So inevitably some genomic DNA has some homology to the oligobt B. So it sticks and gets washed away. But you may see Yeah. I would say compared to the total RNA approach you will see less immature messenger or immature RNA and genomic DNA. When you do the polyselection you do enrich for the stuff that's really in our case signal. And similarly, if you do the cDNA capture it also tends to clean up but you'll see a lot less just scattered reads in the introgenic space or in the introns. The noisiest one is probably the total RNA with the rival reduction but it's a trade-off. You can also see some real signal in the introns that corresponds to anti-sensiton groups, or how old it sounds or whatever. Alright. Skip that. So the next part of this module is QC assessment. So we're going to be looking at a few metrics that can help us tell if the library construction went well, if the alignments actually worked and if we can actually take these samples and then move forward to expression estimation or if we should go back and re-align redo the sequencing and so on. I think this alignment step is one of the most crucial alignment steps so far. So most of the metrics that I'm mentioning here today we're using a tool called R6C and I think we've provided instructions on how to install that tool but feel free to use whatever tool that you think is suitable for you. We're just providing these metrics as a guide so you can even go and look for these metrics in another tool. You can even construct these metrics yourself using the BAM, the BED and some BED tools. You can easily put these together. So one of the first metrics that I usually look at when I try to assess the quality of the R6C, Aligned R6C is the bias. So whether or not there's any bias presence in the sample. I do that by looking at the coverage distribution across the transcripts. So here what R6C is doing is that it's taking all the transcripts and it's trying to split them into 100 bins and it's normalizing so it's looking at each bin in those transcripts and it's trying to find the coverage in each one of these bins that you're looking at here. So you're looking at the bins and the coverage for each one of these bins. Because transcripts will have different lengths so that's why you're splitting them into equal number of bins. And ideally what you want to see is you want to see equal coverage from the beginning of the gene to the end of the gene. Beginning of the transcript to the end of the transcript. So what we're seeing here are some examples or libraries where one group has equal coverage across the gene body while another group there seems to be one area of the gene is highly covered while the other part is not covered at all. And this is 5 to 3 so the 3 end is covered very well where the 5 end is not covered. So there seems to be clear coverage bias where this could be coming from. PolyA, yeah. Any other suggestions? Sorry? Single end reads. Why would they contribute to this bias? Because with a single end you also expect them to be normally distributed across the whole transcript. Degradation, exactly. So degradation or polyA selection. So when you look at this and you see this bias in the coverage you really need to stop and ask all these questions. What happened to cause this? Was it the library construction? Maybe the library kit that we chose was not proper. Maybe there was a problem during library prep. We need to go back and figure where this is coming from. If there's degradation let's go back and look at the RIN numbers where they have bad RINs and we just sequenced them anyways. So you try to figure out what step of the process caused this. Now, ideally what you want to do is you want to fix it by going back and rerunning things. Either rerunning the libraries or resequencing. But this sometimes can be very costy and sometimes you're just giving the data and you can't really do anything about it. It's important to detect this because you can actually do some downstream adjustments as well. One example is you can identify the samples that have bias and then flag them downstream when you're doing the modeling to do differential expression. You can add a feature and say these samples belong to this group. So that if you find anything that's differentially expressed you make sure that you're adjusting for this bias. As long as you know you can adjust for a downstream upstream. Another thing that I look at is the nucleotide content within the reed. So random primers are used in the step where we take RNA and then we reverse complement them to cDNA. And they're called random and they're supposed to be random. But what ends up happening is that sometimes they cause this bias where you expect the nucleotide content to be 25% each ACGT because it's a random process. But what ends up happening is that some bases tend to be selected at a higher rate than others for the first few bases of the reed. So in this graph we're looking at a very short read, only 35 bases long. And we notice that the first 0 to 10 bases there seems to be this fluctuation in the nucleotides and then stabilizes after. So one thing that you can do is to first plot this and see if this actually happens in your Illumina data. And if it does then one way to deal with it is to trim the first 10 bases of your data. So if you trim it then you can just use the rest of the data so you can use the 90 bases that's left of your reed to perform the alignment. And what you can do is you can keep it and look at the alignment rates, trim it and then look at the alignment rates and see if trimming actually affects your alignments and if it does then it's highly recommended that you trim that portion. You can also look at the distribution of the quality within each base of the reed. So here we're also looking at the same short read, 35 bases long. And for each base we're looking at the fred quality score. So that's what the BAM file reports. It reports the quality score, fred score. What is a fred score? It's simply the negative log 10 of the probability that the base calling was done incorrectly. So fred score of 30 for example means that there is one in a thousand chance that this base was not called correctly. And what you want is you want the fred score the better because the less the chance that it's wrong. So you want fred score anything that's higher than 30 is acceptable. You do see when you look at the reed towards the end of the reed the quality declines and that's expected. But as long as most of your the bases within your reed they're above Q30 then that's acceptable. If they're not then what you can do is you can trim. So you can trim your bases and get rid of the last few bases of your reed that are below Q30. PCR duplication, I think this was briefly mentioned earlier dealing with PCR duplication in RNA-seq versus DNA-seq. One way of assessing whether or not you should collapse your reeds in RNA-seq is to look at this plot for example. On the x-axis you're looking at the occurrence of the reed and then the y-axis is the number of reeds. So what that means is that here we're looking at the we're comparing the reeds that have the same exact sequence and the same exact start and end. So that's if they have the same exact sequence and the same exact start and end we call them duplicates. And then on the x-axis we're saying there is a level of duplication and how many reeds have that level of duplication. So what you want ideally is you want the curve to be as low as possible because you want to minimize the number of reeds that have high duplication rates as much as possible. A very bad sample would have a very high plot. So what's the difference between getting rid of the duplications and then trying to figure out how what's the density of expression of a certain reed in the number of transcribes? Am I making this? Yeah, you're right. Because if you're collapsing the reeds then you're actually affecting the dynamic range of the expression because expression is pretty much counting the number of reeds that overlap a specific region. But those reeds that have the same exact start and same exact end we're not sure if these are true expression or they're PCR artifacts. So that's the challenge, right? You have to make sure that they're actually PCR before you collapse them because if they're not and you're collapsing them you're actually reducing their expression profile if they're actually true reeds coming from two fragments. So you need to make sure that they are coming from PCR artifacts and then you collapse. If you're not sure then just leave the reeds and don't collapse them because that will affect your expression. Okay, and are you about to tell us how we will do it? Well, this is trying to help you with that by comparing the two reeds the start and end and the sequence. The reed coverage should not pile up when you're looking at coverage. So when you're looking at DNA coverage for example you expect the reeds to kind of overlap and you don't expect them to all pile up with the same exact start and end because this could be a sign that it's PCR. What you want is you want this. You want some sort of overlap between the reeds. Now the challenge is this is with DNA because with DNA there is no fixed start points with transcription in RNA there are actually fixed start points. These are the transcription sites. So you kind of do sometimes expect to have some sort of piling up. So the rates of seeing this in RNA is higher than DNA and that's why it's more of a challenge because then you look at this and you say okay is this transcription or is this PCR? And that's why you're trying to distinguish between the two. You could I guess look at transcription site and then check if the duplication is happening around those regions. You just have to do a lot more investigation when it comes to RNA versus DNA. With DNA when you see this you just collapse the reeds. But you're right if it's actually true expression and you're collapsing the expression range will differ and that's not good. Thank you. Another thing that you can check is sequencing depth. So I think this was also addressed earlier when one of the questions was how many lanes of sequencing should we do for the specific project or how many samples can we pull in one lane? How deep can we go? And again it really depends on the project that you're trying to do. The question you're trying to answer are you doing fusion or are you doing variant calling? Are you doing expression? But one way you can do that is you can run one full lane of data for just one sample and then take and map it. Take the BAM file and then do a saturation analysis. So take the BAM file but subset the read within that BAM file. So let's say you had 500 million reads in that BAM file. You start by taking 50 random 50 million reads 100, 200, 300, 400, 500 and at each point you look at the number of splice junctions that you can actually detect. Novel splice junctions and known splice junctions. And then you're interested in a point where you start saturating. So as you add reads, are you actually adding more information or does that information saturate? So you get to 300 millions and then you add more reads but the number of splice junction you're detecting is not increasing. And that's what we're trying to visualize here. We're looking here at the, no injections are in red and you can see that they saturate a lot earlier than novel junctions. And that's because there are a lot of false positive junctions that gets generated from these tools. So it's the balance of looking at the known and the novel. So you want to look at both and then see at what point do I stop. I'm not adding any more information. That's one way of looking at it because here you're just looking at splice junction. If you're interested in other things like, for example, gene families. So you can also look at the expression of gene families and see, okay, as I add reads how is the expression changing? And am I introducing new gene members in the families? Am I seeing new things as I add reads? And that's a little cost experiment that you can do on just one sample. And then once you figure out, okay, I think I need about I don't know, 100 million reads. That sounds like enough for my project. You can go back and then do the same thing on the rest of the sample. As long as the sample that you picked is a good representation of the rest of the samples within your cohort, then that will be an idea of figuring out how deep you should go. The other thing, also you can look at the base distribution. So all the bases that you've aligned what is the composition of these bases? Are they coding bases? Are they non-coding bases? And you can do pie charts to help you with that. And that could actually tell you of your library kit work. So for example, if you're using poly-A selection, you're expecting to see a higher fraction of coding bases. If you're using other forms like whole transcriptome, for example, you expect to see less coding bases and more non-coding bases. What do you mean coding bases? Yeah, well, in the exonic bases, yeah. Versus like intronic ribosomal stuff like that, like UTR bases, exonic bases and that will help you with contamination as well. Detecting contamination, like you mentioned earlier, or if we see any DNA or genomic contamination, then we'll be able to see that in those fractions. It will be different. So you'll have less coding bases, you'll have more introgenic or intronic bases, they'll be a lot higher. Again, it depends on the library, right? Like in the whole poly-A, you'll probably see more coding, whole transcriptome, you'll see a bit less, because you're getting more ribosomal with that. So it really depends on the library. But as long as you see consistency across the samples and then it notches what library you've actually what kit you used, then you should be good. Do you do the bases just because it's a higher level of granularity? Is that the only reason you... Yeah. The other thing that you can look at is the insert size. So that's pretty much the distance between the two paired reads. There are different definitions of insert size. You can look at the distance between... So I mean, they're called different things. So you have a fragment which is looking at the distance including the actual read and including the adopters. When you look at insert size, when you use the word insert, that's the distance from the beginning read one to the end of read two. And then intermate is the distance between the end of read one and the beginning of read two. These terminologies were used in top hat. I don't know if they're actually using high side anymore. But just keep that in mind that there are different ways of looking at the distance. You can either include or not include the reads include or not include the adopters. So when you're looking at these numbers, ask yourself that or make sure you know whether or not it includes that information. Because the reads are pretty big. Like if it's a hundred, then that makes a difference. That's 200 whether or not you include them or not include them in those calculations. And that's an example of an insert size plot. And what you expect to see is you only expect to see one distribution, a normal distribution. And then when you end up seeing bimodal distributions, then you're probably selecting another fragment size. Ideally what you want is one distribution, one normal distribution of fragment size. And the other thing is, do you notice anything wrong with this as well? Beside the fact that it's bimodal when you're looking at the actual size itself? Yes. How did that happen? How can I have a minus library size? Okay. How did that happen? Okay. So this is the fragment that you have, right? And then you sequence these two reads. And this is the gap in between them is what you're trying to visualize here. But it's saying that there is a minus... Sorry? Exactly. The reads are running to each other. So what's happening is that you're picking a fragment size that's small. So your fragment size, for example, is 100 bases. But you decided to do 100 bases reads. So if your fragment size is 100 and you're going 100 this way and then 100 this way, your reads are actually overlapping. So you're sequencing the same fragment twice from both ends. Ideally what you want is you want your fragment size to be big enough so that you can actually sequence from this end and from this end and still have some space in between. So the shorter the fragment size the more negative values you're going to get because they're going to start overlapping. And that's not very desirable. So usually reads are 100 bases. So you try to pick fragment size that's let's say around 300. So you can have at least 100 bases between. It's a distribution. Not everything is going to be exactly the same but you want to make sure that the fragment size is large enough so that you don't run into this issue. And you mentioned that if it's by model you can you're probably getting a fragment. I mean, if your size selection didn't work, like you picked multiple sizes instead of just one. You know how you do, you run it through gel electrophoresis and you pick specific band. So maybe you picked a multiple. And the reason why you don't want to have that is because now you're getting two reads that come from the same fragment. If you're doing variant calling for example, what you want is you want reads that are coming from different fragments. That provides better evidence of a mutation. If the two reads that come from the same fragment these reads are not independent. And a lot of the tools when they're doing variant calling they're assuming that the reads are actually independent. But these ones are not independent. So you introduce some sort of a bias when you're doing mutation calling. I don't think it really matters much with expression. It's mainly the variant calling that gets affected by this. Yes. So how do you address this? That's a good question. So again with expression I don't think it matters that much. But if you care so much about it and you think that it's introducing a bias maybe you can shorten the reads. I mean you are losing information. So you have to do some tests where you reduce the number of the length of the reads with trim and then see if that changes the calls, the variant calls that you've called earlier. This is just a very simple slide of how you can look for variants in IGV. We're not doing IGV today so maybe we'll save that till tomorrow. But again when use of IGV is it will detect any bases that are different from the reference base and it will highlight them for you. So that's also one way where if you already have variants in DNA that you've called and you want to kind of validate by looking at the RNA to see if those variants are present, you can take the file that I told you about earlier, make a list of all the variants that you have from your DNA, go to the BAM, pull those regions from the BAM and then visualize it this way. And then look through those variants that you detected from your DNA to see if they're present in RNA or not. Ideally you want to do an actual validation but that's another way of making sure that the variants you picked up in DNA are also present in RNA using completely different technology. So that's the end of this session. Any more questions? So tomorrow we're going to be doing the practical session where you will take the FASCII files and then you will render alignments and then we're doing expression as well tomorrow so you'll do the expression part as well. But I think at the beginning of the day we might do the alignment first and then we'll jump into expression.