 Good afternoon. I hope you had a good lunch. I hope you're all fresh and ready for some RNA-seq. So I'll start by talking about myself for a bit. My name is Vlad Youssef. I'm coming from the Ontario Institute for Cancer Research. My exposure to RNA-seq comes from three fields or three areas. When I joined OICR, I worked in production and production team. So I was responsible for assembling an RNA-seq pipeline that does the alignment, expression estimation, differential expression, and then QC metrics. And then we passed the output files to other teams that would do downstream analysis and more statistics and modeling. I did that for three years and then after that I was involved in, I moved groups and I joined a group that does more statistical stuff. We work on developing biomarkers for cancer. We do a lot of modeling, prediction, classifications. So I used RNA-seq output that I used to generate before into these models and classifications. And I did that for four years. And the third area is teaching. So I've been part of CBW for quite a while now. It's been five, six years. And I mainly teach this workshop, the segment of the workshop. But this segment is actually coming from a more advanced RNA-seq workshop that CBW offers that actually is three days long. So here we're not going to be, we're only going to be covering the basics of RNA-seq just to introduce you to what RNA-seq is and the main concepts behind RNA-seq. And then we'll also have some hands-on exercises. So at the end of the module today, you'll be able to run some stuff yourself, whether it's alignment, expression, or differential expression. So as I mentioned, these slides are adapted from the longer module, which I believe is offered in Toronto in May. You can check with Anna, you can check the website if you're interested in taking the other one. So the section today is actually divided into two parts. So the first part is going to be about an hour and a half where I'll go over the theory. And then the second part will be hands-on where I pick the data set that is small, and then we will actually run the exercises. We'll go through the exercises for the second part. And the first part is actually divided into three sections. First, I'll go over an introduction to what RNA-sequencing is, some challenges that we face in RNA-seq compared to DNA-seq. And then we'll move into alignment, what kind of aligners there are and how they work. And then we'll go over also the output that they generate and do some visualizations. And finally, the third part is expression estimation and differential expression. So here's a more detailed look of the first part. We'll cover the rationale for sequencing RNA, the challenges, as I said, specific to RNA-seq, general goals and themes of RNA-seq workflows, and some technical questions that you'll be faced with when you work with RNA-seq. So for the past couple of days, all you've been working on is DNA, and DNA only solves one part of the puzzle. There are so many other parts that you can also use in coordination with DNA or on its own to answer so many other questions. And here I'm showing you a schema of the central dogma where we're starting with the DNA, which is what you've been working on, trying to identify SNVs from DNA. But also there's this other layer information, which is RNA. So all that information in the DNA is copied into RNA. And by quantifying the RNA, we can actually quantify the expression of those genes. And we can figure out how these genes interact and how they change under different environments. And then the RNA is actually translated into protein and that's the functional unit of every gene. Now any RNA sequencing workflow goes as follows. So you have two experimental conditions usually. So for example, here we have tumor and normal. And you want to compare the two, you want to compare the expression levels between these two conditions for specific genes. So what you do is isolate the RNA, then you generate cDNA fragments, you sequence using whatever sequencing machine you have, for example, alumina, and then you take these sequences, you map them, and then you try to quantify them and do some downstream expression. So just an overall overview of the workflow. Now, what can we get from RNA that we cannot get from DNA? What's so special about RNA? So RNA is widely used in functional studies. So in functional studies, for example, the genome might be constant. And what's changing and what you're trying to detect is changes in the expression level. And that's not present at the genome level or DNA level. So for example, if you're looking at cell lines, drug treated versus untreated cell line, you want to compare the two conditions, you can do that with RNA-seq. Also predicting the transcripts and the structure of the transcript or the sequence of the transcript. So the order of the exons for any specific transcript is really hard to do that, to do through DNA. So you will need RNA sequencing to be able to do that. Also, there are some molecular features that are only present at the RNA level, such as alternative isoforms, fusion transcripts, and RNA editing. RNA can also be used to interpret mutations that do not have an obvious effect on the protein sequence. So any mutations that are involved in the regulation of the transcripts, that RNA-seq will give you an idea that will help you determine those. And also you can use it with DNA-seq. So you can look at all the variants that you just detected this morning, and you try to prioritize how important these mutations are by looking at the expression associated with these mutations. So usually people do, they do DNA-seq and RNA-seq, and then they try to overlap the data to prioritize these mutations. So if you have a, if there's a gene that is not expressed, then the mutation in that gene might not be as interesting. If the gene is expressed, but only the wild type, then that could indicate there is a loss of function. And finally, if there is the mutation, the mutated actually expressed, and that's really important because that could suggest that it's a drug, a drug can't do it. So there are a lot of things that you can get out of RNA-seq. However, there are a lot of challenges that we face with RNA-seq that are not necessarily present in DNA-seq. So some of these challenges are at the sample level, so purity of the sample, the quantity, how much you need in order to do sequencing, and the quality of the RNA sample that you're trying to sequence. Also the fact that RNA consists of small exons that are spanned by very large intranet regions. So that's a challenge when you're trying to map that back to the genome. The abundance of the RNA-seq also varies quite a bit, and RNA-seq works by random sampling. So if you have a small fraction of highly expressed genes, you want to make sure that they are not actually consuming the majority of the reads, and you're not biasing towards those genes that are piling up for some genes. Also, the length of the RNA differs depending on what type of RNA you're looking at. There are very short RNA that cannot be detected using just a regular RNA-seq protocol. You have to have to pick a specific protocol for those. And for very long RNAs, for example, polyase selection could be biased towards short RNA versus long RNA. So the length of RNA is also a challenge that you have to keep in mind. And finally, RNA is very fragile compared to DNA. You can degrade, we'll go over some methods, how you can detect degradation of RNA, and what you can do about it to resolve it. So we're going to walk through the process of RNA sequencing from step one, which is you get the sample, and you're asked to do the sequencing. The first thing you need to do is you make sure that the quality of the RNA is good. So one way you can do that is by looking at the RIN number. Just a show of hand, I should have asked that first. How many people have actually worked with RNA-seq data? Great, wow, a lot of people. And how many people have worked with RNA-seq data where the reference is available? So human genome or perfect? How people are working with RNA-seq data where the reference is not available? Oh, great, okay. So getting back to this, assessing the quality of the RNA, you can use a metric called RIN, RIN Integrity Number, and assess the number or the score that ranges from zero to 10. Zero being a very bad quality RNA, and 10 being very high quality RNA. And what you're looking at here, you know that 80% of the RNA is actually ribosomal RNA. And what this technology does, you can think of it as just gel electrophoresis running your RNA through the gel, you're breaking it down, and you're looking at the composition of the RNA. How much of it is ribosomal, how much of the RNA is not? And here we're looking at two trace plots of two examples where one RNA is really good, the quality is very good, and this is a RIN number six, which is not terrible, but it's considered bad compared to the RIN 10. And what we're looking here, we're looking at these two peaks, and these two peaks present the two subunits of the ribosomal RNA. And because they consist of 80% of the total RNA, then you see very clear peaks. And as the quality of the RNA degrades, you're seeing all these peaks that are very similar or comparable to the ribosomal, which shouldn't be the case. Now a lot of times you have no control over to what sample you get. So you're going to get a sample, and you actually get a RIN of six, and you're asked to do the sequencing. You can't say, okay, this RIN is low, can you please give me another sample? But a lot of times you're not going to get another sample. Yes? A question. Yes. Which RIN score is 40-50 for the RIN? Yeah, so that's a very good question. It's also, there is no answer, because I mean, you can actually, you can go and sequence a RIN of five or RIN of four. The data you're going to be getting is not a good quality. So you are kind of wasting money if you know that the RIN is bad. So I would say I would start with, I don't know, seven, eight or higher. If it's anything lower, then I would request for more. But if you try and push and you can do it, then you've done your part. Yes. Yes, I think it's bioanalyzer. Yes. That's a good question. It could be the molecular size, maybe. I'm actually not 100% sure. Okay. So the two peaks, that's your large-reversible subunit and your small-reversible subunit. So I'm guessing that's through the data. So the peaks might be shifted, ground-structured, and left-reversible bit. If the quality is good, it should still be tall, sharp peaks. Yeah. So I'm guessing this is some sort of a size, length, or size, and then this would be the fraction. Yeah. Yes. Is the RIN mounted, which is given by the, I don't know, the light throw? Yes. Yeah. And I think that the combined values are calibrated for human data. So I was working with the software I'm going to see, and the RIN values are total budget, because you don't get two peaks, you don't get one, because they haven't shattered the size. So keep that in mind if you're working with an answer. Yeah. So once you get the RIN number and then you figure out that the quality of your sample is good, then you can move to the next step, which is preparing your RNA-seq library. And to do that, there are so many different protocols available out there that you get to choose from. And that really depends on what you're planning to do after. So once you isolate the RNA, as I mentioned, 80% of the RNA is ribosomal. So you get your total pool of RNA, and that includes all the RNAs that are available out there, TRNA, mRNA, RNA. And again, depending on what you need, if you want to study ribosomal RNA, and you want to look at everything, then you can take that total pool and go ahead, and that will contain everything. If you're only interested in the coding portion, so RNAs, where you can do, there are two different techniques. You can either do a reduction, so you get rid of ribosomal RNA, or you can do a selection. So you only pick the things that you're interested in, so polyase selection. And each one of these, you will end up with RNA-seq library. Again, it's very specific to what you are interested in doing after. Another thing you have to keep in mind when you're constructing a library is whether or not you're going to be doing stranded versus unstranded library. So I think most of the libraries nowadays in RNA-seq, they are stranded. So the whole stranded versus unstranded or being able to get stranded information started, I would say about four or five years ago. And the idea is that previously, you were not able to tell whether the reads are coming from the leading strand or the lagging strand. So you can't really tell if the gene is being transcribed on which strand. So you just get reads and you can't tell what strand is coming from, like here. But now with stranded technology, the way that you construct the library, you should be able to tell if this gene is being expressed on one strand or the other strand. So that information is available for you. And that's interesting because you can do a little specific expression. You can do so many other things that you were not able to do previously. And when you look at IGV, IGV I think has, if you look at, you load the gene annotation, it tells you some part of the gene annotation, tell you what strand this gene is being transcribed from. And then you can match it with whatever reads that you get to make sure that it's coming from the same direction as the annotation. Yes. Yeah. So think of scenarios. For example, if you have two genes that are overlapping, so one gene is being transcribed on one strand, but there is also another gene on the other strand that's also being transcribed in the other direction. This doesn't happen as often in human genome. It's a very small fraction of genes that actually overlap. But previously, you would get a mix of both. So you can't really tell where the reads are actually coming from, which gene. But now with strand information, you can actually separate these two and tell where it's coming from. Another thing when you're designing, when you're thinking of designing the RNA-seq experiment replicates, and again, it's one of those things where you are probably going to push to get replicates. Chances are you're not going to get replicates. Most of the experiments or studies that I've worked with, they don't have replicates. But again, I also want to emphasize how important replicates are. And when we're talking, I'm talking about replicates, I'm talking about technical replicates, and that's taking the same sample but running it on multiple lanes on the flow cell, or running it on multiple days to adjust for these biases or artifacts. The other type of replicates that's also important is biological replicates. So if you're taking a sample from the tumor, maybe you can take another sample that's adjacent and use that as a replicate. And that's very important because doing the threshold analysis on just one replicate is not really a good representation of whatever you're trying to analyze. So after you get replicates, you can just simply look at the correlation, and then if they're highly correlated, that's great. If they are not, then you can go back and figure out why they are not. If there are any other biological factors that are contributing to these differences. So unlike DNA-seq, there are so many more things you can do with RNA-seq data. Once you perform the sequencing and prepare the libraries, you can do expression. You can do a differential expression. You can look at the alternative expression. You can do transcript discovery and annotation, a little specific expression, just like I said with this trans-specific mutation discovery, fusion detection, RNA editing. In addition, you can combine that with other datasets as well. So the general workflow goes like this. Once you prepare the library and you sequence it, you get the reads. You take the raw reads and usually use an aligner. We'll talk about the different types of aligners. The idea is you take an aligner, you align it either to the transcriptome or a genome, and you take those and then you quantify those reads using another tool that we'll also talk about. And once you take these quantifications, you normalize them somehow and talk about that again. And then you pass these numbers at the gene level or the exon level or the transcript level to downstream analysis to perform visualization and so on to generate heatmaps. So now we're going to cover some of the most common questions that you will probably be asking yourself or other people are going to be asking you when you deal with RNA-seq. One of the first questions that I get asked is, should I remove duplicates for RNA-seq? It's an interesting question. So the concept of PCR duplicates was actually briefly mentioned here this morning. So I assume that you know how PCR duplicates are, what they are. So with DNA sequencing, what you expect when you fragment the DNA and then do the sequencing, you expect the fragments to overlap, and you don't expect the reads to pile and have the same start and end. So one way you can resolve that in DNA-seq is that you can collapse them. So any reads that have the same exact start point for the first read and end for the second read, you take that pile and it collapses. You only take one read out of all these reads. However, with RNA-seq, things are a bit different because with DNA-seq, the starting points of each of the fragment is random. With RNA-seq, the starting point is not random because you have a transcription start site. So it is expected that you will actually have some points in the transcriptome where things start and reads pile that way. So when you're thinking of removing or collapsing, keep that in mind. It's not something that a lot of people do. A lot of people just leave it and they don't collapse because by doing that, you can actually change the expression profile of your RNA-seq. So if you're planning to do that, we'll go over some QC metrics that you can use to assess whether or not you should actually collapse your data. The other question is how much library depth is needed for RNA-seq? So that's a very vague question when I get asked that because it depends on so many different things. So it depends on what you're doing. Are you doing expression estimation? Are you doing mutation calling? Because each one of these will require a certain depth or a certain number of lanes that's different from the other. Also, it depends on what kind of library you picked. So as we mentioned, there is total RNA, there is mRNA. Have you done polyaselection? Have you done our abysmal depletion? Because if you've done total RNA and you're only interested in coding, then you probably need more depth compared to polyaselected libraries. It also depends on read length. How long do you want your reads to be? Now that there is more options, usually you do two by 100, but now you can go up to 150. Also, whether or not you're doing paired, most of the experiments nowadays they're paired reads. So there are so many different factors. So one way you can deal with it is by looking at a pilot experiment. So do your research, look at what papers have done, if they've done something similar or they're interested in working on something similar to what you're working on. Try to figure out in their method how many lanes of sequencing have they done per sample and try to do that yourself. So that's one way of dealing with that. One lane of sequencing nowadays I believe generates 600, 700 million reads. Usually that's pretty sufficient to do anything that you want from mutation calling to expression estimation to fusion detection. But it could also be too much. So if you are low on a budget and you want to spend more, what you can do is you can multiplex. So you can pool three, four samples per lane. So you can cut it down to maybe 100 million reads per sample. But again, I would look at other publications and see what they have done and then try to copy that. That's one way. The other way would be to run a sample test. So just run one sample full lane and then analyze that and then see if you have enough information. And if that's not enough, you can go back and then scale it up and run that number that you get from the first sample on everything else in your pool. And there is a third method that we'll talk about in the QC metric that you can use to estimate how much depth you need. The other question is what mapping strategies should I use for your RNA seek? And that's one of the reasons why I asked if you're dealing with samples that have a reference genome. So if you don't have a reference genome, what you can do is you can do assembly. For example, transit this would be a good tool to perform assembly. You don't really need a reference. But if you do have a reference, then there are two options. You can either align it to a whole genome or you can align it to a transcriptome. Today, we're going to be focusing on aligning to a whole genome. So taking RNA seek data and aligning it to a whole genome to take into account the gaps and trying to gaps. I've talked about that. So we're not doing we're not doing transcript assembly here outside the scope of this workshop, but that's something you can look into. All right. Any questions about RNA seek? Just a general overview. Yes. So we'll talk about that in part two. How that actually works when you take RNA seek data that's spliced and you try to map it back to a whole genome that has the introns. Can I ask one more question? In the cases that you mentioned, I mean, you want to estimate the transcript lens, for example, versus the infusion, are there any boost functions that's like lower bound or, you know, for reads that you need? Are you talking about the number of reads that is required? Yeah. Again, it depends on the quality. So it depends on so many factors like the quality of your RNA, what you're doing. I'll show you a method down in the QC section that you can use computationally to calculate approximately how many reads that you will need. We'll cover that in a bit. All right. So let's move to RNA alignment. And here we're going to be talking about some of the challenges specific to the alignment. And then the alignment strategies, we will, we have picked high side two. So previously in this workshop, we used to do top hot. But a lot of people have switched over the past few years to high side. It's better performing a liner and a lot faster. So we'll talk, we'll talk about that. And then we'll talk about the output of the alignment, the BAM file, which you've already seen. And some bed formats, I don't know if you've seen before. And then some how to do basic manipulation of RNA seek BAM files and visualization in IJV. And then the, I think the second part is very important, which is the alignment QC. So we'll walk through some QC metrics that you can either do yourself or use tools that will generate it for you to help you figure out if your alignment is proper or not. So at this point, sorry, one second, at this point, we the only thing you can do in terms of QC, you can run fast QC. Have you guys used fast QC before? Okay, so you can run fast QC and that will give you an idea of how good the sequences are. And that's you do that right before we run the alignment. So the alignment QC will just tell you how well the sequences aligned back to your genome. But if you want to figure out if the sequences are good quality, then you can run fast QC at this point before the elements. Yes. Well, I said you work with RNA seek data from long weeks? From long weeks. I don't think so. I'm not sure actually, but I don't think they do. I think 150 probably is the longest that you can do. Yeah. The depth is not going to be the same. So it will be different. And I don't think it can handle reads that are that long anyways. So some of the computational challenges that we face with RNA seek is that as sequencing technology is advancing and it's getting cheaper and cheaper, we are generating more and more reads. So I have actually seen lanes of sequencing. And this was maybe two years ago that had 900 million reads per lane. So that's a lot, a lot of reads that you're going to have to process. And that comes at a computational cost. The other challenge that you mentioned earlier is the introns. So the RNA seek data that we have, they are spliced. So they're only exons, sequences of exons. And when you're trying to map them back to the genome, it's the challenge of trying to split those and then map it back to all these large gaps of introns in the reference. Also the fact that with RNA seek, it's unlikely that you're just going to do the alignment once and you'll be done with it forever. There are every few months, there is a new annotation that comes out, a gene annotation that comes out, new versions of an aligner. And there are also so many tools that will need to be rerun at the alignment level. So chances are you're going to be doing a lot of rerunning. And again, that comes at a computational cost. In terms of high side, I mentioned high side is not the only mapper that's available. There are so many other aligners. And I'll leave you a link so you can take a look at the other aligners. But aligners in general can be split into three classes. So you have the de novo assembly if you do not have a reference. And then as I said, you have tools that will align to the transcriptome and then tools that will align to the reference genome. We've mentioned that. So here is a list of some of the most popular aligners out there. They are not RNA seek specific. They are for everything. And this is just a timeline that shows you the date that they were published. And I'll start with 2009 when top hat was published. And every tool that was or a liner that came after tried to do three things. It tried to increase the accuracy of the mapping. It also tried to cut down on the time processing time. And it tried to cut down on the memory that each one of these tools use. So when top hat came out, it did not require a lot of memory. However, it took forever. So one chain of sequencing would if you look at 100 million reads, for example, that would take a thousand hours. That would take two days to process. And I'm sure people who have probably worked with top hat, they know what I mean. It wasn't until a star came out in 2013 that introduced the idea of suffix array. So you take the reference and you try to index it in a way that will speed up the process. So those thousand minutes for that 100 million reads became 23 minutes. So that was a huge transformation in terms of processing time. But that came at a cost of the memory. So now we require a lot more memory. I think it's 28 gigs of memory that is required to store the index for that reference. So chances are you're not going to be able to run star on your local machine. So that's a challenge. And since then, top hat tried to compete and tried to come up with another version of the tool that would cut down on processing time. And they did that with top hat too. But they also required high memory. It was also about 28 gig. And it matched the time. It was done in about 20 minutes for 100 million reads. But then the revolution came when highside came out in 2015. And what they did is they changed the way they were doing the indexing. And that reduced the memory to four gig. So for 100 million reads, you can actually run it in your machine. It will take 20 minutes and you don't require as much memory as the rest of the tools. Top hat was one of the first tools that I used and it used for a very long time. And that's because it had such a big community. And chances are if you ever run into a problem, you will find an answer for it because so many people were using it. It might not necessarily be the best tool out there. And to this day, there are still people using top hat. And I feel part of it is maybe you don't want to change your pipeline. You're just too used to the pipeline that you've established. And also because of the big support that's available in the community out there. But I would highly recommend that you switch to highside or highside two because that improves the quality of your alignments. Yes. Yeah, it's not that difficult. I mean, highside two is still using bow tie. So top hat was using bow tie as a backbone aligner. Highside two is still using bow tie, too, as the aligner. But I think whenever you switch, you will have to run some tests to compare your previous results to your current results. And I guess that that's maybe time consuming. I don't really know why people aren't switching. It's not like the installation and the instructions are very straightforward, very similar to the same kind. Yeah. So I mean, you do have to install it again. And then yeah, but it's not, it's not that different from top hat two. But also, it could be the fact that you probably have to reprocess everything you've done in the past because you won't be able to, you'll have two batches of data that you've processed with top hat and a batch that's probably highside. And maybe your labs don't, they're trying to avoid that and just want to be consistent and run it. For sure. Yes. Go with the latest. Go with the latest. So the next question is should you use a spliceware aligner or unsplice aware? And the answer is pretty easy. If you're trying to map it to a whole genome reference, then you have to use a splice aware aligner. If you're mapping it to a transcriptome, then you don't need a splice aware aligner. But for the sake of this module, we're doing splice aware aligner and we're focusing on high set. So how does high set work? So high set, as I mentioned, they managed to cut down on the processing time and on the memory by using two types of indexing. So the global indexing and the local indexing. So they take the reference and you can think of it this way. You can take the reference and you split it into bins. So the global index, you're looking at the whole genome, the local index is you're taking that whole genome and you're splitting into 48,000 bins. And those bins are overlapping. And what you're going to do is you're going to take the read and then try to look for it in the global index just to anchor it. So find a place where you can start. And once you anchor that read, you switch to the local index. And that reduces your search space by a lot. And now instead of looking at the whole transcript or the whole genome, you're only looking at a window or a bin in that whole transcriptome. So that really speeds up the time. You don't have to go through the whole genome and look. So the way it worked before, top hat, used to take the read, split it into chunks, two, three chunks. And then we take each chunk and then it will look for it in the whole genome. And that was very time consuming. That's why it took so long. But that has changed in high side too. So we're going to go through three examples to show you how this actually works. And the first example is you have a read. And these are two exons. And that's an entronic region between the two. So in this example, we have a read that is fully contained within one exon. And the second example, we have a read that most of it is contained within the exon, but you have a small chunk that is from the second exon. And then the third example, you have a read that is equally distributed to both exon one and exon two. And then we'll walk through each case, each scenario, and I'll show you how High Sat actually does the mapping. So with the first example, I think the simulation that High Sat did, 60% of the reads that they did in their simulation had that scenario, where the read actually is fully contained within an exon. So as I mentioned, the whole technique works by doing global search, local search, and extension. So you start with that read. You first align it with the global index. So you look through the whole genome to try to find a starting position to anchor. And then once you find that, you align 28 bases using the local search, local bin, and then you do extension. So after you align the first 28 bases in that read, the rest of it, you just extend. Now, if you look at the second scenario, we'll do the same thing. So take the read, we'll look for it in the whole genome, and then we'll anchor it. And we will map the first 28 bases, and then we will extend using the local index. Extend, extend, extend, until you get to the first mismatch because of the entronic region. And then you stop. And when you find the first mismatch and you stop, then you look in that space, the local space that you created, and then you try to map the rest of the eight bases. As I mentioned previously, this, you would have to go and look through the whole genome in top hot. And that was very time consuming. But here, you're only looking within that local space. And they're telling you what the local space is. They're saying the FM index is between this space and this space. So that's, it's restricting the search within that space. And it does that for all of these, it does this per read. So if you have paired reads, it will do it independently for read one, and then we'll do it for read two. And then we'll take the alignment summary and then we'll put it together. Yes. So always just look into the next axon or all the next axons? I think it depends on how many, what axons are within that local bin that you're looking, you're looking at. So I don't know if we'll check all the axons within that bin, or FM local level index. And, sorry, finally, the last scenario where you have a read that's split between two axons, do the same thing. We are going to anchor the beginning of the read. We'll map the first 28 bases, then do extension, then face the first mismatch, look for the rest of it in the local index, only map the first eight for the second part, and then do extension until you get to the end. And you do that for every single read until you go through all the reads. Yes. That's a good question. I think it depends on how you run it. So for fusions, we're not going to be talking about fusions right now. We're not only talking about mapping. For fusions, you run the fusion command, and then I think you allow for a specific distance. Because here we have a maximum distance, and we say if it's more than this, then discard or don't use it. So the parameters are going to be different when you're running the fusion version of high sites. You will relax those parameters to take into account those reads. Also with DNA sequencing, you have the choice of multiple aligned reads. So those are reads that will map to multiple locations in the genome with the same quality. So what do you do with these? With DNA seek, we usually leave them. What you can also do is you can randomly just pick one of those reads and use it if all the reads have the same exact mapping score. But with RNA seek, you don't want to leave all these multiple mapped reads. What you want to do is you want to get rid of them. If you're doing expression, but if you're doing mutation calling, then maybe leave the multiple reads in there. Because you don't want to affect the dynamic range of your expression by leaving all these multiply aligned reads. Now, what is the output of HighSat 2? It is a BAM file or SAM file, and you've already seen a BAM file. This is what it looks like. You have the header information that will tell you the flow cell that was used to do the sequencing. It will tell you the tool that we used to use the actual commands from HighSat that was used to run and your sample, your library name, and stuff like that. The body is where the information is, the alignment information is contained, and it's taking every read in your FASQ file, and it's just appending other information to it. So it's telling you, okay, the first read in your FASQ file is located on chromosome 2, and I mapped it from this position to this position, and that's the quality of the mapping and so on. So every read that you input, you will get a record and the output with more information about where it was in the genome. Now, another format that you will use a lot when you're dealing with RNA-seq is bed format. So bed format is a text file that contains the chromosome, start, and the name of the region. So it could be a gene name, for example. So it's a tab, the limited file that has these four columns, and you use that to subset BAM files. So let's say that you're interested in a specific region or a specific gene or an exon in the BAM file, and you don't want to look at the whole BAM file, I just want to look at that region. So there are tools that will let you do that. So you pass it, you give it the BAM file, and you give it the bed file, and you say I'm looking at, I just want to look at this gene and that BAM file. And what it will do, it will generate a new BAM file, a smaller BAM file, that only contains read from that segment that you specified. And that's very helpful if you're if you're using your local machine and you don't really want to load, for example, the whole BAM file in IGV, and you just want to have a small subset, and you want to look at that, if you want to look at specific mutations or overlap mutation expression for a specific gene, you can you can you can do that with BAM. So here's a list of tools that you can use to combine BAM and BAM. SAM tools, BAM tools, the card, BAM tools, there are a lot of tools out there that you can use for that. Sorting SAM and BAM files, there are two ways you can sort them, you can sort them by position, or you can sort them by read ID, and it really depends what you're trying to do. If you sort by position, it makes it easier to pull reads out or certain coordinates out, it's a lot faster that way. But if you're interested in keeping the order of the pair, if you're trying to assess fusions, for example, then it's better to sort them by read ID, a read name. And we've already gone through IGV, so the BAM file you get from RNAseq, similar to the BAM file you get from DNAseq, you can still load it into any IGV viewer, and this is what you will get. You will get the transcript, sorry, this is the gene track, and it's cool here because unlike DNAseq where you actually have coverage for everything, with RNAseq you will only get coverage for exon islands. So if you're looking at these gene right here, these segments right here are represents exon, and then you will see pile up of reads only around these exonic regions. And the introns, you should not be seeing a lot of reads around the introns, unless your gene is not annotated properly, or you discovered a new transcript that is not actually in the gene annotation. Then you might be seeing some reads. But if you do, then I would go back and check to make sure that the mapping worked well, and you're using the right reference. GRC is 38, I'm not 37. Yes. What kind of problem do you expect for unprocessed RNAseq? Unprocessed? Yeah, I mean, you do have DNA contamination, so that's possible. But I don't know, I haven't measured it myself. You should be able to measure that in any of the QC metrics. They will look at the bases that are coding versus non-coding and will tell you the DNA contamination as well. I don't know what fraction of the, I guess it depends on the library that you use, it will differ. The DNA contamination will differ. Yes. Can we use any of those? It does the mapping separately, but then it consolidates the information for the pair. It does. You will see an expression estimation, it is based on the pair, because we're using the pair to assess how many fragments, to quantify the fragments. The pair information is very important. Yes, so we need it. I don't remember the last time I've seen a single end, but I mean, you could, but I think when you run it, you'll have to specify it, so that the high side does not expect a paired end. So we'll work with single end, but I don't know if we'll be as accurate. And instead of, we'll talk about that in an expression, the unit that you get instead of fpkm, you get rpkm. So you're trying to estimate the fragments by looking at the reads, not the paired reads. So you can do that, but it's not as accurate. So fragments is the cDNA fragment. So you know how you take your RNA, and then you split it into pieces. You take the piece that's RNA, and then you convert it to cDNA. So that is a fragment. And what you're trying to do is you're trying to sequence that fragment. So you go 100 bases one way, and then 100 bases the other way. So that's what paired end is. Multiple fragments to transcript. Yes, correct. So now that we have the BAM files, the next step is to do some QC assessment. And that is a huge part of if you're working in production, even if you're working on your own data set to figure out if the alignment went, if the alignments were successful, and if the RNAseq sample is of good quality. And that you can move forward to do expression estimation. So this is a required step before that. So here is a list of some of the metrics that I'll be talking about today. There is a lot more things that you can look at, but these are some basic things that you can maybe start with. So look at the 3 prime and 5 prime bias, nucleotide content, base quality, PCR artifact, sequencing depth, base distribution, and suicide distribution. So one of the, so it's by the way, all of these plots are actually generated by a tool called R6C. This is the link to the tool. It's pretty straightforward. All you need to provide it is a gene annotation file and the BAM file that you have. And it will generate all these plots for you and you can take these plots and then put them in a report of your own. Or you can use another tool to run these plots or you can even come up with these on your own once I tell you what each one represents. So here what we're looking at, we are looking at the coverage, pretty much the coverage distribution. So how are the reads distributed across your transcript? Are they evenly distributed or not? So the way that it does it, it takes your, it takes all the transcripts and then it bins them into 100 bins. And then for each one of these bins, it calculates the coverage or the fraction of the reads that cover that specific bin. And when you plot that, you will notice that in here, there are two groups of samples. So there is a group right here that seems to be evenly distributed, the coverage across the transcript. But we're also seeing a group right here where there seems to be a very high coverage at the three prime end and then very low coverage at the five prime end. Any idea what might be causing this? Sorry. The reverse transcription of oligodity? Low quality? Sorry. The reverse transcription of oligodity? Oligodity. Okay. Any other ideas? Why? No. One potential is maybe if you've done poly-A selection and then poly-A selection has a bias, the library that you picked will bias the three prime end and any long transcripts, they will not capture the five prime end of the RNA. The other possibility is that the RNA is degraded and the degradation happens from the five prime end. So these are just two possibilities that will make you go back and ask, why is this happening? So the first thing you would check is go back to the library and ask, okay, what library did you use? Was it poly-A? Can I go back and check the RIN? Was it low? Was the RNA degraded? And then you try to make all these correlations and investigate what might have caused that. Now, this is a crucial step because if you're actually getting a mix of distributions, that is indicating that maybe there are two different types of libraries that you used or two different protocols. And you have to make a note of that because that will affect the expression downstream because these samples will not have the same expression as these samples. If you take the average, it will be a lot lower than these. And the tools that does expression estimation, they have no way to tell. They don't do the QC for you. They assume that whatever you pass them is good quality. So one thing you can do if you get to this point, there are tools that will allow you to adjust for computational tools that will allow you to adjust for these biases. You just have to be very careful when you're dealing with them. The other thing that you can do besides going back and requesting to process is flag them. So flag these samples so that when you're doing processing downstream, whether it's expression or differential expression, you don't want that to pop up as differential expressed because that's an artifact. But if you mark them, we'll talk about ways that we can actually adjust in your model. When you fit the model, whatever you're doing prediction, we can actually add that. And then we'll try to adjust for these differences that way. The other thing that you can look at is the nucleotide content. And what this is, assuming that this is a 35 base read, very short read, we look at each base and that read. And then we look at the fraction of ACGTs in that and that base. And what you expect to see is you expect equal chances, 25% of it being ACGT. There shouldn't be any preference to a certain base over the other. However, for some Illumina sequencers, there seems to be a, the random priming is not so random, it's kind of selective. So what you end up having is you end up having these weird fluctuations in the ACGT content for the first 10 bases of your read. So any read that you look at, if you look at the composition, you'll realize that the first 10 bases, they have some messed up ACGT distributions. So check that. And if that's the case, that could actually reduce the quality of your mapping because these are artifacts. So one way you can deal with that is you can trim maybe the first 10 bases of your read and then do one version with the full read, another version with the trimmed, and then check the quality of your mapping. And if it improves, then maybe trimming it is a good idea. Similar to what we had before, there's another plot where you can, for each base in your read, look at the quality. So quality here is reported as a fret score of quality. So what is fret score? It's negative log 10 of P. So P here is the probability that the base calling is wrong. And a fret score of 30 means that there's one in a thousand chance that the base calling is done wrong. So what you want is you want a higher number. So the higher the number, the less likely the base you called is wrong. And one thing you can do is you can set a threshold. You can say, I only want bases that have a fret score of 30. So anything that is lower than 30, I will eliminate or I'll trim or I'll get rid of. But look at the distribution and then set a limit. Here everything is pretty high, like over 50. That's fantastic. So you don't have to do anything about it. But if it goes below, then you'll have to trim your read. And there are tools that will allow you to trim based on fixed number of bases or quality per base. Sorry, x-axis. So that's each base within your read. So if your read is 35 bases long, this is base one, base two, base three, and so on. Pretty much the lower quality. So some of these metrics are very similar to DNA seek. PCR duplication. So we mentioned that PCR artifact assessing them is challenging with RNA seek compared to DNA seek because the starting point are not random. You have the transcription starting points. One way you can do it is using R6C, they have a function that would compare the sequences of your paired reads. And it looks at the position of the reads, beginning of the first read and the end of the second read. And we'll also look at the actual sequence. And they try to match and see how many sequences have the same exact stand start and then same exact end. And the sequence itself is exactly the same. Those are likely to be artifacts or duplicates. And I think you asked a question about whether or not you can have two or three duplicates. And that's exactly what you're checking here is how many duplicates do you have, the level of duplication. And this is how many reads have that level of duplication. And what you want to do is when a curve that actually is like this, because you want to minimize the number of reads that have high duplication level. And you don't want anything that looks like this, because then that means that for high duplication levels, you have a very large number of reads that have that duplication levels. So yeah, so look at this graph. And then if it's closer to what you're seeing right now, then that's probably good. If it's closer to this, then maybe collapsing is a good idea. Sequencing depth. So we mentioned that if you get asked, how many lanes do I need that you can check publications, you can run one sample and figure out if it's good enough. But how will you know what's good enough? So one way where you can visually assess that is by running one sample and taking subsets of that sample. So we have a BAM file. Now let's take 20% of that BAM file, 40% of that BAM file, 50, 60, 70, 80 up to 100%. And in each subset, we look at all the splice junctions, the known splice junctions and the new or novel splice junctions. And then we plot this plot. So here we have the novel junctions, we have the known junctions, and we have all junctions. And what we're interested in, we're interested in a point where it saturates, where adding reads is not adding any more information at the splice junction level. And at that point, you want to stop and say, okay, all these reads that I've added, I'm not really adding anything new to my splice junction library. Now you'll notice that the novel is increasing at a higher rate, but the known is saturating earlier. And that's because there's a lot of false positive splice junctions that are generated with whatever tool that you use. So it's definitely not going to saturate at the same rate as your known. So I would go with either the known, or maybe a combination of both, and then figure out a point at which it saturates. And then, so for example, the known, I would say maybe 60 or 70% is a good point to stop. And then you say, okay, so how many reads is 60% of that BAM file? And that's a good example, we can go back and say, okay, now for the rest of my samples, I will sequence this many reads, I'll multiplex this many samples in one lane. Now this addresses the, so R6C does the splice junctions. But if you think of it in terms of the number of new expressed genes, you can also do that. You can at each subset look at how many new genes are you getting? How many new family of genes are you actually seeing in the expression level? And is that changing as you change the subsets? And does that saturate? So from this point, you can actually look at so many different things this way and do saturation plots that way to try to assess a few sequence deep enough. Another way to assess is to look at the base distributions. So here I'm just looking at all the coding bases, non-coding bases, UTR, Entronic, and Entronic will cover the DNA basis. So that's the DNA contamination. And the composition will be different depending on what kind of library you picked. So if you picked a whole transcriptome library that contains ribosomal and tRNA, then the coding bases will be a lot less than a poly A library that should only select for mRNA. So the distributions will be different. But that's another way. If you've done poly A, but you see a very low coding and very high Entronic, then that's an indication that something went wrong at the library level. You report back to the team and then try to figure out what happened at the library construction level. But wouldn't you, if you choose poly A, wouldn't you expect to see? Yeah, you don't. But it happens. I mean, there is contamination in any protocol that you do. Yes. Oh, sorry. Yes. That's also possible. Yes. Yeah, I'm not sure. Yeah, I think it really depends on the protocol, I guess, that you use. I think it's a question. I'm not sure what exact fraction or you're expected to see for the non-splice. I can check for you though. Yeah, yeah. And going back to the fragment question, what do I mean by fragment? Here we're looking at a cDNA fragment. So that's what we're trying to quantify. And the reads are this is read one, read two. So that's what you're trying to read. So the fragment length could be a lot longer than the sum of the reads. So the read could be 100, 100, but the fragment could be 300 basis. So that's what you select. You select your library size at the library construction level. So if you pick a library size or fragment size of 300, then you're going to have a gap of 100 basis between the two reads. And if you pick anything smaller, then these two are going to get closer to each other, read one, read two. And at some point, and I've seen that a lot actually, if you pick a fragment size, I mean fragment size is actually a distribution. You're not just going to get like all the fragments of the same size will come as a distribution. The mean for example is 300, but you will get some fragments that are 100 basis, you'll get fragments that are 400 or 500 basis. The ones that are 100 basis are very short, you will have overlap. So the reads read one, read two are going to overlap. And that's not good, because especially if you're doing mutation calling, because then you will count two reads as two reads coming from two different fragments, they're actually coming from one fragment. So you're really biasing towards that fragment, and you're counting it as two records when it's only one should be one record. So try whenever you're sorry, a second week coming from from the same fragment, though, what you want is you want a second read coming from a different fragment that will that will increase your, your, your sensitivity. Because there could be a problem with that without the fragment that you're sequencing, right, we wanted to come from a different, a different one. So you can have all these overlapping reads. It's, I mean, it's increasing now with the increasing read length. So if you're doing 150, right, you have to increase it a bit, but 400, 500, 600, really depends on what you're trying to achieve. But just don't, if you're doing two by 100, try to go over 300, like around 300, don't go lower. So I think this slide was intended to show that you can actually, just like you, you do with DNAC, if you can actually use RNAC to call the variants. So you can load it in IGV if you have a gene that you're interested in or a specific position that you've already pulled out from your DNAC, can you want to validate? Well, validate, but you just want to see if it actually is present at the RNAC level, you can go in IGV and pull that specific position and then look at the change to see if it's there. You can also run variant callers. I'm not sure if there are variant callers that are specific to RNAC data, or if you can use variant callers that are DNAC and just run RNAC BAM. You'll have to be very careful though, because the error profile is very different in RNAC versus DNAC, especially around the splice junctions. So if you're assembling splice junctions, you will notice that there will be a lot of false positive mutations or SNBs around the splice junctions, just because that could indicate that the splice junctions were not done properly and they will have mismatches around there. So if you plot the error rate, you'll see peaks around the splice junction that you have to account for or filter when you're processing. All right. So yes, so one way to find more variants, just like I said, the splice assembly, but also there is RNA editing, which is a field of research right now too. So you'll see more mutations then. And the fact that if it's not in the RNA, it's not really a validation. Like you actually do have to perform validation of your DNA mutations, but combining the two gives you a better idea of the consequence of these mutations and not whether or not they're present. If they're present, then that's great, but if they're not, that's not, doesn't mean that it's not at the DNA level. Yeah. All right. How are we doing in terms of time? Cool. So let's quickly do the expression differential expression and then we can maybe take a break and then we will jump into the hands-on exercise and do some alignment and expression estimation. So now that we have checked the quality, done the alignment, checked the BAM files, and we made sure that everything is great. The BAM files look great. It's time to move to the estimation or expression estimation. And we are here, we're going to be covering some, we're going to talk about FPKM if you're familiar with that term versus rock counts, some differential expression methods, and then the downstream interpretation that you can do with that data. So one way you can assess differential expression, it's a very naive way, is again loading the two conditions in IGV and then trying to just visually assess the differences. So for example, if I load a tumor BAM and a matched normal BAM and I am looking at the same region and I'm comparing the pile up of the reads between the two conditions and I notice, by looking at it, I notice that there are very high coverage for all these exons for that gene, but then I look at this condition and I see a very low coverage. So I think right over here, I'm not sure if you can see, that will give you the range of how many reads within that window, the maximum number of reads within that window. So, and then you jump to the conclusion and you say, okay, this gene is definitely down-regulated. I can see it, it's very clear. Do you see any problem? Any problems with that? Yes? Okay, okay. Yeah, so you, by doing it this way, we're not looking at the coverage distribution and we're not even looking at how deep we've sequenced. So maybe here I did 10 samples in a lane. Maybe here I did one sample in a lane. That information is not available here and I'm jumping into the wrong conclusion by just visually looking at that without actually normalizing somehow for how deep did I go and what the gene length is. Here I'm looking at the same gene, but imagine that I'm trying to compare two different genes. Of course, if there is a longer gene it will have more reads, so that will cover it. So by just counting the number of reads, that's not sufficient. It's not a fair comparison if I'm comparing a short gene versus a long gene, or if I'm comparing the same gene between two patients that have two different library depths. So that's where RPKM, FPKM fit into the picture. So RPKM stands for reads per kilobase of transcript per million mapped reads and FPKM, same thing, but it's fragments. So wait, I don't know if this slide explains it here. So again, in RNA-Seq, the basic idea is that the relative expression of transcript is proportional to the number of cDNA fragments that originate from it. And as I said, there are two limitations. They're biased towards larger genes and they are biased towards total library depth. So to correct for them, what you do is for FPKM you sum all the fragments or all the reads and then you divide each gene by the total number of reads, so the total number of reads in your library, and then you divide by the, that value divided by the length of the gene. So you're kind of normalizing for two things. You're normalizing for the length of the gene and for how many reads you've sequenced. And that, by doing that, now it's a fair comparison. You can go back and compare because you've normalized by these two things. You can compare the samples or you can compare the genes. Now there's another unit called TPM, and the way this one is calculated is very similar. It's just the order is different. So the first thing you do with TPM is you divide by the length of the gene first, and then you take that and then you sum it and you divide by the sum of that. So you are accounting for both the length of the gene and the total depth, but the order is different. And that can actually, the consequences of that order, it makes a difference because it's a more fair comparison to use TPM to compare a gene between samples because it gives you a better idea of the total proportion of reads that cover that specific gene. And you can use that proportion. The total of the proportion is always one, so you can use that proportion to compare samples, same gene, different samples. Now that for each gene, you calculate that normalized unit, which is FPKM or TPM. What we can do is strength tie. I'm not going to get into the details of how exactly strength tie works, but what strength tie does is that it generates these splice graphs. So you can think of them as networks of all the possible combinations of exons that could happen. So it will take your expression values and it will take your coverage distribution for those regions and it will try to come up with the probability that that read fits into that hypothesized transcript structure. And it will assign the read to the transcript that has the highest probability and then once it does that, then it eliminates it from the pool and then goes back to the pool of reads and it takes each read and then finds the probability of it actually fitting within that graph. And after it comes up with all these hypothetical transcripts, then there is a function called strength tie merge that will merge all the possible transcripts that you generated for all the different samples. And it's just a way of uniting all these transcripts and then it goes back to each sample and tries to recalculate the expression for every single one of these transcripts, the union of all these transcripts just to be consistent across all the samples within your pool. There's another function called gff compare that takes a gtf file. So gtf file is just a file that lists all the the exons and all the possible transcripts that are known. And you can actually when you're doing, when you're doing this, the expression estimation, you can tell the tool strength tie to either do it on its own or you can provide it with annotated transcripts. Say these are annotated transcripts for humans. You use this as a guide or as a reference. So it will go through these transcripts and then try to assess the expression for each one of them. And then if it finds anything novel, it will add it to the list. But if you can also run it in a mode where you don't give it any reference transcripts or gtf and it will try to discover everything on its own, the false positive rate will probably be higher if you do it that way. But it's your choice. Now ball gown is the name of the tool that is used. So strength tie does the expression estimation. Now ball gown does the differential expression. So you can, I mean previously, I think you can just use a t-test to compare the two different conditions because ideally what you're trying to do is you're trying to compare the distribution of expression from condition one to the distribution expression in condition two. And if these two distributions are very different, then the gene is differentially expressed. But the problem with t-test is that you can only use one value, which is just the expression. And you just check how the distributions are different. With ball gown, they introduced some sort of a linear model where the outcome is the expression that you're trying to, for each gene, the expression for each gene, that's the y that you're trying to predict. And you fit two models. One model where you fit the condition as a covariate. So whatever condition you're trying to assess, cancer versus normal. And then you can also fit as many features as you want, whether you know that there was a problem with the distribution. There was a bias at the beginning. You can say group one, group two, coverage bias, no coverage bias. You can put the sex, you can put the age, whatever you think is going to be able to distinguish the two classes. You put it in that model. And then you compare the fit of the two models. The one that has the condition and the one that doesn't have the condition. And if the one that where you applied the condition fits better, then that tells you that they're differentially expressed. That means that adding the condition is actually improving the fit of your model. So the genes are actually differentially expressed within these conditions. So when you get out of that, you get an F test score and you get a P value. And that's what you're using to assess or subset the genes that are differentially expressed. You use the P value from that test to do that. And it's also important that you adjust for multiple testing. Because you can imagine, you're doing this per gene. So we have 20,000, 30,000 genes. You're doing 20,000 tests. You can also do this at the exon level. So that will increase the pool of tests that you have to do. So when you do so many of these tests, by chance, you're expected to see something significant. So what you need to do is you need to adjust for all these multiple tests that you're performing and only pull the ones that are significant after adjusting for multiple testing. So after that, you can subset the 20,000 genes to maybe a list of 100 genes that have significant P values after adjustment and maybe have a good fold difference, a fold ratio. And then use that subset to do downstream visualization. So you can look at box plots of the expression for that gene across different samples. You can compare across different conditions and then visually assess the expression differences between the two and maybe generate heat maps, do some clustering. So that's, sorry, yes. Yes, but we'll talk about that. HR and DCC, they're using, they're not using FPKMs. So the underlying distribution of the score that you're testing is different. Yeah, this is using FPKM. Yeah, yeah, yeah, FPKM or TPM. So HR and DCC, so these are two different tools that would also do differential expression, but they use raw counts. So they're basing on the first point that we discussed earlier, before you do any normalization, you're just taking the raw counts. And because the underlying distributions are different than the tests that you can do there are pretty different. So to get the raw count, what you can do is you can use a tool called HTSeq and you pass it, it will give you a table that has the raw count for each one of these genes. And then you take these raw counts and you pass them to HR or DCC and they will do a different set of tests to assess how differential expressions to these genes are. Now, which one should you use? It really depends on what you're interested in. If you are interested in just looking at full changes, generating heatmaps, anything visual, then I would highly recommend you do FPKM. If you're doing something that is way more statistically advanced, you're doing modeling, then it's better to go with the counts because you have more flexibility there, you can apply more tests there that you cannot apply with FPKM values. And sometimes you can do is if you really want to narrow it down to a list of genes that you want to carry on to a validation maybe, and you have a giant list of genes that you get from whatever, FPKM tests or your HR and DCC, it's interesting that you can also maybe do a Venn diagram and try to see what genes are significantly different in all of these tools that you've run. Because running these tools is not computationally intensive because you're dealing with very small files at that point. You're just dealing with small matrices of counts. So you can run the three algorithms and then look at the overlap. The overlap doesn't mean that these are, this is the only truth. It is likely that you'll be missing out on some. For example, maybe HR is doing better than the other two and it's calling things that are unique and they are still correct. But it's just one way of reducing that space into a smaller space that you can actually then carry on for validation. But that's just one way of doing that, one way of reducing that space. I've talked about multiple test correction and I believe the packages in R for string tie, they do have that option where you can adjust for multiple testing so you don't have to worry about it yourself. So in terms of downstream, sorry, downstream interpretation, what you can do now is you can take those expression values, you can perform clustering. So unsupervised machine learning, for example, tried cluster and then figure out if there are different conditions within the samples that you have. You can do classifications, you can do some supervised machine learning, you can use random forests if you want. You can also use pathway analysis. There are so many different tools that you can use for that. The choices are unlimited once you go down that path in terms of modeling and biomarker discovery.