 I'm going to start by introducing myself. I work as a bioinformatician. I've been at OECR for six years. I've done a lot of work in the first three years. I worked in production, so we tested a lot of tools, bioinformatic tools, and we assembled pipelines. We processed sequencing data, did alignments, variant calling, RNA-seq and DNA-seq, and then we passed the align file or variant calls to people in the downstream analysis. Right now I'm focusing on doing a lot of biomarker analysis, taking that downstream data and then working with it to develop biomarkers. Today I'm going to be talking to you about gene expression, specifically RNA-seq. So I will give you a brief introduction about RNA sequencing and then we'll go over alignments. We'll talk about a couple of tools that we've picked for this module. We'll talk about QC visualizations, and then the last part would be expression, differential expression. So the first three parts are going to be lectures. I'm hoping that in the first section of today's module we'll just cover the lecture and then the second section. I have a small data set, RNA-seq data set. We have raw sequences, and we'll go through the alignment process and the expression estimation together. So hopefully we'll have time to do that. So for the first section I will give you an introduction to the theory and practices of RNA-seq sequencing, rationale behind sequencing the RNA, challenges that are specific to RNA-seq, general goals and themes for RNA-seq analysis, workflows, and then we'll finish this section by talking about technical, common technical questions related to RNA-seq analysis that people usually ask at the beginning of the project before they start working with RNA-seq. So when we're talking about gene expression I wanted to start with the central dogma. It helps us really connect genotype with phenotype. You're starting with DNA sequences that are transcribed and then you clear us to RNA sequences. Those RNA sequences are then modified. You take out any regions in the DNA or RNA that are not coding. These are called introns, and then you stitch all the coding regions together, these are exons, to form mature mRNA. And then you take that mature mRNA into the cytoplasm where you perform translation. You translate them into functional proteins. And in order for us to understand what genes are turned on and off in the cell at a specific time, and also understand the relationship between genes, we need to be able to quantify the RNA in the cell. And we can do that by quantifying the mRNA fragments, for example, in the cell. And RNA sequencing is one technology that will enable us to do that. So there are other technologies that you can use to do that, like microarrays. But currently most of gene expression analysis is being done using RNA sequencing. And we'll get... I'll give you more information about why RNA sequencing is very popular. You'll realize that in the first section of this module. So the basic idea behind RNA sequencing is that you have a sample of interest, whether you have treatment or a condition that you're trying to see how it affects the gene expression in the cell. You extract the RNA from those samples. You generally see the DNA fragments. And you sequence those fragments using a sequencing platform. Illumina is a very popular platform. You generate reads that are usually around 100 base pairs. And then you go through the process of taking those reads and aligning them to the genome or the transcriptome. And then you try to quantify the reads. So why RNA? What does RNA have that DNA doesn't have? So a lot of functional studies use RNA because the genome is actually constant and the condition that you're trying to study in the experiment affects the gene expression only. And it doesn't affect the DNA. So if you're looking at drugs, for example, you want to see the drug affect for treated versus untreated cell lines. That would be a great way to use RNA-seq. Also predicting transcriptome assembly from DNA is a very complicated procedure. And up until today, there are a lot of genes and isoforms that we're still discovering thanks to the technology of RNA sequencing. And some features are actually only present at the RNA-seq level, at the RNA level, sorry, not the DNA, such as alternate isoforms, fusion transcripts, and RNA editing. We can also use RNA to interpret mutations that do not have an obvious effect on protein sequence. So these would be the regulatory mutations that will help us understand how mRNA isoforms are expressed and how they're regulated. And finally, if you have DNA data and you've called variants, for example, and you want to prioritize those variants that you've called from DNA, you can use RNA to help you prioritize those mutations, somatic mutations. So, for example, if the gene is not expressed but it's mutated, then you might not be as interested in that gene. If there is an expression in the wild type of the mutation, then that could indicate there's a loss of function. But if the mutant allele is actually expressed, then that could be very interesting, because that could be a potential drug target. With RNA comes a lot of challenges that you might actually see in other platforms, like DNA-seq. But a lot of these challenges are actually unique to RNA-seq as well. So when we're talking about sample, you have to worry about sample purity, you have to worry about the quantity. Do you have enough sample to perform RNA sequencing? The quality of the sample is good enough. Can you move forward and sequence it? Also, the fact that RNA, and this is unique to RNA, the fact that it has small exons that are separated by very large enthronic regions. So mapping is a difficulty compared to DNA-seq. Also, the abundance of RNA varies a lot between the genes and between samples as well. So you have to keep that in mind. Also, because RNA-seq is based on random sampling, you might get genes that are very highly expressed, and what they end up doing is they consume the majority of the reads. So you might actually have some bias in that sense. Also, RNA comes in different sizes. So you have microRNAs, and those could be as short as a few bases long. And you will need special capturing methods for those. So you have to figure out what you want out of your RNA-seq before you start sequencing. And finally, RNA in general is more fragile than DNA. So it degrades easily. So that's another thing that you have to be careful of and also check for before you perform sequencing. So one method that you can use to perform the quality of your RNA sample before you start sequencing is RIN. How many people have heard of RIN before? So RIN is a score between 0 and 10 that can help you identify the quality. 10, if you get a score of 10, that means the sample that you're dealing with is very good quality. 0 means it's very bad, it means it's degraded. So what does that number mean and where does it come from? It's a technology by Agilent where what you're doing pretty much is you're running your RNA through gel electrophoresis, and in that process you're breaking down the RNA and then the RNA molecules will move according to their size. Now RNA in general is composed of 80% of the RNA in the cell is ribosomal RNA. So you only have very little mature RNA in the cell. So in this plot, what you expect to see is you expect to see ribosomal RNA mainly because most of the RNAs are ribosomal and ribosomal RNA consists of two subunits. There is 18s and 28s. So if the sample is clean, then you'll see two very clean peaks indicating that these are the two ribosomal subunits and if the sample is degraded and not in good quality, then you don't see those clear peaks and you see a lot of fragments which indicate that the RNA is degraded and you're getting different fractions of different sizes of RNA. So usually people would look at RINs first before they go ahead and do the sequencing and different people have different thresholds. I think we used to have a RIN of 7 or 8 as a cutoff, but it really depends on the samples that you have. If you have things that have a low RIN, let's say 4 or 5, then maybe you can go back and ask for a more sample of better quality and if you can't, then keep that in mind when you're doing the sequencing that the quality of the sequences will not be as good as a sample that has a high RIN number. Once you figure out the quality of your RNA, then you have to pick the capture method or the protocol that you can use to do library prep and there are so many different library prep techniques out there that you can choose from and it really depends on what you're interested in. What do you want to do? What kind of question you're trying to address or answer in your experiment? So once you isolate the RNA, you break it down. The pool of RNA that you're getting, as I said, is mainly ribosomal RNA. That is called total RNA. And then you have to ask yourself, okay, do I only need mRNA? If I'm only interested in the coding region, then you will have to pick another capturing method that will either be a selection or a depletion. So you can, with depletion, what you do is that you deplete or you get rid of the ribosomal content in the total RNA. With a selection, you actually go and select for the mRNA by doing a polyaselection or a cDNA capture. So depending on what you want to do, you can pick the appropriate library technique. In addition to that, you also have to think about strandedness, whether you want strand information, because if you do it, then you're going to have to pick a kit that will reserve or keep the strand information in the library. So here we're looking at two examples. The first one is a library that is not stranded, and what you will see is that reads will, will not be able to tell whether the read is coming from the leading strand or the lagging strand, while reads that are coming from a stranded library, you will be able to distinguish if this gene is being transcribed on the leading strand or the lagging strand. Now, why would you need that? In humans, it might not be a bigger than an issue, but in other species it might, but you can have two genes in the transcriptome that are overlapping, different degrees of overlap. So one would be on the leading strand, one would be on the lagging strand, and if you don't look at the strand information, you will, you might actually not count expression correctly, because you might be taking reads from one strand and counting it for the other gene that's being transcribed in the other strand. But yeah, so that's, that's useful. If you have any gene overlap in the gene irritation, then you, you want the strand information. So there is a reverse strand here. What is that? Sorry? I, when I was creating that, I thought there is a reverse strand here. Reverse strand? Yeah. Well, the, the DNA consists of two strands, right? So the transcription can happen on either of these strands, the leading strand or the, or the lagging strand. I don't know. Single strand, double strand, or reverse strand? Single strand, double strand, or reverse strand. Yeah. I'm not sure what, what that means. Maybe he's talking about the lagging strand. Okay. Yeah. I don't, yeah. It's leading or lagging. Like it would be one strand or the other. I'm not sure about the, the reverse. Oh yeah. So that's, yeah. So that's also, so when you're thinking of direction, you have to keep in mind that there are two points in sequencing where direction exists. So, or the direction here, I'm talking about the strandedness. So that's where the gene is being transcribed. There's also the, the forward and reverse. That's when you take the read and you sequence the read. When you sequence the read, you can actually sequence it from both sides. So these are two different things. They both involve direction, but one is biological and one's technical. That makes sense. Yeah. So I have a question with the ring. You said it gives us an idea, it measures the quality of the legosubular. But does that automatically, so if we have a sample which has ring 10, which means that the legosubular RNA is very good, does that automatically mean the RNA is also good? So it's not really measuring the quality of the RNA. It's telling you how much the RNA there is compared to other, like it's the proportion of RNA to non-RNA, our RNA in the cell. This is, this is, no, this is everything. Yeah, all the RNA, exactly. Yeah, not just the ribosomal RNA. And not here, no. But it's just telling you because 80% of the cells are ribosomal RNA, so you expect to see certain proportion of ribosomal versus non-ribosomal. And here. Two weeks are actually indicating... Yeah, just the ribosomal. So the two subunits of the ribosomal RNA. So my question is how does it tell you anything about the quality of the mRNA? I don't think it specifically tells you the quality of the mRNA. You just basically assume that if this is good, then the mRNA... Yes, yeah, yeah. You can check the paper. I think they've done a lot of validation studies where they would pick the RIN number and then actually check the quality of the data after. And then they did a correlation of the RIN with the quality of the mRNA. And they probably extrapolated that from here. All right, so find out the quality of the sample. We picked a capturing method. We figured out whether or not we want strand information. Then you have to think about replicates. I know it's very difficult to get replicates in sequencing because replicates cost money, but it's extremely important to have replicates, both technical replicates and biological replicates. And when I'm talking technical replicates, I'm talking you can take the same sample and sequence it either on different lanes or different chips, different dates. With biological replicates, we're talking the different samples that are going through the same condition or under the same condition that you're trying to study. So try to go forth at these three replicates that will give you some meaningful statistical summaries. So three would be the minimum. Once you get these replicates downstream, a lot of the tools that you're using in RNA-Seq, they do take the replicates in consideration when they're trying to calculate expression, differential expression. And as a QC metric before you do any of that, you can actually compare the read coverage between the replicates and just look at the correlation between your replicates. So some of the common analysis goals for RNA-Seq is to look at gene expression and differential expression, just like I mentioned. You can also look at alternative expression analysis. You can try and do some transcript discovery and mutation. You can do a little specific expression, especially if you have stranded information. If you don't have stranded information, that will be difficult. You can also do mutation calling, mutation discovery on RNA-Seq, and fusion detection and RNA editing. So there's a lot of stuff that you can get out of your RNA-Seq data. And these things, you don't have to worry about using the same library prep protocol. So you don't have to change the library prep protocol to get that data. You can use the same protocol, and once you get the sequencing data, you can get all of this from there. So the way it works is that you sequence the sample. You get raw reads that come in FASQ format. At this point, you're probably used to file names. By the end of the week, you probably have seen similar file names in DNA-Seq. So you get the FASQ files. You align the FASQ files to get aligned reads. Then you take the aligned reads. You do some sort of transcript assembly. You try to assemble transcripts for the different samples that you have. Then you try to quantify the number of reads for each one of the transcripts that you assembled. That's expression estimation. And then you try to compare the samples that you have, the conditions, the genes to find genes that are differentially expressed between different samples. And then finally, you do some sort of systemarization, heat map. So that's pretty much the workflow that a lot of gene expression pipelines follow. Now, the tools can be different, but the idea behind it is pretty much the same. All right. So common questions that you should ask yourself when you're starting any RNA-Seq project. First one is, should I remove duplicates for RNA-Seq? So do you guys know what duplicates are? Or do you want me to go over that? I'll give you an example of DNA-Seq that just simplifies things, simplify things. So when you sequence reads, when you generate fragments, you're cutting the fragments of the DNA random positions. So the reads that you get from sequencing, they should overlap. They should not pile up. What ends up happening sometimes is that you see pile ups of reads that have the same exact start and end. And that is not expected. And that is happening because of the polymerase chain reaction when you're trying to amplify fragments in your library prep. PCR, what ends up happening is that it's biased towards certain fragments. So it amplifies some fragments more than others. So that's why you end up with all these fragments that are piled up. And that's not good because let's say that you have DNA-wise, if you have a mutation in that fragment and PCR amplifies that mutation, and then you look and you try to quantify the frequency of the mutation, you'll say, oh, wow, there's a lot of reads that have that mutation, so it's happening at a high frequency. But in reality, it's not biological signal, it's actually just the PCR artifact. So to get around to what we do is we collapse. So any reads that have the same exact start and end point, we'll just collapse them and we'll take one read. So that's what's done in DNA-seq. And RNA-seq is a bit more challenging because the start points, unlike DNA, they're not that random. With RNA-seq, we have transcription start sites. So before blindly going ahead and removing these duplicates, make sure you assess it properly and we'll go over ways to assess duplicates when you talk about QC. But assess it properly and yeah, because if you collapse, you might actually affect the dynamic range of your expression values. So we'll talk about methods to assess that and then in the QC section. The other question is how much library depth, how many lanes should I sequence? How many samples do I pool in one lane? So the answer to this question is also complex because it depends on so many different factors and variables. So for example, it depends on what you're trying to do. Are you doing gene expression? Are you doing alternate expression? Are you doing mutation calling? Each one of these will have different requirements for depth. And again, the problem with depth in RNA-seq is not the same as DNA-seq you get this uniform coverage across the whole genome. So in DNA-seq you say, okay, I'm going to sequence because I want 50x coverage. I want each base in my genome to have approximately 50 roots that cover it. But the problem with RNA is that there is expression. So some genes might be covered, but they're not expressed. Other genes have very high expression. So I thought you have to think about. Another thing is the tissue type of interest, the RNA preparation protocol. Are you doing total RNA? Are you doing just coding RNA, mRNA, library construction? All of these affect how much depth you want. Also, read length. Whether you're not, you're doing paired end or single end. Most of the project nowadays, they're doing paired end. I rarely see projects that are doing single end sequencing. And the computational approach and the resources that you have available. More reads means more computational resources that you need to do the alignment and the expression estimation. So there are two approaches that you can use to answer this question. First, sit down, write all these variables or these parameters, and then try to find a publication that has similar set of parameters and see what coverage they've used or how many lanes they've used and try to mimic that. If you can't do that, and what you're doing is so unique and different from what's published, then what you can do is you can just test it on one sample, do one lane of sequencing, and then go and then check. I'll show you in the QC section a way to, by looking at some saturation plots, you look at the coverage, you sample your sequenced sample, and then you look at the coverage at every sampling stage. I'll go into detail there. So you can do it before or you can do it after. And once you figure out how much depth you have or you need for that one sample, you can go back and apply that for all the other samples in your cohort. And this way, you're not wasting money. But in general, one to two lanes of sequencing nowadays should enable you to do everything you want to do. So one lane of sequencing, I believe can go up to like 600 million reads now. I think I've seen like 900, something crazy. It really depends on the protocol that you use and people who are doing the sequencing. But yeah, two lanes should be more than enough to do anything you want, including isoform, fusion, mutation calling. But it can't be expressive. So that's why a lot of people, they would rather pool samples within a lane. What mapping strategies should you use for NACC? So there are different mapping strategies. It really depends on what you want to do. So there are two approaches. You can either take your data and align it to transcriptome reference or you can take it and align it to genome reference. So we'll also talk about the aligners that are available and we'll get into details there. What if you don't have reference genome? That's not an issue here because most of you guys have reference genome to deal with. But if you don't, then do you know the assembly is one approach? And there are a lot of tools that do that. So aside of the scope of this module, so we're not going to be covering any Dunova assembly today. All right, so that was the end of part one. Any questions so far? No? Good. Okay, so part two, we're going to be focusing on the alignment and QC and visualization of your aligned RNA sequencing. So we're going to talk about alignment strategies, as I've mentioned before. We've picked a couple of tools that we'll highlight and go over how they work. It's Bowtie, Top Hat, and Hisat2. Introduction to BAM and BAM formats. So if you're not familiar with alignment files and BAM formats, I'm pretty sure you've seen those in the past few days. We'll talk about basic manipulation of BAM files, visualization of RNA-seq using IGD. Do you guys use IGD this week? Perfect. And then we'll focus, the second section will be on alignment QC assessment and BAM read counting to determine expression status. I've talked before about just general challenges on RNA-seq, but RNA-seq alignment itself has a lot of challenges as well. So the first challenge is the computational cost. I just told you that each lane of sequencing has 500, 600 million reads. So that's a lot of reads that you have to process and align. So you will need a lot of computational resources to be able to process that many reads. In addition to that, also, the read length has increased a lot. And the tutorial that we're doing today, it's an old data set, and read length was about 36 basebared, compared to N36 basebare. It's really hard to find studies right now that would do such short read length. So most of the read lengths nowadays are 100, 2x100. There's 2x150. You also have PacBio that has very long... The tools that we're going to be talking about today, they are mainly designed for studies that would do paired-end 2x100, 250. So they're not designed for things that are extremely long or extremely short. Another challenge, which I mentioned as well, is the presence of introns. So there are these large gaps that you have to worry about when you're trying to align your exons back, especially if you're trying to align them to the genome. Because the RNA does not have the introns, but then the whole genome that you're trying to... reference that you're trying to align to has these introns. So you're going to have to somehow take your RNA, split it, and then figure out where these entronic boundaries are when you're doing the alignment. So that's a big challenge for these tools. And I also showed you how RNA-Seq is very resourceful. There is a lot of data that you can get out of it, different data so you can get out of it. So because of that, it's very unlikely that you're just going to process your data once and you'll be done with it. The fact that there are so many different tools that will do different things means that you're probably going to have to reprocess multiple times. Also, the gene annotation is constantly being updated every year. You get new genes, you get new transcripts, so chances are you're going to go back and reprocess with the updated assemblies. HighSat 2 is one tool out of many tools that are available out there. So we'll just cover that today, but feel free to pick any aligner that you want. Aligners, usually they are split into three classes. So we have the DeNovo assembly, we have the alignment to transcriptome, and alignment to reference genome. Which alignment strategy is the best? It really depends what you have, what you're trying to do. DeNovo assembly, as I've mentioned before, is meant for projects that do not have a reference. Also, it might be good for samples that might be too complex. There is a lot of polymorphism in your sample, and if you try to align samples that have so much polymorphism to the reference, you might actually miss out on some interesting events. So in that case, you might want to use DeNovo assembly as well. You can also align to the transcriptome. And here, saying short reads, is because I guess you want, you're just aligning to the exon, you don't want to worry about exon-oxon junctions. That's why it might be good for shorter reads. But you can either align to transcriptome or whole genome. Today, we're going to be focusing on aligning to reference a whole genome reference. And as I said, each one of these strategies, we're going to be focusing on with different tools and packages. So in terms of aligners, there are a lot of aligners that are published and available out there for public use. And each one of these tools, each one of these aligners, when it comes out, it tries to do three things. It tries to increase the accuracy of the calls. It tries to speed up the processing time and cut down the memory usage. Every tool that comes out tries to beat the other tools by modifying or tweaking these three parameters. I just want to highlight a few tools that we have either taught in this workshop or have used personally. One of the first ones is Top Hat and Bowtie. So these two became really popular around, I would say 2008, I believe 2009. And Bowtie Top Hat package, the way it worked is that Bowtie was the backbone aligner. So it was very good for short read alignment. So it would align the reads, and then Top Hat would take the alignment information from Bowtie and it would try to use it to assemble the reads to make transcripts. And then you take those and then you pass them to expression estimation tools. And that was cufflinks. It was good because it has very good support. People use it, a lot of papers use it. If you have any questions, if you run into any problem, you'll probably have an answer with Top Hat. The only problem was the fact that it was a bit slow. So if you're trying to run one lane of sequencing, it might actually take you days to process. So it wasn't until Star came out and Star really cut down in the processing time. But the thing is what it did is it increased memory usage. So in order to be able to run Star, you're going to have to have a lot of computational resources. You need a machine that has a very high memory. So you probably need to use a cluster. You can't really run Star on your local machine. But it did cut down in the processing time. It went from days to minutes, which was great. And then the next thing that happened, High Sat came out. And High Sat, what it tried to do is it tried to cut down on processing time and memory by using a variety of reference indexing. And it managed to do that. So now High Sat runs the same running time as Star, and it uses a lot less memory than Star. So just to give you an example, if you have a lane that has 100 million reads, with Top Hat, it would take a few days to process. With Star, it would take half an hour to run, but it would require 28 gigs of RAM. With High Sat, 100 million reads, half an hour, only 4 gigs of RAM. So it was a huge cut down in terms of memory usage. And that's the tool we're going to be talking about today. We're going to be using as well in the tutorial. Yes? Can you give us a comment about accuracy of certain times? Yeah. So the paper claims that the calls are very accurate, and they're more, well, same level of accuracy as Star. So it's very comparable. Star and High Sat in terms of accuracy. I personally have not done any comparison myself, but I'm just basing it on the High Sat paper. The y-axis, is that time or is that memory? Sorry. The y-axis is there. This one? The y-axis is there. These are different programs. So each one of these is different tools, and this is the time they came out. Oh, so this doesn't... So it's just sorted by the time they were published. Oh, okay. Yeah. So should you use slice aware or unsliced mapper? The third answer is if you are aligning your RNA-seq 2 whole genome reference, then you have to use a slice aware aligner. If you're aligning it to transcriptome reference, then you don't need to use a slice aware aligner. And there are a lot of slice aware aligners out there. There's High Sat 2, Top Pat, Star, MAPS, Slice, and et cetera. And if you've read the High Sat paper, it does compare all of these aligners in terms of performance, runtime, memory, and all that stuff, and accuracy. So High Sat 2, as I've mentioned, they've really cut down on processing time. And just like I said, they did that by using two types of indexes. So there's the global indexing and there's local indexing that is going on. So what does that mean? So previously, let's think Top Pat 2 or Top Pat, if you have a short read, let's say it's eight bases long, short reads are not unique, which means that you can find multiple locations in the genome. When you're trying to align short reads, it's time-consuming because you are going to go through the whole genome, you're going to find multiple spots in the genome, and then you're going to have to decide which one of these locations is accurate or more suitable. What High Sat did is that it created this local index. So it took the whole genome, it split it into 48,000 bits. And the way it did it so that these short reads, when you look for them in the local bin, there's only one copy. You won't find multiple copies of that short read in the local index. And that's how it's saving so much time when it's trying to do the alignment. So let me walk you through... I'm going to go through three examples, and we'll walk through how High Sat does the alignment. So the three examples are right here, A, B, C. A is an example where the read is fully spanning an exon. So there is no boundaries. And let's assume that reads here are 100 base pairs. 100 base pairs fully spanning the exon. The second example, B, we have a read that, let's say, 90 bases span one exon, and there is like 10 bases that span the second exon. And then the third case, it would be a read where half of it spans one exon, and the other half spans the other exon. So when you're trying to align the first case, the tool goes through a series of global indexing, local indexing, and extensions. So I'll tell you what that means. So the first thing that we'll do, it will try to look for a seed for that read throughout the whole genome. Once it finds 28 base pairs, the first 28 base pairs in the genome, then it will anchor that read. And then it starts after 28, it starts just extending. So it'll just extend one base at a time until it gets to the end of the read. That's case one. Case two, do the same thing. We're going to start with global indexing. It's going to find it 28 bases. It anchors it, then starts just extending one base at a time until you hit this spot right here, the first mismatch. When it hits the first mismatch, it's an indication that you reached an entron. It stops. And it switches to local index. So that is the biggest difference here. Before, it used to switch to another global index. Then you would have to go and look for that 10 bases that's left through the whole genome. And you'll find multiple spots and they have to go and pick which one is best suited for the first part of the read. But here, because we have a local index, the 10 bases, it's just going to look for it within the local index. There's only one. There's only going to be one occurrence of that read. And it will do anchoring as well, but this time it will only use 8 bases. So look at the first 8 bases. It will anchor and then extend the rest of that small read. So the third scenario, where you have half of the read in one axon, half of the read in the other axon. Again, we'll start with the same thing. Global indexing, first 28 bases, you anchor, extend, extend, extend. You get to the middle. You stop. You switch to local indexing. You go find the first 8 bases and the local index. You anchor and then you extend the rest of your read. So it's just a global index anchoring and extension. You're just going through that cycle. Is that clear? Yeah, okay. Should you allow multi-mapping of reads? So as I've mentioned, if reads are not unique, you can have multiple locations in the whole genome. But sometimes even the read can actually have multiple locations in the genome. And it will map perfectly two multiple locations in the genome. And when we talk about DNA, when that happens, a lot of the aligners, there are options in the aligners where you say pick randomly, pick one of the reads randomly, or pick the top one, because all of these reads have the same exact quality score. So how are you going to decide which read to report? So you can either randomly pick one read or report everything. You have these options. In RNA, it really depends on what you're doing again. If you are doing variant calling, then you might want to disallow this option and do not report multiple mapped reads. But if you're doing expression, you really don't want to affect the dynamic range of your expression. So you might want to let the tool report multi-mapped reads when you're doing expression. So in terms of output of all these aligners, you get a SAM file and a BAM file. SAM stands for the Sequence Alignment Map Format, and BAM is the binary version of SAM. You guys have worked with BAM over the past week, correct? Okay, perfect. So I don't have to go through details. You can convert BAM to SAM and so on. It's a great idea to convert everything to BAM. It will save a lot of space. And you've probably gone over this. Just an example of a SAM or a BAM file. You have the header section, and you have the body. The header section will contain information about the sample, some techniques. You can put whatever information you want in the header section. It will also have the aligner that was used and all the parameters that were used in the aligner. So when we used Hisat, it will actually have the command of Hisat that was used. And that's very important because if you get a BAM file and you don't know how it was aligned, you can just look at the header and it will tell you how it was aligned, what parameters were used to do the alignment. BED is another file format that is very important when we are dealing with RNA-Seq. BED is just a format where you have chromosome start and end. So if you're looking for certain regions within the transcriptome, let's say you're looking for a specific gene within the transcriptome that you want to subset out of your BAM file, you don't want to look at everything in the BAM file, you can create a small text file called BED file, and you can say chromosome where the gene is, the start of the gene, the end of the gene, gene name. And there are tools where you provided the small BED file and you give it the BAM file and it will subset. It will take all these reads that are within those boundaries that you specified. So you can look at, you can plot the coverage, you can look at the quality, and you can assess expression for just certain regions in the genome or transcriptome. And these are some tools that you can use to view BAM files, there are SAM tools, BAM tools, the cart, and for BED files you can do BED tools and BED hops. Sorting SAM and BAM really depends on the tools that you're using, but there are two ways you can sort them, you can sort them by position or you can sort them by read name. You sort them by position if you are interested and just easily pull out reads from certain position in the transcriptome. But if you're trying to maintain the pair information, you might want to sort them by read name because the read name would have, at the end of the read name you'll have read 1, read 2. So if you sort them by read name, the pairs will always be together. So if you're looking for fusions, for example, then it would be better to sort them by name, it would be easier for the tools to pick that information. Once you have the BAM file, you can actually visualize it, so you can open it up in IGV, and I believe you guys have used IGV, you said yes. And in IGV there is the coverage track, this is the transcriptome assembly that you load into IGV. So it's really exciting to look at these things because you try to match all these exons with these coverage pilups and sometimes you end up discovering that there is coverage in areas or intronic regions that indicate that maybe this is a new exon or a new assembly that does not, a novel assembly that does not exist or it could be a mapping error. But yeah, so you can do that, I highly recommend you do that after you do the alignment, not for everything, just pick a sample. You can even, instead of loading the whole BAM file because it's big, you can use the bed tools that I just used and subset a certain region of interest and then look at it in IGV. So just want to push it on the right? Yeah. That whole region is one exon? This thing, yes. Yeah. So what is it? So it seems to have access to it? Yeah, yeah. Yeah, so that could indicate either poor coverage, some bias, three prime bias, there are many things that can go wrong. And as I said, it could be that it's a novel transcript that actually could be two exons and not one exon and it's just the known transcript is one exon. Let's see. Okay. So now that we have the BAM file, we want to assess how good the alignment was. So the first level of quality assessment happened at the RIN when we looked at the RIN and that was before we did any sequencing. At this point we're assessing how well the sequencing and the mapping did. There's a step that I skipped which is FASTQQC. So you can, did you guys use FASTQC or FASTQ quality assessment before? No, okay. So there's a tool that you can use to run, it's called FASTQC, you run it on the FASTQ file and that will tell you how, with the quality of the sequences prior to alignment. So that's usually done. And then the third level of QC is this one where after the alignment is done, you look at how well the reads aligned to your transcriptome or to your genome. So I'm just going to go over some metrics that you can use for your QC report. So there is the three and five prime bias, nucleotide content, base and read quality, PCR artifact, the duplication that we talked about. And these are just some examples. There's a lot more stuff that you can use, but you can start with this set of parameters. So three and five prime bias, the bias can be introduced by the library prep technique that you use. So remember how I talked about different library prep technique when you have total RNA and you want to get mRNA, you can do poly-A selection or CDNA capture. And when you're doing poly-A selection, there might be some bias where you get more coverage towards the three prime end of your transcript versus the five prime end. So one way to check for that, this is based of a tool called RCXC that does quality checks for you. So what this tool does, it takes the top thousand transcripts and it will split the transcripts into 100 bits. And for each bin, it will look at the coverage for that bin. And here we're looking at each curve represents a sample and these are the normalized positions from five prime to three prime. And you'll notice that your cohort splits into two patterns where this is what you expect. You want even coverage across your transcript. It doesn't matter where five prime with three prime. But you get this cohort of samples where there is a lot more coverage on the three prime end versus the five prime end, which could be due to the selection method. So what do you do at this point? There are multiple options. You can either go back and figure out what caused this. Is it your library technique? Maybe you can change the kit because some kits can introduce some bias. If you can do that, you can computationally adjust for that. There are tools that you can use to fix this bias before you go ahead and do expression estimation. Because if you go ahead and do expression estimation, what will end up happening is that you're going to bias the expression estimation as well. Some genes will have more expression just because they were selected by your kit better than other genes. And it's best if you can't fix it or you can computationally fix it. If you're doing any downstream analysis or you're fitting any models, you might want to include that covariance or that variable in your model just so that you adjust for this in your statistical model downstream. Another thing you can look at is the nucleotide content. So here I am looking at the base distribution. How many ACGT at every base in your read? So this is an example of a very short read. It has only 35 bases. And at each base, I'm looking at the distribution. You expect the distribution to be equal. You expect to have 25% A, 25% G, CT, and so on, which is happening around here. However, with Illumina sequencing, they use these random primers for reverse transcription of RNA fragments to CDNA, and it's the CDNA that you're sequencing. It turns out that these random primers are not so random. At the beginning of the read, what ends up happening is that they have some bias or preference towards some bases. And so one way to deal with it is you can trim the first 10 bases. So plot this, look at your data, see if you have this issue. If you do, you can just trim the first 10 bases and then keep the rest of the read and align that. You can also look at the quality distribution. So again, for each base within your read, we're looking at the, for all the samples, we're looking at the distribution of the quality of each base. And when I talk about quality, we're using a score called FRET score, and FRET score is simply just the negative log 10 of the probability that the base calling is done wrong. What does that mean? If you get a FRET score of 30, it means that there is one in a thousand chance that the base you called is not the correct base. So you want a higher score, which means a less likelihood of her. Yes. Does that depend on the machine? Yes, it is dependent on the sequencing platform. And you tend to see a reduction, well, it depends on the machine, it depends on the library protocol too. There are a lot of things that could go wrong. But you usually see a bit of a decrease in the quality. So you want to make sure that all of the bases or not most of the bases in your read, they have a FRET score of more than 30. 20 or 30, that's usually the accepted. Threshold. For quality? Yes. So all of these plots are generated with R6C. And I don't know if we have R6C installed. We might have, I don't know if you'll have the time because the incident will run out right by the end of the day. So if you were taking the RNA-seq workshop in July, then we'll get to work with R6C. But yes, you can try to install R6C. There are other tools as well that will do that. And you can also, you don't need to use tools. You can implement it yourself, depending on how much background you have in Bartramanus, because it's not too complex. I will start with these tools, R6C. PCR duplication. So we said that could be an issue. We can't just blindly collapse reads because in RNA the start sites are not random because of the transcription sites. So one way you can assess PCR is this plot right here. On the x-axis it's looking at how many duplicates there are, or the occurrence of the read, and then the number of reads for that specific occurrence. So that's, you see the read once, twice, three times, four times, hundred, and so on. Ideally what you want, you want a curve that would look like this, like if you want a very low number of reads that happen multiple times. And what the tool is doing is that it's assessing the duplication using the position, the beginning of the first read and the end of the second read. That's how it's assessing the duplication. That's one way. The other way is actually looking at the sequence of the bases, and it's comparing the sequence of the two reads. And you want to make sure that the sequence base and the mapping base are both very low. Yeah. So then your kids have the UMI attachments, so each CDNA has its own kind of leading unique sequence that allows us to find the PCR duplicates. Okay. Do you know if any tools can incorporate that filtering for that, for the UMI? I'm not sure, actually. Yeah, I'm not aware of any tool that does that. Yeah. How would maybe a curfew or a sample where no PCR is running at all versus model is clearly balanced? Oh, you want to see the difference between... Yeah. I mean, I don't have an example of a... Even if you didn't run PCR, you still wouldn't expect some. Oh, if you didn't read in PCR, then you shouldn't have that issue. You shouldn't. Yeah, I mean the issue arises from PCR or some sort of amplification that you're doing in your library prep. If there is an amplification step, then you want to check for that. If there isn't, then that's not an issue. Yeah. Can you explain again how exactly... So, it does that by looking at the mapping. So, you're looking at the position of the first read, the beginning of the first read, and the end of the second read. Because you want to look at both fragments, not just like both reads, not just one. Okay. So, yeah, for the paired end, you're looking at the beginning of the first read and the end of the second read. Yeah. And then you... Matches. Yeah. If you see multiple fragments having the same exact... beginning of the first read and the end of the second read, or they have the same exact sequence as well. The context of the sequence is exactly the same. So, that's why the tool checks. The checks were two things. Yeah. So, it can be a bit confusing. So, here you're looking at the number of duplicates. And this is the number of reads that have that level of duplication. So, yeah. So, this would be... The read has one duplicate, or happens once, or twice, or three times, or four times, or a hundred times. And then here would be the number of reads that have that level of duplication. So, if you see that would read... No. So, the y-axis would be the... The y-axis would be the count of the reads in the transcriptome that have that level of duplication. So, I could be looking for a base deep. Exactly. Like, extremely, extremely steep. Yeah. So, a bad sample would have something like this. Like, it would... It would be right here. Yeah. Yeah, exactly. Sequencing depth. How many lanes? So, we talked about just writing down parameters, looking for publications. And we said if you don't have a publication where you're doing so unique, then maybe you can do a saturation plot. This is what I meant by saturation plot. So, what are we doing here? We are... Let's say that you did one lane of sequence. You go and do one lane of sequencing for that sample, because you really did not have any idea how deep you should go. So, you just did one lane of sequencing, then you took the BAM file, and what this tool does, R6C as well, this is generated by R6C, we'll take the BAM file, and it will sample 10% of the reads, 20% of the reads, 30%, 40%, up to 100%. And for each level, it will try to look at the fraction of junctions and known junctions. And that's what we're plotting here. So, on the x-axis, we have the sub-sample, how many fractions. And then on the y-axis, we have the number of known junctions versus novel junctions. And what we're looking for, we're looking for a point at which the curve actually saturates. Because at that point, no matter how many reads you're adding, you're not getting any more junctions. You discovered all the junctions that are there in the sample. And you should stop at that point. Now, you will notice that the known junctions, they saturate a lot faster than novel junctions. And that is due to the fact that a lot of these junctions could be false positives. The novel junction that you're discovering, it could be that you don't have enough coverage, they're false positives. So, you have to be careful. So, you can look at the combined curve or you can look at the known curve and see where that saturates. So, for this, we can stop at 20% to 30% of the reads. And then you go and say, okay, 30% of the reads, what is that number? And I'm going to use that number and maybe add a padding, add a few hundred reads to that and then say, okay, that's going to be my number of reads that I need for my population. And you can go back and then sequence that many reads for the rest of the cohort. So, you can also generate this plot. So, RCC does this, does the junction, but you can also look at expression patterns as well. So, you can do the same exact thing, subset the BAM, and then look at and go and do expression and then look at families of genes. Are you discovering any new families of genes? Is expression changing for the genes that you're interested in as you increase the depth? If that's, again, you can look at a saturation level at which no more genes are being introduced or discovered, then you can say, okay, I'm going to stop there. And I can do both and then find a threshold that would work for both expression and splice junctions. Yes. But you've already seen this. Exactly. So, you do this for one sample. And then you go back and then before you do the whole cohort. You have 20 samples, you just do it for one, figure out what the number is, and then you go back and then you sequence the rest. Yeah, it will be cost effective. That way. Yes. Sorry. Yeah, I see what you're saying. So, you would take the sequences and you would merge the sequences after alignment or actually even before alignment. Before alignment would be better because, yeah, as long as you do before alignment, it should be okay. And you have to, it would be nice to maybe do them separately and then do it together just to see if there's any artifact that's being introduced by mixing the reads or even check the quality of the sequences separately. And if the quality is okay, then you can mix them. But make sure you do it before alignment because the alignment is doing transcript assembly and you need more reads, more evidence. So you'll have more power if you combine the sequences to detect transcripts. But you don't think there would be like some sort of systematic difference between the sequencing runs? I mean, there always is, but it depends on how much variability. Are you doing it at two different centers? Are you doing single and versus pair then? It really depends on, you try to minimize the variables as much as possible then it should be okay. Okay. That happens a lot for DNA. We do multiple lanes and then you combine the raw sequences for DNA and then you do the alignment. But I guess RNA is a bit more sensitive because maybe if you did it five months later, the RNA might have degraded. So you might want to check that first. But yeah. Yeah. Do you have any other way of doing the saturation plots by using a number of genes to determine RCQC? It doesn't. So it only does the splice junction. I guess you'll have to write your own. Once you get the idea or the concept, it should be easy to implement. Because the sub-setting the BAM file, you can do that with BED tools, I think. Or you can actually use SAM tools. If you put dash s and you put the fraction, if you put dash s.3, it will take randomly, select 30% of your reads in the BAM file. It will randomly select reads. So you can change the dash s random multiple times, generate different subsets. And then you can, yourself, you can go and look for a gene. Exactly. Yeah. Okay. You can also look at the base distribution. So, like, what fraction of bases that you've aligned are coding bases, what fraction is UTR, intronic, intergenic, and so on. And the fractions will look different depending on what technique you use. So if you're using total RNA, you'll have less coding bases. If you used mRNA techniques, PolyA or CDNA, you're going to have a lot more coding bases. So it's always good to check that as well to make sure that it matches the technique that you used for library prep. Insert size. So I think I just wanted to define this because with, I think, with Hisat, with Hisat you don't have to specify what the expected distance between the reads, but with top hat, it was an option that you had to feed to top hat. So you tell it what is the expected distance between the reads, and it will look at that expected distance and it will see if it's more, if what they're observing is more, because if it is more, then it will classify it as intronic region or it will classify it as a fusion. It will use what you predict or it will use that number to predict or classify those reads. But with Hisat, you don't have to do that. I think it learns that on its own, so you don't have to worry about it. But the reason why I put it here is that you also want to have it as a QC measure. You want to look at the library size from mapping. And insert size can be defined in different ways. So here we have the two reads, read 1, read 2. When you say fragment size, we're talking about the distance from the beginning of read 1 to end of read 2, including the adapters. That would be the fragment size. Insert size is just the dot without the adapters. And then inner mate is the inner distance, end of read 1 and beginning of read 2. Yes? So for RCC, does it output like a warning message that there's something off, like you have 50 samples to look at each plot for each sample? Or will it tell me what to look at? Yeah. So I think RCC runs per sample. It doesn't run as a cohort. So it will not generate a cohort level report. So just like Hisat, each sample will get its own report. And you will get text output and I believe HTML output. So what we did in production was we parsed the text or the HTML to look for errors. And that's how we would flag the samples based on. It doesn't itself have, trying to think, it doesn't have a way, like thresholds that you can put in to say this will pass or fail. You're going to have to parse it yourself, set your thresholds yourself and then you can classify the sample as good or bad. And that's the other issue with DNA. It was really simple because you can just look at coverage and you can say, okay, this sample has less than 50x. We're going to flag it as a bad sample. Let's look at it again. You can look at coverage or you can look at the quality. But here there are so many other variables that you have to consider. How do you classify computationally or in an automated way? How do you classify a sample as being bad? There are so many different metrics that we saw that could be good, could be bad. And sometimes you just, it's borderline, you can say, okay, fine. Even if one metric is not good, I can still go ahead and analyze it and I'll worry about it when I do dance room analysis. I'll adjust for it when I do dance room analysis. So yeah, automating the process is challenging. Just a quick comment. Yeah. I have used in my experience a tool called 1DQC which creates, if you have so many samples, which we usually have, but in the same cases, and 1DQC is well-checked. If you want to create a reward, that's it. Is it for DNAC or RNA-seq? It's RNA-seq. Oh, perfect. Okay. It recognizes the outputs including RCQC and many others. Okay. Multi-QC. Okay. Sounds good. We'll add it to the resources for next year maybe. Thank you. Okay. So one other thing you can do with IGV is mutation calling. That's something we're not covering here. And this is a very simplified version of mutation calling. So this is designed, you can use IGV to look at variants. Don't do that for all the variants in RNA-seq because it will take forever. But this slide is designed for people that maybe have done DNA sequencing. They've looked at variants, they've filtered, and they just want to do some sort of validation where they look at these variants in RNA-seq to see if they exist in RNA-seq as well. And you can do that in IGV. You can look at the pile up of the reads and look at the alleles and see what fraction of the reads are mutated and so on. Yeah. Is there any way why? Yeah. So there is the concept of RNA editing that could introduce some variation. There's also based on my experience, there's a lot of false positives around splice junctions because there's a lot of artifact that's being introduced by the false positive splice junction that Hysat or Top Hat is doing. So when it's trying to assemble the splice junctions, it's not going to be 100% accurate. It's going to try to find them, but it won't find them all the time. And because of that, the sequences are going to be not aligned properly and you're going to find a lot of variants. So if you call SNVs from RNA-seq and then look at how close those SNVs are from splice junctions, you will see an enrichment. You'll see a lot of these SNVs are very close. So you can add that as a filter if it's within proximity to a splice junction, get rid of it. But other than these two reasons, I don't see a reason why it won't be as good as DNA. I think another problem is RNA is not exposed evenly. That's true. Yeah. I mean, you can use the expression information to, as I said, to prioritize these mutations, but some might not be expressed. And then it will be hard to tell whether it's not expressed and there is no coverage. That's another issue that you're going to have to run into as well. How are we doing in terms of time? I don't have time on me. Do you guys know what time it is? 12? 245. And the break is supposed to be at 3? OK. So maybe we can try to finish. And then we can take a break. So after we come back from the break, we can just work on the tutorial and then we'll run the video together. OK. So we've aligned. We have checked the quality of the alignment. Everything is good. We can go ahead now and assess the expression and do expression. So we can now go ahead and do expression estimation for known genes and transcripts. We will look at FPKM, which is the unit of measuring expression. And we'll compare it to raw counts. We'll look at differential expression methods and things that we can do with the expression and differential expression in terms of visualization and downstream analysis. So as I mentioned, one way to look at expression is IGV. So let's say that you have a gene of interest and you're really excited. You get the BAM file and you want to look at the gene and see if it's differential expressed between two conditions. You can do that. So you can pull condition number one BAM, condition number two BAM, and then look at the different isoforms or transcripts that you have. And you can assess expression by looking visually at the coverage and see if one gene has more coverage than the other in these tracks. However, there are a few problems by doing it this way. Because number one, you are not really considering that these two libraries, one of them you could have sequenced one eighth of a lane and the other one you could have sequenced two lanes. So clearly you're going to have a lot more reads with one versus the other. So you have to keep in mind how deep did you sequence and that will affect. You can do it only if you had exactly the same amount of sequencing for both. You can do that visual comparison. The other problem is gene length. So you can't just compare two genes because one gene, if it's longer than the other, it will clearly have more reads that will cover it. So you'll need to adjust for the length of the gene. So these are the two things that you need to adjust for. How deep you've sequenced and how long the genes are. And that's where FPKM comes in handy. So RPKM stands for reads per kilobase of transcript per million mapped reads. What does that mean? You're just taking the reads. You're normalizing them by the total number of reads in the library and by the gene length. I'll show you the formula in a bit, but it's pretty much what I just said. You just divide them with these two numbers. The difference between RPKM and FPKM is that RPKM just looks at total reads. FPKM looks at fragments. It will look at paired reads. So it will count a paired read as one fragment. And then you're doing the same thing. You're counting those fragments. You're dividing them by the total number of fragments in your library and dividing by the gene length. And just like I mentioned, it will get rid of the bias, different library depth and different gene length. I believe Hisat also reports another measure called TPM. So how is FPKM different from TPM? FPKM, just like Hisat, it takes the reads, divides them by the total depth and divides by gene length. TPM just switches the order. So the first thing you would do is divide by the length of their transcript. And then you would take these values and then you divide by the total depth. So just the order at which you normalize is a bit different. And you end up with two different values. And once you get the FPKMs, you can pass them and do some statistical tests for differential expressions. So what do we use for that? With Hisat, there is a path called string tie that does expression estimation for the assembled transcripts. And for top hat, it was called cufflings. So if you're using cufflings, then it's parallel to string tie. And I'm not going to go into the details of how exactly it does things, but what it tries to do is it tries to... it will look at all the possible paths for the transcripts. And then it will take each read to assign a probability of where... for each one of these paths, it will assign a probability of this read belonging to this specific path. And the path with the highest probability will... the read will go under that path or that isoform. And then the read is gone. Then you go through the next read. So once you assign a read to a specific transcript, you move it aside and then you go with the rest of the reads and you do that for all the reads until you're done with all of the reads. So when you do that, you will get a transcript assembly per sample. So you're doing this processing per sample. Each sample you'll write once and you're going to get an assembly per sample. But what happens is that sometimes you might not get the full picture of the transcript from just one sample. So it's good that you go back and you combine all these transcripts from these samples that you came up with and come up with one assembly based on all the samples that you've sequenced. And then that assembly, you'll have more confidence in that assembly. And then you take that assembly and you go back and re-quantify or like redo the expression based on this merged assembly because you have more confidence in this merged assembly than you did before. So you can go back for each one of these samples for that specific merged assembly. Another important function is GFF compare. So what that does is that it looks at the assembly that you generated and you can compare it to assemblies that are established or known assemblies. And it will tell you what fraction of assemblies are novel, what fraction are known. And it will do that comparison for you. With StringTie, you can choose to, there are different modes and again, we're not going to have the time to go through all of that here. But know that there are different modes. If you're interested in known junctions or known transcripts only, you can provide StringTie with a file that has all the known assemblies and say, okay, just follow these transcripts. Don't go and discover new ones. Follow these and find me the expression for these ones. Or you can turn on the novel parameter and say, whatever you discover, I'll take it. Or you can do a mix of both. You can say, okay, here is the assembly, go find new ones, but use this as a guide when you're trying to assess expression. So StringTie then passes these TPM values or expression values to ballgown. Ballgown is responsible for conducting the differential analysis tests. So the way it does that, it simply just fits a linear model and think of the why in your linear model as the expression values for your gene. And then it will fit two models, one model with the condition that you're interested in and one model without the condition. And it will compare the fit of these two models. If they're significantly different, the condition has an effect on your expression. If they're not different, then it means the condition has no effect on the expression. And it does that by looking at the F test. It takes the P value from that comparison and it will adjust it for multiple testing. It will get a Q value. So for each gene, you're going to get a P value and a Q value, and you can set a threshold based on Q values to pick any genes that are significant. Ball gown is also responsible for we'll do the visualization for you. So ball gown is just simply an R script. I've attached a ball gown script for the tutorial that we're doing if we get a chance to run it. All you need to do, you don't have to change much. All you need to change is you tell it, okay, this is the directory where my expression files are and just change the labels of the conditions. So the default will be like condition one, condition two. The treatment one, X, Y, whatever. So that's all you're changing. You're just changing the conditions and then you run the R script and it will generate all these plots for you and it will generate a table of the most significant genes with their P values and Q values. What are other options other than FPKM? You can also use a raw read count. So a lot of tools would look at that and they don't believe in the normalization step that the string type does. So you just look at the raw read count and you can use HTSeq to perform that. I mean, they do look at the coverage. They don't care so much about gene length because they're not comparing genes against each other. They're comparing the same gene across different conditions. So gene length should be constant and you're just looking at the raw counts and you're adjusting for depth there. Which one should you use? FPKM or raw counts? Because if you take a path, there's a lot of packages for each one of these different paths. So for FPKM, you have a lot of packages that you can choose counts. You get a lot of packages there and the results might not be... they'll definitely not be the same. You will get some overlap but because the assumptions are different, you're going to get different answers. So recommendations. If you're mainly interested in visualization, if you're looking at heat maps, you're calculating full changes and stuff, you can go ahead with FPKMs. But if you're doing advanced statistical models, it's recommended that you go with the raw counts because those models are more statistically flexible. You don't have to worry about so many assumptions. So differential expression. Try to use the counts. There are tools like DCIC, HR that would take the HTCIC that I showed you before. HTCIC will do the raw counts, pass the results to DCIC and HR and that will do your differential analysis for you. The output will be similar. I don't know if it's as user-friendly. You're not going to get those fancy plots with DCIC and HR, if you're interested in just a list of significant genes, then you can use those. We're not going to be using HR and DCIC for this module, so we're going to be focusing on string type and ISAP. One approach you can do, you can run both and just look at Venn diagrams, look at the overlap between the gene lists that you generate, and then for example, if you want to validate and you don't have a lot of funding, and you want to restrict it to a smaller subset, you can run different algorithms and then take the overlap, because you have more confidence in these overlapped genes than non-enveloped ones. Also, multiple testing correction is extremely important considering how many tests you're conducting. Just like I mentioned, when you're doing that differential analysis, you're doing that per gene, and depending on the gene annotation model, you're doing this test 50,000 times. So by chance, you are going to find things that are significant if you're just looking at the p-values. So it's important that you adjust by the number of tests that you're doing. Most of these tools, they do the adjustments themselves, and they will report q-values or just the p-values, so make sure you look at those, and you don't look at the raw p-values. So, what can you do with this data? You have a lot of data now, you have expression, you have differential expression. You can do clustering with your expression values, generate heat maps, you can look at classification problems using expression values, or you can do pathway analysis based on the genes that are differential expressed. There are a lot of tools that you can use for that as well. So there's so much you can do past that point. Any questions? Question, yes. How do you analyze RNA splicing data? What do you want to do with it? You can. So you'll get a lot of files that will have the list of all the splice junctions. So you can look at differential expression around the splice junctions if you want. Or even compare the splicing between different conditions. Because you can do that visually. I don't know if there are tools that would specifically look at the splice junctions and compare them, but like in IGV you can definitely load the splice junctions of the tracks and then compare those as well. Yes. So we're laughing at that diagram. Yeah. So are you pretty mostly doing the different approaches to the differential expression to validate them or is it to get some differentially expressed genes that aren't found by others? I mean, either or. It depends on what you're interested in. Ideally you pick a method, it would be the raw results and you stick to those results and if you can validate all of these then go ahead. Because even like looking at the overlap is not I'm not saying it's the right answer. Because you could also by looking at the overlap miss some interesting things that were detected by the count algorithms that were not detected by the other ones. I'm just saying if you're trying to really reduce set and look at focus on the top confident ones then I would do that. And it would also actually if you're doing validation it would be interesting to select the overlap but also select things that show up in some but not others. So you can actually after you do the validation you can go back and compare and see how accurate these calls were.