 My name is Fouad, I work here at the Ontario Institute of Cancer Research as a bioinformatician. A few years ago I used to work in the production team, so I was exposed to RNA-seq data and I was responsible for assembling RNA-seq pipeline for production team at OACR. And today we're going to be talking about gene expression, but also RNA-seq, I'll talk about RNA-seq technology, how it works, how libraries are prepared, and alignment, expression, the whole thing. These slides are actually adapted from another workshop that we run through CBW, the RNA-seq workshop, which lasts two days. So this is just an introduction. So if you're interested in learning more about RNA-seq in depth, then you can take that workshop where you get to do more tutorials and you get to learn more concepts in RNA-seq. So I'm splitting the morning module into three sections. So the way it's going to work today is that the first two hours we're going to be doing just a lecture, and then we're going to have a coffee break, and then we'll do a tutorial altogether where we have a small data set, we can run an alignment and do expression and do differential expression. Hopefully we'll have time to do all of that. If not, I have all the commands listed in the tutorial so you can run things on your own after. So the three parts for this morning. The first part will be introduction to RNA-sequencing. Then I'll move into RNA alignment and visualization, and finally I'll end with expression and differential expression. And feel free to ask me, sorry, questions if there's something you don't understand, you can just raise your hand and stop me, and I'll be more than happy to answer it. So for each part, I'll go over the learning objectives of that part before I get into it. So the learning objectives of the first part is an introduction to the theory on practices of RNA sequencing, rationale behind RNA sequencing, the challenges that we face with RNA sequencing data, and the general goals and themes of RNA-seq workflows and analysis. And I end with common technical questions that a lot of people ask when they start working with RNA-seq data, and these are questions that I asked as well when I started working on RNA-seq. And I'll provide some suggestions and solutions to those questions as well that might help you establish your own pipeline network. Sorry, before I get started, I just want to show hands, how many people have actually worked with RNA-seq data? Okay. And how many people who worked on RNA-seq data worked on human RNA-seq? Okay. Any other organism species? Just I'm curious to know. Yes. Okay. Fun. Anything else? So, sorry? Mouse. Mouse? Okay. Corn. Corn? Ah, okay. So the focus today will be on human, but you can take these concepts and apply them to other organisms and species that you're working with. So in genetics, gene expression is considered to be the most fundamental level at which genotype actually gives rise to phenotype. And through gene expression, we'll learn a lot of things like what genes are expressed, what triggers genes to be turned on and off, how genes interact with each other and pathways, and why certain genes in certain cell types manufacture certain proteins. All that information is available through gene expression. And there are a variety of ways you can actually assess gene expression or measure the abundance of genes in the genome. Some of the platforms that we can use to do that are QPCR. We have microarrays. We have northern blots. But lately, a lot of studies have been using RNA-seq. And the reason why people are moving towards RNA-seq is because RNA-seq data seems to be accurate, sensitive. It's also not restricted to specific gene mutations. So even if you don't have a reference genome, you can still do sequencing and do de novo assembly after. These are things that you could not do with microarrays. And you can also discover novel genes. You can discover novel transcripts. Also, the data that you get is very rich. Not only do you have gene expression, but you can look at many other data types, like fusions. You can look as isoforms. You can look at mutations in the RNA-seq, and so on. And these are things that you cannot really obtain using other platforms. In terms of money, it's actually very cost effective. The cost of sequencing is going down. So it will be a lot cheaper and more feasible to do RNA sequencing in the future. And also, the dynamic range. Just because it's cost effective, the dynamic range of expression, there isn't really a limit. If you have more money, you can sequence more, and you can add more depth to your sequencing. So this shows just an overflow of what RNA sequencing workflow is like. Don't worry about it. I'll go into the details each step, and I'll explain it in a bit. Now, why do we sequence RNA? What kind of information does RNA provide us that DNA does not provide us? For example, if you're interested in some functional studies that can look at certain conditions or drugs, for example, and you're trying to compare different cell lines, and you want to see the effect of a specific drug on two cell lines, a lot of times, in those functional studies, those drugs actually affect the transcriptome, and they do not affect the genome. And you run on our seek to be able to compare the two conditions and the outcome of the drug on the two populations. Also, predicting transcript sequences is really hard by just looking at the DNA itself. It will be a lot easier if you do RNA seek and use that to predict the transcript sequences. And also, some molecular features, as I mentioned before, they are only presence in the RNA, like alternate isoforms, fusion transcripts, and RNA editing. And these are things that you cannot obtain from DNA sequences. In addition, you can also combine the RNA data with DNA. And that will help you evaluate, for example, somatic mutations that you obtain in the DNA. You evaluate the severeness of those mutations. So let's say, for example, you have a somatic mutation in the DNA, but there is no expression in that gene at the RNA level. Then you might not be as interested in that mutation. But let's say there is a mutation, but the wild type of that gene is actually expressed, then that could suggest a loss of functionality. But more interesting, if there is a mutation and then you see an expression in the allele that's mutated, that could actually be very interesting because that might be a drug target that you want to go after. RNA is great. However, there are a lot of challenges that we face and you will face when dealing with RNA, RNA seek. The first challenge is sample related. So a lot of the samples, when you get them, you have to consider the purity of the sample, the quantity, do you have enough to actually run RNA seek technology? The quality of the sample is the quality good enough? Can I move forward? Is it worth it that I take it forward and actually perform RNA sequencing since it's expensive and a waste of money if you take samples that are bad quality and you run the RNA seek, and then it turns out that you can't use any of the data that you sequenced. And another challenge that we face when aligning RNA seek reads is that the reference that we use is a genome reference that consists of introns and exons. However, RNA seek data only consists of coding regions, so only exons. So you're trying to map exons to regions that have exons and introns. So you have a lot of large gaps that the aligners have to deal with and that's a bioinformatic computational challenge. Also, the abundance of expression varies a lot. So you have genes that have very small expression while genes that have extremely high expressions and that extremely high expression genes, they vary also across samples. And sometimes the high expressed genes, what they do is they can actually consume some of the reads from the other genes in the genome and that could lead to false positive results when you're doing comparisons between the samples and you could get false positive, differentially expressed genes. So that's another thing that you have to keep in mind when dealing with RNA seek. Also, the sizes of the genes vary, the number of exons within genes vary, and this variation can lead also to biases in terms of the coverage. So maybe very long genes, the coverage might not be uniform across the transcript or the gene as compared to small genes and if there is any bias that's coming from like technical bias through library prep, it will affect small genes or large genes more than vice versa. So the size of the genes is variable. And these are things that might not be present in DNA when you're sequencing DNA, but are present in RNA and you have to consider. Also, RNA is fragile compared to DNA, so when you're handling RNA is easily degraded, so that's another thing that you have to look at before you start processing your RNA. So speaking of RNA being fragile, the first step that you must conduct before you carry on sequencing or library prep is to check the quality of your RNA. And previously, people used to use bioanalyzer, electrophoretic trace plots. So for those of you who don't know what that is, you're simply what you're doing is you're running your RNA through a gel and you're getting different bands on your gel and each band represents a subunit in RNA. Now RNA, the RNA pool in the cell consists mainly of ribosomal RNA. Most of the RNA in the cells are ribosomal and the two big subunits in ribosomal RNA is the 18S unit and 28S unit. So the plot here is showing the size of the molecule that you're looking at and then on the y-axis you're looking at the fluorescence signal that you're getting from the gel. And an example of a good RNA is the plot on the right where you're seeing two very clear bands of the 18S and 28S units in the ribosome and the rest of the RNA is so tiny that you can't really observe it compared to the two subunits. Now, as RNA degrade, what happens is that it gets split into smaller pieces and when you run them through the gel you will then observe more than just two subunits because you're disintegrating the RNA. So you'll start seeing multiple peaks in the trace plot and that's an example of the plot to the left. So you're seeing a lot of fragments in the RNA and that's not good. Now instead of looking through trace plots to try to figure out whether the RNA is good or bad, Agilent has come up with a software called RIN which stands for RNA integrity number where through some machine learning they've gone through all the trace plots and they try to come up with one score that you can look at and try to assess whether the RNA is good or bad. And those scores are actually, I've put them right underneath the trace plot. So the plot on the left has a RIN score of six while the plot on the right has a RIN score of 10. The range is zero to 10, zero being very degraded and 10 being intact. And what you're interested in, usually we pick RNA samples that have a RIN of seven or higher but a lot of times you're given samples that have very low RIN and it's just bad quality. Now, and then you ask yourself, what do I do? You tell them it's bad, they say no, go ahead and sequence it. So what I suggest is that instead of like doing the whole batch or sequencing the whole batch, you do a pilot study, run one sample, see what kind of data you can get out of it and if you can't get data out of it then you can go back and sequence the rest but if you sequence it and it's terrible and you can't get anything out of it then there's no point in doing that and maybe you should ask for better samples. So once you make sure that your RNA is good then you can go ahead and start the library prep and that process is also complex and the reason why I say it's complex is because there are so many kids out there in the market that are available. How do you choose? How do you decide which kid do I use to create my RNA libraries? Now, that is solely dependent on the purpose or what you're interested in. So the first step is you take your tissues, you isolate the RNA and then you run a DNA. DNA is what it does, it breaks down the DNA in your RNA pool so that there is no DNA contamination and once that's done you're left with an RNA pool. Now that RNA pool, as I've mentioned before, there's a lot of ribosomal RNA and it's a mix of coding and non-coding RNA. Now choosing a kid to prepare a library will depend on what you want. Are you interested in non-coding RNAs? Along non-coding RNAs, if you are then you start with a total RNA which consists of everything and then you can run a ribosomal depletion. So that's another kid like ribo zero or ribo minus if you've ever heard of those. What it does is it eliminates the ribosomal content in your total RNA and then you're left with coding and non-coding RNA that you can use downstream. Now, if you're not interested in non-coding RNA and only interested in coding regions, you can take your total RNA and then run a poly-A selection or poly-A enrichment. So what that means is that the RNA fragments, they get processed and a series of A bases are attached to them and those RNA fragments are actually mature RNAs that contain coding sequences. So those are the ones that you're interested in if you're interested in coding sequences only and poly-A enrichment just pulls all the RNA fragments that have those poly-A tails only. So you can do that or you can do a CDNA capture if you're interested in certain genes or certain regions in the transcriptome. So you can think of it as exome capture, that's what a CDNA capture. So again, it's very dependent on when you're interested in and there are so many kits available in market to do that. Another thing that you might want to consider is whether or not you want to pick a stranded versus unstranded libraries. So five years ago, the concept of stranded libraries did not exist, all the libraries that were available were unstranded. So what that means is that when you sequence, the reads that you get will not tell you where the expression is actually happening. What strand is that read coming from? So for example, the left top plot right there, you have two genes that are being transcribed in two opposite directions and you get reads that are piled. But the thing is the reads that pile, they're coming from both strands. So you can't tell what strand it's coming from. Now with the introduction of strand specific libraries, you can actually tell. So the reads that pile on top of that gene will tell you what strand it came from. And that you might say, why do I need that? Why is it helpful? It's helpful because a lot of times you might not have the gene annotation for your species or you don't have a reference genome. So you don't really know what direction your gene is supposed to be transcribed. So this is good for novel genes. It will tell you what the direction is or what strand it's actually being transcribed from. If there are any overlaps or genes that overlap with unstranded libraries, you can't really tell where the reads are coming from. Is it this gene or this gene? But with strand specific libraries, you can actually separate and you can tell where the reads are coming from for those and then it provides better expression estimation for those genes that overlap, yes. Well human, so that doesn't matter, right? Yes, human doesn't matter, but yes, other genomes, that might be an issue. Yeah. Yeah, so you said it was the earlier myth that you couldn't separate the strands. If you read the legend of the, it actually says the one where the strands are coming. I think that's just the sequencing. So when you sequence the fragment, you take it and then are you sequencing from this side or this side? But it doesn't tell you where the transcription is actually happening for that specific gene, not how you're sequencing. So the color coding for the unstranded is just how you're sequencing it, but the color code for the stranded is where the transcription is happening. Are you going to talk about how that works? Sorry, I'm not going to talk about that, no. But I can provide you with articles on that. Can you go back to why stranded isn't important? Sorry, the... Why stranded isn't important? It's not because there aren't many overlaps in terms of the gene. So other genomes... Who used the same exogenous two directions to the different gene products? Yeah, so I don't know what percentage of the genome actually, or the genes are actually, there are overlaps. I would say so too. A lot of the RNAs that would be compounding. Yeah, but then I believe there are other genomes where it's more of an issue. So it's always recommended. Yeah, that's the worst. So it's always recommended to use stranded because that information is useful. The procedure is a bit more complex when you prepare the library, but it's always better to use that because you can do a little specific expression and so many other things that you can do with unstranded libraries. It's not as severe. That's what I meant to say. Because you can always infer a little specific specific counts in the ABS. Yeah, correct. Now it comes to designing the experiment that you're running for on AC. It's really crucial to have replicates. And when I say replicates, there are two types of replicates. I'm talking about technical replicates where it's the same sample, but you're running it on either different flow cells, different lanes and the flow cells, a different day. You just wanna make sure that the data that you're generating is reproducible with technical replicates. You also need biological replicates and those are different samples that they had the same condition or the same drug, for example, and those biological replicates, the dispersion or the variation in those will also help you downstream when you're looking at genes that are differentially expressed. And a simple plot, the first plot that you can generate is you can look at correlation between the technical replicates and that will give you an idea of how reproducible your data is and you're interested in very highly reproducible data. So we're looking at coefficient, correlation coefficient of 0.92 to 0.98. Anything that's really high is good. Sorry, did you say you want the technical or the biological replicates? Technical. We need both. You need technical replicates and you need biological replicates. A lot of times people don't, from experience, no one does technical replicates. They do a lot of biological replicates so you get one run per sample. But that was an issue because it was costly to actually do multiple copies of the same sample, but it's highly recommended. Now that sequencing is getting cheaper to actually have technical replicates as well as biological replicates. Sorry, sorry, in your opinion, have you actually seen technical replicates not? I personally haven't because I haven't worked with a lot of technical replicates. But that's, it's very likely, especially if you run a sample on two different flow cells, there is always a lot of variations when you're doing experiments and flow cell to flow cell, there are a lot of differences as well. So you have to consider that as well. Sorry, yes. I wanted to ask you that, especially in cancer when you're dealing with a technical as well. Let's say you're limited to four samples because this budget does cost up to $100 to prepare a library. You have about four biological replicates. It's a bit of a challenge to biological and technical process. Are you sure? Who does it for the $10? So I said it's a couple of hundred dollars for each library. Who does it for a couple of hundred dollars? The library. The library. So that's just library for them. There's the sequencing. I've got a question for you. So you'll say, let's say I give you three technical replicates. Yeah. That tells you how good your sequencing is. Yeah. How much noise there is. So then you make a mean of them in some way, right? Then I give you three biological replicates. Then you compare your sort of control to each of the biological replicates. Correct. So you do means of everything and then you compare them. Correct. Yeah. So that's the simplest approach. I will show you later. There are a lot of very advanced techniques. They will do similar approach, but different tests, probably. How do you run the mean? How do you run the mean of three samples? Yeah. One gives you, let's say, three mutations in the gene. The other one gives you two. The other one gives you one. Well, here we're looking at expression. So we were looking at read counts. So you're averaging read counts. We're not averaging expression. We're averaging read counts. Read counts. Yeah. So here is just a list of some of the common goals of RNA sequencing. So as I've mentioned before, you can get gene expression, differential expression. You can look at alternate expression analysis. You can do transcript discovery and vegetation. You can do allele-specific expression, mutation discovery, fusion protection, RNA, I think. So there's so many different things that you can do with the data that you get. Today's focus is going to be on gene expression and differential expression. And we won't be covering the other topics. So most of the RNA-seq workflows consist of the following series of events. You're taking your sequences, which are in a FASQ file. It's a text file that contains all the sequences that you obtain from the sequencer. You take those sequences, and you do some sort of quality check. So you run those through a tool that checks the quality of the sequences to make sure that there is no bias, that the context of the sequences is OK. One of the most popular tools that are used nowadays is FASQC. And all you do is just FASQC, and you give it the directory where all the FASQ files are, and then it goes through them, and it generates PDF reports of the quality of the sequences. Then that is followed by an alignment. So you take those reads or FASQs, and then you give them to an aligner. And we'll talk about different aligners in a bit. But the general concept is you take the FASQ reads, you align them, you get another file, which will SAM or BAM, which contains the same exact sequences that you started with. But now you're adding more annotation. You are adding where the sequence was aligned, if the pair is aligned. What was the alignment quality, and all that information is in the BAM file? Now, that BAM file is very important because you can take it now and then pass it through so many different tools and packages. You can run it through tools that will do expression and differential expression. You can run it through tools that will do fusions. You can run it through tools that will do variant calling. And the output for each one of these tools is going to be different. But mainly, it will be just a text file, a VCF or a bed that will summarize the results from each one of these tools. And at the end of the day, you also want to do some sort of summarization or visualization. And I find that R is very helpful when it comes to that, because there are a lot of packages that you can use to visualize and generate plots to visualize these data sets. Jumping to the common questions that get asked when you deal with RNA-Seq. So the first question that gets asked is, should I remove duplicates for RNA-Seq data? Now, if you're not familiar with duplicates, I'll go over what that means. In DNA, when you're preparing libraries, you enrich for DNA through Pulumary's chain reaction. So you take the reads, and then you go through cycles of replication to enrich for those reads to get more reads that you can sequence. But in that process, that process can be biased. So some regions in the genome can actually get amplified more than others. And what ends up happening is that instead of getting uniform distribution or overlapping reads, you end up with reads that pile up in one region of the genome. So you look at a region, and there is like thousands of reads that pile up with the same exact start and end position. And those are technical artifacts, because those are not true reads coming from the sample itself, but those are reads that you are introducing by PCR. And if there is an error in one of these reads, that error will be amplified in this bias, and you will end up with false positive results. So in DNA, what they do is they collapse the reads. So if there are reads that have the same exact start points, you assume that the start point should be random in the genome. There is no reason why the start point should be at a specific point. So you just collapse all the reads to one read. And there are tools that will do that. One example is Picard that will be able to collapse your reads for you. Now when it comes to RNA, can you do the same? You have to consider a few things. Unlike DNA, the start points of the reads are actually not very random, because you're looking at transcription, and transcription has specific start sets. So it's not random. That's number one. If you try to collapse, that will actually affect the dynamic range of the expression in your sample. So it hasn't been proven that PCR introduces artifacts in RNA-seq yet. But what you can do is you can try it yourself. You can try to collapse the reads and look at the expression profiles for your genes or for your samples before and after collapsing and see how much it actually changes. And if you're planning on doing that, don't do it on single reads. Try to do the pair. So you make sure that the whole fragment, the beginning of the first read and the end of the second read actually match exactly. And what would be even better is you compare the sequences and make sure that the sequences match exactly before you do the collapsing. Yes? How do you define duplicates? So those are reads that have the same sequence and they have the same start and end point. Same exact start and end point. And so that doesn't depend on? Are we talking RNA or DNA? RNA. RNA. Well, that's the thing. It will. Because in DNA, as I said, those duplicates, they are purely coming from PCR artifacts. But in RNA, you can't tell whether it's PCR or it's transcription. So that's why you have to be very careful when you're performing the collapsing. Because when you collapse, you might actually collapse biologically relevant transcription information instead of PCR. Is that pretty common? So we don't collapse RNA-seq. And we don't have a way to actually assess the PCR artifact. I'll show you when we do the QC plots, one way that you can get a better idea of how much duplication there is, a potential duplication. And then it's up to you whether you want to collapse or not collapse. So it's very important for RNA-seq data I should consider. So again, that also depends on what kind of analysis. If you're doing expression, don't collapse. If you're doing variant calling, then maybe it's a good idea to collapse and then do variant calling. So that's a decision. It's up to you. Another question that gets asked a lot is how much library depth is needed for RNA-seq? How many lanes do I need? How many reads do I need? With DNA, it's very straightforward. You can say, OK, if I'm doing a normal sample and a tumor sample with normal, I want an average of 30X coverage and a tumor. I want 60X coverage. And what I mean by that is when you say 60X, you're saying, I want every base in my genome to have 60 reads that cover it, approximately, if you're looking at an average across the genome. Now, with RNA, you can't really do that because the expression of the genes varies. And some genes that are highly expressed will have more reads than genes that are not expressed. So that's a challenge when it comes to that. And there are so many different variables that you have to consider when you consider read depth. So first thing you need to consider is your analysis plan. What are you interested in doing? Are you doing gene expression? If so, then you'll probably need more reads. Are you doing fusions? Are you doing mutation calling? And each one of these will require different read depth. But it's safe to go with a high read depth so you can actually later be able to get all these different data types and not be limited with your read depth. Another dependency is your tissue type. The RNA protocol that you used, just like I showed you earlier, if you use total RNA, you're going to get very small coding regions. So you probably need more reads for total RNA if you're interested in coding regions. The quality of the RNA, if you start with bad quality, if the range is not very high, you probably need more depth than high quality RNA. And sequencing. There is a lot of parameters that you have to consider as well when you're sequencing. What kind of machine you're using? Are you doing single end sequencing or paired end sequencing? What's the read length? Nowadays, reads can go up on high C, can go up to 150. Are you doing 50, 2 by 50, 2 by 100, 2 by 50? So all of these factors you have to consider. And that could be overwhelming if you're just starting in someone asks you that, and you're like, I don't know. There are just too many variables. So maybe the first thing you should do is write down all these variables and then go and look for a study that has done something similar to what you have done and see what coverage they have done. And it's recommended that, again, you do a pilot study. So take these parameters, take those coverages, run one sample, see if that's enough. And in the QC section, I'll show you some plots, QC plots that will tell you whether or not you've sequenced enough through splice junction saturation plots, which we'll talk about later. But once you have enough, then you can go back and sequence everything else. If you don't have enough, you can go back and increase the depth. Now just to give you an idea as well, flow cells on like alumina, each flow cell contains, has eight lanes. And each lane on a flow cell can actually generate about, I think, 400 million reads nowadays. And if you're doing expression, roughly, sorry, roughly you want about 100 million reads to be able to do expression everything else. So you can actually pull three to four. Sometimes people pull five samples per lane if you're doing human transcriptome. So that's a lot of samples in a lane. Sorry. It's in the same lane. How do you determine what goes with which sample to your barcodes? Barcodes, correct, yeah. So you have barcodes that, and you don't have to worry about that. I believe the tool that generates the FASTQs, they will do the decoding for you. So it will generate separate FASTQs for each lane. So you end up with a sequence file for that specific sample for that specific barcode. And the barcode will be in the name of the FASTQ file. Yes? Has it ever been shown that there's a lane factor that you should consider when analyzing your data? So it's good to have, for example, just that I said technical replicates so that when you're doing the pools, you make sure that the sample will be distributed across different lanes, because there could be some lane effects. It really depends on your flow cell and your run. And that day, and the technician who did it, and the machine, and the chemistry that day, there are a lot of variables. That's why doing technical replicates or running it on different lanes or different flow cells is very helpful. Because then you can get rid of those differences or biases, technical biases. What mapping strategy should I use for RNA-seq? This slide is a bit outdated because it's considering reads that are less than 50 base pairs. Most of the reads nowadays are 7,500. I haven't really seen anything lately that is that short, like 50 base pairs. So I recommend if you're doing a spliced aligner. By spliced, it means if you're taking your RNA-seq data and you're trying to align it to a human genome, you need a spliced-aware aligner. So something that can take the coding region and align it to coding and intronic regions of the genome. And for that, bowtie and top hats combination is recommended. And we'll get into the details of why I recommended that in a bit. Now, what if I don't have a reference genome for my species? What do I do? One of the first things is have you considered sequencing the genome? And if you haven't, then now there are a lot of tools that does de novo assembly. For example, just one that comes to my head is trans-abase is a great tool that does de novo assembly for fusions. Now, we're not really covering de novo assembly in this workshop. And that's outside of the scope of this workshop. But there is a lot of material and tools out there that will do that. So you can still work without a reference genome. That's still doable. So that concludes the first part of today. Any questions regarding introduction to RNA sequencing? Yes? Yeah. So what you're doing is you're looking at your reads. We're looking at overlaps. So you're trying to stitch the reads back to come up with transcripts or isoforms. So you try to stitch them back and try to find as many paths possible or isoforms that can explain the variation in your data sets. And you can do that at the alignment level or you can do it at the fusion level. Yes? The kits now are available for FFBE samples. Are they able to do the same amount of splice variants Yeah, everything is the same. So you can do if you're doing FFBE or fresh frozen, it should be the same. The quality might be a bit lower for FFBE compared to fresh frozen. So when you're looking at the quality of the RNA, maybe in terms of coverage, you might want to cover it more. You might want to sequence more lanes for that specific sample. But in terms of bio-traumatics, you can run the same exact tools. It doesn't matter. Yes? So when you were sequencing DNA variants you see, so the genome, the patient, how do you do it? And here, you talked a little bit about the NOVO, which would mean what you construct the genome, the patient, or the RNA question. How do you do it here? Yeah, that will be challenging because you don't have anything to compare to. And I'm assuming that the error rates, you have to keep that in mind. The error rate of the assembly itself. You're going to generate a lot of errors while assembling those reads. So there will be a lot of artifacts. So it's not going to be as clean as data where you have a reference genome, for sure. Can you go back to the reference genome? Let's say you discover using two different proteins from two different chromosomes. You couldn't go back to the reference genome. If you haven't. To the genome of the patient, let's say you have a C, there's a fusion there, right? Yes, but in terms of variants, that will be hard. Like if you're looking at SMD single nucleotide variants, that will be a challenge. So large, large changes you may have. So breakments might be a bit easier than looking at the variants. But even there, actually, you will also introduce some artifacts when you're trying to assemble. You'll probably have some false positive events in different data types. So there's no comprehensive map of RNA expression in human cell that you would use as a benchmark? Well, humans, you do because you have a reference. So you were saying things that don't. But not the RNA expression. Not an RNA expression. No, yes. Well, it's dependent on so many things. Because the cells in each tissue will have different expression profile. Each disease condition will also affect the expression profile. And it also depends on even the time when you take the cells, the different times it will probably have different expression. So there's a lot of variability that I believe will be very hard to capture in a database. Yes? This is quality. And in relation to the RNA quality, basically, which is now reinforced to the particular layer. And then the relationship between the quality of RNA and the real quality. I have some data from our toxic brain tissue. The real quality is really bad. But the random words, like about age, it's not tame. So at each level of sequencing, you can introduce errors. So the error could be in the sample itself. But if the sample is good, it could be the library prep. If the library prep is good, it could be the sequencing. So there are so many different levels where you can introduce artifacts or errors. You can do something wrong. That's why for each one of these levels, it's important to do a QC check. Like when you're doing the REN, you're checking the RNA quality. When you do the sequencing, you check the FASTQ quality. And that will tell you if the RNA was good, and then all of a sudden the reads are bad, and something went wrong at the library level, or maybe the sequencing. And based on the report that you get, you try to identify where that error is coming from. Is it sequencing error? Is it library prep error? Do you see it across? Do you see the same error across the sample? All the samples that you sequence, if that's the case, and it's probably a technical error that you need to go in and deal with. You don't blame the RNA quality itself, but as long as the length is longer than your frame size. Yeah, I mean, REN is just an approximation of, it's not the gold stamp. Like there might be something wrong with your RNA as well that REN was not able to detect. I'm not saying that if you have a REN of 10, that means the sample is great, but you have a higher chance of having a good RNA quality. I believe there are other metrics now that's talking to a lot of technicians that they use to assess the quality of the RNA, which is better than REN. I can't remember the name of it. It's the estimate of the size of the sample. What's the fraction of 200 APP along the, correct. So I think they look at everything, all the other regions, not just the two ribosomal. So here, and the REN is focusing on the two ribosomal units, and it's looking at the ratio of the two ribosomal units. The smaller the ratio, the lower the quality of the sample. But I think the other score is looking at all of the fragments and the RNA, not just these two, and that provides you a better estimate of quality. Yes. So my concern or thing that I want to ask is, unlike genomes, if we do a genome sequencing or exome sequencing, it's bit stable. But RNA quality and over expressed and under expressed gene list is very dependent on bombastemia time when the surgeon plants dark green. The cold is chemically more of a sample to fixation of the formalin was not present. Okay. So like some biobanks and individuals from Germany showed that endoscopic resection of colonic tumors are any seek date time after surgical resection biobank specimen. One third to one fourth of the genes over expressed and under expressed, I mean, the cost of not prolonging this thing now. Do you know of any resources or experiments where people have done to show that, for example, primary tumor RADC can see no graph, RADC. It remains the same. I mean, I feel that we are looking at the signature which has changed in the view it is not there. And you're trying to devise diagnostic, therapeutic biomuncles based on that. Yep. But it's a snapshot which is not, I feel really present in you. Correct. So I don't know if any specific studies that have done that, but your rates, I mean, the environment really affects the expression profile and any stress or really fluctuates the expression profile. And that's why replicas are very important because those samples are taking at different points. And if they're going under different environments and different stress points, and you average, you're looking at consistency, you're looking at things that happen across all these replicates, even though they're coming from different environments. So there is a higher chance that this signal is true. But there are times when you need to adjust for these differences before you run an analysis. But I don't know how you can control, beside replicates, I don't know how you can actually control for all these variables because there are a lot of them. So you might not be able to control for everything, but you try your best when you're doing replicates to control for as much as you can. I don't know about that, thanks. But I will keep it for the poppy discussion. Okay, sounds good. Sorry, yes. Yeah, actually it has been done. Okay. And it's surprisingly similar. Similar, okay. In the transition, patients themselves, surprisingly, few replications of any difference between this and that. Oh, I'd like to check that. Do you know the paper by the chance or the, anyway, we can talk about it. We'll talk about it later. Maybe we'll share the article once we find it. Oh yeah. Okay, sounds good. Any other questions before we move on to alignments? Yeah? Okay. So, in this part, I'll go over some challenges that the aligners or RNA-seq aligners face. I'll go over some of the tools that are used in our RNA-seq alignment. File formats for those of you who are not familiar with them or haven't touched RNA-seq or even like DNA alignments before. And then we end with some QC metrics, things that you should look at in your data sets and I'll provide some plots that will show you how you can do that. So some of the challenges by informatically when it comes to RNA-seq alignments is the abundance of the reads. So as I've mentioned, as technology is advancing, you're getting more and more reads and each lane of sequencing, we're talking 350 to 400 million reads that you need to align to your genome. So in order to do RNA-seq, proper RNA-seq analysis on human samples, you're probably gonna need some sort of compute node that will be able to maybe paralyze the process depending on how many samples you're running. It will be a bit hard to run things on your local machine. Today we're running things on Amazon since that's another option that you have to run those things but today is not a good example of what proper analysis is like because we've substituted all the sequences to certain regions or certain chromosomes but when you're doing the analysis, you're doing it on the whole genome. So we'll take days for things to finish. And just an example, top hat, which we'll be using for alignment we'll require 16 gigabyte RAM and eight to 16 but it's preferred to have a 16 gigabyte RAM so you will need some computational resources. And not just that, you also, the files that you're generating, they're big files, so those BAMs, depending on how many samples you're dealing with, the footprint will be big so we need a lot of space to be able to store those BAM files. Yes? So that's what we do. So we run them in parallel. Each sample will go to, we'll run on its own note. It takes, so the process takes about two days and I'll get into the details of how long each step takes when you go over the alignment. Another challenge is the spliced versus unspliced alignment. You're taking again your reads which are coding regions and then you try to map them to intronix so your reads can actually overlap to exons for example and you're trying to take that read and you're trying to map it to a genome that has exon entron and that entron doesn't exist in your read because it's RNA. There is no entron in RNA. And so you're gonna have to deal with large gaps, large entronic gaps in the reference genome that you're trying to align to and I'll show you an example. Walk through it step by step as well. And most of the time you can't, you don't usually just run your analysis once and you're like, I'm done with it. I'm never touching this again because tools update so frequently. Gene annotation itself updates so frequently. The genome itself changes as well. The boundaries of the genes change. So most of the time if you're doing a project you find yourself going back, updating your reference genome, updating the annotation of the genes and then rerunning everything again just to keep everything up to date, discover new genes, discover new transcripts and be able to do proper analysis and comparisons. So aligners are actually split into three categories. There is Dinova Assembly. There is also Alignment to Transcriptome and Alignment to the Reference Genome. We'll talk about each one of these. So as I've mentioned before, Dinova Assembly is an option for people who do not have the reference genome sequenced and it could also be an option for people who have a reference genome sequence but there is so much polymorphism in your data that when you try to align your data to the genome you might actually introduce errors or provide a lot of difficulty trying to map those polymorphisms to your genome. So it might be better to assemble those yourself. You can also align to the transcriptome and that's just a reference that contains all the possible isoforms out there and you're taking your RNA data and you're just trying to align it to all the possible combinations of isoforms but the most common method is Alignment to Reference Genome and each one of these alignment methods they come with their own packages and tools and that are available that you can use. As you can see this slide is kind of out of days, 2013 but I just wanted to give you an idea of how many tools are being developed for RNA, DNA, how many aligners are there and what we picked for today, we picked Top Hat. And now there are a few reasons why I picked Top Hat. Top Hat was developed back in 2009 and since then a lot of other tools have been developed and a lot of you I'm sure have heard of Star and how great Star is and how fast Star is and I agree Star is a lot faster but as a beginner to RNA seek I find it useful to start with a tool that's well established, that's user friendly, that has a big community and community is very essential because it means that most of the questions that you have have probably been asked before and they've been answered so you can just go online, ask your question, there's an answer, someone has run into that issue before and it also means that they have worked on correcting all the errors that other people have been trying to troubleshoot so you don't need to go through a troubleshooting process which can be frustrating if you're just starting a new technology and so it's very unlikely that you'll run into that with Top Hat. Now once you learn how to run one tool it's extremely easy to just run another tool. All you need to do is just use another command, same files, everything, workflow is the same, just a few parameters get changed and you can go ahead. So Top Hat is a slice aware tool because it uses the reference genome, reference genomes consist of introns and exon and so it deals with splice junctions and tries to detect splice junctions from alignments. So what's the idea behind Bowtie and Top Hat? So Top Hat is using Bowtie as a backbone aligner. So Bowtie is doing all the alignments but the problem with Bowtie is that it can deal with large gaps, it can't try and align very large gaps and when you say large gaps these are entronic regions. So Top Hat works as a wrapper around Bowtie, it uses Bowtie to do the alignments and then it takes that information and tries to do the assembly or the splice junction detection and transcript assembly and they work together to do that and I'll show you, I'll walk through how this actually happens, how the alignment works. So let's say we have two reads, read X and read Y and read X actually spans two exons and read Y spans one exon, the two reads are the same length and at the bottom we have the reference genome, we have exon one, we have an entronic region and then exon two. So the first thing that Top Hat tries to do is it tries to employ Bowtie to align all the reads. It's going to take the reads and try to align them to the reference genome and it takes read Y for example, it aligns it and aligns perfectly fine to the exon because it spans that exon so it takes it and puts it in the aligned bin. Then it tries to take read X and try to align it to the reference but part of the reads will align because that's the part that belongs to exon two and the other part will not align because the entronic region is out. So it will take that read and then it will put it in the unaligned bin and it will go through all the reads and do that, aligned versus unaligned bin. Then it will take the unaligned bin and then go through all the reads in that bin. It will take each read and split it into smaller chunks, smaller segments and then take each segment and then redo the alignment. It will go use bow tie again to align that fragment X1, X2, X3 to the genome. And in this case, X1 is going to align fine. X2 will not align because again, there is still the boundary, the splice junction boundary and X3 will align perfectly to exon two. So what's top hat doing? Now it's taking that information from bow tie all the alignment information and it's trying to interpret and predict where approximately the splice junction could be based on that alignment information. And once it does that, it comes up with a splice library of all the possible splice junctions that you can come up with and then it will take your reads and then it will go through the alignment process again using the new splice library that it tried to detect. Again, you can also provide it with information about where potentially the splice junction can be any known splice junction so that it doesn't actually have to spend a lot of time trying to look for things from scratch but it can use the known junction as a guide and then also look for novel junctions, combine the two. Is that clear or yeah? It looks like it uses a backtrack recursive algorithm. A backtrack recursive algorithm. And it keeps going forward and it manages to come one step back. Yeah, so I don't know if it's doing something. Yes, I don't know if there is a, yeah, yeah. You can't think of it that way, yeah. So, let me just check out this. So basically first tries to align things. If they align great, if they don't, then it splits them into smaller bits and then tries to align those. Then the bit about how it actually looks to see is actually looking for predicted splice sites. Yes. When it's splitting them or does it after this? So the splitting happens just size-wise. So it takes the first 25 bases and then 25 to 50, 50, 75, 75 to 100. So it's not doing that intelligently. It's just splitting according to size and then it takes each chunk and you can actually vary the size. So I think by default is 25 bases but if you have short reads you might want to change that size and make it either. You don't want to make it too small because then when you try to align it will not be as unique. You want to keep it long enough that it's unique. So yeah, so it's taking those small chunks and then it's aligning them and then things that don't align within that chunk give you an indication that there is a slice junction there. In addition to that it's actually using also coverage information. So it looks at the coverage of the reads when it aligns back to those slice junctions to make sure that it has done a proper job because the coverage should be consistent when it tries to assemble back the isoform or the transcript. And there is a coverage search option that you can use to give you a better estimate of those junctions but usually they recommend that you don't run it because it takes a very long time if you do it that way. Yes. So it looks like most of what comes out of bowtie reads that span a slice junction. Sorry, I missed the first part. Most. Whatever bowtie cannot align. Yes. Spanning a slice junction. Yes. Right? So yeah, I don't know if you went into the detail of how top hat does it but top hat takes all the ex tools. And maps them against mature RNA which hasn't sliced yet. So it doesn't do that. There is no reference. Is that possible or what? I don't think so because it's taking everything and just aligning it to the reference genome that you provided. It's not aligning it to a transcriptomic reference. But if you would do it just there would be a way to take the mature and spliced RNA then you can just take all those analyte sequences and map them to the actual size because they would be present. Yeah. Yeah, it doesn't do that. It aligns it. It always aligns it back to the reference genome. And yes. So you said the read depth varies based on the expression level of some gene, right? Yeah. But the read length, you just arbitrarily set that when you go to your experimental design. So the read length a lot of times is by default 100. It depends unless you wanna cut down on the cost and do less. I wouldn't go less than 50. 50 is way too short. I would do at least 75 bases and more. That's when you're sequencing. So that's like an earlier stage of the analysis that's when you do, before you even do a library or prep. You say I wanna do, I'm gonna sequence two by 75 or two by 100 and then you tell the lab that and then they will construct the libraries accordingly. Cause you wanna make sure that also the reason why I say you have to tell the lab because they have to make sure that the fragment that they design is of a certain length that will enable you to do it. Because if the fragments that they do is they're too short then you won't be able to do a certain length of reads. Yes. I just wanted to. You said you can choose what size you want top edge and break up those reads into cameras. Yeah. But is it significantly more computationally intensive? The smaller you go, what is it? It probably is because it's trying to, Bote is gonna try to align and we'll find a lot of not unique map or sequences. So we'll map to multiple locations then we'll have to deal with that. So it's computationally intensive. Also you will probably end up with a lot of false slide junctions because if it's mapping one chunk to multiple locations then you're ending up with just random splice junctions ever. So it's recommended that you pick at least 25, I guess. And Bote is also taking care of mismatches as well. So when it's doing the alignment, when it looks at the read, the default I believe is two mismatches at the beginning of the read. And I think it allows more mismatches at the end just because the quality of the read drops towards the end of the read. But that's another thing that you can alter so you can reduce the number of mismatches allowed or you can increase them. That's up to you. Yes. Once it does it once, it completes another execute. Does it split it again to another set of k-merms or does it stop there and say, okay, this execute contains the splice junction I'm just gonna throw up? I believe it does it once. Now, while all of this happening, should you allow multiple mapped reads? So when you're aligning DNA, for example, you're trying to align the read and the read maps to multiple locations, depending on the tool, a lot of the tools will actually report all the possible or all the potential maps, mappings for that specific read. So if a read has 10 potential mappings, it will list all of them. It will tell you what the quality score of each one of these mappings are. And it's up to you. You can either pick the highest quality because that's probably the proper one. But a lot of times you get two or three mappings that have the same exact score. It's the highest score and it's the same exact score. And those could represent homologous sequences in the genome. So the read aligns perfectly to multiple locations in the genome. When we're dealing with DNA, some of the aligners, they will penalize multiple alignments and it will try to reduce the quality of that read just because it aligns to multiple places because you wanna reduce the error rate and false positive rate downstream. Now, with RNA, it again, it depends on what you want to do after. So if you are doing expression analysis, you should allow multiple reads because they're probably homologous genes in the genome and you want to allow the reads to map to multiple places to get proper counts, proper abundances. If you're doing maybe, if you're calling variants, then you might consider just not using multiple mapped reads and only picking one of those reads and going forward with that. So on fusion, same thing. You might wanna keep the multiple reads. Yes? In terms of repetitive sequences, would that be less of a problem, like RNA data? Yeah, I don't know. Gene, I'm trying to think like gene-wise, is there more repetitive sequences than the DNA? I'm not sure, actually. Second question. As I've mentioned before, so after you do the alignment, you get a SAM file or a BAM file. So a SAM file, for those who don't know what that is, is just a sequence alignment map format. So as I've mentioned, it's just the same sequences you have in FastQ, but with extra information. Where each sequence aligned, the quality of the alignment. BAM is just the binary version of that file and it's condensed. It reduces the footprint. It's highly recommended that you convert all your SAMs to BAMs. How do you do that? There are a variety of tools. SAM tools, for example, is an excellent tool that you can use to do the conversion from SAM to BAM. And once you do that, you can just delete the SAM file. You don't need it. So if you haven't seen a SAM file or a BAM file, this is what it looks like. It's split into two sections. There is a header and there is a body to the file. The header at the top has information about the run, about the workflow, about the lane, where you've done your analysis. The command that you used to run top hat, the parameters they used in top hat, top hat version, sample name. So all the information that you need to be able to go back and rerun that sample again. In the body, you get information about the sequences and where they are aligned. They also have useful information about the paired read. So where the paired read also aligned and how far is your paired read from your original read, a read one to read two. And you can use that information to actually construct insert size distribution. So you can look at the distribution of the distance between the reads and that gives you an idea of the fragment size that you started with. If you didn't know that before, you can construct the plot here and I'll show you how to do that later. And that will give you an idea of how big your fragment is. Also in this file, for that to mention, there is cigar string. So what that is, it summarizes each read. So it will tell you this read, there are five bases that matched. There is 50 bases that did not align and those represents the intronic regions and then another 10 bases that mapped. So you can think of it as exon, intron, exon. That's actually what other tools use downstream. It will look at the cigar string and it will try to interpret where the boundaries are, the splice junction boundaries from the cigar string for each read. So a lot of times you, BAM files are huge. You can view them using IGV or, but if you're trying to look for a gene, for example, it will be very hard to just do a grep or like open the SAM file and then try to look for that gene because the file is huge, the computer probably crash. So to do that, what you need to do is you need to generate a text file called bed, a bed file, where you can put in the gene that you're interested in. So you can put the chromosome, the columns that in a bed file, a chromosome, start and gene name. So that's a bed file. And then there are tools where you can pass the bed file and you can pass the BAM file. And what it will do is we'll go through the BAM file and pull out those regions that you have in the bed file. So you can generate a new smaller subset BAM file that you can distribute in the lab. You can use it to view things. You can use it to count. You can use it to look at coverage and so on. So the bed file is very helpful. It doesn't need to be a gene. It could also be like list of positions that you're interested in if you're looking at variants and what so. And you can do that for RNA-seq or for DNA, any kind of BAM and bed combination. And what tools can you use when you're trying to subset a BAM file? There is SAM tools, there is BAM tools, there's the card, the bed tools. The two that are used quite a lot are SAM tools and bed tools. Bed tools is a great way to calculate coverages. It generates histograms, it subsets files. It's a very useful tool. And SAM tools is great because you can use it to again subset chromosomes, view things, pull reads and so on. In terms of sorting your alignment, so now that we've aligned the file, how do we sort it? There are two methods of sorting. Generally the BAM files are sorted by position and that makes the retrieval of the reads a lot faster for tools that need to just pull the reads based on position. It will be a lot faster if they're sorted by position. But there are certain tools that require the BAM file to be sorted by read ID and that will enable them to easily assess the paired reads and look for insert size and the distance between the paired reads. In special infusion detection, it tries to find if one read aligns to one chromosome and another really aligns to another chromosome so it'll be easier if they're sorted according to read. So check the manual, see how the tool wants the file to be sorted. IGV, as I've mentioned, is a great way to visualize your alignments. If you haven't seen IGV, I'll walk you through what it looks like. It's a tool that you can download. Have you guys used IGV today? Oh, look at this week, excellent. So I don't really need to go through the details, but at the top you have the ideogram and then at the bottom you have the gene annotation where the exons and the introns are and the arrow indicates the direction of the transcription, how the what strand of the gene is being transcribed and why I'm mentioning that because to that you can color the reads, color code them to tell you what strand they're actually coming from just because this was a strand specific library and they all agree that they're coming, all the reads are colored one color which means that they're coming from one strand and you can look at, that's the first thing I do, I look at the exon islands and I try to look at coverage and see if there is a match between the exon islands and the gene annotation and the coverage and there seems to be a concordance between the islands situated at the bottom and then the coverage track right there. So you're seeing reads where they're supposed to pile. You're just a list of other tools that you can use to visualize, yes. What was coverage? So that's the count, so what it tells you is the total number of reads that span that specific position. So it's not an artifact. It's bringing up all these reads. So this is a chunk of the read, I should use my answer. Can you guys see this? Oh, yep. So what it's doing is summing up all the reads that overlap this specific region and it's giving you a count of all the reads that mapped there. So this is a chunk of the read and then there's an intronic region. This is a gap that I tried to align but I couldn't find and then this is another chunk of the read, another splice, so it's using, so IGV is using that cigar string that I talked about where it takes the read and it tells you this many bases aligned, this many bases were a gap, this many bases aligned and so on. Yes. So the character lines in blue below are the exos, right? Yes, so those are known annotations that you load. So you load that independently of your data because you know where the, so you have a lot of pilot on sort of two ends over here. Yeah. And then very little in the other exons. Is that telling you about splicing? Sorry, which exons, this, these ones? Now the two big pilots. Yeah, this and this. Yeah, yep. So that's dependent on the size of the exon, correct? The larger the exon is, the more reads that will cover it. So it's not telling you anything about splicing? Well, it is. Well, the splicing, these light blue lines, these are the spliced sites. So what it's telling you is that what you want is you want, you want them to match. You want the exon, intron regions to match because this is annotated, this is your data. So this is, your data is observed and this is the expected. And you see a match between the observed and the expected. You're seeing reads, sorry, you're seeing reads where you have islands and that's expected and you're not seeing reads where you have introns. That's also expected. Sure, but the amount of expression, guess it's telling you about expression levels, right? It is telling you about expression levels, right? Telling you that the small exons are expressed at lower levels. No. And the ones that have lots of filers or? It just means that there is less number of reads just because of the size. We will talk about that in expression. You have to keep that in mind. So in RNA, there is an assumption that the abundance is related to the number of fragments that you observed. But large genes will have more fragments. Smaller genes will have less fragments. So you have to keep in that in mind when you're assessing. So here, it's hard to just do expression. We're looking at coverage and it's giving you a rough idea of what the coverage or the expression is like. But you can't quantify it by just looking at each exon. You have to normalize. If you look at the exact squash side of the line, then the volume of the exact squash side, exact left hand side of that extreme left of the coverage part, wouldn't that amount give you the information about the expression level? No. Because again, it depends on the size of the exon. So if the extreme left was a large exon, you'll probably have more reads. And you have to also keep in mind that this can tell you about biases as well. So if it's a long gene, there could be a three prime bias. It could tell you about degradation. If the RNA is degraded, which in this case, I don't know if you can tell that, but you will see that the one end of the gene will have a lot less reads than the other end if the exon sizes are similar. In this case, it's a bit tricky because the exon is extremely small, the last exon. So because it's extremely small, you're expecting a very small number of reads that span it. But if that exon was as large as the first exon or the three prime end, then you should expect the same level of expression or coverage, assuming there is no bias. But yeah, so the two things you look at, you look at the splice if they match, observed versus expected, which is good, and you look at biases mainly. You're not really trying to come up with an expression for that specific gene by just looking at it, it's hard. But it tells you, okay, how did my reads distribute across all the exons? Is it fairly equal, or is there some sort of bias? Yes. How often do you see the coverage in the interim area or in the genetic area? There is no, there is not. So if you're seeing some reads, that is. Any contamination? It could be that, but it also could be that the splice, the splicing was not done correctly. Like there are some false positive splice junctions. Also it could be that there are novel splice junctions because the track at the bottom, these are known junctions, but you could discover new junctions. I mean, there isn't enough evidence here. There are very few reads. So it could be, these are probably errors. If it was actually a novel, it would be a lot more clear. You'll see like big island of reads piling up in a region when there is no exon. That makes you think, okay, wait, is there another isophone that hasn't been discovered before and it should go back and investigate, and so on? Is there a percentage you can tolerate like necessarily a percentage? Yeah, I don't have a threshold. Usually we don't do these things because these are, you're not gonna go through 20,000 genes per sample through IGV, it's impossible. So IGV is just a way to do, like there is a gene of interest that you're interested in when you're doing the study. It's a nice way to check and make sure that you've done things properly. It's highly recommended that you do a proper QC that looks at all the genes, looks at the whole thing and we'll get into that in a second, okay? Okay, so alignment QC assessment. I'm gonna walk you through different ways that you can assess. Now that we've aligned the reads, the amount of information you can get about the RNA is a lot richer now. So with the RIN, it was very simple with just one number and it will tell you if it's a good RNA or a bad RNA, but here you get a lot more information now that things have aligned to the chromosome, you know where the introns are, what the exons are. We're getting closer to expression estimation, but before we do expression, we wanna make sure that the alignment was done properly. So this is a list of parameters. So all these parameters I've generated with a tool called R6C and that's, I think it's listed in your notes. It's an excellent tool. There are a lot of RNA seed quality tools available out there, but this seems to be one that generates a very good quality report, very comprehensive. So I highly recommend this. But feel free to generate your own report if you want, you can write it in your script and look at things your way. So bias, we just mentioned that we can use IGV to visualize bias, but I said it's not computationally efficient. And what you want to do is you want to like this to go through all the genes, all the transcripts, and then come up with a way to plot all of them and be able to tell you if there is any bias. So the way it does this, keep in mind that the transcripts, they all have different lengths. So in order to get around this issue, what it does, it takes each transcript and splits it into a hundred bits or a hundred quantiles. And it looks for each of the quantiles, it looks at the coverage in that quantile for that specific transcript. And then it generates this plot where the quantiles are from the five end to the three end of the transcript situated at the bottom, and then the actual coverage is on the y-axis. And here you're seeing two groups of samples. One group of samples that has consistent coverage across the transcript. And another group of samples that seems to have a very low coverage at the five prime end of the transcripts, and then seems to have a very high coverage at the three prime end of the transcript. And one possible cause of this is the way that you have selected your RNA when you were doing the library. So this could have been a poly-A selected library, and what happened is that it selected a lot of the three prime ends because those were the poly-A's are located. And could also indicate there might be some degradation in your RNA because the five prime end is usually, is the one that gets degraded first, and it's possible that that end has been degraded while the three prime is intact. And that's why you're seeing these differences in coverage between the two ends. Now, when you're dealing, when you're seeing this, it's extremely important that you detect it, you find out why this happened by going back to the library construction level, and then you correct it. Correcting for it is the most important step because if you don't correct for it, what's gonna happen is that this error will go through expression and will affect your expression estimation. How? If you have long genes, so if you take these samples right here and you run them through expression tools or softwares, the short genes will have overestimated expression values because the shorter they are, there is more coverage on the three prime end. The very long genes will have underestimated expression values because most of the five end of the transcript doesn't have any coverage. So you're ending up with two pools of over-expressed genes, under-expressed genes, and that's not good at all. So there are various ways and tools out there that can help you. If you can't re-sequence, if you can't re-do the prep, then the tools will correct for such biases across the genome. It's highly recommended though that you re-do and get a better, because you don't want even the corrections to bias your results at the end of the day. Another thing that you can look at is the nucleotide content distribution. So in this plot, at the x-axis, I'm looking at the position within the read. So here I'm looking at very shortly, that 35 spaces long. And then the y-axis, I'm looking at the distribution or the frequency of each one of the bases, A, C, G, T. Now you expect setting GC content bias aside. You expect, if you're looking at the reads, you should have an equal distribution of the four bases. So each one of these bases should have a frequency around 25%, if you look at all the reads in your library. Now what ends up happening is that when we started plotting these, we saw that there were some funky things happening for the first 10 bases of the reads. So when we looked back and investigated, it seems that there is random primers that Illumina uses and those random primers, they're used to reverse, transcribe RNA fragments into the double-stranded CDMA. And they turn out to be not so random after all. They actually have a preference for certain bases when they bind and it causes these patterns. And there was a paper published. So, and they've also suggested a method of correcting for it. But what we do is we usually trim the first 10 bases of our reads because if you keep them, we've noticed that the quality of the alignment get reduced because of the biases in the bases in your reads. So we just trim the first 10 bases and then we continue with our alignment. Why does that necessarily bind to the corresponding data? Sorry, why? Why do we move this first 10 reads of 10 units? Why do we remove it? Because we expect an equal distribution of bases. So there seems to be an over, there's selecting certain bases when you're trying to enrich for the reads. But we want a fair distribution of bases in those reads. We don't want a preference towards certain bases versus others. So that's why. That actually is different. Yes, yeah, yeah. So that's not gonna change. So I'm not sure how this is. They're not wrong, but there's a preference. So it could, the number of reads that you get might not be fairly or like equally distributed. So we're getting more reads from certain areas versus others potentially. So the difference wasn't, I mean, it wasn't that huge, but it makes the alignment actually go faster because it doesn't have to deal with these biases or patterns. What are those numbers to trim? They do have methods of correcting for it, but I haven't actually used those. I just trim the first 10 bases and then we'll go ahead with the alignment. Yes. And is it, I mean, are the samples all the same? Yes. So you see the same, I believe so. Yes, so based on all the samples, as long as they're alumina, then you will see that pattern. Would this plot change, if we take two months with high GC content? That's a good question, I don't know. I don't know how that will affect it. Question, yes. You trim the first 10 bases and you just never see the beginning of GCs or? No, so those reads are equally distributed across the genome. So think of them, these are not localized in a certain area of the genome. All the reads are overlapping across the whole genome, but we're taking all these reads and we're just trimming the first few bases. So the overlap will still exist and we'll still get a full coverage of the whole genome. So it will not affect. It's position of the read, not the gene. Oh, okay. Yes, not the gene with the read. Okay. Alignment QC, another thing you can do is you can look at the quality distribution. FRED score, if you haven't heard of FRED, is just a quality that's widely used to characterize the quality of the base that you're trying to call. And the way that this score is calculated, it's negative 10 times log 10 of P. And P, in this case, represents the probability that the base you called is wrong. So if you see a FRED score of 30, it means that there is a one in a thousand chance that the base you've called is not correct. And FRED score is usually, that's how quality is reported in your FASQ file, in your BAM file. That's the standardized quality score that's used. So what we can do is we can look at the position within the read from five prime to three prime, and then we look at the FRED quality score distribution. And the box one represents, for that position, across all the reads was the distribution for each one of these bases across all the reads. And you see that you wanna make sure that the overall distribution is above Q30. So that's usually the standard to have bases or scores that are higher than Q30. You do see a decline that goes from beginning of the read to the end of the read, and that's just all the sequencing by synthesis techniques have that issue where the certainty of common basis decreases as you reach the end of the read. Remember, we talked about PCR artifact in DNA and in RNA. So this is one way where you can visualize potential PCR artifact or duplicates. So this plot is looking at it to be confusing. So I understand if you can't follow, but on the X axis is the occurrence of the read. So how often does the read happen? Do you see once, do you see two duplicates, three duplicates, four duplicates, so on? And then on the Y axis is the number of reads that have that rate of happening. And what you want ideally, you want curves that go really deep because what this is saying is that very high number of reads happen at very low occurrence rates. So it happens between one to 10 times while there are a lot of reads that happen, very few reads that happen multiple times where they have like huge pileups. And this is done sequence-based and mapping-based. So they've compared the sequences, the exact sequence of the pileup and also the mapping start and end point to come up with this estimate. Now, if you see something like this, like up here, which tells you there's very high number of reads that have very high duplication rate then you might go back and investigate why that happened, consider collapsing or check your PCR method to make sure that it's not introduced in bias. This is one way you can check depth. Remember how I said we, you can run a pilot study if you don't want to run all the samples and then from the pilot study, you can take the BEM file that you've aligned and then try to figure out if you've reached enough coverage, is have your sequence deep enough. So one way this tool does it is that it takes the BEM file and then it samples 10% of it, 20%, 30%, 40%, 50%, up to 100%. And at each level, it will look at all the splice junctions, known and novel splice junctions, and then it will tell you the number of junctions that it detected from 10% of the data, 20% of the data and so on. And what you're interested in, you're interested at a point where no matter how many reads you add, the number of splice junctions that you're getting has saturated. If you add more reads, you're not getting any more splice junctions. There is no point in sequencing more reads because you've reached that saturation level. Now, you look at the saturation level in two curves. You look at the known junctions and the novel junctions. The known junctions, as you can see here, saturates a lot earlier than the novel one. And that is because a lot of the novel junctions are actually false positives. So you will still end up getting a novel junction because most of those are false positives. So when you're trying to decide or come up with a threshold, keep that in mind. So look at the known junctions first and then consider the novel and keep in mind there is a lot of false positives. So what this is telling you that, okay, well, maybe saturates at this point. So I don't really need to do 500 million reads. I can only do 20% of that. I'll only do 100 million reads and that should be sufficient. When you go back and you set that new threshold for coverage and then you run it on all your samples. Now, you can do this same technique. Instead of looking at splice junctions, you can look at the number of genes that are newly introduced at each level of sampling. So you sample at 10%, 20% and you look at the expression instead of the splice junctions and see, okay, now I'm seeing, as I'm increasing the depth, I'm seeing new gene families, new genes are being introduced. At what point do those genes saturate and are not seeing any new genes being expressed? And you can maybe combine that information with the splice junction to come up with a coverage threshold that you can use. Base distribution. So here we're looking at all the bases that are aligned and where they lie within the transcriptome. Are they within the coding regions? Are they within intranet regions? There is any, are they within a UTR, intergenic bases and so on? And these are nice because they actually show differences in your library technique. So if you've selected total RNA and our ribode depletion, then you'll probably see a lot of, you'll see less coding in the pie chart and a lot of non-coding. But if you've done mRNA selection or enrichment, then you'll see a lot more coding bases versus non-coding bases. So another way to confirm that your library prep went correctly, went okay. Here are the reason of this slide. I just wanted to highlight a few terms that are being used in the community interchangeably and it can get a bit confusing. When I talk about fragment, I'm talking about the fragment and the adapters. So that's the cDNA fragment that we're trying to sequence. So this is the RNA fragment that we converted to cDNA and then we're trying to sequence this piece right here. So if we do single end sequencing, we'll start from this end right here and then we do, if we're doing 100 bases, then we do 100 bases this way. If you're doing paired end, then we're doing 100 this way and then 100 from the other end, two by 100. And the gap in the middle is not sequenced usually. And that gap will vary depending on the size of the fragment that you select and the read length that you're doing. And ideally you want to select a read length that matches the fragment size so you want to reduce the unknown gap distance. You don't want to go over it. You don't want to have reads that will overlap because you can think of it as you get shorter and shorter fragment. If you keep the reads length the same, you're going to have reads that will overlap, read one and read two will overlap and you don't want that because that's just a waste of resources and because you're ending up a sequence in the same base twice. And it's not correct because you can't consider them as two calls because they're actually coming from the same fragment. So it's classified as one call. Then you have to worry about that when you're doing expression and variant calling and you don't want to worry about that. Now the three terms that are used interchangeably fragment size, insert size and inner distance or inner mate. So when we talk about fragment I'm talking about the CDNA fragment with the adapters. When we talk about the insert size I'm just talking about the CDNA fragment itself and inner mate or inner distance is the gap distance between the reads. And one of the parameters in Top Hat used to be at least, they ask you to estimate the inner distance. So they ask you to estimate based on your library size and your read length can you tell me what your inner size is? Because they use that as a way to judge how far the pairs should be and if it's further than a specific distance then maybe you should classify them as a fusion or they should classify them as a charming region or what so on. But, sorry, now I don't think they need to do that because they learn, so they pick the first bunch of reads and then they learn all these information from the reads themselves, so you don't have to provide that. Here I'm just showing you, if you're trying to call variants if you have a gene, a specific gene that you're interested in and you want to look at variants in a very inefficient way you open IGV and you go through the gene and you look at a position and here these are the RNA reads and this is the coverage for example and this is the reference genome saying it's a C and then the alternate bases are here. So they're highlighted as T and so it's telling you that there is a mutation in this site to calculate the variant allele frequency what you can do is you can just simply sum up all the alternate bases and divide them by the total number of bases and that will give you a frequency of that mutation at that site. It's not efficient at all. You can use sand tools pile up which will go through every single base in the genome and it will do exactly this. It will look at the reference, the alternate and it will give you a number of bases that are alternate over the total number of bases. It's a huge file because you're going through billions and billions of bases actually with RNA will be less because you're only looking at the current region. So that concludes part two. Any questions regarding alignment, QC visualization? Excellent. I think we're doing very long time. Sorry. What do you measure for that alignment QC? QC, R6C. That's the only one you recognize. Yeah, well that's the only one I have used. Well, I've used multiple but that was one that provided me with all the plots that I needed. The other ones I found that you, there's one plot that's useful and then I had to combine it with another plot from another tool. So it's recommended that you run multiple tools because each tool has its own way of assessing things and it has its own plot. Ideally you wanna come up with your own QC coverage because that's dependent on your own project and your own interest. But just to make it easy, you can use R6C. Okay. So now we move to expression. So now that we have the sequences of the RNA, we aligned them, we have the alignments, all we need to do is perform expression estimation and differential expression. So the learning objective of this part of the module is to look at expression estimation, also look at some methods of normalization that are used by cufflinks and then jump into the differential expression methods and then look at some downstream analysis that you can do. I'm not going to go into the details because doing expression and differential expression, you can take a whole course on that. So it's really hard, I find it really hard to condense all that information in one. So I've only picked and focused on cufflinks to talk about the method and how it works and I'm not even going into the details of exactly how it works, but I'm just giving you a background of the statistical model that's being used, how you can run it and what you can do with the data. There are a lot of papers that I can also point to you at where they look at exactly that. They look at various tools, how they work, benchmarking and then we'll talk about that in a bit. But just keep that in mind. Back to IGV. So some people were saying, okay, well, I can look at the thing and then assess expression from the IGV plot. Well, and I said it's hard really because that depends on the size of the exons, the number of visas span it. One thing, again, that you can do is you can assess bias in expression. So if you see exons that have very high coverage at one end versus the other end, that's one way. Ideally what you want to do is if there is actually differential expression between two samples, if you load them, so you're loading sample one here, sample two here, and then the gene annotation here, that's known gene annotation, and if there is down-regulation, then you should see the coverage will be reduced or will be statistically different, significantly different between the two samples. That's if you eliminate bias first. Now, one way to normalize for the differences, yes. When did you see the down-regulation here? Oh, so remember the expression peaks or by expression peaks, sorry, I mean the coverage peaks? So if you compare the coverage peaks for that specific axon between the two samples, you see a reduction. But you have to make sure that there's no bias first, correct for any bias that's present, before you do such comparison. And again, this is a coverage comparison at the axon level. It's not an overall expression, but just to give you an idea. So this is what you need to do to be able to do that comparison that I just mentioned. So one normalization method for cufflinks is called FPKM. You might also have heard of RPKM, which is another version. So RPKM stands for reads per kilobase of transcript per million mapped reads. And we'll get into what that means. FPKM stands for fragments per kilobase of transcript per million mapped reads. So the difference between RPKM and FPKM is that RPKM is looking at single reads, while FPKM is looking at the paired reads. So it's counting the paired reads as a unit, and the RPKM is looking just at one read as a unit. So in RNA-seq, the relative expression of the transcript is proportional, as I have mentioned, to the number of cDNA fragments that you have sequenced, or you've done in your library prep. Or you can say that it's proportional to the number of reads that you generated from cDNA. And you assume that the more reads, there is more expression. However, you have to keep in mind that genes, longer genes will have more coverage, just because they stand more bases. And also you have to keep in mind that the library depth. So let's say we have two samples, one where we have sequenced three lanes, and another with only sequenced one lane. It's not fair to do a comparison between the two, because obviously the one where we sequenced three lanes will have a lot more reads covering those exons, versus the one that only had one lane sequenced. So the two things that you have to keep in mind and normalize for are the gene length and the library depth. Total number of reads that you have sequenced. And that's exactly what FPKM is. FPKM is taking those counts, the number of fragments, number of cared Ns that span your gene, and is dividing it by the total depth of that library, and is dividing it by the number of bases in that gene. And that allows you to come up with a normalized unit that you can use to compare genes within the same sample and genes between different samples, because it has been normalized by library size as well. So how does coupling work? So what coupling starts to do is it tries to take the paired N fragments, and it tries to come up with overlapped graphs. So it tries to assemble the fragments if there's any overlap between them, and come up with overlapping bundles. And it tries to minimize the number of paths. So each one of these is classified as a path. It tries to minimize the number of paths that will actually explain all the variation in your fragments. And each one of these paths represents an isoform, a specific isoform. Now, a fragment can belong to multiple isoforms. So the way they quantify the isoform is it's just a linear combination of the probabilities of a fragment belonging to a certain path or a certain isoform. So once you quantify the number of reads per path or per isoform, each one of these have two things. It has an uncertainty. So if a fragment belongs to multiple paths, there is more uncertainty in that fragment, uncertainty score. And then also, it looks at the variability of that measure across all the replicates that you have, the dispersion across all the replicates. And it takes this dispersion, or the variance between the replicates, and it combines it with the uncertainty score that it generated to come up with a new score, the count variance, that's what it is. And that count variance seems to follow some sort of parametric distribution. It's called a beta non-negative distribution. Might be throwing a lot of things that might not be familiar, but if you really want to learn more about the stats behind this, I highly recommend that you read the paper, and it might be a bit complex still, but give it a try. So after you come up with that variance score, the count score, you can just simply do a t-test that compares the different replicates, the variance between the different biological conditions, and you'll obtain a p-value that you can use to assess which genes are significantly different between samples. Now in that process, there's a step called cuff merge that gets used, so think of it this way. You're taking each sample, and you're running cufflinks on that sample. You're getting expression, and you're getting a transcript assembly, so a list of all the transcripts that are available, boundaries and so on, eyes of ones, within that sample. And then you do the same thing for the second sample, same thing for the third sample. And then you want to compare the three samples. You will have three different transcript assemblies, one for each sample. So what cuff merge does, it takes those assemblies, and then tries to put them together to come up with a unified assembly. And while it's doing that, it tries to filter any technical errors or false positive assemblies, and at the same time, it will make it a lot easier for cufflinks to compare. So it looks at a specific eyes of form, and then we'll look at that eyes of form across all the different conditions, and then compare the quantification across the different conditions. So that's pretty much what cuff merge does, and we'll be able to run cufflinks, cuff merge, and cuff diff in the tutorial that we'll have next. Also, cufflinks and cuff diff, they come with an R package called a cummerbund, and it's extremely useful. They list all the commands on their website, and what it does, it generates so many different plots based on all the output quantifications and differential expression text files that you generate after running cuff diff, and it will generate things like distribution, the distribution plot, distribution of coverage across the genes, it will generate heat maps, it will generate clusters, it will try to cluster your patients to see if they cluster according to the condition that you're interested in, and if you see any difference between the conditions. Heat maps, I think I mentioned that, volcano plots to look at significantly different genes between the conditions, and so on. So there are two sets of tools that you can, two pools of tools that you can use, ones that use FPKM or normalized expression values, and ones that use raw counts. So with raw counts, the intention behind it is that you're not really interested in comparing genes within the same sample. You're interested in looking at that gene and comparing it across conditions. That's why you want differential expression. So you look at raw count instead of normalizing, because once you normalize, that will limit the statistical test that you can run because the distributions will be different. But if you use the raw counts, that will enable you to use more statistical methods to do comparisons, so you'll have more power to compare, to do comparison between the samples. And one way you can do that is you can run HTSeq, which pretty much goes through the, it's like the alignment step. It just goes through all the regions of the genes and then tries to come up with a count of how many reads pile up across that specific gene or isoform or a transcript. And this is just a count, pure count. It's not normalized for gene length. It's not normalized for library depth. And you pass these to other tools that will deal with counts and will do count comparison. They do some sort of normalization as well, not FPKM, a different kind of normalization to adjust for the different library depth. And some of those tools are DCQ and HR. So DCQ and HR, they're very similar. The underlying assumptions are similar. The distributions are similar. So you can give those a try as well if you want. If you don't wanna do the FPKM method of expression. Now, there are papers that have done benchmarking. So they've taken the samples and they've run them through all the different expression analysis tools out there in hopes that they will find the best tool out there that they can use for everything. But what they found is that there isn't a single tool that will work with everything. So it really is dependent on the conditions you're testing, the number of replicates that you're looking at. If you have more replicates, then some tools work better than others. If you have less replicates, then other tools work better than others. And the papers where they do the benchmarking, they have very nice tables where they summarize the underlying distribution for each one of these tools, the tests that they actually perform. For example, DCQ and HR, they use a Fisher exact test instead of a T test, just because the underlying distributions are different. One method that you can do, if you want high confidence in your calls, is to run them all, run two, three tools, and then look at the overlap. So look at genes, that appear to be differential expressed across the different tools, and that will give you confidence. Ideally, you want to validate, but if you can't do that, then maybe that's a cheaper alternative to validation. Once you get these genes, if you have money, you can do QPCR, and you can actually go and look at the expression using a completely different technique to quantify genes, to see if there is actually a difference between the genes. Yes. So if I wanted to validate a transcriptomic signature using the alternative thing like a minister of chemistry, while filtering the gene list, is it better to look at the raw counts versus FTKM? The objective is to make a cheaper essay. Yeah. Choose four or five minister of chemistry antibodies and try to replicate the transcriptomic signature. When you say a transcriptomic signature, what do you mean by that? For example, you look at gene expression models. Yeah. The press there are two signatures that you get off of. Yeah, so look at all. Try to see, and if you want to recap, do you get that? Yeah. Yeah. So that, again, depends on the number of replicas that you have. So if, because as I mentioned, if you have a lot of replicas and some tools will work better, you might look at the count based versus the FTKM. However, I'll show you in a bit. FTKM is mainly used for visualization. So if you're doing things like that, it's highly recommended that you do count based methods. So you add HR or DC or there are a lot of other tools. I don't want to sound unfair and not mention the rest. There are a lot of other good tools that you can use. But FTKM is mainly designed for, if you're looking at heat maps, if you're looking at like clustering, if you're doing these things, then you might want to do that. But if you're doing machine learning and classification, as you mentioned in signatures, it's better to use the count based. Oh, question. I just wonder about TPM. Are you going to discuss that at all? Is it used? Oh, sorry. What is TPM? Is that a tool to transcript per million? Yeah. And how is that calculated? Is there like a software that you use to calculate it or? I'm not sure on the details, is there something that I see popping up here? Okay. I haven't seen it personally. As I mentioned, there are a lot of other tools that you can use. I haven't done RNA within the past year, so I haven't processed any new tools. But that would be interesting. I would like to give it a try and check that out. Also one thing to keep in mind when you're running these tools is that we're doing these tests at the exon level, or the gene level, and keep in mind that we have like 10th and thousands of genes, 100th and thousands of exons. So what that means is that you're doing hundreds and thousands of tests. So it's expected that you will get a list of differentially expressed genes by chance alone because you're doing so many tests. So looking at the p-value on its own is not sufficient, and you will need to correct for the number of tests that you have conducted. And a lot of these tools nowadays, they actually do correct for that, and they do multiple correction, and they report the q-value, and that's what you're interested in. You're interested in the q-value, and if the q-value is above or below a certain threshold and not the p-value. Some downstream analysis, there is a lot of stuff that you can do, the expression you can cluster them, supervise unsupervised clustering, and try to see if the patients cluster according to some sort of gene panel and gene expression. You can do classification. You can come up with a signature based on that expression through machine learning and try to classify patients into high risk and low risk based on the expressions, profiles that you get that'll be very useful and random force is a great utility that you can use to do that, and that's a package in R, which is also really easy to use. You can look at pathway enrichment analysis, look at the genes that were differentiated expressed, and then see what pathways they belong to, and if there are certain pathways that are enriched, and if that actually relates to the biology of the question that you're trying to study. And we are done.