 Yeah, when you switch the copy. Oh, OK, perfect. Yeah, so it's on me later. OK. Perfect. So, you know, everybody, you need to speak louder. I'm Florence Cavalli. I work at Zikid in Michael Taylor Lab. We work on pediatric bench humors. And I'm a pharmacist. And more recently, or actually, already three years ago, I joined a large stand-up to cancer project, which is led by Peter Durk at Zikid as well and Samwise in Calgary. And we're especially looking at bench humors stem cells to study epanema and other bench humors and glioblastoma, one of the most common bench humors in adults. So as a bioinformatics project leader, I deal with all the bioinformatics at most of the bioinformatics styles of this project. And we have multi-omex. So we have some more genomes. So an acyc, some single cell, some epigenetic data at acyc, as well as some complementary data on proteomics, metabolomics, and some blood screening. So that's a very large project. I'm really interested. Today is going to be about an acyc, gene expression and an acyc in particular. And things about myself that is not on my CV, maybe I used to do a lot of, I still do a lot of track and field and I'm a polevolter. So I guess it's a bit specific, nothing related to science. OK, let's start. So gene expression and an acyc. First, how many of you have already used an acyc data? OK, not that many. And how many of you are planning to have an acyc data coming soon or notizing it? OK, and the rest would be of interest for later if they want to. So this module, there are three parts. This lecture would be in three parts. First introduction of the acyc, why we do it and how we do it briefly. Then when you have the acyc data, we're going to talk about alignment and visualization. And then what you do when you read our line, you would assess the expression and perform differential expression. Feel free to interrupt if you have any question. I'm happy to, anytime I'm happy to answer. So some, what about an acyc? What is the rationale of an acyc? The challenges linked to it as compared to a whole genome sequence analysis, for example. There's a general goal of the analysis and the workflow. And I will go through a few questions that most of people that start to do an acyc would have in terms of technical question, how much data should I go for and other things like this to give you some points of how to answer to those then. In terms of gene expression, you all aware that from your genome, you have a double strand. And one strand is your gene is translated. And you have a single-stranded pre-mRNA that is then spliced. So process. And you get a mature mRNA. And this is the mature mRNA we want to catch in the acyc to sequence and to have an idea of how much the gene is expressed. And later on in the cell, the mRNA would be actually translated to produce a protein. But in RNA, I think we can answer the mature mRNA level. To do that, generally, we take several samples, quite often from two conditions. It could be tumor versus normal in most of the cancer study. It could be a different subgroup of patient of the same type of a tumor. So we have several samples. We would isolate the RNA, cut fragment of the RNAs of about 300 base pairs, and have some primer to prepare the library and sequence it's on her Illumina machine, for example. And we would, generally, in other sequence, pair and read on both the strand of the fragment. And I will have some detail later on about that. And when you get your read, fragment reads, you would want to then map them to the genome or to the transcriptome, often to the genome, to reconstruct your transcript. And I have an idea of the expression of the genome. And we will go through all these steps in more detail. So why do we want a sequence RNA as compared to DNA? Because they give us different type of information and different complementary information, I would say. Of course, the genome is the same in all your cells. But the expression of genes vary in function of the cell type or in function of some condition, for example. If you treat some cell line with the drug or without the drug, and if you have a wild type versus a no clout of mice, so you can compare to condition or more, for example. And you want to see how much the expression of the gene change. It allows us, as well, to predict transcript sequence, a lot better than just looking at the genome, because, of course, in the genome, you have the exon and the entron, and sometimes it's hard to actually predict where are the exon and what are the boundaries between exon and entrons, as a different transcript. But with the RNA-seq, because you have the matter of RNA, you can more easily identify those boundary between exons when you map to the genome. And it allows us, as well, to look at some features that you only can get from RNAs, such as alternative razor form, fusion transcript, and RNA editing. We're not going to go in detail about that, but that's the thing you can do with the RNA-seq data, as well. Other important things that are useful with the RNA-seq data, it could help to interpret some mutations. So yesterday and today, you've been talking about copinumbra mutation. Mutation can happen, somatic mutation can happen in the regulatory region and can have an effect on gene expression. But to know that, you actually need to assess the gene expression. So you need to read out of the gene expression how much the gene expression change in a sample with a mutation, a somatic mutation, compared to the sample that doesn't have the mutation. And it's also, as well, to prioritize some coding somatic mutations. So for example, you might have detected a mutation in a coding region of a gene, but if the gene is not expressed, it's unlikely to do anything. So if the gene is not expressed, it wouldn't be any protein, so it's unlikely to have any effect. You can, as well, look at which of the mutations of the wild type is actually expressed. It's only a wild type. It could be a loss of function of the mutated one, and it's called a plosificency. So that's another information you would get. And if only the mutantile, actually, it's expressed, it could be a good indicator that there could be a good drug target, because it would be very specific to your tumor sample, as compared to a normal that doesn't express this mutant, because the mutant is somatic, and it will be only on the tumor, and highly expressed, so it could be a good target. Some challenges are linked to generated herniasing herniasing. First is to get good samples. And the context of tumor, it's always a question about the purity. There are some tumor types that are, by default, really a lot more pure. So I have a tumor content of 18, 90, 95%. And some sample, some tumor are a lot more diffuse. And then when the surgeon removes the tumor, he actually removes some of the healthy tissue, which is next to it. So the sample you would get might not be as pure as you hoped it is. So it's something you would want to be aware of. And if it's not pure enough, you would have a mix of the normal cell and the tumor cell in your real auto-herniasing. How much quantity can you get from the sample? If it's a very small biopsy, you might not be able to generate a lot of RNA, a large library, enough to sequence as much as you want. And the quality, as the RNA is a lot more fragile, that's the DNA. So it could be harder to get a good quality library as compared to DNA. He has to be stopped properly. And the library preparation might be a tiny bit more. So yeah, the DNA is more susceptible than the DNA to for the preparation. As you know, the RNA consists of small exans. That's where the large intron in the genome. So you need to map those reads to the genome, taking it to icons that they would be reads that would be spammed to exans. So when you map it to the genome, you would have an intron in the middle. So you need an RNA load that's allowed to find the start of the reads that map to an exon, then skip the intron and map to the second exon after. We're going to look at it in more detail after. RNA is a relative abundance. It has several orders of magnitude. And of course, it's actually you catch a pool of mRNA, and it's random. The more exposed gene, you would have more piece of mRNA of the large exposed gene than the low-exposed gene. So it's something you have to take into account in your analysis. And there is a question of how much how deep I need to sequence to actually catch the low-exposed gene, because your library would be, you would more easily catch the highly-exposed gene than the lower one. So the deeper you sequence, the more likely you get the lower-exposed gene. The normal mRNA libraries, without any selection, you would have a lot of ribosomal and metacondial gene. So not necessarily very interesting. So we would generally not want to spend money to sequence those again, again, and again. So we would like to remove them at the library restriction. And of course, similarly, when I come from in a wide range of size, if you're interested in a small mRNA, such as macronas, that could be really interesting for gene expression regulation, you need a specific type of library to capture the small 25, 30 base pairs that you would then get in a normal library. And if you're more interested in the protein coding, which usually we are, you would do a polyaselection of the mature mRNA to enrich for protein coding genes. So the first thing you can do is to assess the quality of your library. And for this, we use a metric measure, which is a ring number, which stands for RNA integrity number. And here, it's R2, Electro 4, it's weird. The ring number comes from the electrochloric RNA measurement. And these are two profiles. So it goes from 0 being a bad quality to 10 being a very good quality from your RNA. So the plot on the right with the ring of 10 is a very good one that ideally you would get for your samples. So the idea is to look at the relative how much do you have of ribosomal RNA? So the 18S and the 28S among your mRNAs. They should compose 80% of your mRNAs. So that way, you would have two large peaks and no much else. You don't see other peaks. As compared if you have a bad quality mRNA and you have a lot of degradation, you have all these other peaks that appear as you could see on the left. And that's an indication of the mRNA has been degraded. And you might not want to sequence that. What exactly is your X and Y axis? Sorry. The X is the size of the number of nucleotides. And high would be the fraction. So how much of the 18S and 20S do you have? But you want an overall patient of the ribosomal 18S and 20S. It's a good ring number because that means your RNA is not degraded. If you see the other peak as you have on the left, that's I'll tell you something going on with your RNA. It's been degraded, so it's smaller and smaller. Quite often, it doesn't have a bar from addition. It doesn't depend on you. But you want to make sure that you know what was the ring number. And if it's 8 or more, 7, 8 or more, you should try and sequence it. If it's below, you might want to ask, is it possible to have more sample? And we extract and reassess the quality of the RNA to have a better quality. Because, of course, what you put in the sequencer, if the quality of the RNA is not very good, you're not going to have a good sequence, and your analysis is going to depend on it. So the better the ring number is, the better it is for you to have good results at the end. But it's often not in your hands. Some samples are really precious, and you cannot get more. That's possible. So how do we prepare this library? Generally, we do some selection, RNA sequence selection, or the patient. So if from the tissue, you isolate the RNA with the DNAs. And then if you use a total RNA, you would have a lot of representation for a representable RNA, TRNA, and all type of RNAs. And there might not be a lot of reads you're going to sequence. You're not going to use. So generally, we don't use total RNA, except if you have a particular question that would be linked to that. But most of the time, we do as a reduction. So you would do an RNA, so you would deplete your mRNAs of the ribosomal. You would remove the ribosomal RNA to remove all those sequence. Or you would do a selection of polyA, mRNA. So a polyA selection, or a reduction of getting rid of the ribosomal RNA. So that's a more common one. And you want to do one or the other. You don't want to mix up those two for your experiment. And if you have a quart of polyA, and you want to add more sample, you should do the same. Use the same library constructions, like the same kit, and not change it. Because as you would see in QCM2 later on, the read coverage could be different due to the way the library is prepared. Why did you fund the RNAs? Why are they not issued in the U.S.? Sorry? I'm going to go back to that side. Yes. For the RNA reduction, if it's a bund RNA, we're DM the size. I don't understand why that would be. So that would be the ribosomal RNAs. So if you use polyA, they wouldn't be catched. If you do a depletion of ribosomal RNA, then you remove them as well. So these two ways to unreach for protein-coding mRNA. And not tier RNA as well. There are two ways of preferring the library. You can have a stranded versus a non-stranded library. Sorry, yes? So just on the same previous slide, I don't know if you touched on it, but those three methods of CDNA capture polyA selection or RNA reduction, is there a benefit or is there a one that's more frequently used, or is it just up to the description of the experiment? I'm a step between polyA and RNA depletion. Both are used widely. As I said, I think you will eat. If you're part of a project and they use a type, you should go with the same type. If you have to, you can choose. Yeah. So stranded versus un-stranded library preparation. So the first RNA library were un-stranded. So we didn't have the information of if the read were coming from a particular strand of the DNA or the forward of the reverse. And that could be a problem, because in some place of the genome, you could have a gene on the forward strand and a gene on the reverse strand. And you would have a read that map at this location, but you don't know if it's come from the gene on the forward or the gene on a reverse strand. However, if you have this strand of information, you would be able to actually specifically say the street come from this strand, because that has the information. Is it low as well? Yeah. I also want to add, there's also an anti-sensor. This RNA is abundant, but you actually see the bizarre name. That's totally right. And at the same time, having this strand of formation, I'll use to assess a specific expression, because you know which strand we come from and which other would be expressed from the two strands. So I must say, most of the library now are stranded. It would be great for you not to do a stranded RNA-seq. But just to tell you that there are two options, and usually you pick the kids that are the strand-specific RNA-seq. Well, to the other question, there are some messenger RNA-classic example of this small message. And I remember as well I was looking at, I wanted to call circular RNA in my RNA-seq. And the fact that I have poly-A selection allowed me to catch some, but not most of them. So that could be another reason you would be using RNA-seq. What about replicates? Do you need some? Why? So there are two types of replicates, the technical replicates. Basically, it's the same library preparation that you're going to sequence on different flow cell on different land or with a different index. So this is really technical because it's about the machine, the sequencing of the machine. Would you get the same read at the end of the correlation between the two replicates? Usually this now tends to be very good, the correlation for technical replicates. People do it less and less. The sequencing facility would do a lot at the beginning to make sure their sequencer work well and their pipeline work well. You might not want to spend money on technical replicates. For the project I was involved in, we had some control, like normal, that we put on every single batch. So there were technical replicates each to be able to see the batch effect. Because we know that should be the same type of profile. And that was the lowest to assess the batch effect between the four batches we did of 100 of samples. Very important are the biological replicates. So you would isolate the marinade and paper at the library separately from the same sample. That's very pure biological replicates, or from different tumor from the same type. So most often in the cancer genomic study, you would have actually several samples of a tumor type versus several samples of control. And you need to do a differential expression. You need several replicates, biological replicates, to be able to perform this proper statistical analysis. I would say at least six. Some people go with three, less is not. So you do want two groups with biological replicates in both groups. Generally, the correlation is quite good. And it's something you can assess while you're doing the analysis. Some common analysis goal of the RNA-seq, most of the time you would start by looking at evaluating the gene expression and performing like having a matrix, an output of the matrix of how much a gene is expressing every single samples, which we're gonna do, and performing a differential expression analysis between two conditions normal. You can, as well, with the RNA-seq data, look at alternative expression analysis, discover some new transcripts by, with some two string time we're gonna use tomorrow, as well as improving the annotation of your genome, for example, if the transcript was not annotated. You can look for a little specific expression related to SNPs on mutation. We're gonna touch a tiny bit on this. You can look, do some mutation discovery, perform some fusion detection and RNA editing, and I know that the next module would be about fusion detection, not in RNA-seq, but on the genome, that's what you're trying to do. So, general team of RNA-seq workflow, all RNA-seq data and project are different, but there are some common things that everyone would do with an RNA-seq data set. First, you get the raw data from the sequencing facility, general FAQ files, and then you would want to assess the quality, align the reads, and assemble the reads into genome transcript, and then process those align reads to evaluate the gene expression with scaffolding or the tool, or look at fusion detection with diffuse or the tool, as an example. And the output of those tools, you would then want to do a downstream analysis with R, with do a password analysis and visualize it with cytoskept, that's gonna do this at the end of the workshop. And then, at the end, you can get out, for example, a list of different expression, or the one that's the top candidate for validation. So there is a lot of downstream analysis that really depend on your question, but the first few steps are usually the same. Common question that came back very often in an RNA-seq analysis is, should I remove the duplicates, reads? The answer is not as stressful when that's for DNA, so maybe, but it's not complicated, because in the DNA, you cut the genome randomly in many different places, so it's quite unlikely that you would have a read that starts exactly at the same place. However, in the RNA-seq, you have the transcript that starts at the same place, that the transcript starts at that. So you have a lot of reads that would actually have the same start and then the same sequence after. So if you remove those in your analysis, you're gonna lose some information, especially about the gene expression. So, but at the same time, as in DNA, when you prepare the library, some duplicate can be produced by PC amplification, so that's those ones are not real reads that are very informative as of. So, yes, as I was saying, for high-express read, duplicates are expected, and they are not due to PC amplification. So, most of the time, we actually don't remove duplicate in RNA-seq analysis, but it's good to do a QC and to see how much duplicate you have and if it looks like what you expect or if it's over the board, and then you need to take that into account. How much library depth is needed? It depends on many factors. What is the question are you asking? Do you want a simple gene expression analysis with differential express gene, or do you want to look at the alternative transcript? If you want to look at alternative transcript for your gene, you need to go deeper to have more information to be able to catch all this different type of transcript for the same gene. It depends as well on the quality of your input and as a library construction method, et cetera, et cetera. The lengths of your read, if your read are longer, you have more information, so maybe you don't need to sequence as much. If they are paired, one paired, most of the time now they are paired and quite long, 100 best pairs and it tends to be 150 more recently. And the computational approach you want to take and as well as the budget you have. Because the more you sequence, the more expensive it is. So quite often you would aim for something and you could have several codes and you can decide as well. A good way to start and having an idea of how many reads you want per samples is to look at previous published study, which do something similar, at least in the same species you are. If you're all working on human, it's easy. If you're working on other species, you need to find studies that address a similar question and see how much is sequenced. And if they were able to catch the thing you want to look and go with this recommendation and then evaluate if your every desk is good enough. You can do a pilot study as well. You can sequence one or two samples on one line. So I have a lot, a lot of read. And then with the QC we're gonna do, I can assure you later on, I says how much information you gain by adding more and more read. So you would take only 10% of your total read and see what you can find. And then you go to 20% and see how much you can find, et cetera, et cetera, and to use 100%. And if at 70% it's plateau, you will not get more information. So you don't need to sequence as deep the rest of the samples. We will see that in the QC measure. So yeah, you could start with one or two samples in a line and then evaluate this way. What mapping strategy should you use? The first analysis library were short read, 36 base space. So in this case, you could use an eyeliner like BWA on the genome and add the junction database at the end of the genome. So the read would be able to, you still would be able to map read on junction, but the 36 or less than 50 base space is unlikely to spam more than one exonic junction. So that was one way of doing it. You could do as well an assembly strategy such as using transabits. Now the read are longer, 100, 150 base space. You really need a splice of wire liner to be able to map the beginning of the read in one exon, be able to skip the entrone or even the next section and map the rest of the read on the following exon. And we will see that with the high set. So there are several splice of wire liner that came out over the last few years. But I was used at the beginning that you had a new and progression top out. Star came out, I will describe a bit what's improvement. And more recently, high set, high set two, we're gonna use high set two and I will describe a bit high set two today and we're gonna use it tomorrow. What if you don't have a reference genome? I don't know how many people work with this pieces that doesn't have a reference genomes. No, we're all safe. Okay, good. Well, in this case, you would might consider actually sequencing the genome of your species that could be really useful. If you're only interesting in the transcriptome, you could do some Dinova assembly of the transcriptome. It's out of the scope of this workshop. You learn about Dinova assembly of the genome with Jarrett yesterday, I believe. And you will know about the ANACIC analysis today. So a mix of those skills would help you to understand and use the tool for the Dinova assembly of the transcriptome if you were interested by that. This exists, there are tools that exist for that. Okay, so now ANACIC alignments and some visualization. So we're gonna touch upon alignment challenges and some common question, some strategy, a few words about the different aligner that came out over the last few years. From an aligner, you have a BAM file. You probably already had a look at BAM file at least during the workshop. So the BAM format and another very useful format for this type of analysis is the bed format. I will describe very briefly how you might need, somewhat of how you manipulate the BAM file. Some type of visualization you can do, which is important is some QC assessment that you would do on your ANACIC after alignment. And you can start looking at the variant of specific stages as well with your BAM. So some challenges about ANACIC. You will have a lot, a lot of rid. For the human cancer study, we saw about 100 million per sample. It's roughly a good idea to get information about genes and transcripts. So of course, if you're so old, you wouldn't sequentially want samples. So it gets easily a lot of rid. And that could cost money in function of the server you're using and where you are. Of course, there are, and you will need some computational power as well to analyze this data set. Of course, there are the problem of, or the problem, the fact that there are internal genomes, so you need to have a splicerware liner. And the question of, can I just align my data once and just go with it? It really depends, usually not, because you will find improvements. Maybe a new one and it came out and it's faster and more accurate, so you might want to rerun it. You might need to realign it because you need this kind of input for another tool. You want to run for different analysis on the same data. However, if you're part of a large project or already have a large cohort and you want to continue adding more samples as another batch, you do want to do the same thing as you've done before and you're not going to run everything. Well, it's a choice. But for example, in our lab, we have a lot, a lot of samples. We did all the elements the same way. A few years after, we decided, okay, we need to do better. We're going to use style. We're going to realign everything, but now we're not re-learning it. We're finishing the analysis and other people would do it if they want to once it's published. We're not going to move to the next liner because it would be way too much work too, because all the analysis after would change. But if you start with a brand new dataset, use the latest liner, the best one, right away. And for example, use high set two and don't go with top hat. That high set is a better version from the same people anyway. High set two is not the only map to consider. As I said, you were top up top up two before. Star is a very popular one. Actually, I used star quite a bit. There are all advantages and that they came sequentially over the past year. So it's always better than the previous one, usually, because you have to publish it. You have to show you do better somehow. So there are three way of mapping, strategy of anisic mapping. It could be the novel assembly. It could be aligning to the transcriptome if you have very short read. Or it could be aligning to the reference genomes. It's what we commonly do and it's what we're going to do here. Which alignment strategy is the best? I actually already touched upon that. De novo, if you don't have a reference genomes, or if you believe your samples are a lot of polymorphism or weird thing that you want to be able to catch and pushing it to map to the reference genome, a human reference genome that you might not be the best. So some de novo assembly could be useful in this particular case. Transcriptome only if it's very short read and likely you have it. And all the other case for aligning to the reference genomes. This is a plot from ZBI of the different map or liners that came since 2001. So they are not all for RNA. As you can see, there is a color code. So red one for RNA. So one of the latest was Hisat in 2015 and recently there is doubt as well, which I'll check briefly. He used less memory again and it's a bit better, but I would say a lot of people used the Hisat. Hisat too now. So basically in 2005, 2009, when Topat was published, he was a big revolution because he was doing things a lot better by reducing lots, it didn't require a lot of memory, which was good because people could run their analysis on their machine and not if they didn't have access to a large server. However, it took forever. It was really long and they needed to improve that. And then Star came out and he used an index, used to index the genome, so as a reference, and make the search a lot faster to search where the read would start mapping. So that means from 100 million reads, he was about 1,000 minutes with, so a bit less than a day with Stopat. When Star came out, you could do that in 23 minutes. So you could do a lot more sample in a day. But he was using a lot of memory because you needed 28 giga of memory to start the index of the reference genomes and that you cannot really do on a normal machine. You needed access to a server. So that was not ideal for a lot of fermentation. So then Topap, two match pretty much what Star was able to do by using a reference genome as well. But it was faster, but then I used a lot of memory as well. And Hisat came and used two index, two index reference genome, but in two steps, a large, old genome search and then the local search and make it that the memory was a lot less. So it was fast and don't use a lot of memory and doesn't take a lot of time. So that was the advantages of Hisat and it's more accurate as well, according to the people when they did some comparison. So that's why Hisat and Hisat two became very popular for people to run because you can run it on a machine, on a local machine and it's not too long. So yeah, basically the three things I tried to do after Topap was to increase the mapping accuracy to reduce the amount of memory needed and to reduce the processing time, as you would expect. Should you use a splicer where a liner? I actually already gave the answer. Yes, you should. And secret may span larger introns and you need to deal with that when you align. Except if you have very short reads and you could have an alternative option as I described before because we're not need to the referentials. So what about Hisat and Hisat two? So Hisat is a splicer where an acyclic liner. You require referentials, it's very fast. You use indexing shame based on the barrel wheeler transform and the Ferragima Manzini FM index. I would not go into detail of how it is, but basically use multiple step type of indexing for alignment that make it faster and don't use a lot of memory. So you do a World Genome FM search with the World Genome FM index to hunker the first base pairs of your read to find where it's more likely to read our start. And then from this location, you would do a local search with the other indexes to extend your read and continue the alignment. And I will describe that in a few examples here. Yeah, so first, you will try to find the candidate location across the World Genome. Try to hunker a 20th base pairs of the start of your read. This would give you a few number of candidates on the World Genome. And then it would select one of the 50,000 or so local indexes for each candidate to continue aligning the remaining of the read. And if you have pair reads, it would do that separately for both pairs because actually most of the pairs should be like within your library size, which is 300 base pairs of 500 base pairs in function of how you constructed it. So the pair and start and end should be reflected at this distance. But sometimes the second read of the pair could be very far away or even another chromosome. And that could be an indication of something happening in the genome, a copy number change or a translocation. So we want to map those two read separately because most of the time they should be next to each other and that's an information you would get. But if they're not, it could be evidence of something else and that's something you would want to use. So this is our first example. First, you will align the read with looking for global index, which is a slow part. Once at least 28 base pairs are exactly mapped to one location, you would choose to the next extension mode. So you would just, you would try to continue mapping the remaining of the read on the base pairs next to this first anchor. If you have an exam, a read that is completely within an exam. That's what you would do. That's a little A. Another example is when you start again with the global search until you match the first 28 base pairs. Then you extend and in this case, you, for example, you extend to 93 base pairs but you hit a splice site. So you cannot extend more. So this remaining eight base pairs, you would switch to a local search with a small local index to find where this eight remaining base pair would be able to map, likely to the next exam. At least it should be local. And when you have a good map, you would check the compatibility to combine this, the splice. So you have the splice information. So that's a second example, which what you see on B. And the third one is when you start the same thing, you first do a global search and find where you first 28 base pairs match to the anchor you read. You do an extension. Here it's a splice junction. You try to map locally with a local FM index to the first eight base pairs. You find another place and then you extend. So roughly is how, well, it's how I said it works. That was C on the figure. Should you allow multi-map reads? It depends on the application. In DNA analysis, it's common to use a map to randomly select when you have multi-map reads. When I mean by multi-map reads, it's a read that has an equally good lagment in several position of the genome. This could happen because of multiple repeated sequence, for example, that's possible. For shorter, you read a house and more likely you would have multi-map reads. The longer they are, the less like you would have, but you still have some, can have some. In RNA-seq is less common to remove them. You might wanna do that, especially if you look at variant calling, because you wouldn't want to count this read in many, for different variants, in many different places on the genomes, because it actually come from one mRNA and it shouldn't be, this variant allele shouldn't be counted in so many times to get to identify different magician in many different places. However, if you wanna look at read expression, a gene expression, you would want to keep them because they would anyway, they would anyway have the same effect in the different sample you have. So, you do a sample versus another and the miltipapid would have a same effect in sample, A and B in your condition one and C in your condition two. So that's not what's gonna give you a differential expression, a gene. And removing all the miltipapid, you would list their information. So, you would keep them. So, which output do you get from the formula aligners, such as butyotopapoid two? Basically, you get a SAM or BAM file. The SAM is a sequence alignment map file and a BAM is a binary version of the SAM file. You probably already seen SAM and BAM in this workshop. Yes. Of course, the BAM is compressed, so you need tool to access it and work with it. And you must probably know now how to convert a BAM to a FAM, the SAM and VESVA FAM. The BAM FAM, what isn't important for the RNA-seq data when you have a BAM FAM is at the top you actually have the information of what was the command to align your samples. So, if you down the line, if you go back to your data sets later on and you're not sure exactly if you didn't keep track of it, which parameter you use, which version of the RNA you use, you can go by and go and look at your BAM and check the top and you would have the command that has been run on some other information. And of course, after you have the alignment information with the ID, how much of the base pairs match, which match the sequence of the read, the quality of the read and also information that you've seen before. Another useful format that you would help, you would use when working with BAM file to look at them, to extract some information from the BAM is to use a BAM file. A BAM file is really simple, but is a standard file that is used in many cases. Basically, he has some basic information of the chromosome, a name, the start and the end of the position. So, it could be the location of a gene, of an exon, of a transcript. Yeah. What are you saying in this example? Yeah. The content would be different, but you can use the same file format. Yeah, BAM and some others. It's a type of, it's how you store the information in a given way. So, you can have a BAM from a word you know, a BAM from anything, yes. So, the BAM formats would have basic information of some regions and what would be really useful, for example, if you have your an acyclic BAM file and you're really interested in this gene and you wanna look at it in IGV to see, for example, the different transcripts if you have evidence of different transcripts in IGV with a sashimi plot. You might wanna create not, you don't wanna download your BAM file that can be large from the server locally and try to open it. It could take a lot of time. You might wanna make a mini BAM of your BAM file which is extracting the reads that are in the region you're interested. So, you would create a beta file with the start of the run with a bit before and after the gene and with some tool extract the read that map to this particular region using the BAM file information and save it in a very small BAM and open this in IGV as an example. There are several tools you started to use some of them, I believe, to manipulate some BAM file, some tool BAM tool picker to mark duplicate or collapse duplicate, as well as for the bed time, the bed tools and bed ops. There are standard tools that we commonly use for some processing and for the analysis. There is someone in the lab, a white lab person that was getting a lot more into bioinformatics and wanted to and was analyzing his own data. And he came to my colleague and I was, oh, I wanna do this and this and I'm trying to code it, but I'm not sure, because he was learning how to code and one of her first, actually it was Rana, one of her comment was just, well, have you checked if this tool exists because most of the things you actually wanna do, people already think of it and you probably have a good tool out there. So I think it's a very good advice. Maybe not later on when you're an expert bioinformatician and like sometimes you came up to some analysis that there is no tool for us, but like processing or doing a comparison or extracting things, you don't need to reinvent the wheel and often the tool out there have been done and are available out there and get a bone and bioconductor on other repository. It could be that's what you really need is specific and you have to check the parameters or the particular function within some tool or within BAM tool or within BAT tool, but those two do a lot of things. So check first that what you wanna do can be done or not by those tools. I didn't list all of them, but I was a tool that could be really useful. How should you sort your BAM or some file? There are two way to sort them as a by position, which so usually when you sort them by, but you would have an index and it's all to actually go back to this BAM file to the more rapidly because it's index. So a tool that's gonna use those BAM would be a lot faster to access what is the information to the one in the BAM because it's sorted a particular way and you can skip good directly to the right position. All the tool might want your BAM to be sorted by with name. In this case, the two pair, read that come from the same fragment which we next to each other. So they would be useful for example for fusion detection, which is one of the information they would want first rather than something. So in function of which tool you're gonna use your BAM with, it would be required to be sorted in a particular way. Some tool would allow you to sort the BAM in the way you want with different parameters and then create the index that goes with it. So as you probably see, you would get a BAM and then an information of the index BAM linked to it. So as from the world genome data, you can load your BAM and a SIG BAM in IGV and start looking at your sequence. It would look a bit different of course because you have read that map only on the exam and not all along the genome. So here you have some evidence of for example, some price, a splice read here. When you have the start of the read in one exam, then you have a big gap and the end of the read in another exam as you would expect. Or some read that are not spliced. So all the read the 100 base BAM map to actually a one exam or at the end of the exam here. So you try. And you had an IGV particle the first day, right? So you already have it then. So you know what you can see on this. You have used it really, really soft. So you can use it with your ANASIG data BAM file as well. IGV is not the only browser. There are other ones. 7 is another one that is used by many people. It has other functions that IGV does not necessarily have, but I would say that complementary. If you're really used to IGV and you're happy with that, it's good if you sometimes miss some thing you would like to be able to do, you can check the other viewer to see if that's allowed you to do. 7 allows you annotate a bit more and do a bit of some variant analysis on with the viewer integrated together, which IGV is just to view and you need to load the information. We're gonna move to some QC assessment. There are several QC you can do. Here's a list of, I would say the basic ones that are important to do. Looking at the three prime to the five prime basis. Some nucleotide content, the base quality, how many PC artifacts do you have, sequence if days, evaluating if you have sequence enough or not, base distribution and the insertion size distribution. So all this QC can be done with several tools and one of them is a C which you have the address here and you would, so you would have it. It's easy to install and to run and it would produce the type of the plot have on my slide. So the first one is to look at the three prime to five prime basis. So basically for every single transcript, regardless of his lens, the transcript would be in a hundred beans and we will calculate the coverage of how many read you have in each of the beans and you can produce the plot you have on the right. So from the gene both these from zero to hundreds, the hundred beans or the full length of the gene and then the coverage and you can observe here that's actually two different distribution. One which has a good coverage which is pretty much the same along the old gene body, the old transcript and another set of samples that have clearly a three prime NBS. There are several reasons for that. It could be because it's a poly-ye library. You have an enrichment of the three prime hand. It could be that the degradation of the prime hand as well. The thing is just to notice it's okay if all your sample have the same base because it would affect all your sample the same way and you're not gonna catch different expressions because of that. However, if you have a mix of sample of two type that could be a problem because you might say that your gene is different expressed because you have a lot more read covering the gene that ends the sample that have a good coverage along the old transcript which would be different than the one that has two three prime. So if you notice that you need to flag your samples and either use one set or the other if you can or remove the one which are different or use that as a covariate in your differential expression analysis. So you shouldn't have a different gene expression just because of this for example. So be aware of it. Generally, if you use the same library selection if you didn't pull data from different public data sets like FASQ you get from defense project if you use the same library preparation you would have the same base in all your samples. Another thing you can observe is the fact that you use random primers to reverse transcribe the RNA fragment. You would not have at the beginning of your read the representation of ACGT as you would expect of 25% as you have in the normal genome sequence but you would have as you can see for the first 10 base here on the plot a higher frequency of ACGT and it goes up and down. So this you can easily observe it. Usually we just stream this first 10 base pairs because they are actually primer and they are not the sequence you're interested in. You could try to align keeping them assessing and then removing streaming the 10 base pairs and relying again and see if you improve the quality of the random element because actually this 10 base pairs are not the real sequence and then you won't be able to match them properly easily when you align to the genome because they don't exist in the genome, it's a primer. So you want to trim them if that hasn't been done yet. Another QC you wanna do is to look at the base quality. So for this we use a spread quality score. I'm pretty sure you already touched upon this in the workshop because this adds a comment. So it's mainly 10 of log 10 of the probability of the base quality is being wrong. So basically a spread score of 30 means that you have one to a thousand chance of the base quality being wrong. A good spread score of 30 above is quite good. So you just wanna check that, and it's common that you have a decrease of the quality in the superman because the sequence would bet more and more errors as the read gets longer. So it's typical to see the base quality going down. You don't want the end of your read, you don't want the quality of your mapping being low because the end of read doesn't match well because actually it was not the right base that was sequence. So you might wanna trim your read at the end on the shi prime hand, saying as I'm removing a fixed number of base pairs as I'm removing all the base pairs that have below a quality of 30s. So there are tools to do that, like Trimomatic is a tool and it would take the end of the read and trim it until you get a good spread score and keep the rest of the read. What about PCR duplicates? So duplicates read are reads that have the same start and position and the exact same sequence. So in DNA, it's often a metric of PCR duplication rate, which you don't want. And you tend to want to collapse those reads for the DNA analysis. However, for an ASIC, we really wanna keep them but you wanna assess that you don't have too many reads that have duplicated. So you can produce this kind of plot you have here with ASIC Q, ASIC C, sorry. So the occurrence of the read versus the number of the read. So you want this plot to have this shape. So you don't have so many reads that have a high number of duplicates. So it goes down. If you had something flat or something that goes up, that would be concerning. And you might have to deal with that or talk to the facility and figure out why you would have so many duplicates in your ASIC data. So you want that the ones that read, it's only a few reads that have a high number of duplication as you can see on this graph. Regarding the sequencing depths, how much should you sequence and how much information do you gain by having more reads of the same samples? In DNA, you can determine that by looking at the coverage, average coverage on a particular region if it's above a threshold. From some good world genome sequence, we use 66. You have a lot of information, like 63 coverage is good, but we don't really have a measure like this for an ASIC because a gene more expressed than another, you're not gonna serve the coverage of that for high-expression and the coverage of that for low-expression because there is no standard. So that's not something you can use. The challenge is different for an ASIC. One way you could do it is to look at how much information you gain by having more reads and looking at specifically the spice junction. How many known junction and how many novel junction can you detect with more and more reads? So to do that, you would do a sub-sampling or resampling of your read. You would only use 5% of your read and calculate the number of known junction and novel junction you can detect. Then 10, 15, 20, etc., that up to 100 and put this type of code so it's a scan of a saturation, a good curve, and you wanna look at where it saturates. So if it plateaus, that means that having more reads doesn't give you more information. So you don't need to sequence as deep. Of course, for the novel junction, it would go a lot faster and doesn't place too as quickly as a known junction because you have a lot of false positives. Those two are not perfect and you would always, often, maybe always, I don't know where it would saturate, get a novel junction. So you would rely on known junction or old junction and check where the curve saturates and say, okay, I had 120 million reads in my library. It saturates at 70%. Now I can go down to, I need to do the proportion, but maybe 100 million reads would be enough. Here, maybe the red curve would plateau at 60 or 70%. So you could sequence a bit less. So next sample if you want to save money. The last QC, no, not exactly the last. So here is one more after. You can look at after your alignment is the base composition, how much of your read mapped in the genome in the coding sequence, the coding base, and any antigenic, antionic, or UTR bases. If you do the world transcriptome library, you would have a more distributed proportion of read in those different categories. If you do a poly-A, you really want to outreach for coding bases. So you should have a larger proportion of coding bases. It's never perfect. You have some contamination in your libraries. That's why you would have some antionic or antigenic bases. Or it could be something that were not annotated and actually are true exon. But you would never get 100% of the coding bases. But at least a distribution like this would be expected for poly-F or beta with more of the coding base. And another thing we can look at is the answer size. So when you have a fragment, so basically you cut underneath your mRNA into fragments, for roughly the same length. So you would aim for 300 bespars or 500 bespars. And then you would sequence from both hands. So to have your paired read. What you don't want is to have a two short fragments and then you read sequence being overlapping in the middle. Because you're sequencing twice the same piece of mRNA and it's not what you want. So, and of course you aim for 300 bespars for example, but you're gonna have some shorter fragment from longer fragments. So you want the fragment to be big enough in function of how many sequence read, the length of your reads for them not to overlap and have useful information. Especially if they overlap, you don't wanna use those read for variant coding because they actually come from the same mRNA and you would count twice the bespars. However, it should come from two different mRNA. So that would be something you would have to pay attention. As it's common to do 300 bespars with and a fragment size when you have 100 bespars paired read, now you have longer library for 150 bespars. So it could be like you could go up to 400 or 500 bespars. So it's a choice or the facility can recommend you with doing this fragment size. And another key is actually to plot the distribution of the fragment size if you would want in your library because you aim for a particular average of that mean. Then with your BAM you can load it in IGV and start looking at a variant allele and see if a particular variant is expressed and the other one is not or both are expressed. In this case for example, you have read in the gene being expressed and both variant allele are expressed. It's not ideal to do variant collate on RNA-seq, clearly. There is no really specific tool for that. Most of the variant collars are meant for whole genome sequencing, DNA-seq, especially because in RNA-seq we would have a very different distribution of the error rate close to the junction. You're more likely to make an error at the slide site. So you don't want to call a lot of mutation as those slides said because they're actually unlikely to be true. In some cases you really want to do it. We use a high plot of color from the GATK suite to call mutation on RNA-seq. But again, if you want good mutation you should have more genome-friendly RNA-seq. And the other thing is that, especially in cancer, you would never be able to say if it's a somatic or for germline mutation because you don't have the reference. So when you do a world genome sequencing you usually have your sample and the normal or the blood of the patient. So you can call your mutation and get only somatic mutation with RNA-seq. All the variants you will see, it could be somatic, it could be germline, you don't know. And you have no way to actually figure it out. And the other thing is that you would have a bias representation of your mutation because you can only catch mutation in gene that are expressed or highly expressed. Because the low-expressed gene you wouldn't have enough evidence or the ones that are not expressed, you don't have read for it. So you would miss those mutations. So it could be a useful information but it's clearly not ideal as compared to world genome. It depends on what is your question. Again, what is your question? So about QCs, there are many tools to the QC. Many of them are good. Now I wanted to point out this tool which is multi-QC that actually allows you, you can run it in a particular directory and it would look for all the log file output of all the QC tool you've been running on your data and pull them all together and have a large HTML report. It supports 73 tools. So it's very likely that the tool you've been running, like FastQC or whatever tool you run on QC, is supported by this multi-QC. And we really had to summarize your overall QC S measurement. Here in this example is the screenshot of the multi-QC website. You can see that you have an outlet for FastQC, all the different plots that FastQC would produce listed on the left. And FastQC would actually have one report per sample, but if you had a lot of sample, it's actually hard to go through all of them. Here you would summarize all your sample in one, in one plot for the different assessments you would do. And you can see as well the start of QC, how many reads have been aligned, how many have been mapped and mapped, or multi-map for example, you would just pull it together on a good summary. So it's a very useful tool. Okay. Let's move on to the last part, any questions so far? Okay. What about the expression and differential expression? So we're gonna talk about how to estimate expression for my known genome or transcript. We use different measure, FBKM, or you can base your analysis on row count. There are different reasons why you would use one or the other. How to perform differential expression analysis, different method. And then there will be some, we'll touch a tiny bit on some downstream interpretation of the differential expression analysis result. You need to pattern to enable deeper testing. You could use some heat map or some classification after and some password analysis. We're not gonna go into detail of that because usually it really depends on your question, but briefly what you can do. When you have your anisic band, you might want to stop looking at my night TV and looking at you have this coverage track at the top of each samples. And then I'll give you first an indication of how much is in express in the different samples. However, it's clearly not gonna give you a good answer because you would observe some peers. For example, there is a three prime NBs on the first sample and it would look like here that sample at the bottom has less coverage, so it would be down-regulated. That could be the information you would think of. However, you have no idea how many reads have been sequenced in those two samples. And you see if you sequence a lot less sample at the bottom, it's normal that your coverage is lower. Right, so you need to correct for this when you estimate the expression of your genes. So to do that, when the first anisic library came out, the first paper introduced a measure which is called EPICM, read per kilobytes per transcript per million map read. And then when we started to have fragment and not just one read, but per read, the SPKM were used, fragment per kilobytes of transcript per million map read. So in the anisic, the relative expression of the transcript is really a proponent of number of CDNA fragments that it comes from. However, there are two things that really influence is the number of fragments would be higher if you have a longer gene. Of course, if you have a larger marinade and you cut it in many different little fragments, you would have a lot more fragments of this longer anis and a short one, right? So you need to correct for the gene lens to have an accurate measure of your gene expression of your gene. And the second thing is that the total number of fragments you have in your library. If you have a library with 50 million reads versus a library of 100 million reads, of course the gene seems more expressed in 100 million read because you catch it more often than a 50 million read. So that's another thing you need to correct for it. And I would say a simple way to do it is to use this RPKM or FBKM value. Now it's FBKM because you have buried. That's the formula as you can see here. So you take into account the number of mapable fragments for your given gene or transcript and you multiply it by million and you divide by the total number of mapable read in the library, as well as the number of the base pairs in your gene or transcript to correct for the gene lens. After the FBKM, another value has been introduced, the TPM, the transcript per pillar base per million, which use the same idea for normalizing for library size and gene lens, but in another, basically the order of the operation is different. So FBKM as I just described, the TPM is different because you look at more at the proportion. So you first divide your fragment count by the length of the transcript and then you sum up all your fragment per kilo bath for the library per million and then you divide the first value by the second. So basically give you a proportion of your expression, gene expression within your library. So, and if you sum up all the TPM in one sample and in the other sample, they would sum up to the same value. So a gene in sample A and a gene in sample B with TPM, you can compare a bit more easily the proportion of the gene in your sample A as comparing to the proportion of the gene in sample B, how much is expressed in those because the sum is the same. In FBKM, the sum wouldn't be the same, so it's a bit harder to compare sample A versus sample B. So those FBKM and TPM are widely used. I must say a pure statistician would not say it's the right way to do it. It's quite for two important things, but it's not perfect. There are other methods that are based on row, with content does do some time of normalization with SAS factor and some models that I will touch on it at the end of the lecture that would be more rigorous. But you would see in many papers this using FBKM and TPM and it's okay to use it, but personally I'm not a fan of it, but they are widely used. So when you have your BAM file, you read being aligned on the genome. What you wanna be able to do is to find the different transcripts. So string time is part of the suite after high set you can. So output of high set can mean protein and string time and we're gonna use it tomorrow. It's allowed to evaluate the transcripts expression. So how many reads belong to the transcripts and what transcript do you have in your samples? So because for the same gene, you could have several transcripts. So you need a method that's allowed to find all the reads that belong to the same transcripts. So you use a path, a splice graph to look for the most abundant transcript first, put all the really can in the most abundant transcript, evaluate the expression of this transcript, remove the transcripts for those reads from the pool and look for another transcript for the same gene with another combination of examples and until it doesn't have reads anymore. So string time would give you for a given gene which reads belong to which transcript and how many read belong to a particular transcript. It can do two things. It can evaluate the expression on transcripts that you know that have been annotated. So if you give a GTS file with this transcript information it will use this information. As well as it can find, it can do a de novo transcript identification. So that means finding its own transcript, maybe some new one, some ones that are not annotated. It's common to run string time using both mode, like say I wanna evaluate the transcripts that I know from the referendum but if you find some new one, please, I wanna see them. If you do use the mode of having the new transcript annotation discovery, you would have to merge those results for the different samples. Because, so using string time merge. Because in one sample, you would have the known transcript annotation because of some new ones. And in another sample, you think I might find another set of transcripts which were not found in the first samples. But at the end you want a matrix of all your transcript versus all your samples. So you need to quantify this new transcripts of the sample B in the sample A. So you need to merge all this new transcript across all the samples. It's what string time merge does. It allows you to incorporate known and novel transcripts. And then he does a reassessment of all transcripts in all end of in the samples to have a uniform matrix, combined matrix at the end. Tomorrow we're gonna use high set two and string ties but only on the reference annotated transcript. We're not gonna use a new transcript. But in this, so we're not gonna use a merge but with, you might, it could be really interesting to find novel transcript even in the human genome for some genes. Yes, there are still some that are not well annotated and it could be useful for your analysis. Something you could use after this is the GFF compare is to actually compare your novel transcript, the output of your transcript analysis to a GTF that has a well-known annotated one. How many new transcripts do you have? Which one are they? Just to compare to what you already know. That's another tool you could use that is compatible with all this output. Once you have the output of string ties and quantify your transcript expression, so you have a FBTM for each of those transcripts in every single samples, you can use background for differential expression analysis. Basically use a parametric F test comparison with a linear model, nested linear model. So you would compare two conditions and you would include the covariate of interest in the analysis and compare the two models. One with the covariate, so having a case and like tumor and versus control and one without and see if having the covariates give you a better fit to the model. And so a significant p-value means that the modeling, including the co-variant of a test fits significantly better than the model without that covariate and you can take a differential expression. So that's the idea. So the output is you can have a full change and a p-value and then a corrected p-value for multi-test testing, so q-value with a false discovery rate, control a 5% if your q-value is below 0 or 5 as you have in a standard analysis. So we will do that tomorrow. There are some particular plots that the ballroom or package has. So you can look at the look to FBKM values across your different samples. And here it was colored by the two tap. It was male versus female in the first plot. You could do a box plot of your expression of a given gene or given transcript in the male versus program in the two conditions of the example. You can as well plot the different transcript of a given gene in a given samples and the expression of the different transcripts that color code as you can see on the right plot. The ballroom object within R, you're gonna be able to play with it tomorrow as a particular structure. You can extract the expression of the transcript level as the exon on the internal level and then as the gene level. And it has a set of annexes to be able to link the exon to the transcript, the intron to the transcript and the gene to the transcript as well. So it's a very structured object and they have a set of function to query the object to extract what you need. And we will use that tomorrow in the lab. There are several alternatives to FBKM. So the row with count could be useful for differential expression analysis instead of FBKM. And to get the row with counts, that mean in this case it really gets the number of reads that map to a given transcript or a given gene without correcting for library size or length of the gene. So that's what we call by row with count. How to get the row with count? You can get it with the HTC tool that was published some time ago. So from a BAM file, you can run HTC through some particular parameter described here and get a row with count matrix. Or you can actually use the star liner. You have an option which is a count mode with the option parameter gene count. You would get directly a table of how many reads belong to which gene as you would get from HTC. So you don't have to align and then use ST sheet to get a row with count. You can directly with star add this parameter option to get the table, so read count file for every single sample. And this could be the output when you consider we extract this file for every single sample, could be the input of the differential expression analysis packages as the DC2 and HR that need row with count as an input. Yeah, I'm nearly done. I just want to point out that both in HTC and its star he actually returns three different column, the total read count, and then from one strand and then from the other strand. It's weird because somehow your library is constricted and you need to consider the reverse strand, but it's actually what you need. So it's really depend on how the library is constricted and which kit has been used, most of the kids which are strand specific, if your library is not strand specific, it would be the total of the first column, but if it's strand specific, you need one of the two other column and surprisingly it's a reverse because the library is constricted. If you look at your read count file from star, output of star, you would actually easily see that the highest number is the last column and it's the one you should take, but that's something you want to take into account because you might take the other column because it's more intuitive and you actually all your read count are all uploads as they should be. And there were the same thing with our same, it's another tool to evaluate transcript expression that a read parameter, and it was really not intuitive, it was not the default parameters that you need the reverse to get the right read count. So yeah, something to keep in mind. What would you use FBKM or read count? FBKM, as I say, you can leverage the toxicity all the tool we got in the script today and we are used to more use FBKM in. It's good visualization, it's I would say more intuitive. However, who we can base analysis, you could do more of a statistical method for differential expression. I could mod it for more sophisticated experimental design with appropriate statistical text. I prefer the second option, but it's just personal. And to do that, you would use a tool like DCIG2 or HR. So DCIG2 allows you to do the differential expression analysis and as well as getting a normalized matrix with the size factors. That's another way to normalize for the total library size, but not just by dividing by the library sizes. It's a better way to do it. And then you can transform with a variant stabilization transformation within the package that's allowed to control for the variants of the gene expression. So you don't want the low expression and the high expression having a larger variance just because they are low express or high express. So that's good for that. And it's useful for some more clustering or if you wanna apply some machine learning methods. However, so that you don't care about the gene lens if you do all the sample in condition one versus all the sample in condition B. Because your gene are equally long in both conditions. You compare basically the expression of your gene A in this sample and your gene of same gene and in this other sample. So if your gene is long, it's gonna have a lot of read on both. And if it's differently expressed, you would have more read in this case. So the gene lens is not corrected within DCIG for example for differential expression because you don't need it. However, if you extract your anomalous matrix and then you wanna rank your gene expression within a sample to run GSEF for example, a pathway analysis, then you will need to correct for gene lens because you don't want your long gene to be at the top because they are high express just because they are long. So for a cross sample comparison is good. For within sample, you need to correct for gene lens and DCIG has the option of when you have your normalize matrix with the size factors, then you can apply gene lens correction and have a FPKM value. But this FPKM is not exactly the same as you would get when you compute as with a formula I gave you before. But that's something to point out. Of course, the different method gives you a different answer, it's never perfect. You might take a multiple approach option of running your differential expression and with the three different tools and only use the ones that are in common. We do that a lot for mutation color, not necessarily for anesthesia, usually you stick to one of the method. But yeah. Of course, you need to correct for multiple testing because you test all the gene in your genome or at least in the ones that I express. So you're more likely to make error on which one I'll definitely express because you did so many tests. So as usually you need a multiple testing correction. And you can do that using the Q value which is output of the, like you have a P value and then a corrected one, the Q value to control you have the error. Both ball ground and DC can and jar would have a Q value. So usually you would use that and put a threshold on the Q value to say, this is the gene that I'll definitely express because my Q value is below 0.05. And then my last slide. Well, and I watch when you learn, I don't know, I think. It's what you would do with that. It would be the topic of an antiochus. You can do some downstream analysis of using this different expression, do some more clustering, looking at some heat map, do some pathway analysis. Maybe try to build a classifier using some random forest approach or this thing. I really don't know what is your question and what do you want to do with that. But at least you have the data process and analysis and you can do more with it. And we can check about it if you want to. And that's it.