 So, this module this morning is about gene expression providing and especially RNA-seq analysis and it would be bringing an overview of quite wide and good to differential expression. So, this is the, actually, I'm going to go back to that in a second. A very few words about myself, Franz Kevalli. I'm currently a junior PI at the Institute of Geography in Paris and I work, I did my post-doctoral fellow I'm still in a thick kid with Michael Taylor working on pediatric ventrumous medial gastrointestinal especially. And I joined the startup to cancer Canada during team working on leobastoma and cancer stem cell. And we'll kill me dataset as a bioinformatic project leader before I apply and start my lab just a year ago. So the lab is still quite young and it's a lot of fun. And this lecture is about the RNA-seq and I must give credit to most of the slide to OB. The content in this workshop is open as long as you can share it, reuse it, modify it as long as you acknowledge what I'm doing here. Most of the lecture has been created by previous instructor and I did a few things more concentrated from my own experience. Yep. So the objectives of this is I will bring an introduction to the RNA-seq sequencing, talk about the arrangement and visualization and we're going to explore the expression and differential expression analysis that you're going to do during the lab at your own. So we're going to talk about the rationale sequencing, what are the challenges, will you probably going to go through the RNA-seq analysis, you're going to ask yourself some questions and we're going to go through these most communications and talk about technical question related to the RNA-seq analysis. This is something you probably very well aware of it, just to bring it back to the dogma, the genome is expressed and we have this primary transcribe and we have a splicing in getting a mature mRNA and this is what we're exploring the RNA sequencing, which is going to be translated as a protein there, active elements of the cell. Very large overview of the RNA sequencing in a cancer setting, usually you have different condition or different samples tumor type. A critical experiment would be having tumor sample and a normal sample next to it of the same tissue. There is a first isolation of the RNA, for example, using the poly, then there is a generation of the CDNA cutting the fragments at linkers to then sequencing and when we get short-tweet, very short-tweet, we're learning to the genome and then we evaluate the expression of the gene that we know of or we can analyze some annotation and there is downstream analysis. So why are we sequencing RNA as compared to DNA? This is essentially for functional study and of course, you know, maybe constant but in certain conditions, a different gene can be expressed as you can have a different expression when you have a drug treatment that is untreated for the cell line or you have a wild type that is not out of mice, as well as you can compare like human tumor as a tumor versus a normal tissue of the cell versus a normal sample of the cell tissue. This RNA-sync, although as well to predict transcript sequence from the genome, which is difficult, so the RNA, the information from the RNA-sync allowed to better predict the transcript with a better gene annotation. And you can as well do some more analysis looking at the alternatives of transcription and DNA editing. So another way it's like, it could help to interpret some mutations that you have in the DNA, if you have much more genome sequence and you are in a sick, it could maybe possibly explain the regulatory function of the mutation in an encoding area that affects the expression. So if you have an additional gene in your time, that would be the RNA-sync analysis that allow you to have some evidence of getting to that. And you might want to be able to explore if this mutation detected in your whole genome analysis or exome is actually expressed or not expressed. So the RNA-sync links to this generation of the data and the analysis. First is the quality of the sample with the priority. Hopefully if we consider tumor samples, is all you sell in your bulk of samples, are they tumor cells, are they infiltrating? And the non-tumor cells, as well as, you know, that the RNA consists of small exomes which are large entrance, so we need to read the map to the genome. And I could be challenging because you have, that would be split on exomes with an intron in the middle and we're going to talk about this. Of course, there is a relative abundance of the RNA very widely, which like magnitude, and RNA sequencing is based on random sampling. So you catch RNA with a pool of RNA and the deeper you sequence, the more fragments you catch, but always it's a random sampling. And there are predominance of RNA, predominance of ribosomal genes, so there are different techniques to, like, enrich for coding genes, for example, and get rid of those. There are different sizes of RNA. You need a specific protocol to look at the smaller RNA to be able to be captured there and to be analyzed, such as macroRNA. And if you do polyselection, which we do once, it can be 3NBase. So one of the first things you need to check is the quality of the RNA in your samples. And one of the measures is the RIN number, RNA integrity number that you can get from the bioanalyser. Maximum a very good quality, some would have a RIN of 10, which is the plot on the right. So which you can see very, too big of the ribosomal RNA is what you expect because it composes 80% of the RNA in samples. And a RIN of 6 would be a plot like this when you have a lot of size for RNA, which is not what you want because RNA is degraded. So quite often we require a RIN of 7 or above to go and generate a RIN of a given sample. But we might go away if it's a very precious sample and you still want to try knowing that you might spend money and not get very good quality after. So it's a choice you have to take at the beginning of your experiment of which sample you're sequencing a function of the quality. There are different ways to enrich for what you want to generate the RNA-6. You can consider the total RNA. In this case, you would have a lot of ribosomal RNA for examples, but that might be interesting for particular analysis. You could do ribosomal inhibition RNA reduction. One of them is a protease selection when you use a proteate of the transcript to actually select for them and enrich for them. And then so each, you need to know what type of library you have and choose the one you correspond to the question you're going to ask with your RNA-6. And this is a bit, I would say, because now I would say 99% of the library would be stranded. But in the past, the first kit of RNA-6 transiting where you go ahead, the option of having a stranded, but not stranded library. And we realize that actually having a function of the strand is very useful, especially when you have two genes that are placed on the two different. Strand of the DNA and that could be both expressive. If you have a read that is in this region that overlap between the two strands, you cannot know if it comes from the gene A on the fourth strand or beyond the real estrange. However, with a stranded DNA, you are able to do that. So the estimation of the expression, the measure of the expression of the gene is a lot more accurate. So we do a stranded RNA. Could you please explain a little more what is stranded and unstranded? I didn't understand that. It's like in the generation of the RNA, in the process of creating the library, you're able to actually have the information that all the risks come from the same strand versus another strand. And when you align it, this information is taken by a liner and then you will know where it comes from. You can as well visualize it later on on JLGV. There is a color code and then you can look at the read coming from the same strand as those strands. And we know that the gene on it did on one strand or the other. And so it's allowed to place your read more accurately. But honestly, I don't think they would, if you do a new sample now, I don't think they would have for you a non-stranded library. That's a very useful information which is very common now. Yes, replicates. So replicates are always very important in our analysis. And to have a statistical power to actually have some significant rich conclusions. However, it's always a choice of how many replicates do you need and how many replicates you can afford in your experiments. If we consider tumor samples, most of the last study would consider each tumor being a replicate of a given time being a replicate of another tumor from another person. So we're not going to necessarily take two pieces of a tumor of a given person, except if you have a particular question about tumor. So in another differential experiment analysis, if we take tumor versus control, all the tumor would be somehow replicates versus the control being a replicate as well. However, if you have experiments with mice, you do want to do replicates because you can control it better and have more samples coming from mice with or without drug, for example, or with cell line. If you do experiments with cell line and drug and no drug, you do want several replicates. So when you do have replicates, you want to check that the correlation is very good. And if it's not, you have to double check and figure out one. So there are biological replicates and technical replicates. Common analysis score and the analysis. So we generally look at the gene expression, how much the gene is expressed and what are the differences expressed between condition A and B, for example. We can look at alternative expression analysis, discover new transcripts, particularly expressed in a given tissue or sample. We can study the specific expression, which is related to sneak or mutation identification, which is not ideal for analysis. We're going to talk a little bit about that. Yes, mutation discovery. So again, there is still as well to take fusion from the sequencing data, as well as everything. When you have your, you are on a sick normalize. You can as well look for sample plus some sample clustering and cluster sample, such which are more similar to each other. So just a group of subtype or the particular tumor type of sample. And then you can combine with other data sets. You can also make a clinical data. And that you would talk about that later during the week. So roughly this analysis has the same pipeline. You, there is the library is prepared. You choose the depth you want. And you obtain the raw data from the NGS facility while you've done it. You need to check the quality. Then you need to align the rate to your reference notes. And then process learning meant with different tool to estimate the expression for the fusion, for example. And then the differential expression is when you have expression and some of the processing. This are very general using our MATLAB set escape. And we're going to see some of the application. That's right. Quite often you want to create a certain list of some candidates for validation. But many different other type of analysis you can do in function of your question. There is one. There are some common question when you start analyzing you. So actually I'm going to ask. I don't know if you can all raise your number. How many of you have already analyzed and I see data. Maybe I can see with her hand or no hands. Let's try. I can see one, two. I guess there is a yes option. Three. I don't know if the. Yeah. Okay. Thank you for raising your hand. So maybe you, for the one who already are analyzing and seek that asset, maybe you ask yourself, what do I do with the duplicate? Should I remove them or not? This is a different question and different. Setting as in the DNL and the world genome sequencing, because the, the, where the start of the read actually defined by the start of this transcript. So there is now a random and a random distribution of all the reads because they all come from a particular transcript. And so the status defined by the transcript. The case is transcript. So for. So in general, we don't remove them. But if you do one because of X and Y reason, I pay attention to SSM as a pair level and not as a single read level. So we don't want to guess the distribution by remote then because it's really, it's kind of unlikely to that are true. If you see a duplicate, due to the amplification. How many, how much library depth is needed for the area and I think this really depends off your question and what type of analysis you want to do. If you want to do, I would say just a basic expression analysis and find the most differently expresses you don't need to go very deep. So if you want to do quite a lot of transcripts that I express, or even mutation coding or anything, you do need to go deeper in, in your, in your sequencing to have more leads to, to support the identification of an alternative transcripts or a differential expression of the example. So it's different as well as a few read lengths and it's our parent pair. Now we pretty much do a whisper and how much competition approach you can resources you can have but I hope you most of you have access to a cluster to do analysis. One way to start when you have, I would say no idea of how much you would need for your library is to identify publication, which I've done similar things that you want to do what are your goals. You can for sure talk to you and just platform. They're quite used to that and they would be likely recommend what is the standard as we use hope so the library says we use for a particular analysis of fairness issue. You can perform a pilot if you're a student like spending a bit more money to actually sequence more with and then evaluate how much information you gain by down something you read, compare to what's your answer you want to. You want to get better just an idea it's, we often say that 100 million pairs, it's a good, it's a good depth so it's a good size to look at the testing analysis for analysis data, just like to give a ballpark of what you start. You can go more or less in function of the budget you can you can afford on the NSIC data generation. What might be a strategy should you use for NSIC. If you do have short which which is less and less common now, you can align it to the do you know Chris Johnson, such as the good year. And, or you can do an assembly such as transit this, but most of the time, you read our longest and spacy bear price reads and you want to use a space and I know it's basically where Linus just was I which is one of the first one to but and now we usually make you start and he said, he said, which is what we're going to talk about. In this lecture. What if you don't have a reference genius. This is unlikely. And for people that work in cancer. Because usually we have data coming from you mentioned or maybe mouse motors. But, but just for you to know if there is no reference, you know, maybe the possibility to actually sequence a genome. How do I do it, you know, by somebody of the transcript, but that's outside of this lecture. For sure. So, is there any question about this, you know, not to mention about the NSIC. Florence. Maybe you could just talk briefly about the computational requirements for RNA sequencing alignment because sometimes it can be pretty large, especially for index generation. So what sort of resource should the students be looking for. I'm going to do could tell me a bit about that when we actually talk about how it's broken which one we use. Oh, great. Thank you. Yes. Yes. I. So I have a question about the. Removing the duplicates. So you said it's not good to remove the duplicates in RNA seek experiment, right. It's time to keep them because they are can be useful information. And it's less likely to be a true PCR duplicate. So how do you distinguish whether it's a PCR duplicate or not a PCR duplicate I mean, do you need to first do removal of the duplicates and then do the experiment. I mean, now the analysis without the duplicates and see whether there's a change or something like that. You can. You can do that once to convince yourself that you should keep them on the script then. But like at the start of the transcript of the genome, so you're going to have plenty of trees that would start at the same place. Because it's nourishment genes that is really expressed and you have a nourishment of the streets there. And so you don't want to remove all this. This needs that could be PCR duplicate but are more likely to be several coming from several transcripts from the same gene. We're actually going to have I will show you a plot how you can put something to give you an idea if you have a lot of lots of replicates and this might be some worry or concern and you that is. Thank you. Can I just add a little bit to to this discussion. Yeah. The other consideration with PCR duplicates. So which is true consideration for mutation calling might be discussed in previous modules for RNA seek you're not usually doing mutation calling. You want to quantify your gene expression so you don't want to remove part of your actual genes expression. Especially for small genes, they can you can only break that RNA in certain ways. So you're going to see the same breakage over and over again, especially for small genes so you don't want to bias your removing duplicates you're biasing this this error of removing the wrong thing in small genes, and genes that have a really high expression value where you're likely to get the same breakage over and over again. So to know if your library is good or not, I think, you know, looking at library complexity is the best thing to do. But the nobody really removes duplicates anymore for RNA seek data is. Okay, so that's my addition. All right, thank you. And so in the second part, we can talk about the basic language entrance and some common questions. What are the strategy, what kind of a liner you could use the output of the liner which has a bad file and some bad file. I believe you already gone through the looking at them twice. So you basically visualize the alignment with a TV. And then some we're going to try a set of you see the assessment you can do after alignment to check if this everything is all right to continue the analysis. So you can have a run I think that I know you can have a lot a lot of free and free. And this have to to be a place on the genome considering that they're likely to be to be from exams and they aren't really in between. So you want to spy server element. You can. If you never done that and never, I would say. You can run the first liner once, but you're probably going to have to play with with the parameter and estimates the quality of your element and do it a few times before you have with it. Yes. Hi, sorry, there's a question on slack that I thought might be useful to everybody. What do you mean by library complexity. That's what's on. Click buttons become visible again. So library complexity is has to do with how much input material you had. So if you're if you didn't actually get too much RNA out of your tissue, and you make a library of it, you're going to, you know, those 1000 transcripts that you isolated from your tissue, you're going to make a big library that just represents that low complexity sample. Whereas if you get a really good sample from your tissue, you're going to see 25,000 genes, right, and all their isoforms, etc. And so the deeper you sequence, the more unique things you find. So when you look at, you know, by increasing the increasing the number of reads, how many genes are you identifying that line should go up. And then eventually it's going to tail off because you found everything and you're going to find more transcripts of those same genes. But if your library complexity is low because you didn't start with too much, you're going to sequence a bit find those 1000 things, add more sequence and not find anything new so you've plateaued. And so you can look at looking at library complexity is one of the kind of QC steps that you'd want to do with your RNA seek data. And if your library complexity is low and you're finding the same things, you're going to have all these duplicates, which are real PCR duplicates, because you didn't have anything much to amplify to begin with. So hopefully that makes sense. They're going to be a plot which is supposed to what you're saying just after the QC. Yeah. So there are different matter. I was blessed to be on a liner to that you can use to align your analysis to the genomes. One of the common ones that people use is as to, but for example on bio stars you probably know this website that people ask question and you can see a conversation about which one should you use what are the advantages of etc comparison between different. And so with the last three strategies, you could do the novel assembly. If you don't have reference, you can align to a transcript, but all the transcript, I'm sorry. But only if you read are really short, but more generally we're not willing to reference a genomes with a slice of alignment. And this is what this slide said. It's more on more of a guess you will learn to reference genomes. And of course the other strategy, the novel and align to transcript is another element to an assembly to which not going to talk about them today. This is a plot that I find really interesting, but I didn't find a new version of it. It was published probably in 2018. It was a different tool that came along being published and people using them and at the end, progressively. So one of the very common liner was stock hats that's a very large community use in what so he came out in 2009, I believe. And, but he actually require a lot of memory. It takes a lot of time. So for a hundred million read a library, you wouldn't need, you would need a date to run it. You couldn't run it locally. So it was like, high, a lot of high memory usage and high high time usage back into that. So after this, and I know that was very useful at the time and the best at the time. And the new liner have several aims. It was to have a better mapping accuracy of the read what you want and you want to read to be to have the best based on your genome and to be sure that it comes from this particular place. So you want to not use too much memory. So reduce the memory usage and reduce the time you need to to have your tool run to actually align some examples. So after top at the few years later, you had to buy two and then you had star which went a lot faster. From one day went down to about 20 minutes to run your sample line or your sample of millions. However, you still need it. You still need 28 gig of memory because you need to store the reference, you know, and this is a very large reference genome. So it's a user a new way use the indexes so the church is done differently because you have the index and that's how the time is reduced. But the memory usage was still quite large. And so it's like actually solve or I don't know if I can solve but improve both the memory you just and the time. And so it's that can allow you to you need about four gig to get the different indexes and you can run it on your laptop from a local machine. And you need and it's about the same time as star. So you will need about 20 minutes to to align 100 million with an ASIC sample. So here we'll mention the talk about it's like and it's like to attack how it works and how very briefly how it works and what's how make it quite fast and efficient. What they did for that. So, of course, you have the problem of you have an ASIC that map, they can read that can spam large entrance so you need to figure out which part of the read of a single read match to a particular intro, like skip. Skip the entrance and then match with the exit. So how do you find that initially a stop at another line of doing he would take you read and you would try all the genome. And it would take a very long time. So now it's like the use to reference. So it's a place where an ASIC and it's used to Alexis to do a search into steps. There is a general person global short, then you have to look for a match offers of the beginning of the read up to 28 base base and when you find a very good match of this beginning of the read, then you do a local chest with a different indexes that allow you to do that into step and so really responsibility as a mode of time you need to find the perfect match. So this is what. So yeah, there is a, like 48,000 local indexes that you use in the second set to to finalize your new mapping. And I will go through two exam three example. So in the case of a it's a read that actually fully maps within an exam, which is a more simple case. So you first the ideas that you first line the read with the global index, which is a, I would say rather slow part and one at this 28 space is actually mapping one location which leads to the to the extension mode, which is like you can see in this arrow that's a purple. And then if you continue mapping the street perfectly up to the end of the read, you have a good match and that's it. Which is the most first simple case. The scenario B is you have a read that map mainly to one exam and then you need to skip an entrance and then map again to the to the other part of the of another exam. So in this case, you do the first global church until you have an exact match of 28 base base, then you extend as before. You're mapping in this case, so for example, 90 space base, and this you free free doesn't match anymore than you switch to a local church to be able to map. And the remaining base space. And on the third scenario numbers in that little sample see here, then you have, like, half of the read that map to one exam and half of the map to the other exams. So you start with a global search to find the benchmark at beginning of the 20 base base, then you extend and when it doesn't map anymore, you would use a local search in this area to find a new map and then extend again. So it's how they cannot solve the memory slash. And slash time for some time usage problem. And I actually have a question for you. Have someone that did already some analysis analysis of you use this. Well, I know what you've used another one. We are using a star. No, we're very open as well. She do have a group that she do have a cluster for sure. And it's part of most of many pilots as well. I use that as well. Let's continue. So should I allow me to map please. It depends again it depends on your application and what you want which kind of analysis you want to do. If you, if you want to look at Expression analysis with the type of string case, you should allow this multi map. As well if you want to look at fusion because that might help you to actually identify fusion because if you only back in the DNA search sometimes when you are multi perhaps you just randomly assigned to a particular place that might not be the right one. And with other reasons, in other places, you could detect a fusion for example if it wasn't so you want to keep the option of having to read or line to the best basic care. And so that you keep open the multi map. But it's a question you need to ask yourself in function of what you want to do. The output of the liner as a Sam or a BAM file. So the same for sequencing and Map format and the BAM is the binary of the Sam which takes the space. I believe you have looked at the some of the BAM file already in the previous states. So I don't need to explain that you have information to alignment at the top, you would, for example, find the common you use if you, if you don't have this information know if you get a BAM from from another. And I think that that you download. If the if the method is not really well described in the paper, for example, you can look at the as a BAM and look for exactly parameters used to align it. And you as well have some information in the head of what are the different keys. And then in the you have information about the elements of course you have the read names way mad for how long was it a perfect match or not. The sequence and the quality and some a lot more information and then more you process your, your BAM family can actually handle the information of the site. So working with the BAM is very common and you want quite often to access it directly to a particular location or to look at particular zone of a gene. And for this you can use that file to specify where you want to go over which type of feed you want to retrieve a particular location. This is very useful if you do have your big BAM. And then you want to look at a particular place. So if you have a BAM for your favorite gene on IGV, you can create mini BAMs, which you can a lot more easily download locally and look at your IGV and not wait a long time that you're big BAM loading on your IGV locally. So you just specify with a bit format and the chromosome names and start position and then position of your original interest or your exome or your gene. Sub-sample the BAM of the read of interest at map in this region and load this on IGV, for example. Yes, is there a question? Okay. There are tools to manipulate BAM and some to pick around and some tools I want to manipulate bit file to the bit to the bit top. So you're going to use the sample for BAM in the lab and you probably use it before. How are they sorted? Generally, they are sorted by position because actually it depends of the tool you're going to use after to query your BAM file. Some tools require that they are sorted by position, most of them, but some ones to be sorted by written name because so you have the pairs that are next to each other for particular analysis after. So you might need to re-sort your BAM in function of the software that would be used. So clicking the output of the BAM file. There's a tool command. And with some tools you can easily sort to the one by position of a written name. This is a common MR that so your tool doesn't run because actually your BAM is not sorted so it's where they needed to be. So just keep in mind that could be an issue. So IGV, you sent me open it yesterday when you look at the mutation. So you can actually load a BAM file from an ASIC alignment, a result of an element of an ASIC on IGV. And you can look at the read and actually you can see here some read can be spliced. So the read, like you have read mapping on the axons and then the end of the reader mapping on the other axons. So, and you can specify one of the IGV option is to specify this trend as well. There are also alternative you go to IGV. You might have your file read. One of the one common uses seven and there are other ones that there's. I would say matter of test IGV is very common and useful. We're going to talk about a few ways to assess the quality of the other elements with several plots you can perform. Thanks to RCQC, RCQC, which you have a link at the bottom. This is also true as well, but this is a very useful and common one. So one of them is to look at the element you see and looking at the three prime or the five prime based. So basically for every single transcript, all the reader have been into a hundred values, and then you look at the coverage across this hundred bins for every single transcript. And then you can check if you read our like properly map all along the your transcript between the five prime and the three prime of you have a clear base. So just you can see in the second distribution here with a lot more read mapping on the three prime man. This is something you want to be aware of. In particular case, it's likely that you actually have to library that I've been prepared differently and you do want to consider all the sample cases because this distribution can be as you analyze this later on. And if you don't pay attention to that and find some cluster at the end of between a different sample, you might come from this at the beginning. So it's unlikely that you have that I was different library. If not, that's a person's agency to that I should tell you. But if you download that from a public data set, you should check that they all have. I would say a good juicy. And in this case, for example, like the sample of one type of attack. So that is considered them to back you up. You could maybe put them together with some much effect to take you back to figure and score it. The superman and it's often because you have no more degradation of the ran out of five prints. So it could be quite common. But yeah, the thing is that you don't want an exorcist to. At the beginning of the week. So the primary should be random. So they should you should be as much as the ACG OT. Read proportions and all of the weeds, but when you actually put the frequency of the ACGT function of the read position, you would often see that the first 10 base pairs. This is not as flat as the 25% of the expected. You we usually actually trim those because it's not it's not the real information and this and this can mess up a bit your mapping. So you can try to to keep it on the line and do some QC and then do the same by trimming the first 10 base pairs. And if the mapping improve, you would likely trim your beginning of your week. We have another question in the on the slack. Beena has asked what is what is exactly the three prime and five prime bias? Could you explain it biologically? So there is a natural tendency of your head to be degraded. And this is more degraded at the five prime and so you're less likely to read that come from the five prime and the three prime. So that's actually. And when you do pull a selection, you're a bias. That was a superman. Yeah, I have a question. So is it is it moving the primer and adapter is two different things or is it the same thing because you can remove adapters using that if you use the trimmomatic. So, is it different? So the trimmomatic adapter, if you remove the adapters from the trimmomatic, it only removes the adapters but not the primer section, right? It's too long that I use this particular tool so I cannot 100% tell you yes or no. Okay, so is removing primer and removing adapter is two different things? Yeah, you correct me if I'm wrong, Serena. You do want to remove adapter and then you go to the primers and usually you move as well. Would you agree? So the adapters are the extra sequences you're adding so that your CDNA sticks to the flow cell, right? The primer is going, it's random because unusually it's like nine base pairs longer or something and yeah it is because you can see it there on the plot. It is a random primer and the ones that are complementary to your RNA is what will bind to your RNA. So they have the adapter on one end, the primer, and then they bind perfectly and they extend when you make your CDNA. So it actually matches your RNA molecule. So you don't want to remove the primer because that was already like a sequence that was in your RNA molecule and then you extended it, right? So you made your CDNA which now has adapters so you can sequence from the adapters in. The reason you see this pattern is because I guess the random primers are random and you always see some like skew in the distributions which are in the random primers are not the same distribution that you're going to see in the human genome, right? So that's okay. It's just that then your cycles in the sequencing read will be skewed away from the rest of the reads, right? Which are not the primer, sorry, from the rest of the mRNA, which isn't what's there. You're just starting off with the primer. So you don't want to remove the primer. You just want to remove the adapters. The primers should match what your RNA bases were. It'll be complementary. Okay. Even though you see this spike. Yes. Okay. All right. All right. Thank you. Another thing to check is the quality of a few base calling. So the way we do it is using the Fred's quality score, which is made to stand up to 10 of the probabilities that the base is calling wrong. And so if you consider score a Fred's score of 30, it means that there is one how to have 1000 chance that the base calling is wrong. So you do want a Fred's score above 30, for example. So the plot on the right is actually a good plot because all the, along the position of the read between the 500th, we have the quality, the Fred's score quality being quite high and above, even above 50 or all the time. So this is a good quality sequence. Let's go back to the question of the PCI duplicates. So one plot you can produce, which is the one on the right to have an idea of if that could be a problem in your, in your library or something went wrong when clearly wrong when you, you might have a new gene at it. And so that's, you actually compute the occurrence of the read and the number of the number of bits or how often the three days is mapped to the same position. That's the way it's a number of freedom locked in. So it's sort of catch you have a few that are like this, but they should go down. Most of the read other. And not to pick it. So this is another plot that's generative by this tool that can be informative of how much you need to sequence the 17 steps. And basically you can consider all of your read and take a percentage of it and evaluate how many non junction of no virgin turn you get when you only use 10% 50% 20% 30% etc etc up to 100% of your library. Do you get information or not. So here we can see that the known junction cannot like to at some point and stay stay flat the curve that mean like you get using more and more with you don't get more information. This is linked to the cluster and I was mentioning you can look at when you doubt something you read how many G you find as being expressed with 1020 3040% of your rates. And if it's become flat at 50% of your rent you actually spend half of your money for you for this for this library sequencing in like re sequencing the same type. So you don't gain more information with more. So that's one way to estimate if you do pilot experiments for, for example, if you do need to sequence more or less for the next samples. You can as well check the progression of the of the base distribution where does the map. If you have a podium library, you want to start most of them actually map to coding region. If it's a world transcript of library this progression could be smaller, and you will have some attorney. You'd have some reads that map to them to equation. So you need to know that in function of the way you generated your library is going to be different. What about in sales size. So you, you and I are fragmented in a particular with a target lens, but this lens as a distribution for example, you want, when you prepare it you want to fragment about 300 base pairs. And then you would have great from both and doing 100 and 100 base pairs so you have a gap in the middle that would be 100 base pairs if you took almost 300. One thing that you don't want is that you, you fragment are shorter to short and then if you read our lungs they actually become overlapping. That's not something you want because you will actually sequence twice the same thing with you to meet your parents and you might overestimate if you look at mutation you might like if you pay is that you don't want. And most commonly now you have parents of 100 base pairs 150 base pairs. So you do want at least 200, 500 fragments fragments to be able to second the parents and still have the gap. And not have overlapping weights from the same fragment. This was clear. Here is that you can actually load your new bomb in your IGV and and zoom zoom zoom to look at a particular position and probably have an estimation of the variance at a frequency. But be careful of this type of analysis research and I think you never know if this is if it's a bit as far as the fact that they are both expressed or not as expressed equally or not, as well as if it's somatic or not somatic. This is not something you can define more from the NSIC data. So there is a tool which is a product from JDK to call mutation from my NSIC. But be aware that there are a lot of kids and you is not as good as calling mutation from a whole genome every on it's own library. I would like to recommend to use the Milti QC, which is actually a tool that gather all the output of the different QC tools you might have run on your fast Q on your bomb, etc. He actually support 102 tools a check of last week and that pull everything together very nicely usually and it's a lot easier to go through. For example, you might have run fast Q best QC and you get a file for every single sample when you have 100, 200 or 200 sample of this level as far as it cannot go through. But this tool would allow you to actually flag sample a lot more simply and have another you have the older QC and you get the output of most of the tool together. And it's it works with NSIC tool, but it works with multiple tools and other tools that work from other omic data. And then we're going to talk about the evaluating the expression of the gene differential expression is a question. One question. So when you're looking at sequencing that then you should just always construct those rarifaction curves to make sure that they plateau that's how you know you reached enough sequencing that right. And it depends of that one way to check that you sequence enough but this plot was to look at how many junction new junction you would get from what what you already know like with the reference you have for the different transcript. So, if it's plateau that tell you you're not getting more information so that's good information, but you might want to do that with how many gene do they take they being expressed. If I get more and more reasons why you need to plateau as well. So that's a metric you can you can look at this. Okay, thank you. So now would you like to add something about this. You have a good way to estimate how deep you need to go. I mean, you can. So it depends on your experimental design if you're going to compare two conditions. You kind of want to have sequence. Sort of the same depths across both. Right. So it depends on your experimental design but this plot is really, it's really useful. Right but if you're looking at a sort of a cancer sample versus a non cancer sample so a healthy. I'm trying to, I'm trying to sort of get straight in my head how would that look for a cancer sample where one or two genes which is dominate, whereas in a healthy sample you wouldn't have that. What do you mean one or two genes would dominate. Well, I'm assuming if it's or a few genes I don't know how many genes would be responsible for would be mutated when you're looking at a cancer sample, but it will certainly be a subset of all the genes that are possible. And so those would tend to dominate. We would look for what which are the genes are overexpressed or downregulated, upregulated or downregulated as far as the normal. Right or down. That's right. That's right. Yeah. Yeah. And it would be a set of gene but you still want to know that if you get more and more reads to detect the expression of more gene or not. That's one of the information and both in the normal as well as in the tumor. But like, you should aim to sequence a set number of reads in both of your samples from the beginning, ready. You don't want to sequence less of your normal because it might be in my, I don't know, maybe less complicated complex. But that's one of the things so when I mentioned that I would advise to start with the same amount of reads when you start to pilot experiments. Yes. I was just going to say it's an interesting comment that you might expect that your normal tissue is less complex. I don't know that that would be the case. It really depends. So cancer grew from one cell, right? It's a clonal expansion. Your normal tissue is a chunk of tissue which has tons of different kinds of cells usually not that one cell that the cancer grew from. So usually the normal sample is pretty complex in terms of the cell types represented within that tissue, unless you're doing something like, I don't know, pancreatic islet cells. And then you just get sequence insulin over and over, right? But if you sequence deeply enough, you'll start to see other genes as well. But in most tissues, your normal sample, like if you're doing brain cancer and you take a chunk of brain, there's going to be dozens of cell types and millions of cells there. So you'll have a really diverse population that's represented in your sequencing data. Got it. Got it. Thank you. Here's another hand raised. Yes. Yeah. So I have a question about the sample preparation at the initial stages. Let's say you have a few biological replicates, but when you get the samples, you don't have enough sample to do the sequencing. So in that case, if you combine them all the biological replicates together and then sequence it, is it a better way to do that? Or is it advisable or because you don't have enough sample to do each biological replicate? But do you mean enough material at the beginning? Like you haven't extracted enough RNA? Yeah. I mean, is it, if you pull them together, if it's let's say it's a one condition, but you have different mouse samples. And if you get everything together as a one sample and pull them together and sequence it, is it a good way to do that? Or is it not a good way to do the sequencing? You just still keep the information of the different library from the different sample at the beginning. And then in your analysis, you might merge them or not. But you want to check that there is no bills that have been introduced along the way. Okay. So you need to make sure you have a way of differentiating each biological replicate, right? I think if you want to do a differential expression analysis, yes. If you don't have enough tissue because you're working with a really rare cell type or something, then there are some library prep kits that take in very small amounts of starting material. But if it's still not suitable, then you'd want to consider single cell sequencing or a different kind of library prep and data generation, right? Or you could pull like the super rare cell type from a bunch of mice and make one pool and then you make another pool and another pool. You need about three replicates to get a good differential expression analysis. You don't want to do one versus one. So you need at the end to have a distribution to be at least three versus three when we do the differential expression analysis. Oh, okay. You're going to do a differential expression analysis, too. Thank you. Excuse me, I have another question about the thing that you said that normal samples are also very complex and we can have different cell types in a normal sample. Where do we use this extra information? As far as I know, we just compare the tumor cells to normal cells. So if these normal cells are coming from different cell types, where do we use this information? You need single cell information to do it. But to take on what it says, it's a new way we're starting to think about of the analysis I would say that we didn't in the past, we were just considering bulk. Like as you have it's a bulkier and I think that knowing that actually it's one of the images is it's a mix. It's a smoothie like you have all your fruits in together. So that's the single cell area and analysis we were able to do with the single cell data that allows us to really show that there were all these cell types and this kind of information. So you would need other data set to be able to take that into account. In the analysis data, usually we just compare one condition versus another or a normal tumor or something like this as a bulk. On the bulk, you might have your normal sample and then like two kinds of tumor and you just compare them to the same reference and find the things that are different from the normal and then different in A versus B, right? So you have to think about what your normal is and what your experimental setup is and that will help you interpret your results. I just don't remember from the previous session about single cell. Can we understand which part of the normal cell is now this part of the tumor cell? Can we understand that? I think this afternoon we're going to have the session on single cell RNA-seq. So that will build on this module. So yesterday was DNA sequencing, which doesn't get to anything to do with expression, but you'll discuss expression differences between single cells. After Florence teaches you about expression differences between bulk samples. Thank you. The two questions is very long term. It's one of the main questions we're asking with single cell data. We were not able to answer with bulk RNA-seq before. Professor, can I ask a question? Sure. Yeah, so I'm working with TCGA and it has two types of normal samples. So one is blood normal and the other one is solid normal. Which one would you recommend? Let's say if you're doing lung cancer. For the RNA-seq analysis, I would take the normal lung samples that they have. Because you don't want to compare the transcriptome of your blood versus the transcriptome of lung. It would be different because of the tissue type. The blood normal is very useful with liquid to get your reference genome for this particular patient and do somatic mediation. So that's why we take blood and we do exome or genome on it. To be able to know if this mediation is already present everywhere, such as in the blood, if it's somatic on a new tumor. So for somatic variants, you would recommend blood. But for differential expression, you would recommend the nearby normal, right? Is that correct? Okay. Thank you. Appreciate it. Yeah. I just want to comment. I will let you ask a question. For example, for brain tumors, usually we don't have much normal. They don't take extra pieces of the brain. So we tend to cluster and compare a tumor type versus another. That's another question. Okay. Someone else had a question? Yeah. So I had a question about, for example, what I wanted to know is, let's say you did your DNA sequencing and you found a synonymous variant in the last amino acid. That's at the exon-inconjunction. And you wanted to see if the synonymous change actually resulted in exon skipping. And you did your RNA seek or your mRNA sequencing. How would you visualize that in, like, how would you know if that synonymous variant in that area is actually causing exon skipping or some sort of event in the mRNA? So you would align your read to genome, and then you would estimate the expression of different transcripts and see if in the sample with or without synification, you have a different transcript that being expressed with or without exon skipping. Okay. So you would see it as a novel transcript in your IGV or it would show up as. So if it's, it depends if it's, so we're actually going to talk about that. For example, with string tie, you can identify transcript expression and novel transcript expression. So you would have, you're going to get the output of string tag very carefully to see if you can observe a different speech in your sample with or without synification. And could you use IGV to visualize it? After that, yes. After that, yes. Okay. Because you would see the read that map at the junction before or after. But it just, it would be a visualization that not going to tell you, you will need some statistic and differential expression transcript analysis to really prove that there were significant difference. But you could visualize it, yes, with with IGV. So here we're going to talk about how we estimate the expression of an origin or a transcript, what measure we usually use, such as a PTM, or why would you use rule counts as well. Some of the differential expression methods, and then a little bit of the downstream analysis. So here as an example, if you load your BAM file from your tumor and your control in IGV and look at the particular genes. And actually you see at the top the coverage and you could possibly think that this particular gene might be different to express because clearly is there is less coverage of expression at the bottom part in the data at the top. But this has a lot of Kenya, you cannot do that just by looking at IGV, because you might have not sequenced as the people of the samples. And, for example, if you compare different genes, the genes lines could be different. So you wouldn't have the same number of read on it. So it's not the way we do it. It's the way you could go back and have a look, but it's not the way you detect differential expression. So one way that we do it is to have a super estimate or evaluate the expression level using some measure. And one of them initially was the FPKM, the read per kilobase per transcript per million map read of transcript per million map read. And then when we have fair reads, we use now fragments FPKM, fragment per kilobase of transcript per million map read. So they're in there, and I think you had said that that's a relative expression of the transcript. It's a number of the CDNA fragments, where it comes from, but as well as the RPS towards the number of fragments of the number of fragments towards larger gene. Of course, if you have a larger gene, you're more likely to get fragments from it. So the total number of fragments that is related to your total languages. So we need to correct for this. And one way to do it is to use the FPKM value for which you get the number of mapable read. And you divide by the total number of mapable read. And you get the number of fragments from the fragment library and the length of your genome, your transcript. And another, I would say measure, if I can call it, I guess I'm not sure what the best way to call that. Value is the TPM. Basically, you want to correct for the same thing, the length of the transcripts and the number of read in your library, but you do it in the other way around. For the TPM, you divide the fragment by the length of the transcript. Then you sum all the FPK, the fragment kilobase, and you divide by the library per unit. And then you divide one by two and you have a TPM. So basically the TPM gives you the proportions of the read for a given gene or transcript in your library. And people like it because you can compare more easily between samples because the total number of PPM is equal in the different samples. So it's easier to compare with the proportions. So when you have your, so you can use, sorry, you have your read that as you are in a secret and you want to estimate the transcript level expression. For a gene, it's more easily to represent the XPKM because you have to annotate the gene and you get the number of reads that map to the gene and you compute the FPKM. For a transcript, you need to define what is actually, which transcript is expressed. I mean, which the combination of exempt is the right one. And do you, how much do you estimate so three that map to exempt one to transcript one or transcript two or transcript three that are all part of Exxon 1 being all part of the three transcripts, but Exxon 2 being, for example, only part of transcript two. So there are different tools that do that, that try to evaluate the transcript expression. This is not an easy problem. One of the early one was cuffling. And then later string times came along and we'd had a more accurate estimation of the transcript level. So basically string tiles look for paths of for which you can assign most of the read with a better past going from Exxon to Exxon. And go back to build a flow network of the past of the ages coverage and then update to have another pass with the remaining reads, etc, etc, to try to estimate as accurately as possible. All the expression of the different transcript. You can as well discover new transcript. At the end, you need to run the merge of the strength ice because you can have some gene structures that were identified in some sample and not in others. I don't know, 25 of gene of a given gene that is being quantified in one sample, but actually is not present in your, in your output of the other samples. So you want to merge all those transcripts to have expression value of all the different transcripts identified across all samples. That's why you wouldn't want to run string time match. So it's allowed to incorporate a non transcript with a sample and pursue a similar transcript as well. There is a mode for a denouer from this mode and then you need to run it. You can use as well to just compare to compare a match transcription to ask about a known annotation. So we already have an additional gene and given transcript. And you can compare the output of the transcription detect what is known. So one way to perform differential expression analysis. Following string time. I don't know if it's a relevant question is just I for transcript quantification I've seen Calisto being used. What's the, what's the difference with the tools that you just mentioned. I haven't used Calisto. I wouldn't be able to tell you right now what's the big difference. Yes. Yeah. Yeah. So I use Calisto for my for my RNA is the analysis. The big difference between Calisto and the rest of the aligners that that we've talked about here is the the mathematical basis to it, which is well beyond my capacity to explain, but I will post a link to the pack to the labs Calisto help page in the channel. I do recommend it because it's computationally very efficient and straightforward to use. And there's a follow up suite for differential expression called sleuth. That's also great and I love. But yeah, I will push you I will push you the link and the papers you can look at it because the maths is like way above my head. Probably above my mind to yeah okay that's what I heard that it was really really fast compared to cuff links let's say so I was wondering what was the trick behind. Okay, thank you very much. I have a question for you Emma on Calisto. Does it. So I thought it does not produce an alignment like a bam. It just does a pseudo alignment on the fly and calculates calculates calculates and you get counts and that's it. So if you want to do anything with the bam afterwards. Calisto is not your tool, but it's your tool you can generate a bam. Okay, and that takes longer though. Yeah, definitely takes longer. Um, but yeah so Calisto doesn't align in the same way that we're talking about it does a pseudo alignment where it's sort of. It aligns a certain proportion of the read and then it mathematically estimates the likelihood that it belongs to that fragment as opposed to any other fragment. And it's very complicated and I don't have a math. Yeah, you can skip some elements and those are different transcript expression evaluation analysis. Yeah. Yes, as to perform differential expression analysis so it's one of the tools in the suite after string tiles you can use for one, we can briefly talk about other pipe and you can use that there was a way to do that as well. You can use a parametric F test comparing nested linear models. So we actually can compare to model fit with which feature using the expression as the outcome so you want to estimate the best fit with your expression of Eugene, considering or know the provider of interest being for example of case or control or if you have a time to what time it was in compression function of what you're comparing. So the F statistic and the P value calculates using a fit of the two models, and then you get a P value of the P value cynic and that means you actually have a better fit when you take into account this feature being case or control, and that's a differential expression. Of course you want to run multiple testing and you want to consider the Q value rather than the P value with a standard thing of being a Q value below zero zero five. Yeah, I think there is a slide about testing that's going to come after. So this is one of the output visual output you can get, so flux you can get from one group. Then you can look at the low to FPM value of your different sample. You can recognize the basic post plot of her particular gene looking at the expression level from here at the main basis female, and you can look at the expression of the different transcript of the given gene. And this is a structure of the object of problems that you're going to play with in the lab. Alternative to it became actually a lot of us, I would say don't use this necessarily became, but we prefer to use a row read count base analysis that you can use with DC code. And so the rule recounts an alternate for differential expression analysis. Instead of calculating the game, you actually simply assign the number of fragments to your gene or to your transcript. And then you can use this number with SP the HTC tool and the HTC count function, as well as if you run star with a reserve mode option running when you run your start command to get right away to the output of the of the count for given gene. And just as it was a side note that I would, I guess two years ago, because we had experience with that, and you need to get attention to look at which column you need to use in the output of the, of the star for the counts that it could be not an entity column, I think it would be because in units, for example, it's a fourth and not the third column, that's just something you, you would want to check when the first time you get the output of your start to make sure that you're taking the color that makes sense. So that's another way to get missing to get a count for you. We count for Eugene. So your line star after you start. So, so you can, you do different if you get that became expression. It's, it would be used when you want to benefit of the Tuxedo streets so string time, but rune and all the tools that are within streets. It's good for these edges in the map because you're not modelled by size and margin lens. And so you can move with a quickly compare the expression level. You can change. However, using the road points, I want you to use more robust statistical method for differential expression analysis. And I can also sophisticated experimental design that as you can do with the sick and nature. So those are the two most of the most popular way to perform the tools to perform differential expression analysis. That's follow the sick. That you can use a complex design metrics specifying, of course, your case control but more information you get about your sample to, to better perform your differential expression analysis with information with an inform design ethics. Within the sick to you can actually run the variant several as in transformation, which you do that after suspect of my normalization and you could use that for clustering was a visualization. It's good as well for between sample comparison. Just a side note that after this normalization and VST transformation hasn't been collected for Jean lens. So if you want to compare sample like Jean, within a sample expression of the gene, you don't want to call it for Jean lens. So compute a type of that if begin after. But you can come back in a in sample one and be directly, but not in a and be in the same samples. If you haven't collected for units with those values. And another link is for a jazz, another very popular tool, which I believe I would use the same statistical method behind as if you see with some with a particular way to run the analysis but they are two very good tools as well. If you do run several methods and several ways to to get different types of analysis, you will not get the exact same gene list at the end of being a significantly different gene definitely did express after Q value being 1005 for multiple reasons. So if you want to be really conservative and propose gene for maybe some validation, you might want to run several one and take the overlap. So one that's only cool by a chair, for example, odyssey, it doesn't mean that they are necessarily a false positive. So there is no perfect method. And sometimes you just have to choose which one you go with the functionality question as well. Yeah, if you want about multiple distance correction, I believe you're aware of that but it's important that you realize it's something we all have to do when you do this kind of test when you compare the expression of every single gene in the genome on your transcriptome. And if it's different express between condition A and B, we do 20,000 tests 25,000 tests. So you're more likely to have a test that is become that significant and it shouldn't be than if you were doing one or two tests for example. So you want to correct for this and what you do when you actually gets adjusted key value for the value with a multi ballistic correction. Which is the output that comes directly with the P value. So you always consider the Q value rather than the P value in the output of your differential expression as is because of multiple testing some what can you do after you can do a lot of things. It really depends on your questions. Usually you pass on to you do some maybe some clustering, hit my presentation. You pass on to how or do you do password analysis. You're going to do that in a module later on. It really depends on your question. So I did a few slide about you might be interested in actually using already published data sets, which are related to the tumor type you're interested in. And you have access to a lot of data sets that are NGO, they are many array based, but they are somewhere in a six mission for models. So it's a question of if the sample, the people that generate the sample were allowed to deposit the data NGO or not because it's open as it's not controlled access. It's open. Most of the human aeronautics are under control access such as in Egypt, but actually you can find some in NGO. So it's a problem of not being able to personally identify where the sample come from at the end if you do have it from a particular tumor. But still, if you look at mouse model organoid, that's a model of your tumor, you might find a lot of data sets. The mid-lation data tree NGO as well because they are mainly array based for the 450K and 50K data. And the other place you might actually deposit your own samples when you publish it or you want to have access to is EGA, managed by the European Biomedical Institute and the Centre for Genomic Regulation, the CRG, the Barcelona. So you need to actually request access to the data to the data access committee and there is a data access to agreements that need to be signed or allow you to download the raw files of a particular data set. This can take a bit of time, but it was mainly because there are a lot of cool data out there. It really depends again on your project and your question. I wanted to point out that there are a lot of data out there of cancer type, and especially one of the very last project that you're probably aware of is the TCGA, run by the NCI, and over 20,000 primary tumors and much more have been sequenced with different type of homics and the Spam 33 cancer type. You probably have seen the big papers that came out of the different cancer type. And there is a portal for which you can query process data quite easily and retrieve. And I was actually curious to see what kind of pipeline they use for the data to share. And this is an ASIC mRNA analysis pipeline that they use. So actually, they do use STARK for the class assumption, the detection for salingments with STARK. They do get the BAM and then they use an HTC count and that's a type of gene expression. They provide the gene expression with a confirm HTC, some FBKM, FBKM you can use with some type of normalization. They do run fusion as well. TROSCAP fusion analysis from ALIBA and STARK fusions two different way. So that's how they obtained the count after the alignment with TROSCAP information. And there's a type of data they actually give access to. So the RNA-seq alignment as a BAM file, the HTC RIT count, the STARK RIT count, the FBKM and FBKM UQ value. So you can stop from one of those files and do your analysis. I use the sample for your analysis as well. And the old documentation on the website. So the thing is that here it's all at the gene level. You wouldn't have transcript information, which is might be something, something you want. And one of the reasons you would use string time. And that's it for this lecture. Thank you, Danny.