 You are now listening to the RNA-seq tutorial. I'm Bernice Batu. I'm a postdoc researcher from the Freiburg Galaxy team at the University of Freiburg in Germany. Before you have listened to the transcriptomic lecture from Fortis, and now we will apply what you learn during these lectures to RNA-seq data, and yeah, we will apply that to real data, and for that we will use the European Galaxy server that you can reach at usegalaxy.tv. You should have learned before when following one of the Galaxy introduction tutorials how to log in, so please log into this server, and then we will follow the tutorials. So the tutorials you can find that on the training on the Galaxy training materials, so you can open a new tab and type training.galaxyproject.org, you would be redirected to this page where you have all the tutorials listed, you go to transcriptomics, and then you go to the reference-based RNA-seq data analysis. You will have these tutorials. As you may have already seen it, you can also have the tutorial directly visible into Galaxy. So to do that, you click on this small hat on the top, which is seek Galaxy training materials, you click there, and then you go, you are redirected to the Galaxy training page, and then you go to transcriptomics, and you search for the reference-based RNA-seq data analysis tutorials, you click there, and the tutorials will be open. And it's a tutorial we will follow now. So, and with these tutorials, we will try to answer the different questions like what are the different steps to process RNA-seq data? How to identify differentially expressed genes across multiple experimental conditions? What are the biological functions impacted by the differential expression of genes? And at the end of this tutorial, you should be able to do several things. You should be able to check a sequence quality report generated by Fascusi for RNA-seq data. You should be able to explain the principle and specificity of mapping of RNA-seq data to an okaotic reference sheet. You should be able to select and run a state-of-the-art mapping tools for RNA-seq data, evaluate the quality of mapping results, but also describe the process to estimate the library's sonness, estimate the number of reads per gene. Explain how the count normalizations is done to perform before sample comparisons, construct and run a differential gene expression analysis, analyze the seq output to identify, annotate, and visualize differentially expressed genes, but also perform a gene ontology arrangement analysis and perform and visualize an arrangement analysis for keg pathway. You probably, most of the term that you heard before, you may never heard before, and but it's not him. It's not everything is fine. We will, you will learn that by following these tutorials. Before this tutorial, you should have gone through at least one of the galaxy introduction tutorials and also followed at least the quality control and mapping tutorials. You can find the links onto this tutorial there and follow this tutorial. As Fortis did for the introductions, with the lectured introductions, RNA-seq has become a widely used technologies to analyze the continuous changing cellular transcriptome. It's one of the most common and aim of RNA-seq is to identify gene, is to profile the gene expression and identify genes and molecular pathways that are differentially expressed between different conditions. And it's what we will do now in these tutorials. We'll try to identify differentially expressed genes and pathways from real data. And for that, we will use the data that has been published in 2011. And in this, in this study, the authors wanted to identify genes and pathways that are regulated by the depletion by a pastilla genes. But what they did, they deleted the pastilla genes in some trozophila by RNA interference and they extracted the total RNA and to prepare and prepare a single end and parent RNA-seq libraries for the samples that are where the pastilla genes have been depleted, but also for some samples from control samples where nothing happened. This library were then sequenced to obtain RNA-seq crates for each samples. You can find the old data available in CBI. You can find the detail there. But for these tutorials and to learn how to analyze this data, we will use only seven of the samples. Four that are, that we can call control samples or unsweeted samples and three treated samples where the pastilla genes has been depleted. So each samples constitute a separate biological replicate. But for, for some of the samples, some of the samples has been sequenced using parent sequencing and some where sequenced using single end. And we will learn how to manage both of these, these different types of sequencing. So through these tutorials, we will deal through a normal RNA-seq data analysis pipeline. So we will do a quality control mapping, counting the number of reads per annotated genes, run a differential expression analysis, visualize it and run a functional arrangement analysis. We now need to have the data into Galaxy. For the first part of this tutorial, we will use the files for two or out of the seven samples to demonstrate the first steps of the pipeline. The data, so for to getting the data into Galaxy, we need first to create a new history. So we are now in the tutorials, we want to go back to Galaxy interface. So we click outside of the tutorials here anywhere. We need to create a new history. So you go on the history panel on the right, you click on create a new history. We rename the history, rename it RNA-seq tutorial. And then we need to get data into, we need to get to add our data there in this history. If you go back to the tutorials, you will see that you can import FASQ file for the difference for these two samples from Zenodo where they are stored by copy the different links that are there and import them. If you don't know how to do that, you can expand this little box and it will be explained. But if you are using the European Galaxy server, we already put the data on a shared data libraries for you. So we will do that. We will import the data using the data libraries. So to do that, go back to Galaxy. You will go on the top to shared data. You click on data libraries. You will be redirected to a new page where you have different folders. You may have different folder than mine, but you should have all the same, that is called GTN materials. You click on GTN materials. It's the materials from all the training materials tutorials. You go then to in transcriptomics. You go to the reference-based RNA-seq data analysis. You click on DOI, the title of the day. And then we got different, we got different files. You have just some GTF file that we will use later. And then you have for different, for the different samples that are named GSM4611 something, you have different files. So you have a fast glue sanger, you have accounts, etc. for each of the sample. We need to go back to the tutorial to check which samples we want to use. So please, you can select there. You see that you are redirected to the home page because we changed the page, so it's not loaded anymore. So we go to transcriptomics. Again, the reference-based RNA-seq data analysis tutorials. You can go directly to the correct sections if you go on the left in the table of content. And you see that for this, for these tutorials, we need to load the 77 sample, that is a control and treated sample, and the 80, that is a treated sample. So we will select the file, the fast queue file for these samples. And as it's mentioned here, it should be 77 underscore 1, 77 underscore 2, 80 underscore 1, 80 underscore 2. So we need to select four files there. So we select 77 underscore 1, 77 underscore 2. And then we need to go on the second page and to find the 80 underscore 1, 80 underscore 2. So you should have selected four data sets and that we want to export to our history as a data set. Then you select in which story you want to import them and you click on import. It will import the select items into your history. When it's green here, you can go back to the main page of Galaxy by clicking on analyze data. And then on the right in your history, you should have now four data sets. But you see that the names of these data sets are the URL names. So we need to rename them. So we need to edit attributes. And then we remove the first part of the URL to keep only 77 underscore 1. We save that and we do that for the other data sets here, underscore 7. Save that. We now go to 80. We do the same. We do that again for the last data set. And we now have four files with a better naming. But we want to be sure that these both data sets that are from the same sample. So 77 underscore 1 and 77 underscore 2 are somehow linked together. We want to add something that is called a tag. So to do that, we expand the data sets by clicking on it. And we click edit data set tag. And we add a tag, which is the name of the sample. So GSM 46 11 77. And we want to be sure that this tag propagate that it's visible for each other outputs for for the tool that I run for that. So we add a hashtag first. We copy this tag that has been added because we want to add the same tag to the other samples. So you do that. And once you're ready, you can click on the return key on your keyboard. You do the same for the same. And then you passed the tag that you already added. Then it's automatically added there. You see that now we have below the names of the of the files. You have a small tag there. And you would see later how it's can be really useful. Then we are the same for the other one. GSM 46 11 80. You copy it also because you will add it again to the other file. That I said here and here we are. So we now have four files with the correct naming in our history and with the tags. Check again on the tutorials. Did we forgot any steps? So we need to reload. I'm sorry tutorials. So we go to transcriptomic, reference-based, RNA-seq, data analysis, data upload. We are in the steps. So we created a new history that we named correctly. We imported the FASQ file. We renamed each data set according to the sample. Ah, we forgot this step. Check the data type. If the data type is FASQ Sanger and not FASQ. So to do that, we expand the file. We see that FASQ Sanger is there. FASQ Sanger, FASQ Sanger, FASQ Sanger. So all good for our data sets. Let's go back to the tutorial. And then we added a tag for all the sampling. And then we have a question. How are the data sequences stored and what are the different entries in the file? So if we open one of the files, we see that we have a FASQ file, which means that each sequence is stored as four lines. The first line is always starting with a hat, then the sequence ID and some information from the sequencing facility. Then the second line is the sequence itself. The third line starts with a plus. And the fourth line is a sequence of quality for each nucleotide. So this is a quality score, the i for the t. But if you already, you should have already followed the quality control tutorials where everything is already detailed there. So I will not go through the details there now about that. So let's go back to the tutorials. So we finished that. The files that we just uploaded to Galaxy contain the read data from the sequencing machine. So no pretreatment has been done on the data. So the first step that we need to do for any RNA-seq analysis is to asset the quality. Because during sequencing, some errors can be introduced such as incorrect nucleotide being called. And they are due to technical limitation of the sequencing platforms. But you probably already know more from the quality control tutorials. And so now we will follow the same step that you did during the quality control tutorials. We will run FastQC to create a report of the quality sequence quality. We will run MultiQC to aggregate the generated report and could adapt to improve the quality of sequence via trimming and filtering. So let's go back to Galaxy. So first we need to run FastQC. So we go back to Galaxy. We search and the toolbar on the left, we search for FastQC. And then we select FastQC, read quality report. And then we want to run that on all our force files. So we click on multiple data sets. And with the shift key on your keyboard, you select the four files there. And then you can execute. And you will see that in your history, for each of the samples, two new data sets are generated. So FastQC on the data one, web page and FastQC on the data one, for example, for the data. And what we want to do afterwards is to aggregate the FastQC report using MultiQC. So we can start, we can already launch MultiQC, even if FastQC is not, is already running, is still running. So we search for MultiQC in the toolbar. We select on which in which tools has been used to generate the report. So we search for FastQC, FastQC. Then we need to say which type of FastQC output we will use the raw data. And then using again the shift, again, now the command or the keyboard, you can select the different, the raw data that you need to have. So the four, you should have select four files, so file raw data. And then you can execute. So MultiQC will wait until FastQC. All FastQC are done to run. So we are a few minutes now to wait. And you see already that the tag that we added there are automatically added to the FastQC. So it's a way for us to keep track of which, so that this FastQC has been run on data one, which is this one, which means this sample here. So it's a better way for us to keep track of that. So if we open, for example, the FastQC report with page for the data one, you see that you have some basic information, basic statistics, and some reports that has been generated. But you already go through each of these reports by following the quality control tutorials. So I just check on the tutorials. So we are in the end zone, a quality control. So we see that FastQC, rerun FastQC. And it's, we have a question there. What is the red length? So how do we know the red length? So we open the web page report from FastQC. And we see that the length, the sequence length is 37. So it's quite small sequences there. But it's also, remember, it's all data from 2000, that has been purchased in 2011. Oops, sorry. We go back to the tutorial here. We see that we are already running MultQC. And then once MultQC is done, we need to inspect the web page output for MultQC. And we have two questions to check then. What do you think about the quality of the sequence? What should we do? So MultQC is, we are waiting for MultQC to run. And if you see already that, because we run MultQC on the output of these files, that have these different tags, we see that the both tags has been added there for this file. So MultQC is running. We are waiting until MultQC is done. And should be fast now. And once it's done, we will open it. So it's green, everything is fine. We can open it. And you see that the different report for the different, for the different samples has been aggregated together. And we can go through the different output from FastQC. And we see the different sample in one report. So one graphs for the sequence quality histogram and with a different sample there. So the question was, oh, sorry again. So what do you think about the quality of the sequence? So the quality is quite good. So it's mostly for most of the sample, it's green. So the quality is okay. It's decreasing at the end, which is expected for most of the Illumina sequencing. But we see that one of the sample, the 80 underscore two, is the decrease is much higher. So it's much more important. And it's labeled as a range by FastQC. So maybe the quality is not so good there. I'm sorry, I will be used to that again. So what should we do? So here you have more detailed answers if you want to check. So the quality, the histogram of the mean quality score per sequence is quite okay, except for 80 underscore two. We see that here we may need to trim the end of the sequence because it's dropping too much. The GC content is okay. There is some end contents or some nucleotide data ends. We have a quite duplication level, quite high, but it's expected for aeronistic data. And I think that's most of the things. Yep. But if you want more details about what we can say about that, I recommend you to check the detailed answer that you have there. But we have, so as we said, we should trim the read to get rid of bases that were sequenced with high centenity. So it's mostly the end of the reads. And also removed the reads with overall bad quality. We have another question in the tutorials. What is the relationship between underscore one and underscore two for each of the samples? We are talking here about parent sequencing. So the underscore one is the forward reads and the underscore two are the reverse reads. And you have a detailed explanation there about what does that mean? So now we want to remove the bad quality nucleotide but also the read. So for that, we will need to run cut adapt. So we need to go to the tools. We search for cut adapt here, remove, we use that. And we have, as we mentioned, because we checked that we have underscore two, underscore one, we have parent data. So we say parent. And then we need to select. As the dash one. So it's the forward reads with the underscore one. The underscore two, then as the reverse. What are the different? So to check, I don't remember all the details. So we want to have a filtering option. So remove all the read that are, that are shorter than two. We want to, to get rid of the reach where the quality, the mean quality is below 20. But also in the, if the quality dropped too much. So we go back, sorry again. So in filtering options. So we need to go down filter options here. We want to have a minimum length of 20. In read modification option, we want a quality cutoff of 20. That we mean that, that the quality base, it will trim the quality base for that. There, based on the quality score of 20. And in the output options, we want a report to be, yes. So output options, we want to create a report that can be used afterwards for multi-QC. Globally, I recommend you to go through all the parameters to check that. But here I go quickly. You can see more details during the, with the quality control tutorials, but also by reading through the documentation. So regenerator report that we can again aggregate using multi-QC. So I recommend while cut-adapt is running to also already start multi-QC. So then for multi-QC, which should have been used, you select, you search for cut-adapt. And then you need to select the both, the report. And you have two report that has been generated. Because, so now we don't have four cut-adapt that we run, but only two, because we are used already the, the do two samples together. So we have two reports that we need to aggregate using multi-QC. You can take a short break now, because cut-adapt can take sometimes to run. And then we come back in a few minutes when cut-adapt is done. It's now finished, and as well as multi-QC. So we can inspect the multi-QC output to check the result of the, of the, of the trimming and the cutting of removing of the data sets. So if we see that, if we open the multi-QC, we see that for the 77 samples, 25, 2.5% of the base pair has been trimmed. So especially at the end of the, at the end of the reads. And for the 80 samples, it seems that 12% have been removed. It's something we expected given the results. And when we saw that for A, for at least for the, for the 80 samples, the quality was not so good at the end. And we can see also if we look at the percentage of filter reads, we see that quite a lot of reads has been removed also. So for the 77 sample, 1.4% of the read were too short after the cutting. So they were removed. And for the 80 sample, 9% has been removed. So it's, it's, it's, it's, it's, it's, it's, it's, it's, it's, it's, it's removed. So it means that, uh, so a lot of the base pairs as we have been cut out of the end of the read. And then the read would become too small to be kept, too short to be kept. So it's what it said there. So if we go back to the tutorials, there were, uh, the VCE questions, uh, where exactly we're asked then in the tutorials. So now we can move to the mapping part. Now that the quality control is done, we need to do a mapping. So the mapping, the idea of the mapping is to make sense of the read, to try to identify where this read comes from, from which gene they come from. For that, we need to locate where they come from on the genomes to then associate this part of the genome where the belongs to is part of the genomes where the belongs correspond to these genes. So because we are talking about Drosophila, and for Drosophila there is a reference genome available so we can use that to help us finding the locations. This process is called aligning or mapping the reach to the reference genome. You should have already learned a bit about mapping by following the tutorials on mapping. If not, please feel free to follow it. So in this study, we used Drosophila melanogaster cells to extract the RNA so we need to use the same, we need to use to find the reference genome of Drosophila to extract the sequence to that. There is a question on the tutorials about what is a reference genome, what are the different versions of a reference genome, why do we have that, and which reference genome we should use. Please check the solutions on your own, but the answer is we need to use the reference genome once we start using a reference genome and with one version of the reference genome we should keep using this one along. So when we talk about a chaotic transcriptor, most reads origin from mRNA, so from messenger RNA, so from genes like the introns. So the reads come from exon 1, exon 2, or exon 3 and they can somehow, what we called, span over these different exons. For example, the blue reads span over exon 1 and exon 2 and miss the intro in between exon 1 and exon 2. So then because of that, because of the specificity of a chaotic transcriptor, we cannot just map the reads to the genome as we usually do for DNA data. We need some specific mappers that are developed to efficiently map the reads to different exon. So the idea of most of these mappers or these tools is they try to map the reads over the different exons, so like this, so they map this read to this exon 1 and this one. So we call them the mapped reads and once out of the read that didn't map, they try to split them over different exons. So it's the idea of splicing aware mappers and it's used for a chaotic transcriptomic data. We gave here in this box more details about the difference of spliced aware mappers and especially the history behind them. I recommend you to read that. I will not go through the detail now, but just to remember there are different generations of mappers and with different implementation there. So now we will map our reads to the Drosophila melanogaster genome using a tool that is called STAR. For that, we need to have first annotations of where the different exons are in the genomes. We need to import that into our history. As for the data, this file which is called a GTF file is available in the data library, so we import it. So please go back to Galaxy, go to shared data, data libraries. You go then to the GTL material. Once you are there, you search for transcriptomic. You go to the reference-based RNA-seq data analysis, then the DOI and then you use this one. So the third file here, we want to import it to history. Yeah, we want to import in our history. So it's here to be sure that we use the same file, but just to be sure, so if you go to Galaxy and the data libraries, so if you go back to shared data, data libraries, you can see that there are other genomes and annotations available there where you can find the annotation, the reference genome, etc. Here we will use this one because to be sure that we all use the same between different Galaxy servers, but here I just showed you where you can find other reference genomes on your European Galaxy server. So let's go back to Galaxy. And then so we have now our GTF file there. We will rename it as we did for the FASQ file at the beginning. So we remove the first part, we remove the GTF at the end and we save it. And I think, oh, we need to load again the tutorials as every time we change a page on the interface. So we go back there, we were in the mapping part and we are here now. So we need to rename this history. We need to check that the GTF, the data type is GTF and not GFF and that the database is DM6, which is the version of Drosophila Minnanogaster. So the GTF, which one we need to, we need to be sure that it's GTF, the data type, and the database is DM6. So we have the GTF here, we need to change now the database. So we go to edit attribute and then here in database build you can search for DM6 here and you save it. So now it's to be sure that this file is associated with that. Okay. So now we want to run a RNA-STAR, which is a mapper with the following parameters to map the read to the function. So we want to use parent dataset. So we search for star in the tools and you use the RNA-STAR, gap-treat mappers for RNA data. We say that we have parent data as individual datasets. We now need to use... So now we need to put the forward reads and the reverse read here. We have two samples, so we need to collect to run and we want to run the same parameters so we need to select multiple datasets. We need to select now the output of cut-adapt, that where the trimmed read. So for the forward it's always the read one output and for the reverse the read two. Be careful when you select them so it's important that you select the correct one. Custom or built-in reference genomes. So we need to use a built-in... We want to use what is again... So we want to use genome reference without built-in genome. So without we select the reference genome, which is DM6. Then we can give a gene model for the splice junction so it's mean where are the different exomes. So for that we can use a GTF file that we just uploaded. Then there is the length of the genomic sequencing and it say the interval value is read length minus one. Do you remember what were the lengths? We thought that in the output of FASQC so it was 37. So here we need to put 36. Then what are the different parameters? So 36 and that's all. So you can check the other outputs but take time when you have the time to do that. I will not do it now. So then you can execute a star. It will take some time so please feel free to take a break now for a few minutes, just the time it's running. Error in a star is now done. So you have now in your history you have three files for the AT sample and three files for 77. As we did for FASQC or for Catadapt we can aggregate the result from star and the report that are there in logs using MultiQC. So search for MultiQC. In the tools you search for star are the tools that can be used for generate. We use the log and you select the two log file there and you can execute that. It will give us some information like which how many reads has been mapped to the reference genomes how many read has been multiple mapped so it's meaning to several locations in the reference genome, etc. And so while MultiQC is running we can check, so we have a question afterwards what is the percentage of read exactly once for both samples and what are the other available statistics and so you have some details here so we will check the results once MultiQC is done and seems to be done now so I will open the web page so it says that 83% of the read the first data set has been mapped or aligned to the reference genome 79% for the second data sets if you will look at the percentage so 80.3% has been uniquely mapped so it means that they are mapped to only one locations and then we have 5.5% of the read that has been mapped to multiple locations it means probably that the read are two shorts and because of some repetitions we can see that and some has been mapped to too many locations at the same time and some were too short to be correctly mapped and we have also the informations for the second data set here sorry here so we go back here so it means that according to the report more than 80% of the read for both samples has been mapped at least once to the reference genomes so we can then proceed with the analysis and if the percentage were below 70% we should probably investigate for possible contaminations it can be due during the sequencing during the RNA extraction or something one thing you can do then is trying to in such a case you can try to map to the human reference genome or to other reference genome or maybe bacteria to see if you have some possible contamination that you can remove them but the main output of and the more interesting output for RNA-STAR is the BAM file but if you follow already the BAM mapping tutorials you already know what is a BAM file and which information you have in a BAM file so I will not go too much in details I will just try to find some example here so here so if you scroll down after the first line of comments you have the name of the sequence and some extra informations like the size and on which chromosome it has been mapped on which locations etc etc but you can find the detail here on the mapping here on how it looks like so the question is which information do you find in a BAM file, in a SAM file and what are the different informations compared to the FASQ file so the more information that you either the BAM file and SAM file compared to a FASQ file is that you have the location of the read and mapped on the reference genome the BAM file contains as BAM file contains information for all our reads it makes it difficult to inspect and export especially with this format, the text format so there is poor tool now to visualize the content of a BAM file for example using IGV so I recommend you to install IGV and start it so I already started locally for me so feel free to follow the instruction to install it and then we can visualize the BAM file for the 77 SAM file together so we go back to Galaxy we already started the IGV so you expand you go to the BAM file here and what you can do you can say display with IGV local and then it will launch IGV for you and then you can see here the different information here we will zoom in a bit so to zoom you select IVR there and you can see there you need to zoom again a bit more again, you don't see anything you have here, okay now it's zoom out a bit so here, no it's loading slowly so IGV is a quite slow tool but yeah, it's a good one so here you have a good example so here you have the different reads here and while they are mapped on the reference genome that is on the top you see their locations and you have what we call the courage here which represents the density of read mapping to the different locations here so if we go back to the tutorials on Galaxy, sorry I need to close that if we go back to the tutorials we would like to zoom to this specific location so on the chromosome IV so please copy this the this text here, the chromosome IV etc. and go back to your IGV and what you do, you can pass this location here on the top here and it will go directly to the specific region that is there so you can see exactly the region that we want to display within IGV that is from the tutorials so if we go back to the tutorial we should have something like this what is the information up here on the top of the gray peaks this one is the courage as we already mentioned what do the connecting lines between the sum of the aligned reads indicate so if we go back to IGV here you have some example here of these lines, the horizontal lines between some between there these are the spanning the introns in between so you have a read that mapped here one part of the read here and one part here so it means that this read is spanning over this intron that is there is what is represented by that another way to inspect the splicing junction is using what is called a sashimi plot so to do that, to create a sashimi plot what you can do is right click on IGV and click on sashimi plot here so it will create this sashimi plot here where you have the coverage here location here, you have this arc here with some number on the top here it's the number of read that's spanning over this intron so here we have 7 reads that mapped and span over this specific intron 30 here, 20 here, 22 etc this watch is represented here and the next question is why do we observe different stacked group of blue lines here and the bottoms these represent a different transcript of the gene that are represented in the in the in the GTA file so in the annotated file we see that so we know that there is an exon here and some some intron here and exon here but we know that they can be used by different genes so this one is one gene this one in another gene etc so we have different gene name here that use the same exon find the annotation and once you did that once you are done with the inspection of the mapping result you can do some extra checking for the mapping we gave a lot of details in this expandable box here to check for example the percentage of duplicate reads using a tool that is called mark duplicate the number of reads mapped to each chromosome the body coverage the read distribution across features across genes, across intron, exon etc we will not go through that I recommend you to do it on your own and to apply that also to your own data set then later so we now for the mapping we have now the information of where the reads are located in the reference genome and how well they were mapped and so the next step in our RNA-seq data analysis is to quantify the number of reads mapped to a specific genomic feature so the genes like now to identify the genes that are differentially expressed because of the passilla genes depletions so for that we need to we now so we have where the reads are mapped but on the locations they are mapped on the genomes but not to which genes they correspond to so to then compare the expression of sequences between different conditions so between with depletion or without depletions we need to to identify the number of reads that are we need to quantify the number of reads per genes or specifically we need to quantify the number of reads mapping to the exon of each genes as we already mentioned before though the reads are mapping only to exon not the entrant so when we need to count the number of reads per genes we need to count the number of reads that are mapping over the exons so if we take this example of these figures the question so we have two questions here so how many reads are found for the different exons so for example for the exon one we have three reads mapping there exon two, two, exon one in the second genes one, two, three, exon two one, two, three, four and exon three, one, two, three but then if we count at the level of the genes we have different numbers so it's not just an addition of the number of read mapping for the different exons we need to take into account this case of these three that map that span over to exons so here for this this gene so gene one we have one, two, three, four reads that that map to the genes and not five if we would have just make an addition between the two and here for the gene two it's the same because we have one, two three, four, five six and not one, two, three, five, six, seven eight, nine, ten, eleven so we need to be careful when we do the counting of just not to count the different reads inside the genes by taking into account the exon and the reads that span over the different exons so different tools are available for that which just account or features count we will use today features count because it's currently faster and requires far less computational resources to do that so in principle this accounting was it's a fairly simple task as I mentioned but there is some detail that needs to be given to features count for example the strengthness and because the RNA are typically targeting a simple single strength and of polarity the strengthness is usually lost after both strength of CNA are sanitized and selected however keeping this information can be useful especially for read located on the overlap between two genes so for example if you have the read one here and it's map here you are quite sure that it's mapping and it's belonging to gene one but if you talk about the read two here that mapped here at the overlap between read gene one and gene two gene one being on one strength of the DNA and gene two on the other strength and we don't know to which one, to which genes read two belongs to and some library preparation protocol creates some what is called a stranded RNA-seq library and by extracting only the RNA mapping belonging to the genes from one strength so we need to do this information so if our library preparation is unstreaded or stranded and on which strength it belongs to and if you want to know more about strengthness you can read more we added much more information here about the strengthness information usually this information is provided with your FASQ file with when you got your FASQ file from your sequencing facility if not you need to try to find on the site where you donated the data or in the corresponding publication this information it's usually come with the library preparation but if you have no way that you find this information what you can do you can use a tool that is called inferexperiment from airseqqtoolstreet this tool takes a bound file from the mapping and select some sample of the reads and compare where these reads are mapping to the coordinates and strength of using the annotations so based on the strength of the genes where we can find the read we can estimate whether the sequencing protocol is strength specific or not so for example if we know that all the genes are mapping to only genes all the reads are mapping only to genes belonging to the forward read we know that it's a standard library for what same for reverse and if we are read that mapped on both statistically on every sequence on the genomes then we can estimate that the library in unstranded so here we will assume that we don't know the library's strength so we need to estimate that so we need first to convert the GTF the annotation file to a bed 12 because it's needed to the tool inferexperiment so the first step that we need to do so we need to do GTF to BED so it will take this our trozofilamilano-gaster annotation file and transfer it into a bed file a bed 12 file here execute so the bed 12 is just another it's also a tabular file it's a different way of a different column order than the GTF and then once you have that we will use a tool that is called inferexperiment inferexperiment that speculate our RNA-seq where configured so we need to provide the BAM file so we provide the both BAM file we provide the reference genome that is in bed 12 here we need to select how many reads sample from the BAM file to use so here there is 200,000 we will use the same number so 200,000 the mapping quality, minimum mapping quality of 30 so just for information the mapping quality is encoded the same way as the nucleotide the mapping quality for sequencing is between 0 and 40 so here we keep a quite high then it will generate two files so two outputs for the inferexperiment that are from the BOSS sample and this tool generates one file which is a text file with information if the library is parent or single-hand and the fraction of reads that were failed for which we don't know if they become to one strand on the others or map to a specific genes and then for the different type of library they will say how many reads are mapping the fraction of reads that can be explained because mapping to forward-strand genes the fraction for the reverse-strand reads so if we open the this one that is from for the AT sample here we can see so this is parent data the fraction of read that were failed to determine is 9% the fraction of read explained by 1 plus plus, 1 minus blah blah blah, etc. is 0.35 so it's being that 45% of the read are somehow mapped to the forward and 45% to the reverse read to the reverse-strand so then here because the fraction is I mean we don't it seems that there is no clear distinctions between the two types so we can we cannot say we have to say that the library seems to be unstranded if you want more details about the strenness because it can be sometimes difficult to find out which settings correspond to the others given when we talk about the strenness we created a small table that can be useful to identify the library type that you need to fill or the information that you need to fill for the different tools so depending on the library preparation and the tool here what are the difference so if it's parent and and forward or reverse, etc. so now we know that our reads are forward and our library preparation is unstranded, we can run features counts to count the number of reads per annotated genes so we search for feature counts here in our tools so feature feature counts we define we need to give our to-bam file and we say that its library is unstranded we will use a locally cached gtf file we will in your history, sorry the annotation is in our history we will use the gtf file we want the output to be a gene ID slash read counts that can be used for multi qc and dsec2 we want to create what is called a gene length file and we will need that later on what do we want to have too in the option for parent read, so option for parent we have several information we want to count fragment instead of read we want to count what do we want already I'm sorry I'm lost we want to count the fragment so we want to count the fragment instead of the reads and in advanced options so we want to be sure that we filter that we count the number of read mapping on the exons and aggregate that at the level of the gene that's what we say there we don't want the reads to map to multiple locations the minimum mapping quality per read is zero there and I think that, oh no we say 10 here sorry to be sure that we are consistent here so we want at least we want to consider only the read that have at least a mapping quality of 10 and then we can you can check the other parameters I recommend you to check the other parameters but then you can run the FAST QC sorry, features count we will as we did already for the others tools so we will aggregate the report from features counts using multi QC once when it's okay for the tools to be searched okay there is some delay in the interface it happens, no worries everything is fine I will just reload the interface it can happen sometimes that we lose the interface can be a bit slower slower to load I will search for the tool again to see then I will start it why, oh features count is already done so I can select features count I use, I select the both summaries that has been generated in the history and I will run execute oh I lost it, yeah true because I reload the interface so I go back to transcriptomics reference-based RNA-seq data analysis counting the number of reads per counting reads per jeans and I'm currently there multi QC I aggregate the result and then the different question that we have are how many reads has been assigned to a jeans and when we should worry about assignment rate and what should we do so we just wait until FAST QC multi QC is done it will be fast it should be fast but then ok it's done so I can visualize 60% of the around 60% of the reads has been assigned to a feature specifically here we selected only the exam so has been assigned to exam if we look there we have a percentage 60% 63% has been assigned to exam 9% has been neither unsigned or unmapped some were unassigned because the mapping quality was too low some were not assigned because multi mapped some were not assigned because there were no features really interesting there and some were ambiguous the next question was when should we worry I will say that when the percentage is below of assignment is below 50% then you should get where your reads have been mapped so are they mapped inside jeans and check if the annotations correspond to the correct reference genome version so it's happening also that you select one version of your reference genome for the mapping and then the annotations you have a different versions so then they are mapping they are mapped correctly but not to jeans that are known or not in the correct location of the jeans but the main output of features count that we can see is this what we call the counts file if you open it you will see that it's a big table where you have the first column is the gene ID so here you have the ID of the jeans and the second column is how many reads have been mapped to this specific gene here so here you have 197 reads that have been mapped to this specific gene and one question that we have see which features are the most count on both samples for that we can use the sort tools to sort these tables by descending order to have on the top the output and you can identify that I will let you do that on your own and so now we have the count we have counting the number of reads that map to the different genes for the two samples so 80 and 77 but it will be for you I will recommend you to do the same procedure as we did so the upload quality control mapping counting for the other data sets to check how did you if you understood the correct how to do that but also how to do when you have single end data so you will find the same the data on the same locations in the data libraries so we have the FASQ file on the top and where you can also find them on the nodal