 You are now listening to the RNA-seq tutorial. I'm Bernice Batu. I'm a postdoc researcher from the Freiburg Galaxy team at the University of Freiburg in Germany. Before you have listened to the transcriptomic lecture from Fortis, and now we will apply what you learn during these lectures to RNA-seq data, and yeah, we will apply that to real data, and for that we will use the European Galaxy server that you can reach at usegalaxy.tv. You should have learned before when following one of the Galaxy introduction tutorials how to log in. So please log into this server, and then we will follow the tutorials. So the tutorials you can find that on the training on the Galaxy training materials. So you can open a new tab and type training.galaxyproject.org. You will be redirected to this page where you have all the tutorials listed. You go to transcriptomics, and then you go to the reference-based RNA-seq data analysis. You will have these tutorials. As you may have already seen it, you can also have the tutorial directly visible into Galaxy. So to do that, you click on this small hat on the top, which is seek Galaxy training materials. You click there, and then you go, you are redirected to the Galaxy training page, and then you go to transcriptomics, and you search for the reference-based RNA-seq data analysis tutorials. You click there, and the tutorials will be open. And it's the tutorial we will follow now. So, and with these tutorials, we will try to answer the different questions like, what are the different steps to process RNA-seq data? How to identify differentially expressed genes across multiple experimental conditions? What are the biological functions impacted by the differential expression of genes? And at the end of this tutorial, you should be able to do several things. You should be able to check a sequence quality report generated by FASQC for RNA-seq data. You should be able to explain the principle and specificity of mapping of RNA-seq data to a no-careotic reference sheet. You should be able to select and run a state-of-the-art mapping tools for RNA-seq data, evaluate the quality of mapping results, but also describe the process to estimate the library's sonnets, estimate the number of free-spur genes. Explain how the current normalizations is done to perform before sample comparisons, construct and run a differential gene expression analysis, analyze the seq output to identify, annotate, and visualize differentially expressed genes, but also perform a gene ontology arrangement analysis and perform and visualize an arrangement analysis for keg pathway. You probably, most of the term that you heard before, you may never heard before. But it's not everything is fine. You will learn that by following these tutorials. Before this tutorial, you should have gone through at least one of the Galaxy Introduction Tutorials and also followed at least the Quality Control and Mapping Tutorials. You can find the links onto this tutorial there and follow this tutorial. As Fortis did for the introductions, with the lectured introductions, RNA-seq has become a widely used technologies to analyze the continuous changing cellular transcriptome. It's one of the most common and aim of RNA-seq is to identify is to profile the gene expression and identify genes and molecular pathways that are differentially expressed between different conditions. And it's what we will do now in these tutorials. We'll try to identify differentially expressed genes and pathway from real data. And for that, we will use the data that has been published in 2011. And in this study, the authors wanted to identify genes and pathways that are regulated by the depletion by a bacillat genes. What they did, they deleted the bacillat genes in some trozophila by RNA interference. And they extracted the total RNA and to prepare and prepare single-n and parent RNA-seq libraries. For the samples that were where the bacillat genes have been depleted, but also for some samples where from controlled samples, where nothing happened. This library were then sequenced to obtain RNA-seq crates for each samples. You can find the old data available, the old data are available in in CBI. You can find the detail there. But for these tutorials and to learn how to analyze this data, we will use only seven of the samples, four that are, that we can call controlled samples or unsweeted samples and three treated samples where the bacillat genes has been depleted. So each samples constitute a separate biological replicate. But for some of the samples, some of the samples have been sequenced using parent sequencing and some were sequenced using single end. And we will learn how to manage both of these different types of sequencing. So through these tutorials, we will deal through a normal RNA-seq data analysis pipeline. So we will do a quality control mapping, counting the number of reads per annotated genes, run a differential expression analysis, visualize it and run a functional arrangement analysis. We now need to have the data into Galaxy. For the first part of this tutorial, we will use the files for two or out of the seven samples to demonstrate the first steps of the pipeline. The data, so for to getting the data into Galaxy, we need first to create a new history. So we are now in the tutorials, we want to go back to Galaxy interface. So we click outside of the tutorials here anywhere. We need to create a new history. So you go on the history panel on the right, you click on create a new history. We rename the history, rename it RNA-seq tutorial. And then we need to get data into, we need to get to add our data there in this history. If you go back to the tutorials, you will see that you can import a FASQ file for the different, for these two samples from Zenodo where they are stored by copy the different links that are there and import them. If you don't know how to do that, you can expand this little box and it will be explained. But if you are using the European Galaxy server, we already put the data on a shared data libraries for you. So we will do that. We will import the data using the data libraries. So to do that, go back to Galaxy. You will go on the top to shared data. You click on data libraries. You will be redirected to a new page where you have different folders. You may have different folder than mine, but you should have all the same, that is called GTN materials. You click on GTN materials. It's the materials from all the training materials tutorials. You go then to in transcriptomics, you go to the reference-based RNA-seq data analysis. You click on DOI, the title. And then we got different, we got different files. You have just some GTF file that we will use later. And then you have for different, for the different samples that are named GSM4611 something, you have different files. So you have a FASQ Sanger, you have accounts, etc. for each of the samples. We need to go back to the tutorial to check which samples we want to use. So please, you can select there. You see that you are redirected to the home page because we changed the page so it's not loaded anymore. So we go to transcriptomics. Again, the reference-based RNA-seq data analysis tutorials. You can go directly to the correct sections if you go on the left in the table of content. And you see that for this, for these tutorials, we need to load the 77 sample that is a control, a treated sample and the 80 that is a treated sample. So we will select the file, the FASQ file for these samples. And as it's mentioned here, it should be 77 underscore 1, 77 underscore 2, 80 underscore 1, 80 underscore 2. So we need to select four files there. So we select 77 underscore 1, 77 underscore 2. And then we need to go on the second page and to find the 80 underscore 1, 80 underscore 2. So you should have selected four data sets and that we want to export to our history as a data set. Then you select in which story you want to import them and you click on import. It will import the select items into your history. When it's green here, you can go back to the main page of Galaxy by clicking on analyze data. And then on the right in your history, you should have now four data sets. But you see that the names of these data sets are the URL names. So we need to rename them. So we need to edit attributes and then we remove the first part of the URL to keep only 77 underscore 1. We save that and we do that for the other data sets here. Let's go to 7. Save that. We now go to 80. We do the same. We do that again for the last data set. Perfect. And we now have four files with a better naming. But we want to be sure that both data sets from the same sample. So 77 underscore 1 and 77 underscore 2 are somehow linked together. We want to add something that is called a tag. So to do that, we expand the data sets by clicking on it and we click edit data set tag. And we add a tag which is the name of the sample. So GSM 46, 11, 77. And we want to be sure that this tag propagates, that it's visible for each other outputs for the tool that I run for that. So we add a hashtag first. We copy this tag that has been added because we want to add the same tag to the other samples. So you do that. And once you're ready, you can click on the return key on your keyboard. You do the same for the same and then you pass the tag that you already added. Then it's automatically added there. You see that now we have below the names of the files, you have a small tag there. And you would see later how it can be really useful. Then we add the same for the other one, GSM 46, 11, 80. You copy it also because we will add it again to the other file, data set here. And here we are. So we now have four files with the correct naming in our history and with the tags. Check again on the tutorial. Did we forgot any steps? So we need to reload. I'm sorry, the tutorial. So we go to transcriptomic, irreference-based, RNS-seq, data analysis, data upload. We are in these steps. So we created a new history that we named correctly. We imported the fastq file. We renamed each data set according to the sample. Ah, we forgot this step. Check the data type. If the data type is fastqsanger and not fastq. So to do that, we expand the file. We see that fastqsanger is there, fastqsanger, fastqsanger, fastqsanger. So all good for our data sets. Let's go back to the tutorial. And then we added a tag for all the sample name. And then we have a question. How are the data sequence stored and what are the different entries in the file? So if we open one of the files, we see that we have a fastq file, which means that each sequence is stored as a four lines. The first line is always starting with a hat. Then the sequence ID and some information from the sequencing facility. Then the second line is the sequence itself. The third line starts with a plus. And the fourth line is a sequence of quality for each nucleotide. So this is the quality score, the i for the t. But if you already, you should have already followed the quality control tutorials where everything is already detailed there. So I will not go through the details there now about that. So let's go back to the tutorial. So we finished that. Files that we just uploaded to Galaxy contain the reads that are raw data from the sequencing machine. So no pretreatment has been done on the data. So the first step that we need to do for any RNA-seq analysis is to asset the quality. Because during sequencing, some errors can be introduced such as incorrect nucleotide being called. And they are due to technical limitation of the sequencing platforms. But you probably already know more from the quality control tutorials. And so now we will follow the same step that you did during the quality control tutorials. We will run FastQC to create a report of the quality sequence quality. We will run MultiQC to aggregate the generated report and could adapt to improve the quality of sequence via trimming and filtering. So let's go back to Galaxy. So first, we need to run FastQC. So we go back to Galaxy. We search under toolbar on the left. We search for FastQC. And then we select FastQC, read quality report. Then we need to, we want to run that on all our force files. So we click on multiple data sets. And with the shift key on your keyboard, you select the four files there. And then you can execute. And you will see that in your history for each of the samples, two new data sets are generated. So FastQC on the ETA1 webpage and FastQC on the ETA1, for example, through the ETA topic. And what we want to do afterwards is to aggregate the FastQC report using MultiQC. So we can start, we can already launch MultiQC even if FastQC is already running. It's still running. So we search for MultiQC in the toolbar. We select in which tools has been used to generate the report. So we search for FastQC. Then we need to say which type of FastQC output. We will use the raw data. And then using again the shift, again, now the command or the keyboard, you can select the different, the raw data that you need to have. So the four, you should have select four files. So file raw data and then you can execute. So MultiQC will wait until FastQC. All FastQC are done to run. So we are a few minutes now to wait. And you see already that the tag that we added there automatically added to the FastQC. So it's a way for us to keep track of which, so that this FastQC has been running on ETA1, which is this one, which means this sample here. So it's a better way for us to keep track of that. So if we open, for example, the FastQC report with page for the ETA1, you see that you have some basic information, basic statistics and some reports that have been generated. But you already go through each of these reports by following the quality control tutorials. So I just check on the tutorials. So we are in the end zone, a quality control. So we see that FastQC, rerun FastQC. And it's, we have a question there, what is the red length? So how do we know the red length? So we open the web page report from FastQC. And we see that the length, the sequence length is 37. So it's quite small sequences there. But it's also, remember, it's all data from 2000, that has been purchased in 2011. Oops, sorry. We go back to the tutorial here. We see that we are already running MultQC. And then once MultQC is done, we need to inspect the web page output from MultQC. And we have two questions to check then. What do you think about the quality of the sequence? What should we do? So MultQC is, we are waiting for MultQC to run. And if you see already that, because we run MultQC on the output of these files, that these different tags, we see that the, the both tags has been added there for this file. So MultQC is running. We are waiting until MultQC is done. And should, should be fast now. And once it's done, we will open it. So it's green. Everything is fine. We can open it. And you see that the different report for the different, for the different samples has been aggregated together. And we can go through the different output from FastQC. And we see the different, the different sample in one, one, one report. So one graphs for, for the sequence quality histogram and with a different sample there. So the question was, oops, sorry again, most. So what do you think about the quality of the sequence? So the quality is quite good. So it's mostly for most of the sample, it's green. So the quality is okay. It's decreasing at the end, which is expected for most of the Illumina sequencing. But we see that one of the sample, the 80 underscore two, is the, the decrease is much higher. So it's much more important. And it's labeled as a range by FastQC. So maybe the quality is not so good there. I'm sorry, I will be used to that again. So what should we do? So here you have more detailed answers if you want to check. So the quality, the, the histogram of the mean quality score per sequence is quite okay except for 80 underscore two. We see that here we may need to trim the end of the sequence because it's dropping too much. The GC content is okay. And there is some, some end contents or some, some nucleotide data ends. We have a quite duplication level, quite high, but it's expected for herenistic data. And I think that's most of the things. Yeah. But if you want more details about what we can say about that, I recommend you to check the, the, the detailed answer that you have there. But we have, so as we say it, we should trim the read to get rid of bases that were sequenced with high sentiment. So it's been mostly the end of the reads and also removed the reads with the overall bad quality. We have another question in the tutorials. What is the relationship between underscore one and underscore two for each of the samples? We are talking here about parent sequencing. So the underscore one is the river, the forward reads and the underscore two are the reverse reads. And you have a detailed explanation there about what does that mean. So now we need to, we want to remove the, the bad quality nucleotide, but also the read. So for that, we will need to run cut adapt. So we need to go to the tools we search for cut adapt here, remove, reuse that. And we have, as we mentioned, because we checked that we have underscore two underscore one, we have parent data. So we say parent. And then we need to select as the, the dash one. So it's the forward reads with the underscore one, the underscore two, then as the reverse. What are the different? So to check, I don't remember all the details. So we want to have a filtering option. So remove all the reads that are, that are shorter than 20. We want to, to get rid of the reads where the quality, the mean quality is below 20. But also in the, if the quality dropped too much. So we go back, sorry to gain. So in filtering options. So we need to go down filter options here. We want to have a minimum length of 20. In read modification option, we want a quality cutoff of 20. That we mean that, that the, the quality base, it will trim the quality base for that. And there are based on the quality score of 20. And in the output, output options, we want a report to be yes. So output options, we want to create a report that can be used afterwards for mutiQC. Globally, I recommend you to go through all the parameters to check that. But here I go quickly. You can see more details doing the, with the quality control tutorials, but also by reading through the documentation. So regenerator report that we can again aggregate using mutiQC. So I recommend while cut adapt is running to also already start mutiQC. So then for mutiQC, which should have been used, you select, you search for cut adapt. And then you need to select the both, the report. And you have to report that has been generated. Because, so now we don't have four, four cut adapt that we run, but only two, because we are used already the, the due to samples together. So we have two reports that we need to aggregate using mutiQC. You can take a short break now because cut adaps can take some time to run. It's now finished. And as well as mutiQC. So we can inspect the mutiQC output to check the result of the, of the, of the trimming and the cutting of the removing of the data sets. So if we see that, if we open the mutiQC, we see that for the 77 samples, 25, 2.5% of the base pair has been trimmed. So especially at the end of the, at the end of the reads. And for the 80 samples, it seems that 12% have been removed. It's something we expected given the results. And when we saw that for a, for at least for the, for the 80 samples, the quality was not so good at the end. And we can see also, if we look at the percentage of fitter reads, we see that quite a lot of reads has been removed also. So for the 77 sample, 1.3, 1.4% of the read were too short after the cutting. So they were removed. And for the 80 sample, 9% has been removed. So it's mean that, so a lot of the base pairs has been cut out of the end of the read. And then the read will become too small to be kept too short to be kept. So it's what it said there. So if we go back to the tutorials, there were, the these questions were exactly were asked then in the tutorials. So now we can move to the mapping part. Now that the quality control is done, we need to do a mapping. So the mapping, the idea of the mapping is to make sense of the read to try to identify where this read come from, from which gene they come from. Once we are, so we need, for that we need to locate where they come from on the genomes to then associate, okay, this part of the genome where the belongs to is part of, is, this part of the genomes where the belongs correspond to these genes. So when we, because we are talking about Drosophila, and for Drosophila, there is a reference genome available so we can use that to help us finding the locations. This process is called, is called aligning or mapping the reach to the reference genome. You should have already learned a bit about mapping by following the tutorials on mapping. If not, please feel free to feel, please follow it. So in this study, so we use Drosophila melanogastas cells to extract the RNA. So we need to use the same, we need to use to find the reference genome of Drosophila to map then the take ones to that. There is a question on the tutorials about what is a reference genome, what are the different versions of a reference genome, why do we have that and which reference genome we should use. Please check the solutions on your own but the answer is we need to use the reference genome once we start using a reference genome and with one version of the reference genome we should keep using this one along. So when we talk about a chaotic transcript, most reads origin from mRNA, so from messenger RNA, so from genes like the introns. So the reads come from exon 1, exon 2, or exon 3 and they can somehow, what we call, span over these different exons. For example, the blue year, these blue reads span over exon 1 and exon 2 and miss the intro in between exon 1 and exon 2. So then because of that, because of the specificity of a chaotic transcriptome, we cannot just map the read to the genome as we usually do for DNA data. We need some specific mappers that are developed to efficiently map the reads to different exons. So the idea of most of these mappers of these tools is they try to map the reads over the different exons, so like this. So they map this read to this exon 1 and this one. So we call them the mapped reads and once out of the read that didn't map, they try to split them over a different exon. So it's an idea of splicing aware mappers and it's used for a chaotic transcriptome data. We gave here in this box more details about the different spliced aware mappers and especially the history behind them. I recommend you to read that. I will not go through the detail now, but just to remember there are different generations of mappers and with different implementation there. So now we will map our reads to the Drosophila melanogaster genome reference genome using a tool that is called STAR. For that we need to have first annotations of where the different exons are in the genomes. We need to import that into our history. As for the data, this file which is called a GTF file is available in the data library, so we import it. So please go back to Galaxy, go to shared data, data libraries. You go then to the GTN material. Once you are there, you search for transcriptomic, you go to the reference-based RNA-seq data analysis, then the DOI and then you use this one. So the third file here, sorry, we want to import it to history. Yeah, we want to import in our history. So it's here to be sure that we use the same file, but just to be sure, so if you go to Galaxy and the data libraries, so if you go back to shared data, data libraries, you can see that there is other genomes and annotations available there where you can find the annotation, the reference genome, et cetera. Here we will use this one because to be sure that we all use the same between different Galaxy server, but here I just showed you where you can find other reference genomes on your European Galaxy server. So let's go back to Galaxy. And then so we have now our GTF file there. We will rename it as we did for the FASQ file at the beginning. So we remove the first part, we remove the GTF at the end and we save it. And I think, oh, we need to load again the tutorials as every time we change a page on the interface. So we go back there. We were in the mapping part and we are here now. So we need to rename the history. We need to check that the GTF, the data type is GTF and not GFF, and that the database is TM6, which is the version of Drosophila Minanugas there. So the GTF, which one we need to already, we need to be sure that it's GTF, the data type, and the database is TM6. So we have the GTF here. We need to change now the database. So we go to edit attribute and then here in database build, you can search for TM6 here and you save it. So now it's to be sure that this file is associated with that. Okay. So now we want to run RNA-STAR, which is a mapper, with the following parameters to map the read to the function. So we want to use parent dataset. So we search for star in the tools and you use the RNA-STAR. Capture it mappers for RNA data. We say that we have parent data as individual datasets and we now need to use. So now we need to put the forward reads and the reverse read here. We have two samples. So we need to collect and we want to run the same parameters. So we need to select multiple datasets. We need to select now the output of cut-adapt, that were the trimmed reads. So for the forward, it's always the read one output. And for the reverse, the read two. Be careful when you select them. So it's important that you select the correct one. Custom or built-in reference genome. So we need to use a built-in built-in. We want to use what is again. So we want to use genome reference without built-in genome. So without, we select the reference genome, which is DM6. Then we can give a gene model for the splice junction. So it's mean where are the different exomes. So for that, we can use the GTF file that we just uploaded. Then there is the length of the genomic sequencing around 10 to 80 genocents. And it say the interval value is read length minus one. Do you remember what were the lengths? We thought that in the output of FASQC. So it was 37. So here we need to put 36. Then what are the different parameters? So 36. And that's all. So you can check the other outputs. But yeah, take time. When you have the time to do that, I will not do it now. So then you can execute star. It can take, it will take some times. So please feel free to take a break now for a few minutes. Just the time it's running. Arena star is now done. So you have now in your history, you have three files for the 80 sample and three files for 77. As we did for FASQC or for Catadapt, we can aggregate the results from star and the report that are there in logs using MultiQC. So search for MultiQC. In the tools, you search for star. The tool that can be used for generate, we use the log and you select the two log file there and you can execute that. It will give you, give us some, some information like which how many, how many reads has been mapped to the reference genomes, how many read has been multiple mapped. So it's meaning to several locations in the reference genome, et cetera. And so once, while MultiQC is running, we can check. So we have a question afterwards. So what percentage of read has been mapped exactly? I've been like exactly once for both samples and what are the other available statistics. And so you have some details here. So we will check the results once MultiQC is done and it seems to be done now. So I will open the webpage. So it's say that 83% of the read for the first dataset has been mapped or aligned to the reference genome 79 for the second, for the second datasets. If you will look at the percentage, so 80.3 has been uniquely mapped. So it's mean that they are mapped to only one locations. And then we have 5.5% of the read that has been mapped to multiple locations. It's mean probably that the read are two shorts. And because of the repetitions, we can see that. And some has been mapped to too many locations at the same times, and some were too short to be correctly mapped. And we have also the informations for the, for the second dataset here. Sorry here. So we go back here. So it's mean though, according to the report, more than 80% of the read for both samples has been mapped at least once for the reference genomes. So we can then proceed with the analysis. And if the percentage were below 70%, we should probably investigate for possible contaminations. It can be due during the sequencing, during the RNA extraction or something. One thing you can do then is trying to, if you have such a case, you can try to map to the human reference genome or to other reference genome or maybe bacteria to see if you have some possible contamination that you can remove them. But the main output of, and the more interesting output for RNA star is the BAM file. But if you follow already the RNA, the mapping tutorials, you already know what is a BAM file and that which information you have in a BAM file. So I will not go too much in details. I will just try to find some example here. So here, so if you scroll down after the first line of comments, you have the name of the sequence and some extra informations like the size and on which chromosome it has been mapped, on which locations, with the quality, et cetera, et cetera. But you find the detail, you can find the detail here on the mapping here on how it looks like. So the question is which information do you find in a BAM file, in a SAM file, and what are the different information compared to the FASQ file? So the more information that you either, the BAM file and SAM file compared to a FASQ file is that you have the location of the read or where the read has been mapped on the reference genome. The BAM file contains, as BAM file contains information for all our reads, it makes it difficult to inspect and expose, especially with this format, the text format. So there is powerful tool now to visualize the content of a BAM file, for example, using IGV. So I recommend you to install IGV and start it. So I already started locally for me. So feel free, please start it. Check, follow the instruction to install it. And then we can visualize the BAM file for the 77 sample together. So we go back to Galaxy. We already started the IGV. So you expand, you go to the BAM file here. And what you can do, you can say display with IGV local. And then it will launch IGV for you. And then you can see here the different information here. We will zoom in a bit. So to zoom, you select IVR there. And you can see there, you need to see to zoom again a bit more again. Oh, you don't see anything. You have here. Okay, now it's a zoom out a bit. So here, no, it's loading slowly. So IGV is a quite slow tool. But yeah, it's a good one. So here, you have a good example. So here you have the different reads here. And while they are mapped on the reference genome that is on the top, you see their locations. And you have what we call the courage here, which represents the density of read mapping to the different locations here. If we go back to the tutorials on Galaxy, close that, if we go back to the tutorials, we would like to zoom to this specific location. So on the chromosome IV, so please copy this text here, the chromosome IV, etc. And go back to your IGV. And what you do, you can pass this location here on the top here. And it will go directly to the specific region that is there. So you can see exactly the region that we want to display within IGV that is from the tutorials. So if we go back to the tutorial, we should have something like this. What is the information up here on the top of the gray peaks? So that this one is the courage, as we already mentioned. What do the connecting lines between the sum of the aligned read indicate? So if we go back to IGV here, you have some example here of these lines, horizontal lines between there. These are the introns in between. So you have a read that mapped here, one part of the read here and one part here. So it's made that this read is spanning over this intron that is there. It's what is represented by that. Another way to inspect the splicing junction is using what is called a sashimi plot. So to do that, to create a sashimi plot, what you can do is right click on IGV and click on sashimi plot here. So it will create this sashimi plot here where you have the coverage here. And then in this location here, you have this arc here with some number on the top here. It's the number of read that's spanning over this intron. So here we have seven read that mapped and span over this specific intron, 30 here, 20 here, 22, etc. It's what is represented here. And the next question is why do we observe different stacked group of blue lines here and the bottoms? These represent the different transcripts of the gene that are represented in the GTF file. So in the annotated file, we see that so we know that there is an exon ear and some intron ear and exon ear. But we know that they can be used by different genes. So this one is one gene, this one in another gene, etc. So we have different gene name ear that use the same exon. Find the annotation. Okay. And once you did that, once you are done with the inspection of the mapping results, you can do some extra checking for the mapping. We gave a lot of details in this expandable box here to check, for example, the percentage of duplicate reads using a tool that is called mark duplicates, the number of reads mapped to each chromosome, the body coverage, the read distribution across features, across genes, a code intron, exon, etc. We will not go through that. I recommend you to do it on your own and to apply that also to your own data set then later. So we now have the use the mapping. We have now the information of where the reads are located in the reference genome and how well they were mapped. And so the next step in our RNA-seq data analysis is to quantify the number of reads mapped to a specific genomic feature, so the genes. Like now to identify the genes that are differentially expressed because of the bacillat genes depletions. So for that, we need to, so we have where the reads are mapped, but on the locations they are mapped on the genomes, but not to which genes they correspond to. So to then compare the expression of sequences between different conditions, so between with depletion or without depletions, we need to identify the number of reads that are, we need to quantify the number of reads per genes. Or specifically, we need to quantify the number of read mapping to the exon of each genes. As we already mentioned before, though the reads are mapping only to exon, not to the intron. So when we need to count the number of reads per genes, we need to count the number of reads that are mapping over the exons. So if we take this example of these figures, the question, so we have two questions here. So how many reads are found for the different exons? So for example, for the exon one, we have three reads mapping there. Exon two, two. Exon one in the second genes, one, two, three. Exon two, one, two, three, four. And exon three, one, two, three. But then if we count at the level of the genes, we have different numbers. So it's not just an addition of the number of read mapping for the different exons. We need to take into account this case of the three that span over to exons. So here for this gene, the gene one, we have one, two, three, four reads that map to the genes and not five if we would have just make an addition between the two. And here for the gene two, it's the same because we have one, two, three, four, five, six, and not one, two, three, five, six, seven, eight, nine, ten, eleven. So we need to be careful when we do the counting of just not to count the different reads inside the genes by taking into account the exon and the reads that span over the different exons. So different tools are available for that. Each second of features count. We will use today features count because it's currently faster and requires far less computational resources to do that. So in principle, this accounting was, it's a fairly simple task as I mentioned, but there is some detail that needs to be given to features count. For example, the strenness. And because the RNA are typically targeting a simple single strand and have polarity, the strenness is usually lost after both strands of the RNA are sanitized and selected. However, keeping this information can be useful, especially for read located on the overlap between two genes. So for example, if you have the read one ear, you are, and it's map here, you are quite sure that it's mapping and it's belonging to read one, two gene one. But if you talk about the read two ear, that mapped ear at the overlap between read gene one and gene two. Gene one being on one strand of the DNA and gene two on the other strand. And we don't know to which one, to which genes read two belongs to. And some liberal reprepression protocol creates some what is called a stranded RNA-seq library and by extracting only the RNA mapping belonging to one, to the genes from one strand. So we need to do have this information. So if our library preparation is unstreaded or stranded and which strand it's belong to. If you want to know more about strenness, you can read more. We added much more information here about the strenness information. Usually this information is provided with your FASQ file with when you got your FASQ file from your sequencing facility. If not, you need to try to find on the site where you donated the data or in the corresponding publication this information. It's usually come with a library preparation. But if you have no way that you find this information, what you can do, you can use a tool that is called Infer experiment from airseqqc. This tool takes a bound file from the mapping and select a sub-sample of the reads and compare where these reads are mapping to the coordinates and strands of using the annotations. So based on the strand of the genes where we can find the read, we can estimate whether the sequencing protocol is a strand specific or not. So for example, if we know that all the the genes are mapping to only genes, all the reads are mapping only to genes belonging to the forward read, we know that it's a stranded library for what? Same for a reverse. And if we are read that mapped on both on statistically on every genes on the genomes, then we can estimate that the library in unstranded. So here we will assume that we don't know the library's strength. So we need to estimate that. So we need first to convert the gtf, the annotation file to a bed 12 because it's needed to the tool Infer experiment. So the first step that we need to do, so we need to do a convert gtf to bed. So it will take this our trozofilamilano-gaster annotation file and transfer it into a bed file, a bed 12 file. Yep. Here, execute. So the bed 12 is just another, it's also a tabular file, it's a different way of different column order than the gtf. And then once you have that we will use a tool that is called Infer experiment. Infer experiment that speculates our RNA-seq where configured. So we need to provide the BAM file. So we provide the both BAM file, we will provide the reference genome that is in bed 12 here. We need to select how many reads sample from the BAM file to use. So here there is 200,000. We will use the same number. So 200,000, the mapping quality, minimum mapping quality of 30. So just for information, the mapping quality is encoded the same way as the nucleotide quality for sequencing is. So between 0 and 40. So here we keep a quite high level. Then it will generate two files, so two outputs from the Infer experiment that are from the BOSS sample. And this tool generates one file which is a text file with information if the library is parent or single end and the fraction of reads that were failed for which we don't know if they become to one strand on the others or map to a specific genes and then for the different type of library, they will say how many reads are mapping. The fraction of reads that can be explained because mapping to the two forward strand genes, the fraction for the reverse strand reads. So if we open the this one that is from for the 80 sample here, we can see so this is parent data, the fraction of read that were failed to determine is 9%. The fraction of read explained by one plus plus one minus blah, blah, blah, blah, et cetera is 0.35. So it's meaning that 45% of the reads are somehow mapped to the to the forward and 45 to the reverse read to the reverse trend. So then here, because the fraction is, I mean, we don't, it seems that there is no clear distinctions between the two types. So we cannot say we have to say that the library seems to be unstranded. If you want more details about the strenuous and the because it's going to be sometimes difficult to find out which settings correspond to to the to the others given when we talk about those trendness, we created a small table that can be useful to identify the library type that you need to fill or the information that you need to fill for the different tools. So depending on the library preparation and the tool here, what are the different. So if it's parent and forward or reverse, et cetera, et cetera. So now we know that our reads are forward and our library preparation is unstranded. We can run features counts to count the number of reads per annotated genes. So we search for feature counts here in our tools. So feature, feature counts, we define we need to give our to BAM file. We say that its library is unstranded. We will use a locally cached annotation file. We will, in your history, sorry, the annotation is in our history. We'll use the gtf file. We want the output to be a gene ID slash read counts that can be used for multi qc and the sec two. We want to create a what is called a gene length file and we will need that later on. What do we want to have to some in the option for parent read. So option for parent, we have several information. Do we want to count fragment instead of read? We want to count the we want. What do we want already? I'm sorry, I'm lost. We want to count the fragment. So we want to count the fragment instead of the reads and in advanced options. So we want to be sure that we filter, that we count the number of read mapping on the exons and aggregate that at the level of the level of the gene. It's what we say there. We don't want the reads to map to multiple locations. The minimum mapping quality per read is zero there. And I think that, oh no, we say 10 here, sorry, to be sure that we are consistent here. So we want to at least, we want to consider only the read that have at least a mapping quality of 10. And then we can, you can check the other parameters. You can, I recommend you to check the other parameters, but then you can run the fast QC, sorry, features count. We will, as we did already for the others, tools. So we will aggregate the report from features counts using multi QC. Once when it's okay for the tools to be searched, okay, there is some delay in the interface. It happens, no worries. It's everything is fine. I will just reload the interface. It can happen sometimes that we lose the interface can be a bit slower, slower to load. I will search for the tool again, multi QC. And then I will start it while, oh, features count is already done. So I can select features count. I use, I select the both summaries that has been generated in the history. And I really run execute. Oh, I lost it. Yeah, true. Because I reload the interface. So I go back to transcript to mix reference-based RNA-seq data analysis, counting the number of reads per, counting reads per chains. And I'm currently there. Multi QC, I aggregate the results. And then the different question that we have are how many reads has been assigned to a chains and when we should worry about the assignment rate and what should we do? So we just wait until fast QC, multi QC is done. It will be fast. It should be fast. But then, okay, it's done. So I can visualize. So 60% of the, around 60% of the reads has been assigned to a feature, specifically here. We selected only the exam. So has been assigned to exam. If we look there, we have a percentage. So 63% has been assigned to Exxon. 9% has been neither unsigned or unmapped. Some were unassigned because the mapping quality was too low. Some were not assigned because multimapped. Some were not assigned because there were no features really interesting there. And some were ambiguous. The next question was, when do, should we worry? I will say that when the percentage is below of assignment, is below 50%, then you should have negated where your reads have been mapped. So are they mapped inside genes and check if the annotations correspond to the correct reference genome version? So it's happening also that you select one version of your reference genome for the mapping and then the annotations, you have a different version. So then they are mapping, they are mapped correctly, but not to genes that are known or not in the correct location of the genes. But the main output of features counts that we can see is this, what we call the counts file. If you open it, you will see that it's a big table where you have the first column is the gene ID. So here you have the ID of the genes. And the second column is how many reads have been mapped to this specific gene here. So here you have 197 reads that have been mapped to this specific gene. And one question that we have, which features are the most count on both samples? For that, you can use the sort tools to sort these tables by descending order to have on the top the output. And you can identify that. We'll let you do that on your own. So now we have counting the number of reads that mapped to the different genes for the two samples, so 80 and 77. But it will be for you, I will recommend you to do the same procedure as we did. So the upload quality control mapping counting for the other datasets to check how did if you understood the correct, how to do that, but also how to do when you have single end data. So you will find the same data on the same locations in the data libraries. So we have the FASTQ file on the top and where you can also find them on the nodal. Now that we have the counts for the samples, we would like to do the differential gene expression analysis. So to do that, we need all the datasets of the seven samples to be analyzed following the same procedure. As I mentioned, you could do it on your own. But to save time, we already ran the previous step for you. And we obtained seven files with the counts for each gene syndrosophila for each samples. So now we will import these seven files into a new history and run the differential expression analysis together. So for the first things we need to do is we need to create a new dataset, a new history. So we will create a new history and we will call it RNA, sick, sick differential expression analysis. Okay, we need to import the data and as before they are located in the same location as before. So you go on the top, share data, data libraries. Then you go to the GTN material. Then you will go to transcriptomics, the atomic loading, transcriptomics, reference-based RNA-seq data analysis, the folder called UI. And then we will select all the files that are named .cons. Be careful, don't use the .exon cons, but the .cons. So you select, you should select seven files. So you have 76, 77, 78, then 77, here 78, 81 and 82. And then you export it to your new history, import it into your history. You go back to Galaxy, the main interface, you rename them, these files, to be sure that you keep only the interesting part from the naming. So let's do that. So it takes a few minutes to do that for each samples. Here, one more, it should be good now. So we have now seven files in our new history from 76 to 82. So our new history, oh, yeah, true. I always forgot that we need to reload the tutorials after we changed page. So I go to the reference-based RNA-seq data analysis. And then on the left, I go to the analysis of differential expression. So we rename the history. So we now have, if we look at each of the files, so we just have a few quick vision here. We can see that for each of the different files, so we have a first column with the gene ID, the same. And the second column are the number of genes that are mapped to this particular gene for this data set. So we could compare the file directly and calculate the extent of the differential gene expression by doing that, but it's not that simple. So just let's imagine that we have RNA-seq count from three samples for our genome, so we have four genes. So we have sample one, sample two, sample three, and gene one, gene eight, gene B, gene C, gene D. If we look at the samples, we can see already that for sample three, we have much more read for each of the genes regularly for more read than the other replicates for all the genes. So it means that this sample three has a higher sequencing depth than the other replicates. The other thing that we can have a look is that gene A, gene B, is twice longer as long as gene A. So it might explain why we have twice as many read regardless of the replicate. So the number of sequencing reads mapped to a specific gene depends then on the sequencing depth of the sample, but also on the gene X of the gene. So to compare samples and gene expressions, we need to normalize our gene counts. For that, we could use something that is called the TPM, transcript for a million kilobase, or RPKM or FPKM. If you want more explanation of what are the different terms, I recommend you to look at this expandable box. But RNA-seq is often used to compare one tissue to the other, for example, muscled with epithelial tissues. And then you could have a lot of muscle-specific genes that you have a lot of muscle-specific genes that are transcripted in muscle, but not in the epithelial genes. And it's what we call the difference in library compositions. We could have the same to see a similar difference in library compositions if after knockout of a gene. So for example, you let's imagine that we have RNA-seq account from two samples, sample one and sample two, that have the same library size. So they are both 635 reads for a genome of six genes. So the genes have the same expression in both samples, except one. The only the gene, the sample one here, express, transcribe the gene D, but not the sample two. And it's really a high level here. It's a more than 500 read there. So it means that the 500 read that are used that we can see here, if we have the same number of read in total, they have to be spread over the other genes here in sample two. For the read count, it's implying that, for example, here we have the feeling that sample two is almost 10 times more expressed. The gene F in sample two is 10 times more expressed than in sample one. But it's only because of a differentially expressed genes, just because of the gene D here. And the TPM, RPKM or FPKM doesn't deal with this difference in library composition in normalizations, but we need to use more complex tool like DSEC2. So DSEC2 is a great tool for dealing with RNA-seq data and running differential gene expression analysis. It takes the raw count that we, as we have now in our history, applies a normalization for sequencing depths and library compositions and do some statistical analysis there. And then for the differential gene expression, so you can, if you are interested in how the normalization in DSEC2 is running to do the sequencing depth, both the sequencing depth and library composition, I recommend you to to read this, normally this expandable box. So DSEC2 runs, as I mentioned already, the differential gene expression analysis. And that took where the two basic tasks are to estimate the biological variance using the replicate for each conditions and estimate the significance of the expression difference between any two conditions. So this expression analysis is made from the read counts and in some extent I made to correct for the variability in measurement using replicates. So we need at least three replicates and the best is five replicates per condition to be able to have a good statistical power there. And I'm talking about biological replicates, not technical replicates. If you want to know more about the difference between technical inversions and theoretical replicates, again, have a look to this expandable box. And in the DSEC, multiple factors with several levels can be incorporated in the analysis to describe the different source, possible source of variations. It can be the treatment, it can be the tissue, gender, patches, etc. And it helps to do the correct statistical analysis. So in our case, so in our data sets, we know two types, two possible factors that can impact our expressions, the treatment, so the depletion of the of the basilar genes, but also the sequencing type. If it's parent or single-ed, it can impact really the number of read data mapped to the different genes. But yeah, what we are more interested in is the treatment, so which is the impact of the treatment. So we'll put that at the first factor. The sequencing type is an information that we want to correct for, but it's not our main goal and what we would like to know more. So a quick comment, so we recommend you to add all factor you think may affect the gene expression in your experiment. It can be sequencing type year, it can be other types. Just before doing the TSEC analysis, I recommend we recommend you to just slow down, take the time and evaluate what are the different possible factors for that. So now we will run TSEC. So to run TSEC, so go back to Galaxy, you search for TSEC2 in the tools box, and then you open TSEC2. You say how we want, you want to select dataset per level, and then we will define different levels. So the first level that will be will be the treatment. We will put as the first factor, so the first factor in the treatment will put as the first level of this factor. So the different levels will be the treated and untreated. So the first level is treated. So we select, it should be three files there. So here are the contents that are in our history. So we select the different samples that are treated. It's already written in the name there. You can, you need to put untreated then, and you select the four files for that. So here we have the first factor which is a treatment with treated and treated. Then we have a second factor, as we mentioned, that is a sequencing. So we put a factor name, it will be sequencing. The first level of this factor will be if we have parent data, and then we need to select, we should have four samples where paired is in the name, and then you put single and you put that we have single data here. The data doesn't have any either here. If you look there, there is no either. We want the count data, the input data are from features count. We want to visualize the analysis result, we want to output the normalize count table, and that's all. Then you can start desec too. Just quickly check again. Yep. So as you remember, I would click select multiple data sets. If you have more complex or large sample sets, I recommend you, we have created a tutorial to explain you how to use group tags and other things to help you process that. And putting your data correctly to desec too. So desec too will generate three outputs. The first output will be a normalize count table. So it's a combination of all these seven files of these different samples in a big table. So with the same the genes, the first column will be the gene ID, and then seventh column for the different samples with the normalize count for these samples in this gene. The second file in desec is a plot. The plot will with some graphics to check the quality of your analysis. And then we have what is called the desec result file, which is a big table with for all the genes some statistics. And it's what we will use to extract the differential expression, differential express genes. Desec is now done and as expected, it generated three outputs. The first output that we can see in our history is the normalize count table. So if you expand it, if you can, you can open it, you would see that that can even visualize it. You will have a big table with first column being the gene ID, the different column are the different samples. And in the cell there, you have the normalize count for this sample and for these genes there. And we will not use that directly, we will use this file later. The second file that desec is generating is the plot file, in which we have some plots to check the quality of the analysis. And the first plot that we have is what we call a PCA plot, where we can see how the variability of our data can be represented when displayed on a 2D graphs. I will not explain what is a PCA in details, if you want to know more about what is a PCA, if you've never heard about it, I recommend you to expand this box and read more about that. And so here we can see that the first to component our PCA analysis, and especially the first one that represents more than 50% of our variability of our data, I'm talking about just based on the normalize count table, not no no statistic analysis I've been done, just based on the normalize count table, more than 50% of our variability, you can see that it can discriminate two groups of samples. So this one here on the left, this one on the on the right. And you see that all the samples on the left are all the untreated, so the controlled samples and on the right are the treated samples. So in our data, so it's what we can extract from this plot is that the fact that the we can clearly see a discriminations in our sample, just in the normalize count table, you can you can you have a discriminations between the treated samples and the untreated samples, just on the normalize count table without no other statistics. So there are no, we don't know which genes or et cetera, it's just based on the big table of that we opened before. And the second acts of variability, it's what is represented there, it seems to discriminate the paired end data sets and the single end data sets on the top. So it's mean that the two factors, so the first factor that we used, so the paired versus unpaired, sorry the treated versus untreated, it seems to be a good way of discriminating our samples from each other. So it's a good factor that we used and it's the main factor to differentiate our samples. And the second factor that we use, which is the type of sequencing, seems to have a still a good like 20, almost 30 percent of the variability of our data can be explained by just because of the type of sequencing that has been used. So it's really good that we used that, we put that in our factors for the for then doing the proper statistical analysis. The second plot that we see there is a sample to sample distance. So it takes from the normalize count table, take the distance between these samples and this sample here, for example, and compute the distance between just based on the normalize count, a computer distance for all the samples and output that in this it map there. The darker is it, the closer it is, so it's normal that the, so this one and this one, they are the same sample. So it's normal that the, so it's mean the distance is really small, it's mean that they are similar here, but it's normal if they are the same sample. So the darker is it, the closer these samples are together. And you have some sort of clustering on the top here to represent that. So you see that these two samples are close. So this sample and this sample are really close to each other. So the Instagram, the sorry, clustering that you see on the top and on the left is the same. So you see that this one are close to each other, this one are close to each other. And it's expected. So here we have the treated paired data and they are closer together. Here you have untreated paired data together. And then you see that the treated are in the same group, the untreated are in the same group, etc. So it's clear that the treatment has a big impact as we already saw in the PCA component. And I will not go in details about the other graphs. I recommend you to check on the tutorial. There is more explanation about the other files. And then I will explain, I will have a look at the next files, which is a DSEC2 result file. If you open it, you will see a big table there with the first column being the gene ID. So the same as we extracted before and then some statistics. So the base mean, the log2 fault change, the standard deviation, the val test, the p-value, and the adjusted p-value. So then for each of the genes there, we can see if these genes have been differentially expressed or not and with which log2 fault change and if it's significant or not using the p-value there. So just be careful with the value of the log2 fault change. So if it's minus, it means I always need to check again. So the log2 fault change are about on the primary factor level 1 versus level 2. So the order that you used in the factor levels for the first factor is really important. And because we put the treated sample first and after untreated, it's mean that the first factor, the first level is a treatment. So it's mean that if you have a plus there, it's upregulated. So more expressed in the treated sample compared to the untreated sample. So in its minus, it's less regulated. So less expressed in the genes in the treated sample. I mean, I recommend you to check again regularly because I always struggle with that. So based on that, so you can really extract which genes are upregulated or don't regulate it in the treated sample compared to the untreated samples, and extract them also based on the one that are significant. You have some question here. Is the genes, these genes differentially expressed because of the treatment? If yes, how much? So it's 3360. So 3360, this one. It seems to be don't regulated. So less expressed in the treated sample than in the untreated samples with the log two fault change of almost three. And there is other questions afterwards. Okay. I lost here. I recommend you to go through the different question there. I will not do it now, but take the time and do it. The next steps that we would like to do is to extract and annotate the differentially expressed genes. So the first things we want to do is to extract the genes that are significant p-value. That it means that the p-value or specifically the adjusted p-value there is significant with a threshold of five percent. So please follow this ends on section. Extract the most differentially expressed genes by following the different steps that are explained there. As I mentioned before, I run the filter tool so I can show you quickly. So the idea of the I run that on the DSEC to result files with the following conditions that we want to extract all the line, the rows where the colon seven where the adjusted p-value is is below 0.05. And so I execute it and I renamed it right after by clicking there. And what we can see here is, so here, if you expand that, no, sorry, not the wrong file, here, you see that we have more than 17,000 lines. But after doing this filtering for keeping only the rows with the significant p-value, we have a bit more than 1000 lines. So it's this number of these genes with a differentially expressed expression. And once then I run again a filter tools again on this on the genes on this files with the significant adjusted p-value. And the idea was then to extract the genes where the p-value is significant, but also with the log two file change is higher than one or below minus one. So then the column with the log two file change is three, the column three. So it's what we put there. And we want the genes where the p, the fault change is higher than one or below minus one. So it's mean that the absolute value is superior to one there. And then it will mean that the fault change is higher than two or minus two or below minus two. And then we extract it 130 lines there. It's what we have now. So it's what we use there. And then we have, if we expand this file, now we have the same information as before. So the gene ID and the statistic there. But the gene ID are not so interesting then, which could be more interesting to add extra information that we, with that, for example, the locations of these genes, but also some better gene names. So to do that, we can use a tool that is called annotate desec to output. So I'm sorry, I don't understand why it's reloading every time. So extraction and annotations of the differentially expressed genes. So for that, we need to use the GTF file that we used before so you can import it from your previous history and then use the annotate desec to output tables tool that will use, that will add some extra information like the chromosome start and strengths, features and gene names. I will do it quickly, but without going too much in detail then. So you can follow what I'm doing, but then, yeah. So I go view history to visualize all my histories. I have all my histories then I can see there. And what I want to do is then to get my GTF file there. So you could do that by dragging and dropping it here, or you can import it again from Zenodo. Then you go back to Galaxy and then you search for the tool annotate desec to annotate desec to annotate desec to output tables. You use the tabular output will be the filtering one. You want the input is a desec adder and you want to add the reference genome that will be this one there. Desec is now done. So if we expand it, we see that we have the different column with the gene ID and some, if we go there, we have some extra information that has been added like the chromosome, the position, some of the chromosomes, the strength and some extra information about the type of features it is, so protein coding and the gene names. But as we see, we lost here the first information, which means what are the different column there. So we need to be useful to add this information. So to do that, we need to add an extra, we would like to add an extra line that will add the name of the different columns. So to do that, I recommend you to go to the tutorials and check add column names. You copy there. We will add as a new file the column name then and then concatenate the data sets. So copy that here and then you go to the small thing on the top where you can upload data. You go to past and fetch data there. You add what you copied there. You say that the type of is tabular. In the settings, you can say that you you say that that's all and you start it and you close it and then we will check what is the output of the pasted entry. And then we will use a tool that is called concatenate. So it's concatenate data sets. No, it's the other one. There is two of them. I recommend you to do this one, to use this one. So we just wait until this one is done. But then the idea would be to first put the pasted entries and then added the annotate gsec as an output. We will just check the output of the pasted entry first to be sure that we see that it's correctly, it's 13 columns that has been added. So here you have the name, the number of columns on the top. So then we can run the annotate concatenate data sets. And then on the tutorials, we recommend you to rename these generated files because to keep track of what is inside. So you can do it while it's even if it's not running or if it's not finished yet. And then you will check the outputs that you have 130, 131 lines with the first line being the name of the column. We now have our files with our gene, significant genes with significant or interesting p values. We would like now to visualize them. So one thing we could do is we could plot the log2 fault change. But then one thing we usually ask or we want to do is to visualize a hit map of the expressions for the different samples. So it's what we will try to do now is to visualize the expression of the differentially expressed genes. So we will first extract and plot the normalize scan for these genes for each samples using a hit map. And then compute and extract the plot, the sorts that score for normalize scans. But here it's just a quick possible visualization that we will do. But if you want to do more details and you want to do to go more in depth, I recommend you to check these two tutorials there that are visualization of RNA-Seq results with hit map 2 which go further than what we will do now. But also how to plot that with vulcano plots. Now, so visualization of the normalize count. The first things we want to do is to extract from the normalize count the one that the line for the reverse for the 131 genes that are interesting for us. So we need to join the normalize counts with the genes with significant adjusted p-value and fold change to extract just the row that are overlapping on the column one with the same gene ID. So and then we will just keep the column 128 that are limited by a tab and we will rename it normalize count for the most differentially expressed genes. So then we will have a table of 130 lines and then on that we will be able to plot hit map. So I will do that now. So just join two datasets, join two datasets side by side. So the first one we want to do is to normalize the count table on column one which where the gene ID are. Then the second we want to join with the genes with significant p-value on the column one. We want to keep lines on the first output that don't join no. We want to keep lines on the first input that are incomplete no. We want to fill the empty column no. We want to keep the header yes. I think I need to check again. Feel free to check again. So here and then we can execute it and the second step is to cut to keep only the column with the normalize counts which are the beginning. So then the so as to explain so the join will really join these two datasets. So it will add the column from the second file so where the statistics from these genes are on the count on the normalize count. So we need to just then output to cut this column that has been added. So we need to cut. There is a tool for that that is called cut and cut column from table. Then you say you want to cut and keep only the column one to eight on the join datasets and you can execute it. So just to check that what we did. So here you can view the output of the join datasets. You see that you have the column ID. You see that you have this 131 lines. You have the gene the sample names on the top and then you have the statistic that has been added and so we want to keep with the join with the cut. We want to keep only this first eight column there. Once we are done with that with the cut so we can already rename it here with the normalize count for the most differentially expressed gene sorry here and save it and if you expand that you see that you have now only the eight column interesting column and the next step that we want to do is to create a hit map of that. So it's map two you can search for it map two then hit map two the normalize count for the most differentially expressed genes. What we want to do is to plot a hit map with the normalize count. We want to visualize to have a transformation that with a log two values transform my data enable clustering label colon but not the row and coloring groups. So we want to plot log two values of my data. I want to enable clustering we want to cluster rows but not column what I said I forgot already. Yes we want to label column label my column but not my rows so we want to have the sample names on the top but we don't want to see all the names of the genes and we want the coloring to be blue for the value below below zero white to zero and red for the top then that's we can execute that. So it will create a hit map similar to that so you should check that later if it's similar and please once it's done please answer the question that is there. So while the hit map is running we will also want to compute the death score because it's something we all often see in the publications. So the death score gives the number of standard deviation that a value is away from the mean of all value in the same group. So it's a good way to somehow normalize our data and make the different expressions more comparable between the different genes. So to save time on these tutorials we will compute the death score directly on the normalized count of the that we extracted of the most differentially expressed genes but normally you do that on all genes so on the normalized count file from DSEC and then filter for the interesting genes. So I will recommend you to do that but so to compute a death score we need to do two steps we need first to subtract to each value so to each cells we need to subtract the mean value of the whole row and we will on the normalized count tables and then divide these values that we computed by the standard deviations of the same rows. So we need to do two steps we need to compute first the normalized the we need to do these two steps where we first do a minus so this first step so it's what we will do there so we'll compute the table means one so it's mean we will compute the means per row one is meaning per row and then we want to to subscribe to subtract sorry the means for each of the rows it's a bit complicated but it's a way it is and then we will do the same by dividing we will have two tables we will add the first table which is the normalized count and then the output of the table one and we want to divide the table two so the one that we compute there by the standard deviation from the normalized count there with the same idea as before table here is subtract sub is for subtract divide is for dividing and std is for standard diversions mean is for mean so we will do that now so we search for table compute compute operations on data table on table data and so the first one we want to we have only one table so and check again we have a single tables we are we check the normalized countable and then we will perform what we call perform a full table operations so single table blah blah something and what we want to do is perform a full table operations the operations will be a custom one and then what I recommend you to do is to is to copy what I put there but trying to we have I added we added the explanation there so feel free to go back and check again and execute and then we will run a second table compute then on multiple table where the first table will be the normalized count from the differentially expressed genes and the second table will be the output of the table compute and the custom expressions is the one I mentioned before which is table two dot dot so dividing the second table by the standard deviations on the row of the second of the first table and here it's this one and you can execute and while it's running you sorry you can also rename the output here and here rename it here you save it and once it's one once it's it's running it's done I recommend you to to check the question so what is the range of the Z score what can we say about the Z score for the differential expression and can we use the Z score to estimate the strengths of the differentially differential expression of the genes for the two table computes are now done so if you open you expand the Z score here you should have a file like this with your eight columned the gene ID and some value there um you have some questions as I mentioned um and you can yeah you have the Z score there so now we would like to do to plot a heat map with the Z score so it's what is asked in the tutorial it's to to plot some similar heat map so we will use the Z score there and we plot the data as it is we want to enable data clustering and we want to label columns but not the rows and the coloring would be the same as before so plot Z score the data as it is clustering uh label uh the column but not the rows and the coloring blue to red to white um while it's running the heat maps you can check the first heat map you should have already done that to be sure but I want to display it if it's there here you should have a heat map similar to that it should be similar than to the one that we add so it's just open this one so you should have a similar to this one here and here you see there where you see that the treated samples so it's here just on the normalized count tables we see that the sample the treated samples and treated samples so the untreated are clustered together the treated uh together um and but we expected that you remember from the plots from output from the plot from DSEC and if we look now at the Z score so here the last one you see really like that same the treated samples are together uh the treated samples are together and treated samples are together and you clear see a clear distinctions that all these genes have um more expressed in the treated compared to the untreated and there are these groups and they are highly expressed there compared to the untreated and the same for the other one so you have a clear separation speed too so you have two groups of genes and it would be maybe interesting to look at these genes why are they cluster are they have similar expressions and things um but to do that to help doing that so you could have a look at these particular genes one by one um getting used to getting more information about them looking at different database uh but another way also to help you doing that is uh by doing um it's by doing uh what we will do now which is uh functional enrichment analysis um and knowing they become they come from similar categories or that belongs to similar biological functions and it's what we will do now we look at this file so we see two groups of of genes as I mentioned already um and the question is now how these are these genes related to each other that they come from do they belong to the perform similar functions so what are why are they closer together one thing we could do is to check manually all these genes there and see the how they created or they they are together um but another way to do that and help doing that is to you is to do what we call a functional arrangement analysis um and it's what we will do now um I'm sorry so I reload the the page so I need to go back to the is to the tutorials again and so it's a functional arrangement analysis it's what we will do and the first uh there is different type of analysis that we can do and the first one what we will do is what is called a gene ontology analysis uh the gene ontology um you can open the tab there it's really redirect there so a gene ontology resource um is a comprehensive and computational model of biological systems that go from molecular to organism slivers across the different species of the tree of life so the idea is to try to find or define groups of genes um and or yeah or different groups um that are similar different the different organisms um and so we can do that on a differentially expressed in to to see how the the differentially expressed genes are together if they come from the same groups or not there is different methods for doing that but um the standard method could give some bias results for RNA-seq data due to our expression of differentially the different expressions of long genes or long transcripts and so uh we will use then a tool that is called gothic that provides a method that um take the length of the genes into account um we will apply it first on on go terms or gene ontology terms but we can use it also on cake by swear as we will discuss further for gothic we need two type of files we need um and how do we know that so if you go to the gothic tools and if you search for the gothic tools and you open it um you need two files for for that you need the differentially expressed gene files which is as explained here a tabular file with the first column being the gene IDs and then the true or false in the second columns um and if you go down you have some input more inputs about the files so the the sorry the first column could be the gene ID um and yep and as you see the gene ID should be um as um sorry uh uppercase so we need to create these type of files where you have the gene ID and true or false if the genes is differentially expressed or not um and the second file that we need is a gene length files which is the gene ID and then the length of this file and we already did that compute that when we run the features count if you remember but the first steps we need to do that we need to compute this file um and so to do to create that we need to we will take our dsec file where the dsec result file where we have the gene ID and some statistic then and in particular the information that we want to have is um if the genes is differentially expressed so if the p value is significant and we want to add to this file extra column that say uh true or false if the genes is differentially expressed or not um these true or false in in computer science language is called a boolean and we want to then transform the information about if the seven columns is below 0.5 into a boolean so we will compute add an expression that is called boolean and then it's exactly what is said there so we will do that we will create that and we will then do some formatting then so the first steps we need to do is to search for the compute um tool which is computer expression on every row and so the expression is we want to have to know if the seven column is zero is true is below 0.05 and transform that as a boolean and we want to do that on the dsec result file yeah we will execute that and then we want um in our file we don't want to have all the dsec result all the statistic we just want to have this information which is the gene ID and the and the last column that has been added uh with the compute so c1 and c8 and once we did that um if you remember uh the if you look it's the same here uh the gene ID here are not all uppercase and we here you have a lower case here and we know that gosec needs uppercase so we want to do something that is called change case to put the first column as uppercase um where is change case of selected column and then on the cut we want to change the case and we want to move it to uppercase so i run that and once it's done i want to rename that last file as return gene id and differential expression and so i will rename that here so i will just first check what the difference that did so it's creating it's written here that it created a eight column eight with the expression boolean here so if we open it here we will see that we have the gene ID and the information that we got in the dsec plus an extra column that is true here and you need to really scroll down really a lot to see that at some point it will change from true to false when we eat this 0.05 should be soon here so you see that here we are below 0.05 so it's still true and when we are higher than 0.05 it's false saying that this gene starting from here and below the genes are not differentially expressed but one thing that we see here is that the number of line here is not the full numbers some line has been skipped because there were no information anymore so starting from line this line here there were no informations about the gene if the genes are differentially expressed or not so the compute removed the compute tool to remove this line the cut started here just cut the line at the column and then we have a correctly formatted files where we have the gene id in uppercase and true or false here now we want to have also the gene legs we need to have that and we already created that files before with the feature scans and so we go back to your history to your multiple history and you can drag and drop this file here so take it and drag and drop it in the new history and as before so same as for the other ones you see that the gene id is not uppercase so we need to change case to have it uppercase change case to have the first column as uppercase here so we want the first column to be uppercase and and we want to rename it I don't remember the name again oh true we we change the page so we need to reload the tutorial we are in the row of we are trying to fix that it should be better pretty soon um so gene id and lengths here and we will rename that in case here true that here and then when it's done we run to run go sec so go sec and while we will give it we will give the gene id and fit differentially express the gene id and gene lengths just a quick things we don't need these data sets anymore the edit data the tags so feel free to remove them just that they don't pollute your history there um so gene id different and differentiated expressions so the gene id plus true files and the second column gene id and gene lengths it's just what we just created so yeah the gene id and lengths you want to get the categories you want to uh uh you do that for the genome gm6 uh the gene id are ensemble gene id we want to extract the go terms so for cellular component biological component and molecular component um you can have a look to the different methods options and advanced options um but the things we want to do now in the output is to plot top go terms and extract the differentially expressed gene for the different categories so associate when we have a go terms um interesting go terms uh to know which genes belongs to these go terms uh so and we can execute that so it will create four three files so the first file that we will have a look at is this ranked category list um so it's uh it's similar it's um similar it's similar not really so it's some statistics so you have the categories um you have the overrepresented p-value under represented p-value the number of differentially expressed genes in this category number of genes in this category and the terms so more explanation about the terms so for example here we have the go terms categories the p-value so if the this uh categories overrepresented in the differentially expressed genes the p-value is there um is these categories if the differentially expressed genes under represented in this category compared to the general uh all the genes it's there and then we have some we know that it's it's these go terms is related to the extracellular region and once you have the go terms um and form this file so you can have a look to the category there if you want you can search for go terms here to get more more details about that um in this website so you have more details informations uh indirect annotation to extracellular list of genes etc yeah so once you extracted some interesting uh go terms it can be something you can do but then the first step um and it's what is asked in the tutorials is to extract how many go terms are overrepresented with the adjusted p-value of 0.05 and how many are underrepresented so to do that um you can use a tool that is called filtering filter that will extract only the row we already used it before that will um where is the filter filter what is the name already the full name of it can be a bit cumbersome to find it uh so filter what filter data maybe will help to find this tool filter tabular nope sorry um i'm a bit what we can do because we already ran it before so this one nope this is concatenate so we have a filter tools here um we can re-run that but on the last uh go seg so on the ranked we want the second column so we want to have the second column that are 0.05 so why did i say that so we want to have the overrepresented p-value so we want to have the go terms where the p-value for the overrepresented is below 0.05 so it's what we ask there and you can do the same for the underrepresented to get the information there um so it's a question here so you will find that you have 31 go terms so it says 0.27 percent of your of the go terms that are overrepresented 83 underrepresented and for the overrepresented you can group the this data to know if they come mostly from uh molecular functions so mf molecular functions cellular a component or biological process and to do to know that you can group your day you can group the data there so here you see it's what you get there um the second file that you have with good with the go terms is is a graph with the top 10 overrepresented go terms so if you go on the top oops sorry yeah top 10 you will if you open this file here sorry i have some time some issues with galaxy interface to load the pdf so i need to reload my interface so yeah you see that you have the percentage here you have a graphs where the top 10 uh overrepresented categories for cc so for cellular component biological function and molecular functions and you have the percentage of differentially expressed genes in these categories and so you know that extra cellular functions more than 20 percent of the genes in these categories are over are differentially expressed exhaust metabolic process etc so it could be a good way to to identify some gene that are differentially expressed um i checked again if we had i think we had a question but i don't remember uh we'll go back to i go back to the east to the tutorials here oops so it was a functional arrangement analysis here and you have a question so why what is the x-axis how is it computed it's the percentage of differentially expressed genes in these categories the third file is a table with the differentially expressed genes associated to the go terms um so what you can see here in this file is uh for the these categories um that are a differentially expressed genes which uh interesting categories which are the different differentially expressed genes in these categories um it's a good way to maybe if you are interested in these particular categories and you want to plot maybe the heat map for the for the z-score for these genes so this file will help you to identify interesting genes to plot maybe for for your heat map if you want to go further in this type of analysis uh we we we add another tutorial that is called erinetic genes to pathway that where you will learn a different other type of gene set arrangement analysis the last step that we want to to cover here in the functional arrangement analysis is the cake pathway um so cake pathway the database is a collection of pathway that represents the correct knowledge of the molecular interaction reactions and relation networks so it's a good way to to to find to to represent um to represent your your different genes protein RNA is component in a in a in a general way so for example you have this um this pathway here that is called dm e 00010 which represents the glycolysis process where you see the different the different component in this in this in this pathway and so we can do some sort of cake arrangement to identify pathway where we have a lot of differentially expressed genes here and for that we can use exactly the same tools um gosec so what we can do here is just rerunning gosec here but what we want to say is we don't want to extract the go terms but the cake pathway and we don't want to plot the uh out with the top 10 um uh here but we want to extract the differentially expressed gene for the different categories and you execute it and it will take some time as before and you can we'll run that so while a cake is running so it will uh the statistic will be similar as the one for grow terms so you will have a big table that uh where you can find how many cake pathways has been identified how many are represented underrepresented um what are the overrepresented cake pathway terms etc and so I recommend you to to do these questions afterwards to get more informations and similar you will have tables with the cake pathway ID and what are the differentially expressed gene associated to this pathway um but then it's can be still uh one good thing so you could also represent your differentially expressed genes as a pathway uh in a visualized way visualization way so it's mean we would like to see this type of pathway here and see which part of this pathway are differentially expressed um so to do that there is a tool that is called pass you that will generate similar output as that so to do that you can you what you need to do is to extract the interest in columned from the um you need to to we need to get too far for for for pass you we need to have the genes with significant adjusted p value uh with their look to fault change because it's what is represented there is um the the lock to fault change here um um for the differentially expressed component in this pathway so we need to to add to give this information to pass you um so what we do is we extract from the genes with significant p value we extra we we add uh we extract the gene ID and their uh lock to fault change we need then to to say which pathway we want to display so we can have two of them here so we will view the pathway 10 and 30 40 um and then we can use the pass you for to do that so here just uh because gosec is finished so we have 170 127 lines so it's mean we have 126 pathway that has been identified and similar to to the go terms we have here the categories and then p values and etc yeah um so then you can inspect them uh inspect also the cake pathway here with the differentially expressed genes okay but now what we want to to do is to to create this pass you here so we need to first create this file with the gene um IDs and the fault change for the significant p value so we need to um no we need to cut here cut column we want c 3 and the is the 1 and the 3 it's demoted by a tab um we want to do on significant p value and absolute fault change and i think that was i know here only just this one significant p value and then it will create that um what we want to do is to rename it here afterwards um here to be to be faster so because pass view can take some times we will just run it so you could do it create a new tabular files with the gene ID that you want to display but here for being faster we will just run it quickly on one only one pathway now so what you can do is search for pass view here so you need to have we want to display only one and we want to display the cake pathway here is the one this one so we would like to view this one a species used was drosophila dm6 sorry it's fly that we need to put here so fly we have a gene data file it's been a while it didn't use this gene file so we want to use gene significant does this file as an header and no the it's a ensemble gene id do we have a compute output we want to take native plot on the same layer layer yes and then pathway will create this file similar so it will create one file per per per pathway with a plot similar to that here so pass view is already done so it's now done so you can if you open the generated file so you will have a representation of your of your pathway this way where you have the different component of your pathway and for the differentially expressed genes the log2 fault change display there with this range of colors there so it's a good way to represent an interesting pathway but you need to do it one by one here or by creating a file but it will generate one file per pathway and on that so I think we covered so in these tutorials we took real RNA-seq data and what we did we did several steps for these data sets so we did a quality control we did a mapping we did some estimation of the strength that is useful for counting we did then a differentially expression analysis with d-seq2 where we took all our seven samples because the first steps we did it only on two samples and then out of these differentially expressed genes we could extract some we could annotate them we could visualize them using some it map with the that score we computed and we did then a functional arrangement analysis for extracting the go term the cake pathway and visualize them using pass view so we hope that this for all the approach that we rendering these tutorials are can be summarized by this schema there and it's a somehow a standard approach that you can use for any RNA-seq data then so I recommend you to do it one step by step the first time on your own data if you want more information about the different tools we put the reference there you can have a look at useful literature there and all from all the tools that we use there and we will ask you to fill this form there at the end of the tutorial that is embedded there that tell us did you like these tutorials how we could improve it or not and if you use it in one another if you use this tutorial on your own and you please want to to cite it you can use the detail that is given there and on that we would like first congratulation on successfully run all tutorials here we listed some tutorials that you can follow afterwards to get more informations about about RNA-seq analysis and thanks a lot for following these tutorials this long tutorial