 Hello, my name is Lucille de Lille, I'm postdoc in bioinformatics in Denis Duboul lab at EPFL in Switzerland and today I will guide you through the tutorial entitled reference-based RNA-seq analysis. During this tutorial, we are going to ask ourselves the following questions. What are the steps to process RNA-seq data? How to identify differentially expressed genes across multiple experimental conditions? And what are the biological functions impacted by the differential expression of genes? After this tutorial, you will be able to check a sequence quality report generated by FASQC for RNA-seq data, you will be able to explain the principle and specificity of mapping of RNA-seq data to an eukaryotic reference genome, select and run a state-of-the-art mapping tool for RNA-seq data, evaluate the quality of mapping results, describe the process to estimate a very strong nest, estimate the number of read-seq genes, explain the count normalization to perform before sample comparison, you will be able to construct and run a differential gene expression analysis, analyze the d-seq-2 output to identify, annotate and visualize differential express genes, perform a gene ontology enrichment analysis, and finally perform and visualize an enrichment analysis for cake pathways. During this tutorial, we are going to analyze a public dataset. So RNA-seq is a technique that allows to sequence the different RNA which are present into cells, for example. And it's widely used. Usually people use RNA-seq to either describe treatment, the effect of a treatment, and this description can be whether qualitative or quantitative. Here we are going to focus on the changes in gene expression between the untreated and treatment. As I said, it's based on public data. And the study that we are going to use to get the datasets is Brooks et al. And in this study, the authors used Drosophila cells and they treated some cells with RNA interference and this is a process to deplete the expression of one gene. In this case, they depleted the expression of the passilage gene. What we have in this study, seven datasets in total, we have three treated samples, so samples that received the treatment and the depletion of this passilage gene, and we have four controlled datasets which are untreated. In this study, they used both pen and sequencing and single-read sequencing. Before following this tutorial, I highly suggest that you follow the introduction of the galaxy analysis tutorial and the sequence analysis tutorial on quality control and mapping as I will not spend much time on these aspects. To follow this tutorial, I suggest you to go to training.galaxyproject.org. Here are all the galaxy training material. And if you go to the section transcriptomics, then you go to reference-based RNA-seq data analysis, and this is a tutorial that we are going to do together. Here are the questions and objectives that I already described. Here you have the links for the tutorials that I recommended you to do before performing this tutorial. And here the introduction. As I said, there are seven datasets. Here are the public link that the authors published where the data are available. And in order to facilitate, we already extracted the phase queues, which are the raw sequencing data, into the nodo to enable you to more easily get data into galaxy. So as I said, there are seven samples, but we are going to process only two samples as a demo. And if you feel brave enough, you will be able to process the other five. But if not, we will provide you the results for the whole seven samples so we can work on the second part. So for the two samples, we will be able to go through the quality control, the mapping, and counting the number of reads per annotated gene. So let's start. We need to go to Galaxy Instance. You can see the list of the possible Galaxy Instance that supports this tutorial on the training material. So they are available here, available on these galaxies and you can click and check the one you want. So I show you today on usegalaxy.eu but you can choose the one you want. So make sure you're logged in to check it. You just need to check that here it's written user and not login. And you also need to, if it's the first time that you're using Galaxy, you need to accept to confirm your subscription by the link which is in the email that you just received. You can also see the tutorial within Galaxy. I'm clicking on this little hat and this is a mirror of the Galaxy training website. And therefore you can again go to transcriptomics and then reference-based RNA-seq data analysis. And you will see exactly the same content. So we are going now to do the data upload. So as I said, there are two samples that we are going to pre-process instead of the seven that are available. And the first step when you start a new analysis is to create a new history. Here I have a blank history but if you don't have it, please click on this plus icon, create a new history. And the first reflex that you should adopt is to rename it. So let's rename it Galaxy training network, reference-based RNA-seq. And this is about one where we do the pre-processing. I clicked on save and now I can see that the title, the name of my history have been updated. So first we need to get the FASQs into Galaxy. The one easy way is just to copy the four URLs that correspond to the data set 77, which is one of the samples and 80. And you can see that we have each time two FASQs corresponding to the forward and the reversed as they are paired ends. So as I said, they are both in this seven data set, there are some that are paired ends, other that are single read. And together we will do the pre-processing for paired ends, but then you will be able to do both the paired end and the single read. So let's just copy these four lines. You can either use the keyboard to copy or just this button and then you click outside of the tutorial window. So you're going back to your Galaxy and then you go to the upload data. Within this new window on the bottom you have different buttons and you need to click on paste page data. Here you have a box where you can simply paste the four URLs and then you just need to click on start. The status is 100% so you can leave this window either by clicking on close or clicking outside. You can see that now you have your four first data sets on the right on the history part. But they are gray, that means that they are waiting in the queue to be launched on the Galaxy server. I will put the video and wait until they become yellow and then green. Oh, now they are yellow and after the download they will become green. So I just do a small break on the video recording. The four data sets are green which means that they have the upload is finished and I will be able to use them to process them within Galaxy. We decided to tidy these data sets into a paired and collection. This enables you to run the steps together on the two data sets. To create a paired and collection, a list of paired and collection, you need to click on the checkbox which is on top left of the four data sets. Once you have clicked on it, you can either select all on the right or check individually each of them. Now that they are all selected, you click on the all four selected and you choose build a list of data set pairs. This brings you to this new window which enables you to pair the forward and the reverse read together. Because the fast queue were formatted exactly as they were expecting, they automatically paired the R1 with the R2. And here what we are going to do is to rename the data sets so they have all meta information, meta data for example. So the 77 correspond to an untreated sample so we will use underscore untreated and then underscore paired. And this is very useful to add this information directly on the samples because then we will be able to retrieve this information. I do the same for the 80. So the 80 is a treats sample pair. Once you have renamed your data sets, you're going to give a name to this collection. You can see the collection as a directory if you want. So we will call it two pairs queues. Once it's done, we just write create collection. And now instead of having four data sets, we have a single data set that is in fact a list of with two pairs. To leave the checkbox mode, we just re-click on the checkbox. So you can see on the right that now you have one active data set, which is your list of pairs, and then you have four hidden data sets. So how you may ask how do I know how to rename the data sets, everything is written in the tutorial. So if you miss one step, just don't forget to go here in the training. You can see here, create a pair collection name to pair and pass queues, rename your pairs with a sample name followed by the attributes. And here you have this. During the tutorial, you may have some questions and you have the solution. Some of them I will do it with you, but some you can do by yourself. So now we are going to do the quality control. So during sequencing, it's possible that there are some errors and usually it's mainly at the end of the reads. So what we will do is to check if the quality, how was the quality of the sequencing. And for this, we use a tool which is called FastQC for fast quality control. And unfortunately fast QC does not well deal with lists of pairs. So just to do the fast QC, we will need to change the configuration of our data set. That means that instead of having a list of pairs with two pairs inside named 77 and 80 with inside forward and reverse, we will flatten it to create a simple list that will have only one level with four datasets, the forward, the reverse for 77 and the forward, the reverse for 80. And this is achieved by the Galaxy tool called Flatten Collection. You can either click here directly on the tutorial material. This will bring you to the Flatten Collection. Or you can use the search tool on the left where it's written church tools. You can look for Flatten and you have here Flatten Collection. Once you have your tool on the middle panel, you check the input, which should be two pairs and fast Qs, and then you just run the tool. You have two places where you can click on run tool. You have on top run tool and on the bottom run tool. They do the same. So this is just instant news. We now have a list of four datasets. So if we click on it, I can see 77 and treated bad forward reverse and 80 treat bad forward reverse. I go back and I show you the difference for the two pairs and fast Q I have and treat bad and treat bad and within I have forward reverse. And if I go back on the second one, I also have forward reverse. Okay, now we are ready to run the fast QC. So again, either we look here fast QC. And we click or we go to the training material and we click on fast QC. And this gives the same result. Remember, if you want to leave this tutorial, just click outside. So we will run fast QC with the default parameters. The first parameter is the raw reads. So we will need to provide the flattened collection. Here it's to select a single dataset, which we don't have multiple datasets. We don't have, but we have a dataset collection. So we click on dataset collection. And here we have choice between the flattened or the list of pairs, but we will do on the flattened. And as we don't change the order, we can just use this button on top run tool. So this launch fast QC in parallel on the four fast Qs. And this will generate a webpage for each of the four, as well as raw statistics on each of the four. As previously they are gray, meaning that they are waiting in the queue before being run. When they are yellow, this means that they are running and when it's green, it means that it's done and you can check the result. So again, I post my video and come back when it's green. So you can see now that three are okay and one is running. So this allows us to check the ones that are finished. So if I check, I go to the webpage and to check it, I go to the one that I'm interested in, for example, the 77 forward. And then I click on the I button. Here I can see fast QC report. And in order to see it larger, I can collapse the panel on the left. So just on the bottom left, click on the arrow. And what I can see in that I have basic statistics on the name, the sequence length, the percentage of GC. And then I have different panels that correspond to what is on the left. And I have some icons to say if it's it passed the filters or if there is a warning. So we can see, for example, base sequence quality. So if we check all reads, what is the quality at the first base pairs, second, three, et cetera, et cetera, until the end of the reading. Also to collapse what is on the right to see it properly. And then we have different other statistics like the quality score read the per base content. So that means on each position. So for example, the first position, how many ATCG do I have? I can see that I have much more CG at the start than the AT. And we seem to have bias, which I think is expected for the urban AC or for the library. And then we have the GC content. So what is a theoretical distribution and what we have. And in urban AC, it's common to have a bias of GC. But this is just for one sample. And if we want to see it for each of the four, we need to scroll down for all the four reports. Actually, there is a tool, which is called Multi-QC, that allows you to compact into a single report, different reports coming from different tools. So I will just re-expand the tool bar on the left. I click on the little arrow and on the right, I will also re-expand the history bar. And come back to my history, clicking on top history. And I will run Multi-QC. So I look for Multi-QC and there is a single result. Multi-QC. So the parameters are which tool was used to generate the logs. And I don't know if you remember, but we used FastQC. You can either look for FastQC in the list or you can type FastQC and then select it. So then we need to say that we have a FastQC output. So we click on the plus. And now which type of FastQC output we have. So we have raw data. This is the number 12 in my case. And again, we need to choose between the multiple data sets or the data set collections. And this is written here, a list with four data sets. So it's a collection. We click on collection and we choose raw data. Now we don't change the other parameters so we can use the top button random tool. So this will generate two new data sets, one which is the web page. So something that we can look using the little eye. And another one which is a stats. So it will be just tabular, so text tables with the results. So again, first it's gray, then it becomes yellow and then it will become green. Now it's green, meaning that it finished. So I will check the web page. We click on the eye icon. And so I see that there are some general statistics about each of my four FastQs. The percentage of duplication which seems quite high for the 277 forward and reverse and for the 84 watt, which is expected for a red nasiq. However, it's really low for the reverse. The GC content is more or less the same. And we can see that we have 10 million sequences for 77 and 12 million sequencing for the 80. Then we can go to each of the different sections. The first one is sequence count. So as I said, 10 million. And if I go on the plot bar, I have the detail of how much are unique reads and how much are duplicated reads. And I have about 12 for the 80. Then there is a sequence quality histogram. This is what I told you previously, so from the position. So what is the average quality of the first basis of all the reads of the FastQ. Along the X axis, we go to the end of the FastQ. And you can see that the quality drops. So the average was about 40. But then for the three samples, it goes about 30 in average. And here you have the 80 reverse, which have the lowest quality. So we can see that at the beginning, the quality is already decreased compared to the others. But at the end, it really decreased. And this means that we will need to treat this FastQ to have a cleaner FastQ. Then there is a plot that shows you what is the average sequence quality on the full read. So among all the best. And this is then you count how many reads have a quality of 39 or quality of 30, etc. So you can see that for the three same, the most of the reads are very, very good quality in green area. And a few are in the yellow and now in the red. While for the reverse, we can see that we have some read that are in the red and some in yellow. Therefore, the proportion of reads that are in the green is lower. Then there is a sequence bias. So how much ATCG we have at each position. And we have a bias that is constant among all the four samples. So we can see on the first position, as I said, we have a lot of CG. And this is the same for all. And this is a known bias for the library. And then we have the GC content where we have three sample data in warnings. But this is just normal for MNAC. And finally, they show you how many N do you have. So N means that the sequence software was not able to determine if it was an ATCG. And so it assigned N. And here it's very good because we don't have a lot of N less than 5% at any basis. For the sequence length, all the samples have sequences of 37. And you need to keep in mind because this is something that we will use later on. And finally, the sequence duplication levels. We can see that for the evanescent, we have some reads that have more than 10 copies, more than 50 or more than 100 copies. And this is normal for MNAC. So we have... We check for the over-represented sequences. And they don't have over-represented sequences. You can find here sometimes the adapters that were useful to make the library. And again, here you have a specific look for the adapter content. And there is no adapter, so we don't need to remove the adapters from the reads. And then there is a summary where in X, you have all the sections. And on Y axis, you have the four samples. And then you can see on which section they passed or they failed. So as we saw, there is a tendency to have a decreasing quality at the end of the reads. And this may affect the mapping step, which is the step when we assign the reads to a region on the genome. So to clean a bit the data set, what we will do is to use cut-adapt. And we will just remove the bad quality from the end. Sorry, the bases that have the bad quality. So we look for cut-adapt, the first one. And so the tool asks you, is it single-end or paired-end reads? So we have paired-end and they are tied into a collection. So there is a single proposition, which is your two paired-end rescues. Then you can, if you want, put the read one option to remove the adapters. But as I said, we don't have adapters, so we won't affect it. So to collapse this section, I clicked on the I. Read two options, same. So if I want to remove adapters, but I don't, so I can just click on the I. And now what I want to do is in the read modification options, I put a quality cut-off. That means that I will remove the low quality bases from the three prime data below. And here I chose arbitrary 20. So the software will just remove the bases that are from the end. That have a quality below 20. And then I would put a filter that will discard the pairs if one of them is below 20. Because it does not work to map something that is too short because the algorithm that we are using to mapping will not be able to map if it's too short. And again, I put something that is arbitrary, which is 20 base pairs. Once I've put all these options, I can find them here. So if you go to trimming first use cut-adapt, you can see here the parameters that need to be set. So parent collection, the two parent rescues, and then minimum lens 20, quality cut-off 20. This will generate one report per sample. And if I want to collapse them, I will use again multi-QC. And so in the output selector, I need to check that I want to have a report to see how many bases have been discarded. And now I can click on run tool. Again, I stop the video and come back when it's done. The cut-adapt steps has finished and we will be able to check the results. So the first results are the reads that have been trimmed. We can click and check it's a list with two pairs. So inside, we have the two datasets. And inside, we have the forward and the reverse. And I can come back to my history. The second output is the report. So if I click on the 27, I can see that I have two items. The first one is the 77. And if I click on the I, I can see the detail, the command line that has been launched and the number of reads, so 10 million. And then how many base pairs were processed and how many were trimmed. So for example, we can see that for read one, we have 5 million and for read two, 8 million. And we can also check the second one. So we click on the I. And remember the 80, the reverse was relatively by middle quality, let's say. So we can see here read one, 10 million base pairs that have been trimmed and read two, 50 million. So again, if you have, here we checked two individually, but if you have more than two, it can be useful to use multi-QC. So we will do it together. We look for the multi-QC tool. And here the parameter which tool has been used. It was cut-adapt. So we have cut-adapt in Galore. And then what is the output of cut-adapt? Again, it's a collection and we select the report. And we just run the tool. So if we want to check what is the status of the job, we need to go back to the history, clicking on top. And we can see it's great. So I'm waiting that the multi-QC is running. Meanwhile, I will explain you the next step. So now that we have cleaned our reads, we are going to do the mapping. So the mapping means to assign the reads to a place onto the genome. And if you want to have more information, you can follow the training on the mapping. Here we are doing a reference-based RNA-C that means that we are working with an organism for which we have a reference genome available, the sequence for each chromosome. And we also have annotations available, which are the coordinates of the genes. So if you remember, we are using Drosophila. And so we are going to map onto the Drosophila genome. So maybe some background about RNA processing in eukaryotes. A gene is composed of different axons, which are represented here with the yellow boxes, axon 1, 2, 3. And while the polymerase will generate a pre-mRNA containing all the DNA sequence of the gene, so from axon 1 to axon 3, then the RNA is processed to become a mature mRNA. And the introns, so the region which is between two axons, is excited. So when we do RNA-seq, we are sequencing the mature mRNA, and we are therefore sequencing molecules that do not contain the introns. And you can have reads that are fully matching axons, like the two reds here, so they will match exactly the reference genome. But you can also have reads that will be split into two, because they correspond to fragments of the RNA which was between two axons. And that means that they don't fully match the reference genome. You need to split the read into two parts to make it match on the reference genome. And this is the case of the two blue here. Depending on the size of the axon and the length of sequencing, you may have even reads that span three, four more axons. So that means that you need to use an algorithm that is adapted to this configuration. Indeed, if we use a regular mappe algorithm, it will be able to map the reads that fully match on axons, but it will say that all other reads that need to be split will be unmapped. And that's why we need to use a dedicated algorithm. Here, if you expand with the plus on the tutorial, you have an explanation of the difference placed mappe that are available and how they work. So we can cite top hat, which have been enhanced in top hat two, then there is high set two. And the one we are using is star. We are using it because it's very fast and still very accurate. So to do so, we will map it on the genome of Drosophila melanogaster, which is the fly organism. They are different version for each genome because during the years, the reference genome is improved, enabled to make chromosome more accurate, etc. And today we are using the version six of the Drosophila melanogaster genome. In order to help the star algorithm to split the reads, we will provide a GTF, which is a file that described where the axons are. And there are different versions of the GTF, but we are using today the version of ensemble number 109, which is the last available when we generated the tutorial. So what we do is to import this file, the GTF file, onto Galaxy. So I will just, before doing that, check the result of the multi-QC, which has been run while I was explaining you the principle. So if you remember, we run multi-QC on Catadapt. So I can check it with the eye. And we have a summary on the percentage of base pairs that have been trimmed. So you can see that we have more with the 80 samples, but this we knew because the reverse was of lower quality. And then here is a big summary on how many pairs we keep into the cleaned, because you remember that we discarded any pairs that have one rate which were shorter than 20 base pairs. And we can see either as raw counts, but we can also click on percentages. So we see it as percentages and we see that the 80 we discarded more rates. And this is again due to the lower quality of the rate two of 80. Good. So now for the mapping, as I said, we are going to get the GTF from the nodo. And to do this, we copy the link with the icon copy. We click outside of the tutorial and we click on upload data. And like the first time for the past year, we go to past page data and here we paste the URL. Here we will add some information on this file and we will say to Galaxy that it's a GTF to be sure that it assigns it to the good data type. And we will specify the genome. It's DM6. So you can see I clicked on then to filter because there is a huge list of genomes available. I just type DM6 and then I select. Okay. When I put GTF genome, I just start the upload. I click outside because it's 100% so I can check how is it. So it's great. It's waiting in the queue. And what is good with the Galaxy is that even if it's not done, I can still run the next step. So the next step is to map the reads. So the cleaned reads, the one that we obtain with cut adapt on to the genome. So as I said, we are going to use star. So you can click here. So we, the first parameter to set is do we have a single and or parent. And we have parents and they are tied into a collection. So parent as collection. Then we need to choose. So we have multiple choices, either the cut adapt output or the original. And we cleaned the data to use them. So we select the cut adapt output. Star to work needs an index. That means a pre-processing of the genome to be able to fast to fastly access and know where to map each read. And so hopefully on use galaxy.eu we have a building index for Drosophila. So we will choose a building index. And we will use a genome reference without building model, but we will provide a GTF. So the reference genome here, we don't want to map on the Melifera genome. We want to map on the M6. And then we provide the GTF and the GTF is indeed the number 39 that we just downloaded. Next parameter is the lens of the genomic sequence around annotating junction. And the ideal value is a read length minus one. Do you remember the read length? For all the values, it was the same. It was 37. So we retrieve one and it gives 36. The next parameters are not really mandatory, but we will use them. So star can count the number of reads per gene automatically while mapping. It does not take more time. So we decide to ask for the count per gene. And something that can be useful if you want to check quality on specific regions is at bottom, bottom, bottom. There is the compute coverage and we will ask, yes, a coverage in bell graph format. We leave everything by default and we just run the tool. So this step is maybe a bit long. And I think it's time for you to take a break, relax, stretch, and we come back when it's done. That has finished to run. You can see that there are a lot of outputs for star. The first four are coverage. So an indication of how many reads fell into each region. Then there is the reads per gene, so the counting. The most important output is the BAM file, so maps.BAM. Then you have splicejunction.bed, so an indication where we have junctions. And we have the log that gives you information on the quality of the mapping. So to aggregate this log, we will use multi-QC again. So I click on multi-QC and the tool was the star. And I add the star output and it's the log that I would like to aggregate. I just select the collections and look for the log. So every star on collection, blah, blah, blog. And I run. And I have now, as always, two datasets. So as I said, the most important output is the BAM. And there is a way to look at it. And it's, first we can check it on Galaxy. So we click on the collection, maps.BAM. And we select, for example, the first one, 77. And then we click on the I, so it displays on the left. So you can see that you have what we call a header that starts with a hat symbol. And then that there's something on the BAM. So the BAM is sorted by coordinates. Then you have information on all the chromosomes and their length. And if you go down enough. So after all the chromosomes, you have information on the programs that have been run. And finally, you have information for each read. So here you have the read ID. Then you have a flag that tells you how the read maps. You have the chromosome position, sorry, the chromosome name, the chromosome position, the mapping quality, some information that we call the cigar. And then we have the read sequence, the quality. And then some attributes. So but this is not very visual. And a way to visualize it is to use a software that is called IGV, that you can have locally on your computer. So I have it, I will show you how to use it. So I just launch IGV. And then within Galaxy, I choose the BAM I want to check. So for example, the 77. And I go to this icon that looks like a graph, a bar graph, I click on it. And then there is display with IGV local and I click on this. And this will allow me to see my data on my local IGV. So if I go back to my IGV, I now have three different tracks that correspond to the BAM. One is a coverage, second one are the junctions, and the final one are each of the read individually. But we need to zoom into a region to see something. So if I go back to the tutorial, so with the little hat, if I go to IGV and here is a region that I can use to display. So I just copy, go back to IGV and on top here, I put the region and click on go. And this brings me on the chromosome 4, between 540,000 and 560,000. So we have on top the coverage, so the number of read that fall into each base. So we can see that we have sections with reads, then we have sections with gaps. So no, no reads. We have the junction, so how many reads support are split between these two parts, etc. And then we have each individual bit. And below we have the rest track that tells you which gene we are looking at and what is the composition. So we can see that we have a gene which have accents on the left, and then this gap corresponds to an intron. So if I want to check read individually, I need to zoom in, so I just drag. And here I can see each blue bar corresponds to a read matched for one, and the red bars correspond to the reads matching reverse. And if you go close to the accents, you can see that there is a line like this that indicates that the read have been split between this part and this part. So you can see here that you have both reads forward and reverse that support this action junction. And the same here. So depending on the size of the introns, you can have long lines or small lines. Another way to look at the data in IGV is the sashimi plot. To access it, you need to click right on the left, and here you select sashimi plot. And this is what you obtain. So it's on the same region. You have the annotations below and the coverage. And if you zoom out, you can see the junctions and you have a number that indicates how many reads support this junction. So 20, 13, 22, et cetera. This can be useful if you're interested in splicing. Okay, let's back to our multi-QC. That may have finished to run. So if we zoom out and go back to the history. And if we have the multi-QC, I click on the I to check. So we have information of the percentage of readouts of pairs that uniquely mapped. So you have around 80%, which is good. And everything which is above 70 is really good. If you have something below, you may want to investigate what went wrong. You also have a bar plot that shows you how many were uniquely mapped, mapped to multiple loci, too many loci, or were too short. You remember that we discarded everything that was below 20 base pairs, but maybe we still have, even with this 20 base pairs, was too short to map. So maybe we may have increased this number. So we have 5% in this case that were too short. And here we have 11%. Okay, so now we have for each read the information where it's mapped. But at the end, if we want to know which are the genes that have been deregulated by the treatment, which is the RNA interference for the Pasi-Legin, we need to count number of reads per gene. So I go back to the tutorial to show you because we introduced some schemas. So here is an example with two genes, one gene one, and then the gene two, which the gene one has two axons, the gene two has three axons. And the BAM would tell us where the reads fall into these axons. And then you can count either by axon. So for example, axon one would have one, two, three reads, axon two would have two reads, or by gene one, two, three, four. And what we want today is to estimate the number of reads per gene. So we'll use an algorithm that will count per gene, the number of reads. So there are two main tools available. One is HTSec count published in 2015. And the other one is feature counts in 2013. So if you remember when we did the mapping with star, we also clicked that we wanted to count while mapping. And in fact, the counting given by star is exactly the same as HTSec count. And for most of analysis, I would recommend to use the star counting. However, if you need more customization, it's better to use feature counts. So the thing is that you can think that it's quite easy to count the number of reads. However, there is one parameter that need to be taken into account when we estimate the number of reads is the stranness of the library. So if you know the DNA is double stranded, but the RNA is a single strand and the reverse would not encode for the same protein. So it happens at some places in the genome where we can have genes that overlap in different orientation. Here we show you an example where we have gene one on the first strand and gene two on the second strand and they are convergent, meaning that they go in convergent orientation and they have a region here which is common between the genes. That means that if we have a read that fall in between, we may not know to which gene it belongs to. So it is what we call an ambiguous gene. However, there are some library protocols. So when you give your RNA to the sequencing facility and when they generate what we call the library, so something which is sequenceable, they are different protocols. There are some protocols that are unstranded. So that means that during the process, you lose the information if the RNA molecule was going forward or reverse. And in this case, we can't assign it to any gene, so it's ambiguous. However, we have some library preparation that keep this information. And within the stranded library, the way we call them, there are two stranded libraries, either the stranded library where the sequencing would keep the orientation of the gene or a stranded library where the sequencing read is the opposite strand of the mRNA. And therefore, depending if it's a forward or a reverse library, the read 2 will be assigned either to gene 1 or to gene 2. So here, I put some schema to explain to you how it would be. So if we have a stranded library forward, which means that the gene is in the same orientation as the read, we can see that the read that mapped into the forward orientation, depicted in red, would mainly cover gene 1, and the read that mapped in reverse orientation would cover mainly gene 2. And if we have a library like this, all the reads that fall in the overlap between gene 1 and gene 2 can be anonymously assigned to gene 1 or to gene 2, depending on the orientation. Another way of the stranded library is reverse. That means that the map that's read in the forward orientation correspond to the gene which is on the reverse strand, and the map that's read in the reverse orientation correspond to the gene that is in the forward strand. Again, everything that is in the middle will be unambiguously assigned to gene 1 or gene 2. If we have an unstranded library, genes will be equally distributed between forward and reverse, and all the genes that fall into the overlap will not be assigned. So usually, you can have the information of the strandness, either you generated the data and the sequencing facility tells you which library they prepared, or if it's public data, usually you can find it, for example, on the description of the protocol or directly on the methods. They should say if it was a stranded library and which type of stranded library it was. Unfortunately, sometimes you cannot find this information, and therefore you need to determine the strandness of the library based on the data. And there are four ways to estimate the strandness from the result, and I will go through all the four different approaches, but in a regular analysis when you need to analyze your data, you don't need to do the four. So a first approach is to check on IGV. So as I told you, in theory, we should easily see that the forward match, either gene 1 or gene 2, and the reverse gene 2, or the contrary, or they are evenly distributed. Unfortunately, we are using parent data, and therefore it's a bit more complicated because the orientation is decided by the first read input. So I come back to my IGV, not the sashimi plots, the other one, and I check what is the orientation of the reads. That's not the reads, but the pairs. So I need to click right, just I can put view as pairs. So I now see that the reads are, so I need to zoom out a bit. So we can see that now the reads are paired together. So we have the forward read and the reverse read together. However, like this, I don't know which one is the forward and which one is the reverse. So I click right, and I put group alignment by first, sorry, first in pair strand. And now I have here all the negative. So that means all the reads were the first in read is negative. And below, I have the positive. So when the first read in the pair is in the positive strand. So here I can say that this is the forward, this is the reverse. Well here, this is the reverse and this is the forward. If we zoom out, we can see that on this gene, which is in the reverse orientation, we have roughly as many pairs that are in the negative. That's in the positive. So this looks like an unstranded library. In the tutorial, I put you some screenshot on how it would be if we would have a stranded library for parent data. So we can see that this is a non-stranded library and we have reads on both negative and positive. And here this was a stranded library and we can see that they all go into one category and there is nearly no on the other one. So this is what you would have if we would have a stranded library. This was the first thing, but it's a bit annoying if you need to do it for each of them. This can be painful. So another way to do it is to use Play Genome Tracks, which is a software that allows you to do publishable ready figures. And what we will do is to use the coverage that we generated with Star and we will check how is it on the exact same gene. So when we use Play Genome Tracks, we need to describe from top to bottom which are the tracks we want to see. First, I will copy the region on which that I want to see and then I click on Play Genome Tracks tool. You can also look for it from here, from the tool search. So the region to plot is the chromosome 4 between 540 and 560. Then we will carefully follow what is written here. So we will on top include a track, which will be a bedcraft track. So here we have include tracks in your cloud, choose style of the tracks and we put bedcraft. Then we don't put title so we can have the title, the name of the library. And then we click that asset collection and here you have the bedcraft that are proposed. And I propose to use a uniquely matched strand one and give a color that you will remember to be the first strand. So in the demo I propose to put blue. Let's do it. So here I click and I choose blue, whatever blue I want, but you can choose the color you want. Then minimum value we put 0 and for the height we put 3. It's in the number of centimeters. And finally show visualization of that orange. Yes, this way we will be able to compare the number of reads that fold into one strand and the other. Okay, then I insert a new track and I will again use the bedcraft. I don't put title, I select the collection and this time I take the strand 2. And I take another color, red. And similarly I put the minimum value to 0, the height to 3 and I decide to show the visualization of that orange. So this will allow me to see the coverage on forward and reverse. And then I just check, I will just insert the genes. So I click on insert, here I select, oops, gene. This time I can put a title because there is only one that will be displayed genes. I have the GTF and I think I can leave all by default except the height. I put 5 to have enough space to see. Let me check. Yeah, that's all. Okay, then I just run the tool. The advantage of using the output of star, of coverage, is that star will automatically take into account the fact that it's paired ends. And so it will flip the secondary to have a coverage that fit what we sequenced. I just make a break while it's running. Platinum tracks has finished running. So to check the result, I click on the icon I. And what I see is that for both samples, so the untreated and the treats, the strand 1 coverage. And I can see on the left that it's about 2, the maximum. And then I have for the same samples, the reverse coverage, and it's also around 2. So that means that we have as much coverage on the forward orientation and on the reverse orientation. So that means that they, this is an unstranded library. We can check also as in IGB that we have transcripts, THD1, that we go into this reverse orientation and we have intrants and we can see that indeed we don't have coverage into intrants. So this was the second way to estimate strandness. And another way that maybe you found that what I said was a bit weird, it's because I told you that we need to know the strandness to be able to count. But at the same time, I told you that star counted. Maybe this, this is, this may appear a bit strange. And in fact, if you check the width per gene output of star, and you click on one of them. You can see that the result is a table with four columns. Or at least you should be able to see it. Yeah. So on the left you have the gene ID. And then you have the three different libraries. So in fact star is counting in the three different configuration. So if it would be unstranded, you would have, for example, for the first gene, 780 count. If it would be a first strand library, you would have 420 and if it would be a second strand library, you would have 366. And just checking like this, it appears that you have much more count if it would be unstranded. So this goes in favor of the unstranded. And as usually, when instead of looking individually of each report, we will use multi QC to aggregate them. So we just run multi QC. And we aggregate the output of star. So I click star and then I click insert star output. And if you remember the first time, we aggregated the log to check how many were uniquely mapped, etc. And now what we are going to aggregate is the gene counts. So click on gene counts, and then we click on data set collection. And we click on every star on collection 26 reads per gene. And I run the tool. So you can see on this data set that we have, as I said, for each gene, we have the quantification. If it wants to preview, yeah. And on top, you have some statistics like the number of unmarked, the number of multi maps, the number that have no feature, the number that are ambiguous. And you can see that it's partly because we have a non-stranded library, we have some ambiguous reads. So that means that they fall into two different genes. And we can't say which one it was. Either they are on the same orientation and the library would not change anything. They are on different orientation. And if we would have used stranded library, we would have decreased this number. I just go back to my history. Multi QC is running. And I just pause the video. Multi QC has finished. I just click on the I on the web page. And what I see is that the gene, so it's a summary of what was in the table. So we have the gene counts and we have the three possibilities, unstranded, same stranded, reverse stranded. And we can see there's number of reads or I prefer to check it as percentages. And then in the bar plot, in blue, it's the reads that overlap genes. In dark, in black, it's the number of reads that map no feature. And then in purple, you have ambiguous. The other two, multi mapping and maps will not change. So we can see in the reverse stranded, it's about 40%, which have been attributed to genes. For the same strand, it's the same 40%. And if we use unstranded now, we jump to 78%. So this also shows you that this is an stranded library. The last way, the first way to estimate the, the strengthness is to use a tool dedicated to it, which is called info experiment. And this tool is part of the RCQC package. And I forgot to mention, but in the tutorial, after the mapping step, there are different additional QC that you can do. Let me find it back. It's here. So details further check for the quality of the data. And if you click here, you will be quite guided through different QC steps that I'm not going to show you. But they may be interesting, especially if you find during your analysis that something is strange with one sample, it can be very useful to go through this. So coming back, sorry, to the strengthness. And so the, the fourth way is to use, as I said, info experiments. So info experiment to work requires a bed file. So we have different formats to indicate where the exits are on the genome. We were using a GTF and now we need the bed. So we have a tooling galaxy that is converged GTF to bed 12. And we will convert the GTF we have into bed. I run this tool. And then I will run the info experiment tool, which is here info experiment. And so I need to give a BAM. So the file that contains the alignment information, it's in a data set collection. So it proposed me the good one. And then the referencing model, this is the bed 12 that we are going to generate. And even if it's not ready yet, you can still launch the tool because this tool is very slow. We won't run it on the wall BAM, but we just done sample the number of reads and it's proposed to do 200,000. And I think this is also what is proposed into, yes, into the tutorial. So, and then we can shoot the minimum mapping quality. And we leave it as default. So perfect, we just run the tool. And then we will wait for the result. And so again, I do a little pause in the recording. Info experiment has finished to run. I click outside and I will be able to check the reports individually. So for the 77, it's indicated that it's better than data. And then it's written that there is 10% that the reads could not determine if it was forward or reverse orientation on on the gene. And then for the one plus plus one minus minus one, two plus minus and two minus plus, which indicates that it's a forward stranded library. You have 46% of reads. And for the reverse library, you have 43% of it. So this indicates that it's an unsplendid. And we check for 80 and we have similar percentages. All the four methods are fully gave the same result, which is that it's an unstranded library. So now we will be able to count the number of reads. And for this, you can choose your own tutorial. And you can either follow the video that explained you how to use future grants, which I remember you is more customizable. But we need an additional computation. Or you can use the star output that we already have, we just need to customize it. I mean, to reformat it to fit for the next steps. So I let you choose one of the two next video, either the future count video or the star video.