 So now we are together, people that chose to use star or people that decided to use feature accounts. So I'm showing here the history with the feature account, but the people with star would still be able to follow. So in the tutorial, what is suggested is to determine which is the gene that has a more count into the data sets. And so what we have is either if we use star or if we use feature accounts is that we have a file which contains first column, the ID, and the second column, the count. And in order to know which one has the highest number of reads, we can use a tool that is called sort. So on the left panel, we click on sort data. And what we want to use is the collection. And here, if you follow the feature accounts, you have the counts. So feature accounts, blah, blah, blah, counts. And if you follow the star, you have feature accounts like accounts. The number of headlines, this depends. So if you use feature counts, you have one headline. If you use star, just put zero. And what we want to sort is to sort column two in descending order. So the highest number will be on top. We leave everything by default and we run the tool. I make a short break here. So the sort tool finished. I can click here to see the result. And if I select the 77, it's going to preview. And we can see that the top one is 28, 4245. And if we check in the second one, it's the same one. Sometimes it can be very useful to see the data set side by side. And for this, Galaxy has a mode which is the window manager. So we can click here. And then when we click on the eye, it generates a new window. And we click on the other one and it generates another window. And then we can move the window. So they are aside, one to the other. And this way we can scroll down both and compare. When you want to leave this mode, you just and click on the window manager. And you can close the windows or you can reduce them if you want to see them afterwards. OK, so if I come back to the tutorial, we finish the first part, which was to count number with Virgin. We did it only on two data sets. And therefore, I propose you to do it on the other data sets where you can find the links here. So you can see that you have both Bayer's and single reads. And here you have all the information on how to run this on the single read. There are really few changes. So if you want to check that you understood the steps, you can do this. And if not, we can, or even if you do it, we can go on with the second part, which is based on the counts and where we will be able to identify which are the genes that have been regulated by PASILA into this experiment of the Drosophila cells that receive the RNA interference treatment. So in order to gain time, we already pre-processed the seven data sets. And we put it on the nodal. So we will be able to import all of the seven data sets into from Euroval. Just as I said previously, it's easier if you have one history per analysis. And I would say that this second part is like a new analysis. So first of all, we will create a new history, clicking on the plus. And we will immediately rename it. So you click on the pencil right to the unnamed history. And we will call it GTN, reference based, everything is six parts two. And then I click on Save. And the name of my history changed. I come back to the training material. And I just copy all the links, click outside, go to Upload Data, Test, Page Data, I Paste. And then I start, I close. And I wait. Let the download start. So the upload has finished. And again, we will compact all these data sets into a collection. But this time, we don't have a paired-end collection, but just a collection that we call a list. So I will click on the checkbox, which is on top left of the data sets. And then I click on right select all. For the all seven selected, I bid data set lists. So here I see that I have my seven samples. And for each of them, I have the feature counts counts. So I could have done the same analysis with the star counts because the results are really similar. But just so we are all on the same page, we take the feature counts counts. And these labels will be used for all the plots, all the results. So it's important to re-level them the way we would like them to appear on the plots and the results. So I will double click or click simply and remove the feature counts counts. So I just have the name GSM, so the ID, and then the treatment, and then the paired-end or the single-end. So this is something that you don't really like to do. But you need to do it up. So now we have all seven. And what we will do is to just give a name to this collection and we can call them all counts. Sorry, counts. I leave the checkbox mode by clicking on the checkbox. And now I have one data set, which is a list of seven data sets, which is all counts. I come back to the tutorial as I will do a bit of theory. So here is an example of what we could have obtained. Just if we have four genes, A, B, C, D, they have variable length. And we have three samples, one, two, and three. And here are the counts. So you can see and you can imagine that if we are doing full-length mapping, full-length library. So that means that we have MRNA and we shield the RNA before doing the library. The longest the gene is, the more reads it will have. And the shortest the gene is, the lowest number it will have. Additionally, depending how much you sequence to a sample, you may have a bias that is linked to the sequence index. So for example, you can see that here between gene A and gene B, gene B is twice longer. And you have more or less twice number of reads. And in fact, this would indicate that they are as much expressed gene A and gene B. And then you have also the sequencing depth. So for example, the sample one has only 35 counts, while the sample three has much more. So these are the two factors that we need to normalize for. And in fact, depending in which order we normalize for them, we generate what we call a APKM or FPKM. This is just a bed under single read or TPM tag per million. So here you have the different explanation. And for example, for the RPKM, you just divide first by the sequencing depth. And we generate a scaling factor just by summing the number of reads that have been assigned to genes for sample. And then we divide by this. So this means that now the sum of each is the same. And then we divide by the length of the gene. So if we divide by the length, we can see now that the two gene one and gene B have the same RPKM value. So the difference between FPKM fragment per kilobase million and RPKM read per kilobase million is just the bed under single read. The other acronym that we can see is TPM, transcript per kilobase million. And it's the same, but you invert the operation. So first you divide by the size of the gene. So you divide by 2kb or 4kb. And then you divide by the sequencing depth. So it gives you more the proportion of each transcript into your library. And as a consequence, when you do the sum of all the RPKM value or FPKM value, you have constant numbers when you do RPKM. Sorry, you have a constant number when you do TPM while you may have different numbers when you do RPKM. OK, this was how you can normalize your data to generate FPKM or RPKM or TPM. However, none of this normalization take into account the library composition. And here is an example on what could happen if you have a gene which is expressed into a single of your sample. So let's imagine we have the gene D, which is just overexpressed into the sample one and was not expressed into sample two while the other genes were expressed similarly. If you normalize with the FPKM or TPM, you will see that all genes are affected. So all A, B, E, F are more expressed in sample two and D is less expressed in sample two. But in fact, this is just due to the library composition and we should find as a result that only D is expressed. And for this, we use an algorithm that is called DSEC2. And this is one of the gold standard that people use to do differential gene expression. And this will normalize for sequencing depth and library composition. So if you want the detail on how DSEC2 is normalizing the data, you can click here and it's explained. However, I will not go through it and just tell you that DSEC2 will estimate the biological variance. So it will normalize the data first and then estimate the biological variance using the replicates and then estimate the significance of the expression changes between two conditions. Importantly, you need at least two replicates to do this differential analysis and three is better or even five if it would be possible. What is interesting is that DSEC2 allows to compute the statistics knowing some factors that can be biased into your analysis. So for example, in this configuration, you remember that we have seven samples. We have three that are treated and four that are controls. And we also have both Perdan and Singleton. And we know that the sequencing type may affect the counting. So it is good to specify to this algorithm that we have this sequencing type. And this is something that DSEC2 will take into account and correct for it when it will determine what is differentially expressed. OK, so first, we need to use the name that we used for the count in order to attribute some tags to our data sets. And this will be used then by DSEC2. So to do so, we will just first extract the element identifiers. And this is the tool that you can access by clicking here. Or you can go there and do extract element identifiers. And then you just check all counts. And this will give you a list with a data set that is a text file where you will have seven lines that correspond exactly to these seven labels. Wait a bit that it's run. And then what we will do is to generate a new text file, a tabular, where we will put the name of the identifiers followed by different columns saying group. So the group that they belong to. So we will use a tool which is called replace text. We can also find it with the search bar. Replace text in anti-unit. And we will apply it to extract element identifier. So here it's a bit complicated. So what we will have in this text file is this. So gsm blah blah blah underscore and treat underscore single. And here we will do what we call a regular expression. So what we would do is to say that what we have in the line is something that we want to capture. Then there is an underscore. There is something else that we want to capture. Then there is an underscore. And finally, there is again something that we want to capture. Dot means any character. Star means as many as possible, as many as we need. And so this will be match, capture what is before an underscore, between an underscore and at the end of an underscore. And then what we will do is to replace. And what we want is to get them back into with the... We can get them using this. So what we do is we do backslash. So you can copy it from here. One. So this will be the first, the gsm blah blah blah. And what we want is to reproduce the same label. So gsm blah blah blah. Again, and treat or treat and then single. This will be just to get as we had at the beginning. But now we would put a tabulator. So it's backslash t. And we will add into another column group. Oh, sorry, group. And then we will get the second one. So which will be treat or and treat. And then a third column. So underscore tab tabulation. So underscore, sorry, backslash t group and treat. So to be sure, the best way is just to copy this from the tutorial and to put it inside. OK, and then we run the tool. So I come back to my history. And when these two are done, I just resume the video. These steps may may have here a bit of screw for some of you. So we will just check the outputs. So you can see that now we have the name, which matches what is the identifiers of the whole counts. And then we have group and we extracted the second part. So and treat treat treat treat and treat. And then we have the single pet, single, et cetera. OK, great. So now we can go back, check what is the data set. The data type, sorry, it's a txt. And I would txt, sorry, and I would like to have a tabular to change the data set. I will just click on the pencil. And then go to that type and the new type type tabular. And I select the tabular, oops, sorry, tabular. Then I do save. And now you can see that it's a tabular. And if I display it, now it's more separated and aligned. Right, so now we can use the tool to assign tags. So it's tag elements. And we will tag the elements, which are all counts. And we will tag using the replace. And we run this tool. Oh, OK, so I need to wait that it has been accepted as a tabular. I guess it's a bit slow. I just pause the video. It's done. I can run the tool. And now I have a new collection. And you can see that they have tags. And that corresponds to their name. And these tags can be further used into the next step. So we will run this tool, which is an algorithm that allows to identify the genes that are significantly differentially expressed between two conditions or more. And so here we will click on the tool, this tool, that you can also find if you go into the tool search. And we will just select group tags with levels. And this is how we manage. This is why we put these tags. So the count file collection, we need to choose the collection that have the tags. And we will just specify which are the factors. So the differences between the groups. So just so I match exactly what's here. So the factor name, we will call it treatment. This is the factor name. And then we have the level. So the different values for treatment. We have one which will be treated. And for this, we will select the group treat. So I click here. And you can see the tags that are proposed, so treat. And then I will insert another factor level, so another value for the treatment. I click and I put a treat. That's it. And I select the group and treat. OK, so this is the main factor. So this is the factor for which we would like to know which are the differences. And then we will specify to the algorithm detector that we have a factor that may introduce a bias. And this factor is the sequencing. So I just check. Yeah, sequencing. For the sequencing, we have two values, one which is bed and for this one, we select the bed. And then we have another one which is single and we can also say single bed. And this is single. OK, I just check. Yes. So then, do the file have a header? For this, I just need to click here. And I can see, so I clicked on the name, sorry, on the level. And I can see that indeed, these files have a header. So I put yes. And which type of count they are. So they are feature counts. In the output options, I just want to have the plots to have the results. And I also need to have the normalized counts. So as I said, this sector is using a normalization which is not FPM or TPM because it takes into account the library composition. So we're interested to have this normalized. OK, and then I just run the tool. I go back to my history. And I see that there will be three outputs, one which is normalized counts. So a big table with all the counts. Then there will be some plots. And finally, the result file, which is a table that will tell me for each gene which is a log 2 fault change, which is a p-value and the adjusted p-value. OK, I post the video here and come back when it's done. Now it's done. We can check the results. So as I said, the first result file is a table. And we have one row for gene, so the gene ID. The base mean is the average expression normalized expression across all samples, so all the seven samples. Then there is an estimation of the log 2 fault change. And then there is a standard error on the log 2 fault change. So I'm much more confident with this value of the log 2 fault change. Then there is a statistic old stats. And then there is a p-value and an adjusted p-value. So why do we have both? So the p-value is the probability that we would have such a difference if the data were randomly distributed and if we would not have any difference for real. And because we are computing a lot of tests, because we have a lot of genes, we need to correct for that. Because if not, let's say we use a threshold of 5%, just by chance, we would have 5% of genes that would be significant. So to correct for that, we used a p-adjusted. So it's adjusted for the fact that we did a lot of tests. So this is the first result. The second one are the plots. So I click on the I to display them. So the first one is a PCA. So it's an analysis that is relatively common to do in evanesic. It's how to represent the variation that we see between the samples into a plot of two dimension. And we will just represent on the x-axis the first PCA. So the component that shows the most of variance, we can see here it's 50% of variance that is explained in this axis. And then we take another direction that will be orthogonal. And the PCA2 is the second one that explains the variance. It explains 33% of variance. That means that using these plots, we explain 81% of the variance, which is a lot. And what we can see that the most variance is explained on the first. So on the left, we see we have untreated single and treated birds. And on the right, we have treat single and treat bird. And this is a very good news. That means that on the left, we have the untreated. And on the right, we have the treated. So the main variation comes from the treatment. You remember that we treated the drosophila cells with RNA interference to decrease the expression of the passillaging. So this explains the more variation between samples. And the second variation, 33%, is explained by the sequencing. So you can see that you have single, single, single, adopt, and bird, bird, bird, bird on button. So this is a good result. And then you have here the sample to sample distance. So we have each row and each column is a sample. And so on the diagonal, it's the distance between the sample with itself. So it's zero. There is no distance. But if you take a square, this represents distance between this sample, so the 77 and this sample, the 80. And you can see the distance is relatively large. On top, you have a clustering. So that means that how can we group the samples together? And the length of the branches indicates how far they are. So for example, we can say that these two, the two closest samples, so they are 80 and 81. And they are both treated bird. And then we have these two, which are also very close, the 77 and 78, which are both untreated bird. And then if you follow the tree, you can see that the first separation is between two groups, one group with four samples, and they are all untreated. And one group with three samples, where they are all treated. And this confirms what we have with the PCA, which that the effect what is driving the difference between the seven samples is the treatment. Here we have a plot of the dispersion estimate, so how much for each dot is a gene, and how much there is a dispersion. And on the x-axis, we have the midden normalized cons. So on the right, it's highly expressed genes. And on the left, it's globally expressed genes. Here we have the histogram of the p-values. So what is on the left is significant. What is on the right is not significant. And finally, we have a MA plot. So it's mean and full change. So we have on the x-axis the mean expression. So what is on the right is highly expressed. What is on the left is lowly expressed. And on the y-axis, we have the log two full change. So what is high is highly expressed in the treated, and lowly expressed in untreated, and the contrary here. And in blue is what is significant. OK, and the last output is the normalized cons file. So it's a table where each row is a gene. You can see here the gene IDs. And each column is a sample. And you have a header with the name that we gave to the collection. OK, so once we have these results, I go down into the tutorial. What we would like to do is to annotate the results. Because everything in the dissect to result file, so this table is labeled with a gene ID. But usually we are more familiar with gene names. And this name does not appear in this table here. So we will be able to add it using the GTF. So if you remember, we used the GTF into the first part of the tutorial. So we will use the side by side history view of Galaxy. On the right, you click on the little arrow and you put show history side by side. What you have is sometimes you have your history. But if you don't have it, for example, if it's like this, you go to select histories. And then you will select your current history. And you also need to take the part one that we used previously. And then you click on add selected. Now they are side by side. You have the part one and the part two. And what you want is to look for the GTF on the part one, which was just before the star here. And you will move it into the part two. And it will now be here, but also on the right on your current history. OK, so now we can just use the tool which allows to annotate. So it's annotate dsec2 output table. And the tabular output is dsec2 result. And it's a dsec2 file. And we give the GTF. We run the tool. It's done. And I can check clicking on the I icon. What is the result? So we still have the same first column. So G9D average, normalized expression, log2 file change. We have the error on the log2 file change, the statistic, the p-value, adjusted p-value. And now we have the coordinates of the gene. So chromosome start and the strands, the type, putting coding here. And then the gene name. So now we can see that because it's sorted by the increasing, sorry, p-value, so the most significantly on top, we can see that the gene which is the most differential expression is bark. And this is a gene that is, if we check the log2 file change, it's negative. So that means that it's decreased in the treated. And we can see, for example, the most increased gene is this one, which is a-nt2 and 2. And what we need to check is the p-value gene. Because if you remember the design of the experiment, everything is based on the treatment with RNA interference that is supposed to decrease the expression of p-value gene. So if we go down, we can find p-value, which is hopefully not so down in the list. And we can check and look at the adjusted p-value. It's minus 10 minus 29. So it's highly significant. And if we check the log2 file change, it's here. It's minus 1.6. So it's indeed decreased into the treated samples. Everything is good. So what is missing here is the header. So because I can't tell you what is inside, but maybe it's better to have a header. So what we will do is we go to the tutorial. And we go here. There is a hands-on add column names. And we will just copy this line. And we will create a new file. So we go to upload data, best page data. We just paste. And we say that it's a tabular. Oops, tabular, this one. We start. And then, ooh, I should have given a name. What we will do is to concatenate. So to put this one on top of the annotated. So the way we do it is that we use concatenate data sets, take to head. And the first one to take is a name data set, even if I should have given a name. And then this is top and bottom. We want to put annotate. And this, I can run the tool. I post the video and come back when it's done. So it's done. I just have a look, clicking on the eye to check that everything went smoothly. So we indeed have G9D, base mean, log2-fold-change, standard error on the log2-fold-change, well-stat p-value, p-adjusted, chromosome-stat, and instrument-feature-engine-name. Perfect. So now what we will do is to rename these data sets because we are going to use it. So to do so, I click on the pencil, right, to the concatenate. And I give it a meaningful name. So it's proposed to call it annotated-desect-to-results. Perfect. And so now we would like to know how many genes are differentially expressed. Of course, it depends, which is a definition that you will use for differentially expressed, but usually people use 5% of the adjusted p-value. And this is up to you to choose 1% or even less. So we are going to filter for the rows that have a p-adjusted below 0.05. And to do so, we use a tool which is called filter data on any column using simple expression. On the annotated-desect-to-result, and what the condition is C7 for the column 7, below 0.05. And there is one header that we would like to skip. Perfect, I run the tool and I just wait that it's green. So the filtering is done. We can click on the filter to have a bit of details. So we filtered using the C7 below 0.05 and it's written that we kept 4% of the lines. And at the end, we can also check here 967 lines. So we know we have a header. That means that we have 966 genes that are significantly expressed. So we will just rename this output. Just click on the pencil and rename it genes with significant... Yeah, it's quite true, typical. Adjusted p-value. And we save, perfect. Now we will do an additional filtering to only get the genes that have more than two fold-change differences. So either that are over-expressed to fold or under-expressed to fold. That means that below half the expression. And to do so, if I click on the I to check what's inside, we have a column that is locked to full change. And so a full change of two in lock two, it's one. So we want to have either below minus one or above one and a way to write this directly is just to use the absolute value. So just say that the absolute value of this should be above one. So we will use again the filter tool this time on the genes with significant adjusted. And we will use the absolute value of, it's the column three, C three should be above one. And there is a header that we want to hit. I run the second filtering step. Okay, so filter has run. We can check clicking on the title. So we filtered with absolute of C three above one and it kept 12% of the valid lines. So in total, we have now 114 lines. But 14 lines. And so we will again rename it. Gene, some genes with significant adjusted P value and absolute lock two full change above one. Okay, we can have a look at this table which now is really, let's say small and we can have a look to the gene name. So we have still Spark and two and we still have Pasilla, which is resuring. Good, so now we have a selection of the genes that are significantly expressed but also with a high full change. And we would like to visualize them using a hit map. So I go to the tutorial. We are now in the visualization. And so just to tell you that there are two advanced tutorials on visualization that you may be interested in. The one which will use hit map two, which is a tool that we are going to do now and another one about volcano plots. So if you're interested, do not hesitate to click on these links and check the tutorials. So what we would like to do is to visualize the expression in each of our samples for these 113 genes. And I don't know if you remember but we have a table with all the expression and this is the one which is normalized cons, blah, blah, blah. But of course this file contains all genes. It's quite a lot. But hopefully there is the gene ID, which is here, which matches what is in the dataset that we just created. So a way to subset the big file with expression and only get the genes that are significant and with a large pole change, we will use a tool which is called join. So it's join two files. So it's join two files, sorry, on column, no. Join two datasets side by side on a specified field. And you can also access it just, yeah, clicking here. This way it's less error prone. So what we want to join is on the left, we would like to have the normalized cons. So it's here, normalized cons, file and data, blah, blah. And the gene ID was in column one. And we would like to join it with the last output that we have. So the gene with significant adjusted p-value and absolute log two file change above one. And the gene IDs are also in column one. So we want to keep the line first input that do not join, no, because this is what we are using to restrict and keep the line of the first input data in complete relief, no. Do we keep the header line? Yes. And now we can run the tools. The join has finished. Let's have a look. So we click on the icon. So now we have, first of all, not so many lines. So it has indeed restricted to the differentially expressed genes. Then on the columns, we have the normalized expression and then we have gene ID and everything that was into the DSEC two outputs that we don't need. So what we do is just to cut, so to get only the first columns. So this is a tool that we already used, a cut column from a table. And we will take the column one, which is a gene ID plus the seven next columns which are the normalized expression. So in total, we go from C1 to C8. And it's from join, yes, we run the tool. So it's done. I check with the eye that I get what I wanted to have. So first column IDs and one, two, three, four, five, six, seven. I have all my samples. Perfect, I just rename this data set and I will rename it normalized contrast for the most differentially expressed genes. Okay, and now I will be able to display it as a hit map. So hit map is a way to summarize the information. And instead of displaying the values, we will use a gradient of colors. So the hit map tool we are using is hit map two. And so as input, we will use the, that we just created, the normalized count. We don't need to put any plot title. We will transform the data. So we transform in log two, value plus one. So we do plus one. So first of all, if we have zeros, it will not raise an error because log is only defined for absolute positive. And the other advantage of log of one plus is that it only generates positive data. And what we will do is that we will only label the columns and not the rows because we have still 112 genes. And so if we display the gene levels, they will overlap on each other. And for the column, you can choose the color map you want. But I suggest that we use a two color. And then I run the tool. So the result is now available. I click on the eye and I can see it. So what we see is that on the X axis, we have our samples, seven samples. And we have a clustering. And we can see that on the selected genes, the treats would go together and the entreat would go together. And within them, we can see that we have the two pairs that go together. And similarly on the entreated, we have the two pairs and the two single, confirming that there is still an influence on the sequencing way. If we check, each row is a gene and the color indicates the log two of one plus level of normalized expression. And we can see, for example, that we have one group of genes here where it's under expressed to the treated. Here we have a mix of genes. So these genes are on the contrary over expressed. And here we have again, under expressed over expressed. And finally, here we have genes that are here over expressed and here under expressed. But usually they have a lower value. And in fact, this is how the clustering is done. It's by close, by proximity in the normalized value and not in the behavior. So all the upregulated are not together and the downregulated are not together. If we want to have this, it's better to work with this goal. So if you go to the tutorial, if you go to the visualization, it's written here what exactly is a Z-score. So a Z-score is a measurement of how far it is from the mean and it's normalized to be in standard deviation. So if we have a Z-score of two plus two means that it's two standard deviation above the mean and minus one would be one standard deviation below the mean. And here are the way to compute Z-score on all genes if you're interested in that. But fit map allows to display directly the Z-score and we would have something that is like this. So let's do it. We would rerun the hit map too. We can either click here to have a clean form to regenerate the plot or we can use something which is available in Galaxy to rerun the job. So we will do this. We click on the 31 item and we click on the arrow that is making a circle. What the job again? And because we are just changing few parameters, we will use this. So we don't change the data transformation and we say that we would like to compute Z-scores prior to clustering on the road. So that means each gene will be set to compute the mean of the gene expression and the standard deviation. And then we leave the rest as it was except that the gradient, instead of having a gradient we still call out because we will have the average above the average, below the average. It's better to have a gradient with three colors. So we have a symmetric. And we run the tool and I come back when it's done. So hit map two has just finished. We click on the eye and we see the results. So now we can see that the genes that have that are overexpressed cluster together on the top. And here on the bottom, we have all the genes that are under expressed following the passilla diffusion. And then we can see also that we have, for example, a cluster here of genes that are more expressed into the single compared to the paired treated samples. And similarly here we have a cluster of genes that are more expressed into this sample 82 compared to the other untreated samples. Okay, so we, I don't know if you remember, but when we selected the genes that were differentially expressed, so significantly, we had a list of genes and sometimes you want to know if they have common features. And this is what we call gene ontology. So are they associated into biological process, common biological processes or common compartments, cellular compartments. And so this is what we call the gene ontology analysis. And we will do it with using the tool which is called GOSEC, which allows you to do this type of analysis. So GOSEC requires two input data sets. The first one is a tabular where the first column is a gene ID and it needs to be uppercase. And the second one is a Boolean that say if it's differentially expressed, so true or not differentially expressed false. And the second one, the second file that GOSEC requires is a file with information about the gene length and it's too correct for bias. So we will first generate the first input, which is the true false. So what we will do is to compute an expression and we will use the desecto result. So we click on the compute and we can also look for it. Here we will look for, sorry, we will look for the desecto result file and do this input as a header. If you don't remember, you just need to scroll down in your history and click on the 18 and you can see that there is no header. What is the expression that you want to compute? So it's to convert to Boolean. So I will just copy paste from the tutorial. So you want to convert from Boolean the expression which is the float of C7. So the column seven is a P adjusted below 0.05. And the way you will do is to append. So that means that it will be into a new column, so eighth column. This is an expression and then we would like to have an error handling because I did not show you, but you have some rows. So some genes where the P adjusted is NA, not assigned. And so this will not fit into this expression. So we don't want to auto-detect the column time and we want to not to fail. And what we would do if we don't manage to compute this expression, we would like to fill in a replacement value that would be false. I just check that it matches. Yes. And we run to. So as I said, this will be a tabular and we will have in the eighth column the value being interested in. So what we need to do is to do cat to get only the first column and the eighth column. And we do it on the compute. And we are close to have the good input because if you remember, we need the first column to be uppercase and for this we have a tool, which is change case. And from the cut output, we would change case of the column one to uppercase and this we run. So when all these steps will be done, I will resume the video. Everything has run. I will just check the result with the I and I indeed have the first column, the gene ID is all capital letter and then the true or false. I will rename it. I use the pencil on the right and I will rename it gene IDs and if they run child expression. Great. So now what I need to have is the gene length and now depending if you use the star or the feature counts option, you have two different ways to do. So I will use again, the show history side by side menu. So you can strip that here. So we have the first part where I have the feature counts and then I need to select my history and if I use the feature counts, I need to take the feature length. If I used star, it's even easier because I may have at the end, that is called gene length that I just need to drag. So for the feature count, I got a collection and inside each of the collection, I have the two length and so for this, I need to extract the one of the two and to do so, I would just use the tool. Let's just extract data sets from here, extract data sets and from the feature count or length, I will just check the first one but they are identical so I can choose the one I want and this will be named as the first one but we know that inside it's the gene length. So what I do is that I use the change case tool and again, I need to change the case of the column one and put it up a case. I will post my video and resume when it's done. So change case finished and we will rename the data set to more easily find it in the next step. So we just click on the pencil and rename it gene IDs and length I think. Yes, gene IDs and length and save. Okay, now we have our two inputs ready for the GoSec analysis. So we use the GoSec tool, differentially expressed gene. So it's gene IDs and differential expression, gene IDs and lengths. This is this one, it's go. To, we can, with this tool, we can either get the categories from the package that's installed or we can provide them. So here we will just get the categories because the genome that we are using is available in the list. And the gene ID is ensemble gene ID. This is correct. And first of all, we will use these categories. So the cellular components, biological process and molecular function. If we go to output options, do we want a plot for the top Go terms? Yes. And do we want the diagnostic plots? No. And extract the differential expression for the categories? Yes. And now we run the tool. I pause the video and come back when it's done. GoSec has finished. So we can see that we have three outputs. The one on the top, it's the link between the Go terms. So on the left, the categories and the genes that belong to this category and that are differentially expressed. Then we have a plot that represent the top over-represented categories in cellular components, biological process and molecular function. And on the x-axis is the percentage of genes that are differentially expressed into this category. And on the color, the darker it is, the more significant it is and the blue lighter it is, the less significant it is. And then the size of the dots represents the number of genes. So for example, on top, we have the extra cellular region, then the response to stress, oxydororeductase activity and small molecule metabolic process. Then we have two gene ontology for which 80% of the genes are differentially expressed. Glycogen-biosynthetic process and Glycan-biosynthetic process. Then we have separate junction assembly, establishment of the blood-brain barrier and with less significant p-value transferase activity and the glutathione transferase activity. And the last one is a table where each row is a category. And then we have a p-value for the to express if this category is over-represented, a p-value to say if it's under-represented, the number of genes that are differentially expressed in this category, the number of genes that belong to this category, the term, so extra-cellular region, for example, the ontology, so whether it is cellular component, the biological process or molecular function. And then we have the adjusted p-value. So what we can do is that if we want to know which are how many and which are the gene ontologies that are significant, so we can check which are over-represented. So this way we select on this column and if we want to see the under-represented, we check on this column. So just check what is over-represented. We will use the filter tool that we already used so on the GoSec ranked category list, we will select the one that have, it was column eight, I think. Let me just check again here. Yes, column eight is below 0.05. And this will give us the categories that are over-represented. And we can do the same for the under-represented, just change, instead of C eight, we can do C nine, filter has finished. We can see that there are 60 lines and we have so only 0.5% that are kept. And among them we have extra-cellular region response to stress, et cetera. And if we want to know how many of each category, so it's column seven, we can use a tool which is called group, group data by column. So for this, we would group by column seven. And in operation, we just say that we will come on column one, which is the ID. And we run the tool. Instead of weak grouping, we could also be interesting is just filtering only cellular components or only biological processes and get the list. So the group tool is running now. Yeah, now it's done, I click on the A and I see that there are 50 biological processes that are enriched, five cellular component and five molecular functions. Okay, something else that I can do with the data is to check the keg pathways. So if I go down, the keg pathway database is a collection of pathway maps, which are like this. So they represent a pathway, each box represents genes and then there is like a connection between them, like activation or inhibition and they can have genes, protein, RNAs, chemical compounds, et cetera. And here we have an example of one of the pathways which is the ME00010, that is the glycolysis process. And so if we want to perform pathway analysis, so it's just to find if there are among the differential express genes an enrichment in one of the pathway. So we will just rerun the GOSEC. So I don't know if you remember to rerun a tool where you just change one or two parameters you click on one of the output, then you click on the arrow, there's one into circle and what you need to change here is just the category. So we use keg instead of gene ontology and in the output we don't want to have a plot because we can't have plot for keg. And then we run the tool. So now we have only two outputs because we don't have plot, so not three, two. And similarly we will have, you know, this output will be for each keg pathway which are the differential express genes inside and for this one it will be for each keg pathway it is over express, over represented, sorry, under represented. But now it's running and they come back when it's done. So the second GOSEC finished and we can have a look. So here are the categories and the genes and we can see that there is 128 lines including a header, so 127 pathways and here we can see the result for the values, the p values and if we, what we want is in column six the over represented, so if we just filter so on the ranked and we just say that we want column six to be below 0.05, but I forgot last time is to skip the first one which is a header and this should show us over represented, run back when it's done. So the filter is done and we have three lines. One is in fact a header and then we just have two keg pathways, one which is 0, 0, 10 and this is exactly the one that was here, so the glycolysis and the second one is the 0, 1, 1, 0, 0 and this one is all metabolic pathways. So it's a huge, huge pathway. I don't do it, but you can check what are the under represented keg pathways and if we are interested to project the log to fault change on the picture of the keg pathway, there is a tool that is called PathView that allows you to do this and it needs two inputs, the pathway IDs to plot either just one ID or a file with more than one ID and a tabular file with the gene and the log to fault change. So to generate this, we will get the log to fault change from the gene with significant adjusted p-value as we don't want to plot log to fault change for genes that are not significant because it can be just a high log to fault change but with a high variation and this we don't want to display. So what we will do is to use the cut tool and use one and three on the output that was gene with significant, I can't remember the input that we are using. Yes, genes with significant adjusted p-value. So this is what we have and we can check 26 that it's indeed the third. So I check here and I check we have a header and it's log to fault change. Perfect, so I run the tool and when it's done, I resume the video. So cut as brand, I just check the output clicking on the eye and indeed I have the gene ID and then the log to fault change. So I will rename the dataset, click on the pencil and then change the name to genes with significant adjusted p-value and the log to fault change. Okay, so as I said, the tool that we are using is pass view and here either, if we want to plot a single pathway, we put the name here, for example, 0, 0, 0, 10. But if you want more than one here, you need to specify a dataset that will contain one column with cake pathways to plot. So to do so, we just go to the tutorial and copy these two lines, go to upload data, best page data, we paste it and this time I think about giving a name and take two plots and it will be a tabular. So here I can select the 49, does it has a header? No, it was just two line. The species to use is the fly. Do we have a gene data? Yes, it's the gene with significant adjusted p-value. Does the file have a header? Yes, it has a header. And what is the format for gene data? It's ensemble gene ID. Do we have a compound data file? No, then we leave everything by default and we run the tool. I'll resume the video when it's done. So pass view has finished. I click, this is a list with two datasets and now I'm going to run the tool. I'm going to run the tool. I'm going to run the tool. I click, this is a list with two datasets and they correspond to the two pathways. So if I check the first pathway, which was one that was overrepresented, we can see indeed that we have quite a lot of reds, that means that these are the genes that are over expressed, but also some greens that are genes that are under expressed, which can be a bit counterintuitive, the color code. So this is a type of representation that we can do. And here it was an example of a pathway that was underrepresented and indeed we see that we don't have a lot of colors, just the two genes that have been colored. Okay, so here is a workflow of what we have done with this tutorial. So we had fast cues, we did some fast cues that we integrated into multi-cuesing. We did the cut-adapt to be able to trim the basis that were of poor quality. Then we matched with RNA-STAR and we obtained BAM, but also a log to tell you what was the quality of the mapping. We used either in, we used infrared experiment to get the strenuous, that there were also different ways to get the strenuous. And then we counted with filter count or with star. And with this, we were able to do a de-sector analysis to identify the genes that were differentially expressed. And with this, we could do some go-sick analysis, so gene ontology and cake that we visualized with SpassView. So I hope you enjoyed the tutorial. You have here the different references that we used and below there is a feedback form. So please fill this feedback form. This is a feedback that we read and this would help us to improve the tutorial. So please do not give the feedback on the video, but just about the manual. With that, I would like to thank you very much and I hope I will see your comments.