 Hello, my name is Paul Zireb, I'm from the Galaxy Group in Freiburg and I will walk you through this tutorial for Metatranscriptomics analysis. If you watch this tutorial, please be sure that you have had the introduction about Galaxy, which you can find in this link. Apart from that, there should be the only requirement. So this tutorial will go through how to analyze Metatranscriptomic data, what information can be extracted from this data and how to assign TAXA and functional information to identified sequences. So why do we want to study the microbiome? Well, I mean one of the outstanding information there is, for example, that apart from the genes that the human has, there's also a rich, one could say second genome, which comprises all the bacteria, fungi and archaea, which are part of the human body as well, which is 10 times more cells than yourself actually. So there's a huge information here which just affects about your health and such. Another good reason is that the microbiomes have a huge impact on nature as well, which can be investigated using those technologies. The microbiome analyzes does not only concentrate on metagenomics, but also metatranscriptomics, which is what we focus here now, but it also comprises metaproteomics. And in some regard also metameter studies as well. This tutorial is based to the main part on the ASAIM workflow, which is described here, which has a large part of pre-processing, which is always an important step for any read-based analyzed pipeline. Then it removes some of the RNA, which is not used to create context or which does not encode proteins, and then makes use of the two tools, human and mid-aflan, which are difficult to use on your own machine, but simple to use with the Galaxy workflow manager system. Before we start, maybe a short look on where our data comes from. So this is read data, whole genome read data from biogas degradation, where they want to analyze how the microbiome changes over time by digesting cellulose. To make the series short, we only took one data point from the series, but the overall data analyzers can be extended in theory to large data series, and that later lighted it as well. But this would take more time. It cannot be done easily in such a short tutorial. So in order to start this tutorial, please make sure that you have logged in into usegalaxy.eu. This tutorial does not run on usegalaxy.org, unfortunately, yet. If you have logged in, there's a very nice feature of Galaxy that you can actually do the tutorial inside the Galaxy itself, and you don't have to open another window. So you can click here on the Galaxy training materials, and then you will see you can search for a tutorial. In this case, that's the MITA transcriptomics tutorial that's written here. And then you can follow the steps one by one while having this guide at the same time visual. So like I said, we want to do this MITA transcriptomic tutorial, which follows the ASIM workflow. So you can already scroll down on here. So the first step we need to start here is with the pre-processing steps. And before we start with anything, of course, we need to upload data and create a new history. Unfortunately, this view here is a little bit wrongly rendered. I hope this will be better next week. But you can still read the text, even though the diffs are not perfectly rendered as well. So there are two data sets from Zenodo, which we need for this tutorial, which are the raw reads from the Bio experiment. In order to download those, you can copy the link here, for example, SDGG, or just copy them like this, and then go back to your Galaxy instance. You see here that I have already created a history where all the steps are run through. That's because some of those steps can take longer, and I cannot wait while I do this tutorial. But so you can follow this tutorial step by step and wait until those are done. In my case, they will already be ready. So what you need to do before we start is, like I said, create a new history. Therefore, you need to click on this button, create a new history. I make it here as an example and name it accordingly. In this case, we can call it Metatrans, for example, and save that one. And then we need to upload the data. To upload data, please type in, get upload data here. You can click on this button, and then the window should open. In this window, you can now, say, choose remote file or paste fetch data, and there you have to paste the links we just copied before. If you paste those links here and start, then the upload will start, and this data will be loaded into your new history. You can close that window, and you will see that in your history, Metatrans, this data will now appear. I will switch back now to my history, where all this is already happening. In your case, it might take a little while before the data is completely loaded. Once the data is completely loaded, you can see here that the data is uploaded. You can investigate it having a look on the FASTQ data. So those are raw reads from the Metatrans photomic experiment. So what is the next step in our experiment? We have investigated the files. They're completely uploaded. Let's look at the trainings material. We might want to rename the files, so we can better remember them. In this case, for example, just call them TA1 forward and reverse. Those are for the forward and reverse reads. You can rename the files by clicking here on the edit attributes, and then you can just rename it, for example, like that. Remove that part, save this one here, and you can do the same here. For your files, you have to make sure that the file type is correctly. Therefore, you can check the file type that this is a FASTsanger. FASTsanger is the correct file type, so we don't need to change anything. If for some reason it's not FASTsanger, you would need to convert it to the correct file type. You can choose any file type here. That's just so that Galaxy knows, for example, which tools can use which files and will help you thereby. So what's the next step after we uploaded the data? And as always, we will look here at the tutorial in between. So the first step, like I always said, is the pre-processing. But before we start with that, you should be aware that this tutorial is composed of two major options, basically. You can run a short tutorial, or you can run a long tutorial. I will go through the long tutorial. In the long tutorial, we will call each of the tools and we need to run this workflow. And then look at the outputs and see what is happening. But this can take quite some time because there are multiple tools, and we will go through each of them step by step. What you can also do is switch to the short tutorial by clicking on this button, which you can do different steps of this tutorial. And then you can run a pre-defined workflow, which comprises already some of the tools, and just run the workflows. So that workflow will run all the tools behind each other and makes the processing faster. It depends completely upon you if you want to look in depth into some of the tools or if you just want to have a look on the overall workflow itself. But be sure that you click those buttons to switch between the tutorials. Like I said, I will go through the long tutorial. But I have to find the button back. So let's have another look at what exactly are we going to do in this tutorial. Like I said, we have metatransctomic data, which are reads from our experiment. Then we want to do pre-processing, means that we want to remove the reads which have a low quality, which are duplicated. And we also need to remove adapters and chimeras and things that we don't want to have and that stuff. Then we want to remove all the RNA which does not encode proteins because that is not something we can actually use for our functional analysis because functional analysis is based on the protein encoding information. But we want to use our data as well for taxonomic quantification and therefore we also need the RNA. We will see how we split it up later. Then we use the two tools, to do two of the major things which are very important for metatransctomic data analysis, taxonomic quantification, meaning we want to say what is the abundance of taxa in our sample, so what bacteria and what bacteria and what species and genus exist in our samples. And for functional annotation we want to say what are the genes existing in our sample over all of the species. What are the pathways? So that's like a set of genes or set of proteins which together formulate a function. Or we can also go into gene families and gene ontology terms to further group those genes to derive information. And lately, we also can combine taxonomic information and metabolic information to become taxonomic function quantification, meaning that we can see what are the functions based on specific taxa and not on the overall sample. Additionally, we also use some visualization tools, especially graph run and corona, which makes the taxonomic information especially quite visually appealing, which is also important for example if we want to make papers or similar. So, let's start with the quality control steps. We will do three quality control steps to start with, using FastQC, which is a tool which creates a rep report of your reads, and lets you observe the quality. MultQC, which can combine multiple bioinformatic outputs, for example the outputs of FastQC into one overall report, which is necessary since we have multiple read files, the forward and reverse reads, and we use CutAdapt, which is a tool for trimming and filtering, which removes the reads with low quality from our data. So, let's dive into FastQC. So, we need to open the FastQC tool, which you can find on this site if you opened it, and call it with the information it was given in our tutorial. So, what we can see, we want to use data from our history. So, we collect here multiple files and we collect the data from our history, which is the forward and the reverse ones. You can collect multiple files if you use the sdrg button and collect multiple of those. So, now we have collected multiple files. What else do we need to do? Nothing. We're not using any fancy parameters here. So, what we can do now is just run the tool by clicking on this button and the tool will run. I'm not clicking it because I already did it, but in your case you can click it and the tool should show up in the window. So, what happens after the output? Please stop the video and wait a little bit until your tool is finished so we can compare our outputs. So, this tool is a nice web page which shows us the quality of our reads. We get here, for example, a base quality for each of the positions in our reads and we can see that overall quality is good shown in the red area, a green area here. So, overall, all the reads are over 28. We get also a pair sequence quality score per base sequence content where we see, more or less we have the same amount of normal distribution of our bases in our samples. We have the GC content and different metrics which we can analyze. We have some overrepresented sequences as well, adapter content and such. So, QC is a very fast QC is a very nice tool to have the quality control over your samples. You can have a look at the forward reads. You can also have a look at the reverse reads. Keep in mind that you can always see how the outputs of the tool are related to the last tool because the numbers are shown here. For example, this is fast QC done on data 2. So, that's the fast QC analysis of the forward data and you can also have a look at data 1, which is fast QC analysis of our reverse data. The next step after fast QC would be to use multi-QC to combine our outputs together to have a overall few of our data. So, once again, type in the tool you're looking for multi-QC and then you can see you can select multiple files here, we select our fast QC files web page and web page and that should be it and we can we actually need to use the raw data so we use the raw data for that and create a report. Once again you have to click run tool in my case that was already done and if you look at the multi-QC output you can see now that you have a nice aggregated over all your data. You can see the number of reads in both forward and reverse. You see they have the same number which is normal because one sequence correlates to the other one in forward and reverse reads. You can see the quality of both types is good and you can also see the other metrics for all the data. What would be the next step? First of all we can try to answer all of those questions I already said, told you how many files you have, how is the quality score, is there any bias in the content which we saw that as well so you can answer all those questions based on the multi-QC output. Now we have a slight idea about our data and we see our data isn't overall good but we want to remove some of the sequences which are not good. Multiple tools which can be used for this there is actually a plethora of tools some of them are mentioned here and there are even more like cut-adapt trim, automatic trim galore clip and all of those and they are focused on different specific topics. In this case we will use cut-adapt. Cut-adapt can be used to remove adapter sequences which are left over from the library generation, primers, poly-8-hills and other unwanted sequences. So in order to run cut-adapt please open the tool and add those parameters here. So I will use open cut-adapt so we have to open cut-adapt and then we choose parrot and because we have parrot reads like I said we have forward and reverse read and we need to use now the forward and reverse data here. And we need to add some filter options which where we add one filter options is that we want to take care of the minimum length filter options minimum length actually you have to find it minimum length which is 150's that is the read we are looking at. We want the quality cut-off that can be found in read modification options and then we can say quality cut-off which we send to 20 and we want the output option we want the report which gives some information and also this report can be used in multi-QC. If you want to learn more about quality cut-off in such please have a look at the specific tool informations. Let's run this tool as well I click run this case once again you need to click run on your instance. There is the question why do we use this tool why do we run the trimming tool only once on parrot and data and not twice once for each data set and you know like I said in parrot and data the sequences they correlate with each other so the one sequence is actually the beginning of the read and the other is the reverse of the read and they overlap to a specific content and if the quality of one of those reads or the combined reads is not good enough we want to exclude both of the sequences so that's an important step. Alright so now we have run cut-adapt and let's have a look at the cut-adapt output report and you can see in this report that we processed 260.000 reads some of them were too short and some of them couldn't pass through the trimming. So the total reads that still exist at the end of our trimming you can actually have a look at the output is this number and are those can be seen here. So we also saw that for the much more base pause for the second read file which you can see over here and if we compare that with the information we have from FastQC rather from the multi QC output then we can see that the quality for the one read file was on the sides was less good than for the other ones and that explains that we have to trim more base pass here. Okay, what would be the next step after using cut-adapt? Like I already said we only want to take a look at NR sequences which do encode for proteins. There are other NR sequences for example the ribosomal NR sequence which we don't want to have in our data set for functional information but we actually wanted for taxonomic information. So we use filter with SORTME AMR in order to remove those sequences which can be which I actually found in the silver or the airfarm database. So those are sequences which are identified to encode for ribosomal NR and this NR is not useful for functional information. So we want to use them for functional information and that's how we use the tools. So we can click on a filter with SORTME AMR and what is a nice feature actually is that if you look at the tutorial in Galaxy so if this is opened in Galaxy you can actually sometimes click on the tool and the tool will directly open in the Galaxy frame. So what we need to use here are the quality control forward and reverse reads. So we once again look at paired reads and you need to make sure to use the quality control forward and reverse reads which are those two. So use a cut-adapt output to run this tool and the option we use is output both reads to a rejected file that means if one of the reads is detected to be of a ribosomal sequence we want to remove both of the reads. We want to search through both strands of course, yes and here we can choose the database we will select all of them because we want to remove all of the sequences which cannot be used for functional information. We say include aligned reads in fast-fast-q format yes and that information should be all right. Once again you have to say run the tool and then the tool will run and then you will see the output here. So if the sort-me-m-r tool finished we can have a look at the logs and we see here that we process the total of more than 400,000 reads and of those about 120,000 more than 120,000 had the e-score higher than the one we needed to identify them as a-r-n-r. So about one quarter of our data are a-r-n-r and we can remove them, we don't need them for the functionality. And if we look in the data which database was mainly used to identify those n-r's we see that the silver bacterial database had the highest amount of matches. So we can expect from our data that the main amount of sequences derived from bacterial pieces. Before we can continue with something, switch around before we can continue with the extraction of the community profile we need to do one more pre-processing step. So the function rotation tool actually needs only one a-r-n-r file one FASTQ file but as you know we have paired files so we have two files which show corresponding reads. The simplest way to pass only one file to the next data is FASTQ interlaser which basically takes all the sequences from our two files and puts them together in one dataset where they are stacked over each other. So you can do this by clicking on the FASTQ interlaser and then taking the output of SORTME n-r because we want to take the ones where the n-r was removed because the interlaced will be used as input for our function rotation tool. So we have to take the line forward and align reverse reads here and they can run this tool again and you will see that you get the interlaced file as output where now those sequences are combined and stack it over each other. So the sequences which were previously in two different files in one file and now they can be used as input for the next for the function rotation tool. So now our quality control stops are over and we can start to do the actually nice field of genetic analysis. So we will start with trying to extract the community profile and there are different approaches. We have also different tutorials about this on the Galaxy Training Network which you can slightly have a look at. What is the most used approach to the arrive the community profile is usually based on assignment of 16s or 18s Amplicon data but in this case if you have metagenomic whole genome reads then you miss out a lot of the data actually because only a few of them encode for those Amplicon data sequences but you can also make use of marker genes which is a larger selection of genes that can be attributed to specific sequences for example with the tool Metaflan. So Metaflan has a database of 1 million unicloud specific mark genes not only the AR and AR but also other ones from large database of reference, bacterial, viral and eukaryotic genomes and using those marker genes Metaflan can very nicely assign a large amount of the reads from whole genome metagenomic data. So please click on Metaflan to run this tool and we run the Metaflan tool. Here, this tool is one of the tools which has a lot of options so make sure that you take all the correct inputs and values that it runs correctly. What we need as the input is the output of the cut-adapt quality control. We don't use the interlace data because we actually want to have the AR and AR reads inside our dataset for this phylogenetic or for this taxonomic analysis because those 16S and other umplicon sequences are very useful for taxonomic assignment. So we take the output here of those of the cut-adapt output. We take the minimum length of 70. We use the locally-tatched database that's the Metaflan large specific marker genes database but available in Galaxy. We use the options Relative Abundance here and we want to have the taxonomic level for the Relative Abundance output for all taxonomic levels. What is also important to take care of that we create a quantity value for the robust average of 0.1 which is a quality control number and one other output we also want to set is that we have an output for Corona is the visualization tool which we will have a look at later. So it's nice to have a few on our data and therefore this option needs to be set. And if you made sure that you have all the values correct, please run the tool and this can take a little while take a coffee and come back to the tutorial after. Once Metaflan has finished, you can have a look at the file predicted taxon Relative Abundance by clicking on the few and then you see that the output is hierarchical of all the taxas detected in the sample so it starts goes from all the taxas over family and all the way down to species and strain it also shows you the NCVI tax ID corresponding to the names shown here and the Relative Abundance which means additional percentage wise of the reads found for this taxa. So here for example 100% of the reads in the sample are from the bacterial kingdom and 69% or 68% are from the species Firmicutus and then it goes down and there's a split up where you can investigate all percentages step by step obviously if you don't have a more more if you have more if you don't have different strains or species then the number for the families for example or for the genus identity to the downward species and strains which is for example the case here you have the same number of species and strains because actually all the strains found for the species are of the same strain so the numbers are identical here there are also different outputs from Metaflan for example there's a collection which shows you the output for each taxa for example only for the class which in this case is only closed tria and coprotermobacteria for the species and for all the taxa actually as a separate file there is the output of SEM which is the set against alignment to the query database and there's also biome file which is comparable to the predicted taxaom file but in the biome format which is often the input for other downstream tools and there's also the formatted version for corona which allows us to have a nice visual view on the when analyzing this data for taxonomic abundance one needs to be careful because we're looking at metatranscriptomics data not metatronomics data in metatronomic data each genome copy is assumed to date one copy of each marker gene because it comes from the genome but in transcriptomics we don't know how many genes are transcribed into the transcriptomic data set so the reason why I have many of the marker gene can be because we have many of the species that produce it or it can be that this species is producing more of those marker genes which is a different thing altogether one needs to be careful with interpreting RNA data here they cannot be 100% correlated to the abundance of the species but it could also be true to differentially described genes as a next step there are some downstream tools which only need the line edge and the abundance from the mid affilan output therefore we need to cut those two columns from our data set using the cut tool and we take the mid affilan predicted taxonomic abundance as input and we cut the columns 1 and 3 run this tool and we will use the output later now we want to have a nice visual view on our abundance table from mid affilan that's why we use the mid affilan output for krona as well and we can use this input for the krona pie chart if you click on this one and take the mid affilan predicted taxonary abundance for krona as input which you can find here we take all of those options and then you can run that one and have a look on the output data if the output was created you can click on the krona pie chart and open the hml file and it will give you an interactive hml file where you can investigate your abundance of your sample in this case we see we have two bacterial acetivi pro termocellus and coprothermobacter and also those bacteria have the abundance of the bacteria is comprised of different strains you can zoom in to have a further look in this case when you zoom out you only see the bacteria, if you zoom in the strain as well you can also click on the samples to look at this and this chart may not excite you so much but if you have other samples where you have a large amount of data then it's very nice to have this visualization where you can have hierarchy view on your data another way to visualize this data is to use graphlan export to graphlan and that's where we need the trimmed data basically where we cut out the two columns only because graphlan can only work with those information so please click on graphlan and make sure you use the parameters given in the tutorial as input file you need the cut file not the complete file from the taxonomic output and use those options you can list the levels and basically make sure that you follow the inputs here and if you run this you can have a look at the graphlan output and you will see once again you see that you have two different bacteria strains available in our data so in this case graphlan shows you the data on the strain level also this plot is not so impressive yet but it can be very nice if you have a lot of species and strains available in your sample and also so this is not the interactive view but like a PNG file that can be for example used for posters or putting them on your in your in your papers for example so with the taxonomic abundance we have now answered the question who is present in our sample the next important question about metagenomic data is what are the micro-organisms actually doing there or otherwise say what is the function performed by those micro-organisms which can be for each micro-organism present a sample but also for the community overall because they can share pathways and such therefore we use it to human which was derived from analysis of human microbiome data so click on this tool and when the tool opens make sure that you also use exactly the same parameters as shown in the tutorial as input we use the interlaced non-RNA reads because we don't need the ribosomal RNA from our taxonomic profiling as I already explained in the functional analysis we only need to have genes which also decode for proteins and we can bypass the taxonomic profiling step because we can actually use the taxonomic profiling from Metaflan so we can use this file as input here please also make sure that all the parameters are identical to the one used in the tutorial and then you can run this tool human is a very powerful tool but it can take quite a long time to run since it has to compare your sequences with a large database to find all the functional annotation if you're running low on time you can also just import those two files do it as at the beginning to import those two files into your history and that is actually the output which is generated by human then you don't have to wait all the time to run the tool if it is finished or if you have downloaded the data you can look at the gene families and their abundance and then you will see that we get all the gene families based on Uniref 90 which is a classification of known genes of known proteins to a similarity of 90% which are present in our data and what we get are the reads per kilobase of for those for those protein classes for the complete sample but also for specific bacterial taxonomies we also get the unmapped reads so that this hypothetical reads that are mapped hypothetically to a unknown protein so that already gives us a nice information about what proteins are present in our analyzed sample if you want to learn more about the abundant proteins you can for example use the rename features of human generated table this tool would allow you to rename the Uniref IDs with names if you do that unfortunately I already tried that and for the most abundant family you cannot map a correct name however you can have a look at the Uniref 90 and observe what is the most abundant protein family and you see that this is the iron-salpha-ferrodoxine domain which is I think very common domain in all domains of life and you can see that the output here is like I already said shows you the stratified and un-stratified table meaning that shows you the gene families for the overall sample also the gene families for stratified for specific species so if you want to split it up into two different files you can use the split human table tool which basically splits the files based on the line found in the data so to use that to split a tool you have to use the stratified input table which is specific made for the human tool and apply that to the gene families and their abundance and you will see that those tools run and once they run you will get two different tools where you have an un-stratified and stratified table the un-stratified for all the samples and the stratified table for specific species once this tool is done you can see that un-stratified data shows us the protein families for each in the overall for all the species in the sample and the stratified table shows us the same information based for each specific strain so you can for example see how many of the reads were attributed to un-classified species of using of this protein family which is usually a small amount and also the ones for specific bacteria so in this case for example you can see that the protein families attributed to different species another output we get from humans are the pathways and their abundance so instead of only the protein families there is also like a summary of the protein families which are used together to create pathways so specific pathways of protein families and if they are present a human can calculate if this pathway is present and I mean that's a very nice function analysis because we want to know what pathways exist in our sample and if we can associate that with specific phenotypes or specific things that happen with our samples then that gives us much more information of why specific things happen in our sample or why specific attributes are correlated with our samples so what we see here is that we get the path the name of the pathway once again this is stratified and unstratified for specific species and the overall sample and also the ones which are un-mapped and unintegrated and once again we get also the reads per kilo base for this data there's also another way to look at this there's the pathway and their coverage where our data is rather shown in the percentage wise fashion from 1 to 0 for the pathways if they are present once again it should be mentioned here that this is transcutomic data not metagenomic data so we cannot 100% say whether our pathway or protein family abundance is due to a differential expression or different taxonomic abundance to do this one would need to have data from metagenomic and metatranscutomic data in this case we only have metatranscutomic data so for the sake of simplicity we analyze this data but to actually derive real abundances one would need to have look at the metagenomic data as well so the data we have here is reads per kilo base which is already nice because it's kind of normalized on the length of our of our proteins however it's not normalized in terms of sample sequence and depth so if you compare samples with each other and they have different depths then there could be vast differences between the bonds so we need to normalize our sequences for that as well and we can use this to renormalize in human-generated table there are different ways to normalize here but the most simple one is the relative abundance so comparing the gene families with each other and normalize on the level by the community total and we can also include the unmet undergraded uncrewed data here and run this tool again to see normalized a few of our gene families and we can also run this renormalization step on the pathways and the abundance so you do exactly the same thing you run the tool with the same parameters but this time for pathways and abundance and we can observe the normalized output of our data and if you observe the output from this renormalization step you can see that for example for the pathways instead of reads per million now we have a percentage of our relative abundance in percentage of our samples and you can for example see that the adenine and adenicine salvage pathway has 0.0% abundance in our sample after normalization of our gene families and pathways we can also have a look on which gene families are involved in which pathways at the moment we only know the abundance of gene families and we know the abundance of gene pathways but we can also combine that information to see the contribution of each gene family for each pathway therefore we need to use the tool unpack pathway abundances to show for each gene for each pathway we need to select the renormalization on the data for the gene family and pathways and then when we run this tool we get a combination of both data in the overview looking at this data we can now see what is the contribution of each protein family protein family for the pathway of one specific species so that is a very complex overview of all the gene families and our species including our pathways which gives you comprehensive information on the functional coverage of our sample although we know about our gene families our pathways and for specific species gene families can be quite difficult to interpret even with the known names but even more with the IDs to make our data more transparent and group gene families together into groups that are kindly responsible for specific functions and roles in an organism we can use a mapping to gene ontology terms we can understand as a knowledge base for genes and therefore describes which genes combined are responsible for specific tasks and that makes our data analysis much more straightforward and transparent so we can use the tool regroup human table on our gene families and their abundance and combine the group features using the sum where we put together gene families that belong to the same gene ontology using this mapping gene ontology from the urinary ref there are also other mappings available for further analysis which can be explored here and if you run this tool you will see an output if you click on the output of the regroup tool you can now see the gene ontology terms instead of the gene families for specific strains this doesn't tell you so much so there need to be some modification done here as well since the gene ontology IDs doesn't really help as much interpreting our data we can also further replace those IDs with names of the features therefore you need to use the rename feature of human rate table tool using the regrouped data we generated before as input you can use the advanced feature naming and then you can map the gene ontology IDs and names to the corresponding names and the output of this will be shown here which now are the gene ontology IDs and additional information about their function so for example in this case we know that this ontology ID is part of the phosphor furovate heterotase complex and this is already comprehensive information and given for example that this ontological term would be enriched in specific samples one could deduce interesting functional reasons behind this which allows you to get a detailed understanding of the samples you investigate we can also see that the ontological terms are further grouped into different ontological over classes for MF which is the molecular functions BP which are biological processes cellular components one might want to split this big data set now up into different sub data sets where those overall functions are grouped together therefore we can use the tool which is once again shown also in our tutorial which is in Galaxy for example to select lines that match an impression where we can now select match the expression of the C, C, M, F and BP tag on the ontology so you can use those tools and make sure that you run it similarly as described in the tutorial matching a specific pattern using a ragex from our renamed file and if you run this tool you will see that in the output which is already created in my pre-processed history only the ontology terms which match specific tags are shown so the last very nice analysis we can do is to combine the taxonomic information with the functional information because we know from human the relative abundance of pathway attributed to a specific species and we know from the metaflan output what is the abundance of these species but we don't know the abundance of the pathway of these species because the output is not combined yet but we can do this, there is a tool to combine a human and metaflan output using combined metaflan and human outputs although when I created this tutorial or when I adopted this tutorial I realized there is a big problem using this tool at the moment because we updated human and metaflan to use the newest databases and only later on realized that some of the taxonomic information in those databases is not synchronized yet the main problem here is for example that there is we know from human that most gene families are associated to this species but in our taxonomic table we can only find these species but when we observed this at Uniprod it is actually the scientific name of hungateiclostridium termosellum is termosellus and I have excused myself that I am not a very good latin speaker so those species have to be renamed to match with each other using the human and metaflan output fortunately this can be done very easily with the processing features of Galaxy so to make this tutorial work at the moment we need to replace some text here of the normalized gene files in later updates of this tutorial I am sure we will have synchronized databases from human and metaflan and then this is not necessary anymore so for the sake of this tutorial please use the replace part of text tool and make sure that you use the same arguments that we use in the tutorial so the little hack we need to do here is that we take our renormalized data from our human output and we replace the pattern which describes these species with the new updated species name which is now used in the metaflan output so please just copy and paste those text parts and replace this data and then we can continue with the combined human and metaflan output and when you replace this data you should see now in the human output that we actually see the new species AzeTVPrio instead of the one which was used before and then when this is done we can combine human and metaflan output combine those two analyzes we need to take the output from metaflan where we can use the cut data and we can use the output from human which is the one we just created with the replace command and then we can type the characterization of gene families and run this tool as well and if you run this analyzes and observe the output you will see now that you will see for each of the genus and each of the species there is an abundance for the given protein families and then also the gene family abundance so a comprehensive table including the taxonomic abundance as well as the gene abundance which gives you a great overview of the functional and taxonomic information of your sample and you could even further analyze this data by using group data by column where you group using the genus or the species information to get a better idea about how many if the gene family is associated with the genus and how many gene families are associated with species therefore you can click on group data by column and then follow the tutorial as well I think it's a great time to stop the tutorial now since we have combined output of taxonomic and functional information so to recap a little bit we went through the tutorial using meta-transcatomic data used pre-processing to remove bad sequences and also remove RNA sequences from our data to analyze the data using taxonomic tools specifically meta-flan 2 and human I'm saying meta-flan 2, this is already wrong by now they don't really use those numbers anymore so the newest versions on Galaxy are always meta-flan and human I think we buy human meta-flan 4 now and human 3 we also saw that different ways to further pre-process meta-flan and human outputs to get relative abundance to do normalization to map the data to gene ontology terms to powerful abundance gene families and also transform the taxonomic quantification to taxonomic abundance and further visualize taxonomic information using Kroner and Kraftflan and we also see how this data can be combined to get the taxonomic quantification as well as the functional quantification which gives you a great information about your data set in further tutorials where we work on, we also see how this information can be used to for example to comparative meta-genomics where we compare samples with each other where it would be for example very beneficial to see how pathways and gene families differ between samples based on different conditions another thing that could be done from here is to use the pre-processed and sorted or filtered reads to do assembly create context and also use those context for further functional and other analyzes but this is part of a different tutorial which we might add as well in the future for my side I would say thank you very much for going to this tutorial please feel free to ask in the given chat sources any information which might be missing or you want to have more clear thank you very much for listening to this tutorial I hope most of it worked out quite nicely and you have a better understanding now how you get taxonomic and functional analyzes of meta-genomics samples thank you very much and enjoy the other trainings as well