 Hello everybody, I'm Bernice Batu from the University of Freiburg and today we will talk about taxonomic profiling and visualization of metatronomic data. So the idea of the tutorials that we will go through today is to answer a question of which species or what the taxonomic levels are present in a metatronomic sample, what are the different approaches and tools we can get or we can use to profile the community in a metatronics data or metatronics samples and how can we visualize and compare a different community profile from different samples. This topic we will follow tutorials from the Galaxy Training Network called Taxonomic Profiling and Visualization of Metatronics Data that you can find on training.galaxiproject.org but that you can also in the metatronomic section but you can also use it directly from the Galaxy so then if you go in your favorite Galaxy instance for these tutorials I will use Galaxy Europe and you click on the heart on the top so see Galaxy Training Materials you will be redirected to the training Galaxy Training website and if you scroll down to Metagenomics and then you go down to Taxonomic Profiling and Visualization of Metagenomics Data here you will be redirected to the tutorials we will follow today and the idea of this tutorial is that at the end of the tutorials you are able to explain what is a taxonomy assignment or taxonomy profiling, how it works, apply Kraken and Metafeline to do taxonomic assignations and then apply Krona and Pavian for visualization of the results of Kraken and Metafeline and identify taxonomy classification tools that fit best for your data. So first let's go back to the idea so we talk a year about metagenomics and when we do metagenomics we mostly try to identify to characterize the microbiome so which mean if you take the definition from wipes at all in 8088 a characteristic microbiome community occupying a reasonable well-defined habitat for which as the extinct physical chemical properties and that doesn't refer only to the microorganisms involved but also all the activities around and so microbiome data can be gathered from different environments such for example the soil water but also human gut as you may heard or different part of your bodies and there is different biological interest that rely on the question of how a microbiome present in the specific site can influence their environment so how the gut the your gut microbiome can influence your health for example and so when we want to study a microbiome we need to use indirect methods like metagenomic and metatarscriptomics to get an idea of what are the microorganisms that are there and what they could do there in this environment and how they can interact with each other metagenomic sample usually so the indirect method that we can use is for example metagenomics and metagenomics what does that mean it's it's when we extract the DNA from all from organisms at a different specific site where samples were collected and we extract the DNA and then we sequence this DNA to find out which organisms exist or coexist in that niche at the exact time and we can also identify what are the genes that are present in these different organisms and if we talk about metatarscriptomics we can include also the expression of these genes so what which genes are differentially expressed at a certain site compared to another one today in these tutorials we will focus more on metagenomics data but the idea is similar for metatarscriptomic data for profiling the organism that are there and so when when we do investigate which microorganisms are present as a specific site and their relative evidence so if we have certain organisms for present by the others what we try to do is doing a microbiome community profile so doing a profile of the community of microorganisms there and so the main objective is to identify the microorganisms that are present within a given samples and they maybe compare between different between different samples to compare things and so the first things we want to do is because we cannot usually identify each individual organisms what we do is identify different taxon and from which the different reads or the different pieces of DNA that we sequence with metagenomics belongs to so what is metaxonomy so taxonomy is a method that is used for naming and defining and classifying groups of biological organisms based on shared characteristics for example phylogenetic characteristics or morphological characteristics and it's completely funded on the idea that similarities come from a common common evolutionary ancestors so within this idea we have a defined group of organisms are known as TAXA and TAXA are given a taxonomy crank and are aggregated in different groups that create some sort of taxonomy hierarchy and there is eight levels of of hierarchy that go from the from the domain on the really top to kingdom, phylo, class, order, family, genus, species and we can even go below with strengths and more specific afterwards and for example if we take the taxonomy of of the cat we go from the kingdom that is animalia, phylium cordata, class mammalia, order carnivora, the family is felidae, the genus is felis and the species is feliscatus and we see that for example the cat and the panther here are belongs to the same genus no not the panther is it's I don't know which shape of I don't remember but sorry and but you see that the panther is here and it belong to the same family at the cat if we take the dog and the and the wolf they belong to the same genus so both are different species but they belong to the same genus the the bear belongs to the same genus and all of them belongs to the same order of carnivora and when the taxonomy classification begins by three domains that comprise all the living and extinct forms of life that are the bacteria and archaea that are mostly microchoscopic and single cell organisms and then we have the okiota that is the domain ocaria or cariota that contains more complex organisms in which for example the human belongs to and when a new species is formed they are assigned to a taxa in the taxonomy hierarchy and for example if we try to tax to refund a cat we identify a new species from the genus felis we will add that to the to the species then we are we will expand the things and add the different level space on where it's based on the taxonomy and from this classification we can generate what is called the tree of life which is known as the phylogenetic tree which is a rooty tree that describes the relationships with all life in earth and what is on the root is called some for example what we call the last universal common ancestor and then we have the three main branches that are the bacteria archaea and okiota there and then we can you can if you are interesting you can go deeper in the tree of life and and go through the different levels um yeah and that thing so i i i really recommend you to play a bit with this taxonomy the tree of life from life might to to get more in depth so when we talk about metagenomics data what we start with is a sequence error from dna fragments so really small dna small that are isolated from a sample of interest and ideally this sequence come from all microbiome in your in your sample are present and so the idea is to compare this dna sequence found in the sample to a reference database where we know that could help to assign okay this dna belongs to this taxon or this taxa or something like this and and then we can derive a list of all the microbes present in the sample and when we're talking about taxon miq assignment or taxon miq classification there is two main approaches that are used that is called taxon miq binding where we try to we we take the approach of first binding so combine cluster the the reads based on similarity together first and then afterwards assign to each of the cluster a taxon a taxon and then the taxon miq profiling is doing the inverse is comparing to the database first and then afterwards aggregating the information for taxon to extract the relative abundance for the different taxon today we will talk only about taxon miq profiling the taxon miq binding there is another tool for that that another tutorial sorry that you will be able to follow to learn more about that to getting the reads we can use two approaches we can or different approaches for metagenomics somehow we can use umplicon sequencing or meta meta taxonomics like 16s or 18s where only a specific part of the dna is available and when we do shotgun metagenomics then we have all the dna mixed of all things and today we will really use a shotgun metagenomic sequencing and so when we do taxon miq profiling that we do today and not binding there is different approach that can be used to compare the dna to to the database so we can do a dna dna comparison so we compare our dna to a reference database of dna and it's using two like kraken that we will do now so we really take all the reads that are there and all the content of the reads and compare directly to the database we can do that with uh we can also do a dna to protein comparisons where we take the dna and then we compare to a database of proteins um and that is used for example in tools like diamant that do that and another approach that we can do is targeting some specific marker genes for example the 16s or other marker genes in reach in reads um so we don't we take all the dna but we target in this dna in our samples only a specific portion of the dna or all the genes that is usually faster but that can be yeah we can reduce a bit the the information there and there but there is two like metafilan that do that um but uh when we and so that is the three the three way or the three database that we can use or type of database and then when we want to compare the reads to the database there is different way of doing that afterwards to compare the reads to the database so whatever database we want to use we can use the genome base or where we read uh we align the read to the full genomes of the in the reference database we can use a gene base so it's a similar to the marker gene base so we really target only reference genome genes or the camera based approach where then um we cut our reads in small strings um that is called a camera and then we compare the profile of that the camera profiles of different to the reference database and that is the approach that will be used that is used by kraken here you can read more about that different approach with here um and today we will uh go through the camera based approach from the dna to dna and also use a marker gene based approach with metafilm for that we will use data that come from um noises in the mexican desert that where you can find the publication here and the idea uh the researcher were interested in the dynamic treats that affect the rate and cost of biochemical information processing within there so they pair from the whole ecosystem experiment by and so extracting the data from the ecosystem before then they fertilize the ponds to achieve a nutrient enrichment conditions and then they sequence it so we have then two data sets uh one control that is called g c a uh one a that is uh uh before a fertilization and a gp4d uh that is um after when you have fertilization of the pond so the two data set that we will use so we will use these two data sets just as an example there um the idea is that everything you do that can be expended to much more samples if you have more but the so these data sets differ a bit in size um but it doesn't really matter for the identification of the genomic treat um because there is some normalization that is done uh yep yep so now what we want to do is to get the data in the galaxy so that we can uh run that we can prepare that so let's do that uh let's go to galaxy so I I am in galaxy now when the other things I need to do is to create a new history uh I need to rename it I will name it taxonomic profiling taxonomic profiling and visualization of metagenomics data you can name it another way just from an easy way to to remember what is in my history so that is the first thing so I create a new story I rename it now I need to import the data so the things I will get I will get my data from Zenodo so what I can do is you can copy click here on copy then it will copy your your data um your your the the file the links there you go back to galaxy you go to uh upload data here so you click on upload data you click on past fetch data you pass here your your your links and you can click on start so again I close that um so I go here on the small art I copy I click on copy here I could click on upload data I click on past fetch data I pass my link here and once I'm ready I can click on start and then when it's uh green you can click on close here um and here you should have four data set that appears in your history another things we want to do is to organize these data sets into what is called a paired collection so it's a collection where because here we see that we have for each of the samples we have two data sets so um underscore r1 underscore r2 underscore r1 is the forward reads and underscore r2 is uh reverse reads so we have parent data for each of the two data set two samples so we want to to make the data um built in a way that we know that these two belongs to these two belongs together these two belongs together and then we have what is called a collections that you can learn more about in the gtn there is tutorials for collectible collections and to create a collections what you can do is click in here so on this small uh check icon here select all and then you can select all here uh and then when you are in four all selected you click on build a data set uh build list of data set peer pair here um and then you need to say i want to do underscore r1 underscore r2 all so this one and this one uh i click on peer this data set because they belong together this one and this one belongs together i click on peer this data sets peer this data sets uh then i want to remove this fastq sanger here i click here i click on my file and then here and then i can rename yeah i would put but that input reads uh that is the name of my collection in my history so again what i did is um so i click on for here selected items for all selected and then i got uh uh boxes that show you how to peer pair your data sets to create a collection and then in your history you should have now a collection that is called input data sets in portrait sorry and then with two uh pair pair here one g c a uh one a g p for d and inside you should have uh two things that is called forward and reverse it should download uh from zenodo it would take few minutes so you can wait a bit a few minutes we have uh the inputs um then and next things we want to do after that is we plan to use kraken too so a camera based taxonomic classification tools uh for identifying the microorganisms or the taxa in our reads um so what we will do we'll compare the reads to a reference database uh so a so a reference database is somehow uh so it's a database of sequence for which we know the taxon and so we release kraken so kraken is using a camera approach for taxonomy classification so what is doing is we use a database containing a dns sequence of genomes for which we know the taxon and uh in database this the genome sequence are um the sequence of these genomes are broken into short pieces of length k that is called camera it's usually 30 base pair and what does kraken do is exam it takes our input reads so the reads that we will give it um cut it also in in short reads of length k compare these cameras uh in our input reads to the database itself so search these short cameras into the database look where they are placed within the taxonomy tree inside the database and then make a classification with the most proper positions and then map the cameras to the lowest common ancestor for all the genome that known that known to contain these things so you have your query database here with all cameras you know that so this camera for example is found in this uh our system three um oh yeah so for example this uh this camera in in um in a wrench is found on this organic this uh individual here on the bottom and in both cases so we know it's not specifically to this one or this one it's specific to this taxonomy level the blue uh cameras can be found on all the leaves that are below this blue there so and the latest common ancestor will be this blue and so these cameras will be assigned to this blue there um and same for everything there so if this for example this camera in in uh uh uh violet i don't know purple sorry um if this camera is found here here and here so in this organism so then the largest common ancestor that uh contains of leaves that contain these uh cameras will be this one so we save these cameras belongs to this taxon there so this tonic tree um and and then yeah it's the way somehow it's done to classify the things um so that is original cacken and then cacken two uh is an extension of that uh with a different data structure that is uh make it faster uh and with lower memory requirement you can see the details about that in the in the paper here for these tutorials we use uh so there is different database we can use for cacken we can build our own database we can do uh different things in this case we will use a pre-built database that it's called plus pf so that contains the standard database so archaea bacterial uh virile plus mid-unit human and uh univec core plus protozoa and fungi database so you can see a bit more uh where it comes from so the archaea come from ref sec or uh complete archaeal genomes and proteins um same for everything so you can see here and the database has been prepared and pre-built by ben none made and you can find the details on this on the page that is the link there so here what we will do we will run cacken so you can click on the tool here directly and it will load cacken with the version correct version we will use uh pair we have a paired collections it found directly the input grid here uh we will change the threshold confidence threshold so uh our confidence are we with what has been identified um we put it a bit lower higher that the uh usual confidence um threshold uh just to minimize the false positive so cacken is known to provide a bit of uh some false positive so we we i we increase a bit the confidence um then what we want to create a report uh we want to print a report with aggregate cons and clad to file and then we need to select the database and we say plus pf and we will use the standard this one that is the most recent so we don't want the plus pfp we want the plus pf here and then we can launch the tools so again i will go back here i click on cacken then we said uh here paired collections uh it portrays the confidence is 0.1 here the report we click on create report we want to print a report with aggregate cons data files and we want the plus pf database that is from uh that has been downloaded in 22 if you are more interesting in cacken uh you can read the document the the papers for that but there is also one things i want to recommend you to look uh there is a podcast it's called macro bio inf um and they did uh two episodes about with the kraken developers and that was quite interesting to understand the kraken um and and some how kraken works so that uh i can really recommend you to have a look there and so kraken will take some time to run so i do a short break and i come back in a second so we now have two uh collection that has been generated by kraken so one that is called a classification and one called report so if you open the classification one here um that is the standard output of kraken what you see especially if we take the gc1a it's a long file that has uh one more than 130 000 lines and you have different columns so you have five columns in your in your file you have uh one first column that will tell you if it's classified or not so u is unclassified and c will tell you it's classified the second column will be the id of your sequence so uh in your that correspond to your fastq file or your faster file here yeah it's a fastq file and and then uh third column the third column will tell you uh which taxonomic id it has been classified if it's classified so it's only when you have a c here you will have an information here and here it's a taxonomic id uh for the classification so if you take for this one for example and you say tax id uh here oops sorry in ncbi uh it's say that uh so you could search on ncbi directly this taxonomic id it's the one from ncbi you see it's almost sapiens so this sequence has been assigned to your my sapiens almost sapiens the third column um i think it's um if i correct i think it's written in there it's uh the length of the sequence in in base pair um for read so for the forward and for the reverse here you see that uh and then um what is called a space delimited list and negating the lowest common ancestor mapping of each scammers in the sequence so for example if you have this sequence here you say that the first uh 13 scammers are mapped to the taxonomic id uh 500 uh 562 then the second uh the next four cameras have been assigned to the this taxonomic id um then the next 31 cameras contain a number of nucleotide then the one camera was not in the database and then the last three cameras uh mapped to the taxonomic id here so in our case here so we say that uh so 148 cameras were assigned to this taxonomic id 19 were not mapped at all 19 were mapped afterwards here we have something it's ambiguous or we have no thing uh 19 again to this one 19 to nothing and 148 again to a human but that's how you can interpret that classification um so then the question is uh is this uh one has been classified or not the answer is no as well as being you can see it here also so it has not been classified uh so the first classified as uh this uh taxonomic id and and we say that it's of no sapiens when we look for uh in ncbi i think for me the most interesting uh report output from kraken is this report um if we open that one um you can click on the icon here to see it it's again a tabular file um but with much less lines so it's here it's uh 536 lines for this uh thing and that tell you um that regroup the classification by taxon so it it groups the information that were in the classification file and report uh one line per taxon identify taxon all right except the first line so the first line we tell you our manage has been unclassified so in in this case for this file for these samples um runs 76 percent or 77 percent of the read has been unclassified and then afterwards it tell you that 23 has been rooted at the root and then we tell you um so in the domain bacteria it's 12 point something percent of the read has been assigned to bacteria so the first column will tell you a percentage uh the second column will give you a real numbers of reads there the third columns are always forgot so number of fragment covered by the cladrity of this uh taxon and the number of fragment directly assigned to this taxon um so the difference here is this one this number is one that are directly in the root and here it's at the roots plus every everything that is below the root below this uh this level yeah um and then you have the branch on the column four so it's um um here domain filon class order family uh genera species etc etc and um and if you have a taxa that in any of these 30 ranks um they're formed by a rank for example g2 um it's because it's between uh genus and species um and yeah so it's a bit more complex there and then you have again the taxon the NCBI taxonomy ID and then the the names uh the uh scientific names um of the taxonomy ID here so if we look at uh so what is the percentage of classified and classified for both cases if we want to see side by side we can use the window manager here uh so and then I can click here it will open the report for this one and if I click again this one then it will see so we can and we can ah sorry I have a duplication and then it will load hopefully uh the report on both cases here is it yes it's loading so for uh GC1A we have 30 uh 77 percent of the read that are unclassified for GP4D it's uh around um almost 90 percent of the read that are unclassified um and what are the kingdom funds for us so we kingdom we need to uh search for k can I find a k here so bacteria or domain you can think about domain deep for domain sorry so you have bacteria here what is there oh so and if I search for t look can I find t oops no what happened look scroll down so you can find a carrot so here you have bacteria so you have some bacteria if you scroll down you should find some a carrot also at some point um here we are sitting bacteria bacteria bacteria so don't ask bacteria bacteria we have some verses here really low numbers and here well we have still bacteria but where are the human here so we see homo sapiens the saccharosis was bacteria so we should have some plant here hmm this is a coyote here so we have nine percent of a coyote and yeah almost nothing of viruses and here so another things we could do to uh oops no that was not what I wanted here here so what you want to do if you want to to see only a specific um for example I want to say only the domain here information of the domain you can do a filter on so I want to do on the filter on the column four so c four equal d d sorry with yeah so I want the column I want only the rows where the the the column four is a d and then I can extract this information for both of the cases um so we say that there we have we have some a coyote there it can be a human contamination it uh when there is a sequencing but also because of the side there and now we need to search for proteobacteria and how can we see that so proteobacteria in both cases yeah no I don't want to be that big ah yeah so if you search for proteobacteria you can find proteobacteria and you see we have a lot of things below that is visible and we have quite a lot so before we got the next p here the next p alarm there is a lot of it seems to be a lot of diversity uh there of things um and for ear uh proteobacteria we have still in both cases we have a lot of proteobacteria it seems uh yep but we need to have a better overview of that it's we need more visualization uh so it's really not straight forward to know uh the output directly there let's me check if the filter worked ah it's rocking it's running and you see ear um so we can see so in in the gc1a we have some bacteria a coyote and versus in gp4d we have a coyote we have also archaea and versus a bit but the numbers it's zero percent so we cannot really count that archaea and versus it's only 105 read that has been assigned to that so i will not even count them i will keep only the akiyote and akiyota for this one and this one here it's okay we can put versus also but really really low low concentration there um so once you have that uh to get a really better uh buttons estimation you can use more reliable you can use a tool that is called braken it's giving a more it's giving a probabilistic approach to generate final abundance profile i think it's a good really good thing especially to when you want to have an abundance profile a certain level so for example species uh it redistributes the reasons the economic tree and it's more it's the the results are more reliable the only thing currently it's um the tool as a bug so i will not run that because it would just give empty outputs so for now i will skip that part but i recommend you to have a look later we hope to fix that in the next days but as you say you saw that it's not really easy to visualize the data it's can be uh tricky to to identify a certain taxonomy level and see uh the abundance of the total taxonomy level and so for that we can use tools like krona finch or pavian um currently i will just i will show you i will to use krona so krona um create an interactive html file that uh allow you to visualize to see for one uh sample um the hierarchy of the data and zoom in that in different levels is quite nice i really like this tool um but the kraken outputs cannot be directly used by krona so they need to be first converted um so for that we use a tool a suite of tool that is called kraken tool and then we run the tool that is called kraken convert kraken report files to krona so you can click on this uh file um select the collection here and be careful of selecting the report and not the classification here so create select the report file here and then you can run the tools here it will uh it will create the the outputs there i was generating the collection again of report um to reformate report if you open that you uh the one of the file you will see that the output is formatted a bit differently to the report so the first column is a number so it would be the number of reads that has been assigned to this taxon um then you have different columns that correspond to the different taxonomic levels so the first column is the kingdom then you have the philom uh the class the order the family the genus and the species and so you have eight column there and so for each uh rows you have a number that correspond to the number of reads that has been assigned to this uh rows and there you and you see it's the same for for both cases um and then once you have that um so again it's a bit annoying every time that it's doing that um then you can run krona so krona it will take these outputs there um you need to say it's a tabular file and then you need to select the kraken uh tools output there and run the tool and it will create a one html uh file or report uh for that krona is finished i have now one html file there and if i open it um so i have something that looks like this so it's an html report and it's interactive um what i did is i just idle things on the set you can it's on bottom left and bottom right and here you see uh for both you have both uh samples available here you can click on which one you want there and you see that you have an interactive uh things you can click you can you see the number of unclassified so 78 percent for the first uh sample and then here you have the number of a coyote uh if you click there it says that here on the top it's 30 percent of the root is the bacteria and if you click there you can go deeper in the different phylum and even to the species um so if i click on proto bacteria here it tells me that it's 70 percent of the bacteria 9 percent of the root and and then here i can even go deeper um in in the different levels here so here in this genus and etc etc um so i had a question what is the percentage of classified unclassified for gca1 and gcgp4d um i think we already found that so gc1 uh and if i want to go back to the root i click here so i have 78 percent here and i have 90 percent for the other five samples the next question was what are the different kingdom from sorry here you have bacteria viruses a coyote and viruses but really low numbers of viruses and uh for the first samples it's mostly a bacteria a coyote and viruses also the next question is uh where might the a coyote DNA come from so if we click on a coyote here it comes only from also most sapiens for this for this sample and for the other sample it comes mostly from a coyote but also from others um that seems to be other things that are there that are more from maybe fungi that looks um yep mostly from human but that are probably human contamination and how the diversity is the diversity of proto bacteria in both in both cases uh so i need to go back to the root and here i click on bacteria proteobacteria do we have a similar diversity in both cases um i think it's interesting to see so we have a lot of other proto bacteria in this one a bit less year we have a lot of alpha proteobacteria in this in the first and the second sample a bit less year we have more diversity of other things yeah it's it's quite interesting so it's a bit different here um yeah jeep seems to be more dominated by alpha proto bacteria so another approach we can use to visualize the data and maybe compare a bit better is using pavian so pavian is an interactive tool for metagenomics data um so it was mostly developed for clinical metagenomics uh problem but it can be used for visualizing any type of metagenomic data so to do that you need to click on pavian um then you need to okay i can i need to do i show again my histories i need to click on collections i need to use the report be careful of using again the report and not something else so the last one here and then um it's say waiting for interactive tools uh to become available so it's group it's uh really a gray year and when the interactive tools would be available it become a ranch a ranch mean it doesn't mean it's so it's mean it's running but it's mean it's available and you can interact with it you don't don't wait until it's uh it's uh it will never become green uh or only in one in one day when the the tool is automatically killed um but for now you need to wait uh to become active here so either you wait until the link is available there when it's around there or what other things you can see is uh your active interactive tools will be available here and here oh what happened so it was killed somehow interesting um so it should appear here in your history it should be open so you could have you should have the click the link here or you can also click here on this icon run see running interactive tools and if you click on here open interactive tools it will open a new tab with the interactive tool itself um that will take a bit of time of loading and once you are there you should have something that is uh like this um you can upload files or you can use also that file that are directly put from the server so from galaxy so you click on use data on server here um and then you click this folder and click on read selected directories it will load all the files that are there so it's good to have a collection because it put everything there correctly um um and then it's create a table to say uh are you okay with the sample set so you have two files the names are gc1 a gpd and you have the past year it's mostly things that are need to be set up for for for for this shiny app this pavian app so you can click on save table and once you have the safe table then here you can have a look at the result overview um and the result of overview give you first a table that is here with uh you see the name of the sample there are number of re-reads and some extra information i would just go back to the tutorials to check what is the next step so we we did that um then the question is does both samples have the same size and the question does the same uh same size nope i mean we see that uh definitely gca1 have much less worry than gp4d um what are the percentage of classified reads for gca1 and gp4d um classified we see 20 percent year 10 percent so we have a low low low numbers of percentage of reads gp4d than gc1 a and what are the percentage of bacterial are there similar um we have a low lower number bacterial percentage year in the gp4d than in the gc1 a so what we can do now uh um yeah so we have similar order may be to just a bit lower and other things we can do is now inspect by samples so if you click here on the left on the sample you have what is called a sunki plot so a sunki plot the question is what is a sunki plot is sunki diagram or sunki plot so a sunki diagram is a visualization with that used to depict a flu from one set of values to the others and in this case the set of value are the taxonomy hierarchy so you go from the domain to the kingdom to the philum to the families genus and strengths etc etc and here in domain and kingdom are really similar if you see here you can see that just okay you're then from guy maybe and so you can you can see the things here um and um if you click on proteobacteria here you see so I clicked either you click here but here I clicked on proteobacteria and you can compare the number of reads across all samples so across here the two samples you can see that uh for the gca1 uh how much you have and gp4d how much you have and yep and do you have the same numbers it's expected that you have more for gp4d because you have much more reach so it's here it's a really number of reads it's not really a percentage so it's not really comparable because you don't have a some sort of normalization um there and now if we would like to compare the samples what you can do is you can go to comparison here and uh you can select stuff so you can really compare you have here the both samples and you have several information here um and here you can so what we want to do instead of comparing the raw numbers we want to compare percentage um so we can unclick reads and click just a percentage here um you can click on domain here to be able to compare the domains so here we see the percentage and we can compare that see that bacteria have a higher numbers percentage of uh read assigned to bacteria in gp4d than gc1a for example and that is what is there um and one things we can do is we could also filter species um i'm just lost a bit on select some of sapiens in the filter so here i can select so here i have this button filter taxa and if i want to select i can filter out the host and here specifically um here homo sapiens it will remove all the homo reads that are assigned to homo sapiens to make it more easy to identify the percentage um so it's normalized it removes homo sapiens from from the report and say if we remove homo sapiens it means that 90 almost all the reads in both cases are assigned to bacteria so then the proportion are really similar between the both samples and if we go to the classes then we can even compare the bacteria uh which uh do we have a high diversity and we clearly see that gp4d as uh high proportion for the majority really vast majority almost everything uh is in alpha proteobacteria for gp4d and not for gc1a so um we have much more diversity um in terms of of classes in gc1a than in gp4d um and what could that mean for if we say that gc1a if the control in gp4d is sample from the fertilized plants um so it seemed that then alpha proteobacteria are more resistant to the fertilize so they have a several advantage in the new environment so when there is fertilization um compared to the other ones so the fertilization seems to have killed a bit of the diversity in the terms of classes and and according to the authors in the paper this colorate two with a specific genomic tray that enable them to cope better with high nutrient relativity so it's somehow what you can do you can we don't have we didn't put the BAM file so you cannot have an alignment view but you could also add that maybe in the data if you are interesting but you can really go in depth in different levels at the different levels you can look at the rank you can um you can do a lot of things uh there so which one is the highest in terms of rank um yeah so that is um yeah uh information that you can hear you can really go to the taxon instead of the clade uh so yeah a lot of things that you can do with pavian we are done with pavian uh we can delete the in new history so that the job is completely killed um you could also use instead of pavian another taxonomy classification tool for this uh to another visualization tool that is called finch you have an explanation in these boxes that you expand to explain how to use that tool um there i will not use show you now how to do that i recommend you to do it on your own later or do it now if you want post the recording and do and do and do that um so when it comes to taxonomic assignment or taxonomy classification kraken is not the only tool available and there is several tools uh that benchmark the different tools um so you can have a look for example in the so the kami challenge so the kami is critical assessment uh for macro biol investigation something like this and that uh investigate um different tools for metagenomics data uh for different type so they do for binding for assembly but also for classification taxonomy classification and they use similar data sets to for all these uh different uh what they call um challenge and um then they they ask people to to run that on their favorite tools and they aggregate the results and they created two papers for that um and then you can look here uh in these papers what are the different tools people usually use for this for this uh step um to perform this analysis and and how the kami authors aggregated this result and to identify the best tools for different um for the different tasks that they have to do um but there is another tool that were published in 2019 from ye at all where they prefer uh benchmarking analysis themselves of different tools um but yeah they they have uh different approaches as i said so kami is more a community-oriented approach of of so it's a it's a challenge um a ye tool it's more uh it's they did the analysis themselves and they didn't use both also use a different uh approach different uh data sets um for the kami and ye you can see there and and but several metrics are used every time to to compare the day the the different results so the precision uh so it's mean the proportion of two positive species identified in the sample uh the recall the proportion of two positive divided by the number of distinct species actually in the sample the prison recall uh recall curve and the l2 distance um so to identify our current governance of each species um reflect the abundance of the species in the original uh so the it's somehow the ground truth uh but yeah so it's just to give you a bit of explanation how to interpret this this uh these papers um and in terms of of profiling tools there is a three type of padding that are yeah you can you can read a bit more about these profiling if they are really precise or not etc but globally speaking about a classification tool there is a few tools that you can find there so motius um metafilane dudes focus and kraken i will skip kraken here because kraken is mostly an extension of of kraken um and if you see you can see where it has been used and if it's available or not in galaxy uh so currently it's mostly metafilane and kraken with praken available in galaxy we are working in integrating motius also and because you see that is the most memory efficient um there metafilane for example is recommended for low computational uh recommend from ye at all um and kraken seems to be also a really good one so that is uh which one selecting which tool to use uh this one if you are interesting we could try to integrate into galaxy also but that you need to probably contact us to um do that where we can do it maybe together integrate these tools in galaxy if you will be interesting to have them and if you need them and if you have other tools that you want to integrate in this uh in this table also please contact us we can expand this table so we can show you how to expand these tables to make it a better overview of everything that is really available if if something we missed something there please let us know that would be also great to have a really good tutorial there the just to highlight uh how to integrate to to for example use the mark uh use the marker gene based approach uh using metafilane so we could use also yeah metafilane to do the classification um we mentioned so as mentioned so metafilane is uh based on so marker genes so it's used um um a unique clad specific marker genes uh identify from around one million microbiogenomes um around that yeah that have uh yeah so it's a big big database of marker genes that they did um that allows an ambiguous taxonomic assignment accurate estimation of organisms relative abundance uh that go can do resolution of the species level for bacteria archaea okiotin versus and can also be used for strain identification and tracking um that is quite quick and and can do uh clad estimation uh clad abundance estimation so um in a nutshell uh metafilane identify the clades present in the microbiota from microbiome sample and its relative abundance um so we can use metafilane in galaxy especially the new version just to be aware of that uh if you search for metafilane uh in the in the galaxy bar on the toolbar in the left in galaxy you may find metafilane in metafilane 2 metafilane 2 um is currently uh older version compared to metafilane version that you can find in galaxy um i will show you so uh if i click here um and it opened metafilane you see that the version is four point uh zero point six it's really it's meaning that it's a version zero uh it's this version of the tool it's not the version two point something it's a version four point something it's a it's because uh metafilane developers did a renaming of of the way they they named the tools uh they used initially metafilane then they say metafilane 2 and starting from version three they called it again metafilane only and not metafilane something in the tools um in the command line tools so it's why it's it's a bit um messy and i'm sorry for that so um the only thing so if you see if you want to run metafilane directly you you cannot really give your parent it expect uh something you cannot give it a parent called a paired collection as kraken it expect a forward and a reverse collection so the fun things we need to do we need to split our input reads collection into two collection one for the forward and one for the reverse so we need to unzip it's called unzip you can find that in the collection version it take a paired collections and split into a collection with only the forward and a collection with all the reverse things and once you have that so it's quite quick it's almost instantaneous then you can run a metafilane and so for metafilane then you say you have a paired and files then you can click your collection uh first your forward collection second your reverse collection um and then what are the other you want we want an output that can be used for krona so you need to scroll down a lot uh and uh i put for krona so globally so we said we want to have a paired and file we said collection here we we selected forward reverse here we use the default parameters here we need to be sure that we have the correct database that is selected so it's the latest one so from october 22 and the other one you leave you just scroll down until the end and output of krona you say yes and then you can launch the tool and it should create five new collections in your history metafilane is now done we can inspect the outputs and i will look at the most important though that is the main outputs of of metafilane is this predicted taxon relative abundance um it's give you uh yeah it's a table we'll look at the gp4d one it's a table that looks a bit like the report the kraken tools uh format so uh yeah okay slightly different sorry so this one the first column is uh so one line after the one started with the hashtag is one line per taxon and where the first column is the taxon the taxonomy description where every level is each level is um so between two levels there is this pipe to show that we are talking about a different level and then you have the level um which levels you are talking about with uh represented by this first later like k k underscore underscore and then you have the level the the the name of the level so here we have kingdom bacteria philom bacteria edes and etc etc classes etc um so that is it's how metafilane looks like and and then you have so here then you have here the taxonomic id and cbi taxonomic id again with the different levels so it's it's like the how it's written it's um the lineage with the different taxonomy level and then the the ncbi taxonomy of the of the lineage taxonomy level so you represent a complete lineage separated by the the pipe then you have the relative abundance of this specific taxonomy level and potentially any additional species there so which kingdom has been identified for gsa1a and with metafilane we have nothing so nothing has been classified for these uh datasets and gp4d we have only bacteria that has been identified so much less diversity compared to kraken so it's really reduced so we have almost nothing that is identified with metafilane compared to kraken then you have other output that has generated so you have a predicted taxon relative abundance for krona that looks it's a similar output that the kraken tools output with the column first column being the percentage of read assigned to the taxon and and then the different column being the different levels then you have a collection with the where the taxonomy level are split in different levels where is it no i don't see it um it's maybe i've been removed then you have a biome file and you have a bowtie output and some file i think it's because um yeah i needed to select something else to have this uh collection with the same information um you could run afterwards krona on on the on the output of kraken on this predicted taxon relative abundance for krona in this case it's not really interesting because the diversity is so low that there will be nothing displayed here uh only that we can go to these levels so i will not run krona recommend you to do it on your own if you are interested so as point as i said already that community looks a really lot less diverse for kraken with kraken with metafinan compared to kraken it's maybe due to the reference database that is used um that may be not complete enough to identify all taxon or maybe there is uh too few reads uh in the inputs to cover in also of the marker genes so the marker genes may be not represented in that data um so i'm i'm really not convinced we need to this to change on different data sets uh probably for these tutorials to really show the powerful of metafinan compared to kraken but yeah i'm sorry for that uh example we really need to change that tutorial so globally just to sum up these tutorials so in these tutorials we looked uh how to get the community profile for macobiome data using kraken and metafinan even if metafinan didn't give really good results for this data set i can guarantee it give really really good data set on other especially on gut data on gut macobiome or something it's really powerful it's giving you really nice results um and then we visualize the data the results using krona um but we can also use pavian and finch for that um and then uh we could discuss which tools using in which context um i will say for example kraken is good uh so metafinan is good at uh human data for example or human macobiome uh kraken is maybe better for example in soil data or more uh environment that is less uh less characterized i will say um yep so i hope with these tutorials you learn a bit about taxonomic uh profiling taxonomic verification and community profile and how you can do that with using galaxy and i hope you enjoyed these tutorials and if you liked it and give it feedback we really would appreciate that you give you feel this feedback form at the end of the tutorials to tell us how to improve uh these tutorials uh and then on that thank you and have a nice um you can follow one of the next tutorials for example the taxonomic binding or taxon or metatomic assembly or also applying uh how to apply that on real world data for example a bier macobiome uh also doing meta transcriptomic tutorials you can learn all those after once thank you