 Hi Maxime here. So today for these by-stores I'm here and I'm hosting like Danilo Dileo from Linnet University and he's going to talk about us, to talk to us about the Metat denovo pipeline. So I'm really looking forward to that because I have like very little knowledge about this field and I think it's going to be super interesting. So over to you Danny. Hello. All right thank you Maxime. So I'm sharing the screen, can you see it? All good. Okay perfect. So yeah thanks for hosting me and this is like a pipeline that we developed and it's called Metat denovo and the idea is to work to make like a best practice for like a assembly and annotation of Metatarsketomic and Metagenomic data. So we will go a bit through like what is a denovo assembly and a bit of background then I will describe the pipeline, go through like a test case that we developed while we were writing the manuscript and we will see some future improvements that we thought it would be nice also to have in the pipeline. So why we chose to work on this pipeline? As all know probably, everyone knows probably, Metatarsketomic and Metagenomic projects are increasing over the years. You can see like in this graph like from 2005 to today like a lot of papers are having like the Metatarsketomic involved in the project and the more they will come probably in the future because this kind of like a method is becoming cheap and then it's really useful for many reasons mainly when you work with community and environment data. So out there there are many tools that you can use to address like your projects when you work with Metatarsketomic or Metagenomic, but there is not really standardization that's where like we end up by thinking that it would be nice like to collaborate also with NFCore and creating like a workflow that is reliable and then it can be used actually for doing such sort of analysis. Mainly the idea is to get a gene catalog and then annotate that with the end function. So because of this lack of pipeline as we decided to go for, if you don't ever reference genomes then the best solution is to build your own gene catalog. So to build a gene catalog you need to prepare your reads from your samples. So your Metagenomics reads you clean them, you prepare them and then by using like some tools like in this case I'm showing mega hits you can create this reference. These are group of contigs that are in a single file and these contigs can be used to realign the reads that came directly from the samples. So the advantage of using this method is that you are not, you are building contigs that are directly from the whole community composition. So each reads can represent a fragment of a group of organism and by using the assembly tool you can build these long sequences that you can use to retrieve what is actually done, the gene and the genes can be used to do several things like assign doxonomy and function. So if I want to get inside the pipeline and try to describe it we have different stages like the first is the preprocessing stages, so it's stage we start with clean the reads and check the quality of them then we use like a program that is called BibiDuck. So with BibiDuck we want to remove all the contaminants and the user can decide to give its faster sequences that for him is hard or hard contaminants but we provide Silva which is mainly for removing the RNAs which are like sometimes or in most of the case like the the contaminant factor that you don't want to work with because when you work with metatasketomic for instance you want to work mainly with the genes or those genes that are actually can be functional annotated and taxonomical annotated. Then like we proceed with interleaved sequences when they are for instance pair ends and you can decide whether you need or not to normalize the reads before getting to the assembly. The reason why we added this tool is because when you work with the huge amount of data let's say that you have like 30, 40, 60 samples and even more. Normalize the reads before us making an assembly it will give you a better assembly in terms of quality longer contigs and easier to find like or on it. That is mainly because you can have like some some organism that they express much more genes and then those genes as a metatasketomic they can be sequences more than others. So they can mask some fewer concentration of other reads. So by using Mipidorm you can address this issue but in our case when we work with the metatasketomic data if you work with a smaller project so maybe 10, 10 samples or like 20 samples it's still okay to not use it but it's good to have it there when you're working with the huge data. So then it comes to the assembly we provide two different tools. The main idea going to different tools because it depends a lot also the type of performance that you want address. Mega hit is generally faster and uses less computed resources while RNA space it can be a bit more challenging for the for the for the service but it can give you also a better result in terms of assembly. So it depends also of your resources then you can decide whether you use one or the other. For us it looks like the RNA space mega hit can provide similar outputs but of course when you work with your data you know your environment you might want to give a try with one of the two tools. So it's in a way it's good to have them there. And generally it's the same when it comes to the orb caller so when you get the assembly with the contigs you want to extract the genes and to create this gene catalog that we can use them to annotate it. And we have three different programs that we provide and the idea is that Proka is really good for prokaryotic datasets but it has a lot of filtering steps so it's really good in giving a really nice output in terms of genes but it might be that it can be a bit stringent so some like smaller orc or rare orc can be filtered out from Proka. So if you are working with projects that you might want to also look at smaller orc maybe it's good to use prodigal which is also good for prokaryotes. If you are working with eukaryotes we suggest to use trans decoder instead. But anyway the output that you get from these three tools are the same so you get a jff file and an amino acid sequence that are the actually orcs that we can use downstream. With the jff file and with bmap we can map back to the orcs to the to the contigs and then retrieve it to the orcs the reads and so we can count them in the end with the future counts. We also provide a tool that collects all the future counts from each reads or from each samples and we build and we build like a table that can be used directly as output from the pipeline. So for the annotation step we use the gene catalog that we built before with those three one of those three tools and we have a bunch of functional annotations. The reason why we have some is because you might want to address in different ways and also for us when we work with our project we might sometimes work with the hmr profiles so we address a specific group of genes and we won't only to address them but also you can work with eggnog so you can get a list of different code categories or maybe ko terms and so on so you can use directly eggnog marker and as well as like if you sproka you get another type of annotation and then you get also one other with ko farm scan. When it comes to taxonomy we decided to use ukulele is a tool that was developed by a lab group and they started to work mainly with eukaryotes but we found that is actually really good also for prokaryotes. What one of the advantages of using ukulele is that you can give several databases in the same run so the way that is working inside the pipeline is that you might have the default database that you want to use to annotate your your genes or you want to use your customized database. With ukulele you can run several times in the same run so for instance I'm working with the prokaryotes and I have my own dataset I can give my own dataset in the same time give you also the gtdb dataset so by doing that I can I can get two different annotations and I can use two different annotations to merge the results and get like a better resolution of my output but for our experience so far if we work with prokaryotes gtdb is still the best to use so we are quite happy with that and I think that is the way to go and we also provide this information in our documentation in the end like as all NFCore pipelines we provide a multi qc output with several of these modules inside the multi qc there is also an output from the transfer rate that gives you like the statistic from the assembly which is always good to to look at it and maybe use it when you write your manuscript and we get also the colette statistics which is also really a good table because it gives you the amount of reads that actually map map it back to the to the contigs to the orph at orph level and also for the different annotations that were done so this is actually like maybe our best outcome from the pipeline because at the end of it you might want only focus on one folder that is called summary tables where you have all these tables that I was trying to explain you so accounts table for the tpm a functional notation a taxonomic annotation and all these tables can be merged together when you work for instance in art or python and then you can just start to build your own graph and and doing your own further analysis so for instance this is like a test case that we we are using also for our manuscript and directly from the statistic table we can see like the difference when we start with the treatment reads and then when we remove all the contaminants and then we can see like from the non-contaminated reads how many reads actually map it back and the one that we get with feature counts if it looks we look for instance in the taxonomy here like we are making also a comparison with the original paper that we we wanted to test and we can see that actually in our case we did quite good job with this with this data set because we have like in the original paper two groups that are this pseudo on geocelacea and ultramonasea and in the original paper they couldn't assign anything to the ultramonasea while with our which is a beta database and ukulele we could assign them directly to the to a family level which which gives you like a better like resolution of the of the output we this is like for the functional notation for instance a i was just plotting like the output from the cog category and i just order them by decreasing order like from the translation and all the other functions when then you where you can use actually the pipeline then like if you won't build like graphical representation like the one that they showed before we you can also use it for doing depth analysis of the gene diversity for instance we work a lot with some group of genes so for us using hmr program is really good because then we can further use them for build like our trees and and check the diversity inside the gene catalog and it can also be used if you do like control experiment for gene expression analysis there are some improvements that we we are thinking about and what we would like to to do in the future we saw that sometimes even with even with if silva database for removing the RNA is quite good sometimes we still at have some so maybe adding another filtering step you can address this issue but also like there are other tools that can be interesting to to look at it like a kraken or we were thinking also to have a more specific application for instance in instance removing the host asses it sequences from the macro beyond for instance like we found like sequences that we could use for the gut then as the increasing of the using of long reads is it seems to be increasing through the time like we would like also to have an input that allows you to use the long reads and having maybe a new functional notation tool which is actually really useful especially when you work with the catzimes because it is this randy bicky randy pecan and we also saw that it could be interesting to integrate other pipelines that are already present for instance this differential abundance might be good to integrate inside the pipeline but we are still thinking about it and maybe adding like other features like a field of placement for whoever is working with specific group of genes we are actually now developing another pipeline and the idea is to get a complementary one for the metadata novel so when you work with without a reference genomes it's always good to start from a new assembly but in the case that you are working with the genomes that you already have or might you because you decide to download from different sources you might want to map against those genomes so then we are building this new pipeline that is still under development and we are we are planning also to get new features and news about this in the next hackathon so stay tuned for this yeah so that's it and I will thank like the university and of course people for being like a part be part of this project this bti that is the swedish very university dataset that is actually helping us with the project and of course like Daniel and Emily that also contributes to the pipeline and help channel in slack because that was useful actually thank you thank you so much no that was like super good and super helpful so no no thank you very much about all that yeah so you mentioned that you were looking into like creating like a new like pipeline like a new component pipeline so the mag map I'm guessing also you seem to be like quite interesting in all of the with all of the other like taxonomy like pipeline so I'm guessing tax profiler and mag and the other stuff so I'm guessing you do share a lot of components with each other and you do you are a tight community right yeah exactly like we made several like NF call modules also for the time that are used actually they are they use it in NF call community and also like while building this mag map we found actually useful to find modules that are present also in the NF call modules so that makes sense I'm guessing you're like know that you have module maybe you will start off like making like sub workflows that you so that you can share that in between like more component and stuff like that yeah yeah as well like we already had some like sub workflow that might be also suitable for like NF core and and I think that is like for instance when we were thinking about adding like other NF core like pipelines to the one that we are developing we might we could maybe add them as like sub workflow to the main workflow for instance so there are a lot of different combinations that can one can do with that so okay very good very good also you mentioned the differential abundance pipeline yes yes exactly it was yeah I know we saw that it's actually it seems actually really interesting and useful even like for this for this project in metatidano because in the end what we get is output we can think that it is similar what you can get out from an RNA sec pipeline in the end because RNA sec is starting from reference genomes for instance while we are building our genome from scratch and then we do like the taxonomy and everything but then the output it can be actually used I'm not sure that it will work straight away I think that there will be some like adjustment in the output from metatidanova for the way that it is now but it can address that if one needed like we are we are still thinking about it because of course doing like a differential expression analysis it can be something that you want to do when you work with specific projects for instance if I think about myself I work a lot with the environment so then when you work with the environmental data you usually don't have like a control at least many control because in the environment you you just fish out what what you get from by using like a triplicate for instance so you repeat the analysis but if you work with a control at the environment so you make maybe a mesocosmic experiment or you work in the lab then might might be useful to use it but so it depends a lot of every project no no looks super interesting thank you very much is there any question from like the audience feel please like don't hesitate and feel free to admit yourself yes actually I have some questions so first of all those are two very interesting pipelines especially second one I think fits more like what I want to do so but I will ask you some questions regarding the denova assembler so you mentioned for the mega heat and are in space so the performance are similar but do you have like any experiences of these two assemblers on the virus do you know which one is better because I saw the space they have different assemblers for instance like the one you mentioned are a viral or meta viral do you have suggestions for the virus yeah so so first of all I would say that if you have virus reads you can actually input them in metatidino because the RNA space has this option that you can just use it as a for viral RNAs so in that case if you work with with virums what is something that I didn't mention because I was mainly talking about prokaryos and eukaryos but I didn't focus on on virus but you can also I think that if you work with with virus I would really recommend to use RNA space RNA space yes that is the way to go I think and I know also yeah in in our department I know that people had really nice experience with RNA space so have you tried how do you see they have other algorithms for instance the RNA viral and the meta viral in space do you think these two will outperform with against RNA space for the virus I can I cannot say it because I didn't try it personally so I cannot address that question but I I would guess that using like the the actual tool for um for viral virums it would be much better so like in that case like a RNA space is really like um I would say flexible so you can you can ask to use like the option for the virus so then you can just run it through it I could perhaps add a little bit there if there is an assembly program that we want to use for viruses for instance we would be just let us know and we could try to integrate that also with the pipeline yeah absolutely like it's if there is any tools out there that this might be better it can be worth it and also I have another question maybe I don't know maybe I'm yeah it's very basic questions like so do you actually is it how do you say is it is it possible to include this for instance principle component plots because you already have the the feature counts is it possible to include that in the pipeline you know it's easy just to get so we don't have to use other tools to actually do the yeah yeah absolutely as long as you have the the counts there and like there is this table the count table you can be used directly to build PCA's I mean and those are like one of the first things that we do right when we work with the metatask atomic data so it can be something that you can be added without issues I think it was not on top of our mind to be honest because I feel that our main I feel that we wanted to address the main like point of these pipelines so getting the functional taxonomy but as long as now with it's out every suggestion and improvement that that is the way to go I mean it's not and then process you know you always cannot more features for a PCA to be useful one would also have to include some sort of metadata at least some kind of grouping variable so you could perhaps color samples belonging together or so because a black and white PCA is not so interesting and all the time being we have chosen not to include that type of analysis so much there's sort of this project specific statistical analysis in the pipeline and instead focused on making the output tables very easy to pull up in other tools so that's sort of the our way of thinking so far but we're happy as Danilo said to discuss anything oh yeah thank you yeah okay thank you very much for like such a great presentation I think that was like super interesting definitely it was for me and yes I learned a lot today so yes thank you very much I think I will stop recording for now