 Hello everyone and welcome to this week's bite size talk. I'm happy to present to you today Sabrina Krakow. She is situated at Cubic at the University of Tubing. And she's talking today about the NFCORE Pipeline Mac. And off to you. Thanks Francesca for this kind introduction. I'm very happy that it finally works out to also present the NFCORE Mac pipeline to all of you. This pipeline you can use for metagenome, hybrid, assembly and finning. So the goal of this pipeline is to analyze microbial communities by recovering individual genomes. This might be for example particularly useful if you do not have a complete set or high quality reference genomes given. So such microbial communities could be basically everything. Not only from her environmental samples, but also host associated communities such as the gut microbiome. And what can then be done is that the microbiome samples can be processed with beta genome shotgun sequencing, which generates short reads and the NFCORE Mac pipeline then essentially combines this reads and assembles them to larger context, and downstream genome binning step bins this context to so called metagenome assembled genomes are also called Macs. And these Macs can then further be annotated and also taxonomically classified. So that's basically the concept of the NFCORE Mac pipeline. And as for many NFCORE pipelines, this the development of this was quite large community efforts. There are many different contributors so just mentioning the main important ones. It was started by Adrian Goulet. Then Daniel Straub contributed a lot since early on. Then I joined and also since last year James Feliciates is the main contributor of this pipeline. And yeah, now I would like to mention the key features of this pipeline so it can perform a hybrid assembly using both short Illumina and long Nanopore reads. It's a hybrid because if you have assemblies generated only from short reads, they are often highly fragmented. And by using additionally longer reads, this can improve the contiguity of such resulting assemblies. The pipeline also performs a genome binning step and optionally also a binning refinement step, then can taxonomically classify the resulting bins and also provides a comprehensive QC statistics. Furthermore, it can utilize sample-wise group information. This can be used for the core assembly. This is important if you have data sets where you know that certain strains are present across multiple samples, such as within longitudinal data sets. And because the core assembly can improve or increases the sequencing depth and by this also allows to recover more lower bottom genes. Additionally, the group information is also used for the computation of co-abundances, which can be which is used in the genome binning step. And furthermore, the pipeline also allows the handling of ancient DNA because it's containing ancient DNA validation sub flow, which is further specific for this pipeline. This version of this pipeline was already published at the beginning of this year in NAR genomics and bioinformatics. So if someone is interested in more details, you can also have a look at this application note. So here you can see an overview already about the pipeline. So the pipeline starts with different pre-processing steps and QC, then the actual assembly is performed with a final genome binning step. Here in green you can see the processes or different tools that are run by default by this pipeline. And now in the following I would like to guide you through the different steps of this pipeline in more detail. Just first, how can we actually run it? So here you can see an example of the next flow command that is typically used and in order to run it with default settings, just provide a sample sheet as input file. And here you can see an example how the sample sheet looks like for this pipeline so it contains five columns. The first column contains a sample name. The second column contains a group name. In this case, all samples belong to the same group. And then you have to provide the path to the input read files, either only the short read or to the short and long read. So the long reads are optional. Yeah, and starting with this sample sheet file now or if you have only short reads, you can also just provide a fast Q files directly. The pipeline then pre-processes the short and long reads separately from each other with different pre-processing steps. I don't want to discuss them in detail. Maybe just mention that the host reads can also be removed by mapping the reads to given reference sequences. And this information is also used indirectly for the long reads, since the long reads are filtered based on the already filtered short reads. Yeah, the short reads can then further be taxonomically classified already. This can serve for example as a quality control here in order to check for potential contaminations. Yeah, and after this pre-processing steps, then the actual assembly is done. This can be done sample-wise or the group information can be used in order to run a core assembly, however by default this is done for each sample individually. And by default, the tool spades and mega-head are run both. However, you should keep in mind that if you have long reads given and you are interested in the hybrid assembly, then only the tool spades can be used for this. Then the tool cluster is also used in order to assess the quality of the resulting assemblies, and also the assemblies are further processed with the tool protigal, which predicts protein coding genes for this. Okay, that's the assembly part, and the context of these assemblies are then further processed in the genome binning step, where the tools Metabut 2 and Maxbin are used, which now bin the context to retrieve the actual genomes. And the results of these tools can also further additionally be combined in a binning refinement step, which makes use of those tools. And the quality of the spin is as well assessed with the two requests, and in addition, the tool busco is used, which makes use of single copy of looks in order to estimate the contamination on the completeness of the retrieved genomes. And additionally, the pipeline also uses a custom script, which estimates the abundance of the individual bins, because it's also relatively important output of this pipeline. So the downstream process is then the bins are further taxonomically classified by default using the tool GTDPTK and also annotated with the tool poker. And finally, then a multi QC report is generated and also relatively comprehensive next summer we report. So how does the output of the pipeline so besides all the individual results right of the individual tools, the pipeline generates cluster T map showing the neck abundances across different samples so you can see an example. How this looks like and if you would see here for example there's certain samples cluster together there for which you know that they're originating from different groups. This might indicate that something is gone wrong. And the pipeline also outputs next summer we, which I already mentioned so this contains for each pin for each Mac, the abundance information so across different samples, the QC matrix from the busco results and the cross besides and also taxonomic classifications from the tool GTDBTK. Yeah, and with this. I've shown you the rough overview about the pipeline and next I would like to show you know some of the impact different assembly settings can have. For this I simulated some mouse data set in the past already with the two communism, and I generated hybrid data so containing Illumina data and nano poids and generated two groups so with each with a time series of four samples. So this might be the ideal case where a core assembly might be useful. And now I would like to show you some of the resulting assembly metrics that are commonly used. So here you can see for example the total length of the resulting assemblies and then compared for different pipeline ones where different assembly settings were used. So the lower two pipeline settings correspond to a sample device assembly and using either only short or short and long weeks or hybrid data and the upper two settings correspond to a core assembly. So again, the short or the short and long weeks. And what we can see that is that the total length of the resulting assemblies significantly increased both by using the hybrid setting and by buying the core assembly setting. And similar results we also see when looking at the number of max or the number of genomes that could be retrieved from this data, and also when looking at the and 50 values. So in the case that the actual setting that is used for the assembly within this pipeline pipeline can have a relatively huge impact on the results. And it's definitely good that the pipeline provides different settings so that you can really choose the correct setting for input data and might also be worth to compare different settings. I would like to shortly mention the resource requirements because this came up quite often already in the slack channel, and it's also somehow difficult to estimate in advance, because it really differs dependent on the input data. And the main requirement are both for memory and time coming from the assembly step. And as I mentioned already really differs for different input data sets and I collected some numbers just to give you a rough idea for different pipeline ones that were run by Daniel stroke actually on our compute cluster. And for one rather small sample but was a cultural sample, both mega hit and space required less than 25 gigabytes and we're finished in a couple of hours. However, for a larger river sample data set it mega it took already more than 100 gigabytes of work run. And it took more than one day to finish and space even took more than 900 gigabytes of memory and it required more than nine days. And there was another very large data set containing 15 soil samples for which also a core assembly was performed and for this mega hit required one terabyte and more than 17 days and space could not even be one because it would have required more than two terabytes of memory. So this just shows that even for smaller data sets, you cannot run this on your laptop. But in general one can say that it depends on the sequencing depth, the number of samples the complexity of the underlying metagenome and also on the applied tool and setting. And for this it might be worth noting that like both assembly tools are run by default but mega hit requires much fewer resources than space. And but if you do not want to compute a hybrid assembly it might make sense to consider the skip space parameter. And additionally the core assembly also increases the required resources because it pool samples. So at least for one individual task, the required memory and time is much higher. So this is something important to keep in mind, because also if you want to run it on larger data sets you might want to provide a custom config file in order to adjust the resources required for your particular data set. Yeah, with this we have seen how we can run the NFCORMAC pipeline for modern metagenomic data sets. And as I mentioned already at the beginning, it can also handle ancient DNA. So for this James and Maxine added ancient DNA validation sub workflow. And this is particularly interesting because as we know at least there's no other such pipeline which can handle ancient DNA. So what this essentially does is that it performs identification of possible ancient context by modeling ancient DNA damage patterns and then polishes the context in order to move the errors that are caused by the presence of such a ancient DNA damages in order to allow more unpaired downstream analysis. So it's interesting for some of you to know that this pipeline can also handle ancient metagenomic data analysis. And with this I'm already at the end of my presentation so just a few words to the outlook. So the next release which James already prepared so it just requires one more review, but also contained another optional binning tool, namely it will also allow optionally the binning QC with check M and GUNCY. And for the midterm future it would be also very nice. A functional annotation step could be added so depending on the strategy, either, for example using human three or ECHNOC. And also a standalone long wheat assembly option would be very nice by using, for example, the tool Metafly such that the pipeline could be also one of our short wheat data. And in general, if you are interested in contributing or if you have any questions or problems you would like to discuss so you can join us in the, and of course like channel dedicated for the Mac pipeline, or have a look at our GitHub repository. We're always happy about feedback or a particular bug reports and issues. And with this, I would like to thank you for your attention. In particular, my colleagues from Cubic and importantly Daniel Strahm for many contributions. James and Maxime from the MPI for evolutionary anthropology, Adrian of course, and importantly the whole and of course core team and community for helping with the development for reviewing testing and creating issues. And with this I'm happy to take any questions. Thank you very much. There is indeed one question already in the chat. It was at the very beginning when you were talking about examples, and you mentioned camisim, and could you explain more in detail what this is. This is a tool which was also used in the Kami challenge to simulate metagenomics data. It's basically using as input different genome sources so I used in this case, like a set of genome mouse genome sources which was given from some mouse data sets. And then it can simulate emulator and nano-par data and simulate also different taxonomic profiles. But the more details, I would also have to look up and ask quite a while ago. Was there any particular question about this? No, it was just a question what is camisim, but I think James has now added some links to articles. So if anyone is interested, they can have a look at that. Yes, exactly. So for anyone else if there are more questions, you can now unmute yourself and just ask them straight away, or you can put them in the chat and I will read them out for you. I would actually have a question. What happens to multi-mappers? I can imagine that if you have like related bacteria that it would also map to different ones. How does the pipeline deal with that? I mean, this is handled by the assembly tools then somehow. But are they removed or added to all of them or any idea? Is there some, like, I don't know, someone of the others are more in the details of this algorithmic parts of the assembler? Francesca, do you mean when you're mapping back to the context or during the assembly itself? During the assembly. I mean, you map to the genomes, I guess. No. So we need to explain the main concept there, but basically there's some fancy maths magic that goes on which estimates which reads most likely go with each other based on the number of sort of mutations they have with each other. And there's some weird maths stuff which works out which is the best grouping. Okay, then I misunderstood that part. Thank you. Are there any more questions from the audience? It doesn't seem so. If you have any more questions later on, as you mentioned, you can always go to an of course Slack and ask questions there. So if this is now all the questions answered so far, I would like to thank Sabrina again for this very nice talk. And of course, as usual, I would also like to thank the Chan Zuckerberg Initiative for funding the Biteize talks. And of course, everyone in the audience for listening. Thank you very much. Thanks.