 Hello everyone and welcome to this week's Bitesize talk. With us is Jasmin Fangberg. I'm very happy that you're here. Thank you very much. And she's going to talk about yet another new pipeline that is going to be released very soon, which is NFCore FunkScan. Off to you, Jasmin. Yes, thank you very much. So I will introduce this pipeline to you now, which is a NFCore pipeline to screen for functional components. Of nucleotide sequences from prokaryotic genomes or metagenomes. So what are these functional components that we are interested in or that we screen for? The pipeline screens on the one hand for antimicrobial peptides. These are like they are important in innate immunity. And yeah, they are very short sequences, like peptides out of about 20 amino acids. So you can find them even in small or fragmented DNA and metagenomes. The same applies to antibiotic resistance genes. On the other hand, biosynthetic gene clusters here at the bottom, they are quite big basically because they consist of a whole gene cassette, which codes for a whole metabolic function or secondary metabolites or natural products. So who would be interested in such a pipeline, which identifies these compounds? So of course, in natural product discovery, where you can identify these compounds to develop therapeutics in antibiotic research and environmental metagenomics or simply to have functional and genomic annotations. So of course in these research fields, there is already this detection of these compounds being done with a couple of tools. However, there are certain issues. So one of them would be the efficiency because mostly you apply the tools manually and then you only have like a very specific purpose of the tool. So you can identify a single compound, but it's not a very broad thing and you have only a single algorithm that identifies the output. So it could be more feasible to have this whole process streamlined in a pipeline and also the output of these tools is not standardized. Another issue would be the reproducibility because throughout the years the tools develop, new functions, bugs are fixed. So it's very important actually for researchers to record which versions of which tools they're using, which is hard if you execute them manually on your samples always. Also data privacy, there are a bunch of tools that offer web services where you can upload your data and so they are analyzed for you. However, this requires that you give your data to a third party which is not always intended or even possible. Another issue is that bioinformatics skills are often needed. It is even that you have to write small batch scripts to execute the tools on your data which is then not applicable for all people like if there are biochemists who just want to know what is actually in their data, they don't want to be trained bioinformaticians. So this is actually many problems that our pipeline tackles, namely that it is very scalable the next flow pipeline, all NFCOR pipelines are next flow pipelines. They are very efficient and scalable. You can execute them on your local computer, laptop up to the institute's HPC. They are reproducible since they record all the tools and versions of the tools. And of course you can decide where you want to have your data. You are not forced to put them on any web server. Also they are very easy to execute the pipeline which you will see later when we come to the tutorial part. So now I emphasized how easy the pipeline is to use but it didn't start very easily. So we went back or I go back to October 2021 when we assembled the ideas to develop a pipeline of many tools which we brainstormed what would be needed for obtaining the resistance genes, the biosynthetic gene clusters and the AMPs. And so not all tools were yet on Konda or speaking of NFCOR modules. So we had to do a lot of work there. Then throughout the next year we streamlined the process a bit and the ideas got clearer. And we even went to the first sketch of the famous tube map sketch. And finally in 2023 now the pipeline is ready to use. And this is the current version. So I will walk you through it in the first step. So in the first step we have the input which is being annotated. So as I said input can be any genome sequence could be metagenome context could also be complete bacterial genomes. This data is then analyzed by one of the three tools the annotation tools. And after this data goes into one or all of the three workflows. So the antibiotic resistance genes in the yellow workflow, the BGCs in purple and the antimicrobial peptides in red. Not all of the downstream tools need the annotated data. So for some we also use the direct input data. Then as I said each of the workflows has a bunch of tools. So for example the AMP workflow has here four tools. And as I mentioned before they follow different strategies. So some of them use for example deep neural networks and machine learning to identify compounds of AMP's which would be for example ampere or here deep BGC for the BGC workflow. Other tools have like rule-based strategies. So there are a lot of algorithms predicting the compounds and the results are then very diverse as you can imagine. So now it is important to aggregate these outputs and summarize them into a nicely readable format which is the third step. For this we use one tool per workflow. Two of them are developed by ourselves. So AMP Combi and CombiGC and harmonization was already a tool available. So yeah this was basically the overall workflow and now I would like to show you how to apply the pipeline and you will see that it's really very easy. So we start with the input which is a sample sheet basically a table with two columns. The first one is your sample name. Second one is the path to your FASTA input file. And then of course your FASTA file includes the ID of your sequence and the sequence itself. So this is what you need to actually run the pipeline and it is as easy as running next flow run. And of course you give your input sample sheet, give your output directory and yeah this is basically a minimal example of a pipeline run. Of course it is recommendable to use more parameters. One of them would be the annotation in the annotation step. The flag annotation tool where you can decide which tool you want to use. So they have different properties. For example Prodigal is very fast. However we notice that with Procar we get better downstream results. So it depends on your needs and ideas which tool you would like to choose. The default is Procar. So after the annotation step we come to the actual identification of the compounds. So you can activate each workflow with this flag run amp screening for example for the ampies. And by activating this all the amp tools are run on your data. You can also choose for any reason to deactivate any of the tools. So you can switch them off with a flag amp skip and then the name of the tool. This might be because some tools might be very slow or you think they are so specific that you're not interested in the output. So as I said for whichever reason you can switch them off. And this is the same for the antibiotic resistance workflow. You can apply this flag. It runs all the four or five tools on your data and you can skip any tool with the arc skip flag. Same applies for BGC identification. You have the flag. All the tools are run. You can skip whichever you might want to skip. And of course you can use not only one of the flags per run but all three flags at the same time. So your data is investigated simultaneously and parallelized as much as possible with next flow. Okay, so these are the identification steps. Now we come to the summary steps for each workflow. Let's start with the antibiotic resistance, which is done by harmonization, which is a tool that is already out there. Here you can see the GitHub link. And this tool can actually summarize a bunch of outputs of a bunch of resistance identification tools. And our pipeline currently includes the orange tagged ones. So the output of those tools is then summarized into a standardized gene report. And this is how it looks. It's basically a table with very many columns. So you have here the sample IDs and the genes that have been identified some information about the databases, which tools were run and so on. So these are actually all the column headers that are very conclusive and you can use them in this output table for downstream analysis in R or any statistics program. So this is very similar to AMP Combi, which we developed ourselves or basically Anand and Louisa developed this, where you also have your sample IDs and then some information about probability of amps. And additional feature is that it not only identifies your antimicrobial peptides, but it also does some back aligning to a reference database to identify taxonomic classification. It also infers some chemical properties like stereochemistry and provides the publication so you can go back and read more about the compound identified. Okay, the last tool for the BGC workflow is CombiGC. Similar fashion, we have the sample IDs, the tools which have been applied, and then more information about your candidate by synthetic gene clusters. So with this, you see that we have a scalable workflow now to identify these compounds, which are important for a couple of research fields for, as I said, drug development, antibiotic research and so on. So now, since the pipeline is almost ready, it's probably going to be released next week. Let's see about it. We have at least added all the modules and sub-workflows. We do some more testing, and then the pull request will go out. And so I can already advertise if there is someone here on the chat who would like to review. Please feel free to reach out to us on Slack. Okay, in the future, we would like to include more screening modules and to also have a visual summary of the output, which would be kind of a graphical dashboard, probably with a shiny app. Let's see about that. So with that, I would like to introduce the development team, which is James, Luisa, Anan, Moritz, and me. And of course, we got a lot of help from the NF Core community, which we're always assisting, very nice community. And also, I would like to emphasize some colleagues here at my institute, which helped with biological and biochemistry knowledge and my supervisor, Pia Stalford, from the Leibniz HKI. So with this, I would like to close and lead you to the repository and the documentation of the pipeline. So if you want to interact with us, feel free to join us on Slack. And otherwise, I'm open for questions, either now or later on Slack. So back to you, Franziska. Thank you very much. Very interesting. So anyone can now unmute themselves if they have any questions. They can also post questions in the chat and then I will read them out. So are there any questions from the audience? Otherwise, I actually have a question. So you have shown a minimal command that you can run that doesn't actually specify the workflow that it's using. Exactly. Is that going to use all three workflows or a specific one, a default? So here for this one, you mean? Exactly. So in the default, we have specified none. So this would actually run only the annotation, which is then probably not very useful for you. So yeah, this is the current state of the... Of the settings. Maybe we will change this later. I don't know. And would it make sense to run all three workflows at the same time? Or is that different kinds of samples? No, no. It's exactly that's what it's designed for. So to run efficiently on all three workflows. Yeah, depends on your interest. If you are not interested in the resistance genes, then of course you don't need to run it. But yeah, it's very efficient to use this also. Thank you. Are there any more questions at this moment in time? Otherwise, I thank you again. It was a very nice talk. Thank you. And of course I would also like to thank the Jen Zuckerberg initiative for funding our Bitesize talks. And all of our audience for listening to the talk. And I hope to see everyone next week. Thank you very much. Bye.