 Hello, everyone, and welcome to this week's Bite Size Talks. I'm very happy to have with me today Francesco Leskai from the University of Pavia at the Department of Biology and Biotechnology. He is very, very busy in NFCORE and among other things, he also worked with Sarek, but today he's going to talk about another pipeline which is NFCORE HGT-SIC and off to you. Thank you, Francesca. So, today I'm going to give you a bit of a background for this pipeline and the motivation that inspired us to initiate this project. I'm going to describe the pipeline components. I'll give you some usage indications and the performance of the pipeline, and then I'll describe a bit of future perspectives which is our homework, basically. And I'm going to start with the acknowledgements here. First to Simone Carpansana who's the lead author of this pipeline, but as you might imagine, he's heavily engaged in preparing the defense of his Bachelor of Science now, so he couldn't present today. Mariangela Santorsola, who's a key person in my lab, and she's also contributed to the publication that describes this pipeline. And then this is very important, I think because the value of the NFCORE community is the availability of all the modules that we also have used in our pipeline. So, a very important acknowledgement here is to all the authors of the different modules that we have used and which actually make the added value of NFCORE so important. So, starting from the background of this pipeline, horizontal gene transfer, this is a very known and studied process in biological organisms, and it refers to the transfer of genetic material between two different species when they are in close proximity. These has been very important in evolution because it has contributed to new traits, it creates adaptation to new environments and also the capability to use new sources in different organisms. It's been crucial in the evolution, as I mentioned, particularly in archaea and bacteria, but not very much has been known about this phenomenon happening in higher organisms like mammals, for example. So, our motivation was mostly inspired by a paper of several years ago that described the existence or the detection of microbial reads in exome sequencing data in human projects. So, that paper was really inspiring for us in the sense that it did highlight that microbial sequences have been found in exome sequencing data, which means the coding part of our genome, and it did open a huge lot of questions about these phenomenon in higher organisms, and it definitely needed end-to-end tools to investigate what is happening there. Of course, I put here a final picture of the microbiome because if you remember the definition at Just K, which is transfer of genetic material between species that are in close proximity, then we and many other mammals are believing example of these close proximity between different species, and we have a whole set of microorganisms that live with us and contribute to our own biology. So, clearly, there's a lot to investigate here. A couple of definitions for the pipeline that we have developed. So, first of all, when you map next generation sequencing reads to a host genome, and in our example, a human genome, you could have several scenarios. The first scenario, which is the most common, is that if you do pair the sequencing, both mates in the pair map correctly to the host genome. But you can also have a couple of additional scenarios. One where only one of the two mates or one of the two members of the pair maps, and the other is unmapped, and one where both reads in the pair are not mapped to the host genome. We needed a definition for the pipeline, so we have identified these pairs where only one read is mapped to the genome and one is not as a single unmapped. And then we have defined those where both members of the pair are unmapped as both unmapped. So, you will find these short definitions later on recurring in the picture and the slide that we present in a moment. Of course, the importance of the pair where one mate is mapped and the other is unmapped is that it allows us to make assumptions about a potential integration side. Because of course, we can measure and evaluate the abundance of taxonomic IDs from every read that is not mapped to the host genome. But for those that are members of the pair where one of the two is actually mapped, we can additionally try to infer where that potential integration has happened thanks to the coordinates that we have from the mapping of the mapped member of the pair. So, this is the pipeline overview. The pipeline I think is relatively straightforward and includes a part dedicated to the alignment of quality control, then the conversion and parsing of the reads that I just illustrated and classification using Kraken at the moment. And then a last phase of reporting and we're gonna see each of these steps in the pipeline in a moment. The preprocessing is very important because it's being designed to be plugged downstream to other studies. I made it the example of the initial paper that inspired us to develop this pipeline. That was the discovery of microbial reads within human exome studies. So, our own idea particularly because we have also contributed to SARIC was to plug these type of pipeline downstream to those kind of pipeline like SARIC. So, accepting the BAMFIs or the alignments that have been produced by human exome or whole genome sequencing studies and then use the pipeline to process all those reads that have not been mapped. But the pipeline also starts from a fast queue. So, using row reads and it does a standard alignment to the host genome using VWA. So, unmapped and reads that are both unmapped. We do these using some tools and using the bitwise flag 13 and five. And then we further parse the potential integration for the single unmapped reads using the information from the mapping coordinates of the mapped member of the pair. At this moment, we are using Kraken 2 to classify taxonomically the reads. And in particular, we have chosen this tool because we use in the Caner classification that is given as a sliding window in the NGS read that we are analyzing as a way also for interpreting the results and doing further QC on the outcome of the taxonomic classification. These all goes into a reporting phase of the pipeline. We generate traditional Krona plots that are generated per group. So, if your analysis has one, two or three different groups, we group the sample of Krona plots per category of your samples. We use multi QC as obvious for the reporting. These also includes the classification of a view of what the reads, thanks to the parsing of Kraken 2 outputs. And then we perform a preliminary analysis using our markdown with a parameterized our markdown files, which also adds a couple of important information to the preliminary analysis. One is a classification score. So, we try and use the information that Kraken 2 gives us in the output in order to give a classification score to each of the reads to further allow us to filter based on the quality of taxonomic classification. And important information here is the extent of the, so how much of the read has been classified and has been assigned to that taxonomic organism which appears in the result. And then we have also curated from a number of publications a list of contaminants that are known to affect DNA extraction kids and we have further classified the contaminants depending on their potential role in human disease as well. And of course this is because we are particularly interested in analyzing this phenomenon in humans. Couple of indications about the usage. This is a typical common line to start the pipeline. We will use the input sample sheet as a comma separate value as most of the n-core pipelines. When we use the I genomes genome indication we use the host genome there. So, this is the first part that performs the host genome alignment. And then we pass on the host taxonomic ID which is used to filter the results in the R Machdown report. Two very important part of this common line are a path to the Kraken database and a path to the Krona database which can be either indicated as a path if you have it locally or as a tar GZ file which can also be online or in a repository that you might have in a cloud resource. The inputs as I mentioned in the beginning can be either row reads with a fast Q input as you can see in the first example or already pre-aligned BAM files that are coming from another pipeline that you can see in the second example of input. And here I also have to say that the database for Kraken is obviously crucial for the classification because the whole point of this pipeline is assigning a taxonomic classification to the unmarked reads. So the way the Kraken database has been built obviously will have a huge effect on the results that you're able to report. And so on the taxonomic IDs that you're able to detect in your reads. Couple of words about the performance. We have tested this pipeline on different species both to demonstrate the existence of the phenomenon in not only humans but also in other mammals. This is an overview of the execution of the pipeline on 10 exomes from humans. And you can see that they are executed in our local cluster in about three hours. So this is quite good, the pipeline is very smooth and it's run. And then we have also reported CPU and memory usage for the most intensive tasks. There's nothing major to discover here. I mean, in particular in terms of memory, Kraken and the quality map are also quite intense. Again, the amount of memory that is used by Kraken definitely depends on the database that is used for the classification. Quality map is known to be quite greedy with the memory. In Sarek has been swapped with most depth. We might do the same in a future version of the pipeline for the same reasons. So homework mostly. So as I mentioned, Kraken has a very useful type of output where you can appreciate the assignments to taxonomic ID by sliding window of the camers the reader has been splitted into. This will allow us to draw much more information in terms of classification filters or heatmaps that will allow us to investigate better the biology especially regulating these type of events. We will probably dedicate some work to the optimization of the computing part of the pipeline. I just mentioned the issues with quality map and certainly improvement on the planning and the analysis report, which is currently running only on humans and also considered the introduction of alternative taxonomic classifiers. And here we have a number of examples in other NFCOR part lines. So I hope this is enough of an overview for now. We have published a paper on the international journal Molecular Sciences very recently where NFCOR community is a collective author in the publication as well. And there you can find more in details about particularly about the scientific findings that we have collected by analyzing the different species we have used for the testing of the pipeline. And I'm open to take any questions. Thank you very much. So I haven't emailed now for anyone to unmute themselves if they want to ask any questions or you can also write questions in the chat and I will then read them out. It seems that it was very clear to everyone. So if there are any further questions you can always ask them in the NFCOR Slack or you can ask directly, Francesco, I assume. Yes, definitely. Both on Slack or via email. Then I would like to thank you again, Francesco, and also the Zuckerberg Initiative for funding our Bitesize talks and all of the audience for listening. Thank you very much. Thank you.