 I'm someone an ancient hunter and I'm working as a computational biologist at the Department of Computational Biology at the University of Lausanne and at Vitality at the Swiss Institute of Bioinformatics. I will present you my batch of flexible pipeline to map ancient DNA, a workflow we have published in January 2023 in bioinformatics. The talk is aimed to everybody interested in doing an ancient DNA mapping experiment beginner or experienced. However, the workflow is not limited to ancient DNA. This may also be used to map modern DNA. So what can ancient DNA tell us in the past we used ancient DNA to reconstruct the human past to get clues about past environments to study extinct species and to investigate the spread of microbes. Since sequence technology is evolving, we get more and more samples to extract ancient DNA, the samples are getting altered, we are getting more and more DNA out of a single sample. So studying ancient DNA is getting more and more important. So what is ancient DNA and why is it so difficult to work with it? Here I'm listing the characteristics of ancient DNA to show you why it's so important to have a good workflow to map ancient DNA. So typically ancient DNA is sparse, only 1% of sequences are of human origin. Due to the generation processes, the DNA is fragmented into small pieces, typically, which are less than 60 base per in length, which creates problems while sequencing and during mapping. Moreover, molecular damage accumulates after that and it's visible towards the end of the fragments as shown on the figure. So typically you see an increased substitution rate of C2T and G2A substitutions depending on the library preparation protocol. And moreover, ancient DNA often is contaminated. First, it's contaminated by environmental microbes because of the dead body was lying in the soil. But as well, during excavation lab work by the humans, other humans, and this is difficult as ancient DNA sparse and now you've got a contamination because it's difficult to distinguish between contamination, the real ancient DNA. So there are some challenges to deal with during mapping and so since DNA is sparse, the chance to re-sequence the same DNA molecule increases this, we have to remove this artificial duplicate. Because the fragments are short, sequences are re-eating into the adapter following the DNA and so we have to remove the adapters before mapping the rates to their reference genome. The increased substitution rates towards the end of the reads leads to problems as there are artificial errors in it, so either we trim off the edge of the reads or we adapt the mapping to allow for this increased error rate. This mixture of different taxa or of different humans is problematic as we can't work that out. We can assess the contamination and if contamination is to how we have to exclude the sample and focus on the better samples. So you see mapping ancient DNA is challenging. It is repetitive since your re-sequence runs re-sequence the same sample to increase the coverage and it's unconsuming. So it's really important to have a mapping workflow which is reproducible and efficient. That was the aim of this work. And so we thought about what we want to have as requirements and we decided to have just a lightweight mapping workflow which just goes from the sequences, the FASQ file to the alignments, the PAM file. Of course it should be automated so if the workflow gets stuck for any reason it should be able to restart the workflow without re-computing the present files. Of course the workflow has to be efficient in speed but also in space. Since ancient DNA libraries are highly variable in size, the workflow has to scale well with different data sciences, but as well with different computer infrastructures. So it would be idle to be able to run the workflow in a notebook to test it on a cluster or even in the cloud. A reproducible workflow has to be readable not only for the developer for everybody in order to find problems of bugs in the code but also to extend or modify the code in general. And to end up, a workflow has to be informative. At the end we want to have some information about the last PAM file. But as well we want to have intermediate statistics and information to know if the workflow has succeeded or not. So that are a lot of requirements and we take advantage of a workflow manager. In our case we have decided to use Snakemake which is a great workflow and it's developed in Ioannis Custer's lab. So we have to develop this workflow called Mabacha, a flexible pipeline to map ancient DNA. And I would like to acknowledge all the co-authors. So what is Mabacha doing? We've got a Fasculine with the sequences. We've got a workflow or a mapping workflow we want to process. This is simply a view of it. And we need for this reference genome and the configuration file which allows to parameterize the workflow. At the end we would like to get the alignments, the PAM file. So of course from this PAM file we want to have some statistics to see if it makes sense or not. And everything is encapsulated in a HTML report. Of course we do not only have a single Fasculine but multiple files which will lead to multiple PAM files, multiple individuals for example. In a recent study in our lab, we were analyzing six ancient Greek samples. This study was published in Clemente et al. in 2021. And initially we had 295 Fasculines which led to six PAM files for these six Greek individuals. So let's see what this means in number of tasks. So if we just take one single Fasculine which leads to one PAM file, we have 16 main tasks. If we consider all the tasks including reporting and computation of summary statistics, we already have 43 tasks for a single Fasculine. And for the entire study from Clemente et al. in 2021, we have almost 5000 tasks in our workflow. And we are really happy to have such great workflow managers to deal with these 5000 tasks that at the end they are all computed and none of the files is corrupt. We have benchmarked Mapatche to another workflow manager workflow for ancient DNA, which is called NFCoreager and has been published in 2021. So normally a workflow is keeping all intermediate files as does NFCoreager. Mapatche by default removes all the intermediate files and keeps only the final files. Mapatche can also be run by keeping all intermediate files and for this benchmark we have run Mapatche in the two ways keeping the intermediate files and by removing them. The benchmark showed that Mapatche is slightly faster than NFCoreager. But more importantly, Mapatche is using significantly less storage. So here we see the storage used by NFCoreager. It's 44 GB for this data set and intermediate files can be deleted only after the run. So this part here can be deleted resulting in 16 GB of disk space after the run. Running Mapatche in the same mode so keeping the intermediate file results in a disk space usage of 33 GB. Thus, that's one quote less than what NFCoreager uses. Moreover, Mapatche can be run directly or by removing intermediate files. So the peak space requirements is less than 10 GB during a run and after the end of the run we just need 40 GB of disk space. This is 10 GB less than what NFCoreager uses. The disk space efficiency of Mapatche is important for large mapping experiments or remapping of libraries to another reference genome. So to run Mapatche you need a configuration file where you may parameterize the mapping. If you want to sub-sample the initial FASQ files, do you want to clean the FASQ files, what mapper should be used for the mapping and so on. So that's only a small part of the configuration file. And then you also need a sample file which relies the FASQ file to the library and to the sample. So a sample may have multiple libraries and a library may have multiple FASQ files. Running Mapatche is quite simple. You just have to type Snakemake, so the workflow name, workflow manager name, and pass the number of cores to use. And then the workflow starts and lists a summary of what he will do. This is really transparent and shows you if the configuration file is well taken. At the end of the run, of course we are interested in the BAM files, but also in the HTML report. The HTML report contains workflow metrics which are really great. So for each task we see the time it was using and this allows to optimize the workflow for a given dataset. It also reports the time point when the task was executed and that's great for reproducible research purposes. Of course we are mainly interested about the summary statistics reported as plots to quickly investigate if the mapping experiments went well. But of course the plots are only one side, we also have all these numbers and other summary statistics as a table. We report summary statistics for each FASQ file, for each library, and for each sample. So apart from this, the HTML report also contains the configuration file and the workflow schema. So everything is encapsulated within a single HTML file, which is great as this allows reproducible research. This HTML file can be exchanged with your collaborators and allows to reproduce the entire mapping. So the key features of MAPACHE are that it scales well for different types of data and sizes. So MAPACHE can be used for an initial ancient DNA study screening where you've got a lot of small FASQ files and you want to know if there is any human DNA in it. It can be used for high-carbage genomes, as I have shown you for the CREIC study, where you map large FASQ files to a single genome. But you can also use MAPACHE for metagenomics where you map FASQ files to multiple reference genomes, for example, to all viral reference genomes. The important thing is that MAPACHE uses low, little disk space as it deletes intermediate files on the client. MAPACHE is able to map various multiple reference genomes, which is not the case for other workflows. And the final typically low-carbage mappings can be improved by impute missing states using a reference panel, which may become more and more important in the future. MAPACHE is freely available on GitHub. And with that, I would like to thank all my collaborators and you for listening to that. Thank you.