 Hello, everyone. My name is David Dillis and I was a former postdoc in Christoph Stasimo's group at UNIL and I'm currently working as a senior scientist in the systems biology group in the research division of Roche PRAT and what I'm going to tell you today about is a project that I started in Christoph's group and continued after hours throughout the last couple of years also during my time here at Roche and it's called Read 2 Tree and it's about the inference of phylogenetic trees from raw sequencing data. So let me start with a small introduction. So let's assume that I'm a nature enthusiast and I like to walk through the forest and in this forest I see a nice butterfly here depicted with this yellow and blue colors and I'm curious. I want to understand what species those butterflies are, what are its closest relatives and specifically in the context of doing a phylogenetic tree. I want to understand whether the species is currently invading our ecosystem. So if that species for instance would fall with other butterflies that you find in a very distant location I would be a bit concerned that right now the species is invading our own ecosystem and I might inform some conservation organizations. So how would I go on about it to do that? So standard would be I take the species and I look it up using some phenotypic characteristics like the yellow color of the wings, the blue circles that it has in a butterfly guidebook like for instance the one here depicted and check what species it is. But since I'm a bioinformatician I would take that species and sequence its genome and build the phylogenetic tree to exactly predict what species it is, what relatives there are etc and so on and so forth and have a very precise definition of what the species is. So in the past or still actually what bioinformaticians would do they would result to high throughput sequencing. They would take the species they would generate what we call reads which are fragments with DNA where we have the ACGT letters and then what you would do is you would follow a standard pipeline where you do some read filtering, genome assembly, error connection, annotation, then you would infer homology to other species, then orthology to other species and finally you would do an alignment of the genes and with this alignment you could then infer a tree. So you see that this is quite a big process that is quite time-consuming and it requires a lot of different tools that have to be used and at each step we can introduce errors. And usually we would need around 100x coverage for a good assembly so quite a lot of data in order to get to a good tree. So what we propose in our recent paper is to actually sidestep this whole step by using some reference orthologous groups so this is for instance the prior knowledge that this is a butterfly so we would look for other butterflies and just use directly the output of the high throughput sequencing into our pipeline called read to tree and then do the tree inference directly from there. So we would kind of use a single process to sidestep all these steps before. So how does this process work? Let's take here our reads. These are the blue lines here and our orthologous groups. So these are different gene families from species A to D which we take from the OMA browser and then we first would do an alignment. So we would have the right nucleotides aligned here and then we would do a mapping of our reads to the different sequences that we have in this data set and from these sequences we would build consensus sequences from the reads so a single sequence representing and then because we have multiple ones we would select the best one on some criteria. We would place them back to our multiple sequence alignment and we would use exactly this to infer a tree and you see here that our butterfly is very closely related to species D. So this is kind of the inner workings of our pipeline. So now that we have an understanding how this pipeline works we need to show that it's actually doing a good job and for that we need to characterize it well in terms of how it's working depending on the distance to the closest reference because we said we need to use some prior knowledge and then there are different sequencing technologies so we need to characterize it in regards to this. Also we can use different coverages for data so we can use very little data or a lot of data and see how it works with that and also across different filers. So here we have examples in the publication where we used Ataliana which is a tailcress, a plant, a yeast, a saccharomyces cerevisiae and a mouse data set with all sorts of different animals. And I'm going to explain to you how we set up this experiment on the mouse data set. So what we want to assess is how well can we place mouse reads into this tree? So how accurately can we reconstruct this tree? So what we would do is we would remove the mouse from this tree and use reads from a different data set and play and do the whole mapping and then see whether when we reconstruct the tree it's the same one as it is here and then what we would do is we would remove more and more data so then it would be without mouse and rat so we would remove this here so we are now the closest species would be 90 million years apart then we would remove mouse, rat and human and other apes then we would be already at 312 million years the closest neighbor and so on and so forth and this would give us an understanding of how well our two performs with the closest species being a very distant neighbor. So if we do this we see here quite nicely when we look at specifically the tree precision and the tree recall which are the number of correct branches versus the number of branches that we obtain that with close distance and large coverage we have one tree recall and one tree precision and this is reduced when we are extremely far away so here we are five and six so around 600 million years away and also where we use data with extremely low coverage here we have 0.2x and 0.5x which are extremely low coverages and you see that even with this low coverages we can still place our mouse quite well into the tree for not so distant neighbors as it's depicted on the top line and then we can we have to repeat this so this is an example where we use tagged by us technology and we can look at into Illumina and Oxford Nanopore technologies and we see that we do a similarly good job and even for Illumina an extremely good job in in placing the species to the to the correct place and producing a tree that is very similar to our standard reference tree. So having done that we have a good understanding that our tool depending on technologies, coverages and so on and so forth is doing a reasonable job to place the species right and to do a tree to compute a tree that is that is quite correct but the question and the thing that we set ourselves out to do is to show that we are doing a quite a good job in comparison to a standard pipeline so in order to do this what we do is we take our reference tree which is again for mouse or for the tail crest or for the yeast and we do the reference tree minus our assembly and then we get some type of a distance between the two trees you see they are quite they're a little bit different here and we do our reference versus our re2tree reconstructed tree and then when the difference between the two distance differences is negative then we would say our tree our tool outperforms standard pipeline and if we do that and and and look at just Illumina data we see that for very close distances we do a better job than assembly actually and even for higher distances we could we do an as good job here depicted as as gray and then for some species when the distance is extremely large and the coverages are strong then the other tools are performing better what is important to mention here that for extremely low coverages we cannot even do an assembly for 0.2 and 0.5 x there is no assembly that that can be produced and there we actually are still capable of producing quite accurate trees for all the three species that that have been tested so we can we see the same observation if we look at other technologies this is just depicted here out of completeness and we can summarize it that for low coverages only our tool can actually place the species in the tree and it can do it quite accurately and for increasing distances and high covered levels the standard pipeline starts to produce more accurate trees specifically we see that here for for the yeast and for for mouse and for the tail crest actually ours our our tree generation tool performs better than than the standard assembly pipeline approach so this is all nice and good but we set ourselves how to do to do a simpler process so we wanted also to show whether the simple process is faster and what we see here is a direct comparison between the different technologies for the different data sets that we have and we see quite nicely that our tree read to tree is around 10 to 100 folds faster for for most technologies and specifically for 20 x coverage we are nearly always faster in in producing phylogenetic trees finally i just wanted to mention that you know you might think so that's cool but why would you apply such a tool so in the paper we show a couple of applications and you can have a look later to to inform yourself more about those but specifically we show one application for the the recent covid outbreak that we that we all went through and here you see quite nicely that we applied it for 10000 raw SARS-CoV-2 data set for strains and you see quite nicely that we are capable of placing the different sequences into the right CDC variants that were recognized and that we also accurately recovered the main coronavirus genre and all its subgenera and you see quite nicely that we get here all the different strains so that's quite nice as an application it's extremely simple to use and we can produce a tree like this in quite a short amount of time so to summarize so what we did in this in the study we produced a novel method a novel tool called read to tree which is a simple efficient tool to quickly obtain phylogenies from raw sequencing reads it is as accurate as a standard pipeline it is 10 to 100 times faster than the standard approach and it works for extremely low coverage levels across a range of technologies which is quite a nice tool and I put you here different QR codes so you can read there's a nice blog post about the story behind the paper and the project and there's the paper itself and also here's the link to the github and this is the the the title of the paper and and please flee to to have a read I just wanted to mention that this it was definitely a project that was not done only by myself this was a strong collaboration with Fritz Sadlercheck from the Baylor College of Medicine and and Christoph Desimo from Unil and the Swiss Institute of Bioinformatics specifically I want to thank also Adrian and Sina for for helping out with we're finalizing the project and for doing computations on it and helping with writing the paper and I thank you all for your attention and yeah enjoy this tool