 So, welcome all to the virtual SIP Computational Biology Seminar Series. Today we have the pleasure to host Christophe Desimo from the Computational Evolutionary Biology and Dynamics Group. So Christophe obtained his Master in Biology in 2003 and then his PhD in Computer Science in 2009, both from the ETH Zurich in Switzerland after a postdoc spent at the EBI European Bioinformatics Institute in Cambridge. He joined the University College London as a lecturer in 2013. Then he became a reader two years later and since 2015 he joined the University of Lausanne both at the DE, Department of Ecology and Evolution and the CIG, Center for Integrative Genomics as a Science National Science Foundation Professor. So he's also retaining an appointment at the UCL where part of his group remains active and since this year he's also a group leader at the Swiss Institute of Bioinformatics. So at the interface between Biology and Computer Science Christophe's laboratory seeks to better understand evolutionary and functional relationships between genes, genomes and species. The group activities are divided between bioinformatics methodology and resources development and the application of these are mostly typically in collaboration with experimentalists. So today Christophe will share with us how to make the most out of noisy, low quality genomes. Christophe, thank you again for accepting this invitation and the floor is yours. So thank you everyone for coming this afternoon online or in person. So my talk is going to be about how to make the most out of low quality genomes and so you may think why make a big deal about low quality genomes isn't the technology progressing so quickly that these are going to be relevant soon. But I mean if we look at the current state of affairs actually it's rather concerning. I mean if we look at this genome online database, the gold database which some of you might know about we can see that actually most of the genomes that are deposited now are in permanent draft status which basically means that people have no plan to improve them further and that's really massively dwarfs a complete genome but that's really just the tip of the iceberg. If we look at the rest of the data that is available or generated these days, I mean we have also a lot of projects that obviously you know still in draft people have some hope of perhaps improving these but with not necessarily a very precise schedule, all the projects that are not even registered, we see increasingly genome sequencing projects that are initiated by individual labs and then also you have all of the efforts that are you know that they don't aim at generating a full resource anyways and so maybe they are only interested in exomes or performing some transcriptions of some sorts that sample part of the genome or even going for some techniques that's only ever going to recover a small fraction of the data. So although we can hope for technological improvements I think it's going to get a lot worse before it gets better and so we really have to you know to find some ways of coping with all these noise and uncertainty. So what if we think about the consequences of these noise and it's helpful to think about the analysis that we perform on these and so typically start with some sequencing, then assembly, then genome annotation, homology clustering, autology inference and then depending on your interest perhaps you want to build some species trees and relates this diversity in terms of entire species or maybe the interest is more in terms of genes in which case you know maybe this will be correlated with functional aspects such as expression, traits or functional annotations. And then really the question is if we have real quality genomes you know how is error propagating across all of these pipelines and how can we develop some methods that are able to cope with this uncertainty that are somehow robust. So this is something that we are concerned about and so today I have four small projects to report about that are all you know in one way or another aiming to make the most out of these noisy genomes. So I start with autology benchmarking which is you know autology at this point I will go and get into detail very soon but this is you could see in the pipeline is still rather the basis of this of many of these analysis. And then there's also you know if we are combining information from multiple genomes we have to think of some framework to do so in a way that is going to be you know to take all of the signal and leaving out the noise. And I will also talk a little bit about how to use this comparison to maybe infer some splits genes that are quite common in fragmented genes that are quite common in many genomic data sets and I will finish with some approach to looking at maybe the results in heterogeneity that we that is visible in some of the analysis when we have either a complex processes but also quite often just artifact due to low quality genomes. So first I have one introductory slide I think many of you will be really familiar with this concept but I think it's worth reminding everyone of this concept because they will appear a few times in the talk. So I think everyone started with a species tree with the concept of the species tree. So this is like this tube here you see an ancestral gene I'm using the mouse so that's the online audience can also see this. So we start here with an ancestral genome here and then there are two speciation events and it gives a rise to the amphibians, human and the dog here. It's obviously a simplification. But if we look inside of these species we can think here simply as a gene here and then this is really in this case can think of the unit of evolution. This is the basic one. In this model we will just think of this as a unit and which undergoes here duplication and then we have two copies in that common ancestor and that gets carried over until these extinct species except here in the dog where the dog loses a coho. So this is a gene tree inside our species tree. It looks maybe a bit unfamiliar because of these overlaps here but really it's a gene tree that is represented here and the gene tree conveys a lot of this information about the evolution scenario but it's rather complex kind of data structure to work with and so we like to think in terms of a relationship just among sequences or even pairwise relationships. So that's why we have these other concepts of homology. This is the idea of common ancestry so we would say that all of these sequences here are homologous and then we can subdivide these into other like more fine-grained subtypes of homology so for pairs of genes that have evolved by speciation we will call them orthologs and genes that have evolved by duplication events. When we go back in time we see for instance for this blue and orange gene that they arose through a duplication we call them parallels and we can define other evolutionary relationships such as synologues for genes that result from horizontal gene transfer or homeologues for those that arise through hybridization but in this talk we just focus on orthologs and parallels. And so these are useful concepts to summarize some aspect of the evolution without going into too much detail and so they have here interested in functional propagation or functional aspects that are common to a clade. It's often useful to look at the orthologs because for instance if we look at the human and the dog these two genes here that forms two pairs of orthologs were the same genes in that last common ancestor. So if there is any function that was retained it's likely that we can learn something from these two genes. And so that's why the orthologs are interesting. If we want to build a species tree we obviously want some gene that are related only through speciation events. So that brings me to the first part of the talk which is about orthology benchmarking. This is some work that was done with Adrian Altonhof and the quest for orthologs consortium and the motto of this work is you can't improve what you can't measure. And so that sounds quite banal perhaps but actually it turns out that it's quite hard to measure a good orthology. In fact it's surprisingly hard for something that is so fundamental. So one of the reasons why it's kind of like the main conceptual reason why it's hard is because you're trying to, well you just saw the definition, you're trying to infer events that happened maybe millions, hundreds of millions of years ago and you have very limited amount of information to do these inferences. And if they happen at a relatively fast pace it may in some case be impossible to reconstruct these events. I mean you know all the controversies that are happening around sometimes the placement of entire species. You can imagine if you're just dealing with individual genes that is like all the more difficult. These are just like the conceptual problem but they're also really practical issues. So if you look at resources that are made for orthology, they vary quite a bit. I mean they vary for a long time, they were varying a lot just in terms of how they define groups of orthologs, what they meant really with orthology. So give you what is now really the canonical definition and I think that's what accepted but for quite a long time this was rather inconsistent. And also even if they agree on what is it that they are trying to infer to look at different resources, they may have different facts out of the cover, they may vary in the releases of the database they are analyzing, they may have different identifiers, which really creates some serious barrier. And if you think okay maybe we'll take a common set of data and run these methods but it turns out that most of them are not available as standalone methods and so you can't actually run them on your own data. And even if you could perhaps it would take a long time to run them on a representative set of genome. So these are really huge practical problems and we may say okay we'll just give up but I mean this giving up would mean not measuring the quality of our algorithm and so how can we really do progress by giving up. So there's really no way I think around it. And then even if we did all the time and you know the efforts to try to look at the situation because it's such a fundamental problem there are some new methods that are proposed almost like on a monthly basis. So the moment you do your analysis is probably already out of date. So we've been working quite a bit on this problem for quite some time. You know we came up with different ideas about how to map individual information across resources or doing simulation or looking at some special instances, some special case studies. And so that all gives us some idea about how these methods perform but it's really not so satisfactory. And so really what we realize is that we cannot solve this problem just by ourselves if we really want to tackle this problem. Which was one of the main motivation for this quest forward towards consortium. So basically over the course of a number of years we've done some work you know kind of as a community to try to tackle this problem and so started with first agreeing on definitions which were surprisingly challenging and so it took some time just to get there and then to agree on some reference data sets that everyone would find worthwhile investigating and also representative. And it's only after several years that we started having first results on these common benchmarking efforts although admittedly most of the activity was happening shortly before each of these meetings but nevertheless I mean it's not something that can happen overnight. And so I want to report on like this new orthology benchmark service that we just published this month. And so basically the idea there is we have a reference set and which is public and people that develop some methods can run their prediction on it, submit their predictions and then there's a battery of test that is done on a web service on these predictions using different criteria, different reference species, trees and experimental functional notation to look at the conservation between orthologs versus non-orthologs etc. And at the end there is some you know these results are then summarized and fed back to the submitter who can then decide either to make these results public and presumably you know maybe also alongside a publication or decide that you know maybe it's not quite ready for prime time and that there's a bit more work that is needed. And so if the results are public then it's possible when we have a new method or some refinement to compare the results. And so that looks like this there is a web page people can you know submit their data and then after a few minutes or a few hours depending on the test they get this type of plots where they see the performance of their methods compared to everyone else and they can you know they can see basically address a lot of these practical issues that I've discussed a few minutes ago. So I think this is the main contribution really to be able to really measure the quality of ortholog predictions in a way that is much more practical. But also along the way we have you know we observed a few things that are maybe of interest. So one of the things which was by the way perhaps a bit easier to accept for a consortium was that there's no really single winner. There's the methods that you know end up being compared. They typically make some sort of trade-off in terms of you know precision and recalls or how many predictions they make and how good these predictions are. However we have to keep in mind that people are unhappy about their submission can keep these private so it's almost like a consequence of that design. And so there were quite a few submissions that ended up not being released either because they revealed some bugs or some really unexpected behavior. I think it was also surprising because people like to think in terms of conceptual differences among methods and actually if we look then at the performance perhaps also because you know these genomes also sometimes contain contamination or artifacts or things that violate maybe the models under which the predictions are made there were no obvious differences between these different conceptual approaches. And in particular it was striking that the methods that assume a particular species tree didn't seem to be doing better even when we are looking at the congruence of the predictions with some accepted topology. So it is difficult to summarize just in a few lines what a method is doing when you are dealing with these complex pipelines and really that's why measuring is actually really so important. And also but this is a bit more positive I mean we also noticed that there was some surprising consistency across the different benchmark even though some of them use very different criteria and cover very different parts of the tree of life. So there seem to be some general consensus there. Okay so I'm moving already to the second part so I'm assuming with this way of measuring the performance of the orthology inference we can do a great job of inferring pairs of orthologs but really if we are dealing with many genomes and in particular low quality ones we want a way to combine this information to more than two species at a time. So the example here is a classical example when we are talking about gene evolution is alcohol dehydrogenase which you might know is good to break down alcohol and has duplicated quite a bit in the primates perhaps as a result of our lifestyle. And so this gene here I have an example here of part of the gene tree for this gene in primates and you can see Schumann has three copies here Schumann, ADHD, B, A and C and so okay there are quite a few duplication nodes so already just a few species things get a bit complicated and messy. So in terms of when people talk about this family they think of trying to build some subgroups so you know the non-teacher communities perhaps or this family will call them ADH1A, B and C but if you think about it it's a very human centric point of view because of course we've got three genes, we've got three names but if you're a babooner then you have four genes and actually no ADH1C, three ADH1A, one ADH1B so obviously no one's defending the baboon's interests and the non-teacher community but actually you will also have the problem that depending on the resolution you might want to cut the gene tree at different levels so really you can see that this kind of representation is going to be difficult to scale so how do people think about integrating information from multiple genes so if we think about the paradigm currently so you could go to pairs of species, this is kind of the classical way of doing things we remember that orthology is defined in terms of pairs pairs of genes so for instance if we just compare human baboon there we might see that there are four orthologous relationships the ADH1B and that's a baboon counterpart and the three duplicated ADH1As which form like three, one, two, many relationships and by the way this type of pairwise orthology view is if we are very human centric it can already bring us quite a long way, we can compare each model organism to human, in the plant world Arabidopsis has all the functional data anyway so you can go quite far but it's kind of not so satisfactory because you're dealing with two species at a time if you look at triangles of species you may get inconsistent predictions it's really cumbersome in a way the owners is on the researcher to integrate all this information which is probably a bad idea when we are dealing with this level of complexity so the other end of the spectrum people are working with clusters of one to one orthologous that's very neat because now we've got one copy in each species and then everything can be compared in a very simple mapping across species so comparative genomics project looking at other aspects such as transcription or evolution of transcription splicing or this type of other aspects they tend to look at these one to one orthologs but the problem with that really is that one bad genome, one low quality genome can really mess up the clusters they can really get very fragmentary and in fact as we add more and more genome this cluster can only get more fragmentary which is not very satisfactory so it's quite fragile so it doesn't scale very well or we could just go back to these gene trees as we mentioned before they contain most of the information but they are quite difficult to infer, difficult to interpret and also again I think if we have a low quality genome that may cause some trouble in the trivial and have some effects actually even the prediction among the high quality genome so the paradigm that we are trying to push and that we are that we are using in the lab is that of the hierarchical orthologous groups and so there the idea I mean it's related to all that that is I think there's a subtle but important distinction so to explain this I'm going back to all these these genes all these copies here just organized according to different proteins they belong to and then we can go back and I think of the ancestry of these species and crucially here this is not a gene tree it's a species tree so this is a tree we are quite familiar with and then we can think how do these copies map to ancestral genes and so here if we think about the situation with the ADH1, ABMC we may infer and I'm not looking at the inference process right now but just conceptually that all these genes descended from three copies here in the ancestral semen and if we go back a bit more back in time we look at the ancestral primary there was really just one copy as for all of you who are used to dealing with gene trees you may say well that isn't this cheating you're just showing us a gene tree where you just you know arrange things in the third dimension so for things to look a bit more simple so I mean okay and they have nothing wrong with having an appropriate visualization but I think it's also helping to to see how these genes are related we'll see how this is used then to infer the hogs so what I wanted to also mention is once we've inferred the ancestral genes we can go back to the present day genes and perhaps you know we'll see that okay maybe these three copies really are coming back from that ADH1A and then there's some copies that were lost maybe labeled the nodes and if we think in terms of their relationship and that's how we can define these hogs we can say okay with respect to that ancestral semen really there's one group of gene that contains all of the descendants from that ancestral gene and if we go back to the ancestral primate then really all of these genes should be in the same hog because they're all descended from that ancestral gene and now you can see that if we can infer these ancestral genes and they're the hogs that are attached to them in a robust way that gives us a much more scalable formula because you see this already we might think perhaps indeed it has two duplication on that level but perhaps that was really just a very poorly sequenced genome and we really have just the ADH1A and BNC but it may complicate the tree here actually it's not really affecting very much our picture of kind of like that common denominator and so in a way that gives a much more natural framework to integrate good quality but also poor quality genomes and perhaps the evidence that we see on the terminal branch may both reflect the biology but also the quality of these genomes so this is how we do it conceptually but how do we get to this type of predictions? I also want to mention that the SIB is a hog bed for the hierarchical orthologous group because it's home to the fully leading project that predicts hierarchical orthologous groups with colleagues from orthoDB and EGNOD. So how do we show hogs? So I'll go back to the sequences so we start with annotated genomes and then we can do some homology inference and just find the genes remember that I have a common ancestry and then from that so the traditional roots would be to build these trees and infer the duplication and speciation nodes so this is a reconciliation and once you have these really as I said is really just a matter of parsing these trees to get the hogs so that's really quite straightforward but it might be difficult to infer that at a large scale and if the trees have mistakes that may also be reflected in the hog reconstruction and so the root that we like to pursue is to look at parallel orthologous inference which we've seen in the previous part is something that is quite mature and it is very scalable and if we can skip that part and then go straight to the hogs then we are happy and in a way if we think in terms of robustness it's also intuitive to see that if you're dealing with parallel comparison any single bad genome might have less of an effect that if you are doing a joint analysis or arguably it may be more robust but for sure it's going to be faster so I don't have time to really go into the details about that algorithm but most of it is already published but I want to tell you a little bit about the bottom-up approach because I think it's very a new way of building this hog that we currently developing it's quite intuitive and it does quite well so we start bottom-up with the presentage proteins and we work along the species tree and so basically we can look at the orthology graph so the graph that gives for every parallel orthology relationship on that graph and actually we can show that on the good condition the connected component in this graph should match to hogs at every level so here we will infer the principle the ancestor of these two primates we have three ancestral gene and in the rodent here there are four genes and then we can look at how the orthology graph between now these different hogs and then we can look at the infer that we have really two ancestral gene just working on the tree and if you do that and then we look at the parallel relationship these implies since it's a little bit more rich than just parallel relationship but we can always project that down to parallel relationship we can use our orthology benchmarking service to look at how we're doing and so it looks like it increases quite a bit the recall so the number of correction that we do you know having too bad of an effect on the quality in fact it slightly increase here the quality so that's quite promising but I have to say this is a particularly favorable data set to see the birth rate where the genomes are relatively close to one another and we're still working on functioning this approach to more distant species we also get a very nice speed up with this bottom up approach which is important when we are doing things at this scale so I will visit now our gene family now even burdening it a little bit to related homologues so here we have our three copies so just in this representation we have the species tree here human, chimp etc and each square is a gene and now given a particular ancestral node the big boxes that give us the Yeric so here for instance the prediction is that at that level actually all the planets A, B and C were really just one gene at that level and then all of these genes have descended from one common ancestor so actually the prediction is that there were seven genes in that common ancestor now if you look at the result and it's quite interesting because we had for about two years time we had this prediction in a non-visual way and then all of a sudden coming up with a visualization makes you realize how well you're doing in some ways and also how poorly you're doing because then you realize well actually this is entirely implausible that we have two genes here, two ancestral genes that were lost everywhere else except for these two species so actually the mistakes they stick out like sore thumbs in this representation but on the other hand if you look in terms of the aggregates of the data, we're doing a pretty good job just automatically classifying all of these families into these groups and now I'll come back to that in just a few slides you can see there that the three genes are split so where the prediction would be that the duplication happened here at the common ancestor of the apes and so just to give you an idea, the same type of data if you were to look at it in terms of the gene tree you'd get something quite difficult to deal with so I think here the representation is really quite key but also if we can go straight from the data to that type of inference without the difficulties of building a tree I think we're going to do something interesting so once we've also organized things in terms of the hod, it's also possible to go around the species tree and then collect some aggregate statisticians and look at the number of losses and duplication and the new gene genes that we cannot map anywhere else that appear on the tree and so interestingly what we see is that for instance here you may say ok the entire CA seems on the terminal branch to have many many losses most of the genome here is a very poor quality so if you see most of the events are happening on the terminal branches reflecting actually the limited quality of many of these genomes but if we look at the ancestral branches or the original branches there we have much fewer events and that are supported by multiple genomes so that's actually would be quite interesting to go and have a look at you know for instance the genes have been really just deplicated in a particular clade and see how this is associated with that you know the emergence of new function it's also interesting so to give just another illustration we have a PhD that I also provide that is interested in the Xenocelomorph clade of species I don't know if there's any fan of this marine worm in the audience I mean some zoologists get very excited about them because although they look very simple when you build a phylogeny they seem there are some indication that they actually are quite derived and so that they might have simplified during evolution and so that gives that's just a simple hypothesis which is you know can we infer how many genes we had in the last common ancestor of these Xenocelomorphs and so that you know if there are many genes maybe it was more complex and if there are fewer then you know then maybe not as much I mean although we with all the caveat that we know about the correlation between complexity and the number of genes so our collaborator collected eight specimen and data transcriptome analysis which is really routine and you know get this really awful quality transcriptome that has typically 100 to 150 thousand sequences so we can just put this in the analysis and interestingly we see that it's able to really get rid of a lot of these noise and then if we look at this ancestral branch it will converge to an estimator of around 20,000 genes which I you know this is still preliminary work and you know I think there's some I mean if we look at the genes we see some contamination and some other artifact but it's already quite remarkable the reduction that we get just you know straight from just starting to compare these these genomes and so I think that's a lens credence to that the notion that using this framework we might be able to cope with very noisy genomes contrary to for instance like you know building one-to-one autolode clusters or looking at pairwise relationship between 100 you know genomes that have 150,000 sequences so for that if you're interested to see more about this hierarchical autolubus group you can have a look at the on-up browser or you could run the standalone version that we have on your own data you can even mix your public genome for which we already have some comparison that are pre-computed and just add your own data and we have I think quite a few people already using the vital IT cluster to do this type of analysis but we would welcome more such analysis although perhaps you only speak with disagreement on that so if I just go back now to my picture before I promise to come back to that like one of this weird case and so we looked into it and so indeed what happens here is that that sequence is a fragment and so it's missing a big part here and so actually really it should be a part of that cluster and in fact the other one that has also quite a part missing is in that cluster actually so really this is a fragmentary gene and so for us I mean the method usually doesn't assume that you're dealing with fragments but actually these are actually really common and here we should really go back to the drawing board and think of a way to be a bit more robust how to deal with this but in the meantime we can just by visualizing the result we can already make sense out of this but that's also I guess this is just a tree that is reconstructed from the sequence you see the two charts here sequence here the branch is a bit messed up because it's based on an STM short segments but that gives me a transition to the third project which is actually to say ok perhaps if these split genes are so these fragments are so common can we use maybe comparative genomics to identify some of them so this is where PhD students even at PDJOTA so consider the sequence alignment here with this cartoon multiple sequence alignment and let's assume that these two sequences come from the same genome so there you might perhaps these are actually fragments from the same gene and so we could maybe stitch these together but of course they not also come from parallel sequences in which case we should not stitch them back together and so how can we refer what is what so being phylogeneticists we like to build trees build the tree and then we look at where the red and the orange sequence line of this tree maybe the tree will look like that in which case we'll say ok these are definitely parallels and perhaps if the genes the two fragments are just sitting next to one another we'll say these are coming from the same gene is really a split gene but however this never happens we cannot find any such scenario we cannot find such scenario and the reason for that is actually quite simple is that these red and orange sequences they don't have any overlap so we have no way to know how divergent they are from one another so you don't know if they are here or if they connect somewhere else the only thing that we know is how far they lie to the other sequences so actually the way this looks like is that we get these two genes that are connected roughly the same place but never exactly the same place in the tree so that suggests already a first test, a first way we could go about it which will be to say ok let's just take away all the branches that are insignificant that are just like you to stochastic error in our tree reconstruction so these become the same and then another thing and they come from the same genome that we are going to infer these to be split genes and we can now stitch them back together so that's one test and so it's nice, it's very simple, it's very fast it's also robust to variation in the rate of evolution on the sequence because you might have one part that connects far and one that is much closer but they will still connect at the same place so that's quite nice but what sort of threshold should we use it seems a bit too simple and so it's probably not going to be statistically very powerful and also the PhD students thinks I can't write a thesis based on this idea really so we need something a little bit more sophisticated so the second test is more sophisticated I don't know if it's really better at the end of the day this is something that we still have to investigate in more detail but there the idea is to have two hypotheses and to do our hypothesis likelihood ratio test where we have either there are splits in which case we can put the two sequence together or there are parallels in which case we have to build a tree with one more taxon and then we can look at the likelihoods of these two hypotheses and see whether we get really an improvement in the fits when we add these degrees of freedom and so that's hopefully more powerful this is definitely more sophisticated but there are also some disadvantages the hypotheses are not really nested so we need to have a way to estimate the distribution of the test statistic empirically so that requires some resampling that is time consuming and there is something awkward about the formulation of the problem because as you know this test usually the more simple hypothesis is the one that will be if all goes well rejected in which case we take the alternative hypothesis but here the prediction when we think it's split is kind of like the null hypothesis here so failure to rejects will lead to an inference so there is something a bit awkward about this formulation but regardless we can still treat this as a black box and look at the performance of the approach and so actually it looks like we can do quite a few predictions we can predict quite a few of these split genes when we take for instance the wheat genome and then we artificially split some of the gene and hides the parts of the null overlapping we can reconstruct quite a few of these events so this is looking quite promising finally the last short thing to give is about clustering genes that have a common evolutionary history so this is now we're going so there arguably that was a way to use comparative genomics for scaffolding so we can quite low in the low level in the pipeline that I was showing at the beginning now we are much closer to the application and what I think in the field of phylogeny and phylogenetics in general people are realizing more and more that single tree and that's because and depending on who you are they will say that's because there is a large gene transfer some people will say that's because there is incomplete lineage sorting we have to look at population level effects or maybe that's due to harmonization or you name it and the problem is that sometimes we really don't know what is the source and we would like to go to methods that just don't make any assumption about the source of error and in particular that might be able to identify variation that is due to artifacts in the data so here's an approach that we've been developing with Kevin Gore in the group and so we start again with some sets okay here is a putative ortholog so hopefully we just have one sequence per species and they should all ideally have the same topology and there what we do is we build a tree for each lotus and then look at the for each of these genes for each gene we have a tree and now we can look at which trees are similar and which one are different and then do some class change in this tree space and so if we hold the low side agreed and we'll have one topology but if we have you know if it's justified to add an extra topology to fit the data then the method will do so without any assumption about how this incongruence arises so of course we can validate this in simulated data and I'm not claiming that this is particularly insightful but at least in terms of maybe it helps seeing a bit how the method works so we have here four different underlying topologies and each gene is sample from one of these four topology and then the method is able to to find back which gene came from each topology and we applied this to a data set that was published a few years ago on yeast phylogeny and so one of the selling point of this data set was like the extensive curation that had gone into building all these groups and actually when we just run the method, the method identified two main cluster one that is very tight and here this is just a projection to see what's happening so each point is a gene and the points that are close to one another that they have trees that are similar and so here we see many genes that seem to have a very similar tree and a whole bunch of genes that seem to have different tree and when we looked at the result it's so actually that about 90% of the genes were reconstructing the chemical topology and 10% were giving something completely crazy with that extremely long branch here so it turns out that these are all errors in the archaeology inference and so which gets identified which by the way was not done using our methods and so it's quite interesting because it's 10% of the data so in this case if you aggregate all the data you still get the right tree so it's very difficult to notice this error unless you use a method that is able to deal with this heterogeneity and in this case the topology is not contentious but you know that the many part of the tree of life that are really highly debated and if you have 10% of the data that has an extremely long branch here it will bias the result and so having the very real assumptions about the source of noise and in particular accepting maybe some of the noise may be due to artifacts could be a very powerful approach but of course this is also useful to detect all the legit sources of heterogeneity and so depending on your point of view you might be interested in the approach to look at some other phenomenon that give rise to this variation in the trees. So there's a package for that, a software package that I encourage you to use if you have a related problem. So I'm coming to the end so I'd like to just summarize that briefly so orthology benchmarking used to be very difficult and we think now it's very easy so you know if you're either using some of these methods or developing your own then you should have a look at this. The HOG framework I think is something that could really move the field forward in terms of comparative genomics to be able to integrate information across multiple genomes and so if you know downstream analysis then in terms of regulation and in terms of functional gene ontology annotation in terms of expression could be made in terms of that framework I think we might be able to deal better with the complexity that is in all these gene evolution entails all the fragments they pose problems to our methods and other people's methods but if we use comparative genomics we have a chance of at least identifying some of and repairing some of these fragments and then finally I think these process agnostic approaches to dealing with variation in the treat apologies in the data they're very useful not only to uncover general biological processes but also artifacts in the data so I finish here thank you very much for your attention if you have any questions I'll be very happy to