 So, welcome all to the CIB Virtual Computational Biology Seminar Series. Today we have the pleasure to have Georgie Krobitsch from the Laboratory of Artificial and Natural Evolution of the University of Geneva. Georgie studied computer science, he is a computer scientist during his bachelor at the Faculty of Electrical Engineering and Computing at the University of Zagreb in Croatia. He earned his master in computer science and evolutionary computation in 2011 from the same university and he joined Michel Milinkovich's group in April 2012 as a PhD student for developing meta-heuristics for phylogeny inference in the framework of the meta-pigas software. So, about the group has been 10 years that the group core activities revolve around the production of experimental data and the development of tool software and algorithm in evolutionary genetics. Since 2008 they additionally combine evolutionary and developmental biology or evo-devo and the study of physical processes to understand the mechanism generating complexity and diversity in the living world. So, the group is specializing in non-classical model species in reptiles and mammals and they integrate data and analysis from comparative genomics, molecular developmental genetics as well as computer modeling and numerical simulations. So, today Georgie will show how to explore phylogeny space with meta-pigas, the sort of software. And so, Georgie thanks again for accepting our invitation and the floor is yours. Thank you, Diana. Thanks to SIV for inviting me to speak here and thank all of you that are sharing this room with me today and you're having your attention. This is a quick overview of what am I going to talk about. I will shortly introduce the problem of phylogeny for anybody who is uninformed about it. Also the ways we reconstruct phylogenies and at the end about specifically about my project. So, what is a phylogeny problem? Actually we have entities and certain characters that are homologous so we assume that these characters have a common ancestor and we want to put them in a structure, a tree-like structure where we will try to infer which groups of entities are more closely related than other groups. In this case we have four animals and this is the character we are trying to... We are using to reconstruct their evolutionary pasts, our DNA sequences and on the right side you have a typical, very simple example of a tree that tries to explain this data and in this case we see that we grouped for example rat and pig together and cats and back together. This is just our arbitrary example. So how do we evaluate this model that explains these observed characters we have on the left? Well the simplest example, the simplest method for doing so is using parsimony method which is very straightforward and very simple to interpret. We just try to calculate the minimum amount of changes we need on such tree to explain the data. In the upper example you see that if we put the tree like so and let's say that bat and rat have some character Y and cat and pig have some character X. So in order to explain these characters in green, which is observed data, we need at least two mutations. Either one of these branches that I showed here or alternatively on these upper branches with Y. In that case these internal characters, ancestral and internal characters would be X. And in no way we can put this tree to have less mutations. On the lower example we see that if we rearrange a tree we can explain the observed data with less mutations. This is the needed one mutation here, meaning that this ancestral on the left internal node would be Y and this one would be X. And we would consider that the lower tree would be more parsimonous, therefore more better. This method is usually still used if you have characters such as morphological characters such as a wing shape or number of teeth or this kind of data when you have no other information about how characters actually evolve or change in the tree. On the other hand this method is too simplistic and for genomic data, so if we have let's say genes to compare and if we assume that again they all have common ancestors, we can introduce a more complex method of evaluating our trees such that it involves time and more probabilistic approach and more exact approach in inferring a tree. In order for this method to work we need to have a substitution model which models the way or the rate in which each state transforms in another state. In this case we have nucleotides but you can have something more complex like proteins, like codons, like whatever. In biology, in reconstruction mostly these three data models are used. They have all different complexities, they are all time reversible so in order for our methods to work they all have to have the same rate in both directions and also we need to introduce a concept of branch length which is a combined measure of rate of mutation on that branch and time, so we can imagine that if this is a fast evolving branch, one of them the branch would be longer or if time was longer in that period, so that means that there will be more mutation, the longer the branches we expect more substitutions to occur, also we are able to model with this if we have different parts that evolve with different speed or mutate with different speed and this model allows us to incorporate this into our model. Simplest model of evolution would be Jukes Cantor where all of these parameters here are equal and the most complex model is a general time reversible where all of these parameters are actually parameters of the model, these are three parameters so you will have to optimize on them too, so prior knowledge often is not known about the rate of substitution. So when we calculate what we actually do is take a column, so in one column we assume that all these characters in one column have a common ancestor, so we compare them in one column and then the entire matrix we have to take the joint probability distribution of entire, so we just we superimpose each column on the tree and we calculate the likelihood of this tree and we just multiply over all columns. At the end we have this, it's a value of, like with this means that with all of the parameters, three parameters like branch length substitution rates and the topology of the tree, what is the probability that this model would generate this observed data? And now when we have this as a model this we define a space and this is each topology configuration, so if we switch leaves on these trees and each parameter when we change, this defines a space and likelihood defines kind of like a surface over this space. So what our goal is we want to find the best model to explain the observed data, we need to find the tree topology and configuration of parameters on this tree that will maximize this surface, so we find the model that is best in explaining the data. And I have to mention that in this case the amount of trees, as you add more trees, as you add more species or entities on the vertical of this matrix, the amount of trees that you can possibly generate, that can possibly be generated with this set of species, grows exponentially, even more than exponentially. So you can imagine that if you have like 100 or 200 species, if you would want to check every configuration of branches, this is an infeasible problem, you cannot check every one of them and see which one has the highest likelihood and then pick this as the model. Just to show for example how we move in this space, for example we changed topology as explained and we in this example I cut this tree here, so we have two sub-trees and I re-branch it here, and this is one moving in this space of topology and ideally we would have to do all the possible combinations of these changes and all the possible combinations of branch lengths and substitution models to find the best one to explain our data. Of course we cannot do that, so we have to come up with smart algorithms to kind of help us not find the best one but kind of cope with this problem in a way. So first approaches in solving this problem are put to representative problems, algorithms, first one is neighbor journey, basically it makes a distance matrix between each entry of each species, so how distant they are from each other, so it will be a rectangular matrix and it's kind of an algorithm for clustering based on that is more adapted to problems in biology. So we just will group the closest ones together and then we'll find some kind of a midpoint that will be, so let's say for example here, these two will be closest together, it will group them together in a cluster and then it will calculate the midpoint and then for all future clustering this midpoint will be compared to other species or other nodes. Another one is a greedy search, it starts with either random or we can use for example neighbor journey to produce a starting tree for research and then from one point it calculates all possible immediate steps, so it doesn't try all combinations but from this, from this point it takes branches and tries around and then the best one it finds it moves to this point and then explores the space around it and moves to that point and then until all the neighbors of this current tree are worse than this current tree then we say we stop and we hope we found the best. But the problem with this is if you can imagine this mountain to be our likelihood function and this space to be our space of parameters and branch lengths and substitution parameters, if we start here, we just climb up to the first local optimum and we will stop there, although let's say these other peaks here are higher than this we found but for some purpose sometimes this is enough. Another approach that is actually used in phylogeny but in all other, a lot of others, other branches of engineering is genetic algorithms which try to emulate natural evolution. So I have a population of possible solutions and then they define in one time point, they define a population in that time point, that generation and then they produce offspring, for example in our case offspring could be just another point around this point and then in order to survive they compete there we have some sort of selection function that will kind of in some kind of stochastic way try to push the entire population toward better solutions. So the higher they are on this space, more likely they are to survive to the next generation. It is slower because we have to cope with more points. It has a parallel search but in some way we hope that we will avoid the problem of reaching local optimum like here, that for example some of them will with their jumps actually reach to the top. Again it's not a solution that will always give you the best possible solution but it has been shown in multiple occasions that it will actually perform better than this one. Of course if we have a super simplistic example of just one peak then this one will perform as good. And in actually my supervisor and with another collaborator in early 2000s they came up with an idea how we of consensus pruning. Consensus pruning so it's a modified genetic algorithm so we have these multiple populations that are independent as you see here for example these red ones and blue ones and in our case so this is really specific to the problem of trees if they agree so these independent populations agree on certain branches certain clades in this example they will not try to break this so when we do these kind of moves as here between trees they will not break these red branches in this example this is so-called consensus. So we agree that this is a clade and then several different populations evolve in such way that they all agree that this is a good clade then we will not try to break it which means that this will help us reach top faster because with normal genetic algorithm you still have to if you as I expected if you if you generate offspring or propositions of new trees trying all possible combinations of plucking up the branches and regrafting them on another place it can be very cumbersome so by these kind of agreements that we shouldn't touch these parts of the tree we actually can significantly reduce the space that we explore and hopefully reach the peak faster because everything I'm talking about here is very computationally expensive so if you have huge amount of amount of species and huge amount of characters this can take weeks sometimes so if there any any any improvement in speed is welcome so I talked so far about optimization and this is we call this maximum likelihood so this is the goal to find the best solution the best the best model that explains the data another approach is not really not that you know position with the first is to to we define robustness and robustness would be when we find a trade we want to be have some metric that will tell us how how good this clade is compared to other alternative clades so for example here if we group A and B or B with a so in a plate and if we group B with D this one this one could be better or this one could be better than this one but they can be very similar and if they are very similar in there in their explanatory power they don't want to know that because that would be if we come out with this clade but with this clade as our solution this one we say okay this is better one but maybe there is an alternative one that is similar not as good but it's not far from it so for example here if if tree tree tree three has likely likely could have found about 30% and tree one has like 30% we cannot be as sure this due to no data due to whatever but we want to have this this metric and this so far to achieve this the most popular methods are Bayesian statistics and it includes this allows you to include some kind of prior knowledge maybe you know something in advance that restrict your search space or your model or you know that some parameters behave in certain ways so you can incorporate this knowledge into your into your analysis and then you will get posterior distribution that is like probability of model having the data so far we're having the probability of data having the model but yeah this allows you to do that another way how we implement in metapyga is to also to repeat analysis and see how often we came to certain peaks and then include this in in the in the you know as a include this as a metric of robustness third way would be that to have so called bootstrapping so you sample some data and then this is well known statistical method of having coming to exactly this kind of metrics now for the end I will talk about my project it was a software made in order to explore the efficacy of consensus group and when he started the phd for me they decided that will they want to be more user friendly more more flexible of a software because they were frustrated with usually their command line and you need to read a manual five thousand pages to understand how it works so they they started with the idea that they should be easily portable over multiple platforms initially they had only this maximum likelihood method to find the optimum and then they said this is simple method to calculate robustness from new version that is not yet published but I worked on I implemented Bayesian inference so you can have the full Bayesian like Mr. Byron so you can have that and features that you can use nucleotides as your input data you can use amino acids and you can use codons and also you can in from the from the software you can easily exclude certain parts or partition certain parts or you can immediately outgroup has a nice user friendly interface also there is we implemented this trim algorithm which is algorithm that was developed by a group in Barcelona from Tony Goddan that will actually test your data often people put data that has a lot of noise and this is some kind of a tool that will just warn you if you have chromatic data sets so you don't come to some weird conclusions if you are alive also we implemented multiple models for from very simple JC GTR and everything in between for for amino acids also also a lot of empirical models and for codon models we put the two model two possible models for for optimization again user interface very nice also if you want you one can one can window search or you define your all your parameters in your use in your graphical interface you export a file and then you can write you know on a cluster somewhere without all of this graphical goodness just if you want to paralyze things when you receive your results you can immediately in the software don't then check the topology reroute it and then see the support values branches model you can import other trees it's not as powerful as some software that are just for tree exploring but it helps if you want quick quick overview of your results also has a very simple Bayesian method for inferring and say ancestral states in ancestral nodes so it will give you we can export this you can export this as a file with numbers but can tell you for example here it's for node 5 this is the inferred ancestral sequence and also I've implemented the codon models because now all the softwares at that time I'm not I don't know about the situation now but at that time just few of softwares actually have this because it's very very computationally burdensome so I implemented that you can run it on NVIDIA's GPUs on your if you have on it goes for I mean I see that Alice it's for codon analysis up to 20 times faster and that's it the point of my my PhD so the some kind of so you know that actually what I actually the goal of my PhD is to kind of find best method out of the box that you can use for your analysis so not so if we you have here you see that you have robustness and optimization problems and these algorithms that are used for one thing and not with different problems that are used for another for the other thing to find we to find in which situation which one is better and try to try to combine them into one and actually to to make this faster faster better so if you have any questions I invite you to send emails if you want to I don't know maybe you have some ideas what to implement I'm open for all that and if you will try software the new version is not yet out because it's still in beta and I'm yeah I'm still polishing everything but if you want to try it just send me an email distributed thank you