 So, I'm very pleased here today to welcome you all to this SIB Virtual Computational Biology Seminar Series and I appreciate that the Computational Biology retreat, is it not like that? The retreat of the Goddess of Computational Biology has joined us today. We have the pleasure to host here today Eric Van Imvegen, Associate Professor in the Computational Systems Biology at the BioCentrum of the University of Basel. So Eric did his undergraduate in theoretical physics at the University of Amsterdam in the Netherlands and he obtained his PhD in 1999 at the Santa Fe Institute in the USA and at the theoretical biology bioinformatics department at the ETH University in the Netherlands. He did then one year postdoctoral research at the Santa Fe Institute before becoming a fellow at the Centre for Studies in Physics and Biology, the Rockefeller University in New York in the USA. Then he came back to Europe and in 2003 was appointed as assistant professor at the BioCentrum of the University of Basel and group leader of the SIB, the Swiss Institute of Bioinformatics in 2004 and since then as an associate professor he is leading the Genome Systems Biology Group. So the group main research interest is the study of genome-wide regulatory systems in order to reconstruct them from high throughput molecular data, understand and model how they have evolved and search for design principles in their construction. In particular they are developing and applying new algorithmic tool for the automatic reconstruction of genome-wide regulatory networks from comparative genomics deep sequencing and other high throughput data. So in addition methods being developed for student genome evolution and the evolution of regulatory networks in particular. The group developed the Swiss Regulant Portal for many years now, a set of software tools and knowledge basis for regulatory genomics. I invite you to have a look at their website to know more about this portal. So today Eric will tell us more about precariotic genome evolution and I would like to thank you again for accepting this invitation and the floor is yours. Alright, thanks Diana. So as Diana said, the main research topic of my group is genome-wide regulatory networks, how they function and how they evolve. But today I would like to tell about the topic that got me into biology in the first place as a PhD student and this is genome evolution, in particular genome evolution in precariot. And so I would like to start by provoking you a little bit. And so as I said, as a PhD student I started studying evolutionary dynamics, started studying very theoretical models, mathematical population genetics models. And so this is how I got into the field as a physicist. But over time, over the years, I've sort of started developing this field of unease about the connection between experimental data and theory in evolution. And so in evolution, so why is there still almost no predictive quantitative evolutionary theory? So it's still extremely rare for evolutionary theorists to make concrete predictions that you can then go out and gather some data and check. And so I've been thinking of why is this, and so I want to provoke you by saying that I think there are some very deep reasons for why this is. To introduce this, I'm going to quote for you Richard Feynman, he's a well-known physicist, and this is from his first year introductory lectures for MIT undergraduates and he's explaining to them what do you do when you do physics. And sort of in this sort of Feynman sort of chatty style, he's explaining well the first thing we do is we observe phenomena that interests us. Some things are happening in the world, we think it's interesting, we want to understand what's going on. And the first thing we have to do is we have to formalize the phenomena by rigorously defining measurable quantities. We want to come up with some things that you can measure that you can reproducibly measure and that you can quantify. That's step one, find some experimental things that you can do that you can measure that are reproducible. And then you start looking for quantitative relationships between these things that you can measure. So you find basically phenomenological laws that connect the measurable quantities to each other. And then finally, the glory of physics is when you find a theory in one full sweep, explain many of these relationships and make them sound better. Right, so you, Galileo is sliding balls down slides and measuring how far they move as a function of time. And at the same time, Kepler is saying that all the orbits of the planets are ellipses that will pay certain bars between the periods and the ranges. And then finally somebody like Newton comes and says, well, if you actually think about it like this, then all these things follow automatically. And so what I believe, if you now look at what has been happening in mathematical population genetics, so there's basically now 100 years of, you know, sometimes very sophisticated theory about how evolution might work. Then you see that this is really very, very differently structured than this sort of little outline here. So in particular, there's no collection of phenomenological laws about measurable quantities in evolving populations that the theory is aiming to explain. This is not what it started from. It's often not even clear what the measurable quantities are that you should be analyzing to look for such laws. What things should you be measuring about developing populations? What has happened is that this man over here, Darwin came up with his qualitative framework for understanding how evolution works, and then people just started writing down the mathematics of toy moles. So I said, what if we assume this? What if we assume that? What would happen? And the key thing that worries me here is that quantities in mathematical population genetics, some of the key quantities, are impossible to measure. So almost all mathematical population genetics models have things like fitness and effective population size, which virtually everybody will agree are unmeasurable. Okay. Most people would not call that fitness. Let's talk about this afterwards. And I will formally convince you that it's not so simple. So my feeling has been, the aim is to provoke, so I think I've already provoked Philip, so that's good, is that I think there's a fundamental problem when the quantities that appear in your theories, you don't know how to measure for your experimental system. I mean, if you think about it, if you're going to define your theory in terms of quantities that you admit you don't know how to measure, you're going to have real trouble connecting your theory to data. So this is sort of the point I want to make. And so I think there's really an opportunity in the study of evolution to focus on these steps here. Because these steps are typically not done so much to say what are rigorously defined quantities that you can measure about evolving systems and what are phenomenological laws that these quantities observe. And with this sort of large availability of genomes today, there's a lot of opportunity to look for such phenomenological laws. So basically I just want to make a case for doing this. Because also most evolutionary theories, especially physicists that moved into this field, they all imagine they're going to work on this step four here. And there's very few people that want to work on this step. Okay, so with that introduction, I'm going to go to the topic of today. So the topic of today is how does E. coli evolve in the wild? And when I say the wild, this is the wild. So this is at the shore of Lake Superior in Minnesota, where the lab, so there's a car park, you can go fishing here and so on. And the lab with Mike Sodowski about 10 years ago went there one summer and from the edge of the water, so from the water itself, the watershed soils, the sort of red earth by the edge of the water, and also soil one nutrient land, they isolated E. coli. Basically they looked for all E. coli that were living there, and they sent them to us. So we have about 500 strains of E. coli that they gathered there. So this is sort of in one place at one point in time. And we focused basically on one of these plates and we have extensively characterized their genomes and also phenotype. So we have measured the ability of those E. coli to grow many, many different varieties. But I'm not going to talk about the phenotype data today at all. I'm just going to tell about what you learned from the genotype data, because I think there's already quite a bit. Sorry. It's 4 times 96. It's 484. 484. There's this thing here, right? Oops. This. You see this? All right. So what the first thing to do is, or what virtually everybody does when they start analyzing a set of strains, is to build a phylogeny of the strains. So here I'm showing you the phylogeny of our strains together with a collection of, or very sequenced E. coli strains from the database. So in red are all our strains. So their name is just the well on the plate where they were sitting. In black are all kinds of strains from the database. In green are well-known K-12 strains that are used in the lab. And then there are also some Shigella here, and there is also some out-grouped E-albertine as in out-grouped. All right. So the first thing that, and this is made from the core genome, those are pieces of genome that exist in all strains. So the first thing that is quite remarkable, I think, that every one of these strains was different. So they didn't sequence them beforehand, right? So we could have found that half of the strains were all the same, but we didn't find that. They were all different, which means that the diversity of E. coli in that place must be quite high, because otherwise, if there was any strain that was at some reasonable percentage in that environment, you would expect it to see more than once. But we didn't see that. So E. coli must be very diverse. And also you will see that these red strains essentially spread over this entire circle, which means that we find strains from all over the known E. coli, E. coli phylogeny. So the entire diversity of E. coli that is known is essentially represented there at one spot at one summer. All right. Okay, so this is the phylogenetic tree. So what do we typically think the phylogenetic tree means? So sometimes, before I go to sleep, I like to think of this idea that every cell is the result of the cell division. Is the result of the cell division? Is the result of the cell division? All the way back to the first cells of life. Every cell living today on Earth basically connected in some giant tree that goes all the way back to the beginning of life. This I find is a study. So in a sense, when we study evolution, we're sort of studying the shape of that tree. If you were a mathematician, you would say, well, you're just studying what shape that tree has. That's what you do. So for a set of strains, there is also some history of cell divisions that connects them into a tree. And if it was the case that DNA was only transmitted as replication in a cell device, then by comparing the sequences here at the leaves of the strains that we have, we could reconstruct this phylogenetic tree. So the ancestral relationships between the species. And that's typically what people imagine the tree that they reconstruct in their data needs. And that's also how the tree is reconstructed. So these maximum likelihood algorithms they basically find the tree and makes it maximally likely to transmit all the DNA vertically and end up with the strains that we saw again. And this is also important because virtually any model for the evolution of sequences is formulated on a tree, right? The changes of the sequences happen along the branches of such a tree. However, we also know that DNA can be non-vertically inherited. That is, there are various mechanisms by which pieces of DNA, for example, can be taken up from the environment and incorporated in the genome. Viruses, phages, that in fact bacteria may sometimes package pieces of DNA from the host and then inject those pieces into another E. coli cell and they can be put into the genome. And there are other methods as well which DNA from one E. coli or another bacteria can make it into another E. coli and be incorporated into the genome, in particular through homologous recombination. If a piece of DNA that gets injected into a genome, into a cell, sorry, as ends that are homologous to something that already exists in the cell, we can actually have that this gets recombined at replication. This is, in fact, how we engineer multiple genes and bacteria. So this has been, for a long time, recognized in the field and the question has been, what is the effect of this homologous recombination and horizontal transfer on our ability to reconstruct clonal ancestry and phylogenetic trees? So we decided to look at this question and the first thing we did is we took the genome alignment that we had, this four genome from our 9 to 6 trains and we cut them into blocks of about 3 kb each and then we reconstructed a tree from each block. There's one of those things you can do. And they will find that every block has a different tree. So now two blocks give you the same tree and, however, every block significantly rejects the trees of all other blocks. So the statistical differences are not small. The log likelihood differences are big enough that the alignment in each block confidently rejects the trees of all the other blocks. All right, so that's observation number one. Observation number two is if you now compare the trees in these blocks with these three that you made from the whole core genome alignment, you find that these trees don't look very similar. It could be that they're statistically significantly different, but they still look very similar. But this is not the case. So this is basically showing you here this core genome tree for our strain and, in color, it's showing you what fraction of the 3 kb blocks have the same split occurring. So every branch in the tree is like a split in bipartition, the set of strains into two, and you can ask for each split in what fraction of the blocks does it occur, and you see that most of them, the splits occurring less than half of all blocks and half of the splits occurring less than a quarter of a block. The trees of these blocks are not just subtly different, they're really quite different. However, and this has been something that people have been observing for a long time, if I don't just take one block at a time and make a tree, but I take a large collection of blocks, let's say I randomly select 50% of all my blocks and I build a tree from these 50%. The tree looks very much like the whole tree. That is, the more blocks you make, the more your trees start to convert to these three that you get if you run on all blocks. And so this raises the question, does the fact that this thing converges to one clear tree that you get when you use all blocks is the clone of allot. So my reading of the literature on this question basically suggests that a large number of people in the community say yes. They basically say, well, it must mean that if you take enough blocks, if you always converge to the same tree, that must mean that that's the clone of allot. Probably what's happening is that horizontal transfer of gene transfer is common, but some core genes are less effective and these effects of horizontal transfer that you will see in individual blocks, if you take enough blocks, they will be averaged out and the structure that you're left with, the phylogenetic structure you're left with, that's actually the clone of phylogy coming through when you average over enough blocks. And so many researchers, in fact, are so convinced that this is true that they develop algorithms that basically first make the clone of phylogenetic from all blocks and then detect horizontal transfer by looking how much blocks differ from these four alignments that you make from all blocks. However, in recent years, there is an increasing number of people in the community and I'm sort of mentioning too here that say, no, we believe that this recombination is so common that you really cannot meaningful fully reconstruct a tree. Without really saying so, really in black and white, but the sort of suggestion, reading the papers. So we wanted to look at this more carefully and say, okay, can we not rigorously work out which it is? So to explain how we did this, I just want to give you a couple of numbers for our core genome alignment to give you some idea of what this thing looks like. So the core genome that we have consists of 2.6 million base pairs and most of these positions, right, so this is 96.5, 2.6 million, in most positions, all the letters are the same. All the strengths have the exact same letter at that position, the genome. In 8.5% of the position, there's a polymorphism. That is to say, there's not the same letter at all. But 95% of those SNPs have only two alleles. That is to say, you see only two letters. You see an A in a T or a C in a G. The fraction of the time that you see three letters or four letters is less than 5%. It's 5% of all the cases where you see at least two letters. Now, what this means is that even though each position in the genome might have a different phylogeny, it is still the case that in each position, except for 5%, there was either zero or one mutation. So when you see... Okay, so when all the letters are the same, there was most likely no mutation. When you see two letters, it meant one mutation occurred in the history of that position. Okay? So this is sort of what it looks like, right? So you have your strains by the position, and you see these sort of two colors, right? So some strains have one letter, some strains have another. Now the fact that these things almost always correspond to the single mutation has an important dislocation because even though I do not know what the phylogeny is at each point, a SNP gives me essentially one bit of information about phylogeny at each point, right? So if I see this SNP, ABC have one letter and DEF have another letter, then I know that whatever the three is at this position, it must have ABC on one side and DEF on the other side. That is, it must have a split in it, a branch in it that separates ABC from DEF, and this mutation occurred on this branch, right? So you can now play these games and say, okay, so for this one, I know that this is a mutation that happened in the branch going to strain D because only strain D has this letter. This mutation, I know, must have happened in the ancestor of ENF, and so ENF must be near as many as in the tree because they share this, and the mutation happened on the branch before the ancestor of ENF. So I can now combine these, right? That can resolve a bit more. Now if I also take this SNP, I say, aha, so there's another mutation that splits ABC from DEF and happens on this branch. So as I'm gathering more and more of these SNPs, I can make more and more splits and I can sort of make more and more precise what the tree must be in this area. However, I can also have clashes. All right, so if I look at this guy here, this says there is a letter that DE and F have and A, C and D have another letter. Now this is still consistent with this and it's also consistent with this. All of these three can also go in one tree, but these two columns are inconsistent with each other. I cannot make a tree that has both this split and that split, all right? So now I already know that somewhere between this position and that position, the phylogeny must change, all right? Just from these clashes between the pattern of these bipartitions, I can tell that there must be a change. So now I can ask, for example, how long a stretch is along the alignment where SNPs are compatible with one tree? How many of the SNPs total are consistent with the core phylogeny that I made and how many of these SNPs are compatible with any signal phylogeny that I could construct? All right, so let's look at the first thing. Stretches of compatible SNPs are very short, all right? So if you ask how many SNPs in a row can I go before I get a clash, you get this distribution, all right? So on average, it's, you know, four or five SNPs and then there will be a clash, all right? So you know these SNPs come every sort of 10 base pairs. So every 70 base pairs or so, there's a clash. You have to switch phylogeny. And you can look at it and find that, for example, the phylogeny in this, you call it core genome alignment, must switch at least 30,000 times as you go over the core genome alignment, all right? And actually these kind of estimates are done using the same kind of tests that people typically do to do this for human genome. But people have not applied these things to bacteria, simply because of that. Sorry? I don't know. So now one may still say, well, okay, so maybe it's true that the phylogeny changes many times because maybe there are many very short regions where the phylogeny is different, but maybe there's still some sort of background phylogeny that is sort of dominating. It's just interspersed many, many times with little blocks with other phylogeny. So we can ask what fraction of all SNPs are consistent with the core phylogeny? If I take my core three, what fraction of all my SNPs correspond actually to a split in that three? It turns out that it's less than 30%, okay? So more than 70% of the SNPs that I have do not correspond to a branch of this three. And if you also look at this now, it goes a bit further, but if you say, so normally the length of a branch that you've inferred with your maximum likelihood should be proportional to the number of mutations that you expect to see on a branch. But we see that that is sort of orders of magnitude off for multiple actions. Now you can ask, so this basically says that the core three doesn't summarize the SNP statistics at all. Now you say, well, maybe this core genome you made by assuming a maximum likelihood model that assumes one three, maybe you should just ask, can you make one three that captures as many of the SNPs as possible? Can I make a three that has as many of the SNPs as splits in the three? And then you see that you can do a sign a bit better than with the core genome phylogeny, but basically it doesn't show. So it's impossible to find any one three that captures 30% of all the SNPs. All right. Now you may have said, well, okay, maybe that, right now you're looking at all strains at the same time. Maybe there's still a nice phylogeny that you can reconstruct. It's just that there's a small number of strains that are sort of messing everything up that are strange and that by adding the mean you get all these breaks all these inconsistencies. And if I were to look at a subset of strains or subsets, I can still see a phylogeny. So to do this, we say, well, let's look at the smaller subsets that still have a discernible phylogeny and these are quartets. So if you have a quartet of strains, RJ, MN, there are four possible pathologies, so either I and J are nearest neighbors at M and N or I and M and then J and N or I and N and M and J. So we collected a large number of such quartets, both quartets that were close and along the y-axis is what is the fractions of SNPs that clash with the SNPs that defines the topology. So just to explain, when you have mutations, the only mutations that tell you something about this topology is a mutation where two letters two strains have one letter and two other strains have another letter. Because if it's 3-1 it's a mutation on a terminal branch that doesn't tell you anything about the three. So the only SNPs that tell you something about the three are the two-two SNPs. So you can ask for all two-two SNPs if I take the three that has the most SNPs what fraction of SNPs have another three? And you will see that if there was just one three all these dots would be down here but you see that all the dots are here and some of them are actually really really close to having every possible three occur equally fast. So also how small sets of strains there is no clear phylogenetic structure that you can infer. All right, so finally you can go and look at the pairs of strains. If you now take a pair of closely related strains. So here this is from the three a zoom-in F3 and E4 they're really close and I map the reads from F3 to the assembly of E4 and then basically I went along with E4 and in windows of 500 base pairs each I asked how many mutations how many SNPs between these two strains are in each window and then you will see that for most of the genome this is very very low but it's interspersed with these regions that are maybe 30 to 70 kb long where the SNP rate is way way way high and in fact this SNP rate is more like 1% maybe 10 to the minus 3 and this is much much higher rate. And so these regions are precisely the regions that have been horizontally E4 and F3 are very close together because most of the genome SNPs are rare they have a digital access there but in some places one of them is recombining a piece of depolar genome that came from much further and because of that you get these segments that have a much higher SNP rate than others. So one way you can look at this now is to go along the genome and in blocks of let's say a kb long ask how many SNPs are in each block and then look at the histogram of the number of SNPs per block and this is what you get for this pair but by far this is on a logarithmic axis by far the most common are blocks of 0, 1 or 2 mutations in it, 2 SNPs in it in one kb but then there is a long tail with looks like an exponential with on average 13.6 SNPs per kilobase So basically these regions here are the regions that have not yet recombined and in this case 95% of the genome and the SNP rate has a Poisson distribution in these regions and then in these regions here are the regions where you have recombined and the numbers of SNPs are drawn from another distribution that has a much higher need. So now we try to do this systematically for every pair so for every pair we fit this distribution of how many blocks are there with a certain number SNPs in there Poisson which is coming from the part that has not yet recombined and a negative binomial distribution a negative binomial is what you get when you take a mixture of Poisson distributions with gamma distributed rates and we find that that fits well the distributions that we get in these regions. So basically for every pair I can now make this distribution and then I can fit this mixture model and I can work out for each pair what is the fraction that is still not recombined and what is the fraction of the genome that has to be recombined. So basically what I'm estimating for each pair is what is the what is the fraction of the genome that is covered by these peaks and as you will see that as you take genomes that are further further apart these peaks get more and more common and they start filling up the whole genome. So that's what you see here so this is the example of the pair that I took so their divergence the total slip rate is about 1 in a thousand and 97% of their genomes is still ancestral. Now as you start going to pairs of genomes that are further apart so this is 6 times 10 to the minus 3 here 54% is still ancestral at 0.008 20% and by the time you get close to 1% divergence you start getting very small number of ancestral here it's 10% and now it's gone. Once you're past 1.3 sort of percent distance the entire genome the entire genome is now covered by these guys and so as you take guys further apart more and more the stuff fills up and at some point you've lost 10% that came from your common access. So in summary for very close pairs less than 1 in a thousand total divergence most of the genome is globally inherited but then there is a fairly narrow transition regime where you start losing globally inherited genomes and by the time you get 1% or so all is lost and all pairs that are more than 1% per verse have been completely recombined and that's in fact the vast majority of all pairs almost all of all pairs are here not on this other side so that means that for the vast majority of pairs you simply have no information anymore to tell how long ago was their common ancestor their information is gone because their entire genomes have been overrated and so that means that you cannot possibly reconstruct the ancestral phylogeny because the information has just been lost for most pairs there's nothing left that came from the common ancestor so some key observations that we made is that in E. coli the phylogeny changes at least 30,000 times along the core phylogeny roughly every 8 base pairs because we also know I didn't tell you about the length of these recombined regions that a typical locus in the alignment has been overwritten by recombination in its history more than 100 times the vast majority of pairs are fully recombined no sequence derives from their common ancestor anymore and it's thus possible to estimate the distance to common ancestor and even for cross pairs if you now look at what fraction of the SNPs they have recombined regions versus the recombined regions they're still 10 to 1 dominated by recombined regions now given all that you may ask so why is there an apparent phylogeny so why isn't it that if I now build a tree if everybody has been recombined so much with everybody everything has been overwritten 100 times in the genome why doesn't this look like a star it should look like a star phylogeny where everybody's equally far from everybody else it doesn't look like that at all in fact we've seen that it nicely converges to some tree structure alright so this I think is actually the most interesting part so the way I think you can understand is by looking at a particular type of SNPs I'm now going to look at SNPs we have only two strains shared in mutation and nobody else has it so in particular I'm going to look at the SNPs where the SNP occurs in only two individuals one of which is A1 and then some other guy X so I'm going to look at those SNPs now notice that when you see such a SNP it means that in this piece of the genome this locus X and A1 must be nearest neighbors and the mutation happened on the branch before the ancestor X and A1 so if there was one phylogeny across the whole genome then A1 could only have one nearest neighbor there would be only one other strain that could be the nearest neighbor of A1 and so you could only see such type of SNP of one type A1 and his neighbor and never with anybody else but that's not what we see we see such pair SNPs of A1 with 17 different guys but in the other limit where you say well everybody is just recombining with everybody in the population then A1 can be a partner with anybody and in fact you expect it to be a pair with any other guy equally often but that's also not what we see so many of them you only see once or twice or a few times this pair and some pairs like this one so there is a so there's not one three A1 has recombined with many other guys can be a nearest neighbor with many other guys but with some guys like the eight A1 is a nearest neighbor many, many, many more times with other guys so what this says is if this recombination is biased less everybody can recombine with everybody but some strains are much more likely to recombine so if you now vote this is the graph all two snips so this is only for A1 this is only for A1 if you now turn all of them it looks like this so many strains they share with many others but you see the thickness of these lines is the logarithm of the number and you see that there is a very wide distribution of thicknesses some of these pairs are much much better so now you can actually ask what is this distribution of the thicknesses of these lines look like so most pairs share a snip never once but there are some pairs that have such a snip where they are the only two that have the mutation like thousands of times so and it's roughly a power law distribution I mean it's not really a power law distribution but it's fairly straight over three orders of action so it's a long tailed distribution now why is this important because it's saying the population structure basically happens on all scales so instead of there being one phylogeny or a completely mixed population there is this wide distribution of relative rates of recombination that all the individuals have with each other and this is really a continuum that goes over more than three orders of action alright so instead of just pairs you can also play this with triplets or quartets or quintets or twelve guys right so you could say how many times do I see a snip that is shared by a particular group of twelve and make a distribution and you see this also and all these things show roughly straight lines in a law of law plot that with different exposures now I've been telling you about ecolar but of course we've been checking all these things for other kinems as well I can tell you that this is these patterns are the rule and not the exception this is what most bacteria species look like if you've looked at so here I'm showing you some result this is for Chlamydia this is for Staphylococcus this is for Salmonella for all these guys it is true that these things most pairs are fully recombined the gene has been overrated many many times by recombination and you find these complicated populations sort of scale-free population structure right so we then also Sven asked what does this look like for humans so if you take human genomes you also get straight lines how often do you see paired triplets and so on of humans shared snips but these lines are very different exposures they're all close to three so here I'm summarizing these results so as you go from pairs to triplets to cortex to cortex and so on you see that for bacteria these experiments go from somewhere one and a half to two and a half whereas for humans they're up here so what this says is that the final genetic structure that we see is not the ancestry of the species but it's giving you information about the population structure and the population structure of these species is not such that you can divide E. coli in a couple of discrete subpopulations and there is complete mixing within a subpopulation and not between them no there is basically structure at all scales every E. coli strain has different rates of recombination with the ancestors of all other strains and there is this very long-tailed distribution by these scaling experiments and at the moment I still cannot tell you what these scaling experiments mean we're thinking about this but basically the way we think about this is as you go to larger and larger numbers of guys sharing a SNP you're going back in time a SNP that is shared by two guys is research, a SNP that is shared by more guys is typically over so it's an event that happens further back in time so these scaling experiments tell you something about the population structure at different points back in time alright so with that I've come to the end summary for most pairs of strains none of the DNA in their alignment stem from their clonal ancestry it has all been recombinant recombination drives genome evolution if it introduces substitutions almost tenfold higher rate of permutations you cannot reconstruct the clonal phylogeny from the DNA sequence of the species and the apparent population sorry the apparent phylogenetic structure reflects population structure it's not a single recombining population or separate cell population but a long-tailed continuum of relative recombination rates and you can characterize those by these exponents of these n-SNP distributions alright with that come to the end so if you have any questions I'd be very happy to help