 Okay, so welcome all to this SIB Virtual Computational Biology Seminar series. Today we have the pleasure to host Laurent Escoffier, who is co-director of the Institute of Ecology and Evolution at the University of Burm, and also the head of the Population Genetics Division. So Laurent got his PhD in Molecular Anthropology in 1988 at the Department of Anthropology of the University of Geneva here in Switzerland. He then went on to a postdoctoral training at the Center for Theoretical and Applied Genetics at the Red Girls University in the US. In 1991, Laurent came back to Europe, and he became metrassistant at the Department of Anthropology in Geneva, and in 1995, he became maître d'enseignement et de recherche in the same department. And in 2001, he got promoted to full professor in Population Genetics in the Institute of Ecology and Evolution of the University of Burm. So since 2008, Laurent is also a group leader of the SIB, the Swiss Institute of Bioinformatics, leading the Computational Population Genetics group. So Laurent's group is interested in the development of computational methods to understand evolutionary processes at the population and the species level. The group is also involved in the development of statistical methods, which rely on massive computer simulations to reconstruct the past demography of a species from its genetic diversity and to test among various alternative evolutionary scenarios. The group also devotes time to the maintenance and the extension of various computer programs, which you can find at the list on the group webpage, including the popular program called Arlequin for the analysis of population genetics data. So today Laurent will tell us about demographic inference from NGS data. Laurent, thanks again for accepting this invitation, and the floor is yours. Okay, thank you very much, Dianne. So it's my pleasure to be here to present to you a bit some of the work we've been doing mostly in the last two years. And as Dianne said, we are interested in trying to develop methods to infer past demographic events from genomic data. And that's what I'm going to present to you today. So what we are trying to do is to take some samples. And from these samples, try to reconstruct the past of a given population or species. And usually the true history of a species is very complex. So it involves some population splits, some subpopulation that got extinct, and then some other population at mix and from new populations that progress in time and space until the present day. So that's usually the true history of a species. And that's what we would like to reconstruct. But of course, we cannot really do that. So what we are doing is just take some samples, I said, and try to sequence the genome of these guys and infer polymorphic position snips, get the genotypes, compute the frequencies. And we try to reconstruct a history that is a bit simpler than the real true history of the real population. So what we are doing is that we are building some models. So this could be super simple models like those ones, but also more complex models that I would present to you a bit later. And given a particular model, then we try to estimate the parameters of this model. So you still need to realize that the truth is more complex than what we are, or what we can try to reconstruct. But we hope to try to get the main features of the history for a given species. So in this talk, I'll first try to present you with how we can connect patterns of genomic diversity to past history. And then I present to you a bit the method we are using to reconstruct this demographic history that is based on this site frequency spectrum. I will introduce you to these things. And then I'll go into some applications of these methods. One on the settlement of Australia by modern humans. And the other one on trying to determine if there was some gene flow between bonobos and chimpanzees in the recent past, okay? So first, how can we connect demography to genetics? So on this slide, which is relatively complex, we can see some potential scenarios of populations. So you can have a stationary population, some population that went through recent expansion or a population that went through some kind of recent bottleneck, right? And in each of these cases, we can try to predict a bit what will be the genealogy of G that we are sampling. And these gene genealogy or these coalescent trees are quite different under these different evolutionary models. And then we can just add mutations on the branches of these genealogy. And if you look at the different mutation frequencies under those three scenarios, they have quite different properties. So in a constant population of stationary population, we have a mixture of rare mutations that occur on the terminal branches and a mixture of more frequent mutations that occur on the internal branches. In the case of recent expansion, we have most of the mutation that will occur on terminal branches. And so we will have mainly rare mutations. And if we have a recent bottleneck, then we'll have most mutations that actually can have almost any frequency. So it implies that the allele frequencies that we observe at different sites on the genome are informative about the demographic history of those populations. So and we try to summarize these patterns of allele frequencies, but what we call the side frequencies spectrum. And this side frequency spectrum is just the distribution of these frequencies of the derived allele or of the mutations. And just to show you a bit how it works. So suppose that we have just five sequences. And here we are just listing the polymorphic positions. And we also assume that we sort of know what is the ancestral state of the alleles and what is the derived allele. We can just count what is the frequency of the derived allele for each site. So if we take the first site, the derived allele is just a frequency of one. It's just present in one time out of five sequences. And at the second site, this derived allele has a frequency of two. At the third site frequency of one, etc. So we can just summarize all these patterns of diversity by this distribution of allele frequencies. And here we are just trying to see how many times do we see a derived allele that has a frequency of one, of two, of three, or of four. And we are just counting these frequencies among the sites of our sequence. Okay, so that's what we call the site frequency spectrum. And if we look at what is the shape of this site frequency spectrum under different scenarios, we can see it's very different. Right, so in blue you have the case of a stationary constant size population and you have this kind of exponential decay of the frequency of the site that are single turns, double turns, triple turns, etc. When we have a position extension, as we have seen, we have an excess of low frequency variance. We have many single turns. And when we have a bottleneck, we usually have a much flatter distribution. So these are a bit extreme cases, but it just shows you that the self-frequency spectrum can be very different under different demographic scenarios. So we are just trying then, as we shall see, trying to fit what is, different to find what is the most likely scenario that can explain this site frequency spectrum. So we can see here a site frequency spectrum within a single population. But when we have a bit more complex cases where we have two populations, we can see that depending on the time of divergence of these populations, on the level of migration between these populations, we can have different joint site frequency spectrum. So on the lower side, we have the site frequency spectrum that corresponds to the figure on top. And so in case of isolation, we can see... So let me explain you this self-frequency spectrum. So on the x-axis is the frequency of an allele in population one. On the y-axis is the frequency of this same allele in population two. So if we consider what is the diagonal, it's the sites that have the same frequencies in the two populations. And then there's a color code from very frequent cases to less frequent cases. So it means here that we have a majority of the sites that are fixed in an at low frequencies in one population or in the other population. And this is because the two populations have diverged for a long time and therefore they have drifted away and we have some private alleles in the two populations. But when we have migration between the two populations after divergence, then we have much more correlated allele frequencies. So it means that the allele frequencies are much more similar between the two populations. So we can play this game in two dimensions, we can play this game in three dimensions and we can extend that to multi-dimensions if we have more than three populations. So now the game consists in trying to see or to infer what is the expected shape of this site frequency spectrum under different scenarios. And for this we are using, in our case, computer simulations. And we are using some relatively recent, not so old theory about which is the coalescence rate where we are trying to reconstruct the genealogy of some genes that are currently sampled. So in this case we have, let's say, at a given locus seven genes and we are trying to reconstruct what are the series of coalescent events that allows us to pass from seven to six, five, four, three, two, and one single lineage until we reach this most recent coalescence system. We are just trying to reconstruct a tree that is possible under a given scenario. And given this simple tree, we are asking ourselves what is the probability that if I have a mutation, this mutation has a frequency of one, two, three, four, five, six, etc. And, for instance, in red you can see branches on which if there is a mutation, this mutation will have a frequency of two. So if a mutation falls on this B2 branch, this mutation will be found in these two lineages. It will have frequency of two. Same thing if a mutation occurs on this other B2 branch and same thing on this other B2 branch. So the overall probability that if there is a mutation, this mutation has a frequency of two is just, for this tree, the sum of the red segments over the total length of the genealogy. If it falls on something else, it will have a different frequency. So, and we can compute this probability that it has a frequency of two as the sum of these B2 segments. We can compute the frequency that it has a frequency of one with sum of the B1 segments and these B4 segments would be the probability that it has a frequency of four and this B6 segment would be the probability that it has a frequency of six. So, with a single simulation, we can reconstruct a very crude side frequency spectrum which looks like this and this is just obtained by summing these branch lengths. So what we can do is to then simulate another coalescent tree under the same scenario and we are doing the same thing. In this case, these red branches are smaller, I mean less long than in the former case but these are recent relative probability to have a mutation as frequency of two. So we can somehow add this side frequency spectrum that we obtained on this second one and we can repeat these things many times. So, typically we repeat that 100,000 times or one million times to get a very smooth side frequency spectrum under any given scenario. And that's a bit what we get here. And we shall see this is implemented in the program that we have developed called Fasting Cloud 2. So that's how we're using simulations to try to get this expected side frequency spectrum under any given scenario that we can simulate. Now, the next step consists in trying to find the side frequency spectrum under a given model that is as close as possible as the observed one. So that's a bit this game that we are trying to play and we are using some likelihood framework to try to find the expected side frequency that maximizes the likelihood of the model. And again, we get this expected side frequency spectrum using simulation and the problem there would be to try to get the maximum likelihood side frequency spectrum. But that's what we have been trying to do. And this was done in a program that we published a few years ago where we, as I said, use color simulation to estimate this side frequency spectrum and approximate the likelihood. We use some specific algorithm to try to find the maximum likelihood parameters. This program is relatively fast. Now it's multi-threaded and it can explore quite a wide variety of models and also we can explore the parameter ranges relatively well and it can also handle in principle a bit an arbitrary number of percolations but for most practical purposes up to five, six, eight percolation it's okay. If it's above then it becomes a bit more complicated. So we just checked that this method works well on simulated data and we just had some kind of toy examples which was a model like this with three percolations that diverged at some time and there was some bottleneck. So it's a history that could represent a bit human demography with some ancestral bottleneck and divergence of, let's say, Europeans and Asian percolation at some stage that split from some African percolation and we're trying to reconstruct the age of this bottleneck, the divergence time, the migration rate, the time of the size of this bottleneck, et cetera. So we'll not tell you in details but the red dots represent the true value of the parameters and the black box but represent the estimated value from generalization and we see that on average our method works well and a bit better than some previous approach for complex models and this previous approach that it was limited to three percolations but with this method we can go to more than that and in this case it's a case with 12 percolation that we analyzed in some specific context of a continent that sends migrants to different islands and we can see that we can reconstruct or infer this degree of isolation of migration rates from this continent to the islands that do well even in these complex cases where we have as much as 12 percolations. So it seems to work relatively well. So we've been analyzing some kind of conventional datasets but then we were contacted by S. K. Vilslev and Anasafo Malastinas who were in Denmark asking us to try to estimate some relatively complex demography based on full genome in the context of the settlement of Australia. And so in our lab, S. K. Vilslev and Anasafo Malastinas contributed to these computations. So just to briefly introduce you to the settlement of Australia so what is remarkable for the settlement is that it's very old. So there are traces of human presence for about a bit more than 50,000 years from archaeological artifacts and close to 50,000 years with real fossils in Australia. So if we look at the map of Australia we see that currently it's detached from New Guinea but maybe up to 10,000 years ago there were still some kind of connections some land bridges between these two islands and this formed some kind of big continent called Sahu and this was also disconnected from another subcontinent called Sunda which included many Southeast Asian islands. And it's not really clear how and exactly when this Sahu continent was settled but what is relatively clear is that to go from Southeast Asia to Sahu you need to cross at least 8 to 18 sea crossings that were at least 30 kilometers long. So it re-subgest that the people did that when they're just drifting like a tandem but maybe had some kind of boat technology that allowed them to colonize this continent. But there's no trace of such boats, there's no remains and there's no real proof that it was the case. So it's a very old settlement which is much older than the re-discovery of this continent by Europeans it was only in the 17th century. So to study a bit the diversity, the diversity of this continent Esquivelus Leve and these collaborators got samples from 83 Australians that are scattered over these different samples there with the stars represent the sampling sites these have the sample sizes and so they are scattered over different regions so northeast, central desert and some coast samples and also in this study we had available some 25 Papuans from the highlands in New Guinea. So I will not present you in details all the diversity of these samples but just something that one needs to realize is that these samples, these ocean samples where some of them are mixed. So we computed the admixture rates of the different samples and it varies between non-admixture to up to 80% admixture per individual but these volunteers were just self-reporting their ethnicity so all of them consider themselves as Austrian Aborigines. So but of course to infer properly the history of these individuals we needed to look at the least possible admixtures and that's why we concentrated at least for this demographic inference on some individuals from the western central desert that had been in contact with Urbans very recently. So one of the questions that we were asked to address was to try to see if we can distinguish between different scenarios about the causation of Eurasia and Oceania by modern humans and until recently the main model that was valid was a model where we had an expansion out of Africa, a single exit out of Africa and a single migration wave that led some wave to go directly to Australia and New Guinea and some other wave that went to Europe and Asia and this Asian wave also led to the colonization of the Americans. But relatively recently maybe in 2011 there was a study also led by S.K. Bieloslav that analyzed a single sample from Australia, some tough of hair that was at least 100 years old so devoid of any admixture with Europeans in principle. And from the analysis they inferred that it could have been actually two waves out of Africa. One that would have gone directly to Australia and the second one that would have started maybe 15,000 years later that would have led to the colonization of Eurasia. So there were still these two competing hypothesis and one wanted to see with these new samples which has much more information and before high coverage what we could see about these scenarios. But to study these scenarios, well we had also to take into account some other populations because now it is known that when human populations migrated out of Africa they hybridized with some archaic humans. So the current scenario is that there was maybe some first averageization with Neodotol that occupy this range of western Asia and maybe later on for some population that led, that went to Oceania, some later admixture with some archaics that are called denizovans. So this denizovan population, we have no trace of them in East Asia but we have just a trace of them in the Altai mountains in this denizovan cave, that's why they're called denizovans. So just to really, really summarize that very grossly somehow what has been found is that non-Africans have 1, 2, 3, 4% maximum of Neodotol like mixture so up to 4% of their genome is from Neodotol regions but in Australians and in New Guinea we have in addition to that 3% to 6% of the genome that is of denizovan origin. So which means that these Australians have something very special compared to non-Australians and so we needed to take this admixture into account because if you discount that then you wouldn't expect them to be as different as they are from the other humans. So we needed to distinguish basically models where you would have a single wave out of Africa from models where we have two separate waves out of Africa. And to address this we used this list of full genomes so we compared actually seven Australians from this western central desert location to some Asians from China and to Europeans from Sardinia and to Africans in Nigeria, Europeans. And we also added, as I said, some Neodotol genome and some denizovan genome. So in total we got the full genomes of these needles at more than 50x and we really concentrated on genetic regions to remove any potential effect of selection. We removed CPG islands to avoid multiple hits by mutations and we also excluded some regions that were difficult to align to Neodotol sedenizovans just to prevent some mismapping of some reads. So in total we had almost one gigabase basis of data more than four million SNPs and we used the six-dimensional cypherxy spectrum to fit our model. So what we get after many trials and tests of different models we get something like this. So we have on these graphs the relationship between the four modern humans, population from West Africa and from New York, Europeans, East Asians and Australians. And in this model we allowed for two possible exits of other Africa. And we also assumed that this exit occurred not directly from Europeans, but from maybe some East African population that diverged some time ago from this West African population. And if we examine a bit this scenario in more details what we find is that even though we could have had two exits out of East Africa say what we find is that the divergence time from this East African population is super similar. And actually we check that the likelihood if we just allow a single exit is identical to what we find in this case. So we have no evidence of two separate time of these exits out of Africa. So we also have relatively strong bottleneck that is associated to this exit there. And also we find that we have a main source of unmixed with other tools that occurred just after this bottleneck. So it's really as if there was some bottleneck to get out of Africa and then mixing immediately with another tools. And then later on we have some evidence of some additional unmixed pulse in Eurasians but almost nothing anymore in Australians. So if we really had two exits out of Africa we would not see this shared unmixed event but we really expect to see two strong and independent unmixed events in both ancestor of Australians and the ancestors of Eurasians. So this scenario really supports the view that there was actually a single exit out of Africa. Quite interestingly if we, and this actually scenario was obtained by explicitly modeling admixture with arcades. So we also had this strong input of Denisovan admixture into Australians. And this has been clear contrast to the results we find, the best we find if we do not allow for admixture with arcades. When we don't allow explicitly to have admixture between another tools and non-Africans and Denisoans and Australians then what we find is some evidence of really two separate times of exit between Australians on one side and Eurasians on the other side. And that's really compatible with what was found before but also in this previous study that the second mixture was not accounted for. So it really seems that not taking into account this past history of admixture with arcades humans would give you the impression that the Australians are really super special compared to the other Eurasians but actually they are not that. So to summarize this scenario so we find evidence of initial exit out of Africa maybe 72,000 years ago then some separation of these Australians from these other Eurasians maybe 50,000 years an arrival in Australia around 50,000 years and for them quite strangely some data admixture that is after this arrival in Australia but still the confidence interval is overlapping so we really don't know where this admixture occurred but it must have occurred really at a time that was really close to the settlement of Australia. And only later on maybe 40,000 years ago separation between Europeans and Asians. And we also try to account for admixture with additional archaics other than other social settlements we didn't find any evidence for this and some groups have reported potential admixture of this Australian wave with some ancient humans that could have left Africa even earlier than this but we didn't test this but we don't know at this stage. Okay, so and then the direct example I wanted to present to you today is some study that was led by Thomas Marquez Bonnet from Barcelona and in which we have done a bit the same type of game trying to find a bit the demographic history of some population but this time not of humans but of chimpanzees and bonobos and again this work was mainly driven by Victor Souza and with the help of Zabel Dupont. So if we look at the current distribution of these chimpanzee and bonobo populations we see that they occupy quite distinct regions. So the bonobos are just south of the Congo river and all the chimpanzees are north of this Congo river. And also other subspecies of chimpanzees are separated by some rivers and it seems that these rivers may play a role in preventing gene flow maybe not completely but at least to some extent. So currently we have four recognized subspecies of chimpanzees the western chimpanzees Niger Caron group chimpanzees and they have their own name and they are quite different from the bonobos which are different species. So this current distribution can be explained potentially by some past climatic changes because we if we look at the current distribution of the rainforest because these chimpanzees and bonobos live in the rainforest currently we have a large fragment of the rainforest on Central Africa but also in some in Western Africa maybe 8,000 years ago which corresponded to a much wetter phase of the climate then this rainforest was completely continuous between Western and Central Africa but during the last place on maximum actually this forest were much smaller and quite well separated. So of course it does not correspond to events that necessarily occurred during those times but it's likely that you had during the Pleistocene many cycles of dry and wet phases that could have led to the isolation of some species in some patches of forest. So we think that maybe it's a climate that drove a bit this kind of species. So if we just do a simple principle component analysis we see that Western chimpanzees are very different from both the Central and Eastern which are quite similar and then that are quite far away from the Niger Cameroon populations. And that's only if we consider chimpanzees but then the question was to see if we find any trace of potential gene flow between these two species chimpanzees on one side and bonobos on the other side. And there were some hints of some potential gene flow by looking at very simple statistics like these D statistics. So these D statistics are obtained by comparing an odd group in this case humans to three populations. And when you consider polymorphic sites where bonobos have a derived allel then you're asking yourself is this derived allel more or less shared with two other populations which could be Central and say Western chimpanzees. And if this derived allel is not more shared between bonobo and X and bonobo and Y population then this D statistics should be zero. So if it's shared more with the X population this D statistics should be larger than zero if it's shared more than with the Y population it should be smaller than zero. And when you compute this statistic between different X and Y chimpanzee populations then you see that you have positive values which implies that you have had some gene flow between the bonobo and X because this additional gene flow makes you depart from the value of zero. And the largest discrepancies between Central and Western which implies that Central had much more gene flow from bonobos than Western. So this statistic was really easy to compute and there was really some evidence that there could have been some gene flow but when, how much, that was still a bit of a problem. So to answer these questions we looked at the genomic diversity of 5 million per species we had about 1 gigabase again and in that case we had like 6 million SNPs to play with. We had to add additional filters actually to this data set because it was not really clear how one could define the ancestral state in the ancestors of bonobos and chimpanzees. So that's why we actually looked at what we call the minor allele frequency spectrum which is the fraction of the size of these frequent alleles. And we also had to deal with another problem is that we potentially had many sides that went through multiple mutations and we were really looking at recovering species it's not like in humans where we just compare populations so the longer the gene energy the more likely it is that a site had been hit many times by mutations. So for this we used by some statistic that is called GERP statistics which are just some conservation scores and these GERP scores between minus 2 and 2 it is supposedly neutral and is not moving too fast if it has a statistic that is smaller than minus 2 it implies that you had a lot of mutations at those sites so we removed those potentially fast evolving sites from the analysis. And then we consider models where we had only 4 populations to simplify things. So then we compare different models without bonobos gene flow. So for instance we use this model where we have a primary split between bonobos and common chimpanzees and so that could have happened somewhere here and we have this split no gene flow and then we have a lot of phase where we have divergence between these more western chimps and central eastern ones and then we still have some gene flow between the two we allowed for gene flow between the two and then this gene flow would stop at some time and we also wanted to determine when this time was. So this is a bit the kind of scenario that we envisioned a simplified version and when we have scenarios with bonobos gene flow somehow we assume a bit similar things between gene flow between bonobos and either the ancestor of central and eastern and common chimps and maybe more recently only with central chimpanzees. And again we allow for gene flow between chimpanzees. So when we compare these models with and without gene flow what we find is something like this we just compute the likelihood and look at the difference in likelihood and we find that this difference is huge. So this is in lockdown likelihoods and even though we cannot formally test that using likelihood ratio because we have composite likelihood and not proper likelihoods we still think that this difference in lockdown likelihood is so huge that really this we can clearly say that this model that includes migration with bonobos is really favored by the data. So what we find finally so it's a scenario where we would have very old divergence between bonobos and chimpanzees actually models that do not include migration between bonobos and chimpanzees would date these divergence maybe to less than one million years and we find more than 1.5 million years. So it's a much older diversion than what people would have thought previously. Then we also can date a bit the divergence time between these other chimpanzees species. Maybe half a million years between a clay that would lead to western and Nigocongo from a clay that is leading to eastern and central chimpanzees and maybe a quarter million years ago diversion between western and then Nigocongo and very recently less than 200,000 years diversion between eastern and central. So also we can try to measure this level of gene flow. What we find is that in the old part of the phylogeny we have some kind of asymmetric gene flow from bonobos to chimps than from chimps to bonobos but it's relatively comparable. Then once the different subspecies of chimps emerge then we had some gene flow in both directions actually some from ancestors of these these two lineages with bonobos but more recently we had mainly gene flow from bonobos to central and no real trace of direct gene flow or even indirect gene flow to western chimpanzees. So and of course we can also measure these level of gene flow between current populations actually. So I think that you can see that these nitrogen sequencing data can allow you to really infer relatively complex scenarios. So most of these scenarios we've envisioned had more than 40 parameters so it was some very complex scenario that we can only investigate using these old genomes. But these NGS data somehow needs to deal to some proper handling. So for instance when you have known PCR free protocols then you can have PCR duplicates and somehow includes some PCR errors and excessive singletons in your spectrum so that's a problem so it's better to use these PCR free protocols. What we also did was to really only consider the side that we are devoid of any missing data that we need to pull about the polarization of the ancestral state which is not always easy. You need really to have relatively high coverage data or find a way and there are some tools for this to try to infer the side frequency spectrum that with low coverage data that's something also we learned that you need to remove these fast mutation rate sites that have specific comparisons. You can just handle intraspecific and intraspecific data in the same way. Also we have been a bit wary of some other filtering like Hardivine-Berger-Germ filtering because when you do these Hardivine-Berger filters on a pool of samples somehow you would tend to exclude sites that have different frequencies in different populations and somehow you would only exclude sites that are compatible with high migration rate if you want and so you should really do Hardivine-Berger-Germ filtering just within population, not on whole pooled samples. So these demographic inference procedures seem to be nice but they require still a lot of computer time. Last year we used more than half a million CPU hours to actually do this computation for these two studies so it's really super co-intensive because actually I just showed you the final results but to read these final results you need to test many different models and you need to repeat experiments you need to compute confidence intervals and that requires really a lot of computer time. And something that I would also mention that is worth considering is that even though you can just take one single scenario and try to infer the parameters under this scenario it's relatively easy but what we have also learned is that you should really check that the scenario that you're envisioning is really the best possible scenario and for that you need to examine different scenarios that includes on the gene flow, that includes all these different specificities some publication expansions, bottlenecks to be really sure that in the end you've got the right one even though we've never had completed true history but you should really take or put some efforts into exploring different models in your data. Okay so this project was done with a lot of different people so Vito and Isabelle had to develop another program in testing the checking of different things and then you have all the people that were involved in our Austrian project and in Chippen-Sebenen project and of course this is only a subset of people that currently studies there are dozens and dozens of other people that are not listed here but you have seen the author list in the previous slides that needs to be acknowledged and also in the case of people who also need to acknowledge all the participants to the sampling and the people who contributed their blood actually to make this study possible and so I would be happy to answer any questions if you have any and thank you for your attention.