 So we start, so welcome all to the SIV Virtual Computational Biology Seminar Series. Today we have the pleasure to host Jérôme Goudet, who is Associate Professor in Population Genetics at the Department of Ecology and Evolution at the University of Lausanne, and is also a group leader at the SIV Swiss Institute of Bioinformatics. So Jérôme studied biology and agronomy in Paris, in the University of Paris 7 and the Institute of National Agronomy in France. He earned his PhD in 1993 by studying the genetics of geographically structured populations at the University of Wales, the UK. Then from 1993 to 1995, he pursued his career with post-structural training at the University of Lausanne, where he then became first assistant, and in 1996, met François Neumont at the Recherche. He then partly worked at the Institute of Ecology in Lausanne, at the University of Lausanne, and at the Genetic and Biometric Laboratory in Geneva. And in 2001, Jérôme obtained his tenure track, sorry, tenure track in Population Genetics at the Department of Ecology and Evolution here. And for seven years, he also was the vice head of the department. And since 2004, he is Associate Professor in Population Genetics in the same department. Since 2016, he is also the head of the master program called Behavior, Evolution and Conservation. So the main focus of the group concerned the understanding of the interplay of population structure, traits, architecture, and selection. For this, a group member used different approaches from theory and the development of statistical tools to fill the observations. The main biological models currently used in the group are the Barm-Hauer and Bignoc terrorist bats. On the theoretical side, they investigate the dynamics of multilocal genetic systems under the influence of selection, migration, and group. And they develop comprehensive individual-based models as well as statistical methods for infer selection, mating systems, and population structure. So today, Jérôme will share with us his insight into past demography using, on one hand, today's Barm-Hauer and human genetic data and computer stimulation. So thank you again, Jérôme, for accepting our invitation and the floor is yours. Thank you. Thank you for the invitation and thank you to the audience and thank you to the people online. So one thing that is perhaps a bit different from usual seminars is that since there are people online, it is not recommended that we stop during the presentation. And so unless there is something crucial that you don't understand because I copy and copy something, just say and ask. Otherwise, if you can whistle the question until the end, that would be great. So yes, today what I want to talk about is how can we use genetic data to infer past demography. And what I'm talking about genetic data, I'm talking about today's genetic data. And I'm going to use two examples, one based on Barm-Hauer. I've got a now quite long-standing collaboration with Professor Roulin, Alexander Roulin, also in Lausanne. He's been following Barm-Hauer for 30 plus years and I came into the game a few years ago starting to look into it. And I will also use the treasure trove of human genetic data available to show you what we can do and what we cannot do with inferring past demography using genetic data. So the question, can we infer the past demography of species based on today's genetic data? And I think the key point here is today, today's genetic data. Why? I mean people interested in the past are historians, they look at books where things are recorded or they look at bones or they look at things trying to get stuff from the past. Here we're not doing that at all, although some people and the field is changing, some people are starting to use ancient DNA. But what we're going to talk about today is present day genetic data. How can this be used to understand the past? So why are we interested in such questions? Well, first of all because the past demography is important to understand the past history of the species. Has it been small for a long time? Was it bottleneck? Has it been an expansion? Was a new range available and therefore it could colonize this new range? All these kind of questions are relevant to understand where the species are coming from and whether there might be on the way to speciation and to dividing two different groups that cannot be produced in the long term. So connected to it is who has been connected to who and how long ago? Has it been recent migration between populations or dates back? And again, this is related to the idea of speciation and how differentiated are groups of populations. We might also be interested in selective pressures exerted on population X or population Y. For instance, if a range open and suddenly a species go up in the mountain or get close to a new environment, what are the effects? What are the journey changes that we can see there? And last, can the observed pattern at gene Z be due to selection? We see a pattern. It looks like it's not normal. It doesn't look like coming out by chance. But can we really say that? So this is the type of question that we might be interested in. And so why today's genetic variants are relevant for this kind of questions? Indeed, there are today's genetic variants. So they can tell us something about what's going on today. But what's in it that tells us something about the past? Well, for instance, we know that if you see an excess of very rare variants, single terms or double terms, alleles that are the very low frequency, and the in excess of mean that we need to have a comparison. But if we see many of these rare variants, it is a signature that the proportion has been expanded recently. If we see a high diversity in a population, many variations, much more than in other populations, it is a signal that the proportion has been large and established for a long time. If we see large differences in many frequencies between two sets of population, it means that there has been little change between these populations. But all these are qualitative arguments and these need to be quantified. So how can we quantify this? Ideally, I take a pen and write an equation. The frequency of my different variants is a function of the mutation rate of the effective size of population here, of the effective size of the population there, or for much migration there has been between these two populations of what was the ancestral population size and all these parameters. Well, we're not there yet. We don't have that yet. Or at least we don't have that yet for complex demographic models, involving several many populations. That's what we do, and this is where the computational biology comes in, we resort to simulations. And so one way to do it is to obtain the likelihood of a given value count as a function of all this. But this is quite time consuming and heavy. And the other way we can do things and do this is to use approximate Bayesian computation. Often when people hear the word Bayesian, half the room freaks and the other one say, yeah, grand Bayesians, okay? Well, you'll see in a minute that behind approximate Bayesian computation, there is nothing very tricky, there is no complex mathematics or anything. We're just going to use computers, okay? Basically, in a nutshell, we have observations from all species. And this observation of JAT variants are due to a series of non-parameters, those that I mentioned before, the mutation rates, the population size of the different populations, whether they grew in the past or whether they reduced in the past, whether they were connected and they split into different things. All this will make the observed frequency today, but we don't know how to write the equation, okay? So we have these observations and from this we get summary statistics. What are summary statistics? Well, it could be simply the distribution of frequencies, but it could be some more derived statistics like the square of these frequencies and this is the genetic diversity or several variants of this, okay? And we have this for observation. Then we sit down, we think, and we think about where these species can be coming from and we elaborate a model. So we elaborate a model, meaning we know the land, we have an idea about the past climates and these sort of things and we're going to elaborate ideas about how populations were connected or not connected in the past, how many there were, et cetera, et cetera, okay? So this is all Bayes model and this Bayes model, we plug it into computer simulations. We run some demography. On this demography we have the genetics. From this genetics, we obtain summary statistics and we compare the summary statistics from the simulation to those of the observations and we try to minimize the difference in them, okay? So initially we're going to sample these statistics from a very large set of possible, this is all prior for the different parameters and by comparing the summary statistics to the observed summary statistics, we're going to reduce the possible range of viable parameters for all data. So the summary statistics are statistics obtained from the data sets. The question is obviously which one should we use? And so the simplest one we can think about is the distribution of allele frequency, something known as a site frequency spectrum, okay? And we have this either for the whole set of observation or for each population independently. And I'm not going to discuss this at length because Laurent Escoffier in a virtual seminar, I think one year ago, presented what he's been doing using the site frequency spectrum, okay? Another way to tackle the issue is to take a large panel of summary statistics. So when I say a large panel of summary statistics, what it could be? It could be the number of allele per locus, the distribution of the number of allele per locus. It could be the genetic diversity in each population. It could be the genetic distance between population. It could be all sort of things. And I can generate, there are huge range of population genetics. They've been very good at generating new estimates and new descriptors of genetic diversity. So we could use all of them and compare them from the observation and from the simulation. But when I say this, you already see that there is a problem here. The more statistics we're going to have, the more difficult it's going to match the two sets. We're moving into a more and more multi-dimensional space, okay? So people have tended to rather using an exhaustive set to reduce this using some form of multi-dimensional scale. Another approach and one that I'm going to focus on in the two examples I'm going to take is to use a reduced expert-inspired set of statistics. Basically, the people getting the data, they know their species, they know their statistics, they know what's going on and they know which one should be relevant for the pattern they're seeking, okay? And in the perspective, I will discuss about another way to approach that, okay? So let's move on to the first example. The first example is a bound hole and so you have a picture here of two birds from one brood. And the swacking thing that you see from this is that the different color. From the same brood, we have white birds and we have dark birds, okay? And this polymorphism is known across Europe. So this brood is very cosmopolitan. You find it in Europe, you find it all over the world. Actually, it's one of the birds present in all the continents. And this color polymorphism is also present on almost all continents, okay? Not only do we see this color polymorphism, but the distribution of the color is not random. Across Europe, we see a line of color from the southwest of Europe, from Portugal to the northeast of Europe, okay? And this is true for this color. So the dark, so the brown and the white, but also for splottiness, okay? And on the right-hand side, you have maps of Europe. The left column is for males and the right column is for females. The top row is for redness and the bottom row is for splottiness, okay? But basically what is striking through these four pounds is that we see that as we move through the northeast of Europe, we get darker and darker birds and spottier and spottier birds, okay? Why? We don't know. And Alexandre Houlin has been working on this for many, many years as several hypothesis, but we don't know for sure why we have that, okay? And so an initial question, why do we have that is, could it be due to chance? Could chance have created this pattern? So in order to find out about this, Sylvain Antoniat-Zarr and people from the group of Alexandre went on and rode to several people and traveled across Europe to sample birds. And so the different dots on the map represent sampling locations and the numbers above represent number of individual sample from these locations, okay? And from these individuals, we have two things. We have feathers and from feathers, we can get blood and we can get DNA. And we also have, with the feather, we can get the color, the feather from the belly tell us something about the color of the bird, okay? So we have 17 microsatellite loci and today when we talk about genomics data and several billions markers, it's quite small, but still with this kind of information, we might be able to get something about the path of the species. We had 20 populations across Europe and around 20 individuals per population. And if you're interested, you can get this paper from molecular ecology where we'll describe what is going on. And remember, I mentioned summary statistics. One key observation is that if you, so I'm going back here, Evora is the place here in Portugal, okay? So on the x-axis, I've got distance from Evora. I'm moving away from Portugal. And on the y-axis, I've got the mean allele christianus per locust per population, okay? All many variants I've got at the top of this microsatellite loci. And we clearly see that as we move away from Evora, this mean number of ideal goes down, okay? Similarly, we can look at genetic distance between population by a measure called FST, per-wise FST. And as population gets more and more distant geographically, we see that this genetic distance increases, okay? So we could go on and say, well, okay, let's simulate this population and take allele christianus for each of the population and per-wise FST for each of these populations. And this will make a very, very high dimensional data set. Rather than that, what we use is simply the slope and the intercept of the regression here and the slope and the intercept of the regression here. Okay? The simulations were performed with Quantinemo2. So Quantinemo is a program that was developed in my group by Samuel Neyershwander. And the way we used it is we run forwards demography. So basically we start from the place and we let the population, so in the 20,000 years ago, Europe was covered with an ice sheet, okay? And all cannot survive when there is ice. Okay, they need to have their grounds to courage. So as the glacier retreated, there was territory left over for the holes. And what we did is we simulated the demography of the holes across Europe, and then the genetic was simulated backwards, okay? So what we did compared to many ABC approaches is we had something that contained isolation by distance via a stepping stone migration. If you look at many of the papers using ABC, it's container simulation. Basically you have a container for a population and they change migrants and they grow in size, but there is nothing that mimics isolation by distance. Here we have isolation by distance. And what we want to do is to estimate six demographic parameters, the local carrying capacity, how many individuals are living on each spot. Population growth rate, the size of the refugia, migration rate, mutation rate, and the start of globalization. Okay, so here you have a map of Europe, and we conjectured that there was a refugia in Spain or in the Iberian peninsula. And we simulated growth from Spain across Europe in a stepwise fashion, okay? So on the first step, we let the population grow across Europe, colonizing France and then England and then central Europe all the way to Poland and the Balkans. Once we had done that, we saved all of this information with the migration rate, et cetera, into a database, and we then simulated the things backwards to get genetic data from this demographic, okay? Using the coalescence. So this is one model, but obviously, there is a strong prior in saying that the refugia was on Spain. We know from other species that refugia could be based also in Greece, in Italy, and in other places, okay? So we decided to go for a series of different model and choose which one and find which one was the most likely. So we had models with one refugian in Iberia and we also played with the notion that carrying capacity may not be constant across Europe. We might have larger carrying capacity in the south and smaller in the north. We might have extension rate as we move north, et cetera, so we tested all of these different models, okay? We had models with two refugia, one in the Iberian peninsula and one in Greece, and again, we had several variants with climbing carrying capacity, with extension rates, and with different migration model. And then, as a null model, we had one with no colonization, meaning that Europe was colonized from the start, and also meaning that there is no signature anymore in our data concerning the past, okay? The results are presented here. So here, we just compare the likelihoods of the different models, and we see that the one carrying capacity model with one refugia is the most likely for a close-up by the carrying capacity climb and the other one refugia model. But really, what this graph is showing is that the likelihoods of models with two refugia is very, very small, okay? So based on that, we picked this model here, the one carrying capacity, because it's the most likely, and also the most parsimonious, and from that, we inferred the six demographic parameters we mentioned before. The carrying capacity of each batches, the population growth rates, the refugial population size, migration rate, mutation rate, and when did the colonization start. And the gray line corresponds to the prior distribution and the black line corresponds to the posterior distribution. That is the set of observation that fitted best with or observation. And you see that for some parameters, the prior and the posterior are quite similar. For instance, the population growth rate, we don't have much information. On the other hand, you see that for the mutation rate and for the standard colonization, we have quite a bit of information, okay? And if you look at the value of the parameter is estimated, and if you compare it to what is known about movements of birds, about population size of birds, et cetera, that fits pretty well. So for instance, we have carrying capacity of 200 for square of 50 by 50, and this is something that corresponds to the density that bird watches have been finding. Migration rate was estimated to 37% between neighboring patches, and this corresponds to the average distance between birth and reproductive place of birds, et cetera. So this estimates makes some sense. Okay, so now we have a demography. Now we have a demography. We know, or we have an estimate of four Europe was colonized by the banal. The next question is, could the pattern seen in color be due to neutral processes? And so for that, what we're going to do is we're going to simulate again using the parameter we just estimated, a neutral trait this time corresponding to the color, and we're going to test several genetic architecture for this trait, and the goal is to see if we can see by chance as big a climb of frequency, and this is going to be qualified by something called PST, as a one in the observed data through simulations, and the results are presented on this panel here. So I'm going to walk you through the different steps. So on the top, the three bar plots correspond to a trait encoded by co-dominant markers, and it could either be one locus and two allele, one locus with 50 alleles, or 25 loci with two alleles each, okay? So the architecture of the traits could be either encoded by a very major locus, one locus, two allele. In passing, we know that in all the color is affected by one locus, MC1R, and this MC1R, the variation of MC1R account for 50% of the variation across Europe. So the idea of having a very major locus is not completely ridiculous, but there might be more variants or there could be a polygenic architecture. And all these dots and this box plot represent the distribution of the difference between neutral markers and the trait in simulations, and the two vertical lines correspond to two limits. So this is the value observed in the real data set, the black line, okay? And you see that all the simulations, sorry, all the simulation in these situations are both. Another type of architecture that might be more favorable to such a pattern emerging is if the color allele is recessive. If it is recessive, it means that individuals in the refuge were white, might have a high frequency of this allele, but do not express it. And so there is a better chance for the allele to be caught and to advance. And we see again that when we have this scenario, none of the simulations give a value as high as the observed. And it is only when we force simulations, in simulations that the front of the expansion contain a high frequency of the variant of the color allele that we see that under our natural scenario, we observe value that are as high as in the observed data. So this is a very extremely conservative scenario. And even in this case, it's only one or two simulations that show something as high as the observed, okay? So the take home from this is that ABC needs to be, not to be respected to containers type simulations, we show that we were able to do it with a real geography. Few key well chosen summary statistics are sufficient to infer demography. And ABC can help to identify traits under selection. One step that we haven't been able to go further is inferring the strengths of selection from this type of approach. What's wrong with the selection to allow such a client to be in place. Okay, next example is human expansion. And the key question that I'm going to ask here, if you've read the literature, you in 2000, in the year 2000, there was a paper by Rosenberg saying that humans were grouped into five to seven clusters, okay? Using a program called Structure, very famous. And the question I'm asking here is can this clustering pattern be due to a simple expansion process, okay? So the data I'm going to use is the data on the human genetic diversity panel. We have this time 400 microsatellites. And again, we're going to use Quantanamo. We use both exhaustive summary statistics and multivariate multidimensional scaling and pattern statistics. And for the pattern statistics, that would be the same as the one for the old. Basically the slope and the intercept of isolation by distance. And the slope and the intercept of our equations. And the details can be found in this publication. So this is a projection map of the world. The origin of expansion is well known to be East Africa. So we simulated forward in time expansion from East Africa across the world through the peninsula here. So there is no passage here and the world was colonized. And all the crosses correspond to the sample population in the AGDP panel. And the framework of simulation is, again, we have real data and summary statistics. We will perform simulations, obtain summary statistics. We obtain a posterior estimate from the statistics. And from this, once we have this estimate, we will first sample and re-run some simulations to generate patterns, to generate genetic data. We obtain patterns from that and we'll compare the real and the simulated data. So first of all, the type of estimate we're getting. The time in years since the start of the expansion, oops, the time in years since the start of the expansion was 130,000 years ago. And this is not completely out of what other estimates have been giving. A population size per patch of close to 4,000 individuals. A number of population size of 5,000,000 individuals. A migration rate between the biopropagation of 4%, et cetera. What I want you to have a look at is here, the results of the pattern statistics from simulations and from the observation. You see that you have a very good match. On the left side, you have the observation. On the right side, you have the simulations. And you see that the simulations allow to match very well the observation. This is not so surprising here because we use these very statistics to run the simulation. What would be more interesting is to use some completely different statistics. And the completely different statistic we're going to use is admixture, an admixture analysis, whereby we're going to try to assign proportional genomes of individuals to a different group of clusters. And so what you see here is for different number of clusters, two, three or four, K equal two, K equal three, K equal four. On the left-hand side, each time you have the results of the simulations. And on the right-hand side, you have the results of the observations. And just looking at this different graph, you see a very good match between the observations and the simulations. Okay? So we have clusters emerging from a pure ranch expansion process. Okay? In the simulations, we have no selection. We don't have barrier to migration apart from the shape of the land masses. Okay? And through this, we are able to recreate exactly the pattern we observe in a structure analysis. Okay? So I've been talking about macro satellites. What an old thought I am. Today's we have much larger datasets. We have genomic data. We have millions of markers. How can we scale up to genomic data and use realistic model? People have been doing ABC with genomic data but with very small models. Two populations, presence of migration or not, but not with models with isolation by distance analysis. I think a key point is that we need to choose computationally cheap summary statistics. If we need to perform a lot of calculation on each simulation to obtain something, it will be just impossible to run ABC type approaches. People have been looking at multi-population SFS, site frequency spectrum. They are useful. Laurent has shown that we can go up to the hands or perhaps 10 population, so 10 SFS. Further than that, it's very difficult to go. Okay? And so the question is, how can we simulate realistic population? So I have a hint. And the hint is a statistic that I've been working on since my PhD. It is called FST. FST is a way to quantify the distance between population. Okay? And we show in a recent paper with my colleague Bruce Ware that this FST is a function of, it's a simple function of matching probability, whether two alleles match or not. Okay? It's something quite commonly seen in forensic. One nice feature of it is that it is a method of moment estimator, meaning that it's very straightforward to calculate. You don't need to run simulations and it's not computer intensive. The picture here shows that we can work at an expectation from these. Okay? So what you have, the solid lines represents the expectation and the dots represent the simulations. Okay? So we can work at, given population size and migration, what will be the value of the overall FST or the average value of population specific FST. This can be calculated population by population and you see that the fit is pretty good. Okay? And so the question that I have is does this contain information about past demographic events? And I will illustrate it now with data from the Southern Genome Project. Okay? So what I did here is I did the Southern Genome data and I pulled all the individuals from Africa into one group or the individuals from South Asia in Brown into another group or the individuals from Europe into a group or the individuals from East Asia into a group. Okay? And then what I did is I calculated the population specific FST or the average FST not over the whole data set but per category of value frequency. Okay? So now what I'm looking at is FST not over the whole data set but I'm taking the single terms and I'm calculating FST on the single terms. I'm looking at the double terms, I'm calculating FST on the double terms, et cetera, et cetera and I'll obtain this type of curve. Okay? So the red dashed line correspond to the average worldwide FST. And you see that this average worldwide FST is rather insensitive to the value frequency. It's constant more over the whole range. On the other hand, population specific FST show a very different pattern. Africa shows that for low frequency we have very negative population specific FST. Sorry. It goes up the raptor and then comes down again for very high frequency. And we see the almost reverse pattern for Europe, East Asia and South Asia. So the question now is can we exploit this? Is there here information about the past demography of the Africans, of the Europeans, of the South Asian, et cetera? And at the toy example, I go back to my container simulations and I consider two containers, Africa and South Asia. That's and South Asia derived from Africa some years ago. And the difference containers are linked by migration. I've been running simulation with stats using the MS program. And I obtain estimates of the size of South Asia to be 9,000 compared to Africa being 10,000. And this this estimate are not completely miles away from what is observed using other techniques. And what's the point what I show here is the same graph as before where I have in black the observed population specific FST as a functional value of frequency and in red the result of simulations of the best fitting simulations. And you see that the best fitting simulations fits pretty well at least the beginning of the curve with the real observation. Okay. And the point here, the key point is that it is much easier to visualize something like that than to visualize multi-dimensional side frequency spectrum. For the two dimensional we can do a pair-wise comparison with different colors but in three dimension and four dimension it's impossible to visualize. Here we can visualize things and we can see how close we are between the observation and the simulation. So what remains to be done is to explore the statistical properties of this population specific FST to derive an expression correct and to investigate the sensitivity of this population specific FST function of value of frequency to different demographic parameters. What I mean by that is how long ago was an expansion of how long ago the population split or the migration that is in this population et cetera. And if you're interested I'm looking for a postdoc to work on this. And so I come to an end. I want to analyze the people from my group in particular Samuel Néves-Schwunder who developed Continemo. Frédéric Michaud who took over from Fabriela is now working hard to get the version two out. Sylvain Antony Elza was working with the Barnel and did the Barnel ABC simulations. Riccardo and Elza did the work on human. And Bruce, who is my colleague doing collaborating with me on the population specific FST. And for funding, thank you to Unil, Swiss NFS, and Bipolite. Thank you to you.