 So, welcome all to this SIB virtual computational biology seminar series. Today we have the pleasure to host Daniel Wegman from the Department of Biology at the University of Fribourg and also a member of the SIB. Daniel got his PhD in computational population genetics in 2009 with Laurent Escoffier who is also a member of the SIB at the Institute of Ecology and Evolution at the University of Berlin. From 2009 to 2011 he pursued his research with postdoctoral training with John November, the Department of Ecology and Evolutionary Biology of the University of California in the US. Then Daniel came back to Europe and he did a second postdoc at the School of Life Science at EPFL here, next door in Jeff Jensen's lab. And then since 2013, he is an Associate Professor at the University of Fribourg leading the statistical and computational evolutionary biology group, which is also affiliated to the Swiss Institute of Bioinformatics. So the primary aim of Daniel's group here is to better understand the underlying evolutionary and ecological processes that have been shaping this diversity over the course of evolution on our planet. To achieve these, the group designs and evaluates new statistical and computational approaches to infer complex evolutionary histories. So for these, they develop and apply machine learning algorithms. They then apply these approaches to the wealth of data currently being generated, mostly in collaboration with experimental groups. And they are further committed to making all their development available to the scientific community by releasing easy to use software packages. I encourage you to have a look at the web page of the Daniel's group on the website of the University of Fribourg if you're interested in knowing more about the packages. So today, Daniel will tell us about the bioinformatics and population genetics can help tracing the spread of farming into Europe using ancient DNA. So Daniel, thanks again for accepting this invitation. And the floor is yours. Transfer. Let's see how that plays in. This is the easy way to lose the people online, right? You just tap on the mic. OK. Thank you very much for the introduction. And thanks for hosting me here. It's the first time I'm giving a seminar where I don't know half of the audience or so is actually virtually absent from the room. So I don't know. It's kind of a weird feeling, but we're going to go through. So I'm going to talk about, as was said, the spread of farming in Europe. But before we're going to start, I just have a picture here. So this is for the people listening online. In case you're wondering who that person is that is talking to for the next 45 minutes or so, well, it's me and this guy here that you see a picture of. And anyway, so before we're going to head into all the bioinformatics and population genetics, I'm going to just start my talk by actually giving you a very ignorant and super high-level summary of what I think was probably the most traumatic social or cultural revolution in the history of Europe. And in order to better understand that kind of revolution, we're going to just turn back the wheel of time about 12,000 years. And that was a time at which, when we were looking at Europe, most of the continent was actually colonized by societies that were basing there or were trying to search for food by foraging. That's hunter-gatherers all over the continent. We by now know that many of these different groups are actually genetically distinct. There's a large diversity. Very little is known about that, but we just know that there was a large diversity, different groups, but they were pretty much all doing the same thing that is running after animals and collecting vegetables or berries, fruits in the forest. Then about 11,000 years ago, something quite remarkable happened. That is, we see the first farmers arising in the Middle East, and more precisely in an area known as the fertile crescent. The change from foraging to farming is quite a traumatic change. It's not only the change in terms of what you eat, but with it comes a whole set of other things that shape culture. First of all, it's becoming sedentary. So farmers tend to live in one place. They're not mobile. They don't follow herds traveling around. They actually are in one place. And that is quite a change. If you think about it, it probably even changes the way we look at property and a lot of questions like that. On top of it, becoming a farmer means adopting a lot of technical things, a lot of new tools, and working with specific domesticated animals and plants. So definitely these cultures, a farming culture and a hunter-gathering culture, are very, very different. What we see then is that after farming arose, it didn't take that much time to actually spread this way of culture throughout the continent. But around 8,500 years ago, the first farmers were appearing in Western Europe or on the European continent, especially down here in the Aegean region, and establishing their first societies there. And then within a few other thousand years, all of the continent was pretty much covered by farming societies, and the last hunter-gatherers kind of disappeared around that time. So the question is therefore, how does this, or one of the questions you're interested in, is how did this cultural revolution actually play out? And there's really two different kind of theories you can put forward about how this could have happened. The first one is, if we just turn back the wheel of time a little bit again, so we have farmers kind of appearing in the Middle East. And then the first hypothesis just states that, well, the hunter-gatherers themselves, they were kind of jealous about this awesome lifestyle and tried to copy the way of life and became farmers themselves. If that hypothesis is true, then modern Europeans are direct descendants of the foragers or hunter-gatherers that lived in the same place. And these people became far more farming spread to Europe as an idea, as a way of life, and not necessarily by people. The alternative hypothesis to this is that it was actually the farmers themselves that colonized most of Europe, displacing most of the hunter-gathering people, of course, apart from this tiny little area here up in Brittany, but apart from that, like pretty much replaced the hunter-gathering societies. If that hypothesis was true, then we as modern Europeans are directly descendants of the first farmers from the Fertile Crescent and not of the hunter-gathering societies that were living in this area before. Of course, you can imagine every kind of intermediate scenarios between these extremes, and one of the big questions I think from the last decade was trying to figure out how it is actually played out. Looking at existing ancient DNA, we can get a glimpse of that this change, this change in lifestyle, was first most likely driven by the exchange or by the colonization through people. So what I show here is an admixture analysis. So this is an analysis in which you try, where all the individuals included try to assign or to estimate their proportion of their genome they inherited from a couple of groups. So here is an analysis, we run with two groups. So we just ask, we try to partition individuals into two groups, and for each individual try to estimate how much of their genome they inherited from one of these groups. And this is all ancient DNA data that was existing before we started the study. In the lower part here, we have hunter-gatherer individuals followed by early nihilistic people, so the first farmers that actually colonized Europe, followed by samples taken much later in time in the middle and late nihilistic. What we can see is that genetically, it's super easy to actually distinguish hunter-gatherers from the early farmers. They are really completely different groups. That's suggesting, therefore, that it was really people moving into Europe, bringing agriculture with them, and it was not the hunter-gatherers that were actually picking up the lifestyle of being a farmer. However, very interestingly, in later periods, there's kind of a research on hunter-gatherer ancestry, suggesting that before hunter-gatherers disappeared, there was quite some gene flow between the two groups, so that modern day Europeans actually have some of their ancestors from both groups. However, even if we know kind of that now, the spread of people was really, really important, it is still pretty much unclear where exactly these farmers came from. We know that farming was developed in the near East, but the question was, were these early farmers that we see in Europe, were they the descendants of actually these people in the near East, or not? One major reason for why we do not know this, or we didn't know this, is that actually the preservation of DNA in ancient bones is a lot affected by climate. The cooler the climate, the better it's preserved. And it's just a very, very difficult task to actually obtain ancient DNA from warmer regions. And so when we started out this study, there was actually the southernmost kind of sample in Europe were all from the Balkans, but it wasn't, there weren't any samples available for ourselves. Okay, for some reason I'm blacking a slide here. So what we then did is we started out by actually analyzing samples from the Aegean region, as well as from Anatolia, and trying to see whether these samples were actually in some way related to the modern samples that we have in Europe. And this is work that was mostly led by Jochen Borgern and his group in Mainz, and they have like one of these super sterile, super clean rooms to work with ancient DNA. The issue is that if you actually touch an ancient bone, there's probably more of your DNA on that bone, that original DNA in the bone. So it's a really challenging task. And so these guys were working really hard to get libraries out of that, okay? This is then the data that we got in our lab, and we were trying to analyze to answer some of these questions. Before I gonna really delve into how we actually did some of these analysis, I wanna just walk you through some of the major characteristics of ancient DNA. And these characteristics, they really influenced the way we can actually deal with this data from a bioinformatics point of view. The first thing is that if you work with ancient DNA, most of the data sets, they have really, really low coverage or sequencing depth. There's multiple reasons for this. The first one is that we generally get extremely low DNA content out of the bones. So what I showed you here on the right is for each of the samples, what I should say, what are five best samples that we analyzed. So the people they screened up to almost 100 samples and then identified a five most promising for sequencing. And so these are these five most promising sequences. And what you see is on the X axis is the amount of endogenous DNA. That's the amount of DNA you retrieve that is actually human, most likely of the person that we wanna analyze. All the other DNA comes from soil bacteria and all other things you sequence, but it's actually not human, okay? So in the case of, for instance, this red five sample, more than 80% of what you sequence is not human. So that means if that was the only problem, just to get the same amount of sequencing depth as for a modern person, you would just have to sequence five times more, right? But there's more challenges associated with this. The other, the issue is that because DNA content is very low, if you actually generate a sequencing library, you get a lot of the times the same fragment. You have to amplify the DNA using PCR and you're gonna sequence the same fragment multiple times. What I showed you on the Y axis for each of these samples is the amount of unique fragments we're actually getting. Down here is a bunch of blanks that were on in the lab. So if you prepare a library or if they prepare a library in this ultra sterile clean room, a sequence here about 10 to the five, so about 100,000 fragments, unique fragments anyway in those libraries. It's very hard to be so clean to have no DNA at all in your library. If you compare that to the first sample in this respect we have, so this is bar eight, it's about 10 to the seven, it's only about two orders of magnitude, more human DNA in a kind of library prepared with an actual sample compared to one that is just prepared from what we think was pure water at the start. So we get so little DNA on that it's really hard to actually have a lot of different fragments and as a result, again, you have to invest quite a lot of money, quite a lot of sequencing lanes to actually push up coverage. The next issue we're having is that because the DNA was lying in the ground for a very long time, it's usually extremely fragmented. So this here shows the distribution of the lengths of fragments that we got for three of our samples. And these were sample sequence, two of them were samples with two times 150 base pairs paired in library, right? And so we can estimate the fragment lengths exactly and what we see is that even though we could have fragments up to 300 base pairs, the vast majority of fragments was around 40 base pairs in links. So the DNA is already extremely fragmented in the ground and so even if you invest kind of a lot in having two times 150 base pairs or something, you anyway just getting something around 40 base pairs per read, okay? As a result of that, the later samples were just sequenced with single end kind of 100 base pair reads because it wouldn't make much of a difference, okay? So coming back, not only are you gonna sequence the majority of reads you're gonna sequence, not from the sample you're interested in, those that are from the samples are probably half or even less in lengths than the other ones, like further kind of lowering the amount of interesting DNA you're gonna get out by this approach, okay? Now the last kind of issue that is common when dealing with ancient DNA is the presence of so-called post-mortem damage. So post-mortem damage, this is our chemical modifications that occur along the time to the DNA molecules that result in certain changes, okay, or mutations that are not really reflected of the person's DNA or the person's kind of diversity that we're interested in, but are just areas that are kind of compiled over the course of time. The most common form of post-mortem damage is through the deamination of cytosyns, okay? And if you have a deamination of cytosyns that turns into uracil, and if you do a PCR that turns into timid, okay? As a result, you're gonna see a lot of T's instead of C's when you look at these ancient fragments. And clearly if you were to believe that these T's are all real, we have a large mistake in our genotyping. And as a result, I actually completely get the wrong type of impression of relationships among people or even the diversity. It's interesting to note, important to note that actually all these C2T changes they appear also as G2A changes on the opposite strand if you map. So these are the two types of areas we see. Luckily, these areas do not just occur randomly, or I mean they do occur randomly in the fragments, but what we see is that there is still a certain type of distribution that comes with the position in the read, okay? So what I showed here in the picture are these type of distributions that we inferred for two of our samples. The way we inferred these numbers is by simply comparing, by mapping all the reads against the human reference and then just counting how frequently do we see a T in one of these reads if the human reference was a C. Now certainly you do expect a real difference between a human that you look at and the reference. You do not expect a person to always be exactly like the reference. So there's a certain fraction of sites that you do expect a true difference in which the individual would have a T but the reference has a C, okay? That's, I don't know, something like one in 1,000 or one in 4,000 base, but you would actually expect that, okay? However, what we see here is that at the beginning of a read, so on the X axis you have to position. Beginning of the read in certain cases more than 50% of the times in which a map maps against the C in the reference, we actually see a T and that's exactly this post mortem damage that was mentioning before and the reason for why this kind of really accumulates here at the beginning of the read is because this chemical modification happens much more likely or is occurring much more likely if the DNA is single stranded. So we have the DNA fragmenting up into fragments in the ground, losing some nucleotides at the end. We get sticky ends with single stranded DNA and that's where these chemical modifications appear. This is known because we can actually replicate that process in the lab and we can just expose DNA to a lot of kind of weird influences and you can replicate that process. So the chemical way by which this happens is quite well understood. Okay, and you see that this is a common pattern in all of kind of the libraries we looked at. There are different types of libraries you can use for inch DNA. We can have parent libraries here in black for the five. You can treat them with a chemical that actually turns EuroSeal back into citizen before sequencing. So you would do that, apply that chemical first before doing PCR and like that you would actually lower the amount of these kind of changes. They're still somewhat present, but to a much lower degree. The downside is you actually lose a lot of material. So that's why we stopped kind of doing this. And then for bar eight we use single end sequencing. There is just interesting to see that because the reads do not actually necessarily reach the end of the fragment. They are at the beginning affected by false modern damage, but all the reads that actually do not reach the end of the fragment, they're not affected on the other side. That's why you see kind of here in the lower panel in blue that those read they're actually exactly as long as the cycles we use in the luminous machine. So if you're seeking to 100 base pairs, you read is exactly 100 base pairs long. Then that means that most likely these reads do not reach the end of the fragment. And if you look at those, we see a reduced rate. But other than that, this is pervasive in the data we analyzed, okay? So these are three things. We have very low amount of data generally and we have these errors, okay? And now we are interested in working with this data to understand or to infer genetic diversity in these ancient samples and compare them to modern samples. Generally, low sequencing depth just means ambiguous genotyping, okay? That's something we're all aware of. That's why people try to increase steps, okay? But what it means for us is that we cannot simply increase steps. We have samples which have less than 1x sequencing depth, so coverage less than 1x. And they were already sequenced on many, many different planes of Illumina. So money is not really going to solve this problem in the short term, right? If you just want to say, oh, let's go up to 10x, it becomes unaffordable even for super rich labs. So we have to deal with the fact that some of these samples have depths of around 1x. Clearly, if we are interested in inferring genetic diversity of the sample that has just a 1x coverage, then by first calling genotype and then just counting all the sites to their heterozygous, seems to be a pretty stupid idea, right? So we have to do this in a somewhat different way. In population genetics in the recent years, there's been a lot of tools proposed for actually trying to or to, that try to infer population genetic parameters or statistics without calling genotypes. This is this whole idea that we can invest money rather in more biological samples rather than just in the technical replicates we do for a single sample because all these different technical replicates. One of these common tools, commonly used tools is ANG, stands for analysis of next generation sequencing data. And this offers an algorithm that actually infers allele frequency distributions or site frequency distributions, SSS, without ever calling genotypes, okay? And they show that you can actually infer these distributions even if you have coverage of 1 to 2x, usually for a sample. However, ANG's is not really applicable to our data out of two reasons. First, it requires a priori knowledge on which are the major and minor alleles in the population. Now, clearly we have a lot of knowledge about, say, human populations from Europe. We could actually know which ones are most likely going to be the major and minor alleles. But we are very careful in the sense that we don't want to bias the genotypes of our ancient samples through modern knowledge, right? Otherwise, they're just going to appear more modern than they probably are. So we want to have a way to infer things that does not require knowledge from modern populations. The second problem with ANG's is it does not actually handle cost-modern damage. And if we don't handle that, we're going to grossly overestimate diversity, okay? So we needed kind of a different approach. So this is when we came in, and together with a postdoc in the group, Autonosis Cosotonos and a PhD, Vivian Link, we developed methods to actually analyze this data inferring genetic diversity. Now, since this is a bioinformatic series, I'm quite excited because I can actually show some equations and I'm sure that at least a fraction of the audience will really appreciate that. That's not the case if you give usually talks to different audiences. So I can benefit from that by actually having a couple of slides with a few equations, okay? So bear with me because at least I'm excited about it, okay? So the model we propose to infer genetic diversity is relatively simple, okay? We're just interested in the genetic diversity of a single individual. You can call this heterocyclosity. A population genetic statistics for that would be zeta, okay? Which is given by 2t mu here. So what is this? 2t mu, so 2t, t is the time to the most recent common ancestor of the two sequences of a single individual. So if these are the two sequences in one individual, these two sequences do have a common ancestor let's assume t generations ago. The total length of the tree connecting the two is two times this time. And if you fly by the mutation rate mu, we get the amount of mutations accumulating on that branch, okay? So that's a seed that we're interested in estimating. We will never be able to estimate either a mutation rate or t separately, okay? However, like it clearly puts a model in place in which we can explain like the genetic diversity of a single individual that we're gonna use here. In addition, we're gonna say, okay, we don't wanna use any modern information, but we're gonna use the sequence composition in the region kind of as information about what type of alleles to expect. If you're in a GC rich region, you're more likely to see G and C alleles than A and T alleles. So neighboring loci do tell you something about the probability of seeing a particular allele. So that's what you're gonna say. We have regional allele frequencies pi here and these frequencies, they obviously sum to one, but they give us some information about what type of alleles and mutations to expect in that particular region for which we wanna infer a genetic diversity. If we have these two, we can build an easy model, okay? Here just put down the likelihood for this. So likelihood in which we have a product across each site, assuming that mutations are independent or between sites. That's an assumption generally made. I think it also makes sense just because my grand-granddad had a mutation at one locus, does not prevent my grandmother from having a mutation at then a locus just next nearby, right? So we can assume that mutations are independent between sites, so we can take the product across sites. But then for each site, because we do not know the genotype, we're gonna just integrate out the genotype. We're gonna integrate over the genotype and just ask what's the probability of the sequencing data of DI given the genotype that I integrate over times the probability of the genotype given the C time pi that I have, okay? So this way the genotype is a latent variable that gonna integrate out and I'm interested in finding estimates of both seats and pi without having to know anything about the genotype in between. That's a basic idea. So in the next few slides, I'm just gonna walk you through how we actually, through the model that we really employ for these two terms. So first, what we call the emission probability. That's a probability of the sequencing data given the underlying genotype and then the probability for the genotype given the parameters and we wanna know, okay? So for the substitution model, that is the probability of the genotype given pi and theta, we just employ the classic Thelsenstein substitution model. It's a famous Thelsenstein 1981 model very, very commonly used in phylogenetics. It has some advantages over other models, namely that it actually allows for recurrent mutations at the same site, okay? The model is, in that regard, very simple. It's, the formula may look a little bit complicated, but if I walk you through, you'll see that it's actually a very simple model. So what we have here is the probability of a genotype given the frequencies pi and the heterocyclic acid is theta. Theta, we can see it as a kind of overall mutation rate in that sequence. So if we look at the lower line here, if the two nucleotides of a genotype are actually different, okay, not equal to L, then in that case we do require at least one mutation on the sequence, okay? And then what we have here is the probability that there's at least one mutation is given by one minus E at the power of minus the rate. And then if that's the case, we just multiply with some probability that if you follow a tree, the first nucleotide on one side is actually of type K and the last nucleotide on the other side is of type L. So it gives you pi K times pi L times the probability of at least one mutation. If the two nucleotides are equal, that is a monomorphic site, then there are two options. One option is that there is actually no mutation, okay? So that would be E at the power of minus theta or that there is at least one mutation but that's the last nucleotide to which you mutate to is the same as the first one, okay? So this is why I multiply this again with theta K. So it's a very simple model allows for recurrent mutation and has just the parameter theta plus pi that we require for our model. The emission probabilities are best explained with a little example. So consider a case in which one read at one position here has the data dj, so d is the data and j is just one position in the read. So we see here an A nucleotide being read. And we asked what's the probability that the read actually turns out to be an A if, and now here I'm just gonna for simplicity, assume we have a haplotype case, so we have a genotype that is just a T. Now we're just gonna assume that in this haplotype case, then this means that there is a sequencing error happening at this location and a sequencing error happens is probability epsilon j and because a sequencing error can turn a T into an A, a C or a G, it's just in one third of the cases and that actually turns out to be an A. Clearly this is just for the haplotype case, you can spin this further for a duplicate case, it's not very hard, I just use this as an example to see where it actually pulls more than damage no concern. Because if we have the same again haplotype case in which the underlying genotype is a C, but the data that we read is a T, then there is now like more than one way by which you can actually end up having a T in your sequence if the underlying genotype was a C. One of the ways is easy, it's again like a genotyping error, a sequencing error, sorry, a sequencing error, so that's the second term here, it's epsilon third times the probability that there is no post-mortem damage. So that actually the DNA that I get from the bone still contains a C and then the T appears and by the machine making a mistake. Or I have the probability that I do have post-mortem damage D, so that the DNA molecule was actually turned, the C was actually turned into a T before I even retrieved it. The machine is not doing any mistake and everything is fine, so I'm having one minus the probability of a sequencing error. You can spin it further, you can develop this for all kind of genotypes that you want, we did this for all the diploid genotypes that exist. And then if we have this, we can actually build a model by which we take this or we take these emission properties directly into a component integrated over genotypes. The question is from where do we get the probabilities of the post-mortem damage, so VGHE here? Well, for this we actually can use the exact same the curse that I showed you before, we see a pattern of this exponential decay and actually just fit the generalized exponential decay function to it and then we can assume that this is the rate we have. So in a priori, we estimate this from the data and then assume this to be null, okay? Once we have figured out how to do these probabilities for, we can easily assume that reads, the individual reads at one location are independent, so we just have to integrate all these different reads. We just have to take the product across all the individual data points that we get for the same site to get our life. So how does this actually perform? Like does this work? Can we estimate theta? What type of sequencing depth do we require? So this is, these are some results we got from simulations. We first simulated data in a one-megabase window for different coverage or sequencing depths here, so from 0.2 to 3.2. What you can see is that as soon as coverage is a little over one, we seem to get quite accurate estimates if theta is at least 10 to the minus three, which is what is the average in the human genome, okay? So we seem to get quite accurate estimates, even if coverage is only slightly above one X, if we combine data from about one megabase. Similarly, we can simulate one X coverage, but then ask how does the window size affect things? So how many sites do I require to accurately infer heterocyclicity if the coverage has a certain number? I should say like this, one coverage per site, that is the average coverage we simulate in the window. Again, we see that it's the window size approaches about one megabase, we get quite accurate inference. If we have much less data than that, there is some bias that we see, okay? So generally, one X coverage, about one megabase, one million sites are required to do that. Interestingly, the window size as well as the sequencing depths, there are two factors that seem to work jointly here together. You can either have larger windows and less coverage or less coverage and larger windows. One way to plot this is here in the last plot on the right, where I plot for all combinations of window size and coverage, the amount of error we make in the estimation. And you see this is from simulation, so it's kind of like a weekly density plot, but what you should see is that for all combinations of window size and coverage, that the result in about the same number of sites at the same number of individual read information being there, you get the same performance. That means that whenever we are interested in inferring genetic diversity from an exceptionally low coverage sample, we can just increase the window size. We may not be able to infer heterocyclic density on a one megabase window scale, but just 10 megabase window scales, and then it should just be fine. If we have a sample with a lot more data, then we might actually just zoom in and get those estimates at a much finer scale. Clearly, the performance of this is affected by the actual diversity in the sample. The higher the diversity, the higher the data here, the less data is required for proper inference. I think that makes kind of sense. If in a one megabase window, you have a single heterocyclic site, you need quite a lot of data for confidently actually seeing that. However, if every second site is heterocyclic, you don't need that much data to realize that there's plenty of diversity in that window. So we see this relationship, and that means that for very low CETA values, like 10 to the minus four, we need slightly higher coverage of three if we were using one megabase windows, or actually we might just use higher windows if we estimate CETA to be relatively, or we expect CETA to be relatively small. So that seems to work quite fine. We were quite excited, and then just took our three Greek samples here. So these samples are up to about 10,000 years old. We had sequencing depths between 0.8 for the worst and 3.5x for the best sample. And then we kind of run our estimator. So these are the results we got and compare them to our expectations. So the average human genetic diversity is about 10 to the minus three as I said before. Now what you see is like we get super different estimates. Like we get this blue guy up here, which has something like 10 to the minus two or even more. That means one in a hundred base pairs seems to be heterozygous. Well, this red guy is like, this kind of jumping up and down between two values. One is one in a hundred thousand, and one in, I don't know, something like 10 billion sites being heterozygous. Now that should appear to be a little bit strange, right? I mean, in the end, these are all humans. It's just 10,000 years ago, even if you go nowadays and go to the most distinct human populations that you can assemble and look at, it's hard to find anything like that in diversity. So there's a real question, like why are these estimates so different? We don't expect them to be so different. In the end, they even come from, we think from the same population or same region at the same time, right? About pretty much the first farmers arriving in Eugene, why should the first farmers differ so much in their genetic diversity, okay? So the answer is actually that, the answer is that these are not very good estimates, but there's something else in the data that prevents us from getting proper estimates. And what it is, what we figured out quite quickly is that all these kind of estimate, all these tools that try to estimate diversity with alcohol and genotype, they require an accurate information about sequencing areas at the particular site. We do take quality scores into account, right? And if, of course, these quality scores are completely off, we're gonna get completely different diversity, completely wrong diversity. So the point here is that I wanna make is, we do need to recalibrate quality scores, urgently or absolutely, if you wanna estimate genetic diversity. If the sequencing machine tells you, I'm very, very sure that this is the base you're gonna see, but it's actually wrong, you're gonna have very biased estimates in terms of diversity. And as you will see, we can explain these differences solely by the issue with recalibration. The question is just how do you recalibrate data if you have this much data? How do you really recalibrate that? And if you don't wanna use modern information to do that. Most current recalibration tools, they require information about which sites are actually polymorphic in the genome, which one is the reference site and so forth. And we don't wanna actually make our ancient samples more modern just by just imposing all that information. So we were therefore thinking hard about how could we use, how could we devise a recalibration strategy that does not require any sort of reference information. Okay, now here's our solution. We're relatively excited about that because it also works, of course, for all the species for which you do not have that information available. Anyways, right? If you're working in a non-model organism, you're the first one to sequence it. There is no kind of BB snip around. There's no perfect reference genome. So how are you gonna recalibrate with GATK or BQSR tools, right? You don't know? Well, I think this is one way maybe by which it's feasible. So I'm gonna walk you through. The thing is that we make a particular assumption, namely that we can get one part of the genome for which we may not know the sequence. But we do know that it should be monomorphic all over. Okay, and one of these type of sequences is, for instance, X-linked data in males. So it happens that most of our Greek samples were actually males for pure chance because in the screening, it was pretty much 50-50 when regarding the sex, but for some reason we got mostly males in this app. And so if we just look at the X-linked data, then all these sites, they should be, we don't know what their allele was on the X, but we know that they were monomorphic for one copy of the X, so we should never have a heterozygocy. Now we can use this information in the following way. So we can assume that there is some sort of transformation that transforms the machine quality score that we get into the true error rate. And this transformation we put down here is lower equations in two ways. So first we're gonna say that there is this new IK that is going to be a linear combination of what we call Q, which is some sort of information that we can turn into, we call this new, which is then transformed with a logistic function into an actual error. So the logistic function is just there to bound the error rate between zero and one. You cannot have an error rate below zero or above one, right? So this is why we use this. And the linear function here is just a combination of information that we get for every base, such as for instance, the actual quality, which tells me this, but also the position in the read, whether the position just before what type of nucleotide that was and all these kind of information that also BQSR use. But we use this here in a linear function, okay? Once we have that, we can build a likelihood function for our haploid data, this is the first equation here, where we express the probability of the data given these coefficients, the data that we wanna estimate of our linear model, okay? Here in this way. So for each site, we're gonna first integrate our the unknown hidden genotype, because we do not know the genotype. But then if I pretend, say, let's say the genotype is A, then I'm gonna go through all leads that cover that site. And I say, what's the probability of seeing this data here given that the genotype is A, the error rates that I have for all my base pairs, okay? And all the PMD probabilities that I have my base pairs. And then the product of this multi-bibra, the probability that the genotype is actually A, right? And then the sum over all possible genotypes. It's just gonna integrate out all the genotypes. And then I wanna know which are the coefficients beta for a millinear model that maximize this likelihood, okay? So I try to find the transformation of quality scores that best explains the data in a location at which I do not expect to see any polymorphic sites, but without actually making any assumption about what the reference base is, okay? So this is the basic idea. And so we use an EM algorithm to work that out. It works relatively fast. Does it work? So this is again from simulated data. So here we just simulate data where we transfer. We distorted the qualities using four coefficients. The first one is the actual coefficient distorting using the actual quality score of the machine. One where you take the square of that. That's just if you have nonlinear effects, you can put that in with a square and the position as well as the position square, okay? In practice, we also add all 10 different contexts that you can imagine as additional factors. So this is then we try, so we then run our tools or our toolbox to actually estimate these coefficients from simulated data. Again, within a one megabase window for different sequencing depths. And it seems we need a little bit more data than for CETA, but if you have something like two X coverage in a one megabase window, it seems to be possible to estimate these coefficients quite nicely. Or if you have lower depths at one X, so this is data not shown, but if you have only one X, you need about 10 megabase data, then you're on the safe side and you can actually estimate these coefficients very, very reliably, okay? So we apply this to our Greek samples. This is what we're gonna get. So this is the quality transformation table where on the X axis we have the machine, the quality score reported by the machine. On the Y axis, the one that we get after our recalibration and the colors here represent density. And what you see is that for all these three samples, we actually find that the quality is given by the machine or two of the missing, right? The machine always tells you, for instance here, quality score is 40, but then this would be the line here, the dashed line is when the true, the both qualities are the same since all these points are actually below. So all the high quality sites in particular are actually in all of that as high quality as the machine suggests, okay? So we need it to recalibrate that. You also see that the pattern is quite complex. It's not like kind of just a weird line. That's also because there's lots of different library sequence on many different lanes and even different machines. And of course for each library and each lane, we had to recalibrate on there, okay? So we cannot just recalibrate, pulling all the data together, but we recalibrated for each library and on each run independently. That's why we get these complicated pictures. But anyway, we seem to be, we see a difference in these, in the quality scores. And if we then use these recalibrated quality scores and now infer genetic diversity for our samples, things suddenly start to make a lot of sense, okay? So here it's just an example, we of course downloaded all the other type of available data, ancient data as well and started to play with this, especially all the males, because we know this works for males, okay? And so I should say like, we have some strategies for females now. We use ultra conserved elements. We also tried MTDNA. It's just pieces where you believe there's no polymorphism, right? And that seems to work. So far, we only have mostly males, so we're quite happy with this. We still have to experiment a little bit how to use with females, but we're quite confident that ultra conserved elements should actually be sufficient data to run the same strategy there, too. So here I show you on the first panel, a trace of heterocyclicity or diversity along the genomes. It's just a piece of chromosome one. And so along the genome for different samples, we have TSI2 and GBR2. So these are just two male samples from the 1000 Genomes project. And in comparison, we have KK1 and beyond. So these are 200 gatherer samples from a Swiss cave, and this one here, KK1, is from the Caucasus region. So what we see is the trace here along the genome. Interestingly, the two modern samples, they almost follow on top of each other, right? It's not only true that if you take a British sample and a TSI from Tuscany, a Tuscan sample, they do not only have roughly the same genome-wide average in their heterocyclicity, shown here as a dash line, but even if you just follow kind of one megabase by one megabase through the genome, they're almost identical, these two samples, and they're diverse. However, interestingly, the hunter-gatherer samples here, they're a little bit different. They have genome-wide slightly lower heterocyclicity, which you would expect, given the lower population size, and their traces look a little bit different too. We then run this on a bunch of samples. So we have here three modern samples from Italy again, from Britain. We have, BR2 is a Bronze Age sample. We have the 200 gatherers. Then we have three African modern samples and an ancient modern sample here, the MOTA sample. What we find is like that this is shown, the distribution of heterocyclicity in the windows. We see that the Bronze Age sample seems to have pretty much the same diversity as a modern European. That's kind of expected. On the gatherers has slightly lower diversity. We see that Africans have a higher diversity than Europeans that's well characterized. Interestingly, this ancient African also exceeds modern Europeans in diversity. We can then ask how correlated are the distributions of diversity in the genome? So we just go along the genome and have the trace that I show here in the first picture and we then do a correlation among that. So here is the Sperman correlation and then try to get the Perois correlation show here as a tree. What we see is that we have all the modern African samples, the modern European samples clustered together quite nicely. Basal 2 is we have the MOTA sample here to the African. We have the Bronze Age sample, basal to the modern Europeans, suggesting that they are not only very similar to modern samples in their overall diversity, but also in their distribution of diversity in the genome. However, the 200 gatherer samples, they're really basal to this. So they have a very, very different pattern of diversity in the genome. And interestingly, even between them, the difference is about as big as between any of them with any of the other samples. Suggesting again that these 200 gatherers actually come from completely different groups. It's not just that they are two old samples that represent in completely different groups. It's maybe not as surprising, one from Switzerland, one from the Caucasus region. It's quite far apart, right? But it's kind of interesting that we pulled it off. Okay, so this was kind of our confirmation. Things seem to work quite nicely. Of course, we can then use the very same kind of model that we developed for this heterocyclic, also the calling also for genotype calling. So we also implemented a genotype caller for ancient DNA that takes postmortem damage and everything into account. If we do that using our MLE callers, we see that we can easily outperform TATK. Even if TATK is actually using math damage tool, which is another tool to cut off a PMD-affected size. Okay, as we did here in the simulation. See, this is the amount of errors that we make here as a solid line, okay? For atlas, this is our tool. It's analysis tools for low coverage and ancient samples, how we call it. And then this is for TATK. We see interestingly, our caller does not seem to have a reference bias. So if you compare our line here for reference-reference homozygous calls or alternative-alternate, we see exactly the same pattern. While TATK clearly has a reference bias. So you're making much less mistakes if the true genotype is reference-reference than if the true genotype is alternative-alternate. So the only way by which TATK is beating us here on the reference-reference is actually by calling too many things reference-reference. You also see this in the amount of sites where there is no call being made by the tools. So TATK has a kind of large number of sites at which it just does not produce a call. And while we kind of accept or we are willing to call a lot of sites, even if that may lead to a slightly higher error rate in the reference-reference case, but our error rate is lower in the other cases even for low coverage, okay? So clearly for ancient samples, it's easy to outperform TATK. And for those of you using TATK, just be aware of the fact that we're not the first ones to notice that but it has quite a strong reference bias. Most of your errors are towards the reference. So now that we have genotype calls and we have diversity in everything, let's go back to the biological question we started with, right? Now, if we apply all these tools to our ancient data, what can we learn about how farming actually now spread from the fertile crescent to Europe? Okay, so yeah, this is the slide that should have been at the very beginning. It's because I stitched together latex and other slides, it's a bad habit. But anyway, let's jump to the PCA here. So this is the PCA we did on our cold genotypes at the SNPs for which we have a lot of modern samples available as well. So this is a reduced number of sites. This is a PCA run on all modern samples. So we have different groups. You see quite nicely that it's mode for young, especially for Europe, but most regions if you do a PCA on genetic, genetic call, genotypes of modern samples, you pretty much get kind of a map because isolation by distance is the predominant factor explaining differences between humans, okay? We see here, we have the Sardinians down here, the Basques Southern Europe, Central Europe, the Brits, Slavic people, okay? They kind of line up all around here. So here we have the Mediterranean and then it goes towards the east. Now the question is where do our ancient samples, we can project the ancient samples on this. We have much less SNPs for the ancient samples and all ancient samples have different sets of SNPs. So you need a way to actually project them on the map, scope, costes, analysis. And if we do that, okay, these are all the samples that existed before our study. So we have the European foragers showing up here. We have the early farmers down here, again confirming that these two groups were really distinct groups, right? But that we know already. The question is now where do our Aegean farmers fall? And they actually fall right into the group of early European farmers. So we can say that genetically, the very first farmers that showed up in Greece from the Middle East, they are genetically the direct ancestors of the very first farmers in the rest of Europe. So we can trace back the genetic ancestry of the farmers at least now to the Aegean region. We see that there were hunter-gatherers living in Europe, people moving in. These people that we see establishing new settlement, farming settlements in Central Europe or Western Europe, these are genetically identical to the very early farmers we see in the Aegean region. But now, how can we trace that even further back to the fertile crescent? The answer is no, because our first Iranian farmer that we have from one sample from Iran shows up here. So it's in a completely different place, suggesting that our Iranian first farmers are genetically not related to the Aegean first farmers. We can further look at that using a different type of analysis. So this is an analysis done by Garrett Tellentan, a collaborator of us on our calls. And so he's running a tool called Chroma Painter in which what you try to do, you try. So these are our two ancient samples. This is bar eight, our highest quality sample from the Aegean and WC1, our highest quality sample from Iran. And for all both of these samples, you try to explain the haplotype structure you see in that sample as a model in which you copy haplotypes from modern populations, okay? So you say this is bar eight here. So you say along the genome, you copy together haplotype, the best matching haplotypes in all locations from modern populations, okay? Kind of one of these addicts analysis. And then all these wrong dots are modern populations and their size and color represents how much of haplotypes are called from that population, okay? And you run this for all population at once and then for bar eight and then everything for WC1, okay? And then what we see is like here, the relative contribution of the populations to these samples. What we find is that our ancient gene sample here seems to be in terms of haplotypes, not only individual sites, but haplotypes extremely similar to modern Europeans because you can explain the chromosome best by stitching together haplotypes seen all over Europe, okay? On the other side, our Iranian sample is best explained by stitching together haplotypes we find in the far east, okay? So that is mostly in India and Pakistan and modern Iranian population. So what does this suggest? This suggests that the Iranian farmers are actually not the direct ancestors of our Egean and therefore European farmers. They may be the direct ancestors of the farming that the farmers that spread towards the east. But these two groups, the Egean farmers and the fertile crescent farmers, they are genetically distinct. They're different. So what does that mean? We're not really sure. Does that mean that there were just multiple groups developing farming altogether? Does this mean that these farmers here learned or kind of picked up farming from these other people? We're not really sure. We think the latter one sounds quite plausible. So that's why I think we can conclude that the spread of farming from the fertile crescent to Europe involved two different kind of factors. First, farming most likely spread in some way by actually a cultural diffusion, be it because in that hunter-gatherers learned to farm from other farmers or be it that there was a bunch of different groups developing farming together, exchanging ideas, we don't know. And then once these people that started to farm reached European shores though, they were just spreading demically and colonizing the rest of Europe. So as a result, ancient DNA tells us that the spread of farming into Europe is complex and it depends on which region we look at. From the Aegean to Europe, it's mostly by the replacement of people but from the fertile crescent to Europe is a much more complicated story that mostly involved at least to some degree cultural diffusion as well. Okay, so let me sum up some conclusions. First, some general conclusions that I think we can learn from working with this type of data and apply maybe the other data. So first of all, areas are an unavoidable part of NGS data. All of you working with NGS data, you know that and we have to find a way to deal with it. Just ignoring it seems not to be a very clever strategy as we've seen at least in this case. Filtering introduces a lot of biases and if you're interested in genetic diversity, it usually introduces a bias towards too little diversity. We're only willing to accept heterozygous sites or alternative sites where we're really sure. It means we have too few of them and we can underestimate our diversity. The trick that is what most people in population genetics are kind of moving to is that we try to avoid calling genotypes altogether and just rather use tools that estimate what we're interested in by actually integrating over the genotype uncertainty. If we do that, we have the benefit of actually being able to work with low coverage data. And if we work with low coverage data, that means we can invest in more biological samples rather than more reads of the same thing of one individual. So investing in biological replicates rather than technical replicates as I said before. However, whenever we do this kind of hocus pocus, we require that the quality scores we're working with are actually reliable. We can only integrate our genotype uncertainty if our uncertainty of the genotypes are properly estimated themselves. And we have learned the hard way here that recalibration is super important if we work with low coverage data to the degree to which our diversity estimates were pretty much useless without recalibration before. Now, applying all this kind of logic to the ancient DNA case here, so I presented you some tools that we implemented in Atlas. This analysis tools for low coverage and ancient samples that we developed. It's a tool to specifically design to work with ancient DNA particularly. Ancient DNA has two major things to deal with, low sequencing depths and post-mortem damage, and both have to be properly taken into account. I showed you how we develop the reference-free base quality score recalibration. I think that might also be interesting for people working with normal organisms, for instance, that we can have accurate, or we can infer accurately the genetic diversity of individuals even if they have sequencing depths as low as one X. So if you think about it, it's quite remarkable. You have, on average, one single copy per site, but you can still tell how many sites are heterozygous in the region just by integrating over all uncertainty. And in the end, if we use all these tools, we also arrive at some unbiased genotype calling, so we don't bias our genotypes towards the reference nor towards modern samples, and only this allows us to properly kind of compare ancient samples with modern samples. If we apply these tools to our samples from Aegean and Iranian first farmers, we first found that we can actually trace back the ancestry of the European first farmers directly to the Aegean, meaning that people from the Aegean spread into Europe with bringing farming. However, this kind of direct line, genetic line breaks if you go from the Aegean towards the fertile crescent, somewhere in between and it doesn't extend. We don't know really where, but suggesting that there was at least some cultural diffusion involved in the spread of farming. With that, I would just like to thank a lot of people involved in this work, so mostly for our first people from my group, Athanasios Kozophanas, Vivian Link, a post and a PhD working a lot on this, and my long-term collaborator, Chris here, is a mathematician, I collaborate with a lot. Then the leading people from the polygenetics group in Mines who generated all the data, so if this group is left by Jochen Burgers, Susana, prepared all the samples in the lab for the Aegean project, and also was heavily involved in many of the population genetic analysis. Farners took the lead on the Iranian sample, also two years of lab work to actually get that done, and then some analysis there. Christian is the embedded bioinformatician in the group, we collaborate with a lot, and then a large number of people involved in these projects, archeologists, digging out the samples, helping us with the interpretation, lots of people running analysis to make this work. Okay, and then thank you all for your attention, I'm happy to take any questions if you have some. Thanks.