 of this first session is going to be the Bayesian foundations of phylogenetic and phylogenomic inference. There's not so much about R in this talk, but this is more leaning on the Bayesian inference side and a particular application of Bayesian inference that our research group here at the BSSE at ETH is involved heavily with. So it's okay if you don't understand all of these words yet. I mean, I'm sure at this point in the course, you understand Bayesian phylogenetic. Probably a lot of you do phylogenemic, maybe less of you do, but that is the purpose of this talk to introduce these concepts. But before I do, I just want to describe the outline here. So this talk will be structured is that the first hour, I'm going to cover these or attempt to cover these for initial points. We're going to talk about phylogenetics, phylogenomics, how we use Bayesian inference to wrap this all together. And we're going to discuss a particular application of these methods. The last half hour, if we have time, I'd like you to be able to spend some time at least starting a B2 tutorial where you will get a chance to actually try out some of the software that I'm talking about today. And as Patricia mentioned, please feel free to jump in at any point with your questions. Okay. So I'm going to warm into this very slowly, this first thing in the morning, so I guess that that's appropriate. And apologies if this is not new for some of you, but just so that everyone's on the same page, I'm going to start right at the beginning with phylogenies and talking about sequence evolution. Now I'm just going to move. Some of my windows out of the way so I can see what I'm talking about. Okay. So we're going to just right back to the beginning, what is a phylogenetic tree? I mean, if you type phylogenetic tree into Wikipedia and read the first paragraph of the article that comes up, you'll find something like this. A phylogenetic tree is a branching diagram showing the inferred evolutionary relationships among various biological species or other entities based on similarities and differences in their physical or genetic characteristics. And this based on similarities or differences means that this phylogenetic tree is not something that we directly observe usually. It's something that we have to infer from data, notably these similarities and differences in physical and genetic characteristics. So that process of connecting these things to an actual tree is the topic that we're addressing here. Okay. Now this concept of a tree in biology goes back a very long way. If you go right back to the beginning, yeah, you see these diagrams where people were ordering all living things or starting from the simplest things, what they can do is simplest things right up to complex humans and then divine being at the top a long time ago. And then much more recently, although from our standpoint still a long time ago around the time of Darwin, but just before Darwin, people were already starting to organize species in tree structures. So this figure here is this beautiful depiction of the relationship between the appearance of species and various geological time periods by this person, Edward Hitchcock, who incidentally was a critic of Darwin's theories and removed these figures from his textbook once they started to be regarded as evolutionary trees. But regardless, the idea existed. Of course, Charles Darwin was the first to actually regard these branching patterns as being a depiction of some evolutionary process where species transition from one to another. And one of the first, well, the first known depiction of what we would now call a phylogenetic tree comes from one of Darwin's notebooks. And here, this is a very rough sketch. You can see that there's a tree that relates some token species A, B, C, D here at the leaves. And the tree here is meant to represent the concept that some of these species are more closely related to one another than others. Note that there's no sense in which time is shown in this figure in contrast to this figure over here on the right hand side, which is from, it's the only figure from Darwin's book on the origin of species. And what it displays is again this branching pattern, but here on the vertical axis, we've actually got time increasing up the page. So you can follow these lineages and at each of these points here, you see that there's a burst of speciation with lots and lots of different species being produced and only some of them surviving to the present. And this brings us to what we now consider a phylogenetic tree. This is a more modern depiction over here on the right, still a very schematic depiction showing the relationship between humans and chimps and other primates. And these trees as the Wikipedia article said are inferred from characteristics of the entities that they relate. These characteristics are often genetic, but they don't have to be. They can be, for instance, phenotypic. This is a very important type of data to include if you're considering trees derived from fossils because we don't often have DNA associated with fossils, but we can measure certain phenotypic characters on the fossils and use these to reconstruct the blockchain as well. Okay. One thing that's very important, I think, even at the beginning to point out is that phylogenetic trees don't just occur at one level. So in particular, we can consider in the context of macroevolution, we can consider phylogenies relating species on one level. So this is what we've been discussing so far. So here we have this larger tree if you ignore the little lines in between. This is a species tree where the leaves here, the tips are A, B, C, D and E. These are five different species that are related by this tree. However, from each of these species, one can imagine collecting some genetic data, say some marker gene that is sampled from multiple individuals belonging to that species, and then using the genetic data in those sequences to infer phylogenetic relationships between those individual gene samples. And those phylogenetic relationships are what are displayed by these thin lines within the fatter lines here. And this is what we call a gene tree. And the important thing to know is that while the species tree here does constrain the shape of the gene tree in the sense that you can't have a common ancestor between one of these lineages in B and one of these lineages in A before this speciation event occurs, but it doesn't completely determine the shape of the gene tree. So you can see there's a lot of differences between the shape of these gene trees and the shape of the species tree. And there's a really nice analogy here between this relationship, this nested tree structure, and what we see in pathogen evolution during epidemic. So we can talk about a transmission tree being the relationship between infected hosts within an epidemic. And that's sort of analogous to the species tree. And you can also talk about the pathogen phylogeny that's derived from sequencing the pathogen genomes. And that would be the analog of the gene tree here. Okay. Another important thing to point out at the beginning is this so-called neutrality assumption. And the thing is that in statistical phylogenetics, the way it's applied, the common assumption is that the shape of the tree affects the distribution of characters, but not the other way around. So in order to visualize this, if you consider this tree here, don't worry about the details, and then consider these sequences that are associated with each of the leaves. So these are our say present day samples. And this is the phylogenetic relationship between these individuals. Going right back to a common ancestor at the root here. Now, the way in which we assume this data could have been generated, of course it wasn't generated this way, but the way we assume it could have been generated is that the tree existed first and then we choose a random sequence to apply to the root and then just evolve it according to some process that we'll talk about later down these edges until it gets to the leaves. Now, the important point about this is that the sequence then has no say in the shape of this tree. The shape of the tree determines the sequences that we observe, but for instance, the speciation rate or the birth rate of along lineages in this tree is not determined by the particular sequence that's carried. And of course, that's not true in reality. In general, in biology, of course, it's very important that the sequence plays a role in determining the fitness of an individual. But it's important that we can still make headway with this assumption provided we steer clear of sequences that actually code for things that are involved in selection. So it's possible to satisfy this assumption with real data, but of course it's not in general true, okay? So if we have these sequences, the phylogenetic inference problem is to take these sequences and basically perform some clustering to produce a tree. And of course you can do this in several different ways, and in general, this is a very easy problem, the very crude clustering problem, and there are some very crude approaches to it. But the result of this would just be a diagram like this where we have maybe an ordering of merging between the different samples that we have, but we don't really have much insight into what the length of these branches mean and the uncertainty associated with the overall topology of the tree. So the question is, can we do better than this? And this is the topic. That we'll be discussing. So the way we do this is we sort of take a more model-based approach, and we consider, again, a tree like this, and we imagine that there's some arbitrary sequence assigned to the root of this tree or the first lineage of this tree. And so the question is, can we do better than this? And this is the topic that we'll be discussing. And this first sequence evolves according to some process that is described by this mysterious Q variable here. And this gives us our sequences at the end. Now the question is, what is this Q? And it's really not that complicated. The sorts of models that we choose are these very basic continuous time Markov chain models. So this is basically a model in which a given sequence is undergoing discrete changes at random times. And the rate of all these changes depends only on the current state of the sequence. So it's Markovian. The way we can set this up is we start, I know it's a little bit early in the morning to start discussing maths, but I'm sorry about this. This is the way it's gonna be. If we consider this p sub t super i to be the probability that a nucleotide at site i, so this is site i, and at time t has state x. So x could be GTCRA. And the time development of this probability distribution over the possible characters at this particular site at this particular time is determined by this transition rate matrix Q. And I apologize for the next little bit, but it's really not as bad as it seems. So this is just saying that this probability distribution for the character at a particular site at time t plus delta t or t plus dt. So after some tiny time increment is given by the matrix, this matrix product, the product between this matrix and the probability distribution at the earlier time. And this matrix here is the so-called transition probability matrix, and the matrix, the little matrix here, this Q matrix is the transition rate matrix. And it's composed simply of the individual rates of transitioning from any starting character to any ending character in some infinitesimal time unit. So for instance, QGA is the rate of transitioning from G to A, G to C, G to T, and so on. Don't worry so much about the diagonal elements here, they're just there to make the maths easier. And the only difference between the different substitution models that we consider in statistical follows genetics is the transition rate matrix for the most part. So really they all have this structure. And usually the average rate of substitution, so which is often called mu, is factored out of the transition matrix. So we usually think of time being explained as expressed in units of average number of substitutions. Just remember that you can ask questions at any time. Okay, so the simplest model of this kind is the so-called Duke's Cantor model. And it's the simplest because it just assumes that the transition rate between any pair of character states is the same. So that's why we have all these one thirds here because say we're starting from character G, the possible transitions are to TC or A. So there's three possibilities and there's a normalised rate of one third to each of these different possibilities. So it's a very boring sort of model in the sense that it doesn't include a lot of biological detail. In particular, there's no difference here between the rates of transition and transversions. So for instance, the transition rate between, sorry, the rate of change from G to A versus G to C is the same and this is not what we really want in the biologically reasonable model. The other thing about this very simple model is that the equilibrium distribution is uniform. And what this means is that if you take, if you imagine a single site and you evolve this single site according to this model, eventually if you do this long enough, even if you start with a known character at this site, eventually after a long time, you're equally likely to be in any of the any of the possible states. And this maybe doesn't seem to like too much of a problem, but the reality is that if we consider real sequence evolution, generally, if you count the number of G's, T's, C's and A's across a real sequence, it's going to have some bias to particular characters as opposed to others. So it makes sense to encode that sort of information in the substitution model, and that's not the case here. But despite this, the Duke Scantor is a useful toy model because it lets us think a little bit about the way sequence evolution happens. And in particular, we can easily compute things like this PDIF. Now, this is the probability that a nucleotide at a starting point in time, when we consider this character evolving over a certain amount of time T, winds up being in a different state, having a different character to the starting character. And keep in mind that this is not the probability that the site undergoes a substitution or undergoes a mutation. This is only counting the probability of winding up in a different state. So if we start with G, this is the probability of the ending state being CT or A. Okay, and if we plot this, what we see is this red line here. So we see that it starts at zero. Of course, we start in the same state. So when we're moving some infinitesimal time after this, we're likely to have remained in the same state, PDIF is zero, but it quickly increases, it increases linearly initially, but then you can see that it asymptotes to some value and this value happens to be 0.75. And this is basically saying that after long enough time, you're equally likely to be in any of the four characters. So your chance of being different to the starting state is 75% because there's a 25% chance that you just by chance wind up in the same character as the starting character. This is important for thinking about inference because really these differences are the raw material that we're using to infer phylogeny. So if we're somewhere near the start here, then we can take these differences in the characters that we observe and use them quite reliably to estimate the time that has existed, the evolutionary time that's existed between the pair of sequences. If, on the other hand, we're looking at a pair of sequences that are both at saturation up here, then it's going to be very difficult to reliably infer the time between these sequences. So a basic way that you could infer this time is to use this estimator, which is just a basic rougher's guts, no Bayesian here, method of moments sort of estimator where we're basically taking these differences in the relative frequencies of characters and turning them back into T's. And this becomes unstable if you, numerically unstable if you are trying to measure the evolutionary time separating sequences that are at saturation, so-called saturation. Okay. Now, as promised, that was a simple example of the Gix-Canto model. There are much more clever sort of substitution models out there. Of course, they're all falling far short of biological reality, but they're at least a little bit closer. So this so-called HKY substitution model, named after these authors here, it ticks many of the boxes that the Gix-Canto model didn't. So it allows for non-equilibrium base frequencies. So if you run this model for long enough, you get to an equilibrium distribution that's actually parameterized by these little, these pi's that appear in the substitution rate matrix. It includes also this transition-transversion rate ratio, kappa. So by setting this kappa and not equals one, we allow that there'd be difference in rates between transitions and transversions. And these two things mean there's arguably a sensible model of mutual DNA pollution out there, at least that's commonly used for polygenetics. Going beyond the HKY model, there are more general models. A commonly used general model is the so-called GTR model, a general time-reversible model. And this is the most general time-reversible model. So time-reversibility is a technical characteristic that is sometimes desirable for actually performing inference. But I won't go into it here. This model has 10 parameters, nine if you divide out the average substitution rate, and all of the other time-reversible models are special cases of this. So HKY is a special case of this, Drix Cantor is a special case of this. So going back to this abstract picture of a tree, we have our starting sequence that evolves down the tree. We have our sequences that we observe at the leaves. Now these sequences constitute our alignment. And one thing to maybe point out here is that, of course, we've omitted any discussion of insertions and deletions in this process of molecular evolution. And for this reason, when we're talking about our sequence data, we're only ever really talking about an aligned set of sequences. We assume this has already taken place. And if we call our tree T, then what we can do, given that we've got this transition rate matrix and this understanding of what it means, we can compute the probability of our sequence alignment given the tree and given this transition rate matrix and our equilibrium parameters. And this is something that is actually very easy to compute in linear time using dynamic programming. There's a special algorithm known as Fells and Science pruning algorithm that does exactly this. And this integrates over all of the things that we don't know. So we don't need to know the sequences at the ancestral nodes here, although maybe you want to, but we don't need to know those to compute the probability of the alignment that we observe given a tree, so that's very handy. And this is very important because it's the basis for Bayesian phylogenetic inference. So this is our likelihood in this case. Does anyone have any questions on this section? Okay, yeah, I'll go on. So that was phylogenetics. That's really all I'm going to discuss on that topic. Phylogenetics is taking this really a step further. So this is a pretty buzzwordy term, but it's been around for a long time now, so I can't feel too bad for using it. It was introduced by authors Grenfell and others in 2004 in the context of epidemiology. And in their paper, they describe this as the interplay between immunodynamics, epidemiology, and evolutionary biology. And the effect this has on the shape of pathogen and phylogenies. And this is introducing more buzzwords, but in the end it's actually a fairly simple concept. The idea is that the population level process, be it an epidemic or population dynamics in some other populations, so some population of animals, for instance, or number of species varying through time, all of these things affect the shape of the tree. So for instance, if we have a population that's growing exponentially, we might expect to see a certain kind of tree with long pendant edges like this. If we have constant population size, we have a slightly differently shaped tree. If we have some structure in the population, then this can also affect the tree. If we have some non-neutral evolution going on, so some selection, this also affects the shape of the tree. And so the idea behind phylogenomics is to make use of this relationship between the shape of the tree and what's going on at the population level to try to use inferred phylogenetic trees to further infer properties of the populations within which they're embedded. Just some more examples of this. Here we see several different pathogen trees. So this is a measles phylogeny. This is an influenza A phylogeny. This is a dengue virus phylogeny. And the point here is that these all have qualitatively different shapes. The overall picture here is very clearly different. In particular, this ladder-like tree structure for influenza is very well known and very reproducible, as is the structure for dengue virus. And so there's something going on and something that we can learn about just based on the shape of these trees. And phylogenomics is the way that we aim to do this in a quantitative fashion. So basically the question that phylogenomics seeks to answer is what does a phylogenetic tree tell us about a population and places the emphasis on the population itself instead of the subset of individuals that happen to be involved in a particular phylogeny? This is a really important point because it's often the case when you're studying phylogenetics, phylogenies of species, that the actual phylogeny is of real interest itself. You really care whether humans and chimps diverged at a particular point in time or another point in time. But if you're looking at, say, the phylogeny of a bunch of SARS-CoV-2 samples that have been taken from an epidemic, the particular tree that you infer from those sequences is not necessarily of interest. It's a very ephemeral thing. It's going to be different for every COVID outbreak that you look at. But what you are interested in is the population-level process that generated that tree, and that's where phylogenomics comes in. Of course, population can mean a lot of different things. We've talked about viruses, we've talked about animals, but we can also be species. This is very broadly applicable. How do we do this? In order to infer characteristics of a population from a tree, we need to model the tree generation process. Just like when we wanted to connect sequences to a tree, we need to model the process of sequence evolution down the tree when we want to connect the tree to a population, model the process of generating the tree from a population. There are two main families of models that we will discuss. The so-called birth-death sampling models assume that this tree is the result of a population evolving under a forward-time birth-death model with an explicit sampling process. And then we have coalescent models, which are more retrospective models that assume the tree is the result of this backward-in-time process that successively merges pairs of lineages together and relates the rate of that merging process to the population size. Just to briefly introduce these two, a birth-death sampling model you might depict this way. This box here represents a single compartment of individuals. This represents a population. These could be infected hosts. This is why I've written I here, but it could also be species. And members of this population undergo several different kinds of events. So we have events, we have birth events that occur at rate lambda that result in new individuals being added to this population. We've got some overall death rate new, not to be confused with mutation rate. And we can also say that a subset of the individuals that are removed are sampled, become our samples with probability S. There are many different kinds of birth-death sampling models. This is just one. And we also have the length of the process defined. So you can run, you can simulate such a process. And if you keep track of all of the birth events that occurred, all the death events and all the sampling events, then you can reconstruct, then you can build a folligenetic tree from those samples. So this, you'll have to believe me, but this kind of process allows us to compute the probability of a tree given these parameters that we've discussed. So you're given the birth rate, death rate, sampling proportion and this length of the process to your. On the other hand, we have these coalescent models. As I said, these are more closely related to the population size. And one way you can imagine the coalescent models is that you start with a basic right fissure model. We have some individuals here belonging to one generation of a population. So the total number of individuals in this population is very small. It's just the number of circles on this line. So one, two, three, four, five, six, seven, eight. And three of these have been sampled. These ones are the colored circles here. And now we consider the previous generation. So this model, we assume that the population size is fixed. It doesn't change. So we've got the same number of circles on the next row corresponding to the previous generation. And the question becomes, what is the probability that we see emerging between any pair of these colored individuals? In other words, what's the probability that any pair of them share a common ancestor in the previous generation? Now, under the right fissure model, the way we relate individuals to their parents in the previous generation is that we just allow each child to randomly sample its parent uniformly from the previous generation. So this is making assumptions about how well mixed the population is, random mating, et cetera. So it really is just a very basic model. But still, that's the assumption. And given that assumption, we can approximate the probability of a so-called coalescence or one of these common ancestors being found in the previous generation to be just the number of pairs that exist between the members of our sample set multiplied by one on n. Because the probability of any two individuals in generation i sharing a common ancestor in generation i minus 1 under this assumption of random mating is just one on n. And this is an approximation. Obviously, this is ignoring the possibility that one parent could be the common ancestor for all three individuals. But in the case where the number of green dots is much, much bigger than the number of red dots, then this probability becomes essentially zero and the approximation becomes perfect. OK. So with these probabilities, you can then compute the, well, you can simulate a tree starting from these sampled individuals going up the page. And overall, the rate of coalescence is just the number of pairs between the number of lineages at the particular time multiplied by one on n. So we see that the population size appears in the rate of coalescence as we go back in time. And this allows us to, once we make a large population assumption and assume that we're operating on time scales where we can assume that the time is no longer discreet but rather continuous, we can use these rates to write down the probability of a tree given the population size. And actually there's this ng here because it's actually the population size multiplied by the generation time that's important. You can't really disentangle these two things together. But we often just write n, but it's important to keep in mind that generation time is a factor here as well. Okay, so that's the coalescent approach, but in general, the idea here is that we've got some population dynamic through time. We have some samples that occur at different times and these samples in combination with this population size dynamic give rise to this phylogenetic tree distribution. And this allows us to write down the probability of the tree given eta where either are either the birth-death parameters or the coalescent rate function. And this is our phylogenetic likelihood. So this is where we can turn this probability distribution around and talk about the probability, describe it as the likelihood of eta given t. Okay, so this is already hinting at what we want to do now, which is to wrap all of these things together in a Bayesian framework. And the way we do this, it's not rocket science. Basically, all we're doing here is we're taking our phylogenetic likelihood, this probability of our sequence alignment given the tree and our substitution rate matrix and multiplying that by the phylogenetic likelihood, the probability of the tree given, pardon me, our population model parameters eta. And then finally, we have our parameter prize on the substitution rate matrix and our phylogenetic parameters. And this is all normalized by this marginal likelihood down here for the probability of the sequence alignment integrated over all of these unknown parameters and objects. But on the left-hand side, we have the target of our inference, which is the posterior probability over both the tree, the substitution rate matrix, and the phylogenetic parameters. So that's the goal. And the nice thing here is that if we can characterize this object, then we automatically get a measure of uncertainty in all of our various inferences. And by including appropriate prior information, we can include other sources of information in the inference quite naturally and easily. Okay, this is what I just discussed. Okay. So that all sounds wonderful and very easy. And why is this even difficult? Well, it's difficult because integration is difficult. And in particular, if we look at just base theorem in general, we have this denominator, which I mentioned was the marginal likelihood. This is just the probability of the data between the model assumptions that we're making. And this you can think of as just a normalizing constant for the posterior distribution. And as such, it is equal to just integrating the top line of the numerator of this fraction over all of the possible model parameters. And the difficulty is that we can't solve this integral for most problems of interest because the parameter space is just far too large. And usually we can't even just brute force numerically do this either. So we need another approach and the approach we take in, I'm sure this has come up already in the course, which is to go to Monte Carlo. And in particular, this casino in Monte Carlo, which is the inspiration for the name of Monte Carlo methods. And these Monte Carlo methods are algorithms which produce random samples of values in order to characterize a probability distribution over those values. So our goal is to characterize this posterior distribution. We're going to use Monte Carlo methods to do this. And ideally these algorithms would produce arbitrary numbers of independent samples of the parameters that we're trying to characterize the posterior distribution for. But in practice, we're usually stuck with so-called metropolis hasings, Markov chain Monte Carlo. So this is an algorithm that produces samples that is capable in principle of producing samples from a distribution say f of x by generating a random walk over a possible values of x. So if we imagine this is our f of x here, it's a very simple one in this case for illustrative purposes. And this is our starting value of x somewhere around 0.5. And the way this algorithm proceeds is that we make some modification of x. So for instance, we could draw a new value of x from within some window that we just define arbitrarily. So we say this window is 0.2 or something. And if the new value of x has a higher probability than the old value according to this distribution, then we immediately make this the new value and go there and repeat this process. If it's the case that the new value is a lower probability, then we still make the move downwards, but only with a probability that's equal to the ratio between this lower probability and the original higher probability. So this is less than one. So there's a chance for downward steps that you stay where you are. OK. And so this is a bias random walk that tends to explore the most highly probable areas of the probability surface. The important thing is that the only place that f appears, the probability appears, is in this ratio of the probability of the new value to the probability of the old value. And so what this means is that we don't need to normalize f. So in other words, we don't need to compute that horrible marginal likelihood that it was causing us pain. So this is what we use for tree inference. But of course, trees are a very different space to just real numbers. And so the picture is a little bit different for trees, although in our heads we can still think of this random walk on the real number line. But in reality, what we're having to do when we're doing MCMC with trees is to make these local modifications to a phylogenetic tree using a wide variety of different proposal distributions. And here are just some examples of moves that you sometimes see in these kinds of algorithms. So for instance, moves where you take, these are sort of truncated portions of trees. By the way, this is what these dotted lines represent. So we're just saying this is a portion of a tree. And we've got some subtree down here. And we disconnect it from, so this would be a randomly chosen subtree. And we just disconnect it from wherever it's attached to and reattach it to some other random point in the tree. And then you have other things where you might have two subtree selected and you just swap them. Obviously you need to sample node heights as well. So you can move them up and down. And you can do things like scale the whole tree. But really the only thing that matters for MCMC is that we're able to get from any single tree to any other tree that has a non-zero probability, posterior probability in our, distribution. So as long as that's the case, then as long as if you run this algorithm for long enough, you will eventually get to a correct answer. When you run MCMC, this is again, hopping back to the real number case. So we've got our X here. And we're running this MCMC algorithm for a large number of steps. You can see that it's initially at some starting value that's arbitrarily chosen. It has nothing to do with the actual distribution we're trying to characterize. But then it converges in this case very quickly to this point where it's actually belonging to the portion of the state space that has actual support under the probability distribution. And then it's just wiggling up and down in this point. And if we take these values that have been visited by this algorithm, so this would just be a big long column of numbers, one for each number. And we plot the, we compute the probability density or relative frequency of visiting these different points in the state space. And we compare this to the target distribution F of X. You can see that after long enough time, we start to see this MCMC histogram looking more along the line. More like this dashed line that we're actually trying to target. Okay. Of course, there are some details here. So the fact that we start from an arbitrary point means that this first period of the chain is going to be biased. It's going to have really nothing to do with the actual probability distribution we're trying to characterize. And so we usually call this the burn-in period and we try to discard it somehow before we're actually analyzing our samples. But just in general, it's good to keep in mind that because of the way MCMC works, adjacent MCMC samples are correlated. So far from having 2,000 independent samples from F of X, at the end of the day, we've got some much smaller number of effectively independent samples from X that go into all of these points here. But if we run this for long enough, eventually we'll get a good characterization. Okay. So this is all well and good to plot these figures for values like X, which is just this arbitrary real number. But how do we do this sort of thing for trees? Well, what we can do is just look at summary statistics of the trees. So for instance, here is a histogram resulting from an MCMC analysis where we plotted the tree height. So this is the age of the most recent common ancestor of all of our samples. And this is a real number, so we can get some marginal posterior distribution for this, visualized quite easily. We can also try to depict the posterior distribution of the trees themselves. But by doing things like trying to figure out what the tree height is, but by doing things like trying to plot them all on top of each other, which is sort of a fun exercise, but it can be difficult to interpret once there's a large amount of uncertainty there. Okay. So we've only got a few more minutes, but I'm going to quickly talk about a application of these methods to some SARS-CoV-2 data. So the focus of this application is to infer the so-called basic reproductive number, or R0, for a number of different SARS-CoV-2 outbreaks that were occurring across the world at the beginning of 2020. And this basic reproductive number you've probably seen, even if you'd never heard of it before in a scientific context, it was appearing in the news quite often over the last few years. So you probably have a rough idea of what it means, but it represents the average number of secondary transmissions resulting from a host in a completely naive population. So you imagine that an infected host appears in a population that has no other infected host, and you consider the number of secondary transmissions that that host would be responsible for. And if you consider that the average or expected number of such transmissions, this would be the basic reproductive number. So it's useful because if it's one, and if it's below one, you on average expect the outbreak to just die out. If it's above one, then on average, it's going to exponentially grow. And the way in which this is often estimated are based on just case counts, recorded case counts. So if you look at the number of detected cases in a population through time, you can basically look at the rate of growth of this curve and try to fit an exponential distribution, exponential growth curve to it and figure out the rate that way. But of course, this is contingent on you having good estimates of the case counts. However, there are other ways to do it. And in particular, there are ways to do it involving fellow dynamics. So soon after SARS-CoV-2 emerged, various institutions began sharing genomic sequences on the GIST aid platform. And this was extremely useful because they had the potential to yield much more robust estimates of honor fire follow dynamics. So this particular analysis involved a number of different outbreaks from a number of different countries, including not a country down here, but a cruise ship, the diamond princess. So we had a fairly small number of sequences from many of these outbreaks, but still enough to start considering doing these analyses. And the important point about the sequences that we included in the analysis was that we chose them so that the last sequence included was prior to the implementation of any real public health measures. So the goal here was to infer the transmission rate in the completely naive population without public health measures. So something which should be more quality of the virus itself rather than virus plus these other measures. There were lots of challenges early in the epidemic for this sort of inference. So that main challenge was limited diversity. So a lot of the sequences looked very similar. Sample collection itself is always full of unknown biases and when you don't have many samples, this is, this can be a real problem. Sequencing errors at this point weren't well characterized, which is also obviously a problem. All of this is to say carefully interpreting these results. So we analyzed these sequences using this birth death skyline model. So not a cobalt model, this birth death sampling model, but one in which the rates were allowed to change through time. And we assumed that each outbreak evolved independently from a single introduction under this birth death sampling process. As I said, we're interested in this basic reproductive number and to that end we use a more epidemiological leaf-focused parameterization. So instead of thinking in terms of birth rates and death rates, we instead thought in terms of reproductive number, which is just the birth rate divided by the death rate. And this become an infectious rate, which is it combines both the sampling rate and the removal rate death rate. By the way, death here doesn't necessarily mean death. It means in a biological sense, it means just removal from the population. So quarantining an individual also counts as mathematical death in this sense. So we actually fixed the removal rate to 10 days, which based on information we had at the time was a sensible choice. And we placed weekly informative priors on the other parameters. And long story short, this is the result. So we were able to do this joint analysis of all of these sequences together. Some of these parameters were shared between what we allowed to be shared amongst the outbreaks, but for the most part, they had independent R0 values obviously and independent sampling proportions. You can see that there are lots of outbreaks that have very similar values somewhere around two. These are posterior distributions, by the way. These are marginal and posterior distributions of the R0 values for each of these outbreaks. And but you also see some of these outbreaks have much larger values and we in at least four of these cases, we believe these are artifacts of the sampling process and the evidence for this is just looking at the distribution of sample times that went into each of these analyses. And we see that the outbreaks with very large inferred are not other ones with very tiny sampling windows. So we really believe that this is the reason behind those results. That's all I'm really going to say about this application. I'm sorry it was very short. I didn't have a lot of time, but the take-home message here is that these epidemiological parameters can be inferred from genetic data and I don't think that's necessarily obvious unless you really think about it. These inferences are complementary to inferences parameters from traditional case count data and the reason they're complementary is that they don't just rely on the times and numbers of observed cases that we have but also the timing of these inferred mission events that occur way back in the tree there can be less bias, but they do remain susceptible to sampling biases because the dynamic model also includes sampling rates necessarily. They're also better equipped generally to deal with confounding factors such as the existence of travel cases, so once you have a phylogeny it's much easier to say it's very clear that this sample does not sort of fall within our tree but comes from elsewhere and thus shouldn't be included in the analysis. Okay, I know that was extremely quick. Does anyone have any questions about the application or any of the other points I've raised so far? If not, I'm going to shift then on to the quick tutorial. I just want to talk briefly about beast2. Beast2 is a free software package that does this MCMC to perform Bayesian phylogenetic and phylogenamic inference. You can get it from this website here. There's documentation there as well and there are some tutorials but there's also a dedicated tutorial website tamingthebeast.org that our group here is primarily responsible for and that contains a much larger and more heavily curated set of tutorials for the software. Okay, so beast2 itself is the software that implements the MCMC. There's a portion of the software called beauty Beauty and the Beast that provides a GUI for setting up the analysis input file. There's some software called tracer that is useful for summarizing parameter post areas. There's a tree annotated program that takes this large set of sampled trees from the tree post area and tries to produce summary trees from those sampled trees. Fig tree is a different tool. It's not within the beast ecosystem but it's used a lot by beast users for visualizing trees. These things all fit together in this way. You have your sequence data, you fire up beauty, you load in your sequence data, set up a model, this produces an XML file that contains all of the details including your sequence data that specify the analysis and you feed that into beast2. Usually you'll be running beast2 on a cluster if this is a real production analysis this MCMC process can take a while and then once beast2 is complete it will be producing this log file which contains the set of sampled parameter values and also this trees file which is essentially a log file but just for the phylogenetic trees and these are then fed into the appropriate programs for analysis. So I know we only have a very short amount of time 23 minutes maybe but at this point I'd like to encourage you to download and install beast2 at least if you can just try to get this installed from beast2.org open the taming the beast page and locate the introduction to beast2 tutorial and given that we don't have time I'd encourage you to just focus on the gray boxes that actually tell you within that tutorial exactly what you need to do we don't have time to read all the text obviously the text is important and interesting and I would encourage you if you are interested to go back and complete this tutorial properly but for the purposes of just getting familiar with things I think it's not necessary and I see in the chat these links have been shared so thank you very much Patricia and so I'm going to give you a small amount of time to try to at least just get started on this tutorial and then a few minutes before half past maybe five minutes before I'd like to give you a little bit more time than you would have otherwise I'll wrap up the tutorial and just briefly go through the tutorial results so that at least everyone gets a chance to see roughly what the output looks like this was the tutorial the first bit is mostly about installing the software and understanding how to lead the data so the first thing you're involved with is this beauty program that I mentioned where we load in this primate mitochondrial DNA data in this tutorial and set things up in this panel so that we're sharing we have several different partitions these correspond to the different codon positions of characters because we expect those codon positions to be experiencing different evolutionary rates so we want to keep them separate and obviously also the non-coding portion of the sequence is separate as well okay but they share the same tree they just have different substitution models associated with them all of these samples are occurring at the same time so set up we set up a site model for these different different partitions this site model is the substitution model from the from the lecture and here we've chosen HKY for all of these and we've basically allowed them to vary their overall rates the clock model again this is not something I talked about but we're assuming a strict clock this is the simplest case this would basically allow us to let the overall rate of evolutionary change itself change over the tree to some extent but in our case we're going to keep it fixed the priors is where a lot of the sort of twiddling happens with setting up a beast analysis in this case we've chosen a particular tree prior this is essentially a birth death kind of model this calibrated dual model and the important thing here is that we've also put explicitly a prior on the date of the divergence between humans and chimps and we've put this we've said that this is going to be distributed according to a normal distribution at with a mean of six million years ago and a particular standard deviation there's a little bit to say about this particular kind of calibration it doesn't it's sort of a little bit of a fudge due to the fact that there's no other information about the absolute timing of events in this phylogeny if we don't include this calibration but just be aware it's it's something that you have to be a little bit careful about the last thing that was set up was just some details of the MCMC through the number of steps and the number of times the MCMC state was written to disk and just remember that we always have these different kinds of logs so this trace log in this case is just logging the parameters of the model that we're inferring the tree log is the same thing but it's it's specifically logging the trees themselves okay and if we save that analysis it produces an XML file this primates.xml file and we run it using BEAST I'm not going to show what that looks like it's pretty boring if everything is working correctly it's non interactive and then at the end of the day once the analysis is complete in this case it's extremely fast because it's a very simple analysis but in general it could take a lot longer we get this log file and we get a tree file and the log file we can load into tracer we could also load it directly into r it's just a flat text table with columns corresponding to the different parameters and each row corresponding to a sample from the MCMC but in tracer we get this nice sort of visualization with all of the parameters or summary statistics they're not all real parameters in this table on the left hand side and we can select them we can see we have things like the tree height here and if we click this trace on the right hand side we see that the tree height starts from some value we can see this burning period here that I mentioned and then it oscillates and this is what we're looking for in MCMC this fuzzy caterpillar picture where we don't have any sort of long-term trends but in case we don't believe our eyes we can also look at this ESS value which is the effective sample size estimate so this aims to estimate how many independent samples you have and note we've got some parameters here that are colored red this is indicating that we don't actually have a very large number of effectively independent samples here we might want to run the chain a bit longer the tutorial gets us to compare some of these parameters these are the mutation rates corresponding to these different partitions the first second third code on positions in the non-coding this is actually the third code on position that has the elevated rate the non-coding we can add a legend the non-coding is actually over here and says a similar rate to the first and second code on positions which is interesting okay and then the tree itself we've got this tree file that we can load into firstly I'm going to show what it looks like when you load this tree file into this density tree program these are the posterior samples of the estimated of the time trees so this is a visualization of the marginal posterior for the follow genetic trees these trees here are just plotted on top of each other with some amount of transparency so you get some indication of the uncertainty in the timing of these different events but you can see that these trees all agree very very nicely in particular the topology here is basically the same for every tree that appears here another way we can look at these trees is we can take this tree file process it with tree annotator which is described in the tutorial and then produce this so-called maximum plate credibility summary tree and that's what we see here this is visualized in fig tree and this has been so this is a summary of the full and we can see for instance this human chimp divergence time this is essentially the prior that we placed on this time these bars represent the uncertainty in the divergence times and the numbers here represent I'm not sure if you can see them very well but they're almost all one they're representing the posterior probability that this particular clade appears in the samples from the posterior so one means that it appeared in all of them and this is basically the case for all of these clades and this is just representing this fact that we saw in density tree that the topology is very nicely resolved here there is one clade here that we see must have had some one of the trees in the posterior one of the samples from the posterior must have not had this particular clade in the tree so it's got a posterior probability of 0.9967 so anyway very close to one still that is essentially it there's a lot more to be what we haven't seen in this tutorial really at all is the following dynamic aspect although of course there are birth rates being inferred here as well so this is I guess fundamentally follow dynamics but we haven't done anything like try to infer the dynamics of population size through time or anything like that if you're interested in this for sure go back to this Taming the Beast web page you'll see there's a wide variety of tutorials here including models that describe how to infer population dynamics through time so with that I know I'm five minutes over I'm very sorry for eating into your coffee break but thank you very much for sticking with me