 All right, so as this title suggests, I'm going to talk to you today a little bit about the Bayesian foundations of phylogenetic and phylogenamic inference. And I know that's a buzzword laden title. And so I'm going to unpack that as we go on. But perhaps I should maybe say that it would have been better to call this talk the foundations of Bayesian phylogenetic and phylogenamic inference, because I'm going to talk not only about the Bayesian foundations of these concepts, but also just the foundations of these concepts. Because I think that's important because not everyone knows what these things, what these topics are about, and I want us all to be on the same page. Okay, so this is the outline of the presentation today. The first four sections, the first three sections are all going to be the foundational material. Hopefully we should get through those within the first hour up until the coffee break. After the coffee break, we can get started on talking about applications. And then in the final 45 minutes, hopefully of this time, my intent is to walk you through a beast two tutorial that makes use of some of the concepts that you'll have, that I'll have spoken about and hopefully you'll have learned about by the end of the course. Okay, so and I just want to reiterate that you're absolutely free to interrupt at any point with questions. It makes everything more interesting for everyone, I think. And I don't want you to be sitting there and not understanding what's going on. Okay, so to get started, I'm going to go right back to the beginning and talk about phylogenies, phylogenetic trees, and their relationship to sequence evolution. And when I say right back to the beginning, I mean right back to the beginning. So let's look up phylogenetic trees on Wikipedia and we see that we get this definition that a phylogenetic tree is a branching diagram showing the inferred evolutionary relationships among various biological species or other entities. And when we're inferring this phylogeny because we usually don't, we're not in a position to observe it directly, we're inferring it based on similarities and differences in the physical or genetic characteristics of the organisms that make up this phylogeny. Okay. Now trees in biology and related structures go back a long way. So even before Darwin, there were theological concepts about the idea that you have this hierarchy of organisms and you have somehow some heavenly being at the top and then you have different organisms right down to very, very basic things and minerals at the bottom. So people have been thinking about similar sort of structures for a long, long time. Much more recently and entering into the time of Darwin, but just before he published his work, people were publishing things that starts to look very much like phylogeny tree. So this is a figure taken from Edward Hitchcock's Elementary Geology and he was actually a skeptic of Darwin's work, but as a geologist, he drew these wonderful pictures showing the occurrence of different organisms through different geological periods. But of course it wasn't until Darwin that we actually get to today's sort of interpretation of phylogenetic trees as representing the evolution of different species or organisms from a common ancestor. So very early on in his notebooks, this figure appeared and this has been copied in many different places, but this is just some rough notes of Darwin sketching out potential evolutionary relationships between different species. You notice that time is not really present in a diagram like this is just saying that we have these different species, A, B, C, D and they are somehow related to one another and result from some speciation process. The sole figure in the published work of the origin of species is actually a phylogenetic tree. And this time we actually see on the vertical axis here time playing a role. So there is a sense in which as we go up the page, we're moving forward in time. And what we're seeing here are all of these ancestral speciation events producing new species going forward in time and only some of those species actually make it all the way to the present. Okay, so these phylogenetic trees, this is a more modern but still schematic representation of a phylogenetic tree relating different primates. And trees are inferred from characteristics of the entities that they relate. So the important thing is that we have some property of an organism or a species that is changing through time that's inherited and using our observations of this quantity, perhaps in the case of species trees, this is often all at the present. Our data usually comes from the present although we can have fossils as well. That can be used as the basis for inferring phylogenetic relationships. These characteristics are often genetic these days but they may also be phenotypic. And this is very important when you're including fossils in this sort of inference because we don't usually have DNA sequences that are associated with fossils. Now, it's important to note that there's not just one kind of phylogenetic tree, phylogenetic trees occur on multiple levels. So originally these concepts of phylogenetic trees were representing relationships amongst different species but we can also draw ancestral relationships between different, between genes that are possessed by individual members of those species. And it's the genes that actually often constitute our data. So if we have sequenced some particular market gene in members of different species, we can use the methods that we described in this talk to reconstruct and infer the phylogeny corresponding to those particular genes. But then if you're interested in the species phylogeny, while that's related to the gene trees, it's not exactly the same. So in this figure, you see a diagram of this. So what this is meant to represent is we have these five different species at the present. And from each of these species, we have several different samples of a particular market gene. And using those genetic sequences, one can build the phylogenetic relationships between those. And that's what's represented in these inner trees here. So these lines represent the individual lineages of each of these genes. And you see that at some point, there's some common ancestor found between a pair of these lineages and so on and so forth. And what you can see is that there's some, this larger structure of the species tree that actually represents the speciation events in the ancestry of these species. While the structure of this tree does impose some constraints on the structure of the gene trees, they're not the same thing. So you can't say that the gene tree always gives you the species tree. So they're nested within species trees. And this sort of relationship is important, I think to point out early on, because it doesn't just exist in the context of species trees and gene trees, but also in the context of transmission trees and pathogen trees. So you can draw a completely analogous diagram where instead of a species here, you have different hosts that are infected by some pathogen in an epidemic. And the inner tree, these lines here would represent the phylogeny that you would reconstruct based on the genomic sequences of those pathogens. Whereas this larger structure, the outer tree would represent the transmission tree where branching events represent transmission events between hosts. Okay, another thing that's important to point out early on is that in phylogenetics, a very common assumption is that the shape of the tree affects the distribution of characters, the sequence starter that you observe, but not the other way around in a causal sense. Now, of course, this is not true in general, but we try very hard to ensure that this is roughly true for the sequences that we use for phylogenetic inference. And that's the important thing. So the idea here is that on the left-hand side, we have some tree. You imagine that this tree is already there. And then the relationship between this tree and the sequence starter is that the sequence starter here is a result of taking some unknown ancestral sequence at the root of this tree and evolving it according to some process that we'll talk about down the edges of the tree, eventually to get the sequence starter at the end. The important thing is that the rate, for example, of branching in the tree is not related to the sequence that is on the particular lineage. Now, as I said, that's not true in general. Of course, we have something called selection that means that the genetic sequences that are on a genomic level are extremely important for the fitness and therefore the branching rate at various points in the tree. But if you're looking at just some portion of the sequence, then it can be the case that this portion is evolving neutrally and therefore is not directly affecting the shape of the tree. So we're focusing here then on neutral evolution. Now, if we have our sequences, it stands to reason that we can, at very least we can perform some symbol clustering and get back something like this hierarchical relationship here where edge lengths here might represent genetic distance or time, but we want to do better than this. We need a model-based approach. The Bayesian approaches that we're focusing on today are entirely very much related to generative processes. So we actually try to model the processes involved that generate our data. So we want to employ that approach when we're inferring our phylogenetic trees. And the basic way that we do this is we consider a tree like this and we imagine that we have some ancestral sequence. And then we go about defining a process by which this sequence is iteratively modified as it evolves down the tree. And this is giving rise to the genetic diversity that we see at the present. So I've introduced this mysterious variable here, symbol Q, the question is what is it? And the answer to that, it gets the heart of the models for sequence evolution. So in the traditional models that are used to reconstruct or infer phylogenetic trees, we model nucleotide substitution as a continuous time Markov chain. So basically we consider each site in a sequence independently and we assume that the character at that site is changing according to this continuous time Markov chain through time. And the way that works is that say we consider this P sub t super i of x to represent the probability that a nucleotide at site i at time t is in some state x where x could be GTCA in the case of DNA evolution. So the time development of this probability distribution is going to be determined by this Q and this Q is now going to be revealed as the transition rate matrix. So basically what that looks like is that each of these, so if we consider the result of a small increment so we've got some starting probability for GTCNA, that's the one on the right hand side here. The way we get from this probability to the probability at some infinitesimally advanced time is by multiplying by this matrix and this matrix involves, if you evaluate this matrix as a whole this would give you the transition probability matrix for updating the state at this particular site through this delta t, through this DT time increment. But in general, we can expand it out to involve this transition rate matrix and each element of this rate matrix besides the diagonals, they're a little bit special but the each non-diagonal element represents the rate, the probability per unit time that a particular site will transition from one character to another. Okay. And basically all of the substitution models that we use in fellow genetic inference behave this way and the only difference between different substitution models generally is this transition matrix, how you define the elements of this transition matrix. I should say that we usually factor out the average rate of substitution. So we can then talk about the average rate of mutation or substitution that's separate from the relative rates that appear in the transition matrix. Okay. So just to be a little bit more concrete the basic model is the juxtcantor model so-called because it was introduced by these particular authors it's the simplest possible substitution model. Can I ask a question? Yes, please be sure. Yesterday we learned about these markup chains. No. About how they are looking locally first before looking before moving and looking at the whole space. So is that something biologically relevant here? I'm not actually sure I understand what- If I can jump in real quick, you have to be careful not to confuse markup chains which are a general kind of tool and markup chain Monte Carlo which is a tool with explore the space of your posterior distribution using local jump. So this is just another kind of application for markup chains here. Oh, thank you. Yeah, thank you very much for that clarification. This is an important distinction particularly because we will use markup chain Monte Carlo here as well. So it's important to get that out in the open early on. Okay, yes, but we're just talking about a well, it's a biologically motivated process here. We're actually talking about the changes occurring on to the character at a particular site in a sequence through time due to various random effects during replication. Okay, so this simplest possible substitution model it might look a little bit weird because we've got all these one thirds everywhere but essentially this is just for normalization. The important thing here is that every, one of these off diagonal elements of this transition rate matrix are the same which means the rate from G to C versus G to A versus G to T, these are all identical. And this is not particularly realistic. We know that certain mutations are more occurred at a higher frequencies than others but it's a good starting point. Okay, in particular, there's no difference between transitions and transversions here. So a CT mutation, a GA mutation and a CA, sorry, yeah, a CA or all of these kinds of mutations all have exactly the same rate. Another important thing here is that the equilibrium distribution is uniform meaning that if I take some site, evolve it under this model for long enough the probability that winds up in any one of the four possible nucleotides is the same which is again something that usually is not what you want. However, it's useful for analytical calculations it still does model important things and models the possibility of, for instance, back mutations where we transition from some ancestral wild type sequence to a new character at some site and then go back again. So that's already a property that is not in all models of evolution. So this is useful in its own right. Okay, and just to demonstrate this we can use the Duke-Scantor model to quite easily compute things like this which is the probability that after some amount of time here we've rescaled time so that it's in expected substitutions. That's why we don't have this mu parameter appearing at all. So this is the probability that after a particular amount of time a site will wind up being in some state have some character that's different to the starting character. So this pdiff starts from zero but increases as we go forward in time. So this is pdiff in red here and what you see is that this starts off increasing linearly right at the beginning but then quickly asymptotes to this value of 0.75 which is just due to this uniform distribution that I talked about. So this means that we can based on the number of the relative difference between a pair of sequences we can get an empirical estimate for this probability and turn it on its head and compute an estimate of evolutionary distance. But this is definitely not Bayesian inference here this is just a very straightforward method of moments sort of estimator but nonetheless it gives you some intuition as to what you should be looking for if you're actually trying to infer these distances you don't want to be inferring distances in this region out here because if you're at this point where the number of possible differences between a pair of sequences is close to this limit that you get to on average then various small changes in this relative difference are going to result in massive changes to their evolutionary time meaning that your estimate will be extremely noisy. Okay so beyond the beyond the Duke's Cantor model we've got more complicated things this HKY substitution model named after the authors here is a classic one because it basically ticks all of the main the main boxes so it allows for a non-equilibrium base frequencies that these are actually encoded in this transition matrix by these pi's and also it allows for differences in rate between transitions and transversion and these two things mean that together this is basically the simplest sensible model for a neutral DNA evolution and it's I think the simplest model that's often used in practice but we can of course go further there's this more general model that's the most general model that also satisfies this property of time reversibility that people often want to build into these models of sequence evolution for technical reasons it has 10 parameters so it's already starting to have quite a large number of parameters and all of the other time reversible models including HKY and Duke's Cantor are special cases of this okay so that's that that's just some background to get you used to some of these concepts and some of these models that you hear quite a lot when you're looking through phylogenetic literature and using phylogenetic software the phylogenetic likelihood is now something that we can define so we have a process that takes us from this ancestral sequence that's unknown that takes us from the sequence to our observed sequences at the tips of the leaves of this tree the sequences of the leaves we're going to assume that we know them and that they're all nicely aligned notice that there's no in this model there's nothing about insertions or deletions and so everything is just homologous point mutations meaning that in order to do this inference you need an aligned sequence so this is our sequence alignment here our observation if we call our tree T then what we can write down at least in principle is the probability of our sequence alignment given the tree given this substitution rate matrix and usually given also things like the equilibrium starting frequencies for the characters at the beginning so for the models that I've described so far this can be evaluated in linear time using dynamic programming it's still quite expensive and linear time I mean linear in the number of leaves which is a very useful thing to be able to do and this is the basis for Bayesian phylogenetic inference so in principle you need is this, this is your likelihood you throw in some prior over tree space and turn the wheels of the Bayesian machinery and you get out your posterior trees of course nothing's ever quite as simple as that and we'll talk about the way it's done in practice later on but before I get to that I'm going to now jump over to talking about the second part which is phylogenomics but before I do does anyone have any quick questions about this first part keep in mind that I'm aware that I've skipped over a lot I can't talk about everything but is there any major lack of understanding okay that's good so section two we're going to talk about phylogenomics so what is phylogenomics this is probably the most buzzwordy term from that title it's a fairly recent term that was introduced by Grenfell and others in 2004 in a particular science paper to quote the paper refers to the interplay between immunodynamics epidemiology and evolutionary biology and the effect this has on the shape of pathogen phylogenies which sounds all very high-fleet but essentially it boils down to the fact that the population the context within which the phylogenomic tree evolved affects its shape so the shape of a phylogenetic tree and by the shape I mean mostly hear the timing of the events that appear on the tree can be influenced by the population dynamics through time so here we've got some population that's exponentially increasing through time and it gives rise to some so characteristic tree shape we've got populations that are constant through time they have other qualitatively different tree shapes on average then beyond just population dynamics it's also possible that there is some structure in the population meaning there are some subpopulations between which there is potentially limited gene flow or some difference in growth rate or decay rate within these subpopulations and this gives rise to different differently shaped tree topologies and then there is also although we have I've said that we're ignoring it we are also of course going to see differences in the shape of these trees due to the presence or absence of selection so all of these things affect the shape of the tree and we can see this if we look at trees reconstructed from pathogen sequences for different diseases you can see qualitative differences between for instance influenza A virus phylogenies that tend to have this ladder shape and then there are the Dengue phylogenies where you have the different Dengue strains having appearing as very distinct clades in this overall tree HIV has a particular shape within the house HIV has has its own particular shape that starts to look a little bit like influenza on the population level and so on this is just to say that there is this very strong influence on the of the context on which the evolution in which the evolution is happening on the shape of the tree and whenever there's an effect like this we can it starts to make us think that we can do inference somehow now these days we tend to use a broader definition of this term to say that phylogenomics is the study of the relationship between phylogenetic trees and the populations that produce them so with phylogenetics the question is what does the genetic sequence data tell us about the tree for phylogenomics the question is what does the tree tell us about the population that produced it now of course population can mean a lot of different things the main context of phylogenomics originally was epidemiology but of course we could be talking not just about pathogen populations or host populations we might be talking about populations of animals in some ecological context or we could be talking at a very large scale about populations of species and here the word populations is a very strange kind of idea to apply to species but we can still do that okay so again just as we needed some generative process to connect these substitution models to connect the genetic sequence data to a phylogenetic tree we need a tree generation process to connect the phylogenetic tree to its encompassing population and there are two main families of tree generating processes the first ones that I'll talk about are these so-called birth death sampling models these assume that the tree is the result of a population evolving under a forward time birth death model with an explicit sampling process so this is a particular model where the two events you basically have two events that can happen at any point in time and these are the increase by one in the population size or a decrease by one in the population size then there are coalescent models which assume in a very general sense that the tree is the result of a backward in time process although it's still connected to forward in time processes but mathematically it's the result of a backward in time process that successively merges pairs of lineages together at rates that are related to an effective population size so this birth death sampling model that I mentioned this is a particularly simple example of a birth death sampling model or rather this is a diagram depicting a very simple example of a birth death sampling model where this box I represents a compartment of individuals of a particular type and all of these individuals in this compartment are completely indistinguishable and they're reproducing at rate lambda so each individual here has a fixed probability per unit time of producing new individuals at rate lambda and then there's a decay rate so of mu so this is the again a probability per unit time that every individual in this compartment has of being removed from the compartment and then this peculiar little dot here with the S is representing a probability S with which every individual removed from the compartment result in a sample i.e. a leaf on our phylogenetic tree so this is a sampling probability so the main parameters for this process although there are many different versions of this process and some are parameterized differently but in this case our parameters are the birth rate lambda the total death rate mu and then this sampling probability if the sampling probability were one that would mean that every death in the population or every removal resulted in a leaf on our phylogenetic tree whereas if this was zero we would never see a phylogenetic tree usually in the cases we care about this sampling probability is less than one but greater than zero and then the other parameter of importance is the length of the process the so-called time of origin now with this kind of model although I'm not going to describe how we compute this but we can write down the probability of a phylogenetic tree given all of these parameters so then this paves the way to using phylogenetic trees to infer these parameters so this becomes the likelihood for these parameters given the tree now coalescent models as I said are somewhat different I'm going to describe them very roughly in the context that they're often related to which is this right-fisher model now right-fisher model is a model of population genetics that traditionally involves a fixed population size and this row of circles here forget about their colours for the moment is representing all of the individuals that are alive in one generation and then what we're going to do is imagine that we can look back in time and see the previous generation now if we have these three sampled individuals of generation I the present day generation we can ask what is the probability that any of these sampled individuals share a parent in the previous generation now this is a very easy question to answer in the context of this right-fisher model which assumes that parents are drawn uniformly at random from the whole generation so if I ask what is the parent of this particular individual here this is chosen uniformly at random from the previous generation so it could be any one of these individuals meaning that the probability that this second individual has exactly this the same parent as the first individual is just going to be one on N so this question for two individuals is just one on N the probability that two individuals share a parent in the previous generation is one on N the three individuals one can answer this exactly but what we do in the context of these coalescent models is to make an approximation because eventually we're going to assume the population size is very large so what we assume is that our population is very large and so we can treat each pair of individuals independently and approximate the probability of any parent sharing in the previous generation between any pair of these individuals that are sampled in the present is simply three times this pair-wise coalescent probability so this is correct assuming that N is much greater than three now this right-fisher model eventually if you follow this back we get this red tree according to this rule that each parent has chosen uniformly at random from the previous generation we get this red tree and we can also observe that in general again employing this approximation the probability that a coalescence i.e. a sharing of a parent will occur is K choose to multiply by one and N and K is the number of sampled lineages we're considering at that point in time so we start out with three here once we've found this common ancestor between these individuals here this K becomes two because there are only two lineages that we care about so by successfully going back in the different in the generations and asking what is the probability of coalescence we can compute the probability of seeing a tree like this red tree given a particular population size now again I'm going to gloss over details but what we're doing here is taking the limit as the population size becomes extremely large there's another little detail because we're shifting to continuous time here we need to consider the time between successive generations so if we define this to be little g here then the particular limit that we take is we take N to infinity Ng to zero such that the product Ng remains the same and if we do that these fixed coalescent probabilities that we talked about on the previous slide become fixed coalescent rates and these give rise to exponentially distributed waiting times between coalescent events meaning that under this model if you ask what is the probability distribution that governs the time between the present and this first merging event here the answer is going to be an exponential distribution with a rate K choose to multiply by 1 on Ng where K here is 4 so that's an exponential distribution for this first interval you have another exponential distribution for this second interval and another for this third interval each time just changing the K to be the appropriate number of lineages so by doing that we can evaluate the probability of a phylogenetic tree under this coalescent model so in general with birth death model we have some population size that's determined by a birth death process this birth death sampling described also produces events corresponding to sampled individuals so these are the times at which we see our sequence data and there's a relationship between this birth death trajectory this realization of the process and the sample times and our phylogenetic tree in the context of the coalescent model instead our parameters are instead of birth rates and death rates in the general sense the parameter becomes the coalescent rate through time and another way to think about this is the effective population size through time which can be given a parametric form if you like but it could also be quite free in general though we're just going to write that the tree t is related to our phylogenetic parameters be they birth rates or death rates or coalescent rate functions via this particular conditional probability that we're going to variously call the phylogenetic likelihood or the tree and then we're going to go back to the prior depending on how you like to think about things and then all that's left to do at this point is to join these concepts together and actually talk about Bayesian phylogenetic and phylogenetic inference so given that we have a phylogenetic probability and we can compute it and we have our phylogenetic likelihood or our tree prior and we can compute that we can combine these two functions with some priors on the parameters which in this case are the substitution rate matrix and the parameters of the phylogenetic model to produce this expression is essentially Bayes theorem and on the left hand side we're left with our joint phylogenetic or phylogenetic posterior so we wind up with a posterior distribution over trees and all of the parameters of these models that we've been talking about conditional on our sequence alignment so yeah as I said P of A given T and Q is the phylogenetic likelihood we have our tree prior and we have our parameter priors so there are some subtleties there are lots of subtleties that I'm glossing over I'm sorry but just here I'm glossing over some subtleties because one can ask whether this tree prior is really a prior in the usual sense does it depend on the data the way I've written it here it seems to not depend on the data but then consider the sorts of trees that we've been thinking about if we go back here and think about this tree here this red tree the red tree is a collection of edge lengths and a collection of node times and among the node times are also the times of these leaves at the end so the times of our samples and these are usually assumed to be fixed in the analysis we it is possible to estimate these as well in certain situations but generally they're assumed to be fixed which means that when we compute this object here we usually this tree does include some information that is fixed by our data meaning that this even though it's sometimes called a tree prior does include some data on the left hand side of the conditional another thing that's maybe interesting to think about is what this PA actually represents this is our normalization constant our marginal likelihood for our our base theorem here but from a practical sense in order to evaluate this you would need to expand this out into something like this top line here the numerator and integrate for every combination of tree and different parameter values which of course is extremely difficult and infeasible in general that aside some practical characteristics of the Bayesian approach include the fact that we're jointly inferring the phylogenetic tree the substitution model parameters and the birth and death rates and the birth and death rates or population sizes all together despite the fact that in many cases you may be interested in only some of these things so in the context of pathogen phylogenies you're often only interested in inferring the rates the particular phylogenetic structure that you see isn't necessarily of importance in contrast if you're looking at species trees maybe you are genuinely interested in the tree itself in which case this tree prior really would just be a tree prior and you're not really interested in the parameters of it okay but because we're doing this joint Bayesian approach it correctly accounts for uncertainty at all levels both in the phylogenetic tree itself and the model parameters the nice side effect of doing everything this way I'm sure you've already seen in the course this allows you to include other sources of information by way of the priors on the parameters so we could include some additional information that tells us what the population size is through time and uses this as the starting point for our analysis and that would naturally include both this information and the information from the genetic data okay and then so once you have the posterior then this includes the uncertainty all of these uncertainties in the inference results another thing that's we've talked about briefly already but something that's very clear if we go back to this factorization here this is only possible because of this neutrality assumption and the reason that is is quite clear when you consider that doing this factorization means that it's entirely possible that our data could have been of course they weren't but they could have been assuming these models are correct they could have been produced in this fashion we could have drawn some parameters from our prior distributions so drawn some substitution rates drawn some birth and death parameters etc simulated a tree using these parameters and then only after that simulated the sequence alignment that would be statistically indistinguishable from doing it all together and because we've separated out the sequence simulation here it means then the sequences themselves can have no impact directly on the shape of the street this is not to say that the sequences you observe don't have don't don't tell you things about the shape of the street of course they do just in terms of the causal relationship there's no sense in which the sequences at these internal edges a result in differently structured trees what's so difficult about all this it all sounds very easy of course the difficulty is integration so the fact that our base theorem has this denominator our marginal likelihood that involves this vast integration over tree space makes the whole thing impractical and we can't do this analytically we can't do this in any really obvious brute force fashion either because the size of tree space is enormous so what do we do and I think you're given what you've learned in the previous days know the answer to this question we've got to take a trip to Monte Carlo and in particular we need to go to the casino which is where the Monte Carlo methods get their names there are games of chance so in our context these Monte Carlo methods are algorithms which produce random samples of values in order to characterize a probability distribution over those values and usually these algorithms we deal with seek to produce an arbitrary number of independent samples of our parameters not just the model parameters but also the trees drawn from the joint posterior distribution the way this goes I'm not going to dwell on this because you've obviously seen it before but this algorithm produces we're going to use a particular Monte Carlo algorithm metropolis hastings which produces samples ideally from a distribution of values of X by generating a random walk over possible values of X so imagine that we've got this distribution of X we want to draw samples from it we have some starting value of X that's arbitrarily chosen and what we're going to do is propose a modified version of X so we're just going to choose a new value X prime that's somewhere within this window and what we do is just evaluate F of X prime and compare that to F of X if it's bigger we definitely make this move if we and we iterate this process and if the new value is smaller then we only make this move to this new value with a particular probability of acceptance that's related to the F evaluated at this new X prime divided by F evaluated at X and so this is nice because the walk explores mostly high probability areas so you're focusing on the distribution part of the parameter space that has significant support because F of X only appears in this algorithm as this ratio the normalization of F is not important so in other words because we can evaluate our posterior distribution up to some multiplicative constant we can compute this acceptance probability without having to normalize it and that's what allows us to use MCMC to tackle these much larger problems now that's MCMC in general for MCMC in the phylogenetic tree context we need to think a little bit more abstractly about what it will concretely actually about what it means to traverse this state space because our state space involves all of the possible phylogenetic trees that are out there so to do that you need to modify not just your substitution model parameters and your birth death model parameters you also need to modify your tree and modifying the tree usually involves employing a large suite of different tree moves or operators that update various aspects of the tree so there's a move that this tree on the left here this is a portion of some tree and you've got some sub tree here and we disconnect it from the rest of the tree and reattach it somewhere else randomly on the tree that's an example of such a move and there are lots of others that adjust the tree in various ways but the important thing is that through combining these moves you can create a tree that is a possible explanation for the ancestry of a particular set of sequences and provided that's the case you can use this to construct a valid MCMC algorithm I think I'm going to have to pause there because I think we have a coffee break for 15 minutes so that's just a rough idea of how MCMC works in tree space and how it works exactly the way as it works in other spaces as well okay so as you know the result of an MCMC algorithm particularly the very simple uni variable algorithm I used as illustration is this so-called trace where we see the particular variable values explored by this random walk as time progresses but if we plot a histogram or kernel density estimate of the states that are visited we get something like this so this dash line is the target distribution this blue line is the kernel density estimate from the MCMC trace and for longer and longer eventually this blue line should converge to this red dash line of course we're never going to run it forever so we have to deal with imperfection and in doing so we have to account for the fact that adjacent samples from the MCMC process are of course correlated and you can see that if you just look at this trace through time you can see that for states from adjacent for adjacent steps we can see that the actual states don't change much at all in a relative sense and this means that you need to run through them for a certain amount of time before you can start counting these states as in any sense independent the first decorrelation you have to worry about is this first period the so-called burning because this first value of X or the first tree that we pick in a phylogenetic inference is going to be somewhat arbitrarily chosen and it's not going to be chosen based on the posterior distribution that you're trying to characterize and for this reason it's important to discard this hard to define but this first period this burning period where we're waiting essentially for the MCMC algorithm to forget the starting state and then we can start thinking about the particular value or position in state space as being related to the posterior but once we've chosen a sample we need to wait another amount of time to get another sample and so on and so forth so this is a picture that we have to keep in mind in practice how the what things look like when we're doing these phylogenetic analysis using the tools that you'll see later on in the demonstration and the tutorial is like this so we get a histogram for a particular parameter notice on the left-hand side here is this table in this table are essentially all of the different parameters some useful summary statistics corresponding to a particular inference setup so there's a lot of them but as I said often you're only interested in a handful of parameters and so it's possible to just single out those parameters plot the histogram of states visited by the MCMC for just that parameter and treat that as the marginal posterior for a particular parameter so in this case this is the marginal posterior of the tree height or the age of the root of the tree in a analysis of Zika genetic samples Zika virus samples but the tree itself that's obviously a little bit more complicated to visualize because that is not just a one-dimensional object but you can do worse than using tools like this this is a density tree so-called density tree output a particular piece of software designed by Remco Bucca and what we're seeing here are trees sampled as part of this MCMC process just laid on top of each other so you get a rough idea of the amount of uncertainty involved in the fallogenic reconstruction this blue that's overlaid is just one essentially a point estimate of the tree it's a summary of the posterior distribution but you can see in green all these overlaid lines particularly back in the toward the root of the tree there's a lot of uncertainty to do with the particular ordering of these merging events between the different lineages and the timing of these merging events whereas toward the leaves you can see things are a lot more certain are there any questions about the about Bayesian inference applied to fallogenetics and follow dynamics before I get into the applications particularly any conceptual questions if not then I will proceed to talk about applications and talk about two separate applications today both involving SARS-CoV-2 which is not too much of a surprise that's been on a lot of on the minds of a lot of people doing this sort of work for the last few years and in particular so the first of these I'm going to discuss a project that I was involved with trying to infer the basic reproductive number associated with individual outbreaks of SARS-CoV-2 right at the start of the epidemic so this basic reproductive number for those of you who have not encountered this before represents the average number of secondary transmissions resulting from a host that finds itself in a completely naive population of other hosts and so this is something that you can that is traditionally estimated directly from case count data so just looking at the number of positive test cases through time and trying to use this to estimate the growth rate of these curves but there are several reasons why this is sometimes suboptimal not least of which the results that you get there can depend heavily on the way testing is done so hopefully with a genetic based approach we can avoid some of these biases so happily soon after SARS-CoV-2 emerged lots of institutions began sharing genomic sequences on the Giseid genomic sharing platform this was very widespread unlike any epidemic in the past I think definitely the scale was very very different and as I said these have the potential to yield more robust estimates via follow dynamics so the process involves jointly reconstructing trees and inferring birth and death rates of a follow dynamic model and relating these to the basic reproductive number so for this particular analysis we're looking at really early sequence data so the data that was available very early on before public health measures were implemented because of course once such measures are implemented the whole aim of those measures is to affect this reproductive number so we want to get the raw value before they were modified by these measures and so you can see we've got a large distribution of sequences the largest single outbreak we consider is this outbreak in Washington state and the USA comprising 217 genomes but you can also see we have very very small outbreaks as well such as this one in Australia and this secondary outbreak in Washington state for which we only had nine samples so of course there are lots of challenges to do with tree inferences early in the epidemic or in epidemics in general so firstly we have limited diversity so remember that the signal we're using to reconstruct follow genetic trees is based on sequence diversity we need there to be a large number of mutations or substitutions appearing in the ancestry of a particular set of sequence data in order to be able to reconstruct the tree if all of our sequences are identical then we're not going to be able to reconstruct anything another problem it's always a problem the sample collection is always going to be full of unknown biases sequencing errors at this point weren't very well characterised so it wasn't clear always which sites to ignore in sequence alignments so all this means is that we have to be very careful when interpreting these results so we analysed these sequences under the so called birth death skyline model so this is exactly the same model that we discussed earlier the simple birth death model the only difference being is that we let the birth rates and death rates and sampling rates change in a piecewise constant manner through time so we allow there to be these time intervals during which all of these rates are constant but then we allow these rates to be different in different time intervals and it's very easy to modify the model to allow for that pardon me so as I said we've got these birth rates and death rates in this case we've got a sampling rate rather than a sampling probability as I said earlier there are different ways of parameterising it so we can modify this parameterisation to be very epi-focused by defining a reproductive number here I've written RE but in the context of this early time it should be R0 this is related to the birth and death rates just via this ratio so the reproductive number is just the birth rate divided by the death rate we fix some of these parameters we fix the total death rate we fix it so that it's inverse which is the average amount of time that an individual host remains infectious is 10 days which is roughly corresponding to what was known about the disease at that point in time pardon me and this is what we see large number of these outbreaks had very similar basic reproductive numbers and this is sort of what we expect and these range between 2.5 and 1.5 so somewhere around 2 there were outliers for sure definitely Iceland is an outlier this particular small Washington state outbreak we infer a very high R0 also whales and the diamond princess also we inferred quite a high reproductive number although for the diamond princess in particular I'm inclined to believe this number now in looking for an interpretation as to why some of these reproductive numbers were referred to be quite high it's interesting to note that the sample ranges so the total length of time for which we have samples were shortest for the outbreaks that had a very high estimated R0 so this WA state 2 outbreak both Iceland outbreaks only so this is in days just to give you an idea of how how short these time windows were so this is 5 days here so the WA state was only around about 3 days of sequence data and so this and whales too it's just over 5 6 days outbreaks with the largest inferred R0 have shorter sampling windows which gives us an idea that biases due to the sampling process may be a problem here other things we looked at were the potential that the R0 corresponding to different outbreaks actually were identical so we looked at this one it's known as a Dirichlet process prior on these outbreaks of specific R0 values and this allowed us to infer the number of unique values and it seems like this data really wants each outbreak to more or less have its own R0 value now we can go beyond this as well we can also infer these although we're not really interested in trees here and the R0 parameters we can also infer the actual case numbers through time and the way we go about doing that is using what's known as particle filtering to impute trajectories I'm using the word impute here because these trajectories the actual population dynamics for these outbreaks the number of infected hosts through time this is something that is already contained in the model as a concept that exists but to this point we've integrated it out from our probability calculations so all we need to do is put it back in and the way we do that is to use this particle filtering to do that what we do is imagine we've got some tree and talk about it without introducing too much complexity the way we go about doing this is that we break this tree up and I should say that this tree would be a sampled tree from our MCMC analysis so part way along our MCMC analysis we in one particular step of the MCMC we have a tree and we have a corresponding set of parameters so conditional on this tree and these parameters our goal here is to sample the population dynamics through time so to do that we break this tree up into these partial trees that are called G1, G2 and so forth and then for each of these what we do is we simulate a birth death process so here is such a realization of the process and for a given realization of the process we can step along it and compute the probability of observing this portion of the tree conditional on the parameters that we have and this realization of the birth death process and this just looks like a product of these terms that involve a coalescent probability and so it's very easy to compute these coalescent probabilities are looking like this they're very easy to derive and to compute and we essentially do this over and over again for lots and lots of different realizations we compute these weights for each of these realizations and perform sequential important resampling in order to condition these realizations on the tree and the parameters so by doing this we get posterior distributions over the birth death trajectories through time and here are some examples from some of these outbreaks this is the example from the diamond princess the diamond princess analysis was a little bit different to the others because this is this cruise ship that you might remember from the news right back at the start of the epidemic that was denied entry to a lot of countries because it had massive outbreak on board now a lot is known about that particular outbreak namely we know exactly when the first patient zero arrived on board we know when the outbreak was discovered and we know that quarantine happened immediately following this and so in this analysis we assume that the birth rates and reproductive number is different in this first interval prior to the quarantine compared to after the quarantine and that's why we see this little inflection point here now the gray lines are all of these sampled case numbers the pink here is the 95% highest posterior density the orange is the median and these little triangles sorry these little diamonds represent the recorded case numbers and you can see that there are relatively nice correspondence in this case there are also examples where there is not great correspondence this example from Spain we get very poor correspondence and also from France here but if we just look at the final case numbers so these are all cumulative case numbers if we just look at the final case numbers and compare them finally inferred case numbers and compare them to the actual recorded case numbers this is what we see so actually counterintuitively a lot of the recorded case numbers are higher than what we infer and it's peculiar because we would expect that we would in principle if our model is correct we should be inferring the total case numbers regardless of whether or not they've tested positive or they're actually being picked up by a testing centre so our explanation as to why we infer smaller values is that the sequences we have don't reflect the true genetic diversity in each of these outbreaks so we're actually missing significant portions of the epidemic there's one sort of main counter example there the diamond princess we're actually inferring larger numbers although in this case I would believe that the diamond here I think this is due to the fact that our model is in this case not capable of treating this up or down on the number of cases that exists aboard a confined cruise ship okay so just some take care messages from that we can in principle at least in for a valuable epidemiological parameters from this genetic data these inferences are complementary to those from the more traditional line list data the case records but they do still remain susceptible at least in the way we've used them to sampling bias through the distribution of sample times okay but oh that said it's important to note that even breaking up the dataset into these different sub outbreaks necessitated the use of sequence data so we would not even have been able to do this analysis properly without the genetic sequence data so despite the fact that maybe the results are not fantastic and have high uncertainties attached to them being able to divide up the sequences this way is only possible because we were able to build a larger phylogenetic tree and isolate clusters from it okay did anyone have any quick questions about that last application okay so then moving on this second application is although it's still using data from the same year 2020 was a very long year and so toward the end of it we had far more sequence data so together Swiss Institute sequenced over 11,000 SARS-CoV-2 genomes just in 2020 and despite Switzerland being a relatively small country this was still the 6th largest program globally so it's something to be proud of so and over the course of the same time period contact tracing was being applied in order to attempt to terminate individual transmission chains so this was a very important main method that we're using early on to try to stamp out transmission chains where sometime after testing positive you would provide information on all of your contacts for the last few days and a dedicated team of people would track these people down and perform tests on them and isolate anyone who is who is obviously testing positive so what we aim to do with this huge amount of sequence data was to try to determine the degree to which contact tracing in Switzerland was effective at reducing the propensity for ongoing transmission now in order to do this we first had to because we had so much sequence data we were able to try to counteract some of these the sampling biases and so to do this we sub-sampled sequences up to 5% of the confirmed cases per week in each canton all throughout 2020 we built a maximum likelihood tree not a Bayesian tree but a large number of sequences using these sequences together with similar sequences from abroad the goal for this was to break up this tree into sub-trees that were actually representing local transmission classes because even though you might have sequences just from Switzerland if we build a phylogenetic tree we're inevitably going to have a lot of detail on that tree that's representing the evolution of the pathogen outside of Switzerland and if we're really interested in the use of the epidemic inside of Switzerland it's important for us to draw a box around those portions of this very large tree that correspond to evolution within the country and this is what we did here now we had two suddenly different approaches to this partitioning that yielded either large clusters or small clusters and for all of the analyses we had to do both sets of clusters in order to check whether our results were robust to this variable we also repeated these analyses with publicly available New Zealand sequences just for comparison now this is a figure from the paper that is aiming to describe the phylogenetic model that we used and what's important to note here firstly this is the time period that we're looking at this is all 2020 you see that we've broken the year up into these three different intervals and these intervals are intervals in which we allow the contact tracing effectiveness to be different this first interval we've got one contact tracing effectiveness degree and then for the orange another one and the purple another one and the idea is that we've got a background basic reproductive number that's changing through time on a weekly basis we have smoothing priors to keep that in check and then so this is the basic reproductive number that you would expect would determine the dynamics of the epidemic in the absence of contact tracing and then what we do is for every individual transmission cluster like this one here we say alright the first sample appearing on this tree is this one that my mouse cursor is pointing to and we say two days after this the reproductive number seen by the rest of this tree effective for the rest of this tree is a reproductive number that has been scaled by a damping factor that's specific to which of these time intervals the transmission chain lies in so there are lots of variables in this model and it's a little bit difficult to convey but that's roughly what's going on this is the result these are just the number of cases for comparison on the top through time and they're coded according to which interval they're falling in the first wave and then this is the summer period and then the autumn wave up here and down at the bottom we see the posterior support for different transmission rate damping factors associated with contact tracing and for the spring wave you see quite different values and depending on whether you're looking at the many introductions the large clusters or small clusters partitioning of the large tree and in the case of this many introductions here we see that there is effective contact tracing for this other set of clusters we don't see any effect so it's very difficult to say anything about spring but for both of these there does seem to be a robust damping associated with contact tracing but then for fall we see a robust lack of damping for contact tracing and our explanation for this is that by this point the contact tracing infrastructure was basically overwhelmed and it was impossible to perform impactful contact tracing at this point so with a comparison in New Zealand contact tracing always seemed to work so this was just a calibration to ensure that we knew that this was going to be effective in New Zealand because the epidemic was much smaller throughout 2020 at least so this at least tells us that our inferences basically working of course because this is joint basing inference we're also inferring the reproductive number three time like I said it's very noisy and you can see this this is in the background we're seeing the pink here is the is the reproductive number inferred through the various weeks of 2020 based on the model with the damping factor the green is without the damping factor and what you can see there generally is that the pink is higher than the green which is what you'd expect because with the green we're ignoring the presence of contact tracing and so the model is just assuming that any reduction in RE has got to do with the actual RE reduction whereas for the pink there's also this possibility that it's contact tracing that's working and there's a comparison here with the RE that's estimated just from the case count data one thing that's important here is that we have two different priors on the sampling rate one that puts an upper bound on the fraction of individuals and the total population that have been sampled and included in our study and one without this bound and this really does make a big difference on the results and this is again something to potentially be concerned about okay so just some take home messages from that the analysis here provides some evidence that contact tracing was effective in reducing the duration of transmission change in summer but not in autumn and I said that this was potentially due to contact tracing becoming overwhelmed in autumn okay this is something that I've already said now a poor estimation of jointly inferred parameters does suggest that we must be cautious in interpreting these results okay does anyone have any questions about this last application because if not I'm going to start to dive into a tutorial I'm not sure how long we'll really get to work on this but I wanted to at least point you in the direction of the first tutorial that we always recommend people to when they're getting started with these Bayesian and Folliginetic and Follidenamic methods and it's to do with this piece of software called Beast2 and Beast2 is the workhorse behind all of the results that I've been presenting it's the platform that we use for Folliginetic and Follidenamic inference there is the project website there Beast2.org you can download the software there you can access documentation and frequently asked questions there are tutorials there but for tutorials I would really recommend that you visit this dedicated tutorial website named Taming the Beast and this contains a fairly large list of curated tutorials that go through the various packages that are available in Beast2 for doing this sort of inference so Beast2 itself is the software that implements this Bayesian Markov Chain Monte Carlo for a parameter and tree inference there's a subcomponent of it called Beauty as in Beauty and the Beast which provides a GUI first setting up the input file setting up your analysis that you feed into Beast then just runs by itself there's also a tool called Tracer which is the tool for summarizing parameter posteriors there's this tool called Tree Annotator that performs a similar role for tree posteriors and turns a tree posterior into a summary tree there are also various tools for visualizing trees, fig tree is one that's usually recommended and the main workflow here is that you have your sequence data, you open up Beauty and you load that in, set up an analysis you save that to some XML file that's the input format for Beast2 you run Beast2, load that XML file it runs, often people will run Beast2 on a cluster because these things can take a long time and then Beast2 itself is going to generate a parameter log file and a tree log file and both of these can be analyzed using these other tools okay so in order to proceed with this tutorial the first thing you need to do is to download and install Beast2 from Beast2.org it's not too big of a download and it is self-contained so and there's versions for macOS, for guinea linux and for windows as well once you've done that then you can open the taming the beast page and locate the tutorial the website here I hope you can see it is the hyphen beast.org and you can either click on the link or just go directly to slash tutorials and then find this introduction to Beast2 tutorial there now because we don't have a lot of time I would say focus on the gray boxes in this tutorial we're not going to finish it for sure especially if you're going to read through all of the text I'm not saying that's useless at all but if you want to just get a feel for the way Beast2 works and what it's capable of then you can go a lot faster if you just focus on the main bits