 Module seven is all about the joys of phyldynamics. It's a quick introduction to myself. I'm Finley McGuire, I'm an assistant professor at Dalhousie University jointly appointed in Computer Science and Community Health and Epidemiology despite having a microbiology background. And I'm also pathogenomics by Informatics Lead for the shared hospital lab in Toronto, which is based at Sunnybrook Health Sciences Center, which is a very large medical microbiology lab where we do a lot of sequencing, COVID and otherwise and work with many of the fine other faculty today. Okay, so today we're going to do recovering on caveat, recovering a lot of area, right? Phyldynamics is a very big topic. They very easily could be, and in fact, there is entire workshops dedicated to just aspects of phyldynamics. So we're going to have a high level overview of phyldynamics as well as a couple of different types of modeling techniques as well as a couple of different types of analysis. So looking at temporal inference, looking at spatial trait inference, looking at some epidemiological parameter estimation, and looking at some inference of evolutionary pressures and forces, such as selection. And the lab practical, we're going to look again at some of these phyldynamic analysis. We're going to do them using a small dataset involving Zoonoses in SARS-CoV-2, which several people in the, I noticed in the attendee list are familiar with and I think we're actually on the paper, looking at time estimation, ancestral state reconstruction and that selection testing. So you want to understand epidemiological dynamics of infectious diseases. You're all masochists and you really want to understand what is happening with the disease, how is it happening over time? And I noticed from again, some of your backgrounds yesterday or the first day on the introduction, some of you are coming from this epidemiological world rather than the genomics and microbiology world. So some of this will be very familiar with you. So generally one of the kind of go-to tools we do for modeling infections and their dynamics really focuses entirely on the human side and the cases. So these compartmental models known as SIR models in some cases and essentially the idea is we just have the number of people in the susceptible category that could get infected. We have some degree of parameter, beta and infection rate that transitions people into the infectious category and then some kind of parameter for how quickly people recover and then they might be immune for a while or some cases, you know, they might get susceptible again. And disclaimer, this is pretty much the simplest, one of the simplest compartmental models there is and just kind of the archetypical initial Kermak, McKenna, first model that was developed. So these kind of models we generally, yeah. So we have S is the number of susceptible individuals each time, I number of infectious individuals are the number of recovered and cease to mean individuals at any given time. And then we use, depending on your background either some scary or some very familiar and simple differential equations to model the dynamics over time of how the number of people in each of those categories changes, right? And again, these two key parameters, this beta and this gamma keep coming up in these equations because they determine how people are moving between these categories. And so we can calculate those parameters, right? We can calculate those values for beta and gamma from observed cases. We can use a likelihood approach like we dealt with in phylogenics where we can say, what's the probability given observed certain observed case dynamics like number of cases over the pandemic for a given value of beta and gamma. A simple example, this is like, would be like, okay, you toss a coin 20 times, you get 20 heads, you can ask yourself, what is the probability of 20 heads given it's a fair coin? Probably pretty low, right? So that's a likelihood. That's the entire basis of likelihood work. So we can use likelihood approaches to fit these values onto our case count data, right? So if we can do all that from our case count details, if we can kind of build these kind of classic epidemiological models by which we kind of multiple the dynamics of an infectious outbreak or a pandemic, why do we need genomic data? Why is genomic data useful? Why is genomic epidemiology open? Which I hope some of you we're starting to understand now by this point in the course. Well, the great thing about genomes is they can be used to infer unobserved events, right? So especially early in an epidemic, early in an outbreak, we might not have any case, we might not actually have well-controlled case data. I mean, look at the discussion about the origin of SARS-CoV-2. A lot of the beyond the larger political debates and conspiracy theories and all that kind of mess, one of the challenges is we just don't have a lot of data about that early outbreaks. We have a lot more data than we would have for many data sets, for many outbreaks in the past, but we still have a limited amount of data. How do we work out things about that unobserved area of an outbreak? How do we work out those, you know, when did it start? What kind of basic reproduction numbers so that the ratio of the kind of how many secondary cases are caused in each case it's related to those parameters I showed those beta and gamma parameters in that SIR model. How do we do that when we didn't work collecting case data at the time? And the answer is we can actually use genomic data to kind of fill in the things we didn't see. We can infer what those previous states were, what these historical things that we didn't directly observe by looking at the relationship between some of the genomes and some of the modeling methods we're going to talk about today. The other thing the genomes are really great for is they tell us explicitly who infected whom, right? And they tell us the population structure of things. So on the left here on the right here we have a transition network, right? So these red dots are when an infection starts and you see, so this person here got infected they're infected, they're infected, they're infected they're no longer infected. But meanwhile, they infected to other people, right? This affected this person, this affected this person and then, you know, that person's affected for previous life and they can infect other people and so on and so forth, right? And so we can, you know, use epidemiological data we can use the kind of classic epi approaches, case linkage, all that kind of stuff and try and infer this transmission network based on, okay, there's two people that were in the same space at the same time and both got sick they were probably, one probably infected the other, you know, that's a reasonable kind of classic epi approach. But with genomic data, we can often directly kind of investigate that, we can test that theory and you can see whether this person likely infected another person or not. We can tell more about the kind of population structure. Again, we can see the cases, we can learn something about the cases we didn't see and didn't sample by using genomic data. The other thing is the case information itself doesn't really tell us much about the pathogen evolution. So with any infectious disease and you know, as we've heard about today we heard about AMR, we saw the evolution of resistance we've talked about phylogenetics, you know evolution is a huge component of infectious disease. It's kind of one of the things that makes it not special but it's one of the kind of features of it that is important to investigate, important to pull out at least to changing, moving about over time. And this evolution happens at lots of different scales which adds to a lot of complexity. You know, we have, you know within the multiple copies of a given genome within a single cell, we're gonna have different mutations present we're gonna have different mutations occurring different sites of election. We're gonna have, you know happening within organs in the body within the host of the whole evolution happening between hosts or even in the case of like zoonoses and multiple, you know, multiple between different species. So all of that's gonna lead to these differences in dynamics and different changes over time, right? And without genome evolution without genomes we can't access that evolutionary information, right? Genomes are the substrate of that evolution. So unless we're looking at the genomes we have no way of really analyzing and respecting that in any detail. Sure, we might see something like oh, something's changed about the pathogen there's a lot more cases all of a sudden but we don't know why and we don't know whether maybe that was just a super spreader event, right? So unless we're getting the genome data and using some of the methods that we're talking about in this workshop we're not gonna be able to access that evolution. Okay, so I made the case that epidemiology you know, epidemiology is a useful thing learning about disease dynamics is a useful thing we can learn something in cases but genomes can give us much richer information things we don't see as well as you know, more complex bits of the relationship between pathogens how they're changing over time. So how do we actually link genomes to epidemiology? Plus Fiona showed us, Dr. Brinkman showed us on in module two we can infer a phylogeny from genomic data. So say we've got six genomes here and these dots represent say SNPs. So you know, C has this pink mutations the only one with this pink mutation all of them have these light green mutation here A and B have this purple mutation have an orange mutation and so on and so forth. We can use the pattern of the mutations the sharing of them between and a variety of different approaches maximum likelihood as talked about in that module as well as parsimony distance measures and so on as well as to infer that tree. So here's like, you know, the simple parsimony approach of the simplest explanation being okay, they all share this green mutations that probably happened earlier on, right? So it doesn't really let us split any of these apart into a branching tree structure. Okay, but here, okay, D doesn't have any more mutations we branch that off. And here we have A, B and C in purple and E and F, they share different sets of mutations they probably form separate branches. E and F here, we can split it off again E and F, you know, because the EF has this extra mutation that E doesn't have, A, B and C we might split off again based on this orange and pink mutations and so we can use that pattern of mutations to build phylogenetic trees that show us the evolution of relationships between the samples. I'm gonna say this is a simple parsimony example, you know maybe there's some of these mutations that actually occur lots of times maybe this green mutation actually occurred independently on every single one of these branches. You might have other information that might support that. So this is, you know, not a guaranteed way but the general principle is you can use that pattern of mutations with some degree of modeling ideally and dealing with probability dealing with uncertainty dealing with prior information about how common certain mutations are how commonly are mutations lost again do you see reversion to generate phylogenetic trees? But what does this tree actually represent? Like when you see a phylogenetic tree what does it actually represent? And especially in the context of infectious disease this is something that can keep principle Dr. Taboda kind of talked about this a bit with the kind of the whirlwind figures and stuff like that with the subtyping. So an individual phylogenetic tree represents a sampling from underlying process. So in the case of infectious diseases quite often the underlying process we're interested in is that underlying transmission network? So again, this is the pattern of, okay of all these dots are a person becoming infected or an animal or whatever. And then these vertical lines are transmission when they're infecting someone else, right? And so what we have when you have those A, B, C, D and F is we have samples we've taken from this process. So the light blue here, those are the genomes. So there where we've taken a swab from somebody and we've sequenced it and we've done all the work that Dr. Simpson talked about, variant calling, mapping to the reference, getting a genome out, can set a sequence out at the end. And so we have a sampling of these, of this process. They have these light blue dots. And then basically what we're doing with the phylogeny is they're the bits we're observing. And then the phylogeny is essentially is a kind of blurry view back onto that transmission network. So here's the transmission network with our samples and here's just the samples. Here's just the parts we see directly and with a little bit and you know, this is, okay. Yeah, there's a little bit of kind of lines that don't really look like a normal phylogeny but if we just tidy that up a little bit and we just straighten everything, look, we have a phylogeny, right? So a phylogeny is just a sampling of this underlying process. So the next question is, what determines the shape of this underlying process? What determines that transmission network? The answer is there's many different forces that determine the shape of that underlying transmission network that then we're getting a blurry view of with our phylogeny. So can anyone turn on, this is the most short interactive component, but if anyone wants to turn off their microphone, turn on their microphone and just shout out some examples of things that might determine the shape of an underlying transmission network. Mutation. Yeah, mutation rate could be one of them. Latency period. Yeah, exactly. So the type of virus we're dealing with and the kind of disease. Yeah, exactly. So like long-term infections, short-term infections, something that causes a very immediate severe disease versus a long asymptomatic disease, 100%, yeah. Interventions, like public health interventions? Extremely, yeah, it's gonna have a huge impact and actually there's a really, really nice seminar done in Switzerland, but I think from Tanya Sadler's group showing how you can use some of the methods I'm gonna talk about today to actually evaluate the public health interventions and how well they actually worked. Anything else? Behavior and contacts networks? Yeah, both in terms of animals and in terms of humans and all those aspects of human population structure. How immunologically similar are everyone in a population? Getting access to a naive population causes a big burst of infection, as we saw when China lifted some of their restrictions. We had a huge number of cases. And that's gonna be reflected in that underlying transmission network and the shape of that. Immune escape mutations? Yeah, massively, right? So again, similar thing, those mutations are gonna lead to changes, it's gonna change the dynamics of the disease. And again, that's gonna be reflected in the shape of the transmission network. So I'll say keep thinking of things. You went on. I just muted myself. Yeah, sorry. But I don't know. Habits after speaking. So yeah, so here's just a subset of those examples, some of which should be mentioned. Generation time, again, relates to a pathogen, how quickly new poppies are made, population structure, vaccination, post-migration, people moving around. When we saw lockdown nationally, we actually saw an interesting thing that we hadn't seen in infectious disease for a long time, which was structured populations based on individual country's geography. Generally, movement and flights and stuff like that have led to a more mixing up globally in that sense. All that evolution aspect. So the underlying process is a combination of ecological factors, epidemiological factors, and evolutionary factors, both in the host and the pathogen, they're mostly gonna be focusing on the pathogens there. So all what phyldynamics is, is the process about learning, is learning about this process from the shape of the phylogeny. We've got the underlying process, we're sampling from it, we're building a phylogeny from that, that gives us a blurry view back onto that original process. So by looking at the shape of the phylogeny and modeling things based on that shape, we can start to unpick and unmix some of that mess of that underlying process. So all phyldynamics is kind of trying to go back from the phylogeny to try and work out aspects of that underlying process and some degrees vice versa. So let's start with kind of the simple part of phyldynamics, the simple idea here and the kind of simple analysis, which you'll see is not necessarily simple, are trying to reconstruct a transmission network from the phylogeny. So not dealing with any of those kind of complex features and epidemiological features and evolutionary features, but literally just going from the phylogeny back to that underlying transmission network. And so what this tends to be, the problem with this is it's very complicated by the fact of transmission biologically represents a kind of sampling of a sampling, right? You have diversity within your body of pathogen with SARS-CoV-2 relatively little intro host diversity compared to other pathogens, certainly in short-term infections or long-term infections, but you're still when, if you cough on somebody when you have SARS-CoV-2, which ideally you're not because you're using good, you're being sensible and masking when you're ill and avoiding contact. But if you cough on someone else, they're going to get a sub-sampling of the internal diversity of your SARS-CoV-2, right? And so, and depending on when you infect somebody, so this person here may get infected by different subsets of the virus in this person than this person. And here's another kind of view on that. We see infection, homogenous population at the beginning is, you know, what small number of viruses have caused it, get diversity appearing over time and during that infection. And then depending on someone infects, they might transmit a different subset of that diversity or they might even transmit a mixed infection. You might transmit more than one at a time. So transmission is a sample. So when we're looking at this transmission network from a phylogeny, really it's a sampling of a sampling. So that adds a lot of uncertainty to these inferences. And so we end up with a situation where the same tree, so here we have five different, we have the same tree thought, we have the same tree that could actually be explained by five different transmission scenarios. So here we have, here we have A infecting B infecting C. So we have A as the base, as this kind of ancestral status, this then infecting B, then infecting C. Where it's over here, we have say A infecting B and C. So A is this kind of ancestral state here and here, infecting B and C. Here we have the opposite, we have B as the ancestral state infecting A and C. So the same phylogeny can be consistent with multiple different transmission networks. But some of those are gonna be more likely, they're gonna be more probable than others, right? We can build in some of the things we understand about intro host diversity. You can build in how we know how infectious a different pathogen is, but we need a framework in order to handle all of that and certainly incorporate that into our estimates when we're trying to do this reconstruction. And so the solution to that is something called probabilistic, it's something called probabilistic inference, using these probabilistic methods like the maximum likelihood approach we talked about the phylogeny in the phylogeny module. And so a lot of phylogenomics tends to be based on a very useful probabilistic inference framework known as Bayesian inference, Bayesian phylogenomics. And again, these are whole modules, courses, graduate courses could be done on just Bayesian modeling the phylogenies by itself, let alone the phylogenomics aspect. So this is just a very high level idea of what is going on here. So the Bayesian inference, we have our data, that could be our ATZGCs, our alignments, it could also be some metadata, you know, who the host was, was it a human, was it a deer, was it a bird, right? Could be location, can we all that kind of bits of data? And then we're developing a model for that data. That model is made up of multiple parts. And it could be the tree, the shape of the tree is part of that model. Our epidemiological model, so I'm explaining the demographic change over time, SIR model, et cetera, that can be part of that model. We'd have an evolutionary model. We know certain mutations are more common than other mutations. We might have a temporal model, right? We might know how quickly mutations happen over time. Does that change over time? We're gonna talk about that a little bit more going on. We might, and we can really, the nice thing about this framework is we can kind of fit anything else we want into that as part of the model. We fit in spatial aspects of, you know, how things move and migrate over space. We can build an immune function, right? We can know like this protects against the, you know, infection of this is more like to protect against this. So we can incorporate some kind of antigenic cartography directly into these models. And so the whole idea of Bayesian inference is being able to calculate what is the probability of a model we've specified given the data we have, right? And so the way we do this is using something called Bayes Theorem, where we essentially have our likelihood of stating the same exact same ideas are likely. It's the probability of data given the model we specified, as well as some prior probabilities. There's some prior information we might build in. We know certain tree shapes are more likely than others. We know certain demographic models are more like those. We have lots of data about which mutations occur more often than others. So we might incorporate some of that prior information into our model. We also have an under four. We can also have a uniform prior, right? We could do this in a way we're not incorporating lots of technical detail in there. We might just say we want to just be mostly led by the data, but we're going to make our, we're going to be less effective that way. And then we're, the denominator here is a nightmare term where we're trying to sum the probability of the data over all possible models. We can't actually calculate this in, with most real datasets. So what we do is we do a lot of tricks where we look at the ratio and cancel this term out. And the way we actually get these values, we work at this something called a posterior probability distribution is, again, whole area itself. It's a call, it's generally we use a sampling approach. We use a Monte Carlo markup chains, but we won't go into the depths of that. I just say here, this is the likelihood we're going to be talking about the other day. And the disclaimer is for the actual lab today, we're going to be focusing on the subset of these father dynamic methods to actually use these likelihood frameworks by themselves rather than a full Bayesian framework. Unfortunately, the Bayesian methods are great. They're incredibly powerful and incredibly flexible, but they're a little bit unwieldy for the logistics of everyone running this because it involves that random sampling and exploring the space. And it just, it'd be a bit challenging to run in a lab of this size. But if you're interested in this area, there's some links in the lab for other resources. And particularly the taming the beast workshops are like a whole workshop like CVW about using one of these main Bayesian modeling workflows for this kind of genomic epidemiology and phylogenetics. So now we've got kind of a very quick introduction to the high level idea of the general concept of phylogenetics. Let's look at a couple of specific analyses. Let's look at a couple of the ways we can use that tree to learn about the underlying dataset. And just remember in the back of your head, when we're doing this, we're using this in a probabilistic framework to deal with all the uncertainty involved in that blurry view we're getting of the underlying network. So one example of a problem that we're going to talk about in the lab is knowing when zoonoses happen is kind of each trying to reduce them. So zoonoses is when we have a pathogen in another species or even in humans that spills over to a different species, right? So in this case, you know, that SARS-CoV-2 is spilling over to humans, lots of within-host transmission in humans. And then we've actually seen spill over in humans to other animals. Some cases dead end hosts where there's no ongoing infections. Other cases, you know, we see a stable infection. Infects a new creature. And then we see signs of within-host transmission and the establishment of new reservoir. This is a nasty problem because it can spill back into humans. So knowing when these zoonotic events are occurring helps us identify, you know, knowing where and when as the next step we're going to talk about might help us identify interventions that might help reduce these from occurring. But to work out when something's happened, we need to be able to convert our tree that's based in distance. So these are substitutions per site. So just a normalized number of mutations by the size of the genome. We need to be able to convert this distance tree to a time tree, right? So instead of we want all these dots to be positioned based on the time rather than, and we want these branches to be based on how long the time has occurred rather than how many changes there has been. And so one of the ways we can do that is we can try to estimate the mutation, right? How quickly do mutations happen? And then based on that we can work out we got the time associated with all these dots, right? We've sampled them. So we know when we sampled that genome. So we got points, we got times for the end, but for these branches, internal branches, we don't know, we don't have any information. We've got to infer them, right? So to do that, we need to work out the mutation, right? How quickly do mutations occur over time? Because it's very slow and we see a lot of mutations. It's going to be a very, very long time. Whereas if mutations where it's very high, we see a lot of mutations are short time, then our, even branches that have a lot of mutations on them are going to be relatively short, right? So one of the ways we do this is something called root step progression. And so the idea here is we're taking the distance between all of these points and the root of the tree all the way back here, versus the time when they were sampled and we're making a scatterplot of them, right? So this is that distance from the point all the way back to the beginning of the tree. And this is the sampling time. And we can see we get a scatterplot and we can do kind of simple linear regression here. We can draw a line through this and look at the gradient of that line to work out what that mutation rate is, right? So we see here is 0.0041 per year. So that lets us infer the mutation rate on this tree in these samples. And then what we can do is we can kind of estimate how long these branches should be based on the time that all these individuals were sampled and that mutation rate, right? Again, the long high mutation rate, we're gonna have lots of very short branches potentially, even when they have a lot of changes, a very slow mutation rate. The branches that previously were very long, that has lots of mutations on it are gonna be potentially very, very long, right? And so what this lets us do is let's take this distance tree and convert it to a time tree. And so the overall shape of this tree, the topology of the tree stays the same, right? We're not changing the topology of the tree, but we're changing the branch lengths to be time calibrated instead of, instead of distance calibrated. And then when we want to work out when an event has occurred in the tree, we can just look out of these x-axis here, right? And we could say, okay, so this group here split off around late 2011, right? Or early 2012, right? We can use that on the x-axis. So we can get an estimate of when these internal nodes that we can't observe actually happened. So say this represents spillover into humans. All these say are animal viruses and these are human viruses or bacteria. We can estimate, okay, when did this actually happen? And so a lot of these analysis have been done, again, for the origin of SARS-CoV-2 to try and work out when those initial spillovers occurred. We can refer much more complex models like we can do this in a much more sophisticated way. Mutation rates can vary over time. So we might want to incorporate that. You know, say it spills over and this lineage of all is of higher mutation rate, maybe we want to calibrate that to the tree. We want to include that in our model. So we can build those more sophisticated models quite easily using these standard probabilistic frameworks. Here is an example of the time tree, letting us estimate the timing of an unobserved event that we're gonna use in the lab. So here we see in yellow here, we see human infections and in orange we see SARS-CoV-2 from deer. So we can see here, it looks like ancestor was human. Something happened here. There was a whole bunch of deer infections and there seems to be a human infection nested within those deer infections. So this might represent a transmission for me, human to a deer to a human. There's a lot of unsampled evolution here, there's a lot of things with uncertainty there. And the nice thing about using the time tree approaches is we can estimate when these things occurred, right? So this branch is approximately a year long. There was a whole year of unobserved infections happening somewhere. Could have been humans, could have been wildlife. We don't know, we weren't observing them. And then when did this spill out? You know, when is the ancestor of this? We're gonna dig into more of this data in more detail in the lab. Another example is we might also want to know where something happened, right? So that's when something happened, but maybe we want to know where something happened. So as an example of this might be, we're trying to trace the source of an outbreak. So say we have a spillover in, sorry, we have an outbreak and I'm sorry, I was distracted by Emma mocking the virus name and an agriculture standardization. So we might be interested in an outbreak in a hospital, right? And so say we got our tree, we're back to our simple ABCDEF tree and we can use metadata. We can say where did these samples gets collected? So D was collected in the community, A, B and C were collected in hospitals, E and F were collected in the community. And so we have a trait, right? We have a piece of information we've got on the tips of the tree again. Just like before where we had the dates on the tips of the tree and we want to look back in the tree, we want to work out, okay, where in the tree is it most likely, is it highest posterior probability in the Bayesian framework or maximum likelihood in the likelihood framework that transmission from community to hospital occurred? We saw that transition in the location trait, right? So this trait here, these are both in the hospital. So here is probably a hospital sample. It's pretty unlikely that this was, we were in the community here and there was two separate infections into hospital. Well, it looks like this. Similar here, like all three of these are all hospital samples. So likely the ancestral state back here in the tree is going to be hospital, right? And when we see a single grouping like this, that suggests a single source, that might be a useful way for us to, you know, identify, okay, a single source that was on, and we can use the time inference, it was on this date. So these were the series of healthcare workers. Maybe we want to check for asymptomatic carriage, right? Maybe they're having the disease and shedding and they're not sick. We might want to check them. We might want to be able to do that. And this infection prevention and control investigations are increasingly using these methods. That's how, you know, that's the affections these team in hospitals that will be trying to track the sources of particularly nosocomial hospital-enquired infections. Things like MRSA, antida auris, all that kind of fun. Here's another example, you know, why we'd like to do that reconstruction. So say we see two, we have two hospital samples up here and a hospital sample back here. You know, yeah, this could be a case of, there was an ancestral, ancestral, there was an ancestral infection, you know, there's one transmission to hospital and then people left hospital and cause infections in the community. That's possible scenario. But again, in an ideal method, probably the most likely scenario you would expect based on a simple tree like this would be, there's two separate introductions into the hospital from the community. In the circular diversity community, there's been two separate transmissions into the hospital. So there's hospital one, hospital one. So these two dots here. So we want to be able to infer those internal ancestral states from the observed tips. Much like, you know, we see in those, I see your question Sydney, it's a great question. I will get to, I can get to it at the end unless a TA jumps on it first. So we might want to do that observing with those internal states. So we have, so this could be AA, could be hospital-based state, and these could be physical in the community. And so we're interested in, you know, the transition between these states. So we might try and reconstruct on the tree the kind of series of changes have happened in the internal branches. And the way we do this is we usually develop some form of continuous time Markov chain or some form of kind of, some kind of probabilistic inference like that, where we have some degree of understanding of what is the rate of the different changes, you know, is it much far more likely for someone to, from of the community to enter the hospital or someone in the community to leave the hospital and cause an infection in the community, which is a more likely pattern. Depends on the particular trait we're interested in, depending on the data we have. And we weren't going to use, we're going to use this transition probability, these rates that were either both inferring from the tree and or inferring from prior, we're tuning based on prior information we have to try and infer these internal states in the tree. So here, the probability is most likely this internal node was physically within the hospital, right? This was probably a genome represents a genome, we did sample that would be within the hospital. Whereas back here, it's more likely to be in the community. There are lots of different approaches. Again, we can do for this internal inference of ancestral states. We use parsimony approach, you know, one dollar parsimony is an example used a lot for certain genomic things where it's more likely, if something happens, it's fairly unlikely it reverts, right? So once a mutation or gene fusion happens, say back here, it's fairly unlikely it's going to split apart again. Once you get transmission to a hospital, it's, you know, maybe it's unlikely you're going to get people moving back to a community or moving to a particular type of the community, long-term care or something like that. Might use maximum likelihood approaches. Again, we'll move back in the tree same idea and that's what we're going to use in the lab or might use these big Bayesian modeling approaches, right? So again, I know this is fairly light on details and that's mostly just because the amount of time we have to dedicate to covering a light area. But the key idea to take away from this is we can use information about the tips of the tree. So states or characters or traits we have associated metadata we have associated with those tips, especially if they're all nice and organized metadata following things like Emma's module. And we can use that, we can use probabilistic frameworks to infer what the status of that character, that trait was earlier on in the tree, right? Where we weren't able to observe it. So there's plenty of examples of this being used. So this is a really nice example of one of the big early, one of the big Zika papers. And it basically is trying to trace back where Zika originated. Where did the spillover and causing a cure disease first occur for Zika? They did spatial trait reconstruction here, moving all the way back up the tree. I find out that it most likely originated in Northeast Brazil, right? So they can reconstruct those internal states in the tree. You know, these kind of especially these geographic models you might often incorporate some understanding of geography, right? We're talking about, you know, we're talking about, you know, here when we're parameterizing these transition matrices and the priors on them, maybe we're going to incorporate geographic distance into that, right? Or how commonly people take flights between these two countries, right? We can use that migratory, use that demographic information to parameterize how likely some of these changes are. You know, far more people, you know, say come to Canada, then go to Mauritania. So from Mauritania, right? So it's more likely people will move from Mauritania to Canada than people from Canada will go to Mauritania. So we can use some of these things to parameterize these models. And that's exactly, this Zika, this phylogegraphic model and you see spatial temporal as you have time inference as well, is exactly what this does. And this is really a nice gold standard example of this kind of analysis and this kind of reconstruction. The other thing is you don't have to just use, you know, as I alluded to a little bit with when I talked with Gene Fusion, you don't have to just use geography on these trade tips, right? You can use exactly the same approach to infer, say, mutations and when did mutations occur within the tree? So we have associated with all these tips. We have, okay, all of these have the A mutation, but they all of these have the B mutation. And we can reconstruct these internal modes. Again, that we didn't observe directly. We can reconstruct, you know, what was the most likely mutation at these internal modes? And so we can find the first one above our threshold criteria and try and infer when did that mutation first occur in the tree? So when did that mutation occur? Maybe that's the evolution of new of immunovasion. Maybe that's the evolution of higher transmission rate. Maybe it's jumping and adapting to a new host, right? So we can infer ancestral traits on the tree using the same idea of ancestral state reconstruction. Okay. So that's kind of the spatial and temporal aspects. And again, very high level. Check out that Taming the Beast workshop for far more detail on this. What about, you know, I was talking about epidemiological parameters at the beginning, I was talking about epidemiology. How do we access some of those kind of classic epidemiological modeling parameters using our tree? So again, here's the idea is, you know, one of the aspects when we're looking at number of cases and often as a denominator in all of those models is essentially the pathogen population size, the effect of population size of the pathogen or structure. And so is there a way we can look at the genomes then look at the phylogeny, trying to infer how much pathogen there was at a given time. This gives us an idea of, you know, how many actual cases did we observe? How much asymptomatic carriage was there that we just never actually managed to identify? And one of the really nice properties is the shape of a given tree actually relates to the population size and the underlying structure. So you see here for relatively small population size, we'll probably see this kind of, almost series of essentially their founder effects, right? We're seeing, you know, there's a little bit of diversity and then these, they form the basis of this next bit of diversity and then the form of the basis of this next bit of diversity. Whereas the other hand, if we have a very large population size, we see a more drift-like pattern over time, right? You know, it's very large population. There'll be a bit of selection going on in there and selection will change the shape even further and it'll be another layer we add on. But the shape of the tree, you know, when we see a very large population size, we'll kind of form this kind of gradual expansion and dribble, dribble down because there's not any, the structure of the population is not imposing a bunch of constraints on the topography. So again, don't necessarily need to remember exactly how, exactly the exact patterns we see here, but the general idea is the shape of the tree relates to population size. Why is that the case? What is that actual relationship? So one of the ways we can access that, especially in this kind of modeling world of our dynamics, is using things called coalescent processes. So say we've got two trees here. So this is the tree when it's called, this is what the tree looks like when there's a constant population size over time, when the viral population is not getting any bigger over time. And the red bits are the genomes we sampled, right? Whereas here, we have what the population, what the shape of the tree would look like if we had a growing population. So the number of pathogens, genomes were increasing over time, the number of pathogens out there, and the amount of diversity in the pathogens increasing over time. As we use something called a coalescent process to try to model that. So this is a big complex, it looks like a big complex scary figure, but really all it is, is basically we say, if we have these genomes, these red genomes we sampled, and so the width of these dots, these dots represent the population of the virus, right? So this given time, this is this very small population we're showing, this is 20 odd, something like, I don't know, can count quite quickly while presenting. And we've sampled these three genomes from that population. So we can move backwards in the tree and go, okay, what's the probability that each pair of these is gonna share a common ancestor of each generation? And the probability that any two genomes are gonna share a common ancestor is based on the number of possible ancestors from one, right? So here these two, say we have kind of random walk back over time, and we can see, by this point, they're likely to coalesce, they're relatively close genomes, so this population size, they're gonna have a coalescent point in the past here. And that's gonna be that internal note on that tree, essentially. And so how quickly we see that coalescent, how quickly they are likely to share a common ancestor is based on how wide this population size is, how big the population size of each given time point. So each of these time points is representing a simplifying assumption of a distinct non-overlapping generation. So here's generation one, and all the viruses very cleanly have a second generation, here's the second generation, and so on and so forth. The reality doesn't look like this, but it's an assumption we're gonna deal with for now. And it makes our maths a lot simpler for these things. So the coalescent process is literally just one base over N, right? How likely they are to coalesce to any two genomes is based on how big the population size is at that time. So when we see a population size like this, we can see the coalescent pattern is gonna be different, right? Because the population size is decreasing, so the likelihood they're gonna coalesce actually increases, the probability increases because of the inverse of all the population size. So we see a shorter tree, a stubbier branches. And so we can use, there's a statistical to genomes D, we can use to measure deviation from neutrality, from how neutral in terms of selection, in terms of population change over time is, genomes D is kind of our measure whether we see an increasing population size or decreasing population size. And yet there's a lot of extension, I see a quick question from Alan in the chat. There's a lot of extensions to this kind of model and a coalescent approach. Particularly us used a lot human genomics, right? When you have sexual reproduction, that adds a whole bunch of complexity and mixing. Again, very wide area, I'm giving a very high level overview too and a very short amount of time. Again, so taming the beast again, we'll go into a lot more detail about these coalescent process and how to use them. But the general idea is we can use the shape of the tree to infer things like the population size of the underlying pathogen based on things called coalescent processes. Okay, so onto our last aspect to file dynamics is, and that's looking at evolutionary forces such as selection. So how do we measure whether selection is occurring on a tree? In a genome, so one of the classic ways in which we do this is something called DNDS. What DNDS is, it's the ratio of non-synonymous mutations. So it's mutations that don't cause a change in the protein to synonymous mutations. And both of these have to be normalized by the number of opportunities that are in the genome for synonymous mutations and non-synonymous mutations because they're not equal. The general thought thinking is here is if we see about equal numbers of these both, we say about equal numbers of normalized synonymous and non-synonymous mutations, it probably represents drift. We're probably not seeing any sign of strong selection going on in the genome because they're kind of gradually occurring over time when we're seeing the same number of mutations proportionally that change the protein and don't change the protein. And this is under the slightly sketchy assumption that synonymous mutations don't have much an impact on fitness. There are some, especially things like viruses, they likely do have some degree of an effect just because of the very compact genome. They change things like code on frequencies and all that kind of stuff. And there's a lot of debate on the Shen paper in terms of yeast recently. There'd been a big critique of it later published recently. But for the assumption and simplicity of what we're dealing with today, let's assume synonymous mutations don't cause a selection effect. So when we see equal numbers proportionally of non-synonymous synonymous mutations, it's probably reflective of drift. There's no strong selection going on. If we see more non-synonymous mutations and more mutations change the protein and synonymous mutations, it's probably a sign that positive selection is going on. We're actively selecting and retaining and fixing mutations that are changing the proteins, more so than mutations that don't change the proteins. And the opposite, maybe we have a very stable thing and any mutation changes, the protein is gonna cause the pathogen to become less fit. We might see this ratio being less than one. In example, purified or negative selection. So the only mutations that are allowed to kind of be retained in the population are these non-synonymous mutations. But, okay, so we have the NDS ratios and we can kind of look into this and we can look at the NDS ratios and genomes. So why do we need phylogenies for looking at this? And what's the challenges of that? Well, one of the challenges is mutation rates vary over time in groups. I talked about that earlier a bit in the temporal models. But depending on which part of the tree we're on, mutations are gonna occur faster, there's gonna be more higher mutation rate or lower mutation rates. And that's gonna affect these estimates. Similarly, across the genome, mutation rates are gonna vary. There's gonna be some parts of the genome that mutate far more often than others, synonymously or non-synonymous, right? And again, this is just related to the underlying biological process, determining how mutations are occurring. And then finally, the big one, and the one I kind of really want to guys take away today because it's a very common mistake in analysis of biological data from machine learning, statistics, epidemiology is genomes are related, biological things are related to one another. So mutations, things like that are not independent events. They're non-independent. Therefore you need to, they're not IID, you need to incorporate that dependency structure into your model. Otherwise all your error terms are gonna be messed up and likely wrong. What do I mean by this? So say we have, again, our fun six genomes or ABCDE and F, we have the synonymous mutations and non-synonymous mutations. So synonymous in purple, synonymous in orange. We can just count the number of non-synonymous synonymous mutations in each of these genomes. So there's two purple in A, one orange, one non-synonymous in A. D has only one synonymous mutation. F has two synonymous and one synonymous mutation. And we just naively took this and calculate the ratios, the D and S ratios, when we normalize them. I'm cautious of this, I've messed up for a public life. We're gonna, and we ignore the phylogy, we're gonna over count some of these mutations. So see, we've counted these two mutations here. So mutation E and mutation F as separate mutations that we just tally naively. This is all just one mutation. So the same mutation is just present in A and is, oh, it should be. Okay, I've messed that up again. So we're just saying, so this one mutation is occurring in this part of the tree and just happens to be present in two genomes. It doesn't mean the same mutations happen twice. So that again, we can use an ancestral inference and look at internal states here. But we need to use a tree again here. So we see A, B, and C. All this purple mutation, this synonymous mutation occurred in the common ancestor of A, B, and C. So we're actually counting this as three different mutations that we just tally the genomes when it's only one mutation. That mutation only occurred once. So we're gonna hugely overestimate number of mutations unless we use the underlying phylogenic structure. So what a phylogen does in this case is it captures the dependency structure of the genomic data and so it reforms the air transfer of models. And so in the lab, we're gonna use something called the adaptive branch-like random effects likelihood model, bit of a mouthful, which just tests and controls for all these aspects of, is there a significant portion of sites, so positions in alignment, within selected branches that have a DNDS ratio greater than one? So as there are subset of branches in our tree that have signatures of positive selection. And we're gonna dive into that a bit more in detail in the lab. So the example of this and the kind of last example we're gonna talk about today was Paper's Cagagna. It's actually one of the people in the workshop actually led is looking at, when we treat patients with Remdesivir and antiviral that interrupts viral replication, do we see signs of, and we have shortened courses. So when people don't complete their full recommended course of this antiviral, do we actually select for antiviral resistance mutations? Do we see selection for mutations that cause resistance Remdesivir in the individuals that didn't complete their course? So we build a big phylogeny and we have all the purple branches, we have the individuals that had shortened Remdesivir courses. And we have the time points, they have them in their sampled. And we can ask in the orange branches, do we see signatures of a DNDS ratio greater than one in a significant portion of sites? So we can test directly our inference, our hypothesis of, yes, we think shortened Remdesivir course that leads to selection for resistance mutations. We might use this phylogenetic approach, use some of the abracel to actually directly test the hypothesis. And so there's a great tool called Hi-Fi, Hi-Pothesis in phylogenetics that we're gonna dive into in the lab that lets us do lots of different ways that's testing for selection, using the underlying phylogenetic structure to make sure we don't goof up and miscount things and overcount things. So in today's lecture, we talked about the ways in which pathogen evolution epidemiologically intrinsically linked. A phylogenetics, phylogenetics are structured by sampling ecology, evolution, epidemiology of the underlying process. A phylogeny is a way that gives us a blurry view into the underlying epidemiological process that determines the transmission network. It also gives us a way to access insights into evolution and unobserved events. So we can use similarities and differences between genomes to try and infer things we can't directly observe or can't easily directly observe. Phylogenetics in general tends to be heavily based on Bayesian phylogenetic models. We are going to use likely models in the lab, sorry about that. And we can use these approaches, Bayesian or otherwise, to do many things and understand many things about the epidemiology of our pathogen and the underlying epidemiological process. We can reconstruct transmission networks. We can infer the time and location of outbreaks and events. We can identify when certain mutations occur. We can determine the values for certain epidemiological parameters. And we can test for evolutionary things like selection. So, yep, that is the overview of, that is the whistle stop tour of phylogenetics.