 Alright everybody, my name's Gary. I'm going to deliver the lecture on phylogenetics and phylogenomics for you. This is, well it's a version of a lecture that I've been giving for about eight years. The first time I tried it was about three hours long and everybody keeps telling me to scale it back. I'm trying to get it into about 60 minutes. I've had to cut a lot of stuff out. Even still there's a lot of information in here. There's more information in the lecture notes than I'm going to be actually talking about. I mean I'll talk about every slide but in some slides you'll see there's kind of like a lot of text. So that's there for you as additional resource for understanding phylogenetics and phylogenomics. So before I get started I'm just going to show of hands. Who here has built a phylogenetic tree before? Yeah, okay. So about half. Who is comfortable interpreting a phylogenetic tree? Okay, smaller number of people. Alright, can anybody tell me the difference between like a neighbor joining tree and a maximum likelihood tree? You don't have to actually say just put up your hands if you know. One person. Okay, then I think I've got this about right. Okay, by the end of this lab or this lecture you will know the difference. Okay, so let's get started. The objectives for this module are to really to understand the basics of phylogenetic trees. What are the different the terminology and the different parts? A little bit about how to interpret phylogenetic trees and some sections on building phylogenetic trees and then also the extension out to phylogenomic trees. So mean evolution simply described as descent with modification. Essentially what we you know we observe that are the offspring of our ancestors have similar traits but sometimes those traits change over time and the accumulation of those traits can result in well speciation and a differentiation and those traits basically up so there's lots of there's some changes and there's some things that are the same between a group of related organisms. Phylogenetics is the well it is a scientific method and a computational method that studies the evolutionary relationship between well biological entities so these could be species or genes they could be parts of limbs or really anything that you're you're interested in something that essentially is related through evolution so essentially biological species and these days mostly what we're using is DNA because DNA is and genes are more specifically are the unit of heredity and they do a good job of allowing us to determine the evolutionary relationship of a group of organisms but you know before DNA and still within the field of what's called cladistics there are different ways of of inferring the evolutionary relationship between a group of organisms for example morphologies like bone structure flower structure language actually can be used to to infer evolutionary relationship so there's lots of different ways to go around it we're going to focus obviously on using genetic data and genomic data and so the using that data what we can do is we can build these inferred evolutionary relationships with which we refer to as phylogenetic trees so in most circumstances when we're going to build a phylogenetic tree the data that we have available to us is for the existing well we call them taxa those organisms those those units of those biological units we normally don't have access to the ancestral data sometimes we do so if we can if we have access to the fossil record then we actually do have the ancestral data that we can use to make comparisons with but normally that's not the case especially when we're working within the field of microbial informatics under with very rare exceptions we are only using the extant data that's available to us isolates that essentially have we have been able to you know we've been able to extract the DNA from that that are recently living so the what we need to do is to infer what the relationships are from the from the existing data and the the ancestors that that gave rise to those to their progeny and so that makes in order to do that we have to make some assumptions about the the process that of evolution that gave rise from the ancestors to their to their to their progeny okay so what we're saying kind of here is that we have access for the organisms that are on the very ends of the tree the tips of the tree and everything internal are inferred okay so so here we're gonna talk a little bit about some of the tree terminology so so the tree writ large the phylogenetic tree is basically that structure that models the evolutionary history of that group of sequences or organisms or whatever the taxa are and the tree itself is consists of nodes the little dots here right and branches which connect those nodes together and provide us with that relationship so these these terminal nodes the ones that we have data for those are called the terminal nodes or the leaves or sometimes are called the tips more generally they're called the operational taxonomic units people who are familiar in with metagenomic analysis may also have heard of operational taxonomic units and they are kind of conceptually define a kind of a different well sort of a different set of data but ultimately they're actually describing the same things I'm not going to get into it I just want to make sure that people aren't being confused about for people who actually do have some metagenomic background that the otus which are basically these similar sequences that are clustered together are not the same thing as the otus in phylogenetic trees the otus are the single nodes that exist on the tip of the tree okay so the the internal nodes are the inferred hypothetical ancestors of the well of their of their descendants and the ancestor of all of the the taxa all of the otus so all of the progeny and their ancestors is called the root okay and that's here so we have our terminals we have our internal nodes we have our branches that connect them and then we have the the ultimate ancestor which is called the root but not every phylogenetic tree actually has a as a root does not has a known root any questions about tree terminology so far okay so a little bit more of you so we can group subsets of the tree into these logical subclusters called clades or monophylet or monophyletic groups and so those are the well the a group of species which is or otus which is arbitrary but normally is used to define something that is important like an outbreak clade versus non outbreak clade something like that the clade itself contains the well the ancestor and all of its descendants so it always includes the terminal nodes not just the internals so it's always the terminal nodes and some internal you can so and there's there's sister taxa which are the species or the groups of species that arise from the from the same nodes so we have here a and b these are sister taxa and they're they but they contain this internal ancestor here c and d are sister taxa and the clades themselves ab clade here and cd clade here clade 1 and clade 2 these are also referred to as sister taxa because they also are connected by the same common ancestor all right so one of the things that we assume when we're building a tree is that the ancestor gives rise to two descendants under a process of speciation and so that means that at every node will have three branches one to its ancestor to its own ancestor and then two descendants okay normally that's called a bifurcating tree or dichotomous tree so that's normally what we see here's a fully resolved tree where you can see every node has three connections well the exception of this root node here okay so that's fully resolved but not all trees have enough information to provide that bifurcating structure so that they're fully resolved those trees sometimes can have more than two descendants from an ancestor so if you have three or more descendants from an from one ancestor they're called a polytomy and polytomies are of two types there's hard polytomies and soft polytomies hard polytomies are when you know that there have been instead of a speciation like this one speciation event from a common ancestor there have been two so that you have say well or more than than than one speciation event so that there are say there are three or more descendants that are directly related to a common ancestor rather than just one can anybody give me an example of how something like that might occur well one possible occurrence is that the the ancestral or the descendants of the ancestral organism gets separated from each other at around the same time into separate geographical locations so think like after the last ice age when the ice started to melt and the water started to rise you could there could be occasions where you may have like say an island a large island that gets flooded and starts to essentially create pools and multiple islands as the water rises up right so instead of having one island now you may have with say three peaks on it you have three separate islands and those organisms essentially are going to start to um diverge from each other but they all diverge from a common ancestor this is actually the case um in for for several organisms there's a type of fruit fly in the seychelles islands actually which is known to form a hard polytome where there's different um more than two speciation events from one ancestor okay the other more common one is called a soft polytome and this is where you have three or well three or more descendants from a common ancestor that likely when under underwent a regular speciation event but you don't have enough phylogenetic information to actually be able to tease out what that um pattern of um of speciation was so that just means that there's just not enough information and so you can group those together into what's called a soft polytome okay and at the so the and those are called partially resolved trees you can also get a fully unresolved tree called a star tree um those are kind of a special case that I will talk about a little bit later when we discuss knee rejoining trees all right so there's there's two main ways to depict a phylogenetic tree cladograms and phylograms so the cladogram is the simplest type of tree it only shows the relative recency of the common ancestry so in this example here we have three otus um here represented out at the leaves and this cladogram shows us that a and b are related by a common ancestor and the common ancestor of a and b are related with uh to see by a common ancestor that's the relationship of their ancestry um it doesn't tell us anything about the the degree of divergence that has occurred amongst those three um otus throughout that evolutionary event okay that means that the branches in those trees have the have may may have different lengths but the lengths have no meaning it's only the what's called the topology of the tree right the way that the tree is organized in a cladogram that gives us any phylogenetic information are there questions about cladograms okay so the phylogram is uh it contains more information than just a simple cladogram it's the one that actually does contain information about the rate of evolution of the the relative rates of evolution of the different otus in that phylogenetic tree so it contains that relative recency information in the same way that a cladogram does and so this is has the same topology that i showed in the last example but we can see that the branch lengths are different here and so normally in a phylogram you can place these numbers that give us a sense uh well it quantizes the the the amount of the divergence or amount of difference between um the different species normally when a speed um when an ancestor gives rise to its progeny they will evolve at a linear rate only for a short time then they will acquire different evolutionary rates and so one may acquire mutations slowly and other will acquire mutations much more quickly and so that means that their the rate of divergence is different and is different in between those two um progeny and that's essentially what is represented here we can say that a has essentially evolved at twice the rate as b but b and c you know um well c has evolved very slowly relative to to b and a okay oops so these in this vertically depicted tree the these vertical lines here contain that um distance information evolutionary distance information the horizontal lines do not contain any information okay it's just the vertical ones there okay we can already at trees um well any which way that you that you desire um normally we can have the vertical tree or horizontal trees there's no informational difference it's really just a um it's just a a preference in the way that you wish to display the trees so there's there's nothing um nothing to be gained by information wise by by depicting a tree's orientation in in any orientation okay okay okay we can also um well the order of the leaves also does is not informative um there's there's no information about them and because if you were to think about the vertical branches in these trees if you were to swap them then you can change the the order left to right that appear for the otus in the tree without changing the topology of the tree so when we swap this one here from this a bcd we get a dcba arrangement that we swapped we can swap this one here and get a bcda and so these all have a different leaf order but the topology is unchanged in the trees so so this is trying to reinforce here if it's important for interpretation that we understand that only the topology and the distance information is is what is conveys anything about the evolutionary history of those of those organisms okay so trees can be rooted or unrooted um and the root of the tree i mentioned earlier is the hypothetical ancestor of all of the leaf nodes all of the otus all the organisms that have that that are under study um and have been built with that are being used to build the tree so a tree that have a root that has a root gives you an absolute order of the um the the well the direction of the descent okay so you can say if this is if you have a known root then you can say this um evolve this one evolved from this one and then this one evolved from this one and then this one evolved from this one you've got that sort of a timeline fixed when you have a root but um you don't always have information about what the um which organism is ancestral to the all the other organisms when you're doing a study in those cases the best that you can do is to build an unrooted tree so unrooted trees have no root there they have an evolutionary relationship to each other but the timeline is unknown for them okay so when they're so they're less informative um they because they don't give you the that absolute um direction of evolution uh but they still give you the relative um evolution between the the organisms in the tree so when depicting um an unrooted tree versus a rooted tree normally they will present it in what's called a radial format here so you can't actually determine like it doesn't appear that there's a common ancestor to all of them they just appear that they're related to each other okay in questions about unrooted trees all right again just trying to reinforce rooted versus unrooted trees here so this is um the the same otus that are being depicted in a rooted tree and in an unrooted tree when you have a rooted tree you know which one which is the the one that is most ancestral so you have your ancestral node here oops not very good at these max and it um so modern species four diverged from its from this common ancestor more recently than the um ancestor of species one two and three which and the ancestor of two and three diverged um more recently than the ancestor of one two and three etc etc we have that absolute order of descendancy in an unrooted tree we don't have the root so we know that there is an ancestor of species one and two and we know that there's an ancestor of species three and four so we have these ancestors here we don't know which one of those ancestors is came first which one is ancestral to to the other one that's the difference okay we can't really determine that absolute order of descendancy okay so it's possible to root a tree if you have additional information about the organisms beyond just the the sequence information that you um that you're using to build the tree so for example and that those are called an an out group and the um the rest of these species are called the the in group so if you do if you know that one of the um the if you have additional information that that that tells you that one of the um organisms is ancestral to the others then you allow to say okay i'm going to i can place my root between that ancestor or between that taxa and the in group here so cows humans and chimps these are all mammals zebrafish is a vertebrate but it's certainly not a mammal it's ancestral and so if we add the root between the zebrafish and the mammals then we can get a in this unrooted tree then we can get a rooted tree here that's big that says okay zebrafish diverged and then um you know from its ancestor before the um ancestor of the cow human in the chamber it's maybe a little bit more complicated are there any questions about rooted versus unrooted trees normally when you build a phylogenetic tree you there the there's no information about your out group but when you visualize the tree you're allowed to choose one of the organisms that you come in and say i know that this is the ancestral organism and you choose it and then the tree visualization software will rearrange the tree from an unrooted tree into a rooted tree for you okay all right so um the number of trees that um you can possibly build uh is a function of the number of organisms that you're using that that you're studying with the tree so they and they grow geometrically um this pie here is kind of like the multiplicative um equivalent of the summation sign which people are a little bit normally familiar we add a series of values together here you're multiplying a series of values together and here we're basically we're determining the number of trees or the function of the number of otus so the smallest tree that you can build in a regular bifurcating tree um requires three otus okay and uh there's only and so there's only one tree for three otus if you go to four otus then you can have three possible trees uh three different ways to to three different apologies that can um uh that that you can use to relate to those different otus together at five there's 15 possible combinations and at 10 there's two million combinations when you get up to about 30 you start approaching like the numbers of atoms in the universe kind of thing it really really grows very very quickly and so this becomes a bit of a this combinatorial complexity becomes a bit of a of a comes a major issue actually when working with um uh with with trees and trying to figure out how to build the best tree but let's see how that's done so when we're working with the sequence data that we've collected from a group of taxa a group of otus the way that we can use um use that data to build an uh a phylogenetic tree is with a multiple sequence alignment here so here's our that our organisms are taxa abc and d and here's their sequences we place the sequences together you know in a multiple aligned block um and that is the data that we're going to use to infer the tree we also need to use a model of evolution which i'll talk about briefly later some people say i don't talk about it briefly enough but um we'll talk about it later and this is sometimes assumed and i'll show you what i mean by that and then some algorithm that's used to build that tree up okay and there's three um main tree building methods there's the distance based there's character based and there's bayesian okay there are some additional uh kind of esoteric ways to build trees that don't fall into these three main um tree building methods and i'm only going to talk about the distance based methods and the character based methods oops the um the the bayesian methods which is more recently developed method in the last 20 years um it is more powerful um and provides a good statistical framework for um interpreting the tree but it is complicated and i tried to give an introduction to it last year in the class mutin need but i still have those lecture notes so if anybody is interested in learning about how bayesian trees work um i and we might be able to say find some time outside of the class over the next two days um maybe i can put up a little sign-up sheet or something if people are interested in learning a little bit about how bayesian trees work because it is actually a very powerful method and it can do a lot of cool stuff that the other the character based methods and the uh the distance based methods can't so it's it's actually worth knowing about but it's complicated so i'm going to leave it out anyways let's start with the distance based methods so in the in the distance based methods what we're um what we're simply trying to do is to count the number of differences in our multiple alignment and then use those those distances as the the metric for the evolutionary distance between the different organisms and that's the information that we use to build our tree okay so we take the multiple alignment and then we use it to build something called a distance matrix and um the distance matrix is then used to build our uh our phylogenetic trees of which there are two main distance based methods there's one called up gma and one called neighbor joining so let's just take a quick look here we have sequences a and b and we have a distance matrix here where we just list um on the columns and the rows um all of our different taxa a b c and d and a b c and d here and then we count the numbers of differences between them so for a and a of course it's always going to be zero because exactly the same but let's just take a look at a and b so in this column here we can see that there's one difference there's a t and an a here's another difference t and an a here's a third difference an a and a t that's the same that's the same that's the same that's the same three differences right so we plot three into there and you can do that for every cell in your matrix and you will tabulate all of the differences between the instrument all against all difference um table for that multiple sequence alignment and if you like you you know later on you might want to just go confirm that these this is the actual distance matrix that's been built from this tree or from this multiple sequence alignment okay so this is uh looks a little bit more complicated here but essentially what i'm saying is that when we want to build a tree we the input from the tree for the tree building algorithm for a distance based method is the is that difference matrix which we call m right and the idea here is to try and build a tree where each leaf in the tree corresponds to a sequence in the in your distance matrix so we've got a through e here and we've got our tree a through e here and the if you add up the distances for in between between all of the different otus in the tree then you will um uh they will match the the distance matrix okay so for example we have a distance here from a to b of seven plus five that's 12 so a b 12 here right well it turns out that if you have the tree and you with the known distances it is trivial to build up that distance matrix you just just add the numbers up and put them into the matrix okay if you have a matrix and you want to build the tree where those distances map exactly back to the distances of the tree not easy it turns out there may not even exist a tree where you can exactly recreate the distance matrix what you can do is try and get as close as you can but you cannot you it may be the case that you can't actually build a tree um from a distance matrix where all of those distance constraints are exactly satisfied are there any questions about that no okay all right so so instead of trying to fit the tree exactly to the data to the multiple sequence alignment the process of tree building is one where we want to minimize the the the differences in the distances in that are in that tree and the distances that are in our distance matrix okay we're trying to minimize those those differences trying to get them all down to zero if we can but maybe we can't so there might be a discrepancy of one or two or something like that what we're trying essentially the process is to um well to develop a tree that satisfies what's called the cavalli sforza criterion which was where we essentially what this says minimize the sum of the differences between the distances in your tree and the distances in your distance matrix okay well is that does that make sense the reason that you square it is so that you get rid of that whole negative positive problem where it's on some distances maybe less than zero on some above zero so it's really just sort of just taking those distance the sum of all the distances that that are discrepancies between your tree and the and the matrix and minimize the problem here is that getting the best fit tree is what's called an mp complete problem so does anybody know what mp complete means you know okay well you don't uh okay do you want to give us an answer what mp complete means that's right it means that there is no uh well there there's the only way to determine for sure that you actually have the right answer is to look at all possible variations it's essentially to brute force your way through so you'd have to take a look at all the possible distances that you um with that are within the scope of that um uh of the distance matrix in the tree and then the of the distance matrix you have to take those all evaluate all the different values and then find the tree with the distance values that are that that satisfy the kvali sforza criterion that minimize that that that that distance that might be okay if you have say four taxa or five taxa but if you start getting up to the 10 or the 15 or the 25 then the number of computations that you'd have to brute force just becomes what's called intractable okay it is not it can't be solved in what's called polynomial time so so we need to use these heuristic methods that don't guarantee us that we're going to get the correct tree but they narrow down the you know towards what is what is likely uh the correct tree or as close to the correct tree so the up gma method which stands for unweighted pair group method with arithmetic mean is one of the simplest methods that can be used to build a tree and so it doesn't it doesn't require that you evaluate all the possible trees it's going to just use a method that's going to create one tree for you just starts with the the data set the the distance matrix and it ends up with the with the tree at the end the way that it works is kind of like this it looks around and says which two are which two of the ot use have the smallest distance between them so it looks like maybe three and five here have the smallest distance between them and i'll say okay i'm going to connect those two together okay and then so they get a common ancestor then you do another um uh calculation to see what is the next closest two and so it and in this case we look here and say okay well one and two are closest together um after connecting three and five so we'll connect those together then you look again and say which are the next set that are the closest together well it turns out that the that the ancestor of three and five and four are the closest together so you will connect those together so you can see what we're doing here is we're connecting three and five here then we connect one and two here then we connect three five and four here like what's going on here and then well there's nothing really left to connect except for one and two so that's the last step okay and that's built when we build our tree that way that's the upt ma method very very simple method very quick method okay so so it's very quick um but it has some limitations so for example it infers one ancestral sequence per step it basically says you know i'm going to connect these two together and that's my one ancestral step and there's not and it connects them together in the middle so there's no relative divergence between the two that you connected and there's and that same process is applied over and over again as you build up the tree so you can see here there's essentially this is one step here this is another step here this is another step here and then the final step that that organizes them there there so the evolutionary distance between the hypothetical ancestor and all the taxa are all the same one two three four one two three four four okay the evolutionary the distances are all exactly the same it's called it's called an ultrametric tree and that is a very poor assumption about how real um organisms evolve right because they do acquire these different rates of divergence well after they speciate from each other um and so it's it's a very it's a very quick way to get uh to get the topology of the tree um but the the distance information here is likely not correct so neighbor joining is are there any questions about u p g ma trees no okay so neighbor joining is a second popular distance-based clustering algorithm for inferring phylogenetic trees um that uh can incorporate the relative distance information in the way that the u p g ma doesn't so that you can actually get different branch lengths between the uh between the different taxa and for their different ancestors so the way that it the way that it works is using something called um star decomposition so so here where we the big difference is that the u p g ma just looks at the the closest neighbors but the the neighbor joining tree takes into account the distance from each node to all of the other nodes in the tree okay like this so you start up with your distance matrix here and you calculate a new matrix called a q matrix and that q matrix there is um the one that is essentially saying what is the how close are these two nodes to each other and how far away are they on average from all of the other nodes okay so we're getting this information not just about the two that were that are under consideration but the distance the relative distance between the two and the rest of the nodes essentially kind of like how close are they to the center here in this completely unresolved star topology tree which is the one that i talked about briefly introduced briefly when we're talking about polytomies this is the completely unresolved tree so the q matrix here is essentially what it's allowing us to do is to create um a new node that um well that takes into account the distance between um the the the two individuals that you're grouping together and their distance to their to the rest of the the tax in the tree like this so instead of connecting them right in the middle like a u p g ma tree does it will apply a kind of a weighted proportion and say well a is closer to the rest than b was so i'm going to connect the node closer to a than i will to b here so you cannot see a distance of one here and a distance of four here so now we have that differential length information that's incorporated in there and you now repeat this in the same way that you do with the u p g ma trees you calculate a new q matrix where you have the the new value is the node for the for a and b and the rest of the members and then you find the next closest node that also um that is distant from most distant from the center and you just regenerate a new internal node and at the end you'll have your fully resolved tree that has the um a better representation of the actual divergence between the um the members of the of the tree um from the other members of the tree so the relative distance information is captured in there so normally you will build an unrooted tree um remember the tree after you build the unrooted tree that so the u p g ma tree infers a root because it's actually going from the from the two most closest to the two least closest and so it's actually going towards a hypothetical common ancestor that is the root but it may very well not be the root this one will build an unrooted tree and that's where you can apply your if you have information about which one is um ancestral say a is ancestral to the other you can say okay i'm going to put my root right there and now you will have that absolute ancestry questions about neighbor joining trees so just summarizing of the distance methods so that they they are heuristic methods rules of thumb type methods that allow you to get around the process of having to calculate every possible tree um and so you don't have to brute force your way through and it reduces the problem from being intractable to tractable the neighbor joining tree is preferred over up g ma because it's not ultra metric it is it allows for the relative rates of divergence to be incorporated into the tree and um but the distance methods throw some of the information away specifically information that is contained with the within the actual character data the actual multiple alignment itself like when you see an a switch to a g that's you know just counts as a distance or a change of one but a change from an e to a g may be different from a change to from an a to a c at the molecular level there are different um forces that are that are acting on the uh you know on the well on that the dna sequence that result basically in the you know in in its divergence and different characters basically they're they're they're mapping from one or they're their mutation from one to the other can can be either have different pressures so they can be um acquired more easily or somewhere required more difficult less easily for example so all that information is being thrown away so the character methods they work in with the same process of like that you use with the distance based methods so you start with a multiple alignment but you end up with what you in you you include the character information when you're going to when you build those trees so these character methods were both called discrete methods they don't just work on the on the uh the distance of information you don't calculate a distance matrix you work directly on that multiple sequence alignment and you're taking into account the actual mutations and um and uh you can apply different numerical values basically to the rates of mutation from one um nucleotide to another okay and there's two main types of character based methods there's the maximum um parsimony method and maximum likelihood maximum parsimony is easier to explain um but it is not really that well used most people when they're going to use uh uh the character based methods will use maximum likelihood so i'm not going to describe maximum parsimony i'm going to describe maximum likelihood for you but but the maximum parsimony um the the algorithm itself is trying to find the the tree that describes the sequence with the fewest evolutionary steps so those ancestors have these are going to have the uh you can reconstruct basically what the sequences were that gave rise to the ancestors for the internal nodes and you choose the tree that has the least number of mutations that are required to get from any one um taxon to the uh one taxa to the other taxa okay that's a little bit complicated um it will become a little bit clear here as i go through maximum likelihood so maximum likelihood it's a more sophisticated approach to the maximum parsimony but and it essentially involves involves finding the tree that is that is that has that best describes the data which is the multiple sequence alignment okay all right and it is a probabilistic model and so in and in order to do that it has to have um well a probabilistic framework and and and the probabilistic framework is incorporated through the use of these what are called substitution models or evolutionary models okay so so the maximum likelihood tree is trying to find the tree that maximizes the probability of observing the data so we're trying to formulate this and describe this here and all in more of a probabilistic terms don't panic this is mostly a probability free talk okay so you have your model of sequence evolution which is a probabilistic model and you have your observed data that's your multiple sequence alignment and you're trying to find the tree that best describes that best fits that data given that evolutionary model and the way that that's depicted in a in a in probability notation is where that we're saying um what is you know what is the probability of seeing that data given that tree and that model of evolution and we're trying to maximize that maximize the probability that that that tree with that model tree t and model m um uh was derived from that data set okay so let's take a look at a simple evolutionary model so there's this this one's uh the transitions versus transversions model so transitions are the mutation or interchange of two purines or two pyrimidines and so the the purines are the um those are the nucleotides that contain two rings here here and the pyrimidines are the one that contain just the one ring here so the transitions are an interchange here between the two rings the purines or the or the one ring nucleotides the pyrimidines and then the transversions are all of the other possible mutations right so anything that involves a mutation of a one ring nucleotide to a two ring nucleotide or the other way around is called a transversion so um just using standard um combinatorial mathematics you would think that the you can because you can have one two three four possible transversions from the AG season seas and only two possible transitions that you'd get about twice as many transversions as you would get transitions but it turns out that that's not the case actually the transitions occur at a higher frequency than the transversions do and can anybody suggest why they they tolerate each other for the for the swap better right yeah they look more like each other so they can be they can be they're better tolerated when there's a mutation they're not recognized and excised during in a proof reading as often as the transversions are also transitions are they're they're less likely to result in amino acid amino acid substitutions and the so we know that there's 64 different codons that code for 20 different amino acids and so there's degeneracy in the codons that can that can code for an amino acid for example there's six different codons that can code for leucine okay and the the the differences between those codons are overwhelmingly transition type mutations versus transversion type mutations and they also occur in what's called the wobble base so because the the amino acid that is encoded from a transition is normally this well is more often the same than it is from a transversion that means that the function of the of the protein that's generated from that codon is the same and so it's more well tolerated than it would be if it was a non synonymous mutation or one that resulted in an amino acid change all right so let's transitions and transversions and so now we know that basically an a to a t where if it was a character i mean a distance based method we would just count that as one but an a and a to a t may be different than say an a to a g right because of this whole because the evolutionary pressures of transversions versus transversions so we can take that information to account in a maximum likelihood approach that is our model of evolution our substitution model so in a maximum likelihood tree we're going to go column by column through our multiple sequence alignment and we're going to try and find the best tree that explains that um the the the mutations that we see here what so what we're asking is what is the probability the probability of this of data one which is her column one right given this tree here now how many trees can we have if we have four taxa anybody remember the smallest is is three taxa gives one tree then the second smallest is four taxa which gives three three trees and so there's three possible trees but we're going to just try this one tree here at the start we want to see does this how what is the probability that this tree explains this um with this these observed mutations in this one column of the of our multiple sequence alignment knowing that transitions have a probability of say point three transversions have a probability of point one and then nothing like where there's no mutation has a probability of point six okay those are the values that we use to to apply the weights to transitions versus transversions okay so the way to do this is to reconstruct the ancestral state so the internal nodes here we want to say okay you know what is the possible you know how did what if this were say a g then you'd have a g to a transition here and a g to a transition here let's say this was a g here you'd have no um uh no mutation here and then you'd have a g to g and a g to g here right that's you're reconstructing one ancestral state we have to reconstruct all the different ancestral states and there's four different nucleotides and there's two positions so it's four squared or 16 different combinations that we're gonna have to um evaluate and take a look at what the probabilities were that that that tree with that ancestral state explains that data set okay so that so this is what we do so here let me place a g and a c for example that's our starting one we're gonna have to do 15 another 15 of them but then we can go using our model of transitions versus transversions we can basically plug in and say okay what's the probability of seeing a g to a here and a g to a here and a g to c here and a c to g here and a c to g there and that's what all this is here and we so we multiply those together and then that gives us this one probability that this tree with this ancestral state gave rise to this column of mutations any questions about that go ahead don't ask that question don't deal with them okay so there's two ways to deal with deletions and and insertions there's what's called the double deletion model and there's the full deletion model a double deletion model says it says if you see two deletions in one column then you can consider that to be like a fifth character and you can include that as actual information as long as you basically have an evolutionary model that can explain the the you know those those gaps the problem is that's missing information right so it's hard to apply a value to that and so it's not normally used in maximum likelihood it's normally used more just in character based or distance based methods where you can say that's a change I can add one to my distance matrix right normally don't do it for maximum likelihood trees the other one is called the full deletion model which is do not consider any deletions any gaps are just excluded and that's what we use here maximum likelihood and that by far and away is the approach that is used to generate basically Paul nearly all phylogenetic trees okay good question okay so but getting back to here so just just to just reviewing quickly what we're doing we've done we've created one ancestral straight on one tree and we've evaluated the probability in one column okay there so now we have to do it again with a different ancestral state and that would be our second case right and then our third case and then we get all 16 different ancestral states and then we add those up together and that is the probability that that tree with that mutation model gave rise to the mutations that we saw in that one column then we have to do it for all the other trees right and here's there's three trees so we're gonna have to do another 16 on a second tree and another 16 on the third tree and then that will give us the overall the maximum likelihood we can choose the maximum likelihood by giving us the by choosing the tree that has the well the highest chance of explaining that one column that's maximum likelihood then we have to do it for all the columns so uh so what which method would you guys think is would be the the the least computationally expensive method up gma neighborhood joining our maximum likelihood up gma right the second easiest would be neighborhood joining yeah right and then the one that involves the most computational expense is maximum likelihood it absolutely is there are ways of there are tricks that can be used in these more complicated tree building methods that allow you to say look if this i can tell looking at this for looking at this subset of trees that i will never get a maximum likelihood that's better than i have now right for this one case it's and so it's a basically a tree pruning method that i say i don't have to evaluate that subset right there's a way that you can kind of tell that you won't improve your your likelihood so you don't have to evaluate every possible tree but you still have to do an enormous amount of computation in order to build a maximum likelihood tree so it's more computationally expensive um but the evolution model means it can incorporate the it incorporate a lot of different information um that and uh you can swap out the rates of mutation for the for the well for the for molecular clock data which can actually give you a timeline of evolution so rather than having a uh you know those values that you normally place on a filogram that will show you the those distances those distances are normally substitutions right but you can with these maximum likelihood methods you can start to swap those out for actual time um to like from the uh for example you can predict the time to the emergence of the of a common ancestor what time did this ancestor occur and then and give rise to the project so it's it's it's a lot more there's a lot more information that can that can be built or that can be um interpreted from the maximum likelihood approaches okay so which is the best tree i think we kind of answered that question already it really depends on the circumstances if you have an enormous number of taxa in your tree hundreds or thousands then the maximum likelihood approach might not be the best approach although there are algorithms for building maximum likelihood trees that work on um in high performance computing environments like rax mal so it can distribute the processing amongst thousands of processors that's what we do over at the nml um there are other methods like fast tree which kind of is well it has less guarantee of giving you the best tree but it is a maximum likelihood approach that well makes a whole bunch of additional kind of maybe not correct assumptions but will can more quickly generate a tree for you and it still is based on a maximum likelihood approach but essentially if you um the best tree building method really just depends on the number of taxa if you have access to that character information which you may or may not have and we're going to see that um in the um in the next module uh where we're going using gene by gene based approaches that the actual character information is is not really available to you so that's when you're going to want to use the distance space methods but if you have that if you have the computational horsepower and you have the character data available to you then um the probably the the maximum likelihood approach is going to give you the most accurate tree then it's basing but we're not talking about that okay any questions up to this point how am i doing for time what time is it it's 12 so i've been going for an hour well i'm getting much closer to the end okay so bear with me i think if you'll be able to get through it and maybe about another 10 minutes okay so each of these approaches is going to build a tree for us but there's still a question about how correct is that tree right just we don't know if you've that the that there may not be a lot of really strong phylogenetic information contained in the multiple sequence alignment in the first place and so there you're always going to get a tree with these methods but whether you're going to get a tree that's actually representative of the evolutionary history of those organisms is at that point still unknown so bootstrapping is a semi statistically defensible way of um of inferring the robustness of that tree and whether it is actually giving you a correct tree and it can be used on all the different methods um that involved a multiple sequence alignment well that involved great all just in space methods can work as well and the idea here is that you're going to generate a whole bunch of trees but you're going to do it by shuffling that multiple sequence alignment around a bit and generate trees from these shuffled alignments and then see if you get the same tree or if you get a very different tree if you get like lots of trees that are really the same then that means that the phylogenetic information is so strong in that shuffled multiple sequence alignment that it doesn't really matter them uh if but if you get a whole bunch of different trees as a result of the shuffling event then that tells you that the information is really weak and there's a lot of confounding factors in there that are giving you um different um trees at the end so the method is essentially to just build a whole bunch to to to shuffle your multiple sequence alignment and then rebuild the tree and then start to map the numbers of uh well of nodes in the tree that contain the same um well that have identical clades where they contain the same taxa within a certain internal node i'll show you what i mean here okay so here's kind of the idea the shuffling is is called um sampling with replacement and the idea here is that we're going to choose columns from the original multiple sequence alignment or are going to build a new multiple sequence alignment but we're not going to choose every column and some columns we might choose more than once it's a random sampling with replacement type of procedure okay so imagine that you have an eight sided die and you're going to roll that die and and the outcome on the die is going to tell you which column to choose and then you're going to use that to create a new alignment until you have a new multiple sequence alignment that's the same size as the original so in this example here you might have chosen six one six eight or something like that so we have six chosen twice and maybe two doesn't get represented but we're starting to build up our resampled alignment so some of those columns are going to contain that hopefully that strong information and some are going to contain going to some are going to be swapped out and maybe they contain some strong information or maybe they don't maybe it's all kind of the same amount of strong information in which case it doesn't really matter this is what the bootstrap thing is going to tell us you get that new resampled alignment oops and then you do that a hundred times and then you kind of like glom all those trees on top of each other and then you look to see which of the ancestors contain the same texa underneath it across all the trees so for example here we can see that red here has a and b under it and two trees and green here has c b and a under it in three trees oops and purple has the a b and c and d under it in two of the trees so we can place so we say okay three and two and two and we can place those basically at those nodes here and that is our bootstrap tree so if you've done say a hundred trees and your nodes have values that range from say 100 which means it it captured the same sub sub plate every time at that node then that value is going to get a hundred others may have gotten like 70 out of a hundred trees would have been the same so that that at that node so it gets a 70 that's uh that's your bootstrap value and so essentially it's telling you how robust are the trees that you're building from that data set how strong is the phylogenetic information in that data set and um and that's kind of a yeah that's a metric of your of the robustness of the tree and so normally people will choose a value of say maybe 70 or 75 percent right 75 of of yeah 75 percent or higher is considered to be a robust node within the tree and that it can be interpreted as correct okay any questions about bootstrapping okay all right so very quickly so we we looked at transversion transitions and transversions as an as an evolutionary model and the the idea here is that we want to have a you know more accurately represent the rates of diversion that are inherent in the the evolutionary relationship of the organisms that we're studying and the evolutionary models help to provide that information for us so one really simple model it's called the p-distance and all you do here is you calculate the fractional difference and in your multiple sequence alignment you just take a look at the number of columns that have differences versus the columns that do not and you create a ratio and then that's your your distance here so if you have a two I'm going to say two sequences that have l positions the number of positions where they differ is is d then you have d over l and that's p that's your evolutionary distance substitutions per site is essentially what that is but it's not really very serious measure for a lot of reasons for example the the different the different columns those different sites that we're looking at can have different rates of divergence so think about a protein sequence you may have a say you have a protein sequence that is on the surface of a virus or a bacterium and it may have sections that are highly conserved say maybe the membrane spanning region is really highly conserved and it may have a receptor site on or something that can't really change because of the it will abolish its function if it does but it may have other epitope regions that are the parts that are getting attacked by the immune system and those are under positive selection so those are going to mutate much more rapidly than the conserved functionally important and conserved parts so there's different rates of divergence even within the same protein sequence that is much more common the p distance does not take any of that into account there's something called a gamma distance correction that does take that into account essentially what it's doing is trying to capture the scope of the of the rates of divergence within the multiple sequence alignment that you're looking at here and so this is the equation for it uses this this parameter alpha which kind of changes the shape of that distribution kind of have to know what that alpha is but for protein sequences it typically ranges between about point two and about three point five so you can if you have you have an idea of what that rate is then you can apply something called a gamma district gamma correction to it or gamma distance correction to it so then that's called tau so sometimes you'll see when people are reporting the evolutionary models that they'll say i'm using a juke's cantor plus tau on it right and that or plus gamma not tau plus gamma and that's what this is referring to is that they're trying to capture those different rates of of divergence then there's these other substitution models so the transversions and versus transitions is one type of substitution model and there are a number of different types of substitution models that essentially are trying to capture what are these relative rates of substitution from one nucleotide for another nucleotide in all against all kind of um table so they're the more sophisticated ones and they um there's a couple that are more popular like the juke's cantor um this is that the k80 and the hky 85 this did them here for you as well and without getting into too much um detail about it what they're trying to do essentially is to is to explain for you well these are the assumptions that are being used to to calculate those um those different uh probabilities of of mutation from one nucleotide to another okay there so there is a bunch of them actually there's over 20 so how do you choose the best model well that's kind of difficult to know what is the best model to choose so one way to choose the best model is to build a couple of trees with different models and then see how well do they explain the data set what's the most likely evolutionary model to explain that data set kind of like that we use maximum likelihood to find out what's the best tree and model to explain a data set here we basically are just saying let's just take a tree try a bunch of different models and see which one gives us the best the best explains that data set gives us that maximum likelihood so it's called a maximum likelihood ratio test and there's a program called g model test that you can download that will run through a bunch of different evolutionary models compare them all to each other and give you the best evolutionary model for that data set okay okay so we're just about done um i want to talk about whole genome phylogenies and they are not conceptually any different than their regular phylogenies which normally just use a single gene okay they it's just that we are now extending out the amount of phylogenetic information um out to an entire genome and so these um are more recently developed approaches since the um since sort of the the widespread introduction of whole genome sequencing has allowed us to be able to generate data sets of that scale and to be able to to use them and um they are the main method now used for um for doing things like foodborne disease surveillance of of uh a lot of bacterial foodborne diseases um essentially other you know public health priority pathogens if you can acquire a whole genome sequence for them then you're going to do a whole genome sequence based phylogeny so there's there's two main methods one is distance and one is character okay i'm not going to talk about the distance based methods that they're going to we're going to talk about those a little bit in module three i'm going to talk about the character based methods but i just want to show you that um here's an example where we they have 91 um genome sequences that have been built from a single gene the 16s ribosomal RNA gene to build a phylogeny and then the second phylogeny that's built while using distance information essentially just by blasting the genes together and looking at the amount of similarity between them to create their their distance matrix um but the um the takeaway message here is that the the tree on the bottom which is built with whole genome sequence information is much better resolved than the tree up here and what do i mean by resolved it's that whole polytomy issue right if you don't have enough phylogenetic information to discriminate the different ancestors and you kind of have to group them together into a soft polytomy soft polytomies here are represented by these triangles here okay and you can see lots of triangles here that basically say there's a bunch of organisms here that are all coming off of this one node can't resolve the one from the other so there's a lot of them so they generally have the same kind of gross topology but here you can see there's much less poly there's some polytomies up here but and one over here but generally you're getting like that individual resolution so it's much much much more um powerful and more accurate so so okay here i have most popular approach is based on reference mapping that's no longer true okay so i'll have to update those slides but one popular approach is based upon reference mapping to extract those um the well what the single nucleotide variants from from each genome which you then tabulate into alignment and then you use that to build your phylogenetic tree okay and so this is a suitable approach for for genomes that are highly similar because we're going to have to map them to a reference genome and in order to get the the reference mapping approach assumes that the genomes that you're mapping to the reference are highly similar this is kind of what reference mapping looks like you the at the top we have this black line and that represents a reference genome something that we have say extracted out of the public archive as a full closed finished high quality genome from say salmonella and the blue arrows here those are the reads from a newly sequenced salmonella genome um that and what we're going to do here is we're going to align every read to the reference genome and find out it's you know it's optimal position here so so this one you know maps to this position maps to this position at the end after you've mapped all the reads together you'll end up with a pile up then you can scan through the pile up um column by column and then you look for the for the um for the reads that have a nucleotide that differs from the reference sequence that is your single nucleotide variant or single nucleotide polymorphism so you can extract that out so it's a basically a mapping and then a detection of variants from there here's kind of the idea here so for we have a population of genomes that um that we've seen for instance so we're going to have using the same common reference we're going to take the reads from genome one we map them to the reference we scan along and then we're going to pull out the variants but we can't just pull out any variant you want to pull out the ones that are high quality so a low quality variant might be one where there's not much coverage like there's not many reads that are covering that one position there's other things and we're going to talk about those a little bit more in the in the lab section but essentially what you do is you grab the ones that are high quality there's enough coverage and they pass another uh a set of criteria and then you can collect those there so that's our first um line here then you repeat just wash rinse repeat with every set of reads from every newly sequenced genome against that original reference sequence and now we find another set of of SNPs some are going to be the same um they're going to be identical to each other in the newly sequenced one but they'll be different from the reference sequence some will be the same for as the reference but different from the from each other this is essentially what you're going to get and when you're when you've when you're finished the the mapping all of the reads from all of your genome sequences against that reference you're going to get this collection of either wild types or the single nically tied variants and a bunch of these columns you have to throw out all the ones where there may have not been like if you have a a gap right because there's no sequence information or maybe it's missing in that one genome versus the the reference genome only the ones that are that contain either a wild type or a variant and is you know within the and it's present within the entire collection are the ones that can be kept that's called a core genome SNF um well phylogenomic analysis there so once you have them all you can collect them together and create what's called a SNF alignment so this is just like our regular multiple alignment except for where we've excluded all of the identical um information because that's too much information to build a phylogenetic tree in although some programs can can try and can handle it um but essentially you're going to get this sort of compacted one where every column is going to contain um some type of variant in it that's your SNF alignment and then just you can either use a distance phase approach or you can use your maximum likelihood or anything that's available to you that takes a multiple sequence alignment and you use it to build your tree that's it um for building for building SNF-based phylogenomic trees so the so the just to summarize it uses just reads you don't have to use assemblies which is computationally expensive and introduces problems but you can you can get some what we call bad SNFs not high quality SNFs and ones that actually aren't containing the you know good phylogenetic information so paralogs copies that um that can occur in multiple places can you know internal repeats can cause problems so you want to have to you want to get rid of those you can get what are called homo-plastics news which are ones that are not actually um well they're not consistent with the true evolutionary history of those organisms and those can occur from things like recombination or from horizontal gene transfer and these last couple there's just three more slides here um which are not in your print in your handout so forgive me for that I sent the wrong ones in for printing but there's just there's just three slides here and of course you guys have the electronic copies so recombination is essentially well it's this process of breaking um one chromosomal segment and then or genomic segment and then incorporating a new segment in its place so there's kind of two main types one's called homologous recombination where if you have two really highly similar strands say one from from the that's harbored from the organism and another one that the organisms say up took from the environment which they do and if they're highly similar then they can swap out or and they can replace the one chunk um in the from the from the one genome into the other genome so that's homologous and there's um non-homologous recombination but so those are kind of special cases that we're not really going to talk about and don't worry about all the complications here in the diagram essentially they just the idea is starting with you know your black and red genomes you end up basically with combinations of black and red genomes at the end right so they're but they're highly similar and the way that they manifest themselves is because you they will have a different evolutionary origin when they're popped into that genome that they will when you mount them against their reference sequence they would show up with an abnormally high localized density of variance relative to the rest of the collection okay so there's a a way to to find them doing it in a kind of statistically rigorous way is very very computationally expensive and so um but there are some kind of cheaper methods by essentially just looking to see how what is that the rate of the occurrence of those variants in your collection of genomes against that reference sequence that allow you to identify possible recombination and then finally here's just wanted to talk a little bit about genomic islands so these so genomic islands essentially these these clusters of genes and genomes that have some possible um while they have evidence of a possible um lateral gene transfer or horizontal origin so recombination is one of them but with genomic islands it's more like a big chunk came from some external um environment and was incorporated into the genome just like as an additional um insertion okay there's a bunch of different types like integrons and phage and uh uh the integrative conjugate developments etc not going to go into any of the detail about this but um they are important because that is a source of of confounding evolutionary information so when you're trying to build your phylogenetic tree what you typically want to get is that you want to capture the clonality the of the of the of well the evolutionary um relationship of the of the clonal expansion right not basically this lateral gene transfer that's coming in from random places so identifying this can be or if you include this in your trees they can confound the trees and make it so you get the correct the uh so the primary resource for the detection of genomic islands is called island viewer actually it's developed um by professor in professor Brinkman's lab here and um it's uh it can take your your genome sequence data you upload your data and it will go and find those regions of possible horizontal origin and um even though it's not um a straightforward event hopefully it will be soon but um these days what you can do within that site is you can download those regions and then you can put them into a phylogenomic tree building um tool like the sniffle tool that we're going to look at after lunch and say mask that region out and that's one of the things that you guys should keep in your mind when you go through the lab um that the the because of the issues of things like recombination and horizontal gene transfer can confound a phylogeny at an entire genome scale versus at a gene scale at the gene scale then you are those types of phylogenies are vulnerable to have this this confounding phylogenetic information that you don't want in there so a good phylogenomic tree building program will have some functionality to allow you to say do not um like block mask out this region and you know and do not incorporate you know do not include these regions for consideration in the final phylogeny okay okay