 Alright, so I am going to give you guys an introduction to phylogenetics and then, well, to review basically phylogenetics and then just get everybody onto the same page and then move on to phylogenomic trees built using SNP information or SNP information. So well, I have been giving variations of this phylogenetics lecture for probably about seven years and the first time I gave it was about three hours long and I keep trying to trim it down but every time I do it ends up just seemingly getting longer and longer. So I have done my best but this may take up to the most part of an hour and I don't want to rush too quickly through it so I may not go through all of it. The information that I am giving you here is more than you will need but it's really sort of like the minimum amount to really explain phylogenetic trees. So if you find that it's starting to get harder and harder to understand, don't worry about it, it's all basically in the lecture notes so you should be able to just go and review the presentation again and of course all the instructors here are experts in phylogenies. Some of the top experts in Canada are phylogenies, not me, but I'll struggle through this and see how I do. Just stop me if you need any clarification. So the objectives of this module here are to understand the fundamentals of phylogenetic trees and the terminology, a little bit about how to interpret phylogenetic trees and some of the ways that we build phylogenetic trees. So phylogenetics field itself essentially well simply stated is basically the study of the evolutionary relationship between a given set of biological organisms. So evolution most simply stated is essentially where you have descent with modification. Some progenitor reproduces and in the process of reproduces introduces mutations and variation. Those variations are subject to the selection based upon their fitness to survive in a given environment or ecological niche. So there is essentially this process of reproduction and selection depending on whether the variations that are introduced in this imperfect reproduction process are actually improved or maybe detrimental to that organism's survival. So what we try to do essentially is to infer that evolutionary relationship and to do that with the data that we normally use is DNA sequence data. Traditionally it's been genes. Some genes are better than others but they do a pretty good job of capturing the variation and the relationship between organisms. Although that's not the only type of information that can be used in early systematics they used things like the shapes of organisms and bone structure for example, flower structure. Variation languages can be used to infer phylogenies and actually Rob has a really good one on maximum parsimony where he uses variation in language to determine the relationship between triad and the I think it's Othocetia. So anyways that's kind of the essence of it. The core of phylogenetic analysis is the phylogenetic tree which is a graphical representation of that inferred hypothetical evolutionary relationship between the organisms that are under study. So it's basically that the well that estimated evolutionary relationship, the family tree for the organisms that or the biological entities that are under study typically genes or species. So each the most notable features of the tree are that it is composed of nodes and branches. So the nodes represent the common ancestor of and well of all the descendants that are underneath that node and the the edges well can have variable meaning sometimes they can correspond to the degree of divergence between the species under study, sometimes not we'll cover that here. So when we want to do phylogenetic trees, the data that we have access to typically is only from living organisms, not all the time there's some very interesting studies where they actually can extract the ancestral data, but most of the time and for our purposes we're doing is we're collecting genetic or genomic sequence information from things that we're living or just we're recently living from our sample collections. So those internal nodes are that connect them though common ancestors aren't we don't have data for them typically or trying to do is to use the data that we do have to try and infer those nodes the connectivity between the nodes and that eventually gives us ultimately our phylogenetic tree that makes that represents that the ancestry of all of the organisms that we have data for so the the data that we have are represented at the tips of the tree and their branches or while the very edge of the tree called the leaves also sometimes called the OT user operational taxonomic units. So the the process of evolution is and the trees that are used represent and they do make this a couple of of important assumptions that don't don't always hold one of the most important ones though is that they're the trees are bifurcating which means from every node they'll have two and only two descendants so so the tree itself is the structure to that that we used to impede the evolutionary history of that group of sequences or that group of organisms we have the nodes in the branches I mentioned the terminal nodes are called leaves or operational taxonomic units it's called that less so because that's been that phrase has been co-opted in metagenomics and can start to confuse people but you'll see it occasionally here in this lecture so north so the and the OT use are the ones that we have the data for those are the ones that exist right out on the tips of the tree the leaves or the terminal branches yeah and so they are and they're connected to each other and by their nodes which are the common ancestors for those descendants and the branching pattern is bifurcating sometimes you can see you might witness a tree that has more than two branches off of coming off of one single node so that's referred to as a polytome and it can be one of two types hard polytome or soft polytome so soft polytome is when you don't have enough information to be able to determine the actual bifurcating relationship between a group of organisms there's just there's no there's no informative data that can separate them all they all look sort of the same they'll look like they're equally involved from their common ancestor so that's called a soft polytome sometimes you'll see them wait when you see an entire clade will be represented just by a triangle you know in a polygenetic tree and that's also called a soft polytome sometimes you can have a hard polytome and there are rare instances where organisms will instead of speciating into two different species these species into three different species this can occur from sometimes from unusual phenomena like if there say is an earthquake or something like that can to work a group of organisms that there's their geographically restricted end up being separated out into multiple different areas more than say two like three or four and then they're allowed to evolve on their own they basically have one common ancestor that can involve maybe three or four different progeny until that's called a hard polytome okay so there's two types of trees in phylogenetics the cladogram or cladogram and a phylogran so the cladogram is the simplest type of tree it only shows the relative recency of the of the common ancestry so it doesn't give you a sense of how much they've diverged from each other relatively only the which descendants are associated with which common ancestors so the important information in a cladogram is the topology of the tree okay not the distances between the branches in this example we have three otus that's the information we have data for and we share it just tells us that a and b have a common ancestor and that that common ancestor they share that common answer more recently then either a or b shares with the c okay if this is a well we'll talk about rooted trees what we would basically say is this this is the most ancient common ancestor and then this is the most recent common ancestor there phylograms will show the same information the topology is and the organization of that tree determines the relationships between the progeny and their ancestors but the length of the branches gives us additional information about the amount of that they've diverged from each other so in this example here we have again three otus or three taxa and in this tree it shows us that one organism has acquired more substitutions well in a has acquired more substitutions than b in the time since they shared a common ancestor so there's only you can see there's like a lot of divergence here between a and b whereas there's very little divergence here between the ancestor of a and b and c and the way that we determine that is by counting up the number of well this well the number that represents the amount of divergence between them so between a and b we have four and two so we have a relative distance of six whereas in between a and c we have while we also have six but between b and c we only have four okay questions about filograms okay so stuff here is pretty easy though trees can be in any orientation it doesn't really it affect their interpretation so if they're horizontal or vertical it's really just an arbitrary means of of orienting a tree that has nothing to do with the information that they contain the order of the leaves are also not informative so and that people sometimes find a little bit more surprising but it's really only the connectivity of the nodes that are informative and what we can do is and this actually gets done a lot and you'll see some of this more interesting examples of this in Rob's lecture that you can rotate the branches to get new orders between the taxa but you haven't changed the the evolutionary relationship that they they have between them okay so any any note you can rotate around any node and you can preserve the information in the tree okay so trees can be rooted or unrooted and this has to do basically with determining the absolute ancestry and the order of descent of the of the collection of organisms that you're studying so the root is the ancestor of all of the organisms in your tree and if you have a tree with a root it will give you that absolute ancestry from older to newer okay that's a an absolute order of descent and this is important to remember that information you don't always have when you're doing a phylogenetic tree building exercise in fact in under most circumstances you don't have that information in which case you have an unrooted tree and an unrooted tree does not give you the absolute order of descent only the relative orders of descent between the organisms that are studied and so it's less informative than a rooted tree and sometimes in order to convey that a tree does not have a root it will be depicted in what's called a radial format here it's kind of this doesn't do a great job of showing but it's kind of like a star pattern okay so here again just a little bit of a comparison rooted versus unrooted trees a rooted tree will give you the actual the absolute order of ancestry in that order of descent so here in this rooted tree you can see that you have your most of the data that your tax that would you have data for and so modern speak of this modern species to the modern species three this is the ancestor of two and three and this is the ancestor one two and three and then this is the ancestor one two three and four yeah if you don't have a rooted tree then this is all the information that you have here you can't tell which which organisms are ancestral to which other organisms only that they're their relative ancestry to each other because that makes sense okay here so how do you read a tree there's a number of ways to read a tree the one of the simplest ways is to take an additional organism that you know is ancestral to the rest and use that to build the tree and because you know that it is the ancestor to all those other organisms you can determine the root you basically are placing it at the root so in this example here we have zebra fish champs humans and cows and well well three are mammals one is not right so we know that the zebra fish is ancestral to the mammals and we can add a root here and in this case now we have determined the absolute order of the of descent from between the four these four species so in our world of microbial genomics you typically what you would do is you would just grab an organism if you're working and you typically are working with a lot of organisms population of genomes that are from the same species or maybe even from the same subspecies and so it's a pretty simple matter if you want to root your tree just to go into the you know the Gen Bank and choose something from from another species but within the same genus and then that one basically is going to be ancestral everything with that's all of the same species is easier to root a tree that way if you don't have a suitable those can be problematic because they can make it difficult to build a tree that way they're too divergent but if in those cases there are alternate procedures to build or to root a tree if you have molecular clock data and some and software phylogeny q-building software that can incorporate time information into the analysis then you can essentially infer the root of the of the tree without having to add an out group it's less commonly done but but it can be done with with programs well these are where we start to talk about these Bayesian programs like Beast or Mr. Bays but they they can be done so okay that kind of covers a lot of the introduction to to the trees now we're gonna start and a little bit about their interpretation we want to talk a little bit about how we build them the the important concept here is that the number of possible trees grows geometrically with the number of organisms that you want to study so this is the the equation that defines the size of the trees but if you take a look here you can see how quickly the number of possible trees goes with as the number of of organism that you want study grows so for if you have three there's only one tree that can depict three otus there are three trees that can be depict that can be the depict four otus fifteen for five two thousand for ten I think it's somewhere in the order of about 25 to 30 otus where the number of possible trees will be greater than the number of atoms in the universe something like that I may be wrong it's but it's on that order so it really building trees well intelligently there's some some ways are better than other what we and so but keep in mind that there are lots of possible trees that you can build with your with your taxa and you what you really want to do is to try and build one that's correct there's going to be well one that's probably most correct and others that may not be and there's going to be a lot that are not we kind of find it trying to find your way through that to be a little bit of a difficult problem the problem the the process of building a tree in essence using modern tree techniques is to take a multiple sequence alignment you have up here and sometimes a model of evolution sometimes it's assumed and sometimes it's explicit which can give you information about how the nucleotides or the amino acids are mutating within that collect that that data set and they're written there for young the rates of mutation and with those hidden stations etc. lots of we'll take a look at some variations of those models later on but with that combination of that multiple sequence alignment which is the data okay and your model of evolution and some type of algorithm typically some type of clustering algorithm you can use the data and that evolutionary model to build your tree I'll take a look at a bunch of a couple different ways to do this and in the past if people are the past I would give an explicit example going through all the way through to show you how to build tree because even though it seems complex it actually can it well in for most cases is pretty it's actually pretty straightforward but it does take up a lot of time if anybody's interested in seeing you know a little bit more detail about how to actually build trees I can come talk to Rob later so there's essentially two major well three major types of tree building techniques one is called distance-based one is called character-based there's a third one is just Bayesian but the distance-based methods are the ones that are pretty straightforward to understand and very quick to build and the idea of these are well did the central kind of data set data structure used in these distance methods are is something called a distance matrix for what we take a look at in that in our multiple sequence alignment for the numbers of differences that we see in a kind of an all-against-all comparison and then we can use that information to infer our tree and so the a common and very simple type of tree-building method that takes into the uses a distance matrix to build a tree is called the UPTMA tree or the unweighted pair group method with arithmetic mean there's another one called neighbor joining we'll talk about that in a sec so essentially what we do the this procedure here is to take a look at the numbers of differences that we see in all the sequences and compare them to each other so in if we take a look at the number in between a and b we can see there's one difference here two differences here three differences here these are all the same so there's three differences between the two we pop a little three into our distance matrix there do the same thing for all of them in an all-against-all comparison and you'll follow the matrix like you see here once you have your distance matrix the the the the next step is it to try and come up with a tree that will incorporate all of those distances so so that they map one to one to each other that's the idea of it but it turns out that it's not so straightforward it could be the case that there is no tree that can exactly explain the distance matrix that you provide okay but so what you want to do is try and come up with one that's pretty close so if you have well the if you have you know a distance matrix here you can use it to generate a tree and you can use the in that distance information to supply you know the the the divergences between on the branches of the tree but it may not be the case that they're all going to add up to each other so every possible distance between each other is going to be mapped effectively into that back to that distance matrix so what you want to do is come up with the best fit and that's basically trying to minimize the the sum of the differences over the between the tree and the distance matrix and that's called our covalence for the criteria so what finding that best fit yeah so it turns out that it is well it's a hard problem in computer science terminology it's called NP hard does anybody know what that means well I'm not a computer scientist but my think my understanding of it is that it cannot the only way to guarantee that you found the best answer is to evaluate every possible solution it's the only way unless there's a well I'm not going to get into it I don't have time to write that's the idea of it it's so you can use a brute fourth fourth method to evaluate every possible tree and see which one is closest to my my distance matrix but remember when you have 10 otus you have 2 million trees to evaluate and that's can take a long time so the brute force methods probably not the best method which is why we want to come up with others more simple methods that although they will not guarantee that they're going to build the correct tree or the most correct tree they do a pretty good job of getting us close enough so so the UPGMA method is essentially an iterative clustering method that takes a look at the that distance matrix and essentially just starts to combine the two that are the most similar to each other they have the least distance between each other and into a little into a into an assigned a motry some common ancestor to it so in this case here you can see that if the distances are here between these two are the same here then it will draw a little circle around them then it will look for it say where's the next smallest distance here's another small distance between these two OTUs in our little distance space so I'll draw a circle there and I'll say now what's the next smallest distance it turns out it's between this OTU and this collection so we'll draw a circle there and then finally it's this collection here and this collection here that is the most similar and so it draws another connector around that and then that's your tree and that's essentially what it's doing here is just assigning the common ancestors between the two organisms in your tree that have the smallest distance from each other iteratively until you get a tree so it's the advantages of it is extremely quick you can see that there's only really a limited number of operations before you can be able to like a finite linear number of operations that are involved in connecting up your tree so one ancestral sequence per step the last inferred ancestor has made the root of the tree and so you are inferring kind of a root of the tree and that's not really can not be true it also assumes that all the lineages evolve at the same rate it kind of just connects them equally together and so the branch lengths are always the same that's called an ultrometric tree what says everything is just dividing and evolving away from each other at the same rate probably not true so yeah so it's not really a it doesn't build the most accurate trees but it will very build you a very quick tree so it's good to use when you're working with large numbers of elements in your in your phylogenetic tree so neighbor joining is a an improvement over UP GMA trees especially most notably because it can assign different relative branch lengths to the to well in the process of building that tree so whereas UP GMA is just looking at the two closest neighbors the neighbor joining tree tries to take into account the relative distances between all of the the the taxa in your in your tree so it starts with this the same type of processes for building a UP GMA tree where you make a distance matrix and but then it kind of takes a different tack and what it will do is use this information to create a second matrix called a Q matrix and then the Q matrix is basically it's what it's looking for is kind of distance from each OTU to sort of the center of manas of all of the different OTU okay so it will effectively give you a value that will tell you whether the to the degree to which to of the of the OTUs or the taxa are similar to each other and far away from the rest of the taxa and it maximizes that distance so when you have that information you it allows you rather than to just sort of equally attached to at an equal distance like UP GMA does it can it can say these two are have a relative distance from sort of the center of mass and so that means I can asymmetrically place that node between the two so instead of having a distance of one and one on each new branch that you're connecting you can put say one and four or one and five because you basically have this additional information so you do this and you start out with a tree it's called a star decomposition process because the tree itself is first depicted as in a star topology or a polytome then as you go through that process of analyzing that key matrix and assigning the most common recent ancestors you can you decompose that star tree into a bifurcating tree iteratively until the tree is completely resolved and so at this stage here you have a tree that well is unrooted but it has assigned different variations and and well different branch links to the different branches in the tree that correspond to the rates of evolution of those OTU from each other okay so so it's a an improvement over the UP GMA approach okay it also is not guaranteed to find the most optimal tree but it you know it works through you just through through trial and error basically has been shown to be a very good very accurate method you will see a lot of neighbor joining trees in the literature especially for large trees okay so that time that sums up the distance methods and I will slide here I'm not gonna I just cuz I just explained it I'll just leave it for you guys to maybe read later if you want to get that summary but yeah go ahead it depends on what they are it whenever I'm reviewing any paper basically they'll buy overriding motivation is to make sure that the conclusions are supported by the data okay so if they have a UP geometry it's fine as long as the conclusions that they draw from it are supported from that data right so yeah it really it just it depends if they were trying to I think sometimes people might be trying to they might overreach a little bit you know but I would be skeptical that if they were to use a UP GMA tree to identify the most recent or the common ancestor of all of the species in the collection but most people who are going to build a UP GMA tree are doing it because they do want to sort of a quick rough and dirty look at what the that relationship could be in the when using a large collection is there any other examples of one it's an advantage is UP GMA tree maybe just when you don't have much additional information available to you right about the rates of evolution nobody uses it because I working on a project we're now in person I'm working with piece given the UP GMA tree and I like why do you bring this because it's in China this is what it's plug-and-play right so the key if just yeah on that point in the states the pulse net through CDC we're doing whole genome MLSD and it's a lealic distances yeah so that's when we're I understand so PFGE is like where you're just looking at restriction digest band patterns now those are you can cluster those together but if you you might be making too much of an assumption that the number of differences that you see in the band patterns are related to the degree of divergence of the organisms because it could be happen as a you know as a essentially a random encounter of a corporation of a phage or something like that so things that might look very distance to each other are actually very close to each other and so a UP GMA tree which doesn't make those assumptions will you know can what would be be a more unbiased representation of that that type of data set let's move on so okay so we've covered the distance-based methods now the character-based methods are an alternate and some people feel more accurate method of building a phylogenetic tree if you have that data available to you there are two major types one is called maximum parsimony and the other is called maximum likelihood tree so the maximum parsimony approach involves finding trees that describes the well the data the multiple sequence alignment in the fewest number of of evolutionary changes along within I'm contained in that tree maximum likelihood is involves finding a tree that maximizes the probability of that well of that that that tree would have been produced from the data set so you can include probabilistic models of sequence evolution into a maximum likelihood tree and so one of the big advantages of that so let's take a look at maximum parsimony just to I don't not going to through all of this but just to give you a kind of a sense of how it works here's our multiple sequence alignment of for taxa and we're so we're asking the question you know which tree explains this in the simplest way the least number of evolutionary changes required well you know all that from for if you have four ot user for taxa that you there's a maximum number of three trees that you can build so let's build them and then we place those the characters on the tips of the tree in all the for all the different possible trees and then we'll look to see which tree can give us the can give us back that column of data in the least amount of changes here so the question is which tree explains using the simplest way we've taken the data here so one two one is a two is a three is g and four is g these are all the different ways that we can arrange one two and three four we place the characters the corresponding characters here and now we basically all the different combinations that the possible g combinations now we kind of have to try and look and see how do we get from from from one character to the other character okay within this tree what type of evolutionary change had to happen this a and with the common ancestor of this of these well of these a's at this position would can still have an a and these ones would have a g so you'd have to have some type of the ancestor would have had an a to g transition in there okay here in this arrangement you would have to have an a to g transition and to map to or to to reconcile this branch here and then you'd have to have an a to g transition to reconcile this branch here so now you have two changes in this tree that are used to explain the tree etc in all of them except for this one you have two changes that are required to explain the tree so which is the simplest tree it's the first one right that's right so this is our tree best tree for column one now we have to go through column two and three four and five right and we're going to find which tree is the best explanation for each column and then at the end we just count up the number of trees that were the best for each column and find out which one is has the the smallest number of changes and it turns out if we do that exercise it's the first tree it's our most parsimonious tree only five total changes had to happen okay so interestingly there's um one position here if you take a look here you can see it kind of makes sense because you've got a's here at the top g's here at the bottom t's at the top here um or where is the other one a's a's oh a t's here and c's here so splitting them right down the middle seems to make a good sense and that's where what we did in that for that first tree the problem is in position three here where that actually doesn't make sense okay and that's why we don't get this tree basically is the best tree in position three we have um this uh this second tree here actually explains it with only one change so that snip is called a homo plastic snip it's basically something that is not consistent with the ancestry um uh the wall of the inferred by that tree okay so homo playsies are uh kind of it's like when some things that look together in the descendants that weren't together in the ancestor okay uh so it's used often to describe convergent evolution um so the formation of the eye has been hypothesized to have occurred independently at least 10 times so our eyes did not we don't share the same genes that gave rise the genes that gave rise to our eyes and gave rise to octopus eyes are not the same genes they don't have common ancestors so arose independently but well sight's important so important to for life that you know these kinds of things can will and will arise right in order to be because they're the most fit for for a certain scenario even though they don't have to come arise from a common ancestor okay all right so uh so that's the maximum parts morning and uh you can see that you know for given alignment there can be more than one tree if you have a small number of otus you can just look you know build all the different trees and do all those calculations but as you guys know the number of trees explodes as you go up to very large numbers of trees and so now you have to use these kind of these pruning algorithms to try and figure out if there are some trees that or some sets of trees that you could safely ignore because you know that other sets of trees are always going to be better though so there's a kind of a branch and bound algorithm that's that's used essentially to start uh trimming out all the possible the the uh the neighborhood of possible trees into basically a smaller neighborhood of optimal trees and that works pretty good for for maximum parts morning maximum likelihood also kind of works the same way where you're going to go column by column and you're going to try and evaluate you know the what the best tree is but now you get you in can incorporate directly um you can do this for me rejoining trees as well but um the it's a lot more intuitive uh for maximum likelihood that you in fact it requires it that you have to have a probabilistic model of evolution so that you can start to actually create like probabilistic formulations and at that to to mathematically and quantitatively calculate what the best tree is rather than just you know using this like smallest number of changes with type of approach so in this approach here what we're going to do is we're going to take a look again at the first column and we say okay here's the you know and there's gonna have a bunch of different trees um but instead of just looking and saying like an a to a g um a substitution is the same thing as an a to a c or uh or an a to a t now we can say there's those may not happen with equal probability they may and in fact they don't um so let's incorporate a model of evolution that can actually assign the relative um probabilities of different mutations okay uh that's where that model of evolution comes in so there's uh the one of the simplest models of evolutions is is one that can incorporate transitions versus transversions so that's a purine a transition I think is the purine to purine or pruning pruning yeah um exchange so and that is a um a c to a an a or a no it's a that's an a to a g or a t to a c because they have similar structures during um uh when you're uh uh wouldn't when the when a cell base um is undergoing um uh reproduction and uh the it can you can swap those bases more easily than you can if um if you're swapping the more different um skeletons here so uh transversions where you're getting you know these uh well two ring two one ring um swaps they occur less often because they're just not tolerated very well um also it turns out in coding sequences the um they're the uh a transition or purine or permitting switch um more often gives rise to a silent mutation so that it's not doesn't it's not reflected in the coding sequences basically it's it's not doesn't give an amino acid change and so that's easier to tolerate less it's not subject to the same type of selection pressure so those will accumulate actually more often even though there's a two to one chance of getting a a transversion just simply based on the simple odds the the way that it actually works out is that transitions are observed what twice as often as just transversions so you can incorporate those those probabilities into your evolutionary model here and calculate what those you know the probability of getting those an a to a g translation or or an a to a t translation etc right um directly into your uh into your calculation here so here's your probability that essentially is looking at all the possible um variations and coming up with a an overall possibility then you can you can you have to go through every possible tree and look at every possible transition and you come up basically with the one that gives you the maximum likelihood for that tree in that one position then you have to go through every column that contains an informed of site and then redo that calculation but at the end what you will have is um well is a is an actual number that tells you what the the tree that best explains that data set okay so uh it requires searching an enormous amount of trees and uh and so very computationally intensive it also uses these uh well some heuristic methods to try and um reduce the amount of time uh but uh it is it is by far the most competentially uh intensive method but if you have a small number of o to use and you have maybe a very fast not something like the true billing program like raxamal which can be distributed on uh entire high performance computer then you can build some very accurate maximum likelihood trees so which is the best tree building method well the correct answer is there's really no single best method but if you would ask me i would tell you it's maximum likelihood okay but uh it's uh you know it really depends on the just the size of the tree that you're building you know and how much the computational um resources how much competition muscle you have access to um yeah and uh and really the the application for which you're building the tree there so moving on to bootstrapping here now when we we build a tree and we say okay this is we think this is a good tree well what how do you defend that tree like is there you know what we're we're not really saying like there's a 99 95 percent chance that this is the correct tree all we're doing like especially during things like maximum likelihood is just saying this is the best tree that we could find but does that mean that it's correct it doesn't really we don't have that kind of information yet bootstrapping is a semi well it is a it's a semi statistical procedure that you can use basically to help argue that the tree that you generate is correct tree okay and um you can use it for uh for distance methods or for character methods and essentially the the the the idea behind it is that you're going to take the input multiple sequence alignment and you're going to manipulate it a little bit start to swap out some columns for some other columns and you're going to rebuild a new tree from that and see how much does it look like the original tree and if the data the the evolutionary data that has been captured in that multiple sequence alignment is so strong that it can handle a little bit of perturbation um that you know so and we'll build you the same tree then you've got a lot of confidence that you've got a good strong evolutionary signaling there okay if on the other hand you swap one column out and all of a sudden you get a holding you know I would say you know 15 columns that are using your to build your tree and all of a sudden you get a completely different topology well probably those neither of those trees may you know may be very good trees and you don't really have very strong phylogenetic signal in the first place in that alignment that you're you know that you're using to infer the tree that's the concept so but the idea is to do this well a bunch of times typically a hundred times or maybe a thousand times rebuild the tree and then compare all the trees and then you look to see how much of them are they the same and how much are they different so here we're here's the process where we are doing what is called sampling with replacement okay we choose at random those throwing an eight-sided die I'm sure everybody's got one in their pocket right now and deciding they and say okay this is what I'm going to take out call this column I'm going to use it to build my new tree take out and you do this until you rebuild a new alignment so um that's the same size as the original alignment but built with a different set of columns including possibly multiple copies of the same column and some other columns that are missing because it's basically a sampling with your plate when you just throw that die in whatever column you get from the from the die is the the one that you choose like this example here you could choose column six one six and eight and so six is getting chosen twice some other columns are by if you want to duplicate some you're gonna obviously you're gonna be missing some others and you rebuild your tree with that resampled alignment then you look at every node in the tree and you say does the have the do I have the same descendants in this maybe in these two trees or are they different and if they're the same then you add a little check mark and if they're different then you don't you do that a hundred times and you keep adding check marks to every time you get a node that contains the same number of descendants and if you get a large number of of the same number of descendants that are always given given rise to by that by that same common ancestor then that's a really good well that's good evidence that you have a robust tree building procedure and that the the trees that you're building aren't correct trees yeah so that's the that's kind of the idea there and so the so if you build a tree a hundred times on every and you see that the bootstrapping information is presented on there typically what you see on a node is some number attached to the node maybe it'll say something like 50 or 75 or 80 and what that means is that 80 of my hundred trees all had the same number of descendants from this under this node are the questions about bootstrapping no okay so moving on to evolutionary models we had a little bit of small introduction on the maximum likelihood trees when we look at transitions versus trans turns versions right so well what the evolutionary trees are models are doing for us is trying to increase the the accuracy of the of the trees by incorporating known information about the the modifications and the rates of divergence that can occur between the organisms that we studied okay so this manifests itself in different branch lengths versus you know simple approaches and those and and hopefully more accurate branch lengths so the simplest model is oh um the when we translate transferred this um the slides to the map some of my equations did not come out perfectly on this um um version but if you guys have electronic versions or the printed versions they should look that's okay they're not too bad um but i just wanted to warn you that they didn't they didn't come out perfectly anyways the simplest model is called the p-distance and what really is just telling us is let's take a look at my multiple sequence alignment and look at the number of substitutions that i find over the entire alignment so if you have an alignment of a hundred um columns and you find that there's five substitutions then that would be five substitutions per hundred columns right so that's a very simplistic measurement basically the number of positions that differ over the length of the alignment so let's call it your p-distance and it's just distance over length okay so there's a better one called the Poisson corrected distance um and this is taking into count the uh well the what's known that you can have multiple substitutions that occur at the same site but these are hidden mutations and you don't see but they still have occurred through the you know in the history of the evolution of the of that set of organisms and their ancestors you can assume that there's maybe more mutations than the ones that you actually see in the multiple sequence alignment that's presented to you so the Poisson distance correction take that into account and so it adjusts the branch length for you um and a Poisson so the Poisson statistic is a kind of statistic that says for rare events how often do would we expect them to occur in a certain in a given period of time so if you were to watch trains go by a train station you might want to ask what is the you know uh well in in in one hour what are the odds that five trains will pass through the station okay you can calculate use of the Poisson statistic to calculate the odds of that happening the gamma distance correction takes into account well for it to in in terms of sequence correction it takes into account the fact that you may have different rates of substitution and so the Poisson distance correction just says every every mutation in every column is mutating at the same rate uh say like once every 20 you know uh descendants or something like that the gamma distance correction says different um the different nucleotides may be evolving at different rates like the wobble nucleotide which is the third nucleotide in amino and coding sequences typically evolves faster than the other two because that's the one that has the contains the redundancy so that you have more silent mutations that occur in your third position so you it's in in the train analogy it's like saying um what is the distribution of time between trains that between the five trains that are on say a 20 minute schedule at this train station so either you expect them to come through every 20 minutes sometimes come through at 19 sometimes through at 21 22 if you model that yeah if you want to take a look at that distribution that's the gamma distribution the variation between certain events that that occur in a certain period of time yeah and uh so the the gamut correction is is characterized by a um a parameter called an alpha parameter which kind of determines the rate of sight variation you can choose this parameter but it gives rise to these um you know a pretty uh considerably different shapes in the distribution so you kind of have to have an idea of what that well how of what the alpha value is for your the data set that you're analyzing which is easier said than done because the values can range between about point two and point three and three point five which gives rise to really really significantly different distributions anyways that's the those are these uh uh distance correction um uh measurements there's also the substitution models like we mentioned the one but transitions versus transversions that you can also employ okay so they are uh well a way to to to capture uh basically the the differential rate of variation between the different W types um and um uh and can incorporate other kind of a statistic or evolutionary um uh information basically in a statistical model so that's why you'll see sometimes people will be publishing their trees or something like a GTR plus gamma right I mean they've got the gamma correction there plus they're using a model of evolution called like the general time of risk model there's a bunch of different types of models that are out there that incorporate different um evolutionary theories here's a small list of them the best model well it uh that's hard to know there are programs where you can give the multiple sequence line to the program and it will say it will check all the different models that it has access to and get say this is the one that best matches the data set this is sometimes people choose to um to to use to incorporate into their tree building some people argue that that's not a very good method for the easy part um these in trees so these are uh kind of a newer method of generating trees developed uh just uh in their turn of the century and uh has uh well in becoming increasingly popular especially because it can do uh it can do a lot of powerful uh you can incorporate a lot of information in it so that and it makes your ability to to to generate trees much more powerful than with um other types of approaches like maximum parsimony or maximum likelihood over the uh distance based methods so uh but it is it is a little bit of a complicated procedure i'm going to try and simplify it for you and just go through the basics so you get a kind of a bit of an understanding of what's going on so well uh it uses Bayes theorem and uh Bayes theorem um is a well the focus of Bayes theorem is is on conditional probabilities so far to ask you what are the odds that it's raining uh that would be if you give me a probability um but if i were to ask you what are the odds it was raining given that it's cloudy you could you would give me a different probability and the because now you are including additional prior information into the calculation of that probability that's essentially what Bayes theorem is it form it formalizes um the uh process of working with conditional probabilities so here's an example public health example so you wake up in the morning and you have spots and you're worried that you may have small box so you go to the doctor who is not a great doctor and he looks in the book and he says that he finds that 90 percent of people with smallpox present with spots on their face just like you do so should you be worried that you have smallpox who thinks that you should be worried maybe it kind of depends the problem is that the doctor hasn't really given answered your question right the so the doctor provided you with the probability of having spots given the hypothesis that you have smallpox okay and so and he says 90 percent of the people who have spots have uh or have a couple hundred people who have smallpox present with these spots but there are other ways for you to have spots so you can have measles or like here like you have a bed bug the real question that you want is what are the odds that i have smallpox given that i have spots and that's what this notation is showing us here the conditional probabilities the doctor says what are the probability that you have spots given you have smallpox you really want what are the probability that i have smallpox given that i have spots okay so well how do you calculate that you can calculate it with base theorem but you need to know a little bit of extra information so we're trying to we're reversing the data and the hypothesis here okay and so if we want to find out what the chances that we have that have smallpox given that we have spots it's useful to know what the prevalence is of smallpox out in the general population now we know it's zero but for the purposes of this contrived example let's just say that it's zero point zero zero one okay we also need to know the just the rate of having spots like if you drew a random person from the population what are the problems that they're going to have spots for any reason because it could be for they could be for measles it could be from smallpox it could be from bed bug bites it could be for any number of reasons so that's a that's an additional piece of information that you need in order for you to to switch between those two uh the hypothesis and the data um that has been given to you by the doctor and what you're really asking so um the calculation got messed up a little bit here but the what we're trying to say here is that the probability that you have smallpox given that you have spots is equal to the probability that you have spots given uh that you have smallpox times the probability of smallpox okay and you have to divide that by the probability of spots in the in the uh on the wild so the actual probability that you have smallpox given that you have spots is going to be point nine which is what the doctor told you times point zero zero one which is the prevalence in the population and you divide that by point one for the just the say 10 percent of people in the wild have spots and then this gives you your answer so you have a point nine percent chance that you have smallpox which isn't great but it's better than 90 right so these conditional probabilities are important because they can weigh in on the accuracy of your answer so um just repeating kind of the this formula here we have the in different ways we have the the probability of the disease given the symptoms is equal to the probability of the symptom given the disease times the probability of the disease or the probability of the symptoms in the vernacular for base theorem they call this the posterior likelihood which is what you're interested in is um equal to the likelihood which is that probability of the hypothesis given the data times the prior probability which is what information do you already know about the background rate of of smallpox or the marginal likelihood here which is basically just that rate out of spots out in the out in the population so the third way to state that is the probability of the hypothesis given the data is equal to the probability of the data given the hypothesis times the hypothesis the probability of that hypothesis divided by the probability of that data there so we've done the exercise for uh using base theorem to calculate the probability that you have smallpox given that you have spots how do we use this for trees well the we have a phylogenetic tree when we can denote that tau and we also have the data which is our multiple sequence alignment okay we and the tree that we present that is our hypothesis is this the correct tree given the data but we don't have that answer right we have the data and we can generate trees and essentially take a look and see what are the probability of that this is the correct tree given the data using base theorem here in order to do that we have some priors which was just like the prevalence of having what are the you know what's the the rate of smallpox out in the in the in the wild the priors basically are what is the extra information that you know about these uh well about about the uh the organisms that you're studying and their phylogenetic relationship to each other within that tree and so some of the priors the things that you can that you may already know that you can use to help you determine the correct tree are includes the coalescent and we'll look at that in the second gamma shape parameters and when we talk a little bit about the gamma already the epidemiology that you already know from from from she leather epidemiology and then while ignorance so the coalescent is the coalescent theory is essentially the study of the um uh well the the reproduction of organisms that are changing in the population where the population is changing over time so if you want to and that can affect the the tree that you that you generate here we can see that the population is increasing um you can get one type of tree if the population is decreasing over time you can get a very different type of tree okay so that information is if you have information about whether you're studying organisms that are sort of in a steady state or maybe if they're coming if there's an outbreak or there's some type of geometric expansion involved you can use that information to help you nail down what might be the correct tree remember we talked about that alpha parameter for the gamma distribution and the gamma distribution is remember the time between the average the distribution of the time between scheduled train stops on a train station or the multiple different variations in the rate of mutations in a collection of of variant data in your multiple sequence alignment so those can if you know about the the rate of variation and the tree shape then you can use that to assist in your finding the right tree there's the epidemiology so if you know that there's a certain relationship from from some of the organisms that they're say related evolutionary related and you know that others are not then that's additional information that you can use to build to help you nail down the correct tree or decide which tree maybe correct or not and then the final one is ignorance and this is actually used quite a bit in Bayesian analysis because there is a lot of controversy over the way that you may be biasing the trees that you're building depending on the priors that you choose so if you if your priors aren't are some are biased because of something that is not really true but you think it's true then you're going to be pushing the trees over in one direction which may not be the actually the optimal trees so there's one the most um most tree building methods that use Bayesian methods will at least start out unless they have some very good prior information they will they will use um basically what's called a uniform distribution for their priors we'll take a look at this okay so let's build a Bayesian tree here we've got Darwin and a couple of primates non-human primates and we want to know about their phylogenetic relationship so we can have some a couple of different trees here that are possible trees and we want to know which is the most likely one well if we don't have any information about them at all then they're probably all equally likely but we do have some prior information about the evolutionary relationship between humans and primates so if we are to you know the trees that we build with bringing those observations into the process will change the odds that we have the which tree is correct here so we see here this one um this one in the middle has a higher chance of being the correct tree once we include our prior information than these other two possible trees okay now we're gonna have to look at every tree in order to be able to find out that uh what's the correct topology and that's easy enough to do when you only have a limited number of taxa but it's not just the topology we want to look at it's also the branch lengths and the branch lengths can if we need to we want to look at all possible branch lengths well there's an infinite number of possible branch lengths for every tree that makes it a little bit harder but it turns out that base the formula for base allows you to do something called integrating over nuisance variables and what it will do essentially if you integrate over all possible branch lengths then you will change that discrete probability that you see here into a continuous one that looks like this okay and so your you have a probability space now that you can that that's um well continuous probability space that you can now start to sample and see if you can find the best possible one so the best combination of topology and branch length okay so how do you do that I mean when there are infinite numbers of trees um then you are but the probability space spaces you know is is not infinite how do you how do you select the right you want to mean there's uh well oops the way to do it here is using something called a multi or Markov chain Monte Carlo sampling algorithm so the idea here is this also think called a hill climbing algorithm and the idea is that you don't really know which tree is the correct tree so you will just select a tree at random and you will you will provide a measure using Bayes theorem of of the likelihood that it's the correct tree you can calculate that probability the posterior probability it might not be good but at least it's a value now you then make a switch and uh so you can switch branches or you can switch a branch length um to create a new topology or a new branch length and then you re-measure the probability that that's a correct tree and if that if the probability increases then you are going to accept that new probability that that the new tree is probably gives is is better is a better representation of the new of a true tree than the old one is so as you are sampling the probability space you're heading towards an optimum space the maximum here is the most correct tree in that probability space and by looking at two trees by making small changes and comparing them and seeing which one's better it allows you to start to essentially to ascend up to the top the problem is if you were randomly started out over here or over here on these you know slopes down here then what would happen is you get to the top here and you would never and then you then all tree other other trees would look worse than this um than this uh one spot here which is not the best um and you in in those cases basically your algorithm is going to fail you because it's not giving you the optimal tree it's stuck in what's called a metastable state so in order to get around that the Monte Carlo algorithm will sometimes accept that uh the a tree that is not as good is um still worth um uh it's still worth choosing that tree so that you may be able to uh to uh to escape out of that local maximum okay so if you were to you uh well here on this big hill here you can see if it's going up you always accept if it goes down then you well you if the probability goes down what you'll do is you'll essentially kind of roll the dice and depending on how um much less likely that tree is versus the one the previous tree you will either accept to go to take that worst tree or not and what happens is it basically allows once you get sort of stuck in a little local maximum it will allow you to start to head down the slope eventually and then hopefully you can get out of that local maximum and then back up on your way to the true uh global maximum rate here okay that's how it works and uh well you have to it takes a lot of iterations and a lot of tree comparisons but it's actually less than um the maximum likelihoods and the maximum person money approaches if you be um uh four large numbers of taxa so so that's Bayesian um are there any questions about Bayesian trees yeah i'm filled with questions about Bayesian trees but i'm not going to uh make you suffer through them so that's it for uh for phylogenies and now we're going to move on to the whole genome phylogenies uh is there any questions at all about phylog about the phylogenetic trees the lecture that i've given so far no okay like i said if you find that it's anything is unclear um just ask anyone of the the TAs or the instructors and they can make sure they'll be able to set you straight so here i'm gonna close quickly um on the whole genome phylogenies which we're going to start to work with in our in the hands-on module uh using uh uh step-based approaches so the approaches i've shown you so far typically they are using a small sub-segment of a genome on the scale of about a gene or maybe a couple of concatenated genes um and there's usually sufficient evolutionary signal in that section for you to be able to to build a robust tree but not all the time um and because we have you know widespread um implementation of next generation sequencing technology it's it's trivial for us to generate a draft genome for just about any organism that we study especially bacterial organisms or we crank them out all the time so uh by expanding the amount of information that we collect to the entire genome we can build much more robust trees and um this helps out a lot to in the world of genomic epidemiology because um we sometimes need to be able to harvest as much variation as possible to be able to to make a an epidemiological um interpretation especially during an outbreak scenario where you have a clonal organism that say gets contaminates sort of the the food supply system so a single approved processing facility gets contaminated with one clonal organism and then it gets spread out to the whole country and the um the the organisms well the the the classical typing data may all look the same and so it's very hard for you to be able to distinguish between any of them they might look the same as some sporadic strains or they they but you know they they may look a little bit different it's kind of hard to tell if you have access to whole genome sequencing technology then you can tell the difference between any two organisms that differ in theory by as little as one nucleotide over the typically millions of nucleotides that they harbor within their genomes so it's a very very high resolution technology and it can uh the amount of evolutionary signal that is in there but is maximized so it's it's very um very highly desirable uh to be able to build whole genome trees and it was uh famously demonstrated actually in uh during the Haiti outbreak in 2010 where they used whole genome sequencing to identify the source of an outbreak as being imported from Nepal rather than an alternate uh endemic um outbreak scenario of hypothesis whereas other sequencing technologies could not do this kind of discrimination so there are uh distance methods um can be applied and the character methods can be applied all the file genetic tree building methods can be applied um for for genome trees just as well as they can for uh single genome trees so this is uh an example of uh whole genome trees that are being built from the same organisms um one is using it well it is a single locus approach a single gene approach that's one of the here you see on the bottom and you'll notice that there are a number of polytomies in this uh tree at the bottom that's built with a single gene approach right and why do we have polytomies when do we have polytomies we don't have enough information for us to be able to do any further discrimination right can't do any further bifurcation there's a limited amount of information available for us within that single gene here's the same trees built using a distance based approach we just essentially do blast analysis of between both trees and collect as much of similar content and that becomes your distance measure and for using that approach you get much better resolution between the trees and including single genome um separations okay uh if you use reference mapping now you can harvest the variants that are present in there and you can use that to build either in a similar like a distance based approach but one that's based on steps rather than just global alignments or you can use character-based approaches so these are the the modern ways of doing whole genophilogenies incorporate typically either this reference mapping and variant selection or they'll use a gene by gene approach so just to remind you reference mapping is where you take a typically high quality reference genome that is um evolutionarily related or similar has high sequence similarity to the collection of of genomes from the organisms that you are interested in building a tree for and then and for which you have whole genome sequence data in the form of raw reads and so for every read set from each organism you map it to the to the reference and then you look to see where the differences are and you can collect those differences so um I use the term single nucleotide variant to describe those uh informative differences not everybody does some people call them I'm referring to them as as SNPs but just to make sure that everybody understands what I'm talking about a single nucleotide variant formally described as a single base change that exists among two or more homologous sequences that are under comparison so here you can see I call this a single nucleotide variant now most people would say that also looks like a single nucleotide polymorphism and it is and those are used interchangeably although if you're pedantic like me you actually make a distinction a single nucleotide polymorphism more formally refers to a variant that exists in a that is fixed in the population and exists at a at a relative abundance of one percent or higher in that population so variants can be ephemeral they can just sort of appear and disappear you're not really they'll lend um any type of selective advantage or disadvantage to that organism they're just basically just popping in and out of existence but on the time scale of an outbreak that may be all you have access to so the you don't but it it may or may not but a but a polymorphism they but they haven't really had enough time to be able to be selected for it to the point where they become fixed in the population at an abundance of one percent or higher okay yeah well you know the yeah I agree so but anyways the idea is that a every SNP is also a SNP okay and so you're all encompassing if you use the word single or the initialism single nucleotide and that's why I use it also because we named our software there's a so these are these are polymorphisms and there's a difference between a polymorphism and an insertion of deletion and some people will call a single a single dash that you'll see in a multiple alignment they might call that a polymorphism a SNP an indel SNP sometimes they'll call it but that is not um formally correct those are referred to as indels um or sometimes as as uh divs which is a deletion in insertion variants okay so we are interested in the harvesting of all of the SNPs that we can from our uh from our collection of reads that we have for the organisms that we've recently sequenced that we want to build a phylogenometry for so we choose a reference and we start mapping our reads to that reference and then we use variant detection software to analyze the where the reads contain a variant that is not oops not in the um not in the reference and if it is uh past a certain quality checks that has to do with the quality of the aligned read and maybe the number of reads that actually harbored that variant versus the ones that maybe uh may have the wild type in it then we can say okay this has a this is high quality enough for us to choose um you know we're confident that this is real and not just some sequencing artifact or some other type of confounder um and we're going to choose this and use this to help build our tree so this red these red sequences here are from genome one then we have the orange reads that are mapping to the same blue reference so that's genome two now we do the same procedure we're pulling out all of the variants that we see that put that um pass our filtering criteria and we repeat until we've collected all of the variants out of all the genomes in our collection now we can well compress those into what's called a SNIV alignment now we're not but this is typically what happens is you just take out all the variants and now you're going to build multiple sequence alignment that just consists of the um the variation that you have extracted out of that collection of genomes by mapping them to the reference now once you have a multiple sequence alignment you can build your follow genetic tree using any of the methods that we've just been spending the last hour and a half talking about okay okay and that's it so reference mapping just as a summary it uses reads and so doesn't there's no need for for assembling the genomes first extracting out genes or anything like that it's very handy just take a collection of raw reads map them against a reference pull them out and build them into a tree and that's what we're going to look at a pipeline that does exactly that for us but if you're not careful then you may be harvesting some of the way the SNIVs or SNIVs that you think may be evolutionarily informative but turns out that they are not for example you may have homoplastic SNIVs or SNIVs for like the ones that we talked about that can be incorporated are into an alignment but are not derived from they're not although they're present in the progeny but not present in the progenitor right so essentially the variants that don't agree with the the the overall true phylogeny and they are in there they can occur from recombination so like homologous recombination and so we're going to take a look at some of that the choice of reference sequence can be difficult because you don't always have a handy reference sequence that's high quality and is closely related to the to the sequences that you want to study there are some workarounds you can choose one that's more distant but that can affect the topology and the branch length that you generate so you really do want to try and keep things as close as as you can to the collection that you're looking at an alternate way around that if you don't have anything handy like from that you can extract from the public archives would be to just take one of the isolates that you have and assemble that into a draft genome and then map the rest of them against that draft genome at least that way as long as the collection that are all evolutionary related closely related then that should suffice to build a starter tree which might need some additional interpretation right and there are some additional gotchas and those are some of the things that Phil is going to show us here in the next section that's it